archives

« Bugzilla Issues Index

#525 — Issues in section 6


(1) see bug 524.

(2) Replace "The phrase “Unicode character” refer to the abstract linguistic or typographical unit represented by a single Unicode scalar value." with "All Unicode code point values from U+0000 to U+10FFFF, including surrogate code points, may occur in source text where permitted by the grammar."

(3) Remove "Any well defined encoding such as UTF-32 or UTF-16 may be used. Source text might even be externally represented using a non-Unicode character encoding." Since the sentence before says it's not relevant, we don't need to discuss it.

(4) Remove "Each source character being an abstract Unicode characters with a corresponding Unicode scalar value." Code point is all we need.

(5) Remove the paragraph "The phrase “code point” ....". It's partially wrong (Unicode scalar values are a proper subset of Unicode code points), and we don't need it.

(6) The description of Unicode escape sequences has to allow for expression of a single (supplementary) code point by a sequence of two old-style Unicode escape sequences, e.g., "\u{10000}" === "\uD800\uDC00".


(In reply to comment #0)
...
>
> (6) The description of Unicode escape sequences has to allow for expression of
> a single (supplementary) code point by a sequence of two old-style Unicode
> escape sequences, e.g., "\u{10000}" === "\uD800\uDC00".

I don't necessarily agree. See discussion under https://bugs.ecmascript.org/show_bug.cgi?id=469

Basically, the only places in current implementations where a sequence old-style Unicode escape sequences that form a valid surrogate pair is equivalent to the literal appearance of the corresponding supplementary character is in a string or regexp literal. In particular such a sequence in an IdentiferName always results in a syntax error, even if the supplementary character is an ID_Start or ID_Continue. For example,
var x\uD87E\uDC00;

reportedly produces an early error in current browsers, even though U+2F800 is a valid ID_Continue character as of Unicode 3.1. So, this doesn't appear to be a backwards compat. issues.

The current draft deals with such sequences in sting literals and quasis and in the future RegExp literals. It isn't clear that they need to be allowed out side of such literal contexts.


(In reply to comment #1)

This bug is about clause 6, which covers source text in general. Its third paragraph says "In string literals, regular expression literals,quasi literals and identifiers, any Unicode characters may also be expressed as a Unicode escape sequence that explicitly express a code point’s numeric value. "

The text assumes a 1:1 match between Unicode escape sequences and code points. I think we have agreement that in at least some of the situations mentioned a pair of Unicode escape sequences can be used to express a code point.


fixed in rev20 editor's draft


fixed in rev20 draft, Oct. 28, 2013


Verified in rev 26 draft.