archives

« Bugzilla Issues Index

#469 — Clarify whether supplementary Unicode symbols are allowed in Identifiers or not


ECMAScript 5.1:

> Throughout the rest of this document, the phrase “code unit” and the word “character” will be used to refer to a 16-bit unsigned value used to represent a single 16-bit unit of text.

This effectively meant that supplementary Unicode characters (e.g. 丽, i.e. U+2F800 CJK Compatibility Ideograph, which is listed in the [Lo] category) are disallowed in identifier names, as JavaScript interprets them as two individual surrogate halves (e.g. \uD87E\uDC00) which don’t match any of the allowed Unicode categories.

Did this change now that the spec defines “Unicode character” as a code point (rather than a code unit)? Specifically:

UnicodeIDStart ::
any Unicode character with the Unicode property “ID_Start”.
UnicodeIDStart ::
any Unicode character with the Unicode property “ID_Continue”

If I understood correctly, this production matches non-BMP symbols as well. Following that logic, e.g. `\uD87E\uDC00` would be disallowed as an IdentifierName (because these surrogate halves don’t have these Unicode properties) although `\u{2F800}` would be allowed. If this is the case, it’s very confusing. And then, what would happen if the raw Unicode symbol would be used? Would it behave similar to the Unicode code point escape or similar to the surrogates-based Unicode escape sequence?

Either way, could this be clarified please?


(Note that the second `UnicodeIDStart` occurence should say `UnicodeIDContinue` instead; see bug 465.)


(In reply to comment #0)
> ECMAScript 5.1:
>
> > Throughout the rest of this document, the phrase “code unit” and the word “character” will be used to refer to a 16-bit unsigned value used to represent a single 16-bit unit of text.
>
> This effectively meant that supplementary Unicode characters (e.g. 丽, i.e.
> U+2F800 CJK Compatibility Ideograph, which is listed in the [Lo] category) are
> disallowed in identifier names, as JavaScript interprets them as two individual
> surrogate halves (e.g. \uD87E\uDC00) which don’t match any of the allowed
> Unicode categories.

Yes, the language of ES5.1 and also ES<5 (using slight different language) seems to disallow supplementary Unicode characters in identifiers. I don't know whether or not any implementations allowed them. Do you have any data on that?

>
> Did this change now that the spec defines “Unicode character” as a code point
> (rather than a code unit)? Specifically:

yes, supplementary characters are now allowed in identifiers if they are ID_Start or ID_Continue characters

> ...
> If I understood correctly, this production matches non-BMP symbols as well.
> Following that logic, e.g. `\uD87E\uDC00` would be disallowed as an
> IdentifierName (because these surrogate halves don’t have these Unicode
> properties) although `\u{2F800}` would be allowed. If this is the case, it’s
> very confusing.

That is exactly the case, according to this draft. Logically the alphabet of the lexical grammar consists of abstract Unicode characters not UTF-16 code units. Each \uNNNN escape represents a single Unicode character and neither U+D87E nor U+DC00 are Unicode ID characters. No UTF-16 decoding of pairs of \uNNNN escapes are done as part of lexical recognition. If somebody wants to use an escape to insert a supplementary character into an identifier they should use the \u{} form.

This doesn't create a backwards compatibility problem, assuming that implementations currently don't recognize supplementary character UTF-16 escape sequences such as \uD87E\uDC00 as valid identifier characters.

I'm also not sure why you find this confusing. The general rule is that logically source code consists of abstract Unicode characters and that Unicode escapes (both \uNNNN and \u{}) represent a single Unicode character (code point).

We could explicit define the identifier grammar to recognize surrogate pair escape sequences (eg, \uD87E\uDC00) in identifiers, but why would that be less confusing? Instead of the simple rule: source text is logically full Unicode characters, we would be adding: except that in identifiers you can use UTF-16 escape sequences. Isn't that more complex and potentially more confusing. The best approach will be for people to only use \u{} notation and consider \uNNNN as obsolete.

> And then, what would happen if the raw Unicode symbol would be
> used? Would it behave similar to the Unicode code point escape or similar to
> the surrogates-based Unicode escape sequence?

It would be treated as the Unicode character that it is.

The key point is that we are defining ECMAScript source code in terms of true Unicode characters. The language definition doesn't care about encodings such as UTF-8, UTF-16, UTF-32, or even whether source files are store using some other character set such as BIG5. It is up to an ECMAScript implementation to decide which external character encoding it is able to process and to internally logically translate source text in those encoding into abstract Unicode characters.

>
> Either way, could this be clarified please?

I can work on that, but is the specification text actually unclear or are you just applying your own assumptions when reading it. What specific statements in the spec. to you find to actually be unclear?


(In reply to comment #2)
> Yes, the language of ES5.1 and also ES<5 (using slight different language)
> seems to disallow supplementary Unicode characters in identifiers. I don't
> know whether or not any implementations allowed them. Do you have any data on
> that?

AFAIK no implementation allows this, which is a good thing indeed. (Tested in recent versions of Firefox, Opera, Chrome, Safari.)

> I'm also not sure why you find this confusing. The general rule is that
> logically source code consists of abstract Unicode characters and that Unicode
> escapes (both \uNNNN and \u{}) represent a single Unicode character (code
> point).

What’s confusing IMHO is that with the current spec, a supplementary Unicode character is both valid and invalid in an Identifier, depending on how it’s represented (as a surrogate pair: not allowed; Unicode code point escape or raw character: valid). With ES5 you could say “this symbol may / may not be used in an identifier” but now it gets harder to explain.

If you don’t think that’s a problem, feel free to close this issue.


I think we have to treat \uDxxx\uDyyy, where 0x800 ≤ xxx < 0xC00 ≤ yyy ≤ 0xFFF, as a single code point in all situations. There are tools around that convert any non-ASCII characters into (old-style) Unicode escapes. These escape sequence pairs have been used to represent supplementary characters in ES5 string literals, regular expression literals, and JSON input, and work in many cases (although not in all parts of regular expressions).


(In reply to comment #4)
> I think we have to treat \uDxxx\uDyyy, where 0x800 ≤ xxx < 0xC00 ≤ yyy ≤ 0xFFF,
> as a single code point in all situations. There are tools around that convert
> any non-ASCII characters into (old-style) Unicode escapes. These escape
> sequence pairs have been used to represent supplementary characters in ES5
> string literals, regular expression literals, and JSON input, and work in many
> cases (although not in all parts of regular expressions).

All the situations you name above are literals that already covered in the July 8 draft (except for RegExps, but that is simply because that that section has not yet been updated". What is currently excluded by the draft is recognizing \uDxxx\uDyyy sequences in non-literal contexts such as IdentifierName or Punctuators. It appears that existing browsers report errors in such situations so there are no backwards compatibility issues in that regard.


(In reply to comment #5)
See the ongoing discussion on es-discuss, including
https://mail.mozilla.org/pipermail/es-discuss/2012-July/024097.html


(In reply to comment #4)
> I think we have to treat \uDxxx\uDyyy, where 0x800 ≤ xxx < 0xC00 ≤ yyy ≤ 0xFFF,
> as a single code point in all situations. There are tools around that convert
> any non-ASCII characters into (old-style) Unicode escapes. These escape
> sequence pairs have been used to represent supplementary characters in ES5
> string literals, regular expression literals, and JSON input, and work in many
> cases (although not in all parts of regular expressions).

+9001 to this.

Making this happen this would make it possible to create an ES5 polyfill for `String.isIdentifier{Start,Part}`. As long as `String.isIdentifierStart('\uD87E\uDC00')` and `String.isIdentifierStart('\u{2F800}')` are expected to return different results, this is impossible.


Just to summarize what is specified in the ES6

within string literals, regular expression literals, and template literals non-BMP characters can be expressed directly as SourceCharacters or by using \u{xxxxx} escapes, or by using a sequence of two \uxxxx escapes that form a valid surrogate pair.

within identifiers and keywords (but there are no keywords containing non-BMP code points) non-BMP characters can be expressed directly or by \u{xxxxx} escapes. However a \uxxxx\uxxxx sequence will not be recognized as valid non-BMP identifier characters. If you want to put escaped non-BMP characters into an identified use the \u{xxxxx} form