archives

« Bugzilla Issues Index

#2256 — 21.2.2.8.2 Canonicalize: Non-unicode canonicalization not compatible with ES5


21.2.2.8.2 Runtime Semantics: Canonicalize Abstract Operation, steps 2-3:

> 2. If the file CaseFolding.txt of the Unicode Character Database does not provides a simple or common case folding mapping for ch, return ch.
> 3. Else, Let u be the result of apply that mapping to ch.

These steps are not compatible with ES5 for non-unicode, case-insensitive matching. Example:

- CaseFolding-6.3.0.txt, simple/common case folding entry for u1E9E is u00DF.
- CaseFolding-6.3.0.txt, simple/common case folding entry for u00DF is not explicitly defined, that means it defaults to u00DF.

So Canonicalize(u00DF) == u00DF and Canonicalize(u1E9E) == u00DF, but for ES5 compatibility Canonicalize(u1E9E) must not be mapped to u00DF, but instead to u1E9E (if ignoreCase=true and unicode=false).


For the equivalent of steps 2 and 3, the ES5 spec. indirects through String.prototype.toUpperCase (which itself is defined in terms of toLowerCase)

The ES5 toLowerCase language that corresponds to steps 2 and 3 is:

toLowerCase step 3: "Let L be a String where each character of L is either the Unicode lowercase equivalent of the corresponding character of S or the actual corresponding character of S if no Unicode lowercase equivalent exists."

and

"The result must be derived according to the case mappings in the Unicode character database (this includes not only the UnicodeData.text file,, but also the SpecialCasings.txt file..."

I read, the ES5 language and ES6 steps 2 and 3 as equivalent semantic statements. If that is correct, then this is an ES5 compatibility issue.

Perhaps your implementation doesn't conform to ES5 or perhaps ES5 differs from consensus web reality and needs to be brought back into conformance. Or, perhaps ES5 specifies an acceptable breaking change to web reality...

It probably needs to be discussed on es-discuss where the Unicode experts may see it.


(In reply to comment #1)
> I read, the ES5 language and ES6 steps 2 and 3 as equivalent semantic
> statements. If that is correct, then this is an ES5 compatibility issue.
>

The mapping in CaseFolding.txt maps code points to their lower-case counterpart, that means it uses (more or less) String.prototype.toLowerCase(). Compared to that ES5's Canonicalize operation uses String.prototype.toUpperCase(). And just substituting toUpperCase() with toLowerCase() does not give the same semantics, at least for 'u00df', which is kind of special.


(In reply to comment #2)
> (In reply to comment #1)
> > I read, the ES5 language and ES6 steps 2 and 3 as equivalent semantic
> > statements. If that is correct, then this is an ES5 compatibility issue.
> >
>
> The mapping in CaseFolding.txt maps code points to their lower-case
> counterpart, that means it uses (more or less) String.prototype.toLowerCase().
> Compared to that ES5's Canonicalize operation uses
> String.prototype.toUpperCase(). And just substituting toUpperCase() with
> toLowerCase() does not give the same semantics, at least for 'u00df', which is
> kind of special.

But, the ES5 spec. for toUpperCase (and the ES3 spec. before it) says to do that exact substitution (and to use the Unicode uppercase mappings). Also, doesn't the Unicode mapping tables (including the specialcase mappings) take u+00df into account?

Are you suggesting that the base ES3/5 spec. language is wrong? If so, what should it be?


(In reply to comment #3)
> But, the ES5 spec. for toUpperCase (and the ES3 spec. before it) says to do
> that exact substitution (and to use the Unicode uppercase mappings). Also,
> doesn't the Unicode mapping tables (including the specialcase mappings) take
> u+00df into account?

Here's my thought process why the current definition in the ES6 draft is not compatible with ES5.

First of all the relevant data from UnicodeData, SpecialCasing and CaseFolding.

UnicodeData-6.3.0.txt:
<code>; <name>; fields 2...11; <simple-uppercase-mapping>; <simple-lowercase-mapping>; <simple-titlecase-mapping>
00DF;LATIN SMALL LETTER SHARP S;Ll;0;L;;;;;N;;;;;
1E9E;LATIN CAPITAL LETTER SHARP S;Lu;0;L;;;;;N;;;;00DF;

SpecialCasing-6.3.0.txt:
<code>; <lower> ; <title> ; <upper> ; (<condition_list> ;)? # <comment>
00DF; 00DF; 0053 0073; 0053 0053; # LATIN SMALL LETTER SHARP S

CaseFolding-6.3.0.txt
<code>; <status>; <mapping>; # <name>
00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S
1E9E; F; 0073 0073; # LATIN CAPITAL LETTER SHARP S
1E9E; S; 00DF; # LATIN CAPITAL LETTER SHARP S

ES5-Canonicalize using the data from UnicodeData + SpecialCasing gives the following results:
(1) u+00df:
- no upper-case mapping entry in UnicodeData
- upper-case mapping in SpecialCasing to "0053 0053", which is rejected because |length| > 1
=> result: u+00df

(2) u+1e9e:
- no upper-case mapping entry in UnicodeData
- no upper-case mapping entry in SpecialCasing
=> result: u+1e9e


ES6-Canonicalize using the data from CaseFolding gives the following results:
(1) u+00df:
- mapping to "0073 0073" is rejected, because it is not a simple or common case folding
=> result: u+00df

(2) u+1e9e:
- mapping to "0073 0073" is rejected, because it is not a simple or common case folding
- mapping to "00DF" is accepted, because it is a simple case folding
=> result: u+00df


That means ES6-Canonicalize gives a different result than ES5-Canonicalize for u+1e9e.


(In reply to comment #4)
...
>
> ES5-Canonicalize using the data from UnicodeData + SpecialCasing gives the
> following results:
> (1) u+00df:
> - no upper-case mapping entry in UnicodeData
> - upper-case mapping in SpecialCasing to "0053 0053", which is rejected because
> |length| > 1
> => result: u+00df
>

ES5 doesn't have that length restriction. There is an explicit note that the length of the result string may not be the same as the input string.

But I'm not sure that really changes you point. Norbet and the other Unicode champions need to address this.


(In reply to comment #5)
> ES5 doesn't have that length restriction. There is an explicit note that the
> length of the result string may not be the same as the input string.

No? ES5.1, p.189, abstract operation Canonicalize steps 2-3:
> 2. Let u be ch converted to upper case as if by calling the standard built-in method String.prototype.toUpperCase on the one-character String ch.
> 3. If u does not consist of a single character, return ch.


André is right, case insensitive matching using Unicode case folding results in different behavior in a few cases than using toUpperCase. Case folding should therefore only be used in Unicode mode, as noted in my proposal in the second bullet list under Regular Expressions:
http://norbertlindenberg.com/2012/05/ecmascript-supplementary-characters/index.html#RegExp

In non-Unicode mode, toUpperCase restricted to results with length 1 has to be used for compatibility with ES5.


Note that this also affects the non-normative last paragraph of 21.2.2.8.2.


fixed in rev23 editor's draft


fixed in rev23 draft