archives

« Bugzilla Issues Index

#2360 — 10.1, 21.2.2: Unicode code units -> UTF-16 code units


The Unicode Standard defines code units for UTF-8, UTF-16, and UTF-32, but no Unicode code units.


fixed in rev23 editor's draft


fixed in rev23 draft


Problem still exists in rev 25 draft.


fixed in rev26 editor's draft


fixed in rev26


Looked at rev 26 draft:

- The text simplification in 10.1 that got rid of "Unicode code unit" looks good.

- The change in 21.2.2 is not so good. The old-style RegExp system knows nothing about code points, it doesn't support them in a meaningful way, but it also doesn't limit them to the BMP. It simply operates on 16-bit units. A number of people have used it to successfully support full Unicode, but by mapping regular expressions based on Unicode code points to regular expressions based on UTF-16 code units. You can't talk about "a sequence of 16-bit values that are Unicode code points in the range of the Basic Multilingual Plane" because neither the Unicode Standard nor ECMAScript define a mapping from sequences of 16-bit values to Unicode code points whose range is limited to the BMP.

When describing the old-style RegExp system, we should not use the term "code point" at all. If you don't like "UTF-16 code unit", then "16-bit code unit" or such is fine.


I'm sure I see your problem with 21.2.2. the official Unicode glossary defines the term "BMP code point" http://www.unicode.org/glossary/#BMP_code_point

To me, the language in that paragraph seems consistent with that definition.

It also doesn't seem relevant that the old ES5 RegExp could be made to do some useful work with UTF-16 encoded strings. This paragraph is fundamentally about defining the term "Pattern" as using within the rest of the specification. It isn't about the relative utility of "BMP patterns" and "Unicode patterns".


(In reply to Allen Wirfs-Brock from comment #7)
> I'm sure I see your problem with 21.2.2. the official Unicode glossary
^
not
> defines the term "BMP code point"
> http://www.unicode.org/glossary/#BMP_code_point
>
> To me, the language in that paragraph seems consistent with that definition.
>
> It also doesn't seem relevant that the old ES5 RegExp could be made to do
> some useful work with UTF-16 encoded strings. This paragraph is
> fundamentally about defining the term "Pattern" as using within the rest of
> the specification. It isn't about the relative utility of "BMP patterns" and
> "Unicode patterns".


reclosing as worksforme as I'm not planning on making an more changes.

I you would to suggest some specific changes I'll see if I can get them in.