archives

« Bugzilla Issues Index

#2373 — Specify String in terms of mapping from string tokens to string values


The prose in Section 9 could be improved by clearly framing it in terms of a mapping from string tokens to string values, where a string value is a sequence of code points.

At the moment, the prose is rather unclear. Sometimes it uses the term "code point" or "character" to refer to code points occurring literally in the token ("A string is a sequence of Unicode code points wrapped with quotation marks"), and sometimes it uses those terms to refer to code points in the string value ("Any code point may be represented as a hexadecimal number").

I suggest it should say something like this:

- a string token represents a string value, which is a sequence of code points

- an unescaped code point within the string token represents itself

- a pair of escapes \uXXXX\uYYYY, where U+XXXX is a high surrogate code point and U+YYYY is a low surrogate code point, represents the code point that the surrogate pair U+XXXX, U+YYYY would represent in UTF-16

- in any other case, an escape \uXXXX represents the code point U+XXXX

- other escapes \C represents a single code point as follows ....

Switching from syntax diagrams to BNF would make it easier to express this rigorously.