#2373 — Specify String in terms of mapping from string tokens to string values

bug_id: 2373
creation_ts: 2013-12-10 19:45:00 -0800
short_desc: Specify String in terms of mapping from string tokens to string values
delta_ts: 2013-12-10 23:32:13 -0800
product: ECMA-404 JSON
component: 1st Edition
version: unspecified
rep_platform: All
op_sys: All
bug_status: CONFIRMED
priority: Normal
bug_severity: normal
everconfirmed: true
reporter: James Clark
assigned_to: Douglas Crockford
cc: ["allen", "jjc", "pfpschneider"]

commentid: 6913
comment_count: 0
who: James Clark
bug_when: 2013-12-10 19:45:09 -0800

The prose in Section 9 could be improved by clearly framing it in terms of a mapping from string tokens to string values, where a string value is a sequence of code points.

At the moment, the prose is rather unclear. Sometimes it uses the term "code point" or "character" to refer to code points occurring literally in the token ("A string is a sequence of Unicode code points wrapped with quotation marks"), and sometimes it uses those terms to refer to code points in the string value ("Any code point may be represented as a hexadecimal number").

I suggest it should say something like this:

- a string token represents a string value, which is a sequence of code points

- an unescaped code point within the string token represents itself

- a pair of escapes \uXXXX\uYYYY, where U+XXXX is a high surrogate code point and U+YYYY is a low surrogate code point, represents the code point that the surrogate pair U+XXXX, U+YYYY would represent in UTF-16

- in any other case, an escape \uXXXX represents the code point U+XXXX

- other escapes \C represents a single code point as follows ....

Switching from syntax diagrams to BNF would make it easier to express this rigorously.

archives

#2373 — Specify String in terms of mapping from string tokens to string values