#2371 — JSON cannot be transmitted using any Unicode encoding form

bug_id: 2371
creation_ts: 2013-12-10 19:24:00 -0800
short_desc: JSON cannot be transmitted using any Unicode encoding form
delta_ts: 2013-12-11 09:57:04 -0800
product: ECMA-404 JSON
component: 1st Edition
version: unspecified
rep_platform: All
op_sys: All
bug_status: CONFIRMED
priority: Normal
bug_severity: normal
everconfirmed: true
reporter: Peter F. Patel-Schneider
assigned_to: Douglas Crockford
cc: ["allen", "james.h.manger", "pfpschneider"]

commentid: 6911
comment_count: 0
who: Peter F. Patel-Schneider
bug_when: 2013-12-10 19:24:51 -0800

According to ECMA-404, 1st edition / October 2013, a JSON text is a sequence
of Unicode code points. In particular, the code point sequence <0022,
DEAD, 0022> is a valid JSON text.

However, this code point sequence cannot be represented in UTF-8, UTF-16, or
UTF-32, as it is not a sequence of Unicode scalar values, and Unicode
encoding forms are only defined on Unicode scalar values.

As JSON is a text format designed to facilitate data interchange, this is a
bug that should be fixed.

commentid: 6926
comment_count: 1
who: James Manger
bug_when: 2013-12-11 05:44:44 -0800

If ECMA-404 defined JSON text (and JSON strings) as a sequence of Unicode scalar values it would work well.

Additional rules mandating specific support for unpaired surrogates (as 16-bit code units and as \uDxxx escapes) could be put in ECMAScript.

commentid: 6927
comment_count: 2
who: Allen Wirfs-Brock
bug_when: 2013-12-11 09:21:26 -0800

(In reply to comment #1)
> If ECMA-404 defined JSON text (and JSON strings) as a sequence of Unicode
> scalar values it would work well.
>
> Additional rules mandating specific support for unpaired surrogates (as 16-bit
> code units and as \uDxxx escapes) could be put in ECMAScript.

That would force ECMA=262 to define an extended JSON grammar and that is something we are trying to avoid. We want to be able to just reference the normative grammar in ECMA-404.

ECMAScript/JavaScript is far from being the only language that has a string data type that allows the encoding of unpaired surrogates. Embedded JSON parsers for those languages all have to deal with them, one way or another.

It's better for ECMA-404 to define this rather than each language doing its own thing.

Specifications such as 4627bis that define encodings used to transmit/interchange JSON texts are free to requiring the use of a subset of Unicode code points. For example only scalar values or at an extreme only ASCII values.

Basically, the ECMA=404 approach is to start with a more general specification of the JSON format with the expectation that other specification will specialize it with restrictions.

commentid: 6928
comment_count: 3
who: Peter F. Patel-Schneider
bug_when: 2013-12-11 09:57:04 -0800

Comment 2 is quite astonishing to me.

I had thought that the idea of ECMA-404 was to have a useful, stable definition of JSON as it is used to interchange data. I thus expected that ECMA-404 would permit natural transmittal between senders and receivers that use different languages or toolsets.

If ECMA-404 has to be subsetted so that it can be reliably used for data exchange then why would it get any use outside of the ECMAScript community?

archives

#2371 — JSON cannot be transmitted using any Unicode encoding form