archives

« Bugzilla Issues Index

#2371 — JSON cannot be transmitted using any Unicode encoding form


According to ECMA-404, 1st edition / October 2013, a JSON text is a sequence
of Unicode code points. In particular, the code point sequence <0022,
DEAD, 0022> is a valid JSON text.

However, this code point sequence cannot be represented in UTF-8, UTF-16, or
UTF-32, as it is not a sequence of Unicode scalar values, and Unicode
encoding forms are only defined on Unicode scalar values.

As JSON is a text format designed to facilitate data interchange, this is a
bug that should be fixed.


If ECMA-404 defined JSON text (and JSON strings) as a sequence of Unicode scalar values it would work well.

Additional rules mandating specific support for unpaired surrogates (as 16-bit code units and as \uDxxx escapes) could be put in ECMAScript.


(In reply to comment #1)
> If ECMA-404 defined JSON text (and JSON strings) as a sequence of Unicode
> scalar values it would work well.
>
> Additional rules mandating specific support for unpaired surrogates (as 16-bit
> code units and as \uDxxx escapes) could be put in ECMAScript.

That would force ECMA=262 to define an extended JSON grammar and that is something we are trying to avoid. We want to be able to just reference the normative grammar in ECMA-404.

ECMAScript/JavaScript is far from being the only language that has a string data type that allows the encoding of unpaired surrogates. Embedded JSON parsers for those languages all have to deal with them, one way or another.

It's better for ECMA-404 to define this rather than each language doing its own thing.

Specifications such as 4627bis that define encodings used to transmit/interchange JSON texts are free to requiring the use of a subset of Unicode code points. For example only scalar values or at an extreme only ASCII values.

Basically, the ECMA=404 approach is to start with a more general specification of the JSON format with the expectation that other specification will specialize it with restrictions.


Comment 2 is quite astonishing to me.

I had thought that the idea of ECMA-404 was to have a useful, stable definition of JSON as it is used to interchange data. I thus expected that ECMA-404 would permit natural transmittal between senders and receivers that use different languages or toolsets.

If ECMA-404 has to be subsetted so that it can be reliably used for data exchange then why would it get any use outside of the ECMAScript community?