#610 — Web harness breaks test cases with supplementary characters

The following test case contains a Unicode supplementary character as the first character of the first string:

It looks like those characters also break too, as this bug appears to be truncated :-/

Can you articulate the issue without characters this tool can't handle (or just email me directly). Thanks.

The characters in the Unicode Supplementary Plane are those at codepoints of U+010000 or higher.

The issue with Bugzilla itself is like due to what is discussed here:

Actually I've reproduced the issue. If I put a character outside the BMP (Unicode CodePoint > 0xFFFF) inside a test-case, running the test fails with "Uncaught SyntaxError: Unexpected token ILLEGAL". It looks like we've not encoding surrogate pairs properly in the JSON files. I'll dig into it.

Looks like a bug in jquery.base64.js not being able to handle UTF-8 encodings of greater than 3 bytes (required for U+FFFF). I've made a fix and sent the patch to Norbert to verify.

I've added the test file for posterity.

The data loss in bugzilla is apparently caused by this bug:

Bill's fix fixes the immediate problem, but isn't fully compliant with the UTF-8 spec (Unicode 6.1, chapter 3.9, pages 94-97): Not all byte value sequences are allowed in UTF-8: Code points end at U+10FFFF so sequences representing higher values are illegal, and "overlong" sequences representing code points that can be represented using shorter sequences are also illegal. Accepting illegal sequences can be a security issue, so a parser should throw exceptions for them, or at least replace them with U+FFFD.

The one thing we definitely need is a limitation of code points to U+10FFFF.

Committed Bill's fix with additional check for valid code points.