archives

« Bugzilla Issues Index

#156 — Spec is confusing on what constitutes unicode whitespace


+++ This bug was initially created as a clone of Bug #123 +++

Section 7.2 of ES5.1 contains Table 2 which lists out all valid whitespace characters. Note that the unicode character, \u0085, does not fall under any categories of this table. \u0085 is "NEXT LINE" and is referred to generically as a "line break" character in the Unicode 3.0 standard - see ftp://ftp.unicode.org/Public/3.0-Update/LineBreak-5.txt. Also, note that Unicode 3.0 is the version of the standard that ES5.1 'lines' up with as far as character encodings are concerned.

Section 7.3 of ES5.1 contains Table 3 which lists out all valid line terminator characters. "NEXT LINE" does not full under any of the categories of this table either. 7.3 then goes on to state that:
Only the characters in Table 3 are treated as line
terminators. Other new line or line breaking characters
are treated as white space but not as line terminators.


Based on this info, a reasonable person would infer that \u0085, a "line breaking" character as defined by Unicode 3.0, is treated as white space by conforming implementations. In fact, the JavaScript community documented this exactly at http://en.wikipedia.org/wiki/Newline:
ECMAScript[5] accepts LS and PS as line breaks, but
considers U+0085 (NEL) white space, not a line break.
Also, this confused the author of several test262 tests who had assumed \u0085 was considered white space.


ES5.2/ES6 should clarify this situation by changing the text in section 7.3 to be something like:
Only the characters in Table 3 are treated as line
terminators. Other new line or line breaking characters
are treated as white space but not as line terminators. Note
however that being treated as white space in this case is not
the same as being added to Table 2.


fixed in rev23 draft

added note clarifying that a Unicode whitespace is not necessarily an ES whitespace