#967 — 15.5.4.25 codePointAt usability issue

bug_id: 967
creation_ts: 2012-11-14 01:31:00 -0800
short_desc: 15.5.4.25 codePointAt usability issue
delta_ts: 2014-04-18 12:54:29 -0700
product: Draft for 6th Edition
component: technical issue
version: Rev 11: October 26, 2012 Draft
rep_platform: All
op_sys: All
bug_status: RESOLVED
resolution: WONTFIX
priority: Normal
bug_severity: enhancement
everconfirmed: true
reporter: Roger Andrews
assigned_to: Allen Wirfs-Brock
cc: ["ecmascriptbugs", "roger.andrews"]

commentid: 2466
comment_count: 0
who: Roger Andrews
bug_when: 2012-11-14 01:31:13 -0800

The definition of codePointAt has results:
out-of-bounds -> Undefined
normal BMP char -> the codepoint
lead surrogate of a good pair -> the codepoint
trail surrogate of a good pair -> codeunit in [0xDC00:0xDFFF] !!ambiguous
bad trail surrogate -> codeunit in [0xDC00:0xDFFF]
bad lead surrogate -> codeunit in [0xD800:0xDBFF]

Note that a well-paired trail surrogate still results in a value even though the previous codeunit "subsumed" it. So, if the caller is indexing down a string then it should take the well-paired trail surrogate value out of the sequence.

UTF16 experts can write code to check these possibilities; but for general usability lets have:
Undefined for the trail surrogate of a good pair, and
NaN for bad surrogate.

Then codePointAt would do the work for the casual user and experts can probe the string with charCodeAt (or codeUnitAt if it exists) if they really want to know the situation of bad surrogates.

========================
Unchanged, users are called upon to write code patterns like the messy....

// if the indexed position is part of a well-formed surrogate pair
// then result is either the entire code-point (for lead surrogates)
// or undefined (for trail surrogates)
// result is NaN for bad surrogates
// (result is always undefined for out-of-bounds position)

cp = str.charPointAt( pos );
if (0xDC00 <= cp && cp <= 0xDFFF) {
cu = str.charCodeAt( pos-1 );
if (0xD800 <= cu && cu <= 0xDBFF) {
cp = undefined; // trail surrogate of good pair
}
}
if (0xD800 <= cp && cp <= 0xDFFF) {
cp = NaN; // bad surrogate
}

commentid: 2467
comment_count: 1
who: Roger Andrews
bug_when: 2012-11-14 01:43:13 -0800

(Typo in my example code above: for 'charPointAt' read 'codePointAt')

commentid: 2909
comment_count: 2
who: Norbert
bug_when: 2012-11-29 13:46:56 -0800

See discussion at
https://mail.mozilla.org/pipermail/es-discuss/2012-November/thread.html#26340

commentid: 7822
comment_count: 3
who: Allen Wirfs-Brock
bug_when: 2014-04-18 12:54:29 -0700

It's time to put ES6 to bed. Norbert made a good response to this proposal and nobody has further championed these changes within TC39, so at this point in time it doesn't look like we are going to make further ES6 changes in this area.

Proposals are being made for post ES6 features (see https://github.com/tc39/ecma262 ), so you may want to consider re-proposing some of the additional String functions.

archives

#967 — 15.5.4.25 codePointAt usability issue