archives

« Bugzilla Issues Index

#3145 — 21.2.2.8.2: clarify whether this affects `\w` and `\W` or not


https://people.mozilla.org/~jorendorff/es6-draft.html#sec-runtime-semantics-canonicalize-abstract-operation

“In case-insignificant matches when Unicode is true, all characters are implicitly case-folded using the simple mapping provided by the Unicode standard immediately before they are compared. The simple mapping always maps to a single code point, so it does not map, for example, "ß" (U+00DF) to "SS". It may however map a code point outside the Basic Latin range to a character within, for example, “ſ” (U+017F) to “s”. Such characters are not mapped if Unicode is false. This prevents Unicode code points such as U+017F and U+212A from matching regular expressions such as /[a‑z]/i, but they will match /[a‑z]/ui.”

Does this case-folding apply before or after evaluating `CharacterClassEscape`s as per https://people.mozilla.org/~jorendorff/es6-draft.html#sec-characterclassescape?

This matters because of `\w` and `\W`. `/\w/` is equivalent to `/[0-9A-Z_a-z]/`. Is `/\w/iu` the same, or should it match U+017F and U+212A too? In case of the latter, does that mean `/\W/iu` must not match U+017F and U+212A?


Note that this is an only an informative note so the answer to your question needs to come from the normative algorithms.

In this case, in https://people.mozilla.org/~jorendorff/es6-draft.html#sec-atomescape the last production evaluates CharacterClassEscape to produce the CharSet. There is no /u sensitivity specified for w or W.

However, the CharacterSet matcher that is subsequently produced applies Canonicalize to each both the members of that character set and to the characters being matched. Canonicalize is where the case folding takes (depending upon the i flag). So,/\w/iu will match U+017F because U+017F canonicalizes to 'S' which is in the match set. /\W/iu also matches U+017F (and strangely 'S') because \W includes U+017F which matches either canonicalized U+017F or 'S'.


Norbert, is this what you actually intended for /\W/iu


Another related question:

`/[a-z]/iu` is equivalent to `/[a-z\u017F\u212A]/i`. Is `/[A-Z]/iu` the same, or should that be equivalent to just `/[A-Z]/i` instead (since it’s the lowercase `s` and `k` that case-fold to U+017F and U+212A, and not the uppercase `S` and `K` symbols)?

I’m guessing it’s the latter, but it’s slightly confusing either way.


(In reply to Mathias Bynens from comment #2)
> Another related question:
>
> `/[a-z]/iu` is equivalent to `/[a-z\u017F\u212A]/i`. Is `/[A-Z]/iu` the
> same, or should that be equivalent to just `/[A-Z]/i` instead (since it’s
> the lowercase `s` and `k` that case-fold to U+017F and U+212A, and not the
> uppercase `S` and `K` symbols)?
>
> I’m guessing it’s the latter, but it’s slightly confusing either way.

/[A-Z]/iu will match U+01FE (and U+212A) because case folding is applied to both the pattern and the string being matched. So, U+01FE in the match string turns into 'S'.

I think what is happening is clear enough at the algorithm level. I agree it is challenging to craft a good conceptual explanation. You probably need to emphasize that under /iu Unicode case folding is takes place before any comparisons and that one the pattern side it is applied to match sets rather than the actual pattern text.

For example, /[\u017F-\u0181]/iu is equivalent to /[S\u0180\u0181]/u rather than /[S-\u0181]/u


(In reply to Allen Wirfs-Brock from comment #3)
> (In reply to Mathias Bynens from comment #2)
> > Another related question:
> >
> > `/[a-z]/iu` is equivalent to `/[a-z\u017F\u212A]/i`. Is `/[A-Z]/iu` the
> > same, or should that be equivalent to just `/[A-Z]/i` instead (since it’s
> > the lowercase `s` and `k` that case-fold to U+017F and U+212A, and not the
> > uppercase `S` and `K` symbols)?
> >
> > I’m guessing it’s the latter, but it’s slightly confusing either way.
>
> /[A-Z]/iu will match U+01FE (and U+212A) because case folding is applied to
> both the pattern and the string being matched. So, U+01FE in the match
> string turns into 'S'.

Thanks for clarifying. What do you think would be the best way to approach this for an ES6-to-ES5 transpiler? It’s tempting to rewrite both `/[a-z]/iu` and `/[A-Z]/iu` into `/[a-z\u017F\u212A]/i`, which should behave exactly the same in ES5 environments (unless I’m missing something).

> I think what is happening is clear enough at the algorithm level. I agree it
> is challenging to craft a good conceptual explanation. You probably need to
> emphasize that under /iu Unicode case folding is takes place before any
> comparisons and that one the pattern side it is applied to match sets rather
> than the actual pattern text.
>
> For example, /[\u017F-\u0181]/iu is equivalent to /[S\u0180\u0181]/u rather
> than /[S-\u0181]/u

Shouldn’t that be `/[S\u017F-\u0181]/` (with or without `u` flag) instead? IIUC, `/[S\u0180\u0181]/u` wouldn’t match `\u017F`.


(In reply to Mathias Bynens from comment #4)
>
> Thanks for clarifying. What do you think would be the best way to approach
> this for an ES6-to-ES5 transpiler? It’s tempting to rewrite both `/[a-z]/iu`
> and `/[A-Z]/iu` into `/[a-z\u017F\u212A]/i`, which should behave exactly the
> same in ES5 environments (unless I’m missing something).

But see line 5 of the ES5 Canonicalize algorithm (15.10.2.8) and line 3.h of ES6 21.2.2.8.2.

I think you general problem is with /S/iu in order to correctly match U+017F in ES5 the pattern needs to beES5 transformed into /[S\u017F]/i. Basically, when transpiling as /iu pattern all implicit or explicit occurences of S in the pattern need to be translated to [S\u0178].

>
> > I think what is happening is clear enough at the algorithm level. I agree it
> > is challenging to craft a good conceptual explanation. You probably need to
> > emphasize that under /iu Unicode case folding is takes place before any
> > comparisons and that one the pattern side it is applied to match sets rather
> > than the actual pattern text.
> >
> > For example, /[\u017F-\u0181]/iu is equivalent to /[S\u0180\u0181]/u rather
> > than /[S-\u0181]/u
>
> Shouldn’t that be `/[S\u017F-\u0181]/` (with or without `u` flag) instead?
> IIUC, `/[S\u0180\u0181]/u` wouldn’t match `\u017F`.

Probably, it's trickly to try to map from /i to no /i in this manner. You will need to verify that non of the match algorithm will produce wrong result after you translation.


Is there any real-life use case where this matters, or are we just talking about test262? In the examples so far, if you don't want Unicode case folding you can just remove the "i" flag and replace a-z with a-zA-Z.


(In reply to Norbert from comment #6)
> Is there any real-life use case where this matters, or are we just talking
> about test262? In the examples so far, if you don't want Unicode case
> folding you can just remove the "i" flag and replace a-z with a-zA-Z.

The reason I asked for clarification is because I wanted to confirm my transpiler for ES6 `u` regexps (https://mths.be/regexpu) was working correctly.


marking as wontfix because the specified behavior seems technically correct in context and nobody has come up with anything better