#3157 — Reserve `\p{}` and `\P{}` within `/u` RegExp patterns

bug_id: 3157
creation_ts: 2014-08-26 10:53:00 -0700
short_desc: Reserve `\p{}` and `\P{}` within `/u` RegExp patterns
delta_ts: 2014-10-17 05:47:33 -0700
product: Draft for 6th Edition
component: technical issue
version: Rev 27: August 24, 2014 Draft
rep_platform: All
op_sys: All
bug_status: RESOLVED
resolution: FIXED
priority: Normal
bug_severity: enhancement
everconfirmed: true
reporter: Mathias Bynens
assigned_to: Allen Wirfs-Brock
cc: ["andrebargull", "claude.pache", "ecmascriptbugs", "mathias"]

commentid: 9992
comment_count: 0
who: Mathias Bynens
bug_when: 2014-08-26 10:53:54 -0700

Reserve the syntax `\p{…}` and `\P{…}` within `/u` RegExp patterns. https://mail.mozilla.org/pipermail/es-discuss/2014-August/039033.html

commentid: 9996
comment_count: 1
who: Claude Pache
bug_when: 2014-08-26 18:43:54 -0700

We may also want to reserve \X for "grapheme cluster", for example.

More generally, one should disallow to interpret \<char> as <char>, where <char> is one of 0-9, A-Z, a-z, at the prospect to attach more useful meaning to these sequences.

commentid: 9997
comment_count: 2
who: Mathias Bynens
bug_when: 2014-08-27 00:55:54 -0700

+1 to Claude’s proposal.

Another example: in addition to the standard notation e.g. `\p{L}`, Java, Perl, and PCRE allow you to use the shorthand `\pL`. The shorthand only works with single-letter Unicode properties. `\pLl` is not the equivalent of `\p{Ll}`. It is the equivalent of `\p{L}l` which matches `Al` or `àl` or any Unicode letter followed by a literal `l`.

I’m not saying we should support this in ECMAScript but it’d be nice to keep our options open. For that, we’d have to do what Claude suggested and reserve `\p…` and `\P…` in addition to `\p{…}` and `\P{…}`.

commentid: 9998
comment_count: 3
who: Mathias Bynens
bug_when: 2014-08-27 01:00:49 -0700

(In reply to Claude Pache from comment #1)
> We may also want to reserve \X for "grapheme cluster", for example.
>
> More generally, one should disallow to interpret \<char> as <char>, where
> <char> is one of 0-9, A-Z, a-z, at the prospect to attach more useful
> meaning to these sequences.

`\` followed by `0` already has special meaning (it’s equivalent to `\x00`) and `\` followed by a digit from `1` to `9` is already used for back-references. Just A-Z & a-z sounds good.

commentid: 9999
comment_count: 4
who: Norbert
bug_when: 2014-08-27 11:17:49 -0700

This was already decided by TC39 at the March 2012 meeting, and if I read the spec correctly, it’s already specified:

IdentityEscape[U] ::
[+U] SyntaxCharacter
[~U] SourceCharacter but not IdentifierPart
[~U] <ZWJ>
[~U] <ZWNJ>

https://mail.mozilla.org/pipermail/es-discuss/2012-March/021919.html
http://people.mozilla.org/~jorendorff/es6-draft.html#sec-patterns

commentid: 10000
comment_count: 5
who: Claude Pache
bug_when: 2014-08-28 06:15:25 -0700

(In reply to Norbert from comment #4)
>
> IdentityEscape[U] ::
> [+U] SyntaxCharacter
> [~U] SourceCharacter but not IdentifierPart
> [~U] <ZWJ>
> [~U] <ZWNJ>

Yes, that defines what implementations must accept, but that doesn't define what implementations don't accept.

For instance, the sequence \p is not (and has never been) part of the specced syntax of regular expression: for `p` is included in IdentifierPart, which is excluded from IdentityEscape. However, most (all?) implementations extend the syntax and treat \p as a synonym of a literal `p`.

In fact, it is absolutely fine to keep the old ES5.1 definition, namely:

IdentityEscape ::
SourceCharacter but not IdentifierPart
<ZWJ>
<ZWNJ>

because digits and letters are not part of IdentityEscape, and it is all we need. It is even better to revert to that definition, because otherwise it would create an *unnecessary* discrepancy between u- and non-u-regexps.

What is needed, is to explicitly forbid implementations to extend the syntax by including other identity sequences than those specced. Because of BC constraints, we could require that only when the u-flag is set.

commentid: 10001
comment_count: 6
who: André Bargull
bug_when: 2014-08-28 09:00:13 -0700

(In reply to Claude Pache from comment #5)
> For instance, the sequence \p is not (and has never been) part of the
> specced syntax of regular expression: for `p` is included in IdentifierPart,
> which is excluded from IdentityEscape. However, most (all?) implementations
> extend the syntax and treat \p as a synonym of a literal `p`.

Engines are generally required to implement "B.1.4 Regular Expressions Patterns" instead of "21.2.1 Patterns" because of interoperability reasons, and B.1.4 allows \p.

>
> In fact, it is absolutely fine to keep the old ES5.1 definition, namely:
>
> IdentityEscape ::
> SourceCharacter but not IdentifierPart
> <ZWJ>
> <ZWNJ>
>
> because digits and letters are not part of IdentityEscape, and it is all we
> need. It is even better to revert to that definition, because otherwise it
> would create an *unnecessary* discrepancy between u- and non-u-regexps.

Note that IdentifierPart includes $, so strictly speaking /\$/ is not a valid regular expression according to 21.2.1. To properly escape $, you either need to use character classes or unicode-/hex-escape sequences. On my todo list is an item to request changing IdentityEscape to:

IdentityEscape ::
[+U] SyntaxCharacter
[~U] SourceCharacter but not UnicodeIDContinue or _

>
> What is needed, is to explicitly forbid implementations to extend the syntax
> by including other identity sequences than those specced. Because of BC
> constraints, we could require that only when the u-flag is set.

Sounds good, hopefully implementations adhere to this restriction. :)

commentid: 10005
comment_count: 7
who: Norbert
bug_when: 2014-08-28 18:12:17 -0700

(In reply to André Bargull from comment #6)

> Engines are generally required to implement "B.1.4 Regular Expressions
> Patterns" instead of "21.2.1 Patterns" because of interoperability reasons,
> and B.1.4 allows \p.

Where does it do that? Our intent was certainly that, with the "u" flag set, it would not.

> Sounds good, hopefully implementations adhere to this restriction. :)

Hope is good, conformance test cases are better.

commentid: 10008
comment_count: 8
who: André Bargull
bug_when: 2014-08-29 04:22:15 -0700

(In reply to Norbert from comment #7)
> Where does it do that? Our intent was certainly that, with the "u" flag set,
> it would not.

Are any of the web compatibility extensions allowed for Unicode regular expressions at all? I thought Unicode mode means no web extensions, so my comment in #6 implied non-Unicode mode. I should have made that more clear, sorry!

commentid: 10016
comment_count: 9
who: Allen Wirfs-Brock
bug_when: 2014-08-29 13:13:01 -0700

fixed in rev28 editor's draft

Added a 16.1 restriction forbidding extendings IdentifyEscape to include a-z and A-Z for /u patterns.

Added text to B.1.4 that clarifies that the Annix B extensions don't change the syntax or semantics of Unicode RegExps.

commentid: 10462
comment_count: 10
who: Allen Wirfs-Brock
bug_when: 2014-10-14 15:17:58 -0700

fixed in rev28

commentid: 10517
comment_count: 11
who: Mathias Bynens
bug_when: 2014-10-17 05:47:33 -0700

(In reply to Allen Wirfs-Brock from comment #10)
> fixed in rev28

Looks like there’s a typo: `IdentifyEscape` (should be `IdentityEscape`).

archives

#3157 — Reserve `\p{}` and `\P{}` within `/u` RegExp patterns