archives

« Bugzilla Issues Index

#3157 — Reserve `\p{}` and `\P{}` within `/u` RegExp patterns


Reserve the syntax `\p{…}` and `\P{…}` within `/u` RegExp patterns. https://mail.mozilla.org/pipermail/es-discuss/2014-August/039033.html


We may also want to reserve \X for "grapheme cluster", for example.

More generally, one should disallow to interpret \<char> as <char>, where <char> is one of 0-9, A-Z, a-z, at the prospect to attach more useful meaning to these sequences.


+1 to Claude’s proposal.

Another example: in addition to the standard notation e.g. `\p{L}`, Java, Perl, and PCRE allow you to use the shorthand `\pL`. The shorthand only works with single-letter Unicode properties. `\pLl` is not the equivalent of `\p{Ll}`. It is the equivalent of `\p{L}l` which matches `Al` or `àl` or any Unicode letter followed by a literal `l`.

I’m not saying we should support this in ECMAScript but it’d be nice to keep our options open. For that, we’d have to do what Claude suggested and reserve `\p…` and `\P…` in addition to `\p{…}` and `\P{…}`.


(In reply to Claude Pache from comment #1)
> We may also want to reserve \X for "grapheme cluster", for example.
>
> More generally, one should disallow to interpret \<char> as <char>, where
> <char> is one of 0-9, A-Z, a-z, at the prospect to attach more useful
> meaning to these sequences.

`\` followed by `0` already has special meaning (it’s equivalent to `\x00`) and `\` followed by a digit from `1` to `9` is already used for back-references. Just A-Z & a-z sounds good.


This was already decided by TC39 at the March 2012 meeting, and if I read the spec correctly, it’s already specified:

IdentityEscape[U] ::
[+U] SyntaxCharacter
[~U] SourceCharacter but not IdentifierPart
[~U] <ZWJ>
[~U] <ZWNJ>

https://mail.mozilla.org/pipermail/es-discuss/2012-March/021919.html
http://people.mozilla.org/~jorendorff/es6-draft.html#sec-patterns


(In reply to Norbert from comment #4)
>
> IdentityEscape[U] ::
> [+U] SyntaxCharacter
> [~U] SourceCharacter but not IdentifierPart
> [~U] <ZWJ>
> [~U] <ZWNJ>

Yes, that defines what implementations must accept, but that doesn't define what implementations don't accept.

For instance, the sequence \p is not (and has never been) part of the specced syntax of regular expression: for `p` is included in IdentifierPart, which is excluded from IdentityEscape. However, most (all?) implementations extend the syntax and treat \p as a synonym of a literal `p`.

In fact, it is absolutely fine to keep the old ES5.1 definition, namely:

IdentityEscape ::
SourceCharacter but not IdentifierPart
<ZWJ>
<ZWNJ>

because digits and letters are not part of IdentityEscape, and it is all we need. It is even better to revert to that definition, because otherwise it would create an *unnecessary* discrepancy between u- and non-u-regexps.

What is needed, is to explicitly forbid implementations to extend the syntax by including other identity sequences than those specced. Because of BC constraints, we could require that only when the u-flag is set.


(In reply to Claude Pache from comment #5)
> For instance, the sequence \p is not (and has never been) part of the
> specced syntax of regular expression: for `p` is included in IdentifierPart,
> which is excluded from IdentityEscape. However, most (all?) implementations
> extend the syntax and treat \p as a synonym of a literal `p`.

Engines are generally required to implement "B.1.4 Regular Expressions Patterns" instead of "21.2.1 Patterns" because of interoperability reasons, and B.1.4 allows \p.

>
> In fact, it is absolutely fine to keep the old ES5.1 definition, namely:
>
> IdentityEscape ::
> SourceCharacter but not IdentifierPart
> <ZWJ>
> <ZWNJ>
>
> because digits and letters are not part of IdentityEscape, and it is all we
> need. It is even better to revert to that definition, because otherwise it
> would create an *unnecessary* discrepancy between u- and non-u-regexps.

Note that IdentifierPart includes $, so strictly speaking /\$/ is not a valid regular expression according to 21.2.1. To properly escape $, you either need to use character classes or unicode-/hex-escape sequences. On my todo list is an item to request changing IdentityEscape to:

IdentityEscape ::
[+U] SyntaxCharacter
[~U] SourceCharacter but not UnicodeIDContinue or _

>
> What is needed, is to explicitly forbid implementations to extend the syntax
> by including other identity sequences than those specced. Because of BC
> constraints, we could require that only when the u-flag is set.

Sounds good, hopefully implementations adhere to this restriction. :)


(In reply to André Bargull from comment #6)

> Engines are generally required to implement "B.1.4 Regular Expressions
> Patterns" instead of "21.2.1 Patterns" because of interoperability reasons,
> and B.1.4 allows \p.

Where does it do that? Our intent was certainly that, with the "u" flag set, it would not.

> Sounds good, hopefully implementations adhere to this restriction. :)

Hope is good, conformance test cases are better.


(In reply to Norbert from comment #7)
> Where does it do that? Our intent was certainly that, with the "u" flag set,
> it would not.

Are any of the web compatibility extensions allowed for Unicode regular expressions at all? I thought Unicode mode means no web extensions, so my comment in #6 implied non-Unicode mode. I should have made that more clear, sorry!


fixed in rev28 editor's draft

Added a 16.1 restriction forbidding extendings IdentifyEscape to include a-z and A-Z for /u patterns.

Added text to B.1.4 that clarifies that the Annix B extensions don't change the syntax or semantics of Unicode RegExps.


fixed in rev28


(In reply to Allen Wirfs-Brock from comment #10)
> fixed in rev28

Looks like there’s a typo: `IdentifyEscape` (should be `IdentityEscape`).