Stage 1 Draft / December 4, 2021

Regular Expression '\R' Escape for ECMAScript

Introduction

Forthcoming

See the proposal repository for background material and discussion.

1 Text Processing

1.1 RegExp (Regular Expression) Objects

A RegExp object contains a regular expression and the associated flags.

Note

The form and functionality of regular expressions is modelled after the regular expression facility in the Perl 5 programming language.

1.1.1 Patterns

The RegExp constructor applies the following grammar to the input pattern String. An error occurs if the grammar cannot interpret the String as an expansion of Pattern.

Syntax

Pattern[UnicodeMode, N] :: Disjunction[?UnicodeMode, ?N] Disjunction[UnicodeMode, N] :: Alternative[?UnicodeMode, ?N] Alternative[?UnicodeMode, ?N] | Disjunction[?UnicodeMode, ?N] Alternative[UnicodeMode, N] :: [empty] Alternative[?UnicodeMode, ?N] Term[?UnicodeMode, ?N] Term[UnicodeMode, N] :: Assertion[?UnicodeMode, ?N] Atom[?UnicodeMode, ?N] Atom[?UnicodeMode, ?N] Quantifier Assertion[UnicodeMode, N] :: ^ $ \ b \ B ( ? = Disjunction[?UnicodeMode, ?N] ) ( ? ! Disjunction[?UnicodeMode, ?N] ) ( ? <= Disjunction[?UnicodeMode, ?N] ) ( ? <! Disjunction[?UnicodeMode, ?N] ) Quantifier :: QuantifierPrefix QuantifierPrefix ? QuantifierPrefix :: * + ? { DecimalDigits[~Sep] } { DecimalDigits[~Sep] , } { DecimalDigits[~Sep] , DecimalDigits[~Sep] } Atom[UnicodeMode, N] :: PatternCharacter . \ AtomEscape[?UnicodeMode, ?N] CharacterClass[?UnicodeMode] ( GroupSpecifier[?UnicodeMode] Disjunction[?UnicodeMode, ?N] ) ( ? : Disjunction[?UnicodeMode, ?N] ) SyntaxCharacter :: one of ^ $ \ . * + ? ( ) [ ] { } | PatternCharacter :: SourceCharacter but not SyntaxCharacter AtomEscape[UnicodeMode, N] :: DecimalEscape [+UnicodeMode]R CharacterClassEscape[?UnicodeMode] CharacterEscape[?UnicodeMode] [+N] k GroupName[?UnicodeMode] CharacterEscape[UnicodeMode] :: ControlEscape c ControlLetter 0 [lookahead ∉ DecimalDigit] HexEscapeSequence RegExpUnicodeEscapeSequence[?UnicodeMode] IdentityEscape[?UnicodeMode] ControlEscape :: one of f n r t v ControlLetter :: one of a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z GroupSpecifier[UnicodeMode] :: [empty] ? GroupName[?UnicodeMode] GroupName[UnicodeMode] :: < RegExpIdentifierName[?UnicodeMode] > RegExpIdentifierName[UnicodeMode] :: RegExpIdentifierStart[?UnicodeMode] RegExpIdentifierName[?UnicodeMode] RegExpIdentifierPart[?UnicodeMode] RegExpIdentifierStart[UnicodeMode] :: IdentifierStartChar \ RegExpUnicodeEscapeSequence[+UnicodeMode] [~UnicodeMode] UnicodeLeadSurrogate UnicodeTrailSurrogate RegExpIdentifierPart[UnicodeMode] :: IdentifierPartChar \ RegExpUnicodeEscapeSequence[+UnicodeMode] [~UnicodeMode] UnicodeLeadSurrogate UnicodeTrailSurrogate RegExpUnicodeEscapeSequence[UnicodeMode] :: [+UnicodeMode] u HexLeadSurrogate \u HexTrailSurrogate [+UnicodeMode] u HexLeadSurrogate [+UnicodeMode] u HexTrailSurrogate [+UnicodeMode] u HexNonSurrogate [~UnicodeMode] u Hex4Digits [+UnicodeMode] u{ CodePoint } UnicodeLeadSurrogate :: any Unicode code point in the inclusive range 0xD800 to 0xDBFF UnicodeTrailSurrogate :: any Unicode code point in the inclusive range 0xDC00 to 0xDFFF

Each \u HexTrailSurrogate for which the choice of associated u HexLeadSurrogate is ambiguous shall be associated with the nearest possible u HexLeadSurrogate that would otherwise have no corresponding \u HexTrailSurrogate.

HexLeadSurrogate :: Hex4Digits but only if the MV of Hex4Digits is in the inclusive range 0xD800 to 0xDBFF HexTrailSurrogate :: Hex4Digits but only if the MV of Hex4Digits is in the inclusive range 0xDC00 to 0xDFFF HexNonSurrogate :: Hex4Digits but only if the MV of Hex4Digits is not in the inclusive range 0xD800 to 0xDFFF IdentityEscape[UnicodeMode] :: [+UnicodeMode]SyntaxCharacter [+UnicodeMode]/ [~UnicodeMode]SourceCharacter but not UnicodeIDContinue DecimalEscape :: NonZeroDigit DecimalDigits[~Sep]opt [lookahead ∉ DecimalDigit] CharacterClassEscape[UnicodeMode] :: d D s S w W [+UnicodeMode] p{ UnicodePropertyValueExpression } [+UnicodeMode] P{ UnicodePropertyValueExpression } UnicodePropertyValueExpression :: UnicodePropertyName = UnicodePropertyValue LoneUnicodePropertyNameOrValue UnicodePropertyName :: UnicodePropertyNameCharacters UnicodePropertyNameCharacters :: UnicodePropertyNameCharacter UnicodePropertyNameCharactersopt UnicodePropertyValue :: UnicodePropertyValueCharacters LoneUnicodePropertyNameOrValue :: UnicodePropertyValueCharacters UnicodePropertyValueCharacters :: UnicodePropertyValueCharacter UnicodePropertyValueCharactersopt UnicodePropertyValueCharacter :: UnicodePropertyNameCharacter DecimalDigit UnicodePropertyNameCharacter :: ControlLetter _ CharacterClass[UnicodeMode] :: [ [lookahead ≠ ^] ClassRanges[?UnicodeMode] ] [ ^ ClassRanges[?UnicodeMode] ] ClassRanges[UnicodeMode] :: [empty] NonemptyClassRanges[?UnicodeMode] NonemptyClassRanges[UnicodeMode] :: ClassAtom[?UnicodeMode] ClassAtom[?UnicodeMode] NonemptyClassRangesNoDash[?UnicodeMode] ClassAtom[?UnicodeMode] - ClassAtom[?UnicodeMode] ClassRanges[?UnicodeMode] NonemptyClassRangesNoDash[UnicodeMode] :: ClassAtom[?UnicodeMode] ClassAtomNoDash[?UnicodeMode] NonemptyClassRangesNoDash[?UnicodeMode] ClassAtomNoDash[?UnicodeMode] - ClassAtom[?UnicodeMode] ClassRanges[?UnicodeMode] ClassAtom[UnicodeMode] :: - ClassAtomNoDash[?UnicodeMode] ClassAtomNoDash[UnicodeMode] :: SourceCharacter but not one of \ or ] or - \ ClassEscape[?UnicodeMode] ClassEscape[UnicodeMode] :: b [+UnicodeMode]- CharacterClassEscape[?UnicodeMode] CharacterEscape[?UnicodeMode] Note

A number of productions in this section are given alternative definitions in section A.1.1.

1.1.2 Pattern Semantics

1.1.2.1 Runtime Semantics: CompileAtom

The syntax-directed operation CompileAtom takes argument direction (forward or backward). It returns a Matcher.

AtomEscape :: R
  1. Return a new Matcher with parameters (x, c) that captures direction and performs the following steps when called:
    1. Assert: x is a State.
    2. Assert: c is a Continuation.
    3. Let e be x's endIndex.
    4. If direction is forward, let f be e + 1.
    5. Else, let f be e - 1.
    6. If f < 0 or f > InputLength, return failure.
    7. Let index be min(e, f).
    8. Let ch be the character Input[index].
    9. Let cc be Canonicalize(ch).
    10. Let A be a CharSet containing the characters <LF>, <VT>, <FF>, <CR>, <NL>, <LS>, and <PS>.
    11. If there does not exist a member a of A such that Canonicalize(a) is cc, return failure.
    12. If direction is forward and cc is the character <CR> and index + 1 < InputLength, then
      1. Let nextCh be the character Input[index + 1].
      2. Let nextCc be Canonicalize(nextCh).
      3. If nextCc is the character <LF>, set f to f + 1.
    13. Else, if direction is backward and cc is the character <LF> and index - 1 > 0, then
      1. Let prevCh be the character Input[index - 1].
      2. Let prevCc be Canonicalize(prevCh).
      3. If prevCc is the character <CR>, set f to f - 1.
    14. Let cap be x's captures List.
    15. Let y be the State(f, cap).
    16. Return c(y).

A Additional ECMAScript Features for Web Browsers

A.1 Additional Syntax

A.1.1 Regular Expressions Patterns

The syntax of 1.1.1 is modified and extended as follows. These changes introduce ambiguities that are broken by the ordering of grammar productions and by contextual information. When parsing using the following grammar, each alternative is considered only if previous production alternatives do not match.

This alternative pattern grammar and semantics only changes the syntax and semantics of BMP patterns. The following grammar extensions include productions parameterized with the [UnicodeMode] parameter. However, none of these extensions change the syntax of Unicode patterns recognized when parsing with the [UnicodeMode] parameter present on the goal symbol.

Syntax

Term[UnicodeMode, N] :: [+UnicodeMode]Assertion[+UnicodeMode, ?N] [+UnicodeMode] Atom[+UnicodeMode, ?N] Quantifier [+UnicodeMode]Atom[+UnicodeMode, ?N] [~UnicodeMode] QuantifiableAssertion[?N] Quantifier [~UnicodeMode]Assertion[~UnicodeMode, ?N] [~UnicodeMode] ExtendedAtom[?N] Quantifier [~UnicodeMode]ExtendedAtom[?N] Assertion[UnicodeMode, N] :: ^ $ \ b \ B [+UnicodeMode] ( ? = Disjunction[+UnicodeMode, ?N] ) [+UnicodeMode] ( ? ! Disjunction[+UnicodeMode, ?N] ) [~UnicodeMode]QuantifiableAssertion[?N] ( ? <= Disjunction[?UnicodeMode, ?N] ) ( ? <! Disjunction[?UnicodeMode, ?N] ) QuantifiableAssertion[N] :: ( ? = Disjunction[~UnicodeMode, ?N] ) ( ? ! Disjunction[~UnicodeMode, ?N] ) ExtendedAtom[N] :: . \ AtomEscape[~UnicodeMode, ?N] \ [lookahead = c] CharacterClass[~UnicodeMode] ( Disjunction[~UnicodeMode, ?N] ) ( ? : Disjunction[~UnicodeMode, ?N] ) InvalidBracedQuantifier ExtendedPatternCharacter InvalidBracedQuantifier :: { DecimalDigits[~Sep] } { DecimalDigits[~Sep] , } { DecimalDigits[~Sep] , DecimalDigits[~Sep] } ExtendedPatternCharacter :: SourceCharacter but not one of ^ $ \ . * + ? ( ) [ | AtomEscape[UnicodeMode, N] :: [+UnicodeMode]DecimalEscape [~UnicodeMode] DecimalEscape but only if the CapturingGroupNumber of DecimalEscape is ≤ NcapturingParens [+UnicodeMode]R CharacterClassEscape[?UnicodeMode] CharacterEscape[?UnicodeMode, ?N] [+N] k GroupName[?UnicodeMode] CharacterEscape[UnicodeMode, N] :: ControlEscape c ControlLetter 0 [lookahead ∉ DecimalDigit] HexEscapeSequence RegExpUnicodeEscapeSequence[?UnicodeMode] [~UnicodeMode]LegacyOctalEscapeSequence IdentityEscape[?UnicodeMode, ?N] IdentityEscape[UnicodeMode, N] :: [+UnicodeMode]SyntaxCharacter [+UnicodeMode]/ [~UnicodeMode]SourceCharacterIdentityEscape[?N] SourceCharacterIdentityEscape[N] :: [~N]SourceCharacter but not c [+N]SourceCharacter but not one of c or k ClassAtomNoDash[UnicodeMode, N] :: SourceCharacter but not one of \ or ] or - \ ClassEscape[?UnicodeMode, ?N] \ [lookahead = c] ClassEscape[UnicodeMode, N] :: b [+UnicodeMode]- [~UnicodeMode] c ClassControlLetter CharacterClassEscape[?UnicodeMode] CharacterEscape[?UnicodeMode, ?N] ClassControlLetter :: DecimalDigit _ Note

When the same left-hand sides occurs with both [+UnicodeMode] and [~UnicodeMode] guards it is to control the disambiguation priority.

B Copyright & Software License

Copyright Notice

© 2021 Ron Buckton, Ecma International

Software License

All Software contained in this document ("Software") is protected by copyright and is being made available under the "BSD License", included below. This Software may be subject to third party rights (rights from parties other than Ecma International), including patent rights, and no licenses under such third party rights are granted under this license even if the third party concerned is a member of Ecma International. SEE THE ECMA CODE OF CONDUCT IN PATENT MATTERS AVAILABLE AT https://ecma-international.org/memento/codeofconduct.htm FOR INFORMATION REGARDING THE LICENSING OF PATENT CLAIMS THAT ARE REQUIRED TO IMPLEMENT ECMA INTERNATIONAL STANDARDS.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

  1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
  2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
  3. Neither the name of the authors nor Ecma International may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE ECMA INTERNATIONAL "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL ECMA INTERNATIONAL BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.