Stage 3 Draft / December 12, 2017

Named capture groups in regular expressions

1Patterns (#sec-patterns)

Syntax

Atom[U, N]::PatternCharacter . \AtomEscape[?U, ?N] CharacterClass[?U, ?N] ( GroupSpecifierDisjunction[?U, ?N]) (?:Disjunction[?U, ?N]) AtomEscape[U, N]::DecimalEscape CharacterClassEscape CharacterEscape[?U] [+N]kGroupName[?U] GroupSpecifier[U]::[empty] ?GroupName[?U] GroupName[U]::<RegExpIdentifierName[?U]> RegExpIdentifierName[U]::RegExpIdentifierStart[?U] RegExpIdentifierName[?U]RegExpIdentifierPart[?U] RegExpIdentifierStart[U]::UnicodeIDStart $ _ \RegExpUnicodeEscapeSequence[?U] RegExpIdentifierPart[U]::UnicodeIDContinue $ _ \RegExpUnicodeEscapeSequence[?U] <ZWNJ> <ZWJ>

2Static Semantics: Early Errors (#sec-patterns-static-semantics-early-errors)

Pattern::Disjunction AtomEscape[U]::[+N]kGroupName RegExpIdentifierStart[U]::\RegExpUnicodeEscapeSequence[?U] RegExpIdentifierPart[U]::\RegExpUnicodeEscapeSequence[?U]

3Static Semantics: StringValue

RegExpIdentifierName[U]::RegExpIdentifierStart[?U] RegExpIdentifierName[?U]RegExpIdentifierPart[?U]
  1. Return the String value consisting of the sequence of code units corresponding to RegExpIdentifierName. In determining the sequence any occurrences of \ RegExpUnicodeEscapeSequence are first replaced with the code point represented by the RegExpUnicodeEscapeSequence and then the code points of the entire RegExpIdentifierName are converted to code units by UTF16Encoding each code point.

4Runtime Semantics: BackreferenceMatcher Abstract Operation

The abstract operation BackreferenceMatcher takes one argument, an integer n, and performs the following steps:

  1. Return an internal Matcher closure that takes two arguments, a State x and a Continuation c, and performs the following steps:
    1. Let cap be x's captures List.
    2. Let s be cap[n].
    3. If s is undefined, return c(x).
    4. Let e be x's endIndex.
    5. Let len be the number of elements in s.
    6. Let f be e+len.
    7. If f>InputLength, return failure.
    8. If there exists an integer i between 0 (inclusive) and len (exclusive) such that Canonicalize(s[i]) is not the same character value as Canonicalize(Input[e+i]), return failure.
    9. Let y be the State (f, cap).
    10. Call c(y) and return its result.
Note
This abstract operation is extracted from the runtime semantics of AtomEscape::DecimalEscape , and when this text is integrated into the main specification, it would be called from there as well.

5AtomEscape

The production AtomEscape[U]::[+N]kGroupName evaluates as follows:

  1. Search the enclosing RegExp for an instance of a GroupSpecifier for a RegExpIdentifierName which has a StringValue equal to the StringValue of the RegExpIdentifierName contained in GroupName.
  2. Assert: A unique such GroupSpecifier is found.
  3. Let parenIndex be the number of left capturing parentheses in the entire regular expression that occur to the left of the located GroupSpecifier. This is the total number of times the Atom::(GroupSpecifierDisjunction) production is expanded prior to that production's Term plus the total number of Atom::(GroupSpecifierDisjunction) productions enclosing this Term.
  4. Call BackreferenceMatcher(parenIndex) and return its Matcher result.

6Runtime Semantics: RegExpInitialize ( obj, pattern, flags )

When the abstract operation RegExpInitialize with arguments obj, pattern, and flags is called, the following steps are taken:

  1. If pattern is undefined, let P be the empty String.
  2. Else, let P be ? ToString(pattern).
  3. If flags is undefined, let F be the empty String.
  4. Else, let F be ? ToString(flags).
  5. If F contains any code unit other than "g", "i", "m", "u", or "y" or if it contains the same code unit more than once, throw a SyntaxError exception.
  6. If F contains "u", let BMP be false; else let BMP be true.
  7. If BMP is true, then
    1. Parse P using the grammars in 1 and interpreting each of its 16-bit elements as a Unicode BMP code point. UTF-16 decoding is not applied to the elements. The goal symbol for the parse is Pattern[~U, ~N]. If the result of parsing contains a GroupName, reparse with the goal symbol Pattern[~U, +N] and use this result instead. Throw a SyntaxError exception if P did not conform to the grammar in either parsing attempt, if any elements of P were not matched by the parse, or if any Early Error conditions exist.
    2. Let patternCharacters be a List whose elements are the code unit elements of P.
  8. Else,
    1. Parse P using the grammars in 1 and interpreting P as UTF-16 encoded Unicode code points (6.1.4). The goal symbol for the parse is Pattern[+U, +N]. Throw a SyntaxError exception if P did not conform to the grammar, if any elements of P were not matched by the parse, or if any Early Error conditions exist.
    2. Let patternCharacters be a List whose elements are the code points resulting from applying UTF-16 decoding to P's sequence of elements.
  9. Set obj.[[OriginalSource]] to P.
  10. Set obj.[[OriginalFlags]] to F.
  11. Set obj.[[RegExpMatcher]] to the internal procedure that evaluates the above parse of P by applying the semantics provided in 21.2.2 using patternCharacters as the pattern's List of SourceCharacter values and F as the flag parameters.
  12. Perform ? Set(obj, "lastIndex", 0, true).
  13. Return obj.

7Runtime Semantics: RegExpBuiltinExec ( R, S )

The abstract operation RegExpBuiltinExec with arguments R and S performs the following steps:

  1. Assert: R is an initialized RegExp instance.
  2. Assert: Type(S) is String.
  3. Let length be the number of code units in S.
  4. Let flags be R.[[OriginalFlags]].
  5. If flags contains "g", let global be true, else let global be false.
  6. If flags contains "y", let sticky be true, else let sticky be false.
  7. If global is false and sticky is false, let lastIndex be 0.
  8. Else, let lastIndex be ? ToLength(? Get(R, "lastIndex")).
  9. Let matcher be R.[[RegExpMatcher]].
  10. If flags contains "u", let fullUnicode be true, else let fullUnicode be false.
  11. Let matchSucceeded be false.
  12. Repeat, while matchSucceeded is false
    1. If lastIndex > length, then
      1. If global is true or sticky is true, then
        1. Perform ? Set(R, "lastIndex", 0, true).
      2. Return null.
    2. Let r be matcher(S, lastIndex).
    3. If r is failure, then
      1. If sticky is true, then
        1. Perform ? Set(R, "lastIndex", 0, true).
        2. Return null.
      2. Let lastIndex be AdvanceStringIndex(S, lastIndex, fullUnicode).
    4. Else,
      1. Assert: r is a State.
      2. Set matchSucceeded to true.
  13. Let e be r's endIndex value.
  14. If fullUnicode is true, then
    1. e is an index into the Input character list, derived from S, matched by matcher. Let eUTF be the smallest index into S that corresponds to the character at element e of Input. If e is greater than or equal to the length of Input, then eUTF is the number of code units in S.
    2. Let e be eUTF.
  15. If global is true or sticky is true, then
    1. Perform ? Set(R, "lastIndex", e, true).
  16. Let n be the length of r's captures List. (This is the same value as 21.2.2.1's NcapturingParens.)
  17. Let A be ArrayCreate(n + 1).
  18. Assert: The value of A's "length" property is n + 1.
  19. Let matchIndex be lastIndex.
  20. Perform ! CreateDataProperty(A, "index", matchIndex).
  21. Perform ! CreateDataProperty(A, "input", S).
  22. Let matchedSubstr be the matched substring (i.e. the portion of S between offset lastIndex inclusive and offset e exclusive).
  23. Perform ! CreateDataProperty(A, "0", matchedSubstr).
  24. If R contains any GroupName, then
    1. Let groups be ObjectCreate(null).
  25. Else,
    1. Let groups be undefined.
  26. Perform ! CreateDataProperty(A, "groups", groups).
  27. For each integer i such that i > 0 and in
    1. Let captureI be ith element of r's captures List.
    2. If captureI is undefined, let capturedValue be undefined.
    3. Else if fullUnicode is true, then
      1. Assert: captureI is a List of code points.
      2. Let capturedValue be a string whose code units are the UTF16Encoding of the code points of captureI.
    4. Else fullUnicode is false,
      1. Assert: captureI is a List of code units.
      2. Let capturedValue be a string consisting of the code units of captureI.
    5. Perform ! CreateDataProperty(A, ! ToString(i), capturedValue).
    6. If the ith capture of R was defined with a GroupName, then
      1. Let s be the StringValue of the corresponding RegExpIdentifierName.
      2. Perform ! CreateDataProperty(groups, s, capturedValue).
  28. Return A.

8String.prototype.replace ( searchValue, replaceValue )

When the replace method is called with arguments searchValue and replaceValue, the following steps are taken:

  1. Let O be ? RequireObjectCoercible(this value).
  2. If searchValue is neither undefined nor null, then
    1. Let replacer be ? GetMethod(searchValue, @@replace).
    2. If replacer is not undefined, then
      1. Return ? Call(replacer, searchValue, « O, replaceValue »).
  3. Let string be ? ToString(O).
  4. Let searchString be ? ToString(searchValue).
  5. Let functionalReplace be IsCallable(replaceValue).
  6. If functionalReplace is false, then
    1. Let replaceValue be ? ToString(replaceValue).
  7. Search string for the first occurrence of searchString and let pos be the index within string of the first code unit of the matched substring and let matched be searchString. If no occurrences of searchString were found, return string.
  8. If functionalReplace is true, then
    1. Let replValue be ? Call(replaceValue, undefined, « matched, pos, string »).
    2. Let replStr be ? ToString(replValue).
  9. Else,
    1. Let captures be a new empty List.
    2. Let replStr be GetSubstitution(matched, string, pos, captures, undefined, replaceValue).
  10. Let tailPos be pos + the number of code units in matched.
  11. Let newString be the String formed by concatenating the first pos code units of string, replStr, and the trailing substring of string starting at index tailPos. If pos is 0, the first element of the concatenation will be the empty String.
  12. Return newString.
Note

The replace function is intentionally generic; it does not require that its this value be a String object. Therefore, it can be transferred to other kinds of objects for use as a method.

8.1Runtime Semantics: GetSubstitution( matched, str, position, captures, namedCaptures, replacement )

The abstract operation GetSubstitution performs the following steps:

  1. Assert: Type(matched) is String.
  2. Let matchLength be the number of code units in matched.
  3. Assert: Type(str) is String.
  4. Let stringLength be the number of code units in str.
  5. Assert: position is a nonnegative integer.
  6. Assert: positionstringLength.
  7. Assert: captures is a possibly empty List of Strings.
  8. Assert: Type(replacement) is String.
  9. Let tailPos be position + matchLength.
  10. Let m be the number of elements in captures.
  11. If namedCaptures is not undefined, then
    1. Let namedCaptures be ? ToObject(namedCaptures).
  12. Let result be a String value derived from replacement by copying code unit elements from replacement to result while performing replacements as specified in Table 1. These $ replacements are done left-to-right, and, once such a replacement is performed, the new replacement text is not subject to further replacements.
  13. Return result.
Table 1: Replacement Text Symbol Substitutions
Code units Unicode Characters Replacement text
0x0024, 0x0024 $$ $
0x0024, 0x0026 $& matched
0x0024, 0x0060 $` If position is 0, the replacement is the empty String. Otherwise the replacement is the substring of str that starts at index 0 and whose last code unit is at index position - 1.
0x0024, 0x0027 $' If tailPosstringLength, the replacement is the empty String. Otherwise the replacement is the substring of str that starts at index tailPos and continues to the end of str.
0x0024, N
Where
0x0031 ≤ N ≤ 0x0039
$n where
n is one of 1 2 3 4 5 6 7 8 9 and $n is not followed by a decimal digit
The nth element of captures, where n is a single digit in the range 1 to 9. If nm and the nth element of captures is undefined, use the empty String instead. If n>m, the result is implementation-defined.
0x0024, N, N
Where
0x0030 ≤ N ≤ 0x0039
$nn where
n is one of 0 1 2 3 4 5 6 7 8 9
The nnth element of captures, where nn is a two-digit decimal number in the range 01 to 99. If nnm and the nnth element of captures is undefined, use the empty String instead. If nn is 00 or nn>m, no replacement is done.
0x0024, 0x003C $<
  1. If namedCaptures is undefined, the replacement text is the String "$<".
  2. Otherwise,
    1. Scan until the next >.
    2. If none is found, the replacement text is the String "$<".
    3. Otherwise,
      1. Let the enclosed substring be groupName.
      2. Let capture be ? Get(namedCaptures, groupName).
      3. If capture is undefined, replace the text through > with the empty string.
      4. Otherwise, replace the text through this following > with ? ToString(capture).
0x0024 $ in any context that does not match any of the above. $

9RegExp.prototype [ @@replace ] ( string, replaceValue )

When the @@replace method is called with arguments string and replaceValue, the following steps are taken:

  1. Let rx be the this value.
  2. If Type(rx) is not Object, throw a TypeError exception.
  3. Let S be ? ToString(string).
  4. Let lengthS be the number of code unit elements in S.
  5. Let functionalReplace be IsCallable(replaceValue).
  6. If functionalReplace is false, then
    1. Let replaceValue be ? ToString(replaceValue).
  7. Let global be ToBoolean(? Get(rx, "global")).
  8. If global is true, then
    1. Let fullUnicode be ToBoolean(? Get(rx, "unicode")).
    2. Perform ? Set(rx, "lastIndex", 0, true).
  9. Let results be a new empty List.
  10. Let done be false.
  11. Repeat, while done is false
    1. Let result be ? RegExpExec(rx, S).
    2. If result is null, set done to true.
    3. Else result is not null,
      1. Append result to the end of results.
      2. If global is false, set done to true.
      3. Else,
        1. Let matchStr be ? ToString(? Get(result, "0")).
        2. If matchStr is the empty String, then
          1. Let thisIndex be ? ToLength(? Get(rx, "lastIndex")).
          2. Let nextIndex be AdvanceStringIndex(S, thisIndex, fullUnicode).
          3. Perform ? Set(rx, "lastIndex", nextIndex, true).
  12. Let accumulatedResult be the empty String value.
  13. Let nextSourcePosition be 0.
  14. Repeat, for each result in results,
    1. Let nCaptures be ? ToLength(? Get(result, "length")).
    2. Let nCaptures be max(nCaptures - 1, 0).
    3. Let matched be ? ToString(? Get(result, "0")).
    4. Let matchLength be the number of code units in matched.
    5. Let position be ? ToInteger(? Get(result, "index")).
    6. Let position be max(min(position, lengthS), 0).
    7. Let n be 1.
    8. Let captures be a new empty List.
    9. Repeat while nnCaptures
      1. Let capN be ? Get(result, ! ToString(n)).
      2. If capN is not undefined, then
        1. Let capN be ? ToString(capN).
      3. Append capN as the last element of captures.
      4. Let n be n+1.
    10. Let namedCaptures be ? Get(result, "groups").
    11. If functionalReplace is true, then
      1. Let replacerArgs be « matched ».
      2. Append in list order the elements of captures to the end of the List replacerArgs.
      3. Append position and S as the last two elements ofto replacerArgs.
      4. If namedCaptures is not undefined, then
        1. Append namedCaptures as the last element of replacerArgs.
      5. Let replValue be ? Call(replaceValue, undefined, replacerArgs).
      6. Let replacement be ? ToString(replValue).
    12. Else,
      1. Let replacement be GetSubstitution(matched, S, position, captures, namedCaptures, replaceValue).
    13. If positionnextSourcePosition, then
      1. NOTE position should not normally move backwards. If it does, it is an indication of an ill-behaving RegExp subclass or use of an access triggered side-effect to change the global flag or other characteristics of rx. In such cases, the corresponding substitution is ignored.
      2. Let accumulatedResult be the String formed by concatenating the code units of the current value of accumulatedResult with the substring of S consisting of the code units from nextSourcePosition (inclusive) up to position (exclusive) and with the code units of replacement.
      3. Let nextSourcePosition be position + matchLength.
  15. If nextSourcePositionlengthS, return accumulatedResult.
  16. Return the String formed by concatenating the code units of accumulatedResult with the substring of S consisting of the code units from nextSourcePosition (inclusive) up through the final code unit of S (inclusive).

The value of the name property of this function is "[Symbol.replace]".

ARegular Expressions Patterns

The syntax of 1 is modified and extended as follows. These changes introduce ambiguities that are broken by the ordering of grammar productions and by contextual information. When parsing using the following grammar, each alternative is considered only if previous production alternatives do not match.

This alternative pattern grammar and semantics only changes the syntax and semantics of BMP patterns. The following grammar extensions include productions parameterized with the [U] parameter. However, none of these extensions change the syntax of Unicode patterns recognized when parsing with the [U] parameter present on the goal symbol.

Syntax

Term[U, N]::[+U]Assertion[+U, ?N] [+U]Atom[+U, ?N] [+U]Atom[+U, ?N]Quantifier [~U]QuantifiableAssertionQuantifier [~U]Assertion[~U, ?N] [~U]ExtendedAtom[?N]Quantifier [~U]ExtendedAtom[?N] Assertion[U, N]::^ $ \b \B [+U](?=Disjunction[+U, ?N]) [+U](?!Disjunction[+U, ?N]) [~U]QuantifiableAssertion[N] type="lexical"> QuantifiableAssertion[N]:(?=Disjunction[~U, ?N]) (?!Disjunction[~U, ?N]) type="lexical"> ExtendedAtom[N]:. \AtomEscape[~U, ?N] CharacterClass[~U, ?N] (Disjunction[~U, ?N]) (?:Disjunction[~U, ?N]) InvalidBracedQuantifier ExtendedPatternCharacter InvalidBracedQuantifier::{DecimalDigits} {DecimalDigits,} {DecimalDigits,DecimalDigits} ExtendedPatternCharacter::SourceCharacterbut not one of ^$.*+?()[| AtomEscape[U, N]::[+U]DecimalEscape [~U]DecimalEscapebut only if the integer value of DecimalEscape is <= _NcapturingParens_ CharacterClassEscape CharacterEscape[~U, ?N] [+N]kGroupName CharacterEscape[U, N]::ControlEscape cControlLetter 0[lookahead ∉ DecimalDigit] HexEscapeSequence RegExpUnicodeEscapeSequence[?U] [~U]LegacyOctalEscapeSequence IdentityEscape[?U, ?N] IdentityEscape[U, N]::[+U]SyntaxCharacter [+U]/ [~U]SourceCharacterbut not c [~U]SourceCharacterIdentityEscape[?N] SourceCharacterIdentityEscape[N]::[~N]SourceCharacterbut not c [+N]SourceCharacterbut not one of c or k ClassAtomNoDash[U, N]::SourceCharacterbut not one of \ or ] or - \ClassEscape[?U, ?N] \[lookahead = c] ClassEscape[U, N]::b [+U]- [~U]cClassControlLetter CharacterClassEscape CharacterEscape[?U, ?N] ClassControlLetter::DecimalDigit _ Note

When the same left hand sides occurs with both [+U] and [~U] guards it is to control the disambiguation priority.