Stage 3 Draft / August 20, 2019

RegExp Match Indices

Contributing to this Proposal

This proposal is developed on GitHub with the help of the ECMAScript community. There are a number of ways to contribute to the development of this specification:

1Text Processing

1.1RegExp (Regular Expression) Objects

1.1.1Pattern Semantics

1.1.1.1Notation

The descriptions below use the following variables:

  • Input is a List consisting of all of the characters, in order, of the String being matched by the regular expression pattern. Each character is either a code unit or a code point, depending upon the kind of pattern involved. The notation Input[n] means the nth character of Input, where n can range between 0 (inclusive) and InputLength (exclusive).
  • InputLength is the number of characters in Input.
  • NcapturingParens is the total number of left-capturing parentheses (i.e. the total number of Atom::(GroupSpecifierDisjunction) Parse Nodes) in the pattern. A left-capturing parenthesis is any ( pattern character that is matched by the ( terminal of the Atom::(GroupSpecifierDisjunction) production.
  • DotAll is true if the RegExp object's [[OriginalFlags]] internal slot contains "s" and otherwise is false.
  • IgnoreCase is true if the RegExp object's [[OriginalFlags]] internal slot contains "i" and otherwise is false.
  • Multiline is true if the RegExp object's [[OriginalFlags]] internal slot contains "m" and otherwise is false.
  • Unicode is true if the RegExp object's [[OriginalFlags]] internal slot contains "u" and otherwise is false.

Furthermore, the descriptions below use the following internal data structures:

  • A CharSet is a mathematical set of characters, either code units or code points depending up the state of the Unicode flag. “All characters” means either all code unit values or all code point values also depending upon the state of Unicode.
  • A Range is an ordered pair (startIndex, endIndex) that represents the range of characters included in a capture, where startIndex is an integer representing the start index (inclusive) of the range within Input and endIndex is an integer representing the end index (exclusive) of the range within Input. For any Range, these indices must satisfy the invariant that startIndexendIndex.
  • A State is an ordered pair (endIndex, captures) where endIndex is an integer representing an index within Input and captures is a List of NcapturingParens values. States are used to represent partial match states in the regular expression matching algorithms. The endIndex is one plus the index of the last input character matched so far by the pattern, while captures holds the results of capturing parentheses. The nth element of captures is either a List that represents the valueRange obtained by the nth set of capturing parentheses or undefined if the nth set of capturing parentheses hasn't been reached yet. Due to backtracking, many States may be in use at any time during the matching process.
  • A MatchResult is either a State or the special token failure that indicates that the match failed.
  • A Continuation procedure is an internal closure (i.e. an internal procedure with some arguments already bound to values) that takes one State argument and returns a MatchResult result. If an internal closure references variables which are bound in the function that creates the closure, the closure uses the values that these variables had at the time the closure was created. The Continuation attempts to match the remaining portion (specified by the closure's already-bound arguments) of the pattern against Input, starting at the intermediate state given by its State argument. If the match succeeds, the Continuation returns the final State that it reached; if the match fails, the Continuation returns failure.
  • A Matcher procedure is an internal closure that takes two arguments — a State and a Continuation — and returns a MatchResult result. A Matcher attempts to match a middle subpattern (specified by the closure's already-bound arguments) of the pattern against Input, starting at the intermediate state given by its State argument. The Continuation argument should be a closure that matches the rest of the pattern. After matching the subpattern of a pattern to obtain a new State, the Matcher then calls Continuation on that new State to test if the rest of the pattern can match as well. If it can, the Matcher returns the State returned by Continuation; if not, the Matcher may try different choices at its choice points, repeatedly calling Continuation until it either succeeds or all possibilities have been exhausted.
  • An AssertionTester procedure is an internal closure that takes a State argument and returns a Boolean result. The assertion tester tests a specific condition (specified by the closure's already-bound arguments) against the current place in Input and returns true if the condition matched or false if not.

1.1.1.2Atom

With parameter direction.

The production Atom::(GroupSpecifierDisjunction) evaluates as follows:

  1. Evaluate Disjunction with argument direction to obtain a Matcher m.
  2. Let parenIndex be the number of left-capturing parentheses in the entire regular expression that occur to the left of this Atom. This is the total number of Atom::(GroupSpecifierDisjunction) Parse Nodes prior to or enclosing this Atom.
  3. Return an internal Matcher closure that takes two arguments, a State x and a Continuation c, and performs the following steps:
    1. Let d be an internal Continuation closure that takes one State argument y and performs the following steps:
      1. Let cap be a copy of y's captures List.
      2. Let xe be x's endIndex.
      3. Let ye be y's endIndex.
      4. If direction is equal to +1, then
        1. Assert: xeye.
        2. Let s be a new List whose elements are the characters of Input at indices xe (inclusive) through ye (exclusive).
        3. Let r be the Range (xe, ye).
      5. Else,
        1. Assert: direction is equal to -1.
        2. Assert: yexe.
        3. Let s be a new List whose elements are the characters of Input at indices ye (inclusive) through xe (exclusive).
        4. Let r be the Range (ye, xe).
      6. Set cap[parenIndex+1] to sr.
      7. Let z be the State (ye, cap).
      8. Call c(z) and return its result.
    2. Call m(x, d) and return its result.

1.1.1.3AtomEscape

1.1.1.3.1Runtime Semantics: BackreferenceMatcher ( n, direction )

The abstract operation BackreferenceMatcher takes two arguments, an integer n and an integer direction, and performs the following steps:

  1. Return an internal Matcher closure that takes two arguments, a State x and a Continuation c, and performs the following steps:
    1. Let cap be x's captures List.
    2. Let sr be cap[n].
    3. If sr is undefined, return c(x).
    4. Let e be x's endIndex.
    5. Let rs be r's startIndex.
    6. Let re be r's endIndex.
    7. Let len be the number of elements in sre - rs.
    8. Let f be e + direction × len.
    9. If f < 0 or f > InputLength, return failure.
    10. Let g be min(e, f).
    11. If there exists an integer i between 0 (inclusive) and len (exclusive) such that Canonicalize(s[i]Input[rs + i]) is not the same character value as Canonicalize(Input[g + i]), return failure.
    12. Let y be the State (f, cap).
    13. Call c(y) and return its result.

1.1.2RegExp Abstract Operations

1.1.2.1Match Records

A Match is a Record value used to encapsulate the start and end indices of a regular expression match or capture.

Match Records have the fields listed in Table 1.

Table 1: Match Record Fields
Field Name Value Meaning
[[StartIndex]] An integer ≥ 0. The number of code units from the start of a string at which the match begins (inclusive).
[[EndIndex]] An integer ≥ [[StartIndex]]. The number of code units from the start of a string at which the match ends (exclusive).

1.1.3Properties of the RegExp Prototype Object

1.1.3.1RegExp.prototype.exec ( string )

1.1.3.1.1Runtime Semantics: RegExpBuiltinExec ( R, S )

The abstract operation RegExpBuiltinExec with arguments R and S performs the following steps:

  1. Assert: R is an initialized RegExp instance.
  2. Assert: Type(S) is String.
  3. Let length be the number of code units in S.
  4. Let lastIndex be ? ToLength(? Get(R, "lastIndex")).
  5. Let flags be R.[[OriginalFlags]].
  6. If flags contains "g", let global be true, else let global be false.
  7. If flags contains "y", let sticky be true, else let sticky be false.
  8. If global is false and sticky is false, set lastIndex to 0.
  9. Let matcher be R.[[RegExpMatcher]].
  10. If flags contains "u", let fullUnicode be true, else let fullUnicode be false.
  11. Let matchSucceeded be false.
  12. Let Input be a List consisting of all of the characters, in order, of S. If fullUnicode is true, each character is a code unit, otherwise each character is a code point.
  13. Repeat, while matchSucceeded is false
    1. If lastIndex > length, then
      1. If global is true or sticky is true, then
        1. Perform ? Set(R, "lastIndex", 0, true).
      2. Return null.
    2. Let r be matcher(Input, lastIndex).
    3. If r is failure, then
      1. If sticky is true, then
        1. Perform ? Set(R, "lastIndex", 0, true).
        2. Return null.
      2. Set lastIndex to AdvanceStringIndex(S, lastIndex, fullUnicode).
    4. Else,
      1. Assert: r is a State.
      2. Set matchSucceeded to true.
  14. Let e be r's endIndex value.
  15. If fullUnicode is true, then
    1. e is an index into the Input character list, derived from S, matched by matcher. Let eUTF be the smallest index into S that corresponds to the character at element e of Input. If e is greater than or equal to the number of elements in Input, then eUTF is the number of code units in S.
    2. Set e to eUTF.
  16. If fullUnicode is true, set e to ! GetStringIndex(S, Input, e).
  17. If global is true or sticky is true, then
    1. Perform ? Set(R, "lastIndex", e, true).
  18. Let n be the number of elements in r's captures List. (This is the same value as 1.1.1.1's NcapturingParens.)
  19. Assert: n < 232-1.
  20. Let A be ! ArrayCreate(n + 1).
  21. Assert: The value of A's "length" property is n + 1.
  22. Perform ! CreateDataProperty(A, "index", lastIndex).
  23. Perform ! CreateDataProperty(A, "input", S).
  24. Let matchedSubstr be the matched substring (i.e. the portion of S between offset lastIndex inclusive and offset e exclusive).
  25. Let indices be a new empty List.
  26. Let match be the Match { [[StartIndex]]: lastIndex, [[EndIndex]]: e }.
  27. Add match as the last element of indices.
  28. Let matchedValue be ! GetMatchString(S, match).
  29. Perform ! CreateDataProperty(A, "0", matchedSubstrmatchedValue).
  30. If R contains any GroupName, then
    1. Let groups be ObjectCreate(null).
    2. Let groupNames be a new empty List.
  31. Else,
    1. Let groups be undefined.
    2. Let groupNames be undefined.
  32. Perform ! CreateDataProperty(A, "groups", groups).
  33. Let getIndices be MakeGetIndices(S).
  34. Perform ! DefinePropertyOrThrow(A, "indices", PropertyDescriptor { [[Get]]: getIndices, [[Set]]: undefined, [[Enumerable]]: false, [[Configurable]]: true }).
  35. For each integer i such that i > 0 and in, in ascending order, do
    1. Let captureI be ith element of r's captures List.
    2. If captureI is undefined, let capturedValue be undefined.
    3. Else if fullUnicode is true, then
      1. Assert: captureI is a List of code points.
      2. Let capturedValue be the String value whose code units are the UTF16Encoding of the code points of captureI.
    4. Else fullUnicode is false,
      1. Assert: captureI is a List of code units.
      2. Let capturedValue be the String value consisting of the code units of captureI.
    5. If captureI is undefined, then
      1. Let capturedValue be undefined.
      2. Add undefined as the last element of indices.
    6. Else,
      1. Let captureStart be captureI's startIndex.
      2. Let captureEnd be captureI's endIndex.
      3. If fullUnicode is true, then
        1. Set captureStart to ! GetStringIndex(S, Input, captureStart).
        2. Set captureEnd to ! GetStringIndex(S, Input, captureEnd).
      4. Let capture be the Match { [[StartIndex]]: captureStart, [[EndIndex]:: captureEnd }.
      5. Append capture to indices.
      6. Let capturedValue be ! GetMatchString(S, capture).
    7. Perform ! CreateDataProperty(A, ! ToString(i), capturedValue).
    8. If the ith capture of R was defined with a GroupName, then
      1. Let s be the StringValue of the corresponding RegExpIdentifierName.
      2. Perform ! CreateDataProperty(groups, s, capturedValue).
      3. Assert: groupNames is a List.
      4. Append s to groupNames.
    9. Else,
      1. If groupNames is a List, append undefined to groupNames.
  36. Set getIndices.[[MatchIndices]] to indices.
  37. Set getIndices.[[GroupNames]] to groupNames.
  38. Return A.

1.1.3.1.2GetStringIndex ( S, Input, e )

The abstract operation GetStringIndex with with arguments S, Input, and e performs the following steps:

  1. Assert: Type(S) is String.
  2. Assert: Input is a List of the code points of S interpreted as a UTF-16 encoded string.
  3. Assert: e is an integer value ≥ 0 and < the number of elements in Input.
  4. Let eUTF be the smallest index into S that corresponds to the character at element e of Input. If e is greater than or equal to the number of elements in Input, then eUTF is the number of code units in S.
  5. Return eUTF.

1.1.3.1.3GetMatchString ( S, match )

The abstract operation GetMatchString with arguments S and match performs the following steps:

  1. Assert: Type(S) is String.
  2. Assert: match is a Match Record.
  3. Assert: match.[[StartIndex]] is an integer value ≥ 0 and < the length of S.
  4. Assert: match.[[EndIndex]] is an integer value ≥ match.[[StartIndex]] and ≤ the length of S.
  5. Return the portion of S between offset match.[[StartIndex]] inclusive and offset match.[[EndIndex]] exclusive.

1.1.3.1.4GetMatchIndicesArray ( S, match )

The abstract operation GetMatchIndicesArray with arguments S and match performs the following steps:

  1. Assert: Type(S) is String.
  2. Assert: match is a Match Record.
  3. Assert: match.[[StartIndex]] is an integer value ≥ 0 and < the length of S.
  4. Assert: match.[[EndIndex]] is an integer value ≥ match.[[StartIndex]] and ≤ the length of S.
  5. Return CreateArrayFromListmatch.[[StartIndex]], match.[[EndIndex]] »).

1.1.3.1.5MakeGetIndices ( S )

The abstract operation MakeGetIndices with argument S performs the following steps:

  1. Assert: Type(S) is String.
  2. Let steps be the steps of a GetIndices function as specified below.
  3. Let getter be CreateBuiltinFunction(steps, « [[InputString]], [[GroupNames]], [[MatchIndices]], [[MatchIndicesArray]] »).
  4. Set getter.[[InputString]] to S.
  5. Set getter.[[MatchIndicesArray]] to undefined.
  6. Return getter.

A GetIndices function is an anonymous built-in function with [[InputString]], [[GroupNames]], [[MatchIndices]], and [[MatchIndicesArray]] internal slots. When a GetIndices function that expects no arguments is called it performs the following steps:

  1. Let f be the active function object.
  2. Let A be f.[[MatchIndicesArray]].
  3. If A is not undefined, then
    1. Return A.
  4. Let S be f.[[InputString]].
  5. Let indices be f.[[MatchIndices]].
  6. Assert: indices is a List.
  7. Let groupNames be f.[[GroupNames]].
  8. Assert: groupNames is a List or is undefined.
  9. Let n be the number of elements in indices.
  10. Assert: n < 232-1.
  11. Set A to ! ArrayCreate(n).
  12. Assert: The value of A's "length" property is n.
  13. If groupNames is not undefined, then
    1. Let groups be ! ObjectCreate(null).
  14. Else,
    1. Let groups be undefined.
  15. Perform ! CreateDataProperty(A, "groups", groups).
  16. For each integer i such that i ≥ 0 and i < n, do
    1. Let matchIndices be indices[i].
    2. If matchIndices is not undefined, then
      1. Let matchIndicesArray be ! GetMatchIndicesArray(S, matchIndices).
    3. Else,
      1. Let matchIndicesArray be undefined.
    4. Perform ! CreateDataProperty(A, ! ToString(i), matchIndicesArray).
    5. If groupNames is not undefined and groupNames[i] is not undefined, then
      1. Perform ! CreateDataProperty(groups, groupNames[i], matchIndicesArray).
  17. Set f.[[MatchIndicesArray]] to A.
  18. Return A.

ACopyright & Software License

Copyright Notice

© 2019 Ron Buckton, Ecma International

Software License

All Software contained in this document ("Software") is protected by copyright and is being made available under the "BSD License", included below. This Software may be subject to third party rights (rights from parties other than Ecma International), including patent rights, and no licenses under such third party rights are granted under this license even if the third party concerned is a member of Ecma International. SEE THE ECMA CODE OF CONDUCT IN PATENT MATTERS AVAILABLE AT https://ecma-international.org/memento/codeofconduct.htm FOR INFORMATION REGARDING THE LICENSING OF PATENT CLAIMS THAT ARE REQUIRED TO IMPLEMENT ECMA INTERNATIONAL STANDARDS.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

  1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
  2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
  3. Neither the name of the authors nor Ecma International may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE ECMA INTERNATIONAL "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL ECMA INTERNATIONAL BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.