Stage 3 Draft / January 15, 2021

RegExp Match Indices

Contributing to this Proposal

This proposal is developed on GitHub with the help of the ECMAScript community. There are a number of ways to contribute to the development of this specification:

GitHub Repository: https://github.com/tc39/proposal-regexp-match-indices
Issues: All Issues, File a New Issue
Pull Requests: All Pull Requests, Create a New Pull Request
Community:
- Discourse: https://es.discourse.group
- IRC: #tc39 on freenode

1 Text Processing

1.1 RegExp (Regular Expression) Objects

1.1.1 Pattern Semantics

1.1.1.1 Notation

The descriptions below use the following variables:

Input is a List consisting of all of the characters, in order, of the String being matched by the regular expression pattern. Each character is either a code unit or a code point, depending upon the kind of pattern involved. The notation Input[n] means the n^th character of Input, where n can range between 0 (inclusive) and InputLength (exclusive).
InputLength is the number of characters in Input.
NcapturingParens is the total number of left-capturing parentheses (i.e. the total number of Atom :: ( GroupSpecifier Disjunction ) Parse Nodes) in the pattern. A left-capturing parenthesis is any ( pattern character that is matched by the ( terminal of the Atom :: ( GroupSpecifier Disjunction ) production.
DotAll is true if the RegExp object's [[OriginalFlags]] internal slot contains "s" and otherwise is false.
IgnoreCase is true if the RegExp object's [[OriginalFlags]] internal slot contains "i" and otherwise is false.
Multiline is true if the RegExp object's [[OriginalFlags]] internal slot contains "m" and otherwise is false.
Unicode is true if the RegExp object's [[OriginalFlags]] internal slot contains "u" and otherwise is false.

Furthermore, the descriptions below use the following internal data structures:

A CharSet is a mathematical set of characters, either code units or code points depending up the state of the Unicode flag. “All characters” means either all code unit values or all code point values also depending upon the state of Unicode.
A Range is an ordered pair (startIndex, endIndex) that represents the range of characters included in a capture, where startIndex is an integer representing the start index (inclusive) of the range within Input and endIndex is an integer representing the end index (exclusive) of the range within Input. For any Range, these indices must satisfy the invariant that startIndex ≤ endIndex.
A State is an ordered pair (endIndex, captures) where endIndex is an integer representing an index within Input and captures is a List of NcapturingParens values. States are used to represent partial match states in the regular expression matching algorithms. The endIndex is one plus the index of the last input character matched so far by the pattern, while captures holds the results of capturing parentheses. The n^th element of captures is either a List that represents the ~~value~~Range obtained by the n^th set of capturing parentheses or undefined if the n^th set of capturing parentheses hasn't been reached yet. Due to backtracking, many States may be in use at any time during the matching process.
A MatchResult is either a State or the special token failure that indicates that the match failed.
A Continuation procedure is an internal closure (i.e. an internal procedure with some arguments already bound to values) that takes one State argument and returns a MatchResult result. If an internal closure references variables which are bound in the function that creates the closure, the closure uses the values that these variables had at the time the closure was created. The Continuation attempts to match the remaining portion (specified by the closure's already-bound arguments) of the pattern against Input, starting at the intermediate state given by its State argument. If the match succeeds, the Continuation returns the final State that it reached; if the match fails, the Continuation returns failure.
A Matcher procedure is an internal closure that takes two arguments — a State and a Continuation — and returns a MatchResult result. A Matcher attempts to match a middle subpattern (specified by the closure's already-bound arguments) of the pattern against Input, starting at the intermediate state given by its State argument. The Continuation argument should be a closure that matches the rest of the pattern. After matching the subpattern of a pattern to obtain a new State, the Matcher then calls Continuation on that new State to test if the rest of the pattern can match as well. If it can, the Matcher returns the State returned by Continuation; if not, the Matcher may try different choices at its choice points, repeatedly calling Continuation until it either succeeds or all possibilities have been exhausted.
An AssertionTester procedure is an internal closure that takes a State argument and returns a Boolean result. The assertion tester tests a specific condition (specified by the closure's already-bound arguments) against the current place in Input and returns true if the condition matched or false if not.

1.1.1.2 Atom

With parameter direction.

The production Atom :: ( GroupSpecifier Disjunction ) evaluates as follows:

Evaluate Disjunction with argument direction to obtain a Matcher m.
Let parenIndex be the number of left-capturing parentheses in the entire regular expression that occur to the left of this Atom. This is the total number of Atom :: ( GroupSpecifier Disjunction ) Parse Nodes prior to or enclosing this Atom.
Return an internal Matcher closure that takes two arguments, a State x and a Continuation c, and performs the following steps:
1. Let d be an internal Continuation closure that takes one State argument y and performs the following steps:
  1. Let cap be a copy of y's captures List.
  2. Let xe be x's endIndex.
  3. Let ye be y's endIndex.
  4. If direction is equal to +1, then
    1. Assert: xe ≤ ye.
    2. ~~Let s be a new List whose elements are the characters of Input at indices xe (inclusive) through ye (exclusive).~~
    3. Let r be the Range (xe, ye).
  5. Else,
    1. Assert: direction is equal to -1.
    2. Assert: ye ≤ xe.
    3. ~~Let s be a new List whose elements are the characters of Input at indices ye (inclusive) through xe (exclusive).~~
    4. Let r be the Range (ye, xe).
  6. Set cap[parenIndex+1] to ~~s~~r.
  7. Let z be the State (ye, cap).
  8. Call c(z) and return its result.
2. Call m(x, d) and return its result.

1.1.1.3 AtomEscape

1.1.1.3.1 Runtime Semantics: BackreferenceMatcher ( `n`, `direction` )

The abstract operation BackreferenceMatcher takes two arguments, an integer n and an integer direction, and performs the following steps:

Return an internal Matcher closure that takes two arguments, a State x and a Continuation c, and performs the following steps:
1. Let cap be x's captures List.
2. Let ~~s~~r be cap[n].
3. If ~~s~~r is undefined, return c(x).
4. Let e be x's endIndex.
5. Let rs be r's startIndex.
6. Let re be r's endIndex.
7. Let len be ~~the number of elements in s~~re - rs.
8. Let f be e + direction × len.
9. If f < 0 or f > InputLength, return failure.
10. Let g be min(e, f).
11. If there exists an integer i between 0 (inclusive) and len (exclusive) such that Canonicalize(~~s[i]~~Input[rs + i]) is not the same character value as Canonicalize(Input[g + i]), return failure.
12. Let y be the State (f, cap).
13. Call c(y) and return its result.

1.1.2 RegExp Abstract Operations

1.1.2.1 Match Records

A Match is a Record value used to encapsulate the start and end indices of a regular expression match or capture.

Match Records have the fields listed in Table 1.

Table 1: Match Record Fields

Field Name	Value	Meaning
[[StartIndex]]	An integer ≥ 0.	The number of code units from the start of a string at which the match begins (inclusive).
[[EndIndex]]	An integer ≥ [[StartIndex]].	The number of code units from the start of a string at which the match ends (exclusive).

1.1.3 The RegExp Constructor

1.1.3.1 Abstract Operations for the RegExp Constructor

1.1.3.1.1 RegExpInitialize ( `obj`, `pattern`, `flags` )

The abstract operation RegExpInitialize takes arguments obj, pattern, and flags. It performs the following steps when called:

If pattern is undefined, let P be the empty String.
Else, let P be ? ToString(pattern).
If flags is undefined, let F be the empty String.
Else, let F be ? ToString(flags).
If F contains any code unit other than "d", "g", "i", "m", "s", "u", or "y" or if it contains the same code unit more than once, throw a SyntaxError exception.
If F contains "u", let u be true; else let u be false.
If u is true, then
1. Let patternText be ! StringToCodePoints(P).
2. Let patternCharacters be a List whose elements are the code points of patternText.
Else,
1. Let patternText be the result of interpreting each of P's 16-bit elements as a Unicode BMP code point. UTF-16 decoding is not applied to the elements.
2. Let patternCharacters be a List whose elements are the code unit elements of P.
Let parseResult be ParsePattern(patternText, u).
If parseResult is a non-empty List of SyntaxError objects, throw a SyntaxError exception.
Assert: parseResult is a Parse Node for Pattern.
Set obj.[[OriginalSource]] to P.
Set obj.[[OriginalFlags]] to F.
Set obj.[[RegExpMatcher]] to the Abstract Closure that evaluates parseResult by applying the semantics provided in 1.1.1 using patternCharacters as the pattern's List of SourceCharacter values and F as the flag parameters.
Perform ? Set(obj, "lastIndex", +0_𝔽, true).
Return obj.

1.1.4 Properties of the RegExp Prototype Object

1.1.4.1 RegExp.prototype.exec ( `string` )

1.1.4.1.1 Runtime Semantics: RegExpBuiltinExec ( `R`, `S` )

The abstract operation RegExpBuiltinExec with arguments R and S performs the following steps:

Assert: R is an initialized RegExp instance.
Assert: Type(S) is String.
Let length be the number of code units in S.
Let lastIndex be ? ToLength(? Get(R, "lastIndex")).
Let flags be R.[[OriginalFlags]].
If flags contains "g", let global be true, else let global be false.
If flags contains "y", let sticky be true, else let sticky be false.
If flags contains "d", let hasIndices be true, else let hasIndices be false.
If global is false and sticky is false, set lastIndex to 0.
Let matcher be R.[[RegExpMatcher]].
If flags contains "u", let fullUnicode be true, else let fullUnicode be false.
Let matchSucceeded be false.
Let Input be a List consisting of all of the characters, in order, of S. If fullUnicode is true, each character is a code unit, otherwise each character is a code point.
Repeat, while matchSucceeded is false
1. If lastIndex > length, then
  1. If global is true or sticky is true, then
    1. Perform ? Set(R, "lastIndex", 0, true).
  2. Return null.
2. Let r be matcher(Input, lastIndex).
3. If r is failure, then
  1. If sticky is true, then
    1. Perform ? Set(R, "lastIndex", 0, true).
    2. Return null.
  2. Set lastIndex to AdvanceStringIndex(S, lastIndex, fullUnicode).
4. Else,
  1. Assert: r is a State.
  2. Set matchSucceeded to true.
Let e be r's endIndex value.
~~If fullUnicode is true, then~~
1. e is an index into the Input character list, derived from S, matched by matcher. Let eUTF be the smallest index into S that corresponds to the character at element e of Input. If e is greater than or equal to the number of elements in Input, then eUTF is the number of code units in S.
2. ~~Set e to eUTF.~~
If fullUnicode is true, set e to ! GetStringIndex(S, Input, e).
If global is true or sticky is true, then
1. Perform ? Set(R, "lastIndex", e, true).
Let n be the number of elements in r's captures List. (This is the same value as 1.1.1.1's NcapturingParens.)
Assert: n < 2³²-1.
Let A be ! ArrayCreate(n + 1).
Assert: The value of A's "length" property is n + 1.
Perform ! CreateDataProperty(A, "index", lastIndex).
Perform ! CreateDataProperty(A, "input", S).
~~Let matchedSubstr be the matched substring (i.e. the portion of S between offset lastIndex inclusive and offset e exclusive).~~
Let match be the Match { [[StartIndex]]: lastIndex, [[EndIndex]]: e }.
let indices be a new empty List
let groupNames be a new empty List
Add match as the last element of indices.
Let matchedValue be ! GetMatchString(S, match).
Perform ! CreateDataProperty(A, "0", ~~matchedSubstr~~matchedValue).
If R contains any GroupName, then
1. Let groups be ObjectCreate(null).
2. Let hasGroups be true.
Else,
1. Let groups be undefined.
2. Let hasGroups be false.
Perform ! CreateDataProperty(A, "groups", groups).
For each integer i such that i > 0 and i ≤ n, in ascending order, do
1. Let captureI be i^th element of r's captures List.
2. ~~If captureI is undefined, let capturedValue be undefined.~~
3. ~~Else if fullUnicode is true, then~~
  1. ~~Assert: captureI is a List of code points.~~
  2. ~~Let capturedValue be the String value whose code units are the UTF16Encoding of the code points of captureI.~~
4. ~~Else fullUnicode is false,~~
  1. ~~Assert: captureI is a List of code units.~~
  2. ~~Let capturedValue be the String value consisting of the code units of captureI.~~
5. If captureI is undefined, then
  1. Let capturedValue be undefined.
  2. Append undefined to indices.
6. Else,
  1. Let captureStart be captureI's startIndex.
  2. Let captureEnd be captureI's endIndex.
  3. If fullUnicode is true, then
    1. Set captureStart to ! GetStringIndex(S, Input, captureStart).
    2. Set captureEnd to ! GetStringIndex(S, Input, captureEnd).
  4. Let capture be the Match { [[StartIndex]]: captureStart, [[EndIndex]:: captureEnd }.
  5. Let capturedValue be ! GetMatchString(S, capture).
  6. Append capture to indices.
7. Perform ! CreateDataProperty(A, ! ToString(i), capturedValue).
8. If the ith capture of R was defined with a GroupName, then
  1. Let s be the StringValue of the corresponding RegExpIdentifierName.
  2. Perform ! CreateDataProperty(groups, s, capturedValue).
  3. Append s to groupNames.
9. Else,
  1. Append undefined to groupNames.
If hasIndices is true, then
1. Let indicesArray be ! MakeIndicesArray(S, indices, groupNames, hasGroups).
2. Perform ! CreateDataProperty(A, "indices", indicesArray).
Return A.

1.1.4.1.2 GetStringIndex ( `S`, `Input`, `e` )

The abstract operation GetStringIndex with with arguments S, Input, and e performs the following steps:

Assert: Type(S) is String.
Assert: Input is a List of the code points of S interpreted as a UTF-16 encoded string.
Assert: e is an integer value ≥ 0.
If S is the empty String, return 0.
Let eUTF be the smallest index into S that corresponds to the character at element e of Input. If e is greater than or equal to the number of elements in Input, then eUTF is the number of code units in S.
Return eUTF.

1.1.4.1.3 GetMatchString ( `S`, `match` )

The abstract operation GetMatchString with arguments S and match performs the following steps:

Assert: Type(S) is String.
Assert: match is a Match Record.
Assert: match.[[StartIndex]] is an integer value ≥ 0 and ≤ the length of S.
Assert: match.[[EndIndex]] is an integer value ≥ match.[[StartIndex]] and ≤ the length of S.
Return the portion of S between offset match.[[StartIndex]] inclusive and offset match.[[EndIndex]] exclusive.

1.1.4.1.4 GetMatchIndicesArray ( `S`, `match` )

The abstract operation GetMatchIndicesArray with arguments S and match performs the following steps:

Assert: Type(S) is String.
Assert: match is a Match Record.
Assert: match.[[StartIndex]] is an integer value ≥ 0 and ≤ the length of S.
Assert: match.[[EndIndex]] is an integer value ≥ match.[[StartIndex]] and ≤ the length of S.
Return CreateArrayFromList(« match.[[StartIndex]], match.[[EndIndex]] »).

1.1.4.1.5 MakeIndicesArray ( `S` , `indices`, `groupNames`, `hasGroups` )

The abstract operation MakeIndicesArray with arguments S, indices, groupNames, and hasGroups performs the following steps:

Assert: Type(S) is String.
Assert: indices is a List.
Assert: Type(hasGroups) is Boolean.
Let n be the number of elements in indices.
Assert: n < 2³²-1.
Assert: groupNames is a List with n - 1 elements.
NOTE: The groupNames List contains elements aligned with the indices List starting at indices[1].
Set A to ! ArrayCreate(n).
Assert: The value of A's "length" property is n.
If hasGroups is true, then
1. Let groups be ! ObjectCreate(null).
Else,
1. Let groups be undefined.
Perform ! CreateDataProperty(A, "groups", groups).
For each integer i such that i ≥ 0 and i < n, do
1. Let matchIndices be indices[i].
2. If matchIndices is not undefined, then
  1. Let matchIndicesArray be ! GetMatchIndicesArray(S, matchIndices).
3. Else,
  1. Let matchIndicesArray be undefined.
4. Perform ! CreateDataProperty(A, ! ToString(i), matchIndicesArray).
5. If i > 0 and groupNames[i - 1] is not undefined, then
  1. Perform ! CreateDataProperty(groups, groupNames[i - 1], matchIndicesArray).
Return A.

1.1.4.2 get RegExp.prototype.flags

RegExp.prototype.flags is an accessor property whose set accessor function is undefined. Its get accessor function performs the following steps:

Let R be the this value.
If Type(R) is not Object, throw a TypeError exception.
Let result be the empty String.
Let hasIndices be ! ToBoolean(? Get(R, "hasIndices")).
If hasIndices is true, append the code unit 0x0064 (LATIN SMALL LETTER D) as the last code unit of result.
Let global be ! ToBoolean(? Get(R, "global")).
If global is true, append the code unit 0x0067 (LATIN SMALL LETTER G) as the last code unit of result.
Let ignoreCase be ! ToBoolean(? Get(R, "ignoreCase")).
If ignoreCase is true, append the code unit 0x0069 (LATIN SMALL LETTER I) as the last code unit of result.
Let multiline be ! ToBoolean(? Get(R, "multiline")).
If multiline is true, append the code unit 0x006D (LATIN SMALL LETTER M) as the last code unit of result.
Let dotAll be ! ToBoolean(? Get(R, "dotAll")).
If dotAll is true, append the code unit 0x0073 (LATIN SMALL LETTER S) as the last code unit of result.
Let unicode be ! ToBoolean(? Get(R, "unicode")).
If unicode is true, append the code unit 0x0075 (LATIN SMALL LETTER U) as the last code unit of result.
Let sticky be ! ToBoolean(? Get(R, "sticky")).
If sticky is true, append the code unit 0x0079 (LATIN SMALL LETTER Y) as the last code unit of result.
Return result.

1.1.4.3 get RegExp.prototype.hasIndices

RegExp.prototype.hasIndices is an accessor property whose set accessor function is undefined. Its get accessor function performs the following steps:

Let R be the this value.
If Type(R) is not Object, throw a TypeError exception.
If R does not have an [[OriginalFlags]] internal slot, then
1. If SameValue(R, %RegExp.prototype%) is true, return undefined.
2. Otherwise, throw a TypeError exception.
Let flags be R.[[OriginalFlags]].
If flags contains the code unit 0x0064 (LATIN SMALL LETTER D), return true.
Return false.

A Copyright & Software License

Copyright Notice

Software License

All Software contained in this document ("Software") is protected by copyright and is being made available under the "BSD License", included below. This Software may be subject to third party rights (rights from parties other than Ecma International), including patent rights, and no licenses under such third party rights are granted under this license even if the third party concerned is a member of Ecma International. SEE THE ECMA CODE OF CONDUCT IN PATENT MATTERS AVAILABLE AT https://ecma-international.org/memento/codeofconduct.htm FOR INFORMATION REGARDING THE LICENSING OF PATENT CLAIMS THAT ARE REQUIRED TO IMPLEMENT ECMA INTERNATIONAL STANDARDS.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
Neither the name of the authors nor Ecma International may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE ECMA INTERNATIONAL "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL ECMA INTERNATIONAL BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.