#3477 — Fully specify the behaviour of backslash followed by digits in string literals and template literals

bug_id: 3477
creation_ts: 2014-12-18 06:49:00 -0800
short_desc: Fully specify the behaviour of backslash followed by digits in string literals and template literals
delta_ts: 2015-07-10 08:34:59 -0700
product: Draft for 7th Edition
component: Deferred from 6th edition
version: unspecified
rep_platform: All
op_sys: All
bug_status: CONFIRMED
priority: Normal
bug_severity: enhancement
everconfirmed: true
reporter: Claude Pache
assigned_to: Allen Wirfs-Brock
cc: ["bugs.ecmascript", "caitpotter88", "erights", "erik.arvidsson", "gabelevi", "jorendorff", "mathias", "oliver"]

commentid: 11094
comment_count: 0
who: Claude Pache
bug_when: 2014-12-18 06:49:07 -0800

This is a follow-up of Bug 1553 and Bug 3212.

Let's consider the general case of a backslash followed by one or more digits, in string literal. We can distinguish four cases:

A. "\0"
-------
This is interpreted as the NUL character.

B. "\[89][0-9]*"
----------------
This is interpreted as literal characters (i.e., as if the backslash wasn't there).

C. "\0[89][0-9]*"
-----------------
This is interpreted as NUL followed by literal characters.

D. All other cases: "\[1-7][0-9]*" and "\0[1-7][0-9]*"
------------------------------------------------------
This is interpreted as a legacy octal escape sequence producing a character of code 0-255, eventually followed by literal characters.

--

In sloppy mode, engines seem to accept all the forms above and interpret them as described (although I have not fully tested them).

In strict mode (and in template literals, which follow the rules of strict mode), form A is explicitely allowed, and form D is explicitely forbidden. For forms B and C, according to Bug 1553 Comment 2, the behaviour of engines varies between accepting both (V8/Presto), accepting only B (IE/SpiderMonkey), and throwing a SyntaxError on both (JSC).

As for regexps, engines seem to be very sloppy even in strict mode :-(
If something is to be said about them, I think it is better to open a separate bug.

--

I believe that forms A and D are fully specified. I plan to check the current spec and implementations, and to propose an update of the spec text of annex B that includes also cases B and C (for string literals in sloppy mode) before the end of the year. The only formal decision to be taken here is whether forms B and C should be explicitely forbidden in strict mode and/or in template literals.

commentid: 11096
comment_count: 1
who: Claude Pache
bug_when: 2014-12-18 06:54:05 -0800

*** Bug 1553 has been marked as a duplicate of this bug. ***

commentid: 11098
comment_count: 2
who: Claude Pache
bug_when: 2014-12-18 06:54:35 -0800

*** Bug 3212 has been marked as a duplicate of this bug. ***

commentid: 11099
comment_count: 3
who: Claude Pache
bug_when: 2014-12-18 06:56:10 -0800

IMHO, the best behaviour in strict mode is the one of JSC, that is disallowing forms B and C. The reason is that, in these cases, if you replace the first 8 or 9 with any digit between 1 and 7, you fall in case D, and it is better to have something that behaves consistently accross all digits.

commentid: 11100
comment_count: 4
who: Caitlin Potter [:caitp]
bug_when: 2014-12-18 07:15:33 -0800

Erik's position (in https://crrev.com/811113002/) is that \0 followed by any digit should be legal (in template literals), such that

EscapeSequence :: 0 [lookahead ∉ DecimalDigit]

becomes

EscapeSequence :: 0

And \[1-9] would then need to match CharacterEscapeSequence

It's different from the current strict mode behaviour in all engines, but it doesn't seem unreasonable --- just not what is specified

commentid: 11101
comment_count: 5
who: Erik Arvidsson
bug_when: 2014-12-18 07:49:13 -0800

I wish B, C and D were all syntax errors in strict mode but I'm afraid that changing this might lead to breaking web sites. However, if JSC gets away with it I'm fine specing that and hope that we can make changes to the other engines.

commentid: 11105
comment_count: 6
who: Claude Pache
bug_when: 2014-12-18 08:50:26 -0800

(In reply to Caitlin Potter [:caitp] from comment #4)
> Erik's position (in https://crrev.com/811113002/) is that \0 followed by any
> digit should be legal (in template literals), such that
>
> EscapeSequence :: 0 [lookahead ∉ DecimalDigit]
>
> becomes
>
> EscapeSequence :: 0
>
> And \[1-9] would then need to match CharacterEscapeSequence
>
> It's different from the current strict mode behaviour in all engines, but it
> doesn't seem unreasonable --- just not what is specified

Right, and if we want to be precise, we should make a distinction between:
1. forbidding a given extension (e.g., legacy octal escape sequence);
2. requiring to throw a SyntaxError on some escape sequences.
This will make a difference of what engines are allowed to do with \07, for example.

commentid: 11107
comment_count: 7
who: Allen Wirfs-Brock
bug_when: 2014-12-18 09:14:54 -0800

added Jason and Oliver to the CC list

commentid: 11134
comment_count: 8
attachid: 77
who: Claude Pache
bug_when: 2014-12-23 05:06:45 -0800

Created attachment 77
Tests for string literals in sloppy and strict modes

Here are tests for testing the behaviour of implementations in string literals.

The last two tests check if forms C and B respectively throw a SyntaxError; it is where current implementations differ. (The results are those announced in Comment 0 and Bug 1553 Comment 2.)

commentid: 11135
comment_count: 9
who: Claude Pache
bug_when: 2014-12-23 06:43:37 -0800

Completing the spec in sloppy mode is remarkably simple: It suffices to replace the following definition of EscapeCharacter (11.8.4)

EscapeCharacter ::
SingleEscapeCharacter
DecimalDigit
x
u

by the following alternative one (to be introduced in B.1.2 probably, although it will depend of what we want for strict mode and templates):

EscapeCharacter ::
SingleEscapeCharacter
OctalDigit
x
u

That would effectively add 8 and 9 to the NonEscapeCharacter production. Other cases are just special cases of LegacyOctalEscapeSequence.

commentid: 11136
comment_count: 10
who: Claude Pache
bug_when: 2014-12-23 07:18:37 -0800

In case there is an interest in fully specifying the behaviour in strict mode, we need a decision of what to do with the following:

(a) \07
(b) \7
(c) \08
(d) \8

Options are:

(1) leave undefined (but do not implement the legacy octal escapes);
(2) throw a SyntaxError;
(3) interpret as \0 = NUL, \1 = 1, etc.

Current spec uses (1); implementations uses (2) or (3), depending (or not) on the case.

The same question arises for template literals. According to Comment 4, the answer might be different.

commentid: 11137
comment_count: 11
who: Caitlin Potter [:caitp]
bug_when: 2014-12-23 07:50:35 -0800

(1) and (2) are essentially equivalent, as there is no valid production to be made from each of those (in the current spec).

It comes down to two things: A) should octals be parsed and result in a syntax error in strict mode, or B) should they not be parsed at all, with related productions changed to accommodate.

The rationale for B) is that it doesn't make sense to throw a syntax error, it's strange that the legacy octal escapes are being considered at all (that the grammar is defined in this way explicitly to make it a syntax error to use something that looks like a numeric literal).

So here's a proposal:

```
11.8.4

EscapeSequence ::
CharacterEscapeSequence
0 (previously `0 [lookahead ∉ DecimalDigit]`)
HexEscapeSequence
UnicodeEscapeSequence

EscapeCharacter ::
SingleEscapeCharacter
0 (previously `DecimalDigit`)
x
u

B.1.2

EscapeSequence ::
CharacterEscapeSequence
LegacyOctalEscapeSequence
HexEscapeSequence
UnicodeEscapeSequence

EscapeCharacter ::
SingleEscapeCharacter
OctalDigit (previously replacing `0` in strict mode)
x
u
```

Following this,

- "\0" :: U+0000 (strict and sloppy)
- "\1" :: U+0031 (strict), U+0001 (sloppy)
- "\2" :: U+0032 (strict), U+0002 (sloppy)
- "\3" :: U+0033 (strict), U+0003 (sloppy)
- "\4" :: U+0034 (strict), U+0004 (sloppy)
- "\5" :: U+0035 (strict), U+0005 (sloppy)
- "\6" :: U+0036 (strict), U+0006 (sloppy)
- "\7" :: U+0037 (strict), U+0007 (sloppy)
- "\8" :: U+0038 (strict and sloppy)
- "\9" :: U+0039 (strict and sloppy)
- "\00" :: U+0000 U+0030 (strict), U+0000 (sloppy)
- "\01" :: U+0000 U+0031 (strict), U+0001 (sloppy)
- "\02" :: U+0000 U+0032 (strict), U+0002 (sloppy)
- "\03" :: U+0000 U+0033 (strict), U+0003 (sloppy)
- "\04" :: U+0000 U+0034 (strict), U+0004 (sloppy)
- "\05" :: U+0000 U+0035 (strict), U+0005 (sloppy)
- "\06" :: U+0000 U+0036 (strict), U+0006 (sloppy)
- "\07" :: U+0000 U+0037 (strict), U+0007 (sloppy)
- "\08" :: U+0000 U+0038 (strict and sloppy)
- "\09" :: U+0000 U+0039 (strict and sloppy)

In my mind, it's low-risk because non-interoperable legacy octal behaviour in strict mode is probably not something many applications depend on, and it seems to make sense, but does prevent introducing octal escape sequences to strict mode later.

commentid: 11138
comment_count: 12
who: Claude Pache
bug_when: 2014-12-23 08:21:01 -0800

(In reply to Caitlin Potter [:caitp] from comment #11)
> (1) and (2) are essentially equivalent, as there is no valid production to
> be made from each of those (in the current spec).

They are not equivalent if implementations are allowed to extend the specced syntax. (And they often do extend.)

> (...) it's strange that the legacy octal escapes are being considered at all (that
> the grammar is defined in this way explicitly to make it a syntax error to
> use something that looks like a numeric literal).

It's useful for avoiding refactoring hazards. For example, if you put "use strict" at the top of the file, you are loudly notified that something need to be amended in your string literal 80 lines below, instead of having its semantics silently modified.

commentid: 12919
comment_count: 13
who: Allen Wirfs-Brock
bug_when: 2015-02-18 12:01:47 -0800

I like Catlin's proposal in comment 11, but I think it needs to be formally considered by TC39.

Too late to do that for ES6 but we can do it for ES7.

Changing this ticket to ES7 deferred

archives

#3477 — Fully specify the behaviour of backslash followed by digits in string literals and template literals