11 ECMAScript Language: Source Text

11.1 Source Text

Syntax

SourceCharacter :: any Unicode code point

ECMAScript code is expressed using Unicode. ECMAScript source text is a sequence of code points. All Unicode code point values from U+0000 to U+10FFFF, including surrogate code points, may occur in source text where permitted by the ECMAScript grammars. The actual encodings used to store and interchange ECMAScript source text is not relevant to this specification. Regardless of the external source text encoding, a conforming ECMAScript implementation processes the source text as if it was an equivalent sequence of SourceCharacter values, each SourceCharacter being a Unicode code point. Conforming ECMAScript implementations are not required to perform any normalization of source text, or behave as though they were performing normalization of source text.

The components of a combining character sequence are treated as individual Unicode code points even though a user might think of the whole sequence as a single character.

Note

In string literals, regular expression literals, template literals and identifiers, any Unicode code point may also be expressed using Unicode escape sequences that explicitly express a code point's numeric value. Within a comment, such an escape sequence is effectively ignored as part of the comment.

ECMAScript differs from the Java programming language in the behaviour of Unicode escape sequences. In a Java program, if the Unicode escape sequence \u000A, for example, occurs within a single-line comment, it is interpreted as a line terminator (Unicode code point U+000A is LINE FEED (LF)) and therefore the next code point is not part of the comment. Similarly, if the Unicode escape sequence \u000A occurs within a string literal in a Java program, it is likewise interpreted as a line terminator, which is not allowed within a string literal—one must write \n instead of \u000A to cause a LINE FEED (LF) to be part of the String value of a string literal. In an ECMAScript program, a Unicode escape sequence occurring within a comment is never interpreted and therefore cannot contribute to termination of the comment. Similarly, a Unicode escape sequence occurring within a string literal in an ECMAScript program always contributes to the literal and is never interpreted as a line terminator or as a code point that might terminate the string literal.

11.1.1 Static Semantics: UTF16EncodeCodePoint ( cp )

The abstract operation UTF16EncodeCodePoint takes argument cp (a Unicode code point) and returns a String. It performs the following steps when called:

  1. Assert: 0 ≤ cp ≤ 0x10FFFF.
  2. If cp ≤ 0xFFFF, return the String value consisting of the code unit whose value is cp.
  3. Let cu1 be the code unit whose value is floor((cp - 0x10000) / 0x400) + 0xD800.
  4. Let cu2 be the code unit whose value is ((cp - 0x10000) modulo 0x400) + 0xDC00.
  5. Return the string-concatenation of cu1 and cu2.

11.1.2 Static Semantics: CodePointsToString ( text )

The abstract operation CodePointsToString takes argument text (a sequence of Unicode code points) and returns a String. It converts text into a String value, as described in 6.1.4. It performs the following steps when called:

  1. Let result be the empty String.
  2. For each code point cp of text, do
    1. Set result to the string-concatenation of result and UTF16EncodeCodePoint(cp).
  3. Return result.

11.1.3 Static Semantics: UTF16SurrogatePairToCodePoint ( lead, trail )

The abstract operation UTF16SurrogatePairToCodePoint takes arguments lead (a code unit) and trail (a code unit) and returns a code point. Two code units that form a UTF-16 surrogate pair are converted to a code point. It performs the following steps when called:

  1. Assert: lead is a leading surrogate and trail is a trailing surrogate.
  2. Let cp be (lead - 0xD800) × 0x400 + (trail - 0xDC00) + 0x10000.
  3. Return the code point cp.

11.1.4 Static Semantics: CodePointAt ( string, position )

The abstract operation CodePointAt takes arguments string (a String) and position (a non-negative integer) and returns a Record with fields [[CodePoint]] (a code point), [[CodeUnitCount]] (a positive integer), and [[IsUnpairedSurrogate]] (a Boolean). It interprets string as a sequence of UTF-16 encoded code points, as described in 6.1.4, and reads from it a single code point starting with the code unit at index position. It performs the following steps when called:

  1. Let size be the length of string.
  2. Assert: position ≥ 0 and position < size.
  3. Let first be the code unit at index position within string.
  4. Let cp be the code point whose numeric value is that of first.
  5. If first is not a leading surrogate or trailing surrogate, then
    1. Return the Record { [[CodePoint]]: cp, [[CodeUnitCount]]: 1, [[IsUnpairedSurrogate]]: false }.
  6. If first is a trailing surrogate or position + 1 = size, then
    1. Return the Record { [[CodePoint]]: cp, [[CodeUnitCount]]: 1, [[IsUnpairedSurrogate]]: true }.
  7. Let second be the code unit at index position + 1 within string.
  8. If second is not a trailing surrogate, then
    1. Return the Record { [[CodePoint]]: cp, [[CodeUnitCount]]: 1, [[IsUnpairedSurrogate]]: true }.
  9. Set cp to UTF16SurrogatePairToCodePoint(first, second).
  10. Return the Record { [[CodePoint]]: cp, [[CodeUnitCount]]: 2, [[IsUnpairedSurrogate]]: false }.

11.1.5 Static Semantics: StringToCodePoints ( string )

The abstract operation StringToCodePoints takes argument string (a String) and returns a List of code points. It returns the sequence of Unicode code points that results from interpreting string as UTF-16 encoded Unicode text as described in 6.1.4. It performs the following steps when called:

  1. Let codePoints be a new empty List.
  2. Let size be the length of string.
  3. Let position be 0.
  4. Repeat, while position < size,
    1. Let cp be CodePointAt(string, position).
    2. Append cp.[[CodePoint]] to codePoints.
    3. Set position to position + cp.[[CodeUnitCount]].
  5. Return codePoints.

11.1.6 Static Semantics: ParseText ( sourceText, goalSymbol )

The abstract operation ParseText takes arguments sourceText (a sequence of Unicode code points) and goalSymbol (a nonterminal in one of the ECMAScript grammars) and returns a Parse Node or a non-empty List of SyntaxError objects. It performs the following steps when called:

  1. Attempt to parse sourceText using goalSymbol as the goal symbol, and analyse the parse result for any early error conditions. Parsing and early error detection may be interleaved in an implementation-defined manner.
  2. If the parse succeeded and no early errors were found, return the Parse Node (an instance of goalSymbol) at the root of the parse tree resulting from the parse.
  3. Otherwise, return a List of one or more SyntaxError objects representing the parsing errors and/or early errors. If more than one parsing error or early error is present, the number and ordering of error objects in the list is implementation-defined, but at least one must be present.
Note 1

Consider a text that has an early error at a particular point, and also a syntax error at a later point. An implementation that does a parse pass followed by an early errors pass might report the syntax error and not proceed to the early errors pass. An implementation that interleaves the two activities might report the early error and not proceed to find the syntax error. A third implementation might report both errors. All of these behaviours are conformant.

Note 2

See also clause 17.

11.2 Types of Source Code

There are four types of ECMAScript code:

Note 1

Function code is generally provided as the bodies of Function Definitions (15.2), Arrow Function Definitions (15.3), Method Definitions (15.4), Generator Function Definitions (15.5), Async Function Definitions (15.8), Async Generator Function Definitions (15.6), and Async Arrow Functions (15.9). Function code is also derived from the arguments to the Function constructor (20.2.1.1), the GeneratorFunction constructor (27.3.1.1), and the AsyncFunction constructor (27.7.1.1).

Note 2

The practical effect of including the BindingIdentifier in function code is that the Early Errors for strict mode code are applied to a BindingIdentifier that is the name of a function whose body contains a "use strict" directive, even if the surrounding code is not strict mode code.

11.2.1 Directive Prologues and the Use Strict Directive

A Directive Prologue is the longest sequence of ExpressionStatements occurring as the initial StatementListItems or ModuleItems of a FunctionBody, a ScriptBody, or a ModuleBody and where each ExpressionStatement in the sequence consists entirely of a StringLiteral token followed by a semicolon. The semicolon may appear explicitly or may be inserted by automatic semicolon insertion (12.9). A Directive Prologue may be an empty sequence.

A Use Strict Directive is an ExpressionStatement in a Directive Prologue whose StringLiteral is either of the exact code point sequences "use strict" or 'use strict'. A Use Strict Directive may not contain an EscapeSequence or LineContinuation.

A Directive Prologue may contain more than one Use Strict Directive. However, an implementation may issue a warning if this occurs.

Note

The ExpressionStatements of a Directive Prologue are evaluated normally during evaluation of the containing production. Implementations may define implementation specific meanings for ExpressionStatements which are not a Use Strict Directive and which occur in a Directive Prologue. If an appropriate notification mechanism exists, an implementation should issue a warning if it encounters in a Directive Prologue an ExpressionStatement that is not a Use Strict Directive and which does not have a meaning defined by the implementation.

11.2.2 Strict Mode Code

An ECMAScript syntactic unit may be processed using either unrestricted or strict mode syntax and semantics (4.3.2). Code is interpreted as strict mode code in the following situations:

ECMAScript code that is not strict mode code is called non-strict code.

11.2.3 Non-ECMAScript Functions

An ECMAScript implementation may support the evaluation of function exotic objects whose evaluative behaviour is expressed in some host-defined form of executable code other than via ECMAScript code. Whether a function object is an ECMAScript code function or a non-ECMAScript function is not semantically observable from the perspective of an ECMAScript code function that calls or is called by such a non-ECMAScript function.