« Bugzilla Issues Index

#1574 — 15.10.2 summary of "web reality" regexp syntax extensions

There has been plenty of discussion of "web reality" syntax extensions to the ES5.1 regexp syntax. Over the past few years the main browsers seem to have mostly converged on how to interpret malformed patterns.

There seem to be 5 "web reality" extensions:
- Handling of malformed {n,m} quantifiers
- Look ahead assertions quantifiable
- Octal escapes and their interaction with Decimal escapes (back references)
- Malformed class ranges
- Malformed control escapes (\cA)

Rather than a lengthy prose description, I currently find that the Yarr parser
captures the "web reality" extensions as implemented by IE, FF, Chrome and Safari, and that in less than 1000 lines of heavily commented code.

In this area, there seems to be only one browser disagreement left:
- FF and Safari accept the octal escape \0000 as identical to \0
- Chrome and IE understand \0000 as \000 (i.e. \0) followed by '0'
The latter implementation seems to make more sense, as it matches with the handling of octal escapes in strings in all 4 browsers. The Yarr parser implements the former.

The handling of \0000 seems to be a Yarr implementation bug:
- Yarr limits octals to \0377
- Pre-yarr Firefox has the same behaviour as IE and Chrome today

There is another difference: the handling of \8 and \9 in patterns when there is no eighth or ninth capture:
- IE and Chrome say \8 equals '8' (identity escape, as in literal strings)
- Safari and FF say \8 equals '\' + '8' (malformed pattern taken as literal atoms)

In this case it is not a Yarr issue, as pre-yarr FF handles this the same way as current FF. I guess either interpretation could be argued to conform to the principle of least surprise.

My gut feel is that this pattern is very rarely seen in "web reality", but I have no data to support this.

Rev20 Annex B now contains the "web reality" spec. prepared by Luke Hoban

If you see any issue with it you should report them as separate bugs.