archives

« Bugzilla Issues Index

#2367 — 21.2.2: BMP pattern and Unicode pattern


Section 21.2.2 defines a regular expression pattern as either a "BMP pattern" or a "Unicode pattern". These names are a bit odd because both patterns are expressed as Unicode strings, and the BMP is a subset of Unicode. I think more relevant is that in one mode the pattern (and the strings matched against it) are processed code unit by code, while in the other mode they're processed code point by code point.

I suggest calling them "code unit pattern" and "code point pattern".


Note we use "u" to designate a "code point pattern" even though a "code unit pattern" is also a subset of Unicode (and even has a word in its description that starts with "u").

We are defining terminology for internal clarity within the specification. From that perspective "code point pattern" and "code unit pattern" are too visually and cognitively similar and likely to be occasionally misread.

"BMP pattern" and "Unicode Pattern" are more distinguishable in this regard.