archives

« Bugzilla Issues Index

#240 — UTF-8 BOM makes parseTestRecord fail for certain tests


When running the tests for ch07 or ch15, the command line test runner fails with this message:

Traceback (most recent call last):
File "tools/packaging/test262.py", line 453, in <module>
Main()
File "tools/packaging/test262.py", line 448, in Main
options.full_summary)
File "tools/packaging/test262.py", line 412, in Run
cases = self.EnumerateTests(tests)
File "tools/packaging/test262.py", line 363, in EnumerateTests
strict_case = TestCase(self, name, full_path, True)
File "tools/packaging/test262.py", line 185, in __init__
del testRecord["commentary"]
KeyError: 'commentary'

The problem seems to be that some files start with a UTF-8 Byte Order Mark (which ends up in the source string) causing the parseTestRecord function to fail.

My guess is that there are two approaches to fixing this:

1. Removing the BOM from the offending files. Using the next command, I produced a list of these files:

find ./test/suite -name '*.js' | xargs file | grep BOM

./test/suite/ch07/7.6/7.6.1/7.6.1.2/7.6.1.2-2-s.js
./test/suite/ch07/7.6/7.6.1/7.6.1.2/7.6.1.2-3-s.js
./test/suite/ch07/7.6/7.6.1/7.6.1.2/7.6.1.2-4-s.js
./test/suite/ch07/7.6/7.6.1/7.6.1.2/7.6.1.2-5-s.js
./test/suite/ch07/7.6/7.6.1/7.6.1.2/7.6.1.2-6-s.js
./test/suite/ch07/7.6/7.6.1/7.6.1.2/7.6.1.2-7-s.js
./test/suite/ch07/7.6/7.6.1/7.6.1.2/7.6.1.2-8-s.js
./test/suite/ch07/7.6/7.6.1/7.6.1.2/7.6.1.2-9-s.js
./test/suite/ch15/15.11/15.11.4/15.11.4.2/15.11.4.2-1.js
./test/suite/ch15/15.11/15.11.4/15.11.4.3/15.11.4.3-1.js

I'm not sure if the BOM is there for any reason. Probably not.

2. Changing the way the files are read.

As a quick hack, I changed TestCase.__init__ in tools/packaging/test262.py by replacing the call to 'open' with 'codecs.open':

-f = open(self.full_path)
+f = codecs.open(self.full_path, "r", "utf-8-sig")

With this change, Python skips the BOM when reading the file. This also needs an extra line 'import codecs' at the top of the file.

Tested with Python 2.7.1 on OS X 10.7.2 against revision 282, changeset 99ff7d59530c.


Changing open(self.full_path) to open(self.full_path, 'rb') might be a smaller fix.


(In reply to comment #1)
> Changing open(self.full_path) to open(self.full_path, 'rb') might be a smaller
> fix.

In fact, specifying 'rb' does not change the way the BOM is handled. To illustrate this, I created a text file containing the word 'test' and saved it as UTF-8 with a BOM.

Reading this file in different ways then yields these results:

Python 2.7.1 (r271:86832, Jun 25 2011, 05:09:01)
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> open("withbom.txt").read()
'\xef\xbb\xbftest'
>>> open("withbom.txt", "rb").read()
'\xef\xbb\xbftest'
>>> import codecs
>>> codecs.open("withbom.txt", "r", "utf-8-sig").read()
u'test'


Update: in revision 324, there is a single file left that starts with a UTF-8 Byte Order Mark. From the root directory, running the command from the first line returns the filename on the second line:

> find ./test/suite -name '*.js' | xargs file | grep BOM
> ./test/suite/ch07/7.8/7.8.5/S7.8.5_A3.1_T7.js

This is also the only file that results in a KeyError on this line from test262.py:

> del testRecord["commentary"]

If the BOM is removed from this single file, https://bugs.ecmascript.org/show_bug.cgi?id=271 will also be solved since the BOM is the only thing that causes the KeyError on the commentary key.


Should be fixed now.


Thanks! Issue is indeed resolved, the KeyError has disappeared.