#240 — UTF-8 BOM makes parseTestRecord fail for certain tests

bug_id: 240
creation_ts: 2012-01-10 04:54:00 -0800
short_desc: UTF-8 BOM makes parseTestRecord fail for certain tests
delta_ts: 2012-03-28 16:31:53 -0700
product: Test262
component: Test Harness
version: unspecified
rep_platform: Macintosh
op_sys: Mac OS
bug_status: RESOLVED
resolution: FIXED
priority: Normal
bug_severity: normal
everconfirmed: true
reporter: Joost-Wim Boekesteijn
assigned_to: Dave Fugate

commentid: 551
comment_count: 0
who: Joost-Wim Boekesteijn
bug_when: 2012-01-10 04:54:01 -0800

When running the tests for ch07 or ch15, the command line test runner fails with this message:

Traceback (most recent call last):
File "tools/packaging/test262.py", line 453, in <module>
Main()
File "tools/packaging/test262.py", line 448, in Main
options.full_summary)
File "tools/packaging/test262.py", line 412, in Run
cases = self.EnumerateTests(tests)
File "tools/packaging/test262.py", line 363, in EnumerateTests
strict_case = TestCase(self, name, full_path, True)
File "tools/packaging/test262.py", line 185, in __init__
del testRecord["commentary"]
KeyError: 'commentary'

The problem seems to be that some files start with a UTF-8 Byte Order Mark (which ends up in the source string) causing the parseTestRecord function to fail.

My guess is that there are two approaches to fixing this:

1. Removing the BOM from the offending files. Using the next command, I produced a list of these files:

find ./test/suite -name '*.js' | xargs file | grep BOM

./test/suite/ch07/7.6/7.6.1/7.6.1.2/7.6.1.2-2-s.js
./test/suite/ch07/7.6/7.6.1/7.6.1.2/7.6.1.2-3-s.js
./test/suite/ch07/7.6/7.6.1/7.6.1.2/7.6.1.2-4-s.js
./test/suite/ch07/7.6/7.6.1/7.6.1.2/7.6.1.2-5-s.js
./test/suite/ch07/7.6/7.6.1/7.6.1.2/7.6.1.2-6-s.js
./test/suite/ch07/7.6/7.6.1/7.6.1.2/7.6.1.2-7-s.js
./test/suite/ch07/7.6/7.6.1/7.6.1.2/7.6.1.2-8-s.js
./test/suite/ch07/7.6/7.6.1/7.6.1.2/7.6.1.2-9-s.js
./test/suite/ch15/15.11/15.11.4/15.11.4.2/15.11.4.2-1.js
./test/suite/ch15/15.11/15.11.4/15.11.4.3/15.11.4.3-1.js

I'm not sure if the BOM is there for any reason. Probably not.

2. Changing the way the files are read.

As a quick hack, I changed TestCase.__init__ in tools/packaging/test262.py by replacing the call to 'open' with 'codecs.open':

-f = open(self.full_path)
+f = codecs.open(self.full_path, "r", "utf-8-sig")

With this change, Python skips the BOM when reading the file. This also needs an extra line 'import codecs' at the top of the file.

Tested with Python 2.7.1 on OS X 10.7.2 against revision 282, changeset 99ff7d59530c.

commentid: 552
comment_count: 1
who: Dave Fugate
bug_when: 2012-01-10 09:42:18 -0800

Changing open(self.full_path) to open(self.full_path, 'rb') might be a smaller fix.

commentid: 553
comment_count: 2
who: Joost-Wim Boekesteijn
bug_when: 2012-01-10 09:56:25 -0800

(In reply to comment #1)
> Changing open(self.full_path) to open(self.full_path, 'rb') might be a smaller
> fix.

In fact, specifying 'rb' does not change the way the BOM is handled. To illustrate this, I created a text file containing the word 'test' and saved it as UTF-8 with a BOM.

Reading this file in different ways then yields these results:

Python 2.7.1 (r271:86832, Jun 25 2011, 05:09:01)
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> open("withbom.txt").read()
'\xef\xbb\xbftest'
>>> open("withbom.txt", "rb").read()
'\xef\xbb\xbftest'
>>> import codecs
>>> codecs.open("withbom.txt", "r", "utf-8-sig").read()
u'test'

commentid: 755
comment_count: 3
who: Joost-Wim Boekesteijn
bug_when: 2012-03-13 06:14:09 -0700

Update: in revision 324, there is a single file left that starts with a UTF-8 Byte Order Mark. From the root directory, running the command from the first line returns the filename on the second line:

> find ./test/suite -name '*.js' | xargs file | grep BOM
> ./test/suite/ch07/7.8/7.8.5/S7.8.5_A3.1_T7.js

This is also the only file that results in a KeyError on this line from test262.py:

> del testRecord["commentary"]

If the BOM is removed from this single file, https://bugs.ecmascript.org/show_bug.cgi?id=271 will also be solved since the BOM is the only thing that causes the KeyError on the commentary key.

commentid: 758
comment_count: 4
who: Dave Fugate
bug_when: 2012-03-14 14:25:01 -0700

Should be fixed now.

commentid: 760
comment_count: 5
who: Joost-Wim Boekesteijn
bug_when: 2012-03-14 14:57:13 -0700

Thanks! Issue is indeed resolved, the KeyError has disappeared.

archives

#240 — UTF-8 BOM makes parseTestRecord fail for certain tests