#762 — String segmentation

bug_id: 762
creation_ts: 2012-10-09 13:56:00 -0700
short_desc: String segmentation
delta_ts: 2016-02-15 16:17:15 -0800
product: Internationalization - ECMA-402
component: Specification
version: Edition 2.0 proposals
rep_platform: All
op_sys: All
bug_status: CONFIRMED
priority: Normal
bug_severity: enhancement
everconfirmed: true
reporter: Nebojša Ćirić
assigned_to: Richard Gillam
cc: ["addison", "caridy", "princexcess69"]

commentid: 1897
comment_count: 0
who: Nebojša Ćirić
bug_when: 2012-10-09 13:56:32 -0700

A break iterator for strings. Very useful in CJK languages - used in editors, word boundaries in regular expressions.

Sample implementation/documentation in Chrome - http://code.google.com/p/v8-i18n/wiki/BreakIterator

commentid: 2415
comment_count: 1
who: Nebojša Ćirić
bug_when: 2012-11-05 09:02:04 -0800

One more use case we actually encountered - offline indexing, e.g. emails, docs.

Should we up the importance to High?

commentid: 2994
comment_count: 2
who: Addison Phillips
bug_when: 2012-12-05 15:27:48 -0800

I think this is of high importance. Word and grapheme boundary analysis are things that we do a fair amount of in our applications (for features such as dictionary lookup, text highlighting that is scripted, etc.). For some languages, such as JP, CN, Thai, etc., this can't be done trivially and requires a large data set to do a high quality job of. There exist some smaller, statistically based implementations, but these are not accurate enough for our needs.

commentid: 3018
comment_count: 3
who: Norbert
bug_when: 2012-12-17 16:15:37 -0800

At the 2012-12-14 internationalization meeting Rich was asked to write a strawman.

commentid: 3108
comment_count: 4
who: Richard Gillam
bug_when: 2013-01-04 18:07:28 -0800

See http://wiki.ecmascript.org/doku.php?id=globalization:text_segmentation, such as it is. This is a lot different from what Nebojsa has implemented, but I thought it might be useful to go in a different direction and spark a little discussion.

commentid: 14925
comment_count: 5
who: Caridy Patiño
bug_when: 2016-02-15 16:17:15 -0800

Update: @littledan will champion the breakIterator for 4rd edition.

archives

#762 — String segmentation