archives

« Bugzilla Issues Index

#762 — String segmentation


A break iterator for strings. Very useful in CJK languages - used in editors, word boundaries in regular expressions.

Sample implementation/documentation in Chrome - http://code.google.com/p/v8-i18n/wiki/BreakIterator


One more use case we actually encountered - offline indexing, e.g. emails, docs.

Should we up the importance to High?


I think this is of high importance. Word and grapheme boundary analysis are things that we do a fair amount of in our applications (for features such as dictionary lookup, text highlighting that is scripted, etc.). For some languages, such as JP, CN, Thai, etc., this can't be done trivially and requires a large data set to do a high quality job of. There exist some smaller, statistically based implementations, but these are not accurate enough for our needs.


At the 2012-12-14 internationalization meeting Rich was asked to write a strawman.


See http://wiki.ecmascript.org/doku.php?id=globalization:text_segmentation, such as it is. This is a lot different from what Nebojsa has implemented, but I thought it might be useful to go in a different direction and spark a little discussion.


Update: @littledan will champion the breakIterator for 4rd edition.