I keep hoping a string API will catch on in which combining marks are mostly tre...

barrkel · on Feb 8, 2011

It's not so simple: it depends on what you're doing with the text. If you're not trying to do analysis with it, encoded text is more or less a program written in a DSL that, when interpreted by a font renderer, draws symbols in some graphical context. Depending on the analysis you want to do, you need varying amounts of knowledge. Perhaps you only need to know about word boundaries; perhaps you're trying to look things up in a normalized dictionary; maybe even decompose a word into phonemes to try and pronounce it. These require different levels of analysis, and one size won't fit all.

fedd · on Feb 8, 2011

did you mean the situation when for example "ä" can be transmitted as 00e4 alone or as 0061 "a" + 0308 "umlaut"?

fedd · on Feb 8, 2011

look what i found and now plan to use

http://download.oracle.com/javase/6/docs/api/java/text/Norma...

update: for particular purposes consider using Collator class, it makes collation keys (byte arrays) out of strings applying locale, case sensitiveness and unicode decomposition. (at least so says the doc, http://download.oracle.com/javase/6/docs/api/java/text/Colla... )