My latest Flickr photograph

Java does not support unicode in programming

Sunday 2nd of November 2008, 02:59:16 am

If you've ever programmed in Java, and have some knowledge of UTF-8, you're probably going "eh? yes it does?" at this claim. However, I can assure you it is quite true. And well documented, in fact.

Let me quote the javadoc for Character, from the most recent version on the JRE 6.0 API:

"Character information is based on the Unicode Standard, version 4.0."

Unicode 4.0 was released in April of 2003. That's over five years ago. After that, 4.1 was released March 2005, 5.0 in June 2006, and 5.1 in April 2008. With the transition from 4.0 to 5.0, the UTF-16 specification had to be extended to allow for the vast number of new characters, and (love it or leave it) UTF-16 received special reserved codepoints that said "if you see me, I am to be considered an offset for the character encoded in the next 2 bytes, because it has a codepoint that is so high that it cannot be represented in only 2 bytes".

The Java crew never upgraded the Java primitived to work with this new UTF-16 specification, which is a problem, as can be easily illustrated:

Clearly, ⿱𠂉乙 is three characters. It's a decompositional character for top-down component arrangement, the 'radical' 𠂉, and then the kanji 乙. The first and the last are not a problem for Java, since they fit in the old UTF-16 encoding - the middle one is a severe problem, though, since it does not fit in the older version of UTF-16. Let's look at the underlying cause:

"The Java 2 platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes." "A char value [...] represents Basic Multilingual Plane (BMP) code points, including the surrogate code points, or code units of the UTF-16 encoding. An int value represents all Unicode code points, including supplementary code points."

(both cited from the javadoc for Character mentioned above)

That's awesome, but for one thing: String and StringBuffer rely on char arrays, not int arrays, which means that String and Stringbuffer get a hell of a lot wrong: string length and substring won't give the correct result, and split behaves incorrect because it relies on finite state evaluation of char primitives.

Plus, and this is from a design point of view, the high/low surrogate codepoints are not characters, so allowing char to be just the high/low surrogate codepoint without a follow-up of real character information is plain wrong. Unicode specifically states that these things only have 'character' meaning when used in combination with a second pair of bytes to indicate which character is encoded:

"They are called surrogates, since they do not represent characters directly, but only as a pair"

(http://www.unicode.org/faq/utf_bom.html#34)

So where does that put us? Well, for those of us who rely on java to do the right thing, we're in the computing equivalent of the stone age, really. As long as the char primitive stays ignorant with respects to the current - beg pardon, almost two and a half year old - unicode standard, java simply does not support unicode, but makes people believe it does. That's a problem. If you support something, proudly proclaim this. If you don't, very explicitly and unmistakably mention this at the very first opportunity for it. And keep saying it.

Java does not support unicode...

and it turns out I'm one of those few who really need it to -_-

(And yes, luckily I'm one of those people who file a bug report when they happen across things like this ;)

edit: yes, I am aware of the String.codePointCount() method. Guess what: you still can't actually get to any of those codepoints. Brilliantly, String.codePointAt(int n) interprets the passed index as refering to the index in the char array, not the nth codepoint in the "looked at as a series of codepoints" sequence. And no, there is no "getCodePoint(int n)" method. Which is just gross negligence on the part of Lee Boynton and Arthur van Hoff, authors of the String class.