Tuesday, February 13, 2007

Jython unicodedata - how complete is java.lang.Character anyway?

Aye, there's the rub! My initial reaction to java.lang.Character is the extensive use of char in the API. That obviously wouldn't cover all of Unicode. So as a best scenario, maybe the BMP plus a bit can be covered by Character, and then something extra would need to be implemented to support the rest.

A related issue is that all of the nice int overloaded versions of the methods are Java 5, and that's not what I'm targeting here. I'm hoping to get away with a target environment of Java 4, since haven't Sun end-of-lifed Java 3? Java 4 Character only has implemented Unicode 3.0 anyway, so there's a bit of a gap. Python 2.3 contains version 3.2.0 of UnicodeData. There's going to be a gap that I need to fill somewhere.

Thirdly, the API that Character offers seems to be rather different from what I and Python has interpreted under the Unicode specification.

>>> unicodedata.digit(u'\u2468')
9

but then in Java:

assertEquals(true, Character.isDigit('\u2468');

fails. Closer inspection of the isDigit API documentation shows that this is in fact a test for is DECIMAL DIGIT, so it equates to unicodedata.decimal rather than unicodedata.digit. Hopefully, Character will allow me to get a fair way into implementing and making some of the tests run, before I have to start thinking too hard about creating lookup tables and bit-masking the 11 most significant bits.

No comments: