Aye, there's the rub! My initial reaction to java.lang.Character is the extensive use of char
in the API. That obviously wouldn't cover all of Unicode. So as a best scenario, maybe the BMP plus a bit can be covered by Character, and then something extra would need to be implemented to support the rest.
A related issue is that all of the nice int
overloaded versions of the methods are Java 5, and that's not what I'm targeting here. I'm hoping to get away with a target environment of Java 4, since haven't Sun end-of-lifed Java 3? Java 4 Character only has implemented Unicode 3.0 anyway, so there's a bit of a gap. Python 2.3 contains version 3.2.0 of UnicodeData. There's going to be a gap that I need to fill somewhere.
Thirdly, the API that Character offers seems to be rather different from what I and Python has interpreted under the Unicode specification.
>>> unicodedata.digit(u'\u2468')
9
but then in Java:
assertEquals(true, Character.isDigit('\u2468');
fails. Closer inspection of the isDigit API documentation shows that this is in fact a test for is DECIMAL DIGIT, so it equates to unicodedata.decimal
rather than unicodedata.digit
. Hopefully, Character
will allow me to get a fair way into implementing and making some of the tests run, before I have to start thinking too hard about creating lookup tables and bit-masking the 11 most significant bits.
No comments:
Post a Comment