Thursday, August 02, 2007

Jython UnicodeData hacking in the CDS

Got numeric and decimal working today in between contractions in the Central Delivery Suite at Frimley Park Hospital today. Nearly got categories working as well, except I'm not clear how CPython has implemented unicodedata.category for undefined codepoints.

e.g. Python 2.5 uses Unicode 4.2 for the Unicode database. The integer codepoint 13313(decimal) / 3401 (hex) is not defined within Unicode 4.1.

3400;;Lo;0;L;;;;;N;;;;;
4DB5;;Lo;0;L;;;;;N;;;;;

It isn't defined in Unicode 5.0, which is what I've been using to do the Jython implementation.

3400;;Lo;0;L;;;;;N;;;;;
4DB5;;Lo;0;L;;;;;N;;;;;

So how does CPython define unicodedata.category(unichr(13313)) to be 'Lo'? And it doesn't seem to be just 'Lo' in all cases of undefined items. I'm speculating that it might be falling back to the preceding valid codepoint category. Think I need to post to a CPython list to confirm.

No comments: