Monday, February 12, 2007

Jython unicodedata initial overview

So it looks like the problem breaks down into creating a suitable data structure from the contents of the UnicodeData.txt file. CPython uses a python script (what else?) to create a C header file with the contents of various data structures. So all I need to do is probably one of the following:

  1. Use the same approach, generate a very similar structure and port the existing C code that accesses the data structures into Java. Not very appealing.

  2. Do something similar, and create a class for each code point. Probably not very good from a resource perspective (OK, potentially premature optmisation since I haven't measured it, but the UnicodeData.txt file is 817k, so that's some data structure). That could be useful from a LearningTest perspective though; e.g. can I use java.lang.Character, or do I need something else entirely.

  3. Use something in existing core Java libraries.

  4. Use a third-party library.

  5. Something else, that I haven't bothered to think about what it could be.

Just one problem. I don't fully understand what is required yet. From reading the UnicodeData commentary, that indicates to me the reasons why the below tests are fine.

2468;CIRCLED DIGIT NINE;No;0;EN; 0039;;9;9;N;;;;;

(UnicodeData.txt entry for code-point 0x2468)

verify(unicodedata.decimal(u'\u2468',None) is None)
verify(unicodedata.digit(u'\u2468') == 9)
verify(unicodedata.numeric(u'\u2468') == 9.0)

and those tests pass (for CPython - I haven't implemented the Jython version yet!). From the file entry and commentary, that code-point appears to have no decimal digit value, a digit value of 9 and a numeric value of 9. The tests confirm that. I don't understand why these don't also pass.

325F;CIRCLED NUMBER THIRTY FIVE;No;0;ON; 0033 0035;;;35;N;;;;;

(UnicodeData.txt entry for code-point 0x325F)

verify(unicodedata.decimal(u'\u325F',None) is None)
verify(unicodedata.digit(u'\u325F', None) is None)
verify(unicodedata.numeric(u'\u325F') == 35.0)

The last one fails with:
Traceback (most recent call last):
File "", line 1, in ?
ValueError: not a numeric character

Evidently I need to delve deeper into the spec, or start asking more knowledgeable people some questions.

1 comment:

Charlie Groves said...

I'd be interested to see how much of unicodedata you can support just using java.lang.Character. I know not all of the functionality of unicodedata is in Character, but you might be able to use it to replace some of the larger data structures. It'd be nice to not have to ship them along with Jython since we already have them from Java.