Thursday, June 14, 2007

Jython Update - actually doing some work on UnicodeData

So I had put this on hold to finish reading Josh Bloch and Neal Gafter's Java Puzzlers. Great book; highlighted some new things for me, which is all I ask for in any book. Partly that is since I'm still not doing Java 5 apart from at home for various minor things, but it was good.

So now back to Jython, finally. Well, reasons for my procrastination first:
  1. New Job - busy as hell.

  2. New job comes with a new laptop, which I was hoping would be a big step up from my nearly six year old self-built machine. The new laptop has reasonably impressive hardware specification, but it's running Vista. What an absolute pile of shit. I honestly don't know how developers are productive using that OS. I've given it nearly two months, just to be sure that it's not the fact that I've been off Windows for three years that is causing me all of the problems, but really. It's got to the point where I'm looking seriously at Xen on Ubuntu for the odd application that I do need to run Windows for. The other alternative would be to install XP, put up with the half life cost and initial downtime of getting the laptop set up for development all over again. I'll enumerate my grievances in a separate post. I don't think XP would get in my way as much as Vista (I did knock up the xmlunit XMLSchema validation patch on my wife's XP machine over Christmas and it wasn't that painful), but I'm not a fan of Windows after using GNU/Linux exclusively for three years.

So I've been doing a little work on UnicodeData again. Since I've not touched it for so long, I wanted to get some code up and running to start seeing how many tests were failing. Until Jython goes to Java 5 and above, I can't use java.lang.Character to do parts of it, or I could do a piecemeal approach of use java.lang.Character for the BMP, and then implement a new part for supplementary characters, or try to provide behaviour based on the running JVM. All a bit more work than I wanted to do, laziness and hubris being key. Instead, go for the brute force approach of the simplest thing that will possibly work. So I wrote a Python script (what else - it's a nice way of bootstrapping this problem) to parse the UnicodeData.txt file and generate some Java classes. The initial approach was to partition the UnicodeData.txt into a class for each plane in Unicode. Anyone that knows Unicode and the assigned codepoints will know that the BMP will take up most of this, but I was interested in getting something working, and then maybe refine it once the tests are passing. Well, my first cut was to have a simple interface:

interface UnicodePlane {

* Return a UnicodeCodepoint for the specified codepoint.
* @param codepoint the Unicode codepoint
* @return a UnicodeCodepoint, or null if there is no match
UnicodeCodepoint getCodepoint(int codepoint);


I would have a class that implements this interface for each plane and a static initializer within each class that fills a Map of UnicodeCodepoint classes keyed by Integer codepoint.

Eclipse gives me this error:

The code for the static initializer is exceeding the 65535 bytes limit

Whereas ANT gave me this variation:

[javac] Compiling 2 source files to c:\Users\jabley\work\eclipse\workspaces\personal\jython-trunk\jython\build
[javac] c:\Users\jabley\work\eclipse\workspaces\personal\jython-trunk\jython\UnicodeData\generated-src\org\python\modules\unicodedata\UnicodeCharacterDataBasicMultilingualPlane
.java:11: code too large
[javac] private static final Map CODEPOINTS = new HashMap();
[javac] ^
[javac] 1 error

c:\Users\jabley\work\eclipse\workspaces\personal\jython-trunk\jython\build.xml:456: Compile failed; see the compiler error output for details.

So I need to think a bit harder about the data structures. Turning to Bentley's Programming Pearls, the sparseness of certain items stands out, like the mirrored property.

I'll have a think. At least with the Python script that I have to cut up the UnicodeData.txt file, it's very easy to add another list comprehension to it, to see how many items in the file exhibit a certain property. The other way I'm considering is to just generate a properties file and lazily populate a Map as required. That's probably what I'll try next, rather than thinking too hard about how to compress a 1038607 bytes data file into something more reasonable.

No comments: