Monday, February 26, 2007

Jython stalled

I haven't made as much progress with this as I would like, since I'm having some troubles. Not ones of a technical nature, but instead ones of a legal nature. My employment contract is apparently fairly standard and says that my employer owns all of my thoughts (even the ones when I'm writing this!). As what I thought was a courtesy, I asked my manager whether it was OK to contribute to open source projects, so that there would be no shady areas about who owned what code. Well, I'm still waiting for an answer and a bit of paper from my employer. So until I get that, I'm holding off writing any real code.

What I do have is a bit of sketching around the general area. First off, I wasn't sure about some aspects of the CPython implementation, so I asked. It got bounced by the python-dev moderator with the advice that I should post on the python-list. Which I did, but as I expected, it was more of a question for the python-dev list and the only (private) response that I've had is from Martin V. Loewis, who did the last change to that part of CPython. Maybe people are busy with PyCon. I was pointed at the C implementation, which doesn't generate that part from the UnicodeData.txt file, but instead is a horrible case statement, which looks bad. I think I need to raise a bug.

The other thing I've done with it is some Learning Tests about java.lang.Character, to see what it offers me. Obviously, this is attractive since it's a core library, is well tested, debugged and used by millions of people and all the other reasons Josh Bloch enumerates in Effective Java. It seems to have a few little idiosyncracies, which I have captured in my tests. (Note to self: I haven't seen any JUnit (or TestNG - Cedric!) tests in the Jython source tree. Must ask the dev list about that.) Then maybe I should check whether JRuby needs this sort of thing, and make it re-usable, with a Jython wrapper for the API that it needs, etc.

Tuesday, February 13, 2007

Jython unicodedata - how complete is java.lang.Character anyway?

Aye, there's the rub! My initial reaction to java.lang.Character is the extensive use of char in the API. That obviously wouldn't cover all of Unicode. So as a best scenario, maybe the BMP plus a bit can be covered by Character, and then something extra would need to be implemented to support the rest.

A related issue is that all of the nice int overloaded versions of the methods are Java 5, and that's not what I'm targeting here. I'm hoping to get away with a target environment of Java 4, since haven't Sun end-of-lifed Java 3? Java 4 Character only has implemented Unicode 3.0 anyway, so there's a bit of a gap. Python 2.3 contains version 3.2.0 of UnicodeData. There's going to be a gap that I need to fill somewhere.

Thirdly, the API that Character offers seems to be rather different from what I and Python has interpreted under the Unicode specification.

>>> unicodedata.digit(u'\u2468')
9

but then in Java:

assertEquals(true, Character.isDigit('\u2468');

fails. Closer inspection of the isDigit API documentation shows that this is in fact a test for is DECIMAL DIGIT, so it equates to unicodedata.decimal rather than unicodedata.digit. Hopefully, Character will allow me to get a fair way into implementing and making some of the tests run, before I have to start thinking too hard about creating lookup tables and bit-masking the 11 most significant bits.

Monday, February 12, 2007

Jython unicodedata initial overview

So it looks like the problem breaks down into creating a suitable data structure from the contents of the UnicodeData.txt file. CPython uses a python script (what else?) to create a C header file with the contents of various data structures. So all I need to do is probably one of the following:


  1. Use the same approach, generate a very similar structure and port the existing C code that accesses the data structures into Java. Not very appealing.

  2. Do something similar, and create a class for each code point. Probably not very good from a resource perspective (OK, potentially premature optmisation since I haven't measured it, but the UnicodeData.txt file is 817k, so that's some data structure). That could be useful from a LearningTest perspective though; e.g. can I use java.lang.Character, or do I need something else entirely.

  3. Use something in existing core Java libraries.

  4. Use a third-party library.

  5. Something else, that I haven't bothered to think about what it could be.

Just one problem. I don't fully understand what is required yet. From reading the UnicodeData commentary, that indicates to me the reasons why the below tests are fine.


2468;CIRCLED DIGIT NINE;No;0;EN; 0039;;9;9;N;;;;;

(UnicodeData.txt entry for code-point 0x2468)

verify(unicodedata.decimal(u'\u2468',None) is None)
verify(unicodedata.digit(u'\u2468') == 9)
verify(unicodedata.numeric(u'\u2468') == 9.0)

and those tests pass (for CPython - I haven't implemented the Jython version yet!). From the file entry and commentary, that code-point appears to have no decimal digit value, a digit value of 9 and a numeric value of 9. The tests confirm that. I don't understand why these don't also pass.

325F;CIRCLED NUMBER THIRTY FIVE;No;0;ON; 0033 0035;;;35;N;;;;;

(UnicodeData.txt entry for code-point 0x325F)

verify(unicodedata.decimal(u'\u325F',None) is None)
verify(unicodedata.digit(u'\u325F', None) is None)
verify(unicodedata.numeric(u'\u325F') == 35.0)

The last one fails with:
Traceback (most recent call last):
File "", line 1, in ?
ValueError: not a numeric character

Evidently I need to delve deeper into the spec, or start asking more knowledgeable people some questions.

Friday, February 09, 2007

Jython unicodedata proceedings

So a little background as to why I'm doing this. Well, I don't know Unicode as well as I'd like, and I know Python a lot better than I know Ruby, so no temptation to start hacking JRuby at this point (well, maybe just a little).

I've implemented the methods that were missing and now I'm getting failures in the test. For the first implementation, I grabbed the existing Python source. Shit, C programming rots your brain. I learned C at Uni and then again via K&R, but this is a little different. But it's enough to give me the method signatures for everything that I need to stub.

*sys-package-mgr*: processing modified jar, '/home/jabley/work/workspaces/main/jython/dist/jython.jar'
Testing Unicode Database...
Methods: 38ef24ef104d52e24f9b7c942676c6961f9233cc
Functions: 97f3b4a034c7d9a0d0c1f387e216d6b8bf309442
API:Traceback (innermost last):
File "dist/Lib/test/test_unicodedata.py", line 91, in ?
File "/home/jabley/work/workspaces/main/jython/dist/Lib/test/test_support.py", line 125, in verify
TestFailed: test failed

Bit tired tonight (Connor's been throwing up the last two days), so I won't dig much into this. It feels slightly weird to be running the tests as python tests against a Java implementation, but that's what you get for implementing a library like this. No Junit / TestNG in sight. I have a feeling that it's going to require a lot of reading, which is good in that I might learn something, but I also wanted to get back to Stefan with a DocBook example for an xmlunit proposal.

Thursday, February 08, 2007

Jython 101

Inspired by Joe Gregario, I thought I'd take a gander at Jython.
The proposed JythonSprint seemed to suggest that unicodedata was required, so I thought I'd have a play and maybe even contribute something. We'll see.

So, grab the main trunk from subversion and off we go. TDD all the way, as much as possible. Are you sitting comfortably? Then I'll begin...

Step 1 - create an ant.properties file. This is in .cvsignore, so no worries about clashing with anyone else's settings. Straight off the developer guide.

build.compiler=modern
debug=on
optimize=off

Step 2 - build it using ANT.

Step 3 - make it accessible. I don't have jython on my machine already, so I take the dirty approach.

sudo vi /usr/bin/jython
#!/bin/sh

export JYTHON_HOME=/home/jabley/work/workspaces/main/jython
exec java -Dpython.home=${JYTHON_HOME}/dist/ -jar ${JYTHON_HOME}/dist/jython.jar $*

Slight tweak from the development guide, and yes, I'm using Eclipse.


sudo chmod 755 /usr/bin/jython

Now see what it looks like:


jython
*sys-package-mgr*: processing new jar, '/home/jabley/work/workspaces/main/jython/dist/jython.jar'
*sys-package-mgr*: processing new jar, '/usr/lib/jvm/java-1.5.0-sun-1.5.0.06/jre/lib/rt.jar'
*sys-package-mgr*: processing new jar, '/usr/lib/jvm/java-1.5.0-sun-1.5.0.06/jre/lib/jsse.jar'
*sys-package-mgr*: processing new jar, '/usr/lib/jvm/java-1.5.0-sun-1.5.0.06/jre/lib/jce.jar'
*sys-package-mgr*: processing new jar, '/usr/lib/jvm/java-1.5.0-sun-1.5.0.06/jre/lib/charsets.jar'
*sys-package-mgr*: processing new jar, '/usr/lib/jvm/java-1.5.0-sun-1.5.0.06/jre/lib/ext/sunjce_provider.jar'
*sys-package-mgr*: processing new jar, '/usr/lib/jvm/java-1.5.0-sun-1.5.0.06/jre/lib/ext/sunpkcs11.jar'
*sys-package-mgr*: processing new jar, '/usr/lib/jvm/java-1.5.0-sun-1.5.0.06/jre/lib/ext/localedata.jar'
*sys-package-mgr*: processing new jar, '/usr/lib/jvm/java-1.5.0-sun-1.5.0.06/jre/lib/ext/dnsns.jar'
Jython 2.2b1 on java1.5.0_06 (JIT: null)
Type "copyright", "credits" or "license" for more information.
>>>

Cool.

Next run the tests.


jython dist/Lib/test/test_unicodedata.py
Testing Unicode Database...
Methods: 38ef24ef104d52e24f9b7c942676c6961f9233cc
Traceback (innermost last):
File "dist/Lib/test/test_unicodedata.py", line 84, in ?
ImportError: no module named unicodedata

That's expected. So I added a class org.python.modules.unicodedata and added an entry in org.python.modules.Setup to have a unicodedata module. I'll probably go back to the class name and maybe create it's own package longer-term, but that's the simplest thing for now. Tests again.


jython dist/Lib/test/test_unicodedata.py
*sys-package-mgr*: processing modified jar, '/home/jabley/work/workspaces/main/jython/dist/jython.jar'
Testing Unicode Database...
Methods: 38ef24ef104d52e24f9b7c942676c6961f9233cc
Functions:Traceback (innermost last):
File "dist/Lib/test/test_unicodedata.py", line 86, in ?
File "dist/Lib/test/test_unicodedata.py", line 62, in test_unicodedata
AttributeError: class 'org.python.modules.unicodedata' has no attribute 'digit'

I seem to recall reading about some script that will generate stubs for these things - gexpose.py? I'll have a look, but otherwise it looks like my next step will be implementing stubs for the required methods and see where the tests fail next.

Saturday, February 03, 2007

Roaches in the sun

Managed to head up north for a day, hooked up with the Banks, Laurie and Dutch. Cracking day, too hot really (and this is February!).

Mooching around under Inertial Reel discussing who had climbed it. Martin Dearden was just next to us, apparently enjoying being the subject of our historical wanderings when we tried to remember who it was that had taken eight years on it. Mind you, that's positively fast given that I first tried Stefan Grossman eleven years ago, before DB had done it. I came close six years ago, and haven't really been back. There's a long-term goal!

Did a bit, felt a bit fat and not moving well, I think I was a bit tired from introducing the weight belt back into the training programme. The Banks did well, getting Teck Crack Superdirect third go or thereabouts. I couldn't get my foot on the starting hold - bit of flexibility work required maybe!

Backed off Stretch and Mantel - I can't mantel and it was a bit high. Then puntered about on Calcutta Buttress before team effort on the world's hardest 6a+ slab. Andy came close; Adam Long complained about old boots not being up to it, Fiona crimped like a beast but couldn't quite get it.

Flashed Stretch Armstrong then it was time to go. I didn't really do much, but had a good day out with the guys. On the way home, Al rang and said it was OK for me to stay over, but I was already halfway home. Will get an overnight pass properly sorted out for the future.