Showing posts with label widefinder. Show all posts
Showing posts with label widefinder. Show all posts

Thursday, February 07, 2008

Minor casualties from the disk crash

There was some loss - mainly source code for personal hacking. My Jython UnicodeData implementation was lost, although I did have a copy of that in my GMail Sent folder which has been contributed to the committers in the hope that someone with more time and enthusiasm than me finds it useful. Apparently they have. The other thing lost was my WideFinder sketching. I surprisingly did have an early warning that the disk was close to failure, when my implementation (which was clocking ten seconds or something) suddenly started taking twenty minutes to complete. Obviously I didn't pick up on this warning; instead blaming it on some horrible mistake in my code. Some learnings from my WideFinder experience.

  • NIO can be tricky, and should probably have an easier API in comparison to classical IO. It was nice to look at this anyway, and I'll probably have a read of the Jython and JRuby IO stuff to see how others do it, and also the Apache NIO stuff.
  • Java string matching is slow using regexp. I'd not got as far as implementing Boyer-Moore or similar, but I expected to get a big hike out of my 10 second time when switching to that.
  • CAS is good, versus synchronized blocks. Brian Goetz has talked about the advantages of non-blocking forms elsewhere.
  • Java is quite restrictive. I must have been doing too much Common Lisp, Erlang and Scheme recently!

Sunday, November 25, 2007

WideFinder - late entrant

Snagged me a log file and starting cranking out some code tonight. I had some free time to think about this a couple of weeks ago, and got some sketches down, but I've only recently got the data to start seeing how the code can fly. So, I'm starting out on my work Dell laptop, Dual Core Pentium 2GHz with 2GB RAM and a really shitty disk, judging by the slowness and noises it makes (or is that just Vista?) [stay on topic! - Ed]. The Ruby version runs in just over a minute, once all of the caches are warmed up. My initial naive Java version runs in 14 seconds (I haven't figured out yet how to run it using time as per *nix environments - Cygwin says it can't find the time command when I pipe zcat output into it).

Now to start implementing my ideas. I have what I think is the shared update of the accumulator as well as I'm going to get it. I'm hypothesising that most of the updates are uncontended and so don't require the full weight of Java's locking capabilities. Now I just need to parallelize the I/O and determine the most efficient matching algorithm, which seems to be Boyer-Moore from reading the Wide-Finder series. That particular algorithm seems to pop up fairly regularly in searching. Might be interesting to see what else is available in that field, but it should be in a library, surely?