Snagged me a log file and starting cranking out some code tonight. I had some free time to think about this a couple of weeks ago, and got some sketches down, but I've only recently got the data to start seeing how the code can fly. So, I'm starting out on my work Dell laptop, Dual Core Pentium 2GHz with 2GB RAM and a really shitty disk, judging by the slowness and noises it makes (or is that just Vista?) [stay on topic! - Ed]. The Ruby version runs in just over a minute, once all of the caches are warmed up. My initial naive Java version runs in 14 seconds (I haven't figured out yet how to run it using time
as per *nix environments - Cygwin says it can't find the time command when I pipe zcat output into it).
Now to start implementing my ideas. I have what I think is the shared update of the accumulator as well as I'm going to get it. I'm hypothesising that most of the updates are uncontended and so don't require the full weight of Java's locking capabilities. Now I just need to parallelize the I/O and determine the most efficient matching algorithm, which seems to be Boyer-Moore from reading the Wide-Finder series. That particular algorithm seems to pop up fairly regularly in searching. Might be interesting to see what else is available in that field, but it should be in a library, surely?