Only fast languages are interesting
If this isn’t a Zawinski quote, it should be.
I have avoided the JVM my entire life. I am presently confronted with problems which fit in the JVM; JVM libraries, concurrency, giant data: all that good stuff. Rather than doing something insane like learning Java, I figured I’d learn me some Clojure. Why not? It’s got everything I need: JVM guts, lispy goodness; what is not to love?
Well, as it turns out, one enormous, gaping lacuna is Clojure’s numerics performance. Let’s say you want to do something simple, like sum up 3 million numbers in a vector. I do shit like this all the time. My entire life is summing up a million numbers in a vector. Usually, my life is like this:
(let* ((tmp (rand (idx-ones 3000000)))) (cputime (idx-sum tmp))) 0.02
20 milliseconds to sum 3 million random numbers enclosed in a nice tight vector datatype I can’t get into too much trouble with. This is how life should be. Hell, let me show off a little:
(let* ((tmp (rand (idx-ones 30000000)))) (cputime (idx-sum tmp))) 0.18
180 milliseconds to sum up 30 million numbers. Not bad. 60 times worse than I’d like it to be (my computer runs at 2Ghz), but I can live with something like that.
Now, let’s try it in Clojure:
(def rands (repeatedly rand)) (def tmp (take 3000000 rand)) (time (reduce + tmp)) Java heap space [Thrown class java.lang.OutOfMemoryError] Restarts: 0: [QUIT] Quit to the SLIME top level Backtrace: 0: clojure.lang.RT.cons(RT.java:552) (blah blah blah java saying fuck you java blah)
Oh. Shit. Adding 3 million numbers makes Clojure puke. OK. How well does it do at adding, erm, 1/10 of that using my piddley little default JVM with apparently not enough heap space (@130mb).
(time (reduce + tmp)) "Elapsed time: 861.283 msecs"
Um, holy shit. Well, there is this hotspot thing I keep hearing about…
user> (def ^doubles tmp (take 300000 rands)) user> (time (reduce + tmp)) "Elapsed time: 371.451 msecs" 149958.38785575028 user> (time (reduce + tmp)) "Elapsed time: 107.619 msecs" 149958.38785575028 user> (time (reduce + tmp)) "Elapsed time: 46.096 msecs" 149958.38785575028 user> (time (reduce + tmp)) "Elapsed time: 43.776 msecs"
Great; now I’m only a factor of 20 away from Lush speed … assuming I run the same code multiple times, which has a probability close to zero. Otherwise, with a typedef, I’m a factor of 200 away.
Maybe I should try using Incanter? I mean, they’re using parallel Colt guts in that. Maybe it’s better? Them particle physicists at CERN are pretty smart, right?
user> (def tmp (sample-uniform 300000 :mean 0)) #'user/tmp user> (time (sum tmp)) "Elapsed time: 97.398 msecs" 150158.83021894982 user> (def tmp (sample-uniform 3000000 :mean 0)) #'user/tmp user> (time (sum tmp)) java.lang.OutOfMemoryError: Java heap space (NO_SOURCE_FILE:0) user>
A bit of hope, then …. Yaaargh!
Let’s look into that heap issue: firing up jconsole and jacking into fresh swank and clojure repl processes, I see … this:
I can’t really tell what’s going on here. I don’t really want to know. But it seems pretty weird to me than an idle Clojure process is sitting around filling up the heap, then garbage collecting. Presumably this has something to do with lein swank (it doesn’t do it so much with lein repl). Either way, this isn’t the kind of thing I like seeing.
Now, I’m not being real fair to Clojure here. If I define my random vector as a list in Lush (which isn’t really fair to Lush), and do an apply + on it, the stack will blow up also. The point is, Lush has datatypes for fast numerics: it’s designed to do fast numerics. Clojure doesn’t have such datatypes, and as a result, its numeric abilities are limited.
Clojure is neat, lein is very neat, and I’ve learned a lot about Java guts from playing with these tools. Maybe I can use it for glue code somewhere. I’m not going to be using it for numerics. Yeah, I probably should have listened to Mischa, but then if I had, I’d be writing things in numeric Perl.
Thanks to Rob and Mike for showing me the way, and thanks everyone else for demonstrating my n00bness and 4am retardation
(let [ds (double-array 30000000)] (dotimes [i 30000000] (aset ds i (Math/random))) (time (areduce ds i res 0.0 (+ res (aget ds i))))) "Elapsed time: 65.018392 msecs"
I daresay, this makes clojure “interesting” -or at least more interesting than it was a few hours ago. It would be nice if someone had already written some package which makes taking the sum of 3 million numbers a bit less of a chore (a la idx-sum). I mean, what’s going to happen when I have to multiply two matrices together?