Only fast languages are interesting

Posted in Clojure, Lush, tools by Scott Locklin on November 30, 2011

If this isn’t a Zawinski quote, it should be.

I have avoided the JVM my entire life. I am presently confronted with problems which fit in the JVM; JVM libraries, concurrency, giant data: all that good stuff. Rather than doing something insane like learning Java, I figured I’d learn me some Clojure. Why not? It’s got everything I need: JVM guts, lispy goodness; what is not to love?

Well, as it turns out, one enormous, gaping lacuna is Clojure’s numerics performance. Let’s say you want to do something simple, like sum up 3 million numbers in a vector. I do shit like this all the time. My entire life is summing up a million numbers in a vector. Usually, my life is like this:

 (let* ((tmp (rand (idx-ones 3000000))))
    (cputime (idx-sum tmp)))

0.02

20 milliseconds to sum 3 million random numbers enclosed in a nice tight vector datatype I can’t get into too much trouble with. This is how life should be. Hell, let me show off a little:

(let* ((tmp (rand (idx-ones 30000000))))
    (cputime (idx-sum tmp)))

0.18

180 milliseconds to sum up 30 million numbers. Not bad. 60 times worse than I’d like it to be (my computer runs at 2Ghz), but I can live with something like that.

Now, let’s try it in Clojure:

(def rands (repeatedly rand))
(def tmp (take 3000000 rand))
(time (reduce + tmp))

Java heap space
[Thrown class java.lang.OutOfMemoryError]

Restarts:
0: [QUIT] Quit to the SLIME top level

Backtrace:
0: clojure.lang.RT.cons(RT.java:552)
(blah blah blah java saying fuck you java blah)

Oh. Shit. Adding 3 million numbers makes Clojure puke. OK. How well does it do at adding, erm, 1/10 of that using my piddley little default JVM with apparently not enough heap space (@130mb).

(time (reduce + tmp)) "Elapsed time: 861.283 msecs"

Um, holy shit. Well, there is this hotspot thing I keep hearing about…

 

user> (def ^doubles tmp (take 300000 rands))
user> (time (reduce + tmp))
  "Elapsed time: 371.451 msecs" 149958.38785575028 

user> (time (reduce + tmp))
  "Elapsed time: 107.619 msecs" 149958.38785575028 

user> (time (reduce + tmp))
  "Elapsed time: 46.096 msecs" 149958.38785575028 

user> (time (reduce + tmp))
  "Elapsed time: 43.776 msecs"

Great; now I’m only a factor of 20 away from Lush speed … assuming I run the same code multiple times, which has a probability close to zero. Otherwise, with a typedef, I’m a factor of 200 away.

Maybe I should try using Incanter? I mean, they’re using parallel Colt guts in that. Maybe it’s better? Them particle physicists at CERN are pretty smart, right?

user> (def tmp (sample-uniform 300000 :mean 0))
#'user/tmp
user> (time (sum tmp))
"Elapsed time: 97.398 msecs" 150158.83021894982
user> (def tmp (sample-uniform 3000000 :mean 0))
#'user/tmp
user> (time (sum tmp))
java.lang.OutOfMemoryError: Java heap space (NO_SOURCE_FILE:0)
user>

A bit of hope, then …. Yaaargh!

Let’s look into that heap issue: firing up jconsole and jacking into fresh swank and clojure repl processes, I see … this:

I can’t really tell what’s going on here. I don’t really want to know. But it seems pretty weird to me than an idle Clojure process is sitting around filling up the heap, then garbage collecting. Presumably this has something to do with lein swank (it doesn’t do it so much with lein repl). Either way, this isn’t the kind of thing I like seeing.

Now, I’m not being real fair to Clojure here. If I define my random vector as a list in Lush (which isn’t really fair to Lush), and do an apply + on it, the stack will blow up also. The point is, Lush has datatypes for fast numerics: it’s designed to do fast numerics. Clojure doesn’t have such datatypes, and as a result, its numeric abilities are limited.

Clojure is neat, lein is very neat, and I’ve learned a lot about Java guts from playing with these tools. Maybe I can use it for glue code somewhere. I’m not going to be using it for numerics. Yeah, I probably should have listened to Mischa, but then if I had, I’d be writing things in numeric Perl.

Edit Add:

Thanks to Rob and Mike for showing me the way, and thanks everyone else for demonstrating my n00bness and 4am retardation

(let [ds (double-array 30000000)]
(dotimes [i 30000000] (aset ds i (Math/random)))
(time (areduce ds i res 0.0 (+ res (aget ds i)))))

"Elapsed time: 65.018392 msecs"

I daresay, this makes clojure “interesting” -or at least more interesting than it was a few hours ago. It would be nice if someone had already written some package which makes taking the sum of 3 million numbers a bit less of a chore (a la idx-sum). I mean, what’s going to happen when I have to multiply two matrices together?

117 comments

117 Responses

Subscribe to comments with RSS.

Terry A. davis said, on November 30, 2011 at 10:06 am

MMX SSE can be used in matrix math. That’s about it.

Benchmark LoseThos printf-ing numbers from 1 to a million. Chuckle. It’s probably an order of magnitude faster.

Reply
- Scott Locklin said, on November 30, 2011 at 10:18 am
  
  I’m not sure what you mean by benchmarking …. LoseThos printf? It wouldn’t surprise me if Clojure was fast at something like that, but then … adding is a lot more important to me that writing!
  
  Reply
joel9 said, on November 30, 2011 at 10:32 am

Get the job done…man…

You are using wordpress.com, which is powered by PHP, do you care?

Reply
- Scott Locklin said, on November 30, 2011 at 10:45 am
  
  Java + fast numerics would be pretty useful for getting jobs done.
  
  Reply
  - HC said, on November 30, 2011 at 12:33 pm
    
    Remember, idiot, next time “[do] something insane like learning Java”. “keep hearing about ..”. Stupid fool. What is a fool like you adding up numbers for, anyway? “doh, I don’t know, sumptin!”.
    
    Reply
    - Scott Locklin said, on November 30, 2011 at 9:12 pm
      
      Um, prop trading?
      
      Reply
    - John Haugeland said, on December 1, 2011 at 4:07 pm
      
      Yeah, because nobody ever does math on large sets of numbers.
      
      Don’t worry, though, when you call people idiots and then say “who’d ever do this common fundamental thing,” we’re not thinking it’s actually you.
      
      At all.
      
      Really.
      
      Reply
- BS Footprint said, on November 30, 2011 at 4:42 pm
  
  It all depends on what it means to “get the job done” — if part of getting the job done is to crunch a boatload of numbers efficiently, then the language and environment performance *does* matter, doesn’t it?
  
  Reply
Maurizio said, on November 30, 2011 at 10:55 am

For the sake of argument, on ruby 1.9.3p0 (Intel i5 @ 2.67 Ghz):

require ‘benchmark’
puts Benchmark.measure { ([rand]*3_000_000).reduce(&:+) } => 3.342504
puts Benchmark.measure { ([rand]*30_000_000).reduce(&:+) } => 313.705162

Reply
- Scott Locklin said, on November 30, 2011 at 11:02 am
  
  You might want to measure only the reduce operation. Generating the numbers is probably going to take some time (it’s rather time consuming in Clojure).
  
  I’m taking that as being seconds. Ruby is known to be slow at numerics operations. Nice of it to not explode on the reduce operation though!
  
  Reply
  - Maurizio said, on November 30, 2011 at 11:54 am
    
    Ok, let’s try:
    
    three_millions_of_random_numbers = [rand]*3_000_000
    puts Benchmark.measure { three_millions_of_random_numbers.reduce(&:+) } => 3.274554
    
    thirty_millions_of_random_numbers = [rand]*30_000_000
    puts Benchmark.measure { thirty_millions_of_random_numbers.reduce(&:+) } => 286.569942
    
    Comparing them with the previous data, we have ~ 0.7 and ~ 27.2 seconds for generating respectively 3 millions and 30 millions of random numbers.
    
    So generating random numbers in ruby is not very expensive (I think because Kernel.rand is a C function).
    
    Yeah, ruby is certainly another world comparing to functional languages… but I am a curious person! 😉 I must say that, in my experience, ruby is slow and memory expensive, but a crash is not a common thing.
    
    Reply
    - Jeremy Roman said, on November 30, 2011 at 2:39 pm
      
      Actually, you’re only generating one random number, and then repeating it 3 million (or 30 million) times. If you do something else, like (1…3_000_000).map { rand }, you’ll see that it takes a bit longer.
      
      Reply
      - Jeremy Roman said, on November 30, 2011 at 2:41 pm
        
        Whoops, that was mentioned below.
        
        Reply
    - ekosz said, on November 30, 2011 at 3:49 pm
      
      BTW in my testing I found that three_millions_of_random_numbers.inject(:+) is faster than using reduce(&:+).
      
      Reply
    - François Beausoleil said, on November 30, 2011 at 4:13 pm
      
      Do you want different numbers, or the same number 3 million times?
      
      [rand]*3_000_000 repeats the same number 3 million times. You want to generate 3 million *different* numbers:
      
      three_million_different_numbers = []
      3_000_000.times { three_million_different_numbers << rand }
      
      Reply
      - Mon-Ouie said, on November 30, 2011 at 8:01 pm
        
        I’d use Array.new(3e6) { rand }.
        
        Reply
- citizen428 said, on November 30, 2011 at 11:48 am
  
  That creates an array of the same number over and over again.
  
  [rand] * 3 #=> [0.8462455656914812, 0.8462455656914812, 0.8462455656914812]
  
  Not necessarily important for the benchmark, but also not what you want.
  
  Reply
  - Maurizio said, on November 30, 2011 at 12:29 pm
    
    Oh god you’re right >.<
    
    Reply
- Julian Raschke said, on November 30, 2011 at 11:50 am
  
  This will generate an array of 3 or 30 million equal numbers, respectively. What you mean is Array.new(3_000_000) { rand }.
  
  FWIW, on 1.8.7 with a C2D Mac, I get 2.27 seconds for summing up 3 million Floats, and 1.8 seconds for summing up 3 million Fixnums.
  
  Reply
  - Maurizio said, on November 30, 2011 at 12:55 pm
    
    With
    
    three_millions_of_random_numbers = Array.new(3_000_000){ rand }
    puts Benchmark.measure { three_millions_of_random_numbers.reduce(&:+) }
    
    I get 0.448 seconds, while with
    
    thirty_millions_of_random_numbers = Array.new(30_000_000){ rand }
    puts Benchmark.measure { thirty_millions_of_random_numbers.reduce(&:+) }
    
    4.41 seconds.
    
    This is very different from 3.27 and 286… 4.41 vs 286, because values are different each other? I think I am missing something
    
    Reply
    - Julian Raschke said, on November 30, 2011 at 1:34 pm
      
      182 seconds for the second test here, which is more in line with your first round of testing. (1.8.7 again)
      
      Reply
- Alan said, on November 30, 2011 at 11:59 am
  
  With similar specs on Python 2.7 I got 0.05s for 3 million and 0.5s for 30 million, using sum() on a list of random numbers. With reduce() (which Python is bad at) I got 0.4s and 4s respectively. I think the results might be even better using NumPy.
  
  Reply
  - Jorge said, on March 25, 2012 at 10:02 am
    
    I just reproduced your plain Python results. The same code in PyPy took 50ms to add 30 million random numbers, and 44ms with NumPy.
    
    Reply
- Brendan Ribera (@abscondment) said, on November 30, 2011 at 7:08 pm
  
  This is very interesting. I get a similar jump on 1.9.3 (~2s to ~140s), but 1.9.2 is *much* better.
  
  ruby-1.9.2-p290:
  puts Benchmark.measure { ([rand]*3_000_000).reduce(&:+) } => 0.684735
  puts Benchmark.measure { ([rand]*30_000_000).reduce(&:+) } => 8.908247
  
  That seems like a significant regression.
  
  Reply
  - Maurizio said, on December 1, 2011 at 10:07 am
    
    you are comparing 3_000_000 vs 30_000_000
    
    Reply
Neuromancer said, on November 30, 2011 at 11:06 am

all i have to say is FORTAN 🙂

Reply
Neuromancer said, on November 30, 2011 at 11:07 am

Bum FORTRAN not fortan 🙂

Reply
- Scott Locklin said, on November 30, 2011 at 11:14 am
  
  Har. It’s still a good choice for fast numerics. Unfortunately, that’s about all it can do.
  
  Reply
  - Neuromancer said, on December 6, 2011 at 9:28 pm
    
    Hey I wrote billing systems in FORTRAN 🙂 well a big chunk was PL/1G (which worked in the same way as some of the core bits of map reduce does)
    
    Reply
    - Scott Locklin said, on December 6, 2011 at 9:43 pm
      
      Now there’s a blast from the past!
      Computer science is weird. A lot of neat stuff gets forgotten, then rediscovered again as a great innovative breakthrough! I’ve engaged in a bit of CS archeology myself and found a few neat things from back in the day when performance and code size mattered.
      
      Reply
mon key said, on November 30, 2011 at 11:14 am

How much memory did you give the JVM?

Reply
- Scott Locklin said, on November 30, 2011 at 11:21 am
  
  Whatever the defaults are. The Jconsole had some numbers on it n the image capture. It shouldn’t matter on that small a calculation, though, really.
  
  Reply
  - mon key said, on November 30, 2011 at 11:32 am
    
    I don’t know. You should at least try to give it 1GB and see what happens.
    
    Reply
Rob said, on November 30, 2011 at 11:26 am

What is idx-ones? I’m trying to run your Lush code on my computer but it doesn’t recognize that function and the only Google result is this blog entry.

Reply
- Scott Locklin said, on November 30, 2011 at 11:28 am
  
  Sorry: it’s in my includes:
  (libload “libidx/idx-double.lsh”)
  #? (idx-ones )
  ;; makes an idx vector filled with 1’s
  (de idx-ones (n)
  (declare (-double-) n)
  (let* ((nn (int n))
  (out (double-array nn)))
  (idx-d1fill out 1)
  out))
  
  (compile it if you like; I did)
  
  Reply
  - cpcloud said, on November 30, 2011 at 5:58 pm
    
    You could also do do something like:
    
    (defmacro idx-ones (&rest rest)
    `(incr (double-array ,@rest)))
    
    Reply
    - cpcloud said, on November 30, 2011 at 6:01 pm
      
      Whoops sorry replace incr with 1+
      
      Reply
  - tce said, on December 6, 2011 at 10:55 pm
    
    and how is idx-d1fill defined?
    
    Reply
    - Scott Locklin said, on December 6, 2011 at 11:50 pm
      
      (libload “libidx/idx-double.lsh”)
      
      Or download bronzekopf.sf.net and
      (libload “includes.lsh”) to get the function I’m using to make an idx1 of 1’s for making random numbers. I think idx-ones might be compiled under a different namespace. Didn’t expect to be HNNed on this one; it was more of a 4am bleat.
      
      FWIIW, I figured out a way to import JBLAS into Clojure and wrote a lushy looking wrapper for some of the core functionality. It’s quite acceptably fast even without getting rid of reflection. Factor of 10 faster than COLT/Incanter at MMul. Factor of 14 on EV decomp depending on the size. Fortran FTW!
      
      Reply
Mike said, on November 30, 2011 at 11:32 am

You’re not exactly comparing like with like: your Clojure code is building a large linked list of boxed doubles which is blowing your heap. You can do that if you like, but it’s probably not what you’re trying to benchmark….

I get 44ms for summing 3,000,000 numbers in Clojure on my laptop with the following code:

(defn foo []
(let [ds (double-array 3000000)]
(dotimes [i 3000000] (aset ds i (mikera.util.Rand/nextDouble)))
(areduce ds i res 0.0 (+ res (aget ds i)))))

(time (foo))
= > “Elapsed time: 44.259865 msecs”

Reply
- Scott Locklin said, on November 30, 2011 at 11:53 am
  
  Well, I certainly admitted as much. I just couldn’t see any obvious way to do it differently. I think I see how your code works, though I lack the mikera.util class to make it go without thinking about it. And what about the Incanter bits? They certainly seem to have the same problem; one I don’t find in, say, R. Are they not using Java mutables inside?
  
  I admit, I probably should have tried to use native Java types (and I see now I screwed up casting the list to a vector), but well, it wasn’t terribly obvious. I appreciate the pointer in what might be a productive direction. Maybe (double-array can work the same in Clojure.
  
  FWIIW, this came about trying to speed up someone’s code (which is tens of thousands of times slower than something in CLush); I noticed a reduce + bit in it, and got suspicious.
  
  Reply
  - Mike said, on November 30, 2011 at 12:11 pm
    
    Yeah, Clojure optimisation for numerical code is a bit fiddly – you have to know a bit about what is going on under the hood in order to make it really fast. In general, the trick is to make sure that it is getting compiled down to Java arrays and unboxed primitive operations. The JIT can do its magic after that point.
    
    The good news is a) once you’ve got it working it’s pretty fast – certainly close to what you can achieve in pure Java / other statically compiled languages and b) as it’s a Lisp you can quickly hide the ugliness behind a nice DSL and get on with solving higher level problems.
    
    I’ve been doing a lot of numerical code recently and I’ve convinced myself overall that Clojure is a pretty decent choice providing you’re willing to invest some time in learning the tricks.
    
    Reply
    - Scott Locklin said, on November 30, 2011 at 12:36 pm
      
      Mike: I’ve googled in vain for “tricks.” I’ve found out about =/==, some simple unboxing things and stuff like ^:static and type hints. Can you recommend some resources, or some code to look at for ideas? Other than reading the last two years worth of forums on google groups.
      
      It seems to me that breaking the functional paradigm and using a lot of mutables like happens in Lush land is going to be the way to go. It would be a damn shame if this weren’t done in some standard way the way Yann and Leon did with the idx type in Lush. I mean, I feel pretty retarded for not putting my rands in a double-array, but then I don’t feel so bad when I notice that the Incanter guy apparently made the same mistake!
      
      Reply
      - Jan Rychter said, on November 30, 2011 at 1:36 pm
        
        You should feel bad — starting a poorly-informed article with “Only fast languages are interesting” and then mentioning jwz sets certain expectations. Mine were not met, not by a long shot. Even a first cursory glance at your code shows that you’re measuring both the addition and random number generation, because what you constructed were lazy sequences. I tried, and just using a Clojure vector filled with your numbers gets me significantly better performance than what you posted. So it isn’t really about the “tricks” (or at least not only).
        
        Reply
        
        Scott Locklin said, on November 30, 2011 at 8:04 pm
        
        Yep;it’s probably the fastest and worstest article I’ve written yet. The lazy seqs part; yargh!
        Then again, the nice thing about the internets is people telling you when, and exactly how you’re being stupid.
        
        Reply
        
        John Flanagan said, on December 5, 2011 at 3:30 pm
        
        I wish I could remember more clearly, but I have a vague memory of somebody saying “The best way to crowdsource a solution to your problem is to publicly post that solving it is impossible, and then wait for all the internet pedants to come and ‘prove you wrong’.”
        
        Mission accomplished!
        
        Reply
        
        Scott Locklin said, on December 5, 2011 at 10:21 pm
        
        Indeed. Clojure still sucks at numerics though. With JBLAS+Atlas, it’s pretty OK, but it’s going to be a bear doing this sort of thing every time I need a wavelet or something. I might end up taking Jon’s advice below and looking into Mono.
        
        Reply
        
        John Flanagan said, on December 6, 2011 at 3:00 am
        
        I had a similar experience recently looking at LMAX Disruptor ( http://code.google.com/p/disruptor/ ), which is a project that attempts to solve a very similar problem to the problem that I am personally working on. They have built an extremely low latency messaging interface, which is even more impressive because they managed to do it in Java. They claim 50 ns message latency, which is pretty spectacular. My current best effort is in the 280-300 ns range, which is admittedly still spectacular, particularly since my architecture is interprocess rather than just interthread. It still stings to be slower than a Java implementation though, no matter how smart they were in designing it!
        
        Reply
        
        Scott Locklin said, on December 6, 2011 at 4:20 am
        
        There was some gigantor crossing thing done up in Java … some pretty interesting presentations on it… alas, i can’t remember the name of the thing, but they did some tricky things with writing their own threading system. I’m guessing Java is OK at networky things; possibly even optimized for it if you know what you are doing.
        
        Numerics: not so much. There isn’t even a decent GSL JNI. The only large thing I’ve seen which is numerically oriented is Weka. Everything else, Apache Commons, Colt, whatever: not so much, and not so good. Numbers dudes are using something else.
        
        tr8dr said, on December 6, 2011 at 2:22 pm
        
        J Flanagan said “It still stings to be slower than a Java implementation though, no matter how smart they were in designing it”.
        
        Well, Java certainly earned its reputation for being slow back in the day before Sun brought in the team that did HotSpot. That said, to think that a JIT / VM language should be slower is a fallacy, JIT can employ even more interesting optimisations based on observed working-set behavior than a static precompilation can afford. Java server is *very* competitive with some of the best C compilers.
        
        The main issue that the LMAX folks (and anyone else interested in consistent latency) needed to deal with was uncontrolled GC, not code performance. Their concurrent datastructures were an excellent prescription for high-volume low-latency though and took GC out of the picture.
        
        Of course Java and the JVM are hobbled with lack of value types and a number of other “features”. It means that the hotspot optimiser needs to be a lot more clever than a C++ compiler when dealing with classes vs structs, for example. Java’s safety features (such as array bounds checking) are also difficult optimisation problems (however HotSpot does an amazing job of proving containment and avoiding range checks).
        
        I think in terms of VMs there are others out there that stand a better chance of being more “optimizable”, for instance the CLR. The VM was designed with value types, ability to do pointer ops, and support for a much larger set of language constructs.
        
        Aside from the rich environment afforded by VMs, the main draw for me is that I can stay away from C++, which I find to be distasteful (though I spent many years coding with it in the past).
        
        John Flanagan said, on December 6, 2011 at 4:57 pm
        
        Not letting me respond directly to tr8dr, presumably limited comment depth? I always wander into these conversations late.
        
        Yes, at my last job they were getting repeatedly bitten in the ass by java GC. From the latency standpoint, GC is a terrible misfeature. All that development effort that GC is supposed to save by “not worrying about memory allocation” is thrown away by… you guessed it! HAVING TO WORRY ABOUT YOUR MEMORY ALLOCATION. And since it’s so tightly interwoven with the language, you have to bend over backwards to avoid allocation. I’ve heard of a number of trading shops that use Java and simply throw a ridiculous amount of memory at the app, and restart the app nightly to clean everything up. Which effectively means they just LEAK all that memory, in order to avoid GC. We all could have done that 20 years ago in C, never needed Java for that. 🙂
        
        I’ve heard about the JIT working set optimization thing before, but it seems a bit overblown to me. How is that different from running against production loads with profiling, and then doing profile-feedback optimization? Well, for one thing, you get it automatically. Yes, I agree that’s nice. But those types of optimization are still available in C++ if you’re willing to put forth the effort.
        
        tr8dr said, on December 6, 2011 at 7:14 pm
        
        @J Flanagan : It is a very specialized application (such as HF trading) that requires a very consistent response profile (in not going away to GC at incovenient times). For most applications GC is not important, and in fact is certainly much more efficient than the default sort of memory alloc scheme one might use with stl auto_ptr or shared_ptr.
        
        The hoops one will jump through to avoid GC are almost exactly the same hoops one will jump through in constructing a C++ application for the same purpose: basically, avoid allocations that need to be destroyed.
        
        So in the specialized application domain, this has to be constructed carefully in either setting. In the more general setting, the VM wins out in terms of amount of code and often in terms of total dealloc cost.
        
        Scott Locklin said, on December 6, 2011 at 9:53 pm
        
        FWIIW, J. Flanagan slings code for HFT firms, so he’s talking about exactly what you think he’s talking about. Oddly, I met him in a spooky nightclub. You meet some interesting people in spooky nightclubs.
        Sorry: wordpress comment stacksize is 10 deep.
        
        ObClujureBitching: JBLAS is pretty good, but I fear for the next time I have to reach for something on the JNI. Hopefully, it and the JGSL thing will make a good template. It seems like one should be able to dummy up tools which will help the SWIG process from within Clojure itself. I’m hoping I can. After a glance at IKVM, I’m skeered enough to stick with Clojure for a while longer.
    - sigviper said, on November 30, 2011 at 3:05 pm
      
      Hacker News today featured an article about Scala having the same problems:
      http://codahale.com/downloads/email-to-donald.txt
      
      It seems that knowing what happends with lang freatures running on JVM is very importand.
      
      I hate Clojure, but arguments presented here are not really convincing.
      
      Reply
- Mike said, on November 30, 2011 at 11:58 am
  
  p.s. the random number generation is actually taking the majority of the time – if I do the above without the RNG calls the addition itself is 16ms.
  
  Reply
Ahmet said, on November 30, 2011 at 11:40 am

Try pure Java.

Reply

Beuno Deferrari. said, on November 30, 2011 at 11:47 am

I think you should wrap your sequence with a double-array call, here are my results:

	> clj
	Clojure 1.3.0
	user=> (def rands (repeatedly rand))
	#'user/rands
	user=> (def ^doubles tmp (take 300000 rands))
	#'user/tmp
	user=> (time (reduce + tmp))
	"Elapsed time: 433.227 msecs"
	149973.6787182885
	user=> (time (reduce + tmp))
	"Elapsed time: 36.465 msecs"
	149973.6787182885
	user=> (time (reduce + tmp))
	"Elapsed time: 30.094 msecs"
	149973.6787182885
	user=> (time (reduce + tmp))
	"Elapsed time: 13.557 msecs"
	149973.6787182885
	user=> (time (reduce + tmp))
	"Elapsed time: 13.408 msecs"
	149973.6787182885
	user=> (def tmp2 (double-array tmp))
	#'user/tmp2
	user=> (time (reduce + tmp2))
	"Elapsed time: 15.851 msecs"
	149973.6787182885
	user=> (time (reduce + tmp2))
	"Elapsed time: 6.072 msecs"
	149973.6787182885
	user=> (time (reduce + tmp2))
	"Elapsed time: 2.824 msecs"
	149973.6787182885
	user=> (time (reduce + tmp2))
	"Elapsed time: 3.009 msecs"
	149973.6787182885
	user=> (def tmp3 (double-array 3000000 rands))
	OutOfMemoryError Java heap space clojure.core/repeatedly (core.clj:4512)

view raw

gistfile1.txt

hosted with ❤ by GitHub

Scott Locklin said, on November 30, 2011 at 11:55 am

Thanks Bruno; I guess I screwed up my casting in clojure. Doh! Also, running out of memory on a 3 million element thingee makes me sad.

Reply
- Bruno Deferrari said, on November 30, 2011 at 1:02 pm
  
  What double-array does is more about constructing a complete array in memory instead of using a lazy sequence (in which case you are doing more than just adding the numbers of the sequence, because you have to construct it as you go, and it’s internal implementation may be less efficient than an array).
  
  Reply

Hiccup said, on November 30, 2011 at 12:29 pm

Have you tried doing a benchmark of lush along side SciPy? I would love to know the comparisons.

Reply
Rob said, on November 30, 2011 at 12:36 pm

Building off of what Mike did, here’s a version that only times the addition and uses only “stock” functions, so anyone should be able to run it.

(defn add-rands []
(let [ds (double-array 30000000)]
(dotimes [i 30000000] (aset ds i (Math/random)))
(time (areduce ds i res 0.0 (+ res (aget ds i))))))

As written, this successfully creates 30,000,000 million random numbers and adds them up (in 73ms on my machine).

The reason that Beuno’s version choked on 3,000,000 million numbers is that it had to build the full linked list of boxed numbers before putting it into the primitive array. Mike’s method doesn’t need to build the full list first.

Scott, I wonder if you could run this code, so we could compare the run times of the different languages/methods on the same machine. (I never got the Lush code running. I got an error about double-array after the last problem.)

Reply
- Bruno Deferrari said, on November 30, 2011 at 1:00 pm
  
  Hi Rob. I still run out of space with your version:
  
  OutOfMemoryError Java heap space clojure.lang.Numbers.double_array (Numbers.java:1047)
  
  How do I increase the heap size of the JVM? I’m exporting JAVA_OPTS=”-Xmx2048m” but that doesn’t seem to work (sorry, I’m not a Clojure user, my knowledge there is superficial).
  
  Reply
- Scott Locklin said, on November 30, 2011 at 8:10 pm
  
  Trying this on my laptop, since I’m away from my old desktop:
  Your function returns in 64ms. Lush (measuring the random number generation also): 60ms. Pretty good, and thanks to you and Mike for showing me the way.
  
  Reply
André Thieme said, on November 30, 2011 at 12:39 pm

OMG.
First of all: a language can not be fast. Specific code for a concrete *implementation* can be perceived as fast.

Is your 0.2 s Lush code also measuring the time required to generate the random numbers? In Clojure you are measuring it, and this takes nearly all of your computation time. Also, do you use Clojure 1.3? In the repl eval *clojure-version* and see.
Then you store the numbers in a lazy seq, not in an array, so your ^double type hint was wrong. This needs to be realized first. So, why not store the 3 mio numbers in a double array?
(def x (double-array (repeatedly 3000000 rand)))

That requires quite a bit of RAM, and the out-of-memory exception you got was not Clojures fault.
So instead of adding 30 mio random numbers just add your 3 mios ten times:
(time (dotimes [i 10] (reduce + x)))
==> “Elapsed time: 312.968732 msecs”

This is admittedly a bit slower, but I also don’t have the fastest machine.

Reply
- Scott Locklin said, on November 30, 2011 at 8:11 pm
  
  Thanks for the concrete explanation of why I was being a weenie!
  
  Reply
George Wolfe said, on November 30, 2011 at 1:06 pm

Mathematica has lispy goodness and a lot more. Here is the same test:

Timing[Total[RandomReal[1, 3000000]]]
{0.047, 1.50001*10^6}

which took 0.047 seconds. You can do a lot of other things pretty easily, for example, find out how many of the first 3 million integers are primes:

Timing[Length[Select[Range[3000000], PrimeQ[#] &]]]
{2.761, 216816}

It took 2.761 seconds to count the 216,816 primes.

Reply
Bruno Deferrari said, on November 30, 2011 at 1:10 pm

Oh, btw, for some reason areduce works much slower than plain ‘reduce +’ for me (I tried that before posting my first comment, but removed it from the example because it was much slower).

Do you know why is this? I expected areduce to be the faster of the options (inlined, specialized, etc).

Can you try using a plain reduce and see if it is faster for you than areduce?

Reply
Albert Cardona said, on November 30, 2011 at 1:12 pm

When you can’t find a way to formulate the problem in a way that all the cruft is removed by JIT, you are still left with running java code inside clojure. For example:

(ns my.numeric.test
(:import [fiji.scripting Weaver]))

(def a (ref nil))

; Compile once
(def w (Weaver/inline
“int n = ((Number)ref.deref()).intValue();
double sum = 0;
for (int i=0; i<n; ++i) {
sum += Math.random();
}
return sum;"
{"ref" a}
Double true))

; Run many times by altering 'a' to new numbers (here the same is used every time)
(println
(dotimes [i 10]
(dosync (alter a (fn [_] 300000)))
(time (.call w))))

"Elapsed time: 15.05816 msecs"
"Elapsed time: 14.85436 msecs"
"Elapsed time: 11.91884 msecs"
"Elapsed time: 11.85584 msecs"
"Elapsed time: 11.81192 msecs"
"Elapsed time: 11.99564 msecs"
"Elapsed time: 11.84916 msecs"
"Elapsed time: 11.81324 msecs"
"Elapsed time: 17.2514 msecs"
"Elapsed time: 12.896 msecs"

But for most operations you don't need to drop into java. For example, the above can be done with a function that has primitive arguments and a primitive return type (still has a cost), or with a loop:

(ns my.numeric.test)

(def a (ref nil))

(defn ^double sum
[^double a
^double b]
(+ a b))

(defn w1
[n]
(reduce sum (take n (repeatedly rand))))

(defn w1
[n]
(reduce + (take n (repeatedly rand))))

(defn w3
[^long n]
(loop [i (long 0)
sum (double 0)]
(if (< i n)
(recur (inc i) (+ sum (rand))))))

(println "w1:")
(dotimes [i 5]
(time (w1 300000)))

(println "w2:")
(dotimes [i 5]
(time (w2 300000)))

(println "w3:")
(dotimes [i 5]
(time (w3 300000)))

w1:
"Elapsed time: 114.56856 msecs"
"Elapsed time: 112.19752 msecs"
"Elapsed time: 114.77348 msecs"
"Elapsed time: 117.44124 msecs"
"Elapsed time: 111.204 msecs"
w2:
"Elapsed time: 14.44688 msecs"
"Elapsed time: 17.13656 msecs"
"Elapsed time: 14.2618 msecs"
"Elapsed time: 13.70596 msecs"
"Elapsed time: 17.40376 msecs"
w3:
"Elapsed time: 19.48052 msecs"
"Elapsed time: 17.22828 msecs"
"Elapsed time: 13.95 msecs"
"Elapsed time: 18.7912 msecs"
"Elapsed time: 13.95832 msecs"

Reply

Albert Cardona said, on November 30, 2011 at 1:15 pm

Better formatted:

	(ns my.numeric.test
	(:import [fiji.scripting Weaver]))

	(def a (ref nil))

	; Compile once
	(def w (Weaver/inline
	"int n = ((Number)ref.deref()).intValue();
	double sum = 0;
	for (int i=0; i<n; ++i) {
	sum += Math.random();
	}
	return sum;"
	{"ref" a}
	Double true))

	; Run many times by altering 'a' to new numbers (here the same is used every time)
	(println
	(dotimes [i 10]
	(dosync (alter a (fn [_] 300000)))
	(time (.call w))))

view raw

gistfile1.clj

hosted with ❤ by GitHub

	(ns my.numeric.test)

	(def a (ref nil))

	(defn ^double sum
	[^double a
	^double b]
	(+ a b))

	(defn w1
	[n]
	(reduce sum (take n (repeatedly rand))))

	(defn w1
	[n]
	(reduce + (take n (repeatedly rand))))

	(defn w3
	[^long n]
	(loop [i (long 0)
	sum (double 0)]
	(if (< i n)
	(recur (inc i) (+ sum (rand))))))

	(println "w1:")
	(dotimes [i 5]
	(time (w1 300000)))

	(println "w2:")
	(dotimes [i 5]
	(time (w2 300000)))


	(println "w3:")
	(dotimes [i 5]
	(time (w3 300000)))

view raw

gistfile1.clj

hosted with ❤ by GitHub

George Wolfe said, on November 30, 2011 at 1:27 pm

Totally off topic. Sorry.

Reply
testanon said, on November 30, 2011 at 1:38 pm

I just summed up 30M random integer elements of a std::vector in c++ and printed the result. It took 13ms not measuring the creation of the vecor (essentially 30M rand() due to v.reserve(..)).

All I’m trying to say is that there is more to interesting languages than being “fast”. If there isn’t, anything not comparing against (amongst others) C or C++, is meaningless.

Reply
George Wolfe said, on November 30, 2011 at 1:44 pm

I guess it’s not off topic. Scott, please delete both the “off topic” comment and also this request.

Reply
yvdriess said, on November 30, 2011 at 1:46 pm

If I could put forward some constructive criticism: it’s dangerous to compare straight-up native functions.

For example you use clojure’s reduce with a specific summing reduce.

Using common lisp (sbcl), the generic reduce will take around 0.2 secs.

CL-USER>
(defun test ()
(let ((rands (make-array ‘(3000000) :element-type ‘single-float)))
(loop for i from 0 below 3000000
do (setf (aref rands i) (random 1.0)))
(time (reduce #’+ rands))))

CL-USER> (test)
Evaluation took:
0.190 seconds of real time
0.189877 seconds of total run time (0.188758 user, 0.001119 system)
[ Run times consist of 0.012 seconds GC time, and 0.178 seconds non-GC time. ]
100.00% CPU
505,536,865 processor cycles
96,008,176 bytes consed

1499890.0

The same code using a regular accumulating loop takes 0.02 secs.
(defun test2 ()
(let ((rands (make-array ‘(3000000))))
(loop for i from 0 below 3000000
do (setf (aref rands i) (random 1.0)))
(time (loop for rand across rands summing rand))))
Evaluation took:
0.026 seconds of real time
0.025775 seconds of total run time (0.025759 user, 0.000016 system)
100.00% CPU
68,541,277 processor cycles
0 bytes consed

The reason why is the same as Rob commented above: some optimizations will not trigger, real lists or closures need to be constructed etc. Compare the bytes consed in both examples for example.

Reply
fogus said, on November 30, 2011 at 2:05 pm

Another data point:

(def ds (double-array 3000000))

(dotimes [i 3000000] (aset ds i (rand)))

(defn go [^doubles d] (reduce +’ d))

(time (go ds))
; “Elapsed time: 44.434 msecs”
;=> 1501092.228777185

Reply
RyanH said, on November 30, 2011 at 2:19 pm

In Q this would be the code and timing it shows 35 milliseconds:
q)\t sum 3000000?100
35

Or splitting up the rand and the summing (29 and 6 milliseconds):
q)\t a:3000000?100
29
q)\t sum a
6

You can download q here http://kx.com/trialsoftware.php 🙂

Reply
- Scott Locklin said, on November 30, 2011 at 8:13 pm
  
  Q is awesome, and more or less the experience I was looking for in Clojure. Looks like I can get it, but I need to build me a library which does basic matrix operations in a way a little more Lushy a fashion.
  
  Reply
David Nolen said, on November 30, 2011 at 3:18 pm

As far as Incanter – did you try (sum (sample-uniform 3000000 :mean 0)) ?

If that works, then you’ve encountered the head-holding problem. You don’t generally want to store potentially large (lazy) data structures at the top level with a variable like tmp. The whole things is kept in memory instead of being garbage collected as you compute the value that you actually care about.

Clojure arithmetic code as of 1.3.0 can usually achieve the performance of 64 primitive Java arithmetic code with little effort.

Reply
Aaron said, on November 30, 2011 at 3:57 pm

If you are new to the JVM, I just wanted to let you know there are different VM’s. I remember getting better performance for -server (vs -client). It’s worth giving a shot if you haven’t already.

Reply
- Scott Locklin said, on November 30, 2011 at 8:15 pm
  
  I managed to figure out this much. I see now my laptop is using openJDK server, which is all kinds of wrong. I know my desktop is using Sun 1.6.x server.
  
  Reply
atc said, on November 30, 2011 at 4:43 pm

Did anyone say Clojure should be used for numerical processing?

Reply
Video Anak Kecil « CITRO MADURA said, on November 30, 2011 at 5:25 pm

[…] Only fast languages are interesting […]

Reply
cpcloud said, on November 30, 2011 at 5:39 pm

NICE! I’m always shocked at how rare it is to find a Lush user. It’s such a wonderful little language.

Reply
- Scott Locklin said, on November 30, 2011 at 8:34 pm
  
  It’s the coolest thing I’ve ever used for numerics. Everything is as it should be, and all batteries are included.
  
  Reply

Joao said, on November 30, 2011 at 5:55 pm

Hi,

This is the first time I’ve been able to grasp a little the way Lisp-y syntax works. With such a simple test case and meaningful statements I could see what was going on. That inspired me to create a version of it in the Go programming language. For the test I used 30,000,000 size for the list.

	$ 8g append.go && 8l -o goappend append.8 && time ./goappend
	Elapsed time: 0.18
	1.5000704643886663e+07

	real 0m4.944s
	user 0m4.664s
	sys 0m0.236s
	$ cat append.go
	package main

	import (
	"fmt"
	"rand"
	"time"
	)

	const _LIST_SIZE int64 = 30000000

	func main() {
	nn := make([]float64, _LIST_SIZE)
	for i := int64(0); i < _LIST_SIZE; i++ {
	nn[i] = rand.Float64()
	}
	sum := float64(0.0)
	time1 := time.Nanoseconds()
	for _, v := range nn {
	sum += v
	}
	elapsed := float32(time.Nanoseconds() – time1) / 1e9
	fmt.Printf("Elapsed time: %.2f\n", elapsed)
	fmt.Println(sum)
	}
	$

view raw

gistfile1.txt

hosted with ❤ by GitHub

I had also tried creating a version of it for the Dart programming language but it’s early days for Dart and it was having issues dealing with the generated memory and its garbage collector, which happens to be a known issue at the moment. When converting it to Javascript to run on the Node.JS runtime it was working though. With 3,000,000 list size:

	$ frogsh append.dart
	Elapsed time: 0.04
	1500237.4288328215
	$ cat append.dart
	main() {
	var nn = [];
	for (var i = 0; i < 3000000; i++) {
	nn.add(Math.random());
	}
	var sum = 0;
	var sw = new Stopwatch();
	sw.start();
	for (var n in nn) {
	sum += n;
	}
	sw.stop();
	print("Elapsed time: ${sw.elapsedInMs() / 1000}");
	print(sum);
	}
	$

view raw

gistfile1.txt

hosted with ❤ by GitHub

In Go we can try to make a list more dynamic by appending to a slice and reassigning it to the variable. Like so: list = append(list, aNumber)

Only problem is that it works extra time to keep it growing and also blows up with out of memory for this test. 🙂

Adam H said, on November 30, 2011 at 7:06 pm

For a quick test, I tried running these tests in C++, because that’s the thing I use all day. Of course, wall-clock times aren’t directly comparable because I have a different machine, but I figured it’d be interesting to at least get a ballpark figure.

Generating the vector of random numbers seems to take quite a while. 3,000,000 numbers took about 64ms, and 30,000,000 numbers took about 640ms. Much of the cost was probably in the hand-rolled random function, because C++ doesn’t have a built-in way get random numbers into a double. I did manage to roughly halve these times by using a dynamically allocated array of doubles instead of a vector.

The summation results are a lot better. Summing 30,000,000 numbers took 28ms, and summing 3,000,000 numbers took 3ms.

Reply
- Scott Locklin said, on November 30, 2011 at 8:17 pm
  
  Thanks!
  
  Reply
Sean Nilan said, on November 30, 2011 at 7:58 pm

The Java double array is definitely the way to go if you want performance.

However, I was curious and tried to use a lazy sequence (calculates the numbers only as needed). The macro for generates such a lazy sequence.

user=> (def a (for [i (range 3000000)] (Math/random))) ; try 3 million number
#’user/a

user=> (time (reduce + a))
“Elapsed time: 451.658 msecs”
1500419.8606559571

user=> (def a (for [i (range 30000000)] (Math/random))) ; now 30 million numbers
#’user/a

user=> (time (reduce + a))
java.lang.OutOfMemoryError: Java heap space

So my results for a lazy sequence were similar to what you got above. At some point, the program crashes when you feed it too many numbers.

I figured this was because I had a reference to the lazy sequence so the compiler could not perform any garbage collection during the reduce. So I next tried to do an anonymous lazy sequence:

First start a new REPL session to get rid of heap allocations.

user=> (time (reduce + (for [i (range 3000000)] (Math/random)))) ;; 3 million numbers
“Elapsed time: 327.562 msecs”
1500673.092587838

user=> (time (reduce + (for [i (range 30000000)] (Math/random)))) ;; 30 million numbers
“Elapsed time: 2094.698 msecs”
1.5000632187204706E7

I then tried it for 300 million numbers, and it took about 20 seconds, but it did not crash due to a memory error. 3 billion numbers took about 219 seconds.

Using Java primitive types is definitely the way to go performance-wise, but I thought you might find it interesting that using an unreferenced lazy sequence avoids the heap error.

Reply
akashchakrawarti said, on November 30, 2011 at 10:17 pm

Ohh Java , Great Language it is !

Like it and the post is too helpful for me as Java Professional .

Thanks For writing !

Reply
kaffiene said, on December 1, 2011 at 3:24 am

Just to note: doing the ‘insane’ thing of using plain old evil Java gives me 30 million doubles summed in 0.05s.

So, if only fast languages are interesting, I presume you’ll be switching sometime soon?

Reply
- Scott Locklin said, on December 1, 2011 at 4:46 am
  
  I suspect I’ll be screwing around in Java, actually. I can’t really avoid it at this point. All things considered, I’d rather be screwing around in plain old C.
  
  Reply
  - kaffiene said, on December 1, 2011 at 8:35 pm
    
    I really love C – one of my favourite languages. I don’t get your Java hate though. Performance wise, it compares well to C. The one area I’ve always found C faster though is anything that’s particularly memory intensive.
    
    Reply
    - Scott Locklin said, on December 1, 2011 at 9:28 pm
      
      Well, it ain’t hate exactly. More like, “I do not know this, and have invested years of my life in other things: why should I learn it?” Java is wordy, like C++; everything is OO, and everything needs declarations. It’s good that I can’t get into as much trouble as C++, but sometimes I’m looking for trouble. I also have an unfortunate experience asking Java dudes to do stuff like multiply two matrices together (back in 2004 I think) and having them say, “huh?” I know you can do it now, but such experiences imply a cultural disconnect which indicate Java-land doesn’t care about my issues (which are about 90% matrix multiplication, accompanied by the occasional very memory intensive things sometimes). Clojure land doesn’t really care about my issues either.
      
      Finally, there is shit like this, which again tells me that Java-land doesn’t care about my issues:
      http://stackoverflow.com/questions/529457/performance-of-java-matrix-math-libraries
      I had a peek at Parallel Colt’s source. It is impressive as a parallel implementation of matrix multiplication, but as fast numerics code, it is an abomination unto all that is fast and holy. I mean, they used motherfucking FOR LOOPS to multiply matrices together. Maybe this is the only way to do things on Java; that’s part of the problem. I think they’re thinking about humongous matrices multiplied across the cloud. OK, that’s nice: the ratio of people who need this, compared to the people who need their matrix multiplications optimized for cache hits is very small. Sure, some day maybe the JVM will optimize them threads and for loops away on something like Colt, but meanwhile, Atlas beats them by a factor of 50, which is pretty serious, and I got problems that need to be solved, like, yesterday. Basically, Java don’t care about me: why should I care about Java? I’m sure Java is a great tool for many people; it’s just not focused on the issues I have.
      
      Remember: Lush is doing this faster than Java, and I have stuff like macros and an interactive repl. I also have a compiler with which I can freely interact with C (and probably go faster still). If I could figure out how to interact with those Java libs I want without invoking lucifer, I’d probably keep using it and be happy. But alas, nothing is perfect.
      
      Language wars are kind of pointless. Everyone has different needs. Maybe I’ll get BLAS working on Clojure and be happy. Meanwhile, “only fast languages are interesting.”
      
      Reply
      - kaffiene said, on December 4, 2011 at 10:41 pm
        
        “Remember: Lush is doing this faster than Java”
        
        Huh. When I checked. the 0.05sec to run your test in Java seemed less than the 0.18sec to run that in Lush, but maybe I’m mathematically impaired and don;t understand greater-than.
        
        If you don’t like the style of Java, well fine, I can appreciate that. Lots of people don’t like LISP either so that’s clearly a matter of personal taste. (FWIW, I love LISP)
        
        “I mean, they used motherfucking FOR LOOPS”
        
        ….and I think this gets to the root of the issue. I don’t have a religious issue about for loops. I *like* functional programming language features but if a for loop is the quickest way to get stuff done, then bring it on. Obviously your mileage varies, and I do understand and acknowledge that a few less characters used in writing your solution really can be an issue for some people.
        
        Personally, I just like getting stuff done.
        
        Reply
        
        Scott Locklin said, on December 5, 2011 at 12:21 am
        
        Ah, somehow I thought you were summing 3 million, and anyway, I have no idea what kind of computer machine you have. The one which produced that number is an old 2Ghz 2-core. If I want to go through the overhead of declaring variables and such like you had to do, Lush also comes with both a native and a C compiler (aka, you can just write C in the middle of a Lush function and compile it). Were this a match to the death as to who could sum 30 million values together quickest, I am pretty sure Lush would win. No matter how awesome you are at Java numerics, I’d also win using Lush to code up a solution to a complex numerics problem on a tight deadline, just because it’s built for that sort of thing.
        
        As I said, I don’t hate Java. It’s just not the right tool for the types of jobs I do. I wouldn’t be looking at Clojure if it were not for Java interop. Java apparently allows you to write all kinds of useful code which I have no real skill or interest in writing, like Esper or one of them NoSQL DBs. Clojure + Java is great: I get to call all that stuff from a REPL. It’s really awesomely easy and cool; better than the Lush/C abilities in its own way. But … the question of whether or not I can do what I need done is still open. Matrix math: yes, JBLAS + ATLAS (yay Fortran and C!) is a winner. Other stuff…. mmmm, maybe.
        
        The “for loop” thing is not a language syntax thing. I’m fine with for loops. Frankly, I am more comfortable with a for loop than many functional looping constructions. The issue is: you can’t build fast code out of for loops in most language specifications. The overhead is big, and you can’t optimize for cache hits, which is of supreme importance in “real” numerics code. When I look inside COLT and see a sophomoric threaded matrix multiply, I can not help but sneer at it a little. You can’t get shit done that way. That’s why ATLAS beats COLT by factors of 50 or 100. It’s a mindset, and a set of possibilities. I am mind boggled that people use things like COLT and are happy with the results.
        
        Reply
        
        Tu said, on June 7, 2013 at 8:49 pm
        
        I’m just curious. How matrix multiplication be implemented without for loop in imperative languages like C or Java?
        
        Reply
        
        Scott Locklin said, on June 7, 2013 at 9:10 pm
        
        For C: macros which pipeline the matrix multiply. You can see examples of this in ATLAS, and probably APL sources.
        I have no idea how they’d do it in Java; I wrote a Clojure wrapper for ATLAS, and it’s a whole lot faster than Colt. Optimization in Java seems to be “hope for the best from your JVM,” but as matrix math is not a priority, something like ATLAS wins. Maybe some day this will not be the case, but I’ve abandoned the Java platform for what I was trying to do anyway.
        
        Reply
Stanislav Datskovskiy said, on December 1, 2011 at 6:25 am

Why not Common Lisp?

SBCL to be exact. 100% native code compilation, full numeric tower.

Reply
- Scott Locklin said, on December 1, 2011 at 6:52 am
  
  No Java interop, for starters (though I guess there is ABCL). I could go on: fragmented user community (though it looks like SBCL is becoming something of a standard), no useful native numerics libs to speak of, no stats engine, no useful plotting libraries (McClim may offer something by now: it didn’t exist when I started this project), no concurrency, no proven capacity for large data problems, no software for interfacing with my broker (Java), data vendor (Java) or Esper (Java). I’ll stick with Lush until I can figure out how to make Clojure do what I need done.
  
  As a language, CL is much more mature, and I like familiar technology like compilers that do what I tell them to. When I started my project, it was a choice between Lush, CL and OCaML. Lush had the most bits which did what I need, though OCaML was reasonably close. Now I need moar stuff. Had I gone with CL, and trusted in Allah to provide, well, I’d have spent a lot of time building up support libs rather than solving problems, and I’d still be in the position I’m in now. CL was my first Lispy girlfriend, but I didn’t get married.
  
  Reply
  - Stanislav Datskovskiy said, on December 1, 2011 at 5:00 pm
    
    Only one nitpick – SBCL under sane OS (almost any but Windows) has native threads.
    
    Reply
MATLAB said, on December 1, 2011 at 1:53 pm

Ha ha ha! Guys, drop your toys. Here comes the real thing:

>> x = rand(3000000, 1);
>> timeit(@()sum(x))
ans =
0.0049
>> x = rand(30000000, 1);
>> timeit(@()sum(x))
ans =
0.0478

Results (in seconds) obtained on a Core2 Duo @2.4 GHz, DDR2-800. 32-bit MATLAB running inside a virtual machine (VMware Workstation).

As a bonus modern MATLAB versions can use Java objects natively.

Reply
- Scott Locklin said, on December 1, 2011 at 6:01 pm
  
  Matlab isn’t a terrible choice if you’re not doing any computer science; a lot of the annoying busy work has been done. On the other hand, a lot of the annoying busy work has been done in R as well, and it’s almost as fast.
  
  Reply
- kaffiene said, on December 1, 2011 at 8:37 pm
  
  That’s identical to the performance I got doing it in Java
  
  Reply
  - MATLAB said, on December 2, 2011 at 3:20 pm
    
    Come on, dude – don’t cheat. You just forgot to add that you’re on a faster machine.
    
    I did the homework for you: in JVM 6u29 launched with default opts I get 0,0085 s to sum the whole double[3000000]. That’s almost 75% slower than in MATLAB.
    If I force MATLAB to use only one CPU core I get 0,0053 s – Java is still 60% slower.
    
    Reply
    - Scott Locklin said, on December 3, 2011 at 12:23 am
      
      Isn’t Matlab some kind of Java thing these days? I remember back in R12, when it was still a bunch of EISPACK calls… Then …. noooo!
      
      Reply
    - kaffiene said, on December 4, 2011 at 10:45 pm
      
      I don’t have Matlab so I can’t compare, but this machine isn’t especially fast.
      
      If you are running with default options then your code will start off as interpretted until it becomes a hot spot and is then compiled. That’s hardly going to be fast nor a fair comparison. You need to run with the -server option to compile first.
      
      Also, are you timing process launch as well or just the time to do the actual summation?
      
      Reply
Jonathan Shore said, on December 1, 2011 at 2:52 pm

The problems you see with performance are specific to Clojure and not the JVM. That said, I no longer subscribe to using the JVM because there are very few JVM languages that I find palatable and the ones that are, do not have the performance characteristics needed.

I’ve moved to F# / C#, running on OSX & Linux. If you need to access java libraries for big data, etc, these can also be used from the .NET CLR through IKVM compilation. I’ve moved over entirely to using Mono with LLVM. The LLVM (which is also the backend for C compilers on OSX) offers very good performance, as opposed to the mono native JIT. One can now select between LLVM and mono native backends.

C# / Mono has 0 or near 0 overhead in interacting with native libraries (such as BLAS), unlike Java, which has an unnatural and expensive JNI interface into native libraries. One can also do low-level operations with pointers and direct memory access if one likes in “unsafe” mode directly from some of the .NET languages.

If you really like, instead of F# can use something like IronScheme. However, I suspect it has a very small community and not sure how well it interacts objects outside of the scheme ecosystem. The CLR does have facilities in the VM that make functional language implementations easier to do and more performant (such as tail call optimisations).

Reply
- Scott Locklin said, on December 1, 2011 at 6:23 pm
  
  This is a very interesting post. I think reading about some of your essays (don’t worry; lots of other people as well) led me to consider Java in the first place. I’m suspecting IKVM compilation isn’t going to work right, but this is mostly because this is the first I’ve heard of it, and I’m suspicious -I’ve been suspicious of Mono since it came out in general. Then again, I had started screwing around with SWIG/JNI and didn’t like what I saw at all. Now, Mono + LLVM, that’s very interesting. I didn’t even know all this was possible.
  
  I think I mention somewhere above that I was originally considering doing the whole thing in OCaML. IronScheme might be interesting as well, if it plays well with others, though really F# would be the logical choice.
  
  I broke out some of Incanter core last night -wasn’t even sure this was possible. I’ll have to run a few benchmarks on the Colt/matrix stuff. I’ve already figured out I can do things with Lush speed; if I can match up performance on matrix ops (and a few other things, like high-D NN searches), it’s a pretty straightforward translation. Otherwise, well, I always wanted to get good at OCaML/F#.
  
  Reply
  - Jonathan Shore said, on December 1, 2011 at 8:15 pm
    
    Unless you want your code to run on google’s app engine (where it must be JVM bytecode I believe), the IKVM approach to get access to java libraries should work well. I’ve been using IKVM for some years now and have to say that it is excellent. Jeroen Frijters, the author of IKVM is very responsive to issues I had in the past (even very subtle ones).
    
    Back to Mono: Since mono 2.8 the mono team put some focus on performance. There were arguments as to whether LLVM should be the path of the future for JIT, where people such as myself argued in favor of using the larger base of optimisations and eyeballs on the LLVM project. The Mono team has now confirmed that LLVM is their JIT approach going forward for performance oriented apps.
    
    The difference in performance between mono JIT and mono-LLVM is about 4x, where LLVM performs at or near the C / Java server level (and sometimes better). I put together a number of numerical benchmarks and was happy to note that C# / LLVM was within +/- 5% or 10% of -O3 C++ compiled code.
    
    Another positive thing about Mono is that it can be used in embedded settings easily. The required runtime is C# / Mono and sometimes F# -> C# / Mono libraries.
    
    Reply
    - Jonathan Shore said, on December 1, 2011 at 8:29 pm
      
      Somehow the last paragraph was garbled (missing lines). Should have read. Another positive thing about Mono is that it can be used in embedded settings easily. The basic library set and VM can be fit into 3-5MB. Can also be compiled into 1 binary using ATO (ahead-of-time) compilation unique to mono.
      
      I use F# -> C# libraries on Mono, but more often either straight C# for server stuff and R -> C# libraries for research. There are so many resources in R it is much easier to use R as a R&D env than, say, the F# REPL. I’m not a huge fan of R as a language, where would much rather see a ML variant or perhaps lisp, but it has the momentum.
      
      Reply
      - Scott Locklin said, on December 1, 2011 at 9:44 pm
        
        Thanks for the detailed breakdown: that’s very helpful. I’m going to try a few more things in Clojure first: probably classic loss aversion bias, but if I can get it fast in the right places, it would be a lot easier to port the stuff I already have.
        
        Speaking of R-like things in an ML dialect. There was an OCaML thing which gave very matlab like interactivity:
        http://home.mindspring.com/~dmcclain1/nmlpromo.htm
        Unfortunately, it seems to be dead as a project. It looked very promising for doing research in, then maybe (maybe) optimizing down to some simple OCaML primitives for fast applications. As I recall, it did the very necessary thing of quieting down the type system for when you just want to screw around with some data and try different things. I agree with you about R; not a huge fan of the language, if you can even call it that, but it’s hard to beat for trying out new ideas. Lush is OK too, though much lower level: sort of Octave level, with a compiler and defmacro.
        
        I was initially very excited about Clojure and Incanter as an eventual R replacement. It will never beat R’s existing library, but for hacking out new ideas, it might have been a Lush beater.
        
        Reply
- kaffiene said, on December 1, 2011 at 8:46 pm
  
  JNI makes more expensive calls to native code than general Java method calls, but if you put your data in native memory via a byte buffer and make ONE call out out to native code to do your work then the overhead is effectively eliminated.
  
  Obviously, if the work absolutely has to be done with millions of little native side transactions then the JNI overhead might be a real issue.
  
  Reply
  - Jonathan Shore said, on December 1, 2011 at 9:06 pm
    
    it is not only performance cost, but the substantial pain in developing JNI. I can, for example, create structures in C# that map exactly to some C data structure or file format. Better yet, can call into C functions with no JNI at all.
    
    It is nice not having to work out how to make JNI perform (I’ve spent many yrs with java and had to deal with this).
    
    Java as a language has been stagnant for quite some time. The have been some small improvements, but it has been slow going, and languages such as C# are far ahead. Anders Hejlsberg has been an excellent steward of the C# language and CLR runtime. They did have a chance to look back and see what java did right and wrong. Still there is no excuse, IMO, for the lack of Java language progress.
    
    Finally, best yet, is that there is an excellent functional, well supported language on the CLR (F#). Clojure has a very good community, but I found inadequate for numerical work. Scala is definitely functional, but overburdened with complexity (and also has performance issues with some basic constructs).
    
    Reply
    - kaffiene said, on December 4, 2011 at 10:51 pm
      
      JNI isn’t straightforward, that’s true. But it shouldn’t be a performance issue and that’s all I was posting about. Whether you like the flavour of it or think Java is neat or not doesn’t really have much of a baring on that.
      
      I haven’t used F# but I have looked at it before and it looks like a really nice functional language which expresses functional ideas in a way which is more readable for most programmers than oCaml or Haskell.
      
      Reply
Jonathan Shore said, on December 1, 2011 at 8:19 pm

With regard to IronScheme, here is a stackoverflow query which should be informative. Sounds like it has good linkage into the CLR object system, but has a miniscuie user base.

http://programmers.stackexchange.com/questions/107368/is-ironscheme-complete-enough-or-stable-enough-to-be-worth-learning

Reply
- Scott Locklin said, on December 1, 2011 at 9:50 pm
  
  Yeah, I saw that after you mentioned IronScheme. It might have some win to it. I’ve been hacking away in Lush for a while now, and the user community is about as small is it gets there. I’m not a religious man, but for the last two years I’ve been praying that Yann and Ralf remain healthy and look both ways before crossing the street. I guess it wouldn’t be a big change adding the IronScheme guy to my novenas.
  You did mention momentum, and Clojure has some which IronScheme doesn’t at present. I’m using that to justify my loss aversion bias. Not that I’ve invested a lot of time in Clojure, but enough to spend another week or two on it before trying out IronScheme.
  
  Reply
Marcus said, on January 18, 2012 at 9:11 am

I am pretty ignorant about java myself but this may be worth a look for matrix multiplication.
http://jblas.org/
Incanter (http://incanter.org/) may also be worth checking out.

Reply
- Scott Locklin said, on January 18, 2012 at 9:25 am
  
  Thank you kindly, Marcus. As a matter of fact, I thought of that, and implemented it a few weeks ago. It’s ludicrously faster (factors of 20-100) than Colt.
  
  https://github.com/locklin/clushjr
  
  Reply