Locklin on science

Choose your weapon: Matlab, R or something else?

Posted in tools by Scott Locklin on May 8, 2009

As a data sort of guy, I use three programming tools on a daily basis, or at least every week. One is Lush a version of lisp. The other is Matlab. Lastly, there is the R project.

I don’t want to use three tools for dealing with data, but it’s actually necessary right now. I don’t think it will be necessary forever.

Lush is my general purpose programming language. It’s insanely great. Parts of it are wonky and slow, and parts of it are broken or missing, but it’s a lisp, it’s fast where I need it, and I like it a lot. More on this in a future entry. I use Lush for speed and original research. If there are no complex algorithms like what I need written in Matlab or R, I might as well write them in Lush. Lush is a high level language with low level speed when you need it. It would be perfect if it had more libraries. The only thing I may potentially like better is OCaML/F#, and frankly, I find the type inferencer there to get in the way more than it helps. If they made an OCaML where you could turn the type safety off most of the time, that would be better. Or, I could just be like everyone else and use Python or Java for this sort of thing. Not that there is anything wrong with that.

Matlab would be my second choice for hacking out original research. Why Matlab? Matlab is reasonably fast, but one of the main value adds is that it is extremely intuitive if you’ve used Fortran or C, and if you don’t know how to do something, the help system is very informative. Matlab code is also extremely well supported. The debugger, profiler and editor are all excellent; some of the best I’ve used. Sure, someone will argue that they have more powerful debugger, but Matlab’s is the most handy I’ve yet used. I don’t need to read a manual to use it; I just use it. Sure, emacs is way better than the Matlab editor, but it isn’t as handy as Matlab’s editor. You can use Matlab to do just about anything. I’ve used it to code up embedded systems using xPC target and Real Time Workshop. I’ve used it to code up trading systems, from data feed to broker interface. I’ve embedded it in Excel for end users. I’ve deployed it in Enterprise software used by Fortune 100 companies. It’s amazingly useful stuff, especially if you have the proper toolbox to accomplish your tasks. You can build reasonably good numeric software with it as long as you don’t need fancy “programmy” features like concurrency. If Matlab had a way of making fast compiled code, it would be close to perfect for the type of thing I do. I wouldn’t bother with Lush any more, except when I was trying to write interpreter type things. Alas, Matlab’s way of doing this is to write code for your time critical pieces in C, and embed it into your code in a fairly laborious process. The only real drawbacks to Matlab are speed, plotting and expense.

What is R good for then? Well, R is free, so many academics use it to share their latest econometric or machine learning software with everyone else. As such, just about everything statistical under the sun exists in R. And it’s free! What is not to love. Well, sadly, there is plenty not to love about R. First off, there is speed. R doesn’t seem to have anything that makes it inherently slow for an interpreted language: it should be comparable to Matlab in this regard. But it’s slow enough that most people do their heavy work in other languages. Most of the modules written for it have most of the code written in C or Fortran. This is somewhat true of Matlab also, and for the same reasons, but Matlab has a trivial way of telling you what you need to speed up, so R will always end up slower in practice. Second there is debugging. R is hard to debug. First off, it doesn’t drop you into an interactive top level the way Matlab (or Lush, or Python or anything where you write Real Programs) does. That sucks a lot, and removes a bunch of the utility of using an interpreted language. Oh, sure, there is a debugger, but it is buggy, poorly documented, and doesn’t work in the simple way that Matlab’s does. Thirdly, there is the syntax. Personally, I like the syntax; it’s a lot like OCaML. But most people don’t. What is more; the help system is very close to worthless if you’re trying to remember a simple command. People may say this is unfair, as I am just not used to R, but the fact is, I’ll never get as used to it as Matlab, and neither will anyone else. Oh, it’s OK for finding packages you want if you can think of the right keyword for them. But compared to Matlab, or even something like Lush, its online help is pretty worthless. Fourthly: for programming, while it should be better than Matlab in many ways, I haven’t ever seen a legible R program which was over 100 lines. I don’t know how they manage this. Part is doubtless the IDEs are rather bad. I don’t know anyone who claims they can write good, large pieces of software for R. I once asked a guy how he wrote big pieces of software, and he said, “very carefully.”

This sounds pretty bad, but there are solid reasons to use R. For one thing; it’s free. There is a lot to be said for free. Among other things, if you want to give some code away for others to play with, R is going to be a better vehicle than distributing raw C or a matlab package. For another thing, it has a tremendous amount of work done on various hard numeric problems, and installation is trivial: just press a button. Want to wire the latest AdaBoost up to your database, and plot some nice results: pretty easy in R. I might be able to do all this in Matlab, with the correct packages and so on, but in R, it’s the work of seconds. Another thing: it’s a lot easier to make fancy plots in R than it is in Matlab. Matlab’s plotting utility is from the dark ages. It’s insanely bad. You can abstract some of its badness away with objects, but … you shouldn’t have to. Finally, for interacting with data, R wins. Matlab’s matrix paradigm makes it easy to use, but data.frames are more powerful.

Here’s how my decision tree works. When I first heard about Benford’s law, I decided it was simple enough; I’d hack it out in Lush. I did. It worked, and I fiddled with it. Then I realized that goodness of fit to Benford’s distribution might be nice. I had chi-squared distributions already coded up in Lush, and some curve fitting stuff … but wiring it all together, then fiddling with the plotting routines: ugh. So, google informed me that some nice statistician had done all that work for me in R. So I used R. Probably, someone did it in Matlab also (actually, someone did), but it’s a pain to fire up my Windows laptop with Matlab on it, so I just went with R. That’s what R is good for. At some point, I’ll get Lush talking to R, at which point I may cease using Matlab unless someone pays me to do so. It will never be as slick as Matlab, and I will miss all the great user productivity features that Matlab offers, but it will get the job done better and quicker, I think.

I use the cheat sheets in R a lot, for lack of a better help system, so if you want to fool around with it:
A cheat sheet
A better cheat sheet
Other R documents

About these ads

23 Responses

Subscribe to comments with RSS.

  1. R and data « Erehweb’s Blog said, on May 27, 2009 at 5:42 am

    [...] and data By erehweb My fellow bloggers John and Scott have posted recently about the free statistical programming language R.  How does it compare to an [...]

  2. adam said, on August 6, 2009 at 2:11 pm

    Hi – Just found this blog and I’m really enjoying your writing.

    In regards to R’s debugger – I’ll agree its poorly documented, but I haven’t found it to be that bad. Have you tried “options(error=recover)” and “withCallingHandlers(fun(), warning=function(c) recover())”? Also, I haven’t tried it (and it may be what you were talking about as ‘buggy’) but the debug package (install.packages(debug)) looks promising in terms of what you want.

    • Scott Locklin said, on August 6, 2009 at 8:47 pm

      The last time I tried debug, it crashed my system and made me sad. I was going to try it again for this article, but I didn’t see it available for R2.9/OS-X at the time. It seems to be available now. Fiddling for 30 seconds, I’m remembering issues like, “I had to remember to call mtrace() on everything that might crash.” Still, it’s better than what I was doing before.
      I don’t like it as much as Matlab’s debugger (or what Lisp does by default), but it comes a lot closer to making me happy -maybe it will grow on me. Thanks for pointing it out.

  3. Win-Vector Blog » Survive R said, on September 29, 2009 at 6:12 am

    [...] at Win-Vector LLC appear to like R a bit more than some of our, perhaps wiser, colleagues ( see: Choose your weapon: Matlab, R or something else? and R and data ). While we do like R (see: Exciting Technique #1: The “R” language ) we also [...]

  4. Arthur said, on September 29, 2009 at 11:21 pm

    Great blog post. Still waiting for my work’s IT group to approve my download of R, but I have experience with MATLAB and SPlus.

    My absolute favorite environment for rapid and powerful coding is Dyalog APL. The language allows really powerful abstraction, and the IDE’s debugger is the best I’ve ever seen. You can step backwards and forwards, and add/modify code without having to exit debug mode. Most Dyalog users find themselves writing code within the debugger.

    • Scott Locklin said, on September 30, 2009 at 12:39 am

      I had a very brief and scary encounter with APL in learning linear algebra when I was 20 or so. It certainly looks powerful, and has a great pedigree, but trying to read it: ouch. Probably Lisp looks the same way to the uninitiated. While Lisp lacks decent IDE’s (beyond emacs + SLIME, which is admittedly pretty good), you can certainly sling code in the debugger: its one of Lisp’s superpowers. Anyhow, I’m pretty sure at some point somebody is going to pay me to sling K, which is a sort of vectorized APL variant by one of the original authors.

      http://en.wikipedia.org/wiki/K_(programming_language)

      R is crazy frustrating at times, but with the helpful “cheat sheets” you can get a lot done. I’m sure your Splus will serve you well.

  5. Rob Brown said, on October 5, 2009 at 5:47 pm

    I love R, but quite frankly, the best modeling language/environment is Analytica published by Lumina. Here’s a review I wrote about it last year for OR/MS Today:

    http://lionhrtpub.com/orms/orms-6-08/frswr.html

  6. lew burton said, on October 9, 2009 at 6:04 am

    Both of those look neat. The connection with Babcock and Brown is quite funny.

  7. Helmingstay said, on October 21, 2009 at 11:06 pm

    options(error=recover) and options(warn=2) are the most helpful (non-default) settings that ive found for run-of-the-mill errors. Just to clarify, by “crash” do you mean the R interpreter crashes? Ive heard many more complaints about matlab “giving up the ghost” than R.

    An ongoing weakness in R seems to be an odd/poor set of default choices (as with the options above). Google “stringsasfactors” and witness a long list of novice users ready to do violence to their computers. R’s factor-handling has brought me to the verge of tears, only to discover a single pithy sentence in the docs that clarifies all.

    In short, R’s public relations team isnt likely to win any awards today, tomorrow, or ever…

    • Scott Locklin said, on October 21, 2009 at 11:38 pm

      I think I’ve encountered “stringsasfactors” before. I’ve been living in the R debugger for the last few days. When I say it crashes, I mean crash as in crashes R; sometimes with pagefault, sometimes it gets stuck in some kind of wacky ESS REPL->somewhere else loop, probably with the debugger’s TK bits. Either way, I lose whatever I was doing with a kill -9.

      Other fun: keeping track of which libraries you have loaded from where. I found a fun “bug” in my code which couldn’t be reproduced on different installs of R; apparently the old version of XTS (or ZOO, I never figured out which was at fault) allowed you to subset pretty sloppily. New version requires everything be just so. Finding out which lib R was pointing to … any of 3-4 in Framework or my home directory: insanity. In the end, I’m going to have to maintain an R distribution along with my code, because the libraries change so much underneath the code, I can’t rely on CRAN to do it for me for anything resembling software. Shoulda done it in Lush.

  8. Stefan said, on September 29, 2011 at 1:42 pm

    I haven’t read the discussion thoroughly, but just to chime in about R- I think that it has it’s place as data analysis/ statistics software, and for that is superior to MATLAB. I had the same problems with IDEs and especially the fact that the help system is hard to use, but Rstudio and (to some extend) eclipse are quite ok,I think, the autocompletion and function argument hints help a lot (sounds a little childish, but it’s important).
    Lately the performance (with 2.13) is getting better and I would argue that the CRAN package system is incredibly efficient it helping me find the right package and get whatever stat. methods up and running in minutes.
    Then again,I was trying to use it in an F# project, and the whole COM interop is not working for the current version, so it’s bound to make you mad… I guess the debugging is similar story

    • Scott Locklin said, on September 29, 2011 at 4:33 pm

      Two and a half more years experience, and I’ve come to terms with the debugger (it still sucks compared to Matlab or Lisp). I’ve gotten better at vectorizing my code as well, which improves performance (which still sucks compared to Lisp).

      ESS+emacs is pretty OK. ESS+Tramp is pretty helpful too, when working on the cloud: dunno if you can do that trick using Rstudio or eclipse.

      CRAN is OK, but … for example: which KD-tree package should I use? There are 6 of them last time I checked!

      My favorite Lisp is having problems with linking to 64 bit C++ objects these days, so I may end up an OCaML nerd after all…

  9. mike said, on November 3, 2011 at 8:46 am

    I saw your comment on hdf5 w/Lush in http://bronzekopf.sourceforge.net/ &would like to try it.

    I’ve enjoyed XLispStat(Vista)&have thought Lush very interesting, but don’t see enough dev in either lately. I mostly use sbcl.org right now. It/(CommonLisp) has the start of connections to R, Octave, hdf5 &even scidb, though only rclg worked right away.

  10. Helmingstay said, on October 10, 2012 at 5:10 am

    Great comments thread. I’m an inveterate R user, and I’m helping teach a lab in Matlab this semester (which is how I ended up here).

    A few comments:

    * I’ve noticed that R has a relatively large vocabulary, which makes the learning curve rather steep. It reminds me a bit of vim — nearly a decade later, and I’m still learning commands/functions that accelerate my productivity/clarity.

    It’s funny, I like the R docs more and more as I use them. At least for the core docsstate *precisely* what they do (though packages are another story), but there’s not much hand-holding. Cross-listing and search functions could be better.

    * R.e. speed — again, there’s a *big* learning curve here: R makes it easy to write slow code, while simple habits yield order-of-magnitude speedups. Burns Statistics “The R Inferno” does a great job of pointing out the worst offenders.

    I’m really curious to see a problem where Lisp beats R in runtime. I know of a few examples of explicit looping, MCMC-type stuff, that I’ve gotten into the habit of pushing into C++ with the amazing Rcpp package. Can you think of a pseudo-code example where there’s a 2x speed difference betwixt the two?

    * R’s “core” is still evolving (sometimes rapidly), which has cleaned a few things up. Packages now *require* namespaces, which really cuts down on naming collisions and confusions, for example.

    * One major remaining weakness in R is memory management (and threads). The traditional answer has been “buy more RAM”, which generally works well. On the other hand, many problems are amenable and arguably more efficient with divide-and-conquer MapReduce type processing anyway. Any thoughts on memory limitations and parallelism in Lush & Matlab?

    thanks!

    • Scott Locklin said, on October 10, 2012 at 5:23 am

      One of the things about Lush is the ability to natively call C, or compile to C with simple numeric code. This means it virtually always wins over R, by factors of 10-100. Regression, for example, is that much faster. I’ve managed to go even faster using the multicore capabilities of Clojure and stuff like tuned JBlas. They’re not fair comparisons, as the lm() gizmo does way too much stuff, but most of the time, I don’t want that damn stuff, nor do I want to figure out any of the many ways of calling something which might be faster. Usually there is a way to get “fast enough” in R, but you have to figure it out, and if stack up a bunch of things it is bad at (say, storing timeseries in a DB), then regressing a bunch of subsets of the ts against other ts, and you get to some annoying delays.

      FWIIW, I have been fiddling with J recently, and if I could figure the FFI out, I might like it almost as much as Lush. Not as well documented, and the source is often impenetrable, but it’s very handy at certain things, and very fast.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 337 other followers

%d bloggers like this: