Choose your weapon: Matlab, R or something else?
As a data sort of guy, I use three programming tools on a daily basis, or at least every week. One is Lush a version of lisp. The other is Matlab. Lastly, there is the R project.
I don’t want to use three tools for dealing with data, but it’s actually necessary right now. I don’t think it will be necessary forever.
Lush is my general purpose programming language. It’s insanely great. Parts of it are wonky and slow, and parts of it are broken or missing, but it’s a lisp, it’s fast where I need it, and I like it a lot. More on this in a future entry. I use Lush for speed and original research. If there are no complex algorithms like what I need written in Matlab or R, I might as well write them in Lush. Lush is a high level language with low level speed when you need it. It would be perfect if it had more libraries. The only thing I may potentially like better is OCaML/F#, and frankly, I find the type inferencer there to get in the way more than it helps. If they made an OCaML where you could turn the type safety off most of the time, that would be better. Or, I could just be like everyone else and use Python or Java for this sort of thing. Not that there is anything wrong with that.
Matlab would be my second choice for hacking out original research. Why Matlab? Matlab is reasonably fast, but one of the main value adds is that it is extremely intuitive if you’ve used Fortran or C, and if you don’t know how to do something, the help system is very informative. Matlab code is also extremely well supported. The debugger, profiler and editor are all excellent; some of the best I’ve used. Sure, someone will argue that they have more powerful debugger, but Matlab’s is the most handy I’ve yet used. I don’t need to read a manual to use it; I just use it. Sure, emacs is way better than the Matlab editor, but it isn’t as handy as Matlab’s editor. You can use Matlab to do just about anything. I’ve used it to code up embedded systems using xPC target and Real Time Workshop. I’ve used it to code up trading systems, from data feed to broker interface. I’ve embedded it in Excel for end users. I’ve deployed it in Enterprise software used by Fortune 100 companies. It’s amazingly useful stuff, especially if you have the proper toolbox to accomplish your tasks. You can build reasonably good numeric software with it as long as you don’t need fancy “programmy” features like concurrency. If Matlab had a way of making fast compiled code, it would be close to perfect for the type of thing I do. I wouldn’t bother with Lush any more, except when I was trying to write interpreter type things. Alas, Matlab’s way of doing this is to write code for your time critical pieces in C, and embed it into your code in a fairly laborious process. The only real drawbacks to Matlab are speed, plotting and expense.
What is R good for then? Well, R is free, so many academics use it to share their latest econometric or machine learning software with everyone else. As such, just about everything statistical under the sun exists in R. And it’s free! What is not to love. Well, sadly, there is plenty not to love about R. First off, there is speed. R doesn’t seem to have anything that makes it inherently slow for an interpreted language: it should be comparable to Matlab in this regard. But it’s slow enough that most people do their heavy work in other languages. Most of the modules written for it have most of the code written in C or Fortran. This is somewhat true of Matlab also, and for the same reasons, but Matlab has a trivial way of telling you what you need to speed up, so R will always end up slower in practice. Second there is debugging. R is hard to debug. First off, it doesn’t drop you into an interactive top level the way Matlab (or Lush, or Python or anything where you write Real Programs) does. That sucks a lot, and removes a bunch of the utility of using an interpreted language. Oh, sure, there is a debugger, but it is buggy, poorly documented, and doesn’t work in the simple way that Matlab’s does. Thirdly, there is the syntax. Personally, I like the syntax; it’s a lot like OCaML. But most people don’t. What is more; the help system is very close to worthless if you’re trying to remember a simple command. People may say this is unfair, as I am just not used to R, but the fact is, I’ll never get as used to it as Matlab, and neither will anyone else. Oh, it’s OK for finding packages you want if you can think of the right keyword for them. But compared to Matlab, or even something like Lush, its online help is pretty worthless. Fourthly: for programming, while it should be better than Matlab in many ways, I haven’t ever seen a legible R program which was over 100 lines. I don’t know how they manage this. Part is doubtless the IDEs are rather bad. I don’t know anyone who claims they can write good, large pieces of software for R. I once asked a guy how he wrote big pieces of software, and he said, “very carefully.”
This sounds pretty bad, but there are solid reasons to use R. For one thing; it’s free. There is a lot to be said for free. Among other things, if you want to give some code away for others to play with, R is going to be a better vehicle than distributing raw C or a matlab package. For another thing, it has a tremendous amount of work done on various hard numeric problems, and installation is trivial: just press a button. Want to wire the latest AdaBoost up to your database, and plot some nice results: pretty easy in R. I might be able to do all this in Matlab, with the correct packages and so on, but in R, it’s the work of seconds. Another thing: it’s a lot easier to make fancy plots in R than it is in Matlab. Matlab’s plotting utility is from the dark ages. It’s insanely bad. You can abstract some of its badness away with objects, but … you shouldn’t have to. Finally, for interacting with data, R wins. Matlab’s matrix paradigm makes it easy to use, but data.frames are more powerful.
Here’s how my decision tree works. When I first heard about Benford’s law, I decided it was simple enough; I’d hack it out in Lush. I did. It worked, and I fiddled with it. Then I realized that goodness of fit to Benford’s distribution might be nice. I had chi-squared distributions already coded up in Lush, and some curve fitting stuff … but wiring it all together, then fiddling with the plotting routines: ugh. So, google informed me that some nice statistician had done all that work for me in R. So I used R. Probably, someone did it in Matlab also (actually, someone did), but it’s a pain to fire up my Windows laptop with Matlab on it, so I just went with R. That’s what R is good for. At some point, I’ll get Lush talking to R, at which point I may cease using Matlab unless someone pays me to do so. It will never be as slick as Matlab, and I will miss all the great user productivity features that Matlab offers, but it will get the job done better and quicker, I think.
I use the cheat sheets in R a lot, for lack of a better help system, so if you want to fool around with it:
A cheat sheet
A better cheat sheet
Other R documents
[…] and data By erehweb My fellow bloggers John and Scott have posted recently about the free statistical programming language R. How does it compare to an […]
Hi – Just found this blog and I’m really enjoying your writing.
In regards to R’s debugger – I’ll agree its poorly documented, but I haven’t found it to be that bad. Have you tried “options(error=recover)” and “withCallingHandlers(fun(), warning=function(c) recover())”? Also, I haven’t tried it (and it may be what you were talking about as ‘buggy’) but the debug package (install.packages(debug)) looks promising in terms of what you want.
The last time I tried debug, it crashed my system and made me sad. I was going to try it again for this article, but I didn’t see it available for R2.9/OS-X at the time. It seems to be available now. Fiddling for 30 seconds, I’m remembering issues like, “I had to remember to call mtrace() on everything that might crash.” Still, it’s better than what I was doing before.
I don’t like it as much as Matlab’s debugger (or what Lisp does by default), but it comes a lot closer to making me happy -maybe it will grow on me. Thanks for pointing it out.
Nice blog. I wanted to add a couple of comments to this post. First, a discussion related to this topic is available on Stack Overflow. There clearly isn’t a single “right” language: to each his own. I can’t vouch much for Lush, although I appreciate Bottou as a researcher. It doesn’t seem to have much of a user base. The user community and the so-called ecosystem is fundamental for the viability of the language.
Regarding Matlab, it is a poorly designed language. Its object system is bolted on. It is weakly typed and passes by value, with the resulting speed penalties. Maintaining Matlab code is unwieldy. It really never outgrew its roots in numerical algebra.
Regarding R, my impression is that you haven’t tried hard enough. I was a heavy Matlab user, but after some grad school learning curve got used to R. R is actually a much better designed language than Matlab. I never had problems with debugging using browser() or debug(). Its performance in linear algebra operations is very similar to Matlab (using BLAS, or better ATLAS in linux). And of course, you don’t want to do loops in either language. The packages available in R, from wavelets to shrinkage methods, to ensemble methods, SVM, to lattice/ggplot2, is just not comparable to anything SAS, SPSS, Matlab has to offer.
The issues I have with R are speed and multicore scalability. I can use C for speed, but not scalability (unless I get a second job to debug multithreads). I think F# has by far the best chances to succeed as a scientific, fast, scalable language, albeit not truly multiplatform.
Thanks for your detailed comment. You pegged me: I’m least familiar with R, though I’ve done a fair amount at this point. I have many complaints though … for example, most of what I do is timeseries based. R’s 6 different kinds of TS, which don’t always cast properly … I don’t like them as much as the one I wrote in Lush -only XTS comes close to my TS class’s capabilities, and my Lush TS has far more useful functions written for it. Matlab certainly sucks: objects in it are useless. The thing is, it sucks a lot less than a lot of the alternatives. The UI and associated documentation is outstanding.
Lush’s user base sucks, however, compared to the community of people who do numerics in Common Lisp: Lush’s user base is awesome and enormous. As a language it also has an advantage over Common Lisp: it’s very small and easily taken in within a couple of days. It also comes with useful source you can look at and imitate. When I picked Lush for my frankenstein’s monster, I was considering OCaML instead (which I agree is a great language, even in the F# version), but I went with Lush because a lot of the hard work was already done in Lush. I’m basically a machine learning dude, Lush is designed for ML, so it’s a nice fit. Python was also a consideration: it certainly would have made my life easier from a POV of having stuff already written for it, but SWIG+C isn’t a very good solution for speeding up the bits that need to be fast, and what they did with Python 3.0 is totally unacceptable. Another one which has come to my attention is Chicken: very fast, very configurable, and it doesn’t have the namespace problems Lush does. Still, I’m comfortable in my choice: there are no Chicken images with lots of math libraries in ’em.
Scott,
a suggestion and a question:
suggestion (re: time series): have you tried the package zoo? I use it regularly. It has many features, like missing data imputation. more here
question: I am intrigued by lush. Is there a newsgroup or a blog or any community focal point for lush? The only thing I could see is that the last announcement on the lush’s news page dates back to 2 years ago, the latest sourceforge image is dated Nov 2006, and was downloaded less than 5000 times. That’s when I got discouraged.
XTS is a superset of ZOO. Like I said, it’s pretty good. I like mine better; but that’s probably because I wrote it. I haven’t gotten my class to be compilable yet, but it is an eventual goal.
Lush is a very small lisp interpreter married to a compiled Lispy/Fortrany language that you can intersperse with C or C++. It’s also got all the basic fast matrix stuff you need built into it. As I said, the user community is tiny. There is a sourceforge mailing list. The biggest downside, besides the size of the user base, is the fact that there is no DB interface. I wrote a cheap interface to netCDF to shove timeseries in, but it’s too slow on writes, so I just dump objects to file for now. At some point, I may get around to writing a proper TS database in HDF, and a mySQL interface. Though I am also considering making Lush callable from R, and vice versa. Meanwhile I get paid to do something else.
Why it rules: it is exactly the level of abstraction you need. Most of the time you can write sloppy high level interpreter code. When you need to go faster, or have decided on a basic design, you can optimize down to the metal. Theoretically you can do stuff like this in Python + SWIG (something becoming more common at enlightened hedge funds) or OCaML (if only I could turn off the type inferencer when I don’t need to go fast/safe), but I liked the way Yann and Leon did stuff.
There is a new version of it being worked on by Ralf Juengling, but it’s not ready for prime time. The old version is pretty solid.
APL’s:”TOPS”![22:APL!];HP’s,67 KeyCode!
even though i agree that there are many design flaws in the matlab language, it is not true that matlab always passes by value, it only creates copies when necessary, i.e. when modifying the data. For matrix operations it is also difficult to match the speed of matlab since it is based on blas, loops otoh are extremely slow in matlab.
Scott,
1) for time series, have you tried zoo? I use it, and am very happy with it. Check out the vignettes
2) do you know where lush users meet, ask questions, post code, etc? On sourceforce, the last image was uploaded 2 1/2 yrs ago, and downloaded less than 5,000 times.
[…] at Win-Vector LLC appear to like R a bit more than some of our, perhaps wiser, colleagues ( see: Choose your weapon: Matlab, R or something else? and R and data ). While we do like R (see: Exciting Technique #1: The “R” language ) we also […]
Great blog post. Still waiting for my work’s IT group to approve my download of R, but I have experience with MATLAB and SPlus.
My absolute favorite environment for rapid and powerful coding is Dyalog APL. The language allows really powerful abstraction, and the IDE’s debugger is the best I’ve ever seen. You can step backwards and forwards, and add/modify code without having to exit debug mode. Most Dyalog users find themselves writing code within the debugger.
I had a very brief and scary encounter with APL in learning linear algebra when I was 20 or so. It certainly looks powerful, and has a great pedigree, but trying to read it: ouch. Probably Lisp looks the same way to the uninitiated. While Lisp lacks decent IDE’s (beyond emacs + SLIME, which is admittedly pretty good), you can certainly sling code in the debugger: its one of Lisp’s superpowers. Anyhow, I’m pretty sure at some point somebody is going to pay me to sling K, which is a sort of vectorized APL variant by one of the original authors.
http://en.wikipedia.org/wiki/K_(programming_language)
R is crazy frustrating at times, but with the helpful “cheat sheets” you can get a lot done. I’m sure your Splus will serve you well.
APL’s The BEST!
I love R, but quite frankly, the best modeling language/environment is Analytica published by Lumina. Here’s a review I wrote about it last year for OR/MS Today:
http://lionhrtpub.com/orms/orms-6-08/frswr.html
I’ve heard good things about that, and am generally in favor of such things. Some friends of mine wrote something which may even be better:
http://www.advantageforanalysts.com/
Both of those look neat. The connection with Babcock and Brown is quite funny.
options(error=recover) and options(warn=2) are the most helpful (non-default) settings that ive found for run-of-the-mill errors. Just to clarify, by “crash” do you mean the R interpreter crashes? Ive heard many more complaints about matlab “giving up the ghost” than R.
An ongoing weakness in R seems to be an odd/poor set of default choices (as with the options above). Google “stringsasfactors” and witness a long list of novice users ready to do violence to their computers. R’s factor-handling has brought me to the verge of tears, only to discover a single pithy sentence in the docs that clarifies all.
In short, R’s public relations team isnt likely to win any awards today, tomorrow, or ever…
I think I’ve encountered “stringsasfactors” before. I’ve been living in the R debugger for the last few days. When I say it crashes, I mean crash as in crashes R; sometimes with pagefault, sometimes it gets stuck in some kind of wacky ESS REPL->somewhere else loop, probably with the debugger’s TK bits. Either way, I lose whatever I was doing with a kill -9.
Other fun: keeping track of which libraries you have loaded from where. I found a fun “bug” in my code which couldn’t be reproduced on different installs of R; apparently the old version of XTS (or ZOO, I never figured out which was at fault) allowed you to subset pretty sloppily. New version requires everything be just so. Finding out which lib R was pointing to … any of 3-4 in Framework or my home directory: insanity. In the end, I’m going to have to maintain an R distribution along with my code, because the libraries change so much underneath the code, I can’t rely on CRAN to do it for me for anything resembling software. Shoulda done it in Lush.
Scott, have you encountered any stability issue in Lush? I don’t have many issues with core R stability. In fact I think the base distribution is stable. Libraries however vary wildly. The ones that work better are usually ports of independently developed, solid C/Fortran code. Also, those that are older and more used work better (usually). Aside from this, I am exploring Lisp dialects. Lush is very intuitive. However, I am using Clojure, and I really like it. It’s not coming with batteries included, though. Incanter is a project that aims at building a stat/numeric platform on top of Clojure. It’s in the early stages, though.
You can easily crash lush with null pointers and whatnot if you’re screwing around in C. There’s not much you can do about that. The idea is to stay away from C; just encapsulate the bits you need and leave the rest of it alone. I’ve stumped the garbage collector with some weird recursion as well. My comments only apply to 1.x lush: 2.0 is fairly different, and I don’t have enough experience with it yet to comment. It’s a lot better in several obvious ways; I suspect it will be more stable too.
Clojure is interesting, though I feel better about resorting to C than I do about resorting to Java, even if the latter is safer. Incanter looks incredibly weaksauce though.
I suspect most people would find Lush pretty DIY and clunky. Compared to, say, OCaML, it isn’t as solid or well developed. But it is incredibly handy to get stuff done in. I may some day regret spending the time in Lush rather than OCaML, but I doubt it.
I haven’t read the discussion thoroughly, but just to chime in about R- I think that it has it’s place as data analysis/ statistics software, and for that is superior to MATLAB. I had the same problems with IDEs and especially the fact that the help system is hard to use, but Rstudio and (to some extend) eclipse are quite ok,I think, the autocompletion and function argument hints help a lot (sounds a little childish, but it’s important).
Lately the performance (with 2.13) is getting better and I would argue that the CRAN package system is incredibly efficient it helping me find the right package and get whatever stat. methods up and running in minutes.
Then again,I was trying to use it in an F# project, and the whole COM interop is not working for the current version, so it’s bound to make you mad… I guess the debugging is similar story
Two and a half more years experience, and I’ve come to terms with the debugger (it still sucks compared to Matlab or Lisp). I’ve gotten better at vectorizing my code as well, which improves performance (which still sucks compared to Lisp).
ESS+emacs is pretty OK. ESS+Tramp is pretty helpful too, when working on the cloud: dunno if you can do that trick using Rstudio or eclipse.
CRAN is OK, but … for example: which KD-tree package should I use? There are 6 of them last time I checked!
My favorite Lisp is having problems with linking to 64 bit C++ objects these days, so I may end up an OCaML nerd after all…
I saw your comment on hdf5 w/Lush in http://bronzekopf.sourceforge.net/ &would like to try it.
–
I’ve enjoyed XLispStat(Vista)&have thought Lush very interesting, but don’t see enough dev in either lately. I mostly use sbcl.org right now. It/(CommonLisp) has the start of connections to R, Octave, hdf5 &even scidb, though only rclg worked right away.
Great comments thread. I’m an inveterate R user, and I’m helping teach a lab in Matlab this semester (which is how I ended up here).
A few comments:
* I’ve noticed that R has a relatively large vocabulary, which makes the learning curve rather steep. It reminds me a bit of vim — nearly a decade later, and I’m still learning commands/functions that accelerate my productivity/clarity.
It’s funny, I like the R docs more and more as I use them. At least for the core docsstate *precisely* what they do (though packages are another story), but there’s not much hand-holding. Cross-listing and search functions could be better.
* R.e. speed — again, there’s a *big* learning curve here: R makes it easy to write slow code, while simple habits yield order-of-magnitude speedups. Burns Statistics “The R Inferno” does a great job of pointing out the worst offenders.
I’m really curious to see a problem where Lisp beats R in runtime. I know of a few examples of explicit looping, MCMC-type stuff, that I’ve gotten into the habit of pushing into C++ with the amazing Rcpp package. Can you think of a pseudo-code example where there’s a 2x speed difference betwixt the two?
* R’s “core” is still evolving (sometimes rapidly), which has cleaned a few things up. Packages now *require* namespaces, which really cuts down on naming collisions and confusions, for example.
* One major remaining weakness in R is memory management (and threads). The traditional answer has been “buy more RAM”, which generally works well. On the other hand, many problems are amenable and arguably more efficient with divide-and-conquer MapReduce type processing anyway. Any thoughts on memory limitations and parallelism in Lush & Matlab?
thanks!
One of the things about Lush is the ability to natively call C, or compile to C with simple numeric code. This means it virtually always wins over R, by factors of 10-100. Regression, for example, is that much faster. I’ve managed to go even faster using the multicore capabilities of Clojure and stuff like tuned JBlas. They’re not fair comparisons, as the lm() gizmo does way too much stuff, but most of the time, I don’t want that damn stuff, nor do I want to figure out any of the many ways of calling something which might be faster. Usually there is a way to get “fast enough” in R, but you have to figure it out, and if stack up a bunch of things it is bad at (say, storing timeseries in a DB), then regressing a bunch of subsets of the ts against other ts, and you get to some annoying delays.
FWIIW, I have been fiddling with J recently, and if I could figure the FFI out, I might like it almost as much as Lush. Not as well documented, and the source is often impenetrable, but it’s very handy at certain things, and very fast.
I very much understand the imperfections of *R* but, as said, the CRAN libraries are something I can’t do without. These are not only good for statistics, but there’s no place in numerical computing, shy of having access to NAG or Mathematica, which has as rich a set of options. MATLAB’s is pretty good, even that from the MATLAB community, but they are slow rolling theirs out in comparison to CRAN.
My biggest complaint about CRAN is when packages I’ve used get obsolesced and withdrawn, and the occasional package that is either incomplete or is broken, and this is discovered only after using it for a time.
Some of the speed shortcomings can be overcome by (a) pre-allocating data structures rather than building them on the fly, (b) learning to write vectorized code (as one might in J and can in MATLAB and in Python), (c) learning how to use multiple cores, principally using the _parallel_ package facility, and (d) learning some of the specialized facilities supporting big memory and crunching, notably the “big” series of packages, like _biglm_. As far as an IDE goes, I have a good editor, as do many, and editing R files and then sourcing them is pretty painless. There are some editors which permit you to configure them to automatically submit a source file to *R* (or anywhere) and will even capture the console output within the editor. The editor I use, _ultraedit_, can do that, but I found it just easier to use two different open windows.
CRAN is an irreplaceable resource. I go back and forth about the disappearance of CRAN libs. I always wanted to fiddle with GTM, so I dug up the old sources. Unfucking them was a nightmare; the guy who wrote them didn’t want to continue. Writing my own in J was a lot easier.
Parallelism in R is pretty bad; it’s just the old fork a memory image trick. Works OK for optimization or monte carlo things. For most data problems, you can get things done in R. Much of it isn’t good code, but it almost always gives the right answers and a way of doing things somehow.
[…] at Win-Vector LLC appear to like R a bit more than some of our, perhaps wiser, colleagues ( see: Choose your weapon: Matlab, R or something else? and R and data ). While we do like R (see: Exciting Technique #1: The “R” language ) we also […]