## Numeric linear algebra code: an appreciation

Most computer science guys and “software engineers” never deal with the floating point end of their computer machines. This is too bad, as the computer was invented basically to do floating point calculations, whether to build nuclear weapons or to intercept Rooskie bombers. The present use cases of using computers to watch TV, sell underpants on the internet, mining bitcoins, and sending email to the guy across the cubicle from you are later developments: all integer oriented.

Because the subject is so old and obscure, you have to be a sort of historian of science to understand why things are the way they are on the floating point side of the computer machine. You also have to know Fortran, because that’s more or less what most of it is written in. In fact, you have to know old Fortran, because an awful lot of it was written a long time ago.

It is an exaggeration to say everything you can do to simulate or model the world happens using numeric linear algebra of some kind, but it isn’t much of an exaggeration. Numeric linear algebra can be dense, meaning all numbers are defined, and a matrix looks like a matrix in memory, or sparse meaning most of the numbers are 0 and it looks like a hash table in memory. Most problems are inherently dense, but sparsification can make impossible things possible, and there are many data science applications (recommendation engines for example) where sparsity can be assumed if you think your matrix of data has any meaning to it.

You’d think this sort of thing is a trivial problem, but it isn’t. The reason it is two fold. One is the operations are hard. A naive dense square matrix multiply with no trickery involved is O(n^3). If it’s dense, and you need to keep all three matrices around, it’s 3 * O(n^2) in memory. There is an additional problem here which “big O” doesn’t say much about. When you have big-O memory problems, you have IO problems. Your computer doesn’t operate directly on memory chips you install on your computer; it operates directly on registers. While matrix multiply is just vector multiply and sum, the conga dance of doing this in a way that maximizes the data moving in and out of the registers, caches and main memory is extremely intricate and usually *machine dependent* (because everyone’s computer has a slightly different memory profile).

Most of you have never thought about these matters, because you haven’t tried to simulate a quantum mechanical system, or solved differential equations using finite element analysis. If you build your linear algebra library right, you can get factors of 100 or 1000 speedup. That’s not a big-O speedup, but it sure as hell matters. Computer dorks worry a lot about big-O. Of course, big-O matters very much for both memory and processor cycles. But other things matter too, and computer dorks rarely explicitly worry about them, in part because they have no way of really understanding when they’re happening. I’d posit for any interestingly “large” problem, you’re dealing with something I call “big-IO.” Once you get past having a correct linear algebra algorithm, big-IO is screechingly important on practical problems. It’s the difference between being able to solve a problem and not solve a problem.

I think I’ve convinced any sane reader that matrix multiplies are hard. But this is usually the easiest thing you have to deal with, as it is an actual array operation; something that can be done without explicit loops and branches. QR decompositions, Givens rotations, Eigenvalues, Eigenvectors, Singular Value Decompositions, LU decomposition, inverses, partial inverses; these are more algorithmically challenging. Depending on the properties of the matrix you’re working with (and there’s no simple way to make matrices introspective to tell you this), there may be cheap solutions to these problems, and there may be very expensive ones. Lots of interesting problems are in the O(n^3) range. Regardless of what you use, you have “big-IO” bottlenecks everywhere.

Further complicating your problem, linear algebra on a computer also takes place on several types of floating point data: 32bit/float, 64bit/double and the short and long versions of complex numbers. Some problems are inherently only 32 bit, some are 64 bit; some are inherently real numbers, and some complex numbers. So already, we’re talking about every kind of operation having 4 different types of implementation, all of which must be written and maintained and optimized for your IO and data locality requirements.

Most dense linear algebra work presently happens in something called LAPACK. If you want to calculate eigenvectors in Matlab, Incanter/Clojure, Python, R, Julia and what have you: it’s almost certainly getting piped through LAPACK. If you’re working in C, C++ or Fortran, you’re still probably doing in via LAPACK. If you’re doing linear algebra in Java, you are a lost soul, and you should seriously consider killing yourself, but the serious people I know who do this, they allocate blobs of memory and run LAPACK on the blobs. The same thing was true 25 years ago. In fact, people have basically been running their numerics code on LAPACK for over 40 years when it was called EISPACK and LINPACK. Quite a lot of it conforms with… Fortran 66; a language invented 50 years ago which is specifically formatted for punched cards. That is an astounding fact.

LAPACK is a sort of layercake on top of something called the BLAS (“Basic Linear Algebra Subroutines”). I think BLAS were originally designed as a useful abstraction so your Eigendecomposition functions have more elementary operations in common with your QR decomposition functions. This made LAPACK better than its ancestors from a code management and complexity point of view. It also made LAPACK faster, as it allowed one to use machine optimized BLAS when they are available, and BLAS are the type of thing a Fortran compiler could tune pretty well by itself, especially on old-timey architectures without complex cache latency. While most of these functions seem boring, they are in fact extremely interesting. Just to give a hint: there is a function in there for doing matrix-multiply via the Winograd version of the Strassen algorithm, which multiplies matrices in O(N^2.8) instead of O(N^3). It’s the GEMMS family of functions if you want to go look at it. Very few people end up using GEMMS, because you have to be a numerical linear algebraist to understand when it is useful enough to justify the O(N^0.2) speedup. For example, there are exactly no Lapack functions which call GEMMS. But GEMMS is there, implemented for you, just waiting for the moment when you might need to solve … I dunno, a generalized Sylvester equation where Strassen’s algorithm applies.

BLAS can be hyperoptimized for an individual machine by specialized packages like ATLAS or GotoBLAS (and descendants BLIS and OpenBLAS). Usually, it’s worth a factor of 5 or so, so I usually do it, even though it is a pain in the ass. Those packages with your Linux distribution that call themselves ATLAS BLAS; they are technically, but they’re never as fast as rolling your own. The original GotoBLAS were mostly hand tuned assembler, meta-compiled into very fast BLAS for various permutations and combinations of supported architecture. They’re not called GotoBLAS because they use goto statements (though they do); they’re named after the author, Goto Kazushige, who appears to be some kind of superhuman mentat.

LAPACK is intensely important. Computer people who are not physicists or electrical engineers have probably never given it a second thought. The fact that they’re old Fortran code has implications. The most obvious one is Fortran arrays are column major, just like Linear Algebra itself is. I have no idea why C decided to be row-major; probably because they weren’t doing linear algebra. There are ways of handling this, but it ‘s one more thing you have to keep track of, and it can prevent you from doing things on the C side without expensive memory copies. There is also the matter that because old school Fortran looks like stack allocation of arrays (in code appearance only), you have to do weird things like instantiate your arrays and bizarro looking “workspaces” way up the calling pipeline when you’re calling in a C-like language. This is painful, and efforts have been made (LAPACKE) to make the API a little more friendly where this is done under the covers.

The problem with stuff like this: if you’ve already gone through the trouble of porting a version of C or Fortran LAPACK to your system, LAPACKE makes you do it all over again. The only thing you get out of this exercise is a nicer calling convention, except you have to write the code for your new and improved calling convention, and you don’t get anything extra from this work. So the old calling conventions persist, and probably will persist for another 20-30 years. People will probably be coding up errors that amount to “punched card rejected” in 2050 or 2060.

It’s weird that we still use LAPACK, but considering the difficulty of writing such a thing and moving it from place to place, not terribly surprising. There have been attempts at rewrites, some of which look pretty good to me. FLENS was one of them. I’ve never actually used this, though I have struggled with some vintage of its peculiar build system when examining a neural net package I was interested in. Considering the fact that I found it easier to write my own neural net than to get an old Python package based on FLENS working, I don’t see this project taking over any time soon. While calling FLENS from C++/Python might have been sweet, it didn’t seem useful enough to justify its existence. Another attempt was LibFLAME, which I have never attempted to use. LibFLAME was claiming some performance improvements, which is a reason to think about using it. It was clean and easy to read and came with Goto-san’s enhanced BLAS. Unfortunately, the last major release was in 2010, and the last point release in March of 2014, so I’m presuming the authors are doing other things now. As a historical point, LibFLAME actually employed Goto-san back when he was writing GotoBLAS. Unfortunately for the libFLAME team, he is now at Intel, presumably working on MKL.

For all this, you’d think there has been no progress in numeric dense linear algebra. This isn’t remotely true. It is one of the most interesting areas in computer science research at present. All kinds of crazy things have happened since LAPACK was written. They’re scattered to the winds, existing as scraps of C/Matlab/Fortran. In fact, I would even go so far as to say LAPACK doesn’t even contain the most interesting subset of available dense numeric linear algebra routines, though it inarguably contains the most useful subset. Just one obvious example of its deficiencies; the standard eigen-decomposition routine in LAPACK calculates the complete eigen-decompostion of a matrix. Very often, you only need the top eigenvectors, and that’s a lot cheaper; you can use a Lanczos algorithm. What does one do when this happens? Well, there’s this other ancient Fortran numeric linear algebra package called ARPACK out there. Usually, this gets used, very often for this one function. What about sparse methods? Well, that’s another package. In Fortran days it was IMSL or SparseKit. Now a days, there are many choices.

If I had to bet on a future linear algebra package taking over, it would be libelemental. I only heard about this last year (thanks Ryan), and I have yet to use it for anything important, but it appears to be a very good place to stick new numeric linear algebra algorithms. The problem with it is it depends on a lot of other stuff; LAPACK, libFLAME (optionally for the good parts anyway), Scalapack (old Fortran code for distributed matrix math) and the kitchen sink. It would be nice if they eventually replace these old pieces, but in the meanwhile, it has a decent framework for fetching and building some version of them. A very big upside to libelemental is it gives you access to a lot of useful old code, via a nice modern C API and a modern C++ API. It also gives you access to some useful things built on top of these guys, such as non-negative matrix factorization and skeleton decomposition. It includes convex optimization routines, logistic regression, and quite few other things that are nice to have kicking around. With luck, Jack Poulson, the primary author, will dedicate enough of his time to keep this going and turn it into something that lots of other code depends on out of the box. Seems like there is enough interest at Stanford to keep this moving forward.

Lynn Zaringsaid, on May 26, 2016 at 3:16 amThis is a trip. Great article. A walk-through of my life, from sitting at an 029 key punch in college, and asking the Dean of Engineering for permission to hog 32 K of core on a 360. No self-respecting engineer would use WATFIV. That was for the newb’s. FORTRAN 66.

Scott Locklinsaid, on May 26, 2016 at 3:33 amcode that runs in 32k of core

Steve Masseysaid, on May 26, 2016 at 8:17 amTo what extent can this stuff run on CUDA/OpenCL? You would think that would be a natural fit.

Scott Locklinsaid, on May 26, 2016 at 5:15 pmCUDA comes with it. OpenCL didn’t the last time I checked (2 years ago -probably has something by now), which is one of the reasons everyone uses CUDA.

Svetlozar *(Zari) Rachevsaid, on June 13, 2016 at 4:04 amScott ,

my name is Zari Rachev https://en.wikipedia.org/wiki/Svetlozar_Rachev. I saw your article

Locklin on science

Nassim Taleb: clown of quantitative finance

I had quick encounter with Taleb, being affiliated in one of his seminar. He is insane, arrogant . completely ignorant about real science> I was dismaid of his popularity – people are ignorant, I guess, but then I saw very intelligent guys, who liked what said about risk , despite the stupid interruptions of Taleb (trying to be helpful, by making mistake what is convex and concave function and trying to explain that the Hurst ibdex can be negative).

Anyway I heard this gibrish , and the way how he curses all fundamental notions of finance and respected academics (all with Nobel ) and then I understood. People like porn and porn sells. Taleb sells porn. He showed be proudly an invitation from Russia for for 3 days talk to be paid 70,000 after taxes. I am being In teh Russian Academy of science for 7 yeard and have seen the giants in Russia science, how is possible that this prick-in-science can be listen by intelligent people. It is beyond me. In that seminar organized by him came 70 people paying 7,000 for 5 days , none of them was stupid by all means. Yes, they had no clue about real finance or probability theory, or statisticians. I was seek hearing about Taleb’s statistical. concepts – his knowledge of mathematics is below that of my high school granddaughter. And that again, I understood – this is scientific porn. This is not pseudo-science, see wiki, 2. Pseudoscience is a claim, belief, or practice presented as scientific, but which does not adhere to the scientific method, because pseudoscience does not make more “involved” in pleasuring themselves. Then the only thing that I have found close was porn:Pornography (often abbreviated as “porn” or “porno” in informal usage) is

the portrayal of sexual subject matter for the purpose of sexual arousal.

Pornography may be presented in a variety of media, including books,

magazines, postcards, photographs, sculpture, drawing, painting,

animation, sound recording, film, video, and video games.

Reading your article about Taleb’s “writing” it occurs to me that we can have a joint project for a book on “Scientific Porn”. Taleb is not the only one example of scientific porn.

Another is indeed

https://en.wikipedia.org/wiki/New_Chronology_(Fomenko)

I have worked with Fomenko, he is great mathematician and artist,and still his New Chronology is a scientofic porn -very popular indeed.

I guess, I can find few more.

Let me know if you can consider my proposal.

Thank you.

Zari

PS. BTW, Taleb is poor/cheap guy, despite boasting that his books are sold 6 million copies and more each, he was terribly envy that I have a driver and a corporate car, asking me at least twice, how much I paid per year for the driver. I do not want to give more details on that, but this is something that pushed the guy to do scientific porn.

**********************************************************************

Svetlozar (Zari) Rachev, Research Professor,

Applied Mathematics and Statistics, Stony Brook University

Stony Brook, NY 11733

President, Bravo Risk Management Group LLC

office: 631-632-8121 (Math Tower 2-108)

secretary: 631-632-9125,cel. 631-662-6516

**********************************************************************

Oleh Danylivsaid, on May 26, 2016 at 9:00 amVery good artilcle. I was surprised to learn that R is using LAPACK underneath. It is nice to know that good old FORTRAN still rules. How do they solve problem that FORTRAN preallocates memory uprfront? Do they compile it on FORTRAN90?

I was working project called MOLCAS (quantum chemistry) and they wrote C library to allocate memory on the fly and then were passing FOTRAN77 a raw pointer to use it.

Scott Locklinsaid, on May 26, 2016 at 5:13 pmThat’s usually how it is done. I know in J, this is how it is done.

The R source uses Fortran everywhere, and for a lot of things (the hierarchical clustering algorithm for one small example), and it’s almost entirely old timey F77 as far as I can tell. It’s really hard to beat a piece of 40 year old working Fortran code. The compiled code is always fast with minimal tuning, and if it’s 40 years old, the bugs are probably worked out.

Actually a tour of the R source might be fun some time. It’s a horrific language as a language, but it’s like having a very nice textbook on numerics and is surprisingly easy to read if you speak C and Fortran.

Navaneethansaid, on May 27, 2016 at 9:35 amReally interesting history. Does a lot of people still write Fortran code? Wonder what will happen to these libraries when there’s no longer any active Fortran programmers.

Scott Locklinsaid, on May 27, 2016 at 6:58 pmNo shortage of Fortran coders out there. It’s easy to learn anyway. The math is the hard part. My biggest concern is the lack of useful fortran compilers. G77 is a C compiler with a Fortran front end, which is not going to get maximum performance (knowing total allocated memory at compile time allows for lots of performance optimizations G77 can’t do)

Waltsaid, on May 28, 2016 at 6:14 pmThe big-I/O issue you raise is particularly relevant to processing radar data cubes. Having fooled-around with F77, I remember most of the compilers were written for Linux. I don’t know much about them. Can you target modern computer processors with them? What about writing linear algebra routines that target FPGAs? The latter seems like a more-modern application of linear algebra.

What’s your favorite linear algebra book.

Scott Locklinsaid, on May 28, 2016 at 9:40 pmFPGA synthesis is probably not amenable to Fortran; I have no idea how VHDL/Verilog works.

G77 isn’t a real fortran compiler, I would say. The Intel one is probably OK, but the DEC one of yore was pretty awesome.

For numeric linear algebra, the holy trinity is Golub, Trefethen & Bau (my personal favorite, though not as complete as the others) and Demmel.

Natesaid, on June 3, 2016 at 4:06 pmTrefethen & Bau is a wonderful book! As you say, it’s not as complete, but as an introduction to numerical computing it’s superb. I came to it after reading Axler’s Linear Algebra Done Right, and although I loved Axler, his book was so theoretical that it left me badly wanting to work with some actual numbers.

Natesaid, on June 3, 2016 at 4:04 pmMy guess is that C was made row-major because most arrays in C are strings.

bobbybobbobsaid, on June 27, 2016 at 2:58 pmEverybody in C++ land uses Eigen or Armadillo or the proprietary intel stuff. The days of wrapped Fortran are long gone.

Scott Locklinsaid, on June 27, 2016 at 7:03 pmIntel code uses the Fortran Lapack API, Armadillo still wraps Lapack. Eigen, maybe: I’ll believe it when something like numpy is based on it.

Juanlu001said, on December 29, 2016 at 9:57 amIt’s coming:

http://quantstack.net/xtensor

Tonysaid, on July 13, 2016 at 9:31 pmBoy, this hits home. I do computational statistics for genetics, and it’s basically linear algebra. I have been compiling with Intel’s MKL, AMD’s acml and Apple’s vecLib. All of these give several-fold, sometimes 50X speed-ups.

Rudisaid, on August 13, 2016 at 10:18 pmMind explaining why it makes a difference if the matrix is stored column major rather than row major? Matrix multiplication is done row by column, so it seems to me like whatever advantage you’d get from the implementation working with one matrix would be cancelled by the other one (as you can’t traverse both in the same way).

Scott Locklinsaid, on August 13, 2016 at 11:58 pmMath is column major. So you should store the data in the path the algorithm (which is a math algo) happens on, otherwise you’re cache stalling all the time.