## Numeric linear algebra code: an appreciation

Most computer science guys and “software engineers” never deal with the floating point end of their computer machines. This is too bad, as the computer was invented basically to do floating point calculations, whether to build nuclear weapons or to intercept Rooskie bombers. The present use cases of using computers to watch TV, sell underpants on the internet, mining bitcoins, and sending email to the guy across the cubicle from you are later developments: all integer oriented.

Because the subject is so old and obscure, you have to be a sort of historian of science to understand why things are the way they are on the floating point side of the computer machine. You also have to know Fortran, because that’s more or less what most of it is written in. In fact, you have to know old Fortran, because an awful lot of it was written a long time ago.

It is an exaggeration to say everything you can do to simulate or model the world happens using numeric linear algebra of some kind, but it isn’t much of an exaggeration. Numeric linear algebra can be dense, meaning all numbers are defined, and a matrix looks like a matrix in memory, or sparse meaning most of the numbers are 0 and it looks like a hash table in memory. Most problems are inherently dense, but sparsification can make impossible things possible, and there are many data science applications (recommendation engines for example) where sparsity can be assumed if you think your matrix of data has any meaning to it.

You’d think this sort of thing is a trivial problem, but it isn’t. The reason it is two fold. One is the operations are hard. A naive dense square matrix multiply with no trickery involved is O(n^3). If it’s dense, and you need to keep all three matrices around, it’s 3 * O(n^2) in memory. There is an additional problem here which “big O” doesn’t say much about. When you have big-O memory problems, you have IO problems. Your computer doesn’t operate directly on memory chips you install on your computer; it operates directly on registers. While matrix multiply is just vector multiply and sum, the conga dance of doing this in a way that maximizes the data moving in and out of the registers, caches and main memory is extremely intricate and usually *machine dependent* (because everyone’s computer has a slightly different memory profile).

Most of you have never thought about these matters, because you haven’t tried to simulate a quantum mechanical system, or solved differential equations using finite element analysis. If you build your linear algebra library right, you can get factors of 100 or 1000 speedup. That’s not a big-O speedup, but it sure as hell matters. Computer dorks worry a lot about big-O. Of course, big-O matters very much for both memory and processor cycles. But other things matter too, and computer dorks rarely explicitly worry about them, in part because they have no way of really understanding when they’re happening. I’d posit for any interestingly “large” problem, you’re dealing with something I call “big-IO.” Once you get past having a correct linear algebra algorithm, big-IO is screechingly important on practical problems. It’s the difference between being able to solve a problem and not solve a problem.

I think I’ve convinced any sane reader that matrix multiplies are hard. But this is usually the easiest thing you have to deal with, as it is an actual array operation; something that can be done without explicit loops and branches. QR decompositions, Givens rotations, Eigenvalues, Eigenvectors, Singular Value Decompositions, LU decomposition, inverses, partial inverses; these are more algorithmically challenging. Depending on the properties of the matrix you’re working with (and there’s no simple way to make matrices introspective to tell you this), there may be cheap solutions to these problems, and there may be very expensive ones. Lots of interesting problems are in the O(n^3) range. Regardless of what you use, you have “big-IO” bottlenecks everywhere.

Further complicating your problem, linear algebra on a computer also takes place on several types of floating point data: 32bit/float, 64bit/double and the short and long versions of complex numbers. Some problems are inherently only 32 bit, some are 64 bit; some are inherently real numbers, and some complex numbers. So already, we’re talking about every kind of operation having 4 different types of implementation, all of which must be written and maintained and optimized for your IO and data locality requirements.

Most dense linear algebra work presently happens in something called LAPACK. If you want to calculate eigenvectors in Matlab, Incanter/Clojure, Python, R, Julia and what have you: it’s almost certainly getting piped through LAPACK. If you’re working in C, C++ or Fortran, you’re still probably doing in via LAPACK. If you’re doing linear algebra in Java, you are a lost soul, and you should seriously consider killing yourself, but the serious people I know who do this, they allocate blobs of memory and run LAPACK on the blobs. The same thing was true 25 years ago. In fact, people have basically been running their numerics code on LAPACK for over 40 years when it was called EISPACK and LINPACK. Quite a lot of it conforms with… Fortran 66; a language invented 50 years ago which is specifically formatted for punched cards. That is an astounding fact.

LAPACK is a sort of layercake on top of something called the BLAS (“Basic Linear Algebra Subroutines”). I think BLAS were originally designed as a useful abstraction so your Eigendecomposition functions have more elementary operations in common with your QR decomposition functions. This made LAPACK better than its ancestors from a code management and complexity point of view. It also made LAPACK faster, as it allowed one to use machine optimized BLAS when they are available, and BLAS are the type of thing a Fortran compiler could tune pretty well by itself, especially on old-timey architectures without complex cache latency. While most of these functions seem boring, they are in fact extremely interesting. Just to give a hint: there is a function in there for doing matrix-multiply via the Winograd version of the Strassen algorithm, which multiplies matrices in O(N^2.8) instead of O(N^3). It’s the GEMMS family of functions if you want to go look at it. Very few people end up using GEMMS, because you have to be a numerical linear algebraist to understand when it is useful enough to justify the O(N^0.2) speedup. For example, there are exactly no Lapack functions which call GEMMS. But GEMMS is there, implemented for you, just waiting for the moment when you might need to solve … I dunno, a generalized Sylvester equation where Strassen’s algorithm applies.

BLAS can be hyperoptimized for an individual machine by specialized packages like ATLAS or GotoBLAS (and descendants BLIS and OpenBLAS). Usually, it’s worth a factor of 5 or so, so I usually do it, even though it is a pain in the ass. Those packages with your Linux distribution that call themselves ATLAS BLAS; they are technically, but they’re never as fast as rolling your own. The original GotoBLAS were mostly hand tuned assembler, meta-compiled into very fast BLAS for various permutations and combinations of supported architecture. They’re not called GotoBLAS because they use goto statements (though they do); they’re named after the author, Goto Kazushige, who appears to be some kind of superhuman mentat.

LAPACK is intensely important. Computer people who are not physicists or electrical engineers have probably never given it a second thought. The fact that they’re old Fortran code has implications. The most obvious one is Fortran arrays are column major, just like Linear Algebra itself is. I have no idea why C decided to be row-major; probably because they weren’t doing linear algebra. There are ways of handling this, but it ‘s one more thing you have to keep track of, and it can prevent you from doing things on the C side without expensive memory copies. There is also the matter that because old school Fortran looks like stack allocation of arrays (in code appearance only), you have to do weird things like instantiate your arrays and bizarro looking “workspaces” way up the calling pipeline when you’re calling in a C-like language. This is painful, and efforts have been made (LAPACKE) to make the API a little more friendly where this is done under the covers.

The problem with stuff like this: if you’ve already gone through the trouble of porting a version of C or Fortran LAPACK to your system, LAPACKE makes you do it all over again. The only thing you get out of this exercise is a nicer calling convention, except you have to write the code for your new and improved calling convention, and you don’t get anything extra from this work. So the old calling conventions persist, and probably will persist for another 20-30 years. People will probably be coding up errors that amount to “punched card rejected” in 2050 or 2060.

It’s weird that we still use LAPACK, but considering the difficulty of writing such a thing and moving it from place to place, not terribly surprising. There have been attempts at rewrites, some of which look pretty good to me. FLENS was one of them. I’ve never actually used this, though I have struggled with some vintage of its peculiar build system when examining a neural net package I was interested in. Considering the fact that I found it easier to write my own neural net than to get an old Python package based on FLENS working, I don’t see this project taking over any time soon. While calling FLENS from C++/Python might have been sweet, it didn’t seem useful enough to justify its existence. Another attempt was LibFLAME, which I have never attempted to use. LibFLAME was claiming some performance improvements, which is a reason to think about using it. It was clean and easy to read and came with Goto-san’s enhanced BLAS. Unfortunately, the last major release was in 2010, and the last point release in March of 2014, so I’m presuming the authors are doing other things now. As a historical point, LibFLAME actually employed Goto-san back when he was writing GotoBLAS. Unfortunately for the libFLAME team, he is now at Intel, presumably working on MKL.

For all this, you’d think there has been no progress in numeric dense linear algebra. This isn’t remotely true. It is one of the most interesting areas in computer science research at present. All kinds of crazy things have happened since LAPACK was written. They’re scattered to the winds, existing as scraps of C/Matlab/Fortran. In fact, I would even go so far as to say LAPACK doesn’t even contain the most interesting subset of available dense numeric linear algebra routines, though it inarguably contains the most useful subset. Just one obvious example of its deficiencies; the standard eigen-decomposition routine in LAPACK calculates the complete eigen-decompostion of a matrix. Very often, you only need the top eigenvectors, and that’s a lot cheaper; you can use a Lanczos algorithm. What does one do when this happens? Well, there’s this other ancient Fortran numeric linear algebra package called ARPACK out there. Usually, this gets used, very often for this one function. What about sparse methods? Well, that’s another package. In Fortran days it was IMSL or SparseKit. Now a days, there are many choices.

If I had to bet on a future linear algebra package taking over, it would be libelemental. I only heard about this last year (thanks Ryan), and I have yet to use it for anything important, but it appears to be a very good place to stick new numeric linear algebra algorithms. The problem with it is it depends on a lot of other stuff; LAPACK, libFLAME (optionally for the good parts anyway), Scalapack (old Fortran code for distributed matrix math) and the kitchen sink. It would be nice if they eventually replace these old pieces, but in the meanwhile, it has a decent framework for fetching and building some version of them. A very big upside to libelemental is it gives you access to a lot of useful old code, via a nice modern C API and a modern C++ API. It also gives you access to some useful things built on top of these guys, such as non-negative matrix factorization and skeleton decomposition. It includes convex optimization routines, logistic regression, and quite few other things that are nice to have kicking around. With luck, Jack Poulson, the primary author, will dedicate enough of his time to keep this going and turn it into something that lots of other code depends on out of the box. Seems like there is enough interest at Stanford to keep this moving forward.

## String Interning Done Right

**Refresher**

The common method for interning strings breaks in fantastic ways. In Kerf, we’ve taken the old method and revised it for success in the current generation of languages.

If you’ve forgotten what string interning is from your time with Java, it’s a simple way of ensuring that any string appears only once. So for instance, if I have ten objects whose type is the string “lion”, I don’t have ten copies of the string “lion” in memory. What I have instead is a single copy of “lion” and then ten links which all point to the single copy somehow.

Most often these links are pointers (raw addresses in memory). We’ll discuss how this breaks soon. The reference copy of “lion” is never allowed to be moved, changed, or released. It’s permanent. The other implementation detail to figure out is how to keep the strings unique. The next time we create a “lion” string we need to trade our string in for a link, and what is usually done is that the reference copy is stored in a hashtable or some other deduplicating data structure. This lets us figure out that lion already exists, and when we perform the check we can walk away with our link at the same time. If the phrase is “bullmation”, and it doesn’t already exist, then we can add the initial copy to the data structure at the same time.

…read the rest… http://getkerf.wordpress.com/2016/02/22/string-interning-done-right/

## Tatra 603: wacky commie hot rod

Every now and then I run into a piece of technology which I find completely mind boggling. Something that shouldn’t really exist, but does anyway. The Tatra 603 is one of these things.

For one thing, it’s a communist automobile from the former Czechoslovakia, released in 1949. You know; the communists -the people who brought us the Trabant and the Lada. First thing you notice is, unlike the Trabant or Lada, or even a Skoda, the Tatra is pretty.

Looking under the hood, well, you’ll find … nothing, because it’s a rear engined car, like an old Porsche. Looking in the trunk, you find … an *air cooled V-8* which is insane and amazing. The only air cooled cars most people ever see are Porsches. So basically what we have here is a 6-passenger Porsche with a rumbley motor in it.

Apparently it handled like a giant Porsche also. It was also hand-made like an old Porsche. It was only a 100 horsepower V-8, but it was also a light car with a stick shift. Sort of like one of the 1930s era Jaguar sedans, except with a rear engine and the power curve of a V-8, rather than a straight six.

This mind blowing 1962 communist ad for the Tatra 603 … well, gear heads have to promise to take 13 minutes of their lives to watch this. First off; consider the fact that this was a car only allowed to high communist officials who got professional chauffeurs. I guess high communist officials just sat around all day and watched 13 minute long commercials about the glorious products of people’s Tatra factory. Second … I mean, look at the driving insanity. Road hogging, drifting … in a rear engined car, reckless (I’ve been on the very same roads; these guys are nuts) Steve McQueen style hot-dogging, off road mud-bogging, outrunning them silly Boss Hoggski policemen, hill climbing, driving on sidewalks, and doing doughnuts in Chesky Krumlov: they even **rolled the damn car down a hill and drove away; just to show it could be done.** What the hell, communist block leaders? Either these guys had more fun being communist officials than any other group of people in all of human history …. or I don’t know what to think. Either way, try to imagine any of this in an American car ad at any point in history. And then, remember this was *communism*; communism was never sold as a fun ideology; it was a grim and serious ideology covered in human blood. Just skip to the middle if you don’t have the same amount of free time as a high communist party official.

The vague resemblance to the VW bug is no coincidence. The 1930s Tatras were innovators in streamlined cars. The Tatra-77 was a direct ancestor, and the designer (Paul Jaray) was involved with Zeppelin design before he started fooling with cars. The aerodynamics of old Tatras were often better than modern cars, and the VW bug design was lifted directly from Tatra economy cars such as the V570 and the T97.

The communists had only been running the country for a few years when this thing came out in 1956, so it’s really an old capitalist/Paul Jaray design that ended up being made by commies, but it’s pretty damn cool that they kept it going until 1976. Also, the commercial makes me want to study dialectical materialism, so I can have a chauffeur and decorous, refined bimbo to drive around like a maniac with. I’m presuming that everyone in the car was completely schnockered on pivo and slivovitz, and am just a bit disappointed they weren’t all smoking like chimneys through the whole adventure.

http://jalopnik.com/316038/bobash-road-tests-the-1965-tatra-603

## Timestamps done right

I’ve used a lot of tools meant for dealing with time series. Heck, I’ve written a few at this point. The most fundamental piece of dealing with timeseries is a timestamp type. Under the covers, a timestamp is just a number which can be indexed. Normal humans have a hard time dealing with a number that represents seconds of the epoch, or nanoseconds since whenever. Humans need to see things which look like the ISO format for timestamps.

Very few programming languages have timestamps as a native type. Some SQLs do, but SQL isn’t a very satisfactory programming language by itself. At some point you want to pull your data into something like R or Matlab and deal with your timestamps in an environment that you can do linear regressions in. Kerf is the exception.

Consider the case where you have a bunch of 5 minute power meter readings (say, from a factory) with timestamps. You’re probably storing your data in a database somewhere, because it won’t fit into memory in R. Every time you query your data for a useful chunk, you have to parse the stamps in the chunk into a useful type; timeDate in the case of R. Because the guys who wrote R didn’t think to include a useful timestamp data type, the DB package doesn’t know about timeDate (it is an add on package), and so each timestamp for each query has to be parsed. This seems trivial, but a machine learning gizmo I built was entirely performance bound by this process. Instead of parsing the timestamps once in an efficient way into the database, and passing the timestamp type around as if it were an int or a float, you end up parsing them every time you run the forecast, and in a fairly inefficient way. I don’t know of any programming languages other than Kerf which get this right. I mean, just try it in Java.

Kerf gets around this by integrating the database with the language.

Kerf also has elegant ways of dealing with timestamps within the language itself.

Consider a timestamp in R’s timeDate. R’s add-on packages timeDate + zoo or xts are my favorite way of doing such things in R, and it’s the one I know best, so this will be my comparison class.

require(timeDate) a=as.timeDate("2012-01-01") GMT [1] [2012-01-01]

In Kerf, we can just write the timestamp down

a:2012.01.01 2012.01.01

A standard problem is figuring out what a date is relative to a given day. In R, you have to know that it’s basically storing seconds, so:

as.timeDate("2012-01-01") + 3600*24 GMT [1] [2012-01-02]

Kerf, just tell it to add a day:

2012.01.01 + 1d 2012.01.02

This gets uglier when you have to do something more complex. Imagine you have to add a month and a day. To do this in general in R is complex and involves writing functions.

In Kerf, this is easy:

2012.01.10 + 1m1d 2012.02.02

Same story with hours, minutes and seconds

2012.01.01 + 1m1d + 1h15i17s 2012.02.02T01:15:17.000

And if you have to find a bunch of times which are a month, day, hour and 15 minutes and 17 seconds away from the original date, you can do a little Kerf combinator magic:

b: 2012.01.01 + (1m1d + 1h15i17s) times mapright range(10) [2012.01.01, 2012.02.02T01:15:17.000, 2012.03.03T02:30:34.000, 2012.04.04T03:45:51.000, 2012.05.05T05:01:08.000, 2012.06.06T06:16:25.000, 2012.07.07T07:31:42.000, 2012.08.08T08:46:59.000, 2012.09.09T10:02:16.000, 2012.10.10T11:17:33.000]

The mapright combinator runs the verb and noun to its right on the vector which is to the left. So you’re multiplying (1m1d + 1h15i17s) by range(10) (which is the usual [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] ), then adding it to 2012.01.01.

You can’t actually do this in a simple way in R. Since there is no convenient token to add a month, you have to generate a time sequence with monthly periods. The rest is considerably less satisfying as well, since you have to remember to add numbers. In my opinion, this is vastly harder to read and maintain than the Kerf line.

b=timeSequence(from=as.timeDate("2012-01-01"),length.out=10,by="month") + (3600*24 + 3600 + 15*60 + 17) *0:9 [2012-01-01 00:00:00] [2012-02-02 01:15:17] [2012-03-03 02:30:34] [2012-04-04 03:45:51] [2012-05-05 05:01:08] [2012-06-06 06:16:25] [2012-07-07 07:31:42] [2012-08-08 08:46:59] [2012-09-09 10:02:16] [2012-10-10 11:17:33]

This represents a considerable achievement in language design; an APL which is easier to read than a commonly used programming language for data scientists. I am not tooting my own horn here, Kevin did it.

If I wanted to know what week or second these times occur at, I can subset the implied fields in a simple way in Kerf:

b['week'] [1, 6, 10, 15, 19, 24, 28, 33, 37, 42] b['second'] [0, 17, 34, 51, 8, 25, 42, 59, 16, 33]

I think the way to do this in R is with the “.endpoints” function, but it doesn’t seem to do the right thing

sessionInfo() R version 3.2.2 (2015-08-14) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 14.04 LTS other attached packages: [1] xts_0.9-7 zoo_1.7-12 timeDate_3012.100 .endpoints(b, on="week") [1] 0 1 2 3 4 5 6 7 8 9 10 .endpoints(b, on="second") [1] 0 1 2 3 4 5 6 7 8 9 10

You can cast to a POSIXlt and get the second at least, but no week of year.

as.POSIXlt(b)$week NULL as.POSIXlt(b)$sec [1] 0 17 34 51 8 25 42 59 16 33

Maybe doing this using one of the other date classes, like as.Date…

weekGizmo<-function(x){ as.numeric(format(as.Date(time(x))+3,"%U")) }

Not exactly clear, but it does work. If you have ever done things with time in R, you will have had an experience like this. I’m already reaching for a few different kinds of date and time objects in R. There are probably a dozen kinds of timestamps in R which do different subsets of things, because whoever wrote them wasn’t happy with what was available at the time. One good one is better. That way when you have some complex problem, you don’t have to look at 10 different R manuals and add on packages to get your problem solved.

Here’s a more complex problem. Let’s say you had a million long timeseries with some odd periodicities and you want to find the values which occur at week 10, second 33 of any hour.

ts:{{pwr:rand(1000000,1.0),time:(2012.01.01 + (1h15i17s times mapright range(1000000)))}} timing(1) select *,time['second'] as seconds,time['week'] as weeks from ts where time['second']=33 ,time['week'] =10 ┌────────┬───────────────────────┬───────┬─────┐ │pwr │time │seconds│weeks│ ├────────┼───────────────────────┼───────┼─────┤ │0.963167│2012.03.01T01:40:33.000│ 33│ 10│ │0.667559│2012.03.04T04:57:33.000│ 33│ 10│ │0.584127│2013.03.06T05:06:33.000│ 33│ 10│ │0.349303│2013.03.09T08:23:33.000│ 33│ 10│ │0.397669│2014.03.05T01:58:33.000│ 33│ 10│ │0.850102│2014.03.08T05:15:33.000│ 33│ 10│ │0.733821│2015.03.03T22:50:33.000│ 33│ 10│ │0.179552│2015.03.07T02:07:33.000│ 33│ 10│ │ ⋮│ ⋮│ ⋮│ ⋮│ └────────┴───────────────────────┴───────┴─────┘ 314 ms

In R, I’m not sure how to do this in an elegant way … you’d have to use a function that outputs the week of year then something like this (which, FWIIW, is fairly slow) function to do the query.

```
require(xts)
ts=xts(runif(1000000), as.timeDate("2012-01-01") + (3600 + 15*60 + 17) *0:999999)
weekGizmo<-function(x){ as.numeric(format(as.Date(time(x))+3,"%U")) }
queryGizmo <- function(x) {
wks= weekGizmo(time(ts))
secs=as.POSIXlt(time(ts))$sec
cbind(x,wks,secs)->newx
newx[(wks==10) & (secs==33)]
}
system.time(queryGizmo(ts))
user system elapsed
4.215 0.035 4.254
```

The way R does timestamps isn’t terrible for a language designed in the 1980s, and the profusion of time classes is to be expected from a language that has been around that long. Still, it is 2016, and there is nothing appreciably better out there other than Kerf.

Lessons for future language authors:

(continues at official Kerf blog)

20comments