Most computer science guys and “software engineers” never deal with the floating point end of their computer machines. This is too bad, as the computer was invented basically to do floating point calculations, whether to build nuclear weapons or to intercept Rooskie bombers. The present use cases of using computers to watch TV, sell underpants on the internet, mining bitcoins, and sending email to the guy across the cubicle from you are later developments: all integer oriented.
Because the subject is so old and obscure, you have to be a sort of historian of science to understand why things are the way they are on the floating point side of the computer machine. You also have to know Fortran, because that’s more or less what most of it is written in. In fact, you have to know old Fortran, because an awful lot of it was written a long time ago.
It is an exaggeration to say everything you can do to simulate or model the world happens using numeric linear algebra of some kind, but it isn’t much of an exaggeration. Numeric linear algebra can be dense, meaning all numbers are defined, and a matrix looks like a matrix in memory, or sparse meaning most of the numbers are 0 and it looks like a hash table in memory. Most problems are inherently dense, but sparsification can make impossible things possible, and there are many data science applications (recommendation engines for example) where sparsity can be assumed if you think your matrix of data has any meaning to it.
You’d think this sort of thing is a trivial problem, but it isn’t. The reason it is two fold. One is the operations are hard. A naive dense square matrix multiply with no trickery involved is O(n^3). If it’s dense, and you need to keep all three matrices around, it’s 3 * O(n^2) in memory. There is an additional problem here which “big O” doesn’t say much about. When you have big-O memory problems, you have IO problems. Your computer doesn’t operate directly on memory chips you install on your computer; it operates directly on registers. While matrix multiply is just vector multiply and sum, the conga dance of doing this in a way that maximizes the data moving in and out of the registers, caches and main memory is extremely intricate and usually machine dependent (because everyone’s computer has a slightly different memory profile).
Most of you have never thought about these matters, because you haven’t tried to simulate a quantum mechanical system, or solved differential equations using finite element analysis. If you build your linear algebra library right, you can get factors of 100 or 1000 speedup. That’s not a big-O speedup, but it sure as hell matters. Computer dorks worry a lot about big-O. Of course, big-O matters very much for both memory and processor cycles. But other things matter too, and computer dorks rarely explicitly worry about them, in part because they have no way of really understanding when they’re happening. I’d posit for any interestingly “large” problem, you’re dealing with something I call “big-IO.” Once you get past having a correct linear algebra algorithm, big-IO is screechingly important on practical problems. It’s the difference between being able to solve a problem and not solve a problem.
I think I’ve convinced any sane reader that matrix multiplies are hard. But this is usually the easiest thing you have to deal with, as it is an actual array operation; something that can be done without explicit loops and branches. QR decompositions, Givens rotations, Eigenvalues, Eigenvectors, Singular Value Decompositions, LU decomposition, inverses, partial inverses; these are more algorithmically challenging. Depending on the properties of the matrix you’re working with (and there’s no simple way to make matrices introspective to tell you this), there may be cheap solutions to these problems, and there may be very expensive ones. Lots of interesting problems are in the O(n^3) range. Regardless of what you use, you have “big-IO” bottlenecks everywhere.
Further complicating your problem, linear algebra on a computer also takes place on several types of floating point data: 32bit/float, 64bit/double and the short and long versions of complex numbers. Some problems are inherently only 32 bit, some are 64 bit; some are inherently real numbers, and some complex numbers. So already, we’re talking about every kind of operation having 4 different types of implementation, all of which must be written and maintained and optimized for your IO and data locality requirements.
Most dense linear algebra work presently happens in something called LAPACK. If you want to calculate eigenvectors in Matlab, Incanter/Clojure, Python, R, Julia and what have you: it’s almost certainly getting piped through LAPACK. If you’re working in C, C++ or Fortran, you’re still probably doing in via LAPACK. If you’re doing linear algebra in Java, you are a lost soul, and you should seriously consider killing yourself, but the serious people I know who do this, they allocate blobs of memory and run LAPACK on the blobs. The same thing was true 25 years ago. In fact, people have basically been running their numerics code on LAPACK for over 40 years when it was called EISPACK and LINPACK. Quite a lot of it conforms with… Fortran 66; a language invented 50 years ago which is specifically formatted for punched cards. That is an astounding fact.
LAPACK is a sort of layercake on top of something called the BLAS (“Basic Linear Algebra Subroutines”). I think BLAS were originally designed as a useful abstraction so your Eigendecomposition functions have more elementary operations in common with your QR decomposition functions. This made LAPACK better than its ancestors from a code management and complexity point of view. It also made LAPACK faster, as it allowed one to use machine optimized BLAS when they are available, and BLAS are the type of thing a Fortran compiler could tune pretty well by itself, especially on old-timey architectures without complex cache latency. While most of these functions seem boring, they are in fact extremely interesting. Just to give a hint: there is a function in there for doing matrix-multiply via the Winograd version of the Strassen algorithm, which multiplies matrices in O(N^2.8) instead of O(N^3). It’s the GEMMS family of functions if you want to go look at it. Very few people end up using GEMMS, because you have to be a numerical linear algebraist to understand when it is useful enough to justify the O(N^0.2) speedup. For example, there are exactly no Lapack functions which call GEMMS. But GEMMS is there, implemented for you, just waiting for the moment when you might need to solve … I dunno, a generalized Sylvester equation where Strassen’s algorithm applies.
BLAS can be hyperoptimized for an individual machine by specialized packages like ATLAS or GotoBLAS (and descendants BLIS and OpenBLAS). Usually, it’s worth a factor of 5 or so, so I usually do it, even though it is a pain in the ass. Those packages with your Linux distribution that call themselves ATLAS BLAS; they are technically, but they’re never as fast as rolling your own. The original GotoBLAS were mostly hand tuned assembler, meta-compiled into very fast BLAS for various permutations and combinations of supported architecture. They’re not called GotoBLAS because they use goto statements (though they do); they’re named after the author, Goto Kazushige, who appears to be some kind of superhuman mentat.
LAPACK is intensely important. Computer people who are not physicists or electrical engineers have probably never given it a second thought. The fact that they’re old Fortran code has implications. The most obvious one is Fortran arrays are column major, just like Linear Algebra itself is. I have no idea why C decided to be row-major; probably because they weren’t doing linear algebra. There are ways of handling this, but it ‘s one more thing you have to keep track of, and it can prevent you from doing things on the C side without expensive memory copies. There is also the matter that because old school Fortran looks like stack allocation of arrays (in code appearance only), you have to do weird things like instantiate your arrays and bizarro looking “workspaces” way up the calling pipeline when you’re calling in a C-like language. This is painful, and efforts have been made (LAPACKE) to make the API a little more friendly where this is done under the covers.
The problem with stuff like this: if you’ve already gone through the trouble of porting a version of C or Fortran LAPACK to your system, LAPACKE makes you do it all over again. The only thing you get out of this exercise is a nicer calling convention, except you have to write the code for your new and improved calling convention, and you don’t get anything extra from this work. So the old calling conventions persist, and probably will persist for another 20-30 years. People will probably be coding up errors that amount to “punched card rejected” in 2050 or 2060.
It’s weird that we still use LAPACK, but considering the difficulty of writing such a thing and moving it from place to place, not terribly surprising. There have been attempts at rewrites, some of which look pretty good to me. FLENS was one of them. I’ve never actually used this, though I have struggled with some vintage of its peculiar build system when examining a neural net package I was interested in. Considering the fact that I found it easier to write my own neural net than to get an old Python package based on FLENS working, I don’t see this project taking over any time soon. While calling FLENS from C++/Python might have been sweet, it didn’t seem useful enough to justify its existence. Another attempt was LibFLAME, which I have never attempted to use. LibFLAME was claiming some performance improvements, which is a reason to think about using it. It was clean and easy to read and came with Goto-san’s enhanced BLAS. Unfortunately, the last major release was in 2010, and the last point release in March of 2014, so I’m presuming the authors are doing other things now. As a historical point, LibFLAME actually employed Goto-san back when he was writing GotoBLAS. Unfortunately for the libFLAME team, he is now at Intel, presumably working on MKL.
For all this, you’d think there has been no progress in numeric dense linear algebra. This isn’t remotely true. It is one of the most interesting areas in computer science research at present. All kinds of crazy things have happened since LAPACK was written. They’re scattered to the winds, existing as scraps of C/Matlab/Fortran. In fact, I would even go so far as to say LAPACK doesn’t even contain the most interesting subset of available dense numeric linear algebra routines, though it inarguably contains the most useful subset. Just one obvious example of its deficiencies; the standard eigen-decomposition routine in LAPACK calculates the complete eigen-decompostion of a matrix. Very often, you only need the top eigenvectors, and that’s a lot cheaper; you can use a Lanczos algorithm. What does one do when this happens? Well, there’s this other ancient Fortran numeric linear algebra package called ARPACK out there. Usually, this gets used, very often for this one function. What about sparse methods? Well, that’s another package. In Fortran days it was IMSL or SparseKit. Now a days, there are many choices.
If I had to bet on a future linear algebra package taking over, it would be libelemental. I only heard about this last year (thanks Ryan), and I have yet to use it for anything important, but it appears to be a very good place to stick new numeric linear algebra algorithms. The problem with it is it depends on a lot of other stuff; LAPACK, libFLAME (optionally for the good parts anyway), Scalapack (old Fortran code for distributed matrix math) and the kitchen sink. It would be nice if they eventually replace these old pieces, but in the meanwhile, it has a decent framework for fetching and building some version of them. A very big upside to libelemental is it gives you access to a lot of useful old code, via a nice modern C API and a modern C++ API. It also gives you access to some useful things built on top of these guys, such as non-negative matrix factorization and skeleton decomposition. It includes convex optimization routines, logistic regression, and quite few other things that are nice to have kicking around. With luck, Jack Poulson, the primary author, will dedicate enough of his time to keep this going and turn it into something that lots of other code depends on out of the box. Seems like there is enough interest at Stanford to keep this moving forward.
Kerf is primarily designed as a tool for dealing with data streams. The most obvious streams of data which bring in customers are stock market data, but there are lots of interesting data streams in the world. Enough that people are starting to have a name for this “vertical” -aka “internet of things.” One of the “things of the internet” I’ve worked with in the past is smartmeter data. This is a stream of electrical power consumption data from individual power meters in homes, commercial sites, factories, electric car chargers and what have you. While everyone knows that markets create big streams of potentially interesting data, the data from power grids is also big and interesting. It’s big because a typical city might have millions of power meters in it, recording data at minute or quarter hour increments. It’s interesting because power companies need to know what is going on with their networks to provide uninterrupted electrical service that we all take for granted. It’s also very interesting to power companies who rely on “renewables” such as wind or solar which are not steady or controllable producers of power. If you have a power grid which relies on mostly renewables, you have all kinds of complicated operations research problems to deal with during times when it is cloudy, dark or the breezes are calm.
The data volume of “internet of things” applications can be unwieldy to manage in standard tools, but there are analytics challenges as well. A standard challenge for smartgrid stuff is doing forecasts. Such models can get really hairy. Every smart meter installation is quite different from every other smart meter. A power meter for a car charging station is vastly different from that of a school, an apartment or a house, and all of those things are different from one another. The “users” of each apartment and house have different schedules, appliances and devices. Getting this right in an automated way can pay big dividends; you can provide services to the people with the meter, and you can do more accurate forecasts on a city wide level if you understand the generating process at the individual meter level.
If you want to get something like this right in an automated fashion, it’s a fairly large project involving paying people like me to think about it for a few months. Doing all this in an automated, general and reliable way involves feature creation, feature selection algorithms, machine learning, data gathering, data cleansing, weather forecasting, wind forecasting, optimization and the kitchen sink. I’d love to eventually have the full suite of such tools to do such a project in Kerf as primitives, but we don’t know if customers would appreciate or use such things, so I’ll make do with some small steps for this blog.
To my mind, the most mathematically interesting thing about roulette is the betting system you should use to maximize your wins. Bet sizing systems are important in all probabilistic games, and the types of lessons learned from a winning game of roulette are the same types of lessons you need to learn in betting on other things, like success in trading, or having an edge on the wiener dog races. The nice thing about a game of roulette is it is relatively easy to characterize your edge. Most people’s edge over the roulette wheel is negative, so you should not bet. If you built one of the computer gizmos I went over in part 2, you have a positive edge over the roulette wheel.
We know from results in information theory, that sequential bets in the presence of an edge should be sized according to the Kelly Criterion to maximize bankroll growth rate.
or, in more probabilistic terms;
where is probability of success.
It’s probably not immediately obvious why this is so, but consider a biased coin toss at even odds ($1 payoff for $1 bet). If your coin’s edge is 100%, you gain money fastest by betting your whole bankroll. If you have 0% edge, you shouldn’t bet anything. If you have a 1% edge, you should bet 1% of your bankroll.
Daniel Bernoulli came up with the same fraction a long time before by maximizing the geometric mean.
Kelly’s original paper figured this out by modeling how a better would place bets assuming he had insider information transmitted over a noisy wire transmitting a binary code; a beautiful way of thinking about predictions in the presence of noise. Kelly is a guy I wish had lived longer. He dropped dead at the young age of 41; in his short life he was a Naval aviator in WW-2, invented computer speech synthesis, made huge contributions to information theory, mentored important mathematicians (Elwyn Berlekamp, who went on to found Axcom/Rentech, based in part on Kelly’s insights) and had the kind of life that would be considered hyperbole if he was in a science fiction novel. They make big men in Texas. Kelly was a giant.
I’ve been known to take sadistic glee in making fun of economists. One of the most mockable economists in American history is (Nobelist -the Swedes have dry humor) Paul Samuelson. One could write entire books on the ways in which Samuelson was a scoundrel and a numskull who set back human knowledge by decades. One fact will suffice for this essay: Samuelson didn’t believe in Kelly betting. Explaining why he thought this, and why he’s wrong would be pointless; debugging an economist’s faulty thought processes is as pointless as explaining why a crazy lady is breaking dishes in the kitchen. If you’re interested, Ed Thorp is your man here also.
Following Ed Thorp’s original essay in the Gambling Times, as good little experimental physicists, we need to build up an error budget to figure out our edge. Thorp breaks down the errors in his and Shannon’s Roulette system into several kinds.
- E1 Rotor speed measurement error
- E2 Ball speed measurement error
- E3 Ball rotor path randomness
- E4 Ball stator path randomness
- E5 Fret scatter
- E6 Rotor tilt (discovered by Shannon and Thorp)
Uncorrelated errors add up as the sum of squares, so the total error budget is
The Thorp/Shannon roulette system had a 44% edge on the most favored number; single number payouts in Vegas are 35:1, making the correct bet on one number 0.44/ 35 = 0.01256. Since nobody in 1960s Vegas suspected the mathematical machinators of having a physics edge on the wheel, they were able to place larger bets on parts of the quadrant. While Thorp describes it as “diversification” in his exposition. Another way of thinking about it: he’s just playing more games at once. A friend and former customer explained his trend following method as working in much the same way. The more bets you place, the more likely you’ll hit a winning trend.
Kelly betting isn’t a perfect solution in all cases; fixed fraction betting has certain disadvantages when you can’t exactly characterize your edge, or the payout odds, or you have a limited number of bets before you have to cash in your chips. However, in the case of a machine to beat Roulette, it’s difficult to think of a better technique.
Of course, Kelly betting and things like it figure in other sorts of betting; people do use it in Markets where it is appropriate. Supposedly it was part of Axcom/Rentech’s early secret sauce, and certainly folks who have thought about trading need a bet sizing and risk management strategy that makes sense. Kelly is often a good place to start, depending on your situation. But that’s a topic for another blog post. One more coming on modern techniques to beat Roulette, including the one I came up with in 2010 (which, in case you were holding your breath, didn’t really work, which is why I have to work, and am willing to talk about such things in blogs).
Kelly criterion resources
Kelly’s original paper:
The common method for interning strings breaks in fantastic ways. In Kerf, we’ve taken the old method and revised it for success in the current generation of languages.
If you’ve forgotten what string interning is from your time with Java, it’s a simple way of ensuring that any string appears only once. So for instance, if I have ten objects whose type is the string “lion”, I don’t have ten copies of the string “lion” in memory. What I have instead is a single copy of “lion” and then ten links which all point to the single copy somehow.
Most often these links are pointers (raw addresses in memory). We’ll discuss how this breaks soon. The reference copy of “lion” is never allowed to be moved, changed, or released. It’s permanent. The other implementation detail to figure out is how to keep the strings unique. The next time we create a “lion” string we need to trade our string in for a link, and what is usually done is that the reference copy is stored in a hashtable or some other deduplicating data structure. This lets us figure out that lion already exists, and when we perform the check we can walk away with our link at the same time. If the phrase is “bullmation”, and it doesn’t already exist, then we can add the initial copy to the data structure at the same time.