Ruins of forgotten empires: APL languages
One of the problems with modern computer technology: programmers don’t learn from the great masters. There is such a thing as a Beethoven or Mozart of software design. Modern programmers seem more familiar with Lady Gaga. It’s not just a matter of taste and an appreciation for genius. It’s a matter of forgetting important things.
There is a reason I use “old” languages like J or Lush. It’s not a retro affectation; I save that for my suits. These languages are designed better than modern ones. There is some survivor bias here; nobody slings PL/1 or Cobol willingly, but modern language and package designers don’t seem to learn much from the masters. Modern code monkeys don’t even recognize mastery; mastery is measured in dollars or number of users, which is a poor substitute for distinguishing between what is good and what is dumb. Lady Gaga made more money than Beethoven, but, like, so what?
Comparing, say, Kx systems Q/KDB (80s technology which still sells for upwards of $100k a CPU, and is worth every penny) to Hive or Reddis is an exercise in high comedy. Q does what Hive does. It does what Reddis does. It does both, several other impressive things modern “big data” types haven’t thought of yet, and it does them better, using only a few pages of tight C code, and a few more pages of tight K code.
APL languages were developed a long time ago, when memory was tiny compared to the modern day, and disks much slower. They use memory wisely. Arrays are the basic data type, and most APL language primitives are designed to deal with arrays. Unlike the situation in many languages, APL arrays are just a tiny header specifying their rank and shape, and a big pool of memory. Figuring out what to do with the array happens when the verb/function reads the first couple of bytes of the header. No mess, no fuss, and no mucking about with pointless loops.
Code can be confusing if you don’t drink the APL kool-aide, but the concept of rank makes it very reusable. It also relegates idiotic looping constructs to the wastebin of history. How many more for() loops do you want to write in your lifetime? I, personally, would prefer to never write another one. Apply() is the right way for grown-assed men do things. Bonus: if you can write an apply(), you can often parallelize things. For(), you have to make too many assumptions.
One of the great tricks of the APL languages: using mmap instead of scanf. Imagine you have some big chunk of data. The dreary way most languages do things, you vacuum the data in with scanf, grab what is useful, and if you’re smart, throw away the useless bits. If you’re dealing with data which is bigger than core, you have to do some complex conga dance, splitting it up into manageable chunks, processing, writing it out somewhere, then vacuuming the result back in again. With mmap, you just point to the data you want. If it’s bigger than memory …. so what? You can get at it as quickly as the file system gets it to you. If it’s an array, you can run regressions on big data without changing any code. That’s how the bigmemory package in R works. Why wasn’t this built into native R from the start? Because programmers don’t learn from the masters. Thanks a lot, Bell Labs!
This also makes timeseries databases simple. Mmap each column to a file; selects and joins are done along pointed indexes. Use a file for each column to save memory when you read the columns; usually you only need one or a couple of them. Most databases force you to read all the columns. When you get your data and close the files, the data image is still there. Fast, simple and with a little bit of socket work, infinitely scalable. Sure, it’s not concurrent, and it’s not an RDBMS (though both can be added relatively simply). So what? Big data problems are almost all inherently columnar and non-concurrent; RDBMS and concurrency should be an afterthought when dealing with data which is actually big, and, frankly, in general. “Advanced” databases such as Amazon’s Redshift (which is pretty good shit for something which came out a few months ago) are only catching onto these 80s era ideas now.
Crap like Hive spends half its time reading the damn data in, using some godforsaken text format that is not a mmaped file. Hive wastefully writes intermediate files, and doesn’t use a column approach, forcing giant unnecessary disk reads. Hive also spends its time dealing with multithreaded locking horse shit. APL uses one thread per CPU, which is how sane people do things. Why have multiple threads tripping all over each other when a query is inherently one job? If you’re querying 1, 10 or 100 terabytes, do you really want to load new data into the schema while you’re doing this? No, you don’t. If you have new data streaming in, save it somewhere else, and do that save in its own CPU and process if it is important. Upload to the main store later, when you’re not querying the data. The way Q does it.
The APL family also has a near-perfect level of abstraction for data science. Function composition is trivial, and powerful paradigms and function modifications via adverbs are available to make code terse. You can afflict yourself with for loops if that makes you feel better, but the terse code will run faster. APL languages are also interactive and interpreted: mandatory for dealing with data. Because APL languages are designed to fit data problems, and because they were among the first interpreters, there is little overhead to slow them down. As a result, J or Q code is not only interactive: it’s also really damn fast.
It seems bizarre that all of this has been forgotten, except for a few old guys, deep pocketed quants, and historical spelunkers such as myself. People painfully recreate the past, and occasionally, agonizingly, come to solutions established 40 years ago. I suppose one of the reasons things might have happened this way is the old masters didn’t leave behind obvious clues, beyond, “here’s my code.” They left behind technical papers and software, but people often don’t understand the whys of the software until they run into similar problems.
Some of these guys are still around. You can actually have a conversation with mighty pioneers like Roger Hui, Allen Rose or Rohan J (maybe in the comments) if you are so inclined. They’re nice people, and they’re willing to show you the way. Data science types and programmers wanting to improve their craft and increase the power of their creations should examine the works of these masters. You’re going to learn more from studying a language such as J than you will studying the latest Hadoop design atrocity. I’m not the only one who thinks so; Wes McKinney of Pandas fame is studying J and Q for guidance on his latest invention. If you know J or Q, he might hire you. He’s not the only one. If “big data” lives up to its promise, you’re going to have a real edge knowing about the masters.
Start here for more information on the wonders of J.