Locklin on science

Notation as a tool for thought: Wavelets in J

Posted in J, statistical tools by Scott Locklin on November 14, 2014

I’ve recently had a need for Wavelets in my work in J. There is no wavelet toolbox in J, so, I wrote my own (to be released to pacman and github, eventually, along with a version of this blog as a “lab”). This exercise was extremely valuable in two ways. One, it improved my J-ing: I am, and will probably always remain, relatively weak of the J sauce, though  this exercise leveled up my game somewhat. Two, it vastly improved my understanding of the discrete wavelet transform algorithm. Arthur Whitney said something to the effect that you can really understand a piece of code when it all fits on one page. Well, it turns out that’s very true.

Wavelets are filters recursively applied to a timeseries. In the DWT, you’re effectively recursively subsampling and filtering a time series with a low pass filter and a high pass filter. The resulting elements are at different levels, with a remaining low pass filtered time series at the end. The operations at each level are the same. This is the well-known “pyramid algorithm” of Stephane Mallat.

In C, the way you would do the individual levels is by allocating vector with half the number of elements, then filling up each element with the correct subsampled filtered values. In J, the way you do this is by projecting the elements and the filters in the correct way, then multiplying them out and getting the answer.  J allocates more memory to do this, but it’s something that could be mitigated if it becomes a problem. As is often the case with J, the runtime of the J version, without spending a moment to optimize for space or time, is quite comparable to that of the C version; for the wavelet transform, about a factor of 2 within the C results. Here’s what a wavelet level calculation looks like in C. I took it from the excellent waveslim package in R, FWIIW:



void dwt(double *Vin, int M, int L, double *h, double *g, 
	 double *Wout, double *Vout)
{

  int n, t, u;
  for(t = 0; t < M/2; t++) {
    u = 2 * t + 1;
    Wout[t] = h[0] * Vin[u];
    Vout[t] = g[0] * Vin[u];
    for(n = 1; n < L; n++) {
      u -= 1;
      if(u < 0) u = M - 1;
      Wout[t] += h[n] * Vin[u];
      Vout[t] += g[n] * Vin[u];
    } 
  }
}

J doesn’t have functions; it has verbs. “dyad define” is for defining verbs which operate on  two ‘nouns.’ The first one comes before the verb, the second one comes after the verb. They have implied names x (left noun) and y (right noun) inside the verb. “dyad define” is almost never used by J hackers, as it evaluates to the string ‘4 : 0’ (don’t ask me; you get used to it). Similarly the monad define in the wdict verb evaluates to ‘3 : 0’; and monadic verbs only have y (right) noun inputs. The other thing to remember about J; it evaluates from right to left, like Hebrew. The following verb is for evaluating the compiled C function above, using J’s very cool FFI.

LDWT=: '/path/to/dwt.so'   NB. where your compiled C lives
dwtc=: dyad define
 'hpf lpf'=.wdict x
 Newy=.((#y)%2)#(2.2-2.2)
 Wout=.((#y)%2)#(2.2-2.2)
 cmd=. LDWT, ' dwt n *d i i *d *d *d *d'
 cmd cd (y+2.2-2.2);(<.(#y));(#hpf);hpf;lpf;Wout;Newy
 Wout;Newy
)

For purposes of this exercise, wdict is a simple dictionary that returns the boxed high pass and low pass filter. Boxing is literally the same thing as a struct in C, though in J the use of structs ends up being quite different. To save typing, I know I can derive the HPF from the LPF for a given wavelet type, which I accomplish with the HpLp verb. To get the High Pass filter from Low Pass you multiply the odd elements by -1, then rotate their order. So, in J, remembering J executes from right to left,  a monadic # with a noun to the right is the tally operation: it returns the number of elements. i. generates an integer index, so i.# is a constructor of a sequence on the number of elements in y. [: is the cap which makes the i.# group of verbs into a sequential operation, preventing it from being evaluated as a fork. _1 is -1, * is dyadic times as it is in most programming languages, and |. is the “rotate vector” verb. ;~ does magic which flips this result around, and sticks a box between original low pass value, which is represented by ] 


NB. get the high pass from the low pass, return a box list of both
HpLp =: ] ;~ |. * _1 ^ [: i. #
wdict=: monad define
 select. y
  case. 'haar' do.
   HpLp 2 # %: % ] 2
  case. 'db4' do.
   HpLp 0.482962913144534, 0.836516303737808, 0.224143868042013,  _0.12940952255126
  end.
)

If you compile the C piece and run it with the J FFI on a 2^n long vector, you’ll get the right answer. But the J implementation is cooler, and runs within a factor of 4 of the C.


oddx=: ] {~ ([: (] #~ 0 1 $~ #) [: i. [: # ]) -/ [: i. [: # [
dwt=: dyad define
'hpf lpf'=.wdict x
 yvals=. hpf oddx y
 (yvals +/ . * hpf);yvals +/ . * lpf
)

Since we used “dyad define” -we know that it takes a left and right noun argument (x and y in the verb). The way this one is used looks like this:


'db4'dwt myts

Same calling convention and comments apply to the C version. The real action of the dwt verb happens with oddx. oddx is a tacit verb. I wrote it this way for conciseness and speed. It’s not a very good tacit verb, because I am not a very good slinger of J sentences, but it does what I want it to. Unless someone smarter comes along and gives me a better idea, that’s what we’re stuck with. I am not going to give you a breakdown of how it works inside, but I will show you what it does, and how it helps encapsulate a lot of information into a very small place. oddx is a dyad; it takes x and y values. In tacit programming, the x’s and y’s are represented by [ and ] respectively, with [: as described above, as a sort of spacer between chunks. I will resist explaining all the pieces; feel free to examine it in J 8’s excellent sentence dissector. I have already explained the i. verb. The values 0:15 are created by i.16. So, a useful way to examine what this very does is to run it on something like i.16.

  (1,2)oddx i.16
1 0
3 2
5 4
7 6

Huh. So, it’s taking the y input, which is 0:15, and making two columns, which seem to be the odd and even sequence of the first 4 elements… You might guess that it is taking the size of the x noun and operating on the y input. You would guess correctly. Let’s try it on something bigger:


  (1,2,3,4)oddx i.16
 1  0 15 14
 3  2  1  0
 5  4  3  2
 7  6  5  4
 9  8  7  6
11 10  9  8
13 12 11 10
15 14 13 12

Aha, now it is starting to make sense. Let’s try something bigger still;


  (i.6)oddx i.16
 1  0 15 14 13 12
 3  2  1  0 15 14
 5  4  3  2  1  0
 7  6  5  4  3  2
 9  8  7  6  5  4
11 10  9  8  7  6
13 12 11 10  9  8
15 14 13 12 11 10

Now you can see what it does. The columns alternate between odd and even values of y. The rows are the rotated versions of the first two. The more values you add, the deeper the rotation becomes. Why did I do this? Well, a wavelet is these very index elements, multiplied by the high pass and low pass filters. If you look at the last two lines of the dwt verb, you can see it:


 yvals=. hpf oddx y
(yvals +/ . * hpf);yvals +/ . * lpf

yvals contains the rotated elements. The next line contains two pieces. The  one on the right is the rotated yvalues array multiplied (+/ . * is matrix multiplication) by the low pass filter. The invocation of oddx happens with hpf, but it could take lpf; it is just using the size of the filter to construct the rotated matrix for multiplication.  This is the wavelet decimated y, which you need to reconstruct the time series.  The next one is the rotated yvalues multiplied by the high pass filter. This is the level 1 wavelet. They are separated by a ‘;’ which puts them in “boxed” form (remember: typeless structs), so the decimated yvalues and the level 1 wavelet are separated. There is probably a way of making this a one liner using forks and such, but at my present state of enlightenment, it’s not obvious, and this suits me just fine for clarity and runtime sake. Now, almost nobody cares about the level1 wavelets; usually you want a couple of levels of wavelets, so you need a way to recursively apply this guy to the results. I used the power conjunction.  Raul Miller fixed up my messy verb into a very clear adverb on the J-programming list.


dwtL=: adverb define
:
 'wn yn'=. m dwt y
  wn; (<:x) (m dwtL)^:(x>1) yn
)

You’d call it like this

2 ‘db4’ dwtL  y

Adverbs  return verbs based on modifiers. In this case the adverb returns a dyadic verb modified by m, which in this case is the kind of wavelet.  dwtL is a dyad.  Kind is the kind of wavelet (‘db4’ and ‘haar’ are the only ones defined in wdict monad above). You can see dwt being used to calculate the decimated yn and wn wavelet. The tricky piece is the last line. This is the power conjunction; the ^:(lev>1) piece. dyadic ‘>’ does what it does in regular programming languages (in this case anyway). It checks if lev is > 1, and the ^: power verb calls dwtL recursively as long as this is true. More or less, this is a while loop. At the front of the dwtL call, you can see wn being prepended to the output with a ; to box it. You can also see a (<:lev). This is a decrement operator on the noun ‘lev.’ So, the level is called until it is at 1. The wavelet levels are prepended, and you end up with a boxed set of all the wavelets from 1:Level, with the decimated y stuck at the end. Once you grasp how the power conjunction works, this is a really obvious and sweet implementation of Mallat’s “pyramid algorithm” for calculating the levels of wavelets.

Taking the inverse wavelet transform is conceptually more difficult, since you’re stitching together the values of y with the different levels of wavelet.  There is more than one way to skin this cat, but the way I was thinking when I wrote these verbs was “how do I go from the smaller/subsampled values to the next level up in y in the reverse pyramid algorithm?”

First, I needed to rearrange my mother wavelets to do the inverse. I needed a verb like this:


filtrot =: [: |: [: |. (2 ,~ 2 %~ [: # [) $ [: |. ]

You can see what it does if you apply it to i.4 or i.8


filtrot i.4
1 3
0 2
filtrot i.8
1 3 5 7
0 2 4 6

Next, I need a verb which rearranges the values of the input such that it is aware of the shape of the rotated mother wavelet pattern. This was hairy, but I did it with the following verb, which takes the rotated wavelet as the x value, and the yval or wavelet level as the y value:

drot =:   ] {~ ([: i. [: # ]) +/ 0 , (_1 * [: # ]) + [: }. [: i. [: # [: {. [

You can see how it works like this


(filtrot i.4) drot i.8

0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 0

(filtrot i.6) drot i.8
0 1 2
1 2 3
2 3 4
3 4 5
4 5 6
5 6 7
6 7 0
7 0 1

As you can see, the pattern is to rotate the original values in the opposite direction as you did with the wavelet transform.

Finally, I need to double these guys up, and element-wise multiply them to the correct number of rotated wavelets. For the double up part, ‘2 #’ is the correct verb. Notice it is # called dyadically. Dyadic overloadings of monadic verbs is confusing at first, but you get used to it.


2 #   2 # (filtrot i.4) drot i.8
0 1
0 1
1 2
1 2
2 3
2 3
3 4
3 4
4 5
4 5
5 6
5 6
6 7
6 7
7 0
7 0

Putting it all together with the element wise multiplication, we use the verb “reducer.” I’m getting tired of explaining what all the pieces do, so I’ll write it in explicit to tacit (13: gives explicit to tacit) form to show all the implied x’s and y’s, and point you to the documents for $/shape:


13 : ' (2 # (x drot y)) * ((2*#y)$x)'
reducer=: (2 # drot) * [ $~ 2 * [: # ]

Finally combining into the following:


idwt=: dyad define
'wv yl' =. y
'hpf lpf'=. filtrot each wdict x
+/"1 (lpf reducer yl) + hpf reducer wv
)

So, the inverse transform runs the reducer on the wavelet, the decimated y, and then adds them together.

The real sweet part is running the inverse on all the levels. When I first encountered the / adverb, I thought it was sort of like “apply.” This is wrong; what it does exactly is stick a copy of the verb which comes before it in between all the values of what comes after it. Since you’ve stacked all these wavelets together in a convenient way, you can use the / adverb to stick the inverse transform operator between all the wavelets. You need a few extra pieces to make it evaluate properly, but since J evaluates from right to left, and the decimated values are on the far left; this is  a very satisfying inverse.


idwtL=: dyad  define
idw =. [: < [: x&idwt ,
> idw/y
)

I’m sure dwtL and idwtL  could be compressed into a one liner, and it could be read by someone familiar enough with tacit J programming. But this is a good level of abstraction for me. It’s remarkable that the performance is as good as it is, as I basically just wrote down the first verbs which occurred to me (the inverse is fairly inefficient: OCaML speed instead of C speed). I’ve used wavelets for years; they’re very handy things for signal processing, and I have even written code to calculate them before. Writing them down in J really brought my understanding of what they are to a new level. In an ideal world, people would write algorithms in this compressed form, so that mathematicians, engineers and computer scientists could communicate ideas more clearly and concisely. Alas, Iverson’s dream of notation as a tool of thought hasn’t yet caught on. Maybe some day. While something like J is confusing to people who have been programming for years, there is nothing inherently difficult about it. Various educators have been teaching J to kids in grammar school. It’s actually very simple compared to something like C, and since it has many similarities to ordinary written language, for kids it might even be simpler. Maybe some day.

Advertisements

Shannon information, compression and psychic powers

Posted in information theory, J by Scott Locklin on October 31, 2013

Fun thing I found in the J-list, a game of code golf. Generate the following pattern with the least number of bytes of code:

_____*_____
_____*_____
____*_*____
___*___*___
__*_____*__
**_______**
__*_____*__
___*___*___
____*_*____
_____*_____
_____*_____

The  J solution by Marshall Lochbaum is the best one I’ve seen, taking only 18 bytes of code.

'_*'{~4=+/~ 4<.|i:5

I’ll break this down into components for the folks who don’t speak J, so you can see what’s going on:

i:5
NB. -5 to 5;

_5 _4 _3 _2 _1 0 1 2 3 4 5

|i:5
NB. make the list positive

5 4 3 2 1 0 1 2 3 4 5

4<.|i:5
NB. cap it at 4

4 4 3 2 1 0 1 2 3 4 4

+/~ 4<.|i:5
NB. ~ is magic which flips the result around and feeds it to the
NB. verb immediately to the left.
NB. the left hand +/ verb is the + operator applied / across the
NB. list

8 8 7 6 5 4 5 6 7 8 8
8 8 7 6 5 4 5 6 7 8 8
7 7 6 5 4 3 4 5 6 7 7
6 6 5 4 3 2 3 4 5 6 6
5 5 4 3 2 1 2 3 4 5 5
4 4 3 2 1 0 1 2 3 4 4
5 5 4 3 2 1 2 3 4 5 5
6 6 5 4 3 2 3 4 5 6 6
7 7 6 5 4 3 4 5 6 7 7
8 8 7 6 5 4 5 6 7 8 8
8 8 7 6 5 4 5 6 7 8 8

4=+/~ 4<.|i:5
NB. Which of these is equal to 4?

0 0 0 0 0 1 0 0 0 0 0
0 0 0 0 0 1 0 0 0 0 0
0 0 0 0 1 0 1 0 0 0 0
0 0 0 1 0 0 0 1 0 0 0
0 0 1 0 0 0 0 0 1 0 0
1 1 0 0 0 0 0 0 0 1 1
0 0 1 0 0 0 0 0 1 0 0
0 0 0 1 0 0 0 1 0 0 0
0 0 0 0 1 0 1 0 0 0 0
0 0 0 0 0 1 0 0 0 0 0
0 0 0 0 0 1 0 0 0 0 0

'_*'{~4=+/~ 4<.|i:5
NB. { is an array selector, ~ pops the previous result in 1's
NB. and 0's in front of  the string '_*', which creates the
NB. required array of characters and voila!

_____*_____
_____*_____
____*_*____
___*___*___
__*_____*__
**_______**
__*_____*__
___*___*___
____*_*____
_____*_____
_____*_____

I asked some facebook pals for a solution, and got some neat ones:

Python from Bram in 94 bytes:

for i in range(-5,6):print ''.join(['*_'[min(abs(i),4)+min(abs(j),4)!=4] for j in range(-5,6)])

A clever 100 byte Perl solution from Mischa (to be run with -Mbigint):

for$i(0..10){print($_%12==11?"\n":1<<$i*11+$_&41558682057254037406452518895026208?'*':'_')for 0..11}

Some more efficient solutions by Mischa in perl and assembler.

Do feel free to add clever solutions in the comments! I’d like to see a recursive solution in an ML variant, and I’m pretty sure this can be done in only a few C characters.

It’s interesting in that the display of characters is only 132 bytes (counting carriage returns, 121 without), yet it is hard in most languages to do this in under 100 bytes. The winner of the contest used a 73 byte shell script with gzipped contents attached. Pretty good idea and a praiseworthy hack. Of course, gzip assumes any old kind of byte might need to be packed, so it’s not the post efficient packing mechanism for this kind of data.

This gets me to an interesting point. What is the actual information content of the pattern? Claude Shannon told us this in 1948. The information content of a character in a string is approximately \sum\limits_{i=1}^n {prob_i \log_2(\frac{1}{prob_i}})

If you run this string through a gizmo for calculating Shannon entropy (I have one in J, but use the R infotheo package), you’ll find out that it is completely specified by 3 bits sent 11 times, for a total of 33 bits or 4.125 bytes. This sounds completely insane, but look at this coding for the star pattern:

000 _____*_____
000 _____*_____
001 ____*_*____
010 ___*___*___
011 __*_____*__
100 **_______**
011 __*_____*__
010 ___*___*___
001 ____*_*____
000 _____*_____
000 _____*_____

So, if you were to send this sucker on the wire, an optimal encoding gives 33 bits total. Kind of neat that the J solution is only 144 bits. It is tantalizing that there are only 5 distinct messages here, leaving room for 3 more messages. The Shannon information calculation notices this; if you do the calculation, it’s only 2.2 bits. But alas, you need some kind of wacky pants quantum computer that hasn’t been invented yet to send fractions of bits, so you’re stuck with a 3 bit code.

The amount of data that can be encoded in a bit is generally underestimated. Consider the 20 question game. Go click on the link. it’s a computer program that can “read your mind.” It’s an extremely clever realization of Shannon information for a practical purpose. The thing can guess what you’re thinking about by asking you 20 questions. You only give it 20 bits of information, and it knows what you’re thinking about. The reason this is possible is 20 bits is actually a huge amount of information. If you divide the universe of things a person can think of into 2^20 pieces, each piece is pretty small, and is probably close enough to something it seems like the machine can read your mind. This is also how “mentalism” and psychic powers work. Psychics and Mentalists get their bits by asking questions, and either processing the answers, or noticing the audience reaction when they say something (effectively getting a bit when they’re “warm”). Psychics and TV mentalists also only have to deal with a limited number of things people are worried about when they talk to psychics and TV mentalists.

Lots of machine learning algorithms work like optimal codes, mentalists and the 20 questions game. Decision trees, for example. Split the universe of learned things into a tree, which is effectively 1’s and 0’s (and of course, the split criterion at each node).

It’s little appreciated that compression algorithms are effectively a form of decision tree. In fact, they’re pretty good as decision trees, in particular for “online” learning, and sequence prediction. Look at a recent sequence, see if it’s like any past sequences. If it is a novel sequence, stick it in the right place in the compression tree. If it occurred before, you know what the probability of the next character in the string is by traversing the tree. Using such gizmos, you can generate artificial sequences of characters based on what is in the tree. If you encode “characters” as “words,” you can generate fake texts based on other texts. If you encode “characters” as “musical notes” you can generate fake Mozart based on Mozart. There are libraries out there which do this, but most folks don’t seem to know about them. I’ve noodled around with Ron Begleiter’s VMM code which does this (you can find some primitive R hooks I wrote here) to get a feel for it. There’s another R package called VLMC which is broadly similar, and another relative in the PST package.

One of the interesting properties of compression learning is, if you have a good compression algorithm, you have a good forecasting algorithm.  Compression is a universal predictor. You don’t need to do statistical tests to pick certain models; compression should always work, assuming the string isn’t pure noise. One of the main practical problems with them is picking a discretization (discretizations … can also be seen or used as a ML algorithm). If the data is already discrete, a compression algorithm which compresses well on that data will also forecast well on the data. Taking this a step further into crazytown, one can almost always think of a predictive algorithm as a form of compression. If you consider something like EWMA, you’re reducing a potentially very large historical time series to one or two numbers.

While the science isn’t well established yet, and it certainly isn’t widely known (though the gizmo which guesses what google search you’re going to do is a prefix tree compression learner), it looks like one could use compression algorithms  to do timeseries forecasting in the general case, hypothesis testing, decryption, classification, regression and all manner of interesting things involving data. I consider this idea to be one of the most exciting areas of machine learning research, and compression learners to be one of the most interesting ways of thinking about data. If I free up some time, I hope to write more on the subject, perhaps using Ron’s code for demonstrations.

Fun people to google on this subject:

Boris Ryabko & Son (featured in the ants thing)

Vladimir Vovk

Paul Algoet

Nicolò Cesa-Bianchi

Gabor Lugosi

Marcus Hutter

The ubiquitous Thomas Cover

Ruins of forgotten empires: APL languages

Posted in Design, J, Lush by Scott Locklin on July 28, 2013

One of the problems with modern computer technology: programmers don’t learn from the great masters. There is such a thing as a Beethoven or Mozart of software design. Modern programmers seem more familiar with Lady Gaga. It’s not just a matter of taste and an appreciation for genius. It’s a matter of forgetting important things.

talk to the hand that made APL

talk to the hand that made APL

There is a reason I use “old” languages like J or Lush. It’s not a retro affectation; I save that for my suits. These languages are designed better than modern ones. There is some survivor bias here; nobody slings PL/1 or Cobol willingly, but modern language and package designers don’t seem to learn much from the masters. Modern code monkeys don’t even recognize mastery; mastery is measured in dollars or number of users, which is a poor substitute for distinguishing between what is good and what is dumb.  Lady Gaga made more money than Beethoven, but, like, so what?

Comparing, say, Kx systems Q/KDB (80s technology which still sells for upwards of $100k a CPU, and is worth every penny) to Hive or Reddis is an exercise in high comedy. Q does what Hive does. It does what Reddis does. It does both, several other impressive things modern “big data” types haven’t thought of yet, and it does them better, using only a few pages of tight C code, and a few more pages of tight K code.

arthur

This man’s software is superior to yours

APL languages were developed a long time ago, when memory was tiny compared to the modern day, and disks much slower. They use memory wisely. Arrays are the basic data type, and most APL language primitives are designed to deal with arrays. Unlike the situation in many languages, APL arrays are just a tiny header specifying their rank and shape, and a big pool of memory. Figuring out what to do with the array happens when the verb/function reads the first couple of bytes of the header. No mess, no fuss, and no mucking about with pointless loops.

Code can be confusing if you don’t drink the APL kool-aide, but the concept of rank makes it very reusable. It also relegates idiotic looping constructs to the wastebin of history. How many more for() loops do you want to write in your lifetime? I, personally, would prefer to never write another one. Apply() is the right way for grown-assed men do things. Bonus: if you can write an apply(), you can often parallelize things. For(), you have to make too many assumptions.

roger-hui

Roger Hui, also constructed of awesomeness

One of the great tricks of the APL languages: using mmap instead of scanf. Imagine you have some big chunk of data. The dreary way most languages do things, you vacuum the data in with scanf, grab what is useful, and if you’re smart, throw away the useless bits. If you’re dealing with data which is bigger than core, you have to do some complex conga dance, splitting it up into manageable chunks, processing, writing it out somewhere, then vacuuming the result back in again. With mmap, you just point to the data you want. If it’s bigger than memory …. so what? You can get at it as quickly as the file system gets it to you. If it’s an array, you can run regressions on big data without changing any code. That’s how the bigmemory package in R works. Why wasn’t this built into native R from the start? Because programmers don’t learn from the masters. Thanks a lot, Bell Labs!

Masters

Fred Brooks, Larry Breed, Joey Tuttle, Arthur Whitney, Eugene McDonnell, Paul Berry: none of these men can be held responsible for inflicting the horrors of S+ on the world

This also makes timeseries databases simple. Mmap each column to a file; selects and joins are done along pointed indexes. Use a file for each column to save memory when you read the columns; usually you only need one or a couple of them. Most databases force you to read all the columns. When you get your data and close the files, the data image is still there. Fast, simple and with a little bit of socket work, infinitely scalable.  Sure, it’s not concurrent, and it’s not an RDBMS (though both can be added relatively simply). So what? Big data problems are almost all inherently columnar and non-concurrent; RDBMS and concurrency should be an afterthought when dealing with data which is actually big, and, frankly, in general. “Advanced” databases such as Amazon’s Redshift (which is pretty good shit for something which came out a few months ago) are only catching onto these 80s era ideas now.

Crap like Hive spends half its time reading the damn data in, using some godforsaken text format that is not a mmaped file. Hive wastefully writes intermediate files, and doesn’t use a column approach, forcing giant unnecessary disk reads. Hive also spends its time dealing with multithreaded locking horse shit. APL uses one thread per CPU, which is how sane people do things. Why have multiple threads tripping all over each other when a query is inherently one job? If you’re querying 1, 10 or 100 terabytes, do you really want to load new data into the schema while you’re doing this? No, you don’t. If you have new data streaming in, save it somewhere else, and do that save in its own CPU and process if it is important. Upload to the main store later, when you’re not querying the data. The way Q does it.

The APL family also has a near-perfect level of abstraction for data science. Function composition is trivial, and powerful paradigms and function modifications via adverbs are available to make code terse. You can afflict yourself with for loops if that makes you feel better, but the terse code will run faster. APL languages are also interactive and interpreted: mandatory for dealing with data. Because APL languages are designed to fit data problems, and because they were among the first interpreters, there is little overhead to slow them down. As a result, J or Q code is not only interactive: it’s also really damn fast.

It seems bizarre that all of this has been forgotten, except for a few old guys, deep pocketed quants, and historical spelunkers such as myself. People painfully recreate the past, and occasionally, agonizingly, come to solutions established 40 years ago. I suppose one of the reasons things might have happened this way is the old masters didn’t leave behind obvious clues, beyond, “here’s my code.” They left behind technical papers and software, but people often don’t understand the whys of the software until they run into similar problems.

aplkeyb

Some of these guys are still around. You can actually have a conversation with mighty pioneers like Roger Hui, Allen Rose or Rohan J (maybe in the comments) if you are so inclined. They’re nice people, and they’re willing to show you the way. Data science types and programmers wanting to improve their craft and increase the power of their creations should examine the works of these masters. You’re going to learn more from studying a language such as J than you will studying the latest Hadoop design atrocity. I’m not the only one who thinks so; Wes McKinney of Pandas fame is studying J and Q for guidance on his latest invention. If you know J or Q, he might hire you. He’s not the only one. If “big data” lives up to its promise, you’re going to have a real edge knowing about the masters.

Start here for more information on the wonders of J.

http://conceptualorigami.blogspot.com/2010/12/vector-processing-languages-future-of.html

A look at the J language: the fine line between genius and insanity

Posted in J, tools by Scott Locklin on September 18, 2012

I’ve been looking for a decent TSDB for years now. Took a shot at writing one in Lush using HDF5 (as others have done), but the experiments I did raised more questions than I got answers. I’m sure it can be done; I’m also sure it will be a compromise and source of endless suffering. Since many people are using Q/KDB+ to store order book and tick data, I figured I’d have a look at the APL family of languages, in case the same trick is possible elsewhere. I’ve fiddled in Q before; it’s pretty good, and the APL-ness doesn’t scare me. Anyone who has fiddled with functional programming can do things in Q. The problem is, the price is not right for me at this stage.

The first APL I looked at was A+, by complete accident, because I’m writing some funny stuff for Taki on some self-regarding numskulls who call themselves by the same name. A+ is a venerable language, apparently still used at Morgan Stanley. It seems to be old school APL, using the wacky character set and everything. There’s something to be said for the wacky character set, but I don’t want to have to memorize keystrokes or deal with weird fonts, so I quickly moved on.

Second one I looked at was Kona, which is a copy of the K3 language (an earlier version of the thing that Kx systems KDB+ is based on; they’re now up to K4). The C source code for this is intensely beautiful, and very concise. Go look at it! Reading his source will make you a better person, even if you don’t understand what is going on. I was hoping it would have some doodads built in it, and that I could recycle work done in K3 into K4/Q/KDB+, but it’s not there yet.

A wise old futures trader (who has been using old school APL since the punched card era) told me about J and JDB. J is an ancestor of K. It’s now up to J7. JDB is a columnar database written in J; pretty much a free version of KDB+. I was expecting the usual dead language experience, but found a small, friendly and patient community of very smart people. Not the mixture of programmy smartness and smarm you’ll find in Lisp-land. These guys are math smart, stats smart, programmy smart, and just plain smart smart! I guess writing code that looks like line noise means, you have to be smart. They’re also extremely helpful.  I mean, I asked a fairly n00b question, and got this in response. They didn’t just feed me a rote answer; they gave a damn if I understood what’s going on, and what the right way to do it was.  No tinfoil helmeted one-true wayism involved.  These fellas have a good tool; if you’re interested, they’ll show you how to use it. If you’re not, well, that’s OK too. They certainly seem to have a sense of humour; self-deprecation is a nice break from the galloping narcissism of many programmer communities.

Community is important. Ecosystem is more important. What did I find there? The first thing I noticed is the installation package; it’s a shar archive. That caused me to cock an eyebrow: old school. The second thing I found is I could not use my beloved emacs in an easy way with the latest version of J (you can with the last version). On the other hand, they wrote a very good interactive development environment; jgtk; it has all the standard knobs and buzzers; CPAN-like package installer, debugger, source control hooks, console, project manager; the works. They also included an excellent tool (JHS) for running J things in your browser. Why would you want to do that? Well, to go through examples in the  excellent tutorials, wiki examples and labs that it links to. My first impressions are, both the browser and GTK IDEs are very good. I’m not used to them yet, but they are very well thought out. Most such things are thrown together in a slapdash way, and have obvious flaws on initial inspection. This one has no obvious flaws.

Why no big  flaws? I’m guessing because this is a language that demands attention and mindfulness. Every character carries a lot of meaning. A line of J could replace a page of just about anything else. In any other language, your brain ain’t working half the time; instantiating things, making iterators go, building brain dead switch statements, dealing with preposterous function call overhead or declarations, writing dumb helper patterns that are a tiny variation on something you have done 100 times before. With J you have to pay attention at all times. The line-noise look of the language makes you think better. I’m hoping it also makes you more productive; I figure a page of code a day is a decent amount of output; a page of J will do a lot of useful work.

The language: I don’t know it very well yet, so I can’t say anything too clever. It is definitely a data oriented language, assuming your data fits into an array. It has boxed cells for things that are not arrays. Since it is an array language, it has built-in sparse arrays, which are a big help for serious numerics work. It also does OO type things using namespaces when you need that sort of thing. The “lists” are vectors, though since it’s an array language, you shouldn’t miss linked lists much. One thing about J and the APL languages which is fairly different; it is structured like a spoken language (at least, like an Indo-European language).  It has verbs (more or less like functions), nouns (data), adverbs (things which change functions) and compound verbs. The verbs can function on things which come before or after, more or less like real-language verbs. Loops are very much depreciated: noun and verb rank dictate what happens when you want to “apply a function to many things,” and you can modify what happens by using conjunctions. This sounds trivial, but it’s not. This means you can use the same code on things of radically different “shape.”
Learning: the best quick intro I’ve seen so far is the primer. Deeper (I’m only halfway through myself) is J for C Programmers. Reading and altering code didn’t work for me out of the gates, as there are no familiar landmarks to the syntax, which really does look like line noise. It’s pretty easy to put bits of it together , once you know the basics.

One thing which isn’t well documented in the learning process: foreigns and global settings. For example,  9!:3 ] 2 5  has found its way into my defaults, as it helpfully prints out a sort of graphical s-expression of verb expansions (you can do it in tree or parenthesis format as well, but this “boxed” format makes the most sense to me). “Foreigns” like this are useful and it’s an appealing way of controlling things -much more so than using R’s options settings. Intuitive? No. Neither are R’s bewildering options, which are a continual source of misery. There are all manner of neat things accessible in this way. For example, want to know what the pool allocator is doing, type: 7!:3 ”  -or how much memory an object p uses, 7!:5 <‘p’ -it’s not obvious at first, but it is documented and helpful.

One thing I found perplexing: the special forms used to define verbs. Why is:

myverb=. 3: 0   (stuff) equivalent to

myverb=. verb define (stuff)

I guess it saves some typing, and you do get used to it, but that’s just WEIRD. I’m assuming this is historical stuff, programming IBM 360 registers directly or something, the way cdr/car used to mean something physical on the computing machine. There’s lots of tricks like that; it would be nice if they were all documented in one handy place, with the more conversational alternatives.

Language feel: J is metal. It’s spare and powerful, and for an interpreted language, it feels very close to the hardware. It’s small enough to understand the intestines, and most of it seems to be very lean and speedy. Memory management is malloc/free,  reference counting for arrays and a pool for smaller objects; it works very well and thus far I haven’t been able to make it burp or run out of memory embarrassingly, even when working with data much larger than the memory on my laptop. I’m pretty good at running out of memory in R or Lisp; I’ll probably figure out a way to do so in J eventually, but so far, so good. Of course, J is excellent at dealing with large amounts of data on the disc without thinking about it too much; something most languages suck at. The FFI also looks dirt simple, and it seems to have decent facilities for calling foreign libraries.

Packages: there is a good set of packages. It isn’t anywhere near what CPAN or CRAN is, but it’s got a some helpful tools in it, and they’re easily installed using the package installer in the IDE. One of note is the plot package, which produces output as nice or nicer than what R does. Examples from the plot demo below. I’m told the plot package is useful enough, it is used as an adjunct to Q, which lacks such a thing. There are several other plotting packages with different capabilities, though I can’t see myself exploring them much.

Other useful stuff I’ve explored a  bit, an excellent profiler (load ‘jpm’), the aforementioned data/JDB columnar database, a decent date/time class in types/datetime, various other database interfaces, some primitive optimization routines in math/deoptim, math/fftw  for Fourier transforms, lapack (not supported on 64 bit apparently), tools for talking to R in stats/r, some basic statistical distributions in stats/distrib, a lint system, and the excellent set of “labs” designed to help the user learn about J, generally while teaching some interesting piece of math.

Will I actually use this thing to solve useful problems? Hard to say at present; I’m having fun with it for now.  The potential killer app which could keep me in J-land is the JDB database. I haven’t developed a test script for it yet to really put it through its paces, nor do I have a big machine capable of acting as a ticker plant, but early experiments are encouraging, and such things will eventually be explored more fully. It probably doesn’t offer any significant performance advantages over a home made HDF5 type thing, and probably even has drawbacks on very large data. On the other hand, most of the hard work is done, and that counts for a lot. I probably won’t be doing things like reaching for J to write new kinds of Hidden Markov models (R has more doodads for that), but I might use it to code up a Kalman filter or two.  Certainly, finding new ways of using it to talk to R will be mandatory (calling J from R, rather than the other way around seems useful).  If you’re interested in numerics or different programming paradigms, it is worth a look. There is a reason it has lasted as long as it has. It is really a shame things like Matlab ended up taking over this problem space; the APL family is a much more elegant solution to this sort of problem. Yes, J is kind of bonkers, but it’s a good kind of bonkers. Even if I never use it, J is a fascinating view into how a very smart group of folks solve hard problems.

Cool things to look at:

http://www.jsoftware.com/papers/elegant2.htm

J dudes like puzzles.

Lots of helpful articles at the British APL association’s publication Vector

All kinds of educational J/mathematics essays