# Locklin on science

## Notation as a tool for thought: Wavelets in J

Posted in J, statistical tools by Scott Locklin on November 14, 2014

I’ve recently had a need for Wavelets in my work in J. There is no wavelet toolbox in J, so, I wrote my own (to be released to pacman and github, eventually, along with a version of this blog as a “lab”). This exercise was extremely valuable in two ways. One, it improved my J-ing: I am, and will probably always remain, relatively weak of the J sauce, though  this exercise leveled up my game somewhat. Two, it vastly improved my understanding of the discrete wavelet transform algorithm. Arthur Whitney said something to the effect that you can really understand a piece of code when it all fits on one page. Well, it turns out that’s very true.

Wavelets are filters recursively applied to a timeseries. In the DWT, you’re effectively recursively subsampling and filtering a time series with a low pass filter and a high pass filter. The resulting elements are at different levels, with a remaining low pass filtered time series at the end. The operations at each level are the same. This is the well-known “pyramid algorithm” of Stephane Mallat.

In C, the way you would do the individual levels is by allocating vector with half the number of elements, then filling up each element with the correct subsampled filtered values. In J, the way you do this is by projecting the elements and the filters in the correct way, then multiplying them out and getting the answer.  J allocates more memory to do this, but it’s something that could be mitigated if it becomes a problem. As is often the case with J, the runtime of the J version, without spending a moment to optimize for space or time, is quite comparable to that of the C version; for the wavelet transform, about a factor of 2 within the C results. Here’s what a wavelet level calculation looks like in C. I took it from the excellent waveslim package in R, FWIIW:

``````

void dwt(double *Vin, int M, int L, double *h, double *g,
double *Wout, double *Vout)
{

int n, t, u;
for(t = 0; t < M/2; t++) {
u = 2 * t + 1;
Wout[t] = h[0] * Vin[u];
Vout[t] = g[0] * Vin[u];
for(n = 1; n < L; n++) {
u -= 1;
if(u < 0) u = M - 1;
Wout[t] += h[n] * Vin[u];
Vout[t] += g[n] * Vin[u];
}
}
}
``````

J doesn’t have functions; it has verbs. “dyad define” is for defining verbs which operate on  two ‘nouns.’ The first one comes before the verb, the second one comes after the verb. They have implied names x (left noun) and y (right noun) inside the verb. “dyad define” is almost never used by J hackers, as it evaluates to the string ‘4 : 0′ (don’t ask me; you get used to it). Similarly the monad define in the wdict verb evaluates to ‘3 : 0′; and monadic verbs only have y (right) noun inputs. The other thing to remember about J; it evaluates from right to left, like Hebrew. The following verb is for evaluating the compiled C function above, using J’s very cool FFI.

``````LDWT=: '/path/to/dwt.so'   NB. where your compiled C lives
'hpf lpf'=.wdict x
Newy=.((#y)%2)#(2.2-2.2)
Wout=.((#y)%2)#(2.2-2.2)
cmd=. LDWT, ' dwt n *d i i *d *d *d *d'
cmd cd (y+2.2-2.2);(<.(#y));(#hpf);hpf;lpf;Wout;Newy
Wout;Newy
)

``````

For purposes of this exercise, wdict is a simple dictionary that returns the boxed high pass and low pass filter. Boxing is literally the same thing as a struct in C, though in J the use of structs ends up being quite different. To save typing, I know I can derive the HPF from the LPF for a given wavelet type, which I accomplish with the HpLp verb. To get the High Pass filter from Low Pass you multiply the odd elements by -1, then rotate their order. So, in J, remembering J executes from right to left,  a monadic # with a noun to the right is the tally operation: it returns the number of elements. i. generates an integer index, so i.# is a constructor of a sequence on the number of elements in y. [: is the cap which makes the i.# group of verbs into a sequential operation, preventing it from being evaluated as a fork. _1 is -1, * is dyadic times as it is in most programming languages, and |. is the “rotate vector” verb. ;~ does magic which flips this result around, and sticks a box between original low pass value, which is represented by ]

``````
NB. get the high pass from the low pass, return a box list of both
HpLp =: ] ;~ |. * _1 ^ [: i. #
select. y
case. 'haar' do.
HpLp 2 # %: % ] 2
case. 'db4' do.
HpLp 0.482962913144534, 0.836516303737808, 0.224143868042013,  _0.12940952255126
end.
)
``````

If you compile the C piece and run it with the J FFI on a 2^n long vector, you’ll get the right answer. But the J implementation is cooler, and runs within a factor of 4 of the C.

``````
oddx=: ] {~ ([: (] #~ 0 1 \$~ #) [: i. [: # ]) -/ [: i. [: # [
'hpf lpf'=.wdict x
yvals=. hpf oddx y
(yvals +/ . * hpf);yvals +/ . * lpf
)
``````

Since we used “dyad define” -we know that it takes a left and right noun argument (x and y in the verb). The way this one is used looks like this:

``````
'db4'dwt myts
``````

Same calling convention and comments apply to the C version. The real action of the dwt verb happens with oddx. oddx is a tacit verb. I wrote it this way for conciseness and speed. It’s not a very good tacit verb, because I am not a very good slinger of J sentences, but it does what I want it to. Unless someone smarter comes along and gives me a better idea, that’s what we’re stuck with. I am not going to give you a breakdown of how it works inside, but I will show you what it does, and how it helps encapsulate a lot of information into a very small place. oddx is a dyad; it takes x and y values. In tacit programming, the x’s and y’s are represented by [ and ] respectively, with [: as described above, as a sort of spacer between chunks. I will resist explaining all the pieces; feel free to examine it in J 8’s excellent sentence dissector. I have already explained the i. verb. The values 0:15 are created by i.16. So, a useful way to examine what this very does is to run it on something like i.16.

``````  (1,2)oddx i.16
1 0
3 2
5 4
7 6
``````

Huh. So, it’s taking the y input, which is 0:15, and making two columns, which seem to be the odd and even sequence of the first 4 elements… You might guess that it is taking the size of the x noun and operating on the y input. You would guess correctly. Let’s try it on something bigger:

``````
(1,2,3,4)oddx i.16
1  0 15 14
3  2  1  0
5  4  3  2
7  6  5  4
9  8  7  6
11 10  9  8
13 12 11 10
15 14 13 12
``````

Aha, now it is starting to make sense. Let’s try something bigger still;

``````
(i.6)oddx i.16
1  0 15 14 13 12
3  2  1  0 15 14
5  4  3  2  1  0
7  6  5  4  3  2
9  8  7  6  5  4
11 10  9  8  7  6
13 12 11 10  9  8
15 14 13 12 11 10
``````

Now you can see what it does. The columns alternate between odd and even values of y. The rows are the rotated versions of the first two. The more values you add, the deeper the rotation becomes. Why did I do this? Well, a wavelet is these very index elements, multiplied by the high pass and low pass filters. If you look at the last two lines of the dwt verb, you can see it:

``````
yvals=. hpf oddx y
(yvals +/ . * hpf);yvals +/ . * lpf
``````

yvals contains the rotated elements. The next line contains two pieces. The  one on the right is the rotated yvalues array multiplied (+/ . * is matrix multiplication) by the low pass filter. The invocation of oddx happens with hpf, but it could take lpf; it is just using the size of the filter to construct the rotated matrix for multiplication.  This is the wavelet decimated y, which you need to reconstruct the time series.  The next one is the rotated yvalues multiplied by the high pass filter. This is the level 1 wavelet. They are separated by a ‘;’ which puts them in “boxed” form (remember: typeless structs), so the decimated yvalues and the level 1 wavelet are separated. There is probably a way of making this a one liner using forks and such, but at my present state of enlightenment, it’s not obvious, and this suits me just fine for clarity and runtime sake. Now, almost nobody cares about the level1 wavelets; usually you want a couple of levels of wavelets, so you need a way to recursively apply this guy to the results. I used the power conjunction.  Raul Miller fixed up my messy verb into a very clear adverb on the J-programming list.

``````
:
'wn yn'=. m dwt y
wn; (<:x) (m dwtL)^:(x>1) yn
)
``````

You’d call it like this

2 ‘db4′ dwtL  y

Adverbs  return verbs based on modifiers. In this case the adverb returns a dyadic verb modified by m, which in this case is the kind of wavelet.  dwtL is a dyad.  Kind is the kind of wavelet (‘db4′ and ‘haar’ are the only ones defined in wdict monad above). You can see dwt being used to calculate the decimated yn and wn wavelet. The tricky piece is the last line. This is the power conjunction; the ^:(lev>1) piece. dyadic ‘>’ does what it does in regular programming languages (in this case anyway). It checks if lev is > 1, and the ^: power verb calls dwtL recursively as long as this is true. More or less, this is a while loop. At the front of the dwtL call, you can see wn being prepended to the output with a ; to box it. You can also see a (<:lev). This is a decrement operator on the noun ‘lev.’ So, the level is called until it is at 1. The wavelet levels are prepended, and you end up with a boxed set of all the wavelets from 1:Level, with the decimated y stuck at the end. Once you grasp how the power conjunction works, this is a really obvious and sweet implementation of Mallat’s “pyramid algorithm” for calculating the levels of wavelets.

Taking the inverse wavelet transform is conceptually more difficult, since you’re stitching together the values of y with the different levels of wavelet.  There is more than one way to skin this cat, but the way I was thinking when I wrote these verbs was “how do I go from the smaller/subsampled values to the next level up in y in the reverse pyramid algorithm?”

First, I needed to rearrange my mother wavelets to do the inverse. I needed a verb like this:

``````
filtrot =: [: |: [: |. (2 ,~ 2 %~ [: # [) \$ [: |. ]
``````

You can see what it does if you apply it to i.4 or i.8

``````
filtrot i.4
1 3
0 2
filtrot i.8
1 3 5 7
0 2 4 6
``````

Next, I need a verb which rearranges the values of the input such that it is aware of the shape of the rotated mother wavelet pattern. This was hairy, but I did it with the following verb, which takes the rotated wavelet as the x value, and the yval or wavelet level as the y value:

` drot =:   ] {~ ([: i. [: # ]) +/ 0 , (_1 * [: # ]) + [: }. [: i. [: # [: {. [ `

You can see how it works like this

``````
(filtrot i.4) drot i.8

0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 0

(filtrot i.6) drot i.8
0 1 2
1 2 3
2 3 4
3 4 5
4 5 6
5 6 7
6 7 0
7 0 1
``````

As you can see, the pattern is to rotate the original values in the opposite direction as you did with the wavelet transform.

Finally, I need to double these guys up, and element-wise multiply them to the correct number of rotated wavelets. For the double up part, ‘2 #’ is the correct verb. Notice it is # called dyadically. Dyadic overloadings of monadic verbs is confusing at first, but you get used to it.

``````
2 #   2 # (filtrot i.4) drot i.8
0 1
0 1
1 2
1 2
2 3
2 3
3 4
3 4
4 5
4 5
5 6
5 6
6 7
6 7
7 0
7 0
``````

Putting it all together with the element wise multiplication, we use the verb “reducer.” I’m getting tired of explaining what all the pieces do, so I’ll write it in explicit to tacit (13: gives explicit to tacit) form to show all the implied x’s and y’s, and point you to the documents for \$/shape:

``````
13 : ' (2 # (x drot y)) * ((2*#y)\$x)'
reducer=: (2 # drot) * [ \$~ 2 * [: # ]
``````

Finally combining into the following:

``````
'wv yl' =. y
'hpf lpf'=. filtrot each wdict x
+/"1 (lpf reducer yl) + hpf reducer wv
)
``````

So, the inverse transform runs the reducer on the wavelet, the decimated y, and then adds them together.

The real sweet part is running the inverse on all the levels. When I first encountered the / adverb, I thought it was sort of like “apply.” This is wrong; what it does exactly is stick a copy of the verb which comes before it in between all the values of what comes after it. Since you’ve stacked all these wavelets together in a convenient way, you can use the / adverb to stick the inverse transform operator between all the wavelets. You need a few extra pieces to make it evaluate properly, but since J evaluates from right to left, and the decimated values are on the far left; this is  a very satisfying inverse.

``````
idw =. [: < [: x&idwt ,
> idw/y
)
``````

I’m sure dwtL and idwtL  could be compressed into a one liner, and it could be read by someone familiar enough with tacit J programming. But this is a good level of abstraction for me. It’s remarkable that the performance is as good as it is, as I basically just wrote down the first verbs which occurred to me (the inverse is fairly inefficient: OCaML speed instead of C speed). I’ve used wavelets for years; they’re very handy things for signal processing, and I have even written code to calculate them before. Writing them down in J really brought my understanding of what they are to a new level. In an ideal world, people would write algorithms in this compressed form, so that mathematicians, engineers and computer scientists could communicate ideas more clearly and concisely. Alas, Iverson’s dream of notation as a tool of thought hasn’t yet caught on. Maybe some day. While something like J is confusing to people who have been programming for years, there is nothing inherently difficult about it. Various educators have been teaching J to kids in grammar school. It’s actually very simple compared to something like C, and since it has many similarities to ordinary written language, for kids it might even be simpler. Maybe some day.

## Optalysys and Optical computing

Posted in non-standard computer architectures by Scott Locklin on August 11, 2014

Years and years ago, when I was writing up my dissertation and thinking about what I was going to do with my life, I gave some thought to non von-Neumann computing architectures. This was more or less the dawn of the quantum computing/quantum information era as a field, when it turned from an obscure idea to a sort of cottage industry one could get tenure in.  I had the idea in my mind that the Grover Algorithm could be done on a classical optical architecture, and wrote many equations in a fat notebook I now keep in my kitchen between a cookbook and a book on Cold War Submarines.  I suppose had I written some more equations and discovered I wasn’t full of baloney, I might have made a nice academic career for myself as a designer of novel computing architectures, rather than the underemployed constructor of mathematical models for businesses and sporadic blogger I have turned into. Be that as it may, my erstwhile interests have equipped me with some modest background in how the future gizmos might work.

While quantum computing appears to be turning into a multi-decade over-hyped boondoggle of a field, there is no reason why optical computers might not become important in the near future. Years before my attempted non von-Neumann punt into the intellectual void, my old boss Dave Snoke handed me a book called “Fundamentals of Photonics” which implied optical computers might one day be very important. The book contained a few chapters sketching outlines of how future photonic computers might work. Armed with this knowledge, and the fact that a new startup called Optalysys is claiming practical optical computers are right around the corner, perhaps these endless hours of wasted time can be converted into a useful blog post.

The idea for optical computing has been around for decades. Some trace it back to Zernike’s phase contrast microscopes. Certainly by the 60s, Vander Lugt [pdf link] and company were doing thinking about optics in terms of signal processing and computation. It began to be a real technology in the 70s with the invention of “spatial light modulators” (SLM) of various kinds. Historical SLMs have been all kinds of weird things you run into in the optics world; Pockels cells, phototitus converters, Kerr rotation; but the main technology used  is the familiar LCD. Interestingly, some MIT researchers have recently come up with LCDs that are capable of very fast switching. This could be the innovation which makes optical computing real.

Things Optalysis isn’t: This  certainly isn’t an “all optical computer.” The “all optical computer” is sort of the philosopher’s stone of the optical computing world. If they accomplished this, they’d certainly claim it, and people like me would be plagued with doubt. It is also not any kind of “quantum computer” for the same reasons. In fact, they more or less say it isn’t even a digital computer.

What Optalysis is: From the vague descriptions on their website, this is a  standard analog optical computing architecture consisting of a high resolution detector,  at least one SLM, a lens and/or mirror or two, and an array of laser diodes. Presumably there has been some improvement which has made this a practicable commercial item; the MIT fast LCD might eventually be one of them. This should be seen as a sort of “math co-processor,” rather than a general purpose computer, though it does some important math.

What can it do? Matrix  math, basically. Even thinking about this in the most hand-wavey manner, optical computers should be good at  image processing. When you break down what is involved in image processing, matrix math and Fourier transforms come to mind. We know how to do matrix math with optical computers. The limiting aspect is how many elements you can encode and modulate quickly. An optical matrix multiplication doodad will do it in $O(const)$ time, up to the limiting size of the matrices that can be encoded in such a machine. This is huge, as most matrix multiplications are $O(N^3)$ for square matrices of size N. There is all manner of work done on improving matrix multiplication and other forms of matrix math via statistical methods (subject of a coming blog post); the problem is very important, as  $O( N^3 )$ is pretty bad on large problems. People who have studied the history of computer science know how important the  Fast Fourier Transform was. It reduced the 1-d Fourier transform from $O(N^2)$ operations to $O(N log(N))$, opening up vast new applications to signal processing techniques. Optical computing can in principle do Fourier transforms in $O(const)$ [pdf link ] up to some maximum encoding size. Even better, it can do the 2-dimensional version in the same $O(const)$ time. Even the FFT is, of course $O(N^2 log(N))$ for two dimensional problems, so this is a potentially significant speedup over conventional computing.

Most of programmers probably have glazed eyes at this point. Who cares, right? You care more than you think. Many of the elementary operations  used by programmers have something matrix related at their cores. Pagerank is an algorithm that has become essential to modern life. It is effectively a very efficient way of finding the most important eigenvector of a very large matrix in $O(N)$ rather than the $O(N^3 )$ it takes to find all the eigenvalues and then find the biggest one. It is an underappreciated fact that Brin and Page turned a nice thesis topic in linear algebra into one of the great companies of the world, but this sort of thing demonstrates why linear algebra and matrix math are so important. Other database search algorithms are related to Singular Value Decomposition (SVD).  SVD is also used in many scalable forms of data mining and machine learning, and I’m pretty sure it works faster on an optical computer than on an ordinary one.  There are also pattern recognition techniques that only make sense on optical computers. Since nobody sells optical computers, it is not clear where they might be useful, but they might very well be helpful to certain kinds of problem; certainly pattern matching on images could use a speedup. The company is shooting for the notoriously computationally intractable field of computational fluid dynamics (CFD) first. CFD involves a lot of matrix math, including 2-d FTs in some realizations.

Scale: right now, Optalysis is claiming a fairly humble 40Gflops in their prototype, which is far from exceptional (laptops do about the same in a much more general way). I think a lot of people pooped their pants at the “eventual exabyte processor” claims in their press release. In principle they can scale this technology, but of course, a lot of things are possible “in principle.” The practical details are the most important thing. Their technology demonstrator is supposed to have 500×500 pixels in the SLM element. This is way too small for important problems; they might even be using an overhead projector LCD for the prototype doodad (the 20Hz frame rate kind of implies this might be the case): this is quite possible to do if you know some optics and have a good detector. How to scale a 500×500 element device to something on an exabyte scale is far from clear, but the most obvious way is to add more and finer elements, and run the thing faster. A factor of 2.5 million or more increase sounds challenging, but increasing the frame rate by a factor of a thousand and the number of elements by a factor of 1000 seems like something that actually could be done on a desktop as they claim.

Problems to look out for:

1. LCDs: I believe the LCD technology is fast enough to be useful, but lifespan, quality and reliability will be limiting factors. If the fast LCD gizmo only lasts a few million cycles of shining lasers through it, this will be a problem. If the pixels randomly burn out or die, well, you have to know about this, and the more that die, the worse the calculation will be. I guess at this point, I also believe it makes small enough pixels to be useful. On the other hand, if they are using an overhead projector LCD and have no plans to radically upgrade the number of pixels in some way, and the completed machine is fairly large, it probably won’t be improving to Exascales any time soon.
2. Detectors: I’m guessing there haven’t been optical computers yet mostly because the output detectors have been too noisy  for low power lasers. This is a guess; I’m not that interested in such things to investigate deeply, but it is a pretty good guess based on experience working with things optical. So, if the output detector system isn’t very good, this won’t work very well. It will be very different from writing code on a floating point processor. For one thing, you’ll be using a lot fewer than 64 bits per number. I’m guessing, based on the fast detectors used in streak cameras, this will be something more like 16 bits; maybe fewer. This is fine for a lot of physical problems, assuming it works at 16 bits resolution on a fast time scale. If you have to sit around and integrate for a long time to encode more than 16 bits (or, say, increase the laser power used for more dynamic range), well, it might not work so well.
3. Data loading and memory: this thing could do very fast Fourier transforms and array math even if the data load operation is relatively slow, and the “registers” are fairly narrow. They are  only claiming a frame rate of 20Hz on their prototype, as I said above. On the other hand, if it’s going to do something like a pentabyte pagerank type thing, loading data quickly and being able to load lots of data is going to be really important. It’s not clear to me how this sort of thing will scale in general, or how to deal with values that are inherently more like 64 bits than 16 bits. If the frame rate increases by a factor of a million, which looks possible with the MIT breakthrough, assuming their detectors are up to it, a lot of things become possible.
4. Generality: analog computers have been used for “big O” difficult calculations until relatively recently; often on much the same kinds of CFD problems as Optalysis hopes to compete in. I suppose they still are in some fields, using various kinds of ASIC, and FPAAs. The new IBM neural net chip is a recent example if I understand it correctly. The problem with them is the fact that they are very difficult to program. If this thing isn’t generally good at matrix mathematics at least, or requires some very expensive “regular computer” operations to do general computing, you might not be gaining much.

I’m hoping this actually takes off, because I’d love to sling code for such a machine.

Decent history of optical computing:

http://www.hindawi.com/journals/aot/2010/372652/

Their cool video, Narrated by Heinz Wolff:

Kinds of hard problems this thing might work well on:

http://view.eecs.berkeley.edu/wiki/Dwarfs

One of their patents (I was unable to find any last night for some reason), which includes commentary on the utility of this technique in problems with fractional derivatives. Fractional derivatives are the bane of finite element analysis; I ran into this most recently trying to help a friend do microwave imaging on human brains. I’ll add some more commentary as I go through the patent later this evening.

http://www.freepatentsonline.com/8610839.html

## Neglected machine learning ideas

Posted in machine learning, statistical tools, tools by Scott Locklin on July 22, 2014

This post is inspired by the “metacademy” suggestions for “leveling up your machine learning.” They make some halfway decent suggestions for beginners.  The problem is, these suggestions won’t give you a view of machine learning as a field; they’ll only teach you about the subjects of interest to authors of machine learning books, which is different. The level-3 and level-4 suggestions they make are not super useful either: they just reflect the tastes of the author.

The machine learning literature is vast, techniques are bewilderingly diverse, multidisciplinary and seemingly unrelated. It is extremely difficult to know what is important and useful. While “metacademy” has the horse sense to suggest reading some books, the problem is, there is no book which can even give you a survey of what is available, or make you aware of things which might be helpful. The best guide for the perplexed, in my not at all humble opinion, is Peter Flach’s introductory text, “Machine Learning: the Art and Science of Algorithms that Make Sense of Data” which at least mentions some of the more obscure techniques, and makes pointers to other resources. Most books are just a collection of the popular techniques. They all mention regression models, logistic regression, neural nets, trees, ensemble methods, graphical models and SVM type things. Most of the time, they don’t even bother telling you what each technique is actually good for, and when you should choose one over the other for an approach (Flach does; that’s one of many reasons you should read his book). Sometimes I am definitely just whining that people don’t pay enough attention to the things I find interesting, or that I don’t have a good book or review article on the topic. Sleep deprivation will do that to a man. Sometimes I am probably putting together things that have no clearly unifying feature, perhaps because they’re “not done yet.” I figure that’s OK, subjects such as “deep learning” are also a bunch of ideas that have no real unifying theme and aren’t done yet; this doesn’t stop people from writing good treatments of the subject. Perhaps my list is a “send me review articles and book suggestions” cry for help, but perhaps it is useful to others as an overview of neat things.

Stuff I think is egregiously neglected in books, and in academia in unranked semi-clustered listing below:

Online learning: not the “Khan academy” kind, the “exposing your learners to data, one piece at a time, the way the human  brain works” kind. This is hugely important for “big data” and timeseries, but there are precious few ML texts which go beyond mentioning the existence of online learning in passing. Almost all textbooks concentrate on batch learning. Realistically, when you’re dealing with timeseries or very large data sets, you’re probably doing things online in some sense. If you’re not thinking about how you’re exposing your learners to sequentially generated data, you’re probably leaving information on the table, or overfitting to irrelevant data. I can think of zero books which are actually helpful here. Cesa-Bianchi and Lugosi wrote a very interesting book on some recent proofs for online learners and “universal prediction” which strike me as being of extreme importance, though this is a presentation of new ideas, rather than an exposition of established ones.  Vowpal Wabbit is a useful and interesting piece of software with OK documentation, but there should be a book which takes you from online versions of linear regression (they exist! I can show you one!)  to something like Vowpal Wabbit. Such a book does not exist. Hell, I am at a loss to think of a decent review article, and the subject is unfortunately un-googleable, thanks to the hype over the BFD of “watching lectures and taking tests over the freaking internets.” Please correct me if I am wrong: I’d love to have a good review article on the subject for my own purposes.

Reinforcement learning: a form of online learning which has become a field unto its own. One of the great triumphs of machine learning is teaching computers to win at Backgammon. This was done via a form of reinforcement learning known as TD-learning. Reinforcement learning is a large field, as it has been used with great success in control systems theory and robotics. The problem is, the guys who do reinforcement learning are generally in control systems theory and robotics, making the literature impenetrable to machine learning researchers and engineers. Something oriented towards non robotics problems would be nice (Sutton and Barto doesn’t suffice here; Norvig’s chapter is the best general treatment I have thus far seen). There are papers on applications of the idea to ideas which do not involve robots, but none which unify the ideas into something comprehensible and utile to a ML engineer.

“Compression” sequence prediction techniques: this is another form of online learning, though it can also be done in batch mode. We’re all familiar with this; when google tries to guess what you’re going to search for, it is using a primitive form of this called the Trie. Such ideas are related to standard compression techniques like LZW, and have deep roots in information theory and signal processing. Really, Claude Shannon wrote the first iterations of this idea. I can’t give you a good reference for this subject in general, though Ron Begleiter and friends wrote a very good paper on some classical compression learning implementations and their uses. I wrote an R wrapper for their Java lib if you want to fool around with their tool. Boris Ryabko and son have also written numerous interesting papers on the subject. Complearn is a presumably useful library which encapsulates some of these ideas, and is available everywhere Linux is sold. Some day I’ll expound on these ideas in more detail.

Time series oriented techniques in general: a large fraction of  industry applications have a time component. Even in marketing problems dealing with survival techniques, there is a time component, and you should know about it.In situations where there are non-linear relationships in the time series, classical regression and time-series techniques will fail. In situations where you must discover the underlying non-linear model yourself, well, you’re in deep shit if you don’t know some time-series oriented machine learning techniques.  There was much work done in the 80s and 90s on tools like recurrent ANNs and feedforward ANNs for starters, and there has been much work in this line since then. There are plenty of other useful tools and techniques.  Once in a while someone will mention dynamic time warping in a book, but nobody seems real happy about this technique.  Many books mention Hidden Markov Models, which are important, but they’re only useful when the data is at least semi-Markov, and you have some idea of how to characterize it as a sequence of well defined states. Even in this case, I daresay not even the natural language recognition textbooks are real helpful (though Rabiner and Juang is OK, it’s also over 20 years old). Similarly, there are no review papers  treating this as a general problem. I guess we TS guys are too busy racking in the lindens to write one.

Conformal prediction: I will be surprised if anyone reading this has even heard of conformal prediction. There are no wikipedia entries. There is a website and a book. The concept is simple: it would be nice to well motivated put error bars on a machine learning prediction. If you read the basic books, stuff like k-fold cross validation and the jackknife  trick are the entire story. OK, WTF do I do when my training is online? What do I do in the presence of different kinds of noise? Conformal prediction is a step towards this, and hopefully a theory of machine learning confidence intervals in general. It seems to mostly be the work of a small group of researchers who were influenced by Kolomogorov, but others are catching on. I’m interested. Not interested enough to write one, as of yet, but I’d sure like to play with one.

ML in the presence of lots of noise: The closest thing to a book on it is the bizarro (and awesomely cool) “Pattern Theory: The Stochastic Analysis of Real World Signals” by Mumford and Desolneux, or perhaps something in the corpus of speech recognition and image processing books. This isn’t exactly a cookbook or exposition, mind you: more of a thematic manifesto with a few applications.  Obviously, signal processing has something to say about the subject, but what about learners which are designed to function usefully when we know that most of the data is noise?  Fields such as natural language processing and image processing are effectively ML in the presence of lots of noise and confounding signal, but the solutions you will find in their textbooks are specifically oriented to the problems at hand.  Once in a while something like vector quantization will be reused across fields, but it would be nice if we had an “elements of statistical learning in the presence of lots of noise” type book or review paper. Missing in action, and other than the specific subfields mentioned above, there are no research groups which study the problem as an engineering subject. New stuff is happening all the time; part of the success of “Deep Learning” is attributable to the Drop Out technique to prevent overfitting. Random forests could be seen as a technique which at genuflects at  “ML in the presence of noise” without worrying about it too much. Marketing guys are definitely thinking about this. I know for a fact that there are very powerful learners for picking signal out of shitloads of noise: I’ve written some. It would have been a lot easier if somebody wrote a  review paper on the topic. The available knowledge can certainly be systematized and popularized better than it has been.

Feature engineering: feature engineering is another topic which doesn’t seem to merit any review papers or books, or even chapters in books, but it is absolutely vital to ML success. Sometimes the features are obvious; sometimes not. Much of the success of machine learning is actually success in engineering features that a learner can understand. I daresay document classification would be awfully difficult without td-idf representation of document features. Latent Dirichlet allocation is a form of “graphical model” which works wonders on such data, but it wouldn’t do a thing without td-idf. [correction to this statement from Brendan below] Similarly, image processing has a bewildering variety of feature extraction algorithms which are of towering importance for that field; the SIFT descriptor, the GIST and HOG descriptors, the Hough transform, vector quantization, tangent distance [pdf link]. The Winner Take All hash [pdf link] is an extremely simple and related idea… it makes a man wonder if such ideas could be used in higher (or lower) dimensions. Most of these engineered features are histograms in some sense, but just saying “use a histogram” isn’t helpful. A review article or a book chapter on this sort of thing, thinking through the relationships of these ideas, and helping the practitioner to engineer new kinds of feature for broad problems would be great. Until then, it falls to the practitioner to figure all this crap out all by their lonesome.

Unsupervised and semi-supervised learning in general: almost all books, and even tools like R inherently assume that you are doing supervised learning, or else you’re doing something real simple, like hierarchical clustering, kmeans or PCA.  In the presence of a good set of features, or an interesting set of data, unsupervised techniques can be very helpful. Such techniques may be crucial. They may even help you to engineer new features, or at least reduce the dimensionality of your data. Many interesting data sets are only possible to analyze using semi-supervised techniques; recommendation engines being an obvious beneficiary of such tricks. “Deep learning” is also connected with unsupervised and semi-supervised approaches. I am pretty sure the genomics community does a lot of work with this sort of thing for dimensionality reduction. Supposedly Symbolic Regression (generalized additive models picked using genetic algorithms) is pretty cool too, and it’s in my org-emacs TODO lists to look at this more. Lots of good unsupervised techniques such as Kohonen Self Organizing Maps have fallen by the wayside. They’re still useful: I use them. I’d love a book or review article which concentrates on the topic, or just provides a bestiary of things which are broadly unsupervised. I suppose Oliver Chapelle’s book is an OK start for semi-supervised ideas, but again, not real unified or complete.

Images by one of my heroes, the Ukrainian-American artist Boris Artzybasheff. You can find more of it here.

## Can the Su-25 intercept and shoot down a 777?

Posted in big machines, War nerding by Scott Locklin on July 21, 2014

Personal background: I’ve flown Malaysian Airlines and declare it better and more civilized than any US airline. I’ve been to Ukraine on a business-vacation. I’m sympathetic to the aspirations of the long suffering Ukrainian people. I’m also sympathetic to the position of the Russian government with respect to Ukraine, which is, after all, sort of like their version of Canada, if Canada had annexed part of New England in 1991. I am not sympathetic to the claque of sinister war mongers and imperial Gauleiters in the US State department with respect to their activities in Ukraine and towards Russia. If I had my way, creeps like Vicky “fuck the EU” Nuland and Geoff Pyatt would be facing prison and the firing squad for what they’ve done over there. In my opinion, US policy towards Russia since the fall of the Soviet Union has been knavish, evil and disgusting. My opinion isn’t a mere slavophilic eccentricity; George Kennan, our greatest Cold War diplomat, said more or less the same things before he died.

If this was a shoot down by Donetsk separatists, and even if the Russians supplied the missiles to the separatists (who could have captured them from Ukrainian forces, or simply borrowed a couple from the local arms factories), this doesn’t make the Russians culpable for the tragedy. By that logic, the US is responsible for all the bad things done with weapons it supplies to its proxies, such as ISIS in Syria and Iraq, which is arguably worse. Certainly the US is responsible for the escalation of the situation in Ukraine. I say all this, because passions are high, and the war drums are beating. I am not a  war monger, or apologist for anybody; in fact, I’m the closest thing you’re going to get to an unbiased observer in this disaster. I have no horse in this race. I wish they’d all learn to get along.

So, the Rooskies are now implying that a Ukrainian Su-25 may have shot down flight MH17. Facts and objective reality seem to be in short supply in Western coverage of the Ukraine crisis; I aim to supply some. I am going with the assumption that the Rooskies are telling the truth, and that there was indeed a Ukrainian Su-25 where they said there was. They said the Su-25 came within 2 to 3 miles of the 777.

Everyone agrees that the Boeing 777-200ER was flying over the separatist region at 33,000 feet. A Boeing 777’s cruising speed is about 560mph or Mach 0.84. Its mass is about 500,000 pounds, and it has a wingspan and length of about 200 feet each. The MH17 was flying from West to East, more or less.

The Su-25 Frogfoot is a ground attack aircraft; a modern Sturmovik or, if you like, a Rooskie version of the A-10 Warthog. The wingspan and length of the Su-25 is about 50 feet each, and the mass is about 38,000lbs with a combat load. The ceiling of an unladen Su-25 is about 23,000 feet. With full combat load, an Su-25 can only make it to 16,000 feet. This low combat ceiling was actually a problem in the Soviet-Afghanistan war; the hot air and the tall mountains made it less useful than it could have been. At altitude, the maximum speed of the unladen Su-25 is Mach 0.82; probably considerably lower with combat loads. For air to air armament, it has a pair of 30mm cannons and carries the R-60 missile. The Su-25 is also capable of carrying the Kh-13, though it is not clear that the Ukrainians deploy this missile on their Su-25s. For the sake of argument, we’ll talk about it anyway.

Since it was a Ukrainian Su-25, we can also assume it was heading West to East; more or less the same trajectory as flight MH17. It could have been traveling in some other trajectory, but we can already see the problem with an Su-25 intercepting a 777; it’s too low, and too slow. If you want to believe  the crackpot idea that Ukrainian government were a bunch of sinister schemers who shot down MH17 on purpose, an Su-25 is pretty much the worst armed military aircraft you can imagine for such a task. The Ukrainian air force has a dozen Su-27s and two-dozen Mig-29s perfectly capable of intercepting and shooting down a 777. They also have the Buk missile, and are  capable of placing it somewhere near the Donetsk separatists if they wanted to make them look bad. So, the theory that the evil Ukrainians shot down a 777 with a Su-25 on purpose is … extremely unlikely.

Could an Su-25 have shot down a 777 by accident? Fog of war and all that? Perhaps they thought it was a Russian  plane? Well, let’s see how likely that is. The weapons of the Su-25 capable of doing this are the cannons, the R-60 missile (and its later evolutions, such as the R-73E) and the  K-13 missile.

Cannons: impossible. The Su-25 was at minimum 10,000 feet below the 777. This means simply pointing the cannon at the 777 without stalling would have been a challenge. The ballistic trajectory of the cannon fire would have made this worse. The Gsh-30-2 cannon fires a round which travels at only 2800 feet per second, significantly lower than, say, the round fired by a  338 Lapua sniper rifle. Imagine trying to shoot down an airplane with a rifle, from 2-3 miles away using your eyeball, in a plane, at a ballistic angle. If the MH17 was somehow taken out by cannon fire, it will have obvious 30mm holes in the fuselage. None have been spotted so far.

K-13 missile: extremely unlikely. The K-13 is a Soviet copy of the 50s era AIM-9 sidewinder; an infrared homing missile. Amusingly, the Soviets obtained the AIM-9 design during a skirmish between China and Taiwan in 1958; a dud got stuck in a Mig-17. It is not clear that the Ukrainian air force fields these weapons with their Su-25’s; they’re out of date, and mostly considered useless. Worse, the effective range of a K-13 is only about 1.2 miles, putting the 777 out of effective range. Sure, a K-13 miiiight have made it to a big lumbering 777 with its two big, hot turbofans, but it seems pretty unlikely; a lucky shot. The 16lb of the K-13 warhead is certainly capable of doing harm to a 777’s engines. Maybe it would have even taken out the whole airliner. Doubtful though.

The K-13 AA missile

R-60 missile: extremely unlikely. If a Su-25 was firing missiles at a 777, this is probably what it was using. The R-60 is also an IR guided missile, though some of the later models use radar proximity fuzing.  Unlike the K-13, this is a modern missile, and it is more likely to  have hit its target if fired. Why is it unlikely? Well, first off, it is unlikely the Ukrainian Su-25s were armed with them in the first place: these are ground attack planes, fighting in a region where the enemy has no aircraft. More importantly, the R-60 has a tiny little 6lb warhead, which is only really dangerous to fragile fighter aircraft. In 1988, an R-60 was fired at a BAe-125 in Botswana. The BAe-125 being a sort of Limey Lear jet, which weighs a mere 25,000lbs; this aircraft is 20 times smaller than a 777 by mass. The BAe-125 was inconvenienced by the R-60, which knocked one of its engines off, but it wasn’t shot down; it landed without further incident. A 777 is vastly larger and more sturdy than any Limey Lear jet. People may recall the KAL007 incident where an airliner was shot down by a Soviet interceptor. The Su-15 flagon interceptor which accomplished this used a brobdingnagian K-8 missile, with an 88lb warhead, which was designed to take out large aircraft. Not a shrimpy little R-60. The R-60 is such a pipsqueak of a missile, it is referred to as the “aphid.”

The R-60 aphid

That’s it; those are the only tools available to the Su-25 for air to air combat. The other available  weapons are bombs and air to surface missiles, which are even more incapable of shooting down anything which is  10,000 feet above the Su-25.

My guess as to what happened … somebody … probably the Donetsk separatists (the least experienced, least well trained, and least well plugged into a military information network), fired a surface to air missile at something they thought was an enemy plane. It could have been the Buk SA-11/17 with its 150lb warhead and 75,000 foot range, just like everyone is reporting. Another candidate is the Kub SAM, which is an underrated SAM platform also in use in that part of the world. Yet another possibility is the S-125 Pechora, which isn’t deployed in Ukraine or Russia, but it is probably still manufactured in the Donbass region. A less likely candidate is the S-75 Dvina (the same thing that took out Gary Powers), though the primitive guidance system and probable lack of deployed installations in Ukraine and Russia make this unlikely. The fact that the MH17 disappeared from radar at 33,000 feet, and the condition of the wreckage indicates it was something really big that hit flight MH17; not a piddly little aphid missile. The pictures of the wreckage don’t indicate any sort of little missile strike which might have knocked off an engine; it looks like the whole plane was shredded. Both engines came down in the same area, more or less in one piece.

Whatever it was, it wasn’t an Su-25. There is also no use going all “Guns of August” on the Russians over something that was very likely beyond their control. Here’s hoping all parties concerned learn to resolve their differences in a civilized manner.

Interesting links from the rumor mill (as they come in):

http://theaviationist.com/2014/07/21/su-27s-escorted-mh17/

Update July 22:
Nobody else has yet noticed that Donetsk manufactures SAMs, or that there are several other potential sources and varieties of such weapons. The Russians are sticking with the Su-25 idea, and haven’t corroborated the Su-27 story, making it seem much less likely.

“Blame the Rooskie” war mongers would do well to remember the Vincennes incident, where the US shot down an Iranian air liner over Iranian airspace, killing a comparable number of innocent civilians.

Update July 23:
A run down of some of the capabilities of the Buk system from “The National Interest” (one of the few sane US foreign policy periodicals):