Locklin on science

Predicting with confidence: the best machine learning idea you never heard of

Posted in machine learning by Scott Locklin on December 5, 2016

One of the disadvantages of machine learning as a discipline is the lack of reasonable confidence intervals on a given prediction. There are all kinds of reasons you might want such a thing, but I think machine learning and data science practitioners are so drunk with newfound powers, they forget where such a thing might be useful. If you’re really confident, for example, that someone will click on an ad, you probably want to serve one that pays a nice click through rate. If you have some kind of gambling engine, you want to bet more money on the predictions you are more confident of. Or if you’re diagnosing an illness in a patient, it would be awfully nice to be able to tell the patient how certain you are of the diagnosis and what the confidence in the prognosis is.

There are various ad hoc ways that people do this sort of thing.  The one you run into most often is some variation on cross validation, which produces an average confidence interval. I’ve always found this to be dissatisfying (as are PAC approaches). Some people fiddle with their learners and in hopes of making sure the prediction is normally distributed, then build confidence intervals from that (or for the classification version, Platt scaling using logistic regression).  There are a number of ad hoc ways of generating confidence intervals using resampling methods and generating a distribution of predictions. You’re kind of hosed though, if your prediction is in online mode.  Some people build learners that they hope will produce a sort of estimate of the conditional probability distribution of the forecast; aka quantile regression forests and friends. If you’re a Bayesian, or use a model with confidence intervals baked in, you may be in pretty good shape. But let’s face it, Bayesian techniques assume your prior is correct, and that new points are drawn from your prior. If your prior is wrong, so are your confidence intervals, and you have no way of knowing this.  Same story with heteroscedasticity. Wouldn’t it be nice to have some tool to tell you how uncertain your prediction when you’re not certain of your priors or your algorithm, for that matter?

 

mondrian_piet_4

Well, it turns out, humanity possesses such a tool, but you probably don’t know about it. I’ve known about this trick for a few years now, through my studies of online and compression based learning as a general subject. It is a good and useful bag of tricks, and it verifies many of the “seat of the pants” insights I’ve had in attempting to build ad-hoc confidence intervals in my own predictions for commercial projects.  I’ve been telling anyone who listens for years that this stuff is the future, and it seems like people are finally catching on. Ryan Tibshirani, who I assume is the son of the more famous Tibshirani, has published a neat R package on the topic along with colleagues at CMU. There is one other R package out there and one in python. There are several books published in the last two years. I’ll do my part in bringing this basket of ideas to a more general audience, presumably of practitioners, but academics not in the know should also pay attention.

The name of this basket of ideas is “conformal prediction.” The provenance of the ideas is quite interesting, and should induce people to pay attention. Vladimir Vovk is a former Kolmogorov student, who has had all kind of cool ideas over the years. Glenn Shafer is also well known for his co-development of Dempster-Shafer theory, which is a brewing alternative to standard measure-theoretic probability theory which is quite useful in sensor fusion, and I think some machine learning frameworks. Alexander Gammerman is a former physicist from Leningrad, who, like Shafer, has done quite a bit of work in the past with Bayesian belief networks. Just to reiterate who these guys are: Vovk and Shafer have also previously developed a probability theory based on game theory which has ended up being very influential in machine learning pertaining to sequence prediction. To invent one new form of probability theory is clever. Two is just showing off! The conformal prediction framework comes from deep results in probability theory and is inspired by Kolmogorov and Martin-Lof’s ideas on algorithmic complexity theory.

mond-forest

The advantages of conformal prediction are many fold. These ideas assume very little about the thing you are trying to forecast, the tool you’re using to forecast or how the world works, and they still produce a pretty good confidence interval. Even if you’re an unrepentant Bayesian, using some of the machinery of conformal prediction, you can tell when things have gone wrong with your prior. The learners work online, and with some modifications and considerations, with batch learning. One of the nice things about calculating confidence intervals as a part of your learning process is they can actually lower error rates or use in semi-supervised learning as well. Honestly, I think this is the best bag of tricks since boosting; everyone should know about and use these ideas.

The essential idea is that a “conformity function” exists. Effectively you are constructing a sort of multivariate cumulative distribution function for your machine learning gizmo using the conformity function. Such CDFs exist for classical stuff like ARIMA and linear regression under the correct circumstances; CP brings the idea to machine learning in general, and to models like ARIMA  when the standard parametric confidence intervals won’t work. Within the framework, the conformity function, whatever may be, when used correctly can be guaranteed to give confidence intervals to within a probabilistic tolerance. The original proofs and treatments of conformal prediction, defined for sequences, is extremely computationally inefficient. The conditions can be relaxed in many cases, and the conformity function is in principle arbitrary, though good ones will produce narrower confidence regions. Somewhat confusingly, these good conformity functions are referred to as “efficient” -though they may not be computationally efficient.

composition-ii-in-red-blue-and-yellow

The original research and proofs were done on so-called “transductive conformal prediction.” I’ll sketch this out below.

Suppose you have a data set Z:= z_1,...,z_N  , with z_i:=(x_i,y_i) where x_i has the usual meaning of a feature vector, and y_i the variable to be predicted. If the N! different possible orderings are equally likely, the data set Z is exchangeable. For the purposes of this argument, most data sets are exchangeable or can be made so. Call the set of all bags of points from Z with replacement a “bag” B .

The conformal predictor \Gamma^{\epsilon}(Z,x) := \{y | y^{p} > \epsilon \} where Z is the training set and x is a test object and \epsilon \in (0,1) is a defined probability of confidence in a prediction. If we have a function A(B,z_i) which measures how different a point z_i is the bag set of B .

Example: If we have a forecast technique which works on exchangeable data, \phi(B) , then a very simple function is the distance between the new point and the forecast based on the bag set. A(B,z_i):=d(\phi(B), z_i)  .

Simplifying the notation a little bit, let’s call A_i := A(B^{-i},z_i)  where B^{-i} is the bag set, missing z_i  . Remembering that bag sets B are sets of all the orderings of Z we can see that our p^y can be defined from the nonconformity measures; p^{y} := \frac{\#\{i=1,...,n|A_i \geq A_n \} }{n}  This can be proved in a fairly straightforward way. You can find it in any of the books and most of the tutorials.

Practically speaking, this kind of transductive prediction is computationally prohibitive and not how most practitioners confront the world. Practical people use inductive prediction, where we use training examples and then see how they do in a test set. I won’t go through the general framework for this, at least this time around; go read the book or one of the tutorials listed below. For one it is worth, one of the forms of Inductive Conformal Prediction is called Mondrian Conformal Prediction; a framework which allows for different error rates for different categories, hence all the Mondrian paintings I decorated this blog post with.

mondrian-tree

For many forms of inductive CP, the main trick is you must subdivide your training set into two pieces. One piece you use to train your model, the proper training set. The other piece you use to calculate your confidence region, the calibration set. You compute the non-conformity scores on the calibration set, and use them on the predictions generated by the proper training set. There are
other blended approaches. Whenever you use sampling or bootstrapping in  your prediction algorithm, you have the chance to build a conformal predictor using the parts of the data not used in the prediction in the base learner. So, favorites like Random Forest and Gradient Boosting Machines have computationally potentially efficient conformity measures. There are also flavors using a CV type process, though the proofs seem more weak for these. There are also reasonably computationally efficient Inductive CP measures for KNN, SVM and decision trees. The inductive “split conformal predictor” has an R package associated with it defined for general regression problems, so it is worth going over in a little bit of detail.
For coverage at \epsilon confidence, using a prediction algorithm \phi and training data set Z_i,i=1,...,n , randomly split the index i=1,...,n into two subsets, which as above, we will call the proper training set and the calibration set I_1,I_2 .

Train the learner using data on the proper training set I_1

\phi_{trained}:=\phi(Z_i); i \in I_1 . Then, using the trained learner, find the residuals in the calibration set:

R_i := |Y_i - \phi(X_i)|, i \in I_2 
d := the k th smallest value in \{R_i :i \in I_2\} where
k=(n/2 + 1)(1-\epsilon)

The prediction interval for a new point x is \phi(x)-d,\phi(x)+d

This type of thing may seem unsatisfying, as technically the bounds on it only exist for one predicted point. But there are workarounds using leave one out in the ranking. The leave one out version is a little difficult to follow in a lightweight blog, so I’ll leave it up as an exercise for those who are interested to read more about it in the R documentation for the package.

Conformal prediction is about 10 years old now: still in its infancy.  While forecasting with confidence intervals is inherently useful, the applications and extensions of the idea are what really tantalizes me about the subject. New forms of feature selection, new forms of loss function which integrate the confidence region, new forms of optimization to deal with conformal loss functions, completely new and different machine learning algorithms, new ways of thinking about data and probabilistic prediction in general. Specific problems which CP has had success with; face recognition, nuclear fusion research, design optimization, anomaly detection, network traffic classification and forecasting, medical diagnosis and prognosis, computer security, chemical properties/activities prediction and computational geometry. It’s probably only been used on a few thousand different data sets. Imagine being at the very beginning of Bayesian data analysis where things like the expectation maximization algorithm are just being invented, or neural nets before backpropagation: I think this is where the CP basket of ideas is at.  It’s an exciting field at an exciting time, and while it is quite useful now, all kinds of great new results will come of it.

There is a website and a book. Other papers and books can be found in the usual way. This paper goes with the R package mentioned above, and is particularly clearly written for the split and leave one out conformal prediction flavors. Here is a presentation with some open problems and research directions if you want to get to work on something interesting. Only 19 packages on github so far.

Get your Conformal Predictions here.

Get your Conformal Predictions here.

The Leduc ramjet

Posted in big machines by Scott Locklin on November 29, 2016

I ran across this gizmo from looking at Yann LeCun’s google plus profile, and wondering at the preposterous gadget sitting next to him at the top. Figuring out what it was, I realized the genius of the metaphor. Yann builds things (convolutional networks and deep learning in general) which might very well be the Leduc ramjets of machine learning or even “AI” if we are lucky. Unmistakably Phrench, as in both French and physics-based, original in conception, and the rough outlines of something that might become common in the future, even if the engineering of the insides eventually changes.

image014

Rene Leduc was working on practical ramjet engines back in the 1930s. His research was interrupted by the war, but he was able to test his first ramjet in flight in 1946. The ramjet seems like a crazy idea for a military aircraft; ramjets don’t work until the plane is moving. A ramjet is essentially a tube you squirt fuel into which you light on fire. The fire won’t propel the engine forward unless there is already a great deal of air passing through. It isn’t that crazy if you can give it a good kick to get it into motion. If we stop to think about how practical supersonic aircraft worked from the 1950s on; they used afterburners. Afterburners to some approximation, operate much like inefficient ramjets; you squirt some extra fuel in the afterburning component of the engine and get a nice increase in thrust. Leduc wasn’t the only ramjet guy in town; the idea was in the proverbial air, if not the literal air. Alexander Lippisch (a German designer responsible for the rocket powered Komet, the F-106 and the B-58 Hustler) had actually sketched a design for a supersonic coal burning interceptor during WW-2, and his engine designer was eventually responsible for a supersonic ramjet built by another French company. The US also attempted a ramjet powered nuclear cruise missile, the SM-64 Navaho, which looks about as bizarre as the Leduc ramjets.

navaho

Navaho SM-84

In fact, early nautical anti-aircraft missiles such as the Rim-8 Talos used ramjets for later stages as well. The bleeding edge Russian air to air missile, the R-77, also uses ramjets as does a whole host of extremely effective Russian antiship missiles. Ramjets can do better than rockets for long range missilery as they are comparably simple, and hydrocarbon ramjets can have longer range than rockets. Sticking a man in a ramjet powered airframe isn’t that crazy an idea. It works for missiles.

The Leduc ramjets didn’t quite work as a practical military technology, in part due to aerodynamic problems, in part because they needed turbojets to get off the ground anyway, but they were important in developing further French fighter planes.  They were promising at the time and jam packed with innovative ideas; the first generation of them was much faster in both climb and final speed than contemporary turbojets.

lu022

Ultimately, their technology was a dead end, but what fascinates about them is how different, yet familiar they were. They look like modern aircraft from an alternate steampunk future. Consider a small detail of the airframe, such as the nose.  The idea was a canopy bubble would cause aerodynamic drag. Since ramjets operate best without any internal turbulence, the various scoops and side inlets you see in modern jets were non starters. So they put the poor pilot in a little tin can in the front of the thing. The result was, the earliest Leduc ramjet (the 0.10) looked like a Flash Gordon spaceship. The pilot was buried somewhere in the intake and had only tiny little portholes for visibility.

leduc010_05

Later models incorporated more visibility by making a large plexiglass tube for the pilot to sit in. Get a load of the look of epic Gaulic bemusement on the pilot’s “avoir du cran” mug:

faire la moue

faire la moue

Leduc_022_Ramjet

The later model shown above, the Leduc 0.22, actually had a turbojet which got it into the air. It was supposed to hit Mach-2, but never did. In part because the airframe didn’t take into account the “area rule” effect which made supersonic flight practical in early aircraft. But also in part because the French government withdrew funding from the project in favor of the legendary Dassault Mirage III; an aircraft so good it is still in service today.

The Leduc designs are tantalizing in that they were almost there. They produced 15,000 lbs of thrust, which was plenty for supersonic flight. A later ramjet fighter design, the Nord Griffon actually achieved supersonic flight, more or less by using a more conventional looking airframe. Alas, turbojets were ultimately less complex (and less interesting looking) so they ended up ruling the day.

najn8jtz6pwe1z2bilpsdk5

As I keep saying, early technological development and innovative technology often looks very interesting indeed. In the early days people take big risks, and don’t really know what works right. If you look at a radio from the 1920s; they are completely fascinating with doodads sticking out all over the place. Radios in the 50s and 60s when it was down to a science were boring looking (and radios today are invisible). Innovative technologies look a certain way. They look surprising to the eye, because they’re actually something new. They look like science fiction because, when you make something new, you’re basically taking science fiction and turning it into technology.

Some videos:

My favorite photo of this wacky election

Posted in stats jackass of the month, Uncategorized by Scott Locklin on November 9, 2016

This dope got lucky in 2012, essentially using “take the mean” and was hailed as a prophet. He was wrong about virtually everything, and if someone were to make a table of his predictions over time and calculate Brier scores, I’m pretty sure he’ll get a higher score than Magic-8 ball (Brier scores, lower is better). Prediction is difficult, as the sage said, especially regarding the future. Claiming you can do prediction when you can’t is irresponsible and can actually be dangerous.
silver.jpg

While he richly deserves to find his proper station in life as an opinionated taxi driver, this clown is unfortunately likely to be with us for years to come, bringing shame on the profession of quantitative analysis of data. We’ll be watching, Nate.

 

 

The Future ain’t what it used to be

Posted in fun, Progress by Scott Locklin on November 1, 2016

I came across this video recently. It is a think piece by the Ford motor company and a long dead electronics firm called Philco showing what the future will be like from the perspective of 1967. It’s a nice imaginative vista from a time of great technological optimism: 1967. They were close to accomplishing the moon shot, and the Mach-3 Boeing SST had only recently been announced. From the perspective of a technologist alive in those days, life could have ended up like this. The set of things people worried about technological solutions and conveniences they thought would be cool are also interesting. It is kind of sad comparing this bold imagined future (only 32 years away from when the video was made) to our actually existing shabby 1999 +17y future.  It’s 21 minutes, so if you don’t have 21 minutes to watch the whole thing, you can read my comments.

The husband of the house is an astrophysicist (working a remote day job on Mars Colonization no less) with a hobby doing … botany. He’s got a lab at home and is trying to breed a super peach with a protective tangerine skin. This is wildly unrealistic, even if they had thought of genetic engineering back then, and as far as I know, nobody is breeding crazy fruits today, let alone doing so as a hobby. Obviously nobody is colonizing Mars. Still, food and novelty was apparently considered important in 1967, so it is kind of endearing they gave the astrophysicist this kind of hobby. Most astrophysicists I know work 80 hour weeks and have hobbies like looking at youtube videos and grousing about funding levels.

home botany experiments

home botany experiments

The house of tomorrow has a central computer where all kinds of stuff is stored in its “memory banks.” There is really no reason why people distribute their data all over creation the way they do now; the future from 1967 looked a lot more sane and safe in this regard.  Memory banks and computers in this video look a lot like the computers, TVs and radios of 1967. They’re kind of cool looking, like a bit CAMAC crate or IBM mainframe.

memorybanks

memory banks of the future have lots of dip switches

The kid (single child to upper middle class parents; good prediction) seems to be homeschooled by  teaching machines. This is quite technically feasible these days, but not so many people work at home in our shabby future of 2016 that this is done regularly.

home schooling technology

home schooling technology

They chat with each other electronically. Their future used a sort of video intercom, which is a lot more interesting than our actual crummy future, where people furiously thumb-type text messages to each other from across the dinner table, rather than video calling from the other room. They also didn’t predict chatroulette.

1967 era instant messaging

1967 era instant messaging

Dinner is pre-processed and stored in some kind of central silo which microwaves dinner for everybody, based on their nutritional requirements and how fat they’re getting; all done in less than 2 minutes. The upside to our shabby future present is people don’t like icky but futuristic seeming TV dinners as much as they did in the 60s. Our shitty future equivalent, I guess, at least in the Bay Area, we have “services” which deliver food to your house unmade, and you have a bonding experience with your significant other following the directions and making the food. Or we just go to the grocery store like they did in 1967. There are probably apps which claim to track calories for people,  but in shitty future now pretty much everyone is disgustingly fat. Oh yeah, in the future of 1967, dishwashers are obsolete; everyone throws their (seemingly perfectly reusable) plates away. Little did they know in 1967, landfills would become a political problem.

making dinner using technology

making dinner using technology

Lots of clothes in the 1967 future will be as disposable as the plates and silverware. The ones that you want to keep are ultrasound dry cleaned using a frightening closet which seems quite exposed to the rest of the house, despite shooting visible clouds of noxious chemicals all over the place. People in the 1967 future weren’t as petrified of chemicals as we are now. Frankly their self cleaning closet gives even me the creeps. I don’t even like using moth balls. Hot link to scary cleaning closet here.

In the 1967 future, the Mrs. of the household can buy stuff “online,” which was a pretty good guess. Of course, their “online” is from some kind of live video feed. The idea of a website (or a mouse or keyboard) hadn’t occurred to them yet. And the bank is also accessible through some other kind of computerized console, as is a “home post office” which I guess was a form of email. Though their email system works in cursive in this example. I am guessing that typewriter style keyboards were seen as a specialized skill in those days, and “push button” was seen as more futuristic.

Amazon shopping in the future

Amazon shopping in the future

The house is powered by a hydrogen fuel cell for some reason, and “pure water” is a useful byproduct. Maybe in the 1967 future plumbing will be depreciated. In their 1967-vantage future, despite breeding crazy peaches and eating all their food from the microwave-refrigerator food dispensing machine, they’ll get strange undersea fruits from hydro-cultured underwater farms. 1960s futurology was filled with fantasies of growing things under water; science fiction from those days seemed to think we’d all be algae eaters in the future. I was never able to figure that out. I guess humanity obviated this with the green revolution, which was not something which was particularly predictable from those days.

The home gym is fun. It features a medical scanner which scans you while reclining on an Eames style couch and makes exercise suggestions; something that doesn’t exist anywhere, probably never will, despite all the DARPA requests for such a thing. Pretty much the same thing as in Star Trek’s “sick bay.” There’s lots of funny old timey exercise equipment in the gym, some of which has made a recent comeback; exercise clubs, gymnastic rings, chest expander. I don’t think they predicted the come back of such devices: those were probably cutting edge in 1967. Oh yeah, the medical scanner sends data back to the community medical center: HIPPA records apparently don’t apply in 1967 future, as opposed to our present shitty future, because people didn’t think of themselves as living in a sinister  oligopoly careening towards totalitarianism as we do now.

gym

gym

In 1967 future, you video call your far away buddy to make travel plans, just like now on skype. But in 1967 future you could pick between a golf course in Monterey and one in Mexico City for a casual afternoon of golf, depending on the weather forecast. Because in those days, it seemed inevitable that supersonic or even hypersonic air travel be cheap and convenient. They had no way of knowing the oil crisis would come, just as they had no way of knowing you’d need to arrive 3 hours early to the airport because of imbecile US foreign policy hubris. Remember you didn’t even need a photo ID to get on a plane until 1999 or so; you could go to the airport with a bundle of cash and fly anywhere you wanted to; just like in 1967. In a later scene in the video, pals from Philippines and Paris show up for a house party, because, again, supersonic (maybe hypersonic) flight is super cheap in the 1967 future.

skype in the future

skype in the future

Hobbies in the future: the lady of the house has a fine arts degree and makes pots at home. I actually know a few people like this, and suspect there were people like this in 1967, but it’s really more of an upper middle class thing than a future thing. It’s arguably more upper middle class now for the missus to work for a non-profit. Video games in 1967 future seemed to be restricted to chess. 1999 shabby future had stuff like Castle Wolfenstein and was legitimately less shitty than the imagined 1967 future. It’s probably better for kids to play computer chess though.

chess

Parties in the 1967 future looked better than modern parties; people dressed stylishly and listened to decent music while having enlightened conversation. This is pretty rare these days, though I suppose people do often have “parties” centered around the TV the way they did.

party!

party!

The 1999 future as envisioned in 1967 seemed like a nice place. Everything is convenient. People spent a lot of time bettering themselves with productive hobbies; making artistic pots and breeding interesting plants when they’re not doing a man’s work sending people to colonize Mars or playing duets with your child on a giant synthesizer. Friendships were cultivated all over the world, and travel was trivial and cheap. People in the 1967 envisioned future were apparently very worried about getting fat; I can only speculate that this was an actual concern of 1967, which is probably why everyone looks so slim in those old timey “people watching the moon shot” photos. I’m not sure what happened to that; perhaps cheap insulin has made people worry about it less. People in 1967 were also very concerned with overpopulation and foodstuffs to feed the teeming masses, which is why food came up so much in the video, and why the future family only had one offspring. While the 1967 envisioned future seemed preternaturally clean and environmentally sound, upper middle class neuroses now a days are more overtly concerned with pollution and environmental issues. I am guessing the household conveniences of disposable dishes, self-cleaning closets and pre-made meals were some technical reflection on the cultural changes between the sexes brewing in the 60s. In 1967 it probably seemed like you could solve these looming cultural upheavals using technology; just give the missus some self-cleaning closets and a machine which does the cooking. I couldn’t help but think that the Housewife of the future seemed a little bored. Honestly the whole family seemed  pretty spaced out and lost, but perhaps that’s because plot, characterization and motivation in industrial videos is not always a priority.

They did guess that computers would be important in the home, which was far from obvious at that point. They also guessed that some kind of networked computer system would be routine, which was a very good guess, as computer networks were entirely military up to that point. Oh yeah, and unlike lots of science fiction movies, the screens of the future were flat, rather than CRT based.

It would be interesting to find a modern “home of the future” video by a modern industrial concern; maybe there is one by Microsoft or Apple. I doubt as their future is as interesting and healthy seeming as this future. Perhaps some visionary should attempt this, if only for aspiration purposes.