# Locklin on science

## Predicting with confidence: the best machine learning idea you never heard of

Posted in machine learning by Scott Locklin on December 5, 2016

One of the disadvantages of machine learning as a discipline is the lack of reasonable confidence intervals on a given prediction. There are all kinds of reasons you might want such a thing, but I think machine learning and data science practitioners are so drunk with newfound powers, they forget where such a thing might be useful. If you’re really confident, for example, that someone will click on an ad, you probably want to serve one that pays a nice click through rate. If you have some kind of gambling engine, you want to bet more money on the predictions you are more confident of. Or if you’re diagnosing an illness in a patient, it would be awfully nice to be able to tell the patient how certain you are of the diagnosis and what the confidence in the prognosis is.

There are various ad hoc ways that people do this sort of thing.  The one you run into most often is some variation on cross validation, which produces an average confidence interval. I’ve always found this to be dissatisfying (as are PAC approaches). Some people fiddle with their learners and in hopes of making sure the prediction is normally distributed, then build confidence intervals from that (or for the classification version, Platt scaling using logistic regression).  There are a number of ad hoc ways of generating confidence intervals using resampling methods and generating a distribution of predictions. You’re kind of hosed though, if your prediction is in online mode.  Some people build learners that they hope will produce a sort of estimate of the conditional probability distribution of the forecast; aka quantile regression forests and friends. If you’re a Bayesian, or use a model with confidence intervals baked in, you may be in pretty good shape. But let’s face it, Bayesian techniques assume your prior is correct, and that new points are drawn from your prior. If your prior is wrong, so are your confidence intervals, and you have no way of knowing this.  Same story with heteroscedasticity. Wouldn’t it be nice to have some tool to tell you how uncertain your prediction when you’re not certain of your priors or your algorithm, for that matter?

Well, it turns out, humanity possesses such a tool, but you probably don’t know about it. I’ve known about this trick for a few years now, through my studies of online and compression based learning as a general subject. It is a good and useful bag of tricks, and it verifies many of the “seat of the pants” insights I’ve had in attempting to build ad-hoc confidence intervals in my own predictions for commercial projects.  I’ve been telling anyone who listens for years that this stuff is the future, and it seems like people are finally catching on. Ryan Tibshirani, who I assume is the son of the more famous Tibshirani, has published a neat R package on the topic along with colleagues at CMU. There is one other R package out there and one in python. There are several books published in the last two years. I’ll do my part in bringing this basket of ideas to a more general audience, presumably of practitioners, but academics not in the know should also pay attention.

The name of this basket of ideas is “conformal prediction.” The provenance of the ideas is quite interesting, and should induce people to pay attention. Vladimir Vovk is a former Kolmogorov student, who has had all kind of cool ideas over the years. Glenn Shafer is also well known for his co-development of Dempster-Shafer theory, which is a brewing alternative to standard measure-theoretic probability theory which is quite useful in sensor fusion, and I think some machine learning frameworks. Alexander Gammerman is a former physicist from Leningrad, who, like Shafer, has done quite a bit of work in the past with Bayesian belief networks. Just to reiterate who these guys are: Vovk and Shafer have also previously developed a probability theory based on game theory which has ended up being very influential in machine learning pertaining to sequence prediction. To invent one new form of probability theory is clever. Two is just showing off! The conformal prediction framework comes from deep results in probability theory and is inspired by Kolmogorov and Martin-Lof’s ideas on algorithmic complexity theory.

The advantages of conformal prediction are many fold. These ideas assume very little about the thing you are trying to forecast, the tool you’re using to forecast or how the world works, and they still produce a pretty good confidence interval. Even if you’re an unrepentant Bayesian, using some of the machinery of conformal prediction, you can tell when things have gone wrong with your prior. The learners work online, and with some modifications and considerations, with batch learning. One of the nice things about calculating confidence intervals as a part of your learning process is they can actually lower error rates or use in semi-supervised learning as well. Honestly, I think this is the best bag of tricks since boosting; everyone should know about and use these ideas.

The essential idea is that a “conformity function” exists. Effectively you are constructing a sort of multivariate cumulative distribution function for your machine learning gizmo using the conformity function. Such CDFs exist for classical stuff like ARIMA and linear regression under the correct circumstances; CP brings the idea to machine learning in general, and to models like ARIMA  when the standard parametric confidence intervals won’t work. Within the framework, the conformity function, whatever may be, when used correctly can be guaranteed to give confidence intervals to within a probabilistic tolerance. The original proofs and treatments of conformal prediction, defined for sequences, is extremely computationally inefficient. The conditions can be relaxed in many cases, and the conformity function is in principle arbitrary, though good ones will produce narrower confidence regions. Somewhat confusingly, these good conformity functions are referred to as “efficient” -though they may not be computationally efficient.

The original research and proofs were done on so-called “transductive conformal prediction.” I’ll sketch this out below.

Suppose you have a data set $Z:= z_1,...,z_N$, with $z_i:=(x_i,y_i)$ where $x_i$ has the usual meaning of a feature vector, and $y_i$ the variable to be predicted. If the $N!$ different possible orderings are equally likely, the data set $Z$ is exchangeable. For the purposes of this argument, most data sets are exchangeable or can be made so. Call the set of all bags of points from $Z$ with replacement a “bag” $B$.

The conformal predictor $\Gamma^{\epsilon}(Z,x) := \{y | y^{p} > \epsilon \}$ where $Z$ is the training set and $x$ is a test object and $\epsilon \in (0,1)$ is a defined probability of confidence in a prediction. If we have a function $A(B,z_i)$ which measures how different a point $z_i$ is the bag set of $B$.

Example: If we have a forecast technique which works on exchangeable data, $\phi(B)$, then a very simple function is the distance between the new point and the forecast based on the bag set. $A(B,z_i):=d(\phi(B), z_i)$.

Simplifying the notation a little bit, let’s call $A_i := A(B^{-i},z_i)$ where $B^{-i}$ is the bag set, missing $z_i$. Remembering that bag sets $B$ are sets of all the orderings of $Z$ we can see that our $p^y$ can be defined from the nonconformity measures; $p^{y} := \frac{\#\{i=1,...,n|A_i \geq A_n \} }{n}$ This can be proved in a fairly straightforward way. You can find it in any of the books and most of the tutorials.

Practically speaking, this kind of transductive prediction is computationally prohibitive and not how most practitioners confront the world. Practical people use inductive prediction, where we use training examples and then see how they do in a test set. I won’t go through the general framework for this, at least this time around; go read the book or one of the tutorials listed below. For one it is worth, one of the forms of Inductive Conformal Prediction is called Mondrian Conformal Prediction; a framework which allows for different error rates for different categories, hence all the Mondrian paintings I decorated this blog post with.

For many forms of inductive CP, the main trick is you must subdivide your training set into two pieces. One piece you use to train your model, the proper training set. The other piece you use to calculate your confidence region, the calibration set. You compute the non-conformity scores on the calibration set, and use them on the predictions generated by the proper training set. There are
other blended approaches. Whenever you use sampling or bootstrapping in  your prediction algorithm, you have the chance to build a conformal predictor using the parts of the data not used in the prediction in the base learner. So, favorites like Random Forest and Gradient Boosting Machines have computationally potentially efficient conformity measures. There are also flavors using a CV type process, though the proofs seem more weak for these. There are also reasonably computationally efficient Inductive CP measures for KNN, SVM and decision trees. The inductive “split conformal predictor” has an R package associated with it defined for general regression problems, so it is worth going over in a little bit of detail.
For coverage at $\epsilon$ confidence, using a prediction algorithm $\phi$ and training data set $Z_i,i=1,...,n$, randomly split the index $i=1,...,n$ into two subsets, which as above, we will call the proper training set and the calibration set $I_1,I_2$.

Train the learner using data on the proper training set $I_1$

$\phi_{trained}:=\phi(Z_i); i \in I_1$. Then, using the trained learner, find the residuals in the calibration set:

$R_i := |Y_i - \phi(X_i)|, i \in I_2$
$d :=$ the $k$th smallest value in $\{R_i :i \in I_2\}$ where
$k=(n/2 + 1)(1-\epsilon)$

The prediction interval for a new point $x$ is $\phi(x)-d,\phi(x)+d$

This type of thing may seem unsatisfying, as technically the bounds on it only exist for one predicted point. But there are workarounds using leave one out in the ranking. The leave one out version is a little difficult to follow in a lightweight blog, so I’ll leave it up as an exercise for those who are interested to read more about it in the R documentation for the package.

Conformal prediction is about 10 years old now: still in its infancy.  While forecasting with confidence intervals is inherently useful, the applications and extensions of the idea are what really tantalizes me about the subject. New forms of feature selection, new forms of loss function which integrate the confidence region, new forms of optimization to deal with conformal loss functions, completely new and different machine learning algorithms, new ways of thinking about data and probabilistic prediction in general. Specific problems which CP has had success with; face recognition, nuclear fusion research, design optimization, anomaly detection, network traffic classification and forecasting, medical diagnosis and prognosis, computer security, chemical properties/activities prediction and computational geometry. It’s probably only been used on a few thousand different data sets. Imagine being at the very beginning of Bayesian data analysis where things like the expectation maximization algorithm are just being invented, or neural nets before backpropagation: I think this is where the CP basket of ideas is at.  It’s an exciting field at an exciting time, and while it is quite useful now, all kinds of great new results will come of it.

There is a website and a book. Other papers and books can be found in the usual way. This paper goes with the R package mentioned above, and is particularly clearly written for the split and leave one out conformal prediction flavors. Here is a presentation with some open problems and research directions if you want to get to work on something interesting. Only 19 packages on github so far.

Get your Conformal Predictions here.

## Neglected machine learning ideas

Posted in machine learning, statistical tools, tools by Scott Locklin on July 22, 2014

This post is inspired by the “metacademy” suggestions for “leveling up your machine learning.” They make some halfway decent suggestions for beginners.  The problem is, these suggestions won’t give you a view of machine learning as a field; they’ll only teach you about the subjects of interest to authors of machine learning books, which is different. The level-3 and level-4 suggestions they make are not super useful either: they just reflect the tastes of the author.

The machine learning literature is vast, techniques are bewilderingly diverse, multidisciplinary and seemingly unrelated. It is extremely difficult to know what is important and useful. While “metacademy” has the horse sense to suggest reading some books, the problem is, there is no book which can even give you a survey of what is available, or make you aware of things which might be helpful. The best guide for the perplexed, in my not at all humble opinion, is Peter Flach’s introductory text, “Machine Learning: the Art and Science of Algorithms that Make Sense of Data” which at least mentions some of the more obscure techniques, and makes pointers to other resources. Most books are just a collection of the popular techniques. They all mention regression models, logistic regression, neural nets, trees, ensemble methods, graphical models and SVM type things. Most of the time, they don’t even bother telling you what each technique is actually good for, and when you should choose one over the other for an approach (Flach does; that’s one of many reasons you should read his book). Sometimes I am definitely just whining that people don’t pay enough attention to the things I find interesting, or that I don’t have a good book or review article on the topic. Sleep deprivation will do that to a man. Sometimes I am probably putting together things that have no clearly unifying feature, perhaps because they’re “not done yet.” I figure that’s OK, subjects such as “deep learning” are also a bunch of ideas that have no real unifying theme and aren’t done yet; this doesn’t stop people from writing good treatments of the subject. Perhaps my list is a “send me review articles and book suggestions” cry for help, but perhaps it is useful to others as an overview of neat things.

Stuff I think is egregiously neglected in books, and in academia in unranked semi-clustered listing below:

Online learning: not the “Khan academy” kind, the “exposing your learners to data, one piece at a time, the way the human  brain works” kind. This is hugely important for “big data” and timeseries, but there are precious few ML texts which go beyond mentioning the existence of online learning in passing. Almost all textbooks concentrate on batch learning. Realistically, when you’re dealing with timeseries or very large data sets, you’re probably doing things online in some sense. If you’re not thinking about how you’re exposing your learners to sequentially generated data, you’re probably leaving information on the table, or overfitting to irrelevant data. I can think of zero books which are actually helpful here. Cesa-Bianchi and Lugosi wrote a very interesting book on some recent proofs for online learners and “universal prediction” which strike me as being of extreme importance, though this is a presentation of new ideas, rather than an exposition of established ones.  Vowpal Wabbit is a useful and interesting piece of software with OK documentation, but there should be a book which takes you from online versions of linear regression (they exist! I can show you one!)  to something like Vowpal Wabbit. Such a book does not exist. Hell, I am at a loss to think of a decent review article, and the subject is unfortunately un-googleable, thanks to the hype over the BFD of “watching lectures and taking tests over the freaking internets.” Please correct me if I am wrong: I’d love to have a good review article on the subject for my own purposes.

Reinforcement learning: a form of online learning which has become a field unto its own. One of the great triumphs of machine learning is teaching computers to win at Backgammon. This was done via a form of reinforcement learning known as TD-learning. Reinforcement learning is a large field, as it has been used with great success in control systems theory and robotics. The problem is, the guys who do reinforcement learning are generally in control systems theory and robotics, making the literature impenetrable to machine learning researchers and engineers. Something oriented towards non robotics problems would be nice (Sutton and Barto doesn’t suffice here; Norvig’s chapter is the best general treatment I have thus far seen). There are papers on applications of the idea to ideas which do not involve robots, but none which unify the ideas into something comprehensible and utile to a ML engineer.

“Compression” sequence prediction techniques: this is another form of online learning, though it can also be done in batch mode. We’re all familiar with this; when google tries to guess what you’re going to search for, it is using a primitive form of this called the Trie. Such ideas are related to standard compression techniques like LZW, and have deep roots in information theory and signal processing. Really, Claude Shannon wrote the first iterations of this idea. I can’t give you a good reference for this subject in general, though Ron Begleiter and friends wrote a very good paper on some classical compression learning implementations and their uses. I wrote an R wrapper for their Java lib if you want to fool around with their tool. Boris Ryabko and son have also written numerous interesting papers on the subject. Complearn is a presumably useful library which encapsulates some of these ideas, and is available everywhere Linux is sold. Some day I’ll expound on these ideas in more detail.

Time series oriented techniques in general: a large fraction of  industry applications have a time component. Even in marketing problems dealing with survival techniques, there is a time component, and you should know about it.In situations where there are non-linear relationships in the time series, classical regression and time-series techniques will fail. In situations where you must discover the underlying non-linear model yourself, well, you’re in deep shit if you don’t know some time-series oriented machine learning techniques.  There was much work done in the 80s and 90s on tools like recurrent ANNs and feedforward ANNs for starters, and there has been much work in this line since then. There are plenty of other useful tools and techniques.  Once in a while someone will mention dynamic time warping in a book, but nobody seems real happy about this technique.  Many books mention Hidden Markov Models, which are important, but they’re only useful when the data is at least semi-Markov, and you have some idea of how to characterize it as a sequence of well defined states. Even in this case, I daresay not even the natural language recognition textbooks are real helpful (though Rabiner and Juang is OK, it’s also over 20 years old). Similarly, there are no review papers  treating this as a general problem. I guess we TS guys are too busy racking in the lindens to write one.

Conformal prediction: I will be surprised if anyone reading this has even heard of conformal prediction. There are no wikipedia entries. There is a website and a book. The concept is simple: it would be nice to well motivated put error bars on a machine learning prediction. If you read the basic books, stuff like k-fold cross validation and the jackknife  trick are the entire story. OK, WTF do I do when my training is online? What do I do in the presence of different kinds of noise? Conformal prediction is a step towards this, and hopefully a theory of machine learning confidence intervals in general. It seems to mostly be the work of a small group of researchers who were influenced by Kolomogorov, but others are catching on. I’m interested. Not interested enough to write one, as of yet, but I’d sure like to play with one.

ML in the presence of lots of noise: The closest thing to a book on it is the bizarro (and awesomely cool) “Pattern Theory: The Stochastic Analysis of Real World Signals” by Mumford and Desolneux, or perhaps something in the corpus of speech recognition and image processing books. This isn’t exactly a cookbook or exposition, mind you: more of a thematic manifesto with a few applications.  Obviously, signal processing has something to say about the subject, but what about learners which are designed to function usefully when we know that most of the data is noise?  Fields such as natural language processing and image processing are effectively ML in the presence of lots of noise and confounding signal, but the solutions you will find in their textbooks are specifically oriented to the problems at hand.  Once in a while something like vector quantization will be reused across fields, but it would be nice if we had an “elements of statistical learning in the presence of lots of noise” type book or review paper. Missing in action, and other than the specific subfields mentioned above, there are no research groups which study the problem as an engineering subject. New stuff is happening all the time; part of the success of “Deep Learning” is attributable to the Drop Out technique to prevent overfitting. Random forests could be seen as a technique which at genuflects at  “ML in the presence of noise” without worrying about it too much. Marketing guys are definitely thinking about this. I know for a fact that there are very powerful learners for picking signal out of shitloads of noise: I’ve written some. It would have been a lot easier if somebody wrote a  review paper on the topic. The available knowledge can certainly be systematized and popularized better than it has been.

Feature engineering: feature engineering is another topic which doesn’t seem to merit any review papers or books, or even chapters in books, but it is absolutely vital to ML success. Sometimes the features are obvious; sometimes not. Much of the success of machine learning is actually success in engineering features that a learner can understand. I daresay document classification would be awfully difficult without td-idf representation of document features. Latent Dirichlet allocation is a form of “graphical model” which works wonders on such data, but it wouldn’t do a thing without td-idf. [correction to this statement from Brendan below] Similarly, image processing has a bewildering variety of feature extraction algorithms which are of towering importance for that field; the SIFT descriptor, the GIST and HOG descriptors, the Hough transform, vector quantization, tangent distance [pdf link]. The Winner Take All hash [pdf link] is an extremely simple and related idea… it makes a man wonder if such ideas could be used in higher (or lower) dimensions. Most of these engineered features are histograms in some sense, but just saying “use a histogram” isn’t helpful. A review article or a book chapter on this sort of thing, thinking through the relationships of these ideas, and helping the practitioner to engineer new kinds of feature for broad problems would be great. Until then, it falls to the practitioner to figure all this crap out all by their lonesome.

Unsupervised and semi-supervised learning in general: almost all books, and even tools like R inherently assume that you are doing supervised learning, or else you’re doing something real simple, like hierarchical clustering, kmeans or PCA.  In the presence of a good set of features, or an interesting set of data, unsupervised techniques can be very helpful. Such techniques may be crucial. They may even help you to engineer new features, or at least reduce the dimensionality of your data. Many interesting data sets are only possible to analyze using semi-supervised techniques; recommendation engines being an obvious beneficiary of such tricks. “Deep learning” is also connected with unsupervised and semi-supervised approaches. I am pretty sure the genomics community does a lot of work with this sort of thing for dimensionality reduction. Supposedly Symbolic Regression (generalized additive models picked using genetic algorithms) is pretty cool too, and it’s in my org-emacs TODO lists to look at this more. Lots of good unsupervised techniques such as Kohonen Self Organizing Maps have fallen by the wayside. They’re still useful: I use them. I’d love a book or review article which concentrates on the topic, or just provides a bestiary of things which are broadly unsupervised. I suppose Oliver Chapelle’s book is an OK start for semi-supervised ideas, but again, not real unified or complete.

Images by one of my heroes, the Ukrainian-American artist Boris Artzybasheff. You can find more of it here.