Locklin on science

Data is not the new oil: a call for a Butlerian Jihad against technocrat data ding dongs

Posted in econo-blasphemy, machine learning, Progress by Scott Locklin on November 5, 2020

I tire of the dialog on “big data” and “AI.” AI is an actual subject, but as used in marketing and press releases and in the babbling by ideologues and think tank dipshits, the term is a sort of grandiose malapropism meaning “statistics and machine learning.” As far as I can tell “big data” just means the data at one point lived in something other than a spreadsheet.

“BigDataAI” ideology is a continuation of the program of the technocratic managerial “elite.” To those of you who are unfamiliar with the work of James Burnham, there is a social class of technocratic “experts” have largely taken over the workings of society in the West; a process which took place in the first half of the 20th century. While there have always been bureaucrats in civilized societies, the ones since around the time of Herbert Hoover have aped “scientific” solutions even where no such thing is actually possible. This social class of bureaucrats has had some mild successes; the creation of the American highway system, public health initiatives against trichinosis, US WW-2 production. But they have mostly discredited themselves for decades: aka the shitty roads in America, the unaffordable housing in major urban centers, a hundred million fat diabetics, deindustrialization because muh free market reasons, the covidiocy and most recently, the failure of every noteworthy technocrat in the world’s superpower to predict election outcomes and even its ability to honestly count its votes. Similar social classes interested in central planning also failed spectacularly in the Soviet Union, and led to the cultural revolution in China. There are reasons both obvious and deep as to why these social classes have failed.

The obvious reason is that mandarinates are inherently prone to corruption when there are no consequences for their failures. Bureaucrats are  wielders of power and have the extreme privilege of collecting a pension on the public expense. Various successful cultures had different ways of keeping them honest; the Prussians and pre-Soviet Russian bureaucrats recruited from honor cultures. Classical China and the early Soviets did it  via fear. The Soviet Union actually worked pretty well when the guys from Gosplan could be sent to the Gulag for their failings (or because Stalin didn’t like their neckties -keeps them on their toes). It progressively fell apart as it grew more civilized; by the 1980s, nobody was afraid of the late night knock on the door, and the Soviet  system fell apart when the US faked like it was going to build ridiculous space battleships. The rise of China has largely been the story of bureaucratic reforms by Deng where accountability (and vigorous punishment for malefactors) were the order of the day. Singapore makes bureaucrats meet regularly with their constituents; seems reasonable -don’t know why every society doesn’t make this a requirement. It is beyond question the American equivalent of the Gosplan mandiranate is almost unimaginably corrupt at this point, and the country is falling apart as a result.

While it gives policy-makers a sense of agency having a data project, consider that there isn’t a single large scale data project beyond the search engine that has improved the lives of human beings. Mind you, the actual civilizational utility of the search engine is highly questionable. What improvement in human living standards has come of the advent of google in the last 20 years? The only valuable content on the internet is stuff made by human beings. Google effectively steals or destroys most of the revenue of content creators who made the stuff worth looking at in the first place. Otherwise, library science worked just fine without blue haired Mountain View dipshits running SVD on a link graph. INSPEC (more or less; dmoz for research) is 120 years old and is still vastly better for research than google scholar. Science made more progress then between 1898 and 2005 or so when google more or less replaced it: and the news wasn’t socially toxic clickfarming idiocy back when the CIA censored the  news instead of google komissars with facial piercings. These days google even sucks at being google; I generally have more luck with runaroo or just going directly to things on internet archive.

If “AIBigData” were so wonderful, you’d see its salutary effects everywhere. Instead, a visit to the center of these ideas, San Francisco is a visit to a real life dystopia.There are thousands of data projects which have made life obviously worse for people. Pretty much all of nutrition and public health research post discovery of vitamins, and statisticians telling people not to drink toilet water is worthless or actively harmful (look at all those fat people waddling around). Most biomedical research is false, and most commonly prescribed drugs are snake oil or worse. Various “pre-crime” models used to justify setting bail or prison sentences are an abomination. The advertising surveillance hellscape we’ve created for ourselves is both aesthetically awful and a gigantic waste of time. The intelligence surveillance hellscape we’ve created mostly keeps its crimes secret, and does nothing obviously helpful. Annoying advertising invading every empty space; I don’t want to watch ads to pump gas or get money from my ATM machine.  Show me something good these dorks have done for us; I’m not seeing it. Most of it is moronic overfitting to noise, evil or both.

It’s less obvious but can’t be stated often enough: often “there is no data in your data.” The technocracy’s mathematical tools boil down to versions of the t-test being applied to poorly sampled and/or heteroskedastic data where they may not be meaningful. The hypothesis under test may not have a meaningful null no matter how much data you collect. When they talk about “AI” I think it’s mostly aspirational; a way out of heteroskedasticity and actual randomness. It’s not; there are no “AI” t-tests in common use by these knuckleheads, and if there were, the upshot wouldn’t look that much different from 1970s era stats results. When they talk about big data, they don’t talk about $\frac{1}{\sqrt{n}}$, or issues like ROC curves and bias variance tradeoff. They certainly never talk about data which is heteroskedastic or simply random, which is most of it.

In reality, data collection is mostly useless. In intelligence work, in marketing, political work: most of it is completely useless, and collecting it and acting on it is a sort of cargo cult for DBAs, cloud computing saleslizards, technocratic managerial nerds, economists, Nate Silver and other such human refuse. Once in a while it pays off. More often, the technocrat will take credit when things go his way and make complicated excuses when they don’t; just look at Nate Silver’s career for example; a clown with a magic 8-ball.  There’s an entire social class of “muh science” nerds who think it a sort of moral imperative to collect and act on data even if it is obviously useless. The very concept that their KPIs and databases might be filled with the sheerest gorp …. or that you might not be able to achieve marketing uplift no matter what you do… doesn’t compute for some people.

Technocratic data people are mostly parasitic vermin and their extermination, while it would cut into my P/L, would probably be good for society. At the very least we should make their salaries proportional to (1- Brier) scores; that will require them to put error bars on their predictions, reward the competent and bankrupt the useless. Really though, they should all be sent to Idaho to pick potatoes. Or ….

AI is not eliminating jobs

Posted in econo-blasphemy, machine learning by Scott Locklin on February 9, 2019

Midwits keep asserting that “AI” is going to eliminate jobs. They say things like “those jobs aren’t coming back because of AI” (or screws or whatever other dumb excuse: but they’re real sure about those jobs not coming back).  These are not statements of scientific or technological fact, or even a reasonable prediction based on present trends. These are ideological political statements. “AI” is a soundbite/fnord excuse for not doing anything about the policy problems of the present.

The ruling caste of American tech and FIRE lizard people continue to make these statements, not because they are inevitable, but because this is their desired future. Their preferred future is a population consisting of powerless, preferably drugged up serfs on the “universal basic income” dole, ruled over by our present ruling class of grifters, rentiers, pyramid scheme salesmen, watched over by a surveillance hellscape.  The lizard people would like to continue our present policy of de-industrializing the country, breaking what little labor negotiating power US citizens have, and atomizing people to their raw protoplasm. It’s almost like a Freudian slip. Don’t bother agitating for any rights, slave, we soon will have electric Golems and won’t need you!

The most murderous drug dealers who ever existed … Google’s AI dude thinks they’re great! https://twitter.com/JeffDean/status/1093953731756867584

Oh I am sure the Google doofs would like to develop and control some strong AI, and perhaps a robot maid to replace Juanita. I, too, would like to have a magic technology which gives me infinite power, and a robot maid to iron my shirts. If I were a Silicon Valley oligarch rather than a humble nerd, I might develop delusions the pile of C++, Javascript and tech drones which made me rich could become a Golem of infinite power. Personally,  I would build rockets. At least I could get away from lizard people who want to turn the world into a soy dystopia.

If these clowns really believed that “AI” were something actually like an “AI” which could replace humans in general tasks, they’d use it to replace computer programmers. At one point in history, people believed CASE tools would eliminate most programmer jobs. How’s that working out for the AI geniuses at Google? They can’t even automate devops jobs; devops being one of the most automatable roles in tech companies. Devops tasks don’t seem much different from a computer strategy game.

Google’s “AI” team can’t do this useful thing, which, even by my lights, actually seems  achievable. Yet somehow, google boobs think they’re going to violate Moravec’s paradox and replace drivers. Think about that for a minute. It’s becoming clear that autonomous vehicle “technology” as sold to people for the last 10 years is basically fraud, and is still stuck in the 1980s when Ernst Dickmanns was driving around the autobahn with Sun Workstations in his back seat. Demonstrations of this tech always have a human in the loop (remote or in vehicle), because moving automobiles without human control are death machines under most circumstances.

Inside of the UniBwM autonomous experimental vehicle VaMP, at the rear bench where the computing system was installed for easy access and monitoring. This was at the PROMETHEUS demonstration in Paris in October 1994 | Photo by Reinhold Behringer

Even assuming I’m wrong and the media hyperbole is right and full level 5 autonomous vehicles are “right around the corner” Google also has zero business interest in “disrupting” driving. Google is a tech driven advertising company with a  collection of loss leaders. Yet they go after this preposterously difficult, possibly impossible task. Why not disrupt a business they presumably know how to disrupt, like that of the lowly ops engineer? At least this would be good for their bottom line, and it would be a real step forward in “AI” rather than a parlour trick perpetuated by marketing nerds and started by obvious mountebanks.

From a semiotics point of view, this shows astounding hostility to the types of people who drive cars and trucks for a living. Drivers are … ordinary, usually uneducated, salt of the earth people who have a fairly independent lifestyle and make a decent living. Google overlords must really hate such people, since they’re dumping all this skrilla into ruining their lives for no sane business reason. They will almost certainly fail, but man, why would you try to blow up those people’s lives? If this country really wanted to get rid of driving, or considered it a serious problem that there are too many cars on the road, or thought that people now employed as drivers should do something else, we had a solution to this problem invented in the late 1800s.

The other professions  people “think” will be replaced always seem to be low caste irritations or lawyers (lol). You regularly hear “experts” talking about how presently common jobs won’t exist in 20 years because of “AI.”  I’ve said multiple times now that all estimates for delivery of something in 20 years are bullshit. A prediction that a technology will do X in 20 years means “we don’t know how to do this, but we want your money to fool around with anyway.” Controlled nuclear fusion researchers being the most amusing case of the perpetual 20 year rice bowl. 20 years is a magic number, as it’s plenty of time for a technological mountebank to retire; and it’s at least 2-3 generations of tenured academics, which is enough to turn a scam subject like “quantum computing” or “nanotech” into an actual field.

“AI” doesn’t exist. Machine learning is a force multiplier and productivity enhancer for statisticians. If you believe the “automation”=”no more jobs” ding dongs, machine learning should have at least automated away the job of statistician. Yet somehow, the  statistician (aka “data scientist”) jobs are among the best paid and most in-demand jobs out there at present.

The last job category I can think of which was automated away is Flight Engineer on airliners. It mostly went away because of automation of airliners, but it wasn’t even computer related; just normal improvements of systems monitoring and reliability; good old mechanical and systems engineering. Despite 1/3 fewer seats in airliner cockpits, there are now more people with airline flight officer jobs now than ever before. Planes got cheaper and there are more of them servicing vastly more people.

The example of Flight Engineer is how the world works. Technological advances increase human power over nature and makes more things possible. Actual “AI” advances, should any eventually materialize, will work exactly like this.

AI has eliminated exactly zero professions, and essentially no jobs. Since the best prediction tool for a market is generally a random walk, my forecast is, barring giant breakthroughs, this trend of “nothing important actually happened” regarding AI job destruction will continue. If you disagree with me and have an alternate prediction on a normal human (aka 5 or 10 year) timescale, I am happy to entertain any long bets on whatever platform you care to use.

Machine learning & data science: what to worry about in the near future

Posted in machine learning by Scott Locklin on July 9, 2018

Henry Kissinger  recently opined about machine learning. OK, he used the ridiculously overblown phrase “AI” rather than “machine learning” but the latter is what he seemed to be talking about. I’m not a fan of the old reptile, but it is a reasonably thoughtful piece of gaseous bloviation from a politician. Hopefully whoever wrote it for him was well compensated.

There are obvious misapprehensions here; for example, noticing that chess programs are pretty good. You’d expect them to be good by now; we’ve been doing computer chess since 1950. To put this in perspective; steel belted radial tires and transistor radios were invented 3 years after computer chess -we’re pretty good at those as well. It is very much worth noting the first important computer chess paper (Shannon of course) had this sentence in it:

“Although of no practical importance, the question is of theoretical interest, and it is hoped that…this problem will act as a wedge in attacking other problems—of greater significance.”

The reality is, computer chess largely hasn’t been a useful wedge in attacking problems of greater significance.  Kissinger also mentioned Alpha Go; a recent achievement, but it is something which isn’t conceptually much different from TD-Gammon;  done in the 1990s.

Despite all the marketing hype coming out of Mountain View, there really hasn’t been much in the way of conceptual breakthroughs in machine learning since the 1990s.  Improvements in neural networks have caused excitement, and the ability of deep learning to work more efficiently on images is an improvement in capabilities. Stuff like gradient boost machines have also been a considerable technical improvement in usable machine learning. They don’t really count as big conceptual breakthroughs; just normal improvements for a field of engineering that has poor theoretical substructure. As for actual “AI” -almost nobody is really working on this.

None the less, there have been progress in machine learning and data science. I’m betting on some of the improvements having a significant impact on society, particularly now that the information on these techniques is out there and commodified in reasonably decent software packages. Most of these things have not been spoken about by government policy maker types like Kissinger, and are virtually never mentioned in dopey “news” articles on the subject, mostly because nobody bothers asking people who do this for a living.

I’d say most of these things haven’t quite reached the danger point for ordinary people who do not live in totalitarian societies, though national security agency type organizations and megacorps are already using these techniques or could be if they weren’t staffed with dimwits. There are also areas which we are still very bad at, which are to a certain extent keeping us safe.

The real dangers out there are pretty pedestrian looking, but people don’t think through the implications. I keep using the example, but numskull politicians were harping on the dangers of Nanotech about 15 years ago, and nothing came of that either. There were obvious dangerous trends happening in the corporeal world 15 years ago which had nothing to do with nanotech. The obesity rate was an obvious problem back then, whether from chemicals in the environment, the food supply, or the various cocktails of mind altering pharmies that fat people need to get through the day. The US was undergoing a completely uncommented upon and vast demographic, industrial and economic shift. Also, there was an enormous real estate bubble brewing. I almost think numskull politicians talk about bullshit like nanotech to avoid talking about real problems. Similarly politicians and marketers prefer talking about “AI” to issues in data science which may cause real problems in society.

The biggest issue we face has a real world example most people have seen by now. There exists various systems for road toll collection. To replace toll takers, people are encouraged to get radio tags for their car like “ezpass.” Not everyone will have one of these, so government choices are to continue to employ toll takers, removing most of the benefit of having such tools, or use an image recognition system to read license plates, and send people a bill. The technology which underlies this system is pretty much what we’re up against as a society. As should be obvious: not many workers were replaced. Arguably none were; though uneducated toll takers were somewhat replaced by software engineers. The real danger we face from this system isn’t job replacement; it is Orwellian dystopia.

Here is a list of  obvious dangers in “data science” I’m flagging over the next 10-20 years as worth worrying about as a society.

1) Face recognition software  (and to a lesser extent Voice Recognition) is getting quite good. Viola Jones  (a form of boosted machine) is great at picking out faces, and sticking them in classifiers which label them has become routine. Shitbirds like Facebook also have one of the greatest self-owned labeled data sets in the world, and are capable of much evil with it. Governments potentially have very good data sets also. It isn’t quite at the level where we can all be instantly recognized, like, say with those spooky automobile license plate readers, but it’s probably not far away either. Plate readers are a much simpler problem; one theoretically mostly solved in the 90s when Yann LeCun and Leon Bottou developed convolutional nets for ATM machines.

2) Machine learning  and statistics on large data is getting quite respectable. For quite a while I didn’t care that Facebook, google and the advertisers had all my data, because it was too expensive to process it down into something useful enough to say anything about me. That’s no longer true. Once you manage to beat the data cleaning problems, you can make sense of lots of disparate data. Even unsophisticated old school stuff like éclat is pretty helpful and various implementations of this sort of thing are efficient enough to be dangerous.

3) Community detection. This is an interesting bag of ideas that has grown  powerful over the years. Interestingly I’m not sure there is a good book on the subject, and it seems virtually unknown among practitioners who do not specialize in it. A lot of it is “just” graph theory or un/semi-supervised learning of various kinds.

4) Human/computer interfaces are getting better. Very often a machine learning algorithm is more like a filter that sends vastly smaller lists of problems for human analysts to solve. Palantir originated to do stuff like this, and while very little stuff on human computer interfaces is open source, the software is pretty good at this point.

5) Labels are becoming ubiquitous. Most people do supervised learning, which … requires labels for supervision. Unfortunately with various kinds of cookies out there, people using nerd dildos for everything, networked GPS, IOT, radio tags and so on; there are labels for all kinds of things which didn’t exist before. I’m guessing as of now or very soon, you won’t need to be a government agency to track individuals in truly Orwellian ways based on the trash data in your various devices; you’ll just need a few tens of millions of dollars worth of online ad company. Pretty soon this will be offered as a service.

Ignorance of these topics is keeping us safe

1) Database software is crap. Databases are … OK for some purposes; they’re nowhere near their theoretical capabilities in solving these kinds of problems. Database researchers are, oddly enough, generally not interested in solving real data problems. So you get mediocre crap like Postgres; bleeding edge designs from the 1980s. You have total horse shit like Spark, laughably insane things like Hive, and … sort of OK designs like bigtables… These will keep database engineers and administrators employed for decades to come, and prevent the solution of all kinds of important problems. There are people and companies out there that know what they’re doing. One to watch is 1010 data; people who understand basic computing facts, like “latency.” Hopefully they will be badly managed by their new owners. The engineering team is probably the best to beat this challenge. The problem with databases is multifold: getting at the data you need is important. Keeping it close to learning algorithms is also important. None of these things are done well by any existing publicly available database engines. Most of what exists in terms of database technology is suitable for billing systems, not data science. Usually people build custom tools to solve specific problems; like the high frequency trader guys who built custom data tee-offs and backtesting frameworks instead of buying a more general tool like Kx. This is fine by me; perpetual employment. Lots of companies do have big data storages, but most of them still can’t get at their data in any useful way. If you’ve ever seen these things, and actually did know what you were doing, even at the level of 1970s DBA, you would laugh hysterically. Still, enough spergs have built pieces of Kx type things that eventually someone will get it right.

2) Database metadata is hard to deal with. One of the most difficult problems for any data scientist is the data preparation phase. There’s much to be said about preparation of data, but one of the most important tasks in preparing data for analysis is joining data gathered in different databases. The very simple example is the data from the ad server and the data from the sales database not talking to each other. So, when I click around Amazon and buy something, the imbecile ad-server will continue to serve me ads on the thing that Amazon knows it has already sold me. This is a trivial example: one that Amazon could solve in principle, but in practice it is difficult and hairy enough that it isn’t worth the money for Amazon to fix this (I have a hack which fixes the ad serving problem, but it doesn’t solve the general problem). This is a pervasive problem, and it’s a huge, huge thing preventing more data being used against the average individual. If “AI” were really a thing, this is where it would be applied. This is actually a place where machine learning potentially could be used, but I think there are several reasons it won’t be, and this will remain a big impediment to tracking and privacy invasions in 20 years. FWIIW back to my ezpass license plate photographer thing; sticking a billing system in with at least two government databases per state that something like ezpass works in -unless they all used the same system (possible), it was a clever thing which hits this bullet point.

3) Most commonly used forms of machine learning requires many examples. People have been concentrating on Deep Learning, which almost inherently requires many, many examples. This is good for the private minded; most data science teams are too dumb to use techniques which don’t require a lot of examples. These techniques exist; some of them have for a long time. For the sake of this discussion, I’ll call these “sort of like Bayesian” -which isn’t strictly true, but which will shut people up. I think it’s great the average sperglord is spending all his time on Deep Learning which is 0.2% more shiny, assuming you have Google’s data sets. If a company like google had techniques which required few examples, they’d actually be even more dangerous.

4) Most people can only do supervised learning. (For that matter, non-batch learning terrifies most “data scientists” -just like Kalman filters terrify statisticians even though it is the same damn thing as linear regression). There is some work on stuff like reinforcement learning being mentioned in the funny papers. I guess reinforcement learning is interesting, but it is not really all that useful for anything practical. The real interesting stuff is semi-supervised, unsupervised, online and weak learning. Of course, all of these things are actually hard, in that they mostly do not exist as prepackaged tools in R you can use in a simple recipe. So, the fact that most domain “experts” are actually kind of shit at machine learning is keeping us safe.

A shockingly sane exposition of what to expect from machine learning, which I even more shockingly found on a VC’s website:

Predicting with confidence: the best machine learning idea you never heard of

Posted in machine learning by Scott Locklin on December 5, 2016

One of the disadvantages of machine learning as a discipline is the lack of reasonable confidence intervals on a given prediction. There are all kinds of reasons you might want such a thing, but I think machine learning and data science practitioners are so drunk with newfound powers, they forget where such a thing might be useful. If you’re really confident, for example, that someone will click on an ad, you probably want to serve one that pays a nice click through rate. If you have some kind of gambling engine, you want to bet more money on the predictions you are more confident of. Or if you’re diagnosing an illness in a patient, it would be awfully nice to be able to tell the patient how certain you are of the diagnosis and what the confidence in the prognosis is.

There are various ad hoc ways that people do this sort of thing.  The one you run into most often is some variation on cross validation, which produces an average confidence interval. I’ve always found this to be dissatisfying (as are PAC approaches). Some people fiddle with their learners and in hopes of making sure the prediction is normally distributed, then build confidence intervals from that (or for the classification version, Platt scaling using logistic regression).  There are a number of ad hoc ways of generating confidence intervals using resampling methods and generating a distribution of predictions. You’re kind of hosed though, if your prediction is in online mode.  Some people build learners that they hope will produce a sort of estimate of the conditional probability distribution of the forecast; aka quantile regression forests and friends. If you’re a Bayesian, or use a model with confidence intervals baked in, you may be in pretty good shape. But let’s face it, Bayesian techniques assume your prior is correct, and that new points are drawn from your prior. If your prior is wrong, so are your confidence intervals, and you have no way of knowing this.  Same story with heteroscedasticity. Wouldn’t it be nice to have some tool to tell you how uncertain your prediction when you’re not certain of your priors or your algorithm, for that matter?

Well, it turns out, humanity possesses such a tool, but you probably don’t know about it. I’ve known about this trick for a few years now, through my studies of online and compression based learning as a general subject. It is a good and useful bag of tricks, and it verifies many of the “seat of the pants” insights I’ve had in attempting to build ad-hoc confidence intervals in my own predictions for commercial projects.  I’ve been telling anyone who listens for years that this stuff is the future, and it seems like people are finally catching on. Ryan Tibshirani, who I assume is the son of the more famous Tibshirani, has published a neat R package on the topic along with colleagues at CMU. There is one other R package out there and one in python. There are several books published in the last two years. I’ll do my part in bringing this basket of ideas to a more general audience, presumably of practitioners, but academics not in the know should also pay attention.

The name of this basket of ideas is “conformal prediction.” The provenance of the ideas is quite interesting, and should induce people to pay attention. Vladimir Vovk is a former Kolmogorov student, who has had all kind of cool ideas over the years. Glenn Shafer is also well known for his co-development of Dempster-Shafer theory, which is a brewing alternative to standard measure-theoretic probability theory which is quite useful in sensor fusion, and I think some machine learning frameworks. Alexander Gammerman is a former physicist from Leningrad, who, like Shafer, has done quite a bit of work in the past with Bayesian belief networks. Just to reiterate who these guys are: Vovk and Shafer have also previously developed a probability theory based on game theory which has ended up being very influential in machine learning pertaining to sequence prediction. To invent one new form of probability theory is clever. Two is just showing off! The conformal prediction framework comes from deep results in probability theory and is inspired by Kolmogorov and Martin-Lof’s ideas on algorithmic complexity theory.

The advantages of conformal prediction are many fold. These ideas assume very little about the thing you are trying to forecast, the tool you’re using to forecast or how the world works, and they still produce a pretty good confidence interval. Even if you’re an unrepentant Bayesian, using some of the machinery of conformal prediction, you can tell when things have gone wrong with your prior. The learners work online, and with some modifications and considerations, with batch learning. One of the nice things about calculating confidence intervals as a part of your learning process is they can actually lower error rates or use in semi-supervised learning as well. Honestly, I think this is the best bag of tricks since boosting; everyone should know about and use these ideas.

The essential idea is that a “conformity function” exists. Effectively you are constructing a sort of multivariate cumulative distribution function for your machine learning gizmo using the conformity function. Such CDFs exist for classical stuff like ARIMA and linear regression under the correct circumstances; CP brings the idea to machine learning in general, and to models like ARIMA  when the standard parametric confidence intervals won’t work. Within the framework, the conformity function, whatever may be, when used correctly can be guaranteed to give confidence intervals to within a probabilistic tolerance. The original proofs and treatments of conformal prediction, defined for sequences, is extremely computationally inefficient. The conditions can be relaxed in many cases, and the conformity function is in principle arbitrary, though good ones will produce narrower confidence regions. Somewhat confusingly, these good conformity functions are referred to as “efficient” -though they may not be computationally efficient.

The original research and proofs were done on so-called “transductive conformal prediction.” I’ll sketch this out below.

Suppose you have a data set $Z:= z_1,...,z_N$, with $z_i:=(x_i,y_i)$ where $x_i$ has the usual meaning of a feature vector, and $y_i$ the variable to be predicted. If the $N!$ different possible orderings are equally likely, the data set $Z$ is exchangeable. For the purposes of this argument, most data sets are exchangeable or can be made so. Call the set of all bags of points from $Z$ with replacement a “bag” $B$.

The conformal predictor $\Gamma^{\epsilon}(Z,x) := \{y | y^{p} > \epsilon \}$ where $Z$ is the training set and $x$ is a test object and $\epsilon \in (0,1)$ is a defined probability of confidence in a prediction. If we have a function $A(B,z_i)$ which measures how different a point $z_i$ is the bag set of $B$.

Example: If we have a forecast technique which works on exchangeable data, $\phi(B)$, then a very simple function is the distance between the new point and the forecast based on the bag set. $A(B,z_i):=d(\phi(B), z_i)$.

Simplifying the notation a little bit, let’s call $A_i := A(B^{-i},z_i)$ where $B^{-i}$ is the bag set, missing $z_i$. Remembering that bag sets $B$ are sets of all the orderings of $Z$ we can see that our $p^y$ can be defined from the nonconformity measures; $p^{y} := \frac{\#\{i=1,...,n|A_i \geq A_n \} }{n}$ This can be proved in a fairly straightforward way. You can find it in any of the books and most of the tutorials.

Practically speaking, this kind of transductive prediction is computationally prohibitive and not how most practitioners confront the world. Practical people use inductive prediction, where we use training examples and then see how they do in a test set. I won’t go through the general framework for this, at least this time around; go read the book or one of the tutorials listed below. For one it is worth, one of the forms of Inductive Conformal Prediction is called Mondrian Conformal Prediction; a framework which allows for different error rates for different categories, hence all the Mondrian paintings I decorated this blog post with.

For many forms of inductive CP, the main trick is you must subdivide your training set into two pieces. One piece you use to train your model, the proper training set. The other piece you use to calculate your confidence region, the calibration set. You compute the non-conformity scores on the calibration set, and use them on the predictions generated by the proper training set. There are
other blended approaches. Whenever you use sampling or bootstrapping in  your prediction algorithm, you have the chance to build a conformal predictor using the parts of the data not used in the prediction in the base learner. So, favorites like Random Forest and Gradient Boosting Machines have computationally potentially efficient conformity measures. There are also flavors using a CV type process, though the proofs seem more weak for these. There are also reasonably computationally efficient Inductive CP measures for KNN, SVM and decision trees. The inductive “split conformal predictor” has an R package associated with it defined for general regression problems, so it is worth going over in a little bit of detail.
For coverage at $\epsilon$ confidence, using a prediction algorithm $\phi$ and training data set $Z_i,i=1,...,n$, randomly split the index $i=1,...,n$ into two subsets, which as above, we will call the proper training set and the calibration set $I_1,I_2$.

Train the learner using data on the proper training set $I_1$

$\phi_{trained}:=\phi(Z_i); i \in I_1$. Then, using the trained learner, find the residuals in the calibration set:

$R_i := |Y_i - \phi(X_i)|, i \in I_2$
$d :=$ the $k$th smallest value in $\{R_i :i \in I_2\}$ where
$k=(n/2 + 1)(1-\epsilon)$

The prediction interval for a new point $x$ is $\phi(x)-d,\phi(x)+d$

This type of thing may seem unsatisfying, as technically the bounds on it only exist for one predicted point. But there are workarounds using leave one out in the ranking. The leave one out version is a little difficult to follow in a lightweight blog, so I’ll leave it up as an exercise for those who are interested to read more about it in the R documentation for the package.

Conformal prediction is about 10 years old now: still in its infancy.  While forecasting with confidence intervals is inherently useful, the applications and extensions of the idea are what really tantalizes me about the subject. New forms of feature selection, new forms of loss function which integrate the confidence region, new forms of optimization to deal with conformal loss functions, completely new and different machine learning algorithms, new ways of thinking about data and probabilistic prediction in general. Specific problems which CP has had success with; face recognition, nuclear fusion research, design optimization, anomaly detection, network traffic classification and forecasting, medical diagnosis and prognosis, computer security, chemical properties/activities prediction and computational geometry. It’s probably only been used on a few thousand different data sets. Imagine being at the very beginning of Bayesian data analysis where things like the expectation maximization algorithm are just being invented, or neural nets before backpropagation: I think this is where the CP basket of ideas is at.  It’s an exciting field at an exciting time, and while it is quite useful now, all kinds of great new results will come of it.

There is a website and a book. Other papers and books can be found in the usual way. This paper goes with the R package mentioned above, and is particularly clearly written for the split and leave one out conformal prediction flavors. Here is a presentation with some open problems and research directions if you want to get to work on something interesting. Only 19 packages on github so far.