Locklin on science

Machine learning & data science: what to worry about in the near future

Posted in machine learning by Scott Locklin on July 9, 2018

Henry Kissinger  recently opined about machine learning. OK, he used the ridiculously overblown phrase “AI” rather than “machine learning” but the latter is what he seemed to be talking about. I’m not a fan of the old reptile, but it is a reasonably thoughtful piece of gaseous bloviation from a politician. Hopefully whoever wrote it for him was well compensated.


There are obvious misapprehensions here; for example, noticing that chess programs are pretty good. You’d expect them to be good by now; we’ve been doing computer chess since 1950. To put this in perspective; steel belted radial tires and transistor radios were invented 3 years after computer chess -we’re pretty good at those as well. It is very much worth noting the first important computer chess paper (Shannon of course) had this sentence in it:

“Although of no practical importance, the question is of theoretical interest, and it is hoped that…this problem will act as a wedge in attacking other problems—of greater significance.”

The reality is, computer chess largely hasn’t been a useful wedge in attacking problems of greater significance.  Kissinger also mentioned Alpha Go; a recent achievement, but it is something which isn’t conceptually much different from TD-Gammon;  done in the 1990s.

Despite all the marketing hype coming out of Mountain View, there really hasn’t been much in the way of conceptual breakthroughs in machine learning since the 1990s.  Improvements in neural networks have caused excitement, and the ability of deep learning to work more efficiently on images is an improvement in capabilities. Stuff like gradient boost machines have also been a considerable technical improvement in usable machine learning. They don’t really count as big conceptual breakthroughs; just normal improvements for a field of engineering that has poor theoretical substructure. As for actual “AI” -almost nobody is really working on this.

None the less, there have been progress in machine learning and data science. I’m betting on some of the improvements having a significant impact on society, particularly now that the information on these techniques is out there and commodified in reasonably decent software packages. Most of these things have not been spoken about by government policy maker types like Kissinger, and are virtually never mentioned in dopey “news” articles on the subject, mostly because nobody bothers asking people who do this for a living.

I’d say most of these things haven’t quite reached the danger point for ordinary people who do not live in totalitarian societies, though national security agency type organizations and megacorps are already using these techniques or could be if they weren’t staffed with dimwits. There are also areas which we are still very bad at, which are to a certain extent keeping us safe.

The real dangers out there are pretty pedestrian looking, but people don’t think through the implications. I keep using the example, but numskull politicians were harping on the dangers of Nanotech about 15 years ago, and nothing came of that either. There were obvious dangerous trends happening in the corporeal world 15 years ago which had nothing to do with nanotech. The obesity rate was an obvious problem back then, whether from chemicals in the environment, the food supply, or the various cocktails of mind altering pharmies that fat people need to get through the day. The US was undergoing a completely uncommented upon and vast demographic, industrial and economic shift. Also, there was an enormous real estate bubble brewing. I almost think numskull politicians talk about bullshit like nanotech to avoid talking about real problems. Similarly politicians and marketers prefer talking about “AI” to issues in data science which may cause real problems in society.

The biggest issue we face has a real world example most people have seen by now. There exists various systems for road toll collection. To replace toll takers, people are encouraged to get radio tags for their car like “ezpass.” Not everyone will have one of these, so government choices are to continue to employ toll takers, removing most of the benefit of having such tools, or use an image recognition system to read license plates, and send people a bill. The technology which underlies this system is pretty much what we’re up against as a society. As should be obvious: not many workers were replaced. Arguably none were; though uneducated toll takers were somewhat replaced by software engineers. The real danger we face from this system isn’t job replacement; it is Orwellian dystopia.

Here is a list of  obvious dangers in “data science” I’m flagging over the next 10-20 years as worth worrying about as a society.

1) Face recognition software  (and to a lesser extent Voice Recognition) is getting quite good. Viola Jones  (a form of boosted machine) is great at picking out faces, and sticking them in classifiers which label them has become routine. Shitbirds like Facebook also have one of the greatest self-owned labeled data sets in the world, and are capable of much evil with it. Governments potentially have very good data sets also. It isn’t quite at the level where we can all be instantly recognized, like, say with those spooky automobile license plate readers, but it’s probably not far away either. Plate readers are a much simpler problem; one theoretically mostly solved in the 90s when Yann LeCun and Leon Bottou developed convolutional nets for ATM machines.

Related image

2) Machine learning  and statistics on large data is getting quite respectable. For quite a while I didn’t care that Facebook, google and the advertisers had all my data, because it was too expensive to process it down into something useful enough to say anything about me. That’s no longer true. Once you manage to beat the data cleaning problems, you can make sense of lots of disparate data. Even unsophisticated old school stuff like éclat is pretty helpful and various implementations of this sort of thing are efficient enough to be dangerous.

3) Community detection. This is an interesting bag of ideas that has grown  powerful over the years. Interestingly I’m not sure there is a good book on the subject, and it seems virtually unknown among practitioners who do not specialize in it. A lot of it is “just” graph theory or un/semi-supervised learning of various kinds.

Image result for community detection algorithm

4) Human/computer interfaces are getting better. Very often a machine learning algorithm is more like a filter that sends vastly smaller lists of problems for human analysts to solve. Palantir originated to do stuff like this, and while very little stuff on human computer interfaces is open source, the software is pretty good at this point.

5) Labels are becoming ubiquitous. Most people do supervised learning, which … requires labels for supervision. Unfortunately with various kinds of cookies out there, people using nerd dildos for everything, networked GPS, IOT, radio tags and so on; there are labels for all kinds of things which didn’t exist before. I’m guessing as of now or very soon, you won’t need to be a government agency to track individuals in truly Orwellian ways based on the trash data in your various devices; you’ll just need a few tens of millions of dollars worth of online ad company. Pretty soon this will be offered as a service.


Ignorance of these topics is keeping us safe

1) Database software is crap. Databases are … OK for some purposes; they’re nowhere near their theoretical capabilities in solving these kinds of problems. Database researchers are, oddly enough, generally not interested in solving real data problems. So you get mediocre crap like Postgres; bleeding edge designs from the 1980s. You have total horse shit like Spark, laughably insane things like Hive, and … sort of OK designs like bigtables… These will keep database engineers and administrators employed for decades to come, and prevent the solution of all kinds of important problems. There are people and companies out there that know what they’re doing. One to watch is 1010 data; people who understand basic computing facts, like “latency.” Hopefully they will be badly managed by their new owners. The engineering team is probably the best to beat this challenge. The problem with databases is multifold: getting at the data you need is important. Keeping it close to learning algorithms is also important. None of these things are done well by any existing publicly available database engines. Most of what exists in terms of database technology is suitable for billing systems, not data science. Usually people build custom tools to solve specific problems; like the high frequency trader guys who built custom data tee-offs and backtesting frameworks instead of buying a more general tool like Kx. This is fine by me; perpetual employment. Lots of companies do have big data storages, but most of them still can’t get at their data in any useful way. If you’ve ever seen these things, and actually did know what you were doing, even at the level of 1970s DBA, you would laugh hysterically. Still, enough spergs have built pieces of Kx type things that eventually someone will get it right.


2) Database metadata is hard to deal with. One of the most difficult problems for any data scientist is the data preparation phase. There’s much to be said about preparation of data, but one of the most important tasks in preparing data for analysis is joining data gathered in different databases. The very simple example is the data from the ad server and the data from the sales database not talking to each other. So, when I click around Amazon and buy something, the imbecile ad-server will continue to serve me ads on the thing that Amazon knows it has already sold me. This is a trivial example: one that Amazon could solve in principle, but in practice it is difficult and hairy enough that it isn’t worth the money for Amazon to fix this (I have a hack which fixes the ad serving problem, but it doesn’t solve the general problem). This is a pervasive problem, and it’s a huge, huge thing preventing more data being used against the average individual. If “AI” were really a thing, this is where it would be applied. This is actually a place where machine learning potentially could be used, but I think there are several reasons it won’t be, and this will remain a big impediment to tracking and privacy invasions in 20 years. FWIIW back to my ezpass license plate photographer thing; sticking a billing system in with at least two government databases per state that something like ezpass works in -unless they all used the same system (possible), it was a clever thing which hits this bullet point.

3) Most commonly used forms of machine learning requires many examples. People have been concentrating on Deep Learning, which almost inherently requires many, many examples. This is good for the private minded; most data science teams are too dumb to use techniques which don’t require a lot of examples. These techniques exist; some of them have for a long time. For the sake of this discussion, I’ll call these “sort of like Bayesian” -which isn’t strictly true, but which will shut people up. I think it’s great the average sperglord is spending all his time on Deep Learning which is 0.2% more shiny, assuming you have Google’s data sets. If a company like google had techniques which required few examples, they’d actually be even more dangerous.

4) Most people can only do supervised learning. (For that matter, non-batch learning terrifies most “data scientists” -just like Kalman filters terrify statisticians even though it is the same damn thing as linear regression). There is some work on stuff like reinforcement learning being mentioned in the funny papers. I guess reinforcement learning is interesting, but it is not really all that useful for anything practical. The real interesting stuff is semi-supervised, unsupervised, online and weak learning. Of course, all of these things are actually hard, in that they mostly do not exist as prepackaged tools in R you can use in a simple recipe. So, the fact that most domain “experts” are actually kind of shit at machine learning is keeping us safe.



A shockingly sane exposition of what to expect from machine learning, which I even more shockingly found on a VC’s website:


20 Responses

Subscribe to comments with RSS.

  1. Robert Smith said, on July 9, 2018 at 11:06 am

    Scott, thank you for another fascinating post. For some reason it gave me a desire to try and understand a bit about Kalman filters. It’s not and never has been my field but back in the 80’s I almost met someone who worked on them. I was working on Prolog programs to understand if planes were doing suspicious things like firing missiles and dispensing chaff or flares, and he was supposedly looking at integrating TV, IR and radar data to track and predict interesting things in the environment which would feed into my stuff. I say supposedly because I only met him once after a year – he’d taken the trouble to find out which building I was in, and came to say goodbye because he was leaving the company that day. Anyway, I shall definitely have a proper read later on – I did try just now but after checking that bloviate really exists I got carried away on other words first cited in 1845, followed by 13th century words followed by the Merriam-Webster API for returning these in JSON.

    • Scott Locklin said, on July 11, 2018 at 3:05 am

      Ya, that’s the kind of thing Kalman can be used for. I first encountered the idea building an intertial nav thing for G3 cell phones.

  2. Raul Miller said, on July 9, 2018 at 3:07 pm

    I did a quick search on practical uses of concepts from chess programs, and found this: http://www.cse.nd.edu/~cseprog/proj99/final/htm/kmurphy1.ttt.htm

    (Ok, technically, I was searching on uses of alpha/beta pruning – but that’s one of the mechanisms used by chess playing programs…) Anyways, history is a messy subject, and cause-effect trains are at least partially perceptual rather than repeatable.

    • Scott Locklin said, on July 11, 2018 at 3:33 am

      I’m sure humanity learned a thing or two from computer chess; pretty sure none of it amounts to brain in a can though.

      FWIIW I started looking at computer chess a couple of years ago, hoping to solve a chess-like game from antiquity (Hnetaefil and relatives). One of the neat things about it is the data structures which represent the board.

      There was this great resource here at wikispaces which I guess is moving here:

      Anyway, when I get the time, I will fool around with this sort of thing myself. Probably in J.

  3. PuceNoise said, on July 9, 2018 at 5:17 pm

    But the FAANMG stocks (Facebook Amazon Apple Netflix Microsoft Google) pushed the S&P up 2.66%, while the other 494 stocks were down 0.4% since November 2017, and at least three of those are authentic AI companies.

    Clearly, Scott has sour grapes over his poor investment decisions, as this cannot possibly be a bubble.

    • Scott Locklin said, on July 10, 2018 at 2:35 am


      Precisely zero of those are authentic “AI” companies. FWIIW I don’t invest in US stocks.

      • pucenoise said, on July 11, 2018 at 3:02 pm

        /sarcasm 😉

  4. maggette said, on July 11, 2018 at 12:45 pm

    Interesting article. And also a quite nice post by ben evans. Liked the “robot” vs “fride” analogy. Could you expand a bit on why 1010Data has a chance of becoming agame changer?

    And please stop bashing Spark and Postgres. I contributed to the former so you are hurting my feelings:).

    Kidding aside I think Spark pretty much delivers what it is supposed to do. If you ignore MLLib and Spark-Streaming (which are eather crap or broken by design) it is quite decent.

    I re-train close to 12 thousand ML models using Spark and SciKit once a month. I proccess a couple of TB everyday in several ETL project for some DAX customer in a couple of hours (only 30% of the pre-exsiting MapReduce solution). My customers are also very happy with Impala on Kudu. It might get faster…but not much cheaper:)

  5. Iron Sheik Klass said, on July 12, 2018 at 9:44 pm

    It may have gotten better since 2014, but it’s still made by Jabrones.

    • maggette said, on July 15, 2018 at 3:17 pm

      Sinc I a Jabrone I would never ever dare to talk back to the Iron Sheik 🙂

  6. S3 said, on July 21, 2018 at 3:58 am

    You say databases are crap. I am just a javascript monkey, so what do I know? But I would like to know your opinion on this guy’s claims:


    • Scott Locklin said, on July 22, 2018 at 3:06 am

      “the underlying data types have no order and are not set-wise shardable, eliminating almost every technique a typical database has for efficiently representing and manipulating data models at scale.” -lol, yeah, whatever dude; I guess you never heard of kd-trees.

      “Continuously insert a billion GeoJSON documents per minute through disk. This includes parsing JSON and indexing polygons in 3-space.” -pretty sure I can do that even in a language that is shit at parsing JSON.

      He’s probably got something interesting, but it reads like he’s never read Hanan Samet’s books. To be fair, I have never written one of what he’s selling there, and have no idea what scales he’s talking about, so it’s possible he’s talking over my head here, but I doubt it. I’m a metric space guy; always looking for cheap lookups.

  7. TonyC said, on July 23, 2018 at 8:42 pm

    enlighten an old fart, that (fulfilling your cliche), wrote his own half assed kx clone in APL. …
    … how is kalman filtering like linear regression? (and please be gentle)

    • Scott Locklin said, on July 23, 2018 at 8:56 pm

      Linear and gaussian online estimates of state based on minimizing least squares, rather than linear and gaussian offline estimates of state by minimizing least squares. The only difference is the model can dynamically change in a Kalman setting.

      This is pretty good:

      Click to access Sorenson1970.pdf

  8. Rodrigo Rivera said, on August 11, 2018 at 9:03 am

    The field you mentioned in point 3 is actually known as network science. There is some good material available. For example https://www.amazon.com/Network-Science-Applications-Ted-Lewis/dp/0470331887/
    There is still few work done on machine learning for network science, but you have works such as https://www.amazon.com/Machine-Learning-Complex-Networks-Christiano/dp/3319172891/ also more broadly, it would fit into “clustering”.

    Most of the work is done on supervised learning, because that’s where the data is available. Online learning is very exciting and it is getting a lot of attention, but unless you’re Amazon or some other player with significant web traffic, it is hard to put your models to test and thus the community has to work on developing theoretical guarantees. Unsupervised learning comes and goes, about 10 years ago, Bayesian non parametrics was a huge thing and now it has become almost a obscure niche. At the moment you see that the deep learning community is shifting its focus again to unsupervised techniques. As you say, for most interesting problems, you do not have many labeled examples.

    • Scott Locklin said, on August 11, 2018 at 3:59 pm

      I’m pretty sure nobody wants the label “network science” any more. Barabasi and company are lookin’ pretty mountebank at this point.
      Never looked at the book you suggest; most of the stuff I am thinking of probably falls into graph theory or applied topology.

      • pucenoise said, on August 11, 2018 at 9:05 pm

        I think Bayesian non-parametrics are being revived in a variety of fields, such as biochemistry and economics, but how useful they are is another question.

        Not all network science is fluff, Mark Newman has done some great stuff.

      • Adithya said, on October 27, 2018 at 9:59 am

        On a related note, what do you think of “complexity science”?

  9. S3 said, on August 21, 2018 at 9:33 am

    You said database researchers are not interested in solving real world problems. What about this paper:

    The Complete Story of Joins (in HyPer)

    Click to access b9c6f3c71e6c10f12e45f600af72c2cd57eb.pdf

  10. Javier Acuna said, on October 18, 2019 at 9:16 am

    Hi Scott,

    I found your sentence “Similarly politicians and marketers prefer talking about “AI” to issues in data science which may cause real problems in society” really insightful.:Either consciously or not, the fixation with AGI is distracting from more immediate dangers, until now I didn’t make the connection.

    Cheers from Germany,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: