Locklin on science

Timestamps done right

Posted in Design, Kerf by Scott Locklin on January 19, 2016

(Crossposted to Kerf blog)

I’ve used a lot of tools meant for dealing with time series. Heck, I’ve written a few at this point. The most fundamental piece of dealing with timeseries is a timestamp type. Under the covers, a timestamp is just a number which can be indexed. Normal humans have a hard time dealing with a number that represents seconds of the epoch, or nanoseconds since whenever. Humans need to see things which look like the ISO format for timestamps.

Very few programming languages have timestamps as a native type. Some SQLs do, but SQL isn’t a very satisfactory programming language by itself. At some point you want to pull your data into something like R or Matlab and deal with your timestamps in an environment that you can do linear regressions in. Kerf is the exception.

Consider the case where you have a bunch of 5 minute power meter readings (say, from a factory) with timestamps. You’re probably storing your data in a database somewhere, because it won’t fit into memory in R. Every time you query your data for a useful chunk, you have to parse the stamps in the chunk into a useful type; timeDate in the case of R. Because the guys who wrote R didn’t think to include a useful timestamp data type, the DB package doesn’t know about timeDate (it is an add on package), and so each timestamp for each query has to be parsed. This seems trivial, but a machine learning gizmo I built was entirely performance bound by this process. Instead of parsing the timestamps once in an efficient way into the database, and passing the timestamp type around as if it were an int or a float, you end up parsing them every time you run the forecast, and in a fairly inefficient way. I don’t know of any programming languages other than Kerf which get this right. I mean, just try it in Java.

Kerf gets around this by integrating the database with the language.

Kerf also has elegant ways of dealing with timestamps within the language itself.

Consider a timestamp in R’s timeDate. R’s add-on packages timeDate + zoo or xts are my favorite way of doing such things in R, and it’s the one I know best, so this will be my comparison class.


 

require(timeDate) 
a=as.timeDate("2012-01-01")
GMT
[1] [2012-01-01]

 

In Kerf, we can just write the timestamp down


 

a:2012.01.01
  2012.01.01

 

A standard problem is figuring out what a date is relative to a given day. In R, you have to know that it’s basically storing seconds, so:


 

as.timeDate("2012-01-01") + 3600*24
GMT
[1] [2012-01-02]

 

Kerf, just tell it to add a day:


 

2012.01.01 + 1d
  2012.01.02

 

This gets uglier when you have to do something more complex. Imagine you have to add a month and a day. To do this in general in R is complex and involves writing functions.

In Kerf, this is easy:


 

2012.01.10 + 1m1d
  2012.02.02

 

Same story with hours, minutes and seconds


 

2012.01.01 + 1m1d + 1h15i17s
  2012.02.02T01:15:17.000

 

And if you have to find a bunch of times which are a month, day, hour and 15 minutes and 17 seconds away from the original date, you can do a little Kerf combinator magic:


 

b: 2012.01.01 + (1m1d + 1h15i17s) times mapright  range(10)
  [2012.01.01, 2012.02.02T01:15:17.000, 2012.03.03T02:30:34.000, 2012.04.04T03:45:51.000, 2012.05.05T05:01:08.000, 2012.06.06T06:16:25.000, 2012.07.07T07:31:42.000, 2012.08.08T08:46:59.000, 2012.09.09T10:02:16.000, 2012.10.10T11:17:33.000]

 

The mapright combinator runs the verb and noun to its right on the vector which is to the left. So you’re multiplying (1m1d + 1h15i17s) by range(10) (which is the usual [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] ), then adding it to 2012.01.01.

You can’t actually do this in a simple way in R. Since there is no convenient token to add a month, you have to generate a time sequence with monthly periods. The rest is considerably less satisfying as well, since you have to remember to add numbers. In my opinion, this is vastly harder to read and maintain than the Kerf line.


 

b=timeSequence(from=as.timeDate("2012-01-01"),length.out=10,by="month") + (3600*24 + 3600 + 15*60 + 17) *0:9
 [2012-01-01 00:00:00] [2012-02-02 01:15:17] [2012-03-03 02:30:34] [2012-04-04 03:45:51] [2012-05-05 05:01:08] [2012-06-06 06:16:25] [2012-07-07 07:31:42] [2012-08-08 08:46:59] [2012-09-09 10:02:16] [2012-10-10 11:17:33]

 

This represents a considerable achievement in language design; an APL which is easier to read than a commonly used programming language for data scientists. I am not tooting my own horn here, Kevin did it.

If I wanted to know what week or second these times occur at, I can subset the implied fields in a simple way in Kerf:


 

b['week']
  [1, 6, 10, 15, 19, 24, 28, 33, 37, 42]
b['second']
  [0, 17, 34, 51, 8, 25, 42, 59, 16, 33]

 

I think the way to do this in R is with the “.endpoints” function, but it doesn’t seem to do the right thing


 

sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04 LTS
other attached packages:
[1] xts_0.9-7         zoo_1.7-12        timeDate_3012.100

.endpoints(b, on="week")
 [1]  0  1  2  3  4  5  6  7  8  9 10
.endpoints(b, on="second")
 [1]  0  1  2  3  4  5  6  7  8  9 10

 

You can cast to a POSIXlt and get the second at least, but no week of year.


 

as.POSIXlt(b)$week
NULL
as.POSIXlt(b)$sec
 [1]  0 17 34 51  8 25 42 59 16 33

 

Maybe doing this using one of the other date classes, like as.Date…


 

weekGizmo<-function(x){ as.numeric(format(as.Date(time(x))+3,"%U")) }

 

Not exactly clear, but it does work. If you have ever done things with time in R, you will have had an experience like this. I’m already reaching for a few different kinds of date and time objects in R. There are probably a dozen kinds of timestamps in R which do different subsets of things, because whoever wrote them wasn’t happy with what was available at the time. One good one is better. That way when you have some complex problem, you don’t have to look at 10 different R manuals and add on packages to get your problem solved.

Here’s a more complex problem. Let’s say you had a million long timeseries with some odd periodicities and you want to find the values which occur at week 10, second 33 of any hour.


 

ts:{{pwr:rand(1000000,1.0),time:(2012.01.01 + (1h15i17s times mapright  range(1000000)))}}
timing(1)
select *,time['second'] as seconds,time['week'] as weeks from ts where time['second']=33 ,time['week'] =10

┌────────┬───────────────────────┬───────┬─────┐
│pwr     │time                   │seconds│weeks│
├────────┼───────────────────────┼───────┼─────┤
│0.963167│2012.03.01T01:40:33.000│     33│   10│
│0.667559│2012.03.04T04:57:33.000│     33│   10│
│0.584127│2013.03.06T05:06:33.000│     33│   10│
│0.349303│2013.03.09T08:23:33.000│     33│   10│
│0.397669│2014.03.05T01:58:33.000│     33│   10│
│0.850102│2014.03.08T05:15:33.000│     33│   10│
│0.733821│2015.03.03T22:50:33.000│     33│   10│
│0.179552│2015.03.07T02:07:33.000│     33│   10│
│       ⋮│                      ⋮│      ⋮│    ⋮│
└────────┴───────────────────────┴───────┴─────┘
    314 ms

 

In R, I’m not sure how to do this in an elegant way … you’d have to use a function that outputs the week of year then something like this (which, FWIIW, is fairly slow) function to do the query.


 

require(xts)
ts=xts(runif(1000000), as.timeDate("2012-01-01") + (3600 + 15*60 + 17) *0:999999)
weekGizmo<-function(x){ as.numeric(format(as.Date(time(x))+3,"%U")) }
queryGizmo <- function(x) { 
 wks= weekGizmo(time(ts))
 secs=as.POSIXlt(time(ts))$sec
 cbind(x,wks,secs)->newx
 newx[(wks==10) & (secs==33)]
}
system.time(queryGizmo(ts))
   user  system elapsed 
  4.215   0.035   4.254

 

The way R does timestamps isn’t terrible for a language designed in the 1980s, and the profusion of time classes is to be expected from a language that has been around that long. Still, it is 2016, and there is nothing appreciably better out there other than Kerf.

Lessons for future language authors:

(continues at official Kerf blog)

 

repl-header

Putin’s nuclear torpedo and Project Pluto

Posted in big machines by Scott Locklin on December 31, 2015

There was some wanking among the US  foreign policy wonkosphere about the  nuclear torpedo “accidentally” mentioned in a Russian news video.

Status6

The device described in the leak is a  megaton class long range nuclear torpedo. The idea is, if you build a big enough bomb and blow it off in coastal waters, it will create a 1000 foot high nuclear tidal wave that will physically wipe out coastal cities and Naval installations, as well as pollute them with radioactive fallout. If the Rooskies are working on such a thing, rather than trolling the twittering pustules in our foreign policy “elite,” it is certainly nothing new. Such a device was considered in the Soviet Union in the 1950s, and the original November class submarine design (the first non-US built nuclear sub) was designed around it. It was called the T-15 “land attack” torpedo.  Oddly this idea originated from America’s favorite Soviet dissident, Andrei Sakharov when thinking about delivery systems for his 100 megaton class devices. People forget that young Sakharov was kind of a dick. Mind you, the Soviet Navy sunk this idea, in part because it only had a range of 25 miles (meaning it was basically a suicide mission), but also, according to Sakharov’s autobiography, some grizzled old Admiral put it “we are Navy; we don’t make war on civilian populations…”

Notice the big hole in the front: that's where the torpedo went

Notice the big hole in the front: that’s where the original doomsday torpedo went

The gizmo shown in this recent Russian leak is  a modern incarnation of the T-15 land attack torpedo without the Project 627/November class submarine delivery system. Same 1.6 meter caliber, megaton class warhead and everything. The longer range  of 5000 miles versus the 25 of the T-15 could be considered an innovation, and is certainly possible, but it only has tactical implications. From a strategic point of view: they had the same idea  years ago, for roughly the same reasons. Fifties era Soviet nuclear weapons delivery systems were not as reliable as American ones. In the 50s it was because Soviet bombers of the era were junk (mostly copies of the B-29). If they’re building this now, it’s because they’re worried about US missile defense.

 

Various analysts have been speculating that the thing is wrapped in cobalt or something to make it more dirty, because the rooskie power point talks about area denial. While it’s entirely possible, these dopes posing as analysts have some weird ideas about what a nuclear weapon is, and what it does. Nobody seems to have noticed that there’s a nuclear reactor pushing the thing around; predumably one using liquid metal coolants like the Alfa class submarines. I’m pretty sure lighting off a nuke next to a nuclear reactor will make some nasty and long lived fallout. At 1 megaton, just the bomb casing and tamper makes a few hundred pounds of nasty long lived radioactive stuff. The physics package the Russians would  likely use (SS-18 Mod-6 rated at 20Mt, recently retired from deployment atop SS-18 satan missiles) is a fission-fusion-fission bomb, and inherently quite “dirty” since most of the energy is released from U-238. Worse still:  blowing up a 1-100 megaton device in coastal mud will  make lots of nasty fallout.  Sodium-24 (from the salt in the water) is deadly. Half life is around 15 hours, meaning it would be clear in a few days, but being around it for the time it is active …. Then there is sodium-22, which has a half life of two and a half years; nukes in the water make less of this than sodium-24, but, well, go look it up. There is all kinds of other stuff in soil and muck which makes for unpleasant fallout. There’s an interesting book (particularly the 1964 edition) called “The Effects of Nuclear Weapons” available on archive. Chapter 9 shows some of the fallout patterns you can expect from blowing something like this up. Or, you could use this calculator thing;  a 1Mt device makes a lethal fallout cloud over thousands of square kilometers.

november

 

The twittering pustules who pass for our foreign policy elite are horrified, just horrified that the rooskies would spook us with such a device.  As if this were somehow a morally inferior form of megadeath to lobbing a couple thousand half megaton nuclear missile warheads at your least favorite country. Apparently this is how civilized countries who do not possess enemies with a plurality of coastal cities exterminate their foes. I don’t understand such people. Nuclear war is bad in general, m’kay? Mass slaughter with a nuclear torpedo is not morally inferior to mass slaughter with an ICBM. More to the point, getting along with Russians is easy and vodka is cheaper and more effective than ABM (and doomsday torpedo) defenses. If we hired actual diplomats and people who study history, instead of narcissistic toadies and sinister neocon apparatchiks to labor in our foreign services … maybe the Russians wouldn’t troll us with giant nuke torpedoes.

Doomsday engineering is often stranger than any science fiction. The things they built back in the cold war were weird.  While the US never admitted to building any 100 megaton land torpedoes (probably because Russia doesn’t have as many important coastal cities as the US does), we certainly worked on some completely bonkers nuclear objects.

pluto3

Imagine  a locomotive sized cruise missile, powered by a nuclear ramjet, cruising at mach-3 at tree level. The cruise missile  showers the landscape with two dozen hydrogen bombs of the megaton class, or one big one in the 20 megaton class. When it is finished its job of raining electric death mushrooms all over the enemy, it cruises around farting deadly radioactive dust and flattening cities with the sheer power of the sonic boom… for months. In principle, such a device can go on practically forever. If I were to use such a contraption as a plot device, you’d probably think it was far fetched. Such a thing was almost built by the Vought corporation 50 years ago. Click on the link. The Vought corporation thought it was cool enough to brag about it on their website (please don’t take it down guys; anyway if you do, I’ll put it back up).

pluto1

65,000 lbs, 80 feet long, with the terrifying code name, SLAM (Supersonic, Low Altitude Missile), or … “project Pluto.” This thing was perilously close to being built. They tested the engines at full scale and full power at Jackass Flats, and the guidance system was good enough they used essentially the same thing in the Tomahawk cruise missile. The problem wasn’t technical  … but how to test it? The fact that it was an enormous nuclear ramjet made it inherently rather dangerous. Someone suggested flight testing it on a tether in the desert. That would have been quite a tether to hold a mach 3 locomotive in place. Fortunately, we had rocket scientists who built ICBMs that worked. Of course, having an ICBM class booster would have been necessary to make the thing work in the first place (nuclear ramjets don’t start working until they’re moving at a decent velocity), which makes you wonder why they ever thought this was a good idea. Probably because people who dream these things up are barking looneys. Not that I wouldn’t have worked on this project, given the chance.

engines

The ceramic matrix for the reactor was actually made by the  Coors Porcelain company. Yes, the same company that makes shitty  beer has been (and continues to be) an innovator in ceramics; and this originated from the founder’s needing good materials for beer bottles and inventing beer cans. According to Jalopnik, they used exhaust header paint ordered from hot rod magazine to protect some of the electronic components. Apparently when they lit the reactor off at full power for the first time, they got so shitfaced, the project director (Merkle; yes, nano-dude’s father) had vitamin B shots issued to the celebrants the following day. Yes, I would have worked on project SLAM: as far as I can tell, it was the most epic redneck project ever funded by the US government. Not that we should have built such a thing, but holy radioactive doomsday smoke, Batman, it would have been a fun job for a few years.

I wouldn’t blame the Russians if they wanted to build a giant nuclear  torpedo-codpiece when the US sends Russiophobic dipshits like Michael McFaul to represent us in  Russia (look at his twitter feed; it is completely bonkers). I certainly hope they don’t build such a thing. It would also be nice if the US would stop screwing around with crap like that as well. Pretty sure it’s a giant troll, but the T-15 and Project Pluto were not.

Interesting pdf on Project Pluto:

http://www.amug.us/downloads/Pluto-Phoenix%20Facility%20at%20the%20NTS.pdf

Edit add:fascinating Russian wikipedia page MichaelMoser123 posted to hacker news:

https://ru.wikipedia.org/wiki/%D0%A1%D1%82%D0%B0%D1%82%D1%83%D1%81-6

An introduction to Kerf

Posted in Design, Kerf by Scott Locklin on December 15, 2015

My pals generally act impressed when I show them my noodlings in the J language. I’m pretty sure they’re impressed with the speed and power of J because it is inarguably fast and powerful, but I’ve also always figured they more saw it as an exercise in obfuscated coding; philistines! While I can generally read my own J code, I must confess some of the more dense tacit style isn’t something I can read naturally without J’s code dissector. I have also been at it for a while, and for a long time went on faith that this skill would come. Notation as a tool of thought is one of the most powerful ideas I’ve come across. The problem becomes talking people into adopting your notation. Building important pieces of your company around a difficult mathematical notation is a gamble which most companies are not willing to take.

Everyone knows about Arthur Whitney and K because of Kx systems database KDB. Having fiddled around with KDB and Eric Iverson and J-software’s Jd, the mind-boggling power of these things on time series and data problems in general makes me wonder why everyone doesn’t use these tools. Then I remember the first time I looked at things like this:

 

wavg:{(+/x*y)%+/x}        // K version
wavg=: +/ .* % +/@]        NB. J version

 

Oh yeah, that’s why J and K adoption are not universal. I mean, I can read it. That doesn’t mean everyone can read it. And I certainly can understand people’s reluctance to learn how to read things like this. It’s not easy.

For the last year and a half, my partner Kevin Lawler has been trying to fix this problem. You may know of him as the author of Kona, the open source version of K3. Kevin’s latest creation is Kerf. Kerf is basically an APL that humans can read, along with one of the highest performance time series databases money can buy. I liked it so much, I quit my interesting and promising day job doing Topological Data Analysis at Ayasdi, and will be dedicating the next few years of my life to this technology.

We know the above code fragments are weighted averages, but mostly because that’s what they’re called in the verb definitions. Mischievous programmers (the types who write code in K and J) might have called them d17 or something. Kerf looks a lot more familiar.

 

function wavg(x,y) {
  sum(x*y) / sum(x)
}

 

This is cheating a bit, since K & J don’t have a sum primitive, but it begins to show the utility of organizing your code in a more familiar way. Notice that x * y is done vector wise; no stinking loops necessary. Expressing the same thing in more primitive Kerf functions looks like this:

 

function wavg(x,y) {
  (+ fold x*y) / (+ fold x)
}

 

In J and K, the ‘/’ adverb sticks the left hand verb between all the elements on the right hand side. In Kerf, we call that operation “fold” (we also call adverbs “combinators” which we think is more descriptive for what they do in Kerf: I think John Earnest came up with the term).

You could also write the whole thing out in terms of for loops if you wanted to, but fold is easier to write, easier to read, and runs faster.

There are a few surprises with Kerf. One is the assignment operator.

a: range(5);
b: repeat(5,1);
KeRF> b
b
  [1, 1, 1, 1, 1]
KeRF> a
a
  [1, 2, 3, 4]

 

Seems odd. On the other hand, it looks a lot like json. In fact, you can compose things into a map in a very json like syntax:

 

aa:{a: 1 2 3, 
    b:'a bit of data', 
    c:range(10)};

KeRF> aa['a']
aa['a']
  [1, 2, 3]

KeRF> aa
aa
  {a:[1, 2, 3], 
   b:"a bit of data", 
   c:[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]}

 

This seems like syntax sugar, but it actually helps. For example, if I had to feed variable ‘aa’ to somthing that likes to digest json representations of data, it pops it out in ascii json:

 

json_from_kerf(aa)
  "{\"a\":[1,2,3],\"b\":\"a bit of data\",\"c\":[0,1,2,3,4,5,6,7,8,9]}"

 

OK, no big deal; a language that has some APL qualities which speaks json. This is pretty good, but we’d be crazy to attempt to charge money for something like this (Kerf is not open source; Kevin and I have to eat). The core technology is a clean APL that speaks json, but the thing which is worth something is the database engine. Tables in Kerf look like interned maps and are queried in the usual SQL way.

 

u: {{numbers: 19 17 32 8 2 -1 7, 
     strings: ["A","B","C","D","H","B","Q"]}}
select * from u where numbers>18

┌───────┬───────┐
│numbers│strings│
├───────┼───────┤
│     19│      A│
│     32│      C│
└───────┴───────┘

select numbers from u where strings="B"

┌───────┐
│numbers│
├───────┤
│     17│
│     -1│
└───────┘

 

Now the business with ‘:’ starts to make more sense. Since SQL is part of the language, the ‘=’ sign is busy doing other things, rather than setting equality. Now your eyes don’t have to make out any contextual differences or look for ‘==’ versus ‘=’ -everything with an ‘=’ is an equality test. Everything with an ‘:’ is setting a name somewhere.

Standard joins are available with left join:

 

v:{{a: 1 2 2 3, numbers: 19 17 1 99}}
left_join(v,u,"numbers")

┌─┬───────┬───────┐
│a│numbers│strings│
├─┼───────┼───────┤
│1│     19│      A│
│2│     17│      B│
│2│      1│   null│
│3│     99│   null│
└─┴───────┴───────┘

 

For timeseries, having a good time type, preferably first class and with the ability to look at nanoseconds is important. So are asof joins.

 

qq:{{nums: range(10), 
    date: 1999.01.01+ (24 * 3600000000000)  * range(10), 
    strg:["a","b","c","d","e","f","g","h","i","j"]}}
vv:{{nums: 10+range(10), 
    date: 1999.01.01+ (12 * 3600000000000)  * range(10), 
    strg:["a","b","c","d","e","f","g","h","i","j"]}}

select nums,nums1,mavg(3,nums1),strg,strg1,date from asof_join(vv,qq,[],"date")

┌────┬─────┬────────┬────┬─────┬───────────────────────┐
│nums│nums1│nums11  │strg│strg1│date                   │
├────┼─────┼────────┼────┼─────┼───────────────────────┤
│  10│    0│     0.0│   a│    a│             1999.01.01│
│  11│    0│     0.0│   b│    a│1999.01.01T12:00:00.000│
│  12│    1│0.333333│   c│    b│             1999.01.02│
│  13│    1│0.666667│   d│    b│1999.01.02T12:00:00.000│
│  14│    2│ 1.33333│   e│    c│             1999.01.03│
│  15│    2│ 1.66667│   f│    c│1999.01.03T12:00:00.000│
│  16│    3│ 2.33333│   g│    d│             1999.01.04│
│  17│    3│ 2.66667│   h│    d│1999.01.04T12:00:00.000│
│  ⋮│    ⋮│      ⋮│   ⋮│   ⋮│                      ⋮│
└────┴─────┴────────┴────┴─────┴───────────────────────┘

 

Kerf is still young and occasionally rough around the edges, but it is quite useful as it exists now: our customers and partners think so anyway. The only thing comparable to it from an engineering standpoint are the other APL based databases such as KDB and Jd. We think we have some obvious advantages in usability, and less obvious advantages in the intestines of Kerf. Columnar databases like Vertica and Redshift are great for some kinds of problems, but they don’t really compare: they can’t be extended in the same way that Kerf can, nor are they general purpose programming systems, which Kerf is.

We also have a lot of crazy ideas for building out Kerf as a large scale distributed analytics system. Kerf is already a suitable terascale database system; we think we could usefully expand out to hundreds of terabytes on data which isn’t inherently time oriented if someone needs such a thing. There is no reason for things like Hadoop and Spark to form the basis of large scale analytic platforms; people simply don’t know any better and make do with junk that doesn’t really work right, because it is already there.

You can download a time-limited version of Kerf from github here.

John Earnest has been doing some great work on the documentation as well.

I’ve also set up a rudimentary way to work with Kerf in emacs.

Also, for quick and dirty exposition of the core functions: a two page refcard

Keep up with Kerf at our company website:
www.kerfsoftware.com

Kerf official blog:
getkerf.wordpress.com

A visionary who outlined a nice vision of a sort of “Cloud Kerf.”
http://conceptualorigami.blogspot.com/2010/12/vector-processing-languages-future-of.html

repl-header

Advice to a young social scientist

Posted in five minute university by Scott Locklin on August 28, 2015

A comment which woke me from my long nap:

” What areas of mathematics or technical knowledge would you consider necessary for a hedgie analyst or academic researcher in economics /pol science /anthropology / history? I’m not interested in bits, bolts, DNA or mechanical things, but would like to apply more rigor to social, business and economic problems. “

Simple answer: statistics (and ideas in probability). Not the baby stats rubbish where they give you a recipe and hope for the best. Not even the stuff they teach you in an experimental physics course: real statistics, like they use on Wall Street to make money.

If you want to be bleeding edge, or do some exploration on your own, there are interesting results in information theory and machine learning which can help you, but what will help you more than this is a deep understanding of plain old statistics. Frequentist, Bayesian, Topological; whatever: just learn some stats to the point where you understand how they work, what they’re good for and where they break down.

My formal training was in physics, where, generally speaking, statistical sophistication is fairly low. Physicists have the luxury of being able to construct experiments where the observation of one or two photons or some preposterously small amount of torque on a magnetometer is meaningful. Pretty much nobody but physicists have this luxury.

Physicists no longer have this luxury for the most interesting problems these days. Unfortunately nobody told them, which is why physics has been languishing in the swamplands, with “physicists” working on non falsifiable noodle theory, cosmology and writing software for computer architectures which will probably never exist. I think it was Kelvin who said, “in science there is only physics, all the rest is stamp collecting.” When Kelvin said it, this was true: because nobody had bothered to invent statistics yet. Physics was the only real Baconian science.

Now, we have statistics. A flawed quasi-mathematical technique which is effectively how we know anything about everything that isn’t pre-1950s physics. Yes, yes, Disraeli and Mark Twain said there are “lies, damn lies and statistics.” He should have said, “bad statistics” -but that’s all there was in those days. Before we had the adding machine, statistics was the purview of Gauss and people who mostly were not doing it right.

Guys like Fisher, Pearson (both of them), Kolmogorov, Neymande Finetti, Jeffries, Savage,  Cramer and the lot are as important to our understanding of the world as Heisenberg and Darwin. Indeed, at this point I would go so far as to say that statistics invented in the 1930s is arguably more important than physics done in the 1930s. Most of the useful new knowledge of the last 60 years is directly attributable to such men. They don’t get enough respect.

Kolmogorov getting respect

Doing statistics well is the essence of all useful social science. As you probably have noticed, most social science is not done well. Much of social “science” isn’t very scientific; it’s often merely ideological gorp. The statistics used in the social sciences (and biological sciences and drug discovery and …) is abused preposterously to the point where they appear to be mathematical and methodological jokes rather than results which must be taken seriously. If social sciences took themselves seriously, they would be sciences rather than shaggy dog stories.

Consider psychology: according to a recent Science article, the majority of results of a sample of psychology papers can’t be reproduced. Let that sink in for a moment: more than half the results of these psychology papers are anecdotes. Part of this is because the researchers in that field are quacks and morons. Part of it is because they are evil quacks and morons.  I sit in a cafe which is near the UC Berkeley psychology building, and often overhear conversations by professors, grad students and post-docs from this place. Once in a while I overhear something intelligent and salubrious. For example, I  was grateful to overhear a conversation about this paper a few months ago.

However, I have often heard learned psychology department dunderheads stating what the result of their paper will be, and instructing their underlings to mine the data for p-values. I suppose they may have thought themselves speaking over the heads of the rabble, since nobody else from their department was visible. Mind you, they did this in a public place, in a town which is filled to the nostrils with people with training in rigorous subjects, like, you know, me, the buxom Russian girl reading Dirac in the corner, the options trader eating a sandwich, and the girl pouring the coffee, who is studying mathematics. This indicates to me that such people are so abysmally stupid and unaware of their own deficiencies, they couldn’t achieve a scientific result if they actually tried to do so.  Have a click on this link for the UCB psychology department: at least two people on this list are cretinous scientific frauds. If the Science paper mentioned above is a representative sample, most of them are.

Should I ever strike it rich enough to endow a foundation, I would pay legions of trained statisticians to go through the literature and eviscerate the mountains of bad “research” and arrive at the truth. If Universities were interested in advancing human knowledge, rather than advancing a tenured circle jerk which fields a football team, they’d fund entire departments of people who do nothing but act as Inquisitors about their research findings. Meanwhile I will have to content myself with instigating ambitious young people to arm themselves with the best statistical weapons they can muster, and go forth to slay dragons.

It can be done, and at this point, it can be good for your career.  Examples here, here and here. There is plenty of bullshit out there, and as Thucydides (also worth a look for young social scientists) said, “the society that separates its scholars from its warriors will have its thinking done by cowards, and its fighting by fools” so get to work!

Follow

Get every new post delivered to your Inbox.

Join 397 other followers