Locklin on science

Does anyone at The Atlantic understand statistics?

Posted in finance journalism, stats jackass of the month by Scott Locklin on March 24, 2011

At least they’re not pimping talking points for the oligarchy this time: it’s just general journalistic imbecility. No, Anne Hathaway news does not drive Berkshire Hathaway price changes. No, I haven’t tried to do this regression, nor will I ever try to do this regression, because I’m not as statistically retarded as people who think the Huffington Post is anything but an exotic white noise signal generated by the amygdalas of neurotic liberal arts majors.

If anyone reading my blog has fallen victim to the latest installment of the Atlantic’s regularly scheduled moronathon, please see David Leinweber’s excellent and hysterically funny paper, Stupid Data Miner Tricks: Overfitting the S&P500. In it, he uses almost twice as many data points as the HuffPo mouth breather to show a nearly perfect correlation between Bangladeshi butter production and the S&P500. Adding in sheep population also made the regression better. Professor Leinweber’s paper is a classic in the field: anyone who cares about doing statistics properly should read and internalize its lessons.

As for the hedge fund consultant guy they interviewed, John Bates: I hope you are suitably embarrassed or they egregiously misquoted you. Otherwise, I hope I never have to fix one of your messes. You should hang your head in epic shame for your apparent donkey-like lack of understanding of even the most rudimentary ideas about spurious correlation. Until you make amends and grovel in shame before your professional peers for misleading the public about a six data point spurious correlation in exchange for a little publicity, I hereby award you with the very first “Locklin on science statistical jackass of the month” prize:

Enjoy your prize. If I were in charge of the guild, it would be the stockade and rotten cabbages for you. Progress software? Not the guys that make Apama? Some pimp tried to recruit me to that outfit. I couldn’t understand why a CEP would be written in Java, just as I now can’t even imagine working for a company whose CTO doesn’t understand regression.

30 Responses

Subscribe to comments with RSS.

  1. wburzyns said, on March 25, 2011 at 9:51 am

    And why a CEP wouldn’t be written in Java?

    • Scott Locklin said, on March 25, 2011 at 9:58 am

      Java is made out of pure suffering: only people who think SQL or outsourcing your development team to India are clever ideas would willingly subject people intelligent enough to write a CEP to such torments.
      If I weren’t a Lisp fiend, I’d write such things in OCaML, the way Goldman did, or C the way everyone else does it.

      • Chris said, on March 25, 2011 at 5:25 pm

        What is a CEP?

        • Scott Locklin said, on March 25, 2011 at 7:22 pm

          A dumb trend, really.

          http://en.wikipedia.org/wiki/Complex_event_processing

        • Chris said, on March 25, 2011 at 7:37 pm

          Oh, ok.

          You don’t really know anything about java, do you?

          • Scott Locklin said, on March 25, 2011 at 7:42 pm

            Java is one of those things, like acupuncture and homeopathy and other transparently stupid ideas, that I have developed as much deliberate ignorance of as possible. It’s better than it used to be anyway, sort of like C++. It’s still stupid and designed to appeal to the lowest common denominator. I’d rather use .NET.

          • Chris said, on March 25, 2011 at 7:49 pm

            Interesting. You reply to my comment that you don’t know anything about java by telling me that you intentionally don’t know anything about java. I already knew that. Weill, I guess the fact that you know nothing about it by choice is new information, though.

            There must be a word to describe people who are willfully ignorant and proud of it. I don’t know what that word is, though. Maybe you could sponsor a contest or a poll or something here on your blog where people could make up a word that means that and we could vote on the best one. Then you could change the name of your blog.

            • Scott Locklin said, on March 25, 2011 at 7:59 pm

              Java is C++ without pointers and with garbage collection. It is designed to appeal to mediocrities who learned to code in a diploma mill. That, and the fact that it is used everywhere by the same type of people who used to insist upon C++ (which is, of course, worse) are all any intelligent person need know about it. I have designed my career around avoidance of shitty herd technologies like Java and C++. I’m sorry if you’ve dedicated your life to this atrocity, but that says more about you than it does about me.

              Also, admit it: you’re some woman I picked up in a bar and deliberately forgot to call back, aren’t you?

              • Chris said, on March 25, 2011 at 8:08 pm

                Wow, wrong on so much, difficult to focus on any one thing. The fact that you are talking about java, which is a language, and say that you would prefer to use .NET, which is not, says plenty about your ignorance. I’d say that this is one instance where you don’t have to state again that you don’t know what you are talking about. Like I mentioned, we can tell.

                • Scott Locklin said, on March 25, 2011 at 8:15 pm

                  You just keep telling yourself that, while you try to focus Chris.
                  Seriously dude: you put up websites for a living. Do you actually think you know something worth knowing which I don’t?

                  • Chris said, on March 25, 2011 at 8:46 pm

                    java: language or not? .NET: language or not?

                    The question isn’t if I think it. Do you think it? I get the feeling you spent some time trying to stalk me online. No reason to say more…

                    • Scott Locklin said, on March 25, 2011 at 9:01 pm

                      It takes 2 seconds to look at your email address, website monkey. How long you been stalking me? How many dozen furiously typed missives from your sad cube farm job, or sitting at home in your urine stained underpants, wishing your life didn’t suck that bad?

                      Yes, Chris, .NET is a platform; I’d probably end up using blended C# and F# on it, if my life were ever to suck that bad. Much like Java, which in addition to being a platform for many other languages (can you guess my favorite?), is also a very shitty language. I’d rather go work on greasy automobiles again than subject myself to such intellectual indignities. There is no reason, short of gambling debts, for anyone’s life to suck that badly. Even if you had gambling debts to the Russian mob, you could always get plastic surgery and move to Brazil: anything is better than wasting your life slinging Java.

                    • Chris said, on March 25, 2011 at 9:15 pm

                      You were better off with the sort of experience, cynical persona. I haven’t actually been stalking you, just reading your blog. Like I said, I started reading you because you had a few funny, interesting posts. You just didn’t get when to stop. You don’t need to act smarter than everyone to appear clever. It actually makes you look more clever when you don’t try so hard to be so superior.

                      But then, for no apparent reason, you just chose silly topics for a few posts. Your blog stopped being clever and insightful. I didn’t say you stopped being clever and insightful, just your blog. But for some reason (insecurity|arrogance?), you couldn’t even say “Oops, I misspoke. Of course you wouldn’t write something like a CEP with a platform, you would use a language”. I haven’t tried to learn anything about you by visiting a website using the domain of your email address, just read the stuff you put out here.

                      I’m “getting it” that you don’t want me reading your blog anymore. Fair enough. I’m sure the rest of your faithful readers love reading your comments about my urine stained underwear. so they will miss that. It was interesting and clever for a while. So long.

                    • Scott Locklin said, on March 25, 2011 at 9:25 pm

            • John Flanagan said, on March 27, 2011 at 7:44 pm

              Speaking from the perspective of the HFT industry, Java is an absolutely terrible idea for anything that is latency sensitive. I actually threw together an entire trading system in Java back in 2005, which wound up making ridiculous piles of money. They were still making money (although less ridiculously large amounts) when I last heard from them in 2010.

              But they were competing against banks, on 10s-of-milliseconds timescales, and Java can kinda-sorta keep up in that environment. You don’t need to run very fast when you are preying on the fat and slow.

              However, today in 2011, a HFT system can’t afford to paralyze itself for a few milliseconds every few minutes to do garbage collection. Sure, you can do your java coding to minimize object allocation/destruction et cetera, but the whole point of GC is to DO THAT FOR YOU, so there goes one of the major “advertised benefits” of the language right off the bat. Even the best state-of-the-art Java GC engines still run into significant, frequent paralysis issues unless they are also paired with draconian object lifetime management code techniques… at which point you aren’t really using the GC, and doing an awful lot of extra work to avoid a supposedly beneficial feature.

              Please note that this is not theoretical armchair discussion. At my current gig (at least until my 2 weeks notice finishes, thank Jebus), we have a large number of fairly expert people dealing with Java GC hell on a daily basis, trying to get rid of every-few-minutes 10s-of-milliseconds latency spikes in production systems due to GC. This shit happens, and it’s a pain in the ass to avoid in Java. If erratic but unavoidable 50-millisecond latency will ruin your day (and it ruins ours, or rather it ruins our customers’), Java is the wrong tool. I keep telling them to cut their losses and go back to doing it in C++, but at this point it’s a corporate politics issue, since they’d then have to figure out what to do with the resulting bunch of surplus Java programmers.

              In the HFT space, the current state of the art is in the microsecond domain. Sometimes things happen (mostly OS related) to blow you out to millisecond level unless you go through the effort of dealing with a RT operating system, but that tends to be more of a pain in the ass than it’s worth. The typical real-life solution is to try to minimize the OS pain through setting up processor affinity to give your RT apps their own cores to play with, and that gets you most of the way there. Coded in C++ typically, which is just as brain-damaged as Java in its own ways, of course, but it’s far easier to get predictable low-latency performance out of.

              So there, now you’ve had Java trashed by somebody familiar with it. Happy now?

              • Scott Locklin said, on March 28, 2011 at 1:50 am

                My mind reels that they tried to do that, just as my mind reeled at Apama being written in Java. I know people who trade on Java code, but their problems are anything but HFT.

                I’m also kind of amazed people who are doing HFT use Linux or Slowlaris. While I’ve never benchmarked Linux for OS latency issues, I know that QNX and VxWorks do things in a much more orderly fashion. The differences between coding on a QNX or VxWorks machine versus Linux aren’t big enough to justify not using a RTOS, IMO.

                • John Flanagan said, on March 28, 2011 at 7:24 pm

                  One of my jokes is that Solaris is only still around because it even DIES slowly.

                  I can see people finally getting fed up with the non-RT aspects of Linux eventually, but we’re really only just starting to reach the performance levels that would justify that kind of move. When I did that Java system, I scoffed at the thought of worrying about sub-millisecond ANYTHING, and that was just a few years ago. OS latency wasn’t even on my radar.

                  Running Linux is handy from the general-purpose aspect. For your important apps, you can get pretty close to OS-independent operation if you know what you’re doing, but you can still “shell out” to the fully featured normal OS whenever you have something that needs it.

                  The thing that people have been getting hardons about lately are GPUs/Cell processors for doing the mathematical heavy lifting. I personally think those people are smoking crack. For most HFT, the math is not the hard part- the hard part is delivering all the inputs to the procedure that does the math, and getting the result back out and sent to somewhere useful*. Cramming the math routines into running on a GPU card just adds to the difficulty of getting the inputs on and outputs off. Why don’t you just put the theoretical calculators onto other locally networked servers, genius? If you’re gonna go through the trouble of writing custom drivers to get data in and out of your theoretical engines, might as well just write some Infiniband RDMA drivers instead and run everything on standard servers.

                  * Getting data into and out of calculation processes with high bandwidth and low latency is basically THE ENTIRE SUPERCOMPUTING PROBLEM. If it’s still a slow and cumbersome pain in the ass to shove data in and out of your fancy billion-core GPU (hint: IT IS), then you have completely failed to understand or solve your actual problem. But hey, at least you justified getting your employer to buy some sexy ATI Crossfire cards for you to play with, and also presumably swipe to bring home when the company fails to make any actual money.

                  • Scott Locklin said, on March 28, 2011 at 9:27 pm

                    I figure they must be using VxWorks for arbing at least.

                    I’ve argued with people about the GPU crap. Nobody listens to me. It’s trendy, it’s hip, it’s … the GPU … Unless your problem is embarrassingly parallel, and, as you say, you can squirt data in which stays there and hopefully only need a few numbers out: what is the freaking point? Maintaining code for such crap is almost certainly a nightmare even if there is some reason to do it: it certainly was back when I was using VxWorks with DSP’s.

                    It’s like people who love the cloud. Remember the idjits who thought you could do HFT using magic cloud technologies? I’m presently running on the cloud. The instance is about as powerful as my netbook, though the disk access is actually slower (since it happens across their dumb network drive thing), and the network latency is worse than in my favorite coffee shop.

                    I briefly considered the cloud for my personal trading problems: it costs a server a month for a decent instance. I even considered adding a GPU to my server to screw around with, but decided I didn’t hate life that much and so I loaded up on RAM and saved the money for a flash drive should I need some speed where it is important.

                    I do know a guy who is concentrating on the network hardware end of things. I haven’t pinged him in a while: probably time to see what he’s come up with.

                    • John Flanagan said, on March 28, 2011 at 10:19 pm

                      Solarflare network cards are an interesting thing, on the network side. They come with special drivers on Linux, which bypass the kernel for most of what the NIC needs to do. Apps can use LD_PRELOAD magic to intercept the appropriate system calls and deal with them directly on the NIC instead of going through the kernel. This turns the typical 11usec latency for transmitting/receiving an Ethernet packet into around 5usec. Avoiding the kernel entirely for the fast path is an amazing optimization. And this is a drop-in replacement that doesn’t require code changes- just the right cards, drivers, and some LD_PRELOAD crap for your process startup. Impressive.

                      For the brave souls willing to leave the safety of Ethernet waters, Infiniband RDMA can get you down to 1usec latency server-to-server if done properly. Note that this isn’t the easy path of IP-over-IB, which isn’t that much better than normal Ethernet (not least because it’s still going through the kernel). You actually have to rewrite (or possibly even redesign) your application to really be able to handle Infiniband.

                      If your problem is embarrassingly parallel, then you can just as easily (or, actually, much MORE easily) throw more actual servers at the problem. A hardcore trading shop would think nothing of throwing a thousand 12-core servers into a colo, if they needed the horsepower.

                    • Scott Locklin said, on March 29, 2011 at 6:08 am

                      Wasn’t aware of those. I think “the guy I know” is doing something similar; sticking a DSP or something like a DSP in a network card, and having it do a lot of the work.

                      I agree: I have never found anything you could do with a GPU which couldn’t be done cheaper, and a lot easier on more than one computer. I mean, there may be some cases (monte carlo type things, particle filters) where a GPU is easier and works better, but I’m pretty sure they’re corner cases.

  2. Chris said, on March 25, 2011 at 7:40 pm

    Scott, this is another post where you honestly seem like you are trying to hard. I read the article that you linked to, and I can’t imagine you misrepresenting the content any more than you did. Its kinda like your post about how engineering is so terrible these days, and as proof you linked to an article that stated that Apple engineers tried to keep the business managers from doing something stupid. Did you even read the article you linked to?

    • Scott Locklin said, on March 25, 2011 at 7:46 pm

      You must have very special insights which should be shared with the world in a blog of your own, Chris.

      • Chris said, on March 25, 2011 at 7:51 pm

        Not really. I, at least, recognize when I don’t have anything special to add. At those times, I try to see if I can learn something by reading the thoughts of others. At first, I thought that your blog was a place where things can be learned. I was wrong. You seem to go out of your way to mention multiple times that you try to learn as little as possible about things. You then write things that are incorrect and misleading about the universe. In some ways, your blog is a tool for destroying knowledge. If that was your intent, bravo!

        • Scott Locklin said, on March 25, 2011 at 8:01 pm

          I can lead a donkey to water, but I can’t make him drink.

  3. Chris said, on March 25, 2011 at 8:11 pm

    Perhaps you could misrepresent some article about donkeys or water in the meantime. Not the same, I’ll admit, but still good fun.

  4. maggette said, on March 26, 2011 at 1:32 pm

    I still do not see why java is not c++ without pointers AND with garbage collection? That’s what Scott said and I tend to agree with him, even though I don’t and won’t call myself an expert. So chris, maybe you can tell me what java can do for me, that c++ can’t…or why it can not be replaced by C# in .Net enviroment (which I will hve to work with the next two years…not my choice)?

    If I get Scott right, these are the reasons he is declining to spend no time on JAVA. And from a noob point of viem (my POV) t seems to make sense. You seem to this diffrent. Can you tell me why?

    And I think this is not “ignorant”.

    Let’s have a look into proto typing/scientific computing. Let’s assume you are good in MATLAB, R ans python. Do you spend any time learning Mathematica, SAS and GAUSS? Is it ignorant to focus your efforts on a set of tools and ignore the rest? I think it makes sense.
    THX

  5. HankScorpio said, on March 30, 2011 at 4:50 pm

    I have written way too much C and C++ code, especially on real-time embedded systems, you know, the sort of shyeah that makes you go grey prematurely.

    Many moons ago, major super-evil American Investment Bank hired me to re-engineer their analytics. By the time I resigned my then current job and relocated to their neck of the woods they decided that Java is king.

    My experience then was that it was still half-baked, and after a few years of that I yearned to write some real code. In particular, the J2EE abomination made want to vomit. Since that day I swore I’ll never work on J2EE again. I have done some Spring work, but no J2EE.

    These days it is C++ and C#. Now, having said that, I must admit that the inexperienced developers on the team were far more productive using Java than they would have been using C++ – but that is simply a personal opinion. I kinda see the logic of going back to pure C. I never thought I’d say that, but it is true.

    Also, allow me to play devil’s advocate – and I not referring to the pinball machine from that Simpson’s episode. These dudes here:

    http://www.infoq.com/presentations/LMAX

    are apparently doing some serious shyeah using Java. It would appear that Java has definitely improved since the 1.3/1.4 days, and I’m not just talking generics (what the phudge where they thinking in the bad old days, containers classes storing everything as type Object).

    I’ve also recently been hassled by the local pimps to go work for some order management system implementations, in Java, but I have not given in to the temptation.

    Fact is that Java is widespread in the financial industry. What effect the Oracle factor will have is pure conjecture at this point, but many shops – locally (Oz) – are using Java in front-office, I’d say on par with C#. Most analytics however are still written in C++, and then there is the CUDA/GPU thing.

    However, having recently discovered the awesomeness of functional programming, I believe I understand what Dr Locklin is on about.

    Cheers, Hank.

  6. Scott Locklin said, on March 31, 2011 at 6:33 pm

    Interesting presentation. Nothing real Java specific about using ring buffers, not using linked lists or taking advantage of your cache though. And their preallocation tricks and writing low-lifespan objects are basically just writing C code in Java. Let’s face it: if you want to go fast on a systems level, you’re going to be writing some C, even if it’s implemented in Java, just like going fast in floats, you’re probably writing fortran. Like their partner said, “oh, you wrote an ethernet card in Java: cute.”

    Funny, I figured Erlang might do better without worrying about a lot of the crap they worried about by dumping a ton of CPUs on the problem. Google on low latency Erlang … Sergei Aleynikov’s name comes up. Poor Serge.

    • John Flanagan said, on March 31, 2011 at 7:56 pm

      The main concurrency advantage that Erlang has is that it is shared-nothing: threads don’t need to mutex anything because for a thread to have access to a piece of data, it had to get it via a message in its message loop.

      Or, to put it another way, you could get similar concurrency in any other language if you enforced a shared-nothing design. My current R&D work has been developing a shared memory architecture to do something like this in C++.

      When shared (or messaged) state updates relatively infrequently compared to how often it is accessed, this is a gigantic win. If, however, the shared/messaged state is used just once (or only a handful of times) then the additional overhead of creating/sending/destroying messages overtakes the relative cost of mutex locks.

      I had already seen this LMAX presentation (a coworker went to QCon and brought back slides), and I had similar comments about “Any sufficiently advanced Java is indistinguishable from C++!” It’s nothing short of amazing how much performance they squeezed out of Java, though. I was impressed.

      I see that they just finished making an example out of Mr. Aleynikov. Poor bastard. Remember kids: Be safe- don’t piss off rich people!

      • Scott Locklin said, on April 2, 2011 at 12:42 am

        My main attraction to Erlang is that it is a high level programming language that does my thinking for me. I wasn’t aware of, or rather I had never thought of the mutex issue. Sounds like a rediscovery of MPI. I don’t know why more people don’t just use MPI (or even PVM) for parallel tasks. I think a lot of computer scientist types have this “right way to do stuff” picture in their heads. Floating point nerds don’t have this preconception, so we end up using tech like MPI which isn’t right, but which works pretty damn well. Maybe your new framework will do it even better. Seems like there is lots of room to grow on multi-core machines.

        Sergei did it in Erlang + OCaML (there was a code listing on fileshare for a while, though no actual code I know of). I’m pretty sure the way this ended up working, the OCaML ran the speedy bits which performed the actual trades, with Erlang as a way of spreading it out over lots of markets and perhaps looking for Arbs or funky dark crossings. I have a major softie for Serge for doing it in OCaML, as that’s how I would have done it if I hadn’t discovered the wonders of Lush (comparably fast, or faster, but a lot nicer to work in an interactive development Matlab type environment with than OCaML). Some day, maybe I’ll look into Erlang, if my life ever becomes that awesome, but right now it isn’t.


Leave a comment