Locklin on science

Please stop writing new serialization protocols

Posted in Design, fun by Scott Locklin on April 2, 2017

It seems that every day, some computer monkey comes up with a new and more groovy serialization protocol.

In the beginning, there was ASN.1 and XDR, and it was good. I think ASN.1 came first, and like many old things, it was very efficient. XDR was easier to use. At some point, probably before ASN.1, people noticed you could serialize things using stuff like s-expressions for a human readable JSON like format.

Today, we have an insane profusion of serializers. CORBA (which always sucked), Thrift,  protocol buffers,  Messagepack, Avro,  BSON,  BERT, Property Lists, Bencode (Bram … how could you?), Hessian, ICEEtch, CapnProto (because he didn’t get it right the first time), SNAC, Dbus, MUSCLE, YAML, SXDF, XML-RPC, MIME, FIX, FAST,  JSON, serialization in Python, R, PHP, ROOT and Perl… Somehow this is seen as progress.

Like many modern evils, I trace this one to Java and Google. You see, Google needed a serialization protocol across thousands of machines which had versioning. They probably did the obvious thing of tinkering with XDR by sticking a required header on it which allowed for versioning, then noticed that Intel chips are not Big Endian the way Sun chips were, and decided to write their own  semi shitty versioning version of XDR … along with their own (unarguably shitty) version of RPC. Everything has been downhill since then. Facebook couldn’t possibly use something written at Google, so they built “Thrift,” which hardly lives up to its name, but at least has a less shitty version of RPC in it. Java monkeys eventually noticed how slow XML was between garbage collects and wrote the slightly less shitty but still completely missing the point Avro. From there, every ambitious and fastidious programmer out there seems to have come up with something which suits their particular use case, but doesn’t really differ much in performance or capabilities from the classics.

The result of all this is that, instead of having a computer ecosystem where anything can talk to anything else, we have a veritable tower of babel where nothing talks to anything else. Imagine if there were 40 competing and completely mutually unintelligible versions of html or text encodings: that’s how I see the state of serialization today. Having all these choices isn’t good for anything: it’s just anarchy. There really should be a one size fits all minimal serialization protocol, just the same way there is a one size fits all network protocol which moves data around the entire internet, and, like UTF-8. You can have two flavors of the same thing: one S-exp like which a human can read, and one which is more efficient. I guess it should be little-endian, since we all live in Intel’s world now, but otherwise, it doesn’t need to do anything but run everywhere.

IMO, this is a social problem, not a computer science problem. The actual problem was solved in the 80s with crap like XDR and S-expressions which provide fast binary and human readable/self describable representations of data. Everything else is just commentary on this, and it only gets written because it’s kind of easy for a guy with a bachelors degree in CS to write one, and more fun to dorks than solving real problems like fixing bugs. Ultimately this profusion creates more problems than creating a new one solves: you have to make the generator/parser work on multiple languages and platforms, and each implementation on each language/platform will be of varying quality.

I’m a huge proponent of XDR, because it’s the first one I used (along with RPC and rpcgen), because it is Unixy, and because most of the important pieces of the internet and unix ecosystem were based on it. A little endian superset of this with a JSON style human semi-readable form, and an optional self-description field, and you’ve solved all possible serialization problems which sane people are confronted with. People can then concentrate on writing correct super-XDR extensions to get all their weird corner cases covered, and I will not be grouchy any more.

It also bugs the hell out of me that people idiotically serialize data when they don’t have to (I’m looking at you, Spark jackanapes), but that’s another rant.

Oh yeah, I do like Messagepack; it’s pretty cool.

Advertisements

19 Responses

Subscribe to comments with RSS.

  1. fpoling said, on April 2, 2017 at 4:30 pm

    Facebook has too many former Google engineers that literally cannot use Google stuff unless it is fully open sourced. That inevitably leads to writing version 2 from scratch at Facebook.

    The current mess of serialization is way better than a situation like 10-15 years when things were serialized in XML that a sane humane cannot read. At least people pay attention to performance.

  2. Olbap Ratnacla Selarom said, on April 2, 2017 at 6:00 pm

    Just a little point: Anarchy != Chaos

  3. Ant said, on April 2, 2017 at 6:14 pm

    I agree. We should standardise on Loon for the S-expression human-readable data.

  4. Joe Hildebrand said, on April 2, 2017 at 8:28 pm

    CBOR (http://cbor.io/) is on standards-track at the IETF, and meets or exceeds your needs for the binary serialization format. No schema required to parse, reasonably compact on the wire, small code size for encoders and decoders, superset of JSON (+ sane strings, binary blobs, real floats, dates, etc.)

  5. Gregory said, on April 3, 2017 at 11:27 am

    Why MessagePack?

  6. Anonymous said, on April 3, 2017 at 5:06 pm

    https://github.com/cognitect/transit-format

  7. Mike Burke said, on April 3, 2017 at 10:52 pm

    I agree with all this, but find it amusing that the answer to too many serialization protocols is more serialization protocols.

    • pindash91 said, on April 5, 2017 at 12:07 pm

      https://xkcd.com/927/

      Also why message pack, what happened to your love of vectors? esp. If you have the strings interned in the object anyway.

      Keep Posting!

      • Scott Locklin said, on April 5, 2017 at 4:51 pm

        Because the guy who sent me the email that triggered this response was touting messagepack.

        To be honest the main thing I like about it is the J interface has a hash table. I haven’t had need for it yet, but it’s more convenient than the J-json interface.

  8. flanagan314 said, on April 5, 2017 at 2:19 pm

    Meanwhile, I’m one of the revanchist reprobates who’s perfectly happy slinging raw C structs over the wire. Yes, I’m quite aware of what a bad idea this is.

    • Scott Locklin said, on April 5, 2017 at 4:48 pm

      Such a terrible idea! What will happen if someone tries to connect to your ticker plant with their cell phone?

      • flanagan314 said, on April 8, 2017 at 8:18 pm

        Well for starters, I would be very interested in hearing how exactly they talked one of our switches into accepting their multicast group join. We have filters for that.

        • Scott Locklin said, on April 8, 2017 at 10:53 pm

          But muh endian! What if your company invests in an antiquated Spark workstation?

          • flanagan314 said, on April 11, 2017 at 5:43 pm

            Hey, if that’s what they want to use for a paperweight, i’m not going to micromanage them. I would also tell them “You’re gonna need a 10G ethernet card to plug that thing into our network, good luck with that.”

  9. Five Daarstens said, on April 7, 2017 at 2:22 am

    How about a post about how Vax was actually much better OS than anything now? Long uptimes (10 years or more), no Date problems (1970/2000/2038). Have you ever tried to get the time in Windows? Vax had nanosecond timing, nothing like this exists now or even comes close.

    • Scott Locklin said, on April 8, 2017 at 10:55 pm

      I have had insanely ‘orrible experiences with Winders time, despite its alleged provenance to VMS.
      I confess despite all the Astronomers in the Physics department, the difficulty in doing things like changing directories made me not like VMS. The uptime and such was obviously top notch though. Bloody shame what happened to DEC in general. Great hardware, fantabuloso software (even the search engine), bought by a shitty consumer company and now sleeps with the fishes.

  10. fazalmajid said, on April 7, 2017 at 6:03 pm

    ASN.1 is the worst. The lack of word alignment means poor performance, and I can’t begin to mention the number of security flaws introduced by buggy ASN.1 parsers (there is no other kind) in things like SNMP.

    XDR did the right thing with alignment, but the big-endian bit is a drag nowadays. Corba’s CDR is not that different from XDR and it did not suck, as long as you ignored all the designed-by-committee cruft outside the core object-RPC functionality. The less said about grotesquely inefficient XML and JSON formats, the better. msgpack is OK, but oversold.

    I don’t know why every generation of programmers feels compelled to reinvent the RPC square wheel, but George Santayana probably had something to say about it.

    • Scott Locklin said, on April 8, 2017 at 10:57 pm

      No direct experience, but a colleague who was involved with SNMP said something similar about ASN.1 when I posted this.
      Santayana was an optimist.

    • flanagan314 said, on April 11, 2017 at 5:48 pm

      On the newer Intel architectures, alignment is very nearly a non-issue. You need proper alignment for vector operations, and when your misaligned read/write crosses cache lines you can wind up with more than your share of memory latency, But other than that, a misaligned read/write is basically as fast as aligned. It’s pretty impressive. I didn’t believe it at first, but we tested it and I couldn’t measure any significant performance differences from misalignment.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: