Locklin on science

An introduction to Kerf

Posted in Design, Kerf by Scott Locklin on December 15, 2015

My pals generally act impressed when I show them my noodlings in the J language. I’m pretty sure they’re impressed with the speed and power of J because it is inarguably fast and powerful, but I’ve also always figured they more saw it as an exercise in obfuscated coding; philistines! While I can generally read my own J code, I must confess some of the more dense tacit style isn’t something I can read naturally without J’s code dissector. I have also been at it for a while, and for a long time went on faith that this skill would come. Notation as a tool of thought is one of the most powerful ideas I’ve come across. The problem becomes talking people into adopting your notation. Building important pieces of your company around a difficult mathematical notation is a gamble which most companies are not willing to take.

Everyone knows about Arthur Whitney and K because of Kx systems database KDB. Having fiddled around with KDB and Eric Iverson and J-software’s Jd, the mind-boggling power of these things on time series and data problems in general makes me wonder why everyone doesn’t use these tools. Then I remember the first time I looked at things like this:

 

wavg:{(+/x*y)%+/x}        // K version
wavg=: +/ .* % +/@]        NB. J version

 

Oh yeah, that’s why J and K adoption are not universal. I mean, I can read it. That doesn’t mean everyone can read it. And I certainly can understand people’s reluctance to learn how to read things like this. It’s not easy.

For the last year and a half, my partner Kevin Lawler has been trying to fix this problem. You may know of him as the author of Kona, the open source version of K3. Kevin’s latest creation is Kerf. Kerf is basically an APL that humans can read, along with one of the highest performance time series databases money can buy. I liked it so much, I quit my interesting and promising day job doing Topological Data Analysis at Ayasdi, and will be dedicating the next few years of my life to this technology.

We know the above code fragments are weighted averages, but mostly because that’s what they’re called in the verb definitions. Mischievous programmers (the types who write code in K and J) might have called them d17 or something. Kerf looks a lot more familiar.

 

function wavg(x,y) {
  sum(x*y) / sum(x)
}

 

This is cheating a bit, since K & J don’t have a sum primitive, but it begins to show the utility of organizing your code in a more familiar way. Notice that x * y is done vector wise; no stinking loops necessary. Expressing the same thing in more primitive Kerf functions looks like this:

 

function wavg(x,y) {
  (+ fold x*y) / (+ fold x)
}

 

In J and K, the ‘/’ adverb sticks the left hand verb between all the elements on the right hand side. In Kerf, we call that operation “fold” (we also call adverbs “combinators” which we think is more descriptive for what they do in Kerf: I think John Earnest came up with the term).

You could also write the whole thing out in terms of for loops if you wanted to, but fold is easier to write, easier to read, and runs faster.

There are a few surprises with Kerf. One is the assignment operator.

a: range(5);
b: repeat(5,1);
KeRF> b
b
  [1, 1, 1, 1, 1]
KeRF> a
a
  [1, 2, 3, 4]

 

Seems odd. On the other hand, it looks a lot like json. In fact, you can compose things into a map in a very json like syntax:

 

aa:{a: 1 2 3, 
    b:'a bit of data', 
    c:range(10)};

KeRF> aa['a']
aa['a']
  [1, 2, 3]

KeRF> aa
aa
  {a:[1, 2, 3], 
   b:"a bit of data", 
   c:[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]}

 

This seems like syntax sugar, but it actually helps. For example, if I had to feed variable ‘aa’ to somthing that likes to digest json representations of data, it pops it out in ascii json:

 

json_from_kerf(aa)
  "{\"a\":[1,2,3],\"b\":\"a bit of data\",\"c\":[0,1,2,3,4,5,6,7,8,9]}"

 

OK, no big deal; a language that has some APL qualities which speaks json. This is pretty good, but we’d be crazy to attempt to charge money for something like this (Kerf is not open source; Kevin and I have to eat). The core technology is a clean APL that speaks json, but the thing which is worth something is the database engine. Tables in Kerf look like interned maps and are queried in the usual SQL way.

 

u: {{numbers: 19 17 32 8 2 -1 7, 
     strings: ["A","B","C","D","H","B","Q"]}}
select * from u where numbers>18

┌───────┬───────┐
│numbers│strings│
├───────┼───────┤
│     19│      A│
│     32│      C│
└───────┴───────┘

select numbers from u where strings="B"

┌───────┐
│numbers│
├───────┤
│     17│
│     -1│
└───────┘

 

Now the business with ‘:’ starts to make more sense. Since SQL is part of the language, the ‘=’ sign is busy doing other things, rather than setting equality. Now your eyes don’t have to make out any contextual differences or look for ‘==’ versus ‘=’ -everything with an ‘=’ is an equality test. Everything with an ‘:’ is setting a name somewhere.

Standard joins are available with left join:

 

v:{{a: 1 2 2 3, numbers: 19 17 1 99}}
left_join(v,u,"numbers")

┌─┬───────┬───────┐
│a│numbers│strings│
├─┼───────┼───────┤
│1│     19│      A│
│2│     17│      B│
│2│      1│   null│
│3│     99│   null│
└─┴───────┴───────┘

 

For timeseries, having a good time type, preferably first class and with the ability to look at nanoseconds is important. So are asof joins.

 

qq:{{nums: range(10), 
    date: 1999.01.01+ (24 * 3600000000000)  * range(10), 
    strg:["a","b","c","d","e","f","g","h","i","j"]}}
vv:{{nums: 10+range(10), 
    date: 1999.01.01+ (12 * 3600000000000)  * range(10), 
    strg:["a","b","c","d","e","f","g","h","i","j"]}}

select nums,nums1,mavg(3,nums1),strg,strg1,date from asof_join(vv,qq,[],"date")

┌────┬─────┬────────┬────┬─────┬───────────────────────┐
│nums│nums1│nums11  │strg│strg1│date                   │
├────┼─────┼────────┼────┼─────┼───────────────────────┤
│  10│    0│     0.0│   a│    a│             1999.01.01│
│  11│    0│     0.0│   b│    a│1999.01.01T12:00:00.000│
│  12│    1│0.333333│   c│    b│             1999.01.02│
│  13│    1│0.666667│   d│    b│1999.01.02T12:00:00.000│
│  14│    2│ 1.33333│   e│    c│             1999.01.03│
│  15│    2│ 1.66667│   f│    c│1999.01.03T12:00:00.000│
│  16│    3│ 2.33333│   g│    d│             1999.01.04│
│  17│    3│ 2.66667│   h│    d│1999.01.04T12:00:00.000│
│  ⋮│    ⋮│      ⋮│   ⋮│   ⋮│                      ⋮│
└────┴─────┴────────┴────┴─────┴───────────────────────┘

 

Kerf is still young and occasionally rough around the edges, but it is quite useful as it exists now: our customers and partners think so anyway. The only thing comparable to it from an engineering standpoint are the other APL based databases such as KDB and Jd. We think we have some obvious advantages in usability, and less obvious advantages in the intestines of Kerf. Columnar databases like Vertica and Redshift are great for some kinds of problems, but they don’t really compare: they can’t be extended in the same way that Kerf can, nor are they general purpose programming systems, which Kerf is.

We also have a lot of crazy ideas for building out Kerf as a large scale distributed analytics system. Kerf is already a suitable terascale database system; we think we could usefully expand out to hundreds of terabytes on data which isn’t inherently time oriented if someone needs such a thing. There is no reason for things like Hadoop and Spark to form the basis of large scale analytic platforms; people simply don’t know any better and make do with junk that doesn’t really work right, because it is already there.

You can download a time-limited version of Kerf from github here.

John Earnest has been doing some great work on the documentation as well.

I’ve also set up a rudimentary way to work with Kerf in emacs.

Also, for quick and dirty exposition of the core functions: a two page refcard

Keep up with Kerf at our company website:
www.kerfsoftware.com

Kerf official blog:
getkerf.wordpress.com

A visionary who outlined a nice vision of a sort of “Cloud Kerf.”
http://conceptualorigami.blogspot.com/2010/12/vector-processing-languages-future-of.html

repl-header

Advertisements

7 Responses

Subscribe to comments with RSS.

  1. pd said, on December 15, 2015 at 10:44 pm

    Finally a blog post of yours that I can finally put into practice at work (and also not get a headache trying to comprehend the syntax)!

    The company I work at is finally getting to the scale where processing of event/stat data is now becoming too unwieldy to do in a general purpose way, and we’ve had a lot of trouble with Hadoop. That said, as much of an uphill battle Kerf would have been anyway a closed source language is a non starter, as I imagine it is most elsewhere in SV (for better or worse). Even if that wasn’t though, I would have had to bring in a prototype/demonstration before bringing it up and I’m sure I can’t afford a full license on my lonesome. As you said yourself, the real money is in the database engine. It would be awesome to see the language itself be open sourced and the db engine offered as a paid plugin/library/separate binary/what have you (although my ignorance of the underlying technology might give me false hope to this possibility). Not trying to begrudge you well deserved profit, but I’d be remiss to not point out many open source language/database creators/maintainers have found good money via that route. Even just a free license for non-commercial work would be awesome, and could prove to be a great gateway drug into the APL/J world.

    All that said, I’d love a blog post of your thoughts on Hadoop/Spark as I’ve been beating that drum for a while. Unfortunately the status quo in SV is harder to break than people would have you believe 😉

    • Scott Locklin said, on December 15, 2015 at 11:46 pm

      The way Kerf is written, you can’t really remove one thing from the other, and in any case, let’s face it: someone would want that for free as well.

      We are definitely aware of the reluctance of SV to pay for software, and wish those kinds of companies well, making the data centers and power companies of the world rich, wastefully heating up the world with the JVM. Data centers and power companies are among our potential verticals (along with Wall Street, who are happy to throw a little money at us to make problems go away).

      Meanwhile, if you want to play with Kerf, I encourage you to download the documentation and binary and play with it; it will expire after a while, but for the moment, you can simply download a new copy when this happens. The “Cloud Kerf” piece doesn’t exist yet, but you should be able to work on decent sized problems on a beefy server.

      I had a very good experience with Jd back in May in comparison to a small Spark cluster, if you want that as a data point.
      http://jsoftware.com/pipermail/database/2015-May/000076.html
      There will be more such data points compared to Kerf as I go along. Hopefully with something you can download and try yourself.

  2. AB said, on December 16, 2015 at 4:57 am

    Exciting work Scott – please keep posting progress here. I’ve been looking left and right for a less prohibitive alternative to KDB without success (yet). I think Kx Systems’ pricing model is short sighted and out of reach except for bulge-brackets banks, Kerf could fill the void if it gets the value proposition right.

  3. Chris Stucchio said, on December 17, 2015 at 12:32 am

    I left a comment on hacker news, and I’m also going to leave it here also since I’m hoping you can explain why I’m wrong.

    I don’t agree with you that “The only thing comparable to it from an engineering standpoint are the other APL based databases such as KDB and Jd” is correct, at least not based on this blog post. Numpy+pandas also seems pretty similar, as is breeze, and Julia has similar features also.

    Idiomatic numpy:
    def wavg(x,y):
    return sum(x*y) / sum(x)

    Less idiomatic, but analogous to kerf or J:
    def wavg(x,y):
    return add.reduce(x*y) / add.reduce(x)

    arange(1,6) is equivalent to his range(5), and full((5,), 1) is equivalent to repeat(5,1).

    The SQL-ish engine looks fairly similar to pandas:
    df = pandas.DataFrame({‘a’ : arange(5), ‘b’ : [‘a’, ‘b’, ‘c’, ‘d’, ‘e’]})
    #Result is:
    a b
    0 0 a
    1 1 b
    2 2 c
    3 3 d
    4 4 e

    df[df.a > 2]
    #Result is
    a b
    3 3 d
    4 4 e

    Pandas similarly has joins, averages and other aggregates, etc. E.g.:

    pandas.merge(df1, df2, how=’inner’, left_on=’a’, right_on=’b’)

    is equivalent to:

    SELECT * FROM df1 INNER JOIN df2 ON df1.a = df2.b;

    Syntactically it’s a bit different (embedded SQL is pretty cool) but from this blog post I don’t see any major semantic differences. However, if there are important differences under the hood, I’d love to hear about them.

    • Scott Locklin said, on December 17, 2015 at 12:43 am

      Already replied over there.

      There are lots of “table” like solutions out there: R’s data.frame and data.table are obvious examples. Speed, the ability to persist data as an actual database system, and the ability to work with data “out of core” would be the advantages over Pandas. I’ll be releasing various performance comparisons over coming weeks to demonstrate these advantages.

      Anyway, yes, of course I will be discussing the DB engine and all kinds of other doodads we’re cooking up. But I have doodad cooking to do.

  4. John Baker said, on December 17, 2015 at 2:47 pm

    An interesting and intriguing post.

    Merging SQL like facilities into programming languages is something that many have tried but your approach seems less contrived and afterthoughty.

    I’ll have to study Kerf some more.

  5. atp said, on December 30, 2015 at 8:11 pm

    Scott, I haven’t done anything with them myself, but you might find these projects interesting: The Remora static type system for J or other APL-like languages, and the Riposte trace-driven JIT compiler for R:

    An Array-Oriented Language with Static Rank Polymorphism
    http://www.ccs.neu.edu/home/pete/pub/esop-2014.pdf
    https://scholar.google.com/citations?view_op=view_citation&citation_for_view=4FcaMpgAAAAJ:4DMP91E08xMC

    Riposte: A Trace-Driven Compiler and Parallel VM for Vector Code in R
    http://adv-r.had.co.nz/Performance.html#faster-r
    https://github.com/jtalbot/riposte


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: