Locklin on science

Atlas data structure: columnar NoSQL for Kerf

Posted in Kerf, tools by Scott Locklin on August 26, 2016

Kerf introduces a new data structure called an “Atlas” that we think is a helpful thing to have inside of a programming language. We’ve found it very useful to have database Tables inside of a programming language, and the Atlas concept is a natural extension of that. Be careful about jumping to conclusions, though, since an Atlas does a few special things that make it unlike what you might expect. I’ll give a preview of the major points here:

Automatic indexing
Responds to SQL
First-class language type
Columnar implementation

We didn’t spend much time talking about Atlases originally, because interest in NoSQL-style computation seemed to be on the wane. Recently, however, we’ve had a lot of requests for those sorts of applications, and so a detailed description seems in order. (If you’ve come across any interesting problems like that in your work, tell us about them.)

If a Table is a way of including a SQL table inside of the language, then an Atlas is a way of including a NoSQL store: none of the rows have to have the same columns. In other words, an Atlas is a schemaless table. The Atlas is a generalization of the Table where the rows don’t have to be uniform. In one sense, an Atlas is a sparse Table. You could also think of a Table as a vectorized Atlas.

read the full post here.

 

Advertisements

Kerf meets smartgrid data

Posted in Kerf, Uncategorized by Scott Locklin on May 16, 2016

Kerf is primarily designed as a tool for dealing with data streams. The most obvious streams of data which bring in customers are stock market data, but there are lots of interesting data streams in the world. Enough that people are starting to have a name for this “vertical” -aka “internet of things.” One of the “things of the internet” I’ve worked with in the past is smartmeter data. This is a stream of electrical power consumption data from individual power meters in homes, commercial sites, factories, electric car chargers and what have you. While everyone knows that markets create big streams of potentially interesting data, the data from power grids is also big and interesting. It’s big because a typical city might have millions of power meters in it, recording data at minute or quarter hour increments. It’s interesting because power companies need to know what is going on with their networks to provide uninterrupted electrical service that we all take for granted. It’s also very interesting to power companies who rely on “renewables” such as wind or solar which are not steady or controllable producers of power. If you have a power grid which relies on mostly renewables, you have all kinds of complicated operations research problems to deal with during times when it is cloudy, dark or the breezes are calm.

Problems

The data volume of “internet of things” applications can be unwieldy to manage in standard tools, but there are analytics challenges as well. A standard challenge for smartgrid stuff is doing forecasts. Such models can get really hairy. Every smart meter installation is quite different from every other smart meter. A power meter for a car charging station is vastly different from that of a school, an apartment or a house, and all of those things are different from one another. The “users” of each apartment and house have different schedules, appliances and devices. Getting this right in an automated way can pay big dividends; you can provide services to the people with the meter, and you can do more accurate forecasts on a city wide level if you understand the generating process at the individual meter level.

If you want to get something like this right in an automated fashion, it’s a fairly large project involving paying people like me to think about it for a few months. Doing all this in an automated, general and reliable way involves feature creation, feature selection algorithms, machine learning, data gathering, data cleansing, weather forecasting, wind forecasting, optimization and the kitchen sink. I’d love to eventually have the full suite of such tools to do such a project in Kerf as primitives, but we don’t know if customers would appreciate or use such things, so I’ll make do with some small steps for this blog.

continued at the Kerf blog

 

 

 

String Interning Done Right

Posted in Design, Kerf by Scott Locklin on February 22, 2016

Refresher

The common method for interning strings breaks in fantastic ways. In Kerf, we’ve taken the old method and revised it for success in the current generation of languages.

If you’ve forgotten what string interning is from your time with Java, it’s a simple way of ensuring that any string appears only once. So for instance, if I have ten objects whose type is the string “lion”, I don’t have ten copies of the string “lion” in memory. What I have instead is a single copy of “lion” and then ten links which all point to the single copy somehow.

Most often these links are pointers (raw addresses in memory). We’ll discuss how this breaks soon. The reference copy of “lion” is never allowed to be moved, changed, or released. It’s permanent. The other implementation detail to figure out is how to keep the strings unique. The next time we create a “lion” string we need to trade our string in for a link, and what is usually done is that the reference copy is stored in a hashtable or some other deduplicating data structure. This lets us figure out that lion already exists, and when we perform the check we can walk away with our link at the same time. If the phrase is “bullmation”, and it doesn’t already exist, then we can add the initial copy to the data structure at the same time.

…read the rest… http://getkerf.wordpress.com/2016/02/22/string-interning-done-right/

An introduction to Kerf combinators

Posted in Kerf by Scott Locklin on February 9, 2016

One of the most powerful features of Kerf are the combinators. Kerf is a vector language. You can operate on vectors and matrices as if they are natural data units, achieving interpreter speedups over looping constructs. While everyone is used to for loops in an interpreter, interpreters do very badly on them, which is why they always encourage people programming in R and Matlab to use the vector constructs if available. If you think about what goes on in various kinds of interpreter, you realize that there is a lot going on inside of a for loop.
Depending on how the interpreter is implemented, you may have to parse each line in the loop for every iteration in the loop; you have to evaluate test conditions, maintain state and so on.

Even if your loop is trivial and does some element wise operation on vectors it ends up going slower than it would in a vector operation. Interpreters need to check things, build stacks, move things around in a for loop. For example:

a=rnorm(1000000)
sumint

Compilers can generally optimize down to the metal, so that’s one way out. In Kerf, you can use combinators to help the interpreter avoid these problems. For example, summation amounts to putting + in between all
the elements of a vector. Fold puts the function to its left in between all of the elements of the thing on the right.

timing(1)
a:rand(1000000,1.0)
+ fold a
1ms

This is a trivial application, but it illustrates the power of the idea, and its speed in action. Tell the interpreter to do a lot of things at once, and it will be done at close to machine speed, even if there is no compiled primitive to accomplish the task. With a good set of combinators you can achieve the satori known as “No Stinking Loops.”

Sum is a trivial example (there is a primitive in Kerf for this), but one which illustrates the power …. continued at the Kerf blog