Locklin on science

Timestamps done right

Posted in Design, Kerf by Scott Locklin on January 19, 2016

(Crossposted to Kerf blog)

I’ve used a lot of tools meant for dealing with time series. Heck, I’ve written a few at this point. The most fundamental piece of dealing with timeseries is a timestamp type. Under the covers, a timestamp is just a number which can be indexed. Normal humans have a hard time dealing with a number that represents seconds of the epoch, or nanoseconds since whenever. Humans need to see things which look like the ISO format for timestamps.

Very few programming languages have timestamps as a native type. Some SQLs do, but SQL isn’t a very satisfactory programming language by itself. At some point you want to pull your data into something like R or Matlab and deal with your timestamps in an environment that you can do linear regressions in. Kerf is the exception.

Consider the case where you have a bunch of 5 minute power meter readings (say, from a factory) with timestamps. You’re probably storing your data in a database somewhere, because it won’t fit into memory in R. Every time you query your data for a useful chunk, you have to parse the stamps in the chunk into a useful type; timeDate in the case of R. Because the guys who wrote R didn’t think to include a useful timestamp data type, the DB package doesn’t know about timeDate (it is an add on package), and so each timestamp for each query has to be parsed. This seems trivial, but a machine learning gizmo I built was entirely performance bound by this process. Instead of parsing the timestamps once in an efficient way into the database, and passing the timestamp type around as if it were an int or a float, you end up parsing them every time you run the forecast, and in a fairly inefficient way. I don’t know of any programming languages other than Kerf which get this right. I mean, just try it in Java.

Kerf gets around this by integrating the database with the language.

Kerf also has elegant ways of dealing with timestamps within the language itself.

Consider a timestamp in R’s timeDate. R’s add-on packages timeDate + zoo or xts are my favorite way of doing such things in R, and it’s the one I know best, so this will be my comparison class.


 

require(timeDate) 
a=as.timeDate("2012-01-01")
GMT
[1] [2012-01-01]

 

In Kerf, we can just write the timestamp down


 

a:2012.01.01
  2012.01.01

 

A standard problem is figuring out what a date is relative to a given day. In R, you have to know that it’s basically storing seconds, so:


 

as.timeDate("2012-01-01") + 3600*24
GMT
[1] [2012-01-02]

 

Kerf, just tell it to add a day:


 

2012.01.01 + 1d
  2012.01.02

 

This gets uglier when you have to do something more complex. Imagine you have to add a month and a day. To do this in general in R is complex and involves writing functions.

In Kerf, this is easy:


 

2012.01.10 + 1m1d
  2012.02.02

 

Same story with hours, minutes and seconds


 

2012.01.01 + 1m1d + 1h15i17s
  2012.02.02T01:15:17.000

 

And if you have to find a bunch of times which are a month, day, hour and 15 minutes and 17 seconds away from the original date, you can do a little Kerf combinator magic:


 

b: 2012.01.01 + (1m1d + 1h15i17s) times mapright  range(10)
  [2012.01.01, 2012.02.02T01:15:17.000, 2012.03.03T02:30:34.000, 2012.04.04T03:45:51.000, 2012.05.05T05:01:08.000, 2012.06.06T06:16:25.000, 2012.07.07T07:31:42.000, 2012.08.08T08:46:59.000, 2012.09.09T10:02:16.000, 2012.10.10T11:17:33.000]

 

The mapright combinator runs the verb and noun to its right on the vector which is to the left. So you’re multiplying (1m1d + 1h15i17s) by range(10) (which is the usual [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] ), then adding it to 2012.01.01.

You can’t actually do this in a simple way in R. Since there is no convenient token to add a month, you have to generate a time sequence with monthly periods. The rest is considerably less satisfying as well, since you have to remember to add numbers. In my opinion, this is vastly harder to read and maintain than the Kerf line.


 

b=timeSequence(from=as.timeDate("2012-01-01"),length.out=10,by="month") + (3600*24 + 3600 + 15*60 + 17) *0:9
 [2012-01-01 00:00:00] [2012-02-02 01:15:17] [2012-03-03 02:30:34] [2012-04-04 03:45:51] [2012-05-05 05:01:08] [2012-06-06 06:16:25] [2012-07-07 07:31:42] [2012-08-08 08:46:59] [2012-09-09 10:02:16] [2012-10-10 11:17:33]

 

This represents a considerable achievement in language design; an APL which is easier to read than a commonly used programming language for data scientists. I am not tooting my own horn here, Kevin did it.

If I wanted to know what week or second these times occur at, I can subset the implied fields in a simple way in Kerf:


 

b['week']
  [1, 6, 10, 15, 19, 24, 28, 33, 37, 42]
b['second']
  [0, 17, 34, 51, 8, 25, 42, 59, 16, 33]

 

I think the way to do this in R is with the “.endpoints” function, but it doesn’t seem to do the right thing


 

sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04 LTS
other attached packages:
[1] xts_0.9-7         zoo_1.7-12        timeDate_3012.100

.endpoints(b, on="week")
 [1]  0  1  2  3  4  5  6  7  8  9 10
.endpoints(b, on="second")
 [1]  0  1  2  3  4  5  6  7  8  9 10

 

You can cast to a POSIXlt and get the second at least, but no week of year.


 

as.POSIXlt(b)$week
NULL
as.POSIXlt(b)$sec
 [1]  0 17 34 51  8 25 42 59 16 33

 

Maybe doing this using one of the other date classes, like as.Date…


 

weekGizmo<-function(x){ as.numeric(format(as.Date(time(x))+3,"%U")) }

 

Not exactly clear, but it does work. If you have ever done things with time in R, you will have had an experience like this. I’m already reaching for a few different kinds of date and time objects in R. There are probably a dozen kinds of timestamps in R which do different subsets of things, because whoever wrote them wasn’t happy with what was available at the time. One good one is better. That way when you have some complex problem, you don’t have to look at 10 different R manuals and add on packages to get your problem solved.

Here’s a more complex problem. Let’s say you had a million long timeseries with some odd periodicities and you want to find the values which occur at week 10, second 33 of any hour.


 

ts:{{pwr:rand(1000000,1.0),time:(2012.01.01 + (1h15i17s times mapright  range(1000000)))}}
timing(1)
select *,time['second'] as seconds,time['week'] as weeks from ts where time['second']=33 ,time['week'] =10

┌────────┬───────────────────────┬───────┬─────┐
│pwr     │time                   │seconds│weeks│
├────────┼───────────────────────┼───────┼─────┤
│0.963167│2012.03.01T01:40:33.000│     33│   10│
│0.667559│2012.03.04T04:57:33.000│     33│   10│
│0.584127│2013.03.06T05:06:33.000│     33│   10│
│0.349303│2013.03.09T08:23:33.000│     33│   10│
│0.397669│2014.03.05T01:58:33.000│     33│   10│
│0.850102│2014.03.08T05:15:33.000│     33│   10│
│0.733821│2015.03.03T22:50:33.000│     33│   10│
│0.179552│2015.03.07T02:07:33.000│     33│   10│
│       ⋮│                      ⋮│      ⋮│    ⋮│
└────────┴───────────────────────┴───────┴─────┘
    314 ms

 

In R, I’m not sure how to do this in an elegant way … you’d have to use a function that outputs the week of year then something like this (which, FWIIW, is fairly slow) function to do the query.


 

require(xts)
ts=xts(runif(1000000), as.timeDate("2012-01-01") + (3600 + 15*60 + 17) *0:999999)
weekGizmo<-function(x){ as.numeric(format(as.Date(time(x))+3,"%U")) }
queryGizmo <- function(x) { 
 wks= weekGizmo(time(ts))
 secs=as.POSIXlt(time(ts))$sec
 cbind(x,wks,secs)->newx
 newx[(wks==10) & (secs==33)]
}
system.time(queryGizmo(ts))
   user  system elapsed 
  4.215   0.035   4.254

 

The way R does timestamps isn’t terrible for a language designed in the 1980s, and the profusion of time classes is to be expected from a language that has been around that long. Still, it is 2016, and there is nothing appreciably better out there other than Kerf.

Lessons for future language authors:

(continues at official Kerf blog)

 

repl-header

Advertisements

14 Responses

Subscribe to comments with RSS.

  1. jmount said, on January 19, 2016 at 6:25 pm

    Scott: thought you would enjoy this bit of utter nonsense from Python’s date/time libraries. Seems this glitch is “well known” but will never be fixed. The claim is the library looks up a different time zone (like one defined back in 1901) depending on calling path (even though I have a time zone object and not a mere time zone name).

    import datetime
    import pytz

    # data from external application, timestamp and timezone
    shangaiTime = “2015/10/01 08:45:00.183455”
    tz = pytz.timezone(‘Asia/Shanghai’)

    # parse
    print(“timeZone ” + str(tz))
    ## timeZone Asia/Shanghai
    unaware = datetime.datetime.strptime(shangaiTime,’%Y/%m/%d %H:%M:%S.%f’)
    print(“raw timestamp ” + str(unaware))
    ## raw timestamp 2015-10-01 08:45:00.183455

    # The WRONG way to apply the timezone
    aware = unaware.replace(tzinfo=tz)
    print(‘tz “aware” objet ‘ + str(aware))
    ## tz “aware” objet 2015-10-01 08:45:00.183455+08:06
    utctime = aware.astimezone(pytz.UTC)
    print(‘wrong utctime ‘ + str(utctime))
    ## wrong utctime2015-10-01 00:39:00.183455+00:00

    # The right way to apply the timezone
    aware = tz.localize(unaware)
    print(‘tz aware object ‘ + str(aware))
    ## tz aware object 2015-10-01 08:45:00.183455+08:00
    utctime = pytz.utc.normalize(aware)
    print(‘utctime ‘ + str(utctime))
    ## utctime 2015-10-01 00:45:00.183455+00:00

  2. John Baker said, on January 19, 2016 at 7:52 pm

    It was Kerf’s native atomic time-stamp data type that caught my attention when I first heard about the language. This is something that I believe occurs in Arthur Whitney’s recent Q but is absent in J, Nial, all APLs, and every other language I am familiar with. The utility of such a type is undeniable. It will certainly find greater use than say complex numbers in J.

    I am not sure why it has taken so long to introduce time in this fashion. I suspect it’s because the way we measure time is a bit of a kludge. Living on a planet that has an inconvenient orbit has forced us to divide time into messy units. It’s interesting to note that “decimalization” of time, briefly tried in the French revolutionary period, did not stick like the other decimal units.

    When it can time to define fundamental types I’d be willing to bet that many language designers loathed the idea of contrived time units, like days and months to be accorded the same fundamental status as integers. Regardless of the reasons Kerf gets this very right. So right I expect others will follow in the years to come.

    • Scott Locklin said, on January 19, 2016 at 9:48 pm

      There is some kind of timestamp type in Jd, but it appears to be undocumented. Not sure if it uses the pieces in the types/datetime addon; if it does, they’re not optimal. Q/K has timestamps; we like ours a little better.

      Most of what you need in a timestamp type exists in posix, so it isn’t difficult to do. I think it’s just one of those things where a language designer considers it as an afterthought if at all. Most people’s day to day work in a programming language doesn’t involve timestamps. Just like the designers of Java forgot that there were these weird people who use matrix math to accomplish things; these matters tend to be left out, because it’s what people are thinking about at any given moment that drives language design. At least that is what it seems like to me.

  3. flanagan314 said, on January 20, 2016 at 3:17 am

    Ugh, timestamps. I spent the entire day up to my eyeballs in timestamp code, admittedly in the clumsy and random language of C++ rather than your more elegant tool for a more civilized age. I have the unenviable task of changing our entire codebase from microsecond timestamps to nanosecond timestamps, as well as making the timestamps int64_t instead of uint64_t. What the fuck, Mr. Previous Maintainer? 64 signed bits of microseconds doesn’t overflow for millenia, WTF is the point of making it unsigned? Was he TRYING to make timestamp math harder than it has to be? And even in nanoseconds, 64 signed bits doesn’t wrap around until the 23rd century. Let Spock and Scotty deal with the overflows.

    • Scott Locklin said, on January 20, 2016 at 3:24 am

      That comment, almost verbatim (re: ns overflow), is in our documentation somewhere.

      • flanagan314 said, on January 20, 2016 at 3:38 pm

        There is a certain poetic justice in the expiration being in the 23rd Century that way. They can fix it by waving their magic technobabble wand over the source code and be on their merry way.

        This would actually make an AMAZING plot for a Star Trek episode, actually:

        “Captain, sensors indicate that we have suddenly warped approximately 584 years into the past!”

        “Nonsense, Ensign. The starbase is still right there.”

        “But Captain, the sensors CLEARLY indicate that the current year is 1677!”

        “Data, what is your analysis of the situation?”

        “As you know, Captain, I have previous experience with time travel, and in the course of those experiences, I found it necessary to modify my timestamp processing to use 128 signed bits of precision, in order to accommodate any reasonable nanosecond time measurement within the bounds of the observable universe. I learned the technique from, of all things, a robot engaged in parking cars in a far-future entertainment venue of some kind. He had been required to deal with this problem from first principles, after personally having to deal with the universe itself wrapping around to zero.”

    • Scott Locklin said, on January 20, 2016 at 5:26 am

      If I had infinite time, I’d look for signs of a bad time representation (say, encoding millisecond timestamps as floats) on the order books. I bet you can spot and trade against such things. Certainly there is plenty of juice in looking for people with bad clocks in other ways (people using nbbo clocks or whatever), but there would be more justice in this.

  4. jon spencer said, on January 20, 2016 at 4:19 pm

    One of the better usages of a variation of the Timestamp is in the Major League Baseball and the Nation Hockey league reviews. Different camera angles are timestamped to see what, where and when.

    • Scott Locklin said, on January 20, 2016 at 6:34 pm

      Traders and power companies buy more software.

  5. Mr. Tumnis said, on January 20, 2016 at 8:16 pm

    Off-topic, but what do you think of Dalton’s paradox and the alleged resulting experimental breakdown of the 2nd Law? See 1. https://en.wikipedia.org/wiki/Duncan%27s_Paradox 2. http://fqmt.fzu.cz/15/func/viewpdf.php?reg=534&num=1

    • Scott Locklin said, on January 21, 2016 at 6:58 pm

      Looks interesting, but I’d bet on the second law ultimately winning.

      • Mr. Tumnis said, on January 21, 2016 at 11:52 pm

        I couldn’t possibly disagree. I recall someone–was it Eddington?–who essentially said anything was up for grab’s, but if you’re theory violates the 2nd Law, forget about.

        Strange though that this has garnered greater attention and scrutiny.

        • Scott Locklin said, on January 22, 2016 at 12:47 am

          A seat of the pants guess (between writing emails and marketing docs; sorry -startupland is tough): this system, despite being black box, is out of equilibrium somehow (heated catalysts?). Catalysis is generally not understood. I keep asking my thesis advisor if someone has figured it out yet, and not hearing anything promising.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: