Locklin on science

Search engines for grownups

Posted in semantic web, tools by Scott Locklin on March 15, 2013

Google is an amazing company. It is so all-pervasive it has become a verb. It also annoys the hell out of me, and I avoid it whenever I can. No matter how annoying their interface becomes, or how many weird and privacy invading things they do, no matter how many crypto-religious fruitcakes they hire, they’re  the only game in town for most people.  I don’t like monopolies. I think monopolies are inherently evil and should be shunned by people  with a conscience, or tamed by the judicial system. Since the US government is presently composed of ninnyhammers obsessed with irrelevant things, and geldings who have forgotten about the anti-trust laws, it falls to the individual to do something about it. Where is Teddy Roosevelt when you need him?

Health Care Long Haul Analysis

There are alternatives available. The problem is, nobody knows about them. Google dominates people’s thoughts about search the way Microsoft used to dominate people’s ideas about computers in general. Some of the alternatives are very much worth knowing about, even if you are happy with using Google.

For most people, the best alternative is Yandex.com. Yandex is the biggest player in the Russian market. It’s been around for  longer than Google has, it is run by mature computer scientists who specialize in machine learning, and is one of the best search engines you have never heard of.  The English language version of their search engine is considered experimental, but the results are very good. For general search, it is as good or better than Google. The results are uncannily accurate, and the clutter is practically nonexistent. Speaking of clutter: I’m really happy with how their page looks; no clutter. The English language page is missing some “searchy” features at present: for example -no English language news aggregator  (which means, no news results in the basic search either). This feature exists in Russian, so I assume it is coming. Multimedia? Well, they’re not so hot here, but searching for funny pictures is a rare task for me. Google has a marginal win on maps for the US, mostly for the public transit option that works (Yandex seems OK for driving maps). The Russian language translation facilities at Yandex are, of course, excellent: much better than Google. As a slavophile, I find this invaluable.

One privacy advantage Yandex has which Google never will: Yandex does not do business with American intelligence agencies.  I do not like the fact that Google has become an arm of US intelligence agencies. It is to their credit that Google discloses their relationship with the US government (most of Silicon Valley is in bed with the spooks, but they don’t talk about  it). It is the surveillance  state that I abhor. Yandex may very well be doing the same thing with the Russian government, but the FSB is a much smaller threat to American civil rights than our own spooks. While I see no immanent dangers from the all-seeing eye, and I am far from paranoid, the US is going through a weird time right now, and history is a dark and bloody subject. Do I really want the future government  to know what websearches I was doing in 2010? No, thanks,  tovarich.


As a crypto-academic consultant, I end up doing a lot of searches for technical papers. Google is OK at this (I have found no utility in “google scholar” -the regular search results are equivalent). Yandex actually does significantly better.  Of course, these kinds of searches are a broad net. If you have a decent idea of what you’re looking for, INSPEC is still the gold standard. You have to pay for INSPEC, or walk to a university library, but that is what serious people use for deep search in an academic subject.

Yandex does fail one important use case for me. One of the fundamental ways people get work done on computers is searching for error messages and bugs and “how-tos” on message boards. If you’re dealing with a computer problem, chances are good that someone else had the problem, and asked others about it on an online forum; whether it is a compiler directive or a wonky KDE feature. This is a tremendously helpful knowledge base. Google beats everyone at this at present, mostly because you can sort by date. Close behind google for this use is duckduckgo.com.

I have high hopes for Yandex. While Google hires a lot of rock star programmers and well known computer scientists, Google also seems unfocused and adolescent (read the takimag article for more concrete criticisms). The Yandex guys: they’re grownups. They have succeeded in a country of flinty hard men.  People actually died trying to do business in Russia in the 90s; these guys made it. They’ve only been doing English for a little while, and they’re already better than Google at quite a few things. Search in Russian is much harder than search in English, as the language is strongly inflected. So, Yandex solved a much harder problem than Google did at the outset. Google wastes its time with nonsense like Google+ or attempts to bring about the “singularity” by hiring Crazy Ray Kurzweil. Meanwhile, Yandex is using its technology to assist particle physicists at CERN, which seems a bit more impressive. I’ve seen significant improvements in Yandex search results over the past few months. It is very exciting to watch a complex contraption like this improving so quickly. Consider this: they have achieved all this on revenues which are 1/60 of what Google takes in.  The flabby marshmallows at Google may not be worried now, but these guys are coming for them. If I had a bunch of steel hard brainy Russian cossacks in my rear view mirror, I’d be nervous.


On a slightly different topic: one of the hardest things a technical or fact-oriented person looks for on the internets is data. Most search engines are completely useless for this type of thing. It’s really a different type of problem from ordinary search. I have only found two search engines which do this well.

One is Wolfram Alpha, which I made fun of at one point. I now find it indispensible for looking up simple facts and figures, using an English language query. It doesn’t have large amounts of data, but it’s easy to get to the data: just tell it what you need. Kudos to them for getting this right. It ain’t bad for doing integrals and such either; certainly more convenient than using some long-in-the-tooth open source computer algebra system like Axiom or Maxima. While it kind of sucked when it first came out, the suck is all gone: this is an excellent product every numerate individual should avail themselves of.

The other is quandl.com. I have been using it for only a few weeks, and don’t know how I lived without it. I had a lot less data to work with, and I went through a lot more trouble to obtain it. For quants, this is an indispensible tool for historical economic data. For datanauts in general; ditto. Before quandl, you had to scrape publicly available data from myriad websites. Post-quandl; well, it’s easy to get at, and if you register with them, you can download dynamically updated data in easily parsed CSV format all damn day. Hooray for Quandl! Please don’t sell out to gigantor corp that will make you suck. If you must, sell out to Yandex!



Wolfram Alpha, Semantic Web, and back to the pre AI winter future

Posted in semantic web, Wolfram Alpha by Scott Locklin on May 20, 2009

The latest nerd buzz has been about Steven Wolfram’s entry into the search engine business. I’ll draw the conclusion before giving you the meat in case you want the executive summary: it kind of sucks.

This isn’t a big surprise. Most things suck. As my friend Philip says, simple solutions are often better. I remember when we worked together, he was fond of pointing this out in more specific cases -for example, “you have to get up pretty early in the morning to beat linear regression!”

So, what is it? Basically, Wolfram took the Mathematica engine, and added a cheap natural language interface to it. Online Mathematica is pretty helpful, as it’s still one of the most powerful computer algebra systems in the world. The natural language interface? Well, it is an extremely cheap natural language interface. Not even up to the very mediocre standards of Ask Jeeves search engine, which was the first popular natural language engine hooked up to the web. As far as I can tell from fiddling with it, Wolfram added the mathematical equivalent of ^X doctor in emacs. This is a type of code with a long and hoary history of niftiness, but fundamental uselessness. It’s also a type of code with an old history of being used in Mathematica type systems: for example, the last commercial version of Macsyma (I think released in 1998 or so) had a very advanced natural language interface. These sorts of things are easy to write in functional programming. They’re sort of what Lisp and ML type languages were invented for in the first place. If you want some classic examples of how this works, you can look at the source code for ^X doctor in emacs (in /usr/share/emacs/lisp/play/ ) or go look at Peter Norvig’s book, Paradigms in Artificial Intelligence. The relevant programs are in this handy link. Have a look at Eliza and Mycin. They’re both what used to be called “expert system shells.” What they really are, are interpreters where it is easy to update the rules, or write rules for itself.

Expert system shells were hot shit in the pre-AI winter days. If you read the “Journal of Advances in Computers” (my lodestone for this sort of historical context: I read the whole damn thing in the LBNL library while avoiding writing my dissertation), you can see all the excitement from the time when people first write such things, dating from around 1957 when they invented the first one, the “General Problem Solver.” Most of early AI research up until the AI winter (early 1980s) consisted in riffs on this basic theme. One of the great inventions which came out of this sort of thing was, in fact, the computer algebra system (CAS), of which Wolfram is the foremost vendor at present. While CAS do a lot more than the primitive expert system shells, it is effectively a subset of the old “General problem solver.”

So, it’s no surprise that Wolfram was able to cobble together a natural language system that understands very simple mathematical commands in a sort of pidgin English. I guess the real “breakthrough” of Wolfram Alpha is that it uses data found on the web. What would be really impressive if he had something like an augmented transition network (ATN to AI nerds) to parse data he found online and place it in context. Briefly, an ATN was a pre-AI winter technique used to parse grammars which work like the English language. The place you’re most likely to have heard of it is in Gödel, Escher, Bach by Hofstadter, wherein he makes the now hilarious claim that ATN’s will eventually become powerful enough to form a sort of Sentient AI. This is hilarious because ATNs are useless on languages which are unlike English in sentence structure. So, if you could build a Sentient Computer Program (an idea which itself seems hopelessly funny now) using only ATNs as Hofstadter thought we might one day, it would imply that people who speak a declinated language like Latin or Russian are not sentient. Putting aside the ethnic jokes this makes possible, there are all kinds of other parsing problems which humans easily solve which ATN’s haven’t got the remotest chance with. One example is parsing HTML. I mean, we don’t parse HTML directly unless we’re HTML nerds, but our browsers easily turn it into stuff we can read and make sense of. ATNs can’t help us do this, as the language structure of HTML doesn’t map to ATNs any better than Arabic does. I’m guessing, since Wolfram is a smart guy, he must have something like an ATN for some kinds of data based HTML. If he can get it to work properly, this would be an important breakthrough. It obviously doesn’t work right yet, and if he does have something like an ATN to help parse information found in HTML, it probably requires lots of human intervention.

“Semantic Web” is the sort of “next big thing” for solving this problem from the other end. The idea of “semantic web” is to solve the problem by phrasing web data in ways which computers (rather than people armed with browsers) can understand more easily. I have always been confused by the idea of “semantic web.” The problem with getting everyone in the world to adopt your standard is one of motivating them. HTML is a world wide standard because it solves lots of problems. Semantic web only solves the problems of search engine engineers; it doesn’t solve any content creator problems, so i can’t see why any of them would take the trouble to use it. The only types of content creators who would want to use something like this are basically advertisers, who are pretty much useless to search engines. In fact, advertisers steal money from search engines if they appear in an unsponsored search! I mean, that’s how Google makes money! There are of course niche applications of semantic web enabling technologies; it could be very useful for internal databases. But I suspect simple html tags and ordinary search engines will work just as well for internal databases. So, to solve the problem of how to make computers able to think about information on the web, you need the right kind of parse engine for natural language HTML processing.

Does Wolfram have this “HTML parsing special sauce?” Evidently not yet. There are forms where you can submit data to the thing wikipedia style -this is probably how most data gets loaded. Maybe he never will grow special HTML parsing sauce. The problem is actually much harder than teaching computers to read and understand books in natural languages, which they are still largely incapable of. Context is hard. Still, it’s a valiant effort, and a pleasant throw back to a set of largely forgotten technologies. Why were these technologies forgotten in the first place? Mostly: K&R invented C, and Intel invented useful commodity microprocessors. There are tons of useful things you can do with C and a commodity microprocessor. These old AI techniques are not among them. They required much higher level computer languages, and the hardware to support such things. This form of AI also made a sort of unfortunate detour into technologies like Prolog, which made it really easy to ask computers for solutions to NP hard problems, without realizing that you’re asking the computer something impossible for it to solve. Finally, there was a serious AI software bubble which popped in the 80s. There were many AI startups which promised big business the world. They failed in that economic apocalypse because they were largely unable to deliver on their promises. PC style machines and the C programming language made real improvements in business productivity that all the Lisp-AI propellor heads were unable to match with the tools they were using at the time. As such, much “AI” research since the 1980s has looked a lot like signal processing and statistics; fields which map much better into procedural C and limited memory microprocessor machines. Most of the “AI” technologies before 1982 were forgotten and abandoned.

I used to think I could code up an expert system shell for something useful at work. I like forgotten technologies, and I like Lisp. The last time I had this thought, I was plagued by support questions by people I worked with, and considered writing an expert system shell to answer their questions. Why didn’t I follow through with is? Well, it’s back to Philip’s saying. The simple solution is generally hard to beat with a technologically advanced one. I put together a searchable wiki for support questions instead. Sure, it would have saved me seconds a day if I had all that content loaded into an expert system shell, but it probably would have taken me months to build the tailored expert system shell and make it work. And it might not have worked, whereas the wiki worked and was useful immediately. So, you have to give Wolfram some credit for reviving some neat technologies. Minus points for not hiring Philip as a consultant beforehand.

Fun Wolfram Alpha Easter Eggs which show its “Eliza” intestines:
Fun observations (to be updated as I make more of them):

  1. Alpha doesn’t parallelize in any useful ways: when you do a query, you get popped to one of a couple hundred servers on a farm, presumably each running identical instances of Mathematica + the language parser.
  2. A speech pathologist relative once pointed out that profoundly brain damaged people are still capable of “cocktail talk” or “small talk” -this can often surprise doctors, as much of social interaction is apparently small talk. The fact that this was so, and my emacs editor had a creditable Rogerian psychotherapist coded up in it gave me misanthropic ideas for helping profoundly brain damaged people reenter society in high paying jobs. Sort of like “being there.”
  3. Cyc is probably the most impressive expert system shell yet written. Unfortunately, it doesn’t seem to parse the web. Probably because this is a really hard problem.
  4. Why doesn’t he wire a standard search engine up to the thing for things Alpha doesn’t recognize (which is the larger subset of questions I have asked it)? A friend of mine wrote a search engine for things pertaining to his project with a very small engineer head count. Search qua search is actually pretty easy! If nothing else, partner with someone else’s search engine for those non numeric questions!
  5. Since Alpha doesn’t do regular search … are they looking to become some kind of Wikipedia for data? That would also be incredibly useful. But they have not made this decision in any obvious way yet. If they want to be Wikipedia of data + data engine, they should probably be more overt with that, and cut out the Hal-9000 jokes.