Facebook’s controversial Graph Search feature has been two years in the making, and was announced live a couple of months ago. Facebook has on average one billion new posts added every day, with their posts index containing more than one trillion total posts, altogether comprising hundreds of terabytes of data. Graph Search indexes this data and returns real-time results to queries.
I’m talking to Dr Jim Webber about Neo4j, the highly scalable open source graph database that is powering both oil producers in Scandinavia and silicon roundabout startups.
Dr Webber – Chief Scientist for Neo4j – is frank about how the major players adoption of graphs have meant more attention for graph databases.
“I think they have done a lot of PR for graphs we couldn’t have managed as a smaller company. A lot of people draw inspiration from what those guys are doing, and would like to try and replicate some of those features in their own systems”.
But he’s quick to point out that Neo4j has been around a long time, longer than Facebook’s graph team have, and they have burned a lot of shoe leather in getting graph databases out there into the mainstream — and not just for social media startups.
To the uninitiated, a graph database is literally a database storing data in a graph, the most generic of data structures, capable of elegantly representing any kind of data in a highly accessible way. In a graph, every element contains a direct pointer to its adjacent element, avoiding costly global index lookups.
But who would use a graph database? Anyone and everyone who has connected data. From a two person bootstrapping social networking tech startup in someone’s shed, to Global 500 companies including HP and Cisco. Graph databases are being implemented by everyone from high profile blue chip organisations to what Dr Webber refers to as the “the startup end of the spectrum”.
When people ask why they would want to use a graph database like Neo4j over a more traditional database, they only need to look as far as Shutl. Shutl are a London-based technology start-up (recently acquired by eBay), offering same day delivery within minutes of purchase or delivery in any one hour slot on a day of the customers choice.
From a service user point of view, the change in database for Shutl wasn’t about having an answer 10 milliseconds faster, because as Dr Webber points out in human time you can’t really tell the difference, but all of those additional milliseconds that were available to the database meant that customers were able to get a far richer experience. Shutl used the extra headroom from working in a graph to deliver customers 50 delivery options, compared to just one.
A recent Bioinformatics paper entitled “Are graph databases ready for bioinformatics?” authors Christian Theil Have and Lars Juhl Jensen found Neo4j was anywhere up nearly 2500 times faster than the same kind of queries in a relational database (Postgres).
Dr Webber is also a big fan of Doctor Who: to the point where he uses the enormous Doctor Who universe to demonstrate the power of Neo4j. The thing about Neo4j and graphs, Dr Webber explains, is that “you just pour facts into them”. In the case of Doctor Who, you can pour in facts like “The Doctor is from Gallifrey” and you would have a node that represents the Doctor, and a node that represents Gallifrey: with a relationship between them, such as “comes from”.
Sounds simple enough so far. And in the case of Doctor Who, you can pour in facts in their hundreds, thousands and millions and what develops is a multi dimensional data structure, covering everything from the enemies of the doctor and the doctor’s companions, to places visited, and all of the episodes they appear in.
With a graph database you can bring as many or as few of these dimensions into your query as you like, and because each relationship is named you can effectively treat them as dimensions in your query.
Dr Webber says “You can pour millions of data items into your graph and you will not be penalised for it at run time, because in a graph database the latency of your query is proportional to how much of the graph you choose to explore. It is not proportional to how much data is in the database, as it is in a relational data store.”
You can program your query so that it only follows relationships which are “appeared in”: where actors playing the doctor appeared in a particular episode, but to ignore companions. You can bring in or leave out these dimensions as needed.
What this means when you are traversing a graph is that it is lightning fast. You can expect to traverse a million relationships per second per thread per core, which means you can explore a lot of graphs either deeply within a graph, or very broadly, or perform a lot of shallower searches per second.
In Neo4j the longer and looser you make your matches, the more latency will be in your queries. If you can make your matches shorter and more precise, your latency can be very small — as is the case with Shutl.
Another company making waves using a graph database in oil and gas exploration and production, is one of the largest operators on the Norwegian continental shelf.
This organization is finding ways to push gas and oil through their network, while minimising the water they have to use to flush their pipes. With a network of gas and oil pipelines being a physical graph, the operator is using Neo4j and genetic algorithms to figure out optimal routes for the flow of gas and oil through their network, minimizing waste water and maximizing fuel throughput.
Lightning fast queries also happens to be what you need when you want to optimise your gas pipeline, deliver faster to customers, or find out how many incarnations of Doctor Who fought the Daleks.