Why Use A Graph DB?

I like my inner joins, dammit

Relational databases like PostgreSQL and SQLServer are battle-tested, well-understood technologies that work well for a lot of things. Some things they do well include:

  • Things that require strong constraints enforced on the DB level

    Relational databases impose a rigid schema. Tables have the columns they have; columns have the data types they have; various constraints can be applied to keep values unique, or to only allow a subset of data. Relationships themselves are implemented as a foreign key constraint.

  • Tabulating stuff

    Not surprisingly, things based on ledgers are good at adding things up. If you need to sum all the values in a column, they do that performantly.

  • Low-complexity, unchanging data relations

    Relational databases can handle simple sets of related data well. If you know that a User entity will always only relate to a Blog Post entity, that sort of one hop relationship can be handled pretty well with SQL joins. - when you already know all the questions you will ever ask. If what you are trying to do is a known quantity and will never, ever change, relational databases work well. If you don’t, changing schemas and queries are often required to maintain performance.

There are, of course, other types of databases, which we tend to group under "NoSQL." Most address performance issues or complexity issues with traditional relational DBs. Document DBs like MongoDB avoid rigid schemas, and tend to map better to hashes and objects used in most programming languages. Column databases like Cassandra are similar to traditional relational DBs, but store entities in columns instead of rows, and improve performance with very large datasets.

Graph databases are different in that relationships are “first order” objects, just like nodes:

  • relationships can have properties and labels
  • relationships can be added or removed at will, without schema changes
  • relationships can be queried (i.e. "Give me all the CHILD_OF relationships with the created_at date between X and Y, and the nodes at each end")

This gives graph DBs an advantage over any other kind of database when dealing with related data. Graph DBs end up being particularly good at these kinds of things:

  • Answering questions based on the relationships between data

    With a graph DB, asking the question "how many people who bought a toaster in kansas and have a criminal record used yesterday’s coupon," as long as you have the data and the relationships in place. If you don't, adding the relationships is trivial. With a relational DB, even if it doesn't require schema changes, the performance of multiple JOINs would be poor, and require optimization and possibly de-normalization.

  • Maintaining performance while querying against complex relationships

    As the complexity of a query grows, it's typical that the complexity of the SQL used grows as well, and performance generally suffers as the number of JOINs goes up. Graph databases aren't immune to this, but in general they can perform complex queries on related data without the same level of performance hit, and frequently without lengthy, complex query statements. This article at DZone shows some examples.

  • Discovering connections you didn’t expect

    We often find that through some exploratory querying, patterns and connections you didn't expect become clear. A dramatic example of this is the analysis of the "Panama Papers" using Neo4j and Linkurious, a graph visualization tool. Investigative journalists didn't know what they would find, but were able to explore a massive dataset and discover connections that would have been effectively hidden with other tools.

  • Answering questions you didn't anticipate

    For database admins and developers, we've learned to dread answering questions that weren't in our initial spec, and we might have to make significant schema and query changes to pull out different data. Graph databases, by comparison, are relatively easy to modify, adding or removing relationships without significantly affecting performance. So if you get asked the question "how many people who bought a toaster in kansas and have a criminal record used yesterday’s coupon," and it wasn't in the spec, it's much easier to come up with a useful result.

As the datasets and complexity grows, the advantages of graph DBs become more and more evident, because other kinds of DBs just can’t maintain performance with inter-related datasets.