What Is A Graph DB?

There is no math required, we promise

While they sound fancy and complex, graph databases are actually a pretty simple idea. A graph as applied to databases is just a different way of structuring data than we are used to.

Most of us have used or are familiar with traditional "relational" databases. These databases are modeled after ledgers and forms, and store our data in tables. To break it down in a simplified way:

  • A row in a table represents a thing
  • Properties of the thing are columns in the table
  • Different types of things are stored in different tables
  • Two things are said to be related when they have the same values in "key" columns. So an item in table Foo is related to an item in table Bar when Foo.Bar_ID holds the same value as Bar.ID
Diagram of two relational database tables
Two relational database tables. Two rows are "related" because they share a value in "key" columns

This is how most database work as been done for a long time. It's familiar to us, and you can do a lot with it.

Graph databases are actually a bit simpler. There are only two things to worry about: nodes (sometimes called vertices) and relationships (sometimes called edges). Nodes are dots, and relationships are lines between the dots. It's all dots and lines.

Nodes are the things in graph databases. In most graph DBs, nodes have properties, which are a set of keys and values. In many graph dbs, nodes will also have labels which are used to categorize and group nodes. This way you know what type of thing a node is.

So, to break it down:

  • A node represents a thing
  • Properties of the thing are properties on the node
  • The type of a thing is set by a label on the node.
  • Two things are related when a relationship is created between them – a line is drawn between the nodes
Diagram of two nodes in a graph, with two relationships between them
Two nodes in a graph, with two relationships between them. Each node has a Person label, and a set of property key/val pairs. The relationships also have labels on them, to indicate the type of relationship, and they have properties as well.

It's pretty cool, because it's close to how we often think about entities and relationships between them. We don't have to translate from how we would draw it on a whiteboard, and how it actually will work in the database.

That's really all a graph is: a collection of nodes connected by relationships. That simplicity, however, allows it to scale up to big sets of data, and adapt to changing needs very well.

Why Use A Graph DB?

I like my inner joins, dammit

Relational databases like PostgreSQL and SQLServer are battle-tested, well-understood technologies that work well for a lot of things. Some things they do well include:

  • Things that require strong constraints enforced on the DB level

    Relational databases impose a rigid schema. Tables have the columns they have; columns have the data types they have; various constraints can be applied to keep values unique, or to only allow a subset of data. Relationships themselves are implemented as a foreign key constraint.

  • Tabulating stuff

    Not surprisingly, things based on ledgers are good at adding things up. If you need to sum all the values in a column, they do that performantly.

  • Low-complexity, unchanging data relations

    Relational databases can handle simple sets of related data well. If you know that a User entity will always only relate to a Blog Post entity, that sort of one hop relationship can be handled pretty well with SQL joins. - when you already know all the questions you will ever ask. If what you are trying to do is a known quantity and will never, ever change, relational databases work well. If you don’t, changing schemas and queries are often required to maintain performance.

There are, of course, other types of databases, which we tend to group under "NoSQL." Most address performance issues or complexity issues with traditional relational DBs. Document DBs like MongoDB avoid rigid schemas, and tend to map better to hashes and objects used in most programming languages. Column databases like Cassandra are similar to traditional relational DBs, but store entities in columns instead of rows, and improve performance with very large datasets.

Graph databases are different in that relationships are “first order” objects, just like nodes:

  • relationships can have properties and labels
  • relationships can be added or removed at will, without schema changes
  • relationships can be queried (i.e. "Give me all the CHILD_OF relationships with the created_at date between X and Y, and the nodes at each end")

This gives graph DBs an advantage over any other kind of database when dealing with related data. Graph DBs end up being particularly good at these kinds of things:

  • Answering questions based on the relationships between data

    With a graph DB, asking the question "how many people who bought a toaster in kansas and have a criminal record used yesterday’s coupon," as long as you have the data and the relationships in place. If you don't, adding the relationships is trivial. With a relational DB, even if it doesn't require schema changes, the performance of multiple JOINs would be poor, and require optimization and possibly de-normalization.

  • Maintaining performance while querying against complex relationships

    As the complexity of a query grows, it's typical that the complexity of the SQL used grows as well, and performance generally suffers as the number of JOINs goes up. Graph databases aren't immune to this, but in general they can perform complex queries on related data without the same level of performance hit, and frequently without lengthy, complex query statements. This article at DZone shows some examples.

  • Discovering connections you didn’t expect

    We often find that through some exploratory querying, patterns and connections you didn't expect become clear. A dramatic example of this is the analysis of the "Panama Papers" using Neo4j and Linkurious, a graph visualization tool. Investigative journalists didn't know what they would find, but were able to explore a massive dataset and discover connections that would have been effectively hidden with other tools.

  • Answering questions you didn't anticipate

    For database admins and developers, we've learned to dread answering questions that weren't in our initial spec, and we might have to make significant schema and query changes to pull out different data. Graph databases, by comparison, are relatively easy to modify, adding or removing relationships without significantly affecting performance. So if you get asked the question "how many people who bought a toaster in kansas and have a criminal record used yesterday’s coupon," and it wasn't in the spec, it's much easier to come up with a useful result.

As the datasets and complexity grows, the advantages of graph DBs become more and more evident, because other kinds of DBs just can’t maintain performance with inter-related datasets.

How Do I Use A Graph DB?

So is this thing web scale?

We primarily use Neo4j at Graph Story, so we'll talk specifically about getting up to speed with that. There are, of course, other graph DBs out there, and some of these will apply to them as well.

  1. Learn Cypher

    Cypher is a query language similar to SQL that was created for use with Neo4j. It's an open standard, now, so we may see it available with other graph systems in the future. The appeal of Cypher is in it's conceptual similarity to SQL, and in using an "ASCII art"-style approach to describing relationships.

    Here's a straightforward Cypher example:

    MATCH (n1:Label1)-[rel:TYPE]->(n2:Label2)
    WHERE rel.property > {value}
    RETURN rel.property, type(rel)

    You can check out the Cypher intro at Neo's web site, but a great way to get started quickly is to visit the WebUI of any Neo4j install. A couple nice tutorials are built-in that step you through using Cypher against built-in data sets.

  2. Take on a project

    A lot of us learn best by making something. If you havent already been tasked with one, a side project can be a great way to learn about graphs in a fun way. Ed Finkler, one of our team members, wrote a blog post about starting a graph side project.

  3. Learn some things about data modeling for graphs

    Best practices for how to structure your graph are, not surprisingly, different from a relational database. Without getting too detailed here, there's a nice intro to the basics on Neo's web site.

  4. learn how to write performant queries

    One of the most common problems we see are poorly-written queries that hobble the database – something that is significantly easier to do with Neo4j than, say, MySQL or PostgreSQL. You'll want to learn and use the EXPLAIN and PROFILE keywords in Cypher to break down the execution plan for a query that seems slow, but the most common issues we see are:

    1. Not using unique constraints and indexes on node properties
    2. Specify labels in your query to only scan a subset of nodes
    3. Returning entire nodes or relationships, instead of just the data needed

    The section on Query Tuning in the Neo4j developer docs is a must-read on this topic. We also recommend the Tuning Cypher Queries talk by Petra Selmer and Mark Needham from GraphConnect 2015.

  5. Find a library for your language

    There are Neo4j drivers and libraries for most popular programming languages, and we list many of them on our Docs page (login required). These vary from low-level libs to send handwritten Cypher queries to complex OGMs (Object-Graph Mappers).

  6. Ask us questions!

    We're proud at Graph Story to have the best support in the industry. We want to help you be successful incorporating the power of graphs into your processes. We are happy to answer any questions you have using our support chat (see the bottom-right of the window), or email us at support@graphstory.com.

Start Your Free Trial