Managing Surrogate Keys in a VLDB

When using most RDBMS’s one thing they seem to offer is the ability to add a surrogate key column, wich is usually an automatic increment.  For most systems this is easy to implement and allows very fast primary key lookups.  When architecting a system that will potentially have billions of rows, relational data design kinda goes out the window.  You start using a basic star schema if you can.  You also lose the ability to use these nice surrogate key indexes as they tend have a severe performance impact during loading.

Enter the natural key. Typically people use a simple hashing algorithm to create a natural key.  This is usually just an MD5 sum of the concatenated field values.  Using this hash as your primary key can lead to serious performance related issues as string comparisons are much slower that integer comparisons.

Since systems like this can rarely take multiple fast inserts, the common practice is to do bulk loading of data via an ETL (extract, transform, load) process.  What this does is take raw data (ie logs) from a location (extract), write them in a format that the RDBMS can read (transform) and loads it using the RDBMS bulk loading mechanism.

To get the performance of having an ordered surrogate key index without suffering a hit to your load time, the key value store is a great tool.  I personally have used Redis, but others exist and are also great.  The idea here is that the key value store is all within memory, and extremely fast at key lookups.

On to the architecture.  Using the key value store, we can decide if we have a row for a particular dimension or if we need to create one.  During the ETL process for each row we do the following:

  • Create a dimension
  • Get the natural key for this dimension
  • Lookup in the KV Store for the surrogate key
  • If none exists: get the latest number from the KV Store and increment it, then store the new key
  • Load the data with the surrogate key in the dimension table
  • Reference the surrogate key in the fact table
As a side note, I would also recommend storing the natural key in a separate row, just in case you lose your KV store, it can be reloaded.  This minor change to our ETL process increased our join performance by over an order of magnitude and reduced our memory footprint by about 10% on a system that was growing by about a billion rows per day.