Implementing Data Vault 2.0 ghost records
During the development of Data Vault, from the first iteration to its latest Data Vault 2.0, we’ve mentioned the two terms “ghost records” and “zero keys” in our literature as well as in our Data Vault 2.0 Boot Camps. And since then, we’ve noticed these concepts oftentimes being referenced to interchangeably.
In this blog entry, we’ll discuss the implementation of ghost records in Data Vault 2.0. Please note, that this article is part one of a multi-part blog series clarifying Ghost records vs. Zero Keys.
Why implement ghost records?
The concept of ghost records is usually brought up together with the implementation of point-in-time (PIT) tables. PIT tables are used as query assistant objects as part of the Business Vault, in which snapshots of data are created for certain time intervals specified by the data consumers. It’s important to note that these intervals can be daily, weekly, even real-time, etc. Each entry in a PIT table materializes joins from a Data Vault spine object (either a Hub or a Link) to its surrounding Satellite structures with the goal to reduce joins while querying against the Data Vault and thus boosting query performance.
In some instances, however, upon joining e.g. a Hub to one of its Satellites, there can be no corresponding Satellite delta for certain snapshots. The reason behind this could be that the business key was not available or unknown by the data source at that given time.
Reference to a ghost record in a PIT table
To combat this issue, ghost records are added to Satellite entities to virtually fill up gaps in the beginning of the timeline, so that equal joins are made possible in ad-hoc queries against the Raw Vault. Equal joins (a.k.a. equi-joins) are joins that only use equality comparators and arguably the most efficient/fastest SQL-join type.
What does a ghost record look like?
A ghost record can be understood as a dummy record that contains default values. In the previous iteration of Data Vault (DV1), the solution was to create a ghost record per key per satellite structure. This would still do the job of filling up gaps at the beginning of the timeline. However, this solution didn’t scale well on higher volumes of data. Imagine a hub that contains 10 million business keys and there are three satellites attached to it. Every satellite then contains 10 million ghost records, resulting in 30 million records across all three satellites. In addition, every time a business key is added to the hub, a corresponding ghost record needs to be added to each satellite. The sheer amount of ghost records in this case would defeat the whole purpose of trying to achieve equi-joins, to enable faster queries.
Thus, since the introduction of DV2.0, it is only required to insert one single ghost record per Satellite structure.
Example: Ghost record with attributes of different data types
The ghost record typically contains a constant hash key 00000000000000000000000000000000 (32 times the character “0”). This hash key is also known as a Zero key – more on Zero keys coming up in the next part of this blog series. Its load timestamp is usually set to the earliest possible timestamp within the DBMS, indicating the “beginning point of time”. The record source “SYSTEM” simply means the record is artificially generated.
Then, follows a list of default NULL values for every descriptive attribute within the Satellite structure. For each data type, we define a default value for the ghost record. For example, attributes with numeric data types can be filled with (numeric) zero, string attributes can be filled with either “(unknown)” or “?” depending on the length definition of the attribute.
It is recommended that the ghost record is filled with default values, as opposed to filling it with NULL/empty values, since these default values can be used and displayed further downstream. A good example for this can be seen in dimensions, an “(unknown)” string is arguably way more descriptive than a mere NULL value.
How to insert ghost records
There are a couple of ways to insert ghost records into Satellite structures.
The first variation is to insert the ghost records upon object creation as a one-time operation and then forget about it. Simple as that!
Another way is to insert the ghost record during the process of loading Satellites. The loading procedure should start with inserting a ghost record into the target object, if it does not yet exist. Then, the procedure can proceed with loading the Satellite with incoming data as normal. This variation might be viewed as rather excessive. However, it ensures that the ghost record is always available and that it gets inserted back into each Satellite – in case for whatever reason, objects are truncated or the ghost record itself is accidentally deleted, for example during development.
Both variations can be fully automated within your project’s Data Vault automation tool of choice.
We hope that this blog post helps to clarify the implementation of ghost records in a Data Vault 2.0 solution. Coming up next, we’d like to discuss with you about “the-other-technical-term” Zero keys and the difference between Ghost records and them – which has been rather confusing to many fellow Data Vault practitioners.
Feel free to share this blog post to your colleagues and make sure to leave a comment on how your project implements ghost records!
-by Trung Ta (Scalefree)