Zero Key References and Optional Data in Data Vault Modeling
A nuanced modeling question came in recently about how to handle a source table that delivers relationships between two business objects, where one specific role type has no related second object — just additional attributes in the same record. The proposed model was a solid starting point, and the discussion that followed touched on several important Data Vault principles: zero key handling, CDC Satellites vs. multi-active Satellites, effectivity tracking, and why filter conditions in Raw Data Vault loading are a risk worth avoiding. This post walks through each of these in turn.
In this article:
- Zero Key References: Modeling Optional Relationships
- Multi-Active Satellites vs. CDC Satellites: Choosing the Right Approach
- Effectivity Tracking and the Status Tracking Satellite
- GDPR and Customer Re-Registration
- Why Filter Conditions in Raw Data Vault Loading Are Risky
- Validating the Model with the JEDI Test
- Watch the Video
Zero Key References: Modeling Optional Relationships
The scenario involves a Link between two Hubs — Object One and Object Two — with a role type as part of the Link structure. For most role types, both Hub references are populated. For one specific role type, Object Two doesn’t exist; the source provides additional descriptive attributes instead of a foreign key.
The correct handling for the missing Object Two reference is straightforward: use the all-zeros key. When a foreign key in the source is null, the Link refers to the zero key in the Hub rather than storing an actual null. This keeps the model queryable with inner joins, avoids null-handling complexity downstream, and is entirely consistent with Data Vault null business key handling. Both Hubs should have their two zero key rows — the all-zeros key for unknown or null references, and the all-Fs key for error cases — deployed as standard practice.
The role type sits in the Link structure alongside the two Hub references, which means it participates in the hash key computation for the Link. When the role type changes, the Link sees it as a new entry. The effectivity Satellite then captures when the old record was no longer active. That’s the expected behavior.
Multi-Active Satellites vs. CDC Satellites: Choosing the Right Approach
The proposed model used multi-active Satellites to capture multiple rows with different valid_from and valid_to dates. Whether this is the right choice depends on one key question: are those records all active at the same time in the source system? Can a source system user see all of them simultaneously?
If yes — multiple records with different validity periods are all visible and active concurrently in the source — then a multi-active Satellite is the appropriate choice. The multi-active attribute should be a subsequence from staging rather than a business-supplied date, keeping control of uniqueness on the Data Vault side rather than trusting the source.
If no — the records represent sequential changes, not concurrent active states — then a CDC Satellite is a cleaner fit. A CDC Satellite is structurally a standard Satellite, but the load date is modified by adding a sequence number from the CDC package as microseconds. This means only one row is active at any given moment, which simplifies PIT Table construction (two columns instead of three per Satellite) and improves join performance. The choice between the two comes down to how the source system actually manages these records.
A third alternative worth noting: for multi-active scenarios, a JSON array stored in a standard Satellite can replace a traditional multi-active Satellite in some cases. It depends on the loading mechanism and the downstream consumption requirements, but it’s a valid option that avoids the multi-active complexity entirely by capturing multiple active rows as a structured JSON payload.
Effectivity Tracking and the Status Tracking Satellite
The proposed model included a separate status tracking Satellite alongside an effectivity Satellite. A cleaner and more storage-efficient approach is to merge these by adding a deletion timestamp directly to the effectivity Satellite.
The deletion timestamp works simply: when a record exists in the source, the deletion timestamp is set to end-of-all-times. When a deletion is detected — through a comparison of the current load against the target — the deletion timestamp is updated to the current load date, marking the record as no longer physically present in the source. If the record reappears, the timestamp reverts to end-of-all-times.
All timelines — valid_from, valid_to, deletion timestamps — belong together in the effectivity Satellite. This consolidation reduces the number of Satellites to manage and makes the model more straightforward to query.
GDPR and Customer Re-Registration
A related question came up about GDPR: if a customer requests data deletion and later re-registers, are they treated as a new record? The answer is yes, and it’s an important distinction from standard soft-delete handling.
Soft deletes in the Raw Data Vault are used to track hard deletes from the source for non-legal reasons — products removed, records archived, relationships ended. The history is preserved in the Vault even when it’s gone from the source.
GDPR is different. When a deletion is legally required, the personal data must be genuinely removed — a hard delete in the target. The non-personal data associated with that customer may be retained, but the link between the old history and the re-registering customer is permanently severed. If that customer returns and creates a new record, there’s no way to reconnect them to their previous history, because that connection no longer exists in the model. This is by design: the loss of that relationship is the point.
Why Filter Conditions in Raw Data Vault Loading Are Risky
One of the more important principles raised in this discussion: never apply filter conditions when loading the Raw Data Vault. The specific question was whether it’s acceptable to filter the source by role type when loading the additional attributes Satellite — loading only rows where the role type matches the one that doesn’t reference Object Two.
The answer is no, and the reasoning is worth understanding clearly. Applying a WHERE condition or a filtering join in the Raw Data Vault loading process is an application of business logic. The Raw Data Vault is supposed to capture raw data as delivered, without interpretation. Any filter condition that depends on the content of the data — rather than purely technical checks like delta detection or null replacement — violates that principle.
The practical risk is concrete: source system behavior changes. A new role type is introduced that also lacks an Object Two reference. The filter condition doesn’t know about it, so those records get skipped. Or dirty data arrives where an unexpected combination of fields appears — both an Object Two reference and additional attributes for a role type that wasn’t supposed to have both. A filter condition handles this incorrectly or drops data silently.
The only conditions permitted in Raw Data Vault loading are technical ones: checking whether a business key or relationship already exists in the target (the delta check), and replacing null values with zero keys. Everything else — including role-type-based filtering — belongs in the Business Vault, where it can be applied as explicit, versioned, testable business logic.
Validating the Model with the JEDI Test
When uncertain about a modeling decision — whether to use a multi-active Satellite or a CDC Satellite, whether to split or consolidate — the JEDI test provides a reliable check. The test is simple: try to reconstruct the original source delivery from the Raw Data Vault. Join everything back together and verify that no records are lost, no columns are missing, and no artificial records have been generated that didn’t exist in the source.
If the reconstruction succeeds without data loss or artificial inflation, the model is valid. Whether it’s the best model depends on the data and the downstream consumption patterns — but validity is the baseline, and the JEDI test is how you prove it.
To explore these modeling patterns in depth — including zero key handling, effectivity Satellites, CDC loading, and the JEDI test — check out our Data Vault 2.1 Training & Certification. The free Data Vault handbook is also available as a physical copy or ebook.
