Skip to main content
search
0
Scalefree Knowledge Webinars Data Vault Friday Pre-Joining Data Vault Business Keys During Load

Data Vault Business Keys

Data Vault architects often encounter a common challenge when their source systems mix surrogate technical IDs with true business keys. In one real-world scenario, an Employee table used a technical ID as its primary key, while the legitimate business identifier was NBR. Downstream tables—even a self-referencing manager hierarchy—used that ID as a foreign key. When building Hubs on both ID and NBR and linking them with a Same-As Link, the result resembles a Source Vault design—and it quickly becomes unwieldy to repeat ID→NBR lookups for every referencing table.

In this article, we’ll explore the best practice for handling mismatched keys: pre-joining business keys in your staging (or view) layer before loading the Raw Data Vault. You’ll learn why this denormalization is fully compliant with Data Vault principles, how to implement it (even with CDC feeds), and when you might need alternate approaches.



Why Separate Business Keys from Technical IDs?

Surrogate keys (ID) offer stable integer references, simple indexing, and isolation from business rule changes. But they aren’t meaningful outside the operational schema. True business keys (NBR)—like employee numbers, order numbers, or product SKUs—carry real-world meaning and ensure consistency across downstream BI and analytics models.

  • Business Key (NBR): Immutable identifier used in reporting, cross-system integration, and audit.
  • Technical ID (ID): Auto-generated surrogate for OLTP performance and referential integrity.

When you build a Hub on the business key, all Links and Satellites must reference that same key. Mixing in surrogate IDs without conversion violates business lineage and forces repetitive lookups.

The Pitfall: Source Vault by Accident

Faced with source tables referencing ID, some teams create:

  1. A Hub on ID (surrogate),
  2. A Hub on NBR (business key),
  3. A Same-As Link between them to tie ID↔NBR.

This “Source Vault” pattern captures technical IDs as though they were business keys—contradicting the principle that your Vault’s integration key must be a shared business identifier. Moreover, every time you load any Link or Satellite that uses ID, you must look up NBR via the Same-As Link. Tedious, error-prone, and defeating the agility of your Data Vault.

Best Practice: Pre-Join Business Keys in Staging

Data Vault training explicitly allows you to reshape your staging area—denormalizing or normalizing source data to simplify Raw Vault loads. Pre-joining means: before your load process begins, enrich every source record with the true business key (NBR) rather than the surrogate ID. You then feed the Hub/Satellite/Link loaders with business keys directly—no Same-As Link gymnastics required.

How It Works

  1. Create a view or staging query that joins your Employee table (on ID) to itself or to the hierarchy table to retrieve NBR. Add the NBR field into every downstream staging record.
  2. Use that pre-joined staging view as the source for your Data Vault loaders. All Hubs, Links, and Satellites can now reference NBR consistently.
  3. Drop the accidental Hub on ID and Same-As Link—your Vault only contains the true business key Hub (Hub_Employee on NBR).

This approach turns the repeated lookup problem into a one-time denormalization, improving performance and maintainability.

Tool Support: dbt & FlowBI

Modern Data Vault toolkits recognize this pattern:

  • datavault4dbt package: Offers a pre_join feature to automatically enrich staging tables with business keys.
  • FlowBI: Includes configuration options to map surrogate IDs to business keys before Vault loads.

When your staging area resides on a cloud data lake (e.g., AWS S3 + Redshift Spectrum, Azure Data Lake + Synapse), these tools can reference external tables and materialize pre-joined views seamlessly.

Handling CDC & Empty Deltas with Forward Lookup

Change Data Capture (CDC) introduces a nuance: sometimes the Employee record doesn’t change (NBR remains the same), but the hierarchy table (manager assignment) does. A pure inner-join staging view would omit the hierarchy change because no new employee row appeared.

Forward Lookup Pattern

  1. When your CDC batch contains only hierarchy changes, load your staging view with the ID column but no accompanying NBR.
  2. Instead of joining to the source Employee table (which has no new row), perform a forward lookup against the target Satellite in your Raw Vault. That Satellite already stores every historic mapping of ID→NBR.
  3. Retrieve the latest NBR value for each ID and inject it into your staging records—just as if you had joined to the source.

This assumes your CDC infrastructure and initial loads correctly populated the Satellite. If CDC reliability is questionable, you may need to capture ID first and resolve to NBR later in a Business Vault layer.

Caveats: Pre-Join Within a Single Source Only

While you can (and should) pre-join within one source system, avoid chaining pre-joins across multiple systems. If you first wait for System A’s staging load to produce NBR from its ID, and then join System B’s staging to A’s data, you introduce cross-system load dependencies. That forces you to serialize loads—waiting for one system’s batch to finish before you can process another.

To maintain parallel ingestion, each source should be pre-joined only to its own business keys. If two systems share a business key, let that intersection happen downstream in your Vault (via Hub on the shared business key), not in the staging layer.

When You Can’t Pre-Join: Source Vault as Last Resort

In rare environments where you cannot reliably pre-join—legacy databases with locked-down permissions or untrustworthy CDC—you may fall back to a Source Vault. In this design:

  • Your Hubs use surrogate ID as the primary key (capturing the technical ID).
  • You defer mapping to true business keys into the Business Vault layer, after all sources land.

Source Vaults make sense only when staging denormalization is impossible. Otherwise, they sacrifice business clarity for expedience.

Step-By-Step Implementation Guide

  1. Identify Business Keys: Catalog each table’s true business key(s)—not the surrogate PKs.
  2. Build Staging Views: For each source, create a view that LEFT JOINs the “owner” table back to itself (or to its lookup tables) to pull in NBR wherever ID appears.
  3. Validate Keys: Ensure every staging record includes a non-null business key. Flag or quarantine any orphans (IDs without known business key).
  4. Configure Load Scripts: Point your Vault loaders (Hubs, Links, Satellites) at these staging views. Remove any loaders that target surrogate key Hubs or Same-As Links.
  5. Implement Forward Lookup: For CDC batches that may omit source changes, add a fallback join to the Employee Satellite in your load script to fetch the last known NBR for each ID.

Benefits of Pre-Joining Business Keys

  • Simplicity: One denormalization step replaces dozens of repetitive lookups.
  • Performance: Staging views optimize key enrichment in set-based SQL rather than row-by-row Link loads.
  • Lineage: Your Raw Vault contains only true business keys, preserving clear end-to-end lineage.
  • Maintainability: Future source schemas that reference ID get mapped automatically via the shared staging view logic.

Conclusion

Mismatched surrogate and business keys need not derail your Data Vault design. By embracing pre-joining business keys in your staging layer—along with forward lookups for CDC edge cases—you preserve a clean, business-centric Vault model without cumbersome Same-As Links. Reserve Source Vaults only for environments where staging denormalization simply cannot occur. With these best practices, your Vault remains performant, transparent, and aligned with true business identifiers.

Watch the Video

Meet the Speaker

Profile picture of Michael Olschimke

Michael Olschimke

Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

The Data Vault Handbook

Build your path to a scalable and resilient Data Platform

The Data Vault Handbook is an accessible introduction to Data Vault. Designed for data practitioners, this guide provides a clear and cohesive overview of Data Vault principles.

Read it for Free

Leave a Reply

Close Menu