Skip to main content
search
0
Scalefree Knowledge Webinars Data Vault Friday Why Split Hubs Are a Data Vault Anti-Pattern

Split Hubs Are a Data Vault Anti-Pattern: Here’s Why

A practice that occasionally surfaces in Data Vault projects — though it doesn’t appear in the official methodology — is splitting Hubs by source system in the Raw Data Vault, then consolidating them into a “golden record” Hub in the Business Vault. The idea seems intuitive: keep SAP customers and Oracle customers separate at the raw layer, then unify them later. In practice, this approach undermines one of Data Vault’s most powerful features. This post explains why split Hubs are an anti-pattern and what the correct approach looks like.



Split Hubs: Why They Contradict the Purpose of a Hub

To understand why splitting Hubs by source system is a problem, start with the fundamental purpose of a Hub in Data Vault 2.0: a Hub represents a business concept. Not a SAP customer. Not an Oracle customer. A customer. Full stop.

One of the most valuable properties of the Raw Data Vault is that it serves as the integration layer for business keys. This is called passive integration: when two source systems share the same business key for the same real-world entity — a customer number that exists in both SAP and Oracle, for example — loading both into the same Hub causes integration to happen automatically at load time. The moment the same business key is hashed and loaded from both systems, it maps to the same Hub record. No additional logic required.

When you split Hubs by source system, you bypass this integration entirely. HUB_SAP_CUSTOMERS and HUB_ORACLE_CUSTOMERS are two separate entities in the model, and any integration between them has to be built explicitly later — which is exactly the kind of work the Raw Data Vault was designed to handle for you. You’ve taken a passive, automatic process and made it a manual, deferred one.

Business Key Identification: The Real Work

The split Hub pattern often appears in projects where the business key selection process hasn’t been given enough attention. Identifying the right business key is one of the most important — and underestimated — tasks in a Data Vault implementation. It’s a topic that deserves its own dedicated discussion, but the key hierarchy is worth understanding at a high level.

At the top are global business keys: identifiers that are recognized universally, like a VIN number for vehicles or an ISBN for books. These are ideal because they enable integration not just across internal systems but with external data sources as well. Below that are company-wide business keys — identifiers shared across multiple internal source systems. These are the keys that enable cross-system Hub integration. At the bottom are system-specific keys, known only to a single source system.

The temptation for data engineers under time pressure is to reach for whatever unique key is most readily available — often a surrogate key or a system-generated sequence. These keys reliably identify records within their source system, but they were never designed to integrate across systems. Using them as Hub business keys produces technically valid Hubs that miss the entire integration value of the Raw Data Vault.

Investing time upfront in identifying a company-wide or global business key — even if it requires conversations with business stakeholders and source system specialists — pays back significantly in the quality and simplicity of the resulting model. Our Data Vault 2.1 Training & Certification covers business key identification as a core modeling skill.

The Data Vault Handbook:
Core Concepts and Modern Applications

Build Your Path to a Scalable and Resilient Data Platform

The Data Vault Handbook is an accessible introduction to Data Vault. Designed for data practitioners, this guide provides a clear and cohesive overview of Data Vault principles.

Read it for Free

When Two Systems Use Different Keys for the Same Entity

What if SAP and Oracle genuinely use different, unrelated keys for the same customer? This is a common real-world scenario, and the solution is not to create separate Hubs. Both keys still go into the same customer Hub — because a Hub is a distinct list of business keys, not a distinct list of business objects. Two different keys can represent the same customer in the Hub without causing a problem.

The tool for resolving that ambiguity is the Same-as-Link (SAL). A Same-as-Link references the same Hub twice — one side for the master record, one side for the duplicate — and establishes the relationship between them. The golden record logic, the master record calculation, the determination of which key takes precedence: all of that belongs in the Business Vault, expressed as an explicit business rule through the SAL. In some cases, the source system itself provides a key mapping — a master data management system that already knows which keys refer to the same entity — and that mapping can be loaded directly into the SAL in the Raw Data Vault.

This approach keeps the Raw Data Vault clean and close to the source, while giving the Business Vault a precise, auditable place to implement the integration logic. For a deeper look at how SALs enable enterprise-wide deduplication, see our post on Data Vault in modern architecture patterns.

Handling Surrogate Key Collisions

Surrogate keys — sequence numbers used as primary keys in source systems — introduce a specific risk: the same number in SAP and Oracle might refer to two completely different customers. Customer 1042 in SAP is not the same entity as Customer 1042 in Oracle, but if both are loaded into the same Hub using just the sequence number as the business key, they hash to the same value and collapse into a single Hub record. That’s a data integrity problem.

The fix is not to create separate Hubs. The fix is to include a source system identifier in the hash key calculation. The business key fed into the hash function becomes a combination of the source system identifier and the sequence number — SAP + 1042 and Oracle + 1042 hash to different values and produce separate Hub records. One Hub, two distinct records, no collision. The source system becomes part of the key definition rather than a reason to fragment the model.

What Correct Hub Loading Looks Like

To bring this together: if SAP and Oracle share the same company-wide business key for customers, load both into a single customer Hub and add separate Satellites per source system. The integration happens automatically at load time — no golden record logic required in the Raw Data Vault.

If they use different keys, load both into the same Hub and create a Same-as-Link in the Business Vault to express the relationship between them. If surrogate keys create collision risk, include the source system identifier in the hash key computation to ensure uniqueness while still maintaining a single Hub.

In all three scenarios, the answer is one Hub per business concept. Split Hubs trade short-term convenience for long-term complexity — and they give up the passive integration capability that makes Data Vault worth using in the first place.

To go deeper on Hub design, business key identification, and the full Raw Data Vault methodology, explore our Data Vault certification program. The Data Vault Handbook is also available as a free physical copy or ebook for a solid grounding in the core concepts.

The Data Vault Handbook:
Core Concepts and Modern Applications

Build Your Path to a Scalable and Resilient Data Platform

The Data Vault Handbook is an accessible introduction to Data Vault. Designed for data practitioners, this guide provides a clear and cohesive overview of Data Vault principles.

Read it for Free

Watch the Video

Leave a Reply

Close Menu