Skip to main content
search
0
Category

Data Vault Friday

Why Split Hubs Are a Data Vault Anti-Pattern

Split Hubs Are a Data Vault Anti-Pattern: Here’s Why

A practice that occasionally surfaces in Data Vault projects — though it doesn’t appear in the official methodology — is splitting Hubs by source system in the Raw Data Vault, then consolidating them into a “golden record” Hub in the Business Vault. The idea seems intuitive: keep SAP customers and Oracle customers separate at the raw layer, then unify them later. In practice, this approach undermines one of Data Vault’s most powerful features. This post explains why split Hubs are an anti-pattern and what the correct approach looks like.



Split Hubs: Why They Contradict the Purpose of a Hub

To understand why splitting Hubs by source system is a problem, start with the fundamental purpose of a Hub in Data Vault 2.0: a Hub represents a business concept. Not a SAP customer. Not an Oracle customer. A customer. Full stop.

One of the most valuable properties of the Raw Data Vault is that it serves as the integration layer for business keys. This is called passive integration: when two source systems share the same business key for the same real-world entity — a customer number that exists in both SAP and Oracle, for example — loading both into the same Hub causes integration to happen automatically at load time. The moment the same business key is hashed and loaded from both systems, it maps to the same Hub record. No additional logic required.

When you split Hubs by source system, you bypass this integration entirely. HUB_SAP_CUSTOMERS and HUB_ORACLE_CUSTOMERS are two separate entities in the model, and any integration between them has to be built explicitly later — which is exactly the kind of work the Raw Data Vault was designed to handle for you. You’ve taken a passive, automatic process and made it a manual, deferred one.

Business Key Identification: The Real Work

The split Hub pattern often appears in projects where the business key selection process hasn’t been given enough attention. Identifying the right business key is one of the most important — and underestimated — tasks in a Data Vault implementation. It’s a topic that deserves its own dedicated discussion, but the key hierarchy is worth understanding at a high level.

At the top are global business keys: identifiers that are recognized universally, like a VIN number for vehicles or an ISBN for books. These are ideal because they enable integration not just across internal systems but with external data sources as well. Below that are company-wide business keys — identifiers shared across multiple internal source systems. These are the keys that enable cross-system Hub integration. At the bottom are system-specific keys, known only to a single source system.

The temptation for data engineers under time pressure is to reach for whatever unique key is most readily available — often a surrogate key or a system-generated sequence. These keys reliably identify records within their source system, but they were never designed to integrate across systems. Using them as Hub business keys produces technically valid Hubs that miss the entire integration value of the Raw Data Vault.

Investing time upfront in identifying a company-wide or global business key — even if it requires conversations with business stakeholders and source system specialists — pays back significantly in the quality and simplicity of the resulting model. Our Data Vault 2.1 Training & Certification covers business key identification as a core modeling skill.

When Two Systems Use Different Keys for the Same Entity

What if SAP and Oracle genuinely use different, unrelated keys for the same customer? This is a common real-world scenario, and the solution is not to create separate Hubs. Both keys still go into the same customer Hub — because a Hub is a distinct list of business keys, not a distinct list of business objects. Two different keys can represent the same customer in the Hub without causing a problem.

The tool for resolving that ambiguity is the Same-as-Link (SAL). A Same-as-Link references the same Hub twice — one side for the master record, one side for the duplicate — and establishes the relationship between them. The golden record logic, the master record calculation, the determination of which key takes precedence: all of that belongs in the Business Vault, expressed as an explicit business rule through the SAL. In some cases, the source system itself provides a key mapping — a master data management system that already knows which keys refer to the same entity — and that mapping can be loaded directly into the SAL in the Raw Data Vault.

This approach keeps the Raw Data Vault clean and close to the source, while giving the Business Vault a precise, auditable place to implement the integration logic. For a deeper look at how SALs enable enterprise-wide deduplication, see our post on Data Vault in modern architecture patterns.

Handling Surrogate Key Collisions

Surrogate keys — sequence numbers used as primary keys in source systems — introduce a specific risk: the same number in SAP and Oracle might refer to two completely different customers. Customer 1042 in SAP is not the same entity as Customer 1042 in Oracle, but if both are loaded into the same Hub using just the sequence number as the business key, they hash to the same value and collapse into a single Hub record. That’s a data integrity problem.

The fix is not to create separate Hubs. The fix is to include a source system identifier in the hash key calculation. The business key fed into the hash function becomes a combination of the source system identifier and the sequence number — SAP + 1042 and Oracle + 1042 hash to different values and produce separate Hub records. One Hub, two distinct records, no collision. The source system becomes part of the key definition rather than a reason to fragment the model.

What Correct Hub Loading Looks Like

To bring this together: if SAP and Oracle share the same company-wide business key for customers, load both into a single customer Hub and add separate Satellites per source system. The integration happens automatically at load time — no golden record logic required in the Raw Data Vault.

If they use different keys, load both into the same Hub and create a Same-as-Link in the Business Vault to express the relationship between them. If surrogate keys create collision risk, include the source system identifier in the hash key computation to ensure uniqueness while still maintaining a single Hub.

In all three scenarios, the answer is one Hub per business concept. Split Hubs trade short-term convenience for long-term complexity — and they give up the passive integration capability that makes Data Vault worth using in the first place.

To go deeper on Hub design, business key identification, and the full Raw Data Vault methodology, explore our Data Vault certification program. The Data Vault Handbook is also available as a free physical copy or ebook for a solid grounding in the core concepts.

Watch the Video

Why Split Hubs Are a Data Vault Anti-Pattern

Split Hubs Are a Data Vault Anti-Pattern: Here’s Why

A practice that occasionally surfaces in Data Vault projects — though it doesn’t appear in the official methodology — is splitting Hubs by source system in the Raw Data Vault, then consolidating them into a “golden record” Hub in the Business Vault. The idea seems intuitive: keep SAP customers and Oracle customers separate at the raw layer, then unify them later. In practice, this approach undermines one of Data Vault’s most powerful features. This post explains why split Hubs are an anti-pattern and what the correct approach looks like.



Split Hubs: Why They Contradict the Purpose of a Hub

To understand why splitting Hubs by source system is a problem, start with the fundamental purpose of a Hub in Data Vault 2.0: a Hub represents a business concept. Not a SAP customer. Not an Oracle customer. A customer. Full stop.

One of the most valuable properties of the Raw Data Vault is that it serves as the integration layer for business keys. This is called passive integration: when two source systems share the same business key for the same real-world entity — a customer number that exists in both SAP and Oracle, for example — loading both into the same Hub causes integration to happen automatically at load time. The moment the same business key is hashed and loaded from both systems, it maps to the same Hub record. No additional logic required.

When you split Hubs by source system, you bypass this integration entirely. HUB_SAP_CUSTOMERS and HUB_ORACLE_CUSTOMERS are two separate entities in the model, and any integration between them has to be built explicitly later — which is exactly the kind of work the Raw Data Vault was designed to handle for you. You’ve taken a passive, automatic process and made it a manual, deferred one.

Business Key Identification: The Real Work

The split Hub pattern often appears in projects where the business key selection process hasn’t been given enough attention. Identifying the right business key is one of the most important — and underestimated — tasks in a Data Vault implementation. It’s a topic that deserves its own dedicated discussion, but the key hierarchy is worth understanding at a high level.

At the top are global business keys: identifiers that are recognized universally, like a VIN number for vehicles or an ISBN for books. These are ideal because they enable integration not just across internal systems but with external data sources as well. Below that are company-wide business keys — identifiers shared across multiple internal source systems. These are the keys that enable cross-system Hub integration. At the bottom are system-specific keys, known only to a single source system.

The temptation for data engineers under time pressure is to reach for whatever unique key is most readily available — often a surrogate key or a system-generated sequence. These keys reliably identify records within their source system, but they were never designed to integrate across systems. Using them as Hub business keys produces technically valid Hubs that miss the entire integration value of the Raw Data Vault.

Investing time upfront in identifying a company-wide or global business key — even if it requires conversations with business stakeholders and source system specialists — pays back significantly in the quality and simplicity of the resulting model. Our Data Vault 2.1 Training & Certification covers business key identification as a core modeling skill.

When Two Systems Use Different Keys for the Same Entity

What if SAP and Oracle genuinely use different, unrelated keys for the same customer? This is a common real-world scenario, and the solution is not to create separate Hubs. Both keys still go into the same customer Hub — because a Hub is a distinct list of business keys, not a distinct list of business objects. Two different keys can represent the same customer in the Hub without causing a problem.

The tool for resolving that ambiguity is the Same-as-Link (SAL). A Same-as-Link references the same Hub twice — one side for the master record, one side for the duplicate — and establishes the relationship between them. The golden record logic, the master record calculation, the determination of which key takes precedence: all of that belongs in the Business Vault, expressed as an explicit business rule through the SAL. In some cases, the source system itself provides a key mapping — a master data management system that already knows which keys refer to the same entity — and that mapping can be loaded directly into the SAL in the Raw Data Vault.

This approach keeps the Raw Data Vault clean and close to the source, while giving the Business Vault a precise, auditable place to implement the integration logic. For a deeper look at how SALs enable enterprise-wide deduplication, see our post on Data Vault in modern architecture patterns.

Handling Surrogate Key Collisions

Surrogate keys — sequence numbers used as primary keys in source systems — introduce a specific risk: the same number in SAP and Oracle might refer to two completely different customers. Customer 1042 in SAP is not the same entity as Customer 1042 in Oracle, but if both are loaded into the same Hub using just the sequence number as the business key, they hash to the same value and collapse into a single Hub record. That’s a data integrity problem.

The fix is not to create separate Hubs. The fix is to include a source system identifier in the hash key calculation. The business key fed into the hash function becomes a combination of the source system identifier and the sequence number — SAP + 1042 and Oracle + 1042 hash to different values and produce separate Hub records. One Hub, two distinct records, no collision. The source system becomes part of the key definition rather than a reason to fragment the model.

What Correct Hub Loading Looks Like

To bring this together: if SAP and Oracle share the same company-wide business key for customers, load both into a single customer Hub and add separate Satellites per source system. The integration happens automatically at load time — no golden record logic required in the Raw Data Vault.

If they use different keys, load both into the same Hub and create a Same-as-Link in the Business Vault to express the relationship between them. If surrogate keys create collision risk, include the source system identifier in the hash key computation to ensure uniqueness while still maintaining a single Hub.

In all three scenarios, the answer is one Hub per business concept. Split Hubs trade short-term convenience for long-term complexity — and they give up the passive integration capability that makes Data Vault worth using in the first place.

To go deeper on Hub design, business key identification, and the full Raw Data Vault methodology, explore our Data Vault certification program. The Data Vault Handbook is also available as a free physical copy or ebook for a solid grounding in the core concepts.

Watch the Video

Hash Keys and Modern Data Platforms

Hash Keys in Data Vault on Modern Data Platforms: Snowflake, Fabric, and Beyond

A question that comes up regularly — especially from teams working on cloud-native platforms like Snowflake — is whether hash keys are still necessary, or whether sequences or raw business keys might be more efficient. It’s a fair question, and the answer depends on understanding what hash keys actually solve, what the alternatives cost, and how modern massively parallel processing (MPP) platforms change the performance equation. This post covers all three options and explains why hash keys remain the recommended approach even on modern platforms.



Hash Keys on Modern Data Platforms: Why Not Sequences?

Sequences are the first alternative most people consider — integers are small, fast to compare, and familiar. But they come with a fundamental structural problem: they require lookups. To load a Link, you need the sequence values for the Hubs it references, which means Hubs must be loaded before Links, Links before their Satellites, and so on. In small, single-environment setups, this ordering constraint is manageable. In large-scale or distributed environments, it becomes a serious obstacle.

Consider a setup where facts and real-time feeds live in the cloud while customer master data lives on-premise. To load a fact with a sequence-based key, you need to look up the sequence for each customer from the on-premise system — through a firewall, across a network, under latency and security constraints. In practice, this doesn’t scale. It introduces tight loading dependencies between systems that should be able to operate independently.

Hash keys and business keys don’t have this problem. Hash the same business key on two different systems and you get the same hash value. Both environments can load independently and join cleanly without cross-environment lookups. At Scalefree, the only clients currently using sequences in their Data Vault are on migration projects — migrating away from sequences. That’s worth keeping in mind before choosing them.

Business Keys: When They Work and When They Don’t

Business keys are the other alternative. On the surface, a business key stored directly in a Hub seems simpler than hashing it — one less step, shorter values. And on modern MPP platforms like Snowflake, Fabric, or BigQuery, the join performance argument for hash keys is less compelling than it used to be. These platforms distribute and index data across thousands of nodes in ways that make business key joins perform reasonably well.

The problem shows up in Links. A Link referencing three or four Hubs combines multiple business keys into its primary key. A VIN number alone can be 20 characters; combine it with a customer number, a transaction ID, and a location code and you’ve already exceeded the 32 characters of an MD5 hash. Business keys are also often variable-length, which matters on traditional row-based database systems: fixed-length fields are guaranteed to stay in the primary page during a join, while variable-length fields may be offloaded to a secondary page, turning a two-page join into a four-page operation.

On Non-Historized Links and their attached Satellites — where volume is high and the primary key is replicated across every row — wide, variable-length business key combinations compound quickly into a storage and performance problem. As you dig deeper into the Data Vault model with more complex queries and more joins, the size of the join conditions grows with the business keys.

The other practical constraint is tool stack consistency. If your environment mixes a cloud MPP platform with an on-premise Postgres derivative, a data lake for staging, and various Business Vault loading tools, using business keys means different query patterns depending on which systems are involved. Sometimes you join on the business key, sometimes on the hash key, sometimes on a combination. The query logic becomes metadata-driven and harder to read. Hash keys simplify this: always one column, always the same join pattern, regardless of platform.

Binary vs. Character Hash Values

Once you’ve decided to use hash keys, the next question is storage format: character (32 chars for MD5, 40 for SHA-1) or binary (16 or 20 bytes respectively). Binary is half the size, joins faster, and produces smaller join conditions in the dimensional layer — all genuine advantages, especially when materializing data into OLAP cubes or columnar tools like QlikSense.

The reason most projects still use character-based hash values is tool compatibility. Strings are universally supported. Binary data types are not — many real-time processing tools, data mining platforms, and AI/ML frameworks work with basic data types only. If an external script, a RapidMiner workflow, or a streaming processor needs to write into the Business Vault, a binary hash key may not be supported without explicit conversion logic.

The practical recommendation: use character-based hash values in the Raw Data Vault and Business Vault for maximum compatibility. In the Information Mart, if the data is being materialized into a tool that benefits from smaller keys — an OLAP cube, a QlikView dataset — convert to binary in the view layer. That keeps the core model flexible while capturing the storage and join benefits where they actually matter.

Hashdiffs on Modern Platforms: Still Worth It

A related question is whether hashdiffs are still valuable on column-based platforms like Snowflake, where column compression already reduces redundant data significantly. The answer is yes, and the reason is about how compute is distributed across loads rather than the cost of a single load.

The hashdiff is calculated when a record is first loaded into a Satellite. On subsequent loads, the comparison is between the freshly calculated staging hashdiff and the already-materialized Satellite hashdiff — which was computed during a previous load, not the current one. This means the compute cost of delta detection is spread across the load history: roughly half the work happens in prior loads, and the current load only handles the staging side. Over time, especially on high-volume Satellites with relatively low change rates, this distribution of compute is a meaningful performance gain.

Column-by-column comparison without a hashdiff moves all of that computation into the current load and requires fetching additional column pages for each comparison on column-based storage. The hashdiff collapses the entire comparison into a single column join, which scales much better as Satellite width and data volume grow. This is why tools like datavault4dbt no longer offer hashdiff as an optional feature — it’s simply on by default, because the performance case is consistent enough that disabling it isn’t worth the option overhead.

The Case for Staying with Hash Keys

Modern MPP platforms do reduce some of the traditional arguments for hash keys — join performance on business keys is no longer the clear-cut problem it was on row-based on-premise systems. But hash keys still deliver consistent advantages that matter in real projects: single-column join conditions that work the same way everywhere, independence from loading order, full compatibility across distributed environments, and a query pattern simple enough to generate automatically from metadata.

For teams building on Databricks, Snowflake, Fabric, or any other modern platform, hash keys remain the recommended approach. Not because the alternatives are impossible, but because the consistency and operational simplicity they provide across varied tool stacks and deployment patterns is worth more than the marginal gains from switching.

To explore hash key design, hashdiff patterns, and the full Data Vault modeling approach in depth, check out our Data Vault 2.1 Training & Certification. And for a solid introduction to the core concepts, the Data Vault Handbook is available as a free physical copy or ebook.

Watch the Video

Hash Keys and Modern Data Platforms

Hash Keys in Data Vault on Modern Data Platforms: Snowflake, Fabric, and Beyond

A question that comes up regularly — especially from teams working on cloud-native platforms like Snowflake — is whether hash keys are still necessary, or whether sequences or raw business keys might be more efficient. It’s a fair question, and the answer depends on understanding what hash keys actually solve, what the alternatives cost, and how modern massively parallel processing (MPP) platforms change the performance equation. This post covers all three options and explains why hash keys remain the recommended approach even on modern platforms.



Hash Keys on Modern Data Platforms: Why Not Sequences?

Sequences are the first alternative most people consider — integers are small, fast to compare, and familiar. But they come with a fundamental structural problem: they require lookups. To load a Link, you need the sequence values for the Hubs it references, which means Hubs must be loaded before Links, Links before their Satellites, and so on. In small, single-environment setups, this ordering constraint is manageable. In large-scale or distributed environments, it becomes a serious obstacle.

Consider a setup where facts and real-time feeds live in the cloud while customer master data lives on-premise. To load a fact with a sequence-based key, you need to look up the sequence for each customer from the on-premise system — through a firewall, across a network, under latency and security constraints. In practice, this doesn’t scale. It introduces tight loading dependencies between systems that should be able to operate independently.

Hash keys and business keys don’t have this problem. Hash the same business key on two different systems and you get the same hash value. Both environments can load independently and join cleanly without cross-environment lookups. At Scalefree, the only clients currently using sequences in their Data Vault are on migration projects — migrating away from sequences. That’s worth keeping in mind before choosing them.

Business Keys: When They Work and When They Don’t

Business keys are the other alternative. On the surface, a business key stored directly in a Hub seems simpler than hashing it — one less step, shorter values. And on modern MPP platforms like Snowflake, Fabric, or BigQuery, the join performance argument for hash keys is less compelling than it used to be. These platforms distribute and index data across thousands of nodes in ways that make business key joins perform reasonably well.

The problem shows up in Links. A Link referencing three or four Hubs combines multiple business keys into its primary key. A VIN number alone can be 20 characters; combine it with a customer number, a transaction ID, and a location code and you’ve already exceeded the 32 characters of an MD5 hash. Business keys are also often variable-length, which matters on traditional row-based database systems: fixed-length fields are guaranteed to stay in the primary page during a join, while variable-length fields may be offloaded to a secondary page, turning a two-page join into a four-page operation.

On Non-Historized Links and their attached Satellites — where volume is high and the primary key is replicated across every row — wide, variable-length business key combinations compound quickly into a storage and performance problem. As you dig deeper into the Data Vault model with more complex queries and more joins, the size of the join conditions grows with the business keys.

The other practical constraint is tool stack consistency. If your environment mixes a cloud MPP platform with an on-premise Postgres derivative, a data lake for staging, and various Business Vault loading tools, using business keys means different query patterns depending on which systems are involved. Sometimes you join on the business key, sometimes on the hash key, sometimes on a combination. The query logic becomes metadata-driven and harder to read. Hash keys simplify this: always one column, always the same join pattern, regardless of platform.

Binary vs. Character Hash Values

Once you’ve decided to use hash keys, the next question is storage format: character (32 chars for MD5, 40 for SHA-1) or binary (16 or 20 bytes respectively). Binary is half the size, joins faster, and produces smaller join conditions in the dimensional layer — all genuine advantages, especially when materializing data into OLAP cubes or columnar tools like QlikSense.

The reason most projects still use character-based hash values is tool compatibility. Strings are universally supported. Binary data types are not — many real-time processing tools, data mining platforms, and AI/ML frameworks work with basic data types only. If an external script, a RapidMiner workflow, or a streaming processor needs to write into the Business Vault, a binary hash key may not be supported without explicit conversion logic.

The practical recommendation: use character-based hash values in the Raw Data Vault and Business Vault for maximum compatibility. In the Information Mart, if the data is being materialized into a tool that benefits from smaller keys — an OLAP cube, a QlikView dataset — convert to binary in the view layer. That keeps the core model flexible while capturing the storage and join benefits where they actually matter.

Hashdiffs on Modern Platforms: Still Worth It

A related question is whether hashdiffs are still valuable on column-based platforms like Snowflake, where column compression already reduces redundant data significantly. The answer is yes, and the reason is about how compute is distributed across loads rather than the cost of a single load.

The hashdiff is calculated when a record is first loaded into a Satellite. On subsequent loads, the comparison is between the freshly calculated staging hashdiff and the already-materialized Satellite hashdiff — which was computed during a previous load, not the current one. This means the compute cost of delta detection is spread across the load history: roughly half the work happens in prior loads, and the current load only handles the staging side. Over time, especially on high-volume Satellites with relatively low change rates, this distribution of compute is a meaningful performance gain.

Column-by-column comparison without a hashdiff moves all of that computation into the current load and requires fetching additional column pages for each comparison on column-based storage. The hashdiff collapses the entire comparison into a single column join, which scales much better as Satellite width and data volume grow. This is why tools like datavault4dbt no longer offer hashdiff as an optional feature — it’s simply on by default, because the performance case is consistent enough that disabling it isn’t worth the option overhead.

The Case for Staying with Hash Keys

Modern MPP platforms do reduce some of the traditional arguments for hash keys — join performance on business keys is no longer the clear-cut problem it was on row-based on-premise systems. But hash keys still deliver consistent advantages that matter in real projects: single-column join conditions that work the same way everywhere, independence from loading order, full compatibility across distributed environments, and a query pattern simple enough to generate automatically from metadata.

For teams building on Databricks, Snowflake, Fabric, or any other modern platform, hash keys remain the recommended approach. Not because the alternatives are impossible, but because the consistency and operational simplicity they provide across varied tool stacks and deployment patterns is worth more than the marginal gains from switching.

To explore hash key design, hashdiff patterns, and the full Data Vault modeling approach in depth, check out our Data Vault 2.1 Training & Certification. And for a solid introduction to the core concepts, the Data Vault Handbook is available as a free physical copy or ebook.

Watch the Video

How to Define SCD Type 2 Dimension Keys in a Data Vault Solution

SCD Type 2 Dimension Keys in Data Vault: Hash Keys, Sequences, and the PIT Table

Defining dimension keys in a Data Vault solution is one of those topics that seems straightforward until you get to Type 2 dimensions — and then the options multiply quickly. Should you use hash keys or sequences? Where do Type 2 keys come from, and how do they connect back to your facts? This post walks through the full picture, from the simplest Type 1 case all the way to the Dimension Hash Key pattern used for Type 2 slowly changing dimensions.



SCD Type 2 Dimension Keys: Starting with the Simple Case

For Type 0 and Type 1 dimensions — dimensions without history — the dimension key question is easy. Every Hub already contains exactly one hash key per business entity, and every Link contains one hash key per relationship. These Type 1 hash keys are already present throughout your model: in Non-Historized Links, Dependent Child Links, and Bridge Tables. You can use them directly as dimension keys in your view layer without generating anything new. It’s the lowest-effort, highest-compatibility option.

Hash keys also have a significant advantage over sequences in distributed environments. If your facts live in the cloud and your dimensions are generated on-premise, you can’t easily synchronize integer sequences between systems — the lookup dependencies alone make it impractical. Hash keys don’t have this problem. Hashing the same business key on two different systems produces the same hash value. A distributed Information Mart works cleanly with hash keys; with sequences, it becomes a coordination problem.

For more on how hash keys work in Data Vault and why they’re designed the way they are, the Scalefree blog covers the topic in depth.

When Sequences Make Sense — and How to Generate Them

The case for sequences is primarily storage. An MD5 hash value stored as a character string takes 32 bytes; a SHA-1 takes 40. A big integer takes 8 bytes. If storage is a genuine concern, converting character-based hash values to binary in the view layer is the first option to consider — it cuts the size in half with minimal effort and no structural changes.

If you still want integer sequences after that, there are two places to generate them. You can add a sequence column directly to the Hub or Link structure, used purely as a downstream dimension key rather than as an identifier. This works but creates a conceptual tension: after spending effort explaining why sequences aren’t used as Hub identifiers, reintroducing them in the same structure is confusing for anyone reading the model.

The cleaner approach is a Computed Satellite in the Business Vault, attached to the Hub or Link, that generates a new sequence value for every new record in the parent. It’s a simple business rule — new parent record, new sequence — and it keeps the sequence generation in the layer designed for computed values. The trade-off is an additional join when consuming the sequence downstream, but the design is explicit and the logic is easy to understand and maintain.

The Type 2 Challenge: Why Hub Hash Keys Aren’t Enough

Type 1 hash keys work for dimensions without history because the granularity is one row per business entity. Type 2 dimensions need finer granularity — one row per business entity per version over time. The hash key from the Hub doesn’t capture that; it’s the same value regardless of when you’re looking at the data.

What you need for a Type 2 dimension is a key that is unique not just per entity but per entity per point in time. In Data Vault, that key already exists — it’s generated as part of the PIT Table.

The Dimension Hash Key from the PIT Table

When producing a Type 2 dimension, you need a PIT Table anyway — it provides the snapshot-based granularity that drives the dimension’s history. The PIT Table’s alternate key is the combination of the parent’s business key (not the hash key — never hash a hash) and the snapshot date. The primary key of the PIT Table is a hash value computed from those two inputs: business key plus snapshot date.

At Scalefree, this value is called the Dimension Hash Key. It is unique per row in the PIT Table, which means it is unique per entity per point in time — exactly what a Type 2 dimension key needs to be. This Dimension Hash Key becomes the primary key of your Type 2 dimension and the foreign key that your fact entities need to reference in order to join to the correct dimension member at the correct point in time.

Connecting Facts to Type 2 Dimensions

The remaining challenge is on the fact side. Bridge Tables and Non-Historized Links — the typical foundations for fact entities — contain Type 1 hash keys from Hubs and Links, not Type 2 Dimension Hash Keys. So how does a fact row know which Type 2 dimension member to reference?

The solution is a join through the PIT Table’s alternate key inside the fact view. A Bridge Table typically contains the Type 1 hash key from the relevant Hub and a snapshot date. Those two values together form the alternate key of the PIT Table. Inside the fact view, you join the Bridge Table to the PIT Table using the hash key and snapshot date, retrieve the Dimension Hash Key from the PIT Table’s primary key, and surface that as the dimension reference in the fact entity.

The result: the fact entity contains a single column — the Dimension Hash Key — that points to exactly one Type 2 dimension member. The dashboard tool and end users never need to know how it was derived. The join logic is handled in the view layer, the keys match between fact and dimension, and the relationship resolves cleanly. This is the preferred approach rather than exposing a composite key (hash key plus snapshot date) from the fact side, which would complicate the dimensional model unnecessarily.

For teams using datavault4dbt premium, PIT Table generation and the Dimension Hash Key pattern are handled through the automation framework, which significantly reduces the manual effort involved in implementing this correctly at scale.

Putting It Together: Key Decisions for Dimension Keys

To summarize the decision framework: for Type 0 and Type 1 dimensions, use the Type 1 hash keys from Hubs and Links directly — they’re already available throughout the model and work cleanly in distributed environments. If storage is a concern, convert to binary hash values in the view layer before considering sequences. If sequences are genuinely required, generate them in a Computed Satellite in the Business Vault rather than embedding them in Hub or Link structures.

For Type 2 dimensions, use the Dimension Hash Key from the PIT Table as the primary key of the dimension. Connect facts to Type 2 dimensions by joining the Bridge Table or Link to the PIT Table’s alternate key inside the fact view, surfacing the Dimension Hash Key as the dimension reference. This keeps the dimensional model clean, the keys stable, and the join logic encapsulated where it belongs.

To go deeper on PIT Tables, dimension modeling, and the full Data Vault delivery layer, explore our Data Vault certification program. And for a concise introduction to the core concepts, the Data Vault Handbook is available as a free physical copy or ebook.

Watch the Video

How to Define SCD Type 2 Dimension Keys in a Data Vault Solution

SCD Type 2 Dimension Keys in Data Vault: Hash Keys, Sequences, and the PIT Table

Defining dimension keys in a Data Vault solution is one of those topics that seems straightforward until you get to Type 2 dimensions — and then the options multiply quickly. Should you use hash keys or sequences? Where do Type 2 keys come from, and how do they connect back to your facts? This post walks through the full picture, from the simplest Type 1 case all the way to the Dimension Hash Key pattern used for Type 2 slowly changing dimensions.



SCD Type 2 Dimension Keys: Starting with the Simple Case

For Type 0 and Type 1 dimensions — dimensions without history — the dimension key question is easy. Every Hub already contains exactly one hash key per business entity, and every Link contains one hash key per relationship. These Type 1 hash keys are already present throughout your model: in Non-Historized Links, Dependent Child Links, and Bridge Tables. You can use them directly as dimension keys in your view layer without generating anything new. It’s the lowest-effort, highest-compatibility option.

Hash keys also have a significant advantage over sequences in distributed environments. If your facts live in the cloud and your dimensions are generated on-premise, you can’t easily synchronize integer sequences between systems — the lookup dependencies alone make it impractical. Hash keys don’t have this problem. Hashing the same business key on two different systems produces the same hash value. A distributed Information Mart works cleanly with hash keys; with sequences, it becomes a coordination problem.

For more on how hash keys work in Data Vault and why they’re designed the way they are, the Scalefree blog covers the topic in depth.

When Sequences Make Sense — and How to Generate Them

The case for sequences is primarily storage. An MD5 hash value stored as a character string takes 32 bytes; a SHA-1 takes 40. A big integer takes 8 bytes. If storage is a genuine concern, converting character-based hash values to binary in the view layer is the first option to consider — it cuts the size in half with minimal effort and no structural changes.

If you still want integer sequences after that, there are two places to generate them. You can add a sequence column directly to the Hub or Link structure, used purely as a downstream dimension key rather than as an identifier. This works but creates a conceptual tension: after spending effort explaining why sequences aren’t used as Hub identifiers, reintroducing them in the same structure is confusing for anyone reading the model.

The cleaner approach is a Computed Satellite in the Business Vault, attached to the Hub or Link, that generates a new sequence value for every new record in the parent. It’s a simple business rule — new parent record, new sequence — and it keeps the sequence generation in the layer designed for computed values. The trade-off is an additional join when consuming the sequence downstream, but the design is explicit and the logic is easy to understand and maintain.

The Type 2 Challenge: Why Hub Hash Keys Aren’t Enough

Type 1 hash keys work for dimensions without history because the granularity is one row per business entity. Type 2 dimensions need finer granularity — one row per business entity per version over time. The hash key from the Hub doesn’t capture that; it’s the same value regardless of when you’re looking at the data.

What you need for a Type 2 dimension is a key that is unique not just per entity but per entity per point in time. In Data Vault, that key already exists — it’s generated as part of the PIT Table.

The Dimension Hash Key from the PIT Table

When producing a Type 2 dimension, you need a PIT Table anyway — it provides the snapshot-based granularity that drives the dimension’s history. The PIT Table’s alternate key is the combination of the parent’s business key (not the hash key — never hash a hash) and the snapshot date. The primary key of the PIT Table is a hash value computed from those two inputs: business key plus snapshot date.

At Scalefree, this value is called the Dimension Hash Key. It is unique per row in the PIT Table, which means it is unique per entity per point in time — exactly what a Type 2 dimension key needs to be. This Dimension Hash Key becomes the primary key of your Type 2 dimension and the foreign key that your fact entities need to reference in order to join to the correct dimension member at the correct point in time.

Connecting Facts to Type 2 Dimensions

The remaining challenge is on the fact side. Bridge Tables and Non-Historized Links — the typical foundations for fact entities — contain Type 1 hash keys from Hubs and Links, not Type 2 Dimension Hash Keys. So how does a fact row know which Type 2 dimension member to reference?

The solution is a join through the PIT Table’s alternate key inside the fact view. A Bridge Table typically contains the Type 1 hash key from the relevant Hub and a snapshot date. Those two values together form the alternate key of the PIT Table. Inside the fact view, you join the Bridge Table to the PIT Table using the hash key and snapshot date, retrieve the Dimension Hash Key from the PIT Table’s primary key, and surface that as the dimension reference in the fact entity.

The result: the fact entity contains a single column — the Dimension Hash Key — that points to exactly one Type 2 dimension member. The dashboard tool and end users never need to know how it was derived. The join logic is handled in the view layer, the keys match between fact and dimension, and the relationship resolves cleanly. This is the preferred approach rather than exposing a composite key (hash key plus snapshot date) from the fact side, which would complicate the dimensional model unnecessarily.

For teams using datavault4dbt premium, PIT Table generation and the Dimension Hash Key pattern are handled through the automation framework, which significantly reduces the manual effort involved in implementing this correctly at scale.

Putting It Together: Key Decisions for Dimension Keys

To summarize the decision framework: for Type 0 and Type 1 dimensions, use the Type 1 hash keys from Hubs and Links directly — they’re already available throughout the model and work cleanly in distributed environments. If storage is a concern, convert to binary hash values in the view layer before considering sequences. If sequences are genuinely required, generate them in a Computed Satellite in the Business Vault rather than embedding them in Hub or Link structures.

For Type 2 dimensions, use the Dimension Hash Key from the PIT Table as the primary key of the dimension. Connect facts to Type 2 dimensions by joining the Bridge Table or Link to the PIT Table’s alternate key inside the fact view, surfacing the Dimension Hash Key as the dimension reference. This keeps the dimensional model clean, the keys stable, and the join logic encapsulated where it belongs.

To go deeper on PIT Tables, dimension modeling, and the full Data Vault delivery layer, explore our Data Vault certification program. And for a concise introduction to the core concepts, the Data Vault Handbook is available as a free physical copy or ebook.

Watch the Video

Using BEAM to Accelerate Data Vault Implementation

Using BEAM to Accelerate Data Vault Implementation

BEAM — Business Event Analysis and Modeling — has been around for a long time, but it doesn’t come up often in Data Vault conversations. That’s a missed opportunity, because the two methodologies are more aligned than most practitioners realize. This post explores how BEAM and Data Vault complement each other, where BEAM fits in the project timeline, and why using BEAM as a starting point can make your Data Vault modeling faster, more business-aligned, and easier to communicate across teams.



BEAM and Data Vault: A Natural Alignment

BEAM is a business modeling methodology focused on understanding and documenting what actually happens in an organization. Rather than starting from data structures or technical schemas, BEAM starts from business events: a customer places an order, a payment is processed, a product is shipped. Each event is analyzed through what BEAM calls the 7 Ws — who, what, when, where, why, how, and how many or how much. The goal is a complete, business-driven understanding of the processes, entities, and relationships that drive the organization.

When you lay that alongside the core concepts of Data Vault 2.0, the structural similarities are hard to miss. Data Vault models three fundamental things: business keys (captured in Hubs), relationships between business entities (captured in Links), and descriptive context (captured in Satellites). BEAM produces exactly those three things — business entities, relationships, and context — expressed in business language rather than technical schema.

The mapping is direct: BEAM entities become Hubs. BEAM relationships and events become Links. BEAM descriptive context becomes Satellite payloads. The conceptual model that BEAM produces translates naturally into the physical Data Vault model that will implement it.

Where BEAM Fits in the Project Lifecycle

BEAM typically happens before the data warehouse work begins — it’s a business analysis and modeling activity, not a technical one. Teams use it to answer the foundational questions: what processes exist in the business, what events drive those processes, what entities are involved, and how are they related? This is exactly the kind of understanding that Data Vault modeling requires, and it’s often the hardest part of starting a new implementation.

Without this upfront business understanding, Data Vault projects tend to become purely data-driven: modelers look at source tables, identify columns, and build Hubs and Satellites based on what the data looks like rather than what the business actually means. The result is technically valid but often misses the business semantics — relationships that should be Links end up embedded in Satellites, business concepts that deserve their own Hub get collapsed into another entity, or important events go unmodeled because they weren’t visible in the source data at first glance.

A BEAM model built with stakeholders from across the business gives the Data Vault team a map before they start navigating. It surfaces hidden relationships, clarifies which entities are truly distinct business concepts, and creates a shared vocabulary between business users and technical implementers. For teams building an enterprise data warehouse, that shared vocabulary is often as valuable as the model itself.

Translating BEAM to Data Vault: What to Watch For

The translation from BEAM to Data Vault is not mechanical. A one-to-one mapping from a BEAM model to a Data Vault schema without looking at the actual source data will create problems. Business models describe how things should work; source data reflects how things actually work — and those two realities frequently diverge.

A BEAM model might show a clean customer-order-product event with well-defined identifiers. The source data might deliver that same event across three systems with different keys, inconsistent structures, and occasional nulls where the business model assumed complete data. The BEAM model is the target to aim for; the source data is the reality to model from. Both perspectives are necessary.

The practical approach is to use the BEAM model as a starting point and then validate it against the actual data. Does the business key identified in the BEAM model exist in the source? Is it unique? Are the relationships the BEAM model describes actually present as foreign keys, or do they need to be inferred? Does the granularity of the source data match the granularity of the BEAM event? These questions require looking at real data, not just the business model.

This is also where tools like datavault4dbt become relevant — once the BEAM-to-Data Vault translation is validated against the source data, automation tools can significantly accelerate the physical implementation, turning a well-defined model into deployable code much faster than manual development.

BEAM as a Bridge Between Business and IT

One of the persistent challenges in data warehouse projects is the gap between what business stakeholders need and what technical teams build. Business users describe their world in terms of events, customers, products, and transactions. Technical teams describe it in tables, columns, joins, and load patterns. These vocabularies don’t naturally translate, and the gap is where requirements get lost.

BEAM and Data Vault together help close that gap. BEAM produces a model that business users can understand and validate — it speaks their language. Data Vault implements that model in a way that is technically rigorous, scalable, and auditable. When both sides can see their perspective reflected in the same project, alignment improves and the risk of building something technically correct but business-irrelevant decreases.

The 7 Ws framework that BEAM uses to analyze events also maps well to the questions a Data Vault modeler asks when building Links: who are the participants in this relationship, what happened, when, where, and under what conditions? These aren’t just modeling questions — they’re the questions that produce a model business users recognize as a reflection of their actual processes.

Practical Takeaways

BEAM and Data Vault are not competing methodologies — they operate at different levels of the project. BEAM works at the business understanding level, producing a clear picture of events, entities, and relationships from the business perspective. Data Vault works at the technical implementation level, structuring that understanding into a scalable, auditable physical data model.

Used together, they create a stronger foundation than either provides alone. BEAM accelerates the modeling phase by giving the Data Vault team a validated business context to work from. Data Vault gives the BEAM model a rigorous technical home. The combination shortens the distance between business requirements and implemented data structures, reduces rework caused by misunderstood requirements, and produces a model that both sides of the organization can engage with.

If you’re starting a new Data Vault implementation or looking to improve alignment between your business and technical teams, considering BEAM as part of your discovery and modeling process is worth the investment. And to go deeper on Data Vault modeling patterns — including how to translate business concepts into Hubs, Links, and Satellites — our Data Vault 2.1 Training & Certification covers the full methodology. The Data Vault Handbook is also available as a free physical copy or ebook for a solid introduction to the core concepts.

Watch the Video

Using BEAM to Accelerate Data Vault Implementation

Using BEAM to Accelerate Data Vault Implementation

BEAM — Business Event Analysis and Modeling — has been around for a long time, but it doesn’t come up often in Data Vault conversations. That’s a missed opportunity, because the two methodologies are more aligned than most practitioners realize. This post explores how BEAM and Data Vault complement each other, where BEAM fits in the project timeline, and why using BEAM as a starting point can make your Data Vault modeling faster, more business-aligned, and easier to communicate across teams.



BEAM and Data Vault: A Natural Alignment

BEAM is a business modeling methodology focused on understanding and documenting what actually happens in an organization. Rather than starting from data structures or technical schemas, BEAM starts from business events: a customer places an order, a payment is processed, a product is shipped. Each event is analyzed through what BEAM calls the 7 Ws — who, what, when, where, why, how, and how many or how much. The goal is a complete, business-driven understanding of the processes, entities, and relationships that drive the organization.

When you lay that alongside the core concepts of Data Vault 2.0, the structural similarities are hard to miss. Data Vault models three fundamental things: business keys (captured in Hubs), relationships between business entities (captured in Links), and descriptive context (captured in Satellites). BEAM produces exactly those three things — business entities, relationships, and context — expressed in business language rather than technical schema.

The mapping is direct: BEAM entities become Hubs. BEAM relationships and events become Links. BEAM descriptive context becomes Satellite payloads. The conceptual model that BEAM produces translates naturally into the physical Data Vault model that will implement it.

Where BEAM Fits in the Project Lifecycle

BEAM typically happens before the data warehouse work begins — it’s a business analysis and modeling activity, not a technical one. Teams use it to answer the foundational questions: what processes exist in the business, what events drive those processes, what entities are involved, and how are they related? This is exactly the kind of understanding that Data Vault modeling requires, and it’s often the hardest part of starting a new implementation.

Without this upfront business understanding, Data Vault projects tend to become purely data-driven: modelers look at source tables, identify columns, and build Hubs and Satellites based on what the data looks like rather than what the business actually means. The result is technically valid but often misses the business semantics — relationships that should be Links end up embedded in Satellites, business concepts that deserve their own Hub get collapsed into another entity, or important events go unmodeled because they weren’t visible in the source data at first glance.

A BEAM model built with stakeholders from across the business gives the Data Vault team a map before they start navigating. It surfaces hidden relationships, clarifies which entities are truly distinct business concepts, and creates a shared vocabulary between business users and technical implementers. For teams building an enterprise data warehouse, that shared vocabulary is often as valuable as the model itself.

Translating BEAM to Data Vault: What to Watch For

The translation from BEAM to Data Vault is not mechanical. A one-to-one mapping from a BEAM model to a Data Vault schema without looking at the actual source data will create problems. Business models describe how things should work; source data reflects how things actually work — and those two realities frequently diverge.

A BEAM model might show a clean customer-order-product event with well-defined identifiers. The source data might deliver that same event across three systems with different keys, inconsistent structures, and occasional nulls where the business model assumed complete data. The BEAM model is the target to aim for; the source data is the reality to model from. Both perspectives are necessary.

The practical approach is to use the BEAM model as a starting point and then validate it against the actual data. Does the business key identified in the BEAM model exist in the source? Is it unique? Are the relationships the BEAM model describes actually present as foreign keys, or do they need to be inferred? Does the granularity of the source data match the granularity of the BEAM event? These questions require looking at real data, not just the business model.

This is also where tools like datavault4dbt become relevant — once the BEAM-to-Data Vault translation is validated against the source data, automation tools can significantly accelerate the physical implementation, turning a well-defined model into deployable code much faster than manual development.

BEAM as a Bridge Between Business and IT

One of the persistent challenges in data warehouse projects is the gap between what business stakeholders need and what technical teams build. Business users describe their world in terms of events, customers, products, and transactions. Technical teams describe it in tables, columns, joins, and load patterns. These vocabularies don’t naturally translate, and the gap is where requirements get lost.

BEAM and Data Vault together help close that gap. BEAM produces a model that business users can understand and validate — it speaks their language. Data Vault implements that model in a way that is technically rigorous, scalable, and auditable. When both sides can see their perspective reflected in the same project, alignment improves and the risk of building something technically correct but business-irrelevant decreases.

The 7 Ws framework that BEAM uses to analyze events also maps well to the questions a Data Vault modeler asks when building Links: who are the participants in this relationship, what happened, when, where, and under what conditions? These aren’t just modeling questions — they’re the questions that produce a model business users recognize as a reflection of their actual processes.

Practical Takeaways

BEAM and Data Vault are not competing methodologies — they operate at different levels of the project. BEAM works at the business understanding level, producing a clear picture of events, entities, and relationships from the business perspective. Data Vault works at the technical implementation level, structuring that understanding into a scalable, auditable physical data model.

Used together, they create a stronger foundation than either provides alone. BEAM accelerates the modeling phase by giving the Data Vault team a validated business context to work from. Data Vault gives the BEAM model a rigorous technical home. The combination shortens the distance between business requirements and implemented data structures, reduces rework caused by misunderstood requirements, and produces a model that both sides of the organization can engage with.

If you’re starting a new Data Vault implementation or looking to improve alignment between your business and technical teams, considering BEAM as part of your discovery and modeling process is worth the investment. And to go deeper on Data Vault modeling patterns — including how to translate business concepts into Hubs, Links, and Satellites — our Data Vault 2.1 Training & Certification covers the full methodology. The Data Vault Handbook is also available as a free physical copy or ebook for a solid introduction to the core concepts.

Watch the Video

Same-as-Links: Enterprise-Wide Deduplication Across Multiple Sources

Same-as-Links: Enterprise-Wide Deduplication Across Multiple Sources

One of the more powerful but nuanced constructs in Data Vault is the Same-as-Link (SAL). Two questions came in recently that get at the heart of how SALs work across source systems: can a Same-as-Link have multiple sources, and can it span keys from different source systems? The answers differ depending on whether you’re working in the Raw Data Vault or the Business Vault — and understanding why reveals something fundamental about how Data Vault handles enterprise-wide deduplication and integration.



Same-as-Links and Multiple Sources in the Raw Data Vault

The first question — can a Same-as-Link have multiple sources — is straightforward. Like any Link in the Raw Data Vault, a SAL can receive records from multiple source systems. Hubs consolidate business keys from different sources into the same entity, and Links do the same for relationships. As long as the relationship has the same semantic meaning and the same granularity across those sources, loading them into the same Link is valid and correct. So yes, a SAL in the Raw Data Vault can have multiple source systems contributing records to it.

The second question is more nuanced: can a SAL span keys from multiple sources — meaning one Hub reference on one side of the relationship comes from System A, and the other comes from System B?

In the Raw Data Vault, the answer is generally no — with one important exception. A core principle of Raw Data Vault loading is that each row comes from exactly one source system. Loading a single row that requires joining data from two independent source systems introduces a loading dependency: you have to wait for System A before you can load data from System B. That’s precisely the kind of tight coupling the Raw Data Vault is designed to avoid. Independent source systems should load independently.

The exception is when a single source system already knows both business keys. An ERP system, for example, might reference customers by a customer number that originates in a CRM system. The ERP system carries that key as a known reference — it’s available in a single source record without requiring a cross-system join at load time. In that case, a SAL row sourced from the ERP system can legitimately reference a business key that conceptually originates elsewhere. The single-source-per-row rule still holds; the integration happened upstream, inside the source system itself.

Same-as-Links in the Business Vault: Cross-Source Deduplication

In the Business Vault, the picture is quite different — and this is where SALs really show their value. When two independent source systems use completely different, unrelated business keys for what is actually the same real-world entity, there’s no source-level relationship to load. The Raw Data Vault captures both sets of keys in the same Hub (since they represent the same business concept), but there’s nothing in the source data to connect them.

This is where calculated Same-as-Links come in. Using descriptive data from both systems — names, addresses, contact details — fuzzy matching logic can identify that business key A from System A and business key B from System B refer to the same entity. That determination is a business rule. It belongs in the Business Vault. The result is a SAL entry that spans two business keys from completely independent source systems, calculated from the data rather than loaded from any single source.

This is one of the primary use cases for Same-as-Links: not just deduplicating records within a single source system, but integrating and deduplicating entities across the enterprise. Two CRM systems, two customer databases, two product catalogs — wherever the same real-world object appears under different identifiers in different systems, a Business Vault SAL can establish the connection and enable unified reporting and analysis across all of them.

For organizations dealing with complex multi-source environments, this kind of cross-system entity resolution is one of the most tangible business value deliverables a Data Vault implementation can produce. If you’re building or evaluating a enterprise data warehouse, the SAL pattern is worth understanding deeply — it’s the mechanism that turns a collection of source-aligned Hubs into a genuinely integrated enterprise model.

Why the Raw and Business Vault Distinction Matters Here

The contrast between how SALs work in the Raw Data Vault versus the Business Vault illustrates a broader principle that runs through all of Data Vault 2.0 design: the Raw Data Vault captures what the sources deliver, as they deliver it, without interpretation. The Business Vault is where judgment, calculation, and business logic are applied.

Fuzzy matching is business logic. Deciding that two records represent the same entity is a business decision. Those decisions belong in the Business Vault — not because the Raw Data Vault can’t technically store the result, but because embedding that logic at the raw layer makes it invisible, untestable, and hard to change when the matching rules evolve.

By keeping the SAL calculation in the Business Vault, you get a clear audit trail of how the deduplication was performed, the ability to update matching logic without reloading source data, and a separation between “what the source said” and “what we believe to be true across sources.” That separation is one of the most operationally valuable properties of a well-structured Data Vault.

Practical Implications for Modeling

When modeling SALs in practice, a few things are worth keeping in mind. In the Raw Data Vault, SALs are appropriate when a single source system provides an explicit deduplication or matching relationship — a master data management export, a merge table, a golden record mapping from a source MDM system. The loading process remains clean and dependency-free.

In the Business Vault, SALs are the right tool when the matching logic needs to be calculated — whether through exact key matching across systems, probabilistic matching, fuzzy string comparison, or any other form of entity resolution. The SAL lives in the Business Vault, references the appropriate Hub twice (master and duplicate), and is populated by whatever calculation or mapping process produces the match.

In both cases, the hash keys in the SAL reference the same Hub, since by definition the master and the duplicate represent the same type of business object. This is what makes the SAL structurally elegant: it reuses existing Hub infrastructure to express an enterprise-wide identity resolution without requiring new structural entities.

To go deeper on Same-as-Links, Business Vault patterns, and enterprise integration strategies in Data Vault, explore our Data Vault 2.1 Training & Certification. And for a concise introduction to the full methodology, the Data Vault Handbook is available as a free physical copy or ebook.

Watch the Video

Same-as-Links: Enterprise-Wide Deduplication Across Multiple Sources

Same-as-Links: Enterprise-Wide Deduplication Across Multiple Sources

One of the more powerful but nuanced constructs in Data Vault is the Same-as-Link (SAL). Two questions came in recently that get at the heart of how SALs work across source systems: can a Same-as-Link have multiple sources, and can it span keys from different source systems? The answers differ depending on whether you’re working in the Raw Data Vault or the Business Vault — and understanding why reveals something fundamental about how Data Vault handles enterprise-wide deduplication and integration.



Same-as-Links and Multiple Sources in the Raw Data Vault

The first question — can a Same-as-Link have multiple sources — is straightforward. Like any Link in the Raw Data Vault, a SAL can receive records from multiple source systems. Hubs consolidate business keys from different sources into the same entity, and Links do the same for relationships. As long as the relationship has the same semantic meaning and the same granularity across those sources, loading them into the same Link is valid and correct. So yes, a SAL in the Raw Data Vault can have multiple source systems contributing records to it.

The second question is more nuanced: can a SAL span keys from multiple sources — meaning one Hub reference on one side of the relationship comes from System A, and the other comes from System B?

In the Raw Data Vault, the answer is generally no — with one important exception. A core principle of Raw Data Vault loading is that each row comes from exactly one source system. Loading a single row that requires joining data from two independent source systems introduces a loading dependency: you have to wait for System A before you can load data from System B. That’s precisely the kind of tight coupling the Raw Data Vault is designed to avoid. Independent source systems should load independently.

The exception is when a single source system already knows both business keys. An ERP system, for example, might reference customers by a customer number that originates in a CRM system. The ERP system carries that key as a known reference — it’s available in a single source record without requiring a cross-system join at load time. In that case, a SAL row sourced from the ERP system can legitimately reference a business key that conceptually originates elsewhere. The single-source-per-row rule still holds; the integration happened upstream, inside the source system itself.

Same-as-Links in the Business Vault: Cross-Source Deduplication

In the Business Vault, the picture is quite different — and this is where SALs really show their value. When two independent source systems use completely different, unrelated business keys for what is actually the same real-world entity, there’s no source-level relationship to load. The Raw Data Vault captures both sets of keys in the same Hub (since they represent the same business concept), but there’s nothing in the source data to connect them.

This is where calculated Same-as-Links come in. Using descriptive data from both systems — names, addresses, contact details — fuzzy matching logic can identify that business key A from System A and business key B from System B refer to the same entity. That determination is a business rule. It belongs in the Business Vault. The result is a SAL entry that spans two business keys from completely independent source systems, calculated from the data rather than loaded from any single source.

This is one of the primary use cases for Same-as-Links: not just deduplicating records within a single source system, but integrating and deduplicating entities across the enterprise. Two CRM systems, two customer databases, two product catalogs — wherever the same real-world object appears under different identifiers in different systems, a Business Vault SAL can establish the connection and enable unified reporting and analysis across all of them.

For organizations dealing with complex multi-source environments, this kind of cross-system entity resolution is one of the most tangible business value deliverables a Data Vault implementation can produce. If you’re building or evaluating a enterprise data warehouse, the SAL pattern is worth understanding deeply — it’s the mechanism that turns a collection of source-aligned Hubs into a genuinely integrated enterprise model.

Why the Raw and Business Vault Distinction Matters Here

The contrast between how SALs work in the Raw Data Vault versus the Business Vault illustrates a broader principle that runs through all of Data Vault 2.0 design: the Raw Data Vault captures what the sources deliver, as they deliver it, without interpretation. The Business Vault is where judgment, calculation, and business logic are applied.

Fuzzy matching is business logic. Deciding that two records represent the same entity is a business decision. Those decisions belong in the Business Vault — not because the Raw Data Vault can’t technically store the result, but because embedding that logic at the raw layer makes it invisible, untestable, and hard to change when the matching rules evolve.

By keeping the SAL calculation in the Business Vault, you get a clear audit trail of how the deduplication was performed, the ability to update matching logic without reloading source data, and a separation between “what the source said” and “what we believe to be true across sources.” That separation is one of the most operationally valuable properties of a well-structured Data Vault.

Practical Implications for Modeling

When modeling SALs in practice, a few things are worth keeping in mind. In the Raw Data Vault, SALs are appropriate when a single source system provides an explicit deduplication or matching relationship — a master data management export, a merge table, a golden record mapping from a source MDM system. The loading process remains clean and dependency-free.

In the Business Vault, SALs are the right tool when the matching logic needs to be calculated — whether through exact key matching across systems, probabilistic matching, fuzzy string comparison, or any other form of entity resolution. The SAL lives in the Business Vault, references the appropriate Hub twice (master and duplicate), and is populated by whatever calculation or mapping process produces the match.

In both cases, the hash keys in the SAL reference the same Hub, since by definition the master and the duplicate represent the same type of business object. This is what makes the SAL structurally elegant: it reuses existing Hub infrastructure to express an enterprise-wide identity resolution without requiring new structural entities.

To go deeper on Same-as-Links, Business Vault patterns, and enterprise integration strategies in Data Vault, explore our Data Vault 2.1 Training & Certification. And for a concise introduction to the full methodology, the Data Vault Handbook is available as a free physical copy or ebook.

Watch the Video

Set Based Multi-Active Satellite Derived From Record Level Multi Active Satellite

Multi-Active Satellites: Handling Delta Loads and Set-Based Derivation

A detailed modeling question came in recently about deriving a set-based Multi-Active Satellite from a record-level Multi-Active Satellite, and whether using the resulting Business Satellite as input for a parent PIT Table is valid Data Vault practice. The answer involves a few clarifications on terminology and a practical approach to delta loading that makes the Business Vault layer largely unnecessary for this use case. This post breaks it down.



Multi-Active Satellites: The Full Load vs. Delta Load Problem

A Multi-Active Satellite captures multiple active records per business key at the same point in time — phone numbers are the classic example. A person can have a home number, a mobile number, and a work number all active simultaneously. In the Satellite, each row is uniquely identified by the hash key, the Load Date Timestamp, and the Multi-Active Key (in this case, the phone type).

The Hashdiff in a Multi-Active Satellite is calculated across the entire group — all active records for that business key at that load date — not per individual row. This means when any record in the group changes, the Hashdiff changes for the whole group, and the entire group is re-inserted with a new Load Date Timestamp. The old group is virtually end-dated. This works cleanly when you receive full loads: every batch contains all active records, so you always have the complete group to work with.

The challenge arises with delta loads. If the source only sends what changed — say, a new work number is added, but the home and mobile numbers are not re-sent because they didn’t change — you can’t calculate the correct group-level Hashdiff from the incoming batch alone. The group is incomplete.

Reconstructing the Full Group from Delta Loads in the Raw Data Vault

The solution is to reconstruct the full Multi-Active group before loading, without moving this logic into the Business Vault. The approach is straightforward: derive the most recent Multi-Active group from the existing Satellite, combine it with the incoming delta records, and use the resulting complete set to calculate the Hashdiff and load the Satellite as if it were a full batch for that group.

In practice, a staging table acts as the assembly point. The latest group from the Satellite is pulled into staging alongside the incoming delta. Together, they form the complete current group. From there, the standard Multi-Active Satellite loading pattern applies — the Hashdiff is calculated over the full group, a new Load Date Timestamp is assigned, and all records in the group are inserted together.

This approach handles delta loads cleanly in the Raw Data Vault, which means there’s no need for a Business PIT Table or a Business Satellite just to reconstruct the full set. The reconstruction happens at load time, not at query time.

Using the PIT Table to Manage Granularity

The second part of the question was about reducing the Multi-Active group to a single record per business key per Load Date Timestamp — storing that reduced result in a Business Satellite and using it as input for the parent PIT Table.

This is valid, but there’s a lighter alternative worth considering: handle the granularity reduction directly in the PIT Table rather than creating a dedicated Business Satellite for it.

A standard PIT Table references a Multi-Active Satellite via the hash key and Load Date Timestamp, which points to the entire group. If you want the full group available for querying, this is all you need — the PIT gives you the reference, and the join returns all active records for that timestamp.

If you only want one specific record from the group — say, just the mobile number — you add the Multi-Active Key as an additional column in the PIT Table. The PIT row then carries the hash key, the Load Date Timestamp, and the specific Multi-Active Key value you want. The join returns exactly one record. Selecting which Multi-Active record to surface is business logic, and the PIT Table is a clean place to encode it without materializing an intermediate Business Satellite.

If you need to surface multiple specific records — home number and mobile number separately — you add additional column sets to the PIT Table, one per record type. Each set carries its own hash key, Load Date Timestamp, and Multi-Active Key reference. This keeps everything in one structure and avoids unnecessary materialization.

When a Business Satellite Does Make Sense

A Business Satellite for this purpose isn’t wrong — it’s just not always necessary. If the reduced, single-record-per-key result is consumed by multiple downstream processes and materializing it improves performance or simplifies maintenance, building a Business Satellite is a reasonable choice. But if the only goal is to filter down to one record for a specific downstream view, doing it in the PIT Table is simpler and avoids creating an entity whose sole purpose is granularity reduction.

The key principle: keep the Raw Data Vault responsible for capturing the full, accurate group, and make granularity and selection decisions in the PIT Table or Business Vault based on what downstream consumption actually requires.

To go deeper on Multi-Active Satellites, PIT Table design, and the full Data Vault methodology, explore our Data Vault training and certification programs. The free Data Vault handbook is also available as a physical copy or ebook.

Watch the Video

Set Based Multi-Active Satellite Derived From Record Level Multi Active Satellite

Multi-Active Satellites: Handling Delta Loads and Set-Based Derivation

A detailed modeling question came in recently about deriving a set-based Multi-Active Satellite from a record-level Multi-Active Satellite, and whether using the resulting Business Satellite as input for a parent PIT Table is valid Data Vault practice. The answer involves a few clarifications on terminology and a practical approach to delta loading that makes the Business Vault layer largely unnecessary for this use case. This post breaks it down.



Multi-Active Satellites: The Full Load vs. Delta Load Problem

A Multi-Active Satellite captures multiple active records per business key at the same point in time — phone numbers are the classic example. A person can have a home number, a mobile number, and a work number all active simultaneously. In the Satellite, each row is uniquely identified by the hash key, the Load Date Timestamp, and the Multi-Active Key (in this case, the phone type).

The Hashdiff in a Multi-Active Satellite is calculated across the entire group — all active records for that business key at that load date — not per individual row. This means when any record in the group changes, the Hashdiff changes for the whole group, and the entire group is re-inserted with a new Load Date Timestamp. The old group is virtually end-dated. This works cleanly when you receive full loads: every batch contains all active records, so you always have the complete group to work with.

The challenge arises with delta loads. If the source only sends what changed — say, a new work number is added, but the home and mobile numbers are not re-sent because they didn’t change — you can’t calculate the correct group-level Hashdiff from the incoming batch alone. The group is incomplete.

Reconstructing the Full Group from Delta Loads in the Raw Data Vault

The solution is to reconstruct the full Multi-Active group before loading, without moving this logic into the Business Vault. The approach is straightforward: derive the most recent Multi-Active group from the existing Satellite, combine it with the incoming delta records, and use the resulting complete set to calculate the Hashdiff and load the Satellite as if it were a full batch for that group.

In practice, a staging table acts as the assembly point. The latest group from the Satellite is pulled into staging alongside the incoming delta. Together, they form the complete current group. From there, the standard Multi-Active Satellite loading pattern applies — the Hashdiff is calculated over the full group, a new Load Date Timestamp is assigned, and all records in the group are inserted together.

This approach handles delta loads cleanly in the Raw Data Vault, which means there’s no need for a Business PIT Table or a Business Satellite just to reconstruct the full set. The reconstruction happens at load time, not at query time.

Using the PIT Table to Manage Granularity

The second part of the question was about reducing the Multi-Active group to a single record per business key per Load Date Timestamp — storing that reduced result in a Business Satellite and using it as input for the parent PIT Table.

This is valid, but there’s a lighter alternative worth considering: handle the granularity reduction directly in the PIT Table rather than creating a dedicated Business Satellite for it.

A standard PIT Table references a Multi-Active Satellite via the hash key and Load Date Timestamp, which points to the entire group. If you want the full group available for querying, this is all you need — the PIT gives you the reference, and the join returns all active records for that timestamp.

If you only want one specific record from the group — say, just the mobile number — you add the Multi-Active Key as an additional column in the PIT Table. The PIT row then carries the hash key, the Load Date Timestamp, and the specific Multi-Active Key value you want. The join returns exactly one record. Selecting which Multi-Active record to surface is business logic, and the PIT Table is a clean place to encode it without materializing an intermediate Business Satellite.

If you need to surface multiple specific records — home number and mobile number separately — you add additional column sets to the PIT Table, one per record type. Each set carries its own hash key, Load Date Timestamp, and Multi-Active Key reference. This keeps everything in one structure and avoids unnecessary materialization.

When a Business Satellite Does Make Sense

A Business Satellite for this purpose isn’t wrong — it’s just not always necessary. If the reduced, single-record-per-key result is consumed by multiple downstream processes and materializing it improves performance or simplifies maintenance, building a Business Satellite is a reasonable choice. But if the only goal is to filter down to one record for a specific downstream view, doing it in the PIT Table is simpler and avoids creating an entity whose sole purpose is granularity reduction.

The key principle: keep the Raw Data Vault responsible for capturing the full, accurate group, and make granularity and selection decisions in the PIT Table or Business Vault based on what downstream consumption actually requires.

To go deeper on Multi-Active Satellites, PIT Table design, and the full Data Vault methodology, explore our Data Vault training and certification programs. The free Data Vault handbook is also available as a physical copy or ebook.

Watch the Video

Close Menu