Skip to main content
search
0
All Posts By

Michael Olschimke

Michael Olschimke is the Co-Founder and CEO of Scalefree and a "Data Vault 2.0 Pioneer" with over 20 years of IT experience. A Fulbright scholar and co-author of Building a Scalable Data Warehouse with Data Vault 2.0, Michael is a global authority on AI, Big Data, and scalable Lakehouse design across sectors like banking, automotive, and state security.

Refactoring a Data Vault Model

Refactoring a Data Vault Model: Options, Risks, and Best Practices

Source systems change. Columns get added or removed, structures evolve, and sometimes entire business key definitions are overhauled. When that happens to a system feeding your Data Vault, the question isn’t just technical — it’s strategic. Do you modify what you have, or do you build alongside it? This post walks through the main scenarios and the practical options available for each, along with a clear recommendation on where to draw the line between low-risk and high-risk approaches.



When a Column Changes: The Simple Case

The least disruptive scenario is a column-level change in a source table — a new attribute appears, or an existing one disappears. For this, you have a few options depending on your project constraints.

Option 1 — Modify the existing Satellite. If your project allows structural changes, you can add the new column to the existing Satellite with an ALTER TABLE statement. Historical rows will show null values for the new column before its introduction, and a log entry can record exactly when the column was added. Removing a column from a Satellite is generally not done — historical data lives in that column, and dropping it means losing that history.

Option 2 — Create a new Satellite. If you’re not allowed to touch existing structures, or simply prefer not to, you create a new Satellite to capture the new or changed attributes. This Satellite gets added to the relevant PIT Tables. The trade-off is an additional join in your queries, but the existing Satellite and its data remain completely untouched.

Option 3 — Close and replace the Satellite. A slightly more thorough approach: close the existing Satellite (stop loading it) and create a brand new one that reflects the updated structure. The new Satellite starts with a full load from the source, which means some data overlap with the old Satellite. This is handled cleanly at query time using an IIF statement — prefer data from the new Satellite where it exists, fall back to the old one for earlier history. The redundancy is not a problem; it resolves itself during query execution.

The bigger the structural change, the more this third option makes sense. If a source table is overhauled dramatically — many columns removed, many added — creating a fresh Satellite to capture the new shape is often the cleanest path forward.

When the Business Key Changes: The Complex Case

Column-level changes are manageable. Business key changes are where things get genuinely complex — and where the risk calculus shifts significantly.

A business key in Data Vault must be unique over time and across the enterprise. If the current key no longer meets that standard — say, a customer number that was once reliable is now duplicated across regions — you have a structural problem that can cascade through the model. Changing the business key means potentially changing the Hub itself, which in turn affects every Link that references that Hub, and every Satellite attached to those Hubs and Links. The impact can be wide.

At this point, you have two main strategic choices.

Option A: Keep Old and New Structures Separate

The lower-risk approach — and the one most commonly recommended — is to leave the historical Raw Data Vault exactly as it is and build a new Raw Data Vault to capture data under the new structure and key definition.

The reasoning is rooted in a core Data Vault principle: the Raw Data Vault should model data close to how the source systems use it. The business had one structure in the past and a different one going forward. That’s two different realities, and it makes sense to model them separately.

The two Raw Data Vaults then get reconciled in the Business Vault, where business logic handles the combination of old and new data. This might be straightforward — a simple union — or it might be complex, especially if field definitions have changed. For example, if an address field was previously structured (street, house number, zip, city) and is now a free-text memo field that may contain addresses from multiple countries, the logic to normalize and combine that data belongs in the Business Vault. That’s exactly what the Business Vault is designed for.

This approach carries the lowest risk. Historical data is never touched. Nothing can go wrong with data that hasn’t been moved.

Option B: Refactor the Raw Data Vault

The more ambitious option is to refactor the existing Raw Data Vault into a new version — modifying Hubs, Links, and Satellites to reflect the new structure — and then reconstruct historical data within that new model.

This is technically possible, but it comes with a hard requirement: you must be able to reconstruct every historical delivery from the new structure without any data loss. In Data Vault practice, this is validated through what’s known as the “Jedi test” — deriving the old structures from the new ones and verifying the output matches the original data exactly. If the test passes, you can safely drop the old tables and replace them with views that expose the old structure as a backward-compatible interface.

Those views give existing queries time to continue working while users migrate. But they’re a transitional tool, not a permanent one. You’ll want to communicate a clear deprecation timeline — 90 or 180 days is typical — and give users explicit guidance on how to update their queries before the views are dropped.

A word of warning: when those views eventually get dropped, expect complaints. Not because the communication failed, but because, as a rule, nobody reads emails. Plan for it.

Handling Non-Unique Business Keys in the Interim

If a business key loses its uniqueness mid-project and a full refactoring effort will take several sprints, there’s a practical interim solution: a Record Source Tracking Satellite. This allows you to continue working with the existing model while the refactoring is planned and executed in the background. It buys time without requiring an immediate structural overhaul, and it keeps the data pipeline running cleanly during the transition.

Communication: The Overlooked Part of Refactoring

Technical decisions aside, refactoring a Data Vault model is also an organizational event. Users who query your data warehouse need to know when structures change — whether that’s a modified Satellite, a new Hub, or a deprecated view that will be removed in three months.

A simple data warehouse changelog or newsletter goes a long way. When you modify existing entities, inform users. When you introduce views as backward-compatible bridges, tell them the timeline. When the views are going away, tell them what to query instead. This isn’t just good practice — it’s the difference between a smooth migration and a flood of support tickets.

The Bottom Line on Refactoring

Data Vault is built to absorb change, and it does so gracefully at the column level. Descriptive attribute changes — new columns, removed columns, restructured Satellites — are handled with well-defined options and minimal risk. The real challenge arrives when business keys change, because the ripple effects can touch Hubs, Links, and Satellites across the model.

In those cases, the recommended approach is to preserve historical data in the original Raw Data Vault and build a new one for the new structure, using the Business Vault as the reconciliation layer. It’s the lowest-risk path, it keeps your historical data intact, and it puts complex transformation logic exactly where it belongs.

To learn more about Data Vault modeling principles, refactoring strategies, and Business Vault patterns in depth, explore our Data Vault 2.1 Training & Certification. And if you’re new to the methodology, the free Data Vault handbook is a great starting point — available as a hard copy or digital download.

Watch the Video

Source of Data for Business Vault Entities

Business Vault Entities: Using Cross-Satellite Joins for Business Logic

One of the questions that comes up regularly among Data Vault practitioners — especially those working on the Business Vault layer — is whether a Computed Satellite can draw its input data from multiple, unrelated Satellites across different Hubs. It sounds like it might break the rules. It doesn’t. In fact, it’s a very common and entirely valid pattern once you understand what actually constrains a Business Vault Satellite.

Let’s break it down.



Business Vault Computed Satellites: What Are the Real Rules?

When practitioners first encounter the Business Vault, a natural assumption is that a Computed Satellite must only use data from Satellites attached to its own parent Hub. The logic seems sound — keep things connected, keep things clean. But this is not actually a constraint in Data Vault. It’s a misconception.

The real rule for a Computed Satellite is much simpler: what do your results describe? Whatever entity your calculated output describes — that’s your parent. That’s what the Satellite attaches to.

For example, if you’re calculating customer lifetime value using data pulled from order records, transaction histories, product tables, and behavioral data — the result still describes the customer. So your Computed Satellite attaches to the Customer Hub. The source of the input data is largely irrelevant to the attachment decision. What matters is the semantic meaning of the output.

Cross-Satellite Joins: Yes, You Can

So back to the original question: can you build a Business Vault Satellite that joins Satellites from two completely different subjects — even if there’s no Link connecting their parent Hubs?

The answer is yes.

Your business rule query can pull data from anywhere in the model. You can join Satellites on the same Hub, traverse a Link into another Hub and pull Satellite data from there, use business key relationships to establish a match, or even join across entirely disconnected model areas if a descriptive relationship exists in the data itself.

The absence of a Link between two Hubs doesn’t prevent you from joining their Satellite data in a query. If there’s a way to match records — even informally through a shared descriptive attribute in a Satellite — you can use that to establish the relationship in your query logic and proceed.

When to Consider a Business Vault Link Instead

That said, there are situations where formalizing the relationship makes sense. If two Hubs have no Link between them but your business logic consistently requires joining their data, it may be worth creating a Business Link or an Exploration Link in the Business Vault to make that relationship explicit and reusable.

This approach can improve query efficiency and make the model easier to understand for future developers. However, it also adds complexity — a Business Link typically requires an Effectivity Satellite to track the validity of the relationship over time, which may or may not be worth the overhead depending on your use case.

The pragmatic answer: if it’s a one-off calculation and the join works, just write the query. If the relationship is foundational to multiple business rules, formalize it with a Link. Performance and maintainability should guide that decision, not a strict modeling rule.

The Only Real Constraints

In practice, there are two genuine constraints on what you can use as input for a Business Vault Computed Satellite:

  • Query performance: Joining across disconnected parts of your model may not always be efficient. If the join is expensive and runs frequently, optimization — through a Business Link, materialization, or other techniques — is worth considering. But it’s an engineering concern, not a modeling violation.
  • Data Governance: Are you actually allowed to combine those datasets? In organizations with strict data ownership boundaries, combining Satellites from different subject areas or different data owners may require governance approval. That’s a policy question, not a technical one — but it’s one worth asking before you build.

Outside of those two factors, the query is yours to write however you need.

A Practical Example: Customer Lifetime Value

To make this concrete: imagine you need to calculate customer lifetime value (CLV) and attach the result as a Computed Satellite on the Customer Hub. Your inputs might include:

  • Customer profile data from a Satellite on the Customer Hub
  • Order totals from a Satellite on an Order Hub (connected via a Customer-Order Link)
  • Return history from a Satellite on a Returns Hub
  • Promotional sensitivity data from a completely separate marketing model with no direct Link to Customer

Even if the marketing model has no formal Link back to Customer, if there’s a way to match the records — say, via a shared customer identifier in a descriptive Satellite — you can join it. Your CLV business rule can consume all of this data in a single query. The result — a calculated CLV score — describes the customer, so the Computed Satellite attaches to the Customer Hub. Clean, valid, and practical.

This kind of cross-domain business logic is exactly what the Business Vault is designed for. It’s the layer where raw, integrated data gets transformed into meaningful, decision-ready information — and the flexibility to query across the model is one of its most powerful features.

Key Takeaway

When building Computed Satellites in the Business Vault, don’t let the structure of your Raw Data Vault artificially limit your business logic. The query that feeds your Satellite can draw from anywhere in the model. The only questions that matter are: what do your results describe, and is your query performant and governance-compliant?

Get those right, and the rest is just SQL.

To go deeper on Business Vault patterns, Computed Satellites, and the full Data Vault methodology, explore our Data Vault 2.1 Training & Certification — and grab the free Data Vault handbook for a concise overview of the core concepts.

Watch the Video

How to Derive Dimensions and Facts from a Data Vault Model

Patterns for Deriving Data Vault Dimensions and Facts

If you’ve gone through a Data Vault training, you know that Data Vault is a pattern-driven methodology — from loading the Raw Data Vault all the way through business logic implementation in the Business Vault. But one question that consistently comes up, especially among practitioners bridging the gap between modeling and delivery, is this: How do you actually derive Dimensions and Facts from a Data Vault model?

It’s a great question, and the good news is that just like the rest of Data Vault, there are clear, repeatable patterns for doing it well. Let’s walk through them.



The Pattern-Based Nature of Data Vault Delivery

Data Vault is not just a modeling technique — it’s an entire methodology built around patterns. There are patterns for loading the Raw Data Vault, patterns for modeling it, and patterns for implementing business logic in the Business Vault. The same is true for the information delivery layer, where you produce standard target entities like Dimensions and Facts (or 3NF entities, flat/wide schemas, and more).

Once you internalize these patterns, deriving your target entities becomes a predictable and even automatable process. That’s one of the most underappreciated strengths of Data Vault: the patterns repeat, and the more you use them, the faster and more confident you become.

Deriving Dimensions: The 80/20 Rule

Roughly 80% of your Dimensions will follow one core pattern: they are derived from a combination of a Hub, one or more Satellites, and — critically — a PIT Table (Point-In-Time Table).

The basic query pattern looks like this:

  1. Start with the PIT Table as your base
  2. Join the Hub to retrieve the business key
  3. Join the relevant Satellites for descriptive attributes
  4. Cherry-pick the attributes you need for your Dimension

The remaining 20% of Dimensions come from Links rather than Hubs. In these cases, you start with a PIT Table built on a Link, join the Link itself for hash keys and Dimension references (often called Bridge Dimensions), and then join the Satellites to describe the relationship.

In practice, this covers close to 99% of all Dimension derivations. There are always edge cases, but this two-pattern approach handles the vast majority of real-world scenarios.

Why PIT Tables? The Case for Point-In-Time

If you’re new to PIT Tables, they can look intimidating at first — a structure full of hash keys and timestamps. But once you understand the pattern, it’s actually one of the most elegant tools in the Data Vault toolkit. And here’s the key insight: in modern implementations, PIT Tables are generated by automation tooling, so you rarely build them by hand. You learn how to use them, not how to construct them from scratch.

PIT Tables are essential for several reasons:

Supporting All SCD Types

A PIT Table provides the foundation for Slowly Changing Dimension (SCD) types 0, 1, and 2 — all from the same structure. The snapshot day timestamp embedded in the PIT Table allows you to control the grain and timeline of your Dimension flexibly.

Delivering Stable Reports

Here’s a scenario every analyst will recognize: you pull a report at 9:00 AM and expect it to remain consistent throughout the day. But if your Dimension view is simply joining the latest delta from a Satellite directly, any incoming data update after 9:00 AM will silently change your report’s results.

A PIT Table solves this by identifying the latest valid delta for all hash keys at a specific snapshot timestamp — say, 9:00 AM. This decouples your information delivery from live, incoming data pipelines. Users get stable, predictable reports. The data warehouse and the source pipelines can operate independently without stepping on each other.

This is why using a PIT Table, though it may appear more complex at first, is actually simpler and more aligned with what business users actually need: consistent data at a known point in time, not real-time volatility bleeding into their dashboards.

Decoupling Delivery from Ingestion

Beyond stability, the PIT Table serves a broader architectural purpose: it creates a clean separation between the data loading process and the data delivery process. This is a hallmark of mature data warehouse design — and Data Vault makes it a first-class pattern rather than an afterthought.

Want to deepen your understanding of PIT Tables, Satellites, and the complete Data Vault modeling toolkit? Explore our Data Vault 2.1 Training & Certification to get hands-on with these patterns.

Deriving Facts: Links, Non-Historized Links, and Bridge Tables

Facts are a different story. Where Dimensions describe who and what, Facts capture events, transactions, and messages — think banking transactions, call records, IoT sensor readings, or logistics movements.

For Facts, your primary sources in a Data Vault model are:

  • Non-Historized Links (NH Links) — ideal for transactional events that don’t change
  • Dependent Child Links — for subordinate, granular transactional data
  • Bridge Tables — for pre-aggregated or grain-shifted data
  • Joined Link combinations — for cross-domain Fact derivations

If your underlying Link structure already matches the grain you need for your Fact, you can build a simple view on top of it and you’re done. Clean and efficient.

Grain Management: The Core Challenge of Fact Derivation

Things get more interesting when your source data doesn’t naturally sit at the right grain for your target Fact. This is where Bridge Tables and grain shift operations come into play.

Coarsening Grain with GROUP BY

Imagine an NH Link that captures real-time vehicle positions — five messages per second. Your Fact table only needs one position per hour. You need to reduce the granularity.

You do this with a GROUP BY operation on a subset of the Hub references within the Link. The fewer Hub references you group by, the coarser your grain. The result of this aggregation is materialized into a Bridge Table, which then becomes the base for your Fact view.

Refining Grain by Joining Links

The inverse is also possible. If you need a finer grain than any single Link provides, you join multiple Links on their shared Hub references. For example, if one Link connects Customer and Store, and another connects Store and Product, joining them on the shared Store reference produces a result with three Hub references — a finer granularity than either Link alone.

Other Grain Shift Operations

Beyond GROUP BY and joins, there are other grain-shifting operations worth knowing:

  • DISTINCT operations — to remove duplicate rows at a desired level of aggregation
  • JSON array expansion — splitting a JSON array into multiple rows is itself a grain shift, making rows finer

All of these operations can produce results that are materialized in Bridge Tables for performance and reuse, which then serve as the clean input layer for your Fact views.

PIT Tables for Snapshot Facts

There is one important exception to the “Links and Bridge Tables for Facts” rule: snapshot-based Facts. For periodic snapshots — daily inventory levels, end-of-month balances, weekly pipeline status — a PIT Table can serve as an excellent base for your Fact entity.

The catch is that PIT Tables are sometimes pruned over time to keep them manageable. If you need to preserve a long history of snapshots for your Fact, consider using a dedicated Bridge Table to maintain those snapshots independently from your operational PIT Tables. This gives you full control over retention and history without disrupting the PIT Table’s primary purpose.

Beyond Dimensions and Facts: Wider Applicability

The same patterns extend naturally to other target model types:

  • 3NF (Third Normal Form) entities — derived using similar logic from Hubs, Links, and Satellites
  • Flat and wide schemas — built by joining PIT Tables to Bridge Tables and then joining Satellites to pre-join all descriptive attributes into a single denormalized entity
  • Fully denormalized Fact entities — start with a Bridge Table or Link, join PITs, then join the relevant Satellites to fold all Dimension attributes directly into the Fact

This last pattern — the fully denormalized Fact — is particularly powerful for downstream consumers like machine learning pipelines or self-service analytics tools that benefit from wide, flat tables. And it follows the exact same building blocks.

Summary: The Patterns at a Glance

Target Entity Primary Source Pattern
Standard Dimension (80%) PIT + Hub + Satellites Join PIT → Hub → Satellites, select attributes
Bridge/Relationship Dimension (20%) PIT on Link + Satellites Join PIT → Link → Satellites
Transactional Fact NH Links / Dependent Child Links View on top of Link at correct grain
Aggregated Fact Bridge Table (GROUP BY result) Grain shift → materialize → view
Snapshot Fact PIT Table or Bridge Table Snapshot timestamps → Fact view
Denormalized Fact Bridge + PIT + Satellites Pre-join all descriptive attributes

Learn the Patterns, Apply Them Everywhere

One of the most satisfying aspects of working with Data Vault is that once you’ve internalized these patterns, they compound. Each new project reinforces the same building blocks: Hubs, Links, Satellites, PIT Tables, Bridge Tables. The same logic for deriving a Dimension applies to deriving a 3NF entity. The same grain management principles for Facts carry over to denormalized wide tables.

Data Vault doesn’t ask you to reinvent the wheel for every project — it gives you a set of reliable, tested patterns and asks you to apply them consistently.

If you’re ready to master these patterns in depth — including hands-on work with PIT Tables, Bridge Tables, and full information delivery models — check out our Data Vault 2.1 Certification and Training. And if you’re just getting started, there’s a free Data Vault handbook available as a concise primer covering the core concepts, pros and cons, and when to use Data Vault in your architecture.

The patterns are there. Once you learn them, they’ll follow you everywhere.

Watch the Video

How to Capture CDC Data in Data Vault Satellites

Capture CDC Data

Capturing Change Data Capture (CDC) data in Data Vault can be tricky, especially when the source mostly sends inserts but occasionally produces duplicates or deletions. Understanding how to handle these cases ensures historical accuracy and avoids data inconsistencies in your hubs and satellites.



The Scenario

Consider the following behavior of your source system:

  • Most of the time, rows are insert-only.
  • During initial load, the same row may arrive twice (once in the bulk load and once as an insert within the same batch).
  • Deleted rows may occasionally appear.

These patterns can lead to duplicates if not handled correctly. At first glance, it might look like you need a Non-Historized Link, but duplicates must still be managed properly.

Why Standard Non-Historized Links May Fail

A standard non-historized Link assumes a single row per combination of hubs. When duplicates arrive, either due to CDC or multiple inserts during initial load, the Link cannot naturally distinguish them, leading to primary key conflicts or overwritten data.

A common—but sometimes problematic—solution is adding counter rows to differentiate duplicates. However, this often requires a GROUP BY in the Information Mart, which can cause performance issues, particularly on non-columnar databases.

Recommended Approach: Capture Technical History in Satellites

Instead of modifying the Link, the recommended approach is to handle duplicates in satellites, preserving the raw source events and their arrival order.

Step 1: Use a Satellite with a Load-Date Sequence

For each incoming batch:

  • Assign the CDC load timestamp to the first row of a given parent.
  • If multiple rows for the same parent exist in the batch, increment the timestamp by a small unit (microsecond, millisecond, or nanosecond) for each subsequent row.

This creates a unique ordering of changes while preserving the technical history, without touching the original raw data.

Step 2: Maintain Historical Order

By adding a microsecond increment to the load date for each row:

  • The first row in the CDC batch gets the base timestamp.
  • The second row gets base timestamp + 1 microsecond, the third row +2 microseconds, etc.

This ensures the latest row has the highest load timestamp, which can be used to drive Point-In-Time (PIT) tables and type-1 dimension replacements.

Step 3: Preserve Batch or CDC Metadata

If your CDC source provides a batch ID or subsequence number, include it in the satellite. This allows for:

  • Tracking which records arrived together
  • Reconstructing the technical timeline of changes

If no metadata exists, the microsecond sequencing on the load date is sufficient to order the rows.

Handling Non-Historized Links with Duplicates

In rare cases, a non-historized Link may receive multiple rows for the same key combination. To handle this safely:

  • Extend the alternate key to include the load date (or other sequencing attribute) in the hash key calculation.
  • This ensures each row has a unique primary key without modifying the raw data.

Key points:

  • No need to use counter rows in the raw link.
  • Duplicates are captured and preserved for historical accuracy.
  • Aggregations in PIT or Bridge Tables can be used for reporting, ensuring performance optimization.

Optional: Bridge Tables for Performance

If your Information Mart requires grouping or deduplication and your database struggles with performance:

  • Create a Bridge Table that pre-aggregates or resolves duplicates.
  • The Bridge Table stores only the latest row (or the aggregated result) for reporting.
  • You maintain the raw satellite history in case full lineage or historical reconstruction is needed.

Summary

  • Capture all incoming CDC events in a satellite, including duplicates, without modifying the raw data.
  • Use microsecond increments on the load date to order multiple rows per parent.
  • Include CDC batch metadata if available to preserve groupings and arrival order.
  • For non-historized Links receiving multiple rows, include the load date in the hash calculation.
  • Bridge Tables or PIT tables handle reporting and aggregation efficiently, while maintaining full historical traceability.

This approach preserves auditability, ensures correct historical ordering, and avoids performance issues in the Information Mart.

Watch the Video

Data Vault Link Naming Convention

Link Naming Convention

One of the most common questions we hear from Data Vault practitioners—especially once a model grows beyond a few hubs and links—is this: how do you define a clear link naming convention, and how do you avoid getting lost in all the different link types when querying current versus historical data? It’s a very practical concern, and if you don’t address it early, your Data Vault can quickly become hard to understand, even for experienced developers.

In this article, we’ll walk through a pragmatic approach to Data Vault link naming conventions. We’ll cover how to name links in a business-friendly way, how to encode technical meaning into the name without overcomplicating it, and how naming helps users understand which links to use for historical data and which are more event- or transaction-oriented.



The First Question: How Do You Name a Link?

Let’s start with the most basic aspect of the question: how do you name a link at all? Imagine a simple business scenario. A customer walks into a store and buys a product. From a Data Vault perspective, this creates a relationship between three business keys: customer, store, and product. This relationship is represented by a link.

Now you have several naming options. You could name the link something like customer_store_product_link, purely describing the hubs it connects. Or you could give it a business-oriented name such as retail_transaction_link or sales_transaction_link.

Our strong recommendation is to prefer business-driven names whenever possible. If the relationship clearly represents a real business concept, then the link should be named after that concept. Calling it a retail transaction link immediately tells business users and developers what this link represents, without forcing them to interpret a technical list of hubs.

That said, there are cases where a business name simply doesn’t exist. Sometimes a link truly represents nothing more than a structural relationship, such as a customer being associated with a store, without a clear transactional or event-based meaning. In those cases, it is perfectly fine to fall back to a composite name like customer_store_link.

The key point here is consistency and intent. If there is a meaningful business name, use it. If not, use a descriptive composite name that clearly shows which hubs are involved.

Ordering Hub Names in Composite Links

When you do use composite names, the next question is often about ordering. Should it be customer_store or store_customer? From a pure Data Vault perspective, links are end-to-end relationships with no inherent direction. Technically, the order does not matter.

However, from a modeling and readability perspective, a consistent ordering rule helps a lot. One practical guideline is to follow the natural business hierarchy if one exists. For example, if you model an industry body and the organizations that belong to it, the industry body is conceptually “above” the organization. In that case, naming the link industrybody_organization_link feels natural and intuitive.

This approach mirrors how people think about hierarchies: from root to leaf, from higher-level concept to lower-level concept. Again, the most important part is not which rule you choose, but that you standardize it across the entire Data Vault.

Standardization Beats Perfection

No naming convention will ever be perfect. What matters far more is that everyone follows the same rules. At Scalefree, for example, we deliberately leave some freedom to developers when choosing link names, with a clear preference for business-oriented naming. At the same time, we provide documented guidelines that explain how to make those decisions consistently.

This balance allows teams to model complex domains without being blocked by overly rigid naming rules, while still keeping the model understandable and navigable.

Dealing with the Many Types of Links

Once you move beyond basic modeling, the real challenge begins: Data Vault doesn’t just have one type of link. Over time, a model accumulates several varieties, each serving a different purpose.

Some of the most common link types include standard links in the Raw Data Vault, business or exploration links in the Business Vault, non-historized links (also known as transactional links in Data Vault 1.0), hierarchical links, same-as links, dependent child links, and even computed or aggregated links used to reuse logic across multiple bridges.

Without a clear convention, these different link types quickly become indistinguishable, making it hard to know which one to use for a specific query or use case.

Prefix or Suffix? Why Suffixes Usually Win

One common design decision is whether to encode link type information as a prefix or a suffix. While prefixes like link_retailtransaction are sometimes used, we generally recommend suffixes instead.

The reason is simple: suffixes group related objects together more naturally. If everything related to a customer starts with “customer”, then the hub, its satellites, and related links appear together when browsing schemas or metadata. This makes the model easier to explore and understand.

For example, a customer hub might be named customer_h, while a satellite could be customer_data_s. Using suffixes ensures that all customer-related entities are visually grouped.

A Compact Suffix-Based Pattern

To keep naming both expressive and machine-readable, we recommend a compact suffix pattern. In this approach, every link ends with an L, indicating that it is a link. Additional single-character markers are added before the L to indicate special link types.

For example, a standard link in the Raw Data Vault simply ends with _l. A non-historized link ends with _nl. A hierarchical link ends with _hl. A same-as link ends with _sl. A dependent child link ends with _dl.

When working with the Business Vault, an additional B is added. A standard Business Vault link ends with _bl. A dependent child link in the Business Vault becomes _bdl. In this scheme, the Raw Data Vault is the default, so there is no explicit “R” marker.

This compact notation might look cryptic at first, but it has significant advantages. It keeps names short, consistent, and easy to parse automatically.

Why Short Codes Are Powerful

Using single-character indicators is not just about aesthetics. It enables powerful automation and governance capabilities. With consistent suffixes, you can use regular expressions on metadata to identify entity types automatically.

For example, you can quickly validate that all dependent child links contain the required hub references, hash keys, and record source attributes. You can also apply automated tests and checks depending on the link type, without relying on manual inspection.

This becomes especially valuable in larger Data Vault environments where hundreds or thousands of entities exist.

Putting It All Together: Naming Examples

Let’s revisit the earlier example of a relationship between customer and store. In the Raw Data Vault, a simple link could be named customer_store_l. If this relationship is modeled in the Business Vault, the name could become customer_store_bl.

If the relationship represents a business transaction and is non-historized, a name like retailtransaction_nl or retailtransaction_bnl (for the Business Vault) clearly communicates both the business meaning and the technical behavior.

These names immediately tell an experienced user what kind of data to expect and how the link should be used.

Which Links to Use for Current vs. Historical Data?

The second part of the original question is just as important: how do you know which links to use when querying current versus historical data?

The key insight here is that almost all Data Vault entities are historical by nature. Links record which relationships have ever existed during the lifetime of the data warehouse. On their own, they usually do not tell you whether a relationship is currently active.

To answer questions about “current” relationships, you typically need additional structures such as effectivity satellites or deletion indicators. These tell you whether a relationship is active, inactive, or deleted at a given point in time.

For example, if customers can change their preferred store, the link captures all store assignments that ever existed. An effectivity satellite tells you which assignment is valid right now.

Standard Links vs. Non-Historized Links

Another important distinction is between long-lived relationships and point-in-time events. Standard links are often used for relationships that exist over a period of time, such as employment, ownership, or assignments.

Non-historized links, on the other hand, are typically used for events, messages, or transactions that occur at a specific moment. A customer purchasing a product, an invoice being issued, or a sensor sending a reading are all examples of point-in-time events.

Even though they are called “non-historized,” these links can still store large volumes of historical data. The name simply reflects that the relationship itself does not persist over time—it happens at a single point in time.

Dependent child links are a special case of this pattern, often used for structures like invoice line items, where detailed data depends on a parent transaction.

Teaching the Model to Users

Finally, no naming convention works unless users understand it. A clear, consistent naming scheme allows you to teach users how to recognize link types, understand their purpose, and choose the right entities for their queries.

Once users know that “_l” means Raw Vault link, “_bl” means Business Vault link, and “_nl” indicates an event-based relationship, they can navigate even large models with confidence.

Final Thoughts

A good Data Vault link naming convention is not about memorizing rules; it’s about reducing cognitive load. Business-driven names improve readability, suffix-based patterns improve structure, and compact codes enable automation and governance.

If you invest the time to define and standardize these conventions early, you will save countless hours later—both for developers and for business users trying to understand and trust your data warehouse.

Watch the Video

Data Vault Link Effectivity

Link Effectivity

One of the most common — and most misunderstood — challenges in Data Vault modeling is how to correctly handle changing relationships. Especially when source systems insert, delete, and even reinsert the same records over time, many teams struggle to answer a simple but critical question:

“How do I reliably determine the latest valid state of a relationship?”

In this article, we will walk through exactly how to model this scenario using Links and Effectivity Satellites in Data Vault, why this pattern is essential, and how it allows you to extract both the current and historical state of relationships in a clean, auditable way.



The Scenario: A Source Table with Changing Relationships

Let’s start by restating the problem in simple terms.

You have a source table with the following structure:

  • COL_1 + COL_2 → Business Key for Hub A
  • COL_3 + COL_4 → Business Key for Hub B
  • COL_5, COL_6 → Descriptive attributes
  • ROW_OPERATION → Insert / Delete indicator

This tells us something important right away:

  • Two business keys appear in the same row
  • Therefore, there is a relationship between those business keys
  • That relationship belongs in a Link

The complexity arises because the operational system:

  • Inserts records
  • Deletes records
  • Sometimes reinserts the exact same records later

All while the actual business keys and attributes remain unchanged.

So how do we model this in a way that preserves history, supports auditing, and still lets us easily answer:

“Which relationships are valid right now?”

Why Links Alone Are Not Enough

In Data Vault, a Link represents the existence of a relationship — not its state.

Once a relationship between Hub A and Hub B exists, the Link record itself is immutable.
It simply tells us:

“At some point in time, this relationship existed.”

But the Link alone cannot tell us:

  • When the relationship became active
  • When it was removed
  • When it was reintroduced

This is where many implementations go wrong. Teams try to encode relationship state directly into the Link or manage it downstream in reporting logic. Neither approach scales, and both break Data Vault principles.

The correct solution is to model the effectivity of the relationship separately.

Enter the Effectivity Satellite

An Effectivity Satellite attached to a Link tracks the physical existence of that relationship over time.

Think of it as a timeline that answers one simple question:

“Is this relationship active or inactive at a given point in time?”

Typical Structure of a Link Effectivity Satellite

  • Link Hash Key
  • Load Date Timestamp (LDTS)
  • Record Source
  • Deletion Timestamp

The deletion timestamp is the key attribute. It defines until when a relationship is considered active.

A common and proven pattern is:

  • Active relationship → Deletion date = 8888-12-31
  • Inactive relationship → Deletion date = actual load timestamp when deletion was detected

(Using 8888-12-31 instead of 9999-12-31 avoids date overflow issues and is widely adopted in practice.)

Walking Through the Lifecycle of a Relationship

Let’s walk through the example step by step.

Day 1: Relationship Is Inserted

The source system delivers a row linking business keys A and B.

  • A Link record (A–B) is created if it does not yet exist
  • An Effectivity Satellite record is inserted
Deletion_Timestamp = 8888-12-31

This means: the relationship is active.

Day 2: Relationship Is Deleted

The source system no longer contains the A–B row.

We do not delete the Link. Instead:

  • A new Effectivity Satellite record is inserted
  • The deletion timestamp is set to the current load timestamp

Now the timeline clearly shows when the relationship ended.

Day 3: Relationship Is Reinserted

The same A–B relationship appears again in the source.

Important point:

  • The Link already exists → no new Link row
  • The Effectivity Satellite needs a new delta
Deletion_Timestamp = 8888-12-31

This marks the relationship as active again — without losing the history of the previous deletion.

Handling Updates: Why Updates Are Two Events

Another area where teams often struggle is updates.

In Data Vault terms:

An update is a deletion of the old version and an insertion of a new version.

For example, if a relationship changes from A–B to A–C:

  • A–C is inserted as a new Link relationship
  • The Effectivity Satellite marks A–C as active
  • The Effectivity Satellite marks A–B as deleted

This ensures that:

  • No ambiguity exists about which relationship is active
  • Historical reporting remains accurate
  • Auditors can trace every change

How to Query the Latest Valid State

Once modeled correctly, querying the current state becomes straightforward.

Current Active Relationships

  1. Join the Link to its Effectivity Satellite
  2. Select the latest satellite record per Link hash key
  3. Filter where Deletion_Timestamp = 8888-12-31

That’s it.

This pattern works consistently across:

  • Current-state reporting
  • Type 1 dimensions
  • Snapshot-based reporting using PIT tables

What About ROW_OPERATION and Valid-From / Valid-To?

It’s important to clearly separate concepts:

ROW_OPERATION

Indicators like Insert/Delete from the source system are:

  • Descriptive metadata
  • Useful for loading logic
  • Not the source of truth for effectivity timelines

They can be stored in a descriptive Satellite, but effectivity is driven by load detection.

Valid-From / Valid-To

Business validity dates are not deletion timelines.

  • They describe business meaning (contracts, subscriptions, agreements)
  • They belong in separate descriptive Satellites
  • They are typically applied in the Business Vault or downstream

Mixing business timelines with technical effectivity is a common modeling mistake.

Why Effectivity Satellites Are Not Optional

A final, often overlooked argument: dirty data.

What happens if:

  • A relationship is accidentally loaded
  • Used in reports
  • Then later corrected or removed

Without an Effectivity Satellite:

  • You cannot explain historical results
  • You cannot prove when data changed
  • You cannot support audit requirements

That’s why, in real-world Data Vault implementations:

  • Most standard Links have Effectivity Satellites
  • Most Hubs have them as well
  • Only a few special cases are exceptions

Key Takeaways

  • Links capture existence, not state
  • Effectivity Satellites capture when relationships are active
  • Deletes and reinserts are modeled as satellite deltas
  • Updates are modeled as delete + insert
  • Current state is derived by filtering on open deletion timelines

Once you adopt this pattern consistently, handling complex relationship lifecycles becomes simple, scalable, and auditable.

Watch the Video

How to Refactor Your Raw Data Vault: Proven Strategies for Scalable Multi-Tenant Data Warehouses

Refactoring the Raw Data Vault

Refactoring is a natural part of any evolving data architecture. Whether you’re adding a new entity, integrating additional source systems, or moving toward a multi-tenant Data Warehouse, change is inevitable. In this article, we’ll explore what it means to refactor your Raw Data Vault (RDV), why it’s essential, and how to do it safely using strategies like Hub-it-out.

The discussion is based on a real-world scenario where a company’s recruitment data model needed to evolve. Initially, the model tracked requisitions but didn’t fully capture recruiter details. With new requirements, the team needed to refactor their RDV to introduce a new Hub Recruiter entity — without breaking historical data or existing queries.



What Refactoring Really Means

Before diving into the technical details, let’s clarify what refactoring is — and what it isn’t.

Refactoring means changing the internal structure of your data model without altering its external behavior. Think of it as improving or extending the foundation — adding new components, reorganizing relationships, or optimizing loading logic — while keeping the overall functionality intact.

Redesign, on the other hand, means making user-visible changes. This might involve restructuring dashboards, altering reports, or redefining business logic in ways that affect your end users.

Since the Raw Data Vault is an internal structure (not typically user-facing), most adjustments here fall under refactoring. These changes are iterative and low-risk if your base model is valid.

Deploy First, Refactor Later

Many data teams fall into the trap of seeking the “perfect” model before deployment. But as experts at Scalefree emphasize — perfection is the enemy of progress. Instead of spending months debating the right business key or model structure, deploy a valid model quickly, gather feedback, and improve it through incremental refactoring.

“Don’t aim for the best model on day one. Aim for a valid model that you can continuously improve.”

This agile mindset ensures your team delivers value early and often. Over time, with each sprint, you evolve closer to the “best” model for your organization — even though, in truth, a perfect model doesn’t exist.

The Recruiter Example: When Business Requirements Change

Let’s revisit our use case: your company is in the recruiting business. Each job requisition is led by a single recruiter. Your original model included a RecruiterID in the requisition table but didn’t have a dedicated Hub Recruiter. Now, the business wants to integrate additional recruiter data from new source systems.

So how should you adapt your Data Vault model?

  • Do you close the old Link and create a new one that includes the Recruiter Hub?
  • Do you split existing Links into smaller, recruiter-focused relationships?
  • How do you handle historical backfills and multi-tenant variations?

The answer: you refactor using the Hub-it-out strategy.

Understanding the “Hub-it-Out” Strategy

“Hub-it-out” is a practical approach to introducing new Hubs and Links into your existing Raw Data Vault without reloading everything from the source systems.

Here’s how it works:

  1. Identify the business key in your existing Satellite. For example, you already have RecruiterID stored as an attribute, even if it wasn’t modeled as a Hub initially.
  2. Create a new Hub (Hub Recruiter) by selecting all distinct recruiter IDs from your existing Satellite data.
  3. Generate hash keys for each business key value and assign load dates and record sources from the Satellite’s first occurrence.
  4. Create a Link between the existing parent Hub (e.g., Requisition) and the new Hub (Recruiter) based on relationships already present in the Satellite.

With this approach, you can build new structures directly from your existing Data Vault — no need to reload historical data from source systems, which may not even be available anymore. If you have a data lake or Persistent Staging Area (PSA), you can also load from there as an alternative.

Why a Valid Model Is Everything

One key prerequisite for refactoring success is having a valid Raw Data Vault model. A “valid” model means that:

  • Data is captured consistently and completely.
  • No business keys or relationships have been lost during earlier transformations.
  • Hubs, Links, and Satellites follow proper Data Vault design rules.

If your model is valid, you can refactor it safely — adding Hubs, Links, or Satellites — without touching your original data sources. This makes evolution faster, cheaper, and much less disruptive.

Multi-Tenant Data Vault Considerations

When dealing with multiple clients or tenants, you should introduce a TenantID attribute in every Hub, Link, and Satellite. This ensures that data from one tenant never overwrites another’s records, since the hash keys will differ.

Typically, the first few tenants may require adjustments as you generalize your model. But after integrating the second or third tenant, the structure stabilizes. Each new tenant may add Satellites or minor extensions — but the overall architecture remains consistent.

Avoid Overloading Links

Each source system should define its own Unit of Work — a consistent relationship between business keys. If one system defines a Link between three business keys (e.g., Requisition, Candidate, Recruiter) and another defines it between two, treat them as separate Links. Avoid “overloading” Links by mixing different granularities or structures. That’s a common source of data inconsistency and confusion.

Transitioning Without Breaking Queries

Refactoring can be disruptive for power users who query the Raw Data Vault directly. Their queries may break when Hubs or Links are renamed, split, or replaced. To manage this transition smoothly:

  • Load both old and new entities temporarily. Keep the old Link active for a while, even if it’s limited to a single tenant.
  • Mark deprecated objects clearly. Add a “deprecated” flag or comment in your metadata catalog.
  • Communicate proactively. Notify users via email or release notes, giving them 90–180 days to adjust queries.

Despite communication efforts, some users will inevitably miss the deadline — but maintaining transparency and clear documentation helps minimize friction.

Using Virtual Views for Legacy Support

Instead of maintaining redundant tables, you can recreate deprecated entities as virtual views on top of your new Data Vault structures. This approach saves storage while still supporting legacy queries.

Yes, there’s a small performance trade-off, but since these views are temporary and clearly marked as deprecated, it’s an effective bridge strategy. Inform users that performance may decrease slightly and that these views will be removed after a defined period (e.g., one year).

When and How to Clean Up

Every new entity adds maintenance overhead — documentation, metadata management, refactoring complexity. Once your transition period ends, remove deprecated tables and views to keep your model clean and manageable.

Set a clear timeline with your business stakeholders, archive necessary backups, and drop the obsolete entities once the window closes. This keeps your RDV lean and reduces technical debt.

Key Takeaways

  • Deploy early — perfection isn’t required for value.
  • Refactor continuously through small, validated improvements.
  • Use the Hub-it-out strategy to extend your model safely.
  • Always maintain a valid Data Vault foundation to enable future flexibility.
  • Manage user expectations through clear communication and deprecation policies.

Final Thoughts

Refactoring your Raw Data Vault isn’t about redoing work — it’s about evolving intelligently. With the right strategies and mindset, you can adapt to new business requirements, integrate additional tenants, and maintain consistency without painful re-engineering.

Whether you’re adding a Recruiter Hub, optimizing Link structures, or modernizing your multi-tenant DWH, remember: the goal isn’t perfection. It’s progress — sprint by sprint, improvement by improvement.

Watch the Video

Multi-Tenant Data Vault

Multi-Tenant Environment

Designing and maintaining a Data Vault in a multi-tenant environment presents unique challenges. When a data warehouse must handle not just internal data, but also data from dozens of external clients with slightly different processes and systems, the complexity increases dramatically.

A recent question we received highlighted this exact situation:

“I’m struggling with link management and the evolution process in a multi-tenant warehouse, especially putting all data together in the Information Mart. Our Data Warehouse contains internal data as well as shared data from our clients, for which we perform job requisition processes using their internal systems. We plan to onboard 50–60 clients in the next 2–3 years. Right now, we’re still in the MVP phase, supporting just a few clients. How should I manage links with so many different systems, such a large number of source tables, and processes that are similar but not identical? The goal is to have one common Information Mart design for all clients to enable standardized reporting.”

This is a classic question in modern data architecture. Let’s explore how to approach Raw Vault, Business Vault, and Information Mart design in a multi-tenant context.



Multi-Tenancy in the Raw Data Vault

A cornerstone principle in Data Vault modeling is that each Satellite is sourced from a single source system. However, in a multi-tenant setup, this guideline needs some adaptation. Many tenants use the same source systems (e.g., Salesforce, SAP) with similar core structures. In such cases, you can load multiple tenants into the same Satellite as long as you introduce a Tenant ID as part of the key.

Why Add a Tenant ID?

  • Ensures uniqueness of business keys across tenants (e.g., Customer 42 in Tenant A ≠ Customer 42 in Tenant B).
  • Partitions data naturally, so Satellites contain subsets per tenant without overwriting each other’s records.
  • Provides a straightforward way to filter or secure records by tenant.

By combining the local business key with the Tenant ID, you create a unique enterprise-wide business key. This guarantees data integrity while simplifying downstream querying and reporting.

Where to Add the Tenant ID

In multi-tenant designs, the Tenant ID should ideally appear:

  • Hubs: As part of the business key or alternate key, ensuring uniqueness across tenants.
  • Links: As part of the Hub references, ensuring uniqueness in combined relationships.
  • Satellites: As a payload field for convenience, even if the hash key already includes the Tenant ID.

With this approach, every record in the Raw Vault can always be traced back to a specific tenant, which simplifies not only modeling but also governance and security.

Defining the Tenant ID

A natural question arises: what exactly is a “tenant”? The answer depends on your business context:

  • It could be a client organization you serve.
  • It could be a business unit, country, or factory in large enterprises.
  • It might also be defined by data ownership—who is responsible for the dataset.

In some cases, you may also need a reserved Tenant ID for global or shared data that is not owned by any specific tenant. This ensures consistency and supports role-based access control.

Staging and Tenant Assignment

The Tenant ID is typically introduced already in the staging layer. How it’s assigned depends on the source system:

  • Tenant-dedicated systems: Assign a constant Tenant ID for all data from that system.
  • Multi-tenant systems (e.g., SAP, Salesforce): Extract and map the Tenant ID from existing fields (e.g., business unit, org ID).
  • Global systems: Use a reserved Tenant ID (e.g., “GLOBAL”) when ownership is shared or unclear.

This is a hard rule (constant assignment), not a conditional transformation, which ensures repeatability and traceability.

Business Vault in Multi-Tenant Contexts

Once Tenant IDs are embedded in the Raw Vault, the Business Vault becomes much easier to design. Business rules can be applied consistently across tenants, while preserving tenant-specific contexts.

  • Same-as Links: Crucial for resolving duplicate entities across tenants (e.g., the same customer appears in different client systems).
  • Custom Satellites: Standardize where possible, but add additional Satellites for tenant-specific customizations.
  • Wide PIT Tables: Be prepared for them—multiple tenants and diverse source systems naturally lead to broader structures.

At this stage, the goal is harmonization without oversimplification. A balance must be struck between common modeling and tenant-specific flexibility.

Designing the Information Mart

The Information Mart is where tenants—or the enterprise as a whole—derive insights. The challenge is to provide both:

  • Enterprise-wide views: Merging data from all tenants for global reporting.
  • Tenant-specific views: Allowing clients or business units to see only their data.

Common Mart Design

A single common dimensional model for all tenants reduces development overhead and supports standardized reporting. By including the Tenant ID in dimensions and facts, you can apply row-level security to restrict access per tenant.

When Separate Marts Are Needed

In some cases, specific tenants may require custom Information Marts. This is typically justified when:

  • Unique KPIs or processes cannot be expressed in the common model.
  • Legal or contractual reasons require strict separation of data.

However, these should remain exceptions. A well-designed common mart, filtered by Tenant ID, is usually sufficient for most tenants.

Role of Same-as Links in Reporting

To unify data across systems and tenants, Same-as Links are critical. These resolve entity duplicates across different tenants and systems (e.g., a product appearing under different codes in SAP and Salesforce).

Same-as Links can be sourced from:

  • Raw data: Mapping tables provided by business or source systems.
  • Calculated logic: Fuzzy matching, soundex, or other deduplication algorithms.

This harmonization enables the creation of enterprise-wide dimensions that span multiple tenants.

Security and Governance in Multi-Tenant Data Vaults

By embedding Tenant IDs throughout the model, row-level security becomes straightforward. Each record can be tied to a tenant, and access can be granted or denied accordingly. This simplifies compliance with data privacy regulations and contractual obligations.

Governance practices should also establish clear rules for:

  • Defining and maintaining Tenant IDs.
  • Managing ownership of global vs. tenant-specific data.
  • Regular audits of access controls and Same-as Links.

Best Practices for Multi-Tenant Data Vaults

  1. Add Tenant IDs early: Introduce them in staging to ensure consistency across the pipeline.
  2. Unify where possible: Standardize Satellites for common structures, customize only when necessary.
  3. Reserve global IDs: Create special identifiers for shared or unclear ownership data.
  4. Secure with Tenant IDs: Use row-level security tied directly to the Tenant ID field.
  5. Leverage Same-as Links: Resolve duplicates to support enterprise-wide reporting.
  6. Design one common mart: Rely on row-level filtering instead of duplicating models per tenant.
  7. Scale incrementally: Start with MVP, refine the model as you onboard new tenants.

Conclusion

Multi-tenant Data Vault design requires careful thought about uniqueness, ownership, and harmonization. By embedding Tenant IDs consistently across Hubs, Links, and Satellites, you not only preserve data integrity but also simplify governance and security. The Business Vault and Information Mart can then be designed to support both tenant-specific and enterprise-wide perspectives.

As organizations grow and onboard more clients or business units, this approach ensures scalability without overwhelming complexity. With clear governance, Same-as Links, and standardized mart designs, you can build a robust multi-tenant data warehouse that serves diverse needs while staying maintainable and secure.

Watch the Video

Complex Computed Satellites in Data Vault

Complex Computed Satellites

When people first learn about computed satellites in Data Vault, they often encounter very simple examples: concatenating first and last names into a full name, or applying a basic calculation within a satellite. While these examples are valid, they don’t capture the full breadth of what computed satellites can do. In reality, computed satellites are a powerful mechanism for integrating, transforming, and enriching data across your vault — enabling business-driven insights while maintaining the Data Vault principles of auditability and traceability.

This article will walk through the broader concept of computed satellites, discuss how they are designed, and provide practical implementation patterns for handling more complex use cases.



What is a Computed Satellite?

At its core, a satellite in Data Vault is a structure that describes a business object (a hub or link) by holding descriptive attributes over time. A computed satellite differs from a raw satellite because its data does not come directly from the source system but is derived through business logic.

Examples include:

  • Concatenating FirstName and LastName into FullName.
  • Deriving an age from a birthdate.
  • Producing calculated scores, risk categories, or classifications.
  • Integrating attributes from multiple satellites across different hubs via links.
  • Creating artificial relationships, such as product recommendations based on purchase history.

Importantly, a computed satellite isn’t just about the calculation itself — it’s about what the result describes and where it logically belongs in your model.

Step 1: Defining the Parent Entity

Before you build a computed satellite, you must answer a critical question: What does the result describe?

Every satellite attaches to either a hub (a business key) or a link (a relationship between keys). If your calculation produces attributes describing a customer, then the computed satellite belongs on the Customer Hub. If it describes a relationship between customers and products, it belongs on the respective link.

For example:

  • A Full Name attribute describes a Customer Hub.
  • A product recommendation score describes a Customer–Product Link.
  • A risk category for an account describes an Account Hub.

This step ensures that your computed satellite stays aligned with the business meaning of your Data Vault model.

Step 2: Designing the Structure

Once you know the parent, the next step is to decide the structure of your results. Computed satellites can contain:

  • Simple attributes (e.g., strings, numbers, dates).
  • Multiple descriptive fields derived from logic.
  • Semi-structured data, such as JSON or XML.

For example, you might calculate a JSON object capturing a customer’s segmentation profile, or an XML document describing a product configuration.

The important point: the satellite reflects the structure of your results, not the mechanics of how you implemented them.

Step 3: Implementing the Business Logic

After modeling comes implementation. Computed satellites can be populated in several ways:

SQL Views

The most common approach is to implement a computed satellite as a SQL view. Here, the SQL query both expresses the logic (e.g., joins, transformations, calculations) and defines the result structure. If SQL is sufficient for your business rules, this is often the simplest and most maintainable approach.

External Scripts (Python, R, etc.)

For more advanced transformations, machine learning, or statistical processing, you may use external code. A Python script, for example, could pick up data from raw satellites, apply complex algorithms, and write results back into a computed satellite.

The golden rule: the implementation must remain under your control. Even if a data scientist creates an initial model using tools like Azure ML or RapidMiner, once it becomes part of your Business Vault, the deployment and maintenance are governed centrally. This ensures auditability and consistency.

Materialized Tables

Sometimes, business logic requires intermediate storage. In this case, you may materialize computed satellites as physical tables populated via INSERT statements or stored procedures. This is useful for performance optimization or managing dependency chains in cascading business rules.

Complex Use Cases for Computed Satellites

1. Filtering or Subsetting Business Keys

Imagine a Partner Hub with a single satellite. Business users may want to see only clients, employees, or vendors. Computed satellites can create filtered subsets that bring the model closer to business expectations. While not always the cleanest design, this is a practical option in some industries, such as insurance.

2. Artificial Links

A link doesn’t always need to come directly from a source system. You can create artificial links based on computed relationships. For example, by analyzing purchase history, you might generate product recommendations — effectively creating a Customer–Product Recommendation Link.

3. Cascading Business Rules

A powerful pattern is to break complex logic into smaller, reusable steps:

  1. Create a simple computed satellite that performs data cleansing or a basic calculation.
  2. Use that result in a second computed satellite to apply additional rules.
  3. Join results with other business vault entities to build richer attributes.

This cascading approach makes rules easier to maintain, document, and reuse — and avoids giant, unmanageable SQL queries filled with dozens of CTEs.

Best Practices

  • Start with the business meaning: Always clarify what the result describes before modeling.
  • Keep business logic in the Business Vault, not in downstream marts.
  • Favor cascading rules over monolithic transformations — it improves maintainability and reusability.
  • Control the code: All scripts, views, and procedures must be owned by the data warehouse team, not end-users.
  • Support multiple technologies: SQL for straightforward logic, external scripts for advanced logic, and materialized tables where necessary.

Dependencies and Execution

When you cascade rules or materialize results, you introduce dependencies. One entity must load before another. To manage this, many teams implement dependency tables that track loading order. This enables recursive or automated job scheduling, ensuring consistency across the Business Vault.

Virtualized approaches (SQL views) are often easier, since query optimizers can resolve dependencies dynamically. Materialized approaches, however, provide better performance and control at scale.

Why Computed Satellites Matter

Computed satellites are more than “extra calculated fields.” They enable organizations to:

  • Bridge the gap between raw data and business expectations.
  • Implement business rules in a controlled, auditable environment.
  • Support advanced analytics and machine learning workflows inside the Data Vault framework.
  • Enable modular, reusable logic that scales across domains and use cases.

By treating computed satellites as first-class citizens in your Business Vault, you ensure that business logic is not scattered in marts, reports, or ad hoc scripts — but is instead centralized, governed, and reusable.

Conclusion

Computed satellites in Data Vault can be as simple as a concatenated name, or as complex as multi-step cascading business rules that derive artificial relationships. The key is to start by identifying what your result describes, attach the satellite to the correct parent, design the structure of your attributes, and then implement the logic in a controlled, maintainable way.

Whether implemented via SQL, Python scripts, or materialized processes, computed satellites should remain under the stewardship of your data warehouse team. By following best practices, you’ll unlock the full potential of the Business Vault — keeping it business-aligned, auditable, and ready for advanced analytics.

Watch the Video

Data Vault Hashing on Databricks with XXHASH64

Hashing on Databricks

Hashing is a core element of Data Vault modeling. Hash keys are used to uniquely identify Hubs, Links, and Satellites, and they need to be consistent, reproducible, and efficient. A common debate is whether to use widely adopted hashing algorithms like MD5 or SHA-1, or to opt for faster and smaller hash functions such as xxhash64, which Databricks natively supports.

The question is simple: What if we stored hash keys as 64-bit integers (int64) using xxhash64 instead of 128-bit MD5 values? On the surface, this looks attractive — faster generation, better join performance, and reduced storage. But as we’ll explore in this article, the trade-offs around collisions, scalability, and platform independence make this a risky choice in Data Vault 2.0 architectures.



Why Smaller Hash Keys Look Tempting

There are clear benefits to using smaller hash values such as int64:

  • Speed of generation: Algorithms like xxhash64 or CRC-64 are significantly faster than MD5 or SHA-1. OpenSSL and Linux benchmarks consistently show xxhash64 outperforming older cryptographic functions.
  • Join performance: Joining 64-bit integers is naturally faster than joining 128-bit binary or 32-character string columns.
  • Storage efficiency: An int64 hash key requires only 8 bytes, compared to 16 bytes for binary MD5 or 32 bytes for an MD5 stored as a string.

From a performance perspective, the appeal is undeniable. But in Data Vault, performance is not the only concern. The fundamental question is: How safe are these smaller hashes when used as surrogate keys at scale?

The Risk of Collisions

A hash collision occurs when two different inputs produce the same hash value. In Data Vault, this means two different business keys could be treated as the same Hub, Link, or Satellite record — corrupting your data integrity.

With 64-bit hashes, the number of possible unique values is 2^64. While that sounds huge, probability tells a different story when you start loading millions or billions of rows. The so-called birthday paradox makes collisions far more likely than intuition suggests.

For example:

  • With just 10,000 records, a 32-bit hash already has a 1 in 100 chance of collision.
  • A 64-bit hash greatly reduces the risk, but at large scales (hundreds of millions of rows), the probability becomes uncomfortably high.
  • A 128-bit hash (MD5) pushes collision risk into the realm of trillions of rows before it becomes statistically relevant.

That’s why MD5 and SHA-1 — despite being slower — are still standard in Data Vault: they provide mathematically safe keyspace sizes for large enterprise datasets.

Performance vs. Integrity

The argument for int64 hashing often emphasizes query performance:

  • Joins on numeric columns are faster than joins on strings or binary.
  • xxhash64 is faster to compute than MD5.

These points are true. But in practice, the cost of a collision far outweighs the performance gains. A single collision can undermine the entire lineage of your data warehouse. Once data integrity is compromised, every downstream analytic and report is suspect.

As a result, most Data Vault practitioners will gladly accept the slightly higher CPU cost of MD5 or SHA-1 in exchange for peace of mind.

Platform Portability Matters

Another issue with xxhash64 is platform availability. While Databricks supports it, you also need to consider:

  • Can you compute xxhash64 in Snowflake, BigQuery, SQL Server, or Oracle?
  • Can you reproduce xxhash64 consistently in Python, Java, or ETL tools?
  • Will the algorithm be supported 5–10 years from now?

One of the strengths of MD5 and SHA-1 is their ubiquity. They are implemented in nearly every database, programming language, and ETL platform. This cross-platform reproducibility is essential when building a Data Vault that may span multiple systems. By contrast, xxhash64 locks you into Databricks (or requires custom implementations elsewhere).

Why 128-Bit is the Safe Minimum

Let’s compare hash sizes:

  • CRC-32: Very fast, but collisions appear after ~10,000 records. Unusable for Data Vault.
  • xxhash64 / CRC-64: Better, but collisions become likely as datasets grow into the hundreds of millions.
  • MD5 (128-bit): Standard choice. Safe up to trillions of rows.
  • SHA-1 (160-bit): Provides even more headroom for extremely large datasets.

For most enterprises, MD5 hits the sweet spot: fast enough, widely supported, and statistically collision-free at realistic data volumes.

Alternative Approaches

If performance is a serious concern, consider these approaches instead of shrinking hash size:

  • Binary storage: Store MD5 as a 16-byte binary instead of a 32-character string. This cuts storage in half and improves join performance.
  • Partitioning strategies: Optimize joins by partitioning your Data Vault tables, reducing the need for full-table joins.
  • Hardware acceleration: Modern CPUs have optimized instructions for MD5 and SHA-1, making them faster than you might expect.
  • Consider SHA-256 only if required: While SHA-2 offers stronger guarantees, it’s rarely necessary in Data Vault and adds performance overhead.

Summary: Should You Use xxhash64 in Data Vault?

While xxhash64 looks attractive in Databricks because of its speed and smaller footprint, it’s not a good fit for Data Vault 2.0:

  • Collision risk is too high for large-scale data warehouses.
  • Portability is limited — you risk vendor lock-in.
  • Long-term maintainability suffers if your algorithm isn’t standardized across platforms.

In Data Vault, data integrity always comes first. That’s why MD5 (128-bit) or SHA-1 (160-bit) remain the recommended standards for hash keys. They provide the balance of performance, portability, and collision safety needed for enterprise-scale solutions.

Conclusion

If you’re working in Databricks, it might be tempting to adopt xxhash64 for hash keys. But resist that temptation. The risks of collisions and platform lock-in far outweigh the benefits. Stick with MD5 or SHA-1 for your Data Vault hash keys, store them efficiently as binary values, and optimize performance through storage and join strategies.

Remember: a faster broken key is still a broken key. In Data Vault, correctness and consistency are always the highest priority.

Watch the Video

Joining SCD2 Tables Using Data Vault

Joining SCD2 Tables

When working with Data Vault 2.0, one of the most common challenges is how to handle Slowly Changing Dimension type 2 (SCD2) tables when loading Link Satellites. Imagine a scenario where you need to join three SCD2 tables, each with valid_from and valid_to dates, and bring them together into a single Satellite hanging from a Link. At first, this might sound straightforward, but the details matter — and depending on your approach, you could make your warehouse harder to maintain, less performant, or less flexible.

In this article, we’ll walk through the best practices for handling this situation, based on Data Vault principles. We’ll discuss why you shouldn’t rush into joining multiple SCD2 tables into one Satellite, how to handle business timelines, the role of PIT and T-PIT tables, and strategies for keeping your design scalable and future-proof.



Understanding SCD2 in the Context of Data Vault

Slowly Changing Dimensions type 2 are a way to track historical changes in dimensional data. Each record typically has valid_from and valid_to dates that describe its period of effect. In a Data Vault model, however, the focus isn’t on interpreting those business dates upfront. Instead, the Raw Data Vault stage is all about capturing what the source gives you, as-is.

That means when we first bring data into the Raw Vault, the valid_from and valid_to fields should be stored as descriptive attributes — not as part of the Satellite’s primary key. Trying to interpret and align them too early will only create unnecessary complexity.

Why Not Load One Satellite From Three SCD2 Tables?

At first glance, it may seem attractive to combine all three SCD2 tables directly into one Satellite. But this goes against Data Vault best practices:

  • Each Satellite should usually source from a single table or source system.
  • Denormalizing multiple sources into one Satellite complicates your Raw Vault and makes it harder to maintain.
  • You risk having to undo the denormalization later when new requirements come in.

The recommended approach is to create at least three Satellites — one per SCD2 table. You may even need more Satellites if you have to split them based on privacy, rate of change, or security rules. This simplifies your Raw Vault and sets you up for flexibility later.

Handling Validity Dates in Satellites

Once your Satellites are created, the question is how to handle valid_from and valid_to. Here are two key approaches:

1. Multi-Active Satellites

If your SCD2 records represent multiple simultaneously valid states (for example, different price lists where both current and future prices are valid), you can use a multi-active Satellite. In this design:

  • The primary key is composed of hash_key + load_date + subsequence.
  • valid_from and valid_to are stored as descriptive attributes, not as key parts.
  • A staging-generated subsequence ensures uniqueness within a load.

2. Standard (CDC) Satellites with Subsequence

If only one record is active at a time (the classic SCD2 case), then you don’t need a multi-active design. Instead:

  • Stick with the standard Satellite primary key: hash_key + load_date.
  • Handle multiple intraday changes by adding micro- or nanosecond subsequences to load_date.
  • This ensures only one active row per parent key at any given time.

Choosing between multi-active and CDC-style Satellites depends entirely on your data. Do you need multiple simultaneously valid records? Or does one replace the other in sequence? Your answer determines the right design.

Aligning Data with PIT Tables

Once the three Satellites are loaded, you’ll need to bring them together for reporting. This is where PIT (Point-in-Time) tables come in. PIT tables align deltas across Satellites to a common snapshot date, making it possible to present a unified view of related changes.

There are two main flavors:

  • Standard PIT – aligns data based on load_date (technical historization).
  • Temporal PIT (T-PIT) – aligns data based on business timelines like valid_from and valid_to.

A T-PIT lets you activate rows based on both the technical and business timelines. However, it comes with maintenance challenges: whenever valid_from or valid_to dates change, you may need to rebuild PIT partitions — which can be expensive if large date ranges are altered.

Where Should You Apply Business Timelines?

There are three strategies for applying business timelines like valid_from and valid_to:

  1. Downstream in reporting or dashboards – simplest to maintain, and business users can define how timelines should be applied.
  2. In dimension views – apply filters and conditions directly in the SQL layer that feeds reports.
  3. Upstream in PIT/T-PIT – most performant but requires heavier maintenance whenever business timelines change.

A practical approach is to start downstream and only move timeline application upstream if performance issues demand it.

Building a Business Satellite

Once your PIT table aligns the Satellites, you may still need a Business Satellite (or Computed Satellite). This Satellite:

  • Uses the PIT table as its foundation.
  • Combines attributes from the three original Satellites using COALESCE or other business rules.
  • Optionally applies T-PIT logic if strict business timeline alignment is required.

By separating Raw Satellites (which store raw source data) from Business Satellites (which apply interpretation and business logic), you preserve Data Vault’s flexibility while still meeting analytical requirements.

Best Practices Recap

  • Create one Satellite per SCD2 source table — don’t denormalize too early.
  • Treat valid_from and valid_to as descriptive attributes in the Raw Vault.
  • Use multi-active Satellites only when multiple records are simultaneously valid; otherwise, stick with standard CDC Satellites.
  • Align Satellites with PIT tables; consider T-PITs only if business timelines must be applied upstream.
  • Whenever possible, push business timeline interpretation downstream to dimensions or reports for easier maintenance.
  • Use Business Satellites when you need to merge multiple sources into a single unified view.

Conclusion

Joining multiple SCD2 tables into a single Satellite is rarely the right first step in a Data Vault 2.0 implementation. Instead, build your Raw Vault Satellites closely aligned with their sources, use PIT tables to align changes, and apply business timelines carefully — starting as far downstream as possible. This approach keeps your architecture maintainable, flexible, and scalable while still supporting complex historical analysis.

By following these principles, you’ll not only simplify your data model but also give your business users the power to interpret validity ranges in ways that make sense for them — without locking your warehouse into rigid rules that are hard to maintain.

Watch the Video

Defining Snapshot Dates in Data Vault

Defining Snapshot Dates in Data Vault

When working with Data Vault, one of the most common questions practitioners face is: how do we define snapshot dates? While load dates are tied to when data arrives from the source, snapshot dates serve a different and equally important purpose — they allow us to deliver stable, consistent, and predictable datasets to end users. In this article, we’ll break down the concept of snapshot dates, explain how they differ from load dates, and walk through practical examples of how they are implemented in a Data Vault architecture. By the end, you’ll understand how to design and manage snapshot dates effectively, and how they fit into the broader picture of multi-temporal data management.


Why Snapshot Dates Matter

Imagine you are a business user who arrives at the office at 9 AM, expecting your reports to be ready. You want those reports to be stable — not changing throughout the day unless a new cycle of data refresh is scheduled. Snapshot dates exist to decouple data delivery from data ingestion.

  • Load Date: Indicates when data from the source system arrived in your warehouse. It is system-driven and depends on the source’s delivery schedule.
  • Snapshot Date: Defines when the data is frozen for reporting and analysis. It is business-driven and follows a regular schedule (daily, hourly, every 5 minutes, etc.).

This separation allows data teams to provide users with predictable datasets, regardless of how irregularly or frequently source systems deliver new data.

The Three Timelines in Data Vault

To fully grasp snapshot dates, it’s useful to understand that Data Vault designs operate across three different timelines:

  1. Data Warehouse Timeline: Driven by load datetime stamps, representing when data batches (or real-time messages) arrive from the source system.
  2. Snapshot Timeline: Driven by the business delivery cycle. This is when data is made available to users in stable form.
  3. Business Timeline: Driven by business events (contract start dates, end dates, valid-from/valid-to fields, modified timestamps, etc.) and stored in Satellites or other model components.

For this article, we’ll focus on the second timeline: the snapshot timeline.

Defining Snapshot Dates with Examples

Daily Snapshot Example

Suppose a company wants its reports refreshed once per day, every morning at 7 AM UTC. This means that:

  • Regardless of when the source delivers data, the snapshot timestamp is always set to 7 AM.
  • Users querying the data warehouse at 9 AM will see a stable version of the data that won’t change until the next snapshot is generated.
  • In the control table, a row is inserted daily with the snapshot timestamp (e.g., 2025-09-16 07:00:00 UTC).

Hourly Snapshot Example

For real-time dashboards, an hourly or even 5-minute snapshot might be necessary. Let’s say hourly snapshots are generated:

  • A new snapshot timestamp is inserted every hour (e.g., 2025-09-16 01:00:00, 2025-09-16 02:00:00, etc.).
  • Older hourly snapshots may be discarded after a week to save storage, while daily snapshots are retained for a year.
  • End users can query either the most recent hourly snapshot or the daily snapshot depending on their needs.

Mixed Use Case

Some users might want hourly updates, while others only need a daily snapshot. In such cases:

  • The control table holds all snapshots (hourly + daily).
  • Boolean flags are used to mark whether a snapshot is “hourly,” “daily,” or “real-time.”
  • Users can filter based on these flags when running queries.

How Control Tables Help Manage Snapshot Dates

In practice, snapshot dates are managed using control tables. There are typically two types of control tables in Data Vault projects:

  1. Load Control Table: Tracks load datetime stamps for each source, indicating whether the batch has been processed into staging, raw Data Vault, business vault, or marts.
  2. Snapshot Control Table: Stores snapshot datetime stamps generated by the warehouse. These define the stable reporting layers that users can query.

When building PIT (Point-in-Time) tables and bridge tables, the snapshot control table plays a critical role. It ensures that snapshots align with user expectations, and it provides metadata for filtering (e.g., “latest snapshot,” “daily snapshot,” etc.).

Implementing Snapshot Dates in PIT Tables

PIT tables act like indexes into your Data Vault, enabling efficient query performance. The loading process of PIT tables typically involves:

  1. Checking the snapshot control table to see which snapshots should exist.
  2. Loading the required snapshots into the PIT table if they’re missing.
  3. Ensuring that the PIT table and snapshot control table remain in sync after each refresh cycle.

When a user queries a PIT table, they can join it to the snapshot control table based on the snapshot datetime stamp. From there, they can filter by flags (e.g., latest, daily, hourly) to get the version of data they need.

Best Practices for Defining Snapshot Dates

  • Work with business users: Define snapshot frequencies based on real business needs (daily, hourly, real-time).
  • Be consistent: Establish a standard snapshot time (e.g., 7 AM UTC daily) to simplify reporting.
  • Use control tables: Automate the insertion of snapshot timestamps and track them for PIT/bridge table loading.
  • Retain wisely: Keep high-frequency snapshots (hourly/5-minute) for a short time, but retain daily snapshots for longer historical analysis.
  • Decouple load and snapshot timelines: Remember that load datetime depends on the source, but snapshot datetime depends on user requirements.

Snapshot Dates vs. Business Dates

It’s worth highlighting again that snapshot dates are not the same as business dates like “contract start” or “valid-to.” Business dates come directly from source systems and are stored as part of the business timeline in Satellites or Links. Snapshot dates, on the other hand, are warehouse-generated and serve as reference points for reporting and querying.

Conclusion

Snapshot dates are a cornerstone of Data Vault’s multi-temporal design. By providing a regular, predictable timestamp for reporting and queries, they ensure stability and trust in data delivery — even as source systems deliver data at unpredictable times. With the help of control tables, PIT/bridge tables, and well-defined retention policies, snapshot dates give both IT teams and business users the clarity and consistency they need. Whether your business requires daily snapshots for stable reports or high-frequency snapshots for real-time dashboards, the principles remain the same: define, control, and communicate your snapshot strategy clearly.

Watch the Video

Close Menu