Skip to main content
search
0

How to Derive Dimensions and Facts from a Data Vault Model

Patterns for Deriving Data Vault Dimensions and Facts

If you’ve gone through a Data Vault training, you know that Data Vault is a pattern-driven methodology — from loading the Raw Data Vault all the way through business logic implementation in the Business Vault. But one question that consistently comes up, especially among practitioners bridging the gap between modeling and delivery, is this: How do you actually derive Dimensions and Facts from a Data Vault model?

It’s a great question, and the good news is that just like the rest of Data Vault, there are clear, repeatable patterns for doing it well. Let’s walk through them.



The Pattern-Based Nature of Data Vault Delivery

Data Vault is not just a modeling technique — it’s an entire methodology built around patterns. There are patterns for loading the Raw Data Vault, patterns for modeling it, and patterns for implementing business logic in the Business Vault. The same is true for the information delivery layer, where you produce standard target entities like Dimensions and Facts (or 3NF entities, flat/wide schemas, and more).

Once you internalize these patterns, deriving your target entities becomes a predictable and even automatable process. That’s one of the most underappreciated strengths of Data Vault: the patterns repeat, and the more you use them, the faster and more confident you become.

Deriving Dimensions: The 80/20 Rule

Roughly 80% of your Dimensions will follow one core pattern: they are derived from a combination of a Hub, one or more Satellites, and — critically — a PIT Table (Point-In-Time Table).

The basic query pattern looks like this:

  1. Start with the PIT Table as your base
  2. Join the Hub to retrieve the business key
  3. Join the relevant Satellites for descriptive attributes
  4. Cherry-pick the attributes you need for your Dimension

The remaining 20% of Dimensions come from Links rather than Hubs. In these cases, you start with a PIT Table built on a Link, join the Link itself for hash keys and Dimension references (often called Bridge Dimensions), and then join the Satellites to describe the relationship.

In practice, this covers close to 99% of all Dimension derivations. There are always edge cases, but this two-pattern approach handles the vast majority of real-world scenarios.

Why PIT Tables? The Case for Point-In-Time

If you’re new to PIT Tables, they can look intimidating at first — a structure full of hash keys and timestamps. But once you understand the pattern, it’s actually one of the most elegant tools in the Data Vault toolkit. And here’s the key insight: in modern implementations, PIT Tables are generated by automation tooling, so you rarely build them by hand. You learn how to use them, not how to construct them from scratch.

PIT Tables are essential for several reasons:

Supporting All SCD Types

A PIT Table provides the foundation for Slowly Changing Dimension (SCD) types 0, 1, and 2 — all from the same structure. The snapshot day timestamp embedded in the PIT Table allows you to control the grain and timeline of your Dimension flexibly.

Delivering Stable Reports

Here’s a scenario every analyst will recognize: you pull a report at 9:00 AM and expect it to remain consistent throughout the day. But if your Dimension view is simply joining the latest delta from a Satellite directly, any incoming data update after 9:00 AM will silently change your report’s results.

A PIT Table solves this by identifying the latest valid delta for all hash keys at a specific snapshot timestamp — say, 9:00 AM. This decouples your information delivery from live, incoming data pipelines. Users get stable, predictable reports. The data warehouse and the source pipelines can operate independently without stepping on each other.

This is why using a PIT Table, though it may appear more complex at first, is actually simpler and more aligned with what business users actually need: consistent data at a known point in time, not real-time volatility bleeding into their dashboards.

Decoupling Delivery from Ingestion

Beyond stability, the PIT Table serves a broader architectural purpose: it creates a clean separation between the data loading process and the data delivery process. This is a hallmark of mature data warehouse design — and Data Vault makes it a first-class pattern rather than an afterthought.

Want to deepen your understanding of PIT Tables, Satellites, and the complete Data Vault modeling toolkit? Explore our Data Vault 2.1 Training & Certification to get hands-on with these patterns.

Deriving Facts: Links, Non-Historized Links, and Bridge Tables

Facts are a different story. Where Dimensions describe who and what, Facts capture events, transactions, and messages — think banking transactions, call records, IoT sensor readings, or logistics movements.

For Facts, your primary sources in a Data Vault model are:

  • Non-Historized Links (NH Links) — ideal for transactional events that don’t change
  • Dependent Child Links — for subordinate, granular transactional data
  • Bridge Tables — for pre-aggregated or grain-shifted data
  • Joined Link combinations — for cross-domain Fact derivations

If your underlying Link structure already matches the grain you need for your Fact, you can build a simple view on top of it and you’re done. Clean and efficient.

Grain Management: The Core Challenge of Fact Derivation

Things get more interesting when your source data doesn’t naturally sit at the right grain for your target Fact. This is where Bridge Tables and grain shift operations come into play.

Coarsening Grain with GROUP BY

Imagine an NH Link that captures real-time vehicle positions — five messages per second. Your Fact table only needs one position per hour. You need to reduce the granularity.

You do this with a GROUP BY operation on a subset of the Hub references within the Link. The fewer Hub references you group by, the coarser your grain. The result of this aggregation is materialized into a Bridge Table, which then becomes the base for your Fact view.

Refining Grain by Joining Links

The inverse is also possible. If you need a finer grain than any single Link provides, you join multiple Links on their shared Hub references. For example, if one Link connects Customer and Store, and another connects Store and Product, joining them on the shared Store reference produces a result with three Hub references — a finer granularity than either Link alone.

Other Grain Shift Operations

Beyond GROUP BY and joins, there are other grain-shifting operations worth knowing:

  • DISTINCT operations — to remove duplicate rows at a desired level of aggregation
  • JSON array expansion — splitting a JSON array into multiple rows is itself a grain shift, making rows finer

All of these operations can produce results that are materialized in Bridge Tables for performance and reuse, which then serve as the clean input layer for your Fact views.

PIT Tables for Snapshot Facts

There is one important exception to the “Links and Bridge Tables for Facts” rule: snapshot-based Facts. For periodic snapshots — daily inventory levels, end-of-month balances, weekly pipeline status — a PIT Table can serve as an excellent base for your Fact entity.

The catch is that PIT Tables are sometimes pruned over time to keep them manageable. If you need to preserve a long history of snapshots for your Fact, consider using a dedicated Bridge Table to maintain those snapshots independently from your operational PIT Tables. This gives you full control over retention and history without disrupting the PIT Table’s primary purpose.

Beyond Dimensions and Facts: Wider Applicability

The same patterns extend naturally to other target model types:

  • 3NF (Third Normal Form) entities — derived using similar logic from Hubs, Links, and Satellites
  • Flat and wide schemas — built by joining PIT Tables to Bridge Tables and then joining Satellites to pre-join all descriptive attributes into a single denormalized entity
  • Fully denormalized Fact entities — start with a Bridge Table or Link, join PITs, then join the relevant Satellites to fold all Dimension attributes directly into the Fact

This last pattern — the fully denormalized Fact — is particularly powerful for downstream consumers like machine learning pipelines or self-service analytics tools that benefit from wide, flat tables. And it follows the exact same building blocks.

Summary: The Patterns at a Glance

Target Entity Primary Source Pattern
Standard Dimension (80%) PIT + Hub + Satellites Join PIT → Hub → Satellites, select attributes
Bridge/Relationship Dimension (20%) PIT on Link + Satellites Join PIT → Link → Satellites
Transactional Fact NH Links / Dependent Child Links View on top of Link at correct grain
Aggregated Fact Bridge Table (GROUP BY result) Grain shift → materialize → view
Snapshot Fact PIT Table or Bridge Table Snapshot timestamps → Fact view
Denormalized Fact Bridge + PIT + Satellites Pre-join all descriptive attributes

Learn the Patterns, Apply Them Everywhere

One of the most satisfying aspects of working with Data Vault is that once you’ve internalized these patterns, they compound. Each new project reinforces the same building blocks: Hubs, Links, Satellites, PIT Tables, Bridge Tables. The same logic for deriving a Dimension applies to deriving a 3NF entity. The same grain management principles for Facts carry over to denormalized wide tables.

Data Vault doesn’t ask you to reinvent the wheel for every project — it gives you a set of reliable, tested patterns and asks you to apply them consistently.

If you’re ready to master these patterns in depth — including hands-on work with PIT Tables, Bridge Tables, and full information delivery models — check out our Data Vault 2.1 Certification and Training. And if you’re just getting started, there’s a free Data Vault handbook available as a concise primer covering the core concepts, pros and cons, and when to use Data Vault in your architecture.

The patterns are there. Once you learn them, they’ll follow you everywhere.

Watch the Video

How to Deal With Late Arriving Data

Late Arriving Data

Late arriving or backdated data is a common challenge in data warehousing. In Data Vault, it is important to distinguish between the technical timeline used for loading data and the business timeline representing when events actually occurred in the real world.



1. Technical Timeline vs Business Timeline

When loading data into the Raw Vault, always use a Load Date Timestamp (LDTS):

  • Set when the record first arrives in your target system (landing zone, data lake, or Raw Vault).
  • Never backdate this timestamp—it should always move forward.
  • Used for incremental loading, delta detection, and reproducibility of snapshots.

This timestamp does not reflect the real-world timing of the data. It is purely a technical artifact to track ingestion order.

2. Capturing the Business Timeline

To handle late arriving or backdated data, use descriptive business dates stored in your satellites, such as:

  • Apply Date / Effective Date: When the data became valid in the source system or real world.
  • Last Modified Date: When the record was last changed in the source system.

These business timestamps allow you to create snapshots or temporal views that reflect the true order of events.

3. Timeline Corrections Without an Extended Tracking Satellite

You can correct timelines without adding additional satellites by leveraging the business timestamps stored in your existing satellites:

  1. Create temporal PIT tables or snapshots based on the business timeline, not the load date.
  2. When late-arriving data is detected:
    • Option 1: Rebuild the affected snapshots to include the late data.
    • Option 2: Apply counter transactions to reverse previous measures and apply the updated values.
  3. Always keep the load date unchanged—it only tracks ingestion, not validity.

This approach ensures that your historical reports reflect the correct business sequence without complicating the Raw Vault model.

4. Practical Guidelines

  • Do not order or aggregate data using the load date when interpreting or reporting; always use business dates.
  • Maintain separate timelines:
    • Load Date: Technical, for data ingestion and reproducibility.
    • Business Date: For interpretation, analysis, and handling late arrivals.
  • Rebuild snapshots or use counter transactions as necessary when late data affects measures or aggregates.

Summary

Late arriving data can be handled in Data Vault without adding extra tracking satellites by clearly separating technical and business timelines. Load Date timestamps remain forward-only, while satellites store business dates to drive temporal snapshots and corrections. Using temporal PIT tables, counter transactions, or snapshot rebuilding ensures your analytics reflect the real-world timeline accurately.

Watch the Video

How to Capture CDC Data in Data Vault Satellites

Capture CDC Data

Capturing Change Data Capture (CDC) data in Data Vault can be tricky, especially when the source mostly sends inserts but occasionally produces duplicates or deletions. Understanding how to handle these cases ensures historical accuracy and avoids data inconsistencies in your hubs and satellites.



The Scenario

Consider the following behavior of your source system:

  • Most of the time, rows are insert-only.
  • During initial load, the same row may arrive twice (once in the bulk load and once as an insert within the same batch).
  • Deleted rows may occasionally appear.

These patterns can lead to duplicates if not handled correctly. At first glance, it might look like you need a Non-Historized Link, but duplicates must still be managed properly.

Why Standard Non-Historized Links May Fail

A standard non-historized Link assumes a single row per combination of hubs. When duplicates arrive, either due to CDC or multiple inserts during initial load, the Link cannot naturally distinguish them, leading to primary key conflicts or overwritten data.

A common—but sometimes problematic—solution is adding counter rows to differentiate duplicates. However, this often requires a GROUP BY in the Information Mart, which can cause performance issues, particularly on non-columnar databases.

Recommended Approach: Capture Technical History in Satellites

Instead of modifying the Link, the recommended approach is to handle duplicates in satellites, preserving the raw source events and their arrival order.

Step 1: Use a Satellite with a Load-Date Sequence

For each incoming batch:

  • Assign the CDC load timestamp to the first row of a given parent.
  • If multiple rows for the same parent exist in the batch, increment the timestamp by a small unit (microsecond, millisecond, or nanosecond) for each subsequent row.

This creates a unique ordering of changes while preserving the technical history, without touching the original raw data.

Step 2: Maintain Historical Order

By adding a microsecond increment to the load date for each row:

  • The first row in the CDC batch gets the base timestamp.
  • The second row gets base timestamp + 1 microsecond, the third row +2 microseconds, etc.

This ensures the latest row has the highest load timestamp, which can be used to drive Point-In-Time (PIT) tables and type-1 dimension replacements.

Step 3: Preserve Batch or CDC Metadata

If your CDC source provides a batch ID or subsequence number, include it in the satellite. This allows for:

  • Tracking which records arrived together
  • Reconstructing the technical timeline of changes

If no metadata exists, the microsecond sequencing on the load date is sufficient to order the rows.

Handling Non-Historized Links with Duplicates

In rare cases, a non-historized Link may receive multiple rows for the same key combination. To handle this safely:

  • Extend the alternate key to include the load date (or other sequencing attribute) in the hash key calculation.
  • This ensures each row has a unique primary key without modifying the raw data.

Key points:

  • No need to use counter rows in the raw link.
  • Duplicates are captured and preserved for historical accuracy.
  • Aggregations in PIT or Bridge Tables can be used for reporting, ensuring performance optimization.

Optional: Bridge Tables for Performance

If your Information Mart requires grouping or deduplication and your database struggles with performance:

  • Create a Bridge Table that pre-aggregates or resolves duplicates.
  • The Bridge Table stores only the latest row (or the aggregated result) for reporting.
  • You maintain the raw satellite history in case full lineage or historical reconstruction is needed.

Summary

  • Capture all incoming CDC events in a satellite, including duplicates, without modifying the raw data.
  • Use microsecond increments on the load date to order multiple rows per parent.
  • Include CDC batch metadata if available to preserve groupings and arrival order.
  • For non-historized Links receiving multiple rows, include the load date in the hash calculation.
  • Bridge Tables or PIT tables handle reporting and aggregation efficiently, while maintaining full historical traceability.

This approach preserves auditability, ensures correct historical ordering, and avoids performance issues in the Information Mart.

Watch the Video

dbt Source Freshness: Ensuring Reliable and Timely Data in Your Pipeline

dbt Source Freshness

Data teams rely on timely, accurate, and complete data to support dashboards, KPIs, reporting, and data-driven decision making. But even the most advanced data models and transformation logic cannot fix one critical issue: stale or outdated upstream data. This is where dbt Source Freshness becomes one of the most valuable quality checks in your analytics engineering toolkit.

In this article, we take a close look at what dbt Source Freshness is, why it matters, how it works under the hood, and how you can configure and run freshness checks both locally and in dbt Cloud. If your organization depends on reliable data pipelines—or if you’ve ever discovered too late that a report was built on old data—this guide will help you avoid those costly surprises.



What Is dbt Source Freshness?

Source freshness in dbt is a built-in mechanism that measures how up-to-date data is in your defined source tables. While data transformations can apply logic, aggregations, and business rules, they inherently depend on data arriving on time. If source data is delayed, incomplete, or entirely outdated, every model downstream will reflect that delay.

dbt Source Freshness provides a simple, reliable indicator of whether the data you are working with is fresh enough to support your operational and analytical processes. It helps you answer one crucial question:

“Is the data I’m transforming actually the latest data available?”

When enabling freshness checks, dbt evaluates the most recent timestamp from a specified column in your source table and determines whether that timestamp violates your defined freshness thresholds. These thresholds act as data SLAs for your pipeline.

Why Source Freshness Matters

The importance of monitoring source data freshness cannot be overstated. When upstream data is stale, the consequences cascade throughout your entire analytics ecosystem. Dashboards may show outdated KPIs. Operational teams may make decisions based on incomplete numbers. Forecasts and reports may misrepresent the true state of the business.

One scenario that many data teams have encountered illustrates the problem perfectly: a business report runs on what everyone assumes is the latest data. After a few weeks, the team discovers that the upstream system had stopped updating its tables entirely. What appeared to be fresh data was actually months old. As a result, the report generated incorrect metrics for an extended period.

With source freshness monitoring in place, delays like these can be caught immediately. dbt highlights them clearly, allowing teams to:

  • Detect upstream system failures.
  • Identify delays in ingestion or replication pipelines.
  • Enforce data delivery SLAs with source system owners.
  • Stop inaccurate transformations from running on stale data.

Freshness checks turn what could be a hidden issue into a transparent, actionable signal.

How dbt Source Freshness Works

Source freshness configuration lives directly inside the YAML file where your source is defined. This design decision is intentional—freshness belongs to the source, not to downstream models. Each source or table can have its own customized freshness rules.

A typical source block with freshness configuration looks like this:

sources:
  - name: my_source
    tables:
      - name: orders
        freshness:
          warn_after: {hours: 24}
          error_after: {hours: 48}
        loaded_at_field: updated_at

Let’s break down the key components.

loaded_at_field

This is the timestamp column dbt uses to determine when the most recent record arrived. dbt queries this field, finds the newest timestamp, and calculates its age relative to the current time.

Important: dbt always evaluates freshness in UTC time. If your source system stores local timestamps (e.g., CET, EST), the value in loaded_at_field must be converted to UTC.

Thresholds: warn_after and error_after

Freshness thresholds define what “fresh enough” means. dbt compares the age of the newest record with these time limits and returns one of three statuses:

  • pass – the data is within the acceptable freshness window.
  • warn – the data is late but not critically late.
  • error – the data is beyond the maximum acceptable age.

These thresholds effectively act as SLAs, helping teams formalize expectations about data arrival. For example:

  • Warn after 24 hours.
  • Error after 48 hours.

If the source table hasn’t received new records in over 48 hours, dbt marks the freshness check as an error, signaling that the table is unreliable until updated.

What Happens During a Freshness Check?

When you run a freshness check, dbt performs a straightforward but effective procedure:

  1. dbt queries the loaded_at_field and finds the most recent timestamp.
  2. It calculates the time difference between that timestamp and the current UTC time.
  3. It compares the age of the data to your defined thresholds.
  4. It returns a pass, warn, or error result.

This process is intentionally lightweight and fast. It avoids unnecessary complexity while giving teams a dependable, high-value signal about upstream data timeliness.

How to Run Freshness Checks in dbt

Running a freshness check in dbt is simple. The main command is:

dbt source freshness

This command evaluates freshness for all sources that have freshness configurations defined. You can also target a specific source or table:

dbt source freshness --select source:my_source
dbt source freshness --select source:my_source.orders

When executed, dbt displays the freshness status for each table along with metadata such as:

  • The latest timestamp found.
  • The calculated age of the data.
  • The threshold values used.

Running Freshness Checks in dbt Cloud

dbt Cloud makes managing freshness checks even easier. You can create a dedicated job that runs only freshness checks, or you can add freshness as a step in a larger job. This enables automatic monitoring without requiring manual execution.

Once the job completes, results appear directly in the dbt Cloud UI. For each table, you can see:

  • The age of the most recent record.
  • Whether the table passed, warned, or errored.
  • When the freshness check was last executed.

You can also inspect the detailed logs to understand exactly how dbt evaluated each source.

Why Freshness Checks Should Be a Standard Practice

In modern analytics engineering, data reliability is just as important as transformation logic. Freshness checks are a lightweight yet powerful way to ensure that your source systems are delivering data on time.

Without freshness checks, data issues may go unnoticed until they have already impacted dashboards, stakeholder decisions, or downstream processes. With freshness monitoring enabled, you gain visibility into problems early, allowing your team to respond quickly and prevent incorrect reporting.

As data ecosystems grow more complex—with multiple ingestion pipelines, third-party APIs, and event-based systems—freshness checks provide a simple, standardized way to maintain trust in your data.

Watch the Video

How to Model Multi-Type Business Keys in Data Vault

Multi-Type Business Keys

One of the more subtle challenges in Data Vault modeling appears when a single source delivers multiple types of business keys in one structure. Things get even more complex when those keys:

  • Point to different business objects depending on a type indicator
  • Have different formats than the “real” business keys used in existing hubs
  • Require cleansing or transformation before they can be linked correctly

In this article, we’ll walk through how to model such a scenario in Data Vault, how to think about Links and key types, and—most importantly—where business logic and transformations should live.



The Problem Statement

Let’s restate the scenario in simple terms.

Your source table looks something like this:

| ItemBK | KeyType | KeyValue |
|--------|---------|----------|
| ITEM1  | A       | T123     |
| ITEM1  | B       | 999-45   |

From a business perspective:

  • ItemBK identifies an item (one clear Hub candidate)
  • KeyType determines which business object the key refers to
  • KeyValue contains the actual identifier

The complication comes from the target data model:

  • For KeyType A, the key belongs to Hub A, but the real business key is 123, not T123
  • For KeyType B, the key belongs to Hub B and can be used as-is

So the value T123 must eventually link to Hub A using the cleaned value 123.

This raises three classic Data Vault questions:

  • How should this be modeled from staging to the Raw Data Vault?
  • Should the key type be stored in the Link?
  • Where should key cleansing and transformation happen?

First Principle: Preserve the Source

Before jumping into modeling options, let’s clarify a core Data Vault principle:

The Raw Data Vault must always allow you to recreate the original source delivery.

This means:

  • No irreversible transformations in staging
  • No hidden business rules applied too early
  • Full auditability back to the source

Every modeling decision below is guided by that principle.

Identifying the Core Business Objects

From the example, we can already identify several business concepts:

  • Item → clearly a Hub (Hub Item)
  • KeyValue → a business key, but its meaning depends on KeyType
  • Relationships → Item is related to different business objects depending on KeyType

This is a common pattern: a single source column contains references to different hubs.

Modeling Option 1: Pivot the Source (When Key Types Are Fixed)

If—and this is a big if—you can guarantee that:

  • The number of key types is small
  • The set of key types is stable (e.g. only A and B)

Then a pivoted staging approach can work.

What This Looks Like

You pivot KeyType into separate columns:

| ItemBK | KeyValue_A | KeyValue_B |
|--------|------------|------------|
| ITEM1  | T123       | NULL       |
| ITEM1  | NULL       | 999-45     |

This allows you to:

  • Load Hub Item from ItemBK
  • Load Hub A from KeyValue_A
  • Load Hub B from KeyValue_B
  • Create Links between Item and each Hub

Limitations

This approach breaks down quickly when:

  • A new key type appears
  • Key types are dynamic or unknown
  • The source evolves frequently

In those cases, pivoting creates a brittle staging layer and forces frequent refactoring.

Modeling Option 2: Generic Hub for Typed Keys

When key types are dynamic or extensible, a more robust pattern is required.

In this case, treat the combination of KeyType + KeyValue as a business key.

Create a Generic “Key” Hub

This hub represents “external identifiers” regardless of their semantic meaning.

  • Hub Key
    • Business Key = KeyType + KeyValue

Why composite?

  • Key values may overlap across types
  • Only the combination uniquely identifies the business object

You then:

  • Load Hub Item from ItemBK
  • Load Hub Key from (KeyType, KeyValue)
  • Create a Link between Item and Hub Key

This guarantees:

  • Full preservation of the source
  • No assumptions about future key types
  • Clean separation of concerns

But What About the Existing Hubs (Hub A, Hub B)?

This is where many teams are tempted to inject business logic too early.

Yes, you already have Hub A and Hub B. But the mapping logic that says:

  • “If KeyType = A, this belongs to Hub A”
  • “If KeyType = B, this belongs to Hub B”

is business logic.

Business logic does not belong in the Raw Data Vault.

The Correct Place: Business Vault

In the Business Vault, you can:

  • Create derived business hubs
  • Filter by KeyType
  • Union keys from different raw sources

For example:

  • Business Hub A = existing Hub A UNION keys from Hub Key where KeyType = ‘A’

This approach ensures that:

  • Raw data remains unchanged and auditable
  • Business rules are explicit and versionable
  • Changes in logic don’t force Raw Vault reloads

Should the Key Type Be Stored in the Link?

The answer depends on the modeling option.

Option 1 (Pivoted Source)

No.

Once pivoted, the key type is implicit in the structure. There is no longer a “KeyType” concept in the Link.

Option 2 (Generic Key Hub)

Again, no.

The key type becomes part of the Hub business key. The Link simply connects Item to the generic Key Hub.

If—and only if—key values are guaranteed to be unique across types, an alternative is:

  • Store KeyType in a Satellite on the Hub

But in most real-world cases, using a composite business key is safer.

Where Should Key Cleansing and Transformation Happen?

This is the most important question.

In the example:

  • T123 must become 123 to link to Hub A

This is a business rule.

What Not to Do

  • Do not strip the “T” in staging
  • Do not load cleaned values directly into the Raw Hub

Why?

  • The rule may change
  • You lose the original source value
  • You break auditability

The Recommended Pattern: Same-As Link

Instead, load the raw value T123 as-is and create a Same-As Link.

This link connects:

  • The “raw” business key (T123)
  • The “master” business key (123)

Benefits:

  • Full traceability
  • Explicit business logic
  • Flexible remapping if rules change

Same-As Links are purpose-built for this kind of semantic deduplication.

Putting It All Together

A robust end-to-end solution looks like this:

  • Stage data exactly as delivered
  • Load ItemBK into Hub Item
  • Load (KeyType + KeyValue) into a generic Hub
  • Link Item to the generic Hub
  • Apply business logic in the Business Vault
  • Use Same-As Links to handle key transformations

Key Takeaways

  • Multi-type business keys are common—and tricky
  • Preserve the source structure in the Raw Data Vault
  • Avoid early cleansing and transformation
  • Use composite business keys when uniqueness is not guaranteed
  • Apply business rules in the Business Vault
  • Use Same-As Links for key normalization

Handled correctly, this pattern gives you maximum flexibility, auditability, and long-term stability—exactly what Data Vault was designed for.

Watch the Video

Data Vault Link Naming Convention

Link Naming Convention

One of the most common questions we hear from Data Vault practitioners—especially once a model grows beyond a few hubs and links—is this: how do you define a clear link naming convention, and how do you avoid getting lost in all the different link types when querying current versus historical data? It’s a very practical concern, and if you don’t address it early, your Data Vault can quickly become hard to understand, even for experienced developers.

In this article, we’ll walk through a pragmatic approach to Data Vault link naming conventions. We’ll cover how to name links in a business-friendly way, how to encode technical meaning into the name without overcomplicating it, and how naming helps users understand which links to use for historical data and which are more event- or transaction-oriented.



The First Question: How Do You Name a Link?

Let’s start with the most basic aspect of the question: how do you name a link at all? Imagine a simple business scenario. A customer walks into a store and buys a product. From a Data Vault perspective, this creates a relationship between three business keys: customer, store, and product. This relationship is represented by a link.

Now you have several naming options. You could name the link something like customer_store_product_link, purely describing the hubs it connects. Or you could give it a business-oriented name such as retail_transaction_link or sales_transaction_link.

Our strong recommendation is to prefer business-driven names whenever possible. If the relationship clearly represents a real business concept, then the link should be named after that concept. Calling it a retail transaction link immediately tells business users and developers what this link represents, without forcing them to interpret a technical list of hubs.

That said, there are cases where a business name simply doesn’t exist. Sometimes a link truly represents nothing more than a structural relationship, such as a customer being associated with a store, without a clear transactional or event-based meaning. In those cases, it is perfectly fine to fall back to a composite name like customer_store_link.

The key point here is consistency and intent. If there is a meaningful business name, use it. If not, use a descriptive composite name that clearly shows which hubs are involved.

Ordering Hub Names in Composite Links

When you do use composite names, the next question is often about ordering. Should it be customer_store or store_customer? From a pure Data Vault perspective, links are end-to-end relationships with no inherent direction. Technically, the order does not matter.

However, from a modeling and readability perspective, a consistent ordering rule helps a lot. One practical guideline is to follow the natural business hierarchy if one exists. For example, if you model an industry body and the organizations that belong to it, the industry body is conceptually “above” the organization. In that case, naming the link industrybody_organization_link feels natural and intuitive.

This approach mirrors how people think about hierarchies: from root to leaf, from higher-level concept to lower-level concept. Again, the most important part is not which rule you choose, but that you standardize it across the entire Data Vault.

Standardization Beats Perfection

No naming convention will ever be perfect. What matters far more is that everyone follows the same rules. At Scalefree, for example, we deliberately leave some freedom to developers when choosing link names, with a clear preference for business-oriented naming. At the same time, we provide documented guidelines that explain how to make those decisions consistently.

This balance allows teams to model complex domains without being blocked by overly rigid naming rules, while still keeping the model understandable and navigable.

Dealing with the Many Types of Links

Once you move beyond basic modeling, the real challenge begins: Data Vault doesn’t just have one type of link. Over time, a model accumulates several varieties, each serving a different purpose.

Some of the most common link types include standard links in the Raw Data Vault, business or exploration links in the Business Vault, non-historized links (also known as transactional links in Data Vault 1.0), hierarchical links, same-as links, dependent child links, and even computed or aggregated links used to reuse logic across multiple bridges.

Without a clear convention, these different link types quickly become indistinguishable, making it hard to know which one to use for a specific query or use case.

Prefix or Suffix? Why Suffixes Usually Win

One common design decision is whether to encode link type information as a prefix or a suffix. While prefixes like link_retailtransaction are sometimes used, we generally recommend suffixes instead.

The reason is simple: suffixes group related objects together more naturally. If everything related to a customer starts with “customer”, then the hub, its satellites, and related links appear together when browsing schemas or metadata. This makes the model easier to explore and understand.

For example, a customer hub might be named customer_h, while a satellite could be customer_data_s. Using suffixes ensures that all customer-related entities are visually grouped.

A Compact Suffix-Based Pattern

To keep naming both expressive and machine-readable, we recommend a compact suffix pattern. In this approach, every link ends with an L, indicating that it is a link. Additional single-character markers are added before the L to indicate special link types.

For example, a standard link in the Raw Data Vault simply ends with _l. A non-historized link ends with _nl. A hierarchical link ends with _hl. A same-as link ends with _sl. A dependent child link ends with _dl.

When working with the Business Vault, an additional B is added. A standard Business Vault link ends with _bl. A dependent child link in the Business Vault becomes _bdl. In this scheme, the Raw Data Vault is the default, so there is no explicit “R” marker.

This compact notation might look cryptic at first, but it has significant advantages. It keeps names short, consistent, and easy to parse automatically.

Why Short Codes Are Powerful

Using single-character indicators is not just about aesthetics. It enables powerful automation and governance capabilities. With consistent suffixes, you can use regular expressions on metadata to identify entity types automatically.

For example, you can quickly validate that all dependent child links contain the required hub references, hash keys, and record source attributes. You can also apply automated tests and checks depending on the link type, without relying on manual inspection.

This becomes especially valuable in larger Data Vault environments where hundreds or thousands of entities exist.

Putting It All Together: Naming Examples

Let’s revisit the earlier example of a relationship between customer and store. In the Raw Data Vault, a simple link could be named customer_store_l. If this relationship is modeled in the Business Vault, the name could become customer_store_bl.

If the relationship represents a business transaction and is non-historized, a name like retailtransaction_nl or retailtransaction_bnl (for the Business Vault) clearly communicates both the business meaning and the technical behavior.

These names immediately tell an experienced user what kind of data to expect and how the link should be used.

Which Links to Use for Current vs. Historical Data?

The second part of the original question is just as important: how do you know which links to use when querying current versus historical data?

The key insight here is that almost all Data Vault entities are historical by nature. Links record which relationships have ever existed during the lifetime of the data warehouse. On their own, they usually do not tell you whether a relationship is currently active.

To answer questions about “current” relationships, you typically need additional structures such as effectivity satellites or deletion indicators. These tell you whether a relationship is active, inactive, or deleted at a given point in time.

For example, if customers can change their preferred store, the link captures all store assignments that ever existed. An effectivity satellite tells you which assignment is valid right now.

Standard Links vs. Non-Historized Links

Another important distinction is between long-lived relationships and point-in-time events. Standard links are often used for relationships that exist over a period of time, such as employment, ownership, or assignments.

Non-historized links, on the other hand, are typically used for events, messages, or transactions that occur at a specific moment. A customer purchasing a product, an invoice being issued, or a sensor sending a reading are all examples of point-in-time events.

Even though they are called “non-historized,” these links can still store large volumes of historical data. The name simply reflects that the relationship itself does not persist over time—it happens at a single point in time.

Dependent child links are a special case of this pattern, often used for structures like invoice line items, where detailed data depends on a parent transaction.

Teaching the Model to Users

Finally, no naming convention works unless users understand it. A clear, consistent naming scheme allows you to teach users how to recognize link types, understand their purpose, and choose the right entities for their queries.

Once users know that “_l” means Raw Vault link, “_bl” means Business Vault link, and “_nl” indicates an event-based relationship, they can navigate even large models with confidence.

Final Thoughts

A good Data Vault link naming convention is not about memorizing rules; it’s about reducing cognitive load. Business-driven names improve readability, suffix-based patterns improve structure, and compact codes enable automation and governance.

If you invest the time to define and standardize these conventions early, you will save countless hours later—both for developers and for business users trying to understand and trust your data warehouse.

Watch the Video

Data Vault Link Effectivity

Link Effectivity

One of the most common — and most misunderstood — challenges in Data Vault modeling is how to correctly handle changing relationships. Especially when source systems insert, delete, and even reinsert the same records over time, many teams struggle to answer a simple but critical question:

“How do I reliably determine the latest valid state of a relationship?”

In this article, we will walk through exactly how to model this scenario using Links and Effectivity Satellites in Data Vault, why this pattern is essential, and how it allows you to extract both the current and historical state of relationships in a clean, auditable way.



The Scenario: A Source Table with Changing Relationships

Let’s start by restating the problem in simple terms.

You have a source table with the following structure:

  • COL_1 + COL_2 → Business Key for Hub A
  • COL_3 + COL_4 → Business Key for Hub B
  • COL_5, COL_6 → Descriptive attributes
  • ROW_OPERATION → Insert / Delete indicator

This tells us something important right away:

  • Two business keys appear in the same row
  • Therefore, there is a relationship between those business keys
  • That relationship belongs in a Link

The complexity arises because the operational system:

  • Inserts records
  • Deletes records
  • Sometimes reinserts the exact same records later

All while the actual business keys and attributes remain unchanged.

So how do we model this in a way that preserves history, supports auditing, and still lets us easily answer:

“Which relationships are valid right now?”

Why Links Alone Are Not Enough

In Data Vault, a Link represents the existence of a relationship — not its state.

Once a relationship between Hub A and Hub B exists, the Link record itself is immutable.
It simply tells us:

“At some point in time, this relationship existed.”

But the Link alone cannot tell us:

  • When the relationship became active
  • When it was removed
  • When it was reintroduced

This is where many implementations go wrong. Teams try to encode relationship state directly into the Link or manage it downstream in reporting logic. Neither approach scales, and both break Data Vault principles.

The correct solution is to model the effectivity of the relationship separately.

Enter the Effectivity Satellite

An Effectivity Satellite attached to a Link tracks the physical existence of that relationship over time.

Think of it as a timeline that answers one simple question:

“Is this relationship active or inactive at a given point in time?”

Typical Structure of a Link Effectivity Satellite

  • Link Hash Key
  • Load Date Timestamp (LDTS)
  • Record Source
  • Deletion Timestamp

The deletion timestamp is the key attribute. It defines until when a relationship is considered active.

A common and proven pattern is:

  • Active relationship → Deletion date = 8888-12-31
  • Inactive relationship → Deletion date = actual load timestamp when deletion was detected

(Using 8888-12-31 instead of 9999-12-31 avoids date overflow issues and is widely adopted in practice.)

Walking Through the Lifecycle of a Relationship

Let’s walk through the example step by step.

Day 1: Relationship Is Inserted

The source system delivers a row linking business keys A and B.

  • A Link record (A–B) is created if it does not yet exist
  • An Effectivity Satellite record is inserted
Deletion_Timestamp = 8888-12-31

This means: the relationship is active.

Day 2: Relationship Is Deleted

The source system no longer contains the A–B row.

We do not delete the Link. Instead:

  • A new Effectivity Satellite record is inserted
  • The deletion timestamp is set to the current load timestamp

Now the timeline clearly shows when the relationship ended.

Day 3: Relationship Is Reinserted

The same A–B relationship appears again in the source.

Important point:

  • The Link already exists → no new Link row
  • The Effectivity Satellite needs a new delta
Deletion_Timestamp = 8888-12-31

This marks the relationship as active again — without losing the history of the previous deletion.

Handling Updates: Why Updates Are Two Events

Another area where teams often struggle is updates.

In Data Vault terms:

An update is a deletion of the old version and an insertion of a new version.

For example, if a relationship changes from A–B to A–C:

  • A–C is inserted as a new Link relationship
  • The Effectivity Satellite marks A–C as active
  • The Effectivity Satellite marks A–B as deleted

This ensures that:

  • No ambiguity exists about which relationship is active
  • Historical reporting remains accurate
  • Auditors can trace every change

How to Query the Latest Valid State

Once modeled correctly, querying the current state becomes straightforward.

Current Active Relationships

  1. Join the Link to its Effectivity Satellite
  2. Select the latest satellite record per Link hash key
  3. Filter where Deletion_Timestamp = 8888-12-31

That’s it.

This pattern works consistently across:

  • Current-state reporting
  • Type 1 dimensions
  • Snapshot-based reporting using PIT tables

What About ROW_OPERATION and Valid-From / Valid-To?

It’s important to clearly separate concepts:

ROW_OPERATION

Indicators like Insert/Delete from the source system are:

  • Descriptive metadata
  • Useful for loading logic
  • Not the source of truth for effectivity timelines

They can be stored in a descriptive Satellite, but effectivity is driven by load detection.

Valid-From / Valid-To

Business validity dates are not deletion timelines.

  • They describe business meaning (contracts, subscriptions, agreements)
  • They belong in separate descriptive Satellites
  • They are typically applied in the Business Vault or downstream

Mixing business timelines with technical effectivity is a common modeling mistake.

Why Effectivity Satellites Are Not Optional

A final, often overlooked argument: dirty data.

What happens if:

  • A relationship is accidentally loaded
  • Used in reports
  • Then later corrected or removed

Without an Effectivity Satellite:

  • You cannot explain historical results
  • You cannot prove when data changed
  • You cannot support audit requirements

That’s why, in real-world Data Vault implementations:

  • Most standard Links have Effectivity Satellites
  • Most Hubs have them as well
  • Only a few special cases are exceptions

Key Takeaways

  • Links capture existence, not state
  • Effectivity Satellites capture when relationships are active
  • Deletes and reinserts are modeled as satellite deltas
  • Updates are modeled as delete + insert
  • Current state is derived by filtering on open deletion timelines

Once you adopt this pattern consistently, handling complex relationship lifecycles becomes simple, scalable, and auditable.

Watch the Video

How to Build Data Vault Satellites in Coalesce.io

Data Vault Satellite in Coalesce.io

Building Data Vault structures can feel complex when you first begin working with them, especially when implementing them inside a modern transformation platform such as Coalesce.io. In today’s article, we will walk through the full lifecycle of creating Satellites, from source data to stage layers, hash key generation, private and non-private satellites, and finally V1 satellites that support historization and Data Vault 2.0 best practices.

This guide is based on a hands-on demo scenario using supplier data. The goal is not only to show you how to technically create satellites in Coalesce.io but also to explain why these steps matter and how they fit into the larger Data Vault methodology.



Why Satellites Matter in Data Vault Modeling

In Data Vault, satellites store descriptive attributes about business entities. They sit alongside hubs and links, extending these structures with the contextual information that typically changes over time. Because descriptive data can evolve—names, addresses, account balances, and other attributes—satellites allow us to capture the full history of changes while keeping hubs and links stable and lean.

A typical Data Vault satellite includes:

  • A hub hash key to tie satellite rows to a business key
  • A hash diff to detect attribute changes
  • Descriptive attributes such as names or phone numbers
  • Load timestamps and record source metadata
  • Optional historization fields such as load end date timestamps and an is_current flag

Coalesce.io makes it easier to generate these components through its Data Vault package. The platform handles much of the boilerplate SQL, letting you focus on modeling rather than syntax.

Understanding the Supplier Data Example

In the example used throughout the demo, our source system provides several fields:

  • Supplier key (a numerical unique identifier)
  • Name
  • Address
  • Nation key (used in links)
  • Phone number
  • Account balance
  • Comments

To keep things realistic, imagine the name and address values contain personal identifiable information. Because Data Vault supports privacy-aware modeling, we split the satellite into:

  • A private satellite for sensitive fields like name and address
  • A non-private satellite containing the remaining descriptive data

Separating data in this way supports compliance, access control, and sensitive-data masking—common requirements in real-world deployments.

Step 1: Building the Stage Layer

Before building satellites, we must create a stage table in Coalesce. The stage prepares the data for Data Vault modeling by generating the hash keys and hash diffs we will need later.

Inside Coalesce, we right-click the supplier source node and select Add Node → Stage. The purpose of this stage is to normalize the structure and add the necessary technical metadata.

What we generate in the stage:

  • Hub Hash Key (HK_Supplier): created from the Supplier Key
  • Hash Diff for Private Satellite: created from Name and Address
  • Hash Diff for Non-Private Satellite: created from Phone, Account Balance, and Comment
  • Load Date Timestamp: represents when the data was loaded
  • Record Source: tracks where the data came from

After saving and creating the stage node, Coalesce generates a view containing all the original source data plus the new hash fields and technical metadata. This prepares us to build the satellites cleanly and consistently.

Step 2: Creating the Private Satellite (V0)

Next, we right-click the stage and add a new node, selecting the V0 Satellite from the Data Vault package. This satellite will contain only the sensitive columns.

Private satellite includes:

  • Hub hash key
  • Private hash diff (Name + Address)
  • Name
  • Address
  • Load Date Timestamp
  • Record Source

We remove the non-private columns and keep only what belongs to the private satellite. After configuring the node properties and Data Vault options, we create and run the satellite.

The private satellite now holds the sensitive data, along with the hash diff that allows Coalesce to detect changes over time.

Step 3: Creating the Non-Private Satellite (V0)

We repeat the process for the second satellite, this time focusing on non-sensitive attributes such as phone number, account balance, and comments.

Non-private satellite includes:

  • Hub hash key
  • Non-private hash diff
  • Phone
  • Account Balance
  • Comments
  • Load Date Timestamp
  • Record Source

Once configured and created, this satellite is also loaded and ready for historization.

Step 4: Creating V1 Satellites for Historization

V0 satellites store the raw history and all versions of descriptive data. To simplify querying, Coalesce supports generating a V1 Satellite, which is a view layered on top of the V0 Satellite.

The V1 Satellite adds:

  • Load End Date Timestamp (LEDT): identifies until when a record was valid
  • Is_Current Flag: marks the latest version of each record

These fields allow analysts to easily filter for the “current” state of descriptive attributes or build temporal reports when needed.

Creating a V1 Satellite is straightforward: right-click the V0 Satellite, add a node, and select the V1 Satellite template. Coalesce automatically generates all required SQL and fields.

After running the V1 satellites for both private and non-private data, you now have a complete Data Vault satellite layer: historized, query-friendly, and fully compliant with Data Vault 2.0 standards.

Why This Matters in Real Data Vault Implementations

This workflow demonstrates the core principles of Data Vault modeling:

  • Separation of concerns: Private data stays protected.
  • Change detection through hash diffs: Efficiently track what changed.
  • Historization: V1 satellites provide easy access to current and historical states.
  • Consistent metadata: Load timestamps and record sources support auditability.

Coalesce automates much of the repetitive work required in Data Vault, enabling teams to produce reliable, scalable models faster and with less manual SQL.

Final Thoughts

Creating Data Vault satellites in Coalesce.io becomes a smooth process once you understand the core concepts: preparing the stage, generating hash keys and hash diffs, organizing attributes into private and non-private structures, and finally adding V1 satellites for historization. With Coalesce’s Data Vault package, this modeling pattern becomes not only repeatable but highly efficient.

If you’re new to Data Vault or want to deepen your understanding of how its components work together, consider reviewing a Data Vault handbook or exploring more Coalesce transformation sessions. Each layer builds on the previous one, ultimately forming a flexible, auditable, and future-proof data warehouse architecture.

Watch the Video

What’s New in Data Vault 2.1 and Why It Matters for Modern Data Warehousing

What’s New in the Data Vault 2.1 Training

In the world of data warehousing and business intelligence, the Data Vault methodology has long been a trusted foundation for scalable and agile data architectures. With the release of Data Vault 2.1, the methodology has evolved to address new challenges in modern data environments — from handling semi-structured data to aligning with concepts like Data Mesh and Data Lakehouse.

In this article, we summarize what’s new in Data Vault 2.1 compared to 2.0, what these updates mean for practitioners, and how you can take advantage of the new training materials to become officially certified.



1. A Major Expansion in Content and Learning Resources

One of the most visible improvements in Data Vault 2.1 is the expansion of the official training content. The updated course now includes extensive video material featuring Dan Linstedt himself, who explains and demonstrates key Data Vault principles in depth.

Participants can now spend several hours watching recorded theoretical sessions and hands-on demonstrations. The new format combines the benefits of self-paced learning with the engagement of instructor-led sessions. You can also download the official SQL loading patterns for all Data Vault entities from the Data Vault Alliance (DVA) training portal.

Another highlight is access to the Data Vault Alliance community — a global network of Data Vault practitioners where members exchange best practices, discuss implementations, and share insights from real-world projects.

2. Enhanced Instructor-Led Training Experience

The well-known three-day instructor-led training remains a cornerstone of the certification path, but it has been optimized to deliver even more value. Trainers now dedicate more time to practical case studies, group discussions, and collaborative modeling workshops.

Instead of spending large portions of class time on theory, participants focus on applying concepts to real-world data challenges. Trainers provide direct feedback on Data Vault models, encourage peer review, and help attendees explore different architectural scenarios.

This redesign creates a more interactive, productive learning experience — especially valuable for consultants, data architects, and engineers who want to strengthen their practical Data Vault expertise.

3. Better Preparation for Certification

Preparing for the official Certified Data Vault 2.1 Practitioner (CDVP2.1) exam is now easier and more structured. The course includes integrated live quizzes during training sessions, allowing participants to test their understanding and interact directly with the instructor.

In addition, a practice exam has been introduced to help you assess your readiness before attempting the final certification. This makes it easier to identify knowledge gaps and feel confident on exam day.

4. Dealing with JSON and Semi-Structured Data

One of the most exciting updates in Data Vault 2.1 is the new module on handling JSON data and other semi-structured sources. As modern data platforms increasingly deal with variable data structures, the methodology now provides clear guidance for integrating such data efficiently.

The course introduces a set of rules and best practices for balancing performance, flexibility, and complexity. You’ll learn when to apply a schema-on-read approach instead of schema-on-write, how to maintain stability as source structures evolve, and how to preserve governance and traceability in semi-structured environments.

Dan Linstedt often refers to this as the “JSON Dilemma” — the challenge of maximizing flexibility without sacrificing performance or clarity. Data Vault 2.1 equips you with the methodology and patterns to solve that dilemma effectively.

5. Stronger Differentiation Between Logical and Physical Modeling

Another core enhancement in Data Vault 2.1 is the clearer separation between logical and physical modeling. While Data Vault 2.0 touched on this concept, version 2.1 makes it explicit: the logical model represents the business concept, while the physical model depends on the underlying technology and performance needs.

For example, on some platforms normalization works best, while on others (such as document-oriented databases), denormalization might be more efficient. The physical implementation should adapt to these realities — but the logical model remains consistent as the blueprint for the business layer.

This separation provides greater flexibility to evolve with technology without compromising the integrity of the business model. It also helps teams align architecture decisions with specific database or cloud platform requirements.

6. Introducing Ontologies and Taxonomies

In line with the growing emphasis on semantic data integration, Data Vault 2.1 introduces the use of ontologies and taxonomies as essential tools for business modeling. These concepts allow organizations to connect business terms, hierarchies, and relationships in a way that supports consistent data integration across departments and systems.

By embedding ontologies and taxonomies into the modeling process, organizations can improve data understanding, reduce ambiguity, and strengthen the link between data structures and business meaning.

7. Extended Business Key Collision Code Concept

The Business Key Collision Code concept has been extended in Data Vault 2.1 to better support cross-system integration. This improvement helps resolve conflicts that arise when business keys overlap or differ across systems — a common challenge in enterprise data integration.

With enhanced rules and examples, the training now guides you through best practices for identifying, classifying, and merging business keys, ensuring a consistent, high-quality data foundation.

8. Merging Satellites Without PIT Tables

Data Vault 2.1 introduces new approaches for handling historical data when traditional Point-In-Time (PIT) tables or snapshot techniques are not required. In cases where you need to maintain very long data histories or join multiple satellites describing the same business object, version 2.1 outlines methods for merging satellites without relying on PIT tables.

This allows for greater flexibility in data retrieval strategies and helps optimize performance in long-term historical scenarios.

9. Alignment with Modern Industry Terminology

To stay relevant with the evolving data landscape, Data Vault 2.1 integrates current industry concepts such as Data Mesh, Data Fabric, and Data Lakehouse. These paradigms are mapped to the Data Vault framework, demonstrating how the methodology fits within modern data architectures.

This update ensures that Data Vault practitioners can easily connect the methodology to the broader trends and technologies shaping the data industry today.

10. Unlock the Full Potential with Constructor-Led Training

If you’re ready to deepen your knowledge and apply these updates in practice, the constructor-led Data Vault 2.1 training offered by Scalefree is the next step. This hands-on training combines theoretical knowledge, real-world exercises, and guided discussions to help you implement Data Vault successfully in your organization.

Visit the training page to find more information, view upcoming training dates, and begin your journey toward CDVP2.1 certification.

Final Thoughts

Data Vault 2.1 represents a significant step forward for data professionals seeking a future-proof methodology. With improved training content, better integration of semi-structured data, a sharper focus on modeling concepts, and alignment with modern architectural trends, Data Vault continues to be a robust choice for building scalable, flexible, and business-aligned data warehouses.

Whether you are transitioning from Data Vault 2.0 or starting fresh, the new version provides the tools, knowledge, and community support to take your data architecture to the next level.

Watch the Video

Is the Data Warehouse Dead?

From Data Warehouse to Data Platform

Every few years, a new buzzword hits the data industry — and suddenly, the tools and methods we’ve relied on for decades are declared obsolete. Today, that target seems to be the data warehouse. Blogs and conferences proclaim its death, replaced by the data lake, data lakehouse, or even the elusive “data mesh.” But is the data warehouse really dead? Or has it simply evolved into something new?



The “Death” of the Data Warehouse: Where It All Began

For years, the data warehouse has been the foundation of enterprise analytics. It provided a structured, trusted, and governed environment where business data could be collected, cleansed, and analyzed. However, as data volumes exploded and new types of unstructured data emerged, traditional warehouses started showing their age.

Slow ETL processes, rigid schemas, and scalability issues led many to look for alternatives. Enter the data lake — a more flexible, schema-on-read environment that could store raw, unstructured data cheaply and at scale. Suddenly, the industry narrative shifted: data lakes were the future, and warehouses were history.

But as many organizations soon learned, simply dumping everything into a lake didn’t magically solve all their problems. Without governance, context, and structure, data lakes quickly turned into data swamps — massive pools of untrustworthy, undocumented information. And that’s when the story started to change again.

From Data Warehouse to Data Platform

From Warehouse vs. Lake to Warehouse + Lake

The debate shouldn’t be “data warehouse or data lake?” but rather “how do we combine them effectively?” Each serves a different purpose, and modern data platforms are proving that the most successful architectures leverage both.

The data lake is perfect for collecting raw, varied, and large-scale data — structured, semi-structured, or unstructured. It enables exploration, data science, and machine learning. But the data warehouse is still essential for delivering consistent, trusted, and audited data for business reporting and regulatory needs.

As one of our experts put it, the data lake can act as the source system for the data warehouse. The lake is where all data lands. The warehouse sits on top — a refined, curated layer where the most critical data is modeled, governed, and exposed to business users. Together, they form the backbone of a modern data platform.

Why the Data Warehouse Still Matters

Despite the hype around newer architectures, data warehouses provide several key benefits that data lakes alone can’t match:

  • Data Quality: Warehouses enforce rules and transformations that ensure accuracy and consistency across business domains.
  • Auditability and Compliance: Especially in industries governed by GDPR, HIPAA, or SOX, traceability is non-negotiable — something data warehouses excel at.
  • Performance and Optimization: Data warehouses are designed for analytical workloads and provide fast query performance on structured data.
  • Trust: Business users need reliable, validated data for decision-making. Data warehouses remain the single source of truth for that.

So no, the warehouse isn’t dead. It’s simply no longer alone.

Adapting to New Requirements: The Rise of Data Platforms

What has changed, however, is how organizations think about architecture. We’ve moved away from seeing data warehousing as a single monolithic system. Instead, the focus is now on building data platforms — unified ecosystems that combine the strengths of data lakes, data warehouses, and modern cloud technologies.

In this model, the data lake is used as an ingestion and exploration layer, capturing data from across the enterprise. The warehouse, meanwhile, becomes a downstream layer that provides refined, high-quality, and business-ready datasets.

This layered approach is often seen in Data Vault 2.0 architectures. The raw data is first stored in the lake (the “landing zone”), then structured into a raw vault for traceability, and finally transformed into a business vault for analytics and reporting. This methodology blends the flexibility of a lake with the governance of a warehouse — a best-of-both-worlds approach.

AI, Machine Learning, and the New Data Landscape

Another reason the “data warehouse is dead” narrative persists is the rise of AI and machine learning. These applications demand vast quantities of raw and semi-structured data — something traditional warehouses weren’t built to handle efficiently. However, this doesn’t mean warehouses are obsolete; it means they play a different role.

In AI-driven organizations, data scientists use the lake to experiment and train models. Once insights are validated, curated datasets are pushed into the warehouse to ensure they’re governed, standardized, and auditable. This workflow creates a feedback loop between the lake and the warehouse, ensuring agility without sacrificing control.

Modern data warehouses, especially cloud-native ones like Snowflake, Azure Synapse, and Google BigQuery, have also evolved. They now support semi-structured data, elastic scalability, and real-time processing — bridging the gap between lakes and traditional warehouses.

Lessons from the Field: It’s Not About Technology, It’s About Strategy

When companies struggle with data warehousing, it’s rarely because of the technology itself. More often, it’s about poor design, lack of governance, or outdated processes. As many experienced data engineers know, legacy warehouses often become complex, undocumented systems — “historically grown” solutions that no one fully understands.

The real issue isn’t whether to abandon the warehouse. It’s about how to modernize it. That means introducing automation, adopting agile data modeling techniques, and leveraging modern tools that eliminate manual maintenance work.

It also means changing the way organizations think about data. Instead of treating governance as a roadblock, teams should see it as a foundation for scalability. Instead of building massive, inflexible ETL pipelines, they should adopt modular data vault or ELT-based approaches that evolve as business needs change.

Practical Takeaways for Modern Data Teams

  • Stop chasing buzzwords. Data lakes, meshes, and fabrics are valuable, but none are silver bullets. Understand the business problem first.
  • Combine technologies strategically. Use data lakes for exploration and AI, data warehouses for governance and trust.
  • Modernize your warehouse, don’t replace it. Adopt cloud platforms and automation to remove legacy bottlenecks.
  • Think in terms of platforms. Build an integrated data ecosystem instead of disconnected tools.
  • Embrace continuous evolution. The future of data is hybrid, agile, and adaptive — not one-size-fits-all.

Conclusion: The Data Warehouse Is Evolving — Not Dead

The data warehouse isn’t a relic of the past. It’s a vital component of the modern data platform. What’s changing is the way we design, use, and integrate it. By combining the strengths of data lakes and warehouses, organizations can unlock the full potential of their data — balancing flexibility with governance, and innovation with reliability.

The future of data architecture isn’t about replacing one system with another. It’s about convergence. The warehouse, the lake, the lakehouse — all of them are part of a single, connected platform designed to empower both business users and data scientists. So no, the data warehouse isn’t dead. It’s alive, evolving, and more relevant than ever.

Watch the Video

How to Publish Power BI Reports with the dbt Semantic Layer

Publishing Power BI Reports with the dbt Semantic Layer

Welcome back to our two-part guide on connecting Power BI to the dbt Semantic Layer. In Part 1, we demonstrated how to connect Power BI Desktop to dbt Cloud and build a simple dashboard using semantic metrics. That was just the beginning.

In this article, we continue the journey by publishing that report to Power BI Service and configuring the On-premises Data Gateway to ensure the connection to the dbt Semantic Layer remains alive. This is a crucial step for moving from a personal development workflow to a shared, production-ready BI environment.



Why Power BI Service Needs a Gateway

When you publish a Power BI Desktop report connected to the dbt Semantic Layer, the connection details don’t automatically carry over to Power BI Service. Instead, you’ll likely see an error message such as “The model cannot be loaded”. This happens because:

  • Power BI Service cannot directly use the desktop connection credentials.
  • A secure, always-online connection is required to keep queries alive.

The solution is to use the On-premises Data Gateway. This component securely bridges Power BI Service with your data sources—whether on-premises or cloud-based—so your dashboards can refresh seamlessly.

Step 1: Publish the Report to Power BI Service

We’ll start where we left off in Part 1, with a working report in Power BI Desktop named DEMO – dbt Semantic Layer.

  1. Open the report in Power BI Desktop.
  2. Click the Publish button.
  3. Choose a workspace (for this demo, we’ll use “My workspace”).
  4. Open Power BI Service and locate the published report.

At this stage, you’ll notice that the report cannot load data. This is expected and is exactly why we need to configure the gateway.

Step 2: Install the On-Premises Data Gateway

To enable the connection, download and install the On-premises Data Gateway in Standard mode. The dbt Semantic Layer connector is not supported in Personal mode at this time.

Download link: Get the Power BI Gateway (Microsoft).

During installation:

  • Choose Standard mode.
  • Follow the guided steps and complete the setup.
  • Launch the gateway after installation.
  • Sign in with your Power BI Service account to register the gateway.

At this point, your gateway should show as online and ready to use.

Step 3: Configure Custom Connector Support

The dbt Semantic Layer connection is enabled through a custom Power BI connector. This requires a manual file copy:

  1. Locate the connector file on your machine. By default, it is stored in:
    C:\Users\<YourUsername>\Documents\Power BI Desktop\Custom Connectors\dbtSemanticLayer.pqx
  2. Copy this file into the gateway’s custom connector directory:
    C:\Windows\ServiceProfiles\PBIEgwService\Documents\Power BI Desktop\Custom Connectors
  3. Restart the gateway if needed. The connector should now be detected under the Connectors section of the gateway.

This step is critical—without copying the connector, the gateway will not recognize the dbt Semantic Layer.

Step 4: Configure the Gateway in Power BI Service

With the gateway online and the custom connector installed, it’s time to configure the connection inside Power BI Service:

  1. In Power BI Service, navigate to Settings → Manage Connections and Gateways.
  2. Select your gateway from the list of available gateways.
  3. Go to the gateway Settings and allow the use of Custom Connectors. Without this option, the connection will fail.
  4. Next, go to Settings → Power BI Settings → Semantic Models.
  5. Locate your published report (DEMO – dbt Semantic Layer).
  6. Under Gateway and Cloud Connections, add your connector to the gateway and provide the required host, environment ID, and service token values.

Once these steps are complete, your Power BI Service report should start loading live data from the dbt Semantic Layer.

Step 5: Validate the Connection

Open your report in Power BI Service. This time, instead of the “model cannot be loaded” error, you should see the data loading and the dashboard visuals appearing. This confirms that the gateway is successfully maintaining the connection to the dbt Semantic Layer.

Summary of the Workflow

Let’s recap what we’ve achieved across both parts of this series:

  1. Part 1: Connected Power BI Desktop to the dbt Semantic Layer and built a working dashboard using semantic metrics.
  2. Part 2: Published the report to Power BI Service, configured the On-premises Data Gateway in Standard mode, enabled custom connectors, and restored connectivity to the dbt Semantic Layer.

The end result? A Power BI dashboard that queries dbt metrics live, end-to-end, even in the Service environment.

Best Practices

While the demo used a local laptop, in production environments the On-premises Data Gateway should be installed on a reliable, always-on machine, such as:

  • An on-premises server.
  • A cloud-hosted virtual machine (e.g., Azure VM, Amazon EC2).

This ensures the connection remains stable, and report refreshes happen as scheduled without interruption.

Conclusion

Connecting Power BI Service to the dbt Semantic Layer is a powerful way to extend governed, consistent metrics to your entire organization. By configuring the On-premises Data Gateway and enabling custom connectors, you ensure that your published reports continue to deliver real-time insights based on dbt’s semantic definitions.

If this guide was helpful, make sure to check out Part 1 if you haven’t already, and stay tuned for more deep dives into the dbt platform and BI integrations.

Additional Resources

Watch the Video

Data Governance Made Simple with dbt Platform

Data Contract Data Pipeline

Why governance is important when working with data products

As data analytics grows within a company, it becomes more important to prevent changes in one part of the system from breaking things in another. This is important for reports and dashboards that depend on shared dbt models. To manage this growing complexity, dbt now offers key data governance features like

  • model contracts,
  • model versioning, and
  • access control.

These tools help ensure data remains consistent and reliable as projects scale. By using these features, data teams can work together more smoothly, avoid surprises from model changes, and improve overall data quality and trust across the business.

This article takes a closer look at what these features do, why they matter, and how they can support smoother collaboration and stronger data quality across the business.

Data Governance Made Simple with dbt Platform

Scaling your data projects shouldn’t mean sacrificing reliability. This session will tackle the critical issue of data governance in dbt, showing you how to stop data chaos and prevent upstream changes from breaking downstream models. You’ll learn to implement dbt features like Model Contracts, Versioning, and Access Control with a practical, hands-on demonstration. Register for our free webinar, December 4th, 2025!

Register Me Now
Data Contract Data Pipeline

Figure 1: Data contracts act as a safeguard across the entire data pipeline.

Imagine this scenario: A data team has a pipeline to load customer data into a warehouse, and the marketing team relies on it for campaigns. So the downstream consumer, in this case the marketing team, expects fields like customer_name, but an upstream change renames this field to customer_full_name. The marketing team tries to consume this break without anyone immediately noticing. The result? The query would fail since the renamed field is not known to the downstream team. Without a dbt contract, this kind of schema change slips through and causes these downstream failures. When contracts are implemented, however, such a change would trigger an alert or validation error in the pipeline, catching the discrepancy between expected and actual data and preventing the issue.

Data Contracts – Providing Schema Consistency

Data contracts in dbt are a way to ensure what a model is expected to produce. In other words: a contract defines the exact structure of the output table – including which columns are present, what types of data they are or other structural rules. It’s similar to how API contracts work in software: downstream users can count on a consistent format, and if something breaks that expectation, dbt throws an error or a warning during runtime instead of letting flawed data flow further downstream.

With a contract in place, teams can define:

  • Which columns must be present – and block unexpected extras or missing fields.
  • The type of data (e.g. strings, integers, timestamps).
  • Null rules or other constraints, like if a field is allowed to contain null values or can be unique.

These rules are defined in a separate YAML schema file (not directly in the SQL model). For example, a model called orders might have a contract that lists all required columns and their data types, ensuring that every time the model runs, the structure matches what’s been agreed:

models:
- name: orders
  config:
    contract:
      enforced: true
  columns:
    - name: order_id
      data_type: integer
      constraints:
        - type: not_null
    - name: customer_id
      data_type: integer
    - name: order_date
      data_type: date

Once the variable enforced: true (line 5) is set on a model, dbt will validate the output of the SQL before the model is built. It checks that the result includes exactly the expected columns, in the right data types – no extras, no missing fields, no mismatches. If something does not align, dbt throws an error or warning and stops the process before the issue can affect anything downstream. This gives teams a safety net against accidental changes to a model’s structure – whether that’s someone renaming a column, dropping it, or introducing a mismatch in data types.

From a business perspective, this adds a layer of reliability. Teams working with downstream dashboards, reports, or models can rely on a consistent structure without fear that a seemingly small upstream change will quietly break their work. Take the example of a renamed field in a customer model: with a contract in place, that kind of change would’ve been caught during development or CI (Continuous Integration) – before it ever reached production and disrupted reporting. Data contracts bring the discipline of software development (similar to API interface contracts) into analytics work, helping everyone stay aligned on how the data is structured. That is especially valuable in environments where many teams are working from shared models – a clear contract reduces confusion and avoids broken pipelines.

Tip

Not every model needs a contract. It’s good to start with critical, high-impact models (those feeding important dashboards or reports). Applying contracts does add some overhead and rigidity, so dbt’s guidance is to ensure your project is sufficiently mature and the model’s schema is relatively stable before enforcing a contract.

Model Versioning – Managing Change Gracefully

In the previous section it became clear that contracts are able to keep the model’s structure. Even with contracts in place, there are situations where besides the model’s structure, the logic needs to be changed. This could go beyond renaming/removing columns or changes in data types and more into e.g. changing the calculation of columns. On a small team, you could probably just make the change and tell everyone to update their queries. But in a larger organization, that kind of quick shift is risky – one change could break a lot of downstream applications like dashboards or other downstream models.

That’s where model versioning comes in (introduced in dbt v1.5+). Model versions bring structure to model evolution. Instead of forcing immediate changes, versioning allows multiple versions of a model to live side by side, giving teams time to transition.

Here’s how it works:

  1. Create a new version of the model: You start by creating a new version of the model – normally by just duplicating the file with a _v2 suffix – and make your changes there.
  2. In the model’s YAML config file, you declare all existing versions and mark which one is currently latest.
  3. Test the new version: During testing, both versions can be run in production. For testing purposes, consumers can point to a specific version explicitly.
  4. Promote the new version: Once the adjustments are done, the YAML can be updated to make it the latest one.
  5. Deprecate and remove the old version: Eventually, the old version is deprecated. You can even set a deprecation_date so it’s clear when it will be removed for good. But to be clear, the deprecation_date is just an indicator/piece of information during runtime, the model won’t be deactivated automatically after the date passes.

This approach helps teams migrate in a controlled way. Instead of breaking queries over night, there’s a shared window where both old and new versions exist. Teams have time to update at their own pace, and model authors don’t need to hold back on necessary improvements. It’s a clean parallel to versioned APIs – the idea being that a dbt model, once shared, is similiar to a public interface others rely on.

By treating models as versioned products, data teams avoid having broken queries and are able to regain control over how changes are introduced. It makes collaboration safer and more sustainable, especially when many teams depend on shared upstream models. Therefore, model versioning brings real change management to the day to day work with data.

Note: Model versioning is a concept meant for mature datasets that have multiple dependents. If your project is small or models change rapidly in early development, you might not need to formally version every change. At smaller scale, it’s often acceptable that downstream analyses update in lockstep with model changes. But as you “scale up” the number of models, consumers, or even adopt a data mesh approach with multiple projects, versioning becomes indispensable for stability.

Access Control – Modularizing and Securing Your Models

The third pillar of dbt governance is access control for models. In complex organizations, it’s possible that not every data model is meant to be used everywhere by everyone. Some models are intermediate or meant for a specific team, while others are data products meant for wider consumption. To define what data is accessible for who (ref() function), dbt now allows access levels for each model.

To enable this, dbt introduces the concept of groups and access modifiers:

  • Groups: Models can be organized into groups and assign each group an owner. Groups are essentially labels for a set of models that share a logical theme or ownership. This helps turn implicit relationships into explicit boundaries – leading to a clearer model ownership.
  • Access modifiers: Each model can be marked as private , protected , or public to indicate its accessibility to other models.
    • Private: only other models in the same group can reference it. It’s hidden from the rest of the project.
    • Protected: (default) any model within the same project can reference it, but models in other projects cannot. This is how models behaved historically in dbt – accessible project-wide but not exposed to the outside by default.
    • Public: any model, even in other projects (or installed packages), can reference it. This explicitly marks the model as part of a public interface for cross-project use.

By default, models are considered protected for backward compatibility. Using groups and access modifiers together, you can make certain models truly private to a specific team while safely sharing others as needed. For example, a finance team’s intermediate calculation model might be tagged private to finance, so only finance models can use it. If a marketing model tried to ref(‘finance_model’) that was private, dbt would throw an error during parsing, blocking the unauthorized dependency. Meanwhile, a carefully designed model of “Customers” might be marked public so that it can be referenced by models in any project across the company.

Why is this valuable?

It enforces modularity and ownership. Teams can develop models without fear that another team will depend on their internal logic. It prevents the problem, that anything can depend on anything, which becomes hard to manage. Instead, only well defined public models become the integration points between different groups or projects. This kind of access governance improves security (sensitive data can be kept in private models) and maintainability, since changes to a private model won’t ripple out beyond its group.

Governance vs. User Permissions: It’s important to differentiate between model access as described above and user-level permissions. The features that was discussed is about design-time and run-time control of model dependencies in dbt projects. In practice, this is used in conjunction with your data warehouse’s security (and dbt Cloud’s user roles) to ensure only the right people or tools can run or query these models. Model governance is about structuring the project for safe collaboration; it complements (but does not replace) standard data access controls.

Conclusion – Toward Reliable and Scalable Data Projects

In summary, dbt’s governance capabilities like contracts, versioning, and access control provide a framework for reliable, scalable data transformation workflows. They bring proven software engineering principles into the analytics engineering realm:

  • Data contracts ensure that upstream changes don’t unknowingly break downstream models/consumers by enforcing schemas and data integrity rules at build time.
  • Model versioning treats important datasets as stable interfaces, allowing teams to implement improvements without sudden disruptions – enabling change with graceful deprecation instead of chaos.
  • Access control (with groups and private/protected/public models) modularizes your project, so teams can work peacefully within their domain while providing clear interfaces and preventing unintended coupling between projects.

Adopting these practices means fewer hotfixes caused by broken pipelines. As one blog put it, when scaling up dbt or moving towards a multi-team data mesh, these governance features become non-negotiable – they let you treat your data models “like products: stable, predictable, and version-controlled”. In short, they help you scale with confidence.

Finally, it’s important that governance features should be introduced thoughtfully. It’s possible to overengineer too early – adding contracts or strict version control to models that are still rapidly evolving may slow the process down. The best approach is to apply these tools to the most critical and stable parts of your data model, and expand as the project needs grow.

Close Menu