Skip to main content
search
0

How Do You Model External Business Logic In Data Vault?

Modeling External Business Logic in Data Vault: APIs, Scripts, and Source System Thinking

A question that comes up regularly in Data Vault training is how to handle external business logic — specifically, what happens when your data pipeline includes a call to an external API or service that returns enriched or cleansed data. Where does that fit in the model? How do you capture the response? And how do you integrate an external script cleanly into your enterprise data platform? This post walks through a concrete example: address cleansing via an external REST API.



Modeling External Business Logic: The Full Flow

The scenario starts simply enough. You have CRM data — let’s say customer records with addresses — that gets staged and broken down into the Raw Data Vault in the usual way: Hubs for business concepts, Satellites for descriptive attributes. The raw address from the CRM system lands in a Satellite.

Now comes the complication. You need to cleanse and standardize those addresses using an external REST API. A Python script handles the call: it pulls data from the platform, formats it into the required input — perhaps a single string or a calculated key — and sends it to the external service. The service returns a JSON response with the standardized address and additional metadata.

This flow touches several layers of the Data Vault architecture, and each layer has a distinct role.

The Business Vault Prepares the API Call

Before the external call can be made, the Business Vault does preparatory work. If the REST API requires the address in a specific format or needs a calculated key, that computation belongs in the Business Vault — it’s business logic, applied to raw data, to produce the input for an external process.

The external Python script then queries this prepared data — either directly from the Business Vault or via an Interface Mart (more on that below) — and performs the REST call. The script itself may be under version control and within your organization’s control. The external service is not.

Treat the External Service as a Source System

This is the key modeling decision: because the external API is outside your control, you treat its responses exactly as you would treat any other source system. You don’t trust it implicitly. You stage its output and break it into the Raw Data Vault.

If your Raw Data Vault already has an Address Hub from the CRM dataset, and the external service returns identifiers that qualify as business keys — unique, stable identifiers for addresses — those can be added to the Address Hub. The JSON response from the API then gets captured in a Satellite in the Raw Data Vault, associated with the appropriate Hub.

This approach gives you a clean audit trail. You know exactly what the external service returned, when it returned it, and what key was used to make the call. If the external service changes its response structure or returns unexpected data, your Raw Data Vault captures that reality as-is, and your downstream Business Vault logic handles interpretation.

Handling JSON Responses: Two Practical Options

API responses typically come back as JSON — sometimes well-structured, sometimes semi-structured with varying schemas between messages. There are two main approaches for capturing this in the Raw Data Vault, and the right choice depends on how structured the response is and how many attributes you actually need.

Option 1 — Extract what you need, keep the rest as JSON. If the JSON is relatively consistent and you only need a subset of its attributes — say, five out of fifty — extract those five into relational columns in the Satellite. Keep the full JSON (or the remaining payload) as a JSON or JSONB attribute in the same Satellite. You get fast, typed access to the attributes you use regularly, and the full document is available for future needs without requiring a reload.

Option 2 — Keep everything in JSON, extract in the Business Vault. If you’re unsure which attributes you’ll need, or if the structure varies, capture the raw JSON in the Satellite and handle extraction later in the Business Vault. Technically, extracting fields from JSON is a structural transformation — a hard rule, not a business rule — so it could sit in the Raw Data Vault. But if the extraction is straightforward and tied to specific downstream calculations, doing it in the Business Vault view is a reasonable and common practice.

In practice, the hybrid approach from Option 1 is most common: extract the attributes you know you need into relational columns, keep the JSON alongside them. When a new attribute is needed later — and it will be — you can pull it directly from the JSON in your Business Vault view using native JSON functions, without touching the Raw Data Vault or reloading any data.

Integrating the External Script: Dependencies and Interface Marts

When an external script queries your data platform — whether from the Raw Data Vault or the Business Vault — it creates a dependency. The entities that script relies on can’t be freely refactored without risking a broken integration. This is worth flagging explicitly in your metadata: mark those entities as part of the operational vault, indicating that external applications depend on them.

A cleaner long-term solution is to introduce an Interface Mart — a stable, versioned view layer that the external script queries instead of the Raw or Business Vault directly. When you refactor a Satellite or restructure a Business Vault entity, you update the Interface Mart view to maintain the same output structure. The external script sees no change. This decouples your internal model evolution from external integrations, which is especially valuable in organizations where multiple scripts and applications consume data from the platform.

Combining Two Sources in the Business Vault

At this point, you have two sources describing the same concept: the CRM system with the original, non-standardized address, and the external address standardizer with the cleansed version. Both are captured in the Raw Data Vault as separate source inputs. The Business Vault is where you bring them together.

The pre-computed key used to make the API call serves as the joining mechanism. Based on that key, you can establish a relationship — via a Link or a direct join in a Business Vault view — between the raw CRM address and the standardized version returned by the external service. The Business Vault then exposes the combined, cleansed address data to downstream consumers: reports, dashboards, or further downstream application scripts.

The exact modeling decisions at this stage depend heavily on how the CRM data is structured and what the business actually needs from the cleansed address. But the principle holds regardless: raw inputs from both the CRM and the external API live in the Raw Data Vault; the logic that combines and interprets them lives in the Business Vault.

A Pattern Worth Generalizing

Address cleansing is one example, but the same pattern applies to any external enrichment service: geocoding APIs, credit scoring services, entity resolution services, tax calculation engines. Whenever your pipeline includes a call to an external system that returns data you need to capture and use, the approach is the same — treat the response as a source, stage it, load it into the Raw Data Vault, and apply interpretation and combination logic in the Business Vault.

It’s also worth noting that this pattern integrates naturally into data-driven organizations where information is consumed not just through reports and dashboards but through application scripts and automated processes. The enterprise data platform becomes a hub for both analytical and operational consumers — and Data Vault’s layered architecture handles both cleanly.

To explore these patterns in depth — including Business Vault design, Interface Marts, and integrating external sources — check out our Data Vault 2.1 Training & Certification. The free Data Vault handbook is also available as a physical copy or ebook for a solid introduction to the core methodology.

Watch the Video

Capturing Changing Inventory Levels in Data Vault

Capturing Changing Inventory Levels in Data Vault with Non-Historized Links

Inventory data presents a specific challenge in Data Vault modeling: it arrives as snapshots, it can be corrected retroactively, and deletions need to be traceable back to a specific snapshot date. A user recently proposed a solution using a multi-active Snapshot Satellite — a reasonable starting point — but the question of how to handle corrections and logical deletes at the snapshot level pointed toward a cleaner approach. This post walks through the recommended pattern: a Non-Historized Link with technical counter transactions.



Capturing Changing Inventory Levels: Understanding the Source Data

The scenario is this: a source system sends inventory data as full snapshots. Each extract may contain data for multiple snapshot dates, and if a correction exists for a previously loaded snapshot, the source resends the complete data for that snapshot date. The inventory level represents the count of a product at a given location and store at the end of a specified day.

This is what Kimball defines as a snapshot-based fact — a periodic full count of all products in a warehouse, day by day. The complication is that those snapshots can be corrected. Three days after an initial count, a miscalculation might be identified and the inventory level for that past snapshot date needs to be updated — without affecting the two more recent snapshots.

Why the Multi-Active Satellite Approach Has Limitations

The proposed solution used a Link referencing Product, Store, and Location Hubs, with a multi-active Satellite attached. The snapshot date was nominated as the multi-active attribute, added to the Satellite’s primary key alongside the hash key and Load Date Timestamp.

The instinct is sound — you need to track multiple snapshots and their corrections over time. But there’s a problem with using a business-supplied date as the multi-active attribute in the Raw Data Vault: it means trusting the source. In the Raw Data Vault, that’s a risk worth avoiding.

A more robust approach is to use the subsequence from staging as the multi-active attribute instead. The subsequence is assigned during staging and is guaranteed to be unique within each incoming batch. Because it’s unique within the batch, it’s also unique within any subgroup of that batch — including a multi-active group. If something goes wrong with it, it’s a problem you control and can fix. The snapshot date from the source then becomes what it actually is: a descriptive business timeline, added to the Satellite payload like any other descriptive attribute.

But even with this correction, the multi-active Satellite approach doesn’t cleanly solve the core problem: how do you capture a logical delete at the snapshot level — when an item disappears from a re-sent snapshot that was already loaded?

The Non-Historized Link Approach

The cleaner solution is to use a Non-Historized Link rather than a standard Link with a Satellite. Non-Historized Links are designed precisely for capturing facts coming from a source — transactions, events, and inventory levels. They sit close to the incoming data structure and are loaded incrementally using an Alternate Key.

For this scenario, the Non-Historized Link contains the following fields:

  • Product Key, Store Key, Location Key — the Hub references (hash keys)
  • Load Date Timestamp (LDTS) — when the record was received
  • Snapshot Date (SD) — the business timeline from the source, indicating when the inventory was counted
  • Snapshot Date Timestamp (SDTS) — the Data Vault snapshot date used in information delivery
  • A counter column (X) — with a value of 1 for records coming from the source, and -1 for technical counter transactions

The Alternate Key — used to identify whether a record already exists in the target — is the combination of Product Key, Store Key, Location Key, Snapshot Date, and Load Date Timestamp.

Two INSERT Statements: Inserts and Counter Transactions

Loading this Non-Historized Link uses two parallel INSERT statements, not an UPDATE pattern.

INSERT statement one loads all records from the Staging Area where the Alternate Key does not yet exist in the target. This covers both genuinely new records and new versions of corrected records — because the corrected version, with its new Load Date Timestamp, doesn’t exist in the target yet and therefore qualifies as new.

INSERT statement two handles deletions and replaced versions. It identifies records that exist in the target but are absent from the current full-load snapshot in the Staging Area. These records are either hard-deleted in the source or represent the old version of an updated record. For each of these, a technical counter transaction is inserted — an identical row with a counter value of -1 instead of 1.

This means an update in the source produces two rows in the target: a -1 counter transaction that cancels the old version, and a 1 insert for the new version. The data itself is never modified — only the structure changes. This aligns with the hard rule in Data Vault: the Raw Data Vault transforms structure, not content.

How Aggregation Reveals the True Inventory Level

The power of this pattern becomes clear at query time. By summing the counter column across all rows matching a given Alternate Key, you can determine the current state of any inventory record at any point in time.

Here’s how it plays out across several days for a single product-store-location combination:

  • Day 2: Initial snapshot for Day 1 is loaded. Counter value: 1. Sum = 1. One active inventory record.
  • Day 4: A correction arrives for Day 1. The old version gets a -1 counter transaction; the corrected version gets a 1 insert. Sum = 1 - 1 + 1 = 1. Still one active record, now at the corrected level.
  • Day 5: The record is deleted from the source for Day 1. A -1 counter transaction is inserted. Sum = 1 - 1 + 1 - 1 = 0. Zero active records. The inventory level for that snapshot date is gone.

By filtering on the Snapshot Date and aggregating the counter column, you always know exactly how many active facts exist for a given snapshot — and what the current inventory level is. By filtering on the Load Date Timestamp, you can also travel back in time and see what the inventory looked like from the perspective of any earlier date. This gives you both a corrected view of inventory history and a full audit trail of when corrections were made.

Performance and Reporting

On a column-based storage engine, this pattern performs extremely well. Aggregating a counter column across large volumes of inventory records is fast, and the resulting structure integrates cleanly with dashboards. Business users see dimension references for Product, Store, and Location; a snapshot date for time-based aggregation; the inventory level measure; and the counter column for filtering active records. The behavior is exactly what they’d expect — current inventory levels that reflect corrections, with history available on demand.

Where This Pattern Also Applies

Technical counter transactions in Non-Historized Links are not specific to inventory data. The same pattern applies anywhere you have full-load snapshot data with corrections — any scenario where an update in the source needs to be represented as a deletion of the old version and an insertion of the new one, without modifying existing rows.

The pattern can also be applied in the Business Vault, particularly in Bridge Tables, when the counter transactions involve calculated values that need to be corrected rather than raw source data. The principle is the same; the layer where it’s applied depends on whether the transformation is structural (Raw Data Vault) or involves business logic (Business Vault).

For more recordings on Non-Historized Links and technical counter transactions, the Scalefree YouTube channel has several dedicated sessions on the topic. And to master these patterns hands-on, explore our Data Vault 2.1 Training & Certification. The free Data Vault handbook — available as a physical copy or ebook — is also a great starting point for the core concepts.

Watch the Video

Model Access in dbt: Governing Analytics at Scale with Groups

Model Access in dbt

As dbt projects grow, so do the challenges around collaboration, ownership, and reuse. What starts as a small analytics codebase can quickly evolve into a complex ecosystem of models shared across teams, domains, and even projects. Without clear boundaries, it becomes difficult to understand who owns what, which models are safe to reuse, and how to scale analytics without breaking downstream consumers.

This is where model access and groups come into play. Together, they form a powerful governance mechanism in dbt that helps teams structure responsibility, control visibility, and safely enable cross-project data sharing.

In this article, we’ll break down what groups are, how model access works, and how these features support scalable analytics and data mesh architectures.



What Are Groups in dbt?

A group in dbt is a named collection of nodes within a project. Groups provide a way to logically organize resources and define ownership, which becomes increasingly important as more people and teams contribute to the same dbt codebase.

Groups can include the following node types:

  • Models
  • Tests
  • Seeds
  • Snapshots
  • Analyses
  • Metrics

It’s important to note that sources and exposures are not included in groups.

There are a few key rules to understand:

  • Each node can belong to only one group
  • Every group must have a name
  • Every group must have an owner

The owner definition requires at least a name and an email address. Typically, the owner represents a team rather than an individual, such as an Analytics or Finance team. This explicit ownership makes responsibilities visible and helps clarify who to contact when questions or issues arise.

Beyond organization, groups play a critical role in how dbt enforces model access rules. In fact, model access is defined and evaluated in the context of groups.

What Is Model Access in dbt?

Model access is a governance feature in dbt that controls how and where a model can be referenced. It works alongside other governance capabilities such as model contracts and model versions.

By assigning an access level to each model, teams can define which models are internal implementation details and which are intended for broader reuse.

dbt provides three access levels:

Private

Private models are the most restrictive. They can only be referenced by other models within the same group.

This is ideal for intermediate or helper models that support a specific team’s logic but are not meant to be consumed outside that context.

Protected

Protected is the default access level in dbt. Protected models can be referenced by any group within the same project, or by other projects if the project is installed as a package.

This level supports collaboration within a single dbt project while still preventing accidental exposure across project boundaries.

Public

Public models are designed to be consumed outside the project they are defined in. These models can be referenced from other dbt projects using cross-project references.

Public access is especially important for organizations adopting a data mesh approach, where teams expose trusted data products for others to consume.

One important limitation to keep in mind: models materialized as ephemeral cannot be public. Since ephemeral models are not materialized in the warehouse, they cannot be safely shared across projects.

Cross-project public access requires the Enterprise tier of dbt Cloud.

How Groups and Model Access Work Together

Groups and model access complement each other. While groups define ownership and responsibility, access modifiers define visibility and usage.

Consider the following example configuration:

groups:
  - name: analytics
    owner:
      name: Analytics team
      email: [email protected]

models:
  - name: orders_per_supplier_country_customer
    config:
      access: private
      group: analytics

  - name: orders_per_customer
    config:
      access: protected
      group: analytics

  - name: orders_per_country
    config:
      access: public
      group: analytics

In this setup, all models belong to the analytics group, but each has a different access level.

The private model can only be referenced within the analytics group. The protected model can be referenced anywhere inside the same project. The public model can be referenced by other projects entirely.

In the dbt DAG and lineage views, private and public models are visually marked, making it easier to understand boundaries at a glance.

This combination allows teams to clearly communicate intent:

  • Which models are internal building blocks
  • Which models are safe for internal reuse
  • Which models are stable data products

Public Models and Cross-Project References

Public models enable collaboration across dbt projects. This is particularly valuable when different teams manage separate projects but still need to share data in a controlled way.

To make public models available to downstream projects, the upstream project must meet two conditions:

  • An environment defined as PROD or STG
  • At least one successful job run in that environment

These requirements ensure that dbt has generated the necessary metadata for downstream reference resolution.

In the downstream project, the upstream project is added as a dependency:

# dependencies.yml
projects:
  - name: hub_speak_dmytro_base

Once declared, public models from the upstream project can be referenced using a two-argument ref() function:

{{ ref('hub_speak_dmytro_base', 'dim_customer') }}

This pattern allows downstream teams to consume only the models explicitly marked as public, while all internal logic remains private or protected in the upstream project.

Why Model Access Matters for Data Mesh

Model access is a foundational capability for implementing data mesh with dbt.

By treating public models as stable interfaces, teams can publish data products with clear contracts and ownership. Downstream consumers rely on these models without needing to understand or depend on upstream implementation details.

At the same time, teams retain full control over what they expose. Internal experimentation and refactoring can happen safely behind private and protected boundaries.

This approach enables autonomy without chaos — a key principle of data mesh.

Conclusion

As dbt projects scale, governance becomes essential. Groups and model access provide simple but powerful tools to define ownership, enforce boundaries, and enable safe reuse of analytics models.

By thoughtfully combining groups with private, protected, and public access levels, teams can scale collaboration, support cross-project data sharing, and build reliable data products without sacrificing flexibility.

In upcoming sessions, we’ll demonstrate these concepts hands-on in dbt Cloud, showing how producer and consumer projects interact and how these relationships appear in the Catalog and lineage views.

If you’re working toward a scalable analytics or data mesh architecture, mastering model access in dbt is a crucial step.

Watch the Video

Unit of Work (UOW) Links in Data Vault

Unit of Work Links in Data Vault: Purpose, Strategy, and Flow.BI

As FlowBI — the GenAI tool that generates Raw Data Vault models by analyzing source data — becomes more widely used, questions about its specific modeling decisions are coming up more frequently. One of those questions is about a particular Link type that FlowBI produces: the Unit of Work Link. What is it, why does it exist, and when should you use it? This post breaks it down.



Unit of Work Links: What They Are and Why They Matter

In Data Vault modeling, a Link captures a relationship between two or more Hubs — a business event or association expressed through a combination of Business Keys. When modeling a source table that contains multiple Hub references, a natural modeling instinct is to split those relationships into separate, more focused Links. One Link for a customer-product relationship, another for an employee-customer relationship, and so on.

Splitting Links is perfectly valid in Data Vault — but it comes with a constraint. Certain Business Keys must remain together to preserve what is called the Unit of Work: the set of Business Keys that belong together as a single, consistent business event. Separate them incorrectly, and you risk breaking data consistency in your model. This is one of the more subtle but consequential mistakes a modeler can make, and it’s discussed in detail in the Data Vault training and certification curriculum.

The Unit of Work Link is the answer to that risk. It is a Link that spans all Hub references coming from a single source table — unsplit, unfiltered, capturing the full set of Business Keys exactly as they appear together in the source. Think of it as the safe baseline: a Link that guarantees the Unit of Work is preserved, no matter what splitting decisions get made on top of it.

The Human Modeler Problem — and How FlowBI Handles It

FlowBI works by analyzing source data profiles and generating the Hubs, Links, and Satellites needed to capture that data correctly in a Raw Data Vault — one that passes the JEDI test. It integrates with automation tools including Data Vault Builder, Datavault4dbt, and others.

But here’s the design challenge FlowBI had to address: it imitates a human data modeler. And human data modelers make mistakes. One of the most common is splitting Links incorrectly and inadvertently violating the Unit of Work. So the question during development was: do you try to eliminate these mistakes entirely — at the cost of no longer truly imitating human modeling behavior — or do you accept that errors can happen and build a counter-strategy?

FlowBI chose the counter-strategy. The Unit of Work Link is that strategy. By always generating a Link that preserves the full set of Hub references from a source table, FlowBI ensures there is always a valid, consistent fallback in the Raw Data Vault — even if additional, more specific Links turn out to have been split incorrectly.

The Practical Modeling Strategy Behind It

This isn’t just an AI design decision — it reflects a sound modeling practice for human modelers as well. The approach works like this:

When working with a source table that contains multiple Hub references, always create one Link that spans all of them. This is your Unit of Work Link. Then, if your understanding of the source data is strong enough — if you’ve been able to validate the relationships and run the appropriate tests — you can introduce additional, more specific Links that extract individual relationships from the source. A hierarchy Link, an employee-to-customer Link, a product-to-order Link, and so on.

The key point is that the Unit of Work Link remains. It doesn’t get replaced. If one of the more specific Links turns out to violate the Unit of Work — because the source data behaved differently than expected, or because access to production data was limited during modeling — you still have the original Link to fall back on. The model remains valid. You can retrieve the correct relationship from the Unit of Work Link and fix the problem without reloading the Raw Data Vault.

The less familiar you are with a source dataset — particularly in cases where you can’t access production data directly or can’t run comprehensive validation tests — the more valuable this approach becomes. When in doubt, don’t split. Protect the Unit of Work first.

Splitting in the Business Vault Instead

If a Unit of Work Link exists in the Raw Data Vault and you later want to split it into more focused relationships, that work belongs in the Business Vault — not the Raw Vault. You use the Unit of Work Link as the basis for either virtual or materialized Links in the Business Vault, where the split is applied.

The advantage of this approach is containment. If the split is wrong — if it turns out to violate the Unit of Work — the fix happens in the Business Vault. The Raw Data Vault doesn’t need to be reloaded. The source of truth stays intact. You correct the business logic without touching the foundation.

This is exactly the kind of separation of concerns that makes Data Vault resilient. Raw data is captured as-is, close to the source. Business logic — including relationship refinement — happens in the layer designed for it.

A Rule Worth Adopting for Any Modeler

The Unit of Work Link isn’t just a FlowBI artifact. It’s a principle any Data Vault modeler can and should apply. If you’re splitting Links in your Raw Data Vault, ask yourself: where is your Unit of Work? Is there a Link in your model that preserves the full set of Business Keys from each source table, regardless of how you’ve split them elsewhere?

If the answer is no, you’re relying on every split being correct — and on your understanding of the source data being complete. That’s a reasonable bet when you know the data well. It’s a riskier one when you don’t. The Unit of Work Link costs very little to include and provides a meaningful safety net in return.

To go deeper on Link modeling, Unit of Work concepts, and the full Data Vault methodology, explore our Data Vault certification program. And if you’re new to Data Vault, the free handbook — available as a hard copy or ebook — is a solid introduction to the core concepts.

Watch the Video

The AI-Enabling Data Platform: Unlocking Scalable, High-Quality AI Applications

AI Enabling Data Platform

Is your company building an AI time bomb?

Many businesses are rushing to deploy AI prototypes that look impressive during a demo but hide massive, systemic risks. From “hallucinating” bots that give dangerous advice to customers to catastrophic legal liabilities, simple AI setups can quickly become a corporate nightmare.

If your AI strategy depends on unorganized data and ungoverned workflows, you aren’t just experimenting, you are creating a “data debt” that could bankrupt your project or compromise your company’s reputation. If you want to move beyond these risky experiments and build AI that is efficient, scalable, trusted, and actually works for your business, you need a different approach. Learn how an AI-Enabling Data Platform protects your company while unlocking the true power of high-quality, scalable AI.

The AI-enabling Data Platform – Unlocking high-quality AI Applications

To scale AI effectively, organizations must move beyond unmanaged prototypes toward an AI-Enabling Data Platform that addresses security risks and poor data governance. By transforming fragmented data into governed Feature Marts, this architecture ensures the high-quality, compliant data foundation necessary for reliable AI workflows. This shift ultimately solves the maintenance and liability issues that typically hinder AI return on investment. Learn more in our upcoming webinar on February 17th, 2026!

Watch Webinar Recording

Moving Beyond the Prototype

It usually starts with a spark of excitement. You build a small AI tool or workflow using a Large Language Model (LLM), and it works! It answers questions, summarizes text, and saves your team hours of manual labor. This is the “honeymoon phase,” where everything feels possible and the technology seems like magic.

But then, you try to scale. You move from a single user to a whole department, or from a small test folder to your entire company database. Suddenly, things get quite complex. The AI starts making mistakes it didn’t make before so you extend your AI workflows with data adjustments and exceptions, and the system starts breaking regularly. The legal team finds out about the project and starts asking difficult questions regarding data privacy and “black box” decision-making.

Does this sound familiar? You may have seen this in your own projects: A demo that looks great in a controlled environment but cannot handle the pressure of real, messy business use, and gets stuck in PoC purgatory. Without a professional foundation, your AI applications quickly change from being a business asset to becoming a massive liability.

Why Your Current AI Setup is Failing

To understand the solution, we must first look at why most AI initiatives fail when they leave the lab. The problem is almost always the same: a total lack of governance and messy (non-cleansed, non-standardized, or non-integrated) data.

While major LLM models are “trained” generally, they often lack access to the specific “facts” of your business in a way they can understand. This leads to several major threats:

  • The “Hallucination” Risk: If the AI isn’t connected to a “Single Source of Facts,” it guesses. It makes up facts about your product features, delivery times, or prices. In a business setting, a wrong answer isn’t just a mistake but a breach of trust that can quickly destroy a customer relationship.
  • The Maintenance Nightmare: Without a central data platform, every time your source data structure or business logic changes, you have to manually update every single AI tool and workflow you’ve built that touches this piece of data. This makes long-term maintenance impossible and kills the hoped-for ROI of your new AI application.
  • The Legal Challenge: Legal frameworks don’t magically disappear when working with AI. Furthermore, additional frameworks like the EU AI Act are adding new layers of regulatory compliance requirements. If you cannot explain why your AI gave a specific answer or which data it used, you could face massive fines. Using sensitive data without a clear audit trail is a gamble most companies cannot afford.

The Two Traps of Modern AI Development

After the honeymoon phase of the LLM era, companies want to adapt quickly. However, they almost always fall into one of two typical traps. You might recognize these patterns in your own organization:

Trap 1: The “AI Spaghetti” Trap

In the rush to be “AI-First,” many teams use a mix of different AI workflow tools and agents, connecting them piece-by-piece to solve individual problems. While each piece works, the overall system becomes a tangled mess, which I like to call AI Spaghetti. 

In this trap, there is no central “brain” or data control. Each agent has its own way of looking at data, leading to zero consistency. If you change a price in your main database, some agents might see it, while others are still using an old PDF they found in a different folder.

This “spaghetti” is impossible to maintain, secure, and scale. You spend 90% of your time fixing broken connections, integrations or calculations instead of creating new value. 

The dangerous part is that this doesn’t happen on day one; it builds itself as you add more functionalities and exceptions. Often, these workflows are already in production as they grow, and the only way out is building everything from scratch the right way while maintaining the spaghetti in parallel making the “escape route” quite expensive.

Trap 2: The “Lone Wolf” Liability Trap

To bypass what they see as “slow corporate IT,” some teams or individuals start building their own AI applications and workflows. This is not inherently concerning for basic operational efficiency, but the trap is found when teams go deeper and start building workflows and applications consuming and transforming bigger junks of company data.

These “Lone Wolves” work around IT and expose the company to major risks to quickly “get the job done,” ignoring necessary governance processes. When a Lone Wolf uploads a customer list or a trade secret to a public model, that data might be used to train future versions of the model, making your secrets public property. Furthermore, with zero oversight, legal frameworks like GDPR, internal data sharing protocols, and IT security are often ignored.

The Solution: The AI-Enabling Data Platform

To escape these traps and unlock real sustainable value, you must move away from “messy” setups. The answer is the AI-Enabling Data Platform. This is not just a place to store data. It is a professional system that transforms raw, fragmented information into high-quality “fuel” for AI.

The platform acts as a protective layer between your messy company data (emails, databases, PDFs, spreadsheets) and your AI applications. Its main job is to provide Feature Marts.

What are Feature Marts?

Think of a Feature Mart as a library of trusted information. Instead of asking the AI to search through a giant, messy database, you provide it with specific “Features”, which essentially are data points that have been cleaned, integrated, and approved by your data experts.

For example, instead of the AI trying to guess a customer’s loyalty status from thousands of raw interaction logs, it simply asks the Feature Mart for the “Customer_Loyalty_Score.” The result is instant, accurate, and governed.

How do they fit into our data architecture?

This is aligned with how we provide data to business users for standard reporting and analytics. We don’t throw non-integrated, uncleaned data without descriptions at business users and ask them to find the perfect KPI. This is why the principles behind a quality data platform stay mostly the same. You can simply build Feature Marts on top of your existing data platform. Instead of “Information Marts,” you now add Feature Marts.

AI Enabling Data Platform

You build feature marts on top of your integrated data layer as part of your “Gold Layer” as it is a data asset ready for consumption by your AI applications, workflows and agents. Those are responsible for automating your operations supporting your business in a variety of tasks.

What becomes critical for high-quality results is a semantic layer. Nowadays, definitions for your data, calculations, and meaning can be added in modern data cataloging tools. These are excellent as they can be used by business users as well as data specialists. A well-constructed Feature Mart, combined with descriptive data, is the perfect recipe for high-quality results from your AI layer.

If you are interested in more details about the data architecture, check out my article about Data Fabric architecture here: Data Vault, Data Mesh & Data Fabric Guide

What You Achieve: Quality, Speed, Cost Efficiency and Trust

When you invest in an AI-Enabling Data Platform, you achieve four critical business outcomes:

AI Enabling Data Platform Key Points

The Path to Success

Building high-quality AI is a journey. You can achieve better results and avoid the risks by following these steps:

  • Stop the “Lone Wolves”: Ensure all major AI projects use a central data platform so they stay safe and governed. Which AI usage is allowed outside IT and where guardrails are necessary should be defined in your organization’s AI strategy.
  • Stop the “AI Spaghetti”: Simple AI use cases can be achieved with basic workflow tools (e.g., n8n, Zapier) without a dedicated platform. Complex AI use cases building on company data should not and only use workflows tools for orchestration. 
  • Build Feature Marts: Don’t just give the AI raw data. Turn your important business data into ready-to-use “features” to increase trust, speed, security and governance.
  • Focus on Governance: Use the platform to control who (and which AI) can see your data. Audit inputs and outputs to ensure quality stays high.
  • Create Cross-functional Teams: The real impact is in automating everyday business processes, which is best achieved through combined teams of data engineers, AI engineers, and business users.
  • Assess and Plan: Get an overview of how AI is currently used, where the biggest risks are, and where the biggest opportunities lie. Create a roadmap including team structure, team skills, architecture, processes, governance and security.

If you want to profit from external expertise, read about our Scalefree Review & Assessment service and reach out to us for a customized review fitting your exact needs.

Conclusion: Real Value is Built on Trust

The AI revolution is not about who has the most expensive model or the flashiest chatbot. It is about who can automate their business most efficiently leveraging AI without losing trust in operations, results, and decisions.
When your AI applications are accurate, safe, and governed, they stop being “risky experiments” and become the engine of your company’s success.
Start by identifying your “Lone Wolves” and bringing them into a governed environment. Look at your most valuable AI use cases and start building the Feature Marts they need to survive in the real world.

What do you think?

Have you seen the “Agentic Spaghetti” trap in your own company? Are you worried about “Lone Wolves” creating legal risks? I would love to hear your experiences and challenges in the comments below or on social media postings (probably only LinkedIn)!

Refactoring a Data Vault Model

Refactoring a Data Vault Model: Options, Risks, and Best Practices

Source systems change. Columns get added or removed, structures evolve, and sometimes entire business key definitions are overhauled. When that happens to a system feeding your Data Vault, the question isn’t just technical — it’s strategic. Do you modify what you have, or do you build alongside it? This post walks through the main scenarios and the practical options available for each, along with a clear recommendation on where to draw the line between low-risk and high-risk approaches.



When a Column Changes: The Simple Case

The least disruptive scenario is a column-level change in a source table — a new attribute appears, or an existing one disappears. For this, you have a few options depending on your project constraints.

Option 1 — Modify the existing Satellite. If your project allows structural changes, you can add the new column to the existing Satellite with an ALTER TABLE statement. Historical rows will show null values for the new column before its introduction, and a log entry can record exactly when the column was added. Removing a column from a Satellite is generally not done — historical data lives in that column, and dropping it means losing that history.

Option 2 — Create a new Satellite. If you’re not allowed to touch existing structures, or simply prefer not to, you create a new Satellite to capture the new or changed attributes. This Satellite gets added to the relevant PIT Tables. The trade-off is an additional join in your queries, but the existing Satellite and its data remain completely untouched.

Option 3 — Close and replace the Satellite. A slightly more thorough approach: close the existing Satellite (stop loading it) and create a brand new one that reflects the updated structure. The new Satellite starts with a full load from the source, which means some data overlap with the old Satellite. This is handled cleanly at query time using an IIF statement — prefer data from the new Satellite where it exists, fall back to the old one for earlier history. The redundancy is not a problem; it resolves itself during query execution.

The bigger the structural change, the more this third option makes sense. If a source table is overhauled dramatically — many columns removed, many added — creating a fresh Satellite to capture the new shape is often the cleanest path forward.

When the Business Key Changes: The Complex Case

Column-level changes are manageable. Business key changes are where things get genuinely complex — and where the risk calculus shifts significantly.

A business key in Data Vault must be unique over time and across the enterprise. If the current key no longer meets that standard — say, a customer number that was once reliable is now duplicated across regions — you have a structural problem that can cascade through the model. Changing the business key means potentially changing the Hub itself, which in turn affects every Link that references that Hub, and every Satellite attached to those Hubs and Links. The impact can be wide.

At this point, you have two main strategic choices.

Option A: Keep Old and New Structures Separate

The lower-risk approach — and the one most commonly recommended — is to leave the historical Raw Data Vault exactly as it is and build a new Raw Data Vault to capture data under the new structure and key definition.

The reasoning is rooted in a core Data Vault principle: the Raw Data Vault should model data close to how the source systems use it. The business had one structure in the past and a different one going forward. That’s two different realities, and it makes sense to model them separately.

The two Raw Data Vaults then get reconciled in the Business Vault, where business logic handles the combination of old and new data. This might be straightforward — a simple union — or it might be complex, especially if field definitions have changed. For example, if an address field was previously structured (street, house number, zip, city) and is now a free-text memo field that may contain addresses from multiple countries, the logic to normalize and combine that data belongs in the Business Vault. That’s exactly what the Business Vault is designed for.

This approach carries the lowest risk. Historical data is never touched. Nothing can go wrong with data that hasn’t been moved.

Option B: Refactor the Raw Data Vault

The more ambitious option is to refactor the existing Raw Data Vault into a new version — modifying Hubs, Links, and Satellites to reflect the new structure — and then reconstruct historical data within that new model.

This is technically possible, but it comes with a hard requirement: you must be able to reconstruct every historical delivery from the new structure without any data loss. In Data Vault practice, this is validated through what’s known as the “Jedi test” — deriving the old structures from the new ones and verifying the output matches the original data exactly. If the test passes, you can safely drop the old tables and replace them with views that expose the old structure as a backward-compatible interface.

Those views give existing queries time to continue working while users migrate. But they’re a transitional tool, not a permanent one. You’ll want to communicate a clear deprecation timeline — 90 or 180 days is typical — and give users explicit guidance on how to update their queries before the views are dropped.

A word of warning: when those views eventually get dropped, expect complaints. Not because the communication failed, but because, as a rule, nobody reads emails. Plan for it.

Handling Non-Unique Business Keys in the Interim

If a business key loses its uniqueness mid-project and a full refactoring effort will take several sprints, there’s a practical interim solution: a Record Source Tracking Satellite. This allows you to continue working with the existing model while the refactoring is planned and executed in the background. It buys time without requiring an immediate structural overhaul, and it keeps the data pipeline running cleanly during the transition.

Communication: The Overlooked Part of Refactoring

Technical decisions aside, refactoring a Data Vault model is also an organizational event. Users who query your data warehouse need to know when structures change — whether that’s a modified Satellite, a new Hub, or a deprecated view that will be removed in three months.

A simple data warehouse changelog or newsletter goes a long way. When you modify existing entities, inform users. When you introduce views as backward-compatible bridges, tell them the timeline. When the views are going away, tell them what to query instead. This isn’t just good practice — it’s the difference between a smooth migration and a flood of support tickets.

The Bottom Line on Refactoring

Data Vault is built to absorb change, and it does so gracefully at the column level. Descriptive attribute changes — new columns, removed columns, restructured Satellites — are handled with well-defined options and minimal risk. The real challenge arrives when business keys change, because the ripple effects can touch Hubs, Links, and Satellites across the model.

In those cases, the recommended approach is to preserve historical data in the original Raw Data Vault and build a new one for the new structure, using the Business Vault as the reconciliation layer. It’s the lowest-risk path, it keeps your historical data intact, and it puts complex transformation logic exactly where it belongs.

To learn more about Data Vault modeling principles, refactoring strategies, and Business Vault patterns in depth, explore our Data Vault 2.1 Training & Certification. And if you’re new to the methodology, the free Data Vault handbook is a great starting point — available as a hard copy or digital download.

Watch the Video

Source of Data for Business Vault Entities

Business Vault Entities: Using Cross-Satellite Joins for Business Logic

One of the questions that comes up regularly among Data Vault practitioners — especially those working on the Business Vault layer — is whether a Computed Satellite can draw its input data from multiple, unrelated Satellites across different Hubs. It sounds like it might break the rules. It doesn’t. In fact, it’s a very common and entirely valid pattern once you understand what actually constrains a Business Vault Satellite.

Let’s break it down.



Business Vault Computed Satellites: What Are the Real Rules?

When practitioners first encounter the Business Vault, a natural assumption is that a Computed Satellite must only use data from Satellites attached to its own parent Hub. The logic seems sound — keep things connected, keep things clean. But this is not actually a constraint in Data Vault. It’s a misconception.

The real rule for a Computed Satellite is much simpler: what do your results describe? Whatever entity your calculated output describes — that’s your parent. That’s what the Satellite attaches to.

For example, if you’re calculating customer lifetime value using data pulled from order records, transaction histories, product tables, and behavioral data — the result still describes the customer. So your Computed Satellite attaches to the Customer Hub. The source of the input data is largely irrelevant to the attachment decision. What matters is the semantic meaning of the output.

Cross-Satellite Joins: Yes, You Can

So back to the original question: can you build a Business Vault Satellite that joins Satellites from two completely different subjects — even if there’s no Link connecting their parent Hubs?

The answer is yes.

Your business rule query can pull data from anywhere in the model. You can join Satellites on the same Hub, traverse a Link into another Hub and pull Satellite data from there, use business key relationships to establish a match, or even join across entirely disconnected model areas if a descriptive relationship exists in the data itself.

The absence of a Link between two Hubs doesn’t prevent you from joining their Satellite data in a query. If there’s a way to match records — even informally through a shared descriptive attribute in a Satellite — you can use that to establish the relationship in your query logic and proceed.

When to Consider a Business Vault Link Instead

That said, there are situations where formalizing the relationship makes sense. If two Hubs have no Link between them but your business logic consistently requires joining their data, it may be worth creating a Business Link or an Exploration Link in the Business Vault to make that relationship explicit and reusable.

This approach can improve query efficiency and make the model easier to understand for future developers. However, it also adds complexity — a Business Link typically requires an Effectivity Satellite to track the validity of the relationship over time, which may or may not be worth the overhead depending on your use case.

The pragmatic answer: if it’s a one-off calculation and the join works, just write the query. If the relationship is foundational to multiple business rules, formalize it with a Link. Performance and maintainability should guide that decision, not a strict modeling rule.

The Only Real Constraints

In practice, there are two genuine constraints on what you can use as input for a Business Vault Computed Satellite:

  • Query performance: Joining across disconnected parts of your model may not always be efficient. If the join is expensive and runs frequently, optimization — through a Business Link, materialization, or other techniques — is worth considering. But it’s an engineering concern, not a modeling violation.
  • Data Governance: Are you actually allowed to combine those datasets? In organizations with strict data ownership boundaries, combining Satellites from different subject areas or different data owners may require governance approval. That’s a policy question, not a technical one — but it’s one worth asking before you build.

Outside of those two factors, the query is yours to write however you need.

A Practical Example: Customer Lifetime Value

To make this concrete: imagine you need to calculate customer lifetime value (CLV) and attach the result as a Computed Satellite on the Customer Hub. Your inputs might include:

  • Customer profile data from a Satellite on the Customer Hub
  • Order totals from a Satellite on an Order Hub (connected via a Customer-Order Link)
  • Return history from a Satellite on a Returns Hub
  • Promotional sensitivity data from a completely separate marketing model with no direct Link to Customer

Even if the marketing model has no formal Link back to Customer, if there’s a way to match the records — say, via a shared customer identifier in a descriptive Satellite — you can join it. Your CLV business rule can consume all of this data in a single query. The result — a calculated CLV score — describes the customer, so the Computed Satellite attaches to the Customer Hub. Clean, valid, and practical.

This kind of cross-domain business logic is exactly what the Business Vault is designed for. It’s the layer where raw, integrated data gets transformed into meaningful, decision-ready information — and the flexibility to query across the model is one of its most powerful features.

Key Takeaway

When building Computed Satellites in the Business Vault, don’t let the structure of your Raw Data Vault artificially limit your business logic. The query that feeds your Satellite can draw from anywhere in the model. The only questions that matter are: what do your results describe, and is your query performant and governance-compliant?

Get those right, and the rest is just SQL.

To go deeper on Business Vault patterns, Computed Satellites, and the full Data Vault methodology, explore our Data Vault 2.1 Training & Certification — and grab the free Data Vault handbook for a concise overview of the core concepts.

Watch the Video

How to Monitor Data Reliability with dbt Source Freshness in dbt Cloud

How to Validate Data Freshness in dbt Cloud

Ensuring that your data is fresh, reliable, and aligned with your SLAs is one of the most important responsibilities of any analytics engineering or BI team. In modern data stacks, dbt Source Freshness plays a key role in validating that upstream systems are loading data as expected. When used properly, it helps teams identify delays, pipeline failures, or missing updates before they impact models and downstream reporting.

This article walks through a full demo of how to configure, run, and monitor Source Freshness checks in dbt Cloud. It builds on the fundamentals introduced in the first video of our series, where we explained what source freshness is, why it matters, and how dbt evaluates freshness. If you haven’t watched that introduction yet, we recommend doing so first.

In this second part, we go hands-on: reviewing source configurations, running freshness checks using both fields and custom SQL queries, applying optional filters, triggering warnings and failures, and inspecting the results in the dbt Cloud UI and Catalog. By the end, you’ll have a clear understanding of how to integrate freshness checks into your workflows and jobs to maintain a highly trustworthy data foundation.



Understanding the Source Freshness Configuration

The demo starts inside a dbt Cloud project with a YAML file containing our source definitions. For this walkthrough, we are working with a source called dbt_talk_demo_sources, which includes two tables:

  • customer_source
  • employee_source

Inside the configuration block, we define the core freshness thresholds:

  • warning_after: 30 minutes
  • error_after: 60 minutes

These settings tell dbt when to flag a source as slightly stale (warning) or critically outdated (error). They are typically aligned with SLAs and expectations for how often upstream data should be updated.

Using loaded_at_field

For the customer_source table, we use a basic configuration: the table includes an updated_at timestamp column, which dbt uses directly to calculate the freshness. However, the timestamps in this demo are recorded in CET (Europe/Berlin), which means dbt converts them to UTC before evaluating freshness. This highlights a common real-world consideration: time zones must always be handled consistently in freshness checks.

Using loaded_at_query

For employee_source, we use a different approach. This table does not store a timestamp column. Instead, the load timestamps are stored in a metadata table. To handle this, we configure a loaded_at_query—a SQL query that retrieves the latest load time externally. This method is often used when:

  • Timestamps come from an ETL metadata or logging table
  • Data loads use high-watermark patterns
  • You want more control over how freshness timestamps are calculated

In the demo, the query simply selects the MAX(updated_at) value from the metadata table. While simple, it demonstrates how flexible dbt is when working with custom data loading patterns.

Using Optional Filters

dbt also supports an optional filter configuration, which lets you skip certain rows when evaluating freshness. For example, if a table contains soft-deleted records or historical rows that should not count toward freshness checks, you can filter them out. In our demo, the filter excludes rows where deleted = TRUE, ensuring only active records contribute to the freshness calculation.

This becomes particularly useful when old records appear fresher than the latest valid ones, which could skew your results or hide actual issues.

Running Freshness Checks in dbt Cloud

With the configuration in place, we run our first freshness check via the CLI:

dbt source freshness

Before running the check, we insert new rows into the source tables so that the latest data delay is around 20 minutes. Since this is below both the warning and error thresholds, the run reports everything as green.

This confirms that both types of configurations—loaded_at_field and loaded_at_query—are working as expected.

Triggering a Warning

Next, we enable the earlier-mentioned filter configuration on customer_source. After filtering out the deleted rows, the next valid record has a delay of about 50 minutes. When we run:

dbt source freshness -s source:dbt_talk_demo_sources.customer_source

dbt reports a warning state, because the threshold of 30 minutes is exceeded. This demonstrates how filtering can impact the evaluation in meaningful ways.

Triggering a Failure

To understand how a failed freshness check behaves, we insert data with delays exceeding the 60-minute error threshold. Running the same command again produces an error state. The dbt output also shows the exact SQL query it executed to determine freshness—useful when troubleshooting unexpected results.

Including Freshness in dbt Cloud Jobs

dbt Cloud provides two ways to incorporate freshness checks into scheduled jobs:

Option 1: “Run Source Freshness” Checkbox

With this option enabled, dbt automatically runs dbt source freshness as the first step of the job. However, failures do not stop the rest of the job from executing. This mode is ideal when you want visibility but don’t want freshness violations to block model builds.

Option 2: Adding Freshness as a Job Step

Alternatively, you can include freshness checks as an explicit job step. In this case, if freshness fails, subsequent steps are skipped and the job fails. This is the preferred option when:

  • Data reliability is critical
  • Your models depend on up-to-date sources
  • You want strong enforcement of data SLAs

The demo shows examples of both approaches, so you can choose which one best fits your project needs.

Monitoring Freshness in the dbt Cloud Catalog

dbt Cloud makes it easy to monitor freshness results long after the run completes. In the Catalog, you can drill down into each source and see the most recent freshness status, including warnings and errors. This gives data teams better visibility into upstream issues without needing to dive into logs.

For example, in our demo environment, the Catalog displays a warning icon for dbt_talk_demo_sources. Opening the source reveals the individual freshness statuses for each table. This is especially helpful in larger projects where tracking freshness manually would be impractical.

Key Takeaways

This demo highlights the full power and flexibility of dbt Source Freshness in real-world analytics environments. Here are the main lessons:

  • Freshness thresholds provide an essential guardrail for data reliability.
  • loaded_at_field is simple when the timestamp is in the table.
  • loaded_at_query enables more advanced scenarios using external metadata.
  • Filters help refine which rows count toward freshness.
  • dbt distinguishes between OK, warning, and error states in a clear, actionable way.
  • Freshness checks can run as part of your dbt Cloud jobs with configurable strictness levels.
  • The dbt Cloud Catalog provides ongoing visibility into the freshness of all sources.

By combining these tools, you can ensure your source data stays timely, trustworthy, and perfectly aligned with your organization’s SLAs. This ultimately improves downstream analytics quality, enhances user confidence, and reduces the risk of building insights on outdated data.

Watch the Video

How to Derive Dimensions and Facts from a Data Vault Model

Patterns for Deriving Data Vault Dimensions and Facts

If you’ve gone through a Data Vault training, you know that Data Vault is a pattern-driven methodology — from loading the Raw Data Vault all the way through business logic implementation in the Business Vault. But one question that consistently comes up, especially among practitioners bridging the gap between modeling and delivery, is this: How do you actually derive Dimensions and Facts from a Data Vault model?

It’s a great question, and the good news is that just like the rest of Data Vault, there are clear, repeatable patterns for doing it well. Let’s walk through them.



The Pattern-Based Nature of Data Vault Delivery

Data Vault is not just a modeling technique — it’s an entire methodology built around patterns. There are patterns for loading the Raw Data Vault, patterns for modeling it, and patterns for implementing business logic in the Business Vault. The same is true for the information delivery layer, where you produce standard target entities like Dimensions and Facts (or 3NF entities, flat/wide schemas, and more).

Once you internalize these patterns, deriving your target entities becomes a predictable and even automatable process. That’s one of the most underappreciated strengths of Data Vault: the patterns repeat, and the more you use them, the faster and more confident you become.

Deriving Dimensions: The 80/20 Rule

Roughly 80% of your Dimensions will follow one core pattern: they are derived from a combination of a Hub, one or more Satellites, and — critically — a PIT Table (Point-In-Time Table).

The basic query pattern looks like this:

  1. Start with the PIT Table as your base
  2. Join the Hub to retrieve the business key
  3. Join the relevant Satellites for descriptive attributes
  4. Cherry-pick the attributes you need for your Dimension

The remaining 20% of Dimensions come from Links rather than Hubs. In these cases, you start with a PIT Table built on a Link, join the Link itself for hash keys and Dimension references (often called Bridge Dimensions), and then join the Satellites to describe the relationship.

In practice, this covers close to 99% of all Dimension derivations. There are always edge cases, but this two-pattern approach handles the vast majority of real-world scenarios.

Why PIT Tables? The Case for Point-In-Time

If you’re new to PIT Tables, they can look intimidating at first — a structure full of hash keys and timestamps. But once you understand the pattern, it’s actually one of the most elegant tools in the Data Vault toolkit. And here’s the key insight: in modern implementations, PIT Tables are generated by automation tooling, so you rarely build them by hand. You learn how to use them, not how to construct them from scratch.

PIT Tables are essential for several reasons:

Supporting All SCD Types

A PIT Table provides the foundation for Slowly Changing Dimension (SCD) types 0, 1, and 2 — all from the same structure. The snapshot day timestamp embedded in the PIT Table allows you to control the grain and timeline of your Dimension flexibly.

Delivering Stable Reports

Here’s a scenario every analyst will recognize: you pull a report at 9:00 AM and expect it to remain consistent throughout the day. But if your Dimension view is simply joining the latest delta from a Satellite directly, any incoming data update after 9:00 AM will silently change your report’s results.

A PIT Table solves this by identifying the latest valid delta for all hash keys at a specific snapshot timestamp — say, 9:00 AM. This decouples your information delivery from live, incoming data pipelines. Users get stable, predictable reports. The data warehouse and the source pipelines can operate independently without stepping on each other.

This is why using a PIT Table, though it may appear more complex at first, is actually simpler and more aligned with what business users actually need: consistent data at a known point in time, not real-time volatility bleeding into their dashboards.

Decoupling Delivery from Ingestion

Beyond stability, the PIT Table serves a broader architectural purpose: it creates a clean separation between the data loading process and the data delivery process. This is a hallmark of mature data warehouse design — and Data Vault makes it a first-class pattern rather than an afterthought.

Want to deepen your understanding of PIT Tables, Satellites, and the complete Data Vault modeling toolkit? Explore our Data Vault 2.1 Training & Certification to get hands-on with these patterns.

Deriving Facts: Links, Non-Historized Links, and Bridge Tables

Facts are a different story. Where Dimensions describe who and what, Facts capture events, transactions, and messages — think banking transactions, call records, IoT sensor readings, or logistics movements.

For Facts, your primary sources in a Data Vault model are:

  • Non-Historized Links (NH Links) — ideal for transactional events that don’t change
  • Dependent Child Links — for subordinate, granular transactional data
  • Bridge Tables — for pre-aggregated or grain-shifted data
  • Joined Link combinations — for cross-domain Fact derivations

If your underlying Link structure already matches the grain you need for your Fact, you can build a simple view on top of it and you’re done. Clean and efficient.

Grain Management: The Core Challenge of Fact Derivation

Things get more interesting when your source data doesn’t naturally sit at the right grain for your target Fact. This is where Bridge Tables and grain shift operations come into play.

Coarsening Grain with GROUP BY

Imagine an NH Link that captures real-time vehicle positions — five messages per second. Your Fact table only needs one position per hour. You need to reduce the granularity.

You do this with a GROUP BY operation on a subset of the Hub references within the Link. The fewer Hub references you group by, the coarser your grain. The result of this aggregation is materialized into a Bridge Table, which then becomes the base for your Fact view.

Refining Grain by Joining Links

The inverse is also possible. If you need a finer grain than any single Link provides, you join multiple Links on their shared Hub references. For example, if one Link connects Customer and Store, and another connects Store and Product, joining them on the shared Store reference produces a result with three Hub references — a finer granularity than either Link alone.

Other Grain Shift Operations

Beyond GROUP BY and joins, there are other grain-shifting operations worth knowing:

  • DISTINCT operations — to remove duplicate rows at a desired level of aggregation
  • JSON array expansion — splitting a JSON array into multiple rows is itself a grain shift, making rows finer

All of these operations can produce results that are materialized in Bridge Tables for performance and reuse, which then serve as the clean input layer for your Fact views.

PIT Tables for Snapshot Facts

There is one important exception to the “Links and Bridge Tables for Facts” rule: snapshot-based Facts. For periodic snapshots — daily inventory levels, end-of-month balances, weekly pipeline status — a PIT Table can serve as an excellent base for your Fact entity.

The catch is that PIT Tables are sometimes pruned over time to keep them manageable. If you need to preserve a long history of snapshots for your Fact, consider using a dedicated Bridge Table to maintain those snapshots independently from your operational PIT Tables. This gives you full control over retention and history without disrupting the PIT Table’s primary purpose.

Beyond Dimensions and Facts: Wider Applicability

The same patterns extend naturally to other target model types:

  • 3NF (Third Normal Form) entities — derived using similar logic from Hubs, Links, and Satellites
  • Flat and wide schemas — built by joining PIT Tables to Bridge Tables and then joining Satellites to pre-join all descriptive attributes into a single denormalized entity
  • Fully denormalized Fact entities — start with a Bridge Table or Link, join PITs, then join the relevant Satellites to fold all Dimension attributes directly into the Fact

This last pattern — the fully denormalized Fact — is particularly powerful for downstream consumers like machine learning pipelines or self-service analytics tools that benefit from wide, flat tables. And it follows the exact same building blocks.

Summary: The Patterns at a Glance

Target Entity Primary Source Pattern
Standard Dimension (80%) PIT + Hub + Satellites Join PIT → Hub → Satellites, select attributes
Bridge/Relationship Dimension (20%) PIT on Link + Satellites Join PIT → Link → Satellites
Transactional Fact NH Links / Dependent Child Links View on top of Link at correct grain
Aggregated Fact Bridge Table (GROUP BY result) Grain shift → materialize → view
Snapshot Fact PIT Table or Bridge Table Snapshot timestamps → Fact view
Denormalized Fact Bridge + PIT + Satellites Pre-join all descriptive attributes

Learn the Patterns, Apply Them Everywhere

One of the most satisfying aspects of working with Data Vault is that once you’ve internalized these patterns, they compound. Each new project reinforces the same building blocks: Hubs, Links, Satellites, PIT Tables, Bridge Tables. The same logic for deriving a Dimension applies to deriving a 3NF entity. The same grain management principles for Facts carry over to denormalized wide tables.

Data Vault doesn’t ask you to reinvent the wheel for every project — it gives you a set of reliable, tested patterns and asks you to apply them consistently.

If you’re ready to master these patterns in depth — including hands-on work with PIT Tables, Bridge Tables, and full information delivery models — check out our Data Vault 2.1 Certification and Training. And if you’re just getting started, there’s a free Data Vault handbook available as a concise primer covering the core concepts, pros and cons, and when to use Data Vault in your architecture.

The patterns are there. Once you learn them, they’ll follow you everywhere.

Watch the Video

How to Deal With Late Arriving Data

Late Arriving Data

Late arriving or backdated data is a common challenge in data warehousing. In Data Vault, it is important to distinguish between the technical timeline used for loading data and the business timeline representing when events actually occurred in the real world.



1. Technical Timeline vs Business Timeline

When loading data into the Raw Vault, always use a Load Date Timestamp (LDTS):

  • Set when the record first arrives in your target system (landing zone, data lake, or Raw Vault).
  • Never backdate this timestamp—it should always move forward.
  • Used for incremental loading, delta detection, and reproducibility of snapshots.

This timestamp does not reflect the real-world timing of the data. It is purely a technical artifact to track ingestion order.

2. Capturing the Business Timeline

To handle late arriving or backdated data, use descriptive business dates stored in your satellites, such as:

  • Apply Date / Effective Date: When the data became valid in the source system or real world.
  • Last Modified Date: When the record was last changed in the source system.

These business timestamps allow you to create snapshots or temporal views that reflect the true order of events.

3. Timeline Corrections Without an Extended Tracking Satellite

You can correct timelines without adding additional satellites by leveraging the business timestamps stored in your existing satellites:

  1. Create temporal PIT tables or snapshots based on the business timeline, not the load date.
  2. When late-arriving data is detected:
    • Option 1: Rebuild the affected snapshots to include the late data.
    • Option 2: Apply counter transactions to reverse previous measures and apply the updated values.
  3. Always keep the load date unchanged—it only tracks ingestion, not validity.

This approach ensures that your historical reports reflect the correct business sequence without complicating the Raw Vault model.

4. Practical Guidelines

  • Do not order or aggregate data using the load date when interpreting or reporting; always use business dates.
  • Maintain separate timelines:
    • Load Date: Technical, for data ingestion and reproducibility.
    • Business Date: For interpretation, analysis, and handling late arrivals.
  • Rebuild snapshots or use counter transactions as necessary when late data affects measures or aggregates.

Summary

Late arriving data can be handled in Data Vault without adding extra tracking satellites by clearly separating technical and business timelines. Load Date timestamps remain forward-only, while satellites store business dates to drive temporal snapshots and corrections. Using temporal PIT tables, counter transactions, or snapshot rebuilding ensures your analytics reflect the real-world timeline accurately.

Watch the Video

Meet the Speaker

Marc Winkelmann

Marc Winkelmann
Managing Consultant

Marc is working in Business Intelligence and Enterprise Data Warehousing (EDW) with a focus on Data Vault 2.0 implementation and coaching. Since 2016 he is active in consulting and implementation of Data Vault 2.0 solutions with industry leaders in manufacturing, energy supply and facility management sector. In 2020 he became a Data Vault 2.0 Instructor for Scalefree.

How to Capture CDC Data in Data Vault Satellites

Capture CDC Data

Capturing Change Data Capture (CDC) data in Data Vault can be tricky, especially when the source mostly sends inserts but occasionally produces duplicates or deletions. Understanding how to handle these cases ensures historical accuracy and avoids data inconsistencies in your hubs and satellites.



The Scenario

Consider the following behavior of your source system:

  • Most of the time, rows are insert-only.
  • During initial load, the same row may arrive twice (once in the bulk load and once as an insert within the same batch).
  • Deleted rows may occasionally appear.

These patterns can lead to duplicates if not handled correctly. At first glance, it might look like you need a Non-Historized Link, but duplicates must still be managed properly.

Why Standard Non-Historized Links May Fail

A standard non-historized Link assumes a single row per combination of hubs. When duplicates arrive, either due to CDC or multiple inserts during initial load, the Link cannot naturally distinguish them, leading to primary key conflicts or overwritten data.

A common—but sometimes problematic—solution is adding counter rows to differentiate duplicates. However, this often requires a GROUP BY in the Information Mart, which can cause performance issues, particularly on non-columnar databases.

Recommended Approach: Capture Technical History in Satellites

Instead of modifying the Link, the recommended approach is to handle duplicates in satellites, preserving the raw source events and their arrival order.

Step 1: Use a Satellite with a Load-Date Sequence

For each incoming batch:

  • Assign the CDC load timestamp to the first row of a given parent.
  • If multiple rows for the same parent exist in the batch, increment the timestamp by a small unit (microsecond, millisecond, or nanosecond) for each subsequent row.

This creates a unique ordering of changes while preserving the technical history, without touching the original raw data.

Step 2: Maintain Historical Order

By adding a microsecond increment to the load date for each row:

  • The first row in the CDC batch gets the base timestamp.
  • The second row gets base timestamp + 1 microsecond, the third row +2 microseconds, etc.

This ensures the latest row has the highest load timestamp, which can be used to drive Point-In-Time (PIT) tables and type-1 dimension replacements.

Step 3: Preserve Batch or CDC Metadata

If your CDC source provides a batch ID or subsequence number, include it in the satellite. This allows for:

  • Tracking which records arrived together
  • Reconstructing the technical timeline of changes

If no metadata exists, the microsecond sequencing on the load date is sufficient to order the rows.

Handling Non-Historized Links with Duplicates

In rare cases, a non-historized Link may receive multiple rows for the same key combination. To handle this safely:

  • Extend the alternate key to include the load date (or other sequencing attribute) in the hash key calculation.
  • This ensures each row has a unique primary key without modifying the raw data.

Key points:

  • No need to use counter rows in the raw link.
  • Duplicates are captured and preserved for historical accuracy.
  • Aggregations in PIT or Bridge Tables can be used for reporting, ensuring performance optimization.

Optional: Bridge Tables for Performance

If your Information Mart requires grouping or deduplication and your database struggles with performance:

  • Create a Bridge Table that pre-aggregates or resolves duplicates.
  • The Bridge Table stores only the latest row (or the aggregated result) for reporting.
  • You maintain the raw satellite history in case full lineage or historical reconstruction is needed.

Summary

  • Capture all incoming CDC events in a satellite, including duplicates, without modifying the raw data.
  • Use microsecond increments on the load date to order multiple rows per parent.
  • Include CDC batch metadata if available to preserve groupings and arrival order.
  • For non-historized Links receiving multiple rows, include the load date in the hash calculation.
  • Bridge Tables or PIT tables handle reporting and aggregation efficiently, while maintaining full historical traceability.

This approach preserves auditability, ensures correct historical ordering, and avoids performance issues in the Information Mart.

Watch the Video

Meet the Speaker

Profile picture of Michael Olschimke

Michael Olschimke

Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

Close Menu