Skip to main content
search
0

Data Vault in a Microservices Architecture

Microservices Architecture and Data Vault: Managing Satellites at Scale

Microservices architectures create a specific modeling challenge for Data Vault practitioners. When services are ephemeral — spinning up and down as Docker or Kubernetes containers — each with its own message structure, the standard advice to split Satellites by source system quickly leads to hundreds or thousands of Satellites. At that scale, the real question isn’t about metadata management overhead. It’s about how to consume all that data without joining 500 tables every time you need an answer. This post walks through a practical approach to handling high-volume, highly varied source structures in a Data Vault model.



Microservices Architecture: Why Satellite Splits Become a Problem

The conventional Satellite splitting rules — by rate of change, source system, security, and privacy — exist for good reasons. But in a microservices context, applying them strictly leads to an explosion of Satellites. A new Docker image with a new message structure technically deserves its own Satellite. Automate that process and you accumulate hundreds or thousands of Satellites quickly, most of which may never be queried by anyone.

The issue isn’t that databases can’t handle 500 tables — they can. The issue is the consumption side: joining 500 Satellites to produce a target model is expensive, complex to maintain, and in many cases unnecessary. The real challenge is finding a modeling approach that captures the variety of incoming structures without creating an unmanageable query layer downstream.

Rate of Change Splits: Still Relevant, but Less So for Now

The rate of change split was designed to reduce storage consumption by separating high-frequency attributes from stable ones. Every delta insert copies all columns in the Satellite, so a single change on one attribute in a wide Satellite wastes a lot of storage on unchanged data.

For most modern analytical database systems, compression makes this largely unnecessary. Insert-only tables with lots of redundant data compress extremely well, and virtually all modern analytical platforms support this. The storage cost of skipping the rate of change split is manageable with compression turned on.

That said, this is worth watching. In pay-per-query environments like Athena querying row-based Avro files, or systems that charge based on uncompressed data scanned, the rate of change split becomes economically relevant again. BigQuery’s columnar storage sidesteps this because you only pay for the columns you query — but other managed infrastructure doesn’t work that way. The rate of change split isn’t obsolete; it’s just less pressing for now, and likely to become more relevant as managed, consumption-based pricing models become more common.

The Flip-Flop Effect: Why Source System Splits Still Matter

The source system split is a different matter. Loading data from two different source systems into the same Satellite creates a well-known problem: the flip-flop effect.

Consider a customer whose address is known to both an ERP system (California) and a CRM system (Hannover, Germany). The two systems have different knowledge and potentially different structures for representing the same data. If both load into the same Satellite, the Satellite ends up recording two deltas per day — not because the customer moved, but because two systems loaded sequentially with different values. The data flips between California and Hannover with every load cycle, consuming storage and making it impossible to determine the actual address without applying business logic. Worse, the order of loading determines what the Satellite shows at any given moment — a purely technical artifact with no business meaning.

The fix is straightforward: one Satellite per source. This keeps each system’s view of the data independent and equally available, so business logic in the Business Vault can reconcile them deliberately rather than having the Raw Data Vault collapse them accidentally.

The Gray Area: Millions of Sources, One Practical Solution

The flip-flop rule works cleanly when you have a manageable number of distinct source systems. It breaks down at the extreme end — IoT deployments with millions of sensors, or microservices architectures with hundreds of ephemeral containers — where creating one Satellite per source is operationally impractical.

The solution here depends on two conditions being met. First, you need a key in the parent entity that partitions the data by source — a sensor ID, a Docker image ID, a tenant ID, something that creates independent delta streams within the same Satellite. With this in place, deltas from source A can’t replace or invalidate deltas from source B, which eliminates the flip-flop effect without requiring separate Satellites. Second, the structure of the incoming data must be consistent enough to fit in a shared target — which in practice usually means JSON.

When messages from different microservices or sensors all arrive as JSON — even with different internal structures — you can load them all into a single Satellite or Non-Historized Link with a JSON or JSONB payload column. The structure differences are captured inside the JSON document. You add the partitioning key to the parent, and you’re done. Instead of 500 Satellites with 500 different schemas, you have one entity with a JSON payload and a key that tells you which source produced each record.

Non-Historized Links for Real-Time Messages

For real-time message streams from microservices, a Non-Historized Link with a JSON payload is often the right structure. Real-time messages are events — they don’t update, they accumulate. The flip-flop concern largely disappears because you’re capturing messages as they arrive, not loading full snapshots that might overwrite each other. A Non-Historized Link captures the event, the relevant Hub references, and the message payload in a structure that’s fast to load and straightforward to query.

This same pattern was applied at Scalefree for an investment banking client with 500 different source systems delivering asset data in different CSV formats. Rather than creating 500 entities, a single Non-Historized Link and Satellite captured everything — different CSV structures serialized as JSON strings, distinguished by a load source identifier. Two entities replaced 500, and the consumption layer handled the structural variety through filtering and extraction rather than joins.

Consuming Semi-Structured Data Without Joining 500 Tables

Loading everything into a JSON payload doesn’t eliminate the structural variety — it defers it to query time. When you need data from a specific message type, you need to identify records with the right structure among all the records in the same target entity.

The approach here is filtering rather than joining. Instead of joining 500 Satellites, you query one entity and filter for records that contain specific JSON keys or values that uniquely identify the message type you care about. Email messages, for example, always have a subject, body, sender, and recipient — keys that distinguish them from other message types. A specific transaction type might always carry an ID starting with a known prefix. These structural signatures let you extract subsets of the JSON stream efficiently.

Once filtered, you extract the attributes you need from each subset and UNION the results if you need to combine multiple message types. A UNION of 500 filtered queries on one table is significantly faster than a JOIN of 500 separate tables, and it scales much better as the number of source types grows.

Choosing the Right Approach for Your Context

The right answer depends on where you sit on the spectrum between a small number of structurally distinct source systems and a very large number of structurally similar ones. For a handful of systems with genuinely different schemas and different business semantics — CRM, ERP, financial systems — separate Satellites per source is the right call. The flip-flop effect and structural differences make consolidation risky and introduce business logic where it doesn’t belong.

For microservices, IoT devices, or any scenario where you have many sources with similar structures and a partitioning key available, consolidating into a small number of JSON-payload entities is usually the better trade-off. It simplifies loading, reduces metadata overhead, and keeps the consumption layer manageable — at the cost of pushing structural interpretation into filtering and extraction logic downstream.

To go deeper on Satellite design, source system splits, and Data Vault modeling patterns for modern architectures, explore our Data Vault certification program. The free Data Vault handbook is also available as a physical copy or ebook for a solid grounding in the core methodology.

Watch the Video

Salesforce Spring’26 Update in a Nutshell

The Salesforce Spring’26 release has officially arrived, introducing a significant wave of functional updates designed to increase administrative efficiency and improve the end-user experience.

At Scalefree, we have analyzed the extensive release documentation alongside insights from our ecosystem partners to identify the most impactful changes for your business. Here is a professional summary of the key features now available in your environment.

Enhanced Reporting: Dashboard Table Integration

The gap between reports and dashboards has narrowed. You can now utilize native Report Table settings directly within Dashboard components. This update ensures that your dashboard tables automatically respect conditional highlighting, manual column widths, and summary rows previously defined in your source reports, providing a more consistent and professional data visualization experience.

Proactive Governance: Security Health Check Advancements

Salesforce has expanded the Health Check suite to offer more granular visibility into org vulnerabilities. New features include real-time monitoring of session settings and enhanced credential auditing. These tools transition security from a periodic review to a proactive, continuous defense strategy for your business data.

Note for Marketing & Operations: While native tools provide a strong foundation, maintaining an optimized environment requires a strategic approach. If you are looking to validate your configuration against industry best practices, we recommend the Scalefree Admin Buddy. It is designed to help organizations maintain a lean, high-performing Salesforce instance. Learn more about Scalefree Admin Buddy

Automation Excellence: The Evolution of Flow

The Spring ’26 release continues to prioritize Flow as the central engine for business logic, specifically bridging the gap between manual intervention and automated efficiency.

  • Integrated Approval Components: You can now embed approval processes directly within Screen Flows. This allows users to review, comment on, and approve records within a single interface, significantly reducing context switching and improving process velocity.
  • File-Triggered Automation: A long-awaited update, both ContentVersion and ContentDocument are now available as entry criteria for Record-Triggered Flows. This allows for immediate automated actions—such as notifications or status updates—the moment a document is uploaded.

Operational Transparency: Data 360 and Logging

To support more complex automation, Salesforce has introduced Data 360 for Flow Logging. This provides a comprehensive audit trail for your automated processes, functioning as a diagnostic “black box” to provide clear visibility into flow execution and simplify troubleshooting.

Additional Resources

This summary covers only a fraction of the Spring ’26 capabilities, which also include advancements in AI-driven agents and developer productivity tools.

If you would like a deeper technical analysis or wish to discuss how these specific features can be applied to your current business processes, please reach out to one of our experts. We are happy to share our detailed sources and internal testing experiences with you.

Book Your Expert Call

Scalefree @ IT-Tage 2025: Global Innovation, Local Trust

Salesforce IT-Tage Travel Report

How to combine Salesforce innovation with European data standards for a competitive edge in 2026

Fresh back from Kap Europa in Frankfurt, the Scalefree Salesforce team has unpacked a lot of inspiration. While our daily focus is on evolving Customer Relationship Management, IT-Tage 2025 offered an impressive 360-degree view of the IT landscape.

One thing became clear immediately: The “Sovereignty Blues” of recent years has evaporated. The industry is no longer complaining about dependencies; it is in full execution mode.

Here are our top takeaways for decision-makers and admins on how to navigate the future of Salesforce and Digital Sovereignty.

Salesforce IT-Tage Travel Report

Digital Sovereignty: The New “Freedom of Choice”

Sovereignty was the red thread running through the keynotes (Jutta Horstmann’s talk on digital strategy was a standout).

Salesforce Digital Sovereignty IT Days Travel Report

The Scalefree Take: Sovereignty is not a “Cloud Ban.” It is the ability to actively manage your technological dependencies.

  • Best of Breed: The 2025 standard is combining global innovation (SaaS giants like Salesforce) with local security standards.
  • Data Control: In our discussions, it became clear that “Data Residency” isn’t enough. True sovereignty comes from Process Excellence. You are only sovereign if you understand your data architecture and own your business logic.
  • The Result: We can use modern integration architectures to give you the best of both worlds—US innovation with EU compliance.

Architecture & GenAI: Looking into the Crystal Ball

The sessions on Generative AI and software architecture were eye-openers. But we also saw the need for a reality check.

The Scalefree Take: AI is not a strategy; it’s a tool.

  • Lifecycle over Code: LLMs change how we write code, but they don’t replace the need for planning. If anything, the ethical responsibility and security of the generated logic matter more than the programming language itself.
  • No AI without Clean Core: For our Salesforce clients, this confirms our mantra: AI-driven automation must be embedded in a robust architecture. If you automate chaos with AI, you just get faster chaos. Data Quality is the prerequisite for AI success.

Sustainability: More Than Just Marketing

“Green IT” wasn’t just a buzzword this year; it was about hard metrics. We saw deep dives into how energy-efficient algorithms can minimize CO2 footprints.

The Scalefree Take: Efficiency equals Stability.

  • Resource Intensity: A reflective talk by Hannah Herbst highlighted the resource cost of LLMs.
  • Scale Smart: For Salesforce Admins and CFOs, this is crucial. A bloated Salesforce instance isn’t just expensive to license; it’s technically “unhealthy.” Optimizing your Org isn’t just good for the planet—it’s good for your bottom line.
Salesforce IT Days Travel Report

Conclusion: Our Roadmap for 2026

IT-Tage 2025 confirmed that we are on the right track.

Digital Sovereignty and high-efficiency Customer Management with Salesforce do not contradict each other—they condition each other.

Whoever maintains control over their architecture (Sovereignty) and enforces strict Data Quality (Excellence) can use the best and most innovative tools (Salesforce) safely and sustainably.

Frankfurt, see you in 2026!

Salesforce IT Days Travel Report

How Do You Model External Business Logic In Data Vault?

Modeling External Business Logic in Data Vault: APIs, Scripts, and Source System Thinking

A question that comes up regularly in Data Vault training is how to handle external business logic — specifically, what happens when your data pipeline includes a call to an external API or service that returns enriched or cleansed data. Where does that fit in the model? How do you capture the response? And how do you integrate an external script cleanly into your enterprise data platform? This post walks through a concrete example: address cleansing via an external REST API.



Modeling External Business Logic: The Full Flow

The scenario starts simply enough. You have CRM data — let’s say customer records with addresses — that gets staged and broken down into the Raw Data Vault in the usual way: Hubs for business concepts, Satellites for descriptive attributes. The raw address from the CRM system lands in a Satellite.

Now comes the complication. You need to cleanse and standardize those addresses using an external REST API. A Python script handles the call: it pulls data from the platform, formats it into the required input — perhaps a single string or a calculated key — and sends it to the external service. The service returns a JSON response with the standardized address and additional metadata.

This flow touches several layers of the Data Vault architecture, and each layer has a distinct role.

The Business Vault Prepares the API Call

Before the external call can be made, the Business Vault does preparatory work. If the REST API requires the address in a specific format or needs a calculated key, that computation belongs in the Business Vault — it’s business logic, applied to raw data, to produce the input for an external process.

The external Python script then queries this prepared data — either directly from the Business Vault or via an Interface Mart (more on that below) — and performs the REST call. The script itself may be under version control and within your organization’s control. The external service is not.

Treat the External Service as a Source System

This is the key modeling decision: because the external API is outside your control, you treat its responses exactly as you would treat any other source system. You don’t trust it implicitly. You stage its output and break it into the Raw Data Vault.

If your Raw Data Vault already has an Address Hub from the CRM dataset, and the external service returns identifiers that qualify as business keys — unique, stable identifiers for addresses — those can be added to the Address Hub. The JSON response from the API then gets captured in a Satellite in the Raw Data Vault, associated with the appropriate Hub.

This approach gives you a clean audit trail. You know exactly what the external service returned, when it returned it, and what key was used to make the call. If the external service changes its response structure or returns unexpected data, your Raw Data Vault captures that reality as-is, and your downstream Business Vault logic handles interpretation.

Handling JSON Responses: Two Practical Options

API responses typically come back as JSON — sometimes well-structured, sometimes semi-structured with varying schemas between messages. There are two main approaches for capturing this in the Raw Data Vault, and the right choice depends on how structured the response is and how many attributes you actually need.

Option 1 — Extract what you need, keep the rest as JSON. If the JSON is relatively consistent and you only need a subset of its attributes — say, five out of fifty — extract those five into relational columns in the Satellite. Keep the full JSON (or the remaining payload) as a JSON or JSONB attribute in the same Satellite. You get fast, typed access to the attributes you use regularly, and the full document is available for future needs without requiring a reload.

Option 2 — Keep everything in JSON, extract in the Business Vault. If you’re unsure which attributes you’ll need, or if the structure varies, capture the raw JSON in the Satellite and handle extraction later in the Business Vault. Technically, extracting fields from JSON is a structural transformation — a hard rule, not a business rule — so it could sit in the Raw Data Vault. But if the extraction is straightforward and tied to specific downstream calculations, doing it in the Business Vault view is a reasonable and common practice.

In practice, the hybrid approach from Option 1 is most common: extract the attributes you know you need into relational columns, keep the JSON alongside them. When a new attribute is needed later — and it will be — you can pull it directly from the JSON in your Business Vault view using native JSON functions, without touching the Raw Data Vault or reloading any data.

Integrating the External Script: Dependencies and Interface Marts

When an external script queries your data platform — whether from the Raw Data Vault or the Business Vault — it creates a dependency. The entities that script relies on can’t be freely refactored without risking a broken integration. This is worth flagging explicitly in your metadata: mark those entities as part of the operational vault, indicating that external applications depend on them.

A cleaner long-term solution is to introduce an Interface Mart — a stable, versioned view layer that the external script queries instead of the Raw or Business Vault directly. When you refactor a Satellite or restructure a Business Vault entity, you update the Interface Mart view to maintain the same output structure. The external script sees no change. This decouples your internal model evolution from external integrations, which is especially valuable in organizations where multiple scripts and applications consume data from the platform.

Combining Two Sources in the Business Vault

At this point, you have two sources describing the same concept: the CRM system with the original, non-standardized address, and the external address standardizer with the cleansed version. Both are captured in the Raw Data Vault as separate source inputs. The Business Vault is where you bring them together.

The pre-computed key used to make the API call serves as the joining mechanism. Based on that key, you can establish a relationship — via a Link or a direct join in a Business Vault view — between the raw CRM address and the standardized version returned by the external service. The Business Vault then exposes the combined, cleansed address data to downstream consumers: reports, dashboards, or further downstream application scripts.

The exact modeling decisions at this stage depend heavily on how the CRM data is structured and what the business actually needs from the cleansed address. But the principle holds regardless: raw inputs from both the CRM and the external API live in the Raw Data Vault; the logic that combines and interprets them lives in the Business Vault.

A Pattern Worth Generalizing

Address cleansing is one example, but the same pattern applies to any external enrichment service: geocoding APIs, credit scoring services, entity resolution services, tax calculation engines. Whenever your pipeline includes a call to an external system that returns data you need to capture and use, the approach is the same — treat the response as a source, stage it, load it into the Raw Data Vault, and apply interpretation and combination logic in the Business Vault.

It’s also worth noting that this pattern integrates naturally into data-driven organizations where information is consumed not just through reports and dashboards but through application scripts and automated processes. The enterprise data platform becomes a hub for both analytical and operational consumers — and Data Vault’s layered architecture handles both cleanly.

To explore these patterns in depth — including Business Vault design, Interface Marts, and integrating external sources — check out our Data Vault 2.1 Training & Certification. The free Data Vault handbook is also available as a physical copy or ebook for a solid introduction to the core methodology.

Watch the Video

Capturing Changing Inventory Levels in Data Vault

Capturing Changing Inventory Levels in Data Vault with Non-Historized Links

Inventory data presents a specific challenge in Data Vault modeling: it arrives as snapshots, it can be corrected retroactively, and deletions need to be traceable back to a specific snapshot date. A user recently proposed a solution using a multi-active Snapshot Satellite — a reasonable starting point — but the question of how to handle corrections and logical deletes at the snapshot level pointed toward a cleaner approach. This post walks through the recommended pattern: a Non-Historized Link with technical counter transactions.



Capturing Changing Inventory Levels: Understanding the Source Data

The scenario is this: a source system sends inventory data as full snapshots. Each extract may contain data for multiple snapshot dates, and if a correction exists for a previously loaded snapshot, the source resends the complete data for that snapshot date. The inventory level represents the count of a product at a given location and store at the end of a specified day.

This is what Kimball defines as a snapshot-based fact — a periodic full count of all products in a warehouse, day by day. The complication is that those snapshots can be corrected. Three days after an initial count, a miscalculation might be identified and the inventory level for that past snapshot date needs to be updated — without affecting the two more recent snapshots.

Why the Multi-Active Satellite Approach Has Limitations

The proposed solution used a Link referencing Product, Store, and Location Hubs, with a multi-active Satellite attached. The snapshot date was nominated as the multi-active attribute, added to the Satellite’s primary key alongside the hash key and Load Date Timestamp.

The instinct is sound — you need to track multiple snapshots and their corrections over time. But there’s a problem with using a business-supplied date as the multi-active attribute in the Raw Data Vault: it means trusting the source. In the Raw Data Vault, that’s a risk worth avoiding.

A more robust approach is to use the subsequence from staging as the multi-active attribute instead. The subsequence is assigned during staging and is guaranteed to be unique within each incoming batch. Because it’s unique within the batch, it’s also unique within any subgroup of that batch — including a multi-active group. If something goes wrong with it, it’s a problem you control and can fix. The snapshot date from the source then becomes what it actually is: a descriptive business timeline, added to the Satellite payload like any other descriptive attribute.

But even with this correction, the multi-active Satellite approach doesn’t cleanly solve the core problem: how do you capture a logical delete at the snapshot level — when an item disappears from a re-sent snapshot that was already loaded?

The Non-Historized Link Approach

The cleaner solution is to use a Non-Historized Link rather than a standard Link with a Satellite. Non-Historized Links are designed precisely for capturing facts coming from a source — transactions, events, and inventory levels. They sit close to the incoming data structure and are loaded incrementally using an Alternate Key.

For this scenario, the Non-Historized Link contains the following fields:

  • Product Key, Store Key, Location Key — the Hub references (hash keys)
  • Load Date Timestamp (LDTS) — when the record was received
  • Snapshot Date (SD) — the business timeline from the source, indicating when the inventory was counted
  • Snapshot Date Timestamp (SDTS) — the Data Vault snapshot date used in information delivery
  • A counter column (X) — with a value of 1 for records coming from the source, and -1 for technical counter transactions

The Alternate Key — used to identify whether a record already exists in the target — is the combination of Product Key, Store Key, Location Key, Snapshot Date, and Load Date Timestamp.

Two INSERT Statements: Inserts and Counter Transactions

Loading this Non-Historized Link uses two parallel INSERT statements, not an UPDATE pattern.

INSERT statement one loads all records from the Staging Area where the Alternate Key does not yet exist in the target. This covers both genuinely new records and new versions of corrected records — because the corrected version, with its new Load Date Timestamp, doesn’t exist in the target yet and therefore qualifies as new.

INSERT statement two handles deletions and replaced versions. It identifies records that exist in the target but are absent from the current full-load snapshot in the Staging Area. These records are either hard-deleted in the source or represent the old version of an updated record. For each of these, a technical counter transaction is inserted — an identical row with a counter value of -1 instead of 1.

This means an update in the source produces two rows in the target: a -1 counter transaction that cancels the old version, and a 1 insert for the new version. The data itself is never modified — only the structure changes. This aligns with the hard rule in Data Vault: the Raw Data Vault transforms structure, not content.

How Aggregation Reveals the True Inventory Level

The power of this pattern becomes clear at query time. By summing the counter column across all rows matching a given Alternate Key, you can determine the current state of any inventory record at any point in time.

Here’s how it plays out across several days for a single product-store-location combination:

  • Day 2: Initial snapshot for Day 1 is loaded. Counter value: 1. Sum = 1. One active inventory record.
  • Day 4: A correction arrives for Day 1. The old version gets a -1 counter transaction; the corrected version gets a 1 insert. Sum = 1 - 1 + 1 = 1. Still one active record, now at the corrected level.
  • Day 5: The record is deleted from the source for Day 1. A -1 counter transaction is inserted. Sum = 1 - 1 + 1 - 1 = 0. Zero active records. The inventory level for that snapshot date is gone.

By filtering on the Snapshot Date and aggregating the counter column, you always know exactly how many active facts exist for a given snapshot — and what the current inventory level is. By filtering on the Load Date Timestamp, you can also travel back in time and see what the inventory looked like from the perspective of any earlier date. This gives you both a corrected view of inventory history and a full audit trail of when corrections were made.

Performance and Reporting

On a column-based storage engine, this pattern performs extremely well. Aggregating a counter column across large volumes of inventory records is fast, and the resulting structure integrates cleanly with dashboards. Business users see dimension references for Product, Store, and Location; a snapshot date for time-based aggregation; the inventory level measure; and the counter column for filtering active records. The behavior is exactly what they’d expect — current inventory levels that reflect corrections, with history available on demand.

Where This Pattern Also Applies

Technical counter transactions in Non-Historized Links are not specific to inventory data. The same pattern applies anywhere you have full-load snapshot data with corrections — any scenario where an update in the source needs to be represented as a deletion of the old version and an insertion of the new one, without modifying existing rows.

The pattern can also be applied in the Business Vault, particularly in Bridge Tables, when the counter transactions involve calculated values that need to be corrected rather than raw source data. The principle is the same; the layer where it’s applied depends on whether the transformation is structural (Raw Data Vault) or involves business logic (Business Vault).

For more recordings on Non-Historized Links and technical counter transactions, the Scalefree YouTube channel has several dedicated sessions on the topic. And to master these patterns hands-on, explore our Data Vault 2.1 Training & Certification. The free Data Vault handbook — available as a physical copy or ebook — is also a great starting point for the core concepts.

Watch the Video

Model Access in dbt: Governing Analytics at Scale with Groups

Model Access in dbt

As dbt projects grow, so do the challenges around collaboration, ownership, and reuse. What starts as a small analytics codebase can quickly evolve into a complex ecosystem of models shared across teams, domains, and even projects. Without clear boundaries, it becomes difficult to understand who owns what, which models are safe to reuse, and how to scale analytics without breaking downstream consumers.

This is where model access and groups come into play. Together, they form a powerful governance mechanism in dbt that helps teams structure responsibility, control visibility, and safely enable cross-project data sharing.

In this article, we’ll break down what groups are, how model access works, and how these features support scalable analytics and data mesh architectures.



What Are Groups in dbt?

A group in dbt is a named collection of nodes within a project. Groups provide a way to logically organize resources and define ownership, which becomes increasingly important as more people and teams contribute to the same dbt codebase.

Groups can include the following node types:

  • Models
  • Tests
  • Seeds
  • Snapshots
  • Analyses
  • Metrics

It’s important to note that sources and exposures are not included in groups.

There are a few key rules to understand:

  • Each node can belong to only one group
  • Every group must have a name
  • Every group must have an owner

The owner definition requires at least a name and an email address. Typically, the owner represents a team rather than an individual, such as an Analytics or Finance team. This explicit ownership makes responsibilities visible and helps clarify who to contact when questions or issues arise.

Beyond organization, groups play a critical role in how dbt enforces model access rules. In fact, model access is defined and evaluated in the context of groups.

What Is Model Access in dbt?

Model access is a governance feature in dbt that controls how and where a model can be referenced. It works alongside other governance capabilities such as model contracts and model versions.

By assigning an access level to each model, teams can define which models are internal implementation details and which are intended for broader reuse.

dbt provides three access levels:

Private

Private models are the most restrictive. They can only be referenced by other models within the same group.

This is ideal for intermediate or helper models that support a specific team’s logic but are not meant to be consumed outside that context.

Protected

Protected is the default access level in dbt. Protected models can be referenced by any group within the same project, or by other projects if the project is installed as a package.

This level supports collaboration within a single dbt project while still preventing accidental exposure across project boundaries.

Public

Public models are designed to be consumed outside the project they are defined in. These models can be referenced from other dbt projects using cross-project references.

Public access is especially important for organizations adopting a data mesh approach, where teams expose trusted data products for others to consume.

One important limitation to keep in mind: models materialized as ephemeral cannot be public. Since ephemeral models are not materialized in the warehouse, they cannot be safely shared across projects.

Cross-project public access requires the Enterprise tier of dbt Cloud.

How Groups and Model Access Work Together

Groups and model access complement each other. While groups define ownership and responsibility, access modifiers define visibility and usage.

Consider the following example configuration:

groups:
  - name: analytics
    owner:
      name: Analytics team
      email: [email protected]

models:
  - name: orders_per_supplier_country_customer
    config:
      access: private
      group: analytics

  - name: orders_per_customer
    config:
      access: protected
      group: analytics

  - name: orders_per_country
    config:
      access: public
      group: analytics

In this setup, all models belong to the analytics group, but each has a different access level.

The private model can only be referenced within the analytics group. The protected model can be referenced anywhere inside the same project. The public model can be referenced by other projects entirely.

In the dbt DAG and lineage views, private and public models are visually marked, making it easier to understand boundaries at a glance.

This combination allows teams to clearly communicate intent:

  • Which models are internal building blocks
  • Which models are safe for internal reuse
  • Which models are stable data products

Public Models and Cross-Project References

Public models enable collaboration across dbt projects. This is particularly valuable when different teams manage separate projects but still need to share data in a controlled way.

To make public models available to downstream projects, the upstream project must meet two conditions:

  • An environment defined as PROD or STG
  • At least one successful job run in that environment

These requirements ensure that dbt has generated the necessary metadata for downstream reference resolution.

In the downstream project, the upstream project is added as a dependency:

# dependencies.yml
projects:
  - name: hub_speak_dmytro_base

Once declared, public models from the upstream project can be referenced using a two-argument ref() function:

{{ ref('hub_speak_dmytro_base', 'dim_customer') }}

This pattern allows downstream teams to consume only the models explicitly marked as public, while all internal logic remains private or protected in the upstream project.

Why Model Access Matters for Data Mesh

Model access is a foundational capability for implementing data mesh with dbt.

By treating public models as stable interfaces, teams can publish data products with clear contracts and ownership. Downstream consumers rely on these models without needing to understand or depend on upstream implementation details.

At the same time, teams retain full control over what they expose. Internal experimentation and refactoring can happen safely behind private and protected boundaries.

This approach enables autonomy without chaos — a key principle of data mesh.

Conclusion

As dbt projects scale, governance becomes essential. Groups and model access provide simple but powerful tools to define ownership, enforce boundaries, and enable safe reuse of analytics models.

By thoughtfully combining groups with private, protected, and public access levels, teams can scale collaboration, support cross-project data sharing, and build reliable data products without sacrificing flexibility.

In upcoming sessions, we’ll demonstrate these concepts hands-on in dbt Cloud, showing how producer and consumer projects interact and how these relationships appear in the Catalog and lineage views.

If you’re working toward a scalable analytics or data mesh architecture, mastering model access in dbt is a crucial step.

Watch the Video

Unit of Work (UOW) Links in Data Vault

Unit of Work Links in Data Vault: Purpose, Strategy, and Flow.BI

As FlowBI — the GenAI tool that generates Raw Data Vault models by analyzing source data — becomes more widely used, questions about its specific modeling decisions are coming up more frequently. One of those questions is about a particular Link type that FlowBI produces: the Unit of Work Link. What is it, why does it exist, and when should you use it? This post breaks it down.



Unit of Work Links: What They Are and Why They Matter

In Data Vault modeling, a Link captures a relationship between two or more Hubs — a business event or association expressed through a combination of Business Keys. When modeling a source table that contains multiple Hub references, a natural modeling instinct is to split those relationships into separate, more focused Links. One Link for a customer-product relationship, another for an employee-customer relationship, and so on.

Splitting Links is perfectly valid in Data Vault — but it comes with a constraint. Certain Business Keys must remain together to preserve what is called the Unit of Work: the set of Business Keys that belong together as a single, consistent business event. Separate them incorrectly, and you risk breaking data consistency in your model. This is one of the more subtle but consequential mistakes a modeler can make, and it’s discussed in detail in the Data Vault training and certification curriculum.

The Unit of Work Link is the answer to that risk. It is a Link that spans all Hub references coming from a single source table — unsplit, unfiltered, capturing the full set of Business Keys exactly as they appear together in the source. Think of it as the safe baseline: a Link that guarantees the Unit of Work is preserved, no matter what splitting decisions get made on top of it.

The Human Modeler Problem — and How FlowBI Handles It

FlowBI works by analyzing source data profiles and generating the Hubs, Links, and Satellites needed to capture that data correctly in a Raw Data Vault — one that passes the JEDI test. It integrates with automation tools including Data Vault Builder, Datavault4dbt, and others.

But here’s the design challenge FlowBI had to address: it imitates a human data modeler. And human data modelers make mistakes. One of the most common is splitting Links incorrectly and inadvertently violating the Unit of Work. So the question during development was: do you try to eliminate these mistakes entirely — at the cost of no longer truly imitating human modeling behavior — or do you accept that errors can happen and build a counter-strategy?

FlowBI chose the counter-strategy. The Unit of Work Link is that strategy. By always generating a Link that preserves the full set of Hub references from a source table, FlowBI ensures there is always a valid, consistent fallback in the Raw Data Vault — even if additional, more specific Links turn out to have been split incorrectly.

The Practical Modeling Strategy Behind It

This isn’t just an AI design decision — it reflects a sound modeling practice for human modelers as well. The approach works like this:

When working with a source table that contains multiple Hub references, always create one Link that spans all of them. This is your Unit of Work Link. Then, if your understanding of the source data is strong enough — if you’ve been able to validate the relationships and run the appropriate tests — you can introduce additional, more specific Links that extract individual relationships from the source. A hierarchy Link, an employee-to-customer Link, a product-to-order Link, and so on.

The key point is that the Unit of Work Link remains. It doesn’t get replaced. If one of the more specific Links turns out to violate the Unit of Work — because the source data behaved differently than expected, or because access to production data was limited during modeling — you still have the original Link to fall back on. The model remains valid. You can retrieve the correct relationship from the Unit of Work Link and fix the problem without reloading the Raw Data Vault.

The less familiar you are with a source dataset — particularly in cases where you can’t access production data directly or can’t run comprehensive validation tests — the more valuable this approach becomes. When in doubt, don’t split. Protect the Unit of Work first.

Splitting in the Business Vault Instead

If a Unit of Work Link exists in the Raw Data Vault and you later want to split it into more focused relationships, that work belongs in the Business Vault — not the Raw Vault. You use the Unit of Work Link as the basis for either virtual or materialized Links in the Business Vault, where the split is applied.

The advantage of this approach is containment. If the split is wrong — if it turns out to violate the Unit of Work — the fix happens in the Business Vault. The Raw Data Vault doesn’t need to be reloaded. The source of truth stays intact. You correct the business logic without touching the foundation.

This is exactly the kind of separation of concerns that makes Data Vault resilient. Raw data is captured as-is, close to the source. Business logic — including relationship refinement — happens in the layer designed for it.

A Rule Worth Adopting for Any Modeler

The Unit of Work Link isn’t just a FlowBI artifact. It’s a principle any Data Vault modeler can and should apply. If you’re splitting Links in your Raw Data Vault, ask yourself: where is your Unit of Work? Is there a Link in your model that preserves the full set of Business Keys from each source table, regardless of how you’ve split them elsewhere?

If the answer is no, you’re relying on every split being correct — and on your understanding of the source data being complete. That’s a reasonable bet when you know the data well. It’s a riskier one when you don’t. The Unit of Work Link costs very little to include and provides a meaningful safety net in return.

To go deeper on Link modeling, Unit of Work concepts, and the full Data Vault methodology, explore our Data Vault certification program. And if you’re new to Data Vault, the free handbook — available as a hard copy or ebook — is a solid introduction to the core concepts.

Watch the Video

The AI-Enabling Data Platform: Unlocking Scalable, High-Quality AI Applications

AI Enabling Data Platform

Is your company building an AI time bomb?

Many businesses are rushing to deploy AI prototypes that look impressive during a demo but hide massive, systemic risks. From “hallucinating” bots that give dangerous advice to customers to catastrophic legal liabilities, simple AI setups can quickly become a corporate nightmare.

If your AI strategy depends on unorganized data and ungoverned workflows, you aren’t just experimenting, you are creating a “data debt” that could bankrupt your project or compromise your company’s reputation. If you want to move beyond these risky experiments and build AI that is efficient, scalable, trusted, and actually works for your business, you need a different approach. Learn how an AI-Enabling Data Platform protects your company while unlocking the true power of high-quality, scalable AI.

The AI-enabling Data Platform – Unlocking high-quality AI Applications

To scale AI effectively, organizations must move beyond unmanaged prototypes toward an AI-Enabling Data Platform that addresses security risks and poor data governance. By transforming fragmented data into governed Feature Marts, this architecture ensures the high-quality, compliant data foundation necessary for reliable AI workflows. This shift ultimately solves the maintenance and liability issues that typically hinder AI return on investment. Learn more in our upcoming webinar on February 17th, 2026!

Watch Webinar Recording

Moving Beyond the Prototype

It usually starts with a spark of excitement. You build a small AI tool or workflow using a Large Language Model (LLM), and it works! It answers questions, summarizes text, and saves your team hours of manual labor. This is the “honeymoon phase,” where everything feels possible and the technology seems like magic.

But then, you try to scale. You move from a single user to a whole department, or from a small test folder to your entire company database. Suddenly, things get quite complex. The AI starts making mistakes it didn’t make before so you extend your AI workflows with data adjustments and exceptions, and the system starts breaking regularly. The legal team finds out about the project and starts asking difficult questions regarding data privacy and “black box” decision-making.

Does this sound familiar? You may have seen this in your own projects: A demo that looks great in a controlled environment but cannot handle the pressure of real, messy business use, and gets stuck in PoC purgatory. Without a professional foundation, your AI applications quickly change from being a business asset to becoming a massive liability.

Why Your Current AI Setup is Failing

To understand the solution, we must first look at why most AI initiatives fail when they leave the lab. The problem is almost always the same: a total lack of governance and messy (non-cleansed, non-standardized, or non-integrated) data.

While major LLM models are “trained” generally, they often lack access to the specific “facts” of your business in a way they can understand. This leads to several major threats:

  • The “Hallucination” Risk: If the AI isn’t connected to a “Single Source of Facts,” it guesses. It makes up facts about your product features, delivery times, or prices. In a business setting, a wrong answer isn’t just a mistake but a breach of trust that can quickly destroy a customer relationship.
  • The Maintenance Nightmare: Without a central data platform, every time your source data structure or business logic changes, you have to manually update every single AI tool and workflow you’ve built that touches this piece of data. This makes long-term maintenance impossible and kills the hoped-for ROI of your new AI application.
  • The Legal Challenge: Legal frameworks don’t magically disappear when working with AI. Furthermore, additional frameworks like the EU AI Act are adding new layers of regulatory compliance requirements. If you cannot explain why your AI gave a specific answer or which data it used, you could face massive fines. Using sensitive data without a clear audit trail is a gamble most companies cannot afford.

The Two Traps of Modern AI Development

After the honeymoon phase of the LLM era, companies want to adapt quickly. However, they almost always fall into one of two typical traps. You might recognize these patterns in your own organization:

Trap 1: The “AI Spaghetti” Trap

In the rush to be “AI-First,” many teams use a mix of different AI workflow tools and agents, connecting them piece-by-piece to solve individual problems. While each piece works, the overall system becomes a tangled mess, which I like to call AI Spaghetti. 

In this trap, there is no central “brain” or data control. Each agent has its own way of looking at data, leading to zero consistency. If you change a price in your main database, some agents might see it, while others are still using an old PDF they found in a different folder.

This “spaghetti” is impossible to maintain, secure, and scale. You spend 90% of your time fixing broken connections, integrations or calculations instead of creating new value. 

The dangerous part is that this doesn’t happen on day one; it builds itself as you add more functionalities and exceptions. Often, these workflows are already in production as they grow, and the only way out is building everything from scratch the right way while maintaining the spaghetti in parallel making the “escape route” quite expensive.

Trap 2: The “Lone Wolf” Liability Trap

To bypass what they see as “slow corporate IT,” some teams or individuals start building their own AI applications and workflows. This is not inherently concerning for basic operational efficiency, but the trap is found when teams go deeper and start building workflows and applications consuming and transforming bigger junks of company data.

These “Lone Wolves” work around IT and expose the company to major risks to quickly “get the job done,” ignoring necessary governance processes. When a Lone Wolf uploads a customer list or a trade secret to a public model, that data might be used to train future versions of the model, making your secrets public property. Furthermore, with zero oversight, legal frameworks like GDPR, internal data sharing protocols, and IT security are often ignored.

The Solution: The AI-Enabling Data Platform

To escape these traps and unlock real sustainable value, you must move away from “messy” setups. The answer is the AI-Enabling Data Platform. This is not just a place to store data. It is a professional system that transforms raw, fragmented information into high-quality “fuel” for AI.

The platform acts as a protective layer between your messy company data (emails, databases, PDFs, spreadsheets) and your AI applications. Its main job is to provide Feature Marts.

What are Feature Marts?

Think of a Feature Mart as a library of trusted information. Instead of asking the AI to search through a giant, messy database, you provide it with specific “Features”, which essentially are data points that have been cleaned, integrated, and approved by your data experts.

For example, instead of the AI trying to guess a customer’s loyalty status from thousands of raw interaction logs, it simply asks the Feature Mart for the “Customer_Loyalty_Score.” The result is instant, accurate, and governed.

How do they fit into our data architecture?

This is aligned with how we provide data to business users for standard reporting and analytics. We don’t throw non-integrated, uncleaned data without descriptions at business users and ask them to find the perfect KPI. This is why the principles behind a quality data platform stay mostly the same. You can simply build Feature Marts on top of your existing data platform. Instead of “Information Marts,” you now add Feature Marts.

AI Enabling Data Platform

You build feature marts on top of your integrated data layer as part of your “Gold Layer” as it is a data asset ready for consumption by your AI applications, workflows and agents. Those are responsible for automating your operations supporting your business in a variety of tasks.

What becomes critical for high-quality results is a semantic layer. Nowadays, definitions for your data, calculations, and meaning can be added in modern data cataloging tools. These are excellent as they can be used by business users as well as data specialists. A well-constructed Feature Mart, combined with descriptive data, is the perfect recipe for high-quality results from your AI layer.

If you are interested in more details about the data architecture, check out my article about Data Fabric architecture here: Data Vault, Data Mesh & Data Fabric Guide

What You Achieve: Quality, Speed, Cost Efficiency and Trust

When you invest in an AI-Enabling Data Platform, you achieve four critical business outcomes:

AI Enabling Data Platform Key Points

The Path to Success

Building high-quality AI is a journey. You can achieve better results and avoid the risks by following these steps:

  • Stop the “Lone Wolves”: Ensure all major AI projects use a central data platform so they stay safe and governed. Which AI usage is allowed outside IT and where guardrails are necessary should be defined in your organization’s AI strategy.
  • Stop the “AI Spaghetti”: Simple AI use cases can be achieved with basic workflow tools (e.g., n8n, Zapier) without a dedicated platform. Complex AI use cases building on company data should not and only use workflows tools for orchestration. 
  • Build Feature Marts: Don’t just give the AI raw data. Turn your important business data into ready-to-use “features” to increase trust, speed, security and governance.
  • Focus on Governance: Use the platform to control who (and which AI) can see your data. Audit inputs and outputs to ensure quality stays high.
  • Create Cross-functional Teams: The real impact is in automating everyday business processes, which is best achieved through combined teams of data engineers, AI engineers, and business users.
  • Assess and Plan: Get an overview of how AI is currently used, where the biggest risks are, and where the biggest opportunities lie. Create a roadmap including team structure, team skills, architecture, processes, governance and security.

If you want to profit from external expertise, read about our Scalefree Review & Assessment service and reach out to us for a customized review fitting your exact needs.

Conclusion: Real Value is Built on Trust

The AI revolution is not about who has the most expensive model or the flashiest chatbot. It is about who can automate their business most efficiently leveraging AI without losing trust in operations, results, and decisions.
When your AI applications are accurate, safe, and governed, they stop being “risky experiments” and become the engine of your company’s success.
Start by identifying your “Lone Wolves” and bringing them into a governed environment. Look at your most valuable AI use cases and start building the Feature Marts they need to survive in the real world.

What do you think?

Have you seen the “Agentic Spaghetti” trap in your own company? Are you worried about “Lone Wolves” creating legal risks? I would love to hear your experiences and challenges in the comments below or on social media postings (probably only LinkedIn)!

Business Analyst and Data Modeler Collaboration in Data Vault

Business Analyst and Data Modeler Collaboration in Data Vault Projects

One of the most common sources of friction in Data Vault projects isn’t technical — it’s organizational. The collaboration between Business Analysts and Data Modelers is arguably the most important working relationship in the entire delivery chain, yet it’s also one of the least clearly defined. Who does what? Where does one role end and the other begin? What information needs to change hands, and in what format? This post walks through a practical approach to structuring that collaboration, drawn from real project experience.



Why the Business Analyst and Data Modeler Collaboration Is So Critical

When Business Analysts and Data Modelers don’t collaborate effectively, the symptoms show up in the Raw Data Vault. Surrogate keys get nominated as Business Keys. Source system logic bleeds into what should be a raw, business-concept-driven model. Gaps in the information provided to modelers lead to design decisions based on assumptions rather than actual business understanding.

It’s worth clarifying one important point here: the Raw Data Vault is not where business perspectives live. Business logic, business rules, and the way the organization interprets its data — all of that belongs in the Business Vault. The Raw Data Vault should reflect the raw data as it comes from the source, structured around business concepts and Business Keys. Keeping that distinction clear is fundamental to a healthy collaboration between the two roles.

Forget the Line — Work Together

A common instinct is to draw a clean boundary: the Business Analyst works until a certain point, then hands off to the Data Modeler. In practice, this handoff model is where projects run into trouble. Information gets lost in translation. The Data Modeler receives documentation that makes sense from a business perspective but leaves key modeling questions unanswered. The Business Analyst doesn’t know what the Data Modeler actually needs.

A better approach: put everyone in the same room. Business Analysts, Data Modelers, Data Engineers, and dashboard designers all working toward the same deliverable — a report, a KPI, a business process automation. The business user doesn’t care about Data Vault; they care about the output. Build toward that output together.

This doesn’t mean everyone needs to be available full-time. But especially at the start of a project, physical or virtual co-location matters. When the Data Modeler hits a question the Business Analyst’s documentation doesn’t answer, the answer needs to be one conversation away — not a ticket in a queue.

Two additional roles are particularly valuable to have accessible during this phase: a source system specialist who knows the source data structure deeply, and a business user who can validate what’s being built against actual reporting needs. They’re typically time-constrained, so plan interactions with them carefully and make the most of the time you have.

Starting with Concept Classification

Before diving into source tables and column mappings, it pays to start at a higher level. A concept classification session — sometimes called a concept analysis — asks a deceptively simple question: what is your business model?

In a meeting with stakeholders from different departments, you map out the core business objects: customers, products, purchases, factories, whatever is central to how the business operates. You’re not focused on relationships at this stage — you’re building a vocabulary. A taxonomy of the concepts that matter to the business.

The second part of this conversation — often in the same meeting or the next one — asks: how do you identify each of these concepts? This is where it gets interesting. If you have people from finance, production, and sales in the room, you’ll typically get different answers. Finance uses an Oracle ID. Sales uses a Salesforce account key. Production uses an SAP number. Different systems, different keys, all referring to the same underlying concept.

This gives you a set of Business Key candidates. From there, you can examine the actual source data: do these keys exist in the dataset? Are they unique? Do any of them appear across multiple source systems in a way that could serve as a shared integration key? That analysis — even if limited to the data you have in front of you — is enough to identify a strong candidate and move forward. It won’t be perfect. A full analysis of every source system across the enterprise is rarely funded. But a well-reasoned candidate key is enough to start building, and it can be refined as the project progresses.

The Collaboration Spreadsheet: Simple and Effective

Once you’ve identified your concepts and Business Key candidates, the next step is mapping source tables to those concepts and classifying every column. The tool for this doesn’t need to be sophisticated — a spreadsheet works well, and works well precisely because everyone can use it.

The process looks like this: before the meeting, a developer imports the source system metadata into the sheet — column names, data types, lengths, source table. One row per column. Then, in the meeting with the business user and source system specialist, you go through each column and answer a simple question: what is this?

The annotations don’t need to be elaborate. Common classifications include:

  • Business Key — the identified key for this concept
  • Descriptive attribute — goes into a Satellite
  • Link reference — indicates a relationship to another Hub, requires a Link
  • Surrogate Key — captured as descriptive, not used as the Business Key
  • Ignore — not needed for this model

Additional classification dimensions — rate of change, security classification, privacy flags — can be added as columns in the same sheet. Satellite split decisions (which attributes group together into which Satellite) can be noted in comments. The goal is to give the developer enough context to build the metadata for the automation tool without needing another round of meetings.

The key discipline here is consistency. Keep comments patternized. The same type of note should look the same every time. A free-form comment field is useful; a completely unstructured one becomes noise.

From Spreadsheet to Automation Tool Metadata

Once the spreadsheet is complete, the developer translates it into the metadata format required by the automation tool — whether that’s Data Vault Builder, VaultSpeed, Datavault4dbt, or another platform. This translation step takes time and precision: automation tools produce exactly what their metadata specifies. Bad metadata produces bad results. But with a well-annotated spreadsheet as the source, the developer has a clear reference and can resolve most questions independently.

Some projects also require terminology translation at this stage. Source systems — especially SAP — often use abbreviated, language-specific field names that don’t belong in a Data Vault intended for a broader audience. The spreadsheet can include an English translation column, which the business user or source system specialist can complete asynchronously, keeping the meeting time focused on classification rather than translation.

Where AI Is Starting to Help

The concept classification and Business Key identification process described above is time-intensive, and it’s largely limited by how much source system analysis you can afford to fund. This is one area where AI tooling is beginning to make a difference.

Tools like FLOW.BI — developed at Scalefree — can attach to source systems, profile the data automatically, classify attributes, and identify Business Key candidates that appear across multiple systems as potential shared integration keys. The manual process described in this post becomes a validation and refinement step rather than a ground-up analysis. The fundamentals are the same; the speed is different.

The Information Requirement: Starting from the End

One final principle worth emphasizing: start with the target. Before analyzing source systems, ask what needs to be produced. What KPI needs to be calculated? What report needs to be built? What data does that require, and where does it come from?

An information requirement document — a structured template that captures what the business user wants, what they need, and where the data lives — is the ideal starting point for any new delivery. It won’t always be complete. Business users often know what they want but not where the data comes from. That’s fine. The Business Analyst and Data Modeler work together to fill in the gaps. But having even a partial information requirement is better than starting from raw source tables and working backwards.

Scalefree has published a template for information requirements on their blog — searching for “information requirement Scalefree” will bring it up — which can serve as a starting point for teams building this practice.

Making Collaboration Work in Practice

There’s no single formula for Business Analyst and Data Modeler collaboration that works across every project and every team. But a few principles hold consistently: work toward the same deliverable together, use simple tools that everyone can engage with, start from the business concept before diving into source data, and keep the meeting time focused on decisions — not documentation.

The spreadsheet approach is unglamorous. It’s also fast, inclusive, and produces the output the developer actually needs. Sometimes the best collaboration tool is the one everybody already knows how to use.

To learn more about Data Vault modeling practices, Business Key identification, and the full Raw and Business Vault methodology, explore our Data Vault 2.1 Training & Certification. And for a concise introduction to the core concepts, the free Data Vault handbook is available as a physical copy or digital download.

Watch the Video

Refactoring a Data Vault Model

Refactoring a Data Vault Model: Options, Risks, and Best Practices

Source systems change. Columns get added or removed, structures evolve, and sometimes entire business key definitions are overhauled. When that happens to a system feeding your Data Vault, the question isn’t just technical — it’s strategic. Do you modify what you have, or do you build alongside it? This post walks through the main scenarios and the practical options available for each, along with a clear recommendation on where to draw the line between low-risk and high-risk approaches.



When a Column Changes: The Simple Case

The least disruptive scenario is a column-level change in a source table — a new attribute appears, or an existing one disappears. For this, you have a few options depending on your project constraints.

Option 1 — Modify the existing Satellite. If your project allows structural changes, you can add the new column to the existing Satellite with an ALTER TABLE statement. Historical rows will show null values for the new column before its introduction, and a log entry can record exactly when the column was added. Removing a column from a Satellite is generally not done — historical data lives in that column, and dropping it means losing that history.

Option 2 — Create a new Satellite. If you’re not allowed to touch existing structures, or simply prefer not to, you create a new Satellite to capture the new or changed attributes. This Satellite gets added to the relevant PIT Tables. The trade-off is an additional join in your queries, but the existing Satellite and its data remain completely untouched.

Option 3 — Close and replace the Satellite. A slightly more thorough approach: close the existing Satellite (stop loading it) and create a brand new one that reflects the updated structure. The new Satellite starts with a full load from the source, which means some data overlap with the old Satellite. This is handled cleanly at query time using an IIF statement — prefer data from the new Satellite where it exists, fall back to the old one for earlier history. The redundancy is not a problem; it resolves itself during query execution.

The bigger the structural change, the more this third option makes sense. If a source table is overhauled dramatically — many columns removed, many added — creating a fresh Satellite to capture the new shape is often the cleanest path forward.

When the Business Key Changes: The Complex Case

Column-level changes are manageable. Business key changes are where things get genuinely complex — and where the risk calculus shifts significantly.

A business key in Data Vault must be unique over time and across the enterprise. If the current key no longer meets that standard — say, a customer number that was once reliable is now duplicated across regions — you have a structural problem that can cascade through the model. Changing the business key means potentially changing the Hub itself, which in turn affects every Link that references that Hub, and every Satellite attached to those Hubs and Links. The impact can be wide.

At this point, you have two main strategic choices.

Option A: Keep Old and New Structures Separate

The lower-risk approach — and the one most commonly recommended — is to leave the historical Raw Data Vault exactly as it is and build a new Raw Data Vault to capture data under the new structure and key definition.

The reasoning is rooted in a core Data Vault principle: the Raw Data Vault should model data close to how the source systems use it. The business had one structure in the past and a different one going forward. That’s two different realities, and it makes sense to model them separately.

The two Raw Data Vaults then get reconciled in the Business Vault, where business logic handles the combination of old and new data. This might be straightforward — a simple union — or it might be complex, especially if field definitions have changed. For example, if an address field was previously structured (street, house number, zip, city) and is now a free-text memo field that may contain addresses from multiple countries, the logic to normalize and combine that data belongs in the Business Vault. That’s exactly what the Business Vault is designed for.

This approach carries the lowest risk. Historical data is never touched. Nothing can go wrong with data that hasn’t been moved.

Option B: Refactor the Raw Data Vault

The more ambitious option is to refactor the existing Raw Data Vault into a new version — modifying Hubs, Links, and Satellites to reflect the new structure — and then reconstruct historical data within that new model.

This is technically possible, but it comes with a hard requirement: you must be able to reconstruct every historical delivery from the new structure without any data loss. In Data Vault practice, this is validated through what’s known as the “Jedi test” — deriving the old structures from the new ones and verifying the output matches the original data exactly. If the test passes, you can safely drop the old tables and replace them with views that expose the old structure as a backward-compatible interface.

Those views give existing queries time to continue working while users migrate. But they’re a transitional tool, not a permanent one. You’ll want to communicate a clear deprecation timeline — 90 or 180 days is typical — and give users explicit guidance on how to update their queries before the views are dropped.

A word of warning: when those views eventually get dropped, expect complaints. Not because the communication failed, but because, as a rule, nobody reads emails. Plan for it.

Handling Non-Unique Business Keys in the Interim

If a business key loses its uniqueness mid-project and a full refactoring effort will take several sprints, there’s a practical interim solution: a Record Source Tracking Satellite. This allows you to continue working with the existing model while the refactoring is planned and executed in the background. It buys time without requiring an immediate structural overhaul, and it keeps the data pipeline running cleanly during the transition.

Communication: The Overlooked Part of Refactoring

Technical decisions aside, refactoring a Data Vault model is also an organizational event. Users who query your data warehouse need to know when structures change — whether that’s a modified Satellite, a new Hub, or a deprecated view that will be removed in three months.

A simple data warehouse changelog or newsletter goes a long way. When you modify existing entities, inform users. When you introduce views as backward-compatible bridges, tell them the timeline. When the views are going away, tell them what to query instead. This isn’t just good practice — it’s the difference between a smooth migration and a flood of support tickets.

The Bottom Line on Refactoring

Data Vault is built to absorb change, and it does so gracefully at the column level. Descriptive attribute changes — new columns, removed columns, restructured Satellites — are handled with well-defined options and minimal risk. The real challenge arrives when business keys change, because the ripple effects can touch Hubs, Links, and Satellites across the model.

In those cases, the recommended approach is to preserve historical data in the original Raw Data Vault and build a new one for the new structure, using the Business Vault as the reconciliation layer. It’s the lowest-risk path, it keeps your historical data intact, and it puts complex transformation logic exactly where it belongs.

To learn more about Data Vault modeling principles, refactoring strategies, and Business Vault patterns in depth, explore our Data Vault 2.1 Training & Certification. And if you’re new to the methodology, the free Data Vault handbook is a great starting point — available as a hard copy or digital download.

Watch the Video

Source of Data for Business Vault Entities

Business Vault Entities: Using Cross-Satellite Joins for Business Logic

One of the questions that comes up regularly among Data Vault practitioners — especially those working on the Business Vault layer — is whether a Computed Satellite can draw its input data from multiple, unrelated Satellites across different Hubs. It sounds like it might break the rules. It doesn’t. In fact, it’s a very common and entirely valid pattern once you understand what actually constrains a Business Vault Satellite.

Let’s break it down.



Business Vault Computed Satellites: What Are the Real Rules?

When practitioners first encounter the Business Vault, a natural assumption is that a Computed Satellite must only use data from Satellites attached to its own parent Hub. The logic seems sound — keep things connected, keep things clean. But this is not actually a constraint in Data Vault. It’s a misconception.

The real rule for a Computed Satellite is much simpler: what do your results describe? Whatever entity your calculated output describes — that’s your parent. That’s what the Satellite attaches to.

For example, if you’re calculating customer lifetime value using data pulled from order records, transaction histories, product tables, and behavioral data — the result still describes the customer. So your Computed Satellite attaches to the Customer Hub. The source of the input data is largely irrelevant to the attachment decision. What matters is the semantic meaning of the output.

Cross-Satellite Joins: Yes, You Can

So back to the original question: can you build a Business Vault Satellite that joins Satellites from two completely different subjects — even if there’s no Link connecting their parent Hubs?

The answer is yes.

Your business rule query can pull data from anywhere in the model. You can join Satellites on the same Hub, traverse a Link into another Hub and pull Satellite data from there, use business key relationships to establish a match, or even join across entirely disconnected model areas if a descriptive relationship exists in the data itself.

The absence of a Link between two Hubs doesn’t prevent you from joining their Satellite data in a query. If there’s a way to match records — even informally through a shared descriptive attribute in a Satellite — you can use that to establish the relationship in your query logic and proceed.

When to Consider a Business Vault Link Instead

That said, there are situations where formalizing the relationship makes sense. If two Hubs have no Link between them but your business logic consistently requires joining their data, it may be worth creating a Business Link or an Exploration Link in the Business Vault to make that relationship explicit and reusable.

This approach can improve query efficiency and make the model easier to understand for future developers. However, it also adds complexity — a Business Link typically requires an Effectivity Satellite to track the validity of the relationship over time, which may or may not be worth the overhead depending on your use case.

The pragmatic answer: if it’s a one-off calculation and the join works, just write the query. If the relationship is foundational to multiple business rules, formalize it with a Link. Performance and maintainability should guide that decision, not a strict modeling rule.

The Only Real Constraints

In practice, there are two genuine constraints on what you can use as input for a Business Vault Computed Satellite:

  • Query performance: Joining across disconnected parts of your model may not always be efficient. If the join is expensive and runs frequently, optimization — through a Business Link, materialization, or other techniques — is worth considering. But it’s an engineering concern, not a modeling violation.
  • Data Governance: Are you actually allowed to combine those datasets? In organizations with strict data ownership boundaries, combining Satellites from different subject areas or different data owners may require governance approval. That’s a policy question, not a technical one — but it’s one worth asking before you build.

Outside of those two factors, the query is yours to write however you need.

A Practical Example: Customer Lifetime Value

To make this concrete: imagine you need to calculate customer lifetime value (CLV) and attach the result as a Computed Satellite on the Customer Hub. Your inputs might include:

  • Customer profile data from a Satellite on the Customer Hub
  • Order totals from a Satellite on an Order Hub (connected via a Customer-Order Link)
  • Return history from a Satellite on a Returns Hub
  • Promotional sensitivity data from a completely separate marketing model with no direct Link to Customer

Even if the marketing model has no formal Link back to Customer, if there’s a way to match the records — say, via a shared customer identifier in a descriptive Satellite — you can join it. Your CLV business rule can consume all of this data in a single query. The result — a calculated CLV score — describes the customer, so the Computed Satellite attaches to the Customer Hub. Clean, valid, and practical.

This kind of cross-domain business logic is exactly what the Business Vault is designed for. It’s the layer where raw, integrated data gets transformed into meaningful, decision-ready information — and the flexibility to query across the model is one of its most powerful features.

Key Takeaway

When building Computed Satellites in the Business Vault, don’t let the structure of your Raw Data Vault artificially limit your business logic. The query that feeds your Satellite can draw from anywhere in the model. The only questions that matter are: what do your results describe, and is your query performant and governance-compliant?

Get those right, and the rest is just SQL.

To go deeper on Business Vault patterns, Computed Satellites, and the full Data Vault methodology, explore our Data Vault 2.1 Training & Certification — and grab the free Data Vault handbook for a concise overview of the core concepts.

Watch the Video

How to Monitor Data Reliability with dbt Source Freshness in dbt Cloud

How to Validate Data Freshness in dbt Cloud

Ensuring that your data is fresh, reliable, and aligned with your SLAs is one of the most important responsibilities of any analytics engineering or BI team. In modern data stacks, dbt Source Freshness plays a key role in validating that upstream systems are loading data as expected. When used properly, it helps teams identify delays, pipeline failures, or missing updates before they impact models and downstream reporting.

This article walks through a full demo of how to configure, run, and monitor Source Freshness checks in dbt Cloud. It builds on the fundamentals introduced in the first video of our series, where we explained what source freshness is, why it matters, and how dbt evaluates freshness. If you haven’t watched that introduction yet, we recommend doing so first.

In this second part, we go hands-on: reviewing source configurations, running freshness checks using both fields and custom SQL queries, applying optional filters, triggering warnings and failures, and inspecting the results in the dbt Cloud UI and Catalog. By the end, you’ll have a clear understanding of how to integrate freshness checks into your workflows and jobs to maintain a highly trustworthy data foundation.



Understanding the Source Freshness Configuration

The demo starts inside a dbt Cloud project with a YAML file containing our source definitions. For this walkthrough, we are working with a source called dbt_talk_demo_sources, which includes two tables:

  • customer_source
  • employee_source

Inside the configuration block, we define the core freshness thresholds:

  • warning_after: 30 minutes
  • error_after: 60 minutes

These settings tell dbt when to flag a source as slightly stale (warning) or critically outdated (error). They are typically aligned with SLAs and expectations for how often upstream data should be updated.

Using loaded_at_field

For the customer_source table, we use a basic configuration: the table includes an updated_at timestamp column, which dbt uses directly to calculate the freshness. However, the timestamps in this demo are recorded in CET (Europe/Berlin), which means dbt converts them to UTC before evaluating freshness. This highlights a common real-world consideration: time zones must always be handled consistently in freshness checks.

Using loaded_at_query

For employee_source, we use a different approach. This table does not store a timestamp column. Instead, the load timestamps are stored in a metadata table. To handle this, we configure a loaded_at_query—a SQL query that retrieves the latest load time externally. This method is often used when:

  • Timestamps come from an ETL metadata or logging table
  • Data loads use high-watermark patterns
  • You want more control over how freshness timestamps are calculated

In the demo, the query simply selects the MAX(updated_at) value from the metadata table. While simple, it demonstrates how flexible dbt is when working with custom data loading patterns.

Using Optional Filters

dbt also supports an optional filter configuration, which lets you skip certain rows when evaluating freshness. For example, if a table contains soft-deleted records or historical rows that should not count toward freshness checks, you can filter them out. In our demo, the filter excludes rows where deleted = TRUE, ensuring only active records contribute to the freshness calculation.

This becomes particularly useful when old records appear fresher than the latest valid ones, which could skew your results or hide actual issues.

Running Freshness Checks in dbt Cloud

With the configuration in place, we run our first freshness check via the CLI:

dbt source freshness

Before running the check, we insert new rows into the source tables so that the latest data delay is around 20 minutes. Since this is below both the warning and error thresholds, the run reports everything as green.

This confirms that both types of configurations—loaded_at_field and loaded_at_query—are working as expected.

Triggering a Warning

Next, we enable the earlier-mentioned filter configuration on customer_source. After filtering out the deleted rows, the next valid record has a delay of about 50 minutes. When we run:

dbt source freshness -s source:dbt_talk_demo_sources.customer_source

dbt reports a warning state, because the threshold of 30 minutes is exceeded. This demonstrates how filtering can impact the evaluation in meaningful ways.

Triggering a Failure

To understand how a failed freshness check behaves, we insert data with delays exceeding the 60-minute error threshold. Running the same command again produces an error state. The dbt output also shows the exact SQL query it executed to determine freshness—useful when troubleshooting unexpected results.

Including Freshness in dbt Cloud Jobs

dbt Cloud provides two ways to incorporate freshness checks into scheduled jobs:

Option 1: “Run Source Freshness” Checkbox

With this option enabled, dbt automatically runs dbt source freshness as the first step of the job. However, failures do not stop the rest of the job from executing. This mode is ideal when you want visibility but don’t want freshness violations to block model builds.

Option 2: Adding Freshness as a Job Step

Alternatively, you can include freshness checks as an explicit job step. In this case, if freshness fails, subsequent steps are skipped and the job fails. This is the preferred option when:

  • Data reliability is critical
  • Your models depend on up-to-date sources
  • You want strong enforcement of data SLAs

The demo shows examples of both approaches, so you can choose which one best fits your project needs.

Monitoring Freshness in the dbt Cloud Catalog

dbt Cloud makes it easy to monitor freshness results long after the run completes. In the Catalog, you can drill down into each source and see the most recent freshness status, including warnings and errors. This gives data teams better visibility into upstream issues without needing to dive into logs.

For example, in our demo environment, the Catalog displays a warning icon for dbt_talk_demo_sources. Opening the source reveals the individual freshness statuses for each table. This is especially helpful in larger projects where tracking freshness manually would be impractical.

Key Takeaways

This demo highlights the full power and flexibility of dbt Source Freshness in real-world analytics environments. Here are the main lessons:

  • Freshness thresholds provide an essential guardrail for data reliability.
  • loaded_at_field is simple when the timestamp is in the table.
  • loaded_at_query enables more advanced scenarios using external metadata.
  • Filters help refine which rows count toward freshness.
  • dbt distinguishes between OK, warning, and error states in a clear, actionable way.
  • Freshness checks can run as part of your dbt Cloud jobs with configurable strictness levels.
  • The dbt Cloud Catalog provides ongoing visibility into the freshness of all sources.

By combining these tools, you can ensure your source data stays timely, trustworthy, and perfectly aligned with your organization’s SLAs. This ultimately improves downstream analytics quality, enhances user confidence, and reduces the risk of building insights on outdated data.

Watch the Video

Close Menu