Building a scalable Data Platform?

Building a scalable Data Platform? In Artificial Intelligence, Beginner

Orchestration of Agentic Workflows

The Shift from Prompts to Autonomous Systems

For years, organizations have focused on mastering “prompt engineering”, the art of writing precise instructions to extract useful outputs from Large Language Models (LLMs). While highly effective for simple, singular tasks, the prompt-based approach has inherent limitations when faced with complex, multi-step business problems.

The next paradigm shift in enterprise AI is the move toward Agentic Workflows.

An “Agent” is more than just an LLM. It is an autonomous or semi-autonomous system that combines reasoning capability (the LLM) with access to tools, memory, and the ability to act on its environment. Instead of answering a question, an agent performs a role, acting as an analyst, a software engineer, or a project manager, handling sequential professional tasks until a goal is achieved.

Orchestration of Agentic Workflows

Master the art of building multi-step autonomous systems by integrating the LangChain ecosystem with powerful tools like Zapier. This session provides a practical roadmap for evolving from simple prompts to sophisticated, coordinated architectures that execute complex professional tasks with ease. Learn more in our upcoming webinar on April 21st, 2026!

Watch Webinar Recording

In this article:

Why Agents Require Orchestration
Anatomy of an Agentic Stack
Tools of the Trade: Navigating the Lang Ecosystem
Final Thoughts

Why Agents Require Orchestration

The premise of agentic workflows is powerful, but deployment is difficult. In a complex scenario, you may need a system to:

Analyze a business request.
Search a database.
Process results.
Consult a second specialized agent (e.g., a “Coder Agent”).
Revise the plan based on output and finally provide a summary.

Without proper coordination, this series of steps breaks down. The model might hallucinate a tool execution, forget crucial data from step one by step four, or enter an endless loop of unhelpful actions.

Orchestration is the framework that manages this complexity. It is the conductor of the agentic orchestra, defining how different agents, tools, and memory systems interact, ensuring reliability, traceability, and successful execution of the business objective.

Anatomy of an Agentic Stack

To build a reliable orchestrator for autonomous systems, your architecture must unite three fundamental components:

Intelligence Layer (The Brain): The reasoning core, usually an LLM, capable of taking input, breaking it into smaller tasks, and evaluating progress.
Action Layer (The Tools): A library of external integrations, such as databases, web scrapers, computational engines, and business APIs, that the agent can use to gather real-world data or execute actions.
Coordination Layer (The Orchestrator): The logic that manages state, standardizes how agents exchange data, handles errors, and ensures loops are terminated when goals are met.

Tools of the Trade: Navigating the Lang Ecosystem

As organizations move from proof-of-concept to production, the ecosystem of framework tools is rapidly evolving. The “Lang” suite has emerged as a particularly dominant force in defining how agents are built and orchestrated. During our workshop, we will explore several critical tools within this stack:

LangChain

While often used for simple prompt channelling, LangChain’s core contribution to agentic architecture is standardizing integration and chain creation. It provides the interface to connect the LLM to dozens of external systems. Crucially, it allows us to define custom “tools” for the agent. These are specialized, user-created functions that give the agent specific capabilities, such as querying a proprietary data warehouse or executing an internal Python script. By wrapping these functions in LangChain’s tool abstraction, the agent can autonomously decide when and how to invoke them to solve complex problems.

LangGraph

Managing complex agentic workflows required a different mental model: graphs. LangGraph extends LangChain by allowing developers to model agentic flows as stateful graphs (DAGs, or Directed Acyclic Graphs). This is crucial for systems that require robust loops, cyclical processes, and complex state management, ensuring that “Agent A” always knows what state “Agent B” left the system in.

Langfuse

Orchestrating agents is messy, and you need visibility. While not officially developed by the creators of LangChain, Langfuse is an essential open-source operational companion that integrates seamlessly with the ecosystem. It provides a robust platform for debugging, testing, and monitoring agentic systems without vendor lock-in. Langfuse allows teams to “trace” the entire multi-step process, viewing every prompt, tool call, and internal decision, making it possible to identify bottlenecks, reduce costs, and debug failures in production.

Complementary Orchestration Tools

While the Lang ecosystem excels at managing LLM logic, a true enterprise solution often requires integration with generalized orchestration and automation tools (like Zapier or n8n). These tools excel at managing event triggers, parallel processes, and standard API interactions that do not require LLM reasoning, complementing the Lang stack in a complete enterprise architecture.

Final Thoughts

Moving from single prompts to coordinated, agentic systems is a necessary evolutionary step for organizations aiming to unlock true operational efficiency with AI. Mastery of these systems requires shifting your perspective from “engineering a prompt” to “engineering a system.”

Want to see how this works in practice?

This article provides a conceptual blueprint of agentic workflows and the essential role of orchestration. To gain hands-on experience in building these systems, we invite you to join our upcoming webinar on the Orchestration of Agentic Workflows. During the session, we will demonstrate how to build multi-step autonomous systems by integrating these platforms into a single architecture, providing a practical guide for moving from simple prompts to coordinated AI systems that handle professional tasks.

– Hernan Revale (Scalefree)

Building a scalable Data Platform? In Data Architecture

Datensouveränität: Die souveräne Datenplattform als Weg zu unabhängigen Daten und sicherer KI

Datensouveränität wird oft als rein politisches Buzzword oder als bloße Compliance-Aufgabe, z. B. im Rahmen der DSGVO und des AI Acts, abgetan. Doch in der Realität ist sie eine harte wirtschaftliche Notwendigkeit. In einer Ära, in der Daten nicht mehr nur in Dashboards visualisiert werden, sondern als Grundlage für automatisierte Geschäftsprozesse und Künstliche Intelligenz dienen, wird die eigene Infrastruktur zum strategischen Flaschenhals.

Wer in dieser Phase die Kontrolle vollständig an externe Technologieunternehmen abtritt, verliert nicht nur Unabhängigkeit, sondern auch Innovationskraft. Sind Daten in geschlossenen Systemen gefangen, bestimmt letztlich der Anbieter, was angebunden werden darf oder welche KI-Modelle genutzt werden können. Der Weg zu echter Datensouveränität beginnt mit der Erkenntnis, dass die bequemen „All-in-One“-Versprechen vieler Cloud-Anbieter einen hohen, oft versteckten Preis haben.

Wie der Kontrollverlust in der Praxis aussieht

Um zu verstehen, wie sich die Datenhoheit zurückgewinnen lässt, muss man zunächst betrachten, wie Unternehmen sie überhaupt verlieren. Dieser Kontrollverlust geschieht selten über Nacht. Vielmehr ist es ein schleichender Prozess, der tief in der Architektur traditioneller und moderner Cloud-Datenplattformen verwurzelt ist.

Fällt die Entscheidung auf eine proprietäre Datenplattform, werden die Rohdaten vollständig an das System des Anbieters übergeben.

Proprietäre Formate

Um die versprochene Performance zu liefern, wandeln geschlossene Plattformen die eingespeisten Daten in herstellereigene, proprietäre Speicherformate um. Ab diesem Moment können diese Daten nur noch von der Compute-Engine (der Rechenleistung) genau dieses einen Anbieters gelesen und verarbeitet werden.

Fehlende Interoperabilität

Soll nun eine neue, innovative Lösung, wie bspw. eine spezialisierte Analyse-Engine oder Reporting Software eines Drittanbieters angebunden oder eine bestimmte (open-source) KI genutzt werden, stehen Unternehmen oft vor einer Wand. Externe Tools können die proprietären Formate nicht nativ lesen oder es wird gar nicht erst eine benötigte Schnittstelle bereitgestellt.

Kostenfalle (“Egress Fees”)

Um die Daten für andere Anwendungen nutzbar zu machen, oder im schlimmsten Fall den Anbieter komplett zu wechseln, müssen sie aufwändig exportiert werden. Hier schlagen die sogenannten „Egress Fees“ (Kosten für den Datenabfluss) massiv zu Buche. Große Cloud-Provider machen den Ingest (das Einspeisen der Daten) oft sehr günstig, bestrafen den Export aber mit hohen Gebühren.

Verlust der Preissetzungsmacht

Sind historische Unternehmensdaten erst einmal in einem geschlossenen System verankert und die Wechselkosten künstlich in die Höhe getrieben, sind Unternehmen künftigen Preissteigerungen und Lizenzänderungen des Anbieters ausgeliefert.

Kurzum: Das Unternehmen trägt zwar weiterhin die volle rechtliche und geschäftliche Verantwortung für seine Daten, hat aber den direkten, physischen Zugriff darauf verloren. Es mietet lediglich den Zugang zum eigenen Wissen.

Stellen Sie sich an diesem Punkt einmal ganz ehrlich die Frage:

Wissen Sie genau, in welchem Format und auf welcher Infrastruktur Ihre Kern-Daten in diesem Moment liegen?
Und noch viel wichtiger: Wie kommen Sie an Ihre Daten, wenn der Zugang über das Portal Ihres Anbieters morgen früh plötzlich nicht mehr funktioniert oder die Preise über Nacht unerwartet diktiert werden?

Das Data Lakehouse und offene Standards als Ausweg

Der technologische Ausweg aus dieser Abhängigkeit führt über eine grundlegende architektonische Neuausrichtung. Die Antwort auf proprietäre Datensilos lautet heute: Data Lakehouse. Dieser Architekturansatz vereint die Flexibilität eines Data Lakes mit der Struktur und Zuverlässigkeit eines klassischen Data Warehouses, jedoch unter einer entscheidenden Prämisse: der konsequenten Trennung von Speicher (Storage) und Rechenleistung (Compute).

Diese Trennung ermöglicht es Unternehmen, ihre Architektur nach dem “Best-of-Breed-Prinzip” aufzubauen:

Eigene Infrastruktur

Anstatt Daten in die Systeme externer Dienstleister zu laden und dort zu “verriegeln”, verbleiben sie im unternehmenseigenen Cloud-Speicher (beispielsweise Amazon S3, Azure Data Lake oder Google Cloud Storage). Das Unternehmen besitzt faktisch und rechtlich den einzigen Schlüssel zu den eigenen Daten.

Offene Datenformate als Fundament

Ein wichtiger Hebel der Datensouveränität ist das Speicherformat. In einem modernen Data Lakehouse werden Daten ausschließlich in quelloffenen Standards wie Apache Iceberg, Hudi oder Delta Lake abgelegt. Diese Formate gehören keinem einzelnen Software-Hersteller und unterliegen keiner proprietären Lizenzierung.

Interoperabilität (“Bring Your Own Engine”)

Da die Unternehmensdaten nun strukturiert und in einem offenen Format im eigenen Speicher liegen, lassen sie sich von unterschiedlichsten Verarbeitungs-Engines (wie Databricks, Trino, Spark etc.) lesen. Der entscheidende Vorteil: Die Daten müssen dafür weder kopiert noch verschoben werden.

Das Resultat dieser Architektur ist echte digitale Souveränität. Wenn ein Software-Anbieter die Preise drastisch erhöht oder technologisch zurückfällt, lässt sich die Compute-Engine austauschen oder parallel durch andere Tools ergänzen. Die wertvolle Datenbasis bleibt davon völlig unberührt.

Keine sichere KI ohne souveräne Datenplattform

Diese architektonische Unabhängigkeit ist nicht nur eine Frage der Kostenkontrolle, sondern eine wichtige Grundvoraussetzung für den produktiven und sicheren Einsatz von Künstlicher Intelligenz. Aktuell herrscht in nahezu jedem Industriesektor der Druck, KI-gestützte Automatisierungen einzuführen. Gleichzeitig wächst die berechtigte Sorge, sensible Geschäftsgeheimnisse an US-amerikanische „Black-Box“-Sprachmodelle abfließen zu lassen oder durch fehlerhafte KI-Antworten (Halluzinationen) geschäftskritische Fehlentscheidungen zu treffen.

Eine unaufgeräumte Datenbasis und geschlossene SaaS-Systeme bremsen KI-Initiativen hier systematisch aus. Ein souveräner KI-Ansatz erfordert andere Vorgehensweisen.

Abfrage statt Einbettung

Viele frühe KI-Versuche scheitern daran, dass Unternehmensdaten direkt in Sprachmodelle eingebettet werden. Dies birgt nicht nur massive Datenschutzrisiken, sondern führt unweigerlich zu gefährlichen Halluzinationen. Ein Large Language Model (LLM) ist primär ein Sprachwerkzeug, keine relationale Datenbank.

Agentic AI auf Open-Source-Basis

Die Lösung liegt im Einsatz sogenannter „Agentic AI“ in Kombination mit quelloffenen Sprachmodellen (Open-Source-LLMs), die lokal und sicher in der eigenen (Cloud-)Umgebung betrieben werden. Die Daten verlassen die unternehmenseigene Infrastruktur zu keinem Zeitpunkt. Noch wichtiger: Die KI wird so konfiguriert, dass sie die Daten nicht auswendig lernt, sondern als intelligenter Agent agiert. Sie nutzt ihr semantisches Kontextverständnis, um bei Bedarf gezielt direkte Abfragen (beispielsweise über SQL) an die offenen Datenformate des Lakehouses zu stellen.

„Talk-to-your-data“ in der Praxis

Durch die direkte Anbindung an die zentrale Datenplattform liefert das System harte, verifizierbare Fakten statt stochastisch berechneter Wahrscheinlichkeiten. Dieser Ansatz ermöglicht völlig neue Geschäftsprozesse: Fachbereiche ohne tiefe Programmier- oder SQL-Kenntnisse können künftig im direkten Dialog mit ihren Daten interagieren. Komplexe Analysen und Reportings lassen sich per natürlicher Sprache automatisieren und verlässlich abfragen.

Damit dieser reibungslose Dialog zwischen Business-User, KI-Agent und Datenplattform jedoch nicht im Chaos endet, muss die KI exakt verstehen, wie die Daten strukturiert sind und welche semantische Bedeutung sie haben. Technologie allein reicht hierfür nicht aus, womit wir beim oft unterschätzten Kernstück der Datensouveränität angelangt sind.

Data Governance: Vom Regelwerk zum strategischen Enabler

Auch bei Datenplattformen bewahrheitet sich immer wieder eine Erkenntnis, die auch in vielen anderen Bereichen eine gewisse Allgemeingültigkeit erreicht hat: Technologie allein ist kein Garant für Erfolg. Ein modernes Data Lakehouse und fortschrittliche Agentic AI laufen ins Leere, wenn die zugrunde liegende Datenqualität mangelhaft ist oder die semantische Bedeutung der Daten unklar bleibt. An diesem Punkt wandelt sich Data Governance von einem oft ungeliebten Kontrollinstrument zu einem echten strategischen Enabler.

Wenn ein KI-Agent eine Benutzereingabe in eine präzise Datenbankabfrage übersetzen soll, benötigt er mehr als nur Zugriff auf Tabellen. Er benötigt Kontext. Ohne ein gepflegtes Business Glossary, klare Metadaten und definierte Verantwortlichkeiten (Data Ownership) ist das Risiko hoch, dass die KI zwar syntaktisch korrekte, aber fachlich falsche Ergebnisse liefert. „Garbage in, garbage out“ gilt im Zeitalter der Künstlichen Intelligenz mehr denn je.

Eine saubere Governance-Struktur löst dieses Problem an der Wurzel:

Zentrale Wahrheit, dezentrale Nutzung

Durch klare Qualitätsregeln und definierte Datenprodukte entsteht ein Fundament des Vertrauens. Fachbereiche können sich darauf verlassen, dass die bereitgestellten Informationen korrekt, aktuell und rechtssicher sind.

Echte Demokratisierung

Erst dieses Vertrauen ermöglicht Self-Service-Analytics. Wenn die Leitplanken der Governance feststehen, können Daten im gesamten Unternehmen demokratisiert und sicher zur Verfügung gestellt werden, ohne dass die IT-Abteilung jeden einzelnen Report manuell freigeben muss. Auch KI-Ergebnisse können so ohne Kopfschmerzen bezüglich Halluzinationen oder rechtliche Bedenken angenommen und weiterverwendet werden.

Compliance als Standard

Mit Blick auf strenge europäische Regulierungen wie die DSGVO oder den AI Act stellt eine integrierte Governance sicher, dass Zugriffsrechte, Anonymisierung und Nachvollziehbarkeit (Data Lineage) von Beginn an in der Architektur verankert sind.

Wer die Verantwortung für seine Daten auf diese Weise intern übernimmt, schafft die zwingende Voraussetzung für Skalierbarkeit.

Wie gelingt die Migration?

Die Vorteile offener Standards und einer souveränen Architektur sind einleuchtend. Dennoch scheuen viele IT-Verantwortliche den Schritt aus dem Vendor-Lock-in, weil sie ein riskantes, jahrelanges IT-Großprojekt befürchten. Doch die Befreiung aus geschlossenen Systemen erfordert keinen riskanten „Big Bang“.

Erfolgreiche Migrationsprojekte in der Praxis beweisen, dass der Wechsel zu einer offenen souveräneren-Architektur agil und inkrementell erfolgen kann:

Use-Case-getriebene Migration

Anstatt das gesamte historische Data Warehouse auf einmal abzulösen, wird die neue, offene Plattform parallel aufgebaut. Die Migration erfolgt anhand priorisierter, geschäftskritischer Anwendungsfälle.

Schneller Return on Investment (ROI)

Indem zunächst diejenigen Datenbereiche migriert werden, die den höchsten sofortigen Mehrwert bieten, zum Beispiel zur Umsetzung neuer Use-Cases, welche zuvor unmöglich schienen, refinanziert sich der Umbau oft schon während der Projektlaufzeit.

Risikominimierung

Dieser schrittweise Ansatz stellt sicher, dass das Tagesgeschäft (Reporting und laufende Analysen) völlig ungestört weiterläuft, während im Hintergrund das zukunftssichere Fundament iterativ wächst.

Der Übergang zu offener Software und herstellerunabhängigen Datenformaten ist somit kein IT-Selbstzweck, sondern eine planbare, risikoarme Investition in die unternehmerische Handlungsfähigkeit.

Souveränität aktiv gestalten

Wahrlich souverän ist nur das Unternehmen, das die Architektur, die Qualität und den Verbleib seiner Daten vollständig kontrolliert und sich dieser Verantwortung bewusst ist. Wenn Sie sich aus der Abhängigkeit lösen, teure Lizenzmodelle hinter sich lassen und eine rechtssichere Basis für Künstliche Intelligenz schaffen wollen, führt der Weg unweigerlich über offene Standards.

Übernehmen Sie wieder die volle Verantwortung für Ihre Daten. Verwandeln Sie Ihre IT-Infrastruktur von einem reinen Kostenfaktor in den entscheidenden Wettbewerbsvorteil Ihrer Branche.

Als Experten für Big Data und die Entwicklung moderner Datenplattformen unterstützt Scalefree europäische Unternehmen dabei, diesen Weg erfolgreich zu gehen. Wir planen und realisieren End-to-End Daten- und KI-Lösungen jeder Skalierung, von der strategischen Architekturberatung bis zur Implementierung, sowie Agentic AI.

Sind Ihre Daten bereit für die Zukunft?

Lassen Sie uns in einem unverbindlichen Gespräch Ihre aktuelle Architektur beleuchten. Erfahren Sie, wie ein maßgeschneidertes Data Lakehouse auf Basis offener Standards Ihre Datensouveränität dauerhaft sichern kann.

Kostenloses Erstgespräch vereinbaren

– Ole Bause (Scalefree)

Über den Autor

Ole Bause ist seit 2021 bei Scalefree in den Bereichen Business Intelligence, Data Engineering und Enterprise Data Warehousing mit Data Vault 2 tätig. Er ist zertifizierter Data Vault 2.0 Practitioner und verfügt über umfassende Erfahrung mit verschiedenen cloudbasierten Data-Warehouse-Diensten. Die Automatisierung von Data Warehouses gehört ebenfalls zu seinen Kernkompetenzen.

Building a scalable Data Platform? In dbt Talk

How Model Access Works in dbt Cloud: Groups, Permissions & Cross-Project References

Model Access in dbt Cloud

As organizations scale their data platforms, controlling data access and ownership becomes increasingly important. Teams need clear rules around who can use specific datasets, how models are shared across projects, and how data governance is enforced without slowing down collaboration.

dbt Cloud provides powerful model access features that help analytics engineers and data teams manage visibility, ownership, and cross-project collaboration. By defining groups and access levels such as private, protected, and public, organizations can build scalable and secure data architectures aligned with modern data mesh principles.

This article explains how model access works in dbt Cloud using a producer-consumer project setup. We will explore access levels, group configuration, cross-project references, and how model relationships appear in the Catalog and lineage graph.

In this article:

Why Model Access Matters in Modern Data Platforms
Understanding Model Access Levels in dbt
Producer and Consumer Project Setup
Configuring Groups in dbt
- Assigning Groups to Individual Models
- Assigning Groups to Folders
Testing Private Model Access
Using Protected Models
Configuring Public Models for Cross-Project Use
Star Schema Example with Mixed Access Levels
Exploring Model Ownership in the dbt Catalog
Referencing Public Models from Consumer Projects
Understanding Lineage Graphs and Dependencies
Benefits of Model Access Control in dbt
Best Practices for Managing Model Access
Conclusion
Watch the Video
Meet the Speaker

Why Model Access Matters in Modern Data Platforms

In a growing data ecosystem, multiple teams often build and consume data models simultaneously. Without clear access control, this can lead to:

Unclear data ownership
Breaking changes across teams
Data quality risks
Limited governance
Uncontrolled dependencies

dbt addresses these challenges by allowing teams to:

Control who can reference specific models
Define ownership through groups
Enable safe collaboration between teams
Support data mesh architectures
Manage cross-project dependencies

With proper configuration, organizations can ensure reliable and scalable data pipelines while maintaining flexibility.

Understanding Model Access Levels in dbt

dbt provides three main model access levels that control visibility and usage across projects and teams.

Private Models

Private models are restricted to a specific group. Only models within the same group can reference them.

Key characteristics:

Limited visibility
Strong access control
Internal use within a team
Prevents external dependencies

Private models are useful for intermediate transformations or sensitive data that should not be widely accessible.

Protected Models

Protected models allow broader usage while maintaining controlled ownership.

Key characteristics:

Can be referenced outside their group
Still managed by a specific owner
Suitable for shared internal data
Balanced control and accessibility

By default, dbt models are typically configured with protected access unless specified otherwise.

Public Models

Public models are designed for cross-project collaboration and wider consumption.

Key characteristics:

Accessible across projects
Supports data sharing
Enables data mesh architecture
Clear ownership boundaries

Public models are commonly used as trusted data products that other teams depend on.

Producer and Consumer Project Setup

To demonstrate model access behavior, we use two dbt Cloud projects:

Producer Project: Provides models for consumption
Consumer Project: References public models from the producer

This setup reflects real-world scenarios where one team publishes data assets and another team consumes them.

Configuring Groups in dbt

Groups in dbt define ownership and control access to models. They help manage responsibilities and enforce governance.

A group can be configured in YAML files and assigned to specific models or entire folders.

Assigning Groups to Individual Models

For example, an analytics group can be defined and assigned to models such as:

Orders per supplier country customer
Orders per customer
Orders per supplier
Orders per country

This configuration ensures that models follow consistent ownership and access rules.

Assigning Groups to Folders

Instead of configuring each model individually, teams can assign a group to an entire folder using the project configuration file. This approach simplifies governance and ensures consistent access settings.

Testing Private Model Access

Private models can only be referenced by models within the same group. If a model outside the group attempts to reference a private model, dbt returns an error.

To resolve this, the referencing model must be added to the same group configuration.

This behavior ensures:

Clear ownership boundaries
Reduced risk of unintended dependencies
Improved data security

Private access is particularly useful for internal transformations or staging logic that should remain hidden from external teams.

Using Protected Models

Protected models offer more flexibility. They can be referenced outside their group without requiring additional configuration.

This makes them ideal for:

Shared business logic
Reusable transformations
Internal reporting models
Organization-wide metrics

Protected access balances governance with usability.

Configuring Public Models for Cross-Project Use

Public models allow other projects to reference them, enabling collaboration across teams.

For public models to be visible to other projects, dbt Cloud requires:

A successful job run
An environment defined as staging or production
Metadata resolution by the dbt Cloud service

Once these conditions are met, public models become available to consumer projects.

This mechanism ensures that only validated and production-ready models are shared.

Star Schema Example with Mixed Access Levels

A typical data modeling setup might include a star schema with dimensions and fact tables.

In this scenario:

Most models are configured as public
Specific models, such as certain dimensions, may remain protected
Protected models are used internally by other models
Only necessary data is exposed externally

This design prevents unnecessary data exposure while maintaining efficient dependencies.

Exploring Model Ownership in the dbt Catalog

The dbt Catalog provides visibility into model ownership, access levels, and dependencies.

Within the Catalog, users can:

View model owners
Filter by access level
Explore group information
See associated models
Understand data lineage

This transparency improves governance and helps teams understand data responsibilities.

Referencing Public Models from Consumer Projects

Consumer projects can reference public models from producer projects using cross-project references.

These models appear in the development environment and lineage graph as external dependencies. However, consumer teams cannot build or modify them. The producer team retains full ownership.

This separation provides:

Clear responsibility boundaries
Reduced operational risk
Reliable shared data products
Improved collaboration

Understanding Lineage Graphs and Dependencies

The lineage graph provides a visual representation of how models relate to one another across projects.

It helps teams:

Track upstream and downstream dependencies
Understand data flow
Identify external data sources
Analyze impact of changes
Improve system reliability

For cross-project references, the lineage graph clearly shows the project and environment where external models originate.

Benefits of Model Access Control in dbt

Implementing structured model access provides significant advantages:

Strong data governance
Clear ownership structure
Secure data sharing
Scalable collaboration
Support for data mesh architecture
Reduced risk of breaking changes
Improved transparency

These capabilities help organizations scale their analytics operations effectively.

Best Practices for Managing Model Access

To maximize the benefits of dbt model access, organizations should follow these practices:

Define clear ownership groups
Use private models for internal logic
Expose only necessary data through public models
Validate models before sharing
Monitor dependencies using lineage graphs
Document access rules consistently

Following these guidelines helps maintain a reliable and scalable data ecosystem.

Conclusion

Model access in dbt Cloud enables organizations to control data visibility, manage ownership, and support cross-team collaboration. By configuring groups and defining access levels such as private, protected, and public, teams can build secure and scalable data architectures.

When combined with Catalog visibility and lineage tracking, these features provide a strong foundation for data governance and modern analytics workflows.

As data platforms continue to grow in complexity, structured access control becomes essential for ensuring trust, reliability, and collaboration across the organization.

Watch the Video

Meet the Speaker

Dmytro Polishchuk
Senior BI Consultant

Dmytro Polishchuk has 7 years of experience in business intelligence and works as a Senior BI Consultant for Scalefree. Dmytro is a proven Data Vault 2.0 expert and has excellent knowledge of various (cloud) architectures, data modeling, and the implementation of automation frameworks. Dmytro excels in team integration and structured project work. Dmytro has a bachelor’s degree in Finance and Financial Management.

Building a scalable Data Platform? In Artificial Intelligence, Beginner

Predictive Analytics on the Modern Data Platform

Predictive Analytics on the Modern Data Warehouse

From BI to AI: Operationalizing Predictive Analytics where your Data already lives

Traditional Business Intelligence and reporting are incredibly good at telling what happened yesterday. How much revenue was generated last quarter? How many users logged in this week? But while understanding the past is important, today’s businesses need to know what will happen tomorrow.

This is where Predictive Analytics comes in. At its core, predictive analytics simply uses historical data to forecast future outcomes. Instead of asking how many customers canceled last month, predictive analytics asks:

“Which specific customers are most likely to cancel next week?”

Many organizations understand this value and eagerly hire data scientists to build these models. Yet, time and time again, these predictive initiatives fail to make it out of the PoC phase and into daily business operations because of how teams and data architectures are fundamentally structured.

Predictive Analytics on the Modern Data Platform

Learn how to bridge the gap between your data platform and actionable AI by building predictive models directly where your business data lives. This webinar covers practical strategies for transforming warehouse data into features, deploying models, and automating the flow of insights back into your daily operational workflows. Learn more in our upcoming webinar on March 17th, 2026!

Watch Webinar Recording

In this article:

The Problem: The "Two Silos" of Data
The Solution: Bring the Machine Learning to the Data
Two Architecture Approaches
- Pattern 1: In-Warehouse Machine Learning (Democratizing ML)
- Pattern 2: The Data Platform as a Feature Store (The Hybrid Approach)
Example: Predicting Customer Churn
Conclusion

The Problem: The “Two Silos” of Data

In many companies, Data Engineering and Data Science exist in two entirely different worlds.

Data Engineers and Data Warehouse Developers spend their days building the Modern Data Platform. They carefully extract, clean, conform, and govern data from dozens of sources to create a “Single Source of Truth.” When a business analyst looks at a revenue dashboard, they know they can trust the numbers because the data platform enforces strict business logic.

Data Scientists, on the other hand, often work in isolated environments like standalone Jupyter notebooks. Because they need massive amounts of data to train their machine learning models, they often bypass the data platform entirely, pulling raw, unstructured data directly from a data lake.

This disconnect creates some challenges:

Duplicated Effort: Data Scientists waste up to 80% of their time cleaning and prepping raw data, work the Data Engineering team has already done in the platform
Inconsistent Metrics: Because models are built on raw data, a model’s definition of “Active Customer” might completely contradict the official definition used by the in the data platform
The “Wall of Production”: A model might look perfectly accurate on a data scientist’s laptop, but because it relies on disconnected, ungoverned data pipelines, integrating its predictions back into the daily workflows of sales or support teams becomes an IT nightmare
Exclusivity: Data analysts are often limited to classic descriptive analytics which slows time-to-insight. The optimal solution is to democratize data science, empowering analysts to implement predictive use cases directly.

The Solution: Bring the Machine Learning to the Data

The fix to this problem requires a fundamental shift in how we think about machine learning architecture. Instead of moving data out of governed systems to feed external ML models, we need to bring the ML workflows more closer to the data.

By positioning the Modern Data Platform as the foundation for predictive analytics, you ensure that every prediction is built on the same trusted, cleansed, and governed business data used for your daily reporting. The Data Platform becomes the Feature Store, a centralized hub where data is prepared once and used everywhere, whether for a BI dashboard or training a predictive model.

When the data platform serves as the single source of truth for both analysts and algorithms, magic happens. Data science teams stop wrestling with raw data pipelines, data engineering teams maintain governance, and the business gets predictions they can actually trust and operationalize.

Two Architecture Approaches

So, how do we actually bring the machine learning to the data? There isn’t a one-size-fits-all answer. Depending on the team’s skillset and the complexity of the models, there are typically one of two foundational patterns adopted:

Pattern 1: In-Warehouse Machine Learning (Democratizing ML)

Modern cloud data platforms have evolved beyond just storing and querying data, many now have machine learning engines built directly into them.

How it works: Using standard SQL, Data Analysts and Analytics Engineers can train, evaluate, and deploy models entirely inside the data platform (for example BigQuery ML, Snowflake Cortex or Databricks)
The Benefit: This radically democratizes predictive analytics. You don’t need to know Python or manage complex infrastructure to build a model. If you know SQL, you can generate predictions using the exact same tables you use for your BI dashboards
The Trade-off: While perfect for standard tasks like regression or classification, you are limited to the specific algorithms supported by the data platform

Pattern 2: The Data Platform as a Feature Store (The Hybrid Approach)

For organizations with dedicated Data Science teams building highly complex or custom models, the data platform takes on a different role: the Feature Store.

How it works: Data Scientists continue to work in their preferred external ML platforms (like Vertex AI, Databricks, or other). However, instead of pulling messy data from a data lake, they connect directly to the data platform to pull curated, business-approved data (“features”) for training
The Benefit: Data Scientists retain maximum flexibility to use advanced Python libraries and deep learning frameworks, while ensuring the models are trained on governed, accurate data
The Trade-off: It requires a bit more orchestration to manage the pipeline between the data platform and the ML platform, and ensuring predictions are accurately written back to the data platform

Example: Predicting Customer Churn

To understand the benefits of these approaches, let’s look at a classic business challenge: Customer Churn Prevention.

Imagine a SaaS company trying to figure out which customers are likely to cancel their subscriptions. In a siloed environment, predicting this is can be a messy, manual science project. But on a modern data platform, it becomes an automated operational workflow:

The Foundation (Data): Because of the Data Engineering team’s work, the data platform already contains all necessary historical information of the customer. CRM data (company size), financial records (billing history), product logs (login frequency), and Zendesk tickets (recent complaints) are all cleaned, joined, and sitting in governed tables including a full history of changes
The Prediction (Modeling): An analyst uses In-Warehouse ML (Pattern 1) to run a classification model against this historical data. The model identifies the hidden patterns of a churning customer and generates a “Churn Risk Score” between 0 and 100 for every active user
The Operationalization (Action): This is the crucial step. The predictions aren’t left in a notebook. The risk scores are written directly back into a new table in the data platform. Through reverse-ETL, these scores can be automatically synced to the CRM as well as dashboards and reports can easily be built on top of the results.

Conclusion

Predictive analytics shouldn’t be an isolated science experiment. It should be a living, breathing part of your operational reality. By treating your modern data platform as the foundation for your machine learning workflows, you eliminate data silos, empower your analysts, and ensure your predictions are built on the trusted business data that matters most.

It is time to operationalize predictive insights where your business data already lives.

Want to see how this works in practice?

Join our upcoming webinar: Predictive Analytics on the Modern Data Platform. We will explore how to build and run predictive analytics directly on top of your data platform using trusted, governed business data as the foundation. You’ll learn practical patterns for turning warehouse models into features, training and deploying predictions, and integrating results back into reporting and operational workflows. Join me on March 17th.

– Ole Bause (Scalefree)

Building a scalable Data Platform? In Salesforce

Scalefree @ IT-Tage 2025: Global Innovation, Local Trust

How to combine Salesforce innovation with European data standards for a competitive edge in 2026

Fresh back from Kap Europa in Frankfurt, the Scalefree Salesforce team has unpacked a lot of inspiration. While our daily focus is on evolving Customer Relationship Management, IT-Tage 2025 offered an impressive 360-degree view of the IT landscape.

One thing became clear immediately: The “Sovereignty Blues” of recent years has evaporated. The industry is no longer complaining about dependencies; it is in full execution mode.

Here are our top takeaways for decision-makers and admins on how to navigate the future of Salesforce and Digital Sovereignty.

In this article:

Digital Sovereignty: The New "Freedom of Choice"
Architecture & GenAI: Looking into the Crystal Ball
Sustainability: More Than Just Marketing
Conclusion: Our Roadmap for 2026

Digital Sovereignty: The New “Freedom of Choice”

Sovereignty was the red thread running through the keynotes (Jutta Horstmann’s talk on digital strategy was a standout).

The Scalefree Take: Sovereignty is not a “Cloud Ban.” It is the ability to actively manage your technological dependencies.

Best of Breed: The 2025 standard is combining global innovation (SaaS giants like Salesforce) with local security standards.
Data Control: In our discussions, it became clear that “Data Residency” isn’t enough. True sovereignty comes from Process Excellence. You are only sovereign if you understand your data architecture and own your business logic.
The Result: We can use modern integration architectures to give you the best of both worlds—US innovation with EU compliance.

Architecture & GenAI: Looking into the Crystal Ball

The sessions on Generative AI and software architecture were eye-openers. But we also saw the need for a reality check.

The Scalefree Take: AI is not a strategy; it’s a tool.

Lifecycle over Code: LLMs change how we write code, but they don’t replace the need for planning. If anything, the ethical responsibility and security of the generated logic matter more than the programming language itself.
No AI without Clean Core: For our Salesforce clients, this confirms our mantra: AI-driven automation must be embedded in a robust architecture. If you automate chaos with AI, you just get faster chaos. Data Quality is the prerequisite for AI success.

Sustainability: More Than Just Marketing

“Green IT” wasn’t just a buzzword this year; it was about hard metrics. We saw deep dives into how energy-efficient algorithms can minimize CO2 footprints.

The Scalefree Take: Efficiency equals Stability.

Resource Intensity: A reflective talk by Hannah Herbst highlighted the resource cost of LLMs.
Scale Smart: For Salesforce Admins and CFOs, this is crucial. A bloated Salesforce instance isn’t just expensive to license; it’s technically “unhealthy.” Optimizing your Org isn’t just good for the planet—it’s good for your bottom line.

Conclusion: Our Roadmap for 2026

IT-Tage 2025 confirmed that we are on the right track.

Digital Sovereignty and high-efficiency Customer Management with Salesforce do not contradict each other—they condition each other.

Whoever maintains control over their architecture (Sovereignty) and enforces strict Data Quality (Excellence) can use the best and most innovative tools (Salesforce) safely and sustainably.

Frankfurt, see you in 2026!

– Markus Lewandowski (Scalefree)

Building a scalable Data Platform? In Data Vault Friday

How to Deal With Late Arriving Data

Late Arriving Data

Late arriving or backdated data is a common challenge in data warehousing. In Data Vault, it is important to distinguish between the technical timeline used for loading data and the business timeline representing when events actually occurred in the real world.

In this article:

1. Technical Timeline vs Business Timeline
2. Capturing the Business Timeline
3. Timeline Corrections Without an Extended Tracking Satellite
4. Practical Guidelines
Summary
Watch the Video
Meet the Speaker

1. Technical Timeline vs Business Timeline

When loading data into the Raw Vault, always use a Load Date Timestamp (LDTS):

Set when the record first arrives in your target system (landing zone, data lake, or Raw Vault).
Never backdate this timestamp—it should always move forward.
Used for incremental loading, delta detection, and reproducibility of snapshots.

This timestamp does not reflect the real-world timing of the data. It is purely a technical artifact to track ingestion order.

2. Capturing the Business Timeline

To handle late arriving or backdated data, use descriptive business dates stored in your satellites, such as:

Apply Date / Effective Date: When the data became valid in the source system or real world.
Last Modified Date: When the record was last changed in the source system.

These business timestamps allow you to create snapshots or temporal views that reflect the true order of events.

3. Timeline Corrections Without an Extended Tracking Satellite

You can correct timelines without adding additional satellites by leveraging the business timestamps stored in your existing satellites:

Create temporal PIT tables or snapshots based on the business timeline, not the load date.
When late-arriving data is detected:
- Option 1: Rebuild the affected snapshots to include the late data.
- Option 2: Apply counter transactions to reverse previous measures and apply the updated values.
Always keep the load date unchanged—it only tracks ingestion, not validity.

This approach ensures that your historical reports reflect the correct business sequence without complicating the Raw Vault model.

4. Practical Guidelines

Do not order or aggregate data using the load date when interpreting or reporting; always use business dates.
Maintain separate timelines:
- Load Date: Technical, for data ingestion and reproducibility.
- Business Date: For interpretation, analysis, and handling late arrivals.
Rebuild snapshots or use counter transactions as necessary when late data affects measures or aggregates.

Summary

Late arriving data can be handled in Data Vault without adding extra tracking satellites by clearly separating technical and business timelines. Load Date timestamps remain forward-only, while satellites store business dates to drive temporal snapshots and corrections. Using temporal PIT tables, counter transactions, or snapshot rebuilding ensures your analytics reflect the real-world timeline accurately.

Watch the Video

Meet the Speaker

Marc Winkelmann
Managing Consultant

Marc is working in Business Intelligence and Enterprise Data Warehousing (EDW) with a focus on Data Vault 2.0 implementation and coaching. Since 2016 he is active in consulting and implementation of Data Vault 2.0 solutions with industry leaders in manufacturing, energy supply and facility management sector. In 2020 he became a Data Vault 2.0 Instructor for Scalefree.

Building a scalable Data Platform? In Data Vault Friday

How to Capture CDC Data in Data Vault Satellites

Capture CDC Data

Capturing Change Data Capture (CDC) data in Data Vault can be tricky, especially when the source mostly sends inserts but occasionally produces duplicates or deletions. Understanding how to handle these cases ensures historical accuracy and avoids data inconsistencies in your hubs and satellites.

In this article:

The Scenario
Why Standard Non-Historized Links May Fail
Recommended Approach: Capture Technical History in Satellites
Handling Non-Historized Links with Duplicates
Optional: Bridge Tables for Performance
Summary
Watch the Video
Meet the Speaker

The Scenario

Consider the following behavior of your source system:

Most of the time, rows are insert-only.
During initial load, the same row may arrive twice (once in the bulk load and once as an insert within the same batch).
Deleted rows may occasionally appear.

These patterns can lead to duplicates if not handled correctly. At first glance, it might look like you need a Non-Historized Link, but duplicates must still be managed properly.

Why Standard Non-Historized Links May Fail

A standard non-historized Link assumes a single row per combination of hubs. When duplicates arrive, either due to CDC or multiple inserts during initial load, the Link cannot naturally distinguish them, leading to primary key conflicts or overwritten data.

A common—but sometimes problematic—solution is adding counter rows to differentiate duplicates. However, this often requires a GROUP BY in the Information Mart, which can cause performance issues, particularly on non-columnar databases.

Recommended Approach: Capture Technical History in Satellites

Instead of modifying the Link, the recommended approach is to handle duplicates in satellites, preserving the raw source events and their arrival order.

Step 1: Use a Satellite with a Load-Date Sequence

For each incoming batch:

Assign the CDC load timestamp to the first row of a given parent.
If multiple rows for the same parent exist in the batch, increment the timestamp by a small unit (microsecond, millisecond, or nanosecond) for each subsequent row.

This creates a unique ordering of changes while preserving the technical history, without touching the original raw data.

Step 2: Maintain Historical Order

By adding a microsecond increment to the load date for each row:

The first row in the CDC batch gets the base timestamp.
The second row gets base timestamp + 1 microsecond, the third row +2 microseconds, etc.

This ensures the latest row has the highest load timestamp, which can be used to drive Point-In-Time (PIT) tables and type-1 dimension replacements.

Step 3: Preserve Batch or CDC Metadata

If your CDC source provides a batch ID or subsequence number, include it in the satellite. This allows for:

Tracking which records arrived together
Reconstructing the technical timeline of changes

If no metadata exists, the microsecond sequencing on the load date is sufficient to order the rows.

Handling Non-Historized Links with Duplicates

In rare cases, a non-historized Link may receive multiple rows for the same key combination. To handle this safely:

Extend the alternate key to include the load date (or other sequencing attribute) in the hash key calculation.
This ensures each row has a unique primary key without modifying the raw data.

Key points:

No need to use counter rows in the raw link.
Duplicates are captured and preserved for historical accuracy.
Aggregations in PIT or Bridge Tables can be used for reporting, ensuring performance optimization.

Optional: Bridge Tables for Performance

If your Information Mart requires grouping or deduplication and your database struggles with performance:

Create a Bridge Table that pre-aggregates or resolves duplicates.
The Bridge Table stores only the latest row (or the aggregated result) for reporting.
You maintain the raw satellite history in case full lineage or historical reconstruction is needed.

Summary

Capture all incoming CDC events in a satellite, including duplicates, without modifying the raw data.
Use microsecond increments on the load date to order multiple rows per parent.
Include CDC batch metadata if available to preserve groupings and arrival order.
For non-historized Links receiving multiple rows, include the load date in the hash calculation.
Bridge Tables or PIT tables handle reporting and aggregation efficiently, while maintaining full historical traceability.

This approach preserves auditability, ensures correct historical ordering, and avoids performance issues in the Information Mart.

Watch the Video

Meet the Speaker

Michael Olschimke

Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

Building a scalable Data Platform? In Data Vault Friday

Data Vault Link Naming Convention

Link Naming Convention

One of the most common questions we hear from Data Vault practitioners—especially once a model grows beyond a few hubs and links—is this: how do you define a clear link naming convention, and how do you avoid getting lost in all the different link types when querying current versus historical data? It’s a very practical concern, and if you don’t address it early, your Data Vault can quickly become hard to understand, even for experienced developers.

In this article, we’ll walk through a pragmatic approach to Data Vault link naming conventions. We’ll cover how to name links in a business-friendly way, how to encode technical meaning into the name without overcomplicating it, and how naming helps users understand which links to use for historical data and which are more event- or transaction-oriented.

In this article:

The First Question: How Do You Name a Link?
Ordering Hub Names in Composite Links
Standardization Beats Perfection
Dealing with the Many Types of Links
Prefix or Suffix? Why Suffixes Usually Win
A Compact Suffix-Based Pattern
Why Short Codes Are Powerful
Putting It All Together: Naming Examples
Which Links to Use for Current vs. Historical Data?
Standard Links vs. Non-Historized Links
Teaching the Model to Users
Final Thoughts
Watch the Video
Meet the Speaker

The First Question: How Do You Name a Link?

Let’s start with the most basic aspect of the question: how do you name a link at all? Imagine a simple business scenario. A customer walks into a store and buys a product. From a Data Vault perspective, this creates a relationship between three business keys: customer, store, and product. This relationship is represented by a link.

Now you have several naming options. You could name the link something like customer_store_product_link, purely describing the hubs it connects. Or you could give it a business-oriented name such as retail_transaction_link or sales_transaction_link.

Our strong recommendation is to prefer business-driven names whenever possible. If the relationship clearly represents a real business concept, then the link should be named after that concept. Calling it a retail transaction link immediately tells business users and developers what this link represents, without forcing them to interpret a technical list of hubs.

That said, there are cases where a business name simply doesn’t exist. Sometimes a link truly represents nothing more than a structural relationship, such as a customer being associated with a store, without a clear transactional or event-based meaning. In those cases, it is perfectly fine to fall back to a composite name like customer_store_link.

The key point here is consistency and intent. If there is a meaningful business name, use it. If not, use a descriptive composite name that clearly shows which hubs are involved.

Ordering Hub Names in Composite Links

When you do use composite names, the next question is often about ordering. Should it be customer_store or store_customer? From a pure Data Vault perspective, links are end-to-end relationships with no inherent direction. Technically, the order does not matter.

However, from a modeling and readability perspective, a consistent ordering rule helps a lot. One practical guideline is to follow the natural business hierarchy if one exists. For example, if you model an industry body and the organizations that belong to it, the industry body is conceptually “above” the organization. In that case, naming the link industrybody_organization_link feels natural and intuitive.

This approach mirrors how people think about hierarchies: from root to leaf, from higher-level concept to lower-level concept. Again, the most important part is not which rule you choose, but that you standardize it across the entire Data Vault.

Standardization Beats Perfection

No naming convention will ever be perfect. What matters far more is that everyone follows the same rules. At Scalefree, for example, we deliberately leave some freedom to developers when choosing link names, with a clear preference for business-oriented naming. At the same time, we provide documented guidelines that explain how to make those decisions consistently.

This balance allows teams to model complex domains without being blocked by overly rigid naming rules, while still keeping the model understandable and navigable.

Dealing with the Many Types of Links

Once you move beyond basic modeling, the real challenge begins: Data Vault doesn’t just have one type of link. Over time, a model accumulates several varieties, each serving a different purpose.

Some of the most common link types include standard links in the Raw Data Vault, business or exploration links in the Business Vault, non-historized links (also known as transactional links in Data Vault 1.0), hierarchical links, same-as links, dependent child links, and even computed or aggregated links used to reuse logic across multiple bridges.

Without a clear convention, these different link types quickly become indistinguishable, making it hard to know which one to use for a specific query or use case.

Prefix or Suffix? Why Suffixes Usually Win

One common design decision is whether to encode link type information as a prefix or a suffix. While prefixes like link_retailtransaction are sometimes used, we generally recommend suffixes instead.

The reason is simple: suffixes group related objects together more naturally. If everything related to a customer starts with “customer”, then the hub, its satellites, and related links appear together when browsing schemas or metadata. This makes the model easier to explore and understand.

For example, a customer hub might be named customer_h, while a satellite could be customer_data_s. Using suffixes ensures that all customer-related entities are visually grouped.

A Compact Suffix-Based Pattern

To keep naming both expressive and machine-readable, we recommend a compact suffix pattern. In this approach, every link ends with an L, indicating that it is a link. Additional single-character markers are added before the L to indicate special link types.

For example, a standard link in the Raw Data Vault simply ends with _l. A non-historized link ends with _nl. A hierarchical link ends with _hl. A same-as link ends with _sl. A dependent child link ends with _dl.

When working with the Business Vault, an additional B is added. A standard Business Vault link ends with _bl. A dependent child link in the Business Vault becomes _bdl. In this scheme, the Raw Data Vault is the default, so there is no explicit “R” marker.

This compact notation might look cryptic at first, but it has significant advantages. It keeps names short, consistent, and easy to parse automatically.

Why Short Codes Are Powerful

Using single-character indicators is not just about aesthetics. It enables powerful automation and governance capabilities. With consistent suffixes, you can use regular expressions on metadata to identify entity types automatically.

For example, you can quickly validate that all dependent child links contain the required hub references, hash keys, and record source attributes. You can also apply automated tests and checks depending on the link type, without relying on manual inspection.

This becomes especially valuable in larger Data Vault environments where hundreds or thousands of entities exist.

Putting It All Together: Naming Examples

Let’s revisit the earlier example of a relationship between customer and store. In the Raw Data Vault, a simple link could be named customer_store_l. If this relationship is modeled in the Business Vault, the name could become customer_store_bl.

If the relationship represents a business transaction and is non-historized, a name like retailtransaction_nl or retailtransaction_bnl (for the Business Vault) clearly communicates both the business meaning and the technical behavior.

These names immediately tell an experienced user what kind of data to expect and how the link should be used.

Which Links to Use for Current vs. Historical Data?

The second part of the original question is just as important: how do you know which links to use when querying current versus historical data?

The key insight here is that almost all Data Vault entities are historical by nature. Links record which relationships have ever existed during the lifetime of the data warehouse. On their own, they usually do not tell you whether a relationship is currently active.

To answer questions about “current” relationships, you typically need additional structures such as effectivity satellites or deletion indicators. These tell you whether a relationship is active, inactive, or deleted at a given point in time.

For example, if customers can change their preferred store, the link captures all store assignments that ever existed. An effectivity satellite tells you which assignment is valid right now.

Standard Links vs. Non-Historized Links

Another important distinction is between long-lived relationships and point-in-time events. Standard links are often used for relationships that exist over a period of time, such as employment, ownership, or assignments.

Non-historized links, on the other hand, are typically used for events, messages, or transactions that occur at a specific moment. A customer purchasing a product, an invoice being issued, or a sensor sending a reading are all examples of point-in-time events.

Even though they are called “non-historized,” these links can still store large volumes of historical data. The name simply reflects that the relationship itself does not persist over time—it happens at a single point in time.

Dependent child links are a special case of this pattern, often used for structures like invoice line items, where detailed data depends on a parent transaction.

Teaching the Model to Users

Finally, no naming convention works unless users understand it. A clear, consistent naming scheme allows you to teach users how to recognize link types, understand their purpose, and choose the right entities for their queries.

Once users know that “_l” means Raw Vault link, “_bl” means Business Vault link, and “_nl” indicates an event-based relationship, they can navigate even large models with confidence.

Final Thoughts

A good Data Vault link naming convention is not about memorizing rules; it’s about reducing cognitive load. Business-driven names improve readability, suffix-based patterns improve structure, and compact codes enable automation and governance.

If you invest the time to define and standardize these conventions early, you will save countless hours later—both for developers and for business users trying to understand and trust your data warehouse.

Watch the Video

Meet the Speaker

Michael Olschimke

Building a scalable Data Platform? In Data Vault Friday

Data Vault Link Effectivity

Link Effectivity

One of the most common — and most misunderstood — challenges in Data Vault modeling is how to correctly handle changing relationships. Especially when source systems insert, delete, and even reinsert the same records over time, many teams struggle to answer a simple but critical question:

“How do I reliably determine the latest valid state of a relationship?”

In this article, we will walk through exactly how to model this scenario using Links and Effectivity Satellites in Data Vault, why this pattern is essential, and how it allows you to extract both the current and historical state of relationships in a clean, auditable way.

In this article:

The Scenario: A Source Table with Changing Relationships
Why Links Alone Are Not Enough
Enter the Effectivity Satellite
- Typical Structure of a Link Effectivity Satellite
Walking Through the Lifecycle of a Relationship
Handling Updates: Why Updates Are Two Events
How to Query the Latest Valid State
- Current Active Relationships
What About ROW_OPERATION and Valid-From / Valid-To?
- ROW_OPERATION
- Valid-From / Valid-To
Why Effectivity Satellites Are Not Optional
Key Takeaways
Watch the Video
Meet the Speaker

The Scenario: A Source Table with Changing Relationships

Let’s start by restating the problem in simple terms.

You have a source table with the following structure:

COL_1 + COL_2 → Business Key for Hub A
COL_3 + COL_4 → Business Key for Hub B
COL_5, COL_6 → Descriptive attributes
ROW_OPERATION → Insert / Delete indicator

This tells us something important right away:

Two business keys appear in the same row
Therefore, there is a relationship between those business keys
That relationship belongs in a Link

The complexity arises because the operational system:

Inserts records
Deletes records
Sometimes reinserts the exact same records later

All while the actual business keys and attributes remain unchanged.

So how do we model this in a way that preserves history, supports auditing, and still lets us easily answer:

“Which relationships are valid right now?”

Why Links Alone Are Not Enough

In Data Vault, a Link represents the existence of a relationship — not its state.

Once a relationship between Hub A and Hub B exists, the Link record itself is immutable.
It simply tells us:

“At some point in time, this relationship existed.”

But the Link alone cannot tell us:

When the relationship became active
When it was removed
When it was reintroduced

This is where many implementations go wrong. Teams try to encode relationship state directly into the Link or manage it downstream in reporting logic. Neither approach scales, and both break Data Vault principles.

The correct solution is to model the effectivity of the relationship separately.

Enter the Effectivity Satellite

An Effectivity Satellite attached to a Link tracks the physical existence of that relationship over time.

Think of it as a timeline that answers one simple question:

“Is this relationship active or inactive at a given point in time?”

Typical Structure of a Link Effectivity Satellite

Link Hash Key
Load Date Timestamp (LDTS)
Record Source
Deletion Timestamp

The deletion timestamp is the key attribute. It defines until when a relationship is considered active.

A common and proven pattern is:

Active relationship → Deletion date = 8888-12-31
Inactive relationship → Deletion date = actual load timestamp when deletion was detected

(Using 8888-12-31 instead of 9999-12-31 avoids date overflow issues and is widely adopted in practice.)

Walking Through the Lifecycle of a Relationship

Let’s walk through the example step by step.

Day 1: Relationship Is Inserted

The source system delivers a row linking business keys A and B.

A Link record (A–B) is created if it does not yet exist
An Effectivity Satellite record is inserted

Deletion_Timestamp = 8888-12-31

This means: the relationship is active.

Day 2: Relationship Is Deleted

The source system no longer contains the A–B row.

We do not delete the Link. Instead:

A new Effectivity Satellite record is inserted
The deletion timestamp is set to the current load timestamp

Now the timeline clearly shows when the relationship ended.

Day 3: Relationship Is Reinserted

The same A–B relationship appears again in the source.

Important point:

The Link already exists → no new Link row
The Effectivity Satellite needs a new delta

Deletion_Timestamp = 8888-12-31

This marks the relationship as active again — without losing the history of the previous deletion.

Handling Updates: Why Updates Are Two Events

Another area where teams often struggle is updates.

In Data Vault terms:

An update is a deletion of the old version and an insertion of a new version.

For example, if a relationship changes from A–B to A–C:

A–C is inserted as a new Link relationship
The Effectivity Satellite marks A–C as active
The Effectivity Satellite marks A–B as deleted

This ensures that:

No ambiguity exists about which relationship is active
Historical reporting remains accurate
Auditors can trace every change

How to Query the Latest Valid State

Once modeled correctly, querying the current state becomes straightforward.

Current Active Relationships

Join the Link to its Effectivity Satellite
Select the latest satellite record per Link hash key
Filter where Deletion_Timestamp = 8888-12-31

That’s it.

This pattern works consistently across:

Current-state reporting
Type 1 dimensions
Snapshot-based reporting using PIT tables

What About ROW_OPERATION and Valid-From / Valid-To?

It’s important to clearly separate concepts:

ROW_OPERATION

Indicators like Insert/Delete from the source system are:

Descriptive metadata
Useful for loading logic
Not the source of truth for effectivity timelines

They can be stored in a descriptive Satellite, but effectivity is driven by load detection.

Valid-From / Valid-To

Business validity dates are not deletion timelines.

They describe business meaning (contracts, subscriptions, agreements)
They belong in separate descriptive Satellites
They are typically applied in the Business Vault or downstream

Mixing business timelines with technical effectivity is a common modeling mistake.

Why Effectivity Satellites Are Not Optional

A final, often overlooked argument: dirty data.

What happens if:

A relationship is accidentally loaded
Used in reports
Then later corrected or removed

Without an Effectivity Satellite:

You cannot explain historical results
You cannot prove when data changed
You cannot support audit requirements

That’s why, in real-world Data Vault implementations:

Most standard Links have Effectivity Satellites
Most Hubs have them as well
Only a few special cases are exceptions

Key Takeaways

Links capture existence, not state
Effectivity Satellites capture when relationships are active
Deletes and reinserts are modeled as satellite deltas
Updates are modeled as delete + insert
Current state is derived by filtering on open deletion timelines

Once you adopt this pattern consistently, handling complex relationship lifecycles becomes simple, scalable, and auditable.

Watch the Video

Meet the Speaker

Michael Olschimke

Building a scalable Data Platform? In Data Vault

What’s New in Data Vault 2.1 and Why It Matters for Modern Data Warehousing

What’s New in the Data Vault 2.1 Training

In the world of data warehousing and business intelligence, the Data Vault methodology has long been a trusted foundation for scalable and agile data architectures. With the release of Data Vault 2.1, the methodology has evolved to address new challenges in modern data environments — from handling semi-structured data to aligning with concepts like Data Mesh and Data Lakehouse.

In this article, we summarize what’s new in Data Vault 2.1 compared to 2.0, what these updates mean for practitioners, and how you can take advantage of the new training materials to become officially certified.

In this article:

1. A Major Expansion in Content and Learning Resources
2. Enhanced Instructor-Led Training Experience
3. Better Preparation for Certification
4. Dealing with JSON and Semi-Structured Data
5. Stronger Differentiation Between Logical and Physical Modeling
6. Introducing Ontologies and Taxonomies
7. Extended Business Key Collision Code Concept
8. Merging Satellites Without PIT Tables
9. Alignment with Modern Industry Terminology
10. Unlock the Full Potential with Constructor-Led Training
Final Thoughts
Watch the Video
Meet the Speaker

1. A Major Expansion in Content and Learning Resources

One of the most visible improvements in Data Vault 2.1 is the expansion of the official training content. The updated course now includes extensive video material featuring Dan Linstedt himself, who explains and demonstrates key Data Vault principles in depth.

Participants can now spend several hours watching recorded theoretical sessions and hands-on demonstrations. The new format combines the benefits of self-paced learning with the engagement of instructor-led sessions. You can also download the official SQL loading patterns for all Data Vault entities from the Data Vault Alliance (DVA) training portal.

Another highlight is access to the Data Vault Alliance community — a global network of Data Vault practitioners where members exchange best practices, discuss implementations, and share insights from real-world projects.

2. Enhanced Instructor-Led Training Experience

The well-known three-day instructor-led training remains a cornerstone of the certification path, but it has been optimized to deliver even more value. Trainers now dedicate more time to practical case studies, group discussions, and collaborative modeling workshops.

Instead of spending large portions of class time on theory, participants focus on applying concepts to real-world data challenges. Trainers provide direct feedback on Data Vault models, encourage peer review, and help attendees explore different architectural scenarios.

This redesign creates a more interactive, productive learning experience — especially valuable for consultants, data architects, and engineers who want to strengthen their practical Data Vault expertise.

3. Better Preparation for Certification

Preparing for the official Certified Data Vault 2.1 Practitioner (CDVP2.1) exam is now easier and more structured. The course includes integrated live quizzes during training sessions, allowing participants to test their understanding and interact directly with the instructor.

In addition, a practice exam has been introduced to help you assess your readiness before attempting the final certification. This makes it easier to identify knowledge gaps and feel confident on exam day.

4. Dealing with JSON and Semi-Structured Data

One of the most exciting updates in Data Vault 2.1 is the new module on handling JSON data and other semi-structured sources. As modern data platforms increasingly deal with variable data structures, the methodology now provides clear guidance for integrating such data efficiently.

The course introduces a set of rules and best practices for balancing performance, flexibility, and complexity. You’ll learn when to apply a schema-on-read approach instead of schema-on-write, how to maintain stability as source structures evolve, and how to preserve governance and traceability in semi-structured environments.

Dan Linstedt often refers to this as the “JSON Dilemma” — the challenge of maximizing flexibility without sacrificing performance or clarity. Data Vault 2.1 equips you with the methodology and patterns to solve that dilemma effectively.

5. Stronger Differentiation Between Logical and Physical Modeling

Another core enhancement in Data Vault 2.1 is the clearer separation between logical and physical modeling. While Data Vault 2.0 touched on this concept, version 2.1 makes it explicit: the logical model represents the business concept, while the physical model depends on the underlying technology and performance needs.

For example, on some platforms normalization works best, while on others (such as document-oriented databases), denormalization might be more efficient. The physical implementation should adapt to these realities — but the logical model remains consistent as the blueprint for the business layer.

This separation provides greater flexibility to evolve with technology without compromising the integrity of the business model. It also helps teams align architecture decisions with specific database or cloud platform requirements.

6. Introducing Ontologies and Taxonomies

In line with the growing emphasis on semantic data integration, Data Vault 2.1 introduces the use of ontologies and taxonomies as essential tools for business modeling. These concepts allow organizations to connect business terms, hierarchies, and relationships in a way that supports consistent data integration across departments and systems.

By embedding ontologies and taxonomies into the modeling process, organizations can improve data understanding, reduce ambiguity, and strengthen the link between data structures and business meaning.

7. Extended Business Key Collision Code Concept

The Business Key Collision Code concept has been extended in Data Vault 2.1 to better support cross-system integration. This improvement helps resolve conflicts that arise when business keys overlap or differ across systems — a common challenge in enterprise data integration.

With enhanced rules and examples, the training now guides you through best practices for identifying, classifying, and merging business keys, ensuring a consistent, high-quality data foundation.

8. Merging Satellites Without PIT Tables

Data Vault 2.1 introduces new approaches for handling historical data when traditional Point-In-Time (PIT) tables or snapshot techniques are not required. In cases where you need to maintain very long data histories or join multiple satellites describing the same business object, version 2.1 outlines methods for merging satellites without relying on PIT tables.

This allows for greater flexibility in data retrieval strategies and helps optimize performance in long-term historical scenarios.

9. Alignment with Modern Industry Terminology

To stay relevant with the evolving data landscape, Data Vault 2.1 integrates current industry concepts such as Data Mesh, Data Fabric, and Data Lakehouse. These paradigms are mapped to the Data Vault framework, demonstrating how the methodology fits within modern data architectures.

This update ensures that Data Vault practitioners can easily connect the methodology to the broader trends and technologies shaping the data industry today.

10. Unlock the Full Potential with Constructor-Led Training

If you’re ready to deepen your knowledge and apply these updates in practice, the constructor-led Data Vault 2.1 training offered by Scalefree is the next step. This hands-on training combines theoretical knowledge, real-world exercises, and guided discussions to help you implement Data Vault successfully in your organization.

Visit the training page to find more information, view upcoming training dates, and begin your journey toward CDVP2.1 certification.

Final Thoughts

Data Vault 2.1 represents a significant step forward for data professionals seeking a future-proof methodology. With improved training content, better integration of semi-structured data, a sharper focus on modeling concepts, and alignment with modern architectural trends, Data Vault continues to be a robust choice for building scalable, flexible, and business-aligned data warehouses.

Whether you are transitioning from Data Vault 2.0 or starting fresh, the new version provides the tools, knowledge, and community support to take your data architecture to the next level.

Watch the Video

Meet the Speaker

Marc Winkelmann
Managing Consultant

Building a scalable Data Platform? In Inside Modern Data Teams

Is the Data Warehouse Dead?

From Data Warehouse to Data Platform

Every few years, a new buzzword hits the data industry — and suddenly, the tools and methods we’ve relied on for decades are declared obsolete. Today, that target seems to be the data warehouse. Blogs and conferences proclaim its death, replaced by the data lake, data lakehouse, or even the elusive “data mesh.” But is the data warehouse really dead? Or has it simply evolved into something new?

In this article:

The “Death” of the Data Warehouse: Where It All Began
From Warehouse vs. Lake to Warehouse + Lake
Why the Data Warehouse Still Matters
Adapting to New Requirements: The Rise of Data Platforms
AI, Machine Learning, and the New Data Landscape
Lessons from the Field: It’s Not About Technology, It’s About Strategy
Practical Takeaways for Modern Data Teams
Conclusion: The Data Warehouse Is Evolving — Not Dead
Watch the Video
Meet the Speakers

The “Death” of the Data Warehouse: Where It All Began

For years, the data warehouse has been the foundation of enterprise analytics. It provided a structured, trusted, and governed environment where business data could be collected, cleansed, and analyzed. However, as data volumes exploded and new types of unstructured data emerged, traditional warehouses started showing their age.

Slow ETL processes, rigid schemas, and scalability issues led many to look for alternatives. Enter the data lake — a more flexible, schema-on-read environment that could store raw, unstructured data cheaply and at scale. Suddenly, the industry narrative shifted: data lakes were the future, and warehouses were history.

But as many organizations soon learned, simply dumping everything into a lake didn’t magically solve all their problems. Without governance, context, and structure, data lakes quickly turned into data swamps — massive pools of untrustworthy, undocumented information. And that’s when the story started to change again.

From Warehouse vs. Lake to Warehouse + Lake

The debate shouldn’t be “data warehouse or data lake?” but rather “how do we combine them effectively?” Each serves a different purpose, and modern data platforms are proving that the most successful architectures leverage both.

The data lake is perfect for collecting raw, varied, and large-scale data — structured, semi-structured, or unstructured. It enables exploration, data science, and machine learning. But the data warehouse is still essential for delivering consistent, trusted, and audited data for business reporting and regulatory needs.

As one of our experts put it, the data lake can act as the source system for the data warehouse. The lake is where all data lands. The warehouse sits on top — a refined, curated layer where the most critical data is modeled, governed, and exposed to business users. Together, they form the backbone of a modern data platform.

Why the Data Warehouse Still Matters

Despite the hype around newer architectures, data warehouses provide several key benefits that data lakes alone can’t match:

Data Quality: Warehouses enforce rules and transformations that ensure accuracy and consistency across business domains.
Auditability and Compliance: Especially in industries governed by GDPR, HIPAA, or SOX, traceability is non-negotiable — something data warehouses excel at.
Performance and Optimization: Data warehouses are designed for analytical workloads and provide fast query performance on structured data.
Trust: Business users need reliable, validated data for decision-making. Data warehouses remain the single source of truth for that.

So no, the warehouse isn’t dead. It’s simply no longer alone.

Adapting to New Requirements: The Rise of Data Platforms

What has changed, however, is how organizations think about architecture. We’ve moved away from seeing data warehousing as a single monolithic system. Instead, the focus is now on building data platforms — unified ecosystems that combine the strengths of data lakes, data warehouses, and modern cloud technologies.

In this model, the data lake is used as an ingestion and exploration layer, capturing data from across the enterprise. The warehouse, meanwhile, becomes a downstream layer that provides refined, high-quality, and business-ready datasets.

This layered approach is often seen in Data Vault 2.0 architectures. The raw data is first stored in the lake (the “landing zone”), then structured into a raw vault for traceability, and finally transformed into a business vault for analytics and reporting. This methodology blends the flexibility of a lake with the governance of a warehouse — a best-of-both-worlds approach.

AI, Machine Learning, and the New Data Landscape

Another reason the “data warehouse is dead” narrative persists is the rise of AI and machine learning. These applications demand vast quantities of raw and semi-structured data — something traditional warehouses weren’t built to handle efficiently. However, this doesn’t mean warehouses are obsolete; it means they play a different role.

In AI-driven organizations, data scientists use the lake to experiment and train models. Once insights are validated, curated datasets are pushed into the warehouse to ensure they’re governed, standardized, and auditable. This workflow creates a feedback loop between the lake and the warehouse, ensuring agility without sacrificing control.

Modern data warehouses, especially cloud-native ones like Snowflake, Azure Synapse, and Google BigQuery, have also evolved. They now support semi-structured data, elastic scalability, and real-time processing — bridging the gap between lakes and traditional warehouses.

Lessons from the Field: It’s Not About Technology, It’s About Strategy

When companies struggle with data warehousing, it’s rarely because of the technology itself. More often, it’s about poor design, lack of governance, or outdated processes. As many experienced data engineers know, legacy warehouses often become complex, undocumented systems — “historically grown” solutions that no one fully understands.

The real issue isn’t whether to abandon the warehouse. It’s about how to modernize it. That means introducing automation, adopting agile data modeling techniques, and leveraging modern tools that eliminate manual maintenance work.

It also means changing the way organizations think about data. Instead of treating governance as a roadblock, teams should see it as a foundation for scalability. Instead of building massive, inflexible ETL pipelines, they should adopt modular data vault or ELT-based approaches that evolve as business needs change.

Practical Takeaways for Modern Data Teams

Stop chasing buzzwords. Data lakes, meshes, and fabrics are valuable, but none are silver bullets. Understand the business problem first.
Combine technologies strategically. Use data lakes for exploration and AI, data warehouses for governance and trust.
Modernize your warehouse, don’t replace it. Adopt cloud platforms and automation to remove legacy bottlenecks.
Think in terms of platforms. Build an integrated data ecosystem instead of disconnected tools.
Embrace continuous evolution. The future of data is hybrid, agile, and adaptive — not one-size-fits-all.

Conclusion: The Data Warehouse Is Evolving — Not Dead

The data warehouse isn’t a relic of the past. It’s a vital component of the modern data platform. What’s changing is the way we design, use, and integrate it. By combining the strengths of data lakes and warehouses, organizations can unlock the full potential of their data — balancing flexibility with governance, and innovation with reliability.

The future of data architecture isn’t about replacing one system with another. It’s about convergence. The warehouse, the lake, the lakehouse — all of them are part of a single, connected platform designed to empower both business users and data scientists. So no, the data warehouse isn’t dead. It’s alive, evolving, and more relevant than ever.

Watch the Video

Meet the Speakers

Lorenz Kindling
Senior Consultant

Lorenz is working in Business Intelligence and Enterprise Data Warehousing (EDW) with a focus on data warehouse automation and Data Vault modeling. Since 2021, he has been advising renowned companies in various industries for Scalefree International. Prior to Scalefree, he also worked as a consultant in the field of data analytics. This allowed him to gain a comprehensive overview of data warehousing projects and common issues that arise.

Lennart Busche
Senior Consultant

Lennart is working in Business Intelligence and Enterprise Data Warehousing (EDW), supporting Scalefree International since the beginning of 2023 as a BI Consultant. Prior to Scalefree, he had over eight years of experience in the financial IT sector with focus on project management, IT-Service management and client management. This helped him get a broad knowledge of business requirements, the needs of customers dealing with IT and communication with different customer groups.

Building a scalable Data Platform? In Data Vault Friday

Data Vault on dbt Snapshots

In recent years, dbt has become one of the most popular tools in modern data stacks. At the same time, Data Vault continues to be a proven methodology for building scalable, auditable, and historically complete data warehouses.

It is therefore no surprise that questions arise at the intersection of both worlds. One question we recently received perfectly captures this:

“Can you build a Data Vault view downstream off of dbt snapshots?
I feel dbt snapshots are safer because they capture data ‘as is’, and a Data Vault might be designed wrong.”

This is a great question—and one that touches architecture, performance, data modeling, and risk management at the same time. In this article, we’ll unpack the topic step by step and give a clear, practical answer.

In this article:

First Things First: What Are dbt Snapshots?
dbt Snapshots vs. Data Vault Satellites
- Insert-Only vs. Update-Based Modeling
Why Insert-Only Matters in the Cloud
Where dbt Snapshots Shine
Back to the Core Question
The Real Challenge: Performance and Cost
Does a “Wrong” Data Vault Design Mean Data Loss?
A Pragmatic Recommendation
Final Thoughts
Watch the Video
Meet the Speaker

First Things First: What Are dbt Snapshots?

Before we compare dbt snapshots with Data Vault concepts, let’s align on what dbt snapshots actually are.

According to dbt’s own documentation, snapshots are used to implement Type 2 Slowly Changing Dimensions (SCD Type 2) on mutable source tables.

If you’re familiar with dimensional modeling, this should sound very familiar. SCD Type 2 means:

Whenever a record changes, a new row is inserted.
The old version of the record is kept for historical analysis.
Validity timestamps define from when to when a record version was valid.

In a typical example, a source table might only store the current state of an order:

January 1st: Order status = pending
January 2nd: Order status = shipped

The source system overwrites the status, so you only ever see the latest value. But in analytics and data warehousing, we usually want to know how the data looked at a specific point in time.

That’s where dbt snapshots come in. They store multiple versions of the same business key and enrich the data with technical columns such as:

dbt_valid_from
dbt_valid_to

Whenever dbt detects a change, it:

Inserts a new row for the new version.
Updates the dbt_valid_to of the previous version.

From a functional perspective, this is classic SCD Type 2 behavior.

dbt Snapshots vs. Data Vault Satellites

Now let’s compare dbt snapshots with Data Vault modeling. This is where things get interesting.

In Data Vault, Satellites are responsible for storing descriptive attributes and tracking changes over time. In other words:

Satellites are also SCD Type 2 structures.
They store full history.
They insert a new row for every detected change.

So at first glance, dbt snapshots and Data Vault satellites look almost identical. And conceptually, they are very close.

However, there is one important difference.

Insert-Only vs. Update-Based Modeling

Modern Data Vault implementations follow a strict insert-only approach. That means:

No updates to existing records.
No physical valid_to column updates.
History is reconstructed using window functions or Point-in-Time (PIT) tables.

dbt snapshots, on the other hand, do update the previous record to set the dbt_valid_to timestamp.

From a pure modeling perspective, both approaches are valid. But from a platform and performance perspective, insert-only has some strong advantages—
especially in cloud data warehouses like Snowflake, BigQuery, or Redshift.

Why Insert-Only Matters in the Cloud

Cloud-native data warehouses are optimized for append-heavy workloads.

For example:

Snowflake uses micro-partitions that are immutable.
Updates often result in copy-on-write operations.
Insert-only workloads scale better and are cheaper.

This is one of the reasons why Data Vault adopted insert-only patterns years ago. It’s not just about modeling philosophy—it’s about performance and scalability.

That doesn’t mean dbt snapshots are “wrong”. It just means they were designed with a slightly different use case in mind.

Where dbt Snapshots Shine

From a practical standpoint, dbt snapshots are extremely useful in specific scenarios.

One very common use case is a persistent staging area.

Imagine you receive:

Full data extracts every day from a source system.
No CDC (Change Data Capture).
Large tables where storing daily full loads would be wasteful.

In this case, dbt snapshots allow you to:

Store only the changes between loads.
Keep historical versions.
Reduce storage and processing overhead.

From this perspective, dbt snapshots act like a slim persistent staging layer. They capture the source data “as is” and preserve history.

If you already receive proper CDC data from upstream systems, then dbt snapshots are often unnecessary. The change tracking has already been done for you.

Back to the Core Question

So let’s return to the original question:

Can you build a Data Vault view downstream of dbt snapshots?

Technically and conceptually, the answer is:

Yes, you can.

If your dbt snapshots contain all source changes, you have everything you need to:

Identify business keys.
Track attribute changes.
Build hubs, links, and satellites.

In theory, you could build a fully virtualized Data Vault layer on top of snapshots:

Virtual hubs
Virtual links
Virtual satellites

From a data completeness perspective, nothing is missing.

The Real Challenge: Performance and Cost

Unfortunately, theory and reality often diverge.

While a fully virtualized Data Vault sounds elegant, it usually doesn’t work well in practice—at least not today.

Why?

Large historical datasets require heavy joins and window functions.
Virtualization pushes computation to query time.
Cloud compute costs increase rapidly.

In most real-world environments, fully virtualizing the Data Vault on top of snapshots leads to:

Slow queries
High compute bills
Poor user experience

That’s why most architectures still materialize the Data Vault at some point.

Does a “Wrong” Data Vault Design Mean Data Loss?

Another concern in the question is the fear of designing the Data Vault “wrong”.

This fear is understandable—but largely unfounded.

One of the core promises of Data Vault is:

You do not lose data due to modeling decisions.

Even if:

You split satellites too much.
You group attributes differently than you would today.
You later realize a better modeling pattern.

You can always:

Refactor satellites.
Split or merge them.
Reload data from existing Data Vault tables.

This is possible because Data Vault stores raw, historized data—not business logic.

So while a persistent staging area can be helpful, it is not a safety net you absolutely must have. A properly loaded Data Vault already is that safety net.

A Pragmatic Recommendation

So what does a pragmatic architecture look like today?

Use dbt snapshots if you need a persistent staging layer and don’t have CDC.
Materialize the Raw Data Vault for performance and scalability.
Virtualize downstream layers (Business Vault, Information Marts) where possible.

This approach balances:

Data safety
Performance
Cost efficiency

As data volumes grow and histories span years or decades, full virtualization simply becomes inefficient. Materialization at the Raw Vault level is still the sweet spot in most projects.

Final Thoughts

dbt snapshots are a powerful feature and fit nicely into modern data stacks. They can absolutely support Data Vault architectures—especially as a persistent staging layer.

However, they don’t eliminate the need for a materialized Data Vault. Nor do they replace the robustness and flexibility that Data Vault modeling provides.

Used together, dbt and Data Vault can form a strong, future-proof foundation for enterprise analytics—when each tool is applied where it makes the most sense.

Watch the Video

Meet the Speaker

Marc Winkelmann
Managing Consultant