Skip to main content
search
0

The Battle Of Table Formats: Iceberg vs Delta vs Hudi

datavault

Selecting the right open-source table format is about securing your infrastructure strategy. Making the right choice helps you save development costs and minimize risks. A well-chosen format lowers your Total Cost of Ownership (TCO) and ensures a future-proof, sustainable architecture. Let’s dive into three popular formats, so you can quickly deliver results without getting locked into a bad ecosystem.

Open table formats bring database-like ACID transactions to your data lake. They reduce storage costs by minimizing data duplication. Here is how Iceberg, Delta, and Hudi compare on the technical essentials.

The Battle of Table Formats: Iceberg vs Delta vs Hudi

Stop risking costly vendor lock-in and future-proof your data lakehouse today. In this deep dive, we cut through the noise to compare the big three open table formats: Apache Iceberg, Delta Lake, and Apache Hudi. We’ll analyze infrastructure fit, real-world performance, and Data Vault integration to help you drive down your TCO. Join us to find the exact format your architecture needs—before you commit to an expensive, irreversible path. Learn more in our upcoming webinar on May 19th, 2026!

Sign Up For Free

Performance Under Pressure

Performance depends directly on your compute engine and use case. Delta Lake is highly optimized for Apache Spark, providing efficient read performance for Spark-heavy workloads. Apache Hudi is specifically built for streaming-first architectures that require handling massive amounts of updates and deletes (upserts). Apache Iceberg utilizes an engine-agnostic architecture, maintaining consistent query performance across multiple different engines like Trino, Flink, and Spark.

It is important to note that choosing the query engine is more important than the table format itself. A well calibrated format-engine pair will perform similarly well.

Community Support

Community maturity directly impacts long-term risk minimization. Delta Lake is supported by a large user base, primarily driven by Databricks. Apache Iceberg currently holds the ultimate multi-vendor momentum. It receives active contributions from multiple major cloud providers and data vendors, offering broad ecosystem support. Apache Hudi’s community centers on data engineering for real-time ingestion and streaming pipelines.

Time Travel Capabilities

Time travel enables querying historical data, auditing changes, or reverting accidental deletions, serving as a critical mechanism for risk minimization. All three formats offer some type of “time travel”.

Delta uses a straightforward transaction log. It replays JSON commits and Parquet checkpoints to reconstruct a table’s exact state at a specific timestamp or version.

Iceberg uses a tree of immutable metadata snapshots. Instead of processing a heavy transaction log, a query references a past snapshot ID. This approach scales efficiently for massive tables without performance degradation.

Hudi tracks changes via a chronological action timeline. It maintains a granular history of operations, enabling strict point-in-time queries that map directly to its streaming architecture.

Interoperability

Infrastructure strategy must account for evolving workloads. The industry is currently shifting toward cross-format compatibility. Projects like Apache XTable and Delta UniForm act as interoperability layers. Data written in one format (e.g., Delta) can be read natively as Iceberg or Hudi. This reduces vendor lock-in risks and lowers pipeline reengineering costs. Additionally, Apache Paimon offers an alternative for dynamic tables with native Apache Flink integration for high-throughput streaming workloads.

Architecture Insight: Data Vault

Table formats and modeling methodologies like Data Vault 2 are complementary. While Iceberg, Delta, or Hudi provide the optimized storage layer and ACID transactions, Data Vault provides the business alignment and agility. For optimal performance on a Data Lakehouse, you can materialize your Raw Vault core entities as physical Delta or Iceberg tables to serve as high-speed indexes. Furthermore, while table “time travel” is useful for quick rollbacks, long-term enterprise historization should still rely on Data Vault’s insert-only architecture to prevent data loss during routine storage maintenance.

A note on Time Travel vs. Historization: While format-level “time travel” is useful for quick rollbacks, long-term enterprise historization should still rely on Data Vault’s insert-only architecture. Relying solely on table formats risks permanent data loss during routine storage maintenance, such as Delta’s VACUUM command.

Keypoints for your Data Strategy

  • Choose Delta for deep Spark integration.
  • Choose Iceberg for maximum tool flexibility and a massive open ecosystem.
  • Choose Hudi for heavy streaming and continuous upserts.

There is no single winner in the battle of table formats, only the right tool for your specific infrastructure strategy. By aligning your choice with your engine preference and streaming needs, you ensure high team agility and keep storage costs manageable.

dbt Fusion: The Next Generation of dbt Execution

dbt Fusion

dbt is evolving rapidly, and with the introduction of dbt Fusion, data teams are entering a new era of performance, efficiency, and intelligence. Built from the ground up, dbt Fusion represents a fundamental shift in how dbt projects are executed, validated, and optimized.

In this article, we’ll explore what dbt Fusion is, why it matters, and how its core capabilities—dialect-aware validation and state-aware orchestration—are changing the way modern data platforms operate.



What is dbt Fusion?

dbt Fusion is a next-generation execution engine for dbt, designed to overcome the limitations of dbt Core and unlock new capabilities for data teams. Rather than incrementally improving the existing engine, dbt Labs rebuilt the execution layer entirely.

One of the most important differences lies in its foundation: dbt Fusion is written in Rust, while dbt Core is built in Python. This change enables significantly better performance, especially for large-scale projects with complex dependency graphs.

But performance is only part of the story.

dbt Fusion introduces a native understanding of SQL across multiple dialects, allowing it to analyze queries more deeply than ever before. This enables advanced features like early error detection, improved lineage tracking, and smarter orchestration.

Importantly, dbt Fusion is designed to support the full dbt Core framework. Most existing dbt projects can run on Fusion with minimal changes, making adoption straightforward for many teams.

Note: Deprecated dbt Core functionality is not supported.

Why dbt Fusion Matters

dbt Fusion introduces two major innovations that directly impact day-to-day data work:

  • Dialect-aware SQL validation
  • State-aware orchestration

Together, these features significantly improve developer productivity, reduce execution time, and lower compute costs.

Dialect-Aware SQL Validation

Static SQL Analysis

One of the most powerful capabilities of dbt Fusion is its ability to perform static SQL analysis. Instead of simply rendering SQL and sending it to the data warehouse, Fusion builds a logical execution plan for every query during compilation.

This means that SQL correctness can be validated before any warehouse resources are used. As a result, many errors are caught early in the development process rather than during execution.

Handling Introspective Models

Not all SQL can be fully analyzed ahead of time. Some models rely on database-dependent macros, often referred to as introspective macros. Examples include:

  • get_column_values
  • star
  • unpivot

In these cases, dbt Fusion may defer part of the validation to the database itself, since the final structure depends on runtime information.

Why This Matters

Dialect-aware validation provides several key benefits:

  • Early error detection: Catch issues before execution
  • Improved developer experience: Faster feedback in the IDE
  • Precise column-level lineage: Better understanding of data flow
  • Foundation for advanced features: Enables orchestration and optimization

In practice, this means fewer failed runs, faster debugging, and more confidence in your transformations.

State-Aware Orchestration

The second major innovation in dbt Fusion is state-aware orchestration, which fundamentally changes how dbt jobs are executed.

Build Only What Changed

Traditionally, dbt runs rebuild models even if nothing has changed. dbt Fusion eliminates this inefficiency by detecting changes in both code and upstream data.

If no changes are detected, the model is skipped and the existing version is reused.

This results in:

  • Faster execution times
  • Reduced compute usage
  • Lower cloud costs

Shared Model State

dbt Fusion maintains a shared, real-time state at the model level. All jobs within the same environment can read and write to this shared state.

This allows dbt to determine whether a model has already been built and whether rebuilding it would produce a different result.

Concurrent Job Handling

In modern data platforms, multiple jobs often run at the same time. dbt Fusion is designed to handle this safely and efficiently.

It avoids unnecessary duplication by:

  • Preventing warehouse collisions
  • Reusing models across concurrent jobs
  • Ensuring consistency across executions

Works Out of the Box

One of the strengths of dbt Fusion is its ease of use. State-aware orchestration works automatically in most cases, without requiring additional configuration.

For advanced use cases, teams can still fine-tune behavior with more granular controls.

Efficient Testing (Beta)

dbt Fusion also introduces efficient testing, a feature currently in beta that optimizes how tests are executed.

Key improvements include:

  • Test result reuse: Avoid rerunning tests when results are unchanged
  • Query aggregation: Combine multiple tests into a single query
  • Reduced warehouse load: Lower compute costs

This makes testing faster and more cost-efficient, especially in large projects with extensive test coverage.

Performance and Cost Benefits

By combining Rust-based execution, advanced SQL analysis, and intelligent orchestration, dbt Fusion delivers measurable improvements:

  • Significantly faster runtimes
  • Reduced warehouse usage
  • Lower infrastructure costs
  • Improved developer productivity

For organizations managing complex data pipelines, these benefits can translate into substantial operational savings.

Compatibility with dbt Projects

dbt Fusion is designed to integrate seamlessly with existing dbt workflows.

Most projects can be migrated without major changes, as Fusion supports the core dbt framework. However, teams should be aware that deprecated features from dbt Core are not supported.

This makes it important to review and modernize older projects before transitioning.

Current State of dbt Fusion

At the time of writing, dbt Fusion is still in preview. While its capabilities are already impressive, some features may evolve as the engine matures.

Organizations considering adoption should monitor updates and test Fusion in controlled environments before full deployment.

Conclusion

dbt Fusion represents a major step forward in the evolution of dbt. By rethinking the execution engine from the ground up, it introduces powerful capabilities that go beyond incremental improvements.

With dialect-aware SQL validation, state-aware orchestration, and efficient testing, data teams can build pipelines that are not only faster, but also smarter and more cost-effective.

As the modern data stack continues to evolve, dbt Fusion is positioned to play a key role in shaping the future of analytics engineering.

Watch the Video

Meet the Speaker

Dmytro Polishchuk profile picture

Dmytro Polishchuk
Senior BI Consultant

Dmytro Polishchuk has 7 years of experience in business intelligence and works as a Senior BI Consultant for Scalefree. Dmytro is a proven Data Vault 2.0 expert and has excellent knowledge of various (cloud) architectures, data modeling, and the implementation of automation frameworks. Dmytro excels in team integration and structured project work. Dmytro has a bachelor’s degree in Finance and Financial Management.

Orchestration of Agentic Workflows

The Shift from Prompts to Autonomous Systems

For years, organizations have focused on mastering “prompt engineering”, the art of writing precise instructions to extract useful outputs from Large Language Models (LLMs). While highly effective for simple, singular tasks, the prompt-based approach has inherent limitations when faced with complex, multi-step business problems.

The next paradigm shift in enterprise AI is the move toward Agentic Workflows.

An “Agent” is more than just an LLM. It is an autonomous or semi-autonomous system that combines reasoning capability (the LLM) with access to tools, memory, and the ability to act on its environment. Instead of answering a question, an agent performs a role, acting as an analyst, a software engineer, or a project manager, handling sequential professional tasks until a goal is achieved.

Orchestration of Agentic Workflows

Master the art of building multi-step autonomous systems by integrating the LangChain ecosystem with powerful tools like Zapier. This session provides a practical roadmap for evolving from simple prompts to sophisticated, coordinated architectures that execute complex professional tasks with ease. Learn more in our upcoming webinar on April 21st, 2026!

Watch Webinar Recording

Why Agents Require Orchestration

The premise of agentic workflows is powerful, but deployment is difficult. In a complex scenario, you may need a system to:

  1. Analyze a business request.
  2. Search a database.
  3. Process results.
  4. Consult a second specialized agent (e.g., a “Coder Agent”).
  5. Revise the plan based on output and finally provide a summary.

Without proper coordination, this series of steps breaks down. The model might hallucinate a tool execution, forget crucial data from step one by step four, or enter an endless loop of unhelpful actions.

Orchestration is the framework that manages this complexity. It is the conductor of the agentic orchestra, defining how different agents, tools, and memory systems interact, ensuring reliability, traceability, and successful execution of the business objective.

Anatomy of an Agentic Stack

To build a reliable orchestrator for autonomous systems, your architecture must unite three fundamental components:

  • Intelligence Layer (The Brain): The reasoning core, usually an LLM, capable of taking input, breaking it into smaller tasks, and evaluating progress.
  • Action Layer (The Tools): A library of external integrations, such as databases, web scrapers, computational engines, and business APIs, that the agent can use to gather real-world data or execute actions.
  • Coordination Layer (The Orchestrator): The logic that manages state, standardizes how agents exchange data, handles errors, and ensures loops are terminated when goals are met.

Tools of the Trade: Navigating the Lang Ecosystem

As organizations move from proof-of-concept to production, the ecosystem of framework tools is rapidly evolving. The “Lang” suite has emerged as a particularly dominant force in defining how agents are built and orchestrated. During our workshop, we will explore several critical tools within this stack:

LangChain

While often used for simple prompt channelling, LangChain’s core contribution to agentic architecture is standardizing integration and chain creation. It provides the interface to connect the LLM to dozens of external systems. Crucially, it allows us to define custom “tools” for the agent. These are specialized, user-created functions that give the agent specific capabilities, such as querying a proprietary data warehouse or executing an internal Python script. By wrapping these functions in LangChain’s tool abstraction, the agent can autonomously decide when and how to invoke them to solve complex problems.

LangGraph

Managing complex agentic workflows required a different mental model: graphs. LangGraph extends LangChain by allowing developers to model agentic flows as stateful graphs (DAGs, or Directed Acyclic Graphs). This is crucial for systems that require robust loops, cyclical processes, and complex state management, ensuring that “Agent A” always knows what state “Agent B” left the system in.

Langfuse

Orchestrating agents is messy, and you need visibility. While not officially developed by the creators of LangChain, Langfuse is an essential open-source operational companion that integrates seamlessly with the ecosystem. It provides a robust platform for debugging, testing, and monitoring agentic systems without vendor lock-in. Langfuse allows teams to “trace” the entire multi-step process, viewing every prompt, tool call, and internal decision, making it possible to identify bottlenecks, reduce costs, and debug failures in production.

Complementary Orchestration Tools

While the Lang ecosystem excels at managing LLM logic, a true enterprise solution often requires integration with generalized orchestration and automation tools (like Zapier or n8n). These tools excel at managing event triggers, parallel processes, and standard API interactions that do not require LLM reasoning, complementing the Lang stack in a complete enterprise architecture.

Final Thoughts

Moving from single prompts to coordinated, agentic systems is a necessary evolutionary step for organizations aiming to unlock true operational efficiency with AI. Mastery of these systems requires shifting your perspective from “engineering a prompt” to “engineering a system.”

Want to see how this works in practice?

This article provides a conceptual blueprint of agentic workflows and the essential role of orchestration. To gain hands-on experience in building these systems, we invite you to join our upcoming webinar on the Orchestration of Agentic Workflows. During the session, we will demonstrate how to build multi-step autonomous systems by integrating these platforms into a single architecture, providing a practical guide for moving from simple prompts to coordinated AI systems that handle professional tasks.

Register for free

– Hernan Revale (Scalefree)

Datensouveränität: Die souveräne Datenplattform als Weg zu unabhängigen Daten und sicherer KI

Datensouveränität wird oft als rein politisches Buzzword oder als bloße Compliance-Aufgabe, z. B. im Rahmen der DSGVO und des AI Acts, abgetan. Doch in der Realität ist sie eine harte wirtschaftliche Notwendigkeit. In einer Ära, in der Daten nicht mehr nur in Dashboards visualisiert werden, sondern als Grundlage für automatisierte Geschäftsprozesse und Künstliche Intelligenz dienen, wird die eigene Infrastruktur zum strategischen Flaschenhals.

Wer in dieser Phase die Kontrolle vollständig an externe Technologieunternehmen abtritt, verliert nicht nur Unabhängigkeit, sondern auch Innovationskraft. Sind Daten in geschlossenen Systemen gefangen, bestimmt letztlich der Anbieter, was angebunden werden darf oder welche KI-Modelle genutzt werden können. Der Weg zu echter Datensouveränität beginnt mit der Erkenntnis, dass die bequemen „All-in-One“-Versprechen vieler Cloud-Anbieter einen hohen, oft versteckten Preis haben.

Wie der Kontrollverlust in der Praxis aussieht

Um zu verstehen, wie sich die Datenhoheit zurückgewinnen lässt, muss man zunächst betrachten, wie Unternehmen sie überhaupt verlieren. Dieser Kontrollverlust geschieht selten über Nacht. Vielmehr ist es ein schleichender Prozess, der tief in der Architektur traditioneller und moderner Cloud-Datenplattformen verwurzelt ist.

Fällt die Entscheidung auf eine proprietäre Datenplattform, werden die Rohdaten vollständig an das System des Anbieters übergeben.

Proprietäre Formate

Um die versprochene Performance zu liefern, wandeln geschlossene Plattformen die eingespeisten Daten in herstellereigene, proprietäre Speicherformate um. Ab diesem Moment können diese Daten nur noch von der Compute-Engine (der Rechenleistung) genau dieses einen Anbieters gelesen und verarbeitet werden.

Fehlende Interoperabilität

Soll nun eine neue, innovative Lösung, wie bspw. eine spezialisierte Analyse-Engine oder Reporting Software eines Drittanbieters angebunden oder eine bestimmte (open-source) KI genutzt werden, stehen Unternehmen oft vor einer Wand. Externe Tools können die proprietären Formate nicht nativ lesen oder es wird gar nicht erst eine benötigte Schnittstelle bereitgestellt.

Kostenfalle (“Egress Fees”)

Um die Daten für andere Anwendungen nutzbar zu machen, oder im schlimmsten Fall den Anbieter komplett zu wechseln, müssen sie aufwändig exportiert werden. Hier schlagen die sogenannten „Egress Fees“ (Kosten für den Datenabfluss) massiv zu Buche. Große Cloud-Provider machen den Ingest (das Einspeisen der Daten) oft sehr günstig, bestrafen den Export aber mit hohen Gebühren.

Verlust der Preissetzungsmacht

Sind historische Unternehmensdaten erst einmal in einem geschlossenen System verankert und die Wechselkosten künstlich in die Höhe getrieben, sind Unternehmen künftigen Preissteigerungen und Lizenzänderungen des Anbieters ausgeliefert.

Kurzum: Das Unternehmen trägt zwar weiterhin die volle rechtliche und geschäftliche Verantwortung für seine Daten, hat aber den direkten, physischen Zugriff darauf verloren. Es mietet lediglich den Zugang zum eigenen Wissen.

Stellen Sie sich an diesem Punkt einmal ganz ehrlich die Frage:

Wissen Sie genau, in welchem Format und auf welcher Infrastruktur Ihre Kern-Daten in diesem Moment liegen?
Und noch viel wichtiger: Wie kommen Sie an Ihre Daten, wenn der Zugang über das Portal Ihres Anbieters morgen früh plötzlich nicht mehr funktioniert oder die Preise über Nacht unerwartet diktiert werden?

Das Data Lakehouse und offene Standards als Ausweg

Der technologische Ausweg aus dieser Abhängigkeit führt über eine grundlegende architektonische Neuausrichtung. Die Antwort auf proprietäre Datensilos lautet heute: Data Lakehouse. Dieser Architekturansatz vereint die Flexibilität eines Data Lakes mit der Struktur und Zuverlässigkeit eines klassischen Data Warehouses, jedoch unter einer entscheidenden Prämisse: der konsequenten Trennung von Speicher (Storage) und Rechenleistung (Compute).

Diese Trennung ermöglicht es Unternehmen, ihre Architektur nach dem “Best-of-Breed-Prinzip” aufzubauen:

Eigene Infrastruktur

Anstatt Daten in die Systeme externer Dienstleister zu laden und dort zu “verriegeln”, verbleiben sie im unternehmenseigenen Cloud-Speicher (beispielsweise Amazon S3, Azure Data Lake oder Google Cloud Storage). Das Unternehmen besitzt faktisch und rechtlich den einzigen Schlüssel zu den eigenen Daten.

Offene Datenformate als Fundament

Ein wichtiger Hebel der Datensouveränität ist das Speicherformat. In einem modernen Data Lakehouse werden Daten ausschließlich in quelloffenen Standards wie Apache Iceberg, Hudi oder Delta Lake abgelegt. Diese Formate gehören keinem einzelnen Software-Hersteller und unterliegen keiner proprietären Lizenzierung.

Interoperabilität (“Bring Your Own Engine”)

Da die Unternehmensdaten nun strukturiert und in einem offenen Format im eigenen Speicher liegen, lassen sie sich von unterschiedlichsten Verarbeitungs-Engines (wie Databricks, Trino, Spark etc.) lesen. Der entscheidende Vorteil: Die Daten müssen dafür weder kopiert noch verschoben werden.

Das Resultat dieser Architektur ist echte digitale Souveränität. Wenn ein Software-Anbieter die Preise drastisch erhöht oder technologisch zurückfällt, lässt sich die Compute-Engine austauschen oder parallel durch andere Tools ergänzen. Die wertvolle Datenbasis bleibt davon völlig unberührt.

Keine sichere KI ohne souveräne Datenplattform

Diese architektonische Unabhängigkeit ist nicht nur eine Frage der Kostenkontrolle, sondern eine wichtige Grundvoraussetzung für den produktiven und sicheren Einsatz von Künstlicher Intelligenz. Aktuell herrscht in nahezu jedem Industriesektor der Druck, KI-gestützte Automatisierungen einzuführen. Gleichzeitig wächst die berechtigte Sorge, sensible Geschäftsgeheimnisse an US-amerikanische „Black-Box“-Sprachmodelle abfließen zu lassen oder durch fehlerhafte KI-Antworten (Halluzinationen) geschäftskritische Fehlentscheidungen zu treffen.

Eine unaufgeräumte Datenbasis und geschlossene SaaS-Systeme bremsen KI-Initiativen hier systematisch aus. Ein souveräner KI-Ansatz erfordert andere Vorgehensweisen.

Abfrage statt Einbettung

Viele frühe KI-Versuche scheitern daran, dass Unternehmensdaten direkt in Sprachmodelle eingebettet werden. Dies birgt nicht nur massive Datenschutzrisiken, sondern führt unweigerlich zu gefährlichen Halluzinationen. Ein Large Language Model (LLM) ist primär ein Sprachwerkzeug, keine relationale Datenbank.

Agentic AI auf Open-Source-Basis

Die Lösung liegt im Einsatz sogenannter „Agentic AI“ in Kombination mit quelloffenen Sprachmodellen (Open-Source-LLMs), die lokal und sicher in der eigenen (Cloud-)Umgebung betrieben werden. Die Daten verlassen die unternehmenseigene Infrastruktur zu keinem Zeitpunkt. Noch wichtiger: Die KI wird so konfiguriert, dass sie die Daten nicht auswendig lernt, sondern als intelligenter Agent agiert. Sie nutzt ihr semantisches Kontextverständnis, um bei Bedarf gezielt direkte Abfragen (beispielsweise über SQL) an die offenen Datenformate des Lakehouses zu stellen.

„Talk-to-your-data“ in der Praxis

Durch die direkte Anbindung an die zentrale Datenplattform liefert das System harte, verifizierbare Fakten statt stochastisch berechneter Wahrscheinlichkeiten. Dieser Ansatz ermöglicht völlig neue Geschäftsprozesse: Fachbereiche ohne tiefe Programmier- oder SQL-Kenntnisse können künftig im direkten Dialog mit ihren Daten interagieren. Komplexe Analysen und Reportings lassen sich per natürlicher Sprache automatisieren und verlässlich abfragen.

Damit dieser reibungslose Dialog zwischen Business-User, KI-Agent und Datenplattform jedoch nicht im Chaos endet, muss die KI exakt verstehen, wie die Daten strukturiert sind und welche semantische Bedeutung sie haben. Technologie allein reicht hierfür nicht aus, womit wir beim oft unterschätzten Kernstück der Datensouveränität angelangt sind.

Data Governance: Vom Regelwerk zum strategischen Enabler

Auch bei Datenplattformen bewahrheitet sich immer wieder eine Erkenntnis, die auch in vielen anderen Bereichen eine gewisse Allgemeingültigkeit erreicht hat: Technologie allein ist kein Garant für Erfolg. Ein modernes Data Lakehouse und fortschrittliche Agentic AI laufen ins Leere, wenn die zugrunde liegende Datenqualität mangelhaft ist oder die semantische Bedeutung der Daten unklar bleibt. An diesem Punkt wandelt sich Data Governance von einem oft ungeliebten Kontrollinstrument zu einem echten strategischen Enabler.

Wenn ein KI-Agent eine Benutzereingabe in eine präzise Datenbankabfrage übersetzen soll, benötigt er mehr als nur Zugriff auf Tabellen. Er benötigt Kontext. Ohne ein gepflegtes Business Glossary, klare Metadaten und definierte Verantwortlichkeiten (Data Ownership) ist das Risiko hoch, dass die KI zwar syntaktisch korrekte, aber fachlich falsche Ergebnisse liefert. „Garbage in, garbage out“ gilt im Zeitalter der Künstlichen Intelligenz mehr denn je.

Eine saubere Governance-Struktur löst dieses Problem an der Wurzel:

Zentrale Wahrheit, dezentrale Nutzung

Durch klare Qualitätsregeln und definierte Datenprodukte entsteht ein Fundament des Vertrauens. Fachbereiche können sich darauf verlassen, dass die bereitgestellten Informationen korrekt, aktuell und rechtssicher sind.

Echte Demokratisierung

Erst dieses Vertrauen ermöglicht Self-Service-Analytics. Wenn die Leitplanken der Governance feststehen, können Daten im gesamten Unternehmen demokratisiert und sicher zur Verfügung gestellt werden, ohne dass die IT-Abteilung jeden einzelnen Report manuell freigeben muss. Auch KI-Ergebnisse können so ohne Kopfschmerzen bezüglich Halluzinationen oder rechtliche Bedenken angenommen und weiterverwendet werden.

Compliance als Standard

Mit Blick auf strenge europäische Regulierungen wie die DSGVO oder den AI Act stellt eine integrierte Governance sicher, dass Zugriffsrechte, Anonymisierung und Nachvollziehbarkeit (Data Lineage) von Beginn an in der Architektur verankert sind.

Wer die Verantwortung für seine Daten auf diese Weise intern übernimmt, schafft die zwingende Voraussetzung für Skalierbarkeit.

Wie gelingt die Migration?

Die Vorteile offener Standards und einer souveränen Architektur sind einleuchtend. Dennoch scheuen viele IT-Verantwortliche den Schritt aus dem Vendor-Lock-in, weil sie ein riskantes, jahrelanges IT-Großprojekt befürchten. Doch die Befreiung aus geschlossenen Systemen erfordert keinen riskanten „Big Bang“.

Erfolgreiche Migrationsprojekte in der Praxis beweisen, dass der Wechsel zu einer offenen souveräneren-Architektur agil und inkrementell erfolgen kann:

Use-Case-getriebene Migration

Anstatt das gesamte historische Data Warehouse auf einmal abzulösen, wird die neue, offene Plattform parallel aufgebaut. Die Migration erfolgt anhand priorisierter, geschäftskritischer Anwendungsfälle.

Schneller Return on Investment (ROI)

Indem zunächst diejenigen Datenbereiche migriert werden, die den höchsten sofortigen Mehrwert bieten, zum Beispiel zur Umsetzung neuer Use-Cases, welche zuvor unmöglich schienen, refinanziert sich der Umbau oft schon während der Projektlaufzeit.

Risikominimierung

Dieser schrittweise Ansatz stellt sicher, dass das Tagesgeschäft (Reporting und laufende Analysen) völlig ungestört weiterläuft, während im Hintergrund das zukunftssichere Fundament iterativ wächst.

Der Übergang zu offener Software und herstellerunabhängigen Datenformaten ist somit kein IT-Selbstzweck, sondern eine planbare, risikoarme Investition in die unternehmerische Handlungsfähigkeit.

Souveränität aktiv gestalten

Wahrlich souverän ist nur das Unternehmen, das die Architektur, die Qualität und den Verbleib seiner Daten vollständig kontrolliert und sich dieser Verantwortung bewusst ist. Wenn Sie sich aus der Abhängigkeit lösen, teure Lizenzmodelle hinter sich lassen und eine rechtssichere Basis für Künstliche Intelligenz schaffen wollen, führt der Weg unweigerlich über offene Standards.

Übernehmen Sie wieder die volle Verantwortung für Ihre Daten. Verwandeln Sie Ihre IT-Infrastruktur von einem reinen Kostenfaktor in den entscheidenden Wettbewerbsvorteil Ihrer Branche.

Als Experten für Big Data und die Entwicklung moderner Datenplattformen unterstützt Scalefree europäische Unternehmen dabei, diesen Weg erfolgreich zu gehen. Wir planen und realisieren End-to-End Daten- und KI-Lösungen jeder Skalierung, von der strategischen Architekturberatung bis zur Implementierung, sowie Agentic AI.

Sind Ihre Daten bereit für die Zukunft?

Lassen Sie uns in einem unverbindlichen Gespräch Ihre aktuelle Architektur beleuchten. Erfahren Sie, wie ein maßgeschneidertes Data Lakehouse auf Basis offener Standards Ihre Datensouveränität dauerhaft sichern kann.

Kostenloses Erstgespräch vereinbaren

– Ole Bause (Scalefree)

Über den Autor

Ole Bause ist seit 2021 bei Scalefree in den Bereichen Business Intelligence, Data Engineering und Enterprise Data Warehousing mit Data Vault 2 tätig. Er ist zertifizierter Data Vault 2.0 Practitioner und verfügt über umfassende Erfahrung mit verschiedenen cloudbasierten Data-Warehouse-Diensten. Die Automatisierung von Data Warehouses gehört ebenfalls zu seinen Kernkompetenzen.

Set Based Multi Active Satellite Derived From Record Level Multi Active Satellite

Multi-Active Satellites: Handling Delta Loads and Set-Based Derivation

A detailed modeling question came in recently about deriving a set-based Multi-Active Satellite from a record-level Multi-Active Satellite, and whether using the resulting Business Satellite as input for a parent PIT Table is valid Data Vault practice. The answer involves a few clarifications on terminology and a practical approach to delta loading that makes the Business Vault layer largely unnecessary for this use case. This post breaks it down.



Multi-Active Satellites: The Full Load vs. Delta Load Problem

A Multi-Active Satellite captures multiple active records per business key at the same point in time — phone numbers are the classic example. A person can have a home number, a mobile number, and a work number all active simultaneously. In the Satellite, each row is uniquely identified by the hash key, the Load Date Timestamp, and the Multi-Active Key (in this case, the phone type).

The Hashdiff in a Multi-Active Satellite is calculated across the entire group — all active records for that business key at that load date — not per individual row. This means when any record in the group changes, the Hashdiff changes for the whole group, and the entire group is re-inserted with a new Load Date Timestamp. The old group is virtually end-dated. This works cleanly when you receive full loads: every batch contains all active records, so you always have the complete group to work with.

The challenge arises with delta loads. If the source only sends what changed — say, a new work number is added, but the home and mobile numbers are not re-sent because they didn’t change — you can’t calculate the correct group-level Hashdiff from the incoming batch alone. The group is incomplete.

Reconstructing the Full Group from Delta Loads in the Raw Data Vault

The solution is to reconstruct the full Multi-Active group before loading, without moving this logic into the Business Vault. The approach is straightforward: derive the most recent Multi-Active group from the existing Satellite, combine it with the incoming delta records, and use the resulting complete set to calculate the Hashdiff and load the Satellite as if it were a full batch for that group.

In practice, a staging table acts as the assembly point. The latest group from the Satellite is pulled into staging alongside the incoming delta. Together, they form the complete current group. From there, the standard Multi-Active Satellite loading pattern applies — the Hashdiff is calculated over the full group, a new Load Date Timestamp is assigned, and all records in the group are inserted together.

This approach handles delta loads cleanly in the Raw Data Vault, which means there’s no need for a Business PIT Table or a Business Satellite just to reconstruct the full set. The reconstruction happens at load time, not at query time.

Using the PIT Table to Manage Granularity

The second part of the question was about reducing the Multi-Active group to a single record per business key per Load Date Timestamp — storing that reduced result in a Business Satellite and using it as input for the parent PIT Table.

This is valid, but there’s a lighter alternative worth considering: handle the granularity reduction directly in the PIT Table rather than creating a dedicated Business Satellite for it.

A standard PIT Table references a Multi-Active Satellite via the hash key and Load Date Timestamp, which points to the entire group. If you want the full group available for querying, this is all you need — the PIT gives you the reference, and the join returns all active records for that timestamp.

If you only want one specific record from the group — say, just the mobile number — you add the Multi-Active Key as an additional column in the PIT Table. The PIT row then carries the hash key, the Load Date Timestamp, and the specific Multi-Active Key value you want. The join returns exactly one record. Selecting which Multi-Active record to surface is business logic, and the PIT Table is a clean place to encode it without materializing an intermediate Business Satellite.

If you need to surface multiple specific records — home number and mobile number separately — you add additional column sets to the PIT Table, one per record type. Each set carries its own hash key, Load Date Timestamp, and Multi-Active Key reference. This keeps everything in one structure and avoids unnecessary materialization.

When a Business Satellite Does Make Sense

A Business Satellite for this purpose isn’t wrong — it’s just not always necessary. If the reduced, single-record-per-key result is consumed by multiple downstream processes and materializing it improves performance or simplifies maintenance, building a Business Satellite is a reasonable choice. But if the only goal is to filter down to one record for a specific downstream view, doing it in the PIT Table is simpler and avoids creating an entity whose sole purpose is granularity reduction.

The key principle: keep the Raw Data Vault responsible for capturing the full, accurate group, and make granularity and selection decisions in the PIT Table or Business Vault based on what downstream consumption actually requires.

To go deeper on Multi-Active Satellites, PIT Table design, and the full Data Vault methodology, explore our Data Vault training and certification programs. The free Data Vault handbook is also available as a physical copy or ebook.

Watch the Video

Handling Zero Key References and Optional Data

Zero Key References and Optional Data in Data Vault Modeling

A nuanced modeling question came in recently about how to handle a source table that delivers relationships between two business objects, where one specific role type has no related second object — just additional attributes in the same record. The proposed model was a solid starting point, and the discussion that followed touched on several important Data Vault principles: zero key handling, CDC Satellites vs. multi-active Satellites, effectivity tracking, and why filter conditions in Raw Data Vault loading are a risk worth avoiding. This post walks through each of these in turn.



Zero Key References: Modeling Optional Relationships

The scenario involves a Link between two Hubs — Object One and Object Two — with a role type as part of the Link structure. For most role types, both Hub references are populated. For one specific role type, Object Two doesn’t exist; the source provides additional descriptive attributes instead of a foreign key.

The correct handling for the missing Object Two reference is straightforward: use the all-zeros key. When a foreign key in the source is null, the Link refers to the zero key in the Hub rather than storing an actual null. This keeps the model queryable with inner joins, avoids null-handling complexity downstream, and is entirely consistent with Data Vault null business key handling. Both Hubs should have their two zero key rows — the all-zeros key for unknown or null references, and the all-Fs key for error cases — deployed as standard practice.

The role type sits in the Link structure alongside the two Hub references, which means it participates in the hash key computation for the Link. When the role type changes, the Link sees it as a new entry. The effectivity Satellite then captures when the old record was no longer active. That’s the expected behavior.

Multi-Active Satellites vs. CDC Satellites: Choosing the Right Approach

The proposed model used multi-active Satellites to capture multiple rows with different valid_from and valid_to dates. Whether this is the right choice depends on one key question: are those records all active at the same time in the source system? Can a source system user see all of them simultaneously?

If yes — multiple records with different validity periods are all visible and active concurrently in the source — then a multi-active Satellite is the appropriate choice. The multi-active attribute should be a subsequence from staging rather than a business-supplied date, keeping control of uniqueness on the Data Vault side rather than trusting the source.

If no — the records represent sequential changes, not concurrent active states — then a CDC Satellite is a cleaner fit. A CDC Satellite is structurally a standard Satellite, but the load date is modified by adding a sequence number from the CDC package as microseconds. This means only one row is active at any given moment, which simplifies PIT Table construction (two columns instead of three per Satellite) and improves join performance. The choice between the two comes down to how the source system actually manages these records.

A third alternative worth noting: for multi-active scenarios, a JSON array stored in a standard Satellite can replace a traditional multi-active Satellite in some cases. It depends on the loading mechanism and the downstream consumption requirements, but it’s a valid option that avoids the multi-active complexity entirely by capturing multiple active rows as a structured JSON payload.

Effectivity Tracking and the Status Tracking Satellite

The proposed model included a separate status tracking Satellite alongside an effectivity Satellite. A cleaner and more storage-efficient approach is to merge these by adding a deletion timestamp directly to the effectivity Satellite.

The deletion timestamp works simply: when a record exists in the source, the deletion timestamp is set to end-of-all-times. When a deletion is detected — through a comparison of the current load against the target — the deletion timestamp is updated to the current load date, marking the record as no longer physically present in the source. If the record reappears, the timestamp reverts to end-of-all-times.

All timelines — valid_from, valid_to, deletion timestamps — belong together in the effectivity Satellite. This consolidation reduces the number of Satellites to manage and makes the model more straightforward to query.

GDPR and Customer Re-Registration

A related question came up about GDPR: if a customer requests data deletion and later re-registers, are they treated as a new record? The answer is yes, and it’s an important distinction from standard soft-delete handling.

Soft deletes in the Raw Data Vault are used to track hard deletes from the source for non-legal reasons — products removed, records archived, relationships ended. The history is preserved in the Vault even when it’s gone from the source.

GDPR is different. When a deletion is legally required, the personal data must be genuinely removed — a hard delete in the target. The non-personal data associated with that customer may be retained, but the link between the old history and the re-registering customer is permanently severed. If that customer returns and creates a new record, there’s no way to reconnect them to their previous history, because that connection no longer exists in the model. This is by design: the loss of that relationship is the point.

Why Filter Conditions in Raw Data Vault Loading Are Risky

One of the more important principles raised in this discussion: never apply filter conditions when loading the Raw Data Vault. The specific question was whether it’s acceptable to filter the source by role type when loading the additional attributes Satellite — loading only rows where the role type matches the one that doesn’t reference Object Two.

The answer is no, and the reasoning is worth understanding clearly. Applying a WHERE condition or a filtering join in the Raw Data Vault loading process is an application of business logic. The Raw Data Vault is supposed to capture raw data as delivered, without interpretation. Any filter condition that depends on the content of the data — rather than purely technical checks like delta detection or null replacement — violates that principle.

The practical risk is concrete: source system behavior changes. A new role type is introduced that also lacks an Object Two reference. The filter condition doesn’t know about it, so those records get skipped. Or dirty data arrives where an unexpected combination of fields appears — both an Object Two reference and additional attributes for a role type that wasn’t supposed to have both. A filter condition handles this incorrectly or drops data silently.

The only conditions permitted in Raw Data Vault loading are technical ones: checking whether a business key or relationship already exists in the target (the delta check), and replacing null values with zero keys. Everything else — including role-type-based filtering — belongs in the Business Vault, where it can be applied as explicit, versioned, testable business logic.

Validating the Model with the JEDI Test

When uncertain about a modeling decision — whether to use a multi-active Satellite or a CDC Satellite, whether to split or consolidate — the JEDI test provides a reliable check. The test is simple: try to reconstruct the original source delivery from the Raw Data Vault. Join everything back together and verify that no records are lost, no columns are missing, and no artificial records have been generated that didn’t exist in the source.

If the reconstruction succeeds without data loss or artificial inflation, the model is valid. Whether it’s the best model depends on the data and the downstream consumption patterns — but validity is the baseline, and the JEDI test is how you prove it.

To explore these modeling patterns in depth — including zero key handling, effectivity Satellites, CDC loading, and the JEDI test — check out our Data Vault 2.1 Training & Certification. The free Data Vault handbook is also available as a physical copy or ebook.

Watch the Video

How Model Access Works in dbt Cloud: Groups, Permissions & Cross-Project References

Model Access in dbt Cloud

As organizations scale their data platforms, controlling data access and ownership becomes increasingly important. Teams need clear rules around who can use specific datasets, how models are shared across projects, and how data governance is enforced without slowing down collaboration.

dbt Cloud provides powerful model access features that help analytics engineers and data teams manage visibility, ownership, and cross-project collaboration. By defining groups and access levels such as private, protected, and public, organizations can build scalable and secure data architectures aligned with modern data mesh principles.

This article explains how model access works in dbt Cloud using a producer-consumer project setup. We will explore access levels, group configuration, cross-project references, and how model relationships appear in the Catalog and lineage graph.



Why Model Access Matters in Modern Data Platforms

In a growing data ecosystem, multiple teams often build and consume data models simultaneously. Without clear access control, this can lead to:

  • Unclear data ownership
  • Breaking changes across teams
  • Data quality risks
  • Limited governance
  • Uncontrolled dependencies

dbt addresses these challenges by allowing teams to:

  • Control who can reference specific models
  • Define ownership through groups
  • Enable safe collaboration between teams
  • Support data mesh architectures
  • Manage cross-project dependencies

With proper configuration, organizations can ensure reliable and scalable data pipelines while maintaining flexibility.

Understanding Model Access Levels in dbt

dbt provides three main model access levels that control visibility and usage across projects and teams.

Private Models

Private models are restricted to a specific group. Only models within the same group can reference them.

Key characteristics:

  • Limited visibility
  • Strong access control
  • Internal use within a team
  • Prevents external dependencies

Private models are useful for intermediate transformations or sensitive data that should not be widely accessible.

Protected Models

Protected models allow broader usage while maintaining controlled ownership.

Key characteristics:

  • Can be referenced outside their group
  • Still managed by a specific owner
  • Suitable for shared internal data
  • Balanced control and accessibility

By default, dbt models are typically configured with protected access unless specified otherwise.

Public Models

Public models are designed for cross-project collaboration and wider consumption.

Key characteristics:

  • Accessible across projects
  • Supports data sharing
  • Enables data mesh architecture
  • Clear ownership boundaries

Public models are commonly used as trusted data products that other teams depend on.

Producer and Consumer Project Setup

To demonstrate model access behavior, we use two dbt Cloud projects:

  • Producer Project: Provides models for consumption
  • Consumer Project: References public models from the producer

This setup reflects real-world scenarios where one team publishes data assets and another team consumes them.

Configuring Groups in dbt

Groups in dbt define ownership and control access to models. They help manage responsibilities and enforce governance.

A group can be configured in YAML files and assigned to specific models or entire folders.

Assigning Groups to Individual Models

For example, an analytics group can be defined and assigned to models such as:

  • Orders per supplier country customer
  • Orders per customer
  • Orders per supplier
  • Orders per country

This configuration ensures that models follow consistent ownership and access rules.

Assigning Groups to Folders

Instead of configuring each model individually, teams can assign a group to an entire folder using the project configuration file. This approach simplifies governance and ensures consistent access settings.

Testing Private Model Access

Private models can only be referenced by models within the same group. If a model outside the group attempts to reference a private model, dbt returns an error.

To resolve this, the referencing model must be added to the same group configuration.

This behavior ensures:

  • Clear ownership boundaries
  • Reduced risk of unintended dependencies
  • Improved data security

Private access is particularly useful for internal transformations or staging logic that should remain hidden from external teams.

Using Protected Models

Protected models offer more flexibility. They can be referenced outside their group without requiring additional configuration.

This makes them ideal for:

  • Shared business logic
  • Reusable transformations
  • Internal reporting models
  • Organization-wide metrics

Protected access balances governance with usability.

Configuring Public Models for Cross-Project Use

Public models allow other projects to reference them, enabling collaboration across teams.

For public models to be visible to other projects, dbt Cloud requires:

  • A successful job run
  • An environment defined as staging or production
  • Metadata resolution by the dbt Cloud service

Once these conditions are met, public models become available to consumer projects.

This mechanism ensures that only validated and production-ready models are shared.

Star Schema Example with Mixed Access Levels

A typical data modeling setup might include a star schema with dimensions and fact tables.

In this scenario:

  • Most models are configured as public
  • Specific models, such as certain dimensions, may remain protected
  • Protected models are used internally by other models
  • Only necessary data is exposed externally

This design prevents unnecessary data exposure while maintaining efficient dependencies.

Exploring Model Ownership in the dbt Catalog

The dbt Catalog provides visibility into model ownership, access levels, and dependencies.

Within the Catalog, users can:

  • View model owners
  • Filter by access level
  • Explore group information
  • See associated models
  • Understand data lineage

This transparency improves governance and helps teams understand data responsibilities.

Referencing Public Models from Consumer Projects

Consumer projects can reference public models from producer projects using cross-project references.

These models appear in the development environment and lineage graph as external dependencies. However, consumer teams cannot build or modify them. The producer team retains full ownership.

This separation provides:

  • Clear responsibility boundaries
  • Reduced operational risk
  • Reliable shared data products
  • Improved collaboration

Understanding Lineage Graphs and Dependencies

The lineage graph provides a visual representation of how models relate to one another across projects.

It helps teams:

  • Track upstream and downstream dependencies
  • Understand data flow
  • Identify external data sources
  • Analyze impact of changes
  • Improve system reliability

For cross-project references, the lineage graph clearly shows the project and environment where external models originate.

Benefits of Model Access Control in dbt

Implementing structured model access provides significant advantages:

  • Strong data governance
  • Clear ownership structure
  • Secure data sharing
  • Scalable collaboration
  • Support for data mesh architecture
  • Reduced risk of breaking changes
  • Improved transparency

These capabilities help organizations scale their analytics operations effectively.

Best Practices for Managing Model Access

To maximize the benefits of dbt model access, organizations should follow these practices:

  • Define clear ownership groups
  • Use private models for internal logic
  • Expose only necessary data through public models
  • Validate models before sharing
  • Monitor dependencies using lineage graphs
  • Document access rules consistently

Following these guidelines helps maintain a reliable and scalable data ecosystem.

Conclusion

Model access in dbt Cloud enables organizations to control data visibility, manage ownership, and support cross-team collaboration. By configuring groups and defining access levels such as private, protected, and public, teams can build secure and scalable data architectures.

When combined with Catalog visibility and lineage tracking, these features provide a strong foundation for data governance and modern analytics workflows.

As data platforms continue to grow in complexity, structured access control becomes essential for ensuring trust, reliability, and collaboration across the organization.

Watch the Video

Meet the Speaker

Dmytro Polishchuk profile picture

Dmytro Polishchuk
Senior BI Consultant

Dmytro Polishchuk has 7 years of experience in business intelligence and works as a Senior BI Consultant for Scalefree. Dmytro is a proven Data Vault 2.0 expert and has excellent knowledge of various (cloud) architectures, data modeling, and the implementation of automation frameworks. Dmytro excels in team integration and structured project work. Dmytro has a bachelor’s degree in Finance and Financial Management.

Understanding Error Keys in Data Vault

Error Keys in Data Vault: Understanding Zero Keys and Null Business Key Handling

One of the more subtle but important concepts in Data Vault is the handling of null business keys — known as zero keys in Data Vault 2.0 and formally called null business key handling in Data Vault 2.1. Most practitioners understand the first zero key intuitively, but the second one — and where it actually earns its value — is less commonly understood. This post explains both, and where each one belongs in practice.



Error Keys Explained: The Two Zero Keys

Every Hub and Link in a Data Vault model is deployed with two special rows pre-loaded: one with a hash key of all zeros, and one with a hash key of all Fs. These are the two zero keys, and they exist to handle null business keys cleanly throughout the model.

The all-zeros hash key is the more commonly understood of the two. It replaces null values in Links — specifically, null references to business keys. When a relationship is received with a missing or null Hub reference, that null gets replaced by the all-zeros key rather than being stored as an actual null. This allows the model to rely on inner joins consistently when querying the Data Vault, without having to handle nulls case by case through left joins or null checks. When you join from a Link to a Hub, you always hit a record — either a real business key or the zero key. Clean, fast, and predictable.

The all-Fs hash key serves a distinct and more specific purpose: it marks bad data, as opposed to merely missing or ugly data. Understanding the difference between those two things is the key to understanding why two zero keys exist at all.

Ugly Data vs. Bad Data: Why the Distinction Matters

Consider a transaction record where the store reference is null. In a brick-and-mortar retail context, this seems wrong — every sale happens somewhere. But in a business that also runs an online store, a null store value might simply mean the transaction happened online. The data is incomplete by conventional standards, but it’s not incorrect. It reflects a real business scenario. This is what you might call ugly data: not ideal, not the most descriptive, but not an error.

Now consider a different scenario: the interface specification for a source system explicitly states that a particular foreign key is non-nullable. The data arrives anyway with null values in that field. Here, either the data is genuinely corrupted or the specification is wrong. Either way, something has gone wrong. This is bad data — data that shouldn’t exist in the form it arrived.

The all-zeros key handles the ugly case. The all-Fs key is reserved for the bad case. Having both allows the model to preserve the distinction rather than collapsing all null situations into a single catch-all placeholder.

Where the All-Fs Key Is Actually Used in Practice

In theory, the all-Fs key could be applied in the Raw Data Vault whenever a null value violates an interface specification. In practice, this rarely happens. Analyzing every interface description, identifying which nulls represent violations, and modifying the Raw Data Vault mappings accordingly is a significant effort — and most projects don’t invest in it at the raw layer. The all-Fs rows exist in every Hub and Link as a structural feature, but they tend to sit unused in the Raw Data Vault itself.

Where the all-Fs key genuinely earns its place is in the Business Vault and Information Marts. The pattern looks like this: during the construction of a Fact view or a Bridge Table, business logic identifies records that reference Hub keys which shouldn’t exist — store locations that were never valid, product codes that are clearly erroneous, data that passed through the raw layer but doesn’t belong in the dimensional model. Instead of passing those records through to the Dimension with a misleading or nonsensical member, the business logic replaces their hash keys with the all-Fs value.

In the resulting Dimension, those records map to an explicitly erroneous member — a designated “error” row — rather than polluting actual dimension members with bad data. Business users and analysts can see that certain facts are associated with an error case, filter them out, investigate them, or handle them according to reporting requirements. The data is quarantined and labeled, not silently dropped or mixed in with valid records.

Ghost Records in Satellites

The zero key pattern extends to Satellites as well, through what are called ghost records. At minimum, one ghost record exists in each Satellite — associated with the all-zeros hash key — to ensure that joins from a Hub or Link to a Satellite always return a result, even for the zero key case.

In implementations using the datavault4dbt package, two ghost records are created: one for the all-zeros key and one for the all-Fs key. Beyond making the implementation consistent, this has a practical benefit in the dimensional layer. The two ghost records can carry different descriptive values — for example, “Unknown Customer” for the all-zeros case and “Erroneous Customer” for the all-Fs case. This makes the distinction visible and user-friendly in reports and dashboards, giving analysts a clear signal about what they’re looking at rather than a generic placeholder for both missing and bad data.

Because the ghost records share their hash keys with the zero keys in the parent Hub and Link, they join naturally without any special handling. It’s a side effect of the design that works elegantly in practice.

Should You Drop the All-Fs Key If You’re Not Using It?

The question occasionally comes up: if the all-Fs key isn’t being used in the Raw Data Vault, can it simply be dropped? Technically, yes. But in most implementations it stays, for a few reasons. It costs almost nothing to maintain — it’s two rows per Hub and Link. It provides a structural home for bad data classification if the need arises later. And its real value, as described above, is realized downstream in the Business Vault and Information Mart, where it’s actively useful for handling erroneous data in business logic and dimensional modeling.

Dropping it from the Raw Data Vault to save minimal overhead would mean losing a precise and semantically meaningful tool at exactly the layer where it’s most needed.

To go deeper on null business key handling, ghost records, and the full Data Vault 2.1 methodology, explore our Data Vault 2.1 Training & Certification. The free Data Vault handbook is also available as a physical or digital copy for a concise introduction to the core concepts.

Watch the Video

Predictive Analytics on the Modern Data Platform

Predictive Analytics on the Modern Data Warehouse

From BI to AI: Operationalizing Predictive Analytics where your Data already lives

Traditional Business Intelligence and reporting are incredibly good at telling what happened yesterday. How much revenue was generated last quarter? How many users logged in this week? But while understanding the past is important, today’s businesses need to know what will happen tomorrow.

This is where Predictive Analytics comes in. At its core, predictive analytics simply uses historical data to forecast future outcomes. Instead of asking how many customers canceled last month, predictive analytics asks:

“Which specific customers are most likely to cancel next week?”

Many organizations understand this value and eagerly hire data scientists to build these models. Yet, time and time again, these predictive initiatives fail to make it out of the PoC phase and into daily business operations because of how teams and data architectures are fundamentally structured.

Predictive Analytics on the Modern Data Platform

Learn how to bridge the gap between your data platform and actionable AI by building predictive models directly where your business data lives. This webinar covers practical strategies for transforming warehouse data into features, deploying models, and automating the flow of insights back into your daily operational workflows. Learn more in our upcoming webinar on March 17th, 2026!

Watch Webinar Recording

The Problem: The “Two Silos” of Data

In many companies, Data Engineering and Data Science exist in two entirely different worlds.

Data Engineers and Data Warehouse Developers spend their days building the Modern Data Platform. They carefully extract, clean, conform, and govern data from dozens of sources to create a “Single Source of Truth.” When a business analyst looks at a revenue dashboard, they know they can trust the numbers because the data platform enforces strict business logic.

Data Scientists, on the other hand, often work in isolated environments like standalone Jupyter notebooks. Because they need massive amounts of data to train their machine learning models, they often bypass the data platform entirely, pulling raw, unstructured data directly from a data lake.

This disconnect creates some challenges:

  • Duplicated Effort: Data Scientists waste up to 80% of their time cleaning and prepping raw data, work the Data Engineering team has already done in the platform
  • Inconsistent Metrics: Because models are built on raw data, a model’s definition of “Active Customer” might completely contradict the official definition used by the in the data platform
  • The “Wall of Production”: A model might look perfectly accurate on a data scientist’s laptop, but because it relies on disconnected, ungoverned data pipelines, integrating its predictions back into the daily workflows of sales or support teams becomes an IT nightmare
  • Exclusivity: Data analysts are often limited to classic descriptive analytics which slows time-to-insight. The optimal solution is to democratize data science, empowering analysts to implement predictive use cases directly.

The Solution: Bring the Machine Learning to the Data

The fix to this problem requires a fundamental shift in how we think about machine learning architecture. Instead of moving data out of governed systems to feed external ML models, we need to bring the ML workflows more closer to the data.

By positioning the Modern Data Platform as the foundation for predictive analytics, you ensure that every prediction is built on the same trusted, cleansed, and governed business data used for your daily reporting. The Data Platform becomes the Feature Store, a centralized hub where data is prepared once and used everywhere, whether for a BI dashboard or training a predictive model.

Predictive Analytics on the Modern Data Warehouse

When the data platform serves as the single source of truth for both analysts and algorithms, magic happens. Data science teams stop wrestling with raw data pipelines, data engineering teams maintain governance, and the business gets predictions they can actually trust and operationalize.

Two Architecture Approaches

So, how do we actually bring the machine learning to the data? There isn’t a one-size-fits-all answer. Depending on the team’s skillset and the complexity of the models, there are typically one of two foundational patterns adopted:

Pattern 1: In-Warehouse Machine Learning (Democratizing ML)

Modern cloud data platforms have evolved beyond just storing and querying data, many now have machine learning engines built directly into them.

  • How it works: Using standard SQL, Data Analysts and Analytics Engineers can train, evaluate, and deploy models entirely inside the data platform (for example BigQuery ML, Snowflake Cortex or Databricks)
  • The Benefit: This radically democratizes predictive analytics. You don’t need to know Python or manage complex infrastructure to build a model. If you know SQL, you can generate predictions using the exact same tables you use for your BI dashboards
  • The Trade-off: While perfect for standard tasks like regression or classification, you are limited to the specific algorithms supported by the data platform

Pattern 2: The Data Platform as a Feature Store (The Hybrid Approach)

For organizations with dedicated Data Science teams building highly complex or custom models, the data platform takes on a different role: the Feature Store.

  • How it works: Data Scientists continue to work in their preferred external ML platforms (like Vertex AI, Databricks, or other). However, instead of pulling messy data from a data lake, they connect directly to the data platform to pull curated, business-approved data (“features”) for training
  • The Benefit: Data Scientists retain maximum flexibility to use advanced Python libraries and deep learning frameworks, while ensuring the models are trained on governed, accurate data
  • The Trade-off: It requires a bit more orchestration to manage the pipeline between the data platform and the ML platform, and ensuring predictions are accurately written back to the data platform
Architecture Feature Store

Example: Predicting Customer Churn

To understand the benefits of these approaches, let’s look at a classic business challenge: Customer Churn Prevention.

Imagine a SaaS company trying to figure out which customers are likely to cancel their subscriptions. In a siloed environment, predicting this is can be a messy, manual science project. But on a modern data platform, it becomes an automated operational workflow:

  1. The Foundation (Data): Because of the Data Engineering team’s work, the data platform already contains all necessary historical information of the customer. CRM data (company size), financial records (billing history), product logs (login frequency), and Zendesk tickets (recent complaints) are all cleaned, joined, and sitting in governed tables including a full history of changes
  2. The Prediction (Modeling): An analyst uses In-Warehouse ML (Pattern 1) to run a classification model against this historical data. The model identifies the hidden patterns of a churning customer and generates a “Churn Risk Score” between 0 and 100 for every active user
  3. The Operationalization (Action): This is the crucial step. The predictions aren’t left in a notebook. The risk scores are written directly back into a new table in the data platform. Through reverse-ETL, these scores can be automatically synced to the CRM as well as dashboards and reports can easily be built on top of the results.

Conclusion

Predictive analytics shouldn’t be an isolated science experiment. It should be a living, breathing part of your operational reality. By treating your modern data platform as the foundation for your machine learning workflows, you eliminate data silos, empower your analysts, and ensure your predictions are built on the trusted business data that matters most.

It is time to operationalize predictive insights where your business data already lives.

Want to see how this works in practice?

Join our upcoming webinar: Predictive Analytics on the Modern Data Platform. We will explore how to build and run predictive analytics directly on top of your data platform using trusted, governed business data as the foundation. You’ll learn practical patterns for turning warehouse models into features, training and deploying predictions, and integrating results back into reporting and operational workflows. Join me on March 17th.

Register for free

– Ole Bause (Scalefree)

Data Vault in a Microservices Architecture

Microservices Architecture and Data Vault: Managing Satellites at Scale

Microservices architectures create a specific modeling challenge for Data Vault practitioners. When services are ephemeral — spinning up and down as Docker or Kubernetes containers — each with its own message structure, the standard advice to split Satellites by source system quickly leads to hundreds or thousands of Satellites. At that scale, the real question isn’t about metadata management overhead. It’s about how to consume all that data without joining 500 tables every time you need an answer. This post walks through a practical approach to handling high-volume, highly varied source structures in a Data Vault model.



Microservices Architecture: Why Satellite Splits Become a Problem

The conventional Satellite splitting rules — by rate of change, source system, security, and privacy — exist for good reasons. But in a microservices context, applying them strictly leads to an explosion of Satellites. A new Docker image with a new message structure technically deserves its own Satellite. Automate that process and you accumulate hundreds or thousands of Satellites quickly, most of which may never be queried by anyone.

The issue isn’t that databases can’t handle 500 tables — they can. The issue is the consumption side: joining 500 Satellites to produce a target model is expensive, complex to maintain, and in many cases unnecessary. The real challenge is finding a modeling approach that captures the variety of incoming structures without creating an unmanageable query layer downstream.

Rate of Change Splits: Still Relevant, but Less So for Now

The rate of change split was designed to reduce storage consumption by separating high-frequency attributes from stable ones. Every delta insert copies all columns in the Satellite, so a single change on one attribute in a wide Satellite wastes a lot of storage on unchanged data.

For most modern analytical database systems, compression makes this largely unnecessary. Insert-only tables with lots of redundant data compress extremely well, and virtually all modern analytical platforms support this. The storage cost of skipping the rate of change split is manageable with compression turned on.

That said, this is worth watching. In pay-per-query environments like Athena querying row-based Avro files, or systems that charge based on uncompressed data scanned, the rate of change split becomes economically relevant again. BigQuery’s columnar storage sidesteps this because you only pay for the columns you query — but other managed infrastructure doesn’t work that way. The rate of change split isn’t obsolete; it’s just less pressing for now, and likely to become more relevant as managed, consumption-based pricing models become more common.

The Flip-Flop Effect: Why Source System Splits Still Matter

The source system split is a different matter. Loading data from two different source systems into the same Satellite creates a well-known problem: the flip-flop effect.

Consider a customer whose address is known to both an ERP system (California) and a CRM system (Hannover, Germany). The two systems have different knowledge and potentially different structures for representing the same data. If both load into the same Satellite, the Satellite ends up recording two deltas per day — not because the customer moved, but because two systems loaded sequentially with different values. The data flips between California and Hannover with every load cycle, consuming storage and making it impossible to determine the actual address without applying business logic. Worse, the order of loading determines what the Satellite shows at any given moment — a purely technical artifact with no business meaning.

The fix is straightforward: one Satellite per source. This keeps each system’s view of the data independent and equally available, so business logic in the Business Vault can reconcile them deliberately rather than having the Raw Data Vault collapse them accidentally.

The Gray Area: Millions of Sources, One Practical Solution

The flip-flop rule works cleanly when you have a manageable number of distinct source systems. It breaks down at the extreme end — IoT deployments with millions of sensors, or microservices architectures with hundreds of ephemeral containers — where creating one Satellite per source is operationally impractical.

The solution here depends on two conditions being met. First, you need a key in the parent entity that partitions the data by source — a sensor ID, a Docker image ID, a tenant ID, something that creates independent delta streams within the same Satellite. With this in place, deltas from source A can’t replace or invalidate deltas from source B, which eliminates the flip-flop effect without requiring separate Satellites. Second, the structure of the incoming data must be consistent enough to fit in a shared target — which in practice usually means JSON.

When messages from different microservices or sensors all arrive as JSON — even with different internal structures — you can load them all into a single Satellite or Non-Historized Link with a JSON or JSONB payload column. The structure differences are captured inside the JSON document. You add the partitioning key to the parent, and you’re done. Instead of 500 Satellites with 500 different schemas, you have one entity with a JSON payload and a key that tells you which source produced each record.

Non-Historized Links for Real-Time Messages

For real-time message streams from microservices, a Non-Historized Link with a JSON payload is often the right structure. Real-time messages are events — they don’t update, they accumulate. The flip-flop concern largely disappears because you’re capturing messages as they arrive, not loading full snapshots that might overwrite each other. A Non-Historized Link captures the event, the relevant Hub references, and the message payload in a structure that’s fast to load and straightforward to query.

This same pattern was applied at Scalefree for an investment banking client with 500 different source systems delivering asset data in different CSV formats. Rather than creating 500 entities, a single Non-Historized Link and Satellite captured everything — different CSV structures serialized as JSON strings, distinguished by a load source identifier. Two entities replaced 500, and the consumption layer handled the structural variety through filtering and extraction rather than joins.

Consuming Semi-Structured Data Without Joining 500 Tables

Loading everything into a JSON payload doesn’t eliminate the structural variety — it defers it to query time. When you need data from a specific message type, you need to identify records with the right structure among all the records in the same target entity.

The approach here is filtering rather than joining. Instead of joining 500 Satellites, you query one entity and filter for records that contain specific JSON keys or values that uniquely identify the message type you care about. Email messages, for example, always have a subject, body, sender, and recipient — keys that distinguish them from other message types. A specific transaction type might always carry an ID starting with a known prefix. These structural signatures let you extract subsets of the JSON stream efficiently.

Once filtered, you extract the attributes you need from each subset and UNION the results if you need to combine multiple message types. A UNION of 500 filtered queries on one table is significantly faster than a JOIN of 500 separate tables, and it scales much better as the number of source types grows.

Choosing the Right Approach for Your Context

The right answer depends on where you sit on the spectrum between a small number of structurally distinct source systems and a very large number of structurally similar ones. For a handful of systems with genuinely different schemas and different business semantics — CRM, ERP, financial systems — separate Satellites per source is the right call. The flip-flop effect and structural differences make consolidation risky and introduce business logic where it doesn’t belong.

For microservices, IoT devices, or any scenario where you have many sources with similar structures and a partitioning key available, consolidating into a small number of JSON-payload entities is usually the better trade-off. It simplifies loading, reduces metadata overhead, and keeps the consumption layer manageable — at the cost of pushing structural interpretation into filtering and extraction logic downstream.

To go deeper on Satellite design, source system splits, and Data Vault modeling patterns for modern architectures, explore our Data Vault certification program. The free Data Vault handbook is also available as a physical copy or ebook for a solid grounding in the core methodology.

Watch the Video

Salesforce Spring’26 Update in a Nutshell

The Salesforce Spring’26 release has officially arrived, introducing a significant wave of functional updates designed to increase administrative efficiency and improve the end-user experience.

At Scalefree, we have analyzed the extensive release documentation alongside insights from our ecosystem partners to identify the most impactful changes for your business. Here is a professional summary of the key features now available in your environment.

Enhanced Reporting: Dashboard Table Integration

The gap between reports and dashboards has narrowed. You can now utilize native Report Table settings directly within Dashboard components. This update ensures that your dashboard tables automatically respect conditional highlighting, manual column widths, and summary rows previously defined in your source reports, providing a more consistent and professional data visualization experience.

Proactive Governance: Security Health Check Advancements

Salesforce has expanded the Health Check suite to offer more granular visibility into org vulnerabilities. New features include real-time monitoring of session settings and enhanced credential auditing. These tools transition security from a periodic review to a proactive, continuous defense strategy for your business data.

Note for Marketing & Operations: While native tools provide a strong foundation, maintaining an optimized environment requires a strategic approach. If you are looking to validate your configuration against industry best practices, we recommend the Scalefree Admin Buddy. It is designed to help organizations maintain a lean, high-performing Salesforce instance. Learn more about Scalefree Admin Buddy

Automation Excellence: The Evolution of Flow

The Spring ’26 release continues to prioritize Flow as the central engine for business logic, specifically bridging the gap between manual intervention and automated efficiency.

  • Integrated Approval Components: You can now embed approval processes directly within Screen Flows. This allows users to review, comment on, and approve records within a single interface, significantly reducing context switching and improving process velocity.
  • File-Triggered Automation: A long-awaited update, both ContentVersion and ContentDocument are now available as entry criteria for Record-Triggered Flows. This allows for immediate automated actions—such as notifications or status updates—the moment a document is uploaded.

Operational Transparency: Data 360 and Logging

To support more complex automation, Salesforce has introduced Data 360 for Flow Logging. This provides a comprehensive audit trail for your automated processes, functioning as a diagnostic “black box” to provide clear visibility into flow execution and simplify troubleshooting.

Additional Resources

This summary covers only a fraction of the Spring ’26 capabilities, which also include advancements in AI-driven agents and developer productivity tools.

If you would like a deeper technical analysis or wish to discuss how these specific features can be applied to your current business processes, please reach out to one of our experts. We are happy to share our detailed sources and internal testing experiences with you.

Book Your Expert Call

Scalefree @ IT-Tage 2025: Global Innovation, Local Trust

Salesforce IT-Tage Travel Report

How to combine Salesforce innovation with European data standards for a competitive edge in 2026

Fresh back from Kap Europa in Frankfurt, the Scalefree Salesforce team has unpacked a lot of inspiration. While our daily focus is on evolving Customer Relationship Management, IT-Tage 2025 offered an impressive 360-degree view of the IT landscape.

One thing became clear immediately: The “Sovereignty Blues” of recent years has evaporated. The industry is no longer complaining about dependencies; it is in full execution mode.

Here are our top takeaways for decision-makers and admins on how to navigate the future of Salesforce and Digital Sovereignty.

Salesforce IT-Tage Travel Report

Digital Sovereignty: The New “Freedom of Choice”

Sovereignty was the red thread running through the keynotes (Jutta Horstmann’s talk on digital strategy was a standout).

Salesforce Digital Sovereignty IT Days Travel Report

The Scalefree Take: Sovereignty is not a “Cloud Ban.” It is the ability to actively manage your technological dependencies.

  • Best of Breed: The 2025 standard is combining global innovation (SaaS giants like Salesforce) with local security standards.
  • Data Control: In our discussions, it became clear that “Data Residency” isn’t enough. True sovereignty comes from Process Excellence. You are only sovereign if you understand your data architecture and own your business logic.
  • The Result: We can use modern integration architectures to give you the best of both worlds—US innovation with EU compliance.

Architecture & GenAI: Looking into the Crystal Ball

The sessions on Generative AI and software architecture were eye-openers. But we also saw the need for a reality check.

The Scalefree Take: AI is not a strategy; it’s a tool.

  • Lifecycle over Code: LLMs change how we write code, but they don’t replace the need for planning. If anything, the ethical responsibility and security of the generated logic matter more than the programming language itself.
  • No AI without Clean Core: For our Salesforce clients, this confirms our mantra: AI-driven automation must be embedded in a robust architecture. If you automate chaos with AI, you just get faster chaos. Data Quality is the prerequisite for AI success.

Sustainability: More Than Just Marketing

“Green IT” wasn’t just a buzzword this year; it was about hard metrics. We saw deep dives into how energy-efficient algorithms can minimize CO2 footprints.

The Scalefree Take: Efficiency equals Stability.

  • Resource Intensity: A reflective talk by Hannah Herbst highlighted the resource cost of LLMs.
  • Scale Smart: For Salesforce Admins and CFOs, this is crucial. A bloated Salesforce instance isn’t just expensive to license; it’s technically “unhealthy.” Optimizing your Org isn’t just good for the planet—it’s good for your bottom line.
Salesforce IT Days Travel Report

Conclusion: Our Roadmap for 2026

IT-Tage 2025 confirmed that we are on the right track.

Digital Sovereignty and high-efficiency Customer Management with Salesforce do not contradict each other—they condition each other.

Whoever maintains control over their architecture (Sovereignty) and enforces strict Data Quality (Excellence) can use the best and most innovative tools (Salesforce) safely and sustainably.

Frankfurt, see you in 2026!

– Markus Lewandowski (Scalefree)

Salesforce IT Days Travel Report
Close Menu