Skip to main content
search
0

Hash Keys and Modern Data Platforms

Hash Keys in Data Vault on Modern Data Platforms: Snowflake, Fabric, and Beyond

A question that comes up regularly — especially from teams working on cloud-native platforms like Snowflake — is whether hash keys are still necessary, or whether sequences or raw business keys might be more efficient. It’s a fair question, and the answer depends on understanding what hash keys actually solve, what the alternatives cost, and how modern massively parallel processing (MPP) platforms change the performance equation. This post covers all three options and explains why hash keys remain the recommended approach even on modern platforms.



Hash Keys on Modern Data Platforms: Why Not Sequences?

Sequences are the first alternative most people consider — integers are small, fast to compare, and familiar. But they come with a fundamental structural problem: they require lookups. To load a Link, you need the sequence values for the Hubs it references, which means Hubs must be loaded before Links, Links before their Satellites, and so on. In small, single-environment setups, this ordering constraint is manageable. In large-scale or distributed environments, it becomes a serious obstacle.

Consider a setup where facts and real-time feeds live in the cloud while customer master data lives on-premise. To load a fact with a sequence-based key, you need to look up the sequence for each customer from the on-premise system — through a firewall, across a network, under latency and security constraints. In practice, this doesn’t scale. It introduces tight loading dependencies between systems that should be able to operate independently.

Hash keys and business keys don’t have this problem. Hash the same business key on two different systems and you get the same hash value. Both environments can load independently and join cleanly without cross-environment lookups. At Scalefree, the only clients currently using sequences in their Data Vault are on migration projects — migrating away from sequences. That’s worth keeping in mind before choosing them.

Business Keys: When They Work and When They Don’t

Business keys are the other alternative. On the surface, a business key stored directly in a Hub seems simpler than hashing it — one less step, shorter values. And on modern MPP platforms like Snowflake, Fabric, or BigQuery, the join performance argument for hash keys is less compelling than it used to be. These platforms distribute and index data across thousands of nodes in ways that make business key joins perform reasonably well.

The problem shows up in Links. A Link referencing three or four Hubs combines multiple business keys into its primary key. A VIN number alone can be 20 characters; combine it with a customer number, a transaction ID, and a location code and you’ve already exceeded the 32 characters of an MD5 hash. Business keys are also often variable-length, which matters on traditional row-based database systems: fixed-length fields are guaranteed to stay in the primary page during a join, while variable-length fields may be offloaded to a secondary page, turning a two-page join into a four-page operation.

On Non-Historized Links and their attached Satellites — where volume is high and the primary key is replicated across every row — wide, variable-length business key combinations compound quickly into a storage and performance problem. As you dig deeper into the Data Vault model with more complex queries and more joins, the size of the join conditions grows with the business keys.

The other practical constraint is tool stack consistency. If your environment mixes a cloud MPP platform with an on-premise Postgres derivative, a data lake for staging, and various Business Vault loading tools, using business keys means different query patterns depending on which systems are involved. Sometimes you join on the business key, sometimes on the hash key, sometimes on a combination. The query logic becomes metadata-driven and harder to read. Hash keys simplify this: always one column, always the same join pattern, regardless of platform.

Binary vs. Character Hash Values

Once you’ve decided to use hash keys, the next question is storage format: character (32 chars for MD5, 40 for SHA-1) or binary (16 or 20 bytes respectively). Binary is half the size, joins faster, and produces smaller join conditions in the dimensional layer — all genuine advantages, especially when materializing data into OLAP cubes or columnar tools like QlikSense.

The reason most projects still use character-based hash values is tool compatibility. Strings are universally supported. Binary data types are not — many real-time processing tools, data mining platforms, and AI/ML frameworks work with basic data types only. If an external script, a RapidMiner workflow, or a streaming processor needs to write into the Business Vault, a binary hash key may not be supported without explicit conversion logic.

The practical recommendation: use character-based hash values in the Raw Data Vault and Business Vault for maximum compatibility. In the Information Mart, if the data is being materialized into a tool that benefits from smaller keys — an OLAP cube, a QlikView dataset — convert to binary in the view layer. That keeps the core model flexible while capturing the storage and join benefits where they actually matter.

Hashdiffs on Modern Platforms: Still Worth It

A related question is whether hashdiffs are still valuable on column-based platforms like Snowflake, where column compression already reduces redundant data significantly. The answer is yes, and the reason is about how compute is distributed across loads rather than the cost of a single load.

The hashdiff is calculated when a record is first loaded into a Satellite. On subsequent loads, the comparison is between the freshly calculated staging hashdiff and the already-materialized Satellite hashdiff — which was computed during a previous load, not the current one. This means the compute cost of delta detection is spread across the load history: roughly half the work happens in prior loads, and the current load only handles the staging side. Over time, especially on high-volume Satellites with relatively low change rates, this distribution of compute is a meaningful performance gain.

Column-by-column comparison without a hashdiff moves all of that computation into the current load and requires fetching additional column pages for each comparison on column-based storage. The hashdiff collapses the entire comparison into a single column join, which scales much better as Satellite width and data volume grow. This is why tools like datavault4dbt no longer offer hashdiff as an optional feature — it’s simply on by default, because the performance case is consistent enough that disabling it isn’t worth the option overhead.

The Case for Staying with Hash Keys

Modern MPP platforms do reduce some of the traditional arguments for hash keys — join performance on business keys is no longer the clear-cut problem it was on row-based on-premise systems. But hash keys still deliver consistent advantages that matter in real projects: single-column join conditions that work the same way everywhere, independence from loading order, full compatibility across distributed environments, and a query pattern simple enough to generate automatically from metadata.

For teams building on Databricks, Snowflake, Fabric, or any other modern platform, hash keys remain the recommended approach. Not because the alternatives are impossible, but because the consistency and operational simplicity they provide across varied tool stacks and deployment patterns is worth more than the marginal gains from switching.

To explore hash key design, hashdiff patterns, and the full Data Vault modeling approach in depth, check out our Data Vault 2.1 Training & Certification. And for a solid introduction to the core concepts, the Data Vault Handbook is available as a free physical copy or ebook.

Watch the Video

How to Define SCD Type 2 Dimension Keys in a Data Vault Solution

SCD Type 2 Dimension Keys in Data Vault: Hash Keys, Sequences, and the PIT Table

Defining dimension keys in a Data Vault solution is one of those topics that seems straightforward until you get to Type 2 dimensions — and then the options multiply quickly. Should you use hash keys or sequences? Where do Type 2 keys come from, and how do they connect back to your facts? This post walks through the full picture, from the simplest Type 1 case all the way to the Dimension Hash Key pattern used for Type 2 slowly changing dimensions.



SCD Type 2 Dimension Keys: Starting with the Simple Case

For Type 0 and Type 1 dimensions — dimensions without history — the dimension key question is easy. Every Hub already contains exactly one hash key per business entity, and every Link contains one hash key per relationship. These Type 1 hash keys are already present throughout your model: in Non-Historized Links, Dependent Child Links, and Bridge Tables. You can use them directly as dimension keys in your view layer without generating anything new. It’s the lowest-effort, highest-compatibility option.

Hash keys also have a significant advantage over sequences in distributed environments. If your facts live in the cloud and your dimensions are generated on-premise, you can’t easily synchronize integer sequences between systems — the lookup dependencies alone make it impractical. Hash keys don’t have this problem. Hashing the same business key on two different systems produces the same hash value. A distributed Information Mart works cleanly with hash keys; with sequences, it becomes a coordination problem.

For more on how hash keys work in Data Vault and why they’re designed the way they are, the Scalefree blog covers the topic in depth.

When Sequences Make Sense — and How to Generate Them

The case for sequences is primarily storage. An MD5 hash value stored as a character string takes 32 bytes; a SHA-1 takes 40. A big integer takes 8 bytes. If storage is a genuine concern, converting character-based hash values to binary in the view layer is the first option to consider — it cuts the size in half with minimal effort and no structural changes.

If you still want integer sequences after that, there are two places to generate them. You can add a sequence column directly to the Hub or Link structure, used purely as a downstream dimension key rather than as an identifier. This works but creates a conceptual tension: after spending effort explaining why sequences aren’t used as Hub identifiers, reintroducing them in the same structure is confusing for anyone reading the model.

The cleaner approach is a Computed Satellite in the Business Vault, attached to the Hub or Link, that generates a new sequence value for every new record in the parent. It’s a simple business rule — new parent record, new sequence — and it keeps the sequence generation in the layer designed for computed values. The trade-off is an additional join when consuming the sequence downstream, but the design is explicit and the logic is easy to understand and maintain.

The Type 2 Challenge: Why Hub Hash Keys Aren’t Enough

Type 1 hash keys work for dimensions without history because the granularity is one row per business entity. Type 2 dimensions need finer granularity — one row per business entity per version over time. The hash key from the Hub doesn’t capture that; it’s the same value regardless of when you’re looking at the data.

What you need for a Type 2 dimension is a key that is unique not just per entity but per entity per point in time. In Data Vault, that key already exists — it’s generated as part of the PIT Table.

The Dimension Hash Key from the PIT Table

When producing a Type 2 dimension, you need a PIT Table anyway — it provides the snapshot-based granularity that drives the dimension’s history. The PIT Table’s alternate key is the combination of the parent’s business key (not the hash key — never hash a hash) and the snapshot date. The primary key of the PIT Table is a hash value computed from those two inputs: business key plus snapshot date.

At Scalefree, this value is called the Dimension Hash Key. It is unique per row in the PIT Table, which means it is unique per entity per point in time — exactly what a Type 2 dimension key needs to be. This Dimension Hash Key becomes the primary key of your Type 2 dimension and the foreign key that your fact entities need to reference in order to join to the correct dimension member at the correct point in time.

Connecting Facts to Type 2 Dimensions

The remaining challenge is on the fact side. Bridge Tables and Non-Historized Links — the typical foundations for fact entities — contain Type 1 hash keys from Hubs and Links, not Type 2 Dimension Hash Keys. So how does a fact row know which Type 2 dimension member to reference?

The solution is a join through the PIT Table’s alternate key inside the fact view. A Bridge Table typically contains the Type 1 hash key from the relevant Hub and a snapshot date. Those two values together form the alternate key of the PIT Table. Inside the fact view, you join the Bridge Table to the PIT Table using the hash key and snapshot date, retrieve the Dimension Hash Key from the PIT Table’s primary key, and surface that as the dimension reference in the fact entity.

The result: the fact entity contains a single column — the Dimension Hash Key — that points to exactly one Type 2 dimension member. The dashboard tool and end users never need to know how it was derived. The join logic is handled in the view layer, the keys match between fact and dimension, and the relationship resolves cleanly. This is the preferred approach rather than exposing a composite key (hash key plus snapshot date) from the fact side, which would complicate the dimensional model unnecessarily.

For teams using datavault4dbt premium, PIT Table generation and the Dimension Hash Key pattern are handled through the automation framework, which significantly reduces the manual effort involved in implementing this correctly at scale.

Putting It Together: Key Decisions for Dimension Keys

To summarize the decision framework: for Type 0 and Type 1 dimensions, use the Type 1 hash keys from Hubs and Links directly — they’re already available throughout the model and work cleanly in distributed environments. If storage is a concern, convert to binary hash values in the view layer before considering sequences. If sequences are genuinely required, generate them in a Computed Satellite in the Business Vault rather than embedding them in Hub or Link structures.

For Type 2 dimensions, use the Dimension Hash Key from the PIT Table as the primary key of the dimension. Connect facts to Type 2 dimensions by joining the Bridge Table or Link to the PIT Table’s alternate key inside the fact view, surfacing the Dimension Hash Key as the dimension reference. This keeps the dimensional model clean, the keys stable, and the join logic encapsulated where it belongs.

To go deeper on PIT Tables, dimension modeling, and the full Data Vault delivery layer, explore our Data Vault certification program. And for a concise introduction to the core concepts, the Data Vault Handbook is available as a free physical copy or ebook.

Watch the Video

dbt Fusion: The Next Generation of dbt Execution

dbt Fusion

dbt is evolving rapidly, and with the introduction of dbt Fusion, data teams are entering a new era of performance, efficiency, and intelligence. Built from the ground up, dbt Fusion represents a fundamental shift in how dbt projects are executed, validated, and optimized.

In this article, we’ll explore what dbt Fusion is, why it matters, and how its core capabilities—dialect-aware validation and state-aware orchestration—are changing the way modern data platforms operate.



What is dbt Fusion?

dbt Fusion is a next-generation execution engine for dbt, designed to overcome the limitations of dbt Core and unlock new capabilities for data teams. Rather than incrementally improving the existing engine, dbt Labs rebuilt the execution layer entirely.

One of the most important differences lies in its foundation: dbt Fusion is written in Rust, while dbt Core is built in Python. This change enables significantly better performance, especially for large-scale projects with complex dependency graphs.

But performance is only part of the story.

dbt Fusion introduces a native understanding of SQL across multiple dialects, allowing it to analyze queries more deeply than ever before. This enables advanced features like early error detection, improved lineage tracking, and smarter orchestration.

Importantly, dbt Fusion is designed to support the full dbt Core framework. Most existing dbt projects can run on Fusion with minimal changes, making adoption straightforward for many teams.

Note: Deprecated dbt Core functionality is not supported.

Why dbt Fusion Matters

dbt Fusion introduces two major innovations that directly impact day-to-day data work:

  • Dialect-aware SQL validation
  • State-aware orchestration

Together, these features significantly improve developer productivity, reduce execution time, and lower compute costs.

Dialect-Aware SQL Validation

Static SQL Analysis

One of the most powerful capabilities of dbt Fusion is its ability to perform static SQL analysis. Instead of simply rendering SQL and sending it to the data warehouse, Fusion builds a logical execution plan for every query during compilation.

This means that SQL correctness can be validated before any warehouse resources are used. As a result, many errors are caught early in the development process rather than during execution.

Handling Introspective Models

Not all SQL can be fully analyzed ahead of time. Some models rely on database-dependent macros, often referred to as introspective macros. Examples include:

  • get_column_values
  • star
  • unpivot

In these cases, dbt Fusion may defer part of the validation to the database itself, since the final structure depends on runtime information.

Why This Matters

Dialect-aware validation provides several key benefits:

  • Early error detection: Catch issues before execution
  • Improved developer experience: Faster feedback in the IDE
  • Precise column-level lineage: Better understanding of data flow
  • Foundation for advanced features: Enables orchestration and optimization

In practice, this means fewer failed runs, faster debugging, and more confidence in your transformations.

State-Aware Orchestration

The second major innovation in dbt Fusion is state-aware orchestration, which fundamentally changes how dbt jobs are executed.

Build Only What Changed

Traditionally, dbt runs rebuild models even if nothing has changed. dbt Fusion eliminates this inefficiency by detecting changes in both code and upstream data.

If no changes are detected, the model is skipped and the existing version is reused.

This results in:

  • Faster execution times
  • Reduced compute usage
  • Lower cloud costs

Shared Model State

dbt Fusion maintains a shared, real-time state at the model level. All jobs within the same environment can read and write to this shared state.

This allows dbt to determine whether a model has already been built and whether rebuilding it would produce a different result.

Concurrent Job Handling

In modern data platforms, multiple jobs often run at the same time. dbt Fusion is designed to handle this safely and efficiently.

It avoids unnecessary duplication by:

  • Preventing warehouse collisions
  • Reusing models across concurrent jobs
  • Ensuring consistency across executions

Works Out of the Box

One of the strengths of dbt Fusion is its ease of use. State-aware orchestration works automatically in most cases, without requiring additional configuration.

For advanced use cases, teams can still fine-tune behavior with more granular controls.

Efficient Testing (Beta)

dbt Fusion also introduces efficient testing, a feature currently in beta that optimizes how tests are executed.

Key improvements include:

  • Test result reuse: Avoid rerunning tests when results are unchanged
  • Query aggregation: Combine multiple tests into a single query
  • Reduced warehouse load: Lower compute costs

This makes testing faster and more cost-efficient, especially in large projects with extensive test coverage.

Performance and Cost Benefits

By combining Rust-based execution, advanced SQL analysis, and intelligent orchestration, dbt Fusion delivers measurable improvements:

  • Significantly faster runtimes
  • Reduced warehouse usage
  • Lower infrastructure costs
  • Improved developer productivity

For organizations managing complex data pipelines, these benefits can translate into substantial operational savings.

Compatibility with dbt Projects

dbt Fusion is designed to integrate seamlessly with existing dbt workflows.

Most projects can be migrated without major changes, as Fusion supports the core dbt framework. However, teams should be aware that deprecated features from dbt Core are not supported.

This makes it important to review and modernize older projects before transitioning.

Current State of dbt Fusion

At the time of writing, dbt Fusion is still in preview. While its capabilities are already impressive, some features may evolve as the engine matures.

Organizations considering adoption should monitor updates and test Fusion in controlled environments before full deployment.

Conclusion

dbt Fusion represents a major step forward in the evolution of dbt. By rethinking the execution engine from the ground up, it introduces powerful capabilities that go beyond incremental improvements.

With dialect-aware SQL validation, state-aware orchestration, and efficient testing, data teams can build pipelines that are not only faster, but also smarter and more cost-effective.

As the modern data stack continues to evolve, dbt Fusion is positioned to play a key role in shaping the future of analytics engineering.

Watch the Video

Meet the Speaker

Dmytro Polishchuk profile picture

Dmytro Polishchuk
Senior BI Consultant

Dmytro Polishchuk has 7 years of experience in business intelligence and works as a Senior BI Consultant for Scalefree. Dmytro is a proven Data Vault 2.0 expert and has excellent knowledge of various (cloud) architectures, data modeling, and the implementation of automation frameworks. Dmytro excels in team integration and structured project work. Dmytro has a bachelor’s degree in Finance and Financial Management.

Using BEAM to Accelerate Data Vault Implementation

Using BEAM to Accelerate Data Vault Implementation

BEAM — Business Event Analysis and Modeling — has been around for a long time, but it doesn’t come up often in Data Vault conversations. That’s a missed opportunity, because the two methodologies are more aligned than most practitioners realize. This post explores how BEAM and Data Vault complement each other, where BEAM fits in the project timeline, and why using BEAM as a starting point can make your Data Vault modeling faster, more business-aligned, and easier to communicate across teams.



BEAM and Data Vault: A Natural Alignment

BEAM is a business modeling methodology focused on understanding and documenting what actually happens in an organization. Rather than starting from data structures or technical schemas, BEAM starts from business events: a customer places an order, a payment is processed, a product is shipped. Each event is analyzed through what BEAM calls the 7 Ws — who, what, when, where, why, how, and how many or how much. The goal is a complete, business-driven understanding of the processes, entities, and relationships that drive the organization.

When you lay that alongside the core concepts of Data Vault 2.0, the structural similarities are hard to miss. Data Vault models three fundamental things: business keys (captured in Hubs), relationships between business entities (captured in Links), and descriptive context (captured in Satellites). BEAM produces exactly those three things — business entities, relationships, and context — expressed in business language rather than technical schema.

The mapping is direct: BEAM entities become Hubs. BEAM relationships and events become Links. BEAM descriptive context becomes Satellite payloads. The conceptual model that BEAM produces translates naturally into the physical Data Vault model that will implement it.

Where BEAM Fits in the Project Lifecycle

BEAM typically happens before the data warehouse work begins — it’s a business analysis and modeling activity, not a technical one. Teams use it to answer the foundational questions: what processes exist in the business, what events drive those processes, what entities are involved, and how are they related? This is exactly the kind of understanding that Data Vault modeling requires, and it’s often the hardest part of starting a new implementation.

Without this upfront business understanding, Data Vault projects tend to become purely data-driven: modelers look at source tables, identify columns, and build Hubs and Satellites based on what the data looks like rather than what the business actually means. The result is technically valid but often misses the business semantics — relationships that should be Links end up embedded in Satellites, business concepts that deserve their own Hub get collapsed into another entity, or important events go unmodeled because they weren’t visible in the source data at first glance.

A BEAM model built with stakeholders from across the business gives the Data Vault team a map before they start navigating. It surfaces hidden relationships, clarifies which entities are truly distinct business concepts, and creates a shared vocabulary between business users and technical implementers. For teams building an enterprise data warehouse, that shared vocabulary is often as valuable as the model itself.

Translating BEAM to Data Vault: What to Watch For

The translation from BEAM to Data Vault is not mechanical. A one-to-one mapping from a BEAM model to a Data Vault schema without looking at the actual source data will create problems. Business models describe how things should work; source data reflects how things actually work — and those two realities frequently diverge.

A BEAM model might show a clean customer-order-product event with well-defined identifiers. The source data might deliver that same event across three systems with different keys, inconsistent structures, and occasional nulls where the business model assumed complete data. The BEAM model is the target to aim for; the source data is the reality to model from. Both perspectives are necessary.

The practical approach is to use the BEAM model as a starting point and then validate it against the actual data. Does the business key identified in the BEAM model exist in the source? Is it unique? Are the relationships the BEAM model describes actually present as foreign keys, or do they need to be inferred? Does the granularity of the source data match the granularity of the BEAM event? These questions require looking at real data, not just the business model.

This is also where tools like datavault4dbt become relevant — once the BEAM-to-Data Vault translation is validated against the source data, automation tools can significantly accelerate the physical implementation, turning a well-defined model into deployable code much faster than manual development.

BEAM as a Bridge Between Business and IT

One of the persistent challenges in data warehouse projects is the gap between what business stakeholders need and what technical teams build. Business users describe their world in terms of events, customers, products, and transactions. Technical teams describe it in tables, columns, joins, and load patterns. These vocabularies don’t naturally translate, and the gap is where requirements get lost.

BEAM and Data Vault together help close that gap. BEAM produces a model that business users can understand and validate — it speaks their language. Data Vault implements that model in a way that is technically rigorous, scalable, and auditable. When both sides can see their perspective reflected in the same project, alignment improves and the risk of building something technically correct but business-irrelevant decreases.

The 7 Ws framework that BEAM uses to analyze events also maps well to the questions a Data Vault modeler asks when building Links: who are the participants in this relationship, what happened, when, where, and under what conditions? These aren’t just modeling questions — they’re the questions that produce a model business users recognize as a reflection of their actual processes.

Practical Takeaways

BEAM and Data Vault are not competing methodologies — they operate at different levels of the project. BEAM works at the business understanding level, producing a clear picture of events, entities, and relationships from the business perspective. Data Vault works at the technical implementation level, structuring that understanding into a scalable, auditable physical data model.

Used together, they create a stronger foundation than either provides alone. BEAM accelerates the modeling phase by giving the Data Vault team a validated business context to work from. Data Vault gives the BEAM model a rigorous technical home. The combination shortens the distance between business requirements and implemented data structures, reduces rework caused by misunderstood requirements, and produces a model that both sides of the organization can engage with.

If you’re starting a new Data Vault implementation or looking to improve alignment between your business and technical teams, considering BEAM as part of your discovery and modeling process is worth the investment. And to go deeper on Data Vault modeling patterns — including how to translate business concepts into Hubs, Links, and Satellites — our Data Vault 2.1 Training & Certification covers the full methodology. The Data Vault Handbook is also available as a free physical copy or ebook for a solid introduction to the core concepts.

Watch the Video

Orchestration of Agentic Workflows

The Shift from Prompts to Autonomous Systems

For years, organizations have focused on mastering “prompt engineering”, the art of writing precise instructions to extract useful outputs from Large Language Models (LLMs). While highly effective for simple, singular tasks, the prompt-based approach has inherent limitations when faced with complex, multi-step business problems.

The next paradigm shift in enterprise AI is the move toward Agentic Workflows.

An “Agent” is more than just an LLM. It is an autonomous or semi-autonomous system that combines reasoning capability (the LLM) with access to tools, memory, and the ability to act on its environment. Instead of answering a question, an agent performs a role, acting as an analyst, a software engineer, or a project manager, handling sequential professional tasks until a goal is achieved.

Orchestration of Agentic Workflows

Master the art of building multi-step autonomous systems by integrating the LangChain ecosystem with powerful tools like Zapier. This session provides a practical roadmap for evolving from simple prompts to sophisticated, coordinated architectures that execute complex professional tasks with ease. Learn more in our upcoming webinar on April 21st, 2026!

Watch Webinar Recording

Why Agents Require Orchestration

The premise of agentic workflows is powerful, but deployment is difficult. In a complex scenario, you may need a system to:

  1. Analyze a business request.
  2. Search a database.
  3. Process results.
  4. Consult a second specialized agent (e.g., a “Coder Agent”).
  5. Revise the plan based on output and finally provide a summary.

Without proper coordination, this series of steps breaks down. The model might hallucinate a tool execution, forget crucial data from step one by step four, or enter an endless loop of unhelpful actions.

Orchestration is the framework that manages this complexity. It is the conductor of the agentic orchestra, defining how different agents, tools, and memory systems interact, ensuring reliability, traceability, and successful execution of the business objective.

Anatomy of an Agentic Stack

To build a reliable orchestrator for autonomous systems, your architecture must unite three fundamental components:

  • Intelligence Layer (The Brain): The reasoning core, usually an LLM, capable of taking input, breaking it into smaller tasks, and evaluating progress.
  • Action Layer (The Tools): A library of external integrations, such as databases, web scrapers, computational engines, and business APIs, that the agent can use to gather real-world data or execute actions.
  • Coordination Layer (The Orchestrator): The logic that manages state, standardizes how agents exchange data, handles errors, and ensures loops are terminated when goals are met.

Tools of the Trade: Navigating the Lang Ecosystem

As organizations move from proof-of-concept to production, the ecosystem of framework tools is rapidly evolving. The “Lang” suite has emerged as a particularly dominant force in defining how agents are built and orchestrated. During our workshop, we will explore several critical tools within this stack:

LangChain

While often used for simple prompt channelling, LangChain’s core contribution to agentic architecture is standardizing integration and chain creation. It provides the interface to connect the LLM to dozens of external systems. Crucially, it allows us to define custom “tools” for the agent. These are specialized, user-created functions that give the agent specific capabilities, such as querying a proprietary data warehouse or executing an internal Python script. By wrapping these functions in LangChain’s tool abstraction, the agent can autonomously decide when and how to invoke them to solve complex problems.

LangGraph

Managing complex agentic workflows required a different mental model: graphs. LangGraph extends LangChain by allowing developers to model agentic flows as stateful graphs (DAGs, or Directed Acyclic Graphs). This is crucial for systems that require robust loops, cyclical processes, and complex state management, ensuring that “Agent A” always knows what state “Agent B” left the system in.

Langfuse

Orchestrating agents is messy, and you need visibility. While not officially developed by the creators of LangChain, Langfuse is an essential open-source operational companion that integrates seamlessly with the ecosystem. It provides a robust platform for debugging, testing, and monitoring agentic systems without vendor lock-in. Langfuse allows teams to “trace” the entire multi-step process, viewing every prompt, tool call, and internal decision, making it possible to identify bottlenecks, reduce costs, and debug failures in production.

Complementary Orchestration Tools

While the Lang ecosystem excels at managing LLM logic, a true enterprise solution often requires integration with generalized orchestration and automation tools (like Zapier or n8n). These tools excel at managing event triggers, parallel processes, and standard API interactions that do not require LLM reasoning, complementing the Lang stack in a complete enterprise architecture.

Final Thoughts

Moving from single prompts to coordinated, agentic systems is a necessary evolutionary step for organizations aiming to unlock true operational efficiency with AI. Mastery of these systems requires shifting your perspective from “engineering a prompt” to “engineering a system.”

Want to see how this works in practice?

This article provides a conceptual blueprint of agentic workflows and the essential role of orchestration. To gain hands-on experience in building these systems, we invite you to join our upcoming webinar on the Orchestration of Agentic Workflows. During the session, we will demonstrate how to build multi-step autonomous systems by integrating these platforms into a single architecture, providing a practical guide for moving from simple prompts to coordinated AI systems that handle professional tasks.

Register for free

Datensouveränität: Die souveräne Datenplattform als Weg zu unabhängigen Daten und sicherer KI

Datensouveränität wird oft als rein politisches Buzzword oder als bloße Compliance-Aufgabe, z. B. im Rahmen der DSGVO und des AI Acts, abgetan. Doch in der Realität ist sie eine harte wirtschaftliche Notwendigkeit. In einer Ära, in der Daten nicht mehr nur in Dashboards visualisiert werden, sondern als Grundlage für automatisierte Geschäftsprozesse und Künstliche Intelligenz dienen, wird die eigene Infrastruktur zum strategischen Flaschenhals.

Wer in dieser Phase die Kontrolle vollständig an externe Technologieunternehmen abtritt, verliert nicht nur Unabhängigkeit, sondern auch Innovationskraft. Sind Daten in geschlossenen Systemen gefangen, bestimmt letztlich der Anbieter, was angebunden werden darf oder welche KI-Modelle genutzt werden können. Der Weg zu echter Datensouveränität beginnt mit der Erkenntnis, dass die bequemen „All-in-One“-Versprechen vieler Cloud-Anbieter einen hohen, oft versteckten Preis haben.

Wie der Kontrollverlust in der Praxis aussieht

Um zu verstehen, wie sich die Datenhoheit zurückgewinnen lässt, muss man zunächst betrachten, wie Unternehmen sie überhaupt verlieren. Dieser Kontrollverlust geschieht selten über Nacht. Vielmehr ist es ein schleichender Prozess, der tief in der Architektur traditioneller und moderner Cloud-Datenplattformen verwurzelt ist.

Fällt die Entscheidung auf eine proprietäre Datenplattform, werden die Rohdaten vollständig an das System des Anbieters übergeben.

Proprietäre Formate

Um die versprochene Performance zu liefern, wandeln geschlossene Plattformen die eingespeisten Daten in herstellereigene, proprietäre Speicherformate um. Ab diesem Moment können diese Daten nur noch von der Compute-Engine (der Rechenleistung) genau dieses einen Anbieters gelesen und verarbeitet werden.

Fehlende Interoperabilität

Soll nun eine neue, innovative Lösung, wie bspw. eine spezialisierte Analyse-Engine oder Reporting Software eines Drittanbieters angebunden oder eine bestimmte (open-source) KI genutzt werden, stehen Unternehmen oft vor einer Wand. Externe Tools können die proprietären Formate nicht nativ lesen oder es wird gar nicht erst eine benötigte Schnittstelle bereitgestellt.

Kostenfalle (“Egress Fees”)

Um die Daten für andere Anwendungen nutzbar zu machen, oder im schlimmsten Fall den Anbieter komplett zu wechseln, müssen sie aufwändig exportiert werden. Hier schlagen die sogenannten „Egress Fees“ (Kosten für den Datenabfluss) massiv zu Buche. Große Cloud-Provider machen den Ingest (das Einspeisen der Daten) oft sehr günstig, bestrafen den Export aber mit hohen Gebühren.

Verlust der Preissetzungsmacht

Sind historische Unternehmensdaten erst einmal in einem geschlossenen System verankert und die Wechselkosten künstlich in die Höhe getrieben, sind Unternehmen künftigen Preissteigerungen und Lizenzänderungen des Anbieters ausgeliefert.

Kurzum: Das Unternehmen trägt zwar weiterhin die volle rechtliche und geschäftliche Verantwortung für seine Daten, hat aber den direkten, physischen Zugriff darauf verloren. Es mietet lediglich den Zugang zum eigenen Wissen.

Stellen Sie sich an diesem Punkt einmal ganz ehrlich die Frage:

Wissen Sie genau, in welchem Format und auf welcher Infrastruktur Ihre Kern-Daten in diesem Moment liegen?
Und noch viel wichtiger: Wie kommen Sie an Ihre Daten, wenn der Zugang über das Portal Ihres Anbieters morgen früh plötzlich nicht mehr funktioniert oder die Preise über Nacht unerwartet diktiert werden?

Das Data Lakehouse und offene Standards als Ausweg

Der technologische Ausweg aus dieser Abhängigkeit führt über eine grundlegende architektonische Neuausrichtung. Die Antwort auf proprietäre Datensilos lautet heute: Data Lakehouse. Dieser Architekturansatz vereint die Flexibilität eines Data Lakes mit der Struktur und Zuverlässigkeit eines klassischen Data Warehouses, jedoch unter einer entscheidenden Prämisse: der konsequenten Trennung von Speicher (Storage) und Rechenleistung (Compute).

Diese Trennung ermöglicht es Unternehmen, ihre Architektur nach dem “Best-of-Breed-Prinzip” aufzubauen:

Eigene Infrastruktur

Anstatt Daten in die Systeme externer Dienstleister zu laden und dort zu “verriegeln”, verbleiben sie im unternehmenseigenen Cloud-Speicher (beispielsweise Amazon S3, Azure Data Lake oder Google Cloud Storage). Das Unternehmen besitzt faktisch und rechtlich den einzigen Schlüssel zu den eigenen Daten.

Offene Datenformate als Fundament

Ein wichtiger Hebel der Datensouveränität ist das Speicherformat. In einem modernen Data Lakehouse werden Daten ausschließlich in quelloffenen Standards wie Apache Iceberg, Hudi oder Delta Lake abgelegt. Diese Formate gehören keinem einzelnen Software-Hersteller und unterliegen keiner proprietären Lizenzierung.

Interoperabilität (“Bring Your Own Engine”)

Da die Unternehmensdaten nun strukturiert und in einem offenen Format im eigenen Speicher liegen, lassen sie sich von unterschiedlichsten Verarbeitungs-Engines (wie Databricks, Trino, Spark etc.) lesen. Der entscheidende Vorteil: Die Daten müssen dafür weder kopiert noch verschoben werden.

Das Resultat dieser Architektur ist echte digitale Souveränität. Wenn ein Software-Anbieter die Preise drastisch erhöht oder technologisch zurückfällt, lässt sich die Compute-Engine austauschen oder parallel durch andere Tools ergänzen. Die wertvolle Datenbasis bleibt davon völlig unberührt.

Keine sichere KI ohne souveräne Datenplattform

Diese architektonische Unabhängigkeit ist nicht nur eine Frage der Kostenkontrolle, sondern eine wichtige Grundvoraussetzung für den produktiven und sicheren Einsatz von Künstlicher Intelligenz. Aktuell herrscht in nahezu jedem Industriesektor der Druck, KI-gestützte Automatisierungen einzuführen. Gleichzeitig wächst die berechtigte Sorge, sensible Geschäftsgeheimnisse an US-amerikanische „Black-Box“-Sprachmodelle abfließen zu lassen oder durch fehlerhafte KI-Antworten (Halluzinationen) geschäftskritische Fehlentscheidungen zu treffen.

Eine unaufgeräumte Datenbasis und geschlossene SaaS-Systeme bremsen KI-Initiativen hier systematisch aus. Ein souveräner KI-Ansatz erfordert andere Vorgehensweisen.

Abfrage statt Einbettung

Viele frühe KI-Versuche scheitern daran, dass Unternehmensdaten direkt in Sprachmodelle eingebettet werden. Dies birgt nicht nur massive Datenschutzrisiken, sondern führt unweigerlich zu gefährlichen Halluzinationen. Ein Large Language Model (LLM) ist primär ein Sprachwerkzeug, keine relationale Datenbank.

Agentic AI auf Open-Source-Basis

Die Lösung liegt im Einsatz sogenannter „Agentic AI“ in Kombination mit quelloffenen Sprachmodellen (Open-Source-LLMs), die lokal und sicher in der eigenen (Cloud-)Umgebung betrieben werden. Die Daten verlassen die unternehmenseigene Infrastruktur zu keinem Zeitpunkt. Noch wichtiger: Die KI wird so konfiguriert, dass sie die Daten nicht auswendig lernt, sondern als intelligenter Agent agiert. Sie nutzt ihr semantisches Kontextverständnis, um bei Bedarf gezielt direkte Abfragen (beispielsweise über SQL) an die offenen Datenformate des Lakehouses zu stellen.

„Talk-to-your-data“ in der Praxis

Durch die direkte Anbindung an die zentrale Datenplattform liefert das System harte, verifizierbare Fakten statt stochastisch berechneter Wahrscheinlichkeiten. Dieser Ansatz ermöglicht völlig neue Geschäftsprozesse: Fachbereiche ohne tiefe Programmier- oder SQL-Kenntnisse können künftig im direkten Dialog mit ihren Daten interagieren. Komplexe Analysen und Reportings lassen sich per natürlicher Sprache automatisieren und verlässlich abfragen.

Damit dieser reibungslose Dialog zwischen Business-User, KI-Agent und Datenplattform jedoch nicht im Chaos endet, muss die KI exakt verstehen, wie die Daten strukturiert sind und welche semantische Bedeutung sie haben. Technologie allein reicht hierfür nicht aus, womit wir beim oft unterschätzten Kernstück der Datensouveränität angelangt sind.

Data Governance: Vom Regelwerk zum strategischen Enabler

Auch bei Datenplattformen bewahrheitet sich immer wieder eine Erkenntnis, die auch in vielen anderen Bereichen eine gewisse Allgemeingültigkeit erreicht hat: Technologie allein ist kein Garant für Erfolg. Ein modernes Data Lakehouse und fortschrittliche Agentic AI laufen ins Leere, wenn die zugrunde liegende Datenqualität mangelhaft ist oder die semantische Bedeutung der Daten unklar bleibt. An diesem Punkt wandelt sich Data Governance von einem oft ungeliebten Kontrollinstrument zu einem echten strategischen Enabler.

Wenn ein KI-Agent eine Benutzereingabe in eine präzise Datenbankabfrage übersetzen soll, benötigt er mehr als nur Zugriff auf Tabellen. Er benötigt Kontext. Ohne ein gepflegtes Business Glossary, klare Metadaten und definierte Verantwortlichkeiten (Data Ownership) ist das Risiko hoch, dass die KI zwar syntaktisch korrekte, aber fachlich falsche Ergebnisse liefert. „Garbage in, garbage out“ gilt im Zeitalter der Künstlichen Intelligenz mehr denn je.

Eine saubere Governance-Struktur löst dieses Problem an der Wurzel:

Zentrale Wahrheit, dezentrale Nutzung

Durch klare Qualitätsregeln und definierte Datenprodukte entsteht ein Fundament des Vertrauens. Fachbereiche können sich darauf verlassen, dass die bereitgestellten Informationen korrekt, aktuell und rechtssicher sind.

Echte Demokratisierung

Erst dieses Vertrauen ermöglicht Self-Service-Analytics. Wenn die Leitplanken der Governance feststehen, können Daten im gesamten Unternehmen demokratisiert und sicher zur Verfügung gestellt werden, ohne dass die IT-Abteilung jeden einzelnen Report manuell freigeben muss. Auch KI-Ergebnisse können so ohne Kopfschmerzen bezüglich Halluzinationen oder rechtliche Bedenken angenommen und weiterverwendet werden.

Compliance als Standard

Mit Blick auf strenge europäische Regulierungen wie die DSGVO oder den AI Act stellt eine integrierte Governance sicher, dass Zugriffsrechte, Anonymisierung und Nachvollziehbarkeit (Data Lineage) von Beginn an in der Architektur verankert sind.

Wer die Verantwortung für seine Daten auf diese Weise intern übernimmt, schafft die zwingende Voraussetzung für Skalierbarkeit.

Wie gelingt die Migration?

Die Vorteile offener Standards und einer souveränen Architektur sind einleuchtend. Dennoch scheuen viele IT-Verantwortliche den Schritt aus dem Vendor-Lock-in, weil sie ein riskantes, jahrelanges IT-Großprojekt befürchten. Doch die Befreiung aus geschlossenen Systemen erfordert keinen riskanten „Big Bang“.

Erfolgreiche Migrationsprojekte in der Praxis beweisen, dass der Wechsel zu einer offenen souveräneren-Architektur agil und inkrementell erfolgen kann:

Use-Case-getriebene Migration

Anstatt das gesamte historische Data Warehouse auf einmal abzulösen, wird die neue, offene Plattform parallel aufgebaut. Die Migration erfolgt anhand priorisierter, geschäftskritischer Anwendungsfälle.

Schneller Return on Investment (ROI)

Indem zunächst diejenigen Datenbereiche migriert werden, die den höchsten sofortigen Mehrwert bieten, zum Beispiel zur Umsetzung neuer Use-Cases, welche zuvor unmöglich schienen, refinanziert sich der Umbau oft schon während der Projektlaufzeit.

Risikominimierung

Dieser schrittweise Ansatz stellt sicher, dass das Tagesgeschäft (Reporting und laufende Analysen) völlig ungestört weiterläuft, während im Hintergrund das zukunftssichere Fundament iterativ wächst.

Der Übergang zu offener Software und herstellerunabhängigen Datenformaten ist somit kein IT-Selbstzweck, sondern eine planbare, risikoarme Investition in die unternehmerische Handlungsfähigkeit.

Souveränität aktiv gestalten

Wahrlich souverän ist nur das Unternehmen, das die Architektur, die Qualität und den Verbleib seiner Daten vollständig kontrolliert und sich dieser Verantwortung bewusst ist. Wenn Sie sich aus der Abhängigkeit lösen, teure Lizenzmodelle hinter sich lassen und eine rechtssichere Basis für Künstliche Intelligenz schaffen wollen, führt der Weg unweigerlich über offene Standards.

Übernehmen Sie wieder die volle Verantwortung für Ihre Daten. Verwandeln Sie Ihre IT-Infrastruktur von einem reinen Kostenfaktor in den entscheidenden Wettbewerbsvorteil Ihrer Branche.

Als Experten für Big Data und die Entwicklung moderner Datenplattformen unterstützt Scalefree europäische Unternehmen dabei, diesen Weg erfolgreich zu gehen. Wir planen und realisieren End-to-End Daten- und KI-Lösungen jeder Skalierung, von der strategischen Architekturberatung bis zur Implementierung, sowie Agentic AI.

Sind Ihre Daten bereit für die Zukunft?

Lassen Sie uns in einem unverbindlichen Gespräch Ihre aktuelle Architektur beleuchten. Erfahren Sie, wie ein maßgeschneidertes Data Lakehouse auf Basis offener Standards Ihre Datensouveränität dauerhaft sichern kann.

Kostenloses Erstgespräch vereinbaren

Same-as-Links: Enterprise-Wide Deduplication Across Multiple Sources

Same-as-Links: Enterprise-Wide Deduplication Across Multiple Sources

One of the more powerful but nuanced constructs in Data Vault is the Same-as-Link (SAL). Two questions came in recently that get at the heart of how SALs work across source systems: can a Same-as-Link have multiple sources, and can it span keys from different source systems? The answers differ depending on whether you’re working in the Raw Data Vault or the Business Vault — and understanding why reveals something fundamental about how Data Vault handles enterprise-wide deduplication and integration.



Same-as-Links and Multiple Sources in the Raw Data Vault

The first question — can a Same-as-Link have multiple sources — is straightforward. Like any Link in the Raw Data Vault, a SAL can receive records from multiple source systems. Hubs consolidate business keys from different sources into the same entity, and Links do the same for relationships. As long as the relationship has the same semantic meaning and the same granularity across those sources, loading them into the same Link is valid and correct. So yes, a SAL in the Raw Data Vault can have multiple source systems contributing records to it.

The second question is more nuanced: can a SAL span keys from multiple sources — meaning one Hub reference on one side of the relationship comes from System A, and the other comes from System B?

In the Raw Data Vault, the answer is generally no — with one important exception. A core principle of Raw Data Vault loading is that each row comes from exactly one source system. Loading a single row that requires joining data from two independent source systems introduces a loading dependency: you have to wait for System A before you can load data from System B. That’s precisely the kind of tight coupling the Raw Data Vault is designed to avoid. Independent source systems should load independently.

The exception is when a single source system already knows both business keys. An ERP system, for example, might reference customers by a customer number that originates in a CRM system. The ERP system carries that key as a known reference — it’s available in a single source record without requiring a cross-system join at load time. In that case, a SAL row sourced from the ERP system can legitimately reference a business key that conceptually originates elsewhere. The single-source-per-row rule still holds; the integration happened upstream, inside the source system itself.

Same-as-Links in the Business Vault: Cross-Source Deduplication

In the Business Vault, the picture is quite different — and this is where SALs really show their value. When two independent source systems use completely different, unrelated business keys for what is actually the same real-world entity, there’s no source-level relationship to load. The Raw Data Vault captures both sets of keys in the same Hub (since they represent the same business concept), but there’s nothing in the source data to connect them.

This is where calculated Same-as-Links come in. Using descriptive data from both systems — names, addresses, contact details — fuzzy matching logic can identify that business key A from System A and business key B from System B refer to the same entity. That determination is a business rule. It belongs in the Business Vault. The result is a SAL entry that spans two business keys from completely independent source systems, calculated from the data rather than loaded from any single source.

This is one of the primary use cases for Same-as-Links: not just deduplicating records within a single source system, but integrating and deduplicating entities across the enterprise. Two CRM systems, two customer databases, two product catalogs — wherever the same real-world object appears under different identifiers in different systems, a Business Vault SAL can establish the connection and enable unified reporting and analysis across all of them.

For organizations dealing with complex multi-source environments, this kind of cross-system entity resolution is one of the most tangible business value deliverables a Data Vault implementation can produce. If you’re building or evaluating a enterprise data warehouse, the SAL pattern is worth understanding deeply — it’s the mechanism that turns a collection of source-aligned Hubs into a genuinely integrated enterprise model.

Why the Raw and Business Vault Distinction Matters Here

The contrast between how SALs work in the Raw Data Vault versus the Business Vault illustrates a broader principle that runs through all of Data Vault 2.0 design: the Raw Data Vault captures what the sources deliver, as they deliver it, without interpretation. The Business Vault is where judgment, calculation, and business logic are applied.

Fuzzy matching is business logic. Deciding that two records represent the same entity is a business decision. Those decisions belong in the Business Vault — not because the Raw Data Vault can’t technically store the result, but because embedding that logic at the raw layer makes it invisible, untestable, and hard to change when the matching rules evolve.

By keeping the SAL calculation in the Business Vault, you get a clear audit trail of how the deduplication was performed, the ability to update matching logic without reloading source data, and a separation between “what the source said” and “what we believe to be true across sources.” That separation is one of the most operationally valuable properties of a well-structured Data Vault.

Practical Implications for Modeling

When modeling SALs in practice, a few things are worth keeping in mind. In the Raw Data Vault, SALs are appropriate when a single source system provides an explicit deduplication or matching relationship — a master data management export, a merge table, a golden record mapping from a source MDM system. The loading process remains clean and dependency-free.

In the Business Vault, SALs are the right tool when the matching logic needs to be calculated — whether through exact key matching across systems, probabilistic matching, fuzzy string comparison, or any other form of entity resolution. The SAL lives in the Business Vault, references the appropriate Hub twice (master and duplicate), and is populated by whatever calculation or mapping process produces the match.

In both cases, the hash keys in the SAL reference the same Hub, since by definition the master and the duplicate represent the same type of business object. This is what makes the SAL structurally elegant: it reuses existing Hub infrastructure to express an enterprise-wide identity resolution without requiring new structural entities.

To go deeper on Same-as-Links, Business Vault patterns, and enterprise integration strategies in Data Vault, explore our Data Vault 2.1 Training & Certification. And for a concise introduction to the full methodology, the Data Vault Handbook is available as a free physical copy or ebook.

Watch the Video

Set Based Multi-Active Satellite Derived From Record Level Multi Active Satellite

Multi-Active Satellites: Handling Delta Loads and Set-Based Derivation

A detailed modeling question came in recently about deriving a set-based Multi-Active Satellite from a record-level Multi-Active Satellite, and whether using the resulting Business Satellite as input for a parent PIT Table is valid Data Vault practice. The answer involves a few clarifications on terminology and a practical approach to delta loading that makes the Business Vault layer largely unnecessary for this use case. This post breaks it down.



Multi-Active Satellites: The Full Load vs. Delta Load Problem

A Multi-Active Satellite captures multiple active records per business key at the same point in time — phone numbers are the classic example. A person can have a home number, a mobile number, and a work number all active simultaneously. In the Satellite, each row is uniquely identified by the hash key, the Load Date Timestamp, and the Multi-Active Key (in this case, the phone type).

The Hashdiff in a Multi-Active Satellite is calculated across the entire group — all active records for that business key at that load date — not per individual row. This means when any record in the group changes, the Hashdiff changes for the whole group, and the entire group is re-inserted with a new Load Date Timestamp. The old group is virtually end-dated. This works cleanly when you receive full loads: every batch contains all active records, so you always have the complete group to work with.

The challenge arises with delta loads. If the source only sends what changed — say, a new work number is added, but the home and mobile numbers are not re-sent because they didn’t change — you can’t calculate the correct group-level Hashdiff from the incoming batch alone. The group is incomplete.

Reconstructing the Full Group from Delta Loads in the Raw Data Vault

The solution is to reconstruct the full Multi-Active group before loading, without moving this logic into the Business Vault. The approach is straightforward: derive the most recent Multi-Active group from the existing Satellite, combine it with the incoming delta records, and use the resulting complete set to calculate the Hashdiff and load the Satellite as if it were a full batch for that group.

In practice, a staging table acts as the assembly point. The latest group from the Satellite is pulled into staging alongside the incoming delta. Together, they form the complete current group. From there, the standard Multi-Active Satellite loading pattern applies — the Hashdiff is calculated over the full group, a new Load Date Timestamp is assigned, and all records in the group are inserted together.

This approach handles delta loads cleanly in the Raw Data Vault, which means there’s no need for a Business PIT Table or a Business Satellite just to reconstruct the full set. The reconstruction happens at load time, not at query time.

Using the PIT Table to Manage Granularity

The second part of the question was about reducing the Multi-Active group to a single record per business key per Load Date Timestamp — storing that reduced result in a Business Satellite and using it as input for the parent PIT Table.

This is valid, but there’s a lighter alternative worth considering: handle the granularity reduction directly in the PIT Table rather than creating a dedicated Business Satellite for it.

A standard PIT Table references a Multi-Active Satellite via the hash key and Load Date Timestamp, which points to the entire group. If you want the full group available for querying, this is all you need — the PIT gives you the reference, and the join returns all active records for that timestamp.

If you only want one specific record from the group — say, just the mobile number — you add the Multi-Active Key as an additional column in the PIT Table. The PIT row then carries the hash key, the Load Date Timestamp, and the specific Multi-Active Key value you want. The join returns exactly one record. Selecting which Multi-Active record to surface is business logic, and the PIT Table is a clean place to encode it without materializing an intermediate Business Satellite.

If you need to surface multiple specific records — home number and mobile number separately — you add additional column sets to the PIT Table, one per record type. Each set carries its own hash key, Load Date Timestamp, and Multi-Active Key reference. This keeps everything in one structure and avoids unnecessary materialization.

When a Business Satellite Does Make Sense

A Business Satellite for this purpose isn’t wrong — it’s just not always necessary. If the reduced, single-record-per-key result is consumed by multiple downstream processes and materializing it improves performance or simplifies maintenance, building a Business Satellite is a reasonable choice. But if the only goal is to filter down to one record for a specific downstream view, doing it in the PIT Table is simpler and avoids creating an entity whose sole purpose is granularity reduction.

The key principle: keep the Raw Data Vault responsible for capturing the full, accurate group, and make granularity and selection decisions in the PIT Table or Business Vault based on what downstream consumption actually requires.

To go deeper on Multi-Active Satellites, PIT Table design, and the full Data Vault methodology, explore our Data Vault training and certification programs. The free Data Vault handbook is also available as a physical copy or ebook.

Watch the Video

Handling Zero Key References and Optional Data

Zero Key References and Optional Data in Data Vault Modeling

A nuanced modeling question came in recently about how to handle a source table that delivers relationships between two business objects, where one specific role type has no related second object — just additional attributes in the same record. The proposed model was a solid starting point, and the discussion that followed touched on several important Data Vault principles: zero key handling, CDC Satellites vs. multi-active Satellites, effectivity tracking, and why filter conditions in Raw Data Vault loading are a risk worth avoiding. This post walks through each of these in turn.



Zero Key References: Modeling Optional Relationships

The scenario involves a Link between two Hubs — Object One and Object Two — with a role type as part of the Link structure. For most role types, both Hub references are populated. For one specific role type, Object Two doesn’t exist; the source provides additional descriptive attributes instead of a foreign key.

The correct handling for the missing Object Two reference is straightforward: use the all-zeros key. When a foreign key in the source is null, the Link refers to the zero key in the Hub rather than storing an actual null. This keeps the model queryable with inner joins, avoids null-handling complexity downstream, and is entirely consistent with Data Vault null business key handling. Both Hubs should have their two zero key rows — the all-zeros key for unknown or null references, and the all-Fs key for error cases — deployed as standard practice.

The role type sits in the Link structure alongside the two Hub references, which means it participates in the hash key computation for the Link. When the role type changes, the Link sees it as a new entry. The effectivity Satellite then captures when the old record was no longer active. That’s the expected behavior.

Multi-Active Satellites vs. CDC Satellites: Choosing the Right Approach

The proposed model used multi-active Satellites to capture multiple rows with different valid_from and valid_to dates. Whether this is the right choice depends on one key question: are those records all active at the same time in the source system? Can a source system user see all of them simultaneously?

If yes — multiple records with different validity periods are all visible and active concurrently in the source — then a multi-active Satellite is the appropriate choice. The multi-active attribute should be a subsequence from staging rather than a business-supplied date, keeping control of uniqueness on the Data Vault side rather than trusting the source.

If no — the records represent sequential changes, not concurrent active states — then a CDC Satellite is a cleaner fit. A CDC Satellite is structurally a standard Satellite, but the load date is modified by adding a sequence number from the CDC package as microseconds. This means only one row is active at any given moment, which simplifies PIT Table construction (two columns instead of three per Satellite) and improves join performance. The choice between the two comes down to how the source system actually manages these records.

A third alternative worth noting: for multi-active scenarios, a JSON array stored in a standard Satellite can replace a traditional multi-active Satellite in some cases. It depends on the loading mechanism and the downstream consumption requirements, but it’s a valid option that avoids the multi-active complexity entirely by capturing multiple active rows as a structured JSON payload.

Effectivity Tracking and the Status Tracking Satellite

The proposed model included a separate status tracking Satellite alongside an effectivity Satellite. A cleaner and more storage-efficient approach is to merge these by adding a deletion timestamp directly to the effectivity Satellite.

The deletion timestamp works simply: when a record exists in the source, the deletion timestamp is set to end-of-all-times. When a deletion is detected — through a comparison of the current load against the target — the deletion timestamp is updated to the current load date, marking the record as no longer physically present in the source. If the record reappears, the timestamp reverts to end-of-all-times.

All timelines — valid_from, valid_to, deletion timestamps — belong together in the effectivity Satellite. This consolidation reduces the number of Satellites to manage and makes the model more straightforward to query.

GDPR and Customer Re-Registration

A related question came up about GDPR: if a customer requests data deletion and later re-registers, are they treated as a new record? The answer is yes, and it’s an important distinction from standard soft-delete handling.

Soft deletes in the Raw Data Vault are used to track hard deletes from the source for non-legal reasons — products removed, records archived, relationships ended. The history is preserved in the Vault even when it’s gone from the source.

GDPR is different. When a deletion is legally required, the personal data must be genuinely removed — a hard delete in the target. The non-personal data associated with that customer may be retained, but the link between the old history and the re-registering customer is permanently severed. If that customer returns and creates a new record, there’s no way to reconnect them to their previous history, because that connection no longer exists in the model. This is by design: the loss of that relationship is the point.

Why Filter Conditions in Raw Data Vault Loading Are Risky

One of the more important principles raised in this discussion: never apply filter conditions when loading the Raw Data Vault. The specific question was whether it’s acceptable to filter the source by role type when loading the additional attributes Satellite — loading only rows where the role type matches the one that doesn’t reference Object Two.

The answer is no, and the reasoning is worth understanding clearly. Applying a WHERE condition or a filtering join in the Raw Data Vault loading process is an application of business logic. The Raw Data Vault is supposed to capture raw data as delivered, without interpretation. Any filter condition that depends on the content of the data — rather than purely technical checks like delta detection or null replacement — violates that principle.

The practical risk is concrete: source system behavior changes. A new role type is introduced that also lacks an Object Two reference. The filter condition doesn’t know about it, so those records get skipped. Or dirty data arrives where an unexpected combination of fields appears — both an Object Two reference and additional attributes for a role type that wasn’t supposed to have both. A filter condition handles this incorrectly or drops data silently.

The only conditions permitted in Raw Data Vault loading are technical ones: checking whether a business key or relationship already exists in the target (the delta check), and replacing null values with zero keys. Everything else — including role-type-based filtering — belongs in the Business Vault, where it can be applied as explicit, versioned, testable business logic.

Validating the Model with the JEDI Test

When uncertain about a modeling decision — whether to use a multi-active Satellite or a CDC Satellite, whether to split or consolidate — the JEDI test provides a reliable check. The test is simple: try to reconstruct the original source delivery from the Raw Data Vault. Join everything back together and verify that no records are lost, no columns are missing, and no artificial records have been generated that didn’t exist in the source.

If the reconstruction succeeds without data loss or artificial inflation, the model is valid. Whether it’s the best model depends on the data and the downstream consumption patterns — but validity is the baseline, and the JEDI test is how you prove it.

To explore these modeling patterns in depth — including zero key handling, effectivity Satellites, CDC loading, and the JEDI test — check out our Data Vault 2.1 Training & Certification. The free Data Vault handbook is also available as a physical copy or ebook.

Watch the Video

How Model Access Works in dbt Cloud: Groups, Permissions & Cross-Project References

Model Access in dbt Cloud

As organizations scale their data platforms, controlling data access and ownership becomes increasingly important. Teams need clear rules around who can use specific datasets, how models are shared across projects, and how data governance is enforced without slowing down collaboration.

dbt Cloud provides powerful model access features that help analytics engineers and data teams manage visibility, ownership, and cross-project collaboration. By defining groups and access levels such as private, protected, and public, organizations can build scalable and secure data architectures aligned with modern data mesh principles.

This article explains how model access works in dbt Cloud using a producer-consumer project setup. We will explore access levels, group configuration, cross-project references, and how model relationships appear in the Catalog and lineage graph.



Why Model Access Matters in Modern Data Platforms

In a growing data ecosystem, multiple teams often build and consume data models simultaneously. Without clear access control, this can lead to:

  • Unclear data ownership
  • Breaking changes across teams
  • Data quality risks
  • Limited governance
  • Uncontrolled dependencies

dbt addresses these challenges by allowing teams to:

  • Control who can reference specific models
  • Define ownership through groups
  • Enable safe collaboration between teams
  • Support data mesh architectures
  • Manage cross-project dependencies

With proper configuration, organizations can ensure reliable and scalable data pipelines while maintaining flexibility.

Understanding Model Access Levels in dbt

dbt provides three main model access levels that control visibility and usage across projects and teams.

Private Models

Private models are restricted to a specific group. Only models within the same group can reference them.

Key characteristics:

  • Limited visibility
  • Strong access control
  • Internal use within a team
  • Prevents external dependencies

Private models are useful for intermediate transformations or sensitive data that should not be widely accessible.

Protected Models

Protected models allow broader usage while maintaining controlled ownership.

Key characteristics:

  • Can be referenced outside their group
  • Still managed by a specific owner
  • Suitable for shared internal data
  • Balanced control and accessibility

By default, dbt models are typically configured with protected access unless specified otherwise.

Public Models

Public models are designed for cross-project collaboration and wider consumption.

Key characteristics:

  • Accessible across projects
  • Supports data sharing
  • Enables data mesh architecture
  • Clear ownership boundaries

Public models are commonly used as trusted data products that other teams depend on.

Producer and Consumer Project Setup

To demonstrate model access behavior, we use two dbt Cloud projects:

  • Producer Project: Provides models for consumption
  • Consumer Project: References public models from the producer

This setup reflects real-world scenarios where one team publishes data assets and another team consumes them.

Configuring Groups in dbt

Groups in dbt define ownership and control access to models. They help manage responsibilities and enforce governance.

A group can be configured in YAML files and assigned to specific models or entire folders.

Assigning Groups to Individual Models

For example, an analytics group can be defined and assigned to models such as:

  • Orders per supplier country customer
  • Orders per customer
  • Orders per supplier
  • Orders per country

This configuration ensures that models follow consistent ownership and access rules.

Assigning Groups to Folders

Instead of configuring each model individually, teams can assign a group to an entire folder using the project configuration file. This approach simplifies governance and ensures consistent access settings.

Testing Private Model Access

Private models can only be referenced by models within the same group. If a model outside the group attempts to reference a private model, dbt returns an error.

To resolve this, the referencing model must be added to the same group configuration.

This behavior ensures:

  • Clear ownership boundaries
  • Reduced risk of unintended dependencies
  • Improved data security

Private access is particularly useful for internal transformations or staging logic that should remain hidden from external teams.

Using Protected Models

Protected models offer more flexibility. They can be referenced outside their group without requiring additional configuration.

This makes them ideal for:

  • Shared business logic
  • Reusable transformations
  • Internal reporting models
  • Organization-wide metrics

Protected access balances governance with usability.

Configuring Public Models for Cross-Project Use

Public models allow other projects to reference them, enabling collaboration across teams.

For public models to be visible to other projects, dbt Cloud requires:

  • A successful job run
  • An environment defined as staging or production
  • Metadata resolution by the dbt Cloud service

Once these conditions are met, public models become available to consumer projects.

This mechanism ensures that only validated and production-ready models are shared.

Star Schema Example with Mixed Access Levels

A typical data modeling setup might include a star schema with dimensions and fact tables.

In this scenario:

  • Most models are configured as public
  • Specific models, such as certain dimensions, may remain protected
  • Protected models are used internally by other models
  • Only necessary data is exposed externally

This design prevents unnecessary data exposure while maintaining efficient dependencies.

Exploring Model Ownership in the dbt Catalog

The dbt Catalog provides visibility into model ownership, access levels, and dependencies.

Within the Catalog, users can:

  • View model owners
  • Filter by access level
  • Explore group information
  • See associated models
  • Understand data lineage

This transparency improves governance and helps teams understand data responsibilities.

Referencing Public Models from Consumer Projects

Consumer projects can reference public models from producer projects using cross-project references.

These models appear in the development environment and lineage graph as external dependencies. However, consumer teams cannot build or modify them. The producer team retains full ownership.

This separation provides:

  • Clear responsibility boundaries
  • Reduced operational risk
  • Reliable shared data products
  • Improved collaboration

Understanding Lineage Graphs and Dependencies

The lineage graph provides a visual representation of how models relate to one another across projects.

It helps teams:

  • Track upstream and downstream dependencies
  • Understand data flow
  • Identify external data sources
  • Analyze impact of changes
  • Improve system reliability

For cross-project references, the lineage graph clearly shows the project and environment where external models originate.

Benefits of Model Access Control in dbt

Implementing structured model access provides significant advantages:

  • Strong data governance
  • Clear ownership structure
  • Secure data sharing
  • Scalable collaboration
  • Support for data mesh architecture
  • Reduced risk of breaking changes
  • Improved transparency

These capabilities help organizations scale their analytics operations effectively.

Best Practices for Managing Model Access

To maximize the benefits of dbt model access, organizations should follow these practices:

  • Define clear ownership groups
  • Use private models for internal logic
  • Expose only necessary data through public models
  • Validate models before sharing
  • Monitor dependencies using lineage graphs
  • Document access rules consistently

Following these guidelines helps maintain a reliable and scalable data ecosystem.

Conclusion

Model access in dbt Cloud enables organizations to control data visibility, manage ownership, and support cross-team collaboration. By configuring groups and defining access levels such as private, protected, and public, teams can build secure and scalable data architectures.

When combined with Catalog visibility and lineage tracking, these features provide a strong foundation for data governance and modern analytics workflows.

As data platforms continue to grow in complexity, structured access control becomes essential for ensuring trust, reliability, and collaboration across the organization.

Watch the Video

Understanding Error Keys in Data Vault

Error Keys in Data Vault: Understanding Zero Keys and Null Business Key Handling

One of the more subtle but important concepts in Data Vault is the handling of null business keys — known as zero keys in Data Vault 2.0 and formally called null business key handling in Data Vault 2.1. Most practitioners understand the first zero key intuitively, but the second one — and where it actually earns its value — is less commonly understood. This post explains both, and where each one belongs in practice.



Error Keys Explained: The Two Zero Keys

Every Hub and Link in a Data Vault model is deployed with two special rows pre-loaded: one with a hash key of all zeros, and one with a hash key of all Fs. These are the two zero keys, and they exist to handle null business keys cleanly throughout the model.

The all-zeros hash key is the more commonly understood of the two. It replaces null values in Links — specifically, null references to business keys. When a relationship is received with a missing or null Hub reference, that null gets replaced by the all-zeros key rather than being stored as an actual null. This allows the model to rely on inner joins consistently when querying the Data Vault, without having to handle nulls case by case through left joins or null checks. When you join from a Link to a Hub, you always hit a record — either a real business key or the zero key. Clean, fast, and predictable.

The all-Fs hash key serves a distinct and more specific purpose: it marks bad data, as opposed to merely missing or ugly data. Understanding the difference between those two things is the key to understanding why two zero keys exist at all.

Ugly Data vs. Bad Data: Why the Distinction Matters

Consider a transaction record where the store reference is null. In a brick-and-mortar retail context, this seems wrong — every sale happens somewhere. But in a business that also runs an online store, a null store value might simply mean the transaction happened online. The data is incomplete by conventional standards, but it’s not incorrect. It reflects a real business scenario. This is what you might call ugly data: not ideal, not the most descriptive, but not an error.

Now consider a different scenario: the interface specification for a source system explicitly states that a particular foreign key is non-nullable. The data arrives anyway with null values in that field. Here, either the data is genuinely corrupted or the specification is wrong. Either way, something has gone wrong. This is bad data — data that shouldn’t exist in the form it arrived.

The all-zeros key handles the ugly case. The all-Fs key is reserved for the bad case. Having both allows the model to preserve the distinction rather than collapsing all null situations into a single catch-all placeholder.

Where the All-Fs Key Is Actually Used in Practice

In theory, the all-Fs key could be applied in the Raw Data Vault whenever a null value violates an interface specification. In practice, this rarely happens. Analyzing every interface description, identifying which nulls represent violations, and modifying the Raw Data Vault mappings accordingly is a significant effort — and most projects don’t invest in it at the raw layer. The all-Fs rows exist in every Hub and Link as a structural feature, but they tend to sit unused in the Raw Data Vault itself.

Where the all-Fs key genuinely earns its place is in the Business Vault and Information Marts. The pattern looks like this: during the construction of a Fact view or a Bridge Table, business logic identifies records that reference Hub keys which shouldn’t exist — store locations that were never valid, product codes that are clearly erroneous, data that passed through the raw layer but doesn’t belong in the dimensional model. Instead of passing those records through to the Dimension with a misleading or nonsensical member, the business logic replaces their hash keys with the all-Fs value.

In the resulting Dimension, those records map to an explicitly erroneous member — a designated “error” row — rather than polluting actual dimension members with bad data. Business users and analysts can see that certain facts are associated with an error case, filter them out, investigate them, or handle them according to reporting requirements. The data is quarantined and labeled, not silently dropped or mixed in with valid records.

Ghost Records in Satellites

The zero key pattern extends to Satellites as well, through what are called ghost records. At minimum, one ghost record exists in each Satellite — associated with the all-zeros hash key — to ensure that joins from a Hub or Link to a Satellite always return a result, even for the zero key case.

In implementations using the datavault4dbt package, two ghost records are created: one for the all-zeros key and one for the all-Fs key. Beyond making the implementation consistent, this has a practical benefit in the dimensional layer. The two ghost records can carry different descriptive values — for example, “Unknown Customer” for the all-zeros case and “Erroneous Customer” for the all-Fs case. This makes the distinction visible and user-friendly in reports and dashboards, giving analysts a clear signal about what they’re looking at rather than a generic placeholder for both missing and bad data.

Because the ghost records share their hash keys with the zero keys in the parent Hub and Link, they join naturally without any special handling. It’s a side effect of the design that works elegantly in practice.

Should You Drop the All-Fs Key If You’re Not Using It?

The question occasionally comes up: if the all-Fs key isn’t being used in the Raw Data Vault, can it simply be dropped? Technically, yes. But in most implementations it stays, for a few reasons. It costs almost nothing to maintain — it’s two rows per Hub and Link. It provides a structural home for bad data classification if the need arises later. And its real value, as described above, is realized downstream in the Business Vault and Information Mart, where it’s actively useful for handling erroneous data in business logic and dimensional modeling.

Dropping it from the Raw Data Vault to save minimal overhead would mean losing a precise and semantically meaningful tool at exactly the layer where it’s most needed.

To go deeper on null business key handling, ghost records, and the full Data Vault 2.1 methodology, explore our Data Vault 2.1 Training & Certification. The free Data Vault handbook is also available as a physical or digital copy for a concise introduction to the core concepts.

Watch the Video

Predictive Analytics on the Modern Data Platform

Predictive Analytics on the Modern Data Warehouse

From BI to AI: Operationalizing Predictive Analytics where your Data already lives

Traditional Business Intelligence and reporting are incredibly good at telling what happened yesterday. How much revenue was generated last quarter? How many users logged in this week? But while understanding the past is important, today’s businesses need to know what will happen tomorrow.

This is where Predictive Analytics comes in. At its core, predictive analytics simply uses historical data to forecast future outcomes. Instead of asking how many customers canceled last month, predictive analytics asks:

“Which specific customers are most likely to cancel next week?”

Many organizations understand this value and eagerly hire data scientists to build these models. Yet, time and time again, these predictive initiatives fail to make it out of the PoC phase and into daily business operations because of how teams and data architectures are fundamentally structured.

Predictive Analytics on the Modern Data Platform

Learn how to bridge the gap between your data platform and actionable AI by building predictive models directly where your business data lives. This webinar covers practical strategies for transforming warehouse data into features, deploying models, and automating the flow of insights back into your daily operational workflows. Learn more in our upcoming webinar on March 17th, 2026!

Watch Webinar Recording

The Problem: The “Two Silos” of Data

In many companies, Data Engineering and Data Science exist in two entirely different worlds.

Data Engineers and Data Warehouse Developers spend their days building the Modern Data Platform. They carefully extract, clean, conform, and govern data from dozens of sources to create a “Single Source of Truth.” When a business analyst looks at a revenue dashboard, they know they can trust the numbers because the data platform enforces strict business logic.

Data Scientists, on the other hand, often work in isolated environments like standalone Jupyter notebooks. Because they need massive amounts of data to train their machine learning models, they often bypass the data platform entirely, pulling raw, unstructured data directly from a data lake.

This disconnect creates some challenges:

  • Duplicated Effort: Data Scientists waste up to 80% of their time cleaning and prepping raw data, work the Data Engineering team has already done in the platform
  • Inconsistent Metrics: Because models are built on raw data, a model’s definition of “Active Customer” might completely contradict the official definition used by the in the data platform
  • The “Wall of Production”: A model might look perfectly accurate on a data scientist’s laptop, but because it relies on disconnected, ungoverned data pipelines, integrating its predictions back into the daily workflows of sales or support teams becomes an IT nightmare
  • Exclusivity: Data analysts are often limited to classic descriptive analytics which slows time-to-insight. The optimal solution is to democratize data science, empowering analysts to implement predictive use cases directly.

The Solution: Bring the Machine Learning to the Data

The fix to this problem requires a fundamental shift in how we think about machine learning architecture. Instead of moving data out of governed systems to feed external ML models, we need to bring the ML workflows more closer to the data.

By positioning the Modern Data Platform as the foundation for predictive analytics, you ensure that every prediction is built on the same trusted, cleansed, and governed business data used for your daily reporting. The Data Platform becomes the Feature Store, a centralized hub where data is prepared once and used everywhere, whether for a BI dashboard or training a predictive model.

Predictive Analytics on the Modern Data Warehouse

When the data platform serves as the single source of truth for both analysts and algorithms, magic happens. Data science teams stop wrestling with raw data pipelines, data engineering teams maintain governance, and the business gets predictions they can actually trust and operationalize.

Two Architecture Approaches

So, how do we actually bring the machine learning to the data? There isn’t a one-size-fits-all answer. Depending on the team’s skillset and the complexity of the models, there are typically one of two foundational patterns adopted:

Pattern 1: In-Warehouse Machine Learning (Democratizing ML)

Modern cloud data platforms have evolved beyond just storing and querying data, many now have machine learning engines built directly into them.

  • How it works: Using standard SQL, Data Analysts and Analytics Engineers can train, evaluate, and deploy models entirely inside the data platform (for example BigQuery ML, Snowflake Cortex or Databricks)
  • The Benefit: This radically democratizes predictive analytics. You don’t need to know Python or manage complex infrastructure to build a model. If you know SQL, you can generate predictions using the exact same tables you use for your BI dashboards
  • The Trade-off: While perfect for standard tasks like regression or classification, you are limited to the specific algorithms supported by the data platform

Pattern 2: The Data Platform as a Feature Store (The Hybrid Approach)

For organizations with dedicated Data Science teams building highly complex or custom models, the data platform takes on a different role: the Feature Store.

  • How it works: Data Scientists continue to work in their preferred external ML platforms (like Vertex AI, Databricks, or other). However, instead of pulling messy data from a data lake, they connect directly to the data platform to pull curated, business-approved data (“features”) for training
  • The Benefit: Data Scientists retain maximum flexibility to use advanced Python libraries and deep learning frameworks, while ensuring the models are trained on governed, accurate data
  • The Trade-off: It requires a bit more orchestration to manage the pipeline between the data platform and the ML platform, and ensuring predictions are accurately written back to the data platform
Architecture Feature Store

Example: Predicting Customer Churn

To understand the benefits of these approaches, let’s look at a classic business challenge: Customer Churn Prevention.

Imagine a SaaS company trying to figure out which customers are likely to cancel their subscriptions. In a siloed environment, predicting this is can be a messy, manual science project. But on a modern data platform, it becomes an automated operational workflow:

  1. The Foundation (Data): Because of the Data Engineering team’s work, the data platform already contains all necessary historical information of the customer. CRM data (company size), financial records (billing history), product logs (login frequency), and Zendesk tickets (recent complaints) are all cleaned, joined, and sitting in governed tables including a full history of changes
  2. The Prediction (Modeling): An analyst uses In-Warehouse ML (Pattern 1) to run a classification model against this historical data. The model identifies the hidden patterns of a churning customer and generates a “Churn Risk Score” between 0 and 100 for every active user
  3. The Operationalization (Action): This is the crucial step. The predictions aren’t left in a notebook. The risk scores are written directly back into a new table in the data platform. Through reverse-ETL, these scores can be automatically synced to the CRM as well as dashboards and reports can easily be built on top of the results.

Conclusion

Predictive analytics shouldn’t be an isolated science experiment. It should be a living, breathing part of your operational reality. By treating your modern data platform as the foundation for your machine learning workflows, you eliminate data silos, empower your analysts, and ensure your predictions are built on the trusted business data that matters most.

It is time to operationalize predictive insights where your business data already lives.

Want to see how this works in practice?

Join our upcoming webinar: Predictive Analytics on the Modern Data Platform. We will explore how to build and run predictive analytics directly on top of your data platform using trusted, governed business data as the foundation. You’ll learn practical patterns for turning warehouse models into features, training and deploying predictions, and integrating results back into reporting and operational workflows. Join me on March 17th.

Register for free

Close Menu