Skip to main content
search
0
All Posts By

Michael Olschimke

Michael Olschimke is the Co-Founder and CEO of Scalefree and a "Data Vault 2.0 Pioneer" with over 20 years of IT experience. A Fulbright scholar and co-author of Building a Scalable Data Warehouse with Data Vault 2.0, Michael is a global authority on AI, Big Data, and scalable Lakehouse design across sectors like banking, automotive, and state security.

Architectural Limitations of PostgreSQL for Enterprise Data Vault Workloads

Architectural Limitations of PostgreSQL for Enterprise Data Vault Workloads

PostgreSQL stands as the industry standard for open-source transactional (OLTP) database engines. Its reliability, strict ACID compliance, and robust extension ecosystem make it a primary choice for operational workloads. It is the foundation that other open-source alternatives build upon.

PostgreSQL derivatives and extensions modify this core engine to handle heavy analytical (OLAP) and Data Vault workloads while preserving the existing PostgreSQL expertise.

Given this ubiquity, engineering organizations often aim to use vanilla PostgreSQL as their initial data warehousing platform. Vanilla” PostgreSQL refers to the core, single-node version of the database system, which serves as the blueprint for its derivatives and extensions. It is published as an open source project at https://www.postgresql.org/. This approach is commonly paired with Data Vault to achieve agile, historized data modeling.

While theoretically sound, this architectural combination regularly encounters a severe performance threshold as data volumes scale into the mid-to-high gigabyte and terabyte range.

I have created multiple applications that relied on PostgreSQL for operational workloads during my career as a software engineer (before my Data Vault career). With this brief history in mind, it is no surprise that Scalefree relied on PostgreSQL to some extent: we built our first internal data warehouse on PostgreSQL, even knowing its limitations. It was good to start with, when the internal data volume at Scalefree was low, and query complexity was limited. We chose vanilla PostgreSQL because of its derivatives; once we experience the database’s limitations, we would migrate to one of its commercial derivatives.



Strategic Context: Digital Sovereignty, Open Source, and Transatlantic Tech

Before diving into the technical mechanics, it is essential to address why open-source infrastructure matters so deeply to the future of enterprise architecture.

At Scalefree, our commitment to open-source software is foundational. We actively maintain and contribute to the community, which you can explore directly via the Scalefree GitHub repository. A prime example of this dedication is our open-source project, datavault4dbt, which brings robust Data Vault generation capabilities to standard PostgreSQL as well as its derivatives and cloud warehouse counterparts like Amazon Redshift.

Beyond pure engineering efficiency, a broader geopolitical driver is at play: digital sovereignty. As a Fulbright Scholar deeply invested in transatlantic affairs, I closely observe the shifting dynamics between the United States and Europe. True strategic alignment requires balance. The US does not benefit from a technologically dependent Europe; rather, the long-term stability of the transatlantic alliance relies on the US having a strong, digitally independent, and technologically sovereign ally in Europe. This calls for a pragmatic approach to digital sovereignty: use it when available and feasible, build capabilities, and create alternatives without going into panic mode on political trends that change every four or eight years.

The Open Source Sovereignty Paradox

A common counterargument often arises here: If many of these major open-source projects are driven, maintained, and funded by US developers and tech giants, how do they actually provide a sovereign alternative to US commercial technology?

The answer lies in the fundamental nature of open-source licensing and governance, which fundamentally alters the power dynamic:

  1. Elimination of Vendor Lock-In and Extraterritorial Jurisdiction: Proprietary cloud solutions are bound by commercial licenses, corporate terms of service, and the host nation’s domestic laws (such as the US Cloud Act). If a geopolitical or regulatory shift occurs, access can be restricted or altered. Open-source code, once published under licenses such as Apache 2.0 or MIT, is in the public domain. It cannot be revoked, repatriated, or shut off by a foreign corporate or state actor. Those licenses are our preferred choice for our open source projects at Scalefree.
  2. The Right to Fork and Host Locally: Open source grants European enterprises the ultimate sovereign right: the ability to fork the code, host it completely on local infrastructure, and maintain it independently. Even if the primary contributor base remains in Silicon Valley, European engineers have full visibility into the source code, allowing them to audit for security, eliminate telemetry, and adapt it to localized regulatory compliance (like GDPR) without foreign oversight.
  3. Decoupling Innovation from Capital Concentration: Open source democratizes access to state-of-the-art software architecture. It allows a European ecosystem to build sovereign, high-performance platforms without being forced to route massive capital into proprietary foreign hyper-scaler ecosystems.

Building enterprise data architecture on open, vendor-neutral infrastructure is a critical pillar of that sovereignty. However, to achieve true independence, our open-source tools must be architected to handle enterprise-scale workloads.

The following analysis details the structural reasons why vanilla PostgreSQL is fundamentally ill-suited for large-scale analytical (OLAP) processing, with a specific focus on how Data Vault methodology exacerbates these limitations and on open-source alternatives to remedy them.

1. Storage Paradigm Conflict: Row-Oriented vs. Columnar

The primary bottleneck stems from PostgreSQL’s native storage architecture. As a traditional relational database, it utilizes a row-oriented storage engine (heap tables), where complete records are written sequentially to disk blocks.

OLTP Optimization: This architecture is optimal for point-write and point-read operations, such as fetching a single entity profile via a unique identifier.

OLAP Inefficiency: Analytical queries typically perform aggregations over hundreds of millions of rows while restricting the projection to a minimal subset of attributes (e.g., computing a sum over a single numeric column).

Because PostgreSQL reads data row-by-row, it must scan every byte of every unrequested column within the target disk blocks into memory. This introduces immense, non-value-add disk I/O bottlenecks that degrade query performance at scale.

2. Structural Degradation Under Data Vault

Data Vault provides exceptional flexibility and auditability by decomposing business domains into discrete structural components:

  • Hubs: Unique business keys.
  • Links: Relationships and associations between keys.
  • Satellites: Contextual attributes, descriptive states, and historical tracking.

While highly effective for parallelized ingestion and decoupling business logic, this normalization strategy creates severe friction for a row-oriented relational engine.

The Multi-Way Join Complexity

To reconstruct a coherent business entity for downstream consuming layers (such as Business Intelligence tools), data engineers must reverse this decomposition. A single analytical query often requires a 10-to-20-way join across multiple large Hubs, Links, and versioned Satellites.

As dataset sizes expand, this structural complexity impacts the engine in two ways:

  1. Optimizer Limitations: The PostgreSQL query planner struggles to generate precise cardinality estimations across deeply nested join trees, frequently reverting to inefficient join strategies (e.g., nested loops instead of parallel hash joins).
  2. Memory Subsystem Pressure: Executing these multi-way joins requires significant memory allocation (work_mem). When a query’s requirements exceed physical memory allocations, PostgreSQL spills intermediate operations to disk, decelerating execution speeds by orders of magnitude.

The Lack of Inner Join Elimination (Join Reduction)

A particularly acute limitation of the PostgreSQL query optimizer in Data Vault architectures is its inability to perform Inner Join Elimination (also known as Join Reduction).

In complex Data Vault environments, users or Business Intelligence tools frequently query comprehensive, multi-table views or abstract layers that automatically stitch Hubs, Links, and Satellites together. If an end-user runs a report that only requests attributes from a single Satellite and its parent Hub, the remaining 10 tables in that view are technically redundant to the final output.

Modern OLAP query optimizers recognize declarative primary/foreign key constraints and automatically eliminate these unnecessary tables from the execution plan. The PostgreSQL query optimizer cannot eliminate redundant Inner Joins. Even if no columns are selected from a joined table, and even if a valid foreign key guarantees a 1:1 match, Postgres will stubbornly execute every single inner join defined in the query or view. This results in massive, redundant table scans and CPU cycles spent processing joins that have zero impact on the final result set.

Accumulation of Dead Tuples (Bloat)

Data Vault architecture is inherently append-only; changes in source systems trigger the insertion of a new row within the corresponding Satellite to preserve history. This high-frequency append behavior, coupled with PostgreSQL’s Multi-Version Concurrency Control (MVCC), accelerates the creation of dead tuples. Under continuous analytical workloads, the native AUTOVACUUM process frequently falls behind, resulting in table and index bloat that degrades scan performance.

3. The Window Function Bottleneck (And the Fallacy of Recent Releases)

Beyond complex joins, Data Vault workloads rely extensively on Window Functions (LEAD, LAG, RANK, ROW_NUMBER) to determine active states, calculate durations between historical intervals, or isolate the latest record within a Satellite stream.

In vanilla PostgreSQL, window functions are computationally expensive when executed against vast datasets. The engine must sort the partition keys in memory or spill them to temporary disk files to establish the window bounds.

The Limits of Engine Optimizations in Recent Versions

Proponents of the ecosystem often point to performance enhancements introduced in recent iterations—specifically PostgreSQL 17 (which introduced advanced streaming I/O optimizations) and PostgreSQL 18 (which implemented an asynchronous I/O (AIO) framework for sequential and bitmap heap scans).

While these enhancements represent significant milestones for core engine efficiency, they do not resolve the analytical window function bottleneck:

  • I/O vs. Execution Model: These upgrades optimize the database’s ability to read blocks from storage more rapidly. However, the underlying execution model for window functions remains bound to single-node, row-by-row compute processing.
  • Diminishing Returns at Scale: Increasing disk I/O throughput offers negligible relief when a LAG() or ROW_NUMBER() function across hundreds of millions of rows forces a massive, single-threaded disk-sort operation. The bottleneck is merely shifted from storage I/O to compute saturation.

4. Single-Node Architectural Constraints

Enterprise analytical data platforms utilize Massively Parallel Processing (MPP) architectures. They shard and distribute datasets across a cluster of compute nodes, allowing complex joins and window calculations to execute concurrently.

Vanilla PostgreSQL is a single-node database system. While it supports intra-query parallelism across multiple CPU cores, it remains bound by the hardware limits of a single virtual or physical machine. Scaling a vanilla PostgreSQL instance to meet growing OLAP demands requires vertical hardware scaling (scale-up), which scales linearly in cost but delivers diminishing returns in performance.

Open-Source Alternatives and PostgreSQL Derivatives

If your organization is committed to maintaining a sovereign, open-source stack while preserving existing PostgreSQL expertise, you do not need to migrate to proprietary cloud data warehouses. Several open-source derivatives and extensions modify the Postgres engine specifically to handle heavy OLAP and Data Vault workloads:

1. Apache Cloudberry (Incubating, Open-Source MPP)

For enterprise Data Vaults, Apache Cloudberry stands out as an incredibly powerful evolutionary step. It is an open-source Massively Parallel Processing (MPP) database derived from the Greenplum 7 codebase but aggressively modernized.

Why Apache Cloudberry outclasses Greenplum for Data Vaults:

  • Modern Upstream Kernel: Apache Cloudberry remedies legacy tracking issues by building on a much newer PostgreSQL 14 kernel (whereas older Greenplum installations remain anchored to legacy Postgres backends).
  • Superior Analytical Feature Set: Cloudberry supports a massive list of critical performance and optimization features, including incremental sort for window functions, run-time filters for joins, aggregation pushdowns, query pipelining, and advanced BRIN indexing.
  • True Community-Driven Open Source: Following Broadcom’s acquisition of Greenplum’s parent company, the source for Greenplum was closed. As a project under the Apache Software Foundation (ASF), Cloudberry offers vendor-neutral governance under an Apache 2.0 license, completely free from mandatory vendor lock-in.

2. Citus (Open-Source Extension)

How it helps: Citus transforms PostgreSQL into an MPP database by distributing tables and queries across a cluster of multiple nodes.

Data Vault Impact: Citus allows you to shard Hubs, Links, and Satellites by a common business key. When a multi-way join occurs, the compute overhead is distributed across the entire cluster, breaking through the single-node hardware ceiling.

3. Hydra & pg_analytics (Open-Source Columnar Extensions)

How it helps: These extensions add a native columnar storage engine directly inside the PostgreSQL kernel.

Data Vault Impact: By enabling columnar storage for non-historized links, dependent child links, and bridge tables, queries only scan the exact attributes requested by a BI tool, eliminating the I/O tax of scanning entire rows.

Conclusion

Vanilla PostgreSQL is an exceptional transactional engine, but its row-based architecture, single-node limitations, lack of inner join reduction, and row-by-row window function execution create structural barriers for large-scale Data Vault implementations.

Postgres should be protected for what it excels at: transaction management and operational metadata serving. When transitioning to heavy Data Vault modeling and analytical workloads, engineers must decouple their compute to an internal data lake or look toward specialized open-source derivatives like Apache Cloudberry that preserve both technological sovereignty and high-performance scalability.

 

Cover Image designed by Magnific

Data Vault Glossary: Hub, Link, Satellite, Business Vault, and More

Data Vault Glossary

The Essential Data Vault Glossary

Data Vault has its own precise vocabulary. Whether you are evaluating the methodology for the first time or preparing for Data Vault certification, understanding what each term means — and why it exists — is the foundation for everything else. This glossary covers the core concepts of Data Vault 2.0 and 2.1, defined at the conceptual level for data engineers, architects, and IT leaders building or modernising a data platform.



Business Key

A business key is the identifier that the business actually uses to recognise and track a business object — a customer number, a product code, an account number, an ISBN. It is the natural, meaningful key that appears in source systems and that business users refer to in their daily work.

In Data Vault, the business key is the fundamental organising principle of the entire model. Every Hub is built around business keys. The goal is to find keys that are unique across the enterprise and stable over time — keys that different source systems share, enabling integration between them.

Business keys sit above surrogate keys (technical IDs generated by a source system). A surrogate key is unique within one system but carries no meaning outside it. A business key has meaning across the organisation, making it suitable for integration. The hierarchy runs from global business keys (universally unique, like a Vehicle Identification Number), through organisational business keys (assigned by the enterprise, like a customer number), down to system-wide surrogate keys where no better option exists.

Hub

A Hub is one of the three fundamental entity types in Data Vault. It stores a distinct list of business keys for a single type of business object — all customer numbers, all product codes, all account numbers. The Hub identifies. It records which business keys have ever existed in the data platform, alongside when they were first seen (the load date) and where they came from (the record source).

The Hub does not describe anything about the business object — that is the Satellite’s job. It does not store relationships — that is the Link’s job. A Hub is insert-only: once a business key is recorded, it is never updated or deleted (except under legal obligation). This permanence is what makes Data Vault historically complete.

Link

A Link is the second fundamental entity type. It stores a distinct list of relationships between business keys — the fact that a customer purchased a product, that an employee was assigned a vehicle, that a booking involved a passenger and a flight. Like the Hub, the Link is insert-only and records when the relationship was first identified and from which source.

The Link does not describe the relationship — it only establishes that it existed. All descriptive information (when it started, when it ended, what conditions applied) lives in Satellites attached to the Link. Importantly, Links can connect more than two Hubs: a purchase transaction might link a customer, a product, and a store simultaneously. This is entirely normal in Data Vault design.

Satellite

A Satellite is the third fundamental entity type, and where the actual data warehousing happens. It stores descriptive data — the attributes that describe a business object or relationship over time. A customer’s name and address. A product’s description and list price. The start and end dates of an employment contract.

Every time an attribute changes in the source, a new row is inserted into the Satellite. No rows are ever updated. This insert-only behaviour is what gives Data Vault its complete historical record. Each Satellite has exactly one parent — either a Hub or a Link — and Satellites are typically split by source system, by security or privacy classification, and sometimes by rate of change.

The combination of Hub, Link, and Satellite reflects the three fundamental components present in all enterprise data: business keys, relationships, and descriptive attributes. For a deeper treatment of how these entities are modelled and loaded, Data Vault 2.1 Training & Certification covers the full methodology in detail.

Raw Vault

The Raw Vault (also called the Raw Data Vault) is the layer of the Data Vault architecture that stores unmodified source data. It consists of Hubs, Links, and Satellites that capture data exactly as it arrived — no cleansing, no business rules, no filtering, no conditional logic of any kind.

The Raw Vault is the single point of facts. Because no business interpretation has been applied, the data it holds is fully auditable: you can demonstrate precisely what any source system delivered on any given date. This auditability is one of the primary reasons Data Vault is adopted in regulated industries such as banking, insurance, and government.

Business Vault

The Business Vault is the layer above the Raw Vault where business logic is applied. It uses the same Hub-Link-Satellite structures, but its purpose is to transform and enrich the raw data — cleansing records, resolving duplicates, applying currency conversions, tagging data quality levels, and deriving calculated attributes.

The Business Vault is not a mandatory pass-through layer. Data that is already clean and ready for reporting can flow directly from the Raw Vault to an Information Mart. In practice, organisations typically maintain multiple Business Vault schemas — one per department or domain — each expressing the business rules and definitions relevant to that context. This is how Data Vault delivers multiple versions of the truth from a single set of facts: different teams can apply their own definitions without touching the shared Raw Vault underneath. Learn more about the full Data Vault 2.0 methodology and how Scalefree applies it in client projects.

Information Mart

An Information Mart is the delivery layer that presents data to end users and reporting tools. Unlike the Raw Vault and Business Vault — which use Hub-Link-Satellite structures — Information Marts use dimensional models such as star schemas, snowflake schemas, or flat wide tables, in whatever structure the consuming tool requires.

Information Marts are usually virtualised (SQL views) rather than materialised tables, making them lightweight and easy to modify. The recommended approach is many small, focused Information Marts — one per report or use case — rather than a single large mart. Several specialised mart types exist for specific purposes:

  • Error Mart — captures records rejected by a loading process due to hard rule violations. Should always be empty in a healthy system.
  • Raw Mart — presents raw data in a reportable dimensional form without applying business rules. Used during agile requirements gathering to help business users articulate what they need.
  • Quality Mart — shows only the bad or suspect records, giving data stewards visibility into data quality issues so they can be fixed at the source.
  • Source Mart — reconstructs the original structure of a source system from the Data Vault model, with the added benefit of historical versioning and built-in GDPR data removal.
  • Interface Mart — designed for machine-to-machine consumption, used when a downstream application needs to read from the platform or receive cleansed data back from it.
  • AI Feature Mart — a specialised Interface Mart designed for AI and machine learning model consumption, typically wide, flat, and enriched with semantic field descriptions.

Hash Key

A Hash key is a fixed-length value derived by applying a hashing algorithm (typically MD5 or SHA-256) to one or more business key columns. In Data Vault, Hash keys serve as the primary keys of Hubs and Links, and as the foreign key references connecting Satellites to their parents.

The key advantage of Hash keys is that they can be computed independently: any system, given the same business key input, will always produce the same Hash key. This enables parallel loading, makes the model portable across environments, and simplifies join logic. The actual business key columns remain stored alongside the Hash key in the Hub or Link. For a detailed look at how Hash keys are implemented in practice, see Scalefree’s article on Hash Keys in the Data Vault.

Load Date

The load date timestamp is a technical metadata attribute on every Hub, Link, and Satellite row. It records the moment the record was loaded into the data platform — not when the event occurred in the source system, but when the data arrived in the vault. The load date is always a full timestamp, never just a date, since data platforms often receive deliveries multiple times per day.

Combined with the record source, the load date answers two fundamental audit questions for every piece of data: when was it received, and from where?

Record Source

The record source identifies which source system a particular record came from. It is stored on every Hub, Link, and Satellite row alongside the load date. Its primary audience is the development and engineering team — when investigating a data issue, the record source points directly to the originating system and delivery batch. It is not used for business reporting or compliance auditing in the same way as the load date.

PIT Table

A PIT table (Point-in-Time table) is a helper structure that makes querying historical data across multiple Satellites significantly more efficient. Without a PIT table, reconstructing the complete state of a business object at a specific historical moment requires complex, expensive joins across Satellites with different load dates. A PIT table pre-computes the correct Satellite row timestamps for each point in time, so downstream queries can join the PIT table directly rather than re-solving the temporal logic on every run.

PIT tables are derived structures — generated from Raw Vault data and rebuildable at any time. They are not part of the core Data Vault model but are standard production companions to it.

Bridge Table

A Bridge table is a helper structure that simplifies querying across multiple Links. Where PIT tables solve the temporal complexity of Satellites, Bridge tables solve the structural complexity of navigating a chain of linked Hubs — for example, tracing from a customer through their orders, through their order lines, to the products. Bridge tables are pre-joined snapshots of relationship paths that would otherwise require multiple sequential joins. See also: Bridge Tables 101 on the Scalefree blog.

Ghost Record

A ghost record (also called a default record or zero key record) is a placeholder row inserted into a Hub or Satellite to handle situations where a foreign key reference exists in the source data but the referenced record itself does not. It prevents referential integrity violations and allows the data platform to load records completely even when source data is incomplete. Ghost records are technical placeholders, not real business data, and are distinguishable by their defined default key values.

Effectivity Satellite

An Effectivity Satellite tracks the active or inactive status of a Hub record or a Link relationship over time. It records when a business object or relationship became active in the source system and when it was deactivated or deleted. When a source system deletes a record, the Hub retains the business key permanently — the Effectivity Satellite gains a new row reflecting the deletion, preserving the complete history while making the current active state queryable.

Persistent Staging Area

The Persistent Staging Area (PSA) is the layer where raw source data is stored before it enters the Raw Vault. Unlike a transient staging area (which holds only the most recent delivery), a PSA retains every historical delivery — a complete, time-stamped archive of everything ever received from every source system. In modern Data Vault architectures, the PSA role is typically fulfilled by a data lake, organised in a folder structure partitioned by source system, table, and load date.

Unit of Work

The unit of work is a concept from the Data Vault agile methodology that defines the smallest deliverable increment of business value in a sprint. It consists of a complete data flow from source to Information Mart — staging the required source data, loading the Raw Vault entities, applying business rules in the Business Vault, and delivering the result in a mart that a business user can consume. Organising development around units of work ensures every sprint delivers something tangible to the business rather than invisible infrastructure.

Data Aging

Data aging refers to the practice of identifying and marking historical records in the Raw Vault or Business Vault that are no longer operationally relevant — records that have not been updated or referenced over a significant period. Data aging strategies help manage storage costs and query performance over time. In keeping with Data Vault’s insert-only philosophy, aged records are flagged or moved to archival storage rather than deleted, preserving the completeness of the historical record.

CDVP2.1

CDVP2.1 stands for Certified Data Vault Practitioner 2.1 — the professional certification awarded by the Data Vault Alliance upon passing the certification examination. It validates that a practitioner understands and can apply the Data Vault 2.1 methodology across architecture, modeling, and implementation.

Scalefree is an authorised Data Vault Alliance training partner. The Data Vault 2.1 Training & Certification is the official path to CDVP2.1, combining instructor-led training with exam preparation and two included exam attempts. If you are building or modernising a data platform and want to understand how Data Vault fits into a broader enterprise architecture, explore the free Data Vault Handbook or get in touch with Scalefree directly.

Hash Keys and Modern Data Platforms

Hash Keys in Data Vault on Modern Data Platforms: Snowflake, Fabric, and Beyond

A question that comes up regularly — especially from teams working on cloud-native platforms like Snowflake — is whether hash keys are still necessary, or whether sequences or raw business keys might be more efficient. It’s a fair question, and the answer depends on understanding what hash keys actually solve, what the alternatives cost, and how modern massively parallel processing (MPP) platforms change the performance equation. This post covers all three options and explains why hash keys remain the recommended approach even on modern platforms.



Hash Keys on Modern Data Platforms: Why Not Sequences?

Sequences are the first alternative most people consider — integers are small, fast to compare, and familiar. But they come with a fundamental structural problem: they require lookups. To load a Link, you need the sequence values for the Hubs it references, which means Hubs must be loaded before Links, Links before their Satellites, and so on. In small, single-environment setups, this ordering constraint is manageable. In large-scale or distributed environments, it becomes a serious obstacle.

Consider a setup where facts and real-time feeds live in the cloud while customer master data lives on-premise. To load a fact with a sequence-based key, you need to look up the sequence for each customer from the on-premise system — through a firewall, across a network, under latency and security constraints. In practice, this doesn’t scale. It introduces tight loading dependencies between systems that should be able to operate independently.

Hash keys and business keys don’t have this problem. Hash the same business key on two different systems and you get the same hash value. Both environments can load independently and join cleanly without cross-environment lookups. At Scalefree, the only clients currently using sequences in their Data Vault are on migration projects — migrating away from sequences. That’s worth keeping in mind before choosing them.

Business Keys: When They Work and When They Don’t

Business keys are the other alternative. On the surface, a business key stored directly in a Hub seems simpler than hashing it — one less step, shorter values. And on modern MPP platforms like Snowflake, Fabric, or BigQuery, the join performance argument for hash keys is less compelling than it used to be. These platforms distribute and index data across thousands of nodes in ways that make business key joins perform reasonably well.

The problem shows up in Links. A Link referencing three or four Hubs combines multiple business keys into its primary key. A VIN number alone can be 20 characters; combine it with a customer number, a transaction ID, and a location code and you’ve already exceeded the 32 characters of an MD5 hash. Business keys are also often variable-length, which matters on traditional row-based database systems: fixed-length fields are guaranteed to stay in the primary page during a join, while variable-length fields may be offloaded to a secondary page, turning a two-page join into a four-page operation.

On Non-Historized Links and their attached Satellites — where volume is high and the primary key is replicated across every row — wide, variable-length business key combinations compound quickly into a storage and performance problem. As you dig deeper into the Data Vault model with more complex queries and more joins, the size of the join conditions grows with the business keys.

The other practical constraint is tool stack consistency. If your environment mixes a cloud MPP platform with an on-premise Postgres derivative, a data lake for staging, and various Business Vault loading tools, using business keys means different query patterns depending on which systems are involved. Sometimes you join on the business key, sometimes on the hash key, sometimes on a combination. The query logic becomes metadata-driven and harder to read. Hash keys simplify this: always one column, always the same join pattern, regardless of platform.

Binary vs. Character Hash Values

Once you’ve decided to use hash keys, the next question is storage format: character (32 chars for MD5, 40 for SHA-1) or binary (16 or 20 bytes respectively). Binary is half the size, joins faster, and produces smaller join conditions in the dimensional layer — all genuine advantages, especially when materializing data into OLAP cubes or columnar tools like QlikSense.

The reason most projects still use character-based hash values is tool compatibility. Strings are universally supported. Binary data types are not — many real-time processing tools, data mining platforms, and AI/ML frameworks work with basic data types only. If an external script, a RapidMiner workflow, or a streaming processor needs to write into the Business Vault, a binary hash key may not be supported without explicit conversion logic.

The practical recommendation: use character-based hash values in the Raw Data Vault and Business Vault for maximum compatibility. In the Information Mart, if the data is being materialized into a tool that benefits from smaller keys — an OLAP cube, a QlikView dataset — convert to binary in the view layer. That keeps the core model flexible while capturing the storage and join benefits where they actually matter.

Hashdiffs on Modern Platforms: Still Worth It

A related question is whether hashdiffs are still valuable on column-based platforms like Snowflake, where column compression already reduces redundant data significantly. The answer is yes, and the reason is about how compute is distributed across loads rather than the cost of a single load.

The hashdiff is calculated when a record is first loaded into a Satellite. On subsequent loads, the comparison is between the freshly calculated staging hashdiff and the already-materialized Satellite hashdiff — which was computed during a previous load, not the current one. This means the compute cost of delta detection is spread across the load history: roughly half the work happens in prior loads, and the current load only handles the staging side. Over time, especially on high-volume Satellites with relatively low change rates, this distribution of compute is a meaningful performance gain.

Column-by-column comparison without a hashdiff moves all of that computation into the current load and requires fetching additional column pages for each comparison on column-based storage. The hashdiff collapses the entire comparison into a single column join, which scales much better as Satellite width and data volume grow. This is why tools like datavault4dbt no longer offer hashdiff as an optional feature — it’s simply on by default, because the performance case is consistent enough that disabling it isn’t worth the option overhead.

The Case for Staying with Hash Keys

Modern MPP platforms do reduce some of the traditional arguments for hash keys — join performance on business keys is no longer the clear-cut problem it was on row-based on-premise systems. But hash keys still deliver consistent advantages that matter in real projects: single-column join conditions that work the same way everywhere, independence from loading order, full compatibility across distributed environments, and a query pattern simple enough to generate automatically from metadata.

For teams building on Databricks, Snowflake, Fabric, or any other modern platform, hash keys remain the recommended approach. Not because the alternatives are impossible, but because the consistency and operational simplicity they provide across varied tool stacks and deployment patterns is worth more than the marginal gains from switching.

To explore hash key design, hashdiff patterns, and the full Data Vault modeling approach in depth, check out our Data Vault 2.1 Training & Certification. And for a solid introduction to the core concepts, the Data Vault Handbook is available as a free physical copy or ebook.

Watch the Video

How to Define SCD Type 2 Dimension Keys in a Data Vault Solution

SCD Type 2 Dimension Keys in Data Vault: Hash Keys, Sequences, and the PIT Table

Defining dimension keys in a Data Vault solution is one of those topics that seems straightforward until you get to Type 2 dimensions — and then the options multiply quickly. Should you use hash keys or sequences? Where do Type 2 keys come from, and how do they connect back to your facts? This post walks through the full picture, from the simplest Type 1 case all the way to the Dimension Hash Key pattern used for Type 2 slowly changing dimensions.



SCD Type 2 Dimension Keys: Starting with the Simple Case

For Type 0 and Type 1 dimensions — dimensions without history — the dimension key question is easy. Every Hub already contains exactly one hash key per business entity, and every Link contains one hash key per relationship. These Type 1 hash keys are already present throughout your model: in Non-Historized Links, Dependent Child Links, and Bridge Tables. You can use them directly as dimension keys in your view layer without generating anything new. It’s the lowest-effort, highest-compatibility option.

Hash keys also have a significant advantage over sequences in distributed environments. If your facts live in the cloud and your dimensions are generated on-premise, you can’t easily synchronize integer sequences between systems — the lookup dependencies alone make it impractical. Hash keys don’t have this problem. Hashing the same business key on two different systems produces the same hash value. A distributed Information Mart works cleanly with hash keys; with sequences, it becomes a coordination problem.

For more on how hash keys work in Data Vault and why they’re designed the way they are, the Scalefree blog covers the topic in depth.

When Sequences Make Sense — and How to Generate Them

The case for sequences is primarily storage. An MD5 hash value stored as a character string takes 32 bytes; a SHA-1 takes 40. A big integer takes 8 bytes. If storage is a genuine concern, converting character-based hash values to binary in the view layer is the first option to consider — it cuts the size in half with minimal effort and no structural changes.

If you still want integer sequences after that, there are two places to generate them. You can add a sequence column directly to the Hub or Link structure, used purely as a downstream dimension key rather than as an identifier. This works but creates a conceptual tension: after spending effort explaining why sequences aren’t used as Hub identifiers, reintroducing them in the same structure is confusing for anyone reading the model.

The cleaner approach is a Computed Satellite in the Business Vault, attached to the Hub or Link, that generates a new sequence value for every new record in the parent. It’s a simple business rule — new parent record, new sequence — and it keeps the sequence generation in the layer designed for computed values. The trade-off is an additional join when consuming the sequence downstream, but the design is explicit and the logic is easy to understand and maintain.

The Type 2 Challenge: Why Hub Hash Keys Aren’t Enough

Type 1 hash keys work for dimensions without history because the granularity is one row per business entity. Type 2 dimensions need finer granularity — one row per business entity per version over time. The hash key from the Hub doesn’t capture that; it’s the same value regardless of when you’re looking at the data.

What you need for a Type 2 dimension is a key that is unique not just per entity but per entity per point in time. In Data Vault, that key already exists — it’s generated as part of the PIT Table.

The Dimension Hash Key from the PIT Table

When producing a Type 2 dimension, you need a PIT Table anyway — it provides the snapshot-based granularity that drives the dimension’s history. The PIT Table’s alternate key is the combination of the parent’s business key (not the hash key — never hash a hash) and the snapshot date. The primary key of the PIT Table is a hash value computed from those two inputs: business key plus snapshot date.

At Scalefree, this value is called the Dimension Hash Key. It is unique per row in the PIT Table, which means it is unique per entity per point in time — exactly what a Type 2 dimension key needs to be. This Dimension Hash Key becomes the primary key of your Type 2 dimension and the foreign key that your fact entities need to reference in order to join to the correct dimension member at the correct point in time.

Connecting Facts to Type 2 Dimensions

The remaining challenge is on the fact side. Bridge Tables and Non-Historized Links — the typical foundations for fact entities — contain Type 1 hash keys from Hubs and Links, not Type 2 Dimension Hash Keys. So how does a fact row know which Type 2 dimension member to reference?

The solution is a join through the PIT Table’s alternate key inside the fact view. A Bridge Table typically contains the Type 1 hash key from the relevant Hub and a snapshot date. Those two values together form the alternate key of the PIT Table. Inside the fact view, you join the Bridge Table to the PIT Table using the hash key and snapshot date, retrieve the Dimension Hash Key from the PIT Table’s primary key, and surface that as the dimension reference in the fact entity.

The result: the fact entity contains a single column — the Dimension Hash Key — that points to exactly one Type 2 dimension member. The dashboard tool and end users never need to know how it was derived. The join logic is handled in the view layer, the keys match between fact and dimension, and the relationship resolves cleanly. This is the preferred approach rather than exposing a composite key (hash key plus snapshot date) from the fact side, which would complicate the dimensional model unnecessarily.

For teams using datavault4dbt premium, PIT Table generation and the Dimension Hash Key pattern are handled through the automation framework, which significantly reduces the manual effort involved in implementing this correctly at scale.

Putting It Together: Key Decisions for Dimension Keys

To summarize the decision framework: for Type 0 and Type 1 dimensions, use the Type 1 hash keys from Hubs and Links directly — they’re already available throughout the model and work cleanly in distributed environments. If storage is a concern, convert to binary hash values in the view layer before considering sequences. If sequences are genuinely required, generate them in a Computed Satellite in the Business Vault rather than embedding them in Hub or Link structures.

For Type 2 dimensions, use the Dimension Hash Key from the PIT Table as the primary key of the dimension. Connect facts to Type 2 dimensions by joining the Bridge Table or Link to the PIT Table’s alternate key inside the fact view, surfacing the Dimension Hash Key as the dimension reference. This keeps the dimensional model clean, the keys stable, and the join logic encapsulated where it belongs.

To go deeper on PIT Tables, dimension modeling, and the full Data Vault delivery layer, explore our Data Vault certification program. And for a concise introduction to the core concepts, the Data Vault Handbook is available as a free physical copy or ebook.

Watch the Video

Same-as-Links: Enterprise-Wide Deduplication Across Multiple Sources

Same-as-Links: Enterprise-Wide Deduplication Across Multiple Sources

One of the more powerful but nuanced constructs in Data Vault is the Same-as-Link (SAL). Two questions came in recently that get at the heart of how SALs work across source systems: can a Same-as-Link have multiple sources, and can it span keys from different source systems? The answers differ depending on whether you’re working in the Raw Data Vault or the Business Vault — and understanding why reveals something fundamental about how Data Vault handles enterprise-wide deduplication and integration.



Same-as-Links and Multiple Sources in the Raw Data Vault

The first question — can a Same-as-Link have multiple sources — is straightforward. Like any Link in the Raw Data Vault, a SAL can receive records from multiple source systems. Hubs consolidate business keys from different sources into the same entity, and Links do the same for relationships. As long as the relationship has the same semantic meaning and the same granularity across those sources, loading them into the same Link is valid and correct. So yes, a SAL in the Raw Data Vault can have multiple source systems contributing records to it.

The second question is more nuanced: can a SAL span keys from multiple sources — meaning one Hub reference on one side of the relationship comes from System A, and the other comes from System B?

In the Raw Data Vault, the answer is generally no — with one important exception. A core principle of Raw Data Vault loading is that each row comes from exactly one source system. Loading a single row that requires joining data from two independent source systems introduces a loading dependency: you have to wait for System A before you can load data from System B. That’s precisely the kind of tight coupling the Raw Data Vault is designed to avoid. Independent source systems should load independently.

The exception is when a single source system already knows both business keys. An ERP system, for example, might reference customers by a customer number that originates in a CRM system. The ERP system carries that key as a known reference — it’s available in a single source record without requiring a cross-system join at load time. In that case, a SAL row sourced from the ERP system can legitimately reference a business key that conceptually originates elsewhere. The single-source-per-row rule still holds; the integration happened upstream, inside the source system itself.

Same-as-Links in the Business Vault: Cross-Source Deduplication

In the Business Vault, the picture is quite different — and this is where SALs really show their value. When two independent source systems use completely different, unrelated business keys for what is actually the same real-world entity, there’s no source-level relationship to load. The Raw Data Vault captures both sets of keys in the same Hub (since they represent the same business concept), but there’s nothing in the source data to connect them.

This is where calculated Same-as-Links come in. Using descriptive data from both systems — names, addresses, contact details — fuzzy matching logic can identify that business key A from System A and business key B from System B refer to the same entity. That determination is a business rule. It belongs in the Business Vault. The result is a SAL entry that spans two business keys from completely independent source systems, calculated from the data rather than loaded from any single source.

This is one of the primary use cases for Same-as-Links: not just deduplicating records within a single source system, but integrating and deduplicating entities across the enterprise. Two CRM systems, two customer databases, two product catalogs — wherever the same real-world object appears under different identifiers in different systems, a Business Vault SAL can establish the connection and enable unified reporting and analysis across all of them.

For organizations dealing with complex multi-source environments, this kind of cross-system entity resolution is one of the most tangible business value deliverables a Data Vault implementation can produce. If you’re building or evaluating a enterprise data warehouse, the SAL pattern is worth understanding deeply — it’s the mechanism that turns a collection of source-aligned Hubs into a genuinely integrated enterprise model.

Why the Raw and Business Vault Distinction Matters Here

The contrast between how SALs work in the Raw Data Vault versus the Business Vault illustrates a broader principle that runs through all of Data Vault 2.0 design: the Raw Data Vault captures what the sources deliver, as they deliver it, without interpretation. The Business Vault is where judgment, calculation, and business logic are applied.

Fuzzy matching is business logic. Deciding that two records represent the same entity is a business decision. Those decisions belong in the Business Vault — not because the Raw Data Vault can’t technically store the result, but because embedding that logic at the raw layer makes it invisible, untestable, and hard to change when the matching rules evolve.

By keeping the SAL calculation in the Business Vault, you get a clear audit trail of how the deduplication was performed, the ability to update matching logic without reloading source data, and a separation between “what the source said” and “what we believe to be true across sources.” That separation is one of the most operationally valuable properties of a well-structured Data Vault.

Practical Implications for Modeling

When modeling SALs in practice, a few things are worth keeping in mind. In the Raw Data Vault, SALs are appropriate when a single source system provides an explicit deduplication or matching relationship — a master data management export, a merge table, a golden record mapping from a source MDM system. The loading process remains clean and dependency-free.

In the Business Vault, SALs are the right tool when the matching logic needs to be calculated — whether through exact key matching across systems, probabilistic matching, fuzzy string comparison, or any other form of entity resolution. The SAL lives in the Business Vault, references the appropriate Hub twice (master and duplicate), and is populated by whatever calculation or mapping process produces the match.

In both cases, the hash keys in the SAL reference the same Hub, since by definition the master and the duplicate represent the same type of business object. This is what makes the SAL structurally elegant: it reuses existing Hub infrastructure to express an enterprise-wide identity resolution without requiring new structural entities.

To go deeper on Same-as-Links, Business Vault patterns, and enterprise integration strategies in Data Vault, explore our Data Vault 2.1 Training & Certification. And for a concise introduction to the full methodology, the Data Vault Handbook is available as a free physical copy or ebook.

Watch the Video

Handling Zero Key References and Optional Data

Zero Key References and Optional Data in Data Vault Modeling

A nuanced modeling question came in recently about how to handle a source table that delivers relationships between two business objects, where one specific role type has no related second object — just additional attributes in the same record. The proposed model was a solid starting point, and the discussion that followed touched on several important Data Vault principles: zero key handling, CDC Satellites vs. multi-active Satellites, effectivity tracking, and why filter conditions in Raw Data Vault loading are a risk worth avoiding. This post walks through each of these in turn.



Zero Key References: Modeling Optional Relationships

The scenario involves a Link between two Hubs — Object One and Object Two — with a role type as part of the Link structure. For most role types, both Hub references are populated. For one specific role type, Object Two doesn’t exist; the source provides additional descriptive attributes instead of a foreign key.

The correct handling for the missing Object Two reference is straightforward: use the all-zeros key. When a foreign key in the source is null, the Link refers to the zero key in the Hub rather than storing an actual null. This keeps the model queryable with inner joins, avoids null-handling complexity downstream, and is entirely consistent with Data Vault null business key handling. Both Hubs should have their two zero key rows — the all-zeros key for unknown or null references, and the all-Fs key for error cases — deployed as standard practice.

The role type sits in the Link structure alongside the two Hub references, which means it participates in the hash key computation for the Link. When the role type changes, the Link sees it as a new entry. The effectivity Satellite then captures when the old record was no longer active. That’s the expected behavior.

Multi-Active Satellites vs. CDC Satellites: Choosing the Right Approach

The proposed model used multi-active Satellites to capture multiple rows with different valid_from and valid_to dates. Whether this is the right choice depends on one key question: are those records all active at the same time in the source system? Can a source system user see all of them simultaneously?

If yes — multiple records with different validity periods are all visible and active concurrently in the source — then a multi-active Satellite is the appropriate choice. The multi-active attribute should be a subsequence from staging rather than a business-supplied date, keeping control of uniqueness on the Data Vault side rather than trusting the source.

If no — the records represent sequential changes, not concurrent active states — then a CDC Satellite is a cleaner fit. A CDC Satellite is structurally a standard Satellite, but the load date is modified by adding a sequence number from the CDC package as microseconds. This means only one row is active at any given moment, which simplifies PIT Table construction (two columns instead of three per Satellite) and improves join performance. The choice between the two comes down to how the source system actually manages these records.

A third alternative worth noting: for multi-active scenarios, a JSON array stored in a standard Satellite can replace a traditional multi-active Satellite in some cases. It depends on the loading mechanism and the downstream consumption requirements, but it’s a valid option that avoids the multi-active complexity entirely by capturing multiple active rows as a structured JSON payload.

Effectivity Tracking and the Status Tracking Satellite

The proposed model included a separate status tracking Satellite alongside an effectivity Satellite. A cleaner and more storage-efficient approach is to merge these by adding a deletion timestamp directly to the effectivity Satellite.

The deletion timestamp works simply: when a record exists in the source, the deletion timestamp is set to end-of-all-times. When a deletion is detected — through a comparison of the current load against the target — the deletion timestamp is updated to the current load date, marking the record as no longer physically present in the source. If the record reappears, the timestamp reverts to end-of-all-times.

All timelines — valid_from, valid_to, deletion timestamps — belong together in the effectivity Satellite. This consolidation reduces the number of Satellites to manage and makes the model more straightforward to query.

GDPR and Customer Re-Registration

A related question came up about GDPR: if a customer requests data deletion and later re-registers, are they treated as a new record? The answer is yes, and it’s an important distinction from standard soft-delete handling.

Soft deletes in the Raw Data Vault are used to track hard deletes from the source for non-legal reasons — products removed, records archived, relationships ended. The history is preserved in the Vault even when it’s gone from the source.

GDPR is different. When a deletion is legally required, the personal data must be genuinely removed — a hard delete in the target. The non-personal data associated with that customer may be retained, but the link between the old history and the re-registering customer is permanently severed. If that customer returns and creates a new record, there’s no way to reconnect them to their previous history, because that connection no longer exists in the model. This is by design: the loss of that relationship is the point.

Why Filter Conditions in Raw Data Vault Loading Are Risky

One of the more important principles raised in this discussion: never apply filter conditions when loading the Raw Data Vault. The specific question was whether it’s acceptable to filter the source by role type when loading the additional attributes Satellite — loading only rows where the role type matches the one that doesn’t reference Object Two.

The answer is no, and the reasoning is worth understanding clearly. Applying a WHERE condition or a filtering join in the Raw Data Vault loading process is an application of business logic. The Raw Data Vault is supposed to capture raw data as delivered, without interpretation. Any filter condition that depends on the content of the data — rather than purely technical checks like delta detection or null replacement — violates that principle.

The practical risk is concrete: source system behavior changes. A new role type is introduced that also lacks an Object Two reference. The filter condition doesn’t know about it, so those records get skipped. Or dirty data arrives where an unexpected combination of fields appears — both an Object Two reference and additional attributes for a role type that wasn’t supposed to have both. A filter condition handles this incorrectly or drops data silently.

The only conditions permitted in Raw Data Vault loading are technical ones: checking whether a business key or relationship already exists in the target (the delta check), and replacing null values with zero keys. Everything else — including role-type-based filtering — belongs in the Business Vault, where it can be applied as explicit, versioned, testable business logic.

Validating the Model with the JEDI Test

When uncertain about a modeling decision — whether to use a multi-active Satellite or a CDC Satellite, whether to split or consolidate — the JEDI test provides a reliable check. The test is simple: try to reconstruct the original source delivery from the Raw Data Vault. Join everything back together and verify that no records are lost, no columns are missing, and no artificial records have been generated that didn’t exist in the source.

If the reconstruction succeeds without data loss or artificial inflation, the model is valid. Whether it’s the best model depends on the data and the downstream consumption patterns — but validity is the baseline, and the JEDI test is how you prove it.

To explore these modeling patterns in depth — including zero key handling, effectivity Satellites, CDC loading, and the JEDI test — check out our Data Vault 2.1 Training & Certification. The free Data Vault handbook is also available as a physical copy or ebook.

Watch the Video

Understanding Error Keys in Data Vault

Error Keys in Data Vault: Understanding Zero Keys and Null Business Key Handling

One of the more subtle but important concepts in Data Vault is the handling of null business keys — known as zero keys in Data Vault 2.0 and formally called null business key handling in Data Vault 2.1. Most practitioners understand the first zero key intuitively, but the second one — and where it actually earns its value — is less commonly understood. This post explains both, and where each one belongs in practice.



Error Keys Explained: The Two Zero Keys

Every Hub and Link in a Data Vault model is deployed with two special rows pre-loaded: one with a hash key of all zeros, and one with a hash key of all Fs. These are the two zero keys, and they exist to handle null business keys cleanly throughout the model.

The all-zeros hash key is the more commonly understood of the two. It replaces null values in Links — specifically, null references to business keys. When a relationship is received with a missing or null Hub reference, that null gets replaced by the all-zeros key rather than being stored as an actual null. This allows the model to rely on inner joins consistently when querying the Data Vault, without having to handle nulls case by case through left joins or null checks. When you join from a Link to a Hub, you always hit a record — either a real business key or the zero key. Clean, fast, and predictable.

The all-Fs hash key serves a distinct and more specific purpose: it marks bad data, as opposed to merely missing or ugly data. Understanding the difference between those two things is the key to understanding why two zero keys exist at all.

Ugly Data vs. Bad Data: Why the Distinction Matters

Consider a transaction record where the store reference is null. In a brick-and-mortar retail context, this seems wrong — every sale happens somewhere. But in a business that also runs an online store, a null store value might simply mean the transaction happened online. The data is incomplete by conventional standards, but it’s not incorrect. It reflects a real business scenario. This is what you might call ugly data: not ideal, not the most descriptive, but not an error.

Now consider a different scenario: the interface specification for a source system explicitly states that a particular foreign key is non-nullable. The data arrives anyway with null values in that field. Here, either the data is genuinely corrupted or the specification is wrong. Either way, something has gone wrong. This is bad data — data that shouldn’t exist in the form it arrived.

The all-zeros key handles the ugly case. The all-Fs key is reserved for the bad case. Having both allows the model to preserve the distinction rather than collapsing all null situations into a single catch-all placeholder.

Where the All-Fs Key Is Actually Used in Practice

In theory, the all-Fs key could be applied in the Raw Data Vault whenever a null value violates an interface specification. In practice, this rarely happens. Analyzing every interface description, identifying which nulls represent violations, and modifying the Raw Data Vault mappings accordingly is a significant effort — and most projects don’t invest in it at the raw layer. The all-Fs rows exist in every Hub and Link as a structural feature, but they tend to sit unused in the Raw Data Vault itself.

Where the all-Fs key genuinely earns its place is in the Business Vault and Information Marts. The pattern looks like this: during the construction of a Fact view or a Bridge Table, business logic identifies records that reference Hub keys which shouldn’t exist — store locations that were never valid, product codes that are clearly erroneous, data that passed through the raw layer but doesn’t belong in the dimensional model. Instead of passing those records through to the Dimension with a misleading or nonsensical member, the business logic replaces their hash keys with the all-Fs value.

In the resulting Dimension, those records map to an explicitly erroneous member — a designated “error” row — rather than polluting actual dimension members with bad data. Business users and analysts can see that certain facts are associated with an error case, filter them out, investigate them, or handle them according to reporting requirements. The data is quarantined and labeled, not silently dropped or mixed in with valid records.

Ghost Records in Satellites

The zero key pattern extends to Satellites as well, through what are called ghost records. At minimum, one ghost record exists in each Satellite — associated with the all-zeros hash key — to ensure that joins from a Hub or Link to a Satellite always return a result, even for the zero key case.

In implementations using the datavault4dbt package, two ghost records are created: one for the all-zeros key and one for the all-Fs key. Beyond making the implementation consistent, this has a practical benefit in the dimensional layer. The two ghost records can carry different descriptive values — for example, “Unknown Customer” for the all-zeros case and “Erroneous Customer” for the all-Fs case. This makes the distinction visible and user-friendly in reports and dashboards, giving analysts a clear signal about what they’re looking at rather than a generic placeholder for both missing and bad data.

Because the ghost records share their hash keys with the zero keys in the parent Hub and Link, they join naturally without any special handling. It’s a side effect of the design that works elegantly in practice.

Should You Drop the All-Fs Key If You’re Not Using It?

The question occasionally comes up: if the all-Fs key isn’t being used in the Raw Data Vault, can it simply be dropped? Technically, yes. But in most implementations it stays, for a few reasons. It costs almost nothing to maintain — it’s two rows per Hub and Link. It provides a structural home for bad data classification if the need arises later. And its real value, as described above, is realized downstream in the Business Vault and Information Mart, where it’s actively useful for handling erroneous data in business logic and dimensional modeling.

Dropping it from the Raw Data Vault to save minimal overhead would mean losing a precise and semantically meaningful tool at exactly the layer where it’s most needed.

To go deeper on null business key handling, ghost records, and the full Data Vault 2.1 methodology, explore our Data Vault 2.1 Training & Certification. The free Data Vault handbook is also available as a physical or digital copy for a concise introduction to the core concepts.

Watch the Video

Data Vault in a Microservices Architecture

Microservices Architecture and Data Vault: Managing Satellites at Scale

Microservices architectures create a specific modeling challenge for Data Vault practitioners. When services are ephemeral — spinning up and down as Docker or Kubernetes containers — each with its own message structure, the standard advice to split Satellites by source system quickly leads to hundreds or thousands of Satellites. At that scale, the real question isn’t about metadata management overhead. It’s about how to consume all that data without joining 500 tables every time you need an answer. This post walks through a practical approach to handling high-volume, highly varied source structures in a Data Vault model.



Microservices Architecture: Why Satellite Splits Become a Problem

The conventional Satellite splitting rules — by rate of change, source system, security, and privacy — exist for good reasons. But in a microservices context, applying them strictly leads to an explosion of Satellites. A new Docker image with a new message structure technically deserves its own Satellite. Automate that process and you accumulate hundreds or thousands of Satellites quickly, most of which may never be queried by anyone.

The issue isn’t that databases can’t handle 500 tables — they can. The issue is the consumption side: joining 500 Satellites to produce a target model is expensive, complex to maintain, and in many cases unnecessary. The real challenge is finding a modeling approach that captures the variety of incoming structures without creating an unmanageable query layer downstream.

Rate of Change Splits: Still Relevant, but Less So for Now

The rate of change split was designed to reduce storage consumption by separating high-frequency attributes from stable ones. Every delta insert copies all columns in the Satellite, so a single change on one attribute in a wide Satellite wastes a lot of storage on unchanged data.

For most modern analytical database systems, compression makes this largely unnecessary. Insert-only tables with lots of redundant data compress extremely well, and virtually all modern analytical platforms support this. The storage cost of skipping the rate of change split is manageable with compression turned on.

That said, this is worth watching. In pay-per-query environments like Athena querying row-based Avro files, or systems that charge based on uncompressed data scanned, the rate of change split becomes economically relevant again. BigQuery’s columnar storage sidesteps this because you only pay for the columns you query — but other managed infrastructure doesn’t work that way. The rate of change split isn’t obsolete; it’s just less pressing for now, and likely to become more relevant as managed, consumption-based pricing models become more common.

The Flip-Flop Effect: Why Source System Splits Still Matter

The source system split is a different matter. Loading data from two different source systems into the same Satellite creates a well-known problem: the flip-flop effect.

Consider a customer whose address is known to both an ERP system (California) and a CRM system (Hannover, Germany). The two systems have different knowledge and potentially different structures for representing the same data. If both load into the same Satellite, the Satellite ends up recording two deltas per day — not because the customer moved, but because two systems loaded sequentially with different values. The data flips between California and Hannover with every load cycle, consuming storage and making it impossible to determine the actual address without applying business logic. Worse, the order of loading determines what the Satellite shows at any given moment — a purely technical artifact with no business meaning.

The fix is straightforward: one Satellite per source. This keeps each system’s view of the data independent and equally available, so business logic in the Business Vault can reconcile them deliberately rather than having the Raw Data Vault collapse them accidentally.

The Gray Area: Millions of Sources, One Practical Solution

The flip-flop rule works cleanly when you have a manageable number of distinct source systems. It breaks down at the extreme end — IoT deployments with millions of sensors, or microservices architectures with hundreds of ephemeral containers — where creating one Satellite per source is operationally impractical.

The solution here depends on two conditions being met. First, you need a key in the parent entity that partitions the data by source — a sensor ID, a Docker image ID, a tenant ID, something that creates independent delta streams within the same Satellite. With this in place, deltas from source A can’t replace or invalidate deltas from source B, which eliminates the flip-flop effect without requiring separate Satellites. Second, the structure of the incoming data must be consistent enough to fit in a shared target — which in practice usually means JSON.

When messages from different microservices or sensors all arrive as JSON — even with different internal structures — you can load them all into a single Satellite or Non-Historized Link with a JSON or JSONB payload column. The structure differences are captured inside the JSON document. You add the partitioning key to the parent, and you’re done. Instead of 500 Satellites with 500 different schemas, you have one entity with a JSON payload and a key that tells you which source produced each record.

Non-Historized Links for Real-Time Messages

For real-time message streams from microservices, a Non-Historized Link with a JSON payload is often the right structure. Real-time messages are events — they don’t update, they accumulate. The flip-flop concern largely disappears because you’re capturing messages as they arrive, not loading full snapshots that might overwrite each other. A Non-Historized Link captures the event, the relevant Hub references, and the message payload in a structure that’s fast to load and straightforward to query.

This same pattern was applied at Scalefree for an investment banking client with 500 different source systems delivering asset data in different CSV formats. Rather than creating 500 entities, a single Non-Historized Link and Satellite captured everything — different CSV structures serialized as JSON strings, distinguished by a load source identifier. Two entities replaced 500, and the consumption layer handled the structural variety through filtering and extraction rather than joins.

Consuming Semi-Structured Data Without Joining 500 Tables

Loading everything into a JSON payload doesn’t eliminate the structural variety — it defers it to query time. When you need data from a specific message type, you need to identify records with the right structure among all the records in the same target entity.

The approach here is filtering rather than joining. Instead of joining 500 Satellites, you query one entity and filter for records that contain specific JSON keys or values that uniquely identify the message type you care about. Email messages, for example, always have a subject, body, sender, and recipient — keys that distinguish them from other message types. A specific transaction type might always carry an ID starting with a known prefix. These structural signatures let you extract subsets of the JSON stream efficiently.

Once filtered, you extract the attributes you need from each subset and UNION the results if you need to combine multiple message types. A UNION of 500 filtered queries on one table is significantly faster than a JOIN of 500 separate tables, and it scales much better as the number of source types grows.

Choosing the Right Approach for Your Context

The right answer depends on where you sit on the spectrum between a small number of structurally distinct source systems and a very large number of structurally similar ones. For a handful of systems with genuinely different schemas and different business semantics — CRM, ERP, financial systems — separate Satellites per source is the right call. The flip-flop effect and structural differences make consolidation risky and introduce business logic where it doesn’t belong.

For microservices, IoT devices, or any scenario where you have many sources with similar structures and a partitioning key available, consolidating into a small number of JSON-payload entities is usually the better trade-off. It simplifies loading, reduces metadata overhead, and keeps the consumption layer manageable — at the cost of pushing structural interpretation into filtering and extraction logic downstream.

To go deeper on Satellite design, source system splits, and Data Vault modeling patterns for modern architectures, explore our Data Vault certification program. The free Data Vault handbook is also available as a physical copy or ebook for a solid grounding in the core methodology.

Watch the Video

How Do You Model External Business Logic In Data Vault?

Modeling External Business Logic in Data Vault: APIs, Scripts, and Source System Thinking

A question that comes up regularly in Data Vault training is how to handle external business logic — specifically, what happens when your data pipeline includes a call to an external API or service that returns enriched or cleansed data. Where does that fit in the model? How do you capture the response? And how do you integrate an external script cleanly into your enterprise data platform? This post walks through a concrete example: address cleansing via an external REST API.



Modeling External Business Logic: The Full Flow

The scenario starts simply enough. You have CRM data — let’s say customer records with addresses — that gets staged and broken down into the Raw Data Vault in the usual way: Hubs for business concepts, Satellites for descriptive attributes. The raw address from the CRM system lands in a Satellite.

Now comes the complication. You need to cleanse and standardize those addresses using an external REST API. A Python script handles the call: it pulls data from the platform, formats it into the required input — perhaps a single string or a calculated key — and sends it to the external service. The service returns a JSON response with the standardized address and additional metadata.

This flow touches several layers of the Data Vault architecture, and each layer has a distinct role.

The Business Vault Prepares the API Call

Before the external call can be made, the Business Vault does preparatory work. If the REST API requires the address in a specific format or needs a calculated key, that computation belongs in the Business Vault — it’s business logic, applied to raw data, to produce the input for an external process.

The external Python script then queries this prepared data — either directly from the Business Vault or via an Interface Mart (more on that below) — and performs the REST call. The script itself may be under version control and within your organization’s control. The external service is not.

Treat the External Service as a Source System

This is the key modeling decision: because the external API is outside your control, you treat its responses exactly as you would treat any other source system. You don’t trust it implicitly. You stage its output and break it into the Raw Data Vault.

If your Raw Data Vault already has an Address Hub from the CRM dataset, and the external service returns identifiers that qualify as business keys — unique, stable identifiers for addresses — those can be added to the Address Hub. The JSON response from the API then gets captured in a Satellite in the Raw Data Vault, associated with the appropriate Hub.

This approach gives you a clean audit trail. You know exactly what the external service returned, when it returned it, and what key was used to make the call. If the external service changes its response structure or returns unexpected data, your Raw Data Vault captures that reality as-is, and your downstream Business Vault logic handles interpretation.

Handling JSON Responses: Two Practical Options

API responses typically come back as JSON — sometimes well-structured, sometimes semi-structured with varying schemas between messages. There are two main approaches for capturing this in the Raw Data Vault, and the right choice depends on how structured the response is and how many attributes you actually need.

Option 1 — Extract what you need, keep the rest as JSON. If the JSON is relatively consistent and you only need a subset of its attributes — say, five out of fifty — extract those five into relational columns in the Satellite. Keep the full JSON (or the remaining payload) as a JSON or JSONB attribute in the same Satellite. You get fast, typed access to the attributes you use regularly, and the full document is available for future needs without requiring a reload.

Option 2 — Keep everything in JSON, extract in the Business Vault. If you’re unsure which attributes you’ll need, or if the structure varies, capture the raw JSON in the Satellite and handle extraction later in the Business Vault. Technically, extracting fields from JSON is a structural transformation — a hard rule, not a business rule — so it could sit in the Raw Data Vault. But if the extraction is straightforward and tied to specific downstream calculations, doing it in the Business Vault view is a reasonable and common practice.

In practice, the hybrid approach from Option 1 is most common: extract the attributes you know you need into relational columns, keep the JSON alongside them. When a new attribute is needed later — and it will be — you can pull it directly from the JSON in your Business Vault view using native JSON functions, without touching the Raw Data Vault or reloading any data.

Integrating the External Script: Dependencies and Interface Marts

When an external script queries your data platform — whether from the Raw Data Vault or the Business Vault — it creates a dependency. The entities that script relies on can’t be freely refactored without risking a broken integration. This is worth flagging explicitly in your metadata: mark those entities as part of the operational vault, indicating that external applications depend on them.

A cleaner long-term solution is to introduce an Interface Mart — a stable, versioned view layer that the external script queries instead of the Raw or Business Vault directly. When you refactor a Satellite or restructure a Business Vault entity, you update the Interface Mart view to maintain the same output structure. The external script sees no change. This decouples your internal model evolution from external integrations, which is especially valuable in organizations where multiple scripts and applications consume data from the platform.

Combining Two Sources in the Business Vault

At this point, you have two sources describing the same concept: the CRM system with the original, non-standardized address, and the external address standardizer with the cleansed version. Both are captured in the Raw Data Vault as separate source inputs. The Business Vault is where you bring them together.

The pre-computed key used to make the API call serves as the joining mechanism. Based on that key, you can establish a relationship — via a Link or a direct join in a Business Vault view — between the raw CRM address and the standardized version returned by the external service. The Business Vault then exposes the combined, cleansed address data to downstream consumers: reports, dashboards, or further downstream application scripts.

The exact modeling decisions at this stage depend heavily on how the CRM data is structured and what the business actually needs from the cleansed address. But the principle holds regardless: raw inputs from both the CRM and the external API live in the Raw Data Vault; the logic that combines and interprets them lives in the Business Vault.

A Pattern Worth Generalizing

Address cleansing is one example, but the same pattern applies to any external enrichment service: geocoding APIs, credit scoring services, entity resolution services, tax calculation engines. Whenever your pipeline includes a call to an external system that returns data you need to capture and use, the approach is the same — treat the response as a source, stage it, load it into the Raw Data Vault, and apply interpretation and combination logic in the Business Vault.

It’s also worth noting that this pattern integrates naturally into data-driven organizations where information is consumed not just through reports and dashboards but through application scripts and automated processes. The enterprise data platform becomes a hub for both analytical and operational consumers — and Data Vault’s layered architecture handles both cleanly.

To explore these patterns in depth — including Business Vault design, Interface Marts, and integrating external sources — check out our Data Vault 2.1 Training & Certification. The free Data Vault handbook is also available as a physical copy or ebook for a solid introduction to the core methodology.

Watch the Video

Capturing Changing Inventory Levels in Data Vault

Capturing Changing Inventory Levels in Data Vault with Non-Historized Links

Inventory data presents a specific challenge in Data Vault modeling: it arrives as snapshots, it can be corrected retroactively, and deletions need to be traceable back to a specific snapshot date. A user recently proposed a solution using a multi-active Snapshot Satellite — a reasonable starting point — but the question of how to handle corrections and logical deletes at the snapshot level pointed toward a cleaner approach. This post walks through the recommended pattern: a Non-Historized Link with technical counter transactions.



Capturing Changing Inventory Levels: Understanding the Source Data

The scenario is this: a source system sends inventory data as full snapshots. Each extract may contain data for multiple snapshot dates, and if a correction exists for a previously loaded snapshot, the source resends the complete data for that snapshot date. The inventory level represents the count of a product at a given location and store at the end of a specified day.

This is what Kimball defines as a snapshot-based fact — a periodic full count of all products in a warehouse, day by day. The complication is that those snapshots can be corrected. Three days after an initial count, a miscalculation might be identified and the inventory level for that past snapshot date needs to be updated — without affecting the two more recent snapshots.

Why the Multi-Active Satellite Approach Has Limitations

The proposed solution used a Link referencing Product, Store, and Location Hubs, with a multi-active Satellite attached. The snapshot date was nominated as the multi-active attribute, added to the Satellite’s primary key alongside the hash key and Load Date Timestamp.

The instinct is sound — you need to track multiple snapshots and their corrections over time. But there’s a problem with using a business-supplied date as the multi-active attribute in the Raw Data Vault: it means trusting the source. In the Raw Data Vault, that’s a risk worth avoiding.

A more robust approach is to use the subsequence from staging as the multi-active attribute instead. The subsequence is assigned during staging and is guaranteed to be unique within each incoming batch. Because it’s unique within the batch, it’s also unique within any subgroup of that batch — including a multi-active group. If something goes wrong with it, it’s a problem you control and can fix. The snapshot date from the source then becomes what it actually is: a descriptive business timeline, added to the Satellite payload like any other descriptive attribute.

But even with this correction, the multi-active Satellite approach doesn’t cleanly solve the core problem: how do you capture a logical delete at the snapshot level — when an item disappears from a re-sent snapshot that was already loaded?

The Non-Historized Link Approach

The cleaner solution is to use a Non-Historized Link rather than a standard Link with a Satellite. Non-Historized Links are designed precisely for capturing facts coming from a source — transactions, events, and inventory levels. They sit close to the incoming data structure and are loaded incrementally using an Alternate Key.

For this scenario, the Non-Historized Link contains the following fields:

  • Product Key, Store Key, Location Key — the Hub references (hash keys)
  • Load Date Timestamp (LDTS) — when the record was received
  • Snapshot Date (SD) — the business timeline from the source, indicating when the inventory was counted
  • Snapshot Date Timestamp (SDTS) — the Data Vault snapshot date used in information delivery
  • A counter column (X) — with a value of 1 for records coming from the source, and -1 for technical counter transactions

The Alternate Key — used to identify whether a record already exists in the target — is the combination of Product Key, Store Key, Location Key, Snapshot Date, and Load Date Timestamp.

Two INSERT Statements: Inserts and Counter Transactions

Loading this Non-Historized Link uses two parallel INSERT statements, not an UPDATE pattern.

INSERT statement one loads all records from the Staging Area where the Alternate Key does not yet exist in the target. This covers both genuinely new records and new versions of corrected records — because the corrected version, with its new Load Date Timestamp, doesn’t exist in the target yet and therefore qualifies as new.

INSERT statement two handles deletions and replaced versions. It identifies records that exist in the target but are absent from the current full-load snapshot in the Staging Area. These records are either hard-deleted in the source or represent the old version of an updated record. For each of these, a technical counter transaction is inserted — an identical row with a counter value of -1 instead of 1.

This means an update in the source produces two rows in the target: a -1 counter transaction that cancels the old version, and a 1 insert for the new version. The data itself is never modified — only the structure changes. This aligns with the hard rule in Data Vault: the Raw Data Vault transforms structure, not content.

How Aggregation Reveals the True Inventory Level

The power of this pattern becomes clear at query time. By summing the counter column across all rows matching a given Alternate Key, you can determine the current state of any inventory record at any point in time.

Here’s how it plays out across several days for a single product-store-location combination:

  • Day 2: Initial snapshot for Day 1 is loaded. Counter value: 1. Sum = 1. One active inventory record.
  • Day 4: A correction arrives for Day 1. The old version gets a -1 counter transaction; the corrected version gets a 1 insert. Sum = 1 - 1 + 1 = 1. Still one active record, now at the corrected level.
  • Day 5: The record is deleted from the source for Day 1. A -1 counter transaction is inserted. Sum = 1 - 1 + 1 - 1 = 0. Zero active records. The inventory level for that snapshot date is gone.

By filtering on the Snapshot Date and aggregating the counter column, you always know exactly how many active facts exist for a given snapshot — and what the current inventory level is. By filtering on the Load Date Timestamp, you can also travel back in time and see what the inventory looked like from the perspective of any earlier date. This gives you both a corrected view of inventory history and a full audit trail of when corrections were made.

Performance and Reporting

On a column-based storage engine, this pattern performs extremely well. Aggregating a counter column across large volumes of inventory records is fast, and the resulting structure integrates cleanly with dashboards. Business users see dimension references for Product, Store, and Location; a snapshot date for time-based aggregation; the inventory level measure; and the counter column for filtering active records. The behavior is exactly what they’d expect — current inventory levels that reflect corrections, with history available on demand.

Where This Pattern Also Applies

Technical counter transactions in Non-Historized Links are not specific to inventory data. The same pattern applies anywhere you have full-load snapshot data with corrections — any scenario where an update in the source needs to be represented as a deletion of the old version and an insertion of the new one, without modifying existing rows.

The pattern can also be applied in the Business Vault, particularly in Bridge Tables, when the counter transactions involve calculated values that need to be corrected rather than raw source data. The principle is the same; the layer where it’s applied depends on whether the transformation is structural (Raw Data Vault) or involves business logic (Business Vault).

For more recordings on Non-Historized Links and technical counter transactions, the Scalefree YouTube channel has several dedicated sessions on the topic. And to master these patterns hands-on, explore our Data Vault 2.1 Training & Certification. The free Data Vault handbook — available as a physical copy or ebook — is also a great starting point for the core concepts.

Watch the Video

Unit of Work (UOW) Links in Data Vault

Unit of Work Links in Data Vault: Purpose, Strategy, and Flow.BI

As FlowBI — the GenAI tool that generates Raw Data Vault models by analyzing source data — becomes more widely used, questions about its specific modeling decisions are coming up more frequently. One of those questions is about a particular Link type that FlowBI produces: the Unit of Work Link. What is it, why does it exist, and when should you use it? This post breaks it down.



Unit of Work Links: What They Are and Why They Matter

In Data Vault modeling, a Link captures a relationship between two or more Hubs — a business event or association expressed through a combination of Business Keys. When modeling a source table that contains multiple Hub references, a natural modeling instinct is to split those relationships into separate, more focused Links. One Link for a customer-product relationship, another for an employee-customer relationship, and so on.

Splitting Links is perfectly valid in Data Vault — but it comes with a constraint. Certain Business Keys must remain together to preserve what is called the Unit of Work: the set of Business Keys that belong together as a single, consistent business event. Separate them incorrectly, and you risk breaking data consistency in your model. This is one of the more subtle but consequential mistakes a modeler can make, and it’s discussed in detail in the Data Vault training and certification curriculum.

The Unit of Work Link is the answer to that risk. It is a Link that spans all Hub references coming from a single source table — unsplit, unfiltered, capturing the full set of Business Keys exactly as they appear together in the source. Think of it as the safe baseline: a Link that guarantees the Unit of Work is preserved, no matter what splitting decisions get made on top of it.

The Human Modeler Problem — and How FlowBI Handles It

FlowBI works by analyzing source data profiles and generating the Hubs, Links, and Satellites needed to capture that data correctly in a Raw Data Vault — one that passes the JEDI test. It integrates with automation tools including Data Vault Builder, Datavault4dbt, and others.

But here’s the design challenge FlowBI had to address: it imitates a human data modeler. And human data modelers make mistakes. One of the most common is splitting Links incorrectly and inadvertently violating the Unit of Work. So the question during development was: do you try to eliminate these mistakes entirely — at the cost of no longer truly imitating human modeling behavior — or do you accept that errors can happen and build a counter-strategy?

FlowBI chose the counter-strategy. The Unit of Work Link is that strategy. By always generating a Link that preserves the full set of Hub references from a source table, FlowBI ensures there is always a valid, consistent fallback in the Raw Data Vault — even if additional, more specific Links turn out to have been split incorrectly.

The Practical Modeling Strategy Behind It

This isn’t just an AI design decision — it reflects a sound modeling practice for human modelers as well. The approach works like this:

When working with a source table that contains multiple Hub references, always create one Link that spans all of them. This is your Unit of Work Link. Then, if your understanding of the source data is strong enough — if you’ve been able to validate the relationships and run the appropriate tests — you can introduce additional, more specific Links that extract individual relationships from the source. A hierarchy Link, an employee-to-customer Link, a product-to-order Link, and so on.

The key point is that the Unit of Work Link remains. It doesn’t get replaced. If one of the more specific Links turns out to violate the Unit of Work — because the source data behaved differently than expected, or because access to production data was limited during modeling — you still have the original Link to fall back on. The model remains valid. You can retrieve the correct relationship from the Unit of Work Link and fix the problem without reloading the Raw Data Vault.

The less familiar you are with a source dataset — particularly in cases where you can’t access production data directly or can’t run comprehensive validation tests — the more valuable this approach becomes. When in doubt, don’t split. Protect the Unit of Work first.

Splitting in the Business Vault Instead

If a Unit of Work Link exists in the Raw Data Vault and you later want to split it into more focused relationships, that work belongs in the Business Vault — not the Raw Vault. You use the Unit of Work Link as the basis for either virtual or materialized Links in the Business Vault, where the split is applied.

The advantage of this approach is containment. If the split is wrong — if it turns out to violate the Unit of Work — the fix happens in the Business Vault. The Raw Data Vault doesn’t need to be reloaded. The source of truth stays intact. You correct the business logic without touching the foundation.

This is exactly the kind of separation of concerns that makes Data Vault resilient. Raw data is captured as-is, close to the source. Business logic — including relationship refinement — happens in the layer designed for it.

A Rule Worth Adopting for Any Modeler

The Unit of Work Link isn’t just a FlowBI artifact. It’s a principle any Data Vault modeler can and should apply. If you’re splitting Links in your Raw Data Vault, ask yourself: where is your Unit of Work? Is there a Link in your model that preserves the full set of Business Keys from each source table, regardless of how you’ve split them elsewhere?

If the answer is no, you’re relying on every split being correct — and on your understanding of the source data being complete. That’s a reasonable bet when you know the data well. It’s a riskier one when you don’t. The Unit of Work Link costs very little to include and provides a meaningful safety net in return.

To go deeper on Link modeling, Unit of Work concepts, and the full Data Vault methodology, explore our Data Vault certification program. And if you’re new to Data Vault, the free handbook — available as a hard copy or ebook — is a solid introduction to the core concepts.

Watch the Video

Business Analyst and Data Modeler Collaboration in Data Vault

Business Analyst and Data Modeler Collaboration in Data Vault Projects

One of the most common sources of friction in Data Vault projects isn’t technical — it’s organizational. The collaboration between Business Analysts and Data Modelers is arguably the most important working relationship in the entire delivery chain, yet it’s also one of the least clearly defined. Who does what? Where does one role end and the other begin? What information needs to change hands, and in what format? This post walks through a practical approach to structuring that collaboration, drawn from real project experience.



Why the Business Analyst and Data Modeler Collaboration Is So Critical

When Business Analysts and Data Modelers don’t collaborate effectively, the symptoms show up in the Raw Data Vault. Surrogate keys get nominated as Business Keys. Source system logic bleeds into what should be a raw, business-concept-driven model. Gaps in the information provided to modelers lead to design decisions based on assumptions rather than actual business understanding.

It’s worth clarifying one important point here: the Raw Data Vault is not where business perspectives live. Business logic, business rules, and the way the organization interprets its data — all of that belongs in the Business Vault. The Raw Data Vault should reflect the raw data as it comes from the source, structured around business concepts and Business Keys. Keeping that distinction clear is fundamental to a healthy collaboration between the two roles.

Forget the Line — Work Together

A common instinct is to draw a clean boundary: the Business Analyst works until a certain point, then hands off to the Data Modeler. In practice, this handoff model is where projects run into trouble. Information gets lost in translation. The Data Modeler receives documentation that makes sense from a business perspective but leaves key modeling questions unanswered. The Business Analyst doesn’t know what the Data Modeler actually needs.

A better approach: put everyone in the same room. Business Analysts, Data Modelers, Data Engineers, and dashboard designers all working toward the same deliverable — a report, a KPI, a business process automation. The business user doesn’t care about Data Vault; they care about the output. Build toward that output together.

This doesn’t mean everyone needs to be available full-time. But especially at the start of a project, physical or virtual co-location matters. When the Data Modeler hits a question the Business Analyst’s documentation doesn’t answer, the answer needs to be one conversation away — not a ticket in a queue.

Two additional roles are particularly valuable to have accessible during this phase: a source system specialist who knows the source data structure deeply, and a business user who can validate what’s being built against actual reporting needs. They’re typically time-constrained, so plan interactions with them carefully and make the most of the time you have.

Starting with Concept Classification

Before diving into source tables and column mappings, it pays to start at a higher level. A concept classification session — sometimes called a concept analysis — asks a deceptively simple question: what is your business model?

In a meeting with stakeholders from different departments, you map out the core business objects: customers, products, purchases, factories, whatever is central to how the business operates. You’re not focused on relationships at this stage — you’re building a vocabulary. A taxonomy of the concepts that matter to the business.

The second part of this conversation — often in the same meeting or the next one — asks: how do you identify each of these concepts? This is where it gets interesting. If you have people from finance, production, and sales in the room, you’ll typically get different answers. Finance uses an Oracle ID. Sales uses a Salesforce account key. Production uses an SAP number. Different systems, different keys, all referring to the same underlying concept.

This gives you a set of Business Key candidates. From there, you can examine the actual source data: do these keys exist in the dataset? Are they unique? Do any of them appear across multiple source systems in a way that could serve as a shared integration key? That analysis — even if limited to the data you have in front of you — is enough to identify a strong candidate and move forward. It won’t be perfect. A full analysis of every source system across the enterprise is rarely funded. But a well-reasoned candidate key is enough to start building, and it can be refined as the project progresses.

The Collaboration Spreadsheet: Simple and Effective

Once you’ve identified your concepts and Business Key candidates, the next step is mapping source tables to those concepts and classifying every column. The tool for this doesn’t need to be sophisticated — a spreadsheet works well, and works well precisely because everyone can use it.

The process looks like this: before the meeting, a developer imports the source system metadata into the sheet — column names, data types, lengths, source table. One row per column. Then, in the meeting with the business user and source system specialist, you go through each column and answer a simple question: what is this?

The annotations don’t need to be elaborate. Common classifications include:

  • Business Key — the identified key for this concept
  • Descriptive attribute — goes into a Satellite
  • Link reference — indicates a relationship to another Hub, requires a Link
  • Surrogate Key — captured as descriptive, not used as the Business Key
  • Ignore — not needed for this model

Additional classification dimensions — rate of change, security classification, privacy flags — can be added as columns in the same sheet. Satellite split decisions (which attributes group together into which Satellite) can be noted in comments. The goal is to give the developer enough context to build the metadata for the automation tool without needing another round of meetings.

The key discipline here is consistency. Keep comments patternized. The same type of note should look the same every time. A free-form comment field is useful; a completely unstructured one becomes noise.

From Spreadsheet to Automation Tool Metadata

Once the spreadsheet is complete, the developer translates it into the metadata format required by the automation tool — whether that’s Data Vault Builder, VaultSpeed, Datavault4dbt, or another platform. This translation step takes time and precision: automation tools produce exactly what their metadata specifies. Bad metadata produces bad results. But with a well-annotated spreadsheet as the source, the developer has a clear reference and can resolve most questions independently.

Some projects also require terminology translation at this stage. Source systems — especially SAP — often use abbreviated, language-specific field names that don’t belong in a Data Vault intended for a broader audience. The spreadsheet can include an English translation column, which the business user or source system specialist can complete asynchronously, keeping the meeting time focused on classification rather than translation.

Where AI Is Starting to Help

The concept classification and Business Key identification process described above is time-intensive, and it’s largely limited by how much source system analysis you can afford to fund. This is one area where AI tooling is beginning to make a difference.

Tools like FLOW.BI — developed at Scalefree — can attach to source systems, profile the data automatically, classify attributes, and identify Business Key candidates that appear across multiple systems as potential shared integration keys. The manual process described in this post becomes a validation and refinement step rather than a ground-up analysis. The fundamentals are the same; the speed is different.

The Information Requirement: Starting from the End

One final principle worth emphasizing: start with the target. Before analyzing source systems, ask what needs to be produced. What KPI needs to be calculated? What report needs to be built? What data does that require, and where does it come from?

An information requirement document — a structured template that captures what the business user wants, what they need, and where the data lives — is the ideal starting point for any new delivery. It won’t always be complete. Business users often know what they want but not where the data comes from. That’s fine. The Business Analyst and Data Modeler work together to fill in the gaps. But having even a partial information requirement is better than starting from raw source tables and working backwards.

Scalefree has published a template for information requirements on their blog — searching for “information requirement Scalefree” will bring it up — which can serve as a starting point for teams building this practice.

Making Collaboration Work in Practice

There’s no single formula for Business Analyst and Data Modeler collaboration that works across every project and every team. But a few principles hold consistently: work toward the same deliverable together, use simple tools that everyone can engage with, start from the business concept before diving into source data, and keep the meeting time focused on decisions — not documentation.

The spreadsheet approach is unglamorous. It’s also fast, inclusive, and produces the output the developer actually needs. Sometimes the best collaboration tool is the one everybody already knows how to use.

To learn more about Data Vault modeling practices, Business Key identification, and the full Raw and Business Vault methodology, explore our Data Vault 2.1 Training & Certification. And for a concise introduction to the core concepts, the free Data Vault handbook is available as a physical copy or digital download.

Watch the Video

Close Menu