Skip to main content
search
0
All Posts By

Michael Olschimke

Michael Olschimke is the Co-Founder and CEO of Scalefree and a "Data Vault 2.0 Pioneer" with over 20 years of IT experience. A Fulbright scholar and co-author of Building a Scalable Data Warehouse with Data Vault 2.0, Michael is a global authority on AI, Big Data, and scalable Lakehouse design across sectors like banking, automotive, and state security.

Handling Zero Key References and Optional Data

Zero Key References and Optional Data in Data Vault Modeling

A nuanced modeling question came in recently about how to handle a source table that delivers relationships between two business objects, where one specific role type has no related second object — just additional attributes in the same record. The proposed model was a solid starting point, and the discussion that followed touched on several important Data Vault principles: zero key handling, CDC Satellites vs. multi-active Satellites, effectivity tracking, and why filter conditions in Raw Data Vault loading are a risk worth avoiding. This post walks through each of these in turn.



Zero Key References: Modeling Optional Relationships

The scenario involves a Link between two Hubs — Object One and Object Two — with a role type as part of the Link structure. For most role types, both Hub references are populated. For one specific role type, Object Two doesn’t exist; the source provides additional descriptive attributes instead of a foreign key.

The correct handling for the missing Object Two reference is straightforward: use the all-zeros key. When a foreign key in the source is null, the Link refers to the zero key in the Hub rather than storing an actual null. This keeps the model queryable with inner joins, avoids null-handling complexity downstream, and is entirely consistent with Data Vault null business key handling. Both Hubs should have their two zero key rows — the all-zeros key for unknown or null references, and the all-Fs key for error cases — deployed as standard practice.

The role type sits in the Link structure alongside the two Hub references, which means it participates in the hash key computation for the Link. When the role type changes, the Link sees it as a new entry. The effectivity Satellite then captures when the old record was no longer active. That’s the expected behavior.

Multi-Active Satellites vs. CDC Satellites: Choosing the Right Approach

The proposed model used multi-active Satellites to capture multiple rows with different valid_from and valid_to dates. Whether this is the right choice depends on one key question: are those records all active at the same time in the source system? Can a source system user see all of them simultaneously?

If yes — multiple records with different validity periods are all visible and active concurrently in the source — then a multi-active Satellite is the appropriate choice. The multi-active attribute should be a subsequence from staging rather than a business-supplied date, keeping control of uniqueness on the Data Vault side rather than trusting the source.

If no — the records represent sequential changes, not concurrent active states — then a CDC Satellite is a cleaner fit. A CDC Satellite is structurally a standard Satellite, but the load date is modified by adding a sequence number from the CDC package as microseconds. This means only one row is active at any given moment, which simplifies PIT Table construction (two columns instead of three per Satellite) and improves join performance. The choice between the two comes down to how the source system actually manages these records.

A third alternative worth noting: for multi-active scenarios, a JSON array stored in a standard Satellite can replace a traditional multi-active Satellite in some cases. It depends on the loading mechanism and the downstream consumption requirements, but it’s a valid option that avoids the multi-active complexity entirely by capturing multiple active rows as a structured JSON payload.

Effectivity Tracking and the Status Tracking Satellite

The proposed model included a separate status tracking Satellite alongside an effectivity Satellite. A cleaner and more storage-efficient approach is to merge these by adding a deletion timestamp directly to the effectivity Satellite.

The deletion timestamp works simply: when a record exists in the source, the deletion timestamp is set to end-of-all-times. When a deletion is detected — through a comparison of the current load against the target — the deletion timestamp is updated to the current load date, marking the record as no longer physically present in the source. If the record reappears, the timestamp reverts to end-of-all-times.

All timelines — valid_from, valid_to, deletion timestamps — belong together in the effectivity Satellite. This consolidation reduces the number of Satellites to manage and makes the model more straightforward to query.

GDPR and Customer Re-Registration

A related question came up about GDPR: if a customer requests data deletion and later re-registers, are they treated as a new record? The answer is yes, and it’s an important distinction from standard soft-delete handling.

Soft deletes in the Raw Data Vault are used to track hard deletes from the source for non-legal reasons — products removed, records archived, relationships ended. The history is preserved in the Vault even when it’s gone from the source.

GDPR is different. When a deletion is legally required, the personal data must be genuinely removed — a hard delete in the target. The non-personal data associated with that customer may be retained, but the link between the old history and the re-registering customer is permanently severed. If that customer returns and creates a new record, there’s no way to reconnect them to their previous history, because that connection no longer exists in the model. This is by design: the loss of that relationship is the point.

Why Filter Conditions in Raw Data Vault Loading Are Risky

One of the more important principles raised in this discussion: never apply filter conditions when loading the Raw Data Vault. The specific question was whether it’s acceptable to filter the source by role type when loading the additional attributes Satellite — loading only rows where the role type matches the one that doesn’t reference Object Two.

The answer is no, and the reasoning is worth understanding clearly. Applying a WHERE condition or a filtering join in the Raw Data Vault loading process is an application of business logic. The Raw Data Vault is supposed to capture raw data as delivered, without interpretation. Any filter condition that depends on the content of the data — rather than purely technical checks like delta detection or null replacement — violates that principle.

The practical risk is concrete: source system behavior changes. A new role type is introduced that also lacks an Object Two reference. The filter condition doesn’t know about it, so those records get skipped. Or dirty data arrives where an unexpected combination of fields appears — both an Object Two reference and additional attributes for a role type that wasn’t supposed to have both. A filter condition handles this incorrectly or drops data silently.

The only conditions permitted in Raw Data Vault loading are technical ones: checking whether a business key or relationship already exists in the target (the delta check), and replacing null values with zero keys. Everything else — including role-type-based filtering — belongs in the Business Vault, where it can be applied as explicit, versioned, testable business logic.

Validating the Model with the JEDI Test

When uncertain about a modeling decision — whether to use a multi-active Satellite or a CDC Satellite, whether to split or consolidate — the JEDI test provides a reliable check. The test is simple: try to reconstruct the original source delivery from the Raw Data Vault. Join everything back together and verify that no records are lost, no columns are missing, and no artificial records have been generated that didn’t exist in the source.

If the reconstruction succeeds without data loss or artificial inflation, the model is valid. Whether it’s the best model depends on the data and the downstream consumption patterns — but validity is the baseline, and the JEDI test is how you prove it.

To explore these modeling patterns in depth — including zero key handling, effectivity Satellites, CDC loading, and the JEDI test — check out our Data Vault 2.1 Training & Certification. The free Data Vault handbook is also available as a physical copy or ebook.

Watch the Video

Understanding Error Keys in Data Vault

Error Keys in Data Vault: Understanding Zero Keys and Null Business Key Handling

One of the more subtle but important concepts in Data Vault is the handling of null business keys — known as zero keys in Data Vault 2.0 and formally called null business key handling in Data Vault 2.1. Most practitioners understand the first zero key intuitively, but the second one — and where it actually earns its value — is less commonly understood. This post explains both, and where each one belongs in practice.



Error Keys Explained: The Two Zero Keys

Every Hub and Link in a Data Vault model is deployed with two special rows pre-loaded: one with a hash key of all zeros, and one with a hash key of all Fs. These are the two zero keys, and they exist to handle null business keys cleanly throughout the model.

The all-zeros hash key is the more commonly understood of the two. It replaces null values in Links — specifically, null references to business keys. When a relationship is received with a missing or null Hub reference, that null gets replaced by the all-zeros key rather than being stored as an actual null. This allows the model to rely on inner joins consistently when querying the Data Vault, without having to handle nulls case by case through left joins or null checks. When you join from a Link to a Hub, you always hit a record — either a real business key or the zero key. Clean, fast, and predictable.

The all-Fs hash key serves a distinct and more specific purpose: it marks bad data, as opposed to merely missing or ugly data. Understanding the difference between those two things is the key to understanding why two zero keys exist at all.

Ugly Data vs. Bad Data: Why the Distinction Matters

Consider a transaction record where the store reference is null. In a brick-and-mortar retail context, this seems wrong — every sale happens somewhere. But in a business that also runs an online store, a null store value might simply mean the transaction happened online. The data is incomplete by conventional standards, but it’s not incorrect. It reflects a real business scenario. This is what you might call ugly data: not ideal, not the most descriptive, but not an error.

Now consider a different scenario: the interface specification for a source system explicitly states that a particular foreign key is non-nullable. The data arrives anyway with null values in that field. Here, either the data is genuinely corrupted or the specification is wrong. Either way, something has gone wrong. This is bad data — data that shouldn’t exist in the form it arrived.

The all-zeros key handles the ugly case. The all-Fs key is reserved for the bad case. Having both allows the model to preserve the distinction rather than collapsing all null situations into a single catch-all placeholder.

Where the All-Fs Key Is Actually Used in Practice

In theory, the all-Fs key could be applied in the Raw Data Vault whenever a null value violates an interface specification. In practice, this rarely happens. Analyzing every interface description, identifying which nulls represent violations, and modifying the Raw Data Vault mappings accordingly is a significant effort — and most projects don’t invest in it at the raw layer. The all-Fs rows exist in every Hub and Link as a structural feature, but they tend to sit unused in the Raw Data Vault itself.

Where the all-Fs key genuinely earns its place is in the Business Vault and Information Marts. The pattern looks like this: during the construction of a Fact view or a Bridge Table, business logic identifies records that reference Hub keys which shouldn’t exist — store locations that were never valid, product codes that are clearly erroneous, data that passed through the raw layer but doesn’t belong in the dimensional model. Instead of passing those records through to the Dimension with a misleading or nonsensical member, the business logic replaces their hash keys with the all-Fs value.

In the resulting Dimension, those records map to an explicitly erroneous member — a designated “error” row — rather than polluting actual dimension members with bad data. Business users and analysts can see that certain facts are associated with an error case, filter them out, investigate them, or handle them according to reporting requirements. The data is quarantined and labeled, not silently dropped or mixed in with valid records.

Ghost Records in Satellites

The zero key pattern extends to Satellites as well, through what are called ghost records. At minimum, one ghost record exists in each Satellite — associated with the all-zeros hash key — to ensure that joins from a Hub or Link to a Satellite always return a result, even for the zero key case.

In implementations using the datavault4dbt package, two ghost records are created: one for the all-zeros key and one for the all-Fs key. Beyond making the implementation consistent, this has a practical benefit in the dimensional layer. The two ghost records can carry different descriptive values — for example, “Unknown Customer” for the all-zeros case and “Erroneous Customer” for the all-Fs case. This makes the distinction visible and user-friendly in reports and dashboards, giving analysts a clear signal about what they’re looking at rather than a generic placeholder for both missing and bad data.

Because the ghost records share their hash keys with the zero keys in the parent Hub and Link, they join naturally without any special handling. It’s a side effect of the design that works elegantly in practice.

Should You Drop the All-Fs Key If You’re Not Using It?

The question occasionally comes up: if the all-Fs key isn’t being used in the Raw Data Vault, can it simply be dropped? Technically, yes. But in most implementations it stays, for a few reasons. It costs almost nothing to maintain — it’s two rows per Hub and Link. It provides a structural home for bad data classification if the need arises later. And its real value, as described above, is realized downstream in the Business Vault and Information Mart, where it’s actively useful for handling erroneous data in business logic and dimensional modeling.

Dropping it from the Raw Data Vault to save minimal overhead would mean losing a precise and semantically meaningful tool at exactly the layer where it’s most needed.

To go deeper on null business key handling, ghost records, and the full Data Vault 2.1 methodology, explore our Data Vault 2.1 Training & Certification. The free Data Vault handbook is also available as a physical or digital copy for a concise introduction to the core concepts.

Watch the Video

Data Vault in a Microservices Architecture

Microservices Architecture and Data Vault: Managing Satellites at Scale

Microservices architectures create a specific modeling challenge for Data Vault practitioners. When services are ephemeral — spinning up and down as Docker or Kubernetes containers — each with its own message structure, the standard advice to split Satellites by source system quickly leads to hundreds or thousands of Satellites. At that scale, the real question isn’t about metadata management overhead. It’s about how to consume all that data without joining 500 tables every time you need an answer. This post walks through a practical approach to handling high-volume, highly varied source structures in a Data Vault model.



Microservices Architecture: Why Satellite Splits Become a Problem

The conventional Satellite splitting rules — by rate of change, source system, security, and privacy — exist for good reasons. But in a microservices context, applying them strictly leads to an explosion of Satellites. A new Docker image with a new message structure technically deserves its own Satellite. Automate that process and you accumulate hundreds or thousands of Satellites quickly, most of which may never be queried by anyone.

The issue isn’t that databases can’t handle 500 tables — they can. The issue is the consumption side: joining 500 Satellites to produce a target model is expensive, complex to maintain, and in many cases unnecessary. The real challenge is finding a modeling approach that captures the variety of incoming structures without creating an unmanageable query layer downstream.

Rate of Change Splits: Still Relevant, but Less So for Now

The rate of change split was designed to reduce storage consumption by separating high-frequency attributes from stable ones. Every delta insert copies all columns in the Satellite, so a single change on one attribute in a wide Satellite wastes a lot of storage on unchanged data.

For most modern analytical database systems, compression makes this largely unnecessary. Insert-only tables with lots of redundant data compress extremely well, and virtually all modern analytical platforms support this. The storage cost of skipping the rate of change split is manageable with compression turned on.

That said, this is worth watching. In pay-per-query environments like Athena querying row-based Avro files, or systems that charge based on uncompressed data scanned, the rate of change split becomes economically relevant again. BigQuery’s columnar storage sidesteps this because you only pay for the columns you query — but other managed infrastructure doesn’t work that way. The rate of change split isn’t obsolete; it’s just less pressing for now, and likely to become more relevant as managed, consumption-based pricing models become more common.

The Flip-Flop Effect: Why Source System Splits Still Matter

The source system split is a different matter. Loading data from two different source systems into the same Satellite creates a well-known problem: the flip-flop effect.

Consider a customer whose address is known to both an ERP system (California) and a CRM system (Hannover, Germany). The two systems have different knowledge and potentially different structures for representing the same data. If both load into the same Satellite, the Satellite ends up recording two deltas per day — not because the customer moved, but because two systems loaded sequentially with different values. The data flips between California and Hannover with every load cycle, consuming storage and making it impossible to determine the actual address without applying business logic. Worse, the order of loading determines what the Satellite shows at any given moment — a purely technical artifact with no business meaning.

The fix is straightforward: one Satellite per source. This keeps each system’s view of the data independent and equally available, so business logic in the Business Vault can reconcile them deliberately rather than having the Raw Data Vault collapse them accidentally.

The Gray Area: Millions of Sources, One Practical Solution

The flip-flop rule works cleanly when you have a manageable number of distinct source systems. It breaks down at the extreme end — IoT deployments with millions of sensors, or microservices architectures with hundreds of ephemeral containers — where creating one Satellite per source is operationally impractical.

The solution here depends on two conditions being met. First, you need a key in the parent entity that partitions the data by source — a sensor ID, a Docker image ID, a tenant ID, something that creates independent delta streams within the same Satellite. With this in place, deltas from source A can’t replace or invalidate deltas from source B, which eliminates the flip-flop effect without requiring separate Satellites. Second, the structure of the incoming data must be consistent enough to fit in a shared target — which in practice usually means JSON.

When messages from different microservices or sensors all arrive as JSON — even with different internal structures — you can load them all into a single Satellite or Non-Historized Link with a JSON or JSONB payload column. The structure differences are captured inside the JSON document. You add the partitioning key to the parent, and you’re done. Instead of 500 Satellites with 500 different schemas, you have one entity with a JSON payload and a key that tells you which source produced each record.

Non-Historized Links for Real-Time Messages

For real-time message streams from microservices, a Non-Historized Link with a JSON payload is often the right structure. Real-time messages are events — they don’t update, they accumulate. The flip-flop concern largely disappears because you’re capturing messages as they arrive, not loading full snapshots that might overwrite each other. A Non-Historized Link captures the event, the relevant Hub references, and the message payload in a structure that’s fast to load and straightforward to query.

This same pattern was applied at Scalefree for an investment banking client with 500 different source systems delivering asset data in different CSV formats. Rather than creating 500 entities, a single Non-Historized Link and Satellite captured everything — different CSV structures serialized as JSON strings, distinguished by a load source identifier. Two entities replaced 500, and the consumption layer handled the structural variety through filtering and extraction rather than joins.

Consuming Semi-Structured Data Without Joining 500 Tables

Loading everything into a JSON payload doesn’t eliminate the structural variety — it defers it to query time. When you need data from a specific message type, you need to identify records with the right structure among all the records in the same target entity.

The approach here is filtering rather than joining. Instead of joining 500 Satellites, you query one entity and filter for records that contain specific JSON keys or values that uniquely identify the message type you care about. Email messages, for example, always have a subject, body, sender, and recipient — keys that distinguish them from other message types. A specific transaction type might always carry an ID starting with a known prefix. These structural signatures let you extract subsets of the JSON stream efficiently.

Once filtered, you extract the attributes you need from each subset and UNION the results if you need to combine multiple message types. A UNION of 500 filtered queries on one table is significantly faster than a JOIN of 500 separate tables, and it scales much better as the number of source types grows.

Choosing the Right Approach for Your Context

The right answer depends on where you sit on the spectrum between a small number of structurally distinct source systems and a very large number of structurally similar ones. For a handful of systems with genuinely different schemas and different business semantics — CRM, ERP, financial systems — separate Satellites per source is the right call. The flip-flop effect and structural differences make consolidation risky and introduce business logic where it doesn’t belong.

For microservices, IoT devices, or any scenario where you have many sources with similar structures and a partitioning key available, consolidating into a small number of JSON-payload entities is usually the better trade-off. It simplifies loading, reduces metadata overhead, and keeps the consumption layer manageable — at the cost of pushing structural interpretation into filtering and extraction logic downstream.

To go deeper on Satellite design, source system splits, and Data Vault modeling patterns for modern architectures, explore our Data Vault certification program. The free Data Vault handbook is also available as a physical copy or ebook for a solid grounding in the core methodology.

Watch the Video

How Do You Model External Business Logic In Data Vault?

Modeling External Business Logic in Data Vault: APIs, Scripts, and Source System Thinking

A question that comes up regularly in Data Vault training is how to handle external business logic — specifically, what happens when your data pipeline includes a call to an external API or service that returns enriched or cleansed data. Where does that fit in the model? How do you capture the response? And how do you integrate an external script cleanly into your enterprise data platform? This post walks through a concrete example: address cleansing via an external REST API.



Modeling External Business Logic: The Full Flow

The scenario starts simply enough. You have CRM data — let’s say customer records with addresses — that gets staged and broken down into the Raw Data Vault in the usual way: Hubs for business concepts, Satellites for descriptive attributes. The raw address from the CRM system lands in a Satellite.

Now comes the complication. You need to cleanse and standardize those addresses using an external REST API. A Python script handles the call: it pulls data from the platform, formats it into the required input — perhaps a single string or a calculated key — and sends it to the external service. The service returns a JSON response with the standardized address and additional metadata.

This flow touches several layers of the Data Vault architecture, and each layer has a distinct role.

The Business Vault Prepares the API Call

Before the external call can be made, the Business Vault does preparatory work. If the REST API requires the address in a specific format or needs a calculated key, that computation belongs in the Business Vault — it’s business logic, applied to raw data, to produce the input for an external process.

The external Python script then queries this prepared data — either directly from the Business Vault or via an Interface Mart (more on that below) — and performs the REST call. The script itself may be under version control and within your organization’s control. The external service is not.

Treat the External Service as a Source System

This is the key modeling decision: because the external API is outside your control, you treat its responses exactly as you would treat any other source system. You don’t trust it implicitly. You stage its output and break it into the Raw Data Vault.

If your Raw Data Vault already has an Address Hub from the CRM dataset, and the external service returns identifiers that qualify as business keys — unique, stable identifiers for addresses — those can be added to the Address Hub. The JSON response from the API then gets captured in a Satellite in the Raw Data Vault, associated with the appropriate Hub.

This approach gives you a clean audit trail. You know exactly what the external service returned, when it returned it, and what key was used to make the call. If the external service changes its response structure or returns unexpected data, your Raw Data Vault captures that reality as-is, and your downstream Business Vault logic handles interpretation.

Handling JSON Responses: Two Practical Options

API responses typically come back as JSON — sometimes well-structured, sometimes semi-structured with varying schemas between messages. There are two main approaches for capturing this in the Raw Data Vault, and the right choice depends on how structured the response is and how many attributes you actually need.

Option 1 — Extract what you need, keep the rest as JSON. If the JSON is relatively consistent and you only need a subset of its attributes — say, five out of fifty — extract those five into relational columns in the Satellite. Keep the full JSON (or the remaining payload) as a JSON or JSONB attribute in the same Satellite. You get fast, typed access to the attributes you use regularly, and the full document is available for future needs without requiring a reload.

Option 2 — Keep everything in JSON, extract in the Business Vault. If you’re unsure which attributes you’ll need, or if the structure varies, capture the raw JSON in the Satellite and handle extraction later in the Business Vault. Technically, extracting fields from JSON is a structural transformation — a hard rule, not a business rule — so it could sit in the Raw Data Vault. But if the extraction is straightforward and tied to specific downstream calculations, doing it in the Business Vault view is a reasonable and common practice.

In practice, the hybrid approach from Option 1 is most common: extract the attributes you know you need into relational columns, keep the JSON alongside them. When a new attribute is needed later — and it will be — you can pull it directly from the JSON in your Business Vault view using native JSON functions, without touching the Raw Data Vault or reloading any data.

Integrating the External Script: Dependencies and Interface Marts

When an external script queries your data platform — whether from the Raw Data Vault or the Business Vault — it creates a dependency. The entities that script relies on can’t be freely refactored without risking a broken integration. This is worth flagging explicitly in your metadata: mark those entities as part of the operational vault, indicating that external applications depend on them.

A cleaner long-term solution is to introduce an Interface Mart — a stable, versioned view layer that the external script queries instead of the Raw or Business Vault directly. When you refactor a Satellite or restructure a Business Vault entity, you update the Interface Mart view to maintain the same output structure. The external script sees no change. This decouples your internal model evolution from external integrations, which is especially valuable in organizations where multiple scripts and applications consume data from the platform.

Combining Two Sources in the Business Vault

At this point, you have two sources describing the same concept: the CRM system with the original, non-standardized address, and the external address standardizer with the cleansed version. Both are captured in the Raw Data Vault as separate source inputs. The Business Vault is where you bring them together.

The pre-computed key used to make the API call serves as the joining mechanism. Based on that key, you can establish a relationship — via a Link or a direct join in a Business Vault view — between the raw CRM address and the standardized version returned by the external service. The Business Vault then exposes the combined, cleansed address data to downstream consumers: reports, dashboards, or further downstream application scripts.

The exact modeling decisions at this stage depend heavily on how the CRM data is structured and what the business actually needs from the cleansed address. But the principle holds regardless: raw inputs from both the CRM and the external API live in the Raw Data Vault; the logic that combines and interprets them lives in the Business Vault.

A Pattern Worth Generalizing

Address cleansing is one example, but the same pattern applies to any external enrichment service: geocoding APIs, credit scoring services, entity resolution services, tax calculation engines. Whenever your pipeline includes a call to an external system that returns data you need to capture and use, the approach is the same — treat the response as a source, stage it, load it into the Raw Data Vault, and apply interpretation and combination logic in the Business Vault.

It’s also worth noting that this pattern integrates naturally into data-driven organizations where information is consumed not just through reports and dashboards but through application scripts and automated processes. The enterprise data platform becomes a hub for both analytical and operational consumers — and Data Vault’s layered architecture handles both cleanly.

To explore these patterns in depth — including Business Vault design, Interface Marts, and integrating external sources — check out our Data Vault 2.1 Training & Certification. The free Data Vault handbook is also available as a physical copy or ebook for a solid introduction to the core methodology.

Watch the Video

Capturing Changing Inventory Levels in Data Vault

Capturing Changing Inventory Levels in Data Vault with Non-Historized Links

Inventory data presents a specific challenge in Data Vault modeling: it arrives as snapshots, it can be corrected retroactively, and deletions need to be traceable back to a specific snapshot date. A user recently proposed a solution using a multi-active Snapshot Satellite — a reasonable starting point — but the question of how to handle corrections and logical deletes at the snapshot level pointed toward a cleaner approach. This post walks through the recommended pattern: a Non-Historized Link with technical counter transactions.



Capturing Changing Inventory Levels: Understanding the Source Data

The scenario is this: a source system sends inventory data as full snapshots. Each extract may contain data for multiple snapshot dates, and if a correction exists for a previously loaded snapshot, the source resends the complete data for that snapshot date. The inventory level represents the count of a product at a given location and store at the end of a specified day.

This is what Kimball defines as a snapshot-based fact — a periodic full count of all products in a warehouse, day by day. The complication is that those snapshots can be corrected. Three days after an initial count, a miscalculation might be identified and the inventory level for that past snapshot date needs to be updated — without affecting the two more recent snapshots.

Why the Multi-Active Satellite Approach Has Limitations

The proposed solution used a Link referencing Product, Store, and Location Hubs, with a multi-active Satellite attached. The snapshot date was nominated as the multi-active attribute, added to the Satellite’s primary key alongside the hash key and Load Date Timestamp.

The instinct is sound — you need to track multiple snapshots and their corrections over time. But there’s a problem with using a business-supplied date as the multi-active attribute in the Raw Data Vault: it means trusting the source. In the Raw Data Vault, that’s a risk worth avoiding.

A more robust approach is to use the subsequence from staging as the multi-active attribute instead. The subsequence is assigned during staging and is guaranteed to be unique within each incoming batch. Because it’s unique within the batch, it’s also unique within any subgroup of that batch — including a multi-active group. If something goes wrong with it, it’s a problem you control and can fix. The snapshot date from the source then becomes what it actually is: a descriptive business timeline, added to the Satellite payload like any other descriptive attribute.

But even with this correction, the multi-active Satellite approach doesn’t cleanly solve the core problem: how do you capture a logical delete at the snapshot level — when an item disappears from a re-sent snapshot that was already loaded?

The Non-Historized Link Approach

The cleaner solution is to use a Non-Historized Link rather than a standard Link with a Satellite. Non-Historized Links are designed precisely for capturing facts coming from a source — transactions, events, and inventory levels. They sit close to the incoming data structure and are loaded incrementally using an Alternate Key.

For this scenario, the Non-Historized Link contains the following fields:

  • Product Key, Store Key, Location Key — the Hub references (hash keys)
  • Load Date Timestamp (LDTS) — when the record was received
  • Snapshot Date (SD) — the business timeline from the source, indicating when the inventory was counted
  • Snapshot Date Timestamp (SDTS) — the Data Vault snapshot date used in information delivery
  • A counter column (X) — with a value of 1 for records coming from the source, and -1 for technical counter transactions

The Alternate Key — used to identify whether a record already exists in the target — is the combination of Product Key, Store Key, Location Key, Snapshot Date, and Load Date Timestamp.

Two INSERT Statements: Inserts and Counter Transactions

Loading this Non-Historized Link uses two parallel INSERT statements, not an UPDATE pattern.

INSERT statement one loads all records from the Staging Area where the Alternate Key does not yet exist in the target. This covers both genuinely new records and new versions of corrected records — because the corrected version, with its new Load Date Timestamp, doesn’t exist in the target yet and therefore qualifies as new.

INSERT statement two handles deletions and replaced versions. It identifies records that exist in the target but are absent from the current full-load snapshot in the Staging Area. These records are either hard-deleted in the source or represent the old version of an updated record. For each of these, a technical counter transaction is inserted — an identical row with a counter value of -1 instead of 1.

This means an update in the source produces two rows in the target: a -1 counter transaction that cancels the old version, and a 1 insert for the new version. The data itself is never modified — only the structure changes. This aligns with the hard rule in Data Vault: the Raw Data Vault transforms structure, not content.

How Aggregation Reveals the True Inventory Level

The power of this pattern becomes clear at query time. By summing the counter column across all rows matching a given Alternate Key, you can determine the current state of any inventory record at any point in time.

Here’s how it plays out across several days for a single product-store-location combination:

  • Day 2: Initial snapshot for Day 1 is loaded. Counter value: 1. Sum = 1. One active inventory record.
  • Day 4: A correction arrives for Day 1. The old version gets a -1 counter transaction; the corrected version gets a 1 insert. Sum = 1 - 1 + 1 = 1. Still one active record, now at the corrected level.
  • Day 5: The record is deleted from the source for Day 1. A -1 counter transaction is inserted. Sum = 1 - 1 + 1 - 1 = 0. Zero active records. The inventory level for that snapshot date is gone.

By filtering on the Snapshot Date and aggregating the counter column, you always know exactly how many active facts exist for a given snapshot — and what the current inventory level is. By filtering on the Load Date Timestamp, you can also travel back in time and see what the inventory looked like from the perspective of any earlier date. This gives you both a corrected view of inventory history and a full audit trail of when corrections were made.

Performance and Reporting

On a column-based storage engine, this pattern performs extremely well. Aggregating a counter column across large volumes of inventory records is fast, and the resulting structure integrates cleanly with dashboards. Business users see dimension references for Product, Store, and Location; a snapshot date for time-based aggregation; the inventory level measure; and the counter column for filtering active records. The behavior is exactly what they’d expect — current inventory levels that reflect corrections, with history available on demand.

Where This Pattern Also Applies

Technical counter transactions in Non-Historized Links are not specific to inventory data. The same pattern applies anywhere you have full-load snapshot data with corrections — any scenario where an update in the source needs to be represented as a deletion of the old version and an insertion of the new one, without modifying existing rows.

The pattern can also be applied in the Business Vault, particularly in Bridge Tables, when the counter transactions involve calculated values that need to be corrected rather than raw source data. The principle is the same; the layer where it’s applied depends on whether the transformation is structural (Raw Data Vault) or involves business logic (Business Vault).

For more recordings on Non-Historized Links and technical counter transactions, the Scalefree YouTube channel has several dedicated sessions on the topic. And to master these patterns hands-on, explore our Data Vault 2.1 Training & Certification. The free Data Vault handbook — available as a physical copy or ebook — is also a great starting point for the core concepts.

Watch the Video

Unit of Work (UOW) Links in Data Vault

Unit of Work Links in Data Vault: Purpose, Strategy, and Flow.BI

As FlowBI — the GenAI tool that generates Raw Data Vault models by analyzing source data — becomes more widely used, questions about its specific modeling decisions are coming up more frequently. One of those questions is about a particular Link type that FlowBI produces: the Unit of Work Link. What is it, why does it exist, and when should you use it? This post breaks it down.



Unit of Work Links: What They Are and Why They Matter

In Data Vault modeling, a Link captures a relationship between two or more Hubs — a business event or association expressed through a combination of Business Keys. When modeling a source table that contains multiple Hub references, a natural modeling instinct is to split those relationships into separate, more focused Links. One Link for a customer-product relationship, another for an employee-customer relationship, and so on.

Splitting Links is perfectly valid in Data Vault — but it comes with a constraint. Certain Business Keys must remain together to preserve what is called the Unit of Work: the set of Business Keys that belong together as a single, consistent business event. Separate them incorrectly, and you risk breaking data consistency in your model. This is one of the more subtle but consequential mistakes a modeler can make, and it’s discussed in detail in the Data Vault training and certification curriculum.

The Unit of Work Link is the answer to that risk. It is a Link that spans all Hub references coming from a single source table — unsplit, unfiltered, capturing the full set of Business Keys exactly as they appear together in the source. Think of it as the safe baseline: a Link that guarantees the Unit of Work is preserved, no matter what splitting decisions get made on top of it.

The Human Modeler Problem — and How FlowBI Handles It

FlowBI works by analyzing source data profiles and generating the Hubs, Links, and Satellites needed to capture that data correctly in a Raw Data Vault — one that passes the JEDI test. It integrates with automation tools including Data Vault Builder, Datavault4dbt, and others.

But here’s the design challenge FlowBI had to address: it imitates a human data modeler. And human data modelers make mistakes. One of the most common is splitting Links incorrectly and inadvertently violating the Unit of Work. So the question during development was: do you try to eliminate these mistakes entirely — at the cost of no longer truly imitating human modeling behavior — or do you accept that errors can happen and build a counter-strategy?

FlowBI chose the counter-strategy. The Unit of Work Link is that strategy. By always generating a Link that preserves the full set of Hub references from a source table, FlowBI ensures there is always a valid, consistent fallback in the Raw Data Vault — even if additional, more specific Links turn out to have been split incorrectly.

The Practical Modeling Strategy Behind It

This isn’t just an AI design decision — it reflects a sound modeling practice for human modelers as well. The approach works like this:

When working with a source table that contains multiple Hub references, always create one Link that spans all of them. This is your Unit of Work Link. Then, if your understanding of the source data is strong enough — if you’ve been able to validate the relationships and run the appropriate tests — you can introduce additional, more specific Links that extract individual relationships from the source. A hierarchy Link, an employee-to-customer Link, a product-to-order Link, and so on.

The key point is that the Unit of Work Link remains. It doesn’t get replaced. If one of the more specific Links turns out to violate the Unit of Work — because the source data behaved differently than expected, or because access to production data was limited during modeling — you still have the original Link to fall back on. The model remains valid. You can retrieve the correct relationship from the Unit of Work Link and fix the problem without reloading the Raw Data Vault.

The less familiar you are with a source dataset — particularly in cases where you can’t access production data directly or can’t run comprehensive validation tests — the more valuable this approach becomes. When in doubt, don’t split. Protect the Unit of Work first.

Splitting in the Business Vault Instead

If a Unit of Work Link exists in the Raw Data Vault and you later want to split it into more focused relationships, that work belongs in the Business Vault — not the Raw Vault. You use the Unit of Work Link as the basis for either virtual or materialized Links in the Business Vault, where the split is applied.

The advantage of this approach is containment. If the split is wrong — if it turns out to violate the Unit of Work — the fix happens in the Business Vault. The Raw Data Vault doesn’t need to be reloaded. The source of truth stays intact. You correct the business logic without touching the foundation.

This is exactly the kind of separation of concerns that makes Data Vault resilient. Raw data is captured as-is, close to the source. Business logic — including relationship refinement — happens in the layer designed for it.

A Rule Worth Adopting for Any Modeler

The Unit of Work Link isn’t just a FlowBI artifact. It’s a principle any Data Vault modeler can and should apply. If you’re splitting Links in your Raw Data Vault, ask yourself: where is your Unit of Work? Is there a Link in your model that preserves the full set of Business Keys from each source table, regardless of how you’ve split them elsewhere?

If the answer is no, you’re relying on every split being correct — and on your understanding of the source data being complete. That’s a reasonable bet when you know the data well. It’s a riskier one when you don’t. The Unit of Work Link costs very little to include and provides a meaningful safety net in return.

To go deeper on Link modeling, Unit of Work concepts, and the full Data Vault methodology, explore our Data Vault certification program. And if you’re new to Data Vault, the free handbook — available as a hard copy or ebook — is a solid introduction to the core concepts.

Watch the Video

Refactoring a Data Vault Model

Refactoring a Data Vault Model: Options, Risks, and Best Practices

Source systems change. Columns get added or removed, structures evolve, and sometimes entire business key definitions are overhauled. When that happens to a system feeding your Data Vault, the question isn’t just technical — it’s strategic. Do you modify what you have, or do you build alongside it? This post walks through the main scenarios and the practical options available for each, along with a clear recommendation on where to draw the line between low-risk and high-risk approaches.



When a Column Changes: The Simple Case

The least disruptive scenario is a column-level change in a source table — a new attribute appears, or an existing one disappears. For this, you have a few options depending on your project constraints.

Option 1 — Modify the existing Satellite. If your project allows structural changes, you can add the new column to the existing Satellite with an ALTER TABLE statement. Historical rows will show null values for the new column before its introduction, and a log entry can record exactly when the column was added. Removing a column from a Satellite is generally not done — historical data lives in that column, and dropping it means losing that history.

Option 2 — Create a new Satellite. If you’re not allowed to touch existing structures, or simply prefer not to, you create a new Satellite to capture the new or changed attributes. This Satellite gets added to the relevant PIT Tables. The trade-off is an additional join in your queries, but the existing Satellite and its data remain completely untouched.

Option 3 — Close and replace the Satellite. A slightly more thorough approach: close the existing Satellite (stop loading it) and create a brand new one that reflects the updated structure. The new Satellite starts with a full load from the source, which means some data overlap with the old Satellite. This is handled cleanly at query time using an IIF statement — prefer data from the new Satellite where it exists, fall back to the old one for earlier history. The redundancy is not a problem; it resolves itself during query execution.

The bigger the structural change, the more this third option makes sense. If a source table is overhauled dramatically — many columns removed, many added — creating a fresh Satellite to capture the new shape is often the cleanest path forward.

When the Business Key Changes: The Complex Case

Column-level changes are manageable. Business key changes are where things get genuinely complex — and where the risk calculus shifts significantly.

A business key in Data Vault must be unique over time and across the enterprise. If the current key no longer meets that standard — say, a customer number that was once reliable is now duplicated across regions — you have a structural problem that can cascade through the model. Changing the business key means potentially changing the Hub itself, which in turn affects every Link that references that Hub, and every Satellite attached to those Hubs and Links. The impact can be wide.

At this point, you have two main strategic choices.

Option A: Keep Old and New Structures Separate

The lower-risk approach — and the one most commonly recommended — is to leave the historical Raw Data Vault exactly as it is and build a new Raw Data Vault to capture data under the new structure and key definition.

The reasoning is rooted in a core Data Vault principle: the Raw Data Vault should model data close to how the source systems use it. The business had one structure in the past and a different one going forward. That’s two different realities, and it makes sense to model them separately.

The two Raw Data Vaults then get reconciled in the Business Vault, where business logic handles the combination of old and new data. This might be straightforward — a simple union — or it might be complex, especially if field definitions have changed. For example, if an address field was previously structured (street, house number, zip, city) and is now a free-text memo field that may contain addresses from multiple countries, the logic to normalize and combine that data belongs in the Business Vault. That’s exactly what the Business Vault is designed for.

This approach carries the lowest risk. Historical data is never touched. Nothing can go wrong with data that hasn’t been moved.

Option B: Refactor the Raw Data Vault

The more ambitious option is to refactor the existing Raw Data Vault into a new version — modifying Hubs, Links, and Satellites to reflect the new structure — and then reconstruct historical data within that new model.

This is technically possible, but it comes with a hard requirement: you must be able to reconstruct every historical delivery from the new structure without any data loss. In Data Vault practice, this is validated through what’s known as the “Jedi test” — deriving the old structures from the new ones and verifying the output matches the original data exactly. If the test passes, you can safely drop the old tables and replace them with views that expose the old structure as a backward-compatible interface.

Those views give existing queries time to continue working while users migrate. But they’re a transitional tool, not a permanent one. You’ll want to communicate a clear deprecation timeline — 90 or 180 days is typical — and give users explicit guidance on how to update their queries before the views are dropped.

A word of warning: when those views eventually get dropped, expect complaints. Not because the communication failed, but because, as a rule, nobody reads emails. Plan for it.

Handling Non-Unique Business Keys in the Interim

If a business key loses its uniqueness mid-project and a full refactoring effort will take several sprints, there’s a practical interim solution: a Record Source Tracking Satellite. This allows you to continue working with the existing model while the refactoring is planned and executed in the background. It buys time without requiring an immediate structural overhaul, and it keeps the data pipeline running cleanly during the transition.

Communication: The Overlooked Part of Refactoring

Technical decisions aside, refactoring a Data Vault model is also an organizational event. Users who query your data warehouse need to know when structures change — whether that’s a modified Satellite, a new Hub, or a deprecated view that will be removed in three months.

A simple data warehouse changelog or newsletter goes a long way. When you modify existing entities, inform users. When you introduce views as backward-compatible bridges, tell them the timeline. When the views are going away, tell them what to query instead. This isn’t just good practice — it’s the difference between a smooth migration and a flood of support tickets.

The Bottom Line on Refactoring

Data Vault is built to absorb change, and it does so gracefully at the column level. Descriptive attribute changes — new columns, removed columns, restructured Satellites — are handled with well-defined options and minimal risk. The real challenge arrives when business keys change, because the ripple effects can touch Hubs, Links, and Satellites across the model.

In those cases, the recommended approach is to preserve historical data in the original Raw Data Vault and build a new one for the new structure, using the Business Vault as the reconciliation layer. It’s the lowest-risk path, it keeps your historical data intact, and it puts complex transformation logic exactly where it belongs.

To learn more about Data Vault modeling principles, refactoring strategies, and Business Vault patterns in depth, explore our Data Vault 2.1 Training & Certification. And if you’re new to the methodology, the free Data Vault handbook is a great starting point — available as a hard copy or digital download.

Watch the Video

Source of Data for Business Vault Entities

Business Vault Entities: Using Cross-Satellite Joins for Business Logic

One of the questions that comes up regularly among Data Vault practitioners — especially those working on the Business Vault layer — is whether a Computed Satellite can draw its input data from multiple, unrelated Satellites across different Hubs. It sounds like it might break the rules. It doesn’t. In fact, it’s a very common and entirely valid pattern once you understand what actually constrains a Business Vault Satellite.

Let’s break it down.



Business Vault Computed Satellites: What Are the Real Rules?

When practitioners first encounter the Business Vault, a natural assumption is that a Computed Satellite must only use data from Satellites attached to its own parent Hub. The logic seems sound — keep things connected, keep things clean. But this is not actually a constraint in Data Vault. It’s a misconception.

The real rule for a Computed Satellite is much simpler: what do your results describe? Whatever entity your calculated output describes — that’s your parent. That’s what the Satellite attaches to.

For example, if you’re calculating customer lifetime value using data pulled from order records, transaction histories, product tables, and behavioral data — the result still describes the customer. So your Computed Satellite attaches to the Customer Hub. The source of the input data is largely irrelevant to the attachment decision. What matters is the semantic meaning of the output.

Cross-Satellite Joins: Yes, You Can

So back to the original question: can you build a Business Vault Satellite that joins Satellites from two completely different subjects — even if there’s no Link connecting their parent Hubs?

The answer is yes.

Your business rule query can pull data from anywhere in the model. You can join Satellites on the same Hub, traverse a Link into another Hub and pull Satellite data from there, use business key relationships to establish a match, or even join across entirely disconnected model areas if a descriptive relationship exists in the data itself.

The absence of a Link between two Hubs doesn’t prevent you from joining their Satellite data in a query. If there’s a way to match records — even informally through a shared descriptive attribute in a Satellite — you can use that to establish the relationship in your query logic and proceed.

When to Consider a Business Vault Link Instead

That said, there are situations where formalizing the relationship makes sense. If two Hubs have no Link between them but your business logic consistently requires joining their data, it may be worth creating a Business Link or an Exploration Link in the Business Vault to make that relationship explicit and reusable.

This approach can improve query efficiency and make the model easier to understand for future developers. However, it also adds complexity — a Business Link typically requires an Effectivity Satellite to track the validity of the relationship over time, which may or may not be worth the overhead depending on your use case.

The pragmatic answer: if it’s a one-off calculation and the join works, just write the query. If the relationship is foundational to multiple business rules, formalize it with a Link. Performance and maintainability should guide that decision, not a strict modeling rule.

The Only Real Constraints

In practice, there are two genuine constraints on what you can use as input for a Business Vault Computed Satellite:

  • Query performance: Joining across disconnected parts of your model may not always be efficient. If the join is expensive and runs frequently, optimization — through a Business Link, materialization, or other techniques — is worth considering. But it’s an engineering concern, not a modeling violation.
  • Data Governance: Are you actually allowed to combine those datasets? In organizations with strict data ownership boundaries, combining Satellites from different subject areas or different data owners may require governance approval. That’s a policy question, not a technical one — but it’s one worth asking before you build.

Outside of those two factors, the query is yours to write however you need.

A Practical Example: Customer Lifetime Value

To make this concrete: imagine you need to calculate customer lifetime value (CLV) and attach the result as a Computed Satellite on the Customer Hub. Your inputs might include:

  • Customer profile data from a Satellite on the Customer Hub
  • Order totals from a Satellite on an Order Hub (connected via a Customer-Order Link)
  • Return history from a Satellite on a Returns Hub
  • Promotional sensitivity data from a completely separate marketing model with no direct Link to Customer

Even if the marketing model has no formal Link back to Customer, if there’s a way to match the records — say, via a shared customer identifier in a descriptive Satellite — you can join it. Your CLV business rule can consume all of this data in a single query. The result — a calculated CLV score — describes the customer, so the Computed Satellite attaches to the Customer Hub. Clean, valid, and practical.

This kind of cross-domain business logic is exactly what the Business Vault is designed for. It’s the layer where raw, integrated data gets transformed into meaningful, decision-ready information — and the flexibility to query across the model is one of its most powerful features.

Key Takeaway

When building Computed Satellites in the Business Vault, don’t let the structure of your Raw Data Vault artificially limit your business logic. The query that feeds your Satellite can draw from anywhere in the model. The only questions that matter are: what do your results describe, and is your query performant and governance-compliant?

Get those right, and the rest is just SQL.

To go deeper on Business Vault patterns, Computed Satellites, and the full Data Vault methodology, explore our Data Vault 2.1 Training & Certification — and grab the free Data Vault handbook for a concise overview of the core concepts.

Watch the Video

How to Derive Dimensions and Facts from a Data Vault Model

Patterns for Deriving Data Vault Dimensions and Facts

If you’ve gone through a Data Vault training, you know that Data Vault is a pattern-driven methodology — from loading the Raw Data Vault all the way through business logic implementation in the Business Vault. But one question that consistently comes up, especially among practitioners bridging the gap between modeling and delivery, is this: How do you actually derive Dimensions and Facts from a Data Vault model?

It’s a great question, and the good news is that just like the rest of Data Vault, there are clear, repeatable patterns for doing it well. Let’s walk through them.



The Pattern-Based Nature of Data Vault Delivery

Data Vault is not just a modeling technique — it’s an entire methodology built around patterns. There are patterns for loading the Raw Data Vault, patterns for modeling it, and patterns for implementing business logic in the Business Vault. The same is true for the information delivery layer, where you produce standard target entities like Dimensions and Facts (or 3NF entities, flat/wide schemas, and more).

Once you internalize these patterns, deriving your target entities becomes a predictable and even automatable process. That’s one of the most underappreciated strengths of Data Vault: the patterns repeat, and the more you use them, the faster and more confident you become.

Deriving Dimensions: The 80/20 Rule

Roughly 80% of your Dimensions will follow one core pattern: they are derived from a combination of a Hub, one or more Satellites, and — critically — a PIT Table (Point-In-Time Table).

The basic query pattern looks like this:

  1. Start with the PIT Table as your base
  2. Join the Hub to retrieve the business key
  3. Join the relevant Satellites for descriptive attributes
  4. Cherry-pick the attributes you need for your Dimension

The remaining 20% of Dimensions come from Links rather than Hubs. In these cases, you start with a PIT Table built on a Link, join the Link itself for hash keys and Dimension references (often called Bridge Dimensions), and then join the Satellites to describe the relationship.

In practice, this covers close to 99% of all Dimension derivations. There are always edge cases, but this two-pattern approach handles the vast majority of real-world scenarios.

Why PIT Tables? The Case for Point-In-Time

If you’re new to PIT Tables, they can look intimidating at first — a structure full of hash keys and timestamps. But once you understand the pattern, it’s actually one of the most elegant tools in the Data Vault toolkit. And here’s the key insight: in modern implementations, PIT Tables are generated by automation tooling, so you rarely build them by hand. You learn how to use them, not how to construct them from scratch.

PIT Tables are essential for several reasons:

Supporting All SCD Types

A PIT Table provides the foundation for Slowly Changing Dimension (SCD) types 0, 1, and 2 — all from the same structure. The snapshot day timestamp embedded in the PIT Table allows you to control the grain and timeline of your Dimension flexibly.

Delivering Stable Reports

Here’s a scenario every analyst will recognize: you pull a report at 9:00 AM and expect it to remain consistent throughout the day. But if your Dimension view is simply joining the latest delta from a Satellite directly, any incoming data update after 9:00 AM will silently change your report’s results.

A PIT Table solves this by identifying the latest valid delta for all hash keys at a specific snapshot timestamp — say, 9:00 AM. This decouples your information delivery from live, incoming data pipelines. Users get stable, predictable reports. The data warehouse and the source pipelines can operate independently without stepping on each other.

This is why using a PIT Table, though it may appear more complex at first, is actually simpler and more aligned with what business users actually need: consistent data at a known point in time, not real-time volatility bleeding into their dashboards.

Decoupling Delivery from Ingestion

Beyond stability, the PIT Table serves a broader architectural purpose: it creates a clean separation between the data loading process and the data delivery process. This is a hallmark of mature data warehouse design — and Data Vault makes it a first-class pattern rather than an afterthought.

Want to deepen your understanding of PIT Tables, Satellites, and the complete Data Vault modeling toolkit? Explore our Data Vault 2.1 Training & Certification to get hands-on with these patterns.

Deriving Facts: Links, Non-Historized Links, and Bridge Tables

Facts are a different story. Where Dimensions describe who and what, Facts capture events, transactions, and messages — think banking transactions, call records, IoT sensor readings, or logistics movements.

For Facts, your primary sources in a Data Vault model are:

  • Non-Historized Links (NH Links) — ideal for transactional events that don’t change
  • Dependent Child Links — for subordinate, granular transactional data
  • Bridge Tables — for pre-aggregated or grain-shifted data
  • Joined Link combinations — for cross-domain Fact derivations

If your underlying Link structure already matches the grain you need for your Fact, you can build a simple view on top of it and you’re done. Clean and efficient.

Grain Management: The Core Challenge of Fact Derivation

Things get more interesting when your source data doesn’t naturally sit at the right grain for your target Fact. This is where Bridge Tables and grain shift operations come into play.

Coarsening Grain with GROUP BY

Imagine an NH Link that captures real-time vehicle positions — five messages per second. Your Fact table only needs one position per hour. You need to reduce the granularity.

You do this with a GROUP BY operation on a subset of the Hub references within the Link. The fewer Hub references you group by, the coarser your grain. The result of this aggregation is materialized into a Bridge Table, which then becomes the base for your Fact view.

Refining Grain by Joining Links

The inverse is also possible. If you need a finer grain than any single Link provides, you join multiple Links on their shared Hub references. For example, if one Link connects Customer and Store, and another connects Store and Product, joining them on the shared Store reference produces a result with three Hub references — a finer granularity than either Link alone.

Other Grain Shift Operations

Beyond GROUP BY and joins, there are other grain-shifting operations worth knowing:

  • DISTINCT operations — to remove duplicate rows at a desired level of aggregation
  • JSON array expansion — splitting a JSON array into multiple rows is itself a grain shift, making rows finer

All of these operations can produce results that are materialized in Bridge Tables for performance and reuse, which then serve as the clean input layer for your Fact views.

PIT Tables for Snapshot Facts

There is one important exception to the “Links and Bridge Tables for Facts” rule: snapshot-based Facts. For periodic snapshots — daily inventory levels, end-of-month balances, weekly pipeline status — a PIT Table can serve as an excellent base for your Fact entity.

The catch is that PIT Tables are sometimes pruned over time to keep them manageable. If you need to preserve a long history of snapshots for your Fact, consider using a dedicated Bridge Table to maintain those snapshots independently from your operational PIT Tables. This gives you full control over retention and history without disrupting the PIT Table’s primary purpose.

Beyond Dimensions and Facts: Wider Applicability

The same patterns extend naturally to other target model types:

  • 3NF (Third Normal Form) entities — derived using similar logic from Hubs, Links, and Satellites
  • Flat and wide schemas — built by joining PIT Tables to Bridge Tables and then joining Satellites to pre-join all descriptive attributes into a single denormalized entity
  • Fully denormalized Fact entities — start with a Bridge Table or Link, join PITs, then join the relevant Satellites to fold all Dimension attributes directly into the Fact

This last pattern — the fully denormalized Fact — is particularly powerful for downstream consumers like machine learning pipelines or self-service analytics tools that benefit from wide, flat tables. And it follows the exact same building blocks.

Summary: The Patterns at a Glance

Target Entity Primary Source Pattern
Standard Dimension (80%) PIT + Hub + Satellites Join PIT → Hub → Satellites, select attributes
Bridge/Relationship Dimension (20%) PIT on Link + Satellites Join PIT → Link → Satellites
Transactional Fact NH Links / Dependent Child Links View on top of Link at correct grain
Aggregated Fact Bridge Table (GROUP BY result) Grain shift → materialize → view
Snapshot Fact PIT Table or Bridge Table Snapshot timestamps → Fact view
Denormalized Fact Bridge + PIT + Satellites Pre-join all descriptive attributes

Learn the Patterns, Apply Them Everywhere

One of the most satisfying aspects of working with Data Vault is that once you’ve internalized these patterns, they compound. Each new project reinforces the same building blocks: Hubs, Links, Satellites, PIT Tables, Bridge Tables. The same logic for deriving a Dimension applies to deriving a 3NF entity. The same grain management principles for Facts carry over to denormalized wide tables.

Data Vault doesn’t ask you to reinvent the wheel for every project — it gives you a set of reliable, tested patterns and asks you to apply them consistently.

If you’re ready to master these patterns in depth — including hands-on work with PIT Tables, Bridge Tables, and full information delivery models — check out our Data Vault 2.1 Certification and Training. And if you’re just getting started, there’s a free Data Vault handbook available as a concise primer covering the core concepts, pros and cons, and when to use Data Vault in your architecture.

The patterns are there. Once you learn them, they’ll follow you everywhere.

Watch the Video

Metadata Translation in Data Vault

Watch the Video

In our ongoing Data Vault Friday series, our CEO Michael Olschimke discusses a question from the audience.

“Our EDW should use English entity names for hubs, links, and satellites. However, our sources are in a variety of languages (English, and German mostly). Where is the best option to translate everything into English?”

Michael provides insightful guidance on tackling the challenge of maintaining consistency in entity names across a multilingual landscape. He explores different strategies for translating entity names, weighing the pros and cons of various approaches. Whether to perform the translation at the source level, during the ETL (Extract, Transform, Load) process, or within the EDW itself, Michael offers considerations to help make an informed decision based on the specific needs and characteristics of the project.

The CEO emphasizes the importance of aligning with business objectives and ensuring that the chosen translation strategy aligns with the overall goals of the data warehousing initiative. This episode provides valuable insights and best practices for handling multilingual challenges in Data Vault projects, contributing to the success of your data integration and management endeavors.

Why Data Vault 2.0 Is the Best Data Model for Automation

Watch the Webinar

Many data teams worry that automation won’t work on their specific data and technology stack. They’ve learned the hard way that automation doesn’t always stand up to the complexity of different source data models, taxonomies, and tech stack components.
Join this webinar to understand how Data Vault 2.0 is designed to focus on models and logic, not complex code so that it’s rapidly becoming the DWH standard.

We’ll explain how Data Vault has taken the best of the more traditional modeling
approaches, such as Inmon or Kimball, to provide the level of abstraction, quality, and agility that automation requires.

You’ll learn how the Data Vault model and its methodology and architecture leverage
automation. And how we use integration templates based on Data Vault standards to pave the way to fully automated data loading.

This webinar takes you from theory to practice.

Watch Webinar Recording

Webinar Agenda

1. The pros and cons of different data modeling techniques.
2. The prerequisites for automation.
3. Why Data Vault works best.
4. How to create abstractions in data warehousing.
5. Demo: how it’s applied in VaultSpeed.

Close Menu