Dealing with Corrupted Loads in Data Vault

Corrupted Loads in Data Vault

One of the foundational assumptions in Data Vault modeling is that business keys must be unique. This rule underpins how we model Hubs, Links, and Satellites. But what happens when your data doesn’t play by the rules? Specifically, what should you do when your data delivery contains multiple rows with the same business key—a situation that violates the core principles of your Raw Data Vault model?

In this article, we’ll explore practical strategies for managing corrupted data in Data Vault pipelines, focusing on maintaining auditability, consistency, and data integrity—even when upstream data delivery is flawed. We’ll also look at what to do when your business key assumptions no longer hold true.

In this article:

Understanding the Problem: Duplicate Business Keys
Why You Can't Ignore Corrupted Loads
Step 1: Capture Everything in a Data Lake
Step 2: Define Automated Data Quality Checks
Step 3: Track Rejections and Version Control Your Checks
What If There’s No Data Lake?
When the Business Key Assumption Breaks
Why You Must Refactor, Not Hack
What About Descriptive Data Errors?
Final Thoughts: Build Resilience Into Your Pipeline
Watch the Video
Meet the Speaker

Understanding the Problem: Duplicate Business Keys

Let’s start with the assumption that your Raw Data Vault is modeled around unique business keys. You’ve built Hubs, split Satellites, and established Links based on the expectation that a business key like customer_id uniquely identifies a customer.

Now, you receive a new delivery from your source system. Unexpectedly, it contains multiple rows with the same business key. This isn’t just a data quality issue—it fundamentally breaks your model. The typical loading process can no longer proceed cleanly, and worse, you risk contaminating your data warehouse with incorrect records.

Why You Can’t Ignore Corrupted Loads

It’s tempting to just skip the bad file or fix it manually. But in a proper Data Vault setup—particularly one that adheres to full auditability and compliance standards—this isn’t acceptable. You must be able to fully reconstruct each data delivery, even if it’s flawed. Every decision—whether to reject or load—must be trackable and justifiable.

Step 1: Capture Everything in a Data Lake

Today, many modern architectures use a data lake or Persistent Staging Area (PSA) as the first layer of data capture. This becomes your insurance policy. All incoming data—valid or corrupted—is ingested and stored here as-is, giving you a perfect record of what was delivered and when.

This approach also ensures your Raw Data Vault can skip flawed deliveries without data loss. By storing the original files in the data lake, you preserve the full delivery for later inspection, validation, or correction without halting the loading process entirely.

Step 2: Define Automated Data Quality Checks

Before data is loaded into the Raw Data Vault, it must pass validation. You can implement quality checks like:

Is the business key unique across the delivery?
Are column data types and lengths as expected?
Are required fields populated?

If any of these checks fail, the entire file should be rejected—not just individual records. Why? Because partial loads introduce ambiguity and audit challenges. Instead, flag the file as failed and notify the data provider to investigate and resend a corrected version.

Step 3: Track Rejections and Version Control Your Checks

You must keep detailed logs of every load attempt. This includes:

Which file was loaded or rejected
Which checks were applied
Which check failed and why
The version of the validation rule used

This ensures complete traceability. You can prove not just what was accepted, but also what was rejected and for what reason. This is crucial for regulatory compliance, audits, and operational transparency.

What If There’s No Data Lake?

In some cases, you may not have a data lake. You might be working with a transient relational staging area before the Raw Data Vault. Even then, you should still store failed deliveries. A separate location or table can be used to store the raw files that failed validation. Again, auditability is key—just because data isn’t valid doesn’t mean it can disappear.

When the Business Key Assumption Breaks

Sometimes, you dig deeper and realize that your assumption about the business key was flawed. Maybe you thought customer_id was unique, but the source system allows multiple entries per ID for different contexts. Now what?

This is where things get more complex. You need to refactor your model. Specifically, you must modify the Hub and possibly extend the business key by combining it with another column (e.g., customer_id + region) to enforce uniqueness.

Why You Must Refactor, Not Hack

Some might be tempted to patch the issue using a record source tracking Satellite or other technical workaround. But this introduces long-term maintenance and performance issues. Worse, it hides the real business reality behind a technical trick.

Instead, treat the business key as the central anchor of your model. If it changes, it impacts:

The Hub structure
All related Satellites
Any Links pointing to the Hub

Yes, it’s a big change. But it’s limited to a specific portion of your model and keeps your architecture clean and reliable.

What About Descriptive Data Errors?

If the corrupted data only affects descriptive attributes and not the business key, the fix is simpler. You can ingest a correction load directly into the Satellites with a backdated load date—just after the original bad load. Then, rebuild your PIT (Point-In-Time) tables. This resolves the issue for downstream consumption without any need to refactor Hubs or Links.

Final Thoughts: Build Resilience Into Your Pipeline

Corrupted data is not an exception—it’s an eventuality. Whether it’s duplicate business keys, incorrect formats, or structural changes in the source system, your data warehouse must be prepared. The best defenses are:

A reliable data lake or staging layer to capture raw deliveries
Automated validation and full-file rejection logic
Detailed auditing and version control on checks
Clear communication with source system owners
Willingness to refactor models when business reality shifts

Following these principles ensures your Data Vault model remains robust, scalable, and trustworthy—even in the face of corrupted loads.

Watch the Video

Meet the Speaker

Michael Olschimke

Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

Dealing with Corrupted Loads in Data Vault

Corrupted Loads in Data Vault

Understanding the Problem: Duplicate Business Keys

Why You Can’t Ignore Corrupted Loads

Step 1: Capture Everything in a Data Lake

Step 2: Define Automated Data Quality Checks

Step 3: Track Rejections and Version Control Your Checks

What If There’s No Data Lake?

When the Business Key Assumption Breaks

Why You Must Refactor, Not Hack

What About Descriptive Data Errors?

Final Thoughts: Build Resilience Into Your Pipeline

Watch the Video

Meet the Speaker

Build your path to a scalable and resilient Data Platform

Subscribe to our
free monthly newsletter

Leave a Reply Cancel Reply

Subscribe to our
free monthly newsletter

SOLUTIONS

TRAININGS

EVENTS

KNOWLEDGE HUB

CAREERS

COMPANY

Dealing with Corrupted Loads in Data Vault

Corrupted Loads in Data Vault

Understanding the Problem: Duplicate Business Keys

Why You Can’t Ignore Corrupted Loads

Step 1: Capture Everything in a Data Lake

Step 2: Define Automated Data Quality Checks

Step 3: Track Rejections and Version Control Your Checks

What If There’s No Data Lake?

When the Business Key Assumption Breaks

Why You Must Refactor, Not Hack

What About Descriptive Data Errors?

Final Thoughts: Build Resilience Into Your Pipeline

Watch the Video

Meet the Speaker

Build your path to a scalable and resilient Data Platform

Subscribe to our free monthly newsletter

You May Also Like

Record Source for Links in Data Vault

Data Vault 2.0 Pre-Analysis aka Automation for the Poor

Supersetting in Data Vault

Leave a Reply Cancel Reply

Subscribe to our free monthly newsletter

SOLUTIONS

TRAININGS

EVENTS

KNOWLEDGE HUB

CAREERS

COMPANY

Subscribe to our
free monthly newsletter

Subscribe to our
free monthly newsletter