Corrupted Loads in Data Vault
One of the foundational assumptions in Data Vault modeling is that business keys must be unique. This rule underpins how we model Hubs, Links, and Satellites. But what happens when your data doesn’t play by the rules? Specifically, what should you do when your data delivery contains multiple rows with the same business key—a situation that violates the core principles of your Raw Data Vault model?
In this article, we’ll explore practical strategies for managing corrupted data in Data Vault pipelines, focusing on maintaining auditability, consistency, and data integrity—even when upstream data delivery is flawed. We’ll also look at what to do when your business key assumptions no longer hold true.
In this article:
- Understanding the Problem: Duplicate Business Keys
- Why You Can't Ignore Corrupted Loads
- Step 1: Capture Everything in a Data Lake
- Step 2: Define Automated Data Quality Checks
- Step 3: Track Rejections and Version Control Your Checks
- What If There’s No Data Lake?
- When the Business Key Assumption Breaks
- Why You Must Refactor, Not Hack
- What About Descriptive Data Errors?
- Final Thoughts: Build Resilience Into Your Pipeline
- Watch the Video
- Meet the Speaker
Understanding the Problem: Duplicate Business Keys
Let’s start with the assumption that your Raw Data Vault is modeled around unique business keys. You’ve built Hubs, split Satellites, and established Links based on the expectation that a business key like customer_id
uniquely identifies a customer.
Now, you receive a new delivery from your source system. Unexpectedly, it contains multiple rows with the same business key. This isn’t just a data quality issue—it fundamentally breaks your model. The typical loading process can no longer proceed cleanly, and worse, you risk contaminating your data warehouse with incorrect records.
Why You Can’t Ignore Corrupted Loads
It’s tempting to just skip the bad file or fix it manually. But in a proper Data Vault setup—particularly one that adheres to full auditability and compliance standards—this isn’t acceptable. You must be able to fully reconstruct each data delivery, even if it’s flawed. Every decision—whether to reject or load—must be trackable and justifiable.
Step 1: Capture Everything in a Data Lake
Today, many modern architectures use a data lake or Persistent Staging Area (PSA) as the first layer of data capture. This becomes your insurance policy. All incoming data—valid or corrupted—is ingested and stored here as-is, giving you a perfect record of what was delivered and when.
This approach also ensures your Raw Data Vault can skip flawed deliveries without data loss. By storing the original files in the data lake, you preserve the full delivery for later inspection, validation, or correction without halting the loading process entirely.
Step 2: Define Automated Data Quality Checks
Before data is loaded into the Raw Data Vault, it must pass validation. You can implement quality checks like:
- Is the business key unique across the delivery?
- Are column data types and lengths as expected?
- Are required fields populated?
If any of these checks fail, the entire file should be rejected—not just individual records. Why? Because partial loads introduce ambiguity and audit challenges. Instead, flag the file as failed and notify the data provider to investigate and resend a corrected version.
Step 3: Track Rejections and Version Control Your Checks
You must keep detailed logs of every load attempt. This includes:
- Which file was loaded or rejected
- Which checks were applied
- Which check failed and why
- The version of the validation rule used
This ensures complete traceability. You can prove not just what was accepted, but also what was rejected and for what reason. This is crucial for regulatory compliance, audits, and operational transparency.
What If There’s No Data Lake?
In some cases, you may not have a data lake. You might be working with a transient relational staging area before the Raw Data Vault. Even then, you should still store failed deliveries. A separate location or table can be used to store the raw files that failed validation. Again, auditability is key—just because data isn’t valid doesn’t mean it can disappear.
When the Business Key Assumption Breaks
Sometimes, you dig deeper and realize that your assumption about the business key was flawed. Maybe you thought customer_id
was unique, but the source system allows multiple entries per ID for different contexts. Now what?
This is where things get more complex. You need to refactor your model. Specifically, you must modify the Hub and possibly extend the business key by combining it with another column (e.g., customer_id + region
) to enforce uniqueness.
Why You Must Refactor, Not Hack
Some might be tempted to patch the issue using a record source tracking Satellite or other technical workaround. But this introduces long-term maintenance and performance issues. Worse, it hides the real business reality behind a technical trick.
Instead, treat the business key as the central anchor of your model. If it changes, it impacts:
- The Hub structure
- All related Satellites
- Any Links pointing to the Hub
Yes, it’s a big change. But it’s limited to a specific portion of your model and keeps your architecture clean and reliable.
What About Descriptive Data Errors?
If the corrupted data only affects descriptive attributes and not the business key, the fix is simpler. You can ingest a correction load directly into the Satellites with a backdated load date—just after the original bad load. Then, rebuild your PIT (Point-In-Time) tables. This resolves the issue for downstream consumption without any need to refactor Hubs or Links.
Final Thoughts: Build Resilience Into Your Pipeline
Corrupted data is not an exception—it’s an eventuality. Whether it’s duplicate business keys, incorrect formats, or structural changes in the source system, your data warehouse must be prepared. The best defenses are:
- A reliable data lake or staging layer to capture raw deliveries
- Automated validation and full-file rejection logic
- Detailed auditing and version control on checks
- Clear communication with source system owners
- Willingness to refactor models when business reality shifts
Following these principles ensures your Data Vault model remains robust, scalable, and trustworthy—even in the face of corrupted loads.
Watch the Video
Meet the Speaker

Michael Olschimke
Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!