Capture CDC Data
Capturing Change Data Capture (CDC) data in Data Vault can be tricky, especially when the source mostly sends inserts but occasionally produces duplicates or deletions. Understanding how to handle these cases ensures historical accuracy and avoids data inconsistencies in your hubs and satellites.
In this article:
The Scenario
Consider the following behavior of your source system:
- Most of the time, rows are insert-only.
- During initial load, the same row may arrive twice (once in the bulk load and once as an insert within the same batch).
- Deleted rows may occasionally appear.
These patterns can lead to duplicates if not handled correctly. At first glance, it might look like you need a Non-Historized Link, but duplicates must still be managed properly.
Why Standard Non-Historized Links May Fail
A standard non-historized Link assumes a single row per combination of hubs. When duplicates arrive, either due to CDC or multiple inserts during initial load, the Link cannot naturally distinguish them, leading to primary key conflicts or overwritten data.
A common—but sometimes problematic—solution is adding counter rows to differentiate duplicates. However, this often requires a GROUP BY in the Information Mart, which can cause performance issues, particularly on non-columnar databases.
Recommended Approach: Capture Technical History in Satellites
Instead of modifying the Link, the recommended approach is to handle duplicates in satellites, preserving the raw source events and their arrival order.
Step 1: Use a Satellite with a Load-Date Sequence
For each incoming batch:
- Assign the CDC load timestamp to the first row of a given parent.
- If multiple rows for the same parent exist in the batch, increment the timestamp by a small unit (microsecond, millisecond, or nanosecond) for each subsequent row.
This creates a unique ordering of changes while preserving the technical history, without touching the original raw data.
Step 2: Maintain Historical Order
By adding a microsecond increment to the load date for each row:
- The first row in the CDC batch gets the base timestamp.
- The second row gets
base timestamp + 1 microsecond, the third row +2 microseconds, etc.
This ensures the latest row has the highest load timestamp, which can be used to drive Point-In-Time (PIT) tables and type-1 dimension replacements.
Step 3: Preserve Batch or CDC Metadata
If your CDC source provides a batch ID or subsequence number, include it in the satellite. This allows for:
- Tracking which records arrived together
- Reconstructing the technical timeline of changes
If no metadata exists, the microsecond sequencing on the load date is sufficient to order the rows.
Handling Non-Historized Links with Duplicates
In rare cases, a non-historized Link may receive multiple rows for the same key combination. To handle this safely:
- Extend the alternate key to include the
load date(or other sequencing attribute) in the hash key calculation. - This ensures each row has a unique primary key without modifying the raw data.
Key points:
- No need to use counter rows in the raw link.
- Duplicates are captured and preserved for historical accuracy.
- Aggregations in PIT or Bridge Tables can be used for reporting, ensuring performance optimization.
Optional: Bridge Tables for Performance
If your Information Mart requires grouping or deduplication and your database struggles with performance:
- Create a Bridge Table that pre-aggregates or resolves duplicates.
- The Bridge Table stores only the latest row (or the aggregated result) for reporting.
- You maintain the raw satellite history in case full lineage or historical reconstruction is needed.
Summary
- Capture all incoming CDC events in a satellite, including duplicates, without modifying the raw data.
- Use microsecond increments on the load date to order multiple rows per parent.
- Include CDC batch metadata if available to preserve groupings and arrival order.
- For non-historized Links receiving multiple rows, include the load date in the hash calculation.
- Bridge Tables or PIT tables handle reporting and aggregation efficiently, while maintaining full historical traceability.
This approach preserves auditability, ensures correct historical ordering, and avoids performance issues in the Information Mart.
Watch the Video
Meet the Speaker

Michael Olschimke
Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!
