How to Capture CDC Data in Data Vault Satellites

Capture CDC Data

Capturing Change Data Capture (CDC) data in Data Vault can be tricky, especially when the source mostly sends inserts but occasionally produces duplicates or deletions. Understanding how to handle these cases ensures historical accuracy and avoids data inconsistencies in your hubs and satellites.

In this article:

The Scenario
Why Standard Non-Historized Links May Fail
Recommended Approach: Capture Technical History in Satellites
Handling Non-Historized Links with Duplicates
Optional: Bridge Tables for Performance
Summary
Watch the Video
Meet the Speaker

The Scenario

Consider the following behavior of your source system:

Most of the time, rows are insert-only.
During initial load, the same row may arrive twice (once in the bulk load and once as an insert within the same batch).
Deleted rows may occasionally appear.

These patterns can lead to duplicates if not handled correctly. At first glance, it might look like you need a Non-Historized Link, but duplicates must still be managed properly.

Why Standard Non-Historized Links May Fail

A standard non-historized Link assumes a single row per combination of hubs. When duplicates arrive, either due to CDC or multiple inserts during initial load, the Link cannot naturally distinguish them, leading to primary key conflicts or overwritten data.

A common—but sometimes problematic—solution is adding counter rows to differentiate duplicates. However, this often requires a GROUP BY in the Information Mart, which can cause performance issues, particularly on non-columnar databases.

Recommended Approach: Capture Technical History in Satellites

Instead of modifying the Link, the recommended approach is to handle duplicates in satellites, preserving the raw source events and their arrival order.

Step 1: Use a Satellite with a Load-Date Sequence

For each incoming batch:

Assign the CDC load timestamp to the first row of a given parent.
If multiple rows for the same parent exist in the batch, increment the timestamp by a small unit (microsecond, millisecond, or nanosecond) for each subsequent row.

This creates a unique ordering of changes while preserving the technical history, without touching the original raw data.

Step 2: Maintain Historical Order

By adding a microsecond increment to the load date for each row:

The first row in the CDC batch gets the base timestamp.
The second row gets base timestamp + 1 microsecond, the third row +2 microseconds, etc.

This ensures the latest row has the highest load timestamp, which can be used to drive Point-In-Time (PIT) tables and type-1 dimension replacements.

Step 3: Preserve Batch or CDC Metadata

If your CDC source provides a batch ID or subsequence number, include it in the satellite. This allows for:

Tracking which records arrived together
Reconstructing the technical timeline of changes

If no metadata exists, the microsecond sequencing on the load date is sufficient to order the rows.

Handling Non-Historized Links with Duplicates

In rare cases, a non-historized Link may receive multiple rows for the same key combination. To handle this safely:

Extend the alternate key to include the load date (or other sequencing attribute) in the hash key calculation.
This ensures each row has a unique primary key without modifying the raw data.

Key points:

No need to use counter rows in the raw link.
Duplicates are captured and preserved for historical accuracy.
Aggregations in PIT or Bridge Tables can be used for reporting, ensuring performance optimization.

Optional: Bridge Tables for Performance

If your Information Mart requires grouping or deduplication and your database struggles with performance:

Create a Bridge Table that pre-aggregates or resolves duplicates.
The Bridge Table stores only the latest row (or the aggregated result) for reporting.
You maintain the raw satellite history in case full lineage or historical reconstruction is needed.

Summary

Capture all incoming CDC events in a satellite, including duplicates, without modifying the raw data.
Use microsecond increments on the load date to order multiple rows per parent.
Include CDC batch metadata if available to preserve groupings and arrival order.
For non-historized Links receiving multiple rows, include the load date in the hash calculation.
Bridge Tables or PIT tables handle reporting and aggregation efficiently, while maintaining full historical traceability.

This approach preserves auditability, ensures correct historical ordering, and avoids performance issues in the Information Mart.

Watch the Video

Meet the Speaker

Michael Olschimke

Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

How to Capture CDC Data in Data Vault Satellites

Capture CDC Data

The Scenario

Why Standard Non-Historized Links May Fail

Recommended Approach: Capture Technical History in Satellites

Step 1: Use a Satellite with a Load-Date Sequence

Step 2: Maintain Historical Order

Step 3: Preserve Batch or CDC Metadata

Handling Non-Historized Links with Duplicates

Optional: Bridge Tables for Performance

Summary

Watch the Video

Meet the Speaker

Build your path to a scalable and resilient Data Platform

Subscribe to our
free monthly newsletter

Leave a Reply Cancel Reply

Subscribe to our
free monthly newsletter

SOLUTIONS

TRAININGS

EVENTS

KNOWLEDGE HUB

CAREERS

COMPANY

How to Capture CDC Data in Data Vault Satellites

Capture CDC Data

The Scenario

Why Standard Non-Historized Links May Fail

Recommended Approach: Capture Technical History in Satellites

Step 1: Use a Satellite with a Load-Date Sequence

Step 2: Maintain Historical Order

Step 3: Preserve Batch or CDC Metadata

Handling Non-Historized Links with Duplicates

Optional: Bridge Tables for Performance

Summary

Watch the Video

Meet the Speaker

Build your path to a scalable and resilient Data Platform

Subscribe to our free monthly newsletter

You May Also Like

PIT and Effectivity Satellites

Pre-Joining Data Vault Business Keys During Load

Data Quality in the Enterprise Data Warehouse (EDW) in Data Vault

Leave a Reply Cancel Reply

Subscribe to our free monthly newsletter

SOLUTIONS

TRAININGS

EVENTS

KNOWLEDGE HUB

CAREERS

COMPANY

Subscribe to our
free monthly newsletter

Subscribe to our
free monthly newsletter