Skip to main content
search
0
Scalefree Knowledge Webinars Data Vault Friday How to Capture CDC Data in Data Vault Satellites

Capture CDC Data

Capturing Change Data Capture (CDC) data in Data Vault can be tricky, especially when the source mostly sends inserts but occasionally produces duplicates or deletions. Understanding how to handle these cases ensures historical accuracy and avoids data inconsistencies in your hubs and satellites.



The Scenario

Consider the following behavior of your source system:

  • Most of the time, rows are insert-only.
  • During initial load, the same row may arrive twice (once in the bulk load and once as an insert within the same batch).
  • Deleted rows may occasionally appear.

These patterns can lead to duplicates if not handled correctly. At first glance, it might look like you need a Non-Historized Link, but duplicates must still be managed properly.

A standard non-historized Link assumes a single row per combination of hubs. When duplicates arrive, either due to CDC or multiple inserts during initial load, the Link cannot naturally distinguish them, leading to primary key conflicts or overwritten data.

A common—but sometimes problematic—solution is adding counter rows to differentiate duplicates. However, this often requires a GROUP BY in the Information Mart, which can cause performance issues, particularly on non-columnar databases.

Instead of modifying the Link, the recommended approach is to handle duplicates in satellites, preserving the raw source events and their arrival order.

Step 1: Use a Satellite with a Load-Date Sequence

For each incoming batch:

  • Assign the CDC load timestamp to the first row of a given parent.
  • If multiple rows for the same parent exist in the batch, increment the timestamp by a small unit (microsecond, millisecond, or nanosecond) for each subsequent row.

This creates a unique ordering of changes while preserving the technical history, without touching the original raw data.

Step 2: Maintain Historical Order

By adding a microsecond increment to the load date for each row:

  • The first row in the CDC batch gets the base timestamp.
  • The second row gets base timestamp + 1 microsecond, the third row +2 microseconds, etc.

This ensures the latest row has the highest load timestamp, which can be used to drive Point-In-Time (PIT) tables and type-1 dimension replacements.

Step 3: Preserve Batch or CDC Metadata

If your CDC source provides a batch ID or subsequence number, include it in the satellite. This allows for:

  • Tracking which records arrived together
  • Reconstructing the technical timeline of changes

If no metadata exists, the microsecond sequencing on the load date is sufficient to order the rows.

In rare cases, a non-historized Link may receive multiple rows for the same key combination. To handle this safely:

  • Extend the alternate key to include the load date (or other sequencing attribute) in the hash key calculation.
  • This ensures each row has a unique primary key without modifying the raw data.

Key points:

  • No need to use counter rows in the raw link.
  • Duplicates are captured and preserved for historical accuracy.
  • Aggregations in PIT or Bridge Tables can be used for reporting, ensuring performance optimization.

Optional: Bridge Tables for Performance

If your Information Mart requires grouping or deduplication and your database struggles with performance:

  • Create a Bridge Table that pre-aggregates or resolves duplicates.
  • The Bridge Table stores only the latest row (or the aggregated result) for reporting.
  • You maintain the raw satellite history in case full lineage or historical reconstruction is needed.

Summary

  • Capture all incoming CDC events in a satellite, including duplicates, without modifying the raw data.
  • Use microsecond increments on the load date to order multiple rows per parent.
  • Include CDC batch metadata if available to preserve groupings and arrival order.
  • For non-historized Links receiving multiple rows, include the load date in the hash calculation.
  • Bridge Tables or PIT tables handle reporting and aggregation efficiently, while maintaining full historical traceability.

This approach preserves auditability, ensures correct historical ordering, and avoids performance issues in the Information Mart.

Watch the Video

Meet the Speaker

Profile picture of Michael Olschimke

Michael Olschimke

Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

The Data Vault Handbook

Build your path to a scalable and resilient Data Platform

The Data Vault Handbook is an accessible introduction to Data Vault. Designed for data practitioners, this guide provides a clear and cohesive overview of Data Vault principles.

Read it for Free

Leave a Reply

Close Menu