Skip to main content
search
0
All Posts By

Michael Olschimke

Michael Olschimke is the Co-Founder and CEO of Scalefree and a "Data Vault 2.0 Pioneer" with over 20 years of IT experience. A Fulbright scholar and co-author of Building a Scalable Data Warehouse with Data Vault 2.0, Michael is a global authority on AI, Big Data, and scalable Lakehouse design across sectors like banking, automotive, and state security.

Pre-Joining Data Vault Business Keys During Load

Data Vault Business Keys

Data Vault architects often encounter a common challenge when their source systems mix surrogate technical IDs with true business keys. In one real-world scenario, an Employee table used a technical ID as its primary key, while the legitimate business identifier was NBR. Downstream tables—even a self-referencing manager hierarchy—used that ID as a foreign key. When building Hubs on both ID and NBR and linking them with a Same-As Link, the result resembles a Source Vault design—and it quickly becomes unwieldy to repeat ID→NBR lookups for every referencing table.

In this article, we’ll explore the best practice for handling mismatched keys: pre-joining business keys in your staging (or view) layer before loading the Raw Data Vault. You’ll learn why this denormalization is fully compliant with Data Vault principles, how to implement it (even with CDC feeds), and when you might need alternate approaches.



Why Separate Business Keys from Technical IDs?

Surrogate keys (ID) offer stable integer references, simple indexing, and isolation from business rule changes. But they aren’t meaningful outside the operational schema. True business keys (NBR)—like employee numbers, order numbers, or product SKUs—carry real-world meaning and ensure consistency across downstream BI and analytics models.

  • Business Key (NBR): Immutable identifier used in reporting, cross-system integration, and audit.
  • Technical ID (ID): Auto-generated surrogate for OLTP performance and referential integrity.

When you build a Hub on the business key, all Links and Satellites must reference that same key. Mixing in surrogate IDs without conversion violates business lineage and forces repetitive lookups.

The Pitfall: Source Vault by Accident

Faced with source tables referencing ID, some teams create:

  1. A Hub on ID (surrogate),
  2. A Hub on NBR (business key),
  3. A Same-As Link between them to tie ID↔NBR.

This “Source Vault” pattern captures technical IDs as though they were business keys—contradicting the principle that your Vault’s integration key must be a shared business identifier. Moreover, every time you load any Link or Satellite that uses ID, you must look up NBR via the Same-As Link. Tedious, error-prone, and defeating the agility of your Data Vault.

Best Practice: Pre-Join Business Keys in Staging

Data Vault training explicitly allows you to reshape your staging area—denormalizing or normalizing source data to simplify Raw Vault loads. Pre-joining means: before your load process begins, enrich every source record with the true business key (NBR) rather than the surrogate ID. You then feed the Hub/Satellite/Link loaders with business keys directly—no Same-As Link gymnastics required.

How It Works

  1. Create a view or staging query that joins your Employee table (on ID) to itself or to the hierarchy table to retrieve NBR. Add the NBR field into every downstream staging record.
  2. Use that pre-joined staging view as the source for your Data Vault loaders. All Hubs, Links, and Satellites can now reference NBR consistently.
  3. Drop the accidental Hub on ID and Same-As Link—your Vault only contains the true business key Hub (Hub_Employee on NBR).

This approach turns the repeated lookup problem into a one-time denormalization, improving performance and maintainability.

Tool Support: dbt & FlowBI

Modern Data Vault toolkits recognize this pattern:

  • datavault4dbt package: Offers a pre_join feature to automatically enrich staging tables with business keys.
  • FlowBI: Includes configuration options to map surrogate IDs to business keys before Vault loads.

When your staging area resides on a cloud data lake (e.g., AWS S3 + Redshift Spectrum, Azure Data Lake + Synapse), these tools can reference external tables and materialize pre-joined views seamlessly.

Handling CDC & Empty Deltas with Forward Lookup

Change Data Capture (CDC) introduces a nuance: sometimes the Employee record doesn’t change (NBR remains the same), but the hierarchy table (manager assignment) does. A pure inner-join staging view would omit the hierarchy change because no new employee row appeared.

Forward Lookup Pattern

  1. When your CDC batch contains only hierarchy changes, load your staging view with the ID column but no accompanying NBR.
  2. Instead of joining to the source Employee table (which has no new row), perform a forward lookup against the target Satellite in your Raw Vault. That Satellite already stores every historic mapping of ID→NBR.
  3. Retrieve the latest NBR value for each ID and inject it into your staging records—just as if you had joined to the source.

This assumes your CDC infrastructure and initial loads correctly populated the Satellite. If CDC reliability is questionable, you may need to capture ID first and resolve to NBR later in a Business Vault layer.

Caveats: Pre-Join Within a Single Source Only

While you can (and should) pre-join within one source system, avoid chaining pre-joins across multiple systems. If you first wait for System A’s staging load to produce NBR from its ID, and then join System B’s staging to A’s data, you introduce cross-system load dependencies. That forces you to serialize loads—waiting for one system’s batch to finish before you can process another.

To maintain parallel ingestion, each source should be pre-joined only to its own business keys. If two systems share a business key, let that intersection happen downstream in your Vault (via Hub on the shared business key), not in the staging layer.

When You Can’t Pre-Join: Source Vault as Last Resort

In rare environments where you cannot reliably pre-join—legacy databases with locked-down permissions or untrustworthy CDC—you may fall back to a Source Vault. In this design:

  • Your Hubs use surrogate ID as the primary key (capturing the technical ID).
  • You defer mapping to true business keys into the Business Vault layer, after all sources land.

Source Vaults make sense only when staging denormalization is impossible. Otherwise, they sacrifice business clarity for expedience.

Step-By-Step Implementation Guide

  1. Identify Business Keys: Catalog each table’s true business key(s)—not the surrogate PKs.
  2. Build Staging Views: For each source, create a view that LEFT JOINs the “owner” table back to itself (or to its lookup tables) to pull in NBR wherever ID appears.
  3. Validate Keys: Ensure every staging record includes a non-null business key. Flag or quarantine any orphans (IDs without known business key).
  4. Configure Load Scripts: Point your Vault loaders (Hubs, Links, Satellites) at these staging views. Remove any loaders that target surrogate key Hubs or Same-As Links.
  5. Implement Forward Lookup: For CDC batches that may omit source changes, add a fallback join to the Employee Satellite in your load script to fetch the last known NBR for each ID.

Benefits of Pre-Joining Business Keys

  • Simplicity: One denormalization step replaces dozens of repetitive lookups.
  • Performance: Staging views optimize key enrichment in set-based SQL rather than row-by-row Link loads.
  • Lineage: Your Raw Vault contains only true business keys, preserving clear end-to-end lineage.
  • Maintainability: Future source schemas that reference ID get mapped automatically via the shared staging view logic.

Conclusion

Mismatched surrogate and business keys need not derail your Data Vault design. By embracing pre-joining business keys in your staging layer—along with forward lookups for CDC edge cases—you preserve a clean, business-centric Vault model without cumbersome Same-As Links. Reserve Source Vaults only for environments where staging denormalization simply cannot occur. With these best practices, your Vault remains performant, transparent, and aligned with true business identifiers.

Watch the Video

How to Sell Data Vault to Management

How to Sell Data Vault

Convincing senior executives to invest in a robust data architecture like Data Vault can feel like scaling a fortress gate. You know the technical merits inside out, but managers care about business outcomes, budgets, and timelines—not hash keys, hubs, and satellites. In this article, we’ll explore how to “sell” Data Vault to C-level stakeholders by focusing on the features and value it delivers, rather than its underlying technical mechanics.



The Danger of Technical Jargon

When building a house, a homeowner asks, “How many bedrooms? How many bathrooms? What’s the square footage?” They don’t delve into whether the builder used nails or screws. Likewise, executives don’t care if you use Data Vault, Kimball, or Inmon. They care about the end product: reliable reports, faster insights, and lower risk.

Starting a conversation with “Let me explain our Hub-and-Satellite architecture…” alienates non-technical stakeholders. Instead, frame your pitch around business capabilities and measurable outcomes.

Reframe the Conversation: It’s a Data Platform, Not a Methodology

Ask yourself: what does management really want? The answer is simple:

  • Governed, compliant data for audits and regulations
  • Fast, accurate reporting to inform decisions
  • Scalable infrastructure that grows with your business
  • Ability to integrate new sources—legacy systems, real-time feeds, and unstructured data
  • Future-proof automation and AI-driven efficiency

Position your solution as an enterprise data platform that delivers these capabilities. Only dive into Data Vault specifics when a technical stakeholder asks—then you can explain how its modular design underpins agility and auditability.

Key Business Benefits to Highlight

Below are the core value propositions you should emphasize. Each maps directly to executive priorities:

1. Integrate Any Source, Any Format

– Combine data from ERP, CRM, cloud services, IoT streams, social feeds, and Excel sheets.
– Handle conflicting or incomplete data without losing lineage.
– Accelerate time-to-insight by onboarding new sources in days, not months.

2. Auditability & Compliance

– Capture full history of every data change for regulations (GDPR, SOX, HIPAA).
– Reconstruct past reports exactly as they were delivered.
– Demonstrate data lineage and provenance to satisfy auditors and regulators.

3. Automation & Developer Productivity

– Use off-the-shelf tools (FlowBI, automation frameworks) to generate 70–80% of your pipelines code.
– Reduce manual coding errors with template-driven scaffolding.
– Free your team to focus on business logic and analytics, not plumbing.

4. Scalable Performance & Flexibility

– Scale out compute and storage independently in the cloud (AWS, Azure, Google).
– Handle spikes in data volume—batch or streaming—without re-architecting.
– Support thousands of concurrent users and complex analytics workloads.

5. Multiple Business Perspectives

– Deliver different “versions of the truth” side-by-side: sales view, finance view, marketing view.
– Maintain consistent business rules and definitions across departments.
– Enable self-service BI without sacrificing governance.

Positioning Your Pitch

Armed with these benefits, craft a narrative that aligns with your organization’s strategic goals:

  • Cost Avoidance: Showcase how auditability and automation reduce remediation costs and manual reconciliation.
  • Risk Mitigation: Emphasize regulated data lineage, reducing compliance fines and reputational damage.
  • Business Agility: Illustrate faster source onboarding to support new products, M&A, and market pivots.
  • Developer Efficiency: Quantify hours saved through code generation and reusable templates.

Handling Common Objections

“We don’t want to invest in fundamentals anymore.”

Many firms chase “silver bullet” tools—data lakes, LLMs, or generic ETL appliances—hoping to skip architecture. Explain that without a solid foundation, new tools amplify chaos, not clarity. Draw parallels: no builder skips the foundation to save budget.

“Isn’t this too complicated?”

Compare Data Vault to modular construction: pre-fabricated components assembled with repeatable processes. Complexity is hidden under the hood; what management sees is a predictable, standardized delivery pipeline.

“We already have a data lake/warehouse.”

Acknowledge existing investments. Then demonstrate: Data Vault can sit atop or alongside current environments, enhancing governability and enabling phased migration without rip-and-replace.

Engaging Different Stakeholders

Each audience has different concerns. Tailor your message accordingly:

  • CEO/COO: Focus on revenue growth, operational efficiency, and risk reduction.
  • CFO: Highlight cost avoidance, predictable budgeting through reusable components, and audit compliance.
  • CTO/CIO: Dive into scalability, cloud economics, and integration patterns.
  • Business Unit Leaders: Emphasize faster delivery of insights, tailored dashboards, and self-service BI.

Real-World Success Stories

Nothing beats concrete examples. Present case studies or internal pilots that showcase:

  • 50% reduction in data onboarding time for a new source system.
  • 30% decrease in remediation tickets due to automated auditing.
  • Consistent reporting across finance and marketing, eliminating “version conflict” meetings.

Roadmap & Phasing

Break the project into manageable phases:

  1. Pilot Phase: Integrate one or two critical sources, deliver dashboards in 4–6 weeks.
  2. Expansion Phase: Add additional systems, build out automation and governance playbooks.
  3. Optimization Phase: Operationalize PIT tables, refine performance, onboard self-service users.
  4. Continuous Evolution: Incorporate AI-driven code generation and new data sources as needed.

Phased delivery reduces risk and allows management to see progressive value, reinforcing buy-in for subsequent investments.

Conclusion

Selling Data Vault to management isn’t about cold, technical lectures on hash keys and satellites. It’s about painting a vivid picture of what the organization will achieve: faster insights, iron-clad audit trails, automated pipelines, and a flexible, scalable data platform that grows with the business. Speak their language—doors, windows, and square footage—then build your foundation behind the scenes.

Watch the Video

When to Use Reference Tables in Data Vault?

Reference Tables in Data Vault

In modern Data Vault 2.0 implementations, teams often face a recurring question: “We have dozens of small, static lookup tables—should we model them as full Hubs and Satellites, or can we use simpler reference tables?” If you’re dealing with Excel sheets containing tens or hundreds of rows of relatively stable data (like Profit Centers, Status Codes, or Region mappings), this article will help you decide when a lightweight reference table suffices—and when you need the auditability of a Hub/Satellite pattern.



Understanding Business Data vs. Reference Data

First, it helps to distinguish two broad categories of data:

  • Business Objects: Entities that your processes create and update constantly—Customers, Orders, Products, etc. You generate new keys and change descriptive attributes frequently.
  • Reference Data: Code lists and lookup tables that describe or classify business objects—Country codes, Profit Center codes, Contract types. These change infrequently and usually in small batches.

Although some tables can straddle the line (e.g., Profit Centers may be “business objects” for accounting teams), it’s often safe to treat truly stable code lists as reference data for modeling purposes.

Simple Reference Tables: Pros and Cons

A simple reference table in your Data Vault is nothing more than a flat table with:

  • Primary Key: Your reference code (e.g., profit_center_code).
  • Attributes: The 2–5 descriptive columns you need (e.g., profit_center_name, region).
  • No History: Only the current state is stored; updates overwrite existing rows.

Advantages: Easy to implement, minimal objects, straightforward joins at query time.
Disadvantages: No built-in historical tracking—updates will retroactively change past reports, and you cannot reconstruct previous descriptions.

When Simple Reference Tables Are Appropriate

Consider a flat reference table when:

  • Your business requirement only needs the latest values.
  • Updates are extremely rare (quarterly or less) and don’t require audit trails.
  • Performance of lookups is not mission-critical (small table sizes).
  • You have no regulatory or internal need to reproduce past descriptions.

If any of these criteria fail—especially auditability—then a simple reference table can become a liability.

Introducing Reference Hubs and Reference Satellites

To combine simplicity with history, use the Reference Hub & Reference Satellite pattern. This mirrors the standard Hub/Satellite design, but optimized for code lists:

  • Reference Hub:
    • business_key: the code (e.g., PROFCTR_001)
    • record_source: data origin
    • load_date: date the code list was loaded
  • Reference Satellite:
    • business_key (FK to Hub)
    • load_date (also part of PK)
    • Descriptive attributes (e.g., name, region, valid_from)

This approach captures every change to your reference data without overwriting, and still keeps your model lightweight.

How It Works in Practice

  1. Load the Hub: Insert every code once (or refresh if new codes appear).
  2. Load the Satellite: For each code, insert a new row whenever any descriptive attribute changes, tagging it with the load_date.
  3. Querying: In your dimension or Information Delivery layer, join from your business object Satellite (or Link) directly into the Reference Satellite on code, filtering to the row with the latest load_date ≤ transaction date.

Because reference tables are small, these joins remain performant even when you compare on dates.

Aligning Reference Data with Business Vault Snapshots

For organizations using a Business Vault layer with snapshot dates, you may need to “time-align” reference data. Two patterns are common:

  • On-the-fly alignment: In your reporting view, use the transaction’s snapshot_date and join to the Reference Satellite where load_date ≤ snapshot_date, picking the latest record.
  • PIT/Bridge tables: Precompute “Point-In-Time” (PIT) tables that store the reference code’s surrogate key aligned to each business object snapshot for faster querying.

Choose the pattern that balances your performance SLAs with data freshness requirements.

Auditability and Regulatory Compliance

If you operate in regulated industries (banking, telecom, government), audit trails are mandatory. The Reference Hub/Satellite pattern ensures:

  • Complete lineage and history of every code change.
  • Reproducibility of past reports with original reference descriptions.
  • Ability to support retrospective analyses without reloading or reconstructing data.

Even if your initial business users only ask for current values, future sprints or stakeholders may require historical context—so building auditability upfront can save costly refactoring.

Performance Considerations

Reference tables typically contain at most hundreds of rows. However, you should still consider:

  • Indexing: Ensure load_date and business_key are indexed for fast lookups.
  • Partitioning: Generally unnecessary for small tables but useful if your Satellite grows into thousands of deltas over years.
  • Join Strategy: In most SQL engines, joining a large transaction Satellite to a small Reference Satellite on code + latest date filter is efficient. If not, consider a PIT table.

Governance and Knowledge Transfer

Whatever pattern you choose, document and govern your reference data:

  • Maintain an authoritative data dictionary describing each code list, source, update frequency, and steward.
  • Set up automated tests (e.g., CI/CD validations) to detect unexpected code changes.
  • Implement alerts for large volumes of reference updates that may indicate data quality issues.

Decision Checklist

Use this quick checklist when evaluating a table for reference modeling:

  1. Is the table truly static or slow-changing? (Quarterly or less)
  2. Are there audit or historical requirements? (Regulatory or future use cases)
  3. Is the table small enough (< 1,000 rows) to avoid performance concerns?
  4. Do you need to reconstruct past reports with original descriptions?
  5. Would a simple change in the future (e.g., retro-active update) break historical reports if you used a flat table?

If you answered “yes” to questions 2 or 5, the Reference Hub/Satellite pattern is the safer choice. Otherwise, a simple reference table may be sufficient.

Conclusion

Static lookup tables in a Data Vault 2.0 implementation can be modeled either as simple reference tables or with a Reference Hub & Satellite pattern. While flat tables are easier to build, they lack historical tracking and auditability. By adopting the Reference Hub/Satellite approach, you gain full change history, reproducible reporting, and alignment with regulatory demands—while retaining a lightweight design.

Use the decision checklist above to guide your modeling choices, and ensure your Data Vault remains both agile and compliant as your organization’s needs evolve.

Watch the Video

Outsourcing the Data Warehouse: Finding the Right Balance

Outsourcing the Data Warehouse

In today’s data-driven world, building and maintaining a robust data warehouse is a critical strategic initiative. As organizations grow and their data complexity increases, leaders often ask: Is there a way to third-party some of the data warehouse work and infrastructure? In other words, can you “outsource the data warehouse” without jeopardizing control, security, or long-term knowledge retention?



Why Treat Your Data Warehouse as Strategic

Most organizations—especially those generating significant revenue or handling complex data flows—view their data platform as a core strategic asset. It underpins reporting, analytics, and decision-making, and drives competitive advantage. Handing over this critical function entirely to an external provider can feel like giving away the keys to your business.

  • Control & Governance: Keeping the data warehouse in-house ensures direct oversight of data quality, security, and compliance.
  • Knowledge Retention: Your internal team develops deep institutional understanding of data models, business logic, and reporting requirements.
  • Strategic Flexibility: Internal ownership allows you to pivot rapidly as business needs evolve, without waiting on external roadmaps or SLAs.

The Role of External Expertise

That said, nearly every successful data warehouse project benefits from external consulting, especially in the early stages. Consultants bring:

  • Proven Frameworks: Templates, best practices, and reference architectures refined across multiple clients.
  • Jump-Start Momentum: Hands-on help with infrastructure setup, project governance, and team organization.
  • Skill Gap Coverage: Experienced data architects, engineers, and DevOps specialists to fill temporary talent shortages.

Once your internal team is up to speed, you can scale back external support—keeping consultants focused on specific areas where they provide the most value.

Outsourcing Infrastructure: The Cloud Advantage

For many organizations, the first form of “outsourcing” is the data center itself. On-premises servers have given way to cloud platforms like AWS, Google Cloud, and Azure. By migrating your data warehouse infrastructure to the cloud, you:

  • Reduce Capital Expense: No more large upfront hardware purchases or data center maintenance costs.
  • Gain Elastic Scalability: Spin up additional compute and storage on demand to handle peaks in data processing.
  • Enhance Security & Reliability: Leverage the cloud provider’s certifications, redundancy, and disaster recovery capabilities.

This shift effectively delegates infrastructure work—provisioning, patching, and physical security—to a trusted third party, while you retain control of your data and processes.

Balancing Internal and External DevOps

DevOps for your data platform—CI/CD pipelines, automated testing, and deployment orchestration—can also be partially outsourced. However, it’s important to strike the right balance:

  • Internal DevOps Leads: Core pipeline design, approval processes, and environment governance should remain with your in-house team.
  • External Specialists: Consultants can set up complex workflows, integrate tools, and train your staff on best practices.
  • Limit External Proportion: Aim to keep no more than 50% of DevOps roles external to mitigate knowledge leakage and dependency risks.

AI-Driven Offloading: The Next Frontier

Emerging AI tools are beginning to automate aspects of data warehousing:

  • Schema Generation: AI can suggest optimized table structures based on source data profiles.
  • ETL/ELT Code Snippets: Auto-generation of transformation scripts for common data patterns.
  • Monitoring & Alerting: Machine learning models to detect anomalies and performance bottlenecks.

While these solutions can accelerate development, they work best under the guidance of your experienced internal team, who define requirements, review outputs, and ensure alignment with business objectives.

Setting Clear Boundaries

To manage the consultant-internal balance effectively, define:

  1. Scope of Work: Specify deliverables, timelines, and handover expectations for every engagement.
  2. Knowledge Transfer Plan: Require documentation, training sessions, and code reviews to embed expertise internally.
  3. Governance Model: Establish who makes architectural decisions and how change requests are processed.
  4. Resource Thresholds: Limit external headcount to a reasonable percentage (e.g., under 50%) to avoid over-reliance.

Key Benefits and Risks

Benefits:

  • Speed: Hit the ground running with proven accelerators and expert guidance.
  • Flexibility: Scale resources up or down based on budget cycles and project phases.
  • Cost Efficiency: Avoid long-term commitments; external consultants can be released when budgets tighten.

Risks:

  • Knowledge Drain: Too many external experts can lead to critical know-how walking out the door.
  • Strategic Misalignment: Consultants may optimize for short-term wins rather than your long-term roadmap.
  • Dependency: Over-outsourcing creates vendor lock-in and potential service disruptions if relationships end.

Best Practices for Effective Outsourcing

1. Start with a Pilot: Engage consultants on a small, well-defined project to validate fit and process.

2. Embed Consultants: Position them alongside your team for on-the-job training, rather than in isolation.

3. Prioritize Documentation: Ensure every architecture decision, data model, and pipeline is clearly recorded.

4. Rotate Responsibilities: Alternate tasks between internal staff and consultants to spread knowledge.

5. Review Regularly: Conduct quarterly check-ins to reassess external involvement, goals, and budget alignment.

Conclusion

Outsourcing your entire data warehouse is rarely advisable—this remains a core strategic asset that demands internal stewardship. However, judicious use of external consultants, cloud infrastructure, and AI-driven tools can accelerate your journey, fill critical skill gaps, and optimize costs. By setting clear boundaries, emphasizing knowledge transfer, and maintaining a healthy mix of internal and external talent, you can reap the benefits of outsourcing without surrendering control.

Watch the Video

Data Vault Link Temporality

Link Temporality: Handling Source Data Errors with Effectivity Satellites

In modern data warehousing, ensuring accurate historical records is paramount. The Data Vault methodology excels at capturing raw, unfiltered data changes over time. But what happens when your source system makes errors—linking an entity to the wrong counterpart—and then corrects them? Without the right approach, your Link tables can become confusing, making it hard to identify the true “current” relationship. This article explores an elegant solution: using an Effectivity Satellite to manage link temporality and error correction in your Data Vault.



The Problem: One-to-One Relationship with Source Data Fluctuations

Imagine two hubs in your Data Vault: Hub A and Hub B. A business rule dictates that each A-entity can be linked to exactly one B-entity at a time. Your Link table models these connections. The typical workflow is:

  1. Day 1: Source links A1 → B1 → you load this into your Link.
  2. Day 2: Source mistakenly links A1 → B2 → you load the new link.
  3. Day 3: Source corrects back to A1 → B1 → how do you capture this as the current, up-to-date link?

Since the Link table only records distinct relationships and ignores duplicates, reloading A1 → B1 on Day 3 won’t insert a new row or update any timestamp. You lose clarity on which relationship is active today.

Why Not Tweak the Link Table Directly?

You might be tempted to add LOAD_END_DATE or an “active” flag directly to the Link table to mark when a relationship becomes obsolete. However, this violates Data Vault best practices. The Link should remain a pure, append-only record of every relationship ever observed, without status flags or end dates. Instead, you delegate temporality to a dedicated satellite.

Introducing the Effectivity Satellite

An Effectivity Satellite sits alongside your Link and records the lifespan of each relationship. Its core columns include:

  • Link Hash Key: foreign key back to your Link record
  • LOAD_DATE_TIMESTAMP: when you first detected or ended this link
  • DELETE_DATE_TIMESTAMP: when the link was deactivated (or a far-future “end of time” for active rows)

This design cleanly separates the static relationship definition (Link) from its dynamic, time-dependent status (Effectivity Satellite).

Step-by-Step: Tracking Link Changes

Day 1: Initial Relationship

Link: A1–B1  
Effectivity Satellite:  
LOAD_DATE = D1  
DELETE_DATE = 8888-12-31

We load A1→B1 and mark it active by setting its DELETE_DATE to the end of all time.

Day 2: Erroneous Change

Link: add A1–B2  
Effectivity Satellite updates:  
– For A1–B1: DELETE_DATE = D2 (deactivated)  
– For A1–B2: LOAD_DATE = D2, DELETE_DATE = 8888-12-31

The old relationship is soft-deleted, and the new one is inserted and marked active.

Day 3: Correction Back to Original

Effectivity Satellite updates:  
– For A1–B2: DELETE_DATE = D3  
– For A1–B1: new row (reactivation): LOAD_DATE = D3, DELETE_DATE = 8888-12-31

Instead of touching the Link, we simply record two new deltas: ending B2 and re-activating B1. Querying the satellite for the active row (where DELETE_DATE = 8888-12-31) reveals the current link.

Loading Patterns: Full Loads vs. CDC vs. Incrementals

Your data delivery method influences how you detect deletions:

  • Full Loads: Compare all active links in the satellite against staging; any missing link implies a deletion. Insert a delta to end-date it.
  • Change Data Capture (CDC): Leverage the source’s delete events and timestamps as your DELETE_DATE_TIMESTAMP.
  • Incremental without Deletes: Combine staging deltas (inserts/updates) with a lightweight full load of just business keys. Missing keys signal deletions.

In all cases, the satellite becomes the single source of truth for link effectivity.

Handling Unreliable Source Deliveries with “Last Seen”

Sometimes, your source export may inadvertently drop rows (e.g., locked mainframe records). To avoid false deletions, maintain a Last Seen Date on your effectivity satellite. If a link hasn’t been seen for a configurable “grace period” (e.g., three weeks), a business rule in your Business Vault marks it deleted. This approach balances accuracy against source system quirks.

Querying Current Relationships

To retrieve only active links at any point:

SELECT L.*, S.LOAD_DATE_TIMESTAMP  
FROM Link L  
JOIN Effectivity_Sat S  
ON L.Link_HashKey = S.Link_HashKey  
WHERE S.DELETE_DATE_TIMESTAMP = '8888-12-31';

This simple filter returns the true, live relationships, abstracting away all historical noise and source-system corrections.

Benefits of the Satellite Approach

  • Auditability: Full history of when links were activated and deactivated.
  • Purity: Link tables remain simple, append-only, and free of flags/end dates.
  • Flexibility: Supports full loads, CDC, and incremental patterns seamlessly.
  • Business Rules: “Last seen” logic can live in Business Vault without polluting raw Data Vault layers.

Conclusion

Handling source data errors—especially when relationships ping-pong between states—requires a robust temporal strategy. By delegating link effectivity to a dedicated satellite, you maintain a clean Link table, capture every change, and easily identify the current relationship. Whether you’re dealing with full loads, CDC, or flaky source exports, this pattern scales, remains auditable, and adheres to Data Vault best practices. Implement effectivity satellites in your Data Vault to master link temporality and build a more resilient, transparent data platform.

Watch the Video

Data Vault Links With Just One Hub Reference

Single-Hub Links

In Data Vault modeling, links play a central role in representing relationships between business keys stored in hubs. By design, most links connect two or more hubs, capturing many-to-many relationships or associations. But what happens when an event or transaction involves only a single business key? Can you still use a link structure—and if so, which type? In this article, we’ll explore the concept of non-historized links with a single hub reference, compare alternatives, and outline best practices for real-time event modeling.



Overview of Data Vault Components

Before diving into one-hub links, let’s briefly review the core building blocks of a Data Vault model:

  • Hubs: Store unique, immutable business keys (e.g., customer IDs, order numbers).
  • Links: Represent relationships or associations between two or more hubs.
  • Satellites: Hold descriptive attributes and contextual history for hubs and links.

This three-tiered architecture ensures agility, auditability, and scalability. Hubs guarantee uniqueness; links model relationships; satellites track changes over time.

Traditional Links and Their Purpose

Most Data Vault implementations utilize links to tie together business keys from multiple hubs. Common scenarios include:

  • Customer–Order relationships (customer purchases multiple orders).
  • Order–Product line items (each order can contain multiple products).
  • Employee–Department assignments.

These historized links capture the evolution of relationships over time, recording load dates and allowing queries that include past associations. In contrast, non-historized links focus on events at a single point in time.

Defining Non-Historized Links

A non-historized link (sometimes called an “event link” or “transaction link”) stores relationships for a single event or message without maintaining full historical context. Instead of recording every change, it captures a snapshot of an event at its arrival:

  • Load timestamp identifies when the event occurred or was ingested.
  • Hub references list one or more business keys involved in the event.
  • Non-historized Satellites may attach descriptive details, but typically without tracking attribute history.

This design is ideal for real-time message processing, streaming data, or systems where full history is not required for each event.

When Only One Business Key Is Involved

While many events involve multiple business keys—such as an order linking to both customer and product—some transactions or messages involve just one key. Examples include:

  • A single-customer ping or heartbeat event in an IoT system.
  • A retail message capturing stock-level change for one product.
  • An alert triggered by a lone account reaching a threshold.

In these cases, you might wonder if a link structure still makes sense when there’s only one hub reference. The answer is yes: you can implement a non-historized link that references a single hub key to represent that event.

Advantages of Single-Hub Links

Opting for a non-historized link with one hub reference brings several benefits:

  • Consistency: Sticks to the Data Vault pattern of links for events, avoiding mixed designs.
  • Scalability: Scales out to handle high volumes of incoming messages without heavy historical tracking.
  • Clarity: Clearly separates transactional/event data from descriptive satellites and core business keys.
  • Query Simplicity: Enables straightforward point-in-time queries of events linked to the relevant hub.

Alternative: Multi-Active Satellites

Another design might involve a multi-active satellite on the hub itself, capturing different event types or message variants keyed by a load timestamp or event type. However:

  • Multi-active satellites are designed to capture multiple concurrent “active” roles or statuses rather than transient events.
  • The lack of a dedicated link table can blur semantic distinctions between relationships and descriptive attributes.
  • Query performance and partitioning strategies may suffer when trying to manage high-frequency event data in a satellite.

Therefore, for discrete, passing-through events, a non-historized link generally outperforms a multi-active satellite approach.

Designing Your Single-Hub Non-Historized Link

When modeling a non-historized link that references only one hub, follow these guidelines:

  1. Link Table Structure: Include a surrogate primary key, load timestamp, and the single hub’s surrogate key.
  2. Foreign Key Constraint: Enforce referential integrity back to the hub, ensuring the business key exists.
  3. Descriptive Satellites: If extra attributes are needed (e.g., event payload details), create a non-historized satellite keyed to the link.
  4. Partitioning Strategy: Partition by load date for efficient querying and archiving of stale event data.
  5. Retention Policy: Define sliding windows or archival processes for old events if storage growth is a concern.

Here’s an example DDL snippet for reference:


CREATE TABLE l_event_single_hub (
l_event_id        BIGINT      IDENTITY PRIMARY KEY,
hub_key_id        BIGINT      NOT NULL,
load_date         DATETIME     NOT NULL,
-- optional metadata columns
source_system     VARCHAR(50),
record_hash       CHAR(32),
CONSTRAINT fk_l_event_hub
FOREIGN KEY (hub_key_id)
REFERENCES h_hub_entity(hub_key_id)
);

Use Case Scenarios

Organizations across industries leverage single-hub links for:

  • Banking: Recording individual account balance snapshot events.
  • Retail: Capturing stock level messages for each product unit.
  • IoT: Ingesting single-device telemetry pings.
  • Telecommunications: Logging individual phone number status changes (e.g., activated/deactivated).

In each scenario, the event is tied to one core business key, and history is either ephemeral or summarized elsewhere.

Best Practices and Considerations

When implementing single-hub non-historized links, consider the following:

  • Event Granularity: Define clear semantics—what constitutes one event, and how often will it be ingested?
  • Surrogate Keys: Always use surrogate keys for hubs and links to maintain consistency.
  • Hashing Strategy: Compute a record hash if you need idempotency or change detection on message payloads.
  • Load Performance: Optimize bulk or streaming loads with batching and minimal indexes on the link table.
  • Retention and Archival: Archive stale events into cheaper storage or summarize them into aggregate tables.

By following these practices, you’ll ensure a robust, maintainable design that adheres to Data Vault principles.

Conclusion

While it might seem counter-intuitive to create a link with only one hub reference, non-historized links with a single business key are both feasible and, in many real-time event scenarios, preferable to alternative designs. They preserve the semantic clarity of link tables, ensure data integrity, and scale efficiently for high-volume event streams. When events involve only one business key, reach for a one-hub non-historized link rather than shoehorning events into satellites or hybrid structures.

Watch the Video

Dealing with Corrupted Loads in Data Vault

Corrupted Loads in Data Vault

One of the foundational assumptions in Data Vault modeling is that business keys must be unique. This rule underpins how we model Hubs, Links, and Satellites. But what happens when your data doesn’t play by the rules? Specifically, what should you do when your data delivery contains multiple rows with the same business key—a situation that violates the core principles of your Raw Data Vault model?

In this article, we’ll explore practical strategies for managing corrupted data in Data Vault pipelines, focusing on maintaining auditability, consistency, and data integrity—even when upstream data delivery is flawed. We’ll also look at what to do when your business key assumptions no longer hold true.



Understanding the Problem: Duplicate Business Keys

Let’s start with the assumption that your Raw Data Vault is modeled around unique business keys. You’ve built Hubs, split Satellites, and established Links based on the expectation that a business key like customer_id uniquely identifies a customer.

Now, you receive a new delivery from your source system. Unexpectedly, it contains multiple rows with the same business key. This isn’t just a data quality issue—it fundamentally breaks your model. The typical loading process can no longer proceed cleanly, and worse, you risk contaminating your data warehouse with incorrect records.

Why You Can’t Ignore Corrupted Loads

It’s tempting to just skip the bad file or fix it manually. But in a proper Data Vault setup—particularly one that adheres to full auditability and compliance standards—this isn’t acceptable. You must be able to fully reconstruct each data delivery, even if it’s flawed. Every decision—whether to reject or load—must be trackable and justifiable.

Step 1: Capture Everything in a Data Lake

Today, many modern architectures use a data lake or Persistent Staging Area (PSA) as the first layer of data capture. This becomes your insurance policy. All incoming data—valid or corrupted—is ingested and stored here as-is, giving you a perfect record of what was delivered and when.

This approach also ensures your Raw Data Vault can skip flawed deliveries without data loss. By storing the original files in the data lake, you preserve the full delivery for later inspection, validation, or correction without halting the loading process entirely.

Step 2: Define Automated Data Quality Checks

Before data is loaded into the Raw Data Vault, it must pass validation. You can implement quality checks like:

  • Is the business key unique across the delivery?
  • Are column data types and lengths as expected?
  • Are required fields populated?

If any of these checks fail, the entire file should be rejected—not just individual records. Why? Because partial loads introduce ambiguity and audit challenges. Instead, flag the file as failed and notify the data provider to investigate and resend a corrected version.

Step 3: Track Rejections and Version Control Your Checks

You must keep detailed logs of every load attempt. This includes:

  • Which file was loaded or rejected
  • Which checks were applied
  • Which check failed and why
  • The version of the validation rule used

This ensures complete traceability. You can prove not just what was accepted, but also what was rejected and for what reason. This is crucial for regulatory compliance, audits, and operational transparency.

What If There’s No Data Lake?

In some cases, you may not have a data lake. You might be working with a transient relational staging area before the Raw Data Vault. Even then, you should still store failed deliveries. A separate location or table can be used to store the raw files that failed validation. Again, auditability is key—just because data isn’t valid doesn’t mean it can disappear.

When the Business Key Assumption Breaks

Sometimes, you dig deeper and realize that your assumption about the business key was flawed. Maybe you thought customer_id was unique, but the source system allows multiple entries per ID for different contexts. Now what?

This is where things get more complex. You need to refactor your model. Specifically, you must modify the Hub and possibly extend the business key by combining it with another column (e.g., customer_id + region) to enforce uniqueness.

Why You Must Refactor, Not Hack

Some might be tempted to patch the issue using a record source tracking Satellite or other technical workaround. But this introduces long-term maintenance and performance issues. Worse, it hides the real business reality behind a technical trick.

Instead, treat the business key as the central anchor of your model. If it changes, it impacts:

  • The Hub structure
  • All related Satellites
  • Any Links pointing to the Hub

Yes, it’s a big change. But it’s limited to a specific portion of your model and keeps your architecture clean and reliable.

What About Descriptive Data Errors?

If the corrupted data only affects descriptive attributes and not the business key, the fix is simpler. You can ingest a correction load directly into the Satellites with a backdated load date—just after the original bad load. Then, rebuild your PIT (Point-In-Time) tables. This resolves the issue for downstream consumption without any need to refactor Hubs or Links.

Final Thoughts: Build Resilience Into Your Pipeline

Corrupted data is not an exception—it’s an eventuality. Whether it’s duplicate business keys, incorrect formats, or structural changes in the source system, your data warehouse must be prepared. The best defenses are:

  • A reliable data lake or staging layer to capture raw deliveries
  • Automated validation and full-file rejection logic
  • Detailed auditing and version control on checks
  • Clear communication with source system owners
  • Willingness to refactor models when business reality shifts

Following these principles ensures your Data Vault model remains robust, scalable, and trustworthy—even in the face of corrupted loads.

Watch the Video

Defining the Error Mart in Data Vault

Defining the Error Mart

When working with data platforms that follow the Data Vault methodology, one often hears about components like the Raw Vault, Business Vault, and Information Marts. But among these well-known layers is a lesser-discussed yet critical structure: the Error Mart.

In this blog post, we take a comprehensive look at what an Error Mart is, what its main objectives are, and the best practices for designing one. This insight is based on an informative session led by Michael Olschimke, CEO of Scalefree, during a recent Data Vault Friday.



What is an Error Mart?

In traditional data warehousing approaches like Kimball, an Error Mart is used to store metrics about errors—for example, how many ETL jobs failed or which tables didn’t load successfully. These are primarily KPIs used for monitoring and are typically stored in what’s known as a Metrics Mart.

However, in the context of Data Vault 2.0, the Error Mart has a different, more tactical role: it acts as a catch-all for rejected records that fail to load during any of the staging or integration processes.

This could be due to a mismatch in expected data types, missing columns, or unexpected structural changes in the source data. These issues most frequently arise during:

  • Initial data ingestion from files, APIs, or real-time feeds
  • Loading data into the staging area or raw Data Vault
  • Applying hard rules based on schema assumptions

The Main Goal of an Error Mart

The primary goal of the Error Mart is to ensure that all incoming data—the good, the bad, and the ugly—is captured and traceable, even if it can’t immediately be loaded into the intended layer (such as the Raw Vault).

It’s a technical safety net that provides:

  • A secure location for rejected records
  • The ability to analyze and correct issues manually
  • A reprocessing workflow that ensures full data capture

The Error Mart is not meant for business logic errors (e.g., someone underage purchasing a product); rather, it handles technical discrepancies that prevent data from moving through the pipeline.

How Is It Structured?

Traditionally, one might think of creating multiple error tables to match each data model. However, Michael Olschimke recommends a single flexible structure—a table that stores rejected records as JSON strings. This allows you to capture various unexpected formats without predefined schemas.

Each record should be accompanied by key metadata:

  • Load date – Timestamp of ingestion
  • Record source – Source system or interface
  • Process identifier – The job or transformation that failed

This setup ensures that every error is auditable, traceable, and eventually resolvable.

Best Practices for Designing an Error Mart

Here are some key considerations when building your Error Mart:

1. Flexibility in Structure

Since rejected data often doesn’t conform to expected schemas, use a structure that can handle variability. A single table using JSON or Parquet formats offers great flexibility, especially when stored in a data lake.

2. Avoid Over-Engineering

There’s no need to create one table per error type. One well-documented and meta-tagged table is usually sufficient.

3. Logging and Auditing

Implement a companion log table or file to track which records have been reprocessed. Instead of deleting processed error records, use a status flag or separate tracking log to preserve data lineage and maintain transparency.

4. Trigger Monitoring and Alerts

Your system should monitor the Error Mart for unprocessed records. Set up alerts via email, log monitoring tools like CloudWatch or Greylog, or build dashboards that notify the data team when action is required.

5. Make It the Data Team’s Responsibility

A critical mindset shift: processing records in the Error Mart is not a business responsibility—it’s yours as the data engineering team. Do not offload this to end users.

6. Reprocessing Workflow

Once the technical root cause is identified (e.g., an overly strict field length), update the hard rules, reload the rejected data from the Error Mart into the target layer, and mark it as processed in your log.

7. Error Mart in Every Layer

While most errors occur in the initial stages (staging, Raw Vault), you should prepare to capture errors at every layer—Business Vault and Information Mart included.

8. Binary Data Considerations

If your incoming data includes blob fields, you can mime-encode them and store them alongside the error JSON or separately in the data lake.

Why the Error Mart Matters in Data Vault Architecture

Data Vault is built on the premise of complete and auditable data capture. To meet this principle, you must have a strategy for handling unexpected or failed data loads. The Error Mart acts as that strategy.

It’s not just a dumping ground for bad records—it’s a crucial feedback mechanism that helps you refine your ingestion and transformation rules, ensuring every piece of data, no matter how ugly, makes it into the platform.

Without an Error Mart, you risk data loss, broken lineage, and ultimately, lower trust in your data platform.

Conclusion

In summary, the Error Mart is an essential part of a resilient Data Vault architecture. It gives your data team the tools to identify, correct, and reprocess problematic data while maintaining auditability and trustworthiness.

If you’re implementing a Data Vault, don’t treat the Error Mart as an afterthought. Design it with flexibility, transparency, and process integration in mind. And remember: it’s your job to make sure no record gets left behind.

Watch the Video

Conceptual vs Logical vs Physical Data Models

Why Are Data Models Important?

Before diving into the specifics, let’s understand the purpose of data modeling. Imagine building a house. You wouldn’t start hammering wood together randomly—you’d begin with a sketch, then a blueprint, and finally the construction. Data models serve a similar purpose for databases and data systems.

They help ensure everyone involved (business users, developers, engineers) shares the same understanding of how data is organized, connected, and accessed.



1. What Is a Conceptual Data Model?

The conceptual model is your high-level business map. It’s like the sketch of your house drawn on a napkin. It’s not concerned with technology or database structures. Instead, it focuses on the business concepts and how they relate.

In simple terms, it answers questions like:

  • What are the key things we care about? (e.g., Customers, Products, Orders)
  • How are they related? (e.g., Customers purchase Products)

Here’s a basic example:

  • Entities: Customer, Product, Purchase
  • Relationships: A Customer makes a Purchase; a Purchase involves a Product

You don’t list detailed fields or attributes yet—just the big picture. It’s usually created during the early discussions between business stakeholders and data professionals. It helps everyone align on the language and goals before jumping into technical design.

When Do You Use a Conceptual Model?

You use this at the start of a project, especially when:

  • You’re gathering requirements from the business
  • You’re building a shared understanding with non-technical stakeholders
  • You want to clarify business rules and entities

2. What Is a Logical Data Model?

Once you understand the business concepts, it’s time to turn those into a more detailed, technology-independent design. That’s the job of the logical model.

The logical model focuses on the structure of data—what fields each entity has, how they’re connected, and what kind of data they store. But it still doesn’t worry about the actual database platform or syntax.

Continuing the earlier example, a logical model might say:

  • Customer has attributes like First Name, Last Name, Email
  • Product has attributes like Product ID, Name, Price
  • Purchase links Customer and Product using unique IDs

In the context of Data Vault (a methodology for modeling enterprise data warehouses), this is where you define your:

  • Hubs: Core business entities (e.g., Customer, Product)
  • Links: Relationships between entities (e.g., Purchase)
  • Satellites: Descriptive data about Hubs and Links (e.g., customer name, address)

You also classify data here. For instance, you might mark some attributes as sensitive for privacy reasons or note how frequently they change.

When Do You Use a Logical Model?

This comes after your conceptual model, typically during solution design or system architecture planning. You use it when:

  • You need to define how data should be structured and connected
  • You’re planning your Data Vault architecture
  • You want to define metadata for automation tools like dbt, Wherescape, or Coalesce

3. What Is a Physical Data Model?

Now we get technical. The physical model is the actual implementation in a database system. It’s like building the house based on the blueprints.

This includes:

  • Create table statements
  • Insert/load scripts
  • Indexes and constraints (e.g., foreign keys)
  • Platform-specific configurations (e.g., Snowflake vs. Oracle)

In the physical model, you decide:

  • How data is stored (table structures, partitions)
  • How data is accessed (views, security layers)
  • How data is secured (row-level or column-level permissions)

For example, if your logical model says an attribute is sensitive, your physical model might enforce this by putting that attribute in a separate satellite or restricting access using database roles.

When Do You Use a Physical Model?

This comes last in the chain. You use it when:

  • You’re ready to implement in a database
  • You’re generating SQL from your metadata (often using automation tools)
  • You’re deploying the actual tables, views, and loading processes

Summary: The Three Layers Compared

Aspect Conceptual Model Logical Model Physical Model
Purpose High-level business understanding Detailed data structure without technology Actual implementation in a specific database
Focus Entities & relationships Attributes, keys, classifications SQL scripts, schemas, storage
Audience Business users & analysts Data architects & engineers DBAs & developers
Examples Customer purchases Product Customer has Name, Email; Link to Product CREATE TABLE Customer (…); GRANT SELECT ON ViewX

The Modeling Process: From Concept to Code

One of the key takeaways from Michael Olschimke’s explanation in the Data Vault Friday session is that these models are not alternatives—they’re steps in a process.

  1. Start with the conceptual model to understand the business.
  2. Create a logical model based on the incoming data and requirements.
  3. Generate the physical model using automation tools or code templates tailored to your database platform.

Each step builds on the one before it, guiding you from abstract ideas to concrete implementation.

Final Thoughts

If you’re new to data engineering, start small. Talk to the business. Sketch out their world. Then, gradually evolve those ideas into structured data models and finally into code. Understanding the differences between conceptual, logical, and physical models will make you a more effective engineer—and a better bridge between business and tech.

Watch the Video

Soft-Deleting Records in Data Vault: A Real-World Approach to Status Tracking Satellites

Soft-Deleting Records in Data Vault

When working with Data Vault in real-world enterprise data warehousing projects, managing soft-deleted records is more than just a theoretical exercise—it’s a necessity. While many books and training examples offer simplified scenarios, real-world implementations must take into account complex requirements such as status tracking from multiple source systems. In this article, we dive into how to virtualize dimensions in a Data Vault model with proper handling of status tracking satellites, ensuring that soft deletions are effectively managed and the integrity of historical records is preserved.



The Problem with Simplified Examples

Most Data Vault tutorials focus on core concepts: Hubs, Links, and standard Satellites. These foundational examples are useful for learning but fall short when we need to address data that has been logically removed or soft-deleted. In real-world systems, this is a common occurrence, and ignoring it risks producing inaccurate analytics and flawed dimension models.

The key challenge is to determine whether a business entity—like a customer, product, or concept—is still “active” in the eyes of the business. This gets even trickier when the data comes from multiple source systems, each with its own deletion logic. That’s where Status Tracking Satellites (also known as Effectivity Satellites) come in.

Scenario Overview: Multi-Source Concept with Soft Deletes

Consider a scenario where a Concept (e.g., “Customer”) is fed by two different source systems. Here’s a breakdown of the Data Vault objects involved:

  • Concept_PIT – A Point-In-Time (PIT) table indexing data across all satellites.
  • Concept_SAT_S_source1 – A standard satellite with descriptive data from Source 1.
  • Concept_SATST_source1 – A status tracking satellite for Source 1.
  • Concept_SAT_S_source2 – A standard satellite from Source 2 with more descriptive attributes.
  • Concept_SATST_source2 – A status tracking satellite for Source 2.

In this setup, each source system tracks its own deletions, independently of the other. That means a Concept could be deleted in one source but still be active in another. Properly modeling and querying this requires careful integration of all status indicators.

What Do Status Tracking Satellites Actually Do?

Contrary to some misunderstandings, status tracking satellites are not used to track the deletion of descriptive attributes. That’s the role of standard satellites. Instead, they track whether the entire business key has been logically deleted in the source system. For example, if a customer row is completely removed from a source table, the status tracking satellite records that deletion event.

This distinction is important: you might null out a field in the source system without deleting the record. The standard satellite captures that. But if the whole customer row is deleted, only the status tracking satellite will catch it.

How to Use the PIT Table

The PIT table is key to virtualizing dimensions. It acts as a bridge, linking together the various satellites—both descriptive and status tracking—by capturing the effective row per business key and timestamp. Your virtual dimension view selects from the PIT table and joins all relevant satellites using hash keys and load dates.

In this case, you treat the status tracking satellites just like standard satellites. You join them on hash keys and load dates using PIT indexes. This lets you bring in any flags, such as IsActive or deletion timestamps, as part of your view logic.

Business Rules for Determining Active Status

One of the biggest questions in designing this architecture is: How do we determine if a business key is still active?

There are a few approaches:

  • Single-source dominance: If the customer is deleted in the primary (golden) source, the business key is considered deleted.
  • All-source consensus: The business key is only considered deleted if it’s removed from all source systems.

Each approach has its pros and cons. The decision should be based on your business rules and requirements. You can use simple boolean flags like IsActive or more advanced logic combining multiple status indicators from different sources.

Should You Remove Deleted Entities from Dimensions?

This is a hot topic. Some organizations want to remove soft-deleted entities from their dimension tables entirely. While that may sound clean, it can create problems downstream—especially in fact tables where foreign key references still exist for those deleted entities.

The recommended approach? Flag them instead of deleting them. This preserves history and maintains referential integrity. It also helps analysts understand that, yes, the product or customer was deleted—but it still contributed to revenue or other KPIs in the past.

Virtualizing the Dimension View

Your final dimension view should:

  • Select from the PIT table for the appropriate concept.
  • Join to all descriptive satellites using the hash key and PIT load dates.
  • Join to all status tracking satellites similarly, treating them like descriptive sources.
  • Derive an IsActive or IsDeleted flag using business logic.

Here’s a simplified example of what that SQL might look like:

SELECT 
  pit.BusinessKey,
  sat1.Description1,
  sat2.Description2,
  CASE 
    WHEN st1.IsActive = 1 AND st2.IsActive = 1 THEN 1
    ELSE 0
  END AS IsActive
FROM Concept_PIT pit
LEFT JOIN Concept_SAT_S_source1 sat1 
  ON pit.HashKey = sat1.HashKey AND pit.LoadDate1 = sat1.LoadDate
LEFT JOIN Concept_SAT_S_source2 sat2 
  ON pit.HashKey = sat2.HashKey AND pit.LoadDate2 = sat2.LoadDate
LEFT JOIN Concept_SATST_source1 st1 
  ON pit.HashKey = st1.HashKey AND pit.LoadDateST1 = st1.LoadDate
LEFT JOIN Concept_SATST_source2 st2 
  ON pit.HashKey = st2.HashKey AND pit.LoadDateST2 = st2.LoadDate

This query structure ensures that all available data is consolidated and evaluated according to business-specific logic to determine the final dimension state.

Final Thoughts

Handling soft deletes in a Data Vault using status tracking satellites is a robust and scalable solution for real-world enterprise systems. The key is to treat these satellites as regular descriptive tables, include them in your PIT tables, and let business logic drive how deletions are interpreted in the dimension views.

Instead of deleting records from your dimensions—which can break fact table relationships—simply flag them. This gives you the full power of historical traceability while still providing clear information about current entity status. As always, your implementation should follow the business rules defined by your organization or your client.

Watch the Video

How to Model Address Data in Data Vault

Understanding the Nature of Address Data

In many systems, address data doesn’t come in a uniform format. Some systems may embed it directly in a contact or customer table—think “billing address” and “shipping address” fields in Salesforce—while others provide a separate table for addresses and even a relationship table showing links between addresses and business entities.

Let’s look at how to tackle both of these situations using Data Vault best practices.



Case 1: Addresses as Attributes Inside Another Table

If your source delivers addresses as part of another table (e.g., contact data with billing and shipping fields), the Raw Data Vault should model the data exactly as it comes. For example:

  • Create a Contact Hub with a business key for contacts.
  • Attach a Satellite containing billing and shipping address fields like city, street, and ZIP code.

Even if multiple contacts share the same address, duplication is acceptable in the Raw Vault—it’s a reflection of how the source system delivers the data. Optimization or deduplication can happen in the Business Vault or information marts.

Case 2: Addresses in a Separate Table

When your source system contains a dedicated address table, you have two main modeling options:

Option A: Treat Address as a Hub

  • Create an Address Hub using a business key. If no natural key exists, use a surrogate/technical key.
  • Attach a Satellite to store descriptive fields (e.g., street, city, ZIP).
  • Use a Link to relate addresses to other Hubs like Contact, Customer, or Lead.

This pattern is especially useful in industries like insurance, where addresses are treated as critical business objects (e.g., accident location).

Option B: Treat Address as Reference Data

  • Store addresses in a flat reference table.
  • Use an ID (like address_id = 55) as a code in a descriptive Satellite on related Hubs.

This is simpler but comes with limitations—it doesn’t track historical changes. For example, if a street name changes, the system won’t retain that history.

Tracking Changes in Reference Data

To address the historical limitation, consider modeling reference data with:

  • A Reference Hub (no hashed keys, just the code, load date, and record source).
  • A Reference Satellite to track changes over time using load date as part of the primary key.

This design allows you to maintain a history of changes in descriptive reference data without violating Data Vault principles.

Handling Many-to-Many Relationships Between Addresses and Entities

Things get more interesting when addresses are shared across multiple business objects (e.g., one address used by both a Lead and a Customer). If your source includes a bridge table (many-to-many), follow this approach:

  • Create an Address Hub.
  • Create individual Hubs for each business object (Lead, Customer, Contact).
  • Establish a Link using the bridge table to represent relationships.
  • Track relationship history using Effectivity Satellites.

If the bridge table uses a generalized object like “Customer” to reference multiple types (Contact, Account, Lead), you’ll need:

  • A Generic Customer Hub.
  • Use raw business keys or technical IDs (UUIDs).
  • Create Links between the Customer Hub and the Address Hub.
  • Use Satellites to track effectiveness (i.e., from when to when an address is associated).

Resolving Ambiguity with Conditional Relationships

Sometimes, the source system generalizes business objects (e.g., Microsoft CRM’s Customer entity could be an Account, Lead, or Contact). In these cases, build a generic Customer Hub first. Then, in the Business Vault, apply conditional logic to determine if a Customer is actually a Lead, Contact, or Account—but only in the Business Vault.

This conditional logic would take the form of queries that check whether a UUID from the generic Customer Hub exists in the Lead Hub. If so, you can establish a Business Vault Link between the generic Customer and the Lead.

Guiding Principles for Modeling Address Data

  • Stay Data-Driven: Model what you see, not what you think should be there.
  • Don’t Add Conditional Logic to the Raw Vault. It belongs in the Business Vault.
  • Use Hubs for real business objects like Address when they are shared across systems or have standalone value.
  • Use Reference Data when addresses are just descriptive codes without relationships.
  • Track History with Effectivity or Reference Satellites if needed.

Ultimately, the choice between treating address data as a business object or reference data depends on your use case. If you’re dealing with complex, shared addresses with historical importance, model them as Hubs. If not, use reference tables or Satellites. But always be consistent and avoid conditional logic in the Raw Vault.

Conclusion

Modeling address data in a Data Vault architecture isn’t one-size-fits-all. Whether it’s Salesforce, SAP, or Microsoft CRM, the goal is to be faithful to your data, follow the architecture’s guiding principles, and maintain flexibility. By doing so, you ensure scalability, auditability, and long-term maintainability of your data warehouse solution.

Watch the Video

Applying Soft-Deletes in Dimensions

Soft-Deletes in Dimensions

Data Vault practitioners often encounter the challenge of handling soft deletes in dimensional modeling. While source systems may hard delete records, a data warehouse must maintain historical integrity. This is where effectivity satellites and PIT (Point-in-Time) tables become crucial. Let’s explore how to apply soft deletes effectively in Data Vault dimensions.



Why Soft Deletes?

In a data warehouse, deleting a record outright can lead to inconsistencies, particularly when historical data or fact tables reference that entity. Instead of deleting, we mark records as inactive while preserving history.

Effectivity Satellites for Status Tracking

An effectivity satellite helps track changes in a record’s lifecycle, including soft deletes. This satellite contains:

  • A deletion flag (e.g., isDeleted or isActive).
  • A deletion timestamp (deletedDate).
  • Load date timestamps to track changes.

This setup allows toggling between active and inactive states as records are deleted and restored in the source system.

One Satellite per Source System?

Each source system may handle deletions differently. If multiple sources feed the same entity, separate effectivity satellites per source ensure accurate tracking.

Adding Effectivity Satellites to PIT Tables

The PIT table aligns different satellite timelines, allowing efficient query performance. Including effectivity satellites in PIT tables ensures:

  • Accurate status tracking across snapshots.
  • Seamless integration with dimensional views.
  • Faster lookups by pre-aligning timelines.

Virtualized Dimensions Without PIT Tables

Without a PIT table, virtualized dimensions must align effectivity data dynamically. This requires:

  • Joining effectivity satellites on-the-fly.
  • Using window functions like LEAD or LAG for tracking state changes.
  • Referencing a date dimension for time-based joins.

While possible, this approach is less efficient than using PIT tables.

Final Thoughts

Soft deletes in Data Vault dimensions are best managed through effectivity satellites and PIT tables. By storing deletion timestamps and flags, we retain history while allowing records to be marked inactive instead of being removed.

Watch the Video

Close Menu