Skip to main content
search
0
All Posts By

Julian Brunner

Julian Brunner is a Senior Business Intelligence Consultant and Project Manager at Scalefree with over six years of experience in modern data architecture. A CDVP2 and SnowPro Advanced Data Engineer, he specializes in real time streaming and cloud migrations for international leading companies in the technology and financial sectors. Julian combines a Master’s in Business Informatics with a Scrum Master mindset to deliver structured, scalable solutions.

Set Based Multi Active Satellite Derived From Record Level Multi Active Satellite

Multi-Active Satellites: Handling Delta Loads and Set-Based Derivation

A detailed modeling question came in recently about deriving a set-based Multi-Active Satellite from a record-level Multi-Active Satellite, and whether using the resulting Business Satellite as input for a parent PIT Table is valid Data Vault practice. The answer involves a few clarifications on terminology and a practical approach to delta loading that makes the Business Vault layer largely unnecessary for this use case. This post breaks it down.



Multi-Active Satellites: The Full Load vs. Delta Load Problem

A Multi-Active Satellite captures multiple active records per business key at the same point in time — phone numbers are the classic example. A person can have a home number, a mobile number, and a work number all active simultaneously. In the Satellite, each row is uniquely identified by the hash key, the Load Date Timestamp, and the Multi-Active Key (in this case, the phone type).

The Hashdiff in a Multi-Active Satellite is calculated across the entire group — all active records for that business key at that load date — not per individual row. This means when any record in the group changes, the Hashdiff changes for the whole group, and the entire group is re-inserted with a new Load Date Timestamp. The old group is virtually end-dated. This works cleanly when you receive full loads: every batch contains all active records, so you always have the complete group to work with.

The challenge arises with delta loads. If the source only sends what changed — say, a new work number is added, but the home and mobile numbers are not re-sent because they didn’t change — you can’t calculate the correct group-level Hashdiff from the incoming batch alone. The group is incomplete.

Reconstructing the Full Group from Delta Loads in the Raw Data Vault

The solution is to reconstruct the full Multi-Active group before loading, without moving this logic into the Business Vault. The approach is straightforward: derive the most recent Multi-Active group from the existing Satellite, combine it with the incoming delta records, and use the resulting complete set to calculate the Hashdiff and load the Satellite as if it were a full batch for that group.

In practice, a staging table acts as the assembly point. The latest group from the Satellite is pulled into staging alongside the incoming delta. Together, they form the complete current group. From there, the standard Multi-Active Satellite loading pattern applies — the Hashdiff is calculated over the full group, a new Load Date Timestamp is assigned, and all records in the group are inserted together.

This approach handles delta loads cleanly in the Raw Data Vault, which means there’s no need for a Business PIT Table or a Business Satellite just to reconstruct the full set. The reconstruction happens at load time, not at query time.

Using the PIT Table to Manage Granularity

The second part of the question was about reducing the Multi-Active group to a single record per business key per Load Date Timestamp — storing that reduced result in a Business Satellite and using it as input for the parent PIT Table.

This is valid, but there’s a lighter alternative worth considering: handle the granularity reduction directly in the PIT Table rather than creating a dedicated Business Satellite for it.

A standard PIT Table references a Multi-Active Satellite via the hash key and Load Date Timestamp, which points to the entire group. If you want the full group available for querying, this is all you need — the PIT gives you the reference, and the join returns all active records for that timestamp.

If you only want one specific record from the group — say, just the mobile number — you add the Multi-Active Key as an additional column in the PIT Table. The PIT row then carries the hash key, the Load Date Timestamp, and the specific Multi-Active Key value you want. The join returns exactly one record. Selecting which Multi-Active record to surface is business logic, and the PIT Table is a clean place to encode it without materializing an intermediate Business Satellite.

If you need to surface multiple specific records — home number and mobile number separately — you add additional column sets to the PIT Table, one per record type. Each set carries its own hash key, Load Date Timestamp, and Multi-Active Key reference. This keeps everything in one structure and avoids unnecessary materialization.

When a Business Satellite Does Make Sense

A Business Satellite for this purpose isn’t wrong — it’s just not always necessary. If the reduced, single-record-per-key result is consumed by multiple downstream processes and materializing it improves performance or simplifies maintenance, building a Business Satellite is a reasonable choice. But if the only goal is to filter down to one record for a specific downstream view, doing it in the PIT Table is simpler and avoids creating an entity whose sole purpose is granularity reduction.

The key principle: keep the Raw Data Vault responsible for capturing the full, accurate group, and make granularity and selection decisions in the PIT Table or Business Vault based on what downstream consumption actually requires.

To go deeper on Multi-Active Satellites, PIT Table design, and the full Data Vault methodology, explore our Data Vault training and certification programs. The free Data Vault handbook is also available as a physical copy or ebook.

Watch the Video

How to Model Multi-Type Business Keys in Data Vault

Multi-Type Business Keys

One of the more subtle challenges in Data Vault modeling appears when a single source delivers multiple types of business keys in one structure. Things get even more complex when those keys:

  • Point to different business objects depending on a type indicator
  • Have different formats than the “real” business keys used in existing hubs
  • Require cleansing or transformation before they can be linked correctly

In this article, we’ll walk through how to model such a scenario in Data Vault, how to think about Links and key types, and—most importantly—where business logic and transformations should live.



The Problem Statement

Let’s restate the scenario in simple terms.

Your source table looks something like this:

| ItemBK | KeyType | KeyValue |
|--------|---------|----------|
| ITEM1  | A       | T123     |
| ITEM1  | B       | 999-45   |

From a business perspective:

  • ItemBK identifies an item (one clear Hub candidate)
  • KeyType determines which business object the key refers to
  • KeyValue contains the actual identifier

The complication comes from the target data model:

  • For KeyType A, the key belongs to Hub A, but the real business key is 123, not T123
  • For KeyType B, the key belongs to Hub B and can be used as-is

So the value T123 must eventually link to Hub A using the cleaned value 123.

This raises three classic Data Vault questions:

  • How should this be modeled from staging to the Raw Data Vault?
  • Should the key type be stored in the Link?
  • Where should key cleansing and transformation happen?

First Principle: Preserve the Source

Before jumping into modeling options, let’s clarify a core Data Vault principle:

The Raw Data Vault must always allow you to recreate the original source delivery.

This means:

  • No irreversible transformations in staging
  • No hidden business rules applied too early
  • Full auditability back to the source

Every modeling decision below is guided by that principle.

Identifying the Core Business Objects

From the example, we can already identify several business concepts:

  • Item → clearly a Hub (Hub Item)
  • KeyValue → a business key, but its meaning depends on KeyType
  • Relationships → Item is related to different business objects depending on KeyType

This is a common pattern: a single source column contains references to different hubs.

Modeling Option 1: Pivot the Source (When Key Types Are Fixed)

If—and this is a big if—you can guarantee that:

  • The number of key types is small
  • The set of key types is stable (e.g. only A and B)

Then a pivoted staging approach can work.

What This Looks Like

You pivot KeyType into separate columns:

| ItemBK | KeyValue_A | KeyValue_B |
|--------|------------|------------|
| ITEM1  | T123       | NULL       |
| ITEM1  | NULL       | 999-45     |

This allows you to:

  • Load Hub Item from ItemBK
  • Load Hub A from KeyValue_A
  • Load Hub B from KeyValue_B
  • Create Links between Item and each Hub

Limitations

This approach breaks down quickly when:

  • A new key type appears
  • Key types are dynamic or unknown
  • The source evolves frequently

In those cases, pivoting creates a brittle staging layer and forces frequent refactoring.

Modeling Option 2: Generic Hub for Typed Keys

When key types are dynamic or extensible, a more robust pattern is required.

In this case, treat the combination of KeyType + KeyValue as a business key.

Create a Generic “Key” Hub

This hub represents “external identifiers” regardless of their semantic meaning.

  • Hub Key
    • Business Key = KeyType + KeyValue

Why composite?

  • Key values may overlap across types
  • Only the combination uniquely identifies the business object

You then:

  • Load Hub Item from ItemBK
  • Load Hub Key from (KeyType, KeyValue)
  • Create a Link between Item and Hub Key

This guarantees:

  • Full preservation of the source
  • No assumptions about future key types
  • Clean separation of concerns

But What About the Existing Hubs (Hub A, Hub B)?

This is where many teams are tempted to inject business logic too early.

Yes, you already have Hub A and Hub B. But the mapping logic that says:

  • “If KeyType = A, this belongs to Hub A”
  • “If KeyType = B, this belongs to Hub B”

is business logic.

Business logic does not belong in the Raw Data Vault.

The Correct Place: Business Vault

In the Business Vault, you can:

  • Create derived business hubs
  • Filter by KeyType
  • Union keys from different raw sources

For example:

  • Business Hub A = existing Hub A UNION keys from Hub Key where KeyType = ‘A’

This approach ensures that:

  • Raw data remains unchanged and auditable
  • Business rules are explicit and versionable
  • Changes in logic don’t force Raw Vault reloads

Should the Key Type Be Stored in the Link?

The answer depends on the modeling option.

Option 1 (Pivoted Source)

No.

Once pivoted, the key type is implicit in the structure. There is no longer a “KeyType” concept in the Link.

Option 2 (Generic Key Hub)

Again, no.

The key type becomes part of the Hub business key. The Link simply connects Item to the generic Key Hub.

If—and only if—key values are guaranteed to be unique across types, an alternative is:

  • Store KeyType in a Satellite on the Hub

But in most real-world cases, using a composite business key is safer.

Where Should Key Cleansing and Transformation Happen?

This is the most important question.

In the example:

  • T123 must become 123 to link to Hub A

This is a business rule.

What Not to Do

  • Do not strip the “T” in staging
  • Do not load cleaned values directly into the Raw Hub

Why?

  • The rule may change
  • You lose the original source value
  • You break auditability

The Recommended Pattern: Same-As Link

Instead, load the raw value T123 as-is and create a Same-As Link.

This link connects:

  • The “raw” business key (T123)
  • The “master” business key (123)

Benefits:

  • Full traceability
  • Explicit business logic
  • Flexible remapping if rules change

Same-As Links are purpose-built for this kind of semantic deduplication.

Putting It All Together

A robust end-to-end solution looks like this:

  • Stage data exactly as delivered
  • Load ItemBK into Hub Item
  • Load (KeyType + KeyValue) into a generic Hub
  • Link Item to the generic Hub
  • Apply business logic in the Business Vault
  • Use Same-As Links to handle key transformations

Key Takeaways

  • Multi-type business keys are common—and tricky
  • Preserve the source structure in the Raw Data Vault
  • Avoid early cleansing and transformation
  • Use composite business keys when uniqueness is not guaranteed
  • Apply business rules in the Business Vault
  • Use Same-As Links for key normalization

Handled correctly, this pattern gives you maximum flexibility, auditability, and long-term stability—exactly what Data Vault was designed for.

Watch the Video

Automation Options for Data Vault

How Automation Tools Are Changing the Game

Quick takeaway: Automation for Data Vault spans multiple waves — from template-driven code generation that builds the vault structure to modern generative-AI tools that help discover the model itself and AI copilots that accelerate business-rule development. Combined, these approaches dramatically cut time-to-value, reduce errors, and let teams focus on the parts that actually create business impact.



Why automation matters for Data Vault

Data Vault was designed for change: it separates raw capture from business logic, records full history and provenance, and uses standardized patterns (hubs, links, satellites). Those same patterns make the model ideal for automation. Manual Data Vault development works, but it’s slow and error-prone — especially when you must onboard many sources, handle evolving business keys or prove lineage for audits and AI projects. Automation reduces repetitive work, enforces consistency, and lets your engineers and architects spend time on modelling decisions and business rules, not boilerplate SQL.

The evolution of automation — three waves

Automation for Data Vault didn’t appear overnight. Think about it in three waves:

  1. Manual coding era: Everything by hand — raw ingestion, keys, history tracking, and the transformations. Effective, but slow and brittle.
  2. Template-driven automation (first wave): Tools that generate physical vault objects and standard loading code from a defined model. They speed up delivery and cut repetitive errors.
  3. AI-driven automation (second wave) + AI copilots: Tools that assist or even automate the model discovery itself, and AI copilots that generate business logic or transformation code — moving humans from creators to reviewers.

Understanding these waves helps you choose the right mix — existing template tools remain valuable, while AI tools are rapidly becoming practical for model discovery and logic generation.

First-wave automation: structured, reliable, repeatable

The first-wave tools are the ones most teams have used for years. Their primary job is to take a model and generate the physical implementation and ETL/ELT pipelines. Key benefits:

  • Speed: Generating hub, link and satellite structures with standard loading patterns significantly reduces delivery time.
  • Consistency: Every table and load pattern follows the same, tested template — fewer bugs and easier maintenance.
  • Orchestration and operations: Many tools build pipelines, manage hash keys, and include scheduling and error handling.

These tools are excellent when you already have a trusted logical model and want to automate the “how” of implementation. They do not solve the “what” (the model discovery) — that still requires human analysis.

Second-wave automation: AI-assisted model discovery

The real shift happens when automation starts to help with — or take over — the model discovery process itself. Instead of hand-crafting hubs and links, generative AI platforms can scan source systems and metadata to propose an initial Raw Data Vault logical model. What does that look like in practice?

  • Source scanning: The tool ingests table/field metadata, sample values and constraints.
  • Entity discovery: It suggests candidate hubs (business entities) by grouping columns and identifying recurring patterns and unique keys.
  • Key recommendation: It proposes business-key candidates and highlights primary/unique candidates derived from the source.
  • Relationship discovery: It suggests link structures where keys appear together or where foreign-key relationships are inferred.
  • Satellite design hints: The AI may split attributes into satellites based on volatility, sensitivity (PII), or update patterns.

This capability moves the needle: modelers become reviewers and validators instead of building every piece from scratch. It accelerates onboarding of new sources and shortens the path to a working Raw Vault.

AI copilots for the Business Vault and transformations

While model discovery is one hard problem, translating business requirements into transformation logic is another. This is where AI copilots shine. Integrated into developer environments, they can:

  • Generate SQL transforms from plain-English requirements (e.g., “calculate monthly churn rate by customer segment”).
  • Create complex joins, window functions and aggregations that implement business rules.
  • Suggest test cases, edge-case handling and simple data quality checks.
  • Accelerate the creation of information-marts (star schemas) by scaffolding the necessary queries and documentation.

Important caveat: copilots are accelerators, not autopilots. Generated code still needs human review for correctness, performance and governance. But they massively reduce the repetitive cognitive load and let experienced engineers focus on validation and optimisation.

Putting the technologies together: a practical workflow

Here’s a pragmatic, step-by-step workflow that mixes first-wave tools and AI capabilities into a usable process:

  1. Connect an AI discovery tool to your sources. Let it propose hubs, links and satellites.
  2. Review and refine the AI-suggested model with domain experts — confirm business keys and entity definitions.
  3. Export logical model into a template-driven automation tool (ELT/DBT/Wherescape). Generate physical tables, load patterns and orchestration pipelines.
  4. Use AI copilots to implement Business Vault logic and information-marts — write high-level requirements and have the copilot scaffold the SQL/Python transform code.
  5. Run tests and checks: automated unit tests, data quality checks and lineage validation.
  6. Deploy and monitor: schedule pipelines, monitor failures and feed back findings into the model or automation templates.

This end-to-end process reduces the time spent in data plumbing and increases time spent on business validation and value delivery.

Governance, auditability and human-in-the-loop

Automation is powerful, but it must sit inside proper governance. Because Data Vault is often used for regulatory and audit-sensitive environments, keep these guardrails in place:

  • Human review points: AI should suggest, not decide. Model approvals and business-key selection must be explicit sign-offs by domain owners.
  • Lineage and provenance: Ensure automation tools emit metadata and lineage so every generated artifact is traceable back to sources and the AI suggestions that influenced it.
  • Testing and validation: Automatically generate tests for any AI-generated transformation and fail deployments until tests pass.
  • CI/CD and version control: Keep generated models and transformations in version control so you can audit changes over time.

When these controls exist, you get the speed of automation without sacrificing compliance or trust.

When automation is the right move — and when to hold back

Automation fits particularly well when:

  • Your landscape includes many sources and you expect change.
  • You need fast onboarding (M&A or rapid product expansion).
  • Traceability and auditability are core requirements.
  • You want to reduce repetitive developer work and scale the team’s output.

Consider holding off or using a hybrid approach when:

  • Your environment is tiny and unlikely to change — heavy automation may be overkill.
  • Source data semantics are ambiguous and require deep domain expertise that AI cannot infer reliably.
  • You lack governance and testing practices to safely validate generated models and code.

Risks, limitations and best practices

Generative AI is not perfect: it can hallucinate or misinterpret faint signals in metadata. Best practices to mitigate risk include:

  • Always pair AI output with domain validation. Treat AI suggestions as draft artefacts, not final products.
  • Enforce tests: Automated unit tests and data quality checks should gate deployment.
  • Keep humans in the loop: Use model reviewers, not model builders — domain experts must accept or correct AI outputs.
  • Capture metadata: Store which AI model/version produced which suggestion for future audits.

Final thoughts — how to get started

If you’re curious about Data Vault automation, start small: pick one source or one high-value report and run it through an AI-assisted discovery + template automation pipeline. Measure case outcomes: time saved, fewer errors, and the number of iterations required to reach stakeholder approval. Use these metrics to build a business case and expand automation incrementally.

Automation won’t replace thoughtful modelling and governance, but used correctly it turns weeks of repetitive engineering into hours and lets teams focus on the decisions that move the business forward.

Watch the Video

From Manual Scripts to Automated CI/CD Pipelines

Introduction to CI/CD Pipelines

Modern data-driven organizations rely on robust, reliable, and repeatable deployment processes to maintain high data quality and accelerate delivery of new features. Yet many teams still deploy data warehouse changes manually—running SQL scripts by hand, updating tables one by one, and praying that nothing breaks. This approach is slow, error-prone, and difficult to audit.

In this article, we’ll explore how to move from manual scripts to automated CI/CD pipelines for your data warehouse. You’ll learn what a CI/CD pipeline looks like in a data platform context, the essential stages to implement, and best practices for testing and deployment. By the end, you’ll have a roadmap to transform your ad-hoc process into an efficient, reliable pipeline that scales with your team.



What Is CI/CD for Data Warehouses?

CI/CD stands for Continuous Integration and Continuous Delivery/Deployment. In traditional software engineering, CI/CD automates building, testing, and shipping code changes, reducing manual steps and accelerating feedback. In a data warehouse context, CI/CD pipelines apply schema changes, transformations, and tests against existing data—rather than replacing an application entirely—while preserving historical data.

Rather than manually modifying tables and views in production, an automated pipeline executes SQL migrations, data model updates, and validation tests. If anything goes wrong, the pipeline stops and notifies developers, preventing faulty changes from reaching your analytics environment.

Essential Stages of a Data Warehouse CI/CD Pipeline

A comprehensive CI/CD pipeline for a data warehouse can be broken into the following core stages:

  • Version Control
  • Continuous Integration (CI)
  • Testing and Validation
  • Continuous Delivery/Deployment (CD)
  • Production Verification

1. Version Control

All data warehouse code—SQL scripts, transformation models, configuration files—should live in a version control system such as GitHub, GitLab, or Azure DevOps. Version control gives you:

  • A single source of truth for your codebase
  • An audit trail of every change with commit history
  • Isolation of work via branches and pull requests
  • Collaboration features like code review and merge approvals

By following a feature-branch workflow, multiple developers can work in parallel without stepping on each other’s toes. Only merged, reviewed changes flow downstream into the pipeline.

2. Continuous Integration (CI) Phase

The CI phase triggers automatically when a developer pushes code or opens a pull request. It provides fast feedback on code quality before any changes touch shared environments. A typical CI workflow includes:

  1. Spin up an isolated CI environment: A throwaway database instance that mirrors production schema and dependencies.
  2. Apply schema and model migrations: Execute only the changed SQL models/entities plus any downstream dependencies.
  3. Run unit tests: Validate business logic in transformation code, e.g., date calculations, aggregations, and joins.
  4. Run data tests: Check data quality constraints such as NOT NULL, uniqueness, referential integrity, and custom domain rules.
  5. Verify test coverage: Ensure every added model or table has corresponding tests defined to avoid silent gaps.

If any step fails, the pipeline halts and sends notifications to the developer. This “fail-fast” approach prevents bad code from progressing further.

3. Testing and Validation Best Practices

Effective testing is critical in data warehouses, where subtle data issues can propagate into analytics dashboards. Consider these recommendations:

  • Mirror production-like data: Use a representative subset of real data for quick, realistic tests.
  • Keep tests static and fast: Predefine expected results so comparisons are deterministic and quick to execute.
  • Cover load patterns: Test both initial full loads and incremental loads, since the logic often differs.
  • Leverage patterns: If you use a modeling approach like Data Vault 2.0, reuse generic tests across hubs, links, and satellites for consistency.
  • Include dependency checks: Run tests against all downstream objects—not just the changed ones—to catch regressions.

4. Continuous Delivery/Deployment (CD) Phase

Once CI passes, changes merge into the main branch, triggering the CD phase. The pipeline typically uses a sequence of environments that escalate in trust and stability:

  • Development (Dev): A shared environment where feature branches are first merged and tested.
  • User Acceptance Testing (UAT): A higher environment closely mirroring production, used for business validation.
  • Production (Prod): The live data warehouse serving analytics and reporting.

For each environment:

  1. Deploy code: Execute migrations against the target database.
  2. Run data tests: Validate changes on actual data in that environment.
  3. Promote to next stage: If tests pass, automatically advance to the next environment; if not, alert the team and block deployment.

5. Production Verification

The final stage runs smoke tests or sanity checks post-deployment in production. Typical checks include:

  • Row counts in critical tables against expected thresholds
  • Key dashboards loading without errors
  • Data freshness checks to confirm pipelines ran successfully

Coupling automated alerts with dashboards ensures that any lingering issues are caught immediately, even after deployment.

Advanced Considerations and Tips

As your data warehouse grows, you may introduce additional environments (e.g., staging, performance-testing) or use feature flags for gradual rollouts. Other best practices include:

  • Infrastructure as Code: Manage data platform resources (compute clusters, networking) declaratively.
  • Security and Permissions: Apply least-privilege principles and rotate credentials automatically.
  • Monitoring and Observability: Instrument pipelines to collect metrics on runtime, error rates, and data quality trends.
  • Documentation Automation: Generate data lineage and schema docs from your version-controlled codebase.

Getting Started: A 5-Step Roadmap

  1. Audit and Version-Control Everything: Migrate all scripts, SQL, and configs into Git.
  2. Define a Minimal CI Pipeline: Set up an isolated test database, run migrations, and execute a few core tests.
  3. Expand Your Test Suite: Add data quality checks, unit tests for business logic, and regression tests for downstream dependencies.
  4. Establish Environments: Spin up Dev, UAT, and Prod databases, ideally isolated but structurally identical.
  5. Automate CD and Verification: Wire up automated deployments between environments with post-deploy smoke tests.

Conclusion

Transitioning from manual deployment scripts to a fully automated CI/CD pipeline can seem daunting. However, by breaking the process into clear stages—version control, CI testing, CD deployments, and production verification—you can incrementally build confidence and reduce risk. The payoff? Faster delivery of data features, better data quality, and a scalable process that grows with your team.

Ready to get started? Take the first step by placing your SQL scripts in Git and setting up a basic CI job today. Your future self—and your analytics consumers—will thank you.

Watch the Video

Persistent Staging Area vs Transient Staging Area: Key Differences and When to Use Each

Persistent Staging Area vs Transient Staging Area

Data architecture decisions can make or break the efficiency, flexibility, and scalability of your analytics platform. One such decision revolves around staging areas—more specifically, whether to use a Persistent Staging Area (PSA) or a Transient Staging Area (TSA). Both serve as critical components in the data pipeline, but they serve different needs and use cases.

In this post, we’ll explore the pros and cons of each approach, examine when to use a PSA versus a TSA, and explain how they align with modern data strategies such as Data Vault 2.0. Whether you’re a data engineer, architect, or BI consultant, this guide will help you make a more informed choice for your data warehouse design.



Why Do You Need a Staging Area?

Before diving into PSA vs TSA, let’s understand the purpose of a staging area:

  • Source System Isolation: Quickly extract data from operational systems to reduce their load.
  • Performance Optimization: Decouple extraction and loading processes to improve efficiency.
  • Preprocessing: Perform data validation, cleansing, and simple transformations—though in ELT architectures like Data Vault 2.0, this is minimized.

What Is a Persistent Staging Area (PSA)?

A PSA stores multiple historical batches of source data, often indefinitely. It serves as a historical repository and is commonly implemented on a NoSQL database or data lake using formats like JSON or Parquet.

Advantages of PSA

  • Full Reload Capability: If your raw Data Vault needs rebuilding (e.g., due to modeling mistakes), PSA allows reloading from historical data without re-accessing source systems.
  • Schema Flexibility: Semi-structured formats can easily accommodate changes in source schemas without breaking processes.
  • Decoupling Ingestion and Integration: You can ingest all available data and model the Data Vault incrementally, enabling agile development.
  • Support for Advanced Analytics: Data scientists can query the data lake directly for machine learning and data mining.
  • Virtualized Raw Data Vault: Enables virtualization techniques for accessing raw historical data efficiently.

Drawbacks of PSA

  • Higher Storage Costs: Storing all historical data requires more disk space, though archiving and cold storage can help mitigate this.
  • Compliance Effort: Implementing GDPR compliance (e.g., “right to be forgotten”) is more complex with persistent data storage.
  • Setup Complexity: Requires thoughtful structure to ensure efficient incremental loading into the Data Vault.

What Is a Transient Staging Area (TSA)?

In a TSA, data is only available temporarily, usually for the duration of the ETL process. After the data is loaded into the target system, the staging area is cleared, often by truncating tables. TSA is typically implemented on relational databases.

Advantages of TSA

  • Simplicity: Easy to implement with a 1-to-1 copy of the source structure.
  • Lower Storage Requirements: Since data is temporary, there’s no need for extensive disk space.
  • Improved ETL Performance: Smaller data volumes and simpler management often lead to faster ETL cycles.
  • Minimal Compliance Burden: Personal data is automatically deleted after each cycle, simplifying GDPR requirements.

Drawbacks of TSA

  • No Historical Reload: If a problem occurs, you must re-extract data from the source, which might be unavailable or changed.
  • No History Preservation: Only one batch is available at a time—no record of previous data states.
  • Immediate Schema Adaptation Needed: If the source schema changes, you must adapt your data vault model right away or risk losing new attributes.

PSA vs TSA: A Side-by-Side Comparison

Aspect Persistent Staging Area (PSA) Transient Staging Area (TSA)
Data Retention Long-term, historical Short-term, per batch
Storage Requirements High (can be optimized) Low
Complexity Medium to High Low
Compliance Effort (e.g., GDPR) High Low
Schema Change Handling Handled flexibly Needs immediate updates
Data Reload Capability Yes, full history available No, requires source system access
Auditability High Low
Tool Support Ideal for data lakes, supports tools like dbt Compatible with tools but limited scope

When Should You Choose PSA Over TSA?

Choosing between PSA and TSA depends on your project requirements. Here’s a quick guide:

Choose PSA if:

  • You need historical data retention and full reload capabilities.
  • You want flexibility in schema evolution without breaking processes.
  • Your organization performs advanced analytics or machine learning.
  • You’re implementing a long-term, scalable, agile data architecture (like Data Vault 2.0).

Choose TSA if:

  • You’re working with resource-constrained environments.
  • You want a simpler, lightweight setup.
  • Data historization isn’t critical for your use case.
  • You need rapid ETL performance without added complexity.

How Does This Fit with Data Vault 2.0?

Data Vault 2.0 supports both PSA and TSA. However, the modern recommendation leans towards PSA, particularly with NoSQL or data lake implementations. The ability to decouple ingestion from modeling, manage schema changes gracefully, and maintain full history aligns well with agile and resilient data vault practices.

Tools like dbt do not require a PSA explicitly but benefit from having consistent data available in the stage, regardless of its persistence.

Final Thoughts

Both PSA and TSA play important roles in data architectures. Understanding their strengths and trade-offs allows you to design data pipelines that meet your current and future needs. In most modern, data-driven organizations, a Persistent Staging Area offers the flexibility and robustness needed to manage data complexity, scale analytics efforts, and maintain auditability across the board.

However, don’t rule out TSA for its simplicity and speed, especially in prototyping or when storage is limited.

Ultimately, your choice should align with your business goals, compliance needs, and data maturity level.

Watch the Video

Maintaining the Hash Diff

Watch the Video

The Problem

Adopting a thoughtful approach to Hash Diff calculation can minimize manual maintenance, ensure data consistency, and optimize storage. Our question comes from a project where the source system occasionally delivers new columns for existing tables. When these columns are added to a satellite, the hash difference (hashdiff) calculation changes. As a result, new deltas are generated for all business keys during the next load—even if the actual data hasn’t changed. The manual recalculation of hashdiffs for historical records is time-consuming and prone to errors. Can this be avoided?



Understanding Hash Diff Changes

The hashdiff is a critical component in a Data Vault model, used to detect changes in descriptive attributes. Adding a new column changes the hashdiff logic, potentially creating unnecessary deltas, which consume additional storage and complicate data integrity checks. Let’s break this down:

  • When a new column is introduced, historical records often have NULL values for that column.
  • The updated hashdiff logic incorporates the new column, even if its value doesn’t contribute to meaningful changes.
  • This can result in false positives—new records that aren’t genuinely different.

Potential Solutions

There are several strategies to handle this scenario, each with varying levels of manual effort and maintenance:

1. Recalculating the Hash Diff Manually

One approach is to manually recalculate the hashdiff for all existing records. While effective, this method requires significant effort and is not scalable for large datasets. Additionally, updating historical records can disrupt the auditability of your Data Vault.

2. Minimizing Updates with Targeted Recalculation

A more focused strategy is to update only the current records in the satellite (those with an open-ended load date). These records are actively used for comparisons and would benefit most from updated hashdiffs. While this reduces the number of updates, it still involves manual intervention.

3. Ensuring Hash Diff Consistency Automatically

The most efficient solution is to design the hashdiff calculation to remain consistent, even when structural changes occur:

  • **Add Columns Only at the End:** Ensure new columns are appended to the end of the table structure.
  • **Ignore Trailing Nulls:** Use a function like RTRIM to remove trailing delimiters caused by NULL values. This keeps the hashdiff consistent when new columns are empty for historical records.

This approach eliminates the need for manual updates, provided that all structural changes adhere to these guidelines.

Practical Example

Consider a satellite linked to a company hub, containing records for a company’s name and address. Initially, the hashdiff calculation includes only the company name and address. When a new column, postal code, is added:

  • Historical records will have NULL values for postal code.
  • Using the RTRIM function ensures that the new column does not affect the hashdiff for these records.

This prevents unnecessary deltas, saving storage space and reducing maintenance overhead.

Handling Hash Diff Duplicates

Another question we received involved handling hard duplicates—records that are identical in every aspect, including hashdiff values. The recommended approach is to:

  • Move such duplicates into an Error Mart for auditability.
  • Fix pipeline issues if duplicates are caused by ingestion errors.
  • For soft duplicates (e.g., intraday changes), manipulate the load timestamp by adding microseconds based on sequence IDs to ensure unique records.

Conclusion

By adopting a thoughtful approach to hashdiff calculation, you can minimize manual maintenance, ensure data consistency, and optimize storage in your Data Vault model. Whether you choose to recalculate selectively or implement hashdiff logic that handles changes automatically, the goal is the same: maintain the integrity of your data warehouse while reducing unnecessary effort.

Working With Semi-structured Data

Mastering Semi-Structured Data: Key Approaches and Best Practices

Semi-structured data, such as JSON, is increasingly common in modern data ecosystems. But how should you store and handle it? Should you store the data as-is or flatten its structure? Both approaches have unique advantages and limitations, and understanding these can help you make informed decisions based on your use cases.



Key Considerations

  • Expected Data Structure: Is the schema likely to change? Are nested objects (hierarchies) present?
  • Velocity & Size: How large and fast-moving is your data?
  • Database Capabilities: Does your system support efficient queries and manage large datasets?
  • Use Cases: What operations will you perform on the data?

Approach 1: Store Data As-Is

This method involves storing the data in its original format. It’s ideal for flexibility but has limitations:

  • Pros: Quick to ingest, accommodates changing schemas, suitable for unknown operations.
  • Cons: Struggles with large files and nested queries.

Approach 2: Flatten Nested Structures

Flattening the structure simplifies data querying and scalability. However, it also has trade-offs:

  • Pros: Easy querying, no file size constraints, better for fixed schemas.
  • Cons: Complexity in handling hierarchies, loss of schema flexibility.

Data Vault Modeling: A Flexible Solution

Data Vault modeling supports both approaches:

  • Storing As-Is: Store files as non-historized links or satellites, keeping the original file in a single column. Virtual structures can be built on top.
  • Flattening Before Loading: Create standard Data Vault entities while storing the original files in a Data Lake for reference.

Choosing the right strategy depends on your operational needs and database capabilities. By considering these factors, you can efficiently work with semi-structured data while optimizing performance and flexibility.

Watch the Video

Key Factors for Data Vault Automation

Key Factors for Data Vault Automation

We are excited to announce an upcoming webinar, “Key Factors for Data Vault Automation,” where you’ll gain valuable insights into leveraging automation to optimize your data warehousing processes. This session will feature expert speakers who will explore how Datavault Builder can streamline data modeling and significantly enhance your Data Vault implementation.

Automation has become essential in data warehousing, enabling organizations to reduce manual effort, minimize errors, and boost efficiency. Our speakers will share best practices and real-world use cases that demonstrate the transformative power of automation in Data Vault projects. You’ll learn actionable strategies to ensure a smoother, faster, and more reliable data modeling process.

Whether you are new to Data Vault or seeking ways to fine-tune your existing setup, this webinar will provide practical knowledge and tools to help you succeed. Don’t miss this opportunity to discover how to make the most of automation and take your data warehousing efforts to the next level.

Register now to secure your spot and stay ahead in the ever-evolving world of data warehousing!

Webinar Details

  • Date: November 20th 2024
  • Time: 15:00 – 16:00 CET
Watch Webinar Recording

Why Do We Need Hubs in Data Vault?

Watch the Video

Why Hubs in Data Vault are Essential

Data Vault modeling is a powerful methodology for building robust and scalable data warehouses. One of its core components, the Hub, often raises questions among practitioners and stakeholders. Why do we need hubs? Can’t we just simplify the model by putting business keys directly into satellites? In this article, we delve into the reasons behind the existence of hubs and explore scenarios where deviating from the standard practice might be acceptable.



The Role of Hubs in Data Vault

Hubs play a pivotal role in Data Vault by storing a distinct list of business keys. These keys serve as unique identifiers for real-world entities, such as customers, products, or employees. Hubs provide several critical benefits:

    1. Data Integration: Hubs act as anchors for integrating data from disparate source systems. By consolidating different representations of the same entity into a single hub, you ensure consistency and accuracy across your data warehouse.
    2. Scalability: Hubs facilitate seamless scalability. When new data sources are introduced, you can simply add the business keys to the existing hub without the need for major model refactoring. This simplifies the onboarding of new data and reduces the risk of introducing inconsistencies.
    3. Auditability: Hubs maintain a clear lineage and audit trail for your data. The load timestamp in a hub functions as a “first seen” date, making it easy to track the evolution of your data over time.
    4. Granularity: Perhaps most importantly, hubs define the granularity of multiple downstream objects, including information marts and dimensions. This granularity is crucial for accurate reporting and analysis, making hubs indispensable for many use cases.

Why Not Put Business Keys in Satellites?

While hubs are generally considered best practice, there are rare instances where storing business keys in satellites might be justifiable. One such scenario is when a business key represents an entity that currently lacks descriptive data and is not actively queried.

For example, consider an employee dataset that includes the vehicle identification number (VIN) of the employee’s company car. If there’s no additional information about the car and no immediate need to query it, treating the VIN as a descriptive attribute within the employee satellite might be acceptable.

However, if the need to query or analyze data related to company cars arises in the future, a refactoring strategy called “Hub It Out” can be employed. This involves extracting distinct VIN numbers from the employee satellite into a new hub, creating links between the employee and car hubs, and potentially adding satellites with descriptive data about the cars.


Important Considerations about Hubs

While the above scenario demonstrates a valid exception, it’s crucial to remember that storing business keys in satellites should be the exception, not the rule. Hubs offer numerous benefits in terms of data integration, scalability, auditability, and granularity, making them essential for most Data Vault implementations.

Before deviating from the standard practice, carefully assess whether the potential benefits of storing business keys in satellites outweigh the potential drawbacks, such as increased storage costs, redundancy, and a less elegant data model.


Conclusion

In conclusion, hubs are fundamental building blocks in Data Vault modeling, providing a range of benefits that contribute to the overall integrity, scalability, and usability of your data warehouse. While there are rare cases where storing business keys in satellites might be justifiable, it’s crucial to carefully weigh the pros and cons before adopting this approach. By adhering to Data Vault best practices and understanding the specific requirements of your use case, you can ensure that your data warehouse is optimized for performance, maintainability, and long-term success.

How to Implement Data Quality Techniques

Watch the Video

In our latest video, BI Consultant Julian Brunner tackles a pressing query: “Where to implement data quality techniques? Is it possible to clean dirty data at the entry point into the raw data vault?” Data quality is foundational for informed decision-making, and Julian’s expertise shines as he navigates this critical terrain.

Julian highlights the importance of a holistic approach to data quality management, emphasizing the need for robust frameworks spanning the entire data lifecycle. Whether it’s validation rules, data profiling, or cleansing algorithms, proactive measures at every stage can fortify data integrity.

Automating a Scalable Data Warehouse with Data Vault Builder

Watch the Webinar

Unlock the power of automation in your data warehouse with Data Vault Builder in our upcoming webinar. Dive into the intricacies of Data Vault 2.0 and discover why it’s tailor-made for automation, promising efficiency and scalability like never before. Whether you’re a seasoned data professional or just embarking on your data warehousing journey, this webinar offers invaluable insights into streamlining your processes and accelerating implementation.

During this joint webinar, you’ll delve into the core principles of Data Vault 2.0 and witness firsthand how Data Vault Builder revolutionizes the implementation process. Through a live demonstration, gain practical knowledge and actionable tips to optimize your data warehouse architecture. From overcoming common challenges to kickstarting your project with confidence, this session equips you with the tools and techniques needed to succeed in the world of data warehousing.

Don’t miss this opportunity to elevate your data warehousing game and leverage the full potential of automation with Data Vault Builder. Join us and discover how to transform your data infrastructure into a dynamic, scalable powerhouse. Whether you’re a data architect, analyst, or IT professional, this webinar promises to be a game-changer for your organization’s data strategy. Register now to secure your spot!

Watch Webinar Recording

Ensuring Data Quality in Your Data Warehouse

Data quality in a data warehouse

Ensuring Data Quality

Poor data quality can lead to inaccurate insights, flawed decision-making, and ultimately, compromised business success. In the era of big data, organizations rely heavily on data warehouses to store, manage, and analyze vast amounts of information. However, the effectiveness of a data warehouse depends on the quality of the data it contains. How can high-quality data be ensured?

In this blog article, we will dive into the significance of data quality in an enterprise data warehouse and provide practical strategies to ensure accurate, reliable, and high-quality data with Data Vault 2.0.

You also might want check out the webinar recording about this exact topic. Watch it here for free!

Ensuring Data quality in your data warehouse

In today’s data-driven culture, organizations rely heavily on their data warehouses to make informed decisions. However, the effectiveness of a data warehouse depends on the quality of the data it contains. In this presentation, we will dive into the significance of data quality and provide practical strategies to ensure accurate, reliable, and high-quality data with Data Vault 2.0.

Watch webinar recording

What are the reasons for bad data?

Data quality refers to the accuracy, completeness, consistency, and reliability of data. In the context of a data warehouse, maintaining high data quality is crucial to derive meaningful insights and make informed decisions. Several factors contribute to the presence of poor or bad data. Identifying and understanding these reasons is essential for implementing effective data quality management strategies. Here are some common reasons for bad data in a data warehouse:

  • Incomplete or missing source data
  • Lack of standardizations
  • Data transformation issues
  • Poor data governance
  • Insufficient validation and quality checks
  • Lack of user training and awareness

Data quality techniques

A variety of data quality techniques exist and there is no single best option for all issues. The difficulty lies in understanding the current situation and in understanding the strengths and weaknesses of the techniques available. In fact, the effectiveness of the techniques varies depending on the context. A given technique fits well in some situations and poorly in others. Scott Ambler developed five comparison factors appropriate to consider the effectiveness of a data quality technique. These factors, that are shown below, are intended to help you choose the right DQ technique for the situation you face:

Data Quality graphic technique comparison for a data warehouse

Source: https://agiledata.org/essays/dataqualitytechniquecomparison.html

  • Timeliness: Are you reacting to a discovered DQ issue or are you applying it to proactively avoid or reduce DQ issues?
  • Level of automation: To what extent is automation possible? A continuous technique would be automatically invoked as appropriate.
  • Effect on source: Does the technique have any effect on the actual data source?
  • Benefit realization: Will the benefit of the quality improvement be immediate or is a long-term benefit to be expected?
  • Required skills: Does the technique require demanding skills that may need to be gained through training/experience or is the technique easy to learn?

The benefit of the Data Vault 2.0 approach

When bad data is detected, the first thing to do is perform a root cause analysis. What if the bad data originates from the source? The best approach would be to fix the errors directly in the source system. However, this method is often rejected as it is considered to be costly. Since the sources are out of the scope of a data warehousing team, we need to find a way to clean the bad data somewhere in our architecture. In Data Vault 2.0, we consider a data cleansing routine as a business rule (soft rule) whereby those rules are implemented in the Business Vault.

In the shown architecture (Figure 1) there is a Quality Layer integrated into the Business Vault where the data cleansing routines are performed. The purpose is to make the cleansed data highly reusable for downstream business vault and information mart objects. If the data quality rules change, or new knowledge regarding the data is obtained, it is possible to adjust the rules without having to reload any previous raw data.

Now, the data is ready for use in any dashboarding or reporting tool. It is also possible to write the cleansed data back to the source. For this purpose, the data is provided to an Interface Mart which in turn sends the data back to the source system itself. In this way, business users can utilize the high-quality data inside their source applications as well. The next time when loading the raw data into the Raw Data Vault, the data is already cleansed.

Data quality in a data warehouse

The second use case described in Figure 1 is the monitoring of bad data by a so-called quality mart. The quality mart is part of the information delivery layer and selects all the bad data instead of the cleansed data. Based on this, reports or graphical user interfaces can be created for the data steward. In addition, the data steward can leave comments on certain records that should not be considered bad data or are exceptions to the rules. This user interface stores all added data (comments, flags, etc.) in a database, which in turn serves as a source for the Data Vault. This data can be used to extend the business rules. In particular, to filter out the exceptions to the data cleansing rules.

Another advantage of Data Vault 2.0 are the high pattern-based and standardized entities. This enables a simple and automated development of technical tests. Once created for each Data Vault entity, these tests can be applied to both the Raw Data Vault entities and the Business Vault entities. This ensures a consistent and auditable data warehouse. Check out this blog article, if you need more in-depth information regarding technical tests in Data Vault.

Common data quality techniques

In the last section, we have already described a few techniques for ensuring high data quality in a Data Vault 2.0 architecture. Of course, there are a number of other techniques that are independent of which modeling approach is used. In addition, some techniques do not focus primarily on data quality, but they still have a positive influence on it. Let’s have a closer look on some of them below:

  • Validation of business rules: At this point, we have to distinguish between data quality and information quality. Data quality focuses on the intrinsic characteristics of the data, addressing issues such as errors, inconsistencies, and completeness at the granular level. Information quality is a broader concept that encompasses not only the quality of individual data elements but also the overall value and usefulness of the information derived from those data. Beyond that, what is useful information for one business case may not be sufficient for another. For this reason, the business users must be heavily involved in this process, for example through user acceptance tests.

  • Data Governance involves defining roles, responsibilities, and accountability for data quality, ensuring that data is treated as a valuable organizational asset. Develop and enforce data governance frameworks, including data quality standards, stewardship responsibilities, and documentation.

  • Data Guidance and Standardization ensures uniformity in formats, units, and values across the data warehouse, reducing the risk of errors caused by variations in data representation. Establish and enforce standardized naming conventions, units of measure, formatting rules, and data security/privacy conventions. Moreover, Data Vault 2.0 is very helpful in this regard, as all entities are highly standardized and automatable.

  • Data Steward: As part of the Data Governance practice, a data steward is an oversight/governance role within an organization and is responsible for ensuring the quality and fitness for purpose of the organization’s data.

  • Continuous Integration (CI) is a development practice where developers integrate their work frequently. Successful tests should be an obligatory condition for introducing any new change to your EDW code base. That is achievable by using DevOp tools and enabling continuous integration in your development lifecycle. Running automated tests each time code is checked or merged ensures that any data consistency issues or bugs are detected early and fixed before they are put into production.

  • A Review is a peer review of the implementation (source code, data models etc.). Developing a strong review process sets a foundation for continuous improvement and should become part of a development team’s workflow to improve quality and ensure that every piece of code has been looked at by another team member.

  • User training and Awareness: Educate users on the importance of data quality and provide training on the necessary topics and skills. Foster a culture of data quality awareness within the organization to encourage proactive identification and resolution of data quality issues.

Conclusion

There is no question that high data quality is essential for a successful data warehousing project. The process towards high data quality is not a one-time effort but an ongoing commitment. It is a multifaceted process that involves a combination of techniques, collaboration across teams, and fostering a culture of data stewardship.

In this article, we have addressed the causes of bad data and discussed various techniques for dealing with these issues. More accurately, we described how to implement data quality techniques within a Data Vault 2.0 architecture.

If you want to dive deeper into Data Quality then remember to watch the free webinar recording.

Close Menu