Skip to main content
search
0

From Manual Scripts to Automated CI/CD Pipelines

Introduction to CI/CD Pipelines

Modern data-driven organizations rely on robust, reliable, and repeatable deployment processes to maintain high data quality and accelerate delivery of new features. Yet many teams still deploy data warehouse changes manually—running SQL scripts by hand, updating tables one by one, and praying that nothing breaks. This approach is slow, error-prone, and difficult to audit.

In this article, we’ll explore how to move from manual scripts to automated CI/CD pipelines for your data warehouse. You’ll learn what a CI/CD pipeline looks like in a data platform context, the essential stages to implement, and best practices for testing and deployment. By the end, you’ll have a roadmap to transform your ad-hoc process into an efficient, reliable pipeline that scales with your team.



What Is CI/CD for Data Warehouses?

CI/CD stands for Continuous Integration and Continuous Delivery/Deployment. In traditional software engineering, CI/CD automates building, testing, and shipping code changes, reducing manual steps and accelerating feedback. In a data warehouse context, CI/CD pipelines apply schema changes, transformations, and tests against existing data—rather than replacing an application entirely—while preserving historical data.

Rather than manually modifying tables and views in production, an automated pipeline executes SQL migrations, data model updates, and validation tests. If anything goes wrong, the pipeline stops and notifies developers, preventing faulty changes from reaching your analytics environment.

Essential Stages of a Data Warehouse CI/CD Pipeline

A comprehensive CI/CD pipeline for a data warehouse can be broken into the following core stages:

  • Version Control
  • Continuous Integration (CI)
  • Testing and Validation
  • Continuous Delivery/Deployment (CD)
  • Production Verification

1. Version Control

All data warehouse code—SQL scripts, transformation models, configuration files—should live in a version control system such as GitHub, GitLab, or Azure DevOps. Version control gives you:

  • A single source of truth for your codebase
  • An audit trail of every change with commit history
  • Isolation of work via branches and pull requests
  • Collaboration features like code review and merge approvals

By following a feature-branch workflow, multiple developers can work in parallel without stepping on each other’s toes. Only merged, reviewed changes flow downstream into the pipeline.

2. Continuous Integration (CI) Phase

The CI phase triggers automatically when a developer pushes code or opens a pull request. It provides fast feedback on code quality before any changes touch shared environments. A typical CI workflow includes:

  1. Spin up an isolated CI environment: A throwaway database instance that mirrors production schema and dependencies.
  2. Apply schema and model migrations: Execute only the changed SQL models/entities plus any downstream dependencies.
  3. Run unit tests: Validate business logic in transformation code, e.g., date calculations, aggregations, and joins.
  4. Run data tests: Check data quality constraints such as NOT NULL, uniqueness, referential integrity, and custom domain rules.
  5. Verify test coverage: Ensure every added model or table has corresponding tests defined to avoid silent gaps.

If any step fails, the pipeline halts and sends notifications to the developer. This “fail-fast” approach prevents bad code from progressing further.

3. Testing and Validation Best Practices

Effective testing is critical in data warehouses, where subtle data issues can propagate into analytics dashboards. Consider these recommendations:

  • Mirror production-like data: Use a representative subset of real data for quick, realistic tests.
  • Keep tests static and fast: Predefine expected results so comparisons are deterministic and quick to execute.
  • Cover load patterns: Test both initial full loads and incremental loads, since the logic often differs.
  • Leverage patterns: If you use a modeling approach like Data Vault 2.0, reuse generic tests across hubs, links, and satellites for consistency.
  • Include dependency checks: Run tests against all downstream objects—not just the changed ones—to catch regressions.

4. Continuous Delivery/Deployment (CD) Phase

Once CI passes, changes merge into the main branch, triggering the CD phase. The pipeline typically uses a sequence of environments that escalate in trust and stability:

  • Development (Dev): A shared environment where feature branches are first merged and tested.
  • User Acceptance Testing (UAT): A higher environment closely mirroring production, used for business validation.
  • Production (Prod): The live data warehouse serving analytics and reporting.

For each environment:

  1. Deploy code: Execute migrations against the target database.
  2. Run data tests: Validate changes on actual data in that environment.
  3. Promote to next stage: If tests pass, automatically advance to the next environment; if not, alert the team and block deployment.

5. Production Verification

The final stage runs smoke tests or sanity checks post-deployment in production. Typical checks include:

  • Row counts in critical tables against expected thresholds
  • Key dashboards loading without errors
  • Data freshness checks to confirm pipelines ran successfully

Coupling automated alerts with dashboards ensures that any lingering issues are caught immediately, even after deployment.

Advanced Considerations and Tips

As your data warehouse grows, you may introduce additional environments (e.g., staging, performance-testing) or use feature flags for gradual rollouts. Other best practices include:

  • Infrastructure as Code: Manage data platform resources (compute clusters, networking) declaratively.
  • Security and Permissions: Apply least-privilege principles and rotate credentials automatically.
  • Monitoring and Observability: Instrument pipelines to collect metrics on runtime, error rates, and data quality trends.
  • Documentation Automation: Generate data lineage and schema docs from your version-controlled codebase.

Getting Started: A 5-Step Roadmap

  1. Audit and Version-Control Everything: Migrate all scripts, SQL, and configs into Git.
  2. Define a Minimal CI Pipeline: Set up an isolated test database, run migrations, and execute a few core tests.
  3. Expand Your Test Suite: Add data quality checks, unit tests for business logic, and regression tests for downstream dependencies.
  4. Establish Environments: Spin up Dev, UAT, and Prod databases, ideally isolated but structurally identical.
  5. Automate CD and Verification: Wire up automated deployments between environments with post-deploy smoke tests.

Conclusion

Transitioning from manual deployment scripts to a fully automated CI/CD pipeline can seem daunting. However, by breaking the process into clear stages—version control, CI testing, CD deployments, and production verification—you can incrementally build confidence and reduce risk. The payoff? Faster delivery of data features, better data quality, and a scalable process that grows with your team.

Ready to get started? Take the first step by placing your SQL scripts in Git and setting up a basic CI job today. Your future self—and your analytics consumers—will thank you.

Watch the Video

Differences between Data Vault 2.0 and Data Vault 2.1

Data Vault 2.0 vs Data Vault 2.1

As organizations continue to grapple with rapidly evolving data landscapes, Data Vault remains a leading methodology for building scalable, auditable, and flexible data warehouses. With the release of Data Vault 2.1, practitioners and architects often ask: “What’s changed since 2.0?” In this article, we’ll dive into the differences across three core areas—design principles, ETL patterns, and modeling best practices—and show you how 2.1 enhances your ability to tackle modern data challenges like data lakehouses, data mesh, and nested JSON feeds.



1. Design Principles: Staying True but Embracing Modern Architectures

Core Continuity

At its heart, Data Vault 2.1 retains all the foundational tenets of 2.0: separation of concerns (Hubs, Links, Satellites), immutable history, and decoupling of raw data capture from business transformations. If you already have a robust 2.0 implementation, there’s no need for a forklift upgrade—2.1 is evolutionary, not revolutionary.

Lakehouses, Mesh, and Fabric

Where Data Vault 2.1 shines is in explicitly addressing emerging architectures. You’ll find guidance on integrating Vaults within data lakehouses (e.g., Delta Lake, Apache Iceberg), as well as how Vault concepts align with data mesh domains and data fabric overlays. Instead of an “Enterprise Data Warehouse” monolith, 2.1 helps you embed Vault patterns into cloud-native, distributed environments.

Logical vs. Physical Modeling

With the proliferation of diverse storage engines—relational, columnar, NoSQL document stores, and graph databases—2.1 distinguishes your logical Vault model (Hubs, Links, Satellites) from its physical implementation. You now have clear guidelines on:

  • Keeping the logical model technology-agnostic
  • Adapting physical denormalization or document embedding strategies per platform capabilities
  • Optimizing storage formats (e.g., Parquet, Delta, or JSONB) while preserving auditability

This separation equips data engineers to leverage the strengths of their chosen database without sacrificing Vault integrity.

2. ETL Patterns: From Batch to Streaming and JSON

Expanded CDC Strategies

Data Vault 2.1 deepens its coverage of Change Data Capture (CDC) patterns. You’ll find refined techniques for:

  • Transactional order guarantees: Ensuring raw Vault loads adhere to source system timestamps to preserve lineage.
  • Handling late-arriving or out-of-order events: Techniques to backfill or correct Satellites without breaking immutability.
  • Parallel loading: Avoiding cross-system dependencies by pre-joining keys within each source’s staging area.

Informal “Pre-Join” Denormalization

2.1 codifies the practice of pre-joining business keys in staging or external views—a pattern previously covered only in practitioner forums. This denormalization step enriches payload tables with true business keys upfront, eliminating repetitive lookups during Link loads and simplifying ETL script maintenance.

JSON and Nested Structures

Perhaps the most visible ETL addition is 2.1’s JSON processing module. With more sources emitting nested, semi-structured payloads, new patterns include:

  • Flatten-first loading: Initial extraction of atomic fields into raw Satellites before storing full payloads.
  • Schema evolution handling: Capturing structural changes (added arrays or nested objects) as metadata in Vault artifacts.
  • Selective shredding: Automating transformation of common sub-documents into separate Hubs/Links/Satellites.

3. Modeling Best Practices: Updated Patterns for a Distributed World

Managed Self-Service BI

Data Vault 2.1 recognizes the shift toward self-service analytics within federated teams. Best practices now recommend:

  • Role-based access controls at the raw & business Vault layers, ensuring data stewards can grant fine-grained permissions.
  • Row- and column-level security patterns that can be implemented natively in cloud warehouses (Snowflake masking policies, SQL Server RLS, etc.).
  • Embedding governance metadata in Vault tables, enabling automated lineage and impact analysis for downstream consumers.

Expanded Satellite Strategies

While 2.0 introduced Point-in-Time (PIT) and Bridge tables for performance, 2.1 adds:

  • Snapshot Satellites: Prebuilt structures for frequented combinations of Hubs & Satellites—ideal for dimensional views.
  • Behavioural Satellites: Grouping event-driven attributes (e.g., clickstreams) separately from master-data Satellites.

Cross-Domain Linkage

Data Vault 2.1 extends guidance on managing relationships across micro-warehouse domains—a nod to data mesh. It clarifies when to use:

  • Cross-domain Links: For relationships spanning autonomous teams with separate Hubs.
  • Reference Hubs: Capturing shared code lists (e.g., currency, country) that multiple domains consume.

4. Educational & Organizational Enhancements

Rich Video & Quiz Content

Training for 2.1 now includes extensive pre-recorded modules by Dan, focusing on conceptual foundations—freeing up live classroom time for interactive labs and advanced case studies. Over 40 quizzes interspersed throughout the curriculum reinforce learning and feed directly into certification exams.

Certification & Community

Becoming a Data Vault 2.1 certified practitioner involves:

  • 5 days of combined video and onsite training (versus one day of video + three days live in 2.0).
  • An updated exam covering new ETL patterns, JSON handling, and modern architecture integration.
  • Access to an expanded Slack community and biweekly “Vault Clinics.”

Choosing When to Adopt 2.1

Given the backwards-compatible design, migration from 2.0 to 2.1 can be phased:

  1. Retain existing Hub/Link/Satellite structures in the Raw Vault.
  2. Gradually introduce new ETL patterns (JSON shredding, snapshot Satellites) in parallel.
  3. Implement enhanced governance and self-service controls in the Business Vault.
  4. Leverage certification resources to upskill architects and engineers on updated best practices.

Conclusion

Data Vault 2.1 advances the methodology by weaving in lessons from cloud-native architectures, self-service analytics, and semi-structured data sources—without discarding the proven foundation of 2.0. Whether you’re standardizing a data mesh deployment or optimizing your JSON pipelines, 2.1 provides the patterns and guardrails needed to build a modern, auditable, and flexible data platform.

Watch the Video

Meet the Speaker

Marc Winkelmann

Marc Winkelmann
Managing Consultant

Marc is working in Business Intelligence and Enterprise Data Warehousing (EDW) with a focus on Data Vault 2.0 implementation and coaching. Since 2016 he is active in consulting and implementation of Data Vault 2.0 solutions with industry leaders in manufacturing, energy supply and facility management sector. In 2020 he became a Data Vault 2.0 Instructor for Scalefree.

Unlock the Intelligence Layer: LLMs in Data Warehousing and the Future of Your Data

Natural Language AI Model

“Stop writing complex SQL, start talking to your data?”

This provocative question highlights a growing shift in how we interact with data. For years, getting answers from a Data Warehouse meant writing SQL queries or relying on pre-built dashboards.

For many organizations, their data platforms remain underutilized because accessing insights still requires writing code or navigating complex dashboards. It’s time to go beyond static reports and unlock a true intelligence layer on top of your data warehouse. Recent advances in Large Language Models (LLMs) and Natural Language Processing (NLP) are making data warehouses smarter, faster, and easier to use for everyone. In this article, we’ll explore how LLMs can transform the way you interact with your data – from using plain English queries instead of SQL, to AI-driven discovery of hidden insights, to enriching your data pipelines – and why this shift represents the future of data analytics.

Unlock the Intelligence Layer: LLMs in Data Warehousing and the Future of your Data

Unlock your data warehouse’s full potential! This webinar reveals how Large Language Models and Natural Language Processing are transforming data interaction, empowering everyone to effortlessly translate plain language into SQL, enable AI-driven data discovery, and deliver actionable insights to every stakeholder. Register for our free webinar, August 12th, 2025!

Watch Webinar Recording

From Complex SQL to Conversational Queries

Business users often depend on data engineers or analysts to fetch answers, creating bottlenecks in decision-making. Even data professionals themselves spend considerable time writing and optimizing SQL, rather than interpreting results. What if anyone could simply ask the data warehouse a question in plain language and get the answer? This is the promise of LLMs as an “intelligence layer”, a layer that bridges complex datasets and human comprehension. Advanced LLMs can understand a user’s question or request and generate the appropriate SQL queries on the fly.

This technology (often called Text-to-SQL or Natural-Language-to-SQL or NL2SQL) has rapidly evolved and major technology players have already taken note. For example, Databricks introduced a Natural Language Query feature (LakehouseIQ) to let users ask questions of their Lakehouse, and Snowflake is also exploring LLM-driven query capabilities.

Imagine asking your data warehouse in plain English: “What were our top-selling products last quarter by region?”. This text input is passed into a LLM, often enriched by company-specific data via RAG and then the system translates that into a correct, optimized SQL query that retrieves the answer.

Natural Language AI Model

Of course, translating natural language to SQL at an enterprise scale isn’t trivial. Complex schemas, ambiguous user input, and security considerations mean the LLM has to be both smart and careful. Uber has built such an AI system that works on an enterprise scale level.

Uber’s QueryGPT is an NL2SQL system that uses a multi-step, RAG-based pipeline combining LLMs with retrieval and agent modules. It fetches context via similarity search over a vector database of example queries and schema information for SQL generation. To manage Uber’s vast data ecosystem, QueryGPT employs specialized agents:

  • an Intent Agent classifies requests by business domain
  • a Table Agent suggests tables for the query
  • a Column Prune Agent trims irrelevant columns to reduce prompt length. The LLM then produces the SQL query and an explanation.

This layered design allows QueryGPT to handle large schemas and reliably generate complex multi-table queries. It’s a hybrid architecture where multiple transformer calls specialize in sub-tasks, enabling scalable, accurate NL2SQL as a production service, saving thousands of Uber employees significant time by mid-2024.

AI-Augmented Data Discovery and Insights

Beyond simply fetching results for user queries, LLMs can augment data discovery by revealing insights that users might not have explicitly asked for. Traditional dashboards show you what is happening, but a smart LLM-based system can tell you why it’s happening and highlight patterns you might not notice. This is often called augmented analytics – using AI to automatically find important correlations, trends, outliers, and drivers in your data.

LLMs excel at interpreting data outputs and providing additional context. For example, rather than just displaying a chart or a table, an LLM can generate a written summary pointing out key trends or anomalies. They can explain which metrics are up or down and suggest potential reasons (for instance, detecting that “conversion rates dipped in July, possibly due to seasonality or inventory issues”), enabling quicker and more informed decision-making.

Another area where LLMs can significantly reduce manual effort is in the creation and maintenance of data catalogs. Documenting data models, table structures, and especially individual column descriptions is often time-consuming and easily skipped due to missing resources, despite being crucial for an effective use and accessibility of the data. LLMs can automate large parts of this process by generating descriptions based on data profiling, SQL logic, naming conventions, and metadata.

dbt Cloud has recently released their dbt Copilot AI Agent that supports the developer in various ways, for example by letting the AI analyzing the SQL code and schema metadata to automatically generate model and column descriptions.

LLMs in Your Data Pipeline: Enrichment and Efficiency

LLMs don’t just enhance how users interact with the Data Warehouse; they can also improve the data itself and the efficiency of data engineering processes. In modern ELT (Extract-Load-Transform) pipelines, a lot of time is spent cleaning, enriching, and preparing data for analysis. Here, LLMs offer new tools to automate and augment these steps.

One promising use case is the semantic enrichment of data. Large Language Models have absorbed a vast amount of world knowledge and language patterns, and they can use that to fill gaps or add context to your raw data. For example, imagine you have a dataset of customer feedback where each entry is a text comment. An LLM could automatically classify the sentiment of each comment (positive/negative), extract key themes, or even generate a summary of common issues. In this way, unstructured data becomes structured insights without manual effort. The image below illustrates how an LLM is integrated into a data pipeline: text inputs from a CustomerFeedback table are passed to an OpenAI API endpoint, where the model returns structured sentiment labels that are then stored back in the database.

Large Language Models Sentiment Analysis

In a practical case study, LLMs were used to enrich an academic dataset by inferring missing attributes (like guessing a person’s gender from their name with high accuracy), which outperformed dedicated API services. This showcases how LLMs can bring external knowledge and reasoning to enhance your data.

Another area is metadata enrichment and semantic enrichment of unstructured data. Enterprise data is often filled with cryptic column names and jargon that prevents usability. LLMs can intelligently expand abbreviations and annotate fields with business-friendly descriptions. For instance, an LLM-driven catalog might take a column labeled “CUST_ID” and annotate it as “Customer Identifier, unique ID for each customer record”.

LLMs can also assist in the coding and transformation process itself. Data engineers can leverage LLMs to generate boilerplate code or SQL for transformations, document pipeline logic in plain English, or even detect anomalies and data quality issues through pattern analysis. By automating tedious parts of data preparation and providing AI-generated suggestions, LLMs free up engineers to focus on higher-level architecture and problem-solving.

Conclusion

While the promise of an LLM-powered intelligence layer is exciting, it’s important to approach it with a clear strategy. Successful implementation requires considering a few key challenges and best practices. Data quality and governance are more crucial than ever. If your underlying data is inaccurate or poorly structured, the AI’s answers will be unreliable. As the saying goes, “garbage in, garbage out.”

Ensuring clean, well-organized data (and maintaining a robust data governance program) will help the LLM produce meaningful and correct insights. Additionally, organizations may need to fine-tune or configure their LLMs to understand industry-specific terminology or business context. This reduces the chance of the AI misinterpreting what a user asks or generating an incorrect query.

Privacy and security are another important consideration. If your data includes sensitive information, you must ensure that any AI tool accessing it complies with your security requirements. This might involve using self-hosted models or secure APIs, and setting up proper access controls.

The dream of a self-service analytics experience: “just talk to the data and get answers” is quickly becoming a reality. This evolution may redefine roles (enabling analysts and engineers alike to focus on higher-value tasks) and open up analytics to a wider audience than ever before. It’s an exciting time to be a data professional, but also one that demands staying informed and ready to adapt.

– Ole Bause (Scalefree)

Get Started with dbt: A Quick 15-Minute Guide Using Snowflake

Get Started with dbt Using Snowflake

Introduction to dbt

dbt (data build tool) revolutionizes the way teams build and maintain analytics workflows by bringing software‐engineering best practices to SQL‐based data transformations. Instead of ad‐hoc scripts, dbt encourages version control, modular models, testing, documentation, and lineage graphs. In this guide, you’ll learn how to go from zero to your first dbt models—running entirely in the cloud on Snowflake—in under 15 minutes.



Why Choose dbt Cloud + Snowflake?

  • Fully managed: No local install or complex orchestration; dbt Cloud handles hosting.
  • Quick setup: Snowflake’s partner connector spins up a dbt trial, pre-configured with your credentials.
  • Best practices out of the box: Built-in IDE, job scheduling, Git integration, and documentation.
  • Scalable performance: Leverages Snowflake’s compute power for fast model builds.

Prerequisites

  1. A Snowflake trial account (free, 1-minute setup).
  2. A modern browser (Chrome, Firefox) or VS Code for remote development.
  3. Basic familiarity with SQL.

1. Launching dbt Cloud from Snowflake

Once logged into your Snowflake trial, navigate to the Data Products → Partners → Connect pane. Scroll to find the dbt entry and click Connect → Launch. This will automatically:

  • Provision a dbt Cloud trial account
  • Create a new Snowflake database and warehouse
  • Inject Snowflake credentials into your dbt Cloud connection

You’ll land in the dbt Cloud dashboard, ready to start your first project.

2. Exploring the dbt Cloud UI

In dbt Cloud’s left navigation bar you’ll find:

  • Develop: Interactive IDE for coding models, sources, tests, documentation.
  • Deploy: Job definitions, environments, and run history.
  • Documentation: Auto‐generated docs site with lineage graphs.
  • Settings: Account, project, and Git integration.

Click Develop → IDE (hosted in Chrome or connect your VS Code). Let’s initialize our dbt project.

3. Initializing Your dbt Project

  1. In the IDE, open the Version Control pane and click Initialize dbt Project.
  2. Accept the defaults; dbt creates a dbt_project.yml and folder structure (models/, macros/, etc.).
  3. Commit the auto‐generated files in a new Git branch: “initialized dbt project”.

Your Git pane now shows uncommitted files; click Commit & Push to save the project baseline.

4. Defining Your Source Data

dbt doesn’t load data from external systems—you must point it to existing tables. Snowflake’s sample database (SNOWFLAKE_SAMPLE_DATA) contains TPC-H tables you can use.

Create a new YAML file under models/ named sources.yml with:

version: 2

sources:
  - name: tpch
    database: SNOWFLAKE_SAMPLE_DATA
    schema: TPCH_SF1
    tables:
      - name: CUSTOMER
      - name: ORDERS

Save to see the lineage graph update with two new source nodes.

5. Building a Staging Model

Staging models standardize raw tables and prepare them for downstream transformations. In models/, delete the example/ folder and instead:

  1. Create a folder called models/staging/tpch.
  2. In that folder, create stg_tpch_customer.sql:
{{ config(materialized='view') }}

select
  C_CUSTKEY   as customer_key,
  C_NAME      as customer_name,
  C_ACCTBAL   as account_balance,
  C_COMMENT   as comment
from {{ source('tpch', 'CUSTOMER') }}

Notes:

  • The source() macro resolves to the fully qualified table.
  • materialized='view' tells dbt to build a view by default.

Save and click Run → dbt run to build just this model. In seconds you’ll see a view in your Snowflake UI under the dev schema.

6. Creating a Production-Ready Dimension

Dimensions (Gold layer) contain curated, business-ready tables. Let’s filter for customers with positive balances.

  1. Create models/marts/customer_dim.sql:
{{ config(materialized='table') }}

select
  customer_key,
  customer_name,
  account_balance
from {{ ref('stg_tpch_customer') }}
where account_balance > 0

Here, ref() links to another model, ensuring correct build order and clear lineage.

Click Compile to preview generated SQL, then Run → dbt run to create the table in Snowflake.

7. Testing and Documentation

dbt encourages tests to enforce data quality:

  • Add to models/staging/tpch/schema.yml:
version: 2

models:
  - name: stg_tpch_customer
    tests:
      - not_null:
          column_name: customer_key
      - unique:
          column_name: customer_key

Run dbt test to validate your models. Any failures will be reported in the UI.

Generate documentation with dbt docs generate, then preview via dbt docs serve. Explore your project’s lineage graph and column descriptions.

8. Version Control & Deployment

dbt Cloud integrates Git for collaboration. After feature development:

  1. Commit your branch and open a pull request (GitHub, GitLab, Bitbucket).
  2. Merge into main.

In Deploy → Environments, create a production environment. Under Jobs, define a job that runs:

  • dbt seed (if you have local CSV seeds)
  • dbt run
  • dbt test

Schedule the job (e.g., hourly, daily) or trigger it on Git commits. Monitor run history and logs directly in dbt Cloud.

9. Best Practices & Next Steps

  • Modularize models: Break complex logic into smaller models.
  • Document extensively: Use YAML descriptions for sources, models, and columns.
  • Implement CI/CD: Integrate dbt Cloud jobs with your team’s CI pipeline.
  • Leverage analyses: Create analyses/ for ad-hoc queries and charts.
  • Scale with packages: Reuse community packages (e.g., dbt_utils).

For deeper dives, explore the official dbt documentation and the dbt Hub for community packages and best practices.

Watch the Video

Conclusion

In just a few steps, you’ve:

  • Provisioned dbt Cloud via Snowflake
  • Initialized a dbt project with Git
  • Defined raw sources and built staging models
  • Created a production dimension with ref()
  • Tested data quality and generated documentation
  • Set up a CI/CD job for automated deployment

Now you have a repeatable, maintainable analytics pipeline. Keep building new models, add tests, and document as you go—your future self (and team!) will thank you.

Pre-Joining Data Vault Business Keys During Load

Data Vault Business Keys

Data Vault architects often encounter a common challenge when their source systems mix surrogate technical IDs with true business keys. In one real-world scenario, an Employee table used a technical ID as its primary key, while the legitimate business identifier was NBR. Downstream tables—even a self-referencing manager hierarchy—used that ID as a foreign key. When building Hubs on both ID and NBR and linking them with a Same-As Link, the result resembles a Source Vault design—and it quickly becomes unwieldy to repeat ID→NBR lookups for every referencing table.

In this article, we’ll explore the best practice for handling mismatched keys: pre-joining business keys in your staging (or view) layer before loading the Raw Data Vault. You’ll learn why this denormalization is fully compliant with Data Vault principles, how to implement it (even with CDC feeds), and when you might need alternate approaches.



Why Separate Business Keys from Technical IDs?

Surrogate keys (ID) offer stable integer references, simple indexing, and isolation from business rule changes. But they aren’t meaningful outside the operational schema. True business keys (NBR)—like employee numbers, order numbers, or product SKUs—carry real-world meaning and ensure consistency across downstream BI and analytics models.

  • Business Key (NBR): Immutable identifier used in reporting, cross-system integration, and audit.
  • Technical ID (ID): Auto-generated surrogate for OLTP performance and referential integrity.

When you build a Hub on the business key, all Links and Satellites must reference that same key. Mixing in surrogate IDs without conversion violates business lineage and forces repetitive lookups.

The Pitfall: Source Vault by Accident

Faced with source tables referencing ID, some teams create:

  1. A Hub on ID (surrogate),
  2. A Hub on NBR (business key),
  3. A Same-As Link between them to tie ID↔NBR.

This “Source Vault” pattern captures technical IDs as though they were business keys—contradicting the principle that your Vault’s integration key must be a shared business identifier. Moreover, every time you load any Link or Satellite that uses ID, you must look up NBR via the Same-As Link. Tedious, error-prone, and defeating the agility of your Data Vault.

Best Practice: Pre-Join Business Keys in Staging

Data Vault training explicitly allows you to reshape your staging area—denormalizing or normalizing source data to simplify Raw Vault loads. Pre-joining means: before your load process begins, enrich every source record with the true business key (NBR) rather than the surrogate ID. You then feed the Hub/Satellite/Link loaders with business keys directly—no Same-As Link gymnastics required.

How It Works

  1. Create a view or staging query that joins your Employee table (on ID) to itself or to the hierarchy table to retrieve NBR. Add the NBR field into every downstream staging record.
  2. Use that pre-joined staging view as the source for your Data Vault loaders. All Hubs, Links, and Satellites can now reference NBR consistently.
  3. Drop the accidental Hub on ID and Same-As Link—your Vault only contains the true business key Hub (Hub_Employee on NBR).

This approach turns the repeated lookup problem into a one-time denormalization, improving performance and maintainability.

Tool Support: dbt & FlowBI

Modern Data Vault toolkits recognize this pattern:

  • datavault4dbt package: Offers a pre_join feature to automatically enrich staging tables with business keys.
  • FlowBI: Includes configuration options to map surrogate IDs to business keys before Vault loads.

When your staging area resides on a cloud data lake (e.g., AWS S3 + Redshift Spectrum, Azure Data Lake + Synapse), these tools can reference external tables and materialize pre-joined views seamlessly.

Handling CDC & Empty Deltas with Forward Lookup

Change Data Capture (CDC) introduces a nuance: sometimes the Employee record doesn’t change (NBR remains the same), but the hierarchy table (manager assignment) does. A pure inner-join staging view would omit the hierarchy change because no new employee row appeared.

Forward Lookup Pattern

  1. When your CDC batch contains only hierarchy changes, load your staging view with the ID column but no accompanying NBR.
  2. Instead of joining to the source Employee table (which has no new row), perform a forward lookup against the target Satellite in your Raw Vault. That Satellite already stores every historic mapping of ID→NBR.
  3. Retrieve the latest NBR value for each ID and inject it into your staging records—just as if you had joined to the source.

This assumes your CDC infrastructure and initial loads correctly populated the Satellite. If CDC reliability is questionable, you may need to capture ID first and resolve to NBR later in a Business Vault layer.

Caveats: Pre-Join Within a Single Source Only

While you can (and should) pre-join within one source system, avoid chaining pre-joins across multiple systems. If you first wait for System A’s staging load to produce NBR from its ID, and then join System B’s staging to A’s data, you introduce cross-system load dependencies. That forces you to serialize loads—waiting for one system’s batch to finish before you can process another.

To maintain parallel ingestion, each source should be pre-joined only to its own business keys. If two systems share a business key, let that intersection happen downstream in your Vault (via Hub on the shared business key), not in the staging layer.

When You Can’t Pre-Join: Source Vault as Last Resort

In rare environments where you cannot reliably pre-join—legacy databases with locked-down permissions or untrustworthy CDC—you may fall back to a Source Vault. In this design:

  • Your Hubs use surrogate ID as the primary key (capturing the technical ID).
  • You defer mapping to true business keys into the Business Vault layer, after all sources land.

Source Vaults make sense only when staging denormalization is impossible. Otherwise, they sacrifice business clarity for expedience.

Step-By-Step Implementation Guide

  1. Identify Business Keys: Catalog each table’s true business key(s)—not the surrogate PKs.
  2. Build Staging Views: For each source, create a view that LEFT JOINs the “owner” table back to itself (or to its lookup tables) to pull in NBR wherever ID appears.
  3. Validate Keys: Ensure every staging record includes a non-null business key. Flag or quarantine any orphans (IDs without known business key).
  4. Configure Load Scripts: Point your Vault loaders (Hubs, Links, Satellites) at these staging views. Remove any loaders that target surrogate key Hubs or Same-As Links.
  5. Implement Forward Lookup: For CDC batches that may omit source changes, add a fallback join to the Employee Satellite in your load script to fetch the last known NBR for each ID.

Benefits of Pre-Joining Business Keys

  • Simplicity: One denormalization step replaces dozens of repetitive lookups.
  • Performance: Staging views optimize key enrichment in set-based SQL rather than row-by-row Link loads.
  • Lineage: Your Raw Vault contains only true business keys, preserving clear end-to-end lineage.
  • Maintainability: Future source schemas that reference ID get mapped automatically via the shared staging view logic.

Conclusion

Mismatched surrogate and business keys need not derail your Data Vault design. By embracing pre-joining business keys in your staging layer—along with forward lookups for CDC edge cases—you preserve a clean, business-centric Vault model without cumbersome Same-As Links. Reserve Source Vaults only for environments where staging denormalization simply cannot occur. With these best practices, your Vault remains performant, transparent, and aligned with true business identifiers.

Watch the Video

Meet the Speaker

Profile picture of Michael Olschimke

Michael Olschimke

Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

10 Essential Skills Your Team Needs to Build an Analytical Data Platform

Analytical data platform team meeting

Build an Analytical Data Platform

Building a modern analytical data platform is more than just choosing the right database or ETL tool. It requires a blend of business insight, data expertise, architecture design, and automation savvy. In this article, we’ll explore ten essential skills your team needs to design, develop, and maintain a robust, scalable, and high-value data platform.



1. Business Understanding

Before diving into any technical work, your team must understand the business domain and the data itself. This doesn’t mean every engineer needs to be a data analyst, but they should know:

  • Which source systems hold the data (CRM, ERP, marketing platforms, etc.)
  • Key business objects (customers, contracts, opportunities) and how they relate
  • Business processes behind the data, like a customer’s lifecycle or sales funnel

By grounding the team in real-world outcomes—such as improving customer retention or reducing churn—engineers stay focused on delivering measurable ROI.

2. Objective Setting & ROI Focus

Clear objectives guide every stage of your platform’s development. Whether your goal is to accelerate financial reporting or enable real-time marketing analytics, defining the desired outcomes:

  • Helps prioritize features and data sources
  • Aligns stakeholders around common metrics
  • Boosts motivation by tying work to tangible business value

Teams that regularly track ROI milestones can adjust scope and resources proactively, ensuring the platform grows in step with organizational goals.

3. Data Understanding & Modeling

A deep dive into your source systems reveals hundreds—even thousands—of tables. Your engineers need to know:

  • Primary and foreign keys connecting entities
  • Relationship cardinalities (one-to-one, one-to-many, many-to-many)
  • Data quality quirks and domain-specific rules

This understanding informs the modeling approach—be it third normal form, star schemas, or Data Vault—ensuring downstream analytics are consistent and reliable.

4. Data Acquisition Techniques

Extracting data from source systems can take many forms:

  • Full daily extracts via CSV or JSON files
  • API calls for near-real-time data feeds
  • Change Data Capture (CDC) for incremental updates

Knowing when to use each approach minimizes data latency, reduces load times, and optimizes storage. CDC, in particular, slashes the volume of data transferred, but requires robust handling to maintain consistency.

5. Structured Architecture: The Medallion Approach

Dumping raw data into a single database is a recipe for chaos. Instead, adopt a layered “medallion” architecture:

  • Bronze Layer (Staging/Landing): Raw data as ingested
  • Silver Layer (Cleansed, Integrated): Unified and harmonized data across systems
  • Gold Layer (Presentation): Curated tables/views for business users and BI tools
Medallion Architecture in an analytical data platform

This separation of concerns simplifies debugging, improves performance, and clarifies responsibilities for each team member.

6. Data Integration & Modeling in the Silver Layer

The silver layer is where the “magic” happens:

  • Integrating disparate systems into a unified view
  • Applying your chosen modeling technique (e.g., star schema, Data Vault)
  • Ensuring referential integrity and consistent business definitions

Investing in a proven modeling framework not only scales with additional data sources but also enables automation and accelerates the onboarding of new subject areas.

7. Temporality & Historical Tracking

Beyond technical timestamps (extract load times), your data has business timelines:

  • Contract start/end dates
  • Customer sign-up and churn events
  • Promotion or campaign effective periods

Implementing snapshot tables, slowly changing dimensions, or time-aware modeling ensures accurate trend analysis, historical comparisons, and auditability.

8. Code Generation & Automation Tools

Hand-coding every pipeline is time-consuming and error-prone. Leverage tools that:

  • Automatically generate ETL/ELT code based on templates
  • Orchestrate complex workflows and dependencies
  • Enforce consistency through standard patterns and conventions

Automation not only speeds up development but also improves data quality by reducing manual interventions.

9. Agile Development & Traceability

Adopting an agile mindset means delivering small, working increments quickly. Apply traceability by:

  • Defining clear targets (e.g., monthly revenue report)
  • Mapping those targets back to specific source tables
  • Focusing on data that directly supports your objectives

This approach prevents “scope creep” and ensures that every pipeline built serves an immediate analytical need.

10. DevOps & Cost Management

Once pipelines are automated, you need:

  • Orchestration frameworks (e.g., Airflow, Dagster) to schedule and monitor jobs
  • CI/CD for data code, including version control and automated testing
  • Cost monitoring tools to track cloud resource usage and optimize performance

Effective DevOps practices guarantee reliability, while cost-awareness keeps your platform sustainable in the cloud era.

  • Data Profiling: DataHub continuously monitors and analyzes data quality, automatically generating profiling metrics that reveal data distributions, identify anomalies, and help maintain high data quality standards. It provides key statistics such as row and column counts, query frequency, top users, and last update timestamps, along with detailed attribute profiling, including value ranges, central tendencies, null and distinct values. The table below shows some examples of these profiling metrics.

Watch the Video

Conclusion

Building an analytical data platform is a multifaceted endeavor. By equipping your team with these ten skills—spanning business understanding, data modeling, architecture design, automation, and DevOps—you’ll lay the foundation for a platform that delivers consistent insights, scales gracefully, and drives real business value.

When to Use Reference Tables in Data Vault?

Reference Tables in Data Vault

In modern Data Vault 2.0 implementations, teams often face a recurring question: “We have dozens of small, static lookup tables—should we model them as full Hubs and Satellites, or can we use simpler reference tables?” If you’re dealing with Excel sheets containing tens or hundreds of rows of relatively stable data (like Profit Centers, Status Codes, or Region mappings), this article will help you decide when a lightweight reference table suffices—and when you need the auditability of a Hub/Satellite pattern.



Understanding Business Data vs. Reference Data

First, it helps to distinguish two broad categories of data:

  • Business Objects: Entities that your processes create and update constantly—Customers, Orders, Products, etc. You generate new keys and change descriptive attributes frequently.
  • Reference Data: Code lists and lookup tables that describe or classify business objects—Country codes, Profit Center codes, Contract types. These change infrequently and usually in small batches.

Although some tables can straddle the line (e.g., Profit Centers may be “business objects” for accounting teams), it’s often safe to treat truly stable code lists as reference data for modeling purposes.

Simple Reference Tables: Pros and Cons

A simple reference table in your Data Vault is nothing more than a flat table with:

  • Primary Key: Your reference code (e.g., profit_center_code).
  • Attributes: The 2–5 descriptive columns you need (e.g., profit_center_name, region).
  • No History: Only the current state is stored; updates overwrite existing rows.

Advantages: Easy to implement, minimal objects, straightforward joins at query time.
Disadvantages: No built-in historical tracking—updates will retroactively change past reports, and you cannot reconstruct previous descriptions.

When Simple Reference Tables Are Appropriate

Consider a flat reference table when:

  • Your business requirement only needs the latest values.
  • Updates are extremely rare (quarterly or less) and don’t require audit trails.
  • Performance of lookups is not mission-critical (small table sizes).
  • You have no regulatory or internal need to reproduce past descriptions.

If any of these criteria fail—especially auditability—then a simple reference table can become a liability.

Introducing Reference Hubs and Reference Satellites

To combine simplicity with history, use the Reference Hub & Reference Satellite pattern. This mirrors the standard Hub/Satellite design, but optimized for code lists:

  • Reference Hub:
    • business_key: the code (e.g., PROFCTR_001)
    • record_source: data origin
    • load_date: date the code list was loaded
  • Reference Satellite:
    • business_key (FK to Hub)
    • load_date (also part of PK)
    • Descriptive attributes (e.g., name, region, valid_from)

This approach captures every change to your reference data without overwriting, and still keeps your model lightweight.

How It Works in Practice

  1. Load the Hub: Insert every code once (or refresh if new codes appear).
  2. Load the Satellite: For each code, insert a new row whenever any descriptive attribute changes, tagging it with the load_date.
  3. Querying: In your dimension or Information Delivery layer, join from your business object Satellite (or Link) directly into the Reference Satellite on code, filtering to the row with the latest load_date ≤ transaction date.

Because reference tables are small, these joins remain performant even when you compare on dates.

Aligning Reference Data with Business Vault Snapshots

For organizations using a Business Vault layer with snapshot dates, you may need to “time-align” reference data. Two patterns are common:

  • On-the-fly alignment: In your reporting view, use the transaction’s snapshot_date and join to the Reference Satellite where load_date ≤ snapshot_date, picking the latest record.
  • PIT/Bridge tables: Precompute “Point-In-Time” (PIT) tables that store the reference code’s surrogate key aligned to each business object snapshot for faster querying.

Choose the pattern that balances your performance SLAs with data freshness requirements.

Auditability and Regulatory Compliance

If you operate in regulated industries (banking, telecom, government), audit trails are mandatory. The Reference Hub/Satellite pattern ensures:

  • Complete lineage and history of every code change.
  • Reproducibility of past reports with original reference descriptions.
  • Ability to support retrospective analyses without reloading or reconstructing data.

Even if your initial business users only ask for current values, future sprints or stakeholders may require historical context—so building auditability upfront can save costly refactoring.

Performance Considerations

Reference tables typically contain at most hundreds of rows. However, you should still consider:

  • Indexing: Ensure load_date and business_key are indexed for fast lookups.
  • Partitioning: Generally unnecessary for small tables but useful if your Satellite grows into thousands of deltas over years.
  • Join Strategy: In most SQL engines, joining a large transaction Satellite to a small Reference Satellite on code + latest date filter is efficient. If not, consider a PIT table.

Governance and Knowledge Transfer

Whatever pattern you choose, document and govern your reference data:

  • Maintain an authoritative data dictionary describing each code list, source, update frequency, and steward.
  • Set up automated tests (e.g., CI/CD validations) to detect unexpected code changes.
  • Implement alerts for large volumes of reference updates that may indicate data quality issues.

Decision Checklist

Use this quick checklist when evaluating a table for reference modeling:

  1. Is the table truly static or slow-changing? (Quarterly or less)
  2. Are there audit or historical requirements? (Regulatory or future use cases)
  3. Is the table small enough (< 1,000 rows) to avoid performance concerns?
  4. Do you need to reconstruct past reports with original descriptions?
  5. Would a simple change in the future (e.g., retro-active update) break historical reports if you used a flat table?

If you answered “yes” to questions 2 or 5, the Reference Hub/Satellite pattern is the safer choice. Otherwise, a simple reference table may be sufficient.

Conclusion

Static lookup tables in a Data Vault 2.0 implementation can be modeled either as simple reference tables or with a Reference Hub & Satellite pattern. While flat tables are easier to build, they lack historical tracking and auditability. By adopting the Reference Hub/Satellite approach, you gain full change history, reproducible reporting, and alignment with regulatory demands—while retaining a lightweight design.

Use the decision checklist above to guide your modeling choices, and ensure your Data Vault remains both agile and compliant as your organization’s needs evolve.

Watch the Video

Meet the Speaker

Profile picture of Michael Olschimke

Michael Olschimke

Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

Close Menu