Skip to main content
search
0

Google BigQuery and dbt Cloud Quickstart: Full Setup Guide

Smiling man in a white shirt points to the left at a blue-to-teal gradient background with the text 'BUILD IT' and colorful logo icons to the left.

Get Started with Google BigQuery and dbt Cloud

If you are looking to build a modern, scalable data transformation workflow, combining dbt Cloud with Google BigQuery is one of the most powerful stacks available today. In this guide — based on a hands-on walkthrough video — we cover everything from setting up your Google Cloud project to running your first automated deployment job. Whether you are new to analytics engineering or already familiar with SQL-based transformations, this tutorial will get you productive quickly.



What Is dbt and Why Does It Work So Well with Google BigQuery?

Before diving into the setup steps, it helps to understand what each tool actually does and why they complement each other so naturally.

dbt (data build tool) is an open-source transformation framework that brings software engineering best practices into data teams. With dbt, your SQL code is version-controlled, tested, documented, and executed through scheduled jobs — the same discipline that software engineers apply to application code. You write modular SQL models, dbt compiles them, and the actual query execution happens entirely on your data platform.

Google BigQuery is a fully managed, serverless cloud data warehouse built for large-scale analytics. It handles the storage and compute, scales automatically, and integrates tightly with the broader Google Cloud ecosystem. Critically for this tutorial, BigQuery hosts publicly available datasets you can query immediately — no manual data loading required to get started.

Together, dbt and BigQuery form a clean separation of concerns: dbt manages your transformation logic, documentation, and pipeline orchestration, while BigQuery handles the heavy lifting of storing and querying your data at scale. This makes the combination especially attractive for analytics engineering teams that want speed, reliability, and maintainability.

Step 1: Set Up Your Google Cloud Project and BigQuery Dataset

Everything starts in the Google Cloud Console. If you do not already have a Google Cloud account, you can start for free — Google offers around $300 in credits with no credit card required upfront, which is more than enough for this quickstart.

Once you are in the console, navigate to the Cloud Resource Manager and create a new project. For this demo, a name like bigquery-dbt-quickstart works well. After the project is created, head to the BigQuery console from the main navigation menu.

In BigQuery, you will immediately notice that the terminology differs slightly from traditional databases. Instead of database → schema → table, BigQuery uses project → dataset → table. The concepts map directly, but knowing this terminology prevents confusion when you are configuring dbt later.

Create a new dataset inside your project — call it something like dbt_demo. During creation, you will be asked to set a data location (choose a region that makes sense for your use case, or a multi-region option like US or EU), an optional expiration policy, and an encryption setting. The defaults are fine for a quickstart.

The public tutorial data you will be using lives in a project called dbt-tutorial, under a dataset named jaffle_shop. It contains customer and order tables — exactly what you need to build a meaningful first model.

Step 2: Create a Service Account and Generate a JSON Key

For dbt Cloud to run queries in BigQuery on your behalf, it needs authenticated access. The most straightforward approach for a quickstart is a Google Cloud service account with a JSON key file.

In the Google Cloud console, use the Credentials Wizard under APIs & Services to create a new service account. Name it something like dbt-user. Assign it two roles:

  • BigQuery Data Editor — allows the service account to create and modify tables
  • BigQuery User — allows it to run jobs against the project

Once the service account is created, open it and navigate to the Keys tab. Create a new JSON key. This downloads a file to your local machine — keep it secure and do not commit it to version control. You will upload it to dbt Cloud in the next step.

Note: In a production environment, you would typically use OAuth or Workload Identity Federation rather than a JSON key file. For this quickstart, the service account key is the simplest and most transparent option.

Step 3: Connect dbt Cloud to BigQuery

With your service account key ready, log in to dbt Cloud and create a new project. When prompted to set up a connection, choose BigQuery and upload the JSON key file you just downloaded. dbt Cloud will parse the file and automatically populate your project and credential details.

Set a target name for the connection — something like dbt-demo — and save. For the repository, this demo uses dbt’s managed Git repository, which is the quickest option. In a real production project, you would connect dbt Cloud to your own GitHub, GitLab, or Azure DevOps repository and use pull requests and CI/CD pipelines to manage code changes properly.

Step 4: Initialize Your dbt Project

Open the dbt Cloud Studio (the browser-based IDE). You will see a button in the top-left to initialize your dbt project. Click it — dbt will automatically generate the standard project scaffold, including the dbt_project.yml file and a default folder structure.

The dbt_project.yml file is the configuration file that tells dbt this folder is a dbt project. It is where you set the project name, define global defaults, and configure materialization strategies per folder. Every SQL model you place inside the models/ folder from this point forward can be compiled, tested, documented, version-controlled, and executed against BigQuery automatically.

You will also notice that the initializer creates some example models. Delete the examples/ subfolder inside models/ to start with a clean structure — but be careful not to delete the models/ folder itself.

Once your project structure looks clean, commit the initial files with a message like project initialization and you are ready to start building.

Step 5: Define Your Sources

In dbt, sources are the starting point of every project. They tell dbt where your raw input data lives — which project, dataset, and table — without hardcoding those references directly into each model file. This makes your project far more maintainable: if a source location changes, you update it in one place and every model that uses it automatically picks up the change.

Create a new file in your models/ folder and name it sources.yml. The structure always starts with version: 2 (required for the current dbt YAML schema), followed by a sources: key. Under that, define a source with a name (for example, jaffle_shop), specify the BigQuery project as the database (dbt-tutorial), the dataset (jaffle_shop), and list the individual tables you want to reference — in this case, customers and orders.

After saving the file, the dbt Cloud lineage panel will update to show your newly recognized sources. They will appear as the entry points in your project’s data lineage graph.

Step 6: Build Staging Models

A core best practice in dbt is to never reference raw source tables directly in your business logic models. Instead, you create a thin staging layer — one model per source table — that lightly cleans and standardizes the data. Typical staging transformations include renaming columns to consistent naming conventions, casting data types, and filtering out clearly invalid records.

dbt Cloud makes this step even easier: the lineage view shows a link next to each detected source table that lets you auto-generate a staging model scaffold. Click it, and dbt generates a boilerplate SQL file using Common Table Expressions (CTEs) and the source() macro.

The source() macro is one of the first things that makes dbt feel different from plain SQL. Instead of writing SELECT * FROM dbt-tutorial.jaffle_shop.customers, you write SELECT * FROM {{ source('jaffle_shop', 'customers') }}. dbt compiles this at runtime into the correct fully qualified BigQuery reference, keeping your code clean and location-independent.

In the staging model for customers, rename the id column to customer_id for clarity. Save the file and use the Preview button to verify that dbt can query the data from BigQuery — this confirms your connection and credentials are working correctly.

When you are ready to persist the model, click Build. dbt will create the model in BigQuery under your personal development dataset — a sandboxed schema tied to your dbt Cloud user account. This means you can build and iterate on models freely without touching anything in your production environment.

If you switch back to the BigQuery console at this point, you will find a new dataset named after your dbt username, and inside it, a view representing your staging model. Repeat the same process for the orders source table.

Step 7: Build a Customer Dimension Model

With your staging models in place, you are ready to build a more meaningful transformation: a customer dimension that joins customer data with order history.

Create a new folder inside models/ called marts/ (a common convention for final output models), and inside it create a file called dim_customers.sql.

In this model, instead of the source() macro, you will use the ref() function — the other core dbt macro. Where source() points to raw input tables, ref() points to other dbt models you have already defined. Writing {{ ref('stg_jaffle_shop__customers') }} does two things: it generates the correct BigQuery reference at compile time, and it tells dbt that this model depends on the staging model — building the dependency graph that dbt uses to determine build order, enable lineage tracking, and support documentation.

Build out the model with a few CTEs: one that selects from your customers staging model, one that aggregates order data per customer (first order date, most recent order date, and total order count), and a final CTE that joins the two together. The lineage panel will update to show the full data flow from raw sources through staging into the dimension model — a powerful visualization, especially as your project grows.

Build the model, verify it in BigQuery, then commit your changes and merge to main.

Step 8: Create a Deployment Environment and Schedule a Job

Development work happens in personal dev environments. Production data transformations require a deployment job — a configured, scheduled run that executes your models against your production BigQuery dataset.

In dbt Cloud, navigate to Orchestration → Environments and create a production environment. You will need at minimum one development environment and one production environment. Attach a connection profile to the production environment that points to the dbt_demo dataset you created earlier — this is where production models will be materialized.

Then go to Orchestration → Jobs → Deploy Job and create a new job. Name it something like Production Daily, select your production environment, and configure the run command as dbt build (which compiles, runs, and tests all models in one step). Enable a schedule — for example, daily at 2:00 AM using either the UI scheduler or a cron expression for more precise control.

You can also trigger the job manually to verify it runs successfully. The job run view shows the status of each model — pass, fail, or skip — making it easy to spot and debug issues in production.

What to Learn Next: Data Vault, dbt, and Scalable Analytics Engineering

What you have built in this walkthrough — connecting dbt Cloud to Google BigQuery, defining sources, creating staging and dimension models, and scheduling a deployment job — represents the foundational workflow you will use in every real analytics engineering project.

But this is just the beginning. dbt supports a wide range of advanced features: custom tests, documentation generation, macros and Jinja templating, incremental models for large datasets, snapshots for slowly changing dimensions, and much more.

If you want to go deeper and learn how to combine dbt with structured modeling methodologies, Data Vault 2.0 is the industry standard for building scalable, auditable, and future-proof enterprise data warehouses. At Scalefree, we offer comprehensive Data Vault 2.1 training and certification programs that cover everything from foundational concepts to advanced implementation patterns — including how to integrate dbt into a Data Vault workflow on platforms like BigQuery.

Whether you are just starting your analytics engineering journey or looking to formalize your team’s approach to data modeling, our Data Vault 2.1 certification is the most direct path to building production-grade data platforms with confidence.

Check out our other platform-specific tutorials — including guides for Snowflake — and explore our webinars and training programs to take your data engineering skills to the next level.

Watch the Video

How Data Vault Supports AI and ML Readiness in the Modern Data Platform

Hologram of Padlock on sunset panoramic cityscape of Bangkok, Southeast Asia. The concept of cyber security intelligence. Multi exposure.

How Data Vault Supports AI and ML Readiness

Most organisations today are not failing at AI because they chose the wrong model. They are failing because they built on the wrong foundation. The model is rarely the bottleneck — the data underneath it is.

At Scalefree, working with clients across Europe, we see this consistently: AI workflows will never succeed at scale without proper data support. Not eventually. Always.

This article explains why, and what a mature, AI-ready data architecture actually looks like — with Data Vault 2 at its core.



Where Most Companies Are Right Now

The journey most organisations follow with AI looks roughly the same. It starts with discovery — the first time someone opens a chatbot, enters a prompt, and gets a result that genuinely surprises them. That moment creates momentum.

What follows is an extended period of experimentation. Prototypes are built. Some use cases work. Others fail — not because AI is incapable, but because the setup was wrong, or the use case was not worth the complexity it introduced. This phase is characterised by learning, and by a growing realisation that surfaces quickly: the same problems data engineers have been solving for 20 or 30 years have not gone away. They have simply reappeared under a new name.

Structured processes are needed. Governance is needed. Data integration is needed. The foundation matters — and for many organisations, that foundation is not ready.

The companies that move beyond experimentation into genuine AI maturity are not the ones that found a better model. They are the ones who built a better platform first.

Reasons That Stop Companies From Scaling with AI

Two patterns emerge consistently when AI projects reach the limits of their initial setup.

The first starts innocently. A workflow tool is connected to a data source and a language model. It works. Results are impressive. Then a second data source is added, then a third. A quality control step is introduced. A loop is needed. A second agent handles edge cases. What started as a clean prototype becomes a tangled, fragile system that is expensive to maintain and nearly impossible to debug. When errors appear — and they will — fixing them means untangling months of accumulated complexity. Multiply this across ten, twenty, or forty processes in an organisation, and the maintenance burden alone consumes any efficiency the AI was supposed to create.

The second pattern emerges from urgency. Business users want to move quickly, and internal IT is often a bottleneck. The response is shadow AI: tools adopted outside governance controls, company data uploaded to external platforms without authorisation, processes built that bypass audit trails, GDPR compliance, and data ownership rules. It produces results fast. It also creates legal exposure, data leakage risk, and a complete loss of organisational visibility into what AI is actually doing with company data. For many organisations today, if asked honestly what AI processes are running and what data they are using, the honest answer is: we do not know.

Both patterns are understandable in how they start. Both become critical problems at scale.

Why AI Workflows Fail Without a Data Foundation

The root cause in both cases is the same: attempting to solve data problems inside an AI workflow rather than before it.

When an AI agent needs to access 20 different types of information — personal contact data, past purchase history, product catalogue, past email correspondence, website behaviour, and company context — and that data lives in disconnected source systems with no integration layer, every new data point added to the workflow increases complexity. Quality issues compound. Costs rise with every additional token consumed. And because AI systems produce different results every time they process inconsistent input, errors are unpredictable and difficult to reproduce.

Clean, integrated, well-described data does not make AI smarter in a general sense. It makes AI consistently useful — which is what actually matters when deploying at enterprise scale.

The Enabling Data Platform: What It Looks Like

The architecture Scalefree recommends for AI-ready data platforms follows a clear logical structure, regardless of which specific tools an organisation uses.

Source systems feed into a Persistent Staging Area, where raw data is collected and preserved as-is. This is not a transformation layer — it is a historical record of everything that arrived, in the form it arrived in.

From there, data moves into an integration layer. This is where a Data Vault 2 modeling standard sits. The role of this layer is to integrate data from different source systems, resolve business key conflicts, clean data, and build a single, auditable, historically complete view of the business. It can grow and evolve as new sources are added, without restructuring what already exists. It is also built piece by piece, use case by use case, which means an organisation does not need to build the entire platform before deriving value from it.

Above that, the platform builds Feature Marts. Feature Marts are the direct interface between the data platform and AI agents. They are optimised for AI consumption — sometimes flat and wide, sometimes in a dimensional modeling style, with rich semantic descriptions that help an AI agent understand not just what the data is, but what it means. A Feature Mart might contain all prospect activity data in a single unified view, ready for an agent to consume without having to navigate joins, resolve conflicts, or interpret raw table structures.

AI agents plug into Feature Marts. They do not go directly to the predecessor layers, and they certainly do not go directly to source systems. The Feature Mart is the clean, governed, role-restricted interface that makes agents reliable.

How This Architecture Directly Solves the AI Scaling Problem

Security and access control. When an AI agent connects to a Feature Mart rather than a raw database, access can be scoped precisely. The agent sees only the data it needs for its specific function. If the agent is compromised, the blast radius is limited. This is the same principle applied to any employee or system — give it exactly the access it needs, nothing more.

Data integration. An organisation’s data engineers have already done the work of cleaning, integrating, and resolving data quality issues across source systems. AI engineers do not need to rebuild this. They need to collaborate with data engineers to shape Feature Marts from data assets that already exist. This is a fundamental shift in how AI teams and data teams should work together — and it accelerates the time from idea to deployed AI use cases.

Full audit trail. With a proper data platform, every piece of data that flows into an AI decision can be traced back to its source, with a timestamp. The output can also be stored back in the platform. This means that when a compliance question arises — which data informed this decision, on which date, processed by which agent — the answer is available.

GDPR and compliance. If data deletion rules are already implemented in the platform, they extend automatically to AI agents. Data that has been deleted from the platform under GDPR rules will not be served to an AI agent via the Feature Mart. Compliance is inherited, not rebuilt for each use case.

Cost control. Providing an agent with a clean, pre-integrated Feature Mart means it processes the right data once, rather than consuming tokens navigating raw, inconsistent, or duplicated data sources. Token costs are a real and growing concern for organisations running AI at scale. A well-structured data foundation is also a cost optimisation strategy.

Semantic layers. AI agents make mistakes when they do not understand the data they are working with. A data catalog and semantic layer that provides meaningful descriptions of every data asset — what a field means, how a metric is calculated, what business context surrounds a particular entity — reduces AI hallucinations significantly. This is especially important as organisations move toward conversational interfaces that allow business users to query data in natural language.

Data Vault 2 as the Foundation for AI Readiness

Data Vault 2 is not simply a modeling technique. It is a complete methodology covering architecture, modeling, and implementation standards. Its particular strengths make it well-suited as the integration layer in an AI-ready platform.

The insert-only, historically complete nature of Data Vault means that every version of every piece of data is preserved. An AI agent working with a Feature Mart derived from a Data Vault has access to the full history of a business entity — not just its current state. This matters significantly for ML models that rely on historical patterns, and for audit requirements that demand a complete record of what was known at any point in time.

Data Vault’s business-key-centred approach means that data from multiple source systems can be integrated without losing the original context of each source. An AI agent drawing on customer data that has been properly integrated through a Data Vault model is working with a single, coherent view of that customer across every system in the organisation — rather than multiple conflicting records from disconnected databases.

Data Vault 2 also extends the original methodology to address real-time and semi-structured data patterns — JSON structures, streaming sources, event-driven loading — which are precisely the data types that AI-driven workflows increasingly depend on.

Starting the Journey: Three Things to Do in Parallel

Organisations do not need a complete data platform before they begin building AI use cases. What they need is a plan to build both in parallel, with the right teams working together.

The first priority is extending the existing data platform to serve AI applications — identifying which data assets already exist, which Feature Marts can be built from them quickly, and which source systems need to be connected next.

The second priority is identifying the right processes to automate. The most valuable targets are not tasks unique to one person, but processes that many people across the organisation perform repeatedly — the same steps, executed at scale, across departments. These are the processes where AI creates compounding returns.

The third priority is building cross-functional teams. AI engineers, data engineers, and business users need to work together from the start. Business users understand the processes. Data engineers have already solved the data integration and quality problems. AI engineers know which models fit which constraints and how to optimise for cost and performance. No single group has all three — and organisations that try to run AI projects without all three perspectives will hit the limits of that approach quickly.

Want to Go Deeper?

If your organisation is evaluating its data platform readiness for AI, or if you are already building AI workflows and hitting the limitations described in this article, Scalefree offers a free Data Vault Handbook — 60 pages covering the fundamentals of Data Vault and where it fits in a modern data architecture. Available to order to your door, free of charge within Europe.

For teams ready to take the next step, the Data Vault 2.1 Training & Certification equips data engineers and architects with the methodology, modeling skills, and CDVP2.1 credential to build and govern Data Vault implementations at enterprise scale.

To discuss how Scalefree can support your data platform or AI readiness journey, get in touch directly.

Data Vault and Data Mesh: Not Versus, But Together

Best Practices for Data Mesh Implementation

Data Vault and Data Mesh

Few topics generate more confusion in enterprise data architecture than the relationship between Data Vault, Data Mesh, and Data Fabric. Online discussions often frame these as competing approaches — as if an organisation must choose one and abandon the others. This framing is wrong, and understanding why matters for anyone building a serious data platform in 2025 and beyond.

These three concepts operate at different levels. They address different problems. And when combined correctly, they complement each other in ways that none of them can achieve alone.



Data Vault, Data Mesh, and Data Fabric Are Not Competing

The confusion starts because all three terms appear in similar conversations — data platform architecture, enterprise data strategy, scalability. But they are not alternatives to each other.

Data Fabric is a technical approach. It defines the architecture of a data platform — how data moves from source systems through integration layers to delivery, how metadata drives automation, how access is governed, and how the platform scales. It is the engineering blueprint.

Data Mesh is an organisational approach. It defines who is responsible for data, how teams are structured, how data products are owned and maintained, and how to avoid the bottlenecks that emerge when a single central IT team is responsible for everything. It is the operating model.

Data Vault is the methodology that sits between the two. It provides the modeling technique, the reference architecture, the implementation standards, and the agile delivery framework that make both a high-quality Data Fabric and a functional Data Mesh possible. It is the glue.

None of these replaces the other. An organisation can have a Data Fabric architecture without Data Vault — but it will lack the standardisation and automation that make the platform scalable. It can adopt Data Mesh principles without Data Vault — but the integration layer will be fragile and inconsistent. Data Vault without a clear architectural vision and organisational operating model delivers solid modeling but leaves the surrounding platform undefined. For a deeper look at how these three approaches fit together architecturally, Scalefree’s guide on Data Vault, Data Mesh, and Data Fabric covers the full modern architecture picture.

Why Data Vault Is the Foundation Data Mesh Needs

Data Mesh’s central argument is that centralised IT teams become bottlenecks as data platform requirements grow. The solution is to distribute ownership — giving domain teams (sales, finance, operations, logistics) responsibility for their own data products rather than routing every requirement through a central team.

This is a sound organisational idea. But it creates a serious technical problem: if every domain team builds its own data pipelines from scratch, organisations end up with redundant data, conflicting definitions of the same business objects, and a proliferation of unmaintained pipelines. The very inefficiency Data Mesh was designed to solve reappears in a different form.

Data Vault solves this problem at the architecture level. The key insight is that not all layers of a data platform benefit equally from decentralisation.

The Raw Data Vault — the layer that absorbs raw data from source systems and integrates it using business keys — should remain centralised. This layer contains no business logic. It simply records what arrived, when, and from where. Because it is standardised and highly automatable, a small central team can maintain it with minimal overhead. And because it is centralised, every domain team draws from the same single source of facts — the same customer records, the same product data, the same account structures — rather than each team pulling its own version from disconnected pipelines.

The Business Vault and Information Marts, by contrast, are exactly where domain knowledge matters. Business rules, calculated metrics, KPI definitions, and data product shaping all require the kind of deep domain understanding that lives in business teams, not in central IT. This is where decentralisation makes sense — and where Data Mesh principles directly apply.

The result is a practical middle ground: centralise the Raw Vault where standardisation creates efficiency, decentralise the Business Vault and Information Marts where domain knowledge creates value. This is not a theoretical compromise — it is the architecture that enterprise Data Vault implementations demonstrate works at scale.

The Problem With Going “Full Data Mesh”

Fully decentralised Data Mesh — where every domain team manages its own data end to end, from source ingestion to delivery — sounds attractive in theory. In practice, it replicates the problems of the pre-data-warehouse era, where every department ran its own shadow IT, built its own pipelines, and maintained its own version of business objects that nobody else could join reliably.

When domain teams each ingest their own version of a source system, the same Salesforce data gets pulled ten times by ten different teams with ten slightly different transformation approaches. The same customer appears as ten different records with ten different definitions of “active.” Joining across domains becomes a project in itself. The governance that Data Mesh promises — through federated standards and data contracts — is extremely difficult to enforce when no shared foundation exists.

Full centralisation has its own problems, of course. A single IT team responsible for all data products across a large organisation will always struggle with prioritisation, domain knowledge gaps, and delivery speed. The bottleneck that Data Mesh identifies is real.

The architecture that resolves this tension uses Data Vault’s layered structure as the boundary between centralised and decentralised work. Central team: source ingestion, Raw Vault, automation infrastructure, platform governance. Domain teams: Business Vault logic, Information Marts, data product definition, and ownership.

How This Architecture Works in Practice

Organisations that implement this combined approach typically follow an evolution rather than a big-bang restructuring. Teams begin working together on a centralised platform, building the Raw Vault and establishing the automation patterns and tooling. As the platform matures and team members develop deep platform knowledge, those people move into domain teams — bringing their technical expertise with them, closer to the business knowledge that domain work requires.

The central team shrinks to a small core — often just two or three people — responsible for maintaining the Raw Vault, the automation infrastructure, and the platform governance layer. Domain teams handle everything from the Business Vault outward, working with full autonomy on their data products while drawing on the shared, integrated foundation beneath them.

This approach has a specific advantage when organisations grow through acquisition. When a company absorbs another — bringing new source systems, new customer records, new business objects — the Raw Vault absorbs the new data without restructuring what already exists. New Hubs, new Satellites, and new integration logic are added incrementally. Existing data products continue to function. The integration project that would take a traditional data warehouse months or years can be completed in weeks.

The same scalability applies to organic growth. New business units, new products, new markets — each can be onboarded as a new domain team, drawing from existing Raw Vault entities where overlap exists, adding new entities where it does not. The platform grows with the organisation rather than requiring periodic rebuilds.

What Makes This Combination Work

Three characteristics of Data Vault make it particularly well suited as the foundation for a Data Fabric and Data Mesh architecture.

Standardisation enables automation. Data Vault’s Hub-Link-Satellite structure is highly consistent. Once the patterns are established and the metadata is defined, the loading of Raw Vault entities can be generated automatically rather than hand-coded. This is precisely what Data Fabric requires — and precisely what makes the central layer maintainable by a small team even as the number of source systems grows. For a detailed look at how datavault4dbt implements this automation approach, Scalefree’s tooling is specifically designed around this principle.

Historical completeness supports data products. Data Mesh’s concept of the data product — a trusted, documented, governed dataset that domain teams can consume and build upon — requires a reliable foundation. A data product built on a Raw Vault entity inherits complete historical data, a full audit trail, and provable lineage back to source. These are the properties that make data products trustworthy enough to use in downstream analytics, AI applications, and regulatory reporting.

The layered architecture maps naturally to organisational boundaries. Raw Vault and Business Vault are not just technical distinctions — they correspond to a meaningful organisational divide between technical data engineering work and business knowledge work. The architecture makes the organisational model explicit rather than leaving it implicit, which reduces friction when defining team responsibilities and data product ownership.

Data Catalogs and Governance as Connective Tissue

A combined Data Vault, Data Mesh, and Data Fabric architecture only delivers its full value when metadata is managed seriously. Domain teams need to be able to discover what data products already exist, understand their lineage, know how fresh the data is, and assess whether an existing product meets their needs before building a new one.

Without a well-maintained data catalog, teams rebuild work that already exists, queries return answers that conflict with other answers, and the governance that Data Mesh requires to function collapses into informal agreements and institutional knowledge held by a few individuals.

With a proper catalog — one where every data product is documented, every entity has clear ownership, every metric definition is visible, and lineage traces from source to delivery — the platform becomes genuinely self-service. Non-technical users can find and use data products without IT support. AI use cases that require querying data in natural language become feasible. And the platform can scale to serve a large organisation without proportional growth in support overhead.

Data Vault contributes directly to catalog quality. The Raw Vault’s record source and load date on every entity provide automatic lineage. The standardised naming conventions make entities discoverable. The separation of Raw Vault from Business Vault makes the application of business rules explicit and auditable rather than buried in opaque transformation logic.

Starting the Journey

For organisations considering this architectural direction, the starting point is almost always the same: begin with the data platform foundation before attempting to distribute ownership. Teams need to understand the platform — how the Raw Vault works, how automation is configured, how the Business Vault extends the raw layer — before they can work effectively within domain structures.

The Data Vault 2.1 Training & Certification equips data engineers and architects with the complete methodology — from Raw Vault design through Business Vault patterns to Information Mart delivery — so they can build and govern this kind of platform with confidence. For teams evaluating their current architecture and planning the move toward a more scalable and governed platform, Scalefree’s Data Platform Review provides an expert assessment and a clear recommended path forward.

The question is not whether to choose Data Vault, Data Mesh, or Data Fabric. The question is how to combine them in the sequence and proportion that fits your organisation’s current maturity and growth trajectory. The answer, in almost every case, starts with the same foundation: a clean, standardised, automated Raw Vault that every domain team can trust.

For further reading on how Scalefree approaches enterprise data platform architecture, the free Data Vault Handbook covers the core methodology, and the Data Vault consulting practice works with organisations at every stage of the implementation journey.

Will AI Replace Your Data Vault Engineer? We Put Conversational Analytics to the Test

AI Fact Table Comparison

Scalefree tested whether AI can replace a Data Vault Engineer. The accuracy was perfect. The effort and performance gap told a very different story.

Every data team is asking the same question right now. If AI can write SQL, generate documentation, and query complex structures on its own, what exactly is the Data Engineer still doing?

Can an AI agent query a Raw Data Vault on its own? Does a business still need experienced engineers to model, document, and maintain a vault if the AI can just figure it out? We ran the experiment ourselves. The results were not what we expected.

Mastering Conversational Analytics: A Practical Guide to Setup, Testing, and Optimization

Learn how to unlock the ability to “chat” with your company’s data in plain English and get instant, accurate answers using your unique metrics. This practical webinar will demonstrate a strategy that prevents AI hallucinations and implements reliable AI data assistants without requiring a massive, expensive complexity overhaul. Sign up for our upcoming webinar on June 16th, 2026!

Register for free

Sound Familiar?

Your LinkedIn feed is full of it. “AI can write SQL.” “Just ask your data a question.” “No engineer needed.” And honestly, some of it is true. AI agents are getting remarkably good at querying data structures that would have required a specialist just two years ago.

So the question is fair. If an AI can navigate Raw Data Vault entities, join Hubs to Links to Satellites, and return a correct answer, what is the Data Vault Engineer actually still doing?

At Scalefree, we decided to stop debating it and start measuring it. Same data, same AI agent, two architectures. A lean 9-column Fact Table on one side. A full 12-table Raw Data Vault on the other. Twenty questions fired at both.

The accuracy result? Equal. The full picture? A lot more interesting.

The Setup Behind the Scores

Both agents were built using Google’s Gemini Data Analytics SDK, a ready-to-use Python toolkit that connects directly to BigQuery and handles the NL2SQL pipeline out of the box. Before either agent could answer a single question though, both needed a detailed set of system instructions. Table descriptions, field definitions, glossary terms, query guidance. And behind every one of those lines is someone who knows the data well enough to describe it accurately. That person does not go away with AI. They become more important.

Here is what that looked like in practice.

AI Fact Table Comparison

The Fact Table instructions fit in one sitting. The Raw Vault required documenting every join path, every satellite filter, and every entity relationship before the agent could reason correctly. That is 5 times more documentation for the exact same end result.

Putting Both to the Test

The test was designed to build up gradually. The first five questions kept it simple: total booking counts, filtering by office location. Then came date and time logic: specific days, monthly ranges, daily breakdowns. The middle tier pushed into duration analysis: average booking lengths, the longest slot, exact minute matches. After that, day-of-week patterns: which weekday is busiest, how Mondays compare. The final five combined everything at once, multi-dimensional queries that needed location, time, and resource type all in a single answer. Here is an excerpt from the final three tiers:

AI Excerpt of the 20-question benchmark test suite

Excerpt of the 20-question benchmark test suite

One deliberate design choice: no personal data. Names and email addresses live in a restricted part of the vault that the agent cannot access. The Fact Table was built to match that boundary from the start. Fair test, clean data governance.

The Results

Honestly? Nobody expected a clean sweep on accuracy. And between us, as Data Vault engineers, we were hoping it would not.

AI Accuracy Comparison Table

Both architectures answered every single question correctly. The AI agent handled a 12-table vault with Hubs, Links, and Satellites just as confidently as a single flat table. That is genuinely impressive, and honestly a little humbling. It also means the modeling and documentation were done right. You cannot score 20 out of 20 on a poorly described structure.

But then look at the last two columns. The Raw Vault took 33 minutes in total to do what the Fact Table did in 6. That is 1.65 minutes per question on average, compared to 0.3 minutes for the Fact Table. Same destination. Five times longer to get there.

What This Means for Your Business

Let’s translate the numbers into business reality.

33 minutes total query time versus 6. That is 1.65 minutes per question on average, compared to 0.3 minutes for the Fact Table. For a business user who just wants a quick answer, that difference is felt immediately. And before any of those queries even ran, the Raw Vault needed around 400 lines of system instructions written by someone who understands the data deeply enough to describe it accurately.

None of this makes the Raw Data Vault the wrong choice. For enterprise data management, it is still the gold standard. But pointing an AI agent directly at it, without a proper semantic layer and without experienced engineers maintaining it, is a fast path to slow answers and frustrated users.

Build the vault. Then build a Fact Table on top of it as the AI-facing layer. That combination gives you the best of both worlds. And it gives your Data Vault Engineer a role that AI cannot fill. Someone has to know the data well enough to describe it. Someone has to model it well enough that the AI can reason with it. That someone is not going away anytime soon.

Key Takeaways

  • Do not judge your AI setup by accuracy alone. Look at query time and setup effort too.
  • A well-modeled Fact Table gives you fast, reliable conversational analytics with minimal overhead.
  • A Raw Data Vault can match that accuracy, but needs 5 times more documentation and runs 5 times slower.
  • Good documentation requires someone who understands the data. AI cannot write that for you, at least not yet.
  • The best architecture for AI analytics is not either/or. Use the vault for data integrity, the Fact Table for the AI layer.
  • Your Data Vault Engineer is not a cost to cut. They are the reason any of this works.

What Comes Next

The full story gets told at our upcoming webinar: Mastering Conversational Analytics: A Practical Guide to Setup, Testing, and Optimization. Live queries. Real failure examples. A practical framework for choosing the right architecture. All of it, with the actual data behind it.

Register here for free

In the meantime, tell us where you are at. Are you working with a Raw Vault, a Fact Table, something else entirely? Drop a comment. We read them all.

How to Use Zapier Copilot to Integrate BigCommerce and Salesforce

Presenter with glasses and a lapel mic explains Copilot; on-screen text reads 'Prompt. Build. Done.' with the Copilot logo in the corner.

How Zapier’s Copilot Is Changing the Way Businesses Automate Workflows

Workflow automation has always promised to save time — but setting up integrations between business systems has traditionally required technical knowledge, careful configuration, and a fair amount of trial and error. That is changing rapidly. Zapier’s Copilot feature, currently in beta, introduces a fundamentally new way to build automated workflows: describe what you want in plain language, and let the AI do the heavy lifting. In this post, we take a close look at how the feature works, walk through a real-world use case involving BigCommerce and Salesforce, and explain why this matters for data-driven businesses looking to move faster without adding technical overhead.



What Is Zapier Copilot?

Zapier Copilot is an AI-assisted workflow builder built directly into the Zapier interface. Instead of manually selecting triggers, actions, and mapping fields one by one, users can simply type a description of the workflow they want to create. The AI interprets the prompt, selects the appropriate apps and actions, maps the relevant data fields, and builds the “Zap” automatically.

The feature currently offers two modes:

  • Auto Mode — The AI builds and tests each step automatically, without asking for confirmation at each stage. This is the fastest way to get a working Zap, and is ideal when working in a sandbox or staging environment where test records being created are not a concern.
  • Ask Mode — The AI pauses before executing each step and asks for confirmation. This mode is recommended when connecting to a production environment, where automatically created test records could cause issues in live data.

This flexibility makes AI Copilot useful for both technical users who want speed and non-technical users who prefer control and transparency throughout the build process.

A Real-World Use Case: BigCommerce Orders into Salesforce

To understand how powerful this feature really is, let us walk through a concrete integration scenario that many e-commerce and sales teams face: synchronizing web shop orders with a CRM.

Imagine a company running its online store on BigCommerce and managing customer relationships and fulfillment in Salesforce. Every time an order is placed on the web shop, the team needs a corresponding record to be created inside Salesforce — specifically in a custom object called Web Shop Order. On top of that, each order contains multiple line items, and those individual products need to be tracked as separate records in a second custom object: Web Shop Order Entry.

Traditionally, setting this up would involve:

  • Manually selecting BigCommerce as the trigger app and configuring the “New Order” event
  • Adding a Salesforce action to create the parent record and mapping each field individually
  • Adding a second loop or action to handle the order line items
  • Testing each step, debugging field mapping errors, and iterating

With AI Copilot, the entire setup begins with a single prompt.

Prompt Engineering for Workflow Automation

The quality of the output from AI Copilot is directly related to the clarity of the input prompt. For this use case, a well-structured prompt might look like this:

“Every time a new order is placed in our BigCommerce web shop, create a new Web Shop Order record in our Salesforce sandbox and also create a Web Shop Order Entry record for each item in the order.”

This single instruction communicates the trigger (new BigCommerce order), the primary action (create a Salesforce record), the specific object (Web Shop Order), and the secondary action (create child records for each line item). The AI picks up on all of these elements and builds the workflow accordingly.

Because the integration should be tested safely, it is best practice to connect to a Salesforce sandbox rather than the production org during the build phase. This prevents test records from polluting live data, and Zapier’s Copilot makes it easy to select the staging environment during the connection step.

What the AI Builds Automatically

Once the prompt is submitted and the connections are authorized, AI Copilot gets to work. Here is what it handles without any manual input:

  • Trigger configuration — It sets up the BigCommerce “New Order” trigger and links it to the connected account.
  • Salesforce record creation — It identifies the correct custom object and adds an action to create a new Web Shop Order record whenever the trigger fires.
  • Automatic field mapping — It maps the relevant order data from BigCommerce to the corresponding fields in Salesforce, including fields required by the object’s configuration.
  • Line item handling — It adds a second action to create Web Shop Order Entry records for each individual product within the order.
  • Testing steps — In Auto Mode, it runs a test of each action and asks for confirmation before proceeding to the next step, ensuring the connection is working before the full Zap is activated.

The result is a finished, tested workflow — in a fraction of the time it would take to configure manually.

Validating the Integration in Salesforce

After the Zap is built and the test is run, the proof is in the data. Switching over to the Salesforce sandbox confirms the result: a new Web Shop Order record has been created, all mapped fields are populated correctly, and the associated order entry records are visible as child records. The integration is live and working.

This kind of immediate validation is crucial in data integration projects. It confirms not just that the connection exists, but that the data is flowing correctly, the right objects are being created, and the business logic is functioning as intended.

Why This Matters for BI and Data-Driven Organizations

For businesses that rely on accurate, real-time data across systems, the ability to quickly build and test integrations has significant implications. A few key takeaways:

  • Reduced dependency on technical resources — Business analysts and operations teams can build integrations themselves, without needing to involve a developer for every new workflow.
  • Faster iteration — AI Copilot dramatically shortens the time between identifying a data gap and solving it. What previously took hours of configuration can now be done in minutes.
  • Lower risk during testing — The sandbox-first approach and Ask Mode give organizations the confidence to test integrations thoroughly before pushing to production.
  • Scalability — Once the base workflow is confirmed, additional fields, conditions, or actions can be layered on top, extending the integration without starting from scratch.

As AI continues to mature within automation platforms, the barrier between a business requirement and a functioning technical solution continues to shrink. Zapier Copilot is an early but compelling example of what this future looks like in practice.

Getting Started

AI Copilot is available directly within the Zapier interface and is currently in beta. To use it, navigate to the Zap builder, look for the AI Copilot option, and start with a clear description of the workflow you want to build. For integrations that involve production systems, always begin with Ask Mode or connect to a sandbox environment first to review each step before it executes.

For organizations dealing with complex multi-system data flows, this feature is worth exploring — and the BigCommerce-to-Salesforce example above is just the beginning of what is possible.

Watch the Video

Architectural Limitations of PostgreSQL for Enterprise Data Vault Workloads

Architectural Limitations of PostgreSQL for Enterprise Data Vault Workloads

PostgreSQL stands as the industry standard for open-source transactional (OLTP) database engines. Its reliability, strict ACID compliance, and robust extension ecosystem make it a primary choice for operational workloads. It is the foundation that other open-source alternatives build upon.

PostgreSQL derivatives and extensions modify this core engine to handle heavy analytical (OLAP) and Data Vault workloads while preserving the existing PostgreSQL expertise.

Given this ubiquity, engineering organizations often aim to use vanilla PostgreSQL as their initial data warehousing platform. Vanilla” PostgreSQL refers to the core, single-node version of the database system, which serves as the blueprint for its derivatives and extensions. It is published as an open source project at https://www.postgresql.org/. This approach is commonly paired with Data Vault to achieve agile, historized data modeling.

While theoretically sound, this architectural combination regularly encounters a severe performance threshold as data volumes scale into the mid-to-high gigabyte and terabyte range.

I have created multiple applications that relied on PostgreSQL for operational workloads during my career as a software engineer (before my Data Vault career). With this brief history in mind, it is no surprise that Scalefree relied on PostgreSQL to some extent: we built our first internal data warehouse on PostgreSQL, even knowing its limitations. It was good to start with, when the internal data volume at Scalefree was low, and query complexity was limited. We chose vanilla PostgreSQL because of its derivatives; once we experience the database’s limitations, we would migrate to one of its commercial derivatives.



Strategic Context: Digital Sovereignty, Open Source, and Transatlantic Tech

Before diving into the technical mechanics, it is essential to address why open-source infrastructure matters so deeply to the future of enterprise architecture.

At Scalefree, our commitment to open-source software is foundational. We actively maintain and contribute to the community, which you can explore directly via the Scalefree GitHub repository. A prime example of this dedication is our open-source project, datavault4dbt, which brings robust Data Vault generation capabilities to standard PostgreSQL as well as its derivatives and cloud warehouse counterparts like Amazon Redshift.

Beyond pure engineering efficiency, a broader geopolitical driver is at play: digital sovereignty. As a Fulbright Scholar deeply invested in transatlantic affairs, I closely observe the shifting dynamics between the United States and Europe. True strategic alignment requires balance. The US does not benefit from a technologically dependent Europe; rather, the long-term stability of the transatlantic alliance relies on the US having a strong, digitally independent, and technologically sovereign ally in Europe. This calls for a pragmatic approach to digital sovereignty: use it when available and feasible, build capabilities, and create alternatives without going into panic mode on political trends that change every four or eight years.

The Open Source Sovereignty Paradox

A common counterargument often arises here: If many of these major open-source projects are driven, maintained, and funded by US developers and tech giants, how do they actually provide a sovereign alternative to US commercial technology?

The answer lies in the fundamental nature of open-source licensing and governance, which fundamentally alters the power dynamic:

  1. Elimination of Vendor Lock-In and Extraterritorial Jurisdiction: Proprietary cloud solutions are bound by commercial licenses, corporate terms of service, and the host nation’s domestic laws (such as the US Cloud Act). If a geopolitical or regulatory shift occurs, access can be restricted or altered. Open-source code, once published under licenses such as Apache 2.0 or MIT, is in the public domain. It cannot be revoked, repatriated, or shut off by a foreign corporate or state actor. Those licenses are our preferred choice for our open source projects at Scalefree.
  2. The Right to Fork and Host Locally: Open source grants European enterprises the ultimate sovereign right: the ability to fork the code, host it completely on local infrastructure, and maintain it independently. Even if the primary contributor base remains in Silicon Valley, European engineers have full visibility into the source code, allowing them to audit for security, eliminate telemetry, and adapt it to localized regulatory compliance (like GDPR) without foreign oversight.
  3. Decoupling Innovation from Capital Concentration: Open source democratizes access to state-of-the-art software architecture. It allows a European ecosystem to build sovereign, high-performance platforms without being forced to route massive capital into proprietary foreign hyper-scaler ecosystems.

Building enterprise data architecture on open, vendor-neutral infrastructure is a critical pillar of that sovereignty. However, to achieve true independence, our open-source tools must be architected to handle enterprise-scale workloads.

The following analysis details the structural reasons why vanilla PostgreSQL is fundamentally ill-suited for large-scale analytical (OLAP) processing, with a specific focus on how Data Vault methodology exacerbates these limitations and on open-source alternatives to remedy them.

1. Storage Paradigm Conflict: Row-Oriented vs. Columnar

The primary bottleneck stems from PostgreSQL’s native storage architecture. As a traditional relational database, it utilizes a row-oriented storage engine (heap tables), where complete records are written sequentially to disk blocks.

OLTP Optimization: This architecture is optimal for point-write and point-read operations, such as fetching a single entity profile via a unique identifier.

OLAP Inefficiency: Analytical queries typically perform aggregations over hundreds of millions of rows while restricting the projection to a minimal subset of attributes (e.g., computing a sum over a single numeric column).

Because PostgreSQL reads data row-by-row, it must scan every byte of every unrequested column within the target disk blocks into memory. This introduces immense, non-value-add disk I/O bottlenecks that degrade query performance at scale.

2. Structural Degradation Under Data Vault

Data Vault provides exceptional flexibility and auditability by decomposing business domains into discrete structural components:

  • Hubs: Unique business keys.
  • Links: Relationships and associations between keys.
  • Satellites: Contextual attributes, descriptive states, and historical tracking.

While highly effective for parallelized ingestion and decoupling business logic, this normalization strategy creates severe friction for a row-oriented relational engine.

The Multi-Way Join Complexity

To reconstruct a coherent business entity for downstream consuming layers (such as Business Intelligence tools), data engineers must reverse this decomposition. A single analytical query often requires a 10-to-20-way join across multiple large Hubs, Links, and versioned Satellites.

As dataset sizes expand, this structural complexity impacts the engine in two ways:

  1. Optimizer Limitations: The PostgreSQL query planner struggles to generate precise cardinality estimations across deeply nested join trees, frequently reverting to inefficient join strategies (e.g., nested loops instead of parallel hash joins).
  2. Memory Subsystem Pressure: Executing these multi-way joins requires significant memory allocation (work_mem). When a query’s requirements exceed physical memory allocations, PostgreSQL spills intermediate operations to disk, decelerating execution speeds by orders of magnitude.

The Lack of Inner Join Elimination (Join Reduction)

A particularly acute limitation of the PostgreSQL query optimizer in Data Vault architectures is its inability to perform Inner Join Elimination (also known as Join Reduction).

In complex Data Vault environments, users or Business Intelligence tools frequently query comprehensive, multi-table views or abstract layers that automatically stitch Hubs, Links, and Satellites together. If an end-user runs a report that only requests attributes from a single Satellite and its parent Hub, the remaining 10 tables in that view are technically redundant to the final output.

Modern OLAP query optimizers recognize declarative primary/foreign key constraints and automatically eliminate these unnecessary tables from the execution plan. The PostgreSQL query optimizer cannot eliminate redundant Inner Joins. Even if no columns are selected from a joined table, and even if a valid foreign key guarantees a 1:1 match, Postgres will stubbornly execute every single inner join defined in the query or view. This results in massive, redundant table scans and CPU cycles spent processing joins that have zero impact on the final result set.

Accumulation of Dead Tuples (Bloat)

Data Vault architecture is inherently append-only; changes in source systems trigger the insertion of a new row within the corresponding Satellite to preserve history. This high-frequency append behavior, coupled with PostgreSQL’s Multi-Version Concurrency Control (MVCC), accelerates the creation of dead tuples. Under continuous analytical workloads, the native AUTOVACUUM process frequently falls behind, resulting in table and index bloat that degrades scan performance.

3. The Window Function Bottleneck (And the Fallacy of Recent Releases)

Beyond complex joins, Data Vault workloads rely extensively on Window Functions (LEAD, LAG, RANK, ROW_NUMBER) to determine active states, calculate durations between historical intervals, or isolate the latest record within a Satellite stream.

In vanilla PostgreSQL, window functions are computationally expensive when executed against vast datasets. The engine must sort the partition keys in memory or spill them to temporary disk files to establish the window bounds.

The Limits of Engine Optimizations in Recent Versions

Proponents of the ecosystem often point to performance enhancements introduced in recent iterations—specifically PostgreSQL 17 (which introduced advanced streaming I/O optimizations) and PostgreSQL 18 (which implemented an asynchronous I/O (AIO) framework for sequential and bitmap heap scans).

While these enhancements represent significant milestones for core engine efficiency, they do not resolve the analytical window function bottleneck:

  • I/O vs. Execution Model: These upgrades optimize the database’s ability to read blocks from storage more rapidly. However, the underlying execution model for window functions remains bound to single-node, row-by-row compute processing.
  • Diminishing Returns at Scale: Increasing disk I/O throughput offers negligible relief when a LAG() or ROW_NUMBER() function across hundreds of millions of rows forces a massive, single-threaded disk-sort operation. The bottleneck is merely shifted from storage I/O to compute saturation.

4. Single-Node Architectural Constraints

Enterprise analytical data platforms utilize Massively Parallel Processing (MPP) architectures. They shard and distribute datasets across a cluster of compute nodes, allowing complex joins and window calculations to execute concurrently.

Vanilla PostgreSQL is a single-node database system. While it supports intra-query parallelism across multiple CPU cores, it remains bound by the hardware limits of a single virtual or physical machine. Scaling a vanilla PostgreSQL instance to meet growing OLAP demands requires vertical hardware scaling (scale-up), which scales linearly in cost but delivers diminishing returns in performance.

Open-Source Alternatives and PostgreSQL Derivatives

If your organization is committed to maintaining a sovereign, open-source stack while preserving existing PostgreSQL expertise, you do not need to migrate to proprietary cloud data warehouses. Several open-source derivatives and extensions modify the Postgres engine specifically to handle heavy OLAP and Data Vault workloads:

1. Apache Cloudberry (Incubating, Open-Source MPP)

For enterprise Data Vaults, Apache Cloudberry stands out as an incredibly powerful evolutionary step. It is an open-source Massively Parallel Processing (MPP) database derived from the Greenplum 7 codebase but aggressively modernized.

Why Apache Cloudberry outclasses Greenplum for Data Vaults:

  • Modern Upstream Kernel: Apache Cloudberry remedies legacy tracking issues by building on a much newer PostgreSQL 14 kernel (whereas older Greenplum installations remain anchored to legacy Postgres backends).
  • Superior Analytical Feature Set: Cloudberry supports a massive list of critical performance and optimization features, including incremental sort for window functions, run-time filters for joins, aggregation pushdowns, query pipelining, and advanced BRIN indexing.
  • True Community-Driven Open Source: Following Broadcom’s acquisition of Greenplum’s parent company, the source for Greenplum was closed. As a project under the Apache Software Foundation (ASF), Cloudberry offers vendor-neutral governance under an Apache 2.0 license, completely free from mandatory vendor lock-in.

2. Citus (Open-Source Extension)

How it helps: Citus transforms PostgreSQL into an MPP database by distributing tables and queries across a cluster of multiple nodes.

Data Vault Impact: Citus allows you to shard Hubs, Links, and Satellites by a common business key. When a multi-way join occurs, the compute overhead is distributed across the entire cluster, breaking through the single-node hardware ceiling.

3. Hydra & pg_analytics (Open-Source Columnar Extensions)

How it helps: These extensions add a native columnar storage engine directly inside the PostgreSQL kernel.

Data Vault Impact: By enabling columnar storage for non-historized links, dependent child links, and bridge tables, queries only scan the exact attributes requested by a BI tool, eliminating the I/O tax of scanning entire rows.

Conclusion

Vanilla PostgreSQL is an exceptional transactional engine, but its row-based architecture, single-node limitations, lack of inner join reduction, and row-by-row window function execution create structural barriers for large-scale Data Vault implementations.

Postgres should be protected for what it excels at: transaction management and operational metadata serving. When transitioning to heavy Data Vault modeling and analytical workloads, engineers must decouple their compute to an internal data lake or look toward specialized open-source derivatives like Apache Cloudberry that preserve both technological sovereignty and high-performance scalability.

 

Cover Image designed by Magnific

How Long Does The Data Vault Certification Take? Timeline Explained

Person standing above the clouds looking toward a bright horizon — Data Vault certification journey

How Long Does Data Vault Certification Take?

It is one of the first questions anyone asks before committing to a professional certification: how much time does this actually take? For Data Vault certification, the answer is straightforward — and the structure is designed to fit around the reality of working professionals. Here is the complete timeline, from first login to certified practitioner.



The Full Timeline at a Glance

The CDVP2.1® (Certified Data Vault 2.1 Practitioner) certification follows a defined sequence. There is pre-course preparation, a live instructor-led training block, an exam window, and post-certification access to continued learning resources. Each phase has a clear duration.

In total, from starting your preparation to sitting the exam, most candidates complete the process within 10 to 11 weeks.

Phase 1 — Pre-Course Self-Paced Videos (Approximately 15 Hours)

Before the live training begins, candidates work through a set of self-paced video modules covering the foundational concepts of Data Vault 2.1. These cover the reference architecture, the core modeling components (Hub, Link, Satellite), the agile data methodology, and the layered structure of a Data Vault platform.

The self-paced content runs approximately 15 hours in total. You work through the material at your own pace and must complete it before the start of the instructor-led live training. Most candidates spread this over two to three weeks, fitting it around their regular work schedule.

Completing the pre-course material before the live training is important. The three days of instructor-led sessions build directly on these foundations, and arriving prepared means you get significantly more value from the live discussions, Q&A, and hands-on exercises.

Phase 2 — Live Instructor-Led Training (3 Days)

The instructor-led component of the Data Vault 2.1 Training & Certification runs for three consecutive days and is conducted online or on-site by a Scalefree-certified instructor, depending on the training format you choose. Sessions cover Data Vault modeling in depth, architecture patterns, implementation approaches, automation concepts, and the methodology for managing Data Vault projects in a real enterprise environment.

The live format matters. This is not a recorded walkthrough — it is an interactive session where participants bring real questions from their own projects, and the instructor works through edge cases, design decisions, and common mistakes in real time. Attendees regularly join from across Europe and beyond, which means the discussion reflects a broad range of industries and tool stacks.

Three days is intensive. By the end of day three, candidates have covered the full scope of the CDVP2.1® examination and have had the opportunity to clarify anything that needs it before the exam window opens.

Phase 3 — Exam Window (8 Weeks, 2 Attempts Included)

After the live training concludes, the exam window opens. Candidates have eight weeks to sit the CDVP2.1® examination, and two attempts are included in the certification package.

The exam is proctored and taken online, so there is no need to travel to a testing centre. The eight-week window gives candidates flexibility to review the material, consolidate what they learned during the live sessions, and choose their own moment to sit the exam — whether that is the week after training or closer to the deadline.

Two attempts are a meaningful safety net. Most candidates who arrive prepared from the pre-course material and engage actively during the live training pass on their first attempt. The second attempt is there if you need more time to solidify a particular area before trying again.

Phase 4 — Post-Certification Platform Access (6 Months)

Passing the CDVP2.1® exam is not the end of the learning journey. Certified practitioners receive six months of access to the Data Vault Alliance platform, which includes extended reference material, community resources, and continued learning content.

This post-certification access is particularly valuable for practitioners who are actively implementing Data Vault on a project during or after the training period. Having a structured reference resource available as real implementation questions arise — rather than only during the training itself — is a practical advantage that experienced candidates consistently highlight.

Complete Timeline Summary

Phase Duration Format
Pre-course self-paced videos ~15 hours (self-paced, typically 2–3 weeks) Online, on-demand
Live instructor-led training 3 days Online, on-site
Exam window 8 weeks (2 attempts included) Online, proctored
Post-certification platform access 6 months Data Vault Alliance platform

Is There a Faster Path?

The structure above is designed to be as efficient as it is thorough. Three days of instructor-led training is a concentrated format — the same content would typically be spread over a much longer period in a self-paced programme. For data engineers and architects who are already working with data warehousing concepts, the pre-course videos are often faster than the 15-hour estimate because much of the foundational context is already familiar.

There is no shortcut through the exam itself — the CDVP2.1® is a rigorous, proctored assessment — but the preparation path is as streamlined as it can reasonably be while still producing practitioners who can apply the methodology on real projects.

Who Is the Certification For?

The CDVP2.1® certification is designed for data engineers, data architects, BI developers, and technical team leads who are building or planning to build a Data Vault-based data platform. Prerequisites include solid SQL knowledge, experience with data warehousing or BI development, and a working understanding of data modeling fundamentals. The pre-course videos are designed to bring everyone to a consistent baseline before the instructor-led training begins.

For teams considering Data Vault at an organisational level, Scalefree also offers in-house training — the same certification program delivered privately for your team, at a date and format that fits your schedule.

The free Data Vault Handbook is a good starting point if you want to understand the methodology before committing to training. You can also contact Scalefree directly to discuss in-house options or upcoming public training dates.

Next Training Dates

Public Data Vault training runs on a regular schedule throughout the year. Check the current calendar on the training page for upcoming dates, and register early — cohort sizes are intentionally kept small to maintain the quality of the instructor-led sessions.

dbt Fusion Demo: Dialect-Aware Validation, State-Aware Orchestration & Efficient Testing

dbt Fusion Demo

This is the second installment in our dbt Expert Series. In the first video, we introduced dbt Fusion, explored what it is, why it matters, and highlighted its core capabilities: dialect error validation, state-based orchestration, and efficient testing. If you have not watched that video yet, we recommend doing so before continuing here.

In this session, Scale Free principal consultant Metropolis takes those concepts into a live demo environment to show how they actually behave in practice. The result is a clear picture of how dbt Fusion can reduce costs, eliminate redundant work, and catch errors before they become expensive problems.



The demo project: Hub Speak Base

To ground the demo in something realistic, Metropolis built a project called Hub Speak Base. It includes four data sources — customer, line item, orders, and part — along with seven models: one staging model per source, two dimension models (customer and part), and a fact orders model. Unique and not-null tests are configured on both the sources and the staging models, and the mart models include unit tests defined in a star schema YAML file. This gives a solid foundation for demonstrating all three major Fusion capabilities without abstracting away the messiness of real-world pipelines.

Dialect error validation in the IDE

One of the most immediately useful features of dbt Fusion is its ability to catch SQL errors before a query ever leaves the IDE. This is what Fusion calls dialect error validation, and the demo shows it working in two distinct scenarios.

First, Metropolis demonstrates column reference checking. In the fact orders model, he intentionally references a column called order_keys instead of the correct order_key. Fusion flags this instantly — hovering over the incorrect reference surfaces an error message explaining that the column l_order_keys cannot be found. The same error appears in the problems panel below the editor, making it impossible to overlook.

Second, he tests function name validation by changing the round function to a fictitious roundy. Fusion flags this as well. When he attempts to build the model anyway, Fusion does not even send the query to Snowflake. Instead, it stops during static analysis and throws an error immediately — saving both time and warehouse compute.

It is worth noting a current limitation: as of the time of recording, not all SQL errors are caught by Fusion’s static analysis engine. For example, passing a third argument to the round function — which Snowflake does not support — is not yet flagged locally, and the query is sent to Snowflake where it fails at runtime. Since Fusion is still in preview, this behavior is expected to improve over time.

State-based orchestration: only rebuild what changed

State-based orchestration is where dbt Fusion offers some of its most compelling cost savings. The idea is simple: instead of rebuilding every model on every run, Fusion tracks the state of each model in the database and only rebuilds the ones that have changed — whether due to code updates or upstream data changes.

To enable this, you navigate to the Orchestration settings for your production environment and toggle on the Fusion cost optimization features, which includes both state orchestration and efficient testing.

The demo makes the behavior concrete. Metropolis drops all tables and views from the production schema, then triggers a job. Every model is rebuilt from scratch because Fusion detects that nothing exists yet. On the second run, with no changes made, every model is marked as reused. The logs confirm this clearly: no new changes on any upstream model.

Then things get interesting. Metropolis inserts one row into the customer source table and one row into the line item source table. The expected behavior — that only the models downstream of those sources would be refreshed — is exactly what happens. The staging customer model and the dimension customer model are rebuilt. The staging line item model and the fact orders model are rebuilt. Everything else is reused. Fusion is detecting data changes at the row level, without any update timestamps configured. The orchestration works automatically out of the box.

One particularly useful aspect of Fusion’s state orchestration is that state is shared across jobs within the same environment. The demo includes a second production job configured to refresh fact orders and its upstream dependencies. When this job runs after the first, it finds all models already built and up to date — so everything is reused. Teams running multiple jobs against the same environment avoid paying twice for the same compute.

Efficient testing: stop running the same tests twice

The third capability demonstrated is efficient testing. When using the build command in a Fusion-enabled job, dbt Fusion tracks which tests have already run and reuses their results for downstream models within the same job execution — rather than re-running identical tests multiple times.

In the demo, after switching the job command from run to build, the results show tests on the sources executing as expected. But the equivalent tests defined on the staging and mart models — tests that reference the same underlying data — show their results as reused. This avoids redundant warehouse queries and can meaningfully reduce both execution time and compute cost on larger projects.

The current limitation here is scope: as of the time of recording, test result reuse only happens within the context of a single job run. Results are not carried over to subsequent runs or shared across different jobs. This may change in future versions of Fusion.

What this means for your dbt workflows

Taken together, these three capabilities address real pain points that dbt teams encounter as their projects scale. IDE-level error validation shortens the feedback loop between writing SQL and knowing it works, without requiring a round-trip to the warehouse. State-based orchestration dramatically reduces unnecessary compute by treating rebuilds as the exception rather than the rule. And efficient testing ensures that the tests you have invested time in writing do not become a bottleneck in CI/CD by running redundantly.

dbt Fusion is still in preview, and some capabilities are still being refined. But based on this demo, the direction is clear: Fusion is designed to make the full dbt development and deployment cycle faster, cheaper, and more intelligent.

Future videos in this series will cover more advanced configuration options for state orchestration, as well as additional Fusion features as they become available. Make sure you are subscribed so you do not miss them.

Watch the Video

Data Vault Glossary: Hub, Link, Satellite, Business Vault, and More

Data Vault Glossary

The Essential Data Vault Glossary

Data Vault has its own precise vocabulary. Whether you are evaluating the methodology for the first time or preparing for Data Vault certification, understanding what each term means — and why it exists — is the foundation for everything else. This glossary covers the core concepts of Data Vault 2.0 and 2.1, defined at the conceptual level for data engineers, architects, and IT leaders building or modernising a data platform.



Business Key

A business key is the identifier that the business actually uses to recognise and track a business object — a customer number, a product code, an account number, an ISBN. It is the natural, meaningful key that appears in source systems and that business users refer to in their daily work.

In Data Vault, the business key is the fundamental organising principle of the entire model. Every Hub is built around business keys. The goal is to find keys that are unique across the enterprise and stable over time — keys that different source systems share, enabling integration between them.

Business keys sit above surrogate keys (technical IDs generated by a source system). A surrogate key is unique within one system but carries no meaning outside it. A business key has meaning across the organisation, making it suitable for integration. The hierarchy runs from global business keys (universally unique, like a Vehicle Identification Number), through organisational business keys (assigned by the enterprise, like a customer number), down to system-wide surrogate keys where no better option exists.

Hub

A Hub is one of the three fundamental entity types in Data Vault. It stores a distinct list of business keys for a single type of business object — all customer numbers, all product codes, all account numbers. The Hub identifies. It records which business keys have ever existed in the data platform, alongside when they were first seen (the load date) and where they came from (the record source).

The Hub does not describe anything about the business object — that is the Satellite’s job. It does not store relationships — that is the Link’s job. A Hub is insert-only: once a business key is recorded, it is never updated or deleted (except under legal obligation). This permanence is what makes Data Vault historically complete.

Link

A Link is the second fundamental entity type. It stores a distinct list of relationships between business keys — the fact that a customer purchased a product, that an employee was assigned a vehicle, that a booking involved a passenger and a flight. Like the Hub, the Link is insert-only and records when the relationship was first identified and from which source.

The Link does not describe the relationship — it only establishes that it existed. All descriptive information (when it started, when it ended, what conditions applied) lives in Satellites attached to the Link. Importantly, Links can connect more than two Hubs: a purchase transaction might link a customer, a product, and a store simultaneously. This is entirely normal in Data Vault design.

Satellite

A Satellite is the third fundamental entity type, and where the actual data warehousing happens. It stores descriptive data — the attributes that describe a business object or relationship over time. A customer’s name and address. A product’s description and list price. The start and end dates of an employment contract.

Every time an attribute changes in the source, a new row is inserted into the Satellite. No rows are ever updated. This insert-only behaviour is what gives Data Vault its complete historical record. Each Satellite has exactly one parent — either a Hub or a Link — and Satellites are typically split by source system, by security or privacy classification, and sometimes by rate of change.

The combination of Hub, Link, and Satellite reflects the three fundamental components present in all enterprise data: business keys, relationships, and descriptive attributes. For a deeper treatment of how these entities are modelled and loaded, Data Vault 2.1 Training & Certification covers the full methodology in detail.

Raw Vault

The Raw Vault (also called the Raw Data Vault) is the layer of the Data Vault architecture that stores unmodified source data. It consists of Hubs, Links, and Satellites that capture data exactly as it arrived — no cleansing, no business rules, no filtering, no conditional logic of any kind.

The Raw Vault is the single point of facts. Because no business interpretation has been applied, the data it holds is fully auditable: you can demonstrate precisely what any source system delivered on any given date. This auditability is one of the primary reasons Data Vault is adopted in regulated industries such as banking, insurance, and government.

Business Vault

The Business Vault is the layer above the Raw Vault where business logic is applied. It uses the same Hub-Link-Satellite structures, but its purpose is to transform and enrich the raw data — cleansing records, resolving duplicates, applying currency conversions, tagging data quality levels, and deriving calculated attributes.

The Business Vault is not a mandatory pass-through layer. Data that is already clean and ready for reporting can flow directly from the Raw Vault to an Information Mart. In practice, organisations typically maintain multiple Business Vault schemas — one per department or domain — each expressing the business rules and definitions relevant to that context. This is how Data Vault delivers multiple versions of the truth from a single set of facts: different teams can apply their own definitions without touching the shared Raw Vault underneath. Learn more about the full Data Vault 2.0 methodology and how Scalefree applies it in client projects.

Information Mart

An Information Mart is the delivery layer that presents data to end users and reporting tools. Unlike the Raw Vault and Business Vault — which use Hub-Link-Satellite structures — Information Marts use dimensional models such as star schemas, snowflake schemas, or flat wide tables, in whatever structure the consuming tool requires.

Information Marts are usually virtualised (SQL views) rather than materialised tables, making them lightweight and easy to modify. The recommended approach is many small, focused Information Marts — one per report or use case — rather than a single large mart. Several specialised mart types exist for specific purposes:

  • Error Mart — captures records rejected by a loading process due to hard rule violations. Should always be empty in a healthy system.
  • Raw Mart — presents raw data in a reportable dimensional form without applying business rules. Used during agile requirements gathering to help business users articulate what they need.
  • Quality Mart — shows only the bad or suspect records, giving data stewards visibility into data quality issues so they can be fixed at the source.
  • Source Mart — reconstructs the original structure of a source system from the Data Vault model, with the added benefit of historical versioning and built-in GDPR data removal.
  • Interface Mart — designed for machine-to-machine consumption, used when a downstream application needs to read from the platform or receive cleansed data back from it.
  • AI Feature Mart — a specialised Interface Mart designed for AI and machine learning model consumption, typically wide, flat, and enriched with semantic field descriptions.

Hash Key

A Hash key is a fixed-length value derived by applying a hashing algorithm (typically MD5 or SHA-256) to one or more business key columns. In Data Vault, Hash keys serve as the primary keys of Hubs and Links, and as the foreign key references connecting Satellites to their parents.

The key advantage of Hash keys is that they can be computed independently: any system, given the same business key input, will always produce the same Hash key. This enables parallel loading, makes the model portable across environments, and simplifies join logic. The actual business key columns remain stored alongside the Hash key in the Hub or Link. For a detailed look at how Hash keys are implemented in practice, see Scalefree’s article on Hash Keys in the Data Vault.

Load Date

The load date timestamp is a technical metadata attribute on every Hub, Link, and Satellite row. It records the moment the record was loaded into the data platform — not when the event occurred in the source system, but when the data arrived in the vault. The load date is always a full timestamp, never just a date, since data platforms often receive deliveries multiple times per day.

Combined with the record source, the load date answers two fundamental audit questions for every piece of data: when was it received, and from where?

Record Source

The record source identifies which source system a particular record came from. It is stored on every Hub, Link, and Satellite row alongside the load date. Its primary audience is the development and engineering team — when investigating a data issue, the record source points directly to the originating system and delivery batch. It is not used for business reporting or compliance auditing in the same way as the load date.

PIT Table

A PIT table (Point-in-Time table) is a helper structure that makes querying historical data across multiple Satellites significantly more efficient. Without a PIT table, reconstructing the complete state of a business object at a specific historical moment requires complex, expensive joins across Satellites with different load dates. A PIT table pre-computes the correct Satellite row timestamps for each point in time, so downstream queries can join the PIT table directly rather than re-solving the temporal logic on every run.

PIT tables are derived structures — generated from Raw Vault data and rebuildable at any time. They are not part of the core Data Vault model but are standard production companions to it.

Bridge Table

A Bridge table is a helper structure that simplifies querying across multiple Links. Where PIT tables solve the temporal complexity of Satellites, Bridge tables solve the structural complexity of navigating a chain of linked Hubs — for example, tracing from a customer through their orders, through their order lines, to the products. Bridge tables are pre-joined snapshots of relationship paths that would otherwise require multiple sequential joins. See also: Bridge Tables 101 on the Scalefree blog.

Ghost Record

A ghost record (also called a default record or zero key record) is a placeholder row inserted into a Hub or Satellite to handle situations where a foreign key reference exists in the source data but the referenced record itself does not. It prevents referential integrity violations and allows the data platform to load records completely even when source data is incomplete. Ghost records are technical placeholders, not real business data, and are distinguishable by their defined default key values.

Effectivity Satellite

An Effectivity Satellite tracks the active or inactive status of a Hub record or a Link relationship over time. It records when a business object or relationship became active in the source system and when it was deactivated or deleted. When a source system deletes a record, the Hub retains the business key permanently — the Effectivity Satellite gains a new row reflecting the deletion, preserving the complete history while making the current active state queryable.

Persistent Staging Area

The Persistent Staging Area (PSA) is the layer where raw source data is stored before it enters the Raw Vault. Unlike a transient staging area (which holds only the most recent delivery), a PSA retains every historical delivery — a complete, time-stamped archive of everything ever received from every source system. In modern Data Vault architectures, the PSA role is typically fulfilled by a data lake, organised in a folder structure partitioned by source system, table, and load date.

Unit of Work

The unit of work is a concept from the Data Vault agile methodology that defines the smallest deliverable increment of business value in a sprint. It consists of a complete data flow from source to Information Mart — staging the required source data, loading the Raw Vault entities, applying business rules in the Business Vault, and delivering the result in a mart that a business user can consume. Organising development around units of work ensures every sprint delivers something tangible to the business rather than invisible infrastructure.

Data Aging

Data aging refers to the practice of identifying and marking historical records in the Raw Vault or Business Vault that are no longer operationally relevant — records that have not been updated or referenced over a significant period. Data aging strategies help manage storage costs and query performance over time. In keeping with Data Vault’s insert-only philosophy, aged records are flagged or moved to archival storage rather than deleted, preserving the completeness of the historical record.

CDVP2.1

CDVP2.1 stands for Certified Data Vault Practitioner 2.1 — the professional certification awarded by the Data Vault Alliance upon passing the certification examination. It validates that a practitioner understands and can apply the Data Vault 2.1 methodology across architecture, modeling, and implementation.

Scalefree is an authorised Data Vault Alliance training partner. The Data Vault 2.1 Training & Certification is the official path to CDVP2.1, combining instructor-led training with exam preparation and two included exam attempts. If you are building or modernising a data platform and want to understand how Data Vault fits into a broader enterprise architecture, explore the free Data Vault Handbook or get in touch with Scalefree directly.

The Battle Of Table Formats: Iceberg vs Delta vs Hudi

datavault

Selecting the right open-source table format is about securing your infrastructure strategy. Making the right choice helps you save development costs and minimize risks. A well-chosen format lowers your Total Cost of Ownership (TCO) and ensures a future-proof, sustainable architecture. Let’s dive into three popular formats, so you can quickly deliver results without getting locked into a bad ecosystem.

Open table formats bring database-like ACID transactions to your data lake. They reduce storage costs by minimizing data duplication. Here is how Iceberg, Delta, and Hudi compare on the technical essentials.

The Battle of Table Formats: Iceberg vs Delta vs Hudi

Stop risking costly vendor lock-in and future-proof your data lakehouse today. In this deep dive, we cut through the noise to compare the big three open table formats: Apache Iceberg, Delta Lake, and Apache Hudi. We’ll analyze infrastructure fit, real-world performance, and Data Vault integration to help you drive down your TCO. Join us to find the exact format your architecture needs—before you commit to an expensive, irreversible path. Learn more in our upcoming webinar on May 19th, 2026!

Sign Up For Free

Performance Under Pressure

Performance depends directly on your compute engine and use case. Delta Lake is highly optimized for Apache Spark, providing efficient read performance for Spark-heavy workloads. Apache Hudi is specifically built for streaming-first architectures that require handling massive amounts of updates and deletes (upserts). Apache Iceberg utilizes an engine-agnostic architecture, maintaining consistent query performance across multiple different engines like Trino, Flink, and Spark.

It is important to note that choosing the query engine is more important than the table format itself. A well calibrated format-engine pair will perform similarly well.

Community Support

Community maturity directly impacts long-term risk minimization. Delta Lake is supported by a large user base, primarily driven by Databricks. Apache Iceberg currently holds the ultimate multi-vendor momentum. It receives active contributions from multiple major cloud providers and data vendors, offering broad ecosystem support. Apache Hudi’s community centers on data engineering for real-time ingestion and streaming pipelines.

Time Travel Capabilities

Time travel enables querying historical data, auditing changes, or reverting accidental deletions, serving as a critical mechanism for risk minimization. All three formats offer some type of “time travel”.

Delta uses a straightforward transaction log. It replays JSON commits and Parquet checkpoints to reconstruct a table’s exact state at a specific timestamp or version.

Iceberg uses a tree of immutable metadata snapshots. Instead of processing a heavy transaction log, a query references a past snapshot ID. This approach scales efficiently for massive tables without performance degradation.

Hudi tracks changes via a chronological action timeline. It maintains a granular history of operations, enabling strict point-in-time queries that map directly to its streaming architecture.

Interoperability

Infrastructure strategy must account for evolving workloads. The industry is currently shifting toward cross-format compatibility. Projects like Apache XTable and Delta UniForm act as interoperability layers. Data written in one format (e.g., Delta) can be read natively as Iceberg or Hudi. This reduces vendor lock-in risks and lowers pipeline reengineering costs. Additionally, Apache Paimon offers an alternative for dynamic tables with native Apache Flink integration for high-throughput streaming workloads.

Architecture Insight: Data Vault

Table formats and modeling methodologies like Data Vault 2 are complementary. While Iceberg, Delta, or Hudi provide the optimized storage layer and ACID transactions, Data Vault provides the business alignment and agility. For optimal performance on a Data Lakehouse, you can materialize your Raw Vault core entities as physical Delta or Iceberg tables to serve as high-speed indexes. Furthermore, while table “time travel” is useful for quick rollbacks, long-term enterprise historization should still rely on Data Vault’s insert-only architecture to prevent data loss during routine storage maintenance.

A note on Time Travel vs. Historization: While format-level “time travel” is useful for quick rollbacks, long-term enterprise historization should still rely on Data Vault’s insert-only architecture. Relying solely on table formats risks permanent data loss during routine storage maintenance, such as Delta’s VACUUM command.

Keypoints for your Data Strategy

  • Choose Delta for deep Spark integration.
  • Choose Iceberg for maximum tool flexibility and a massive open ecosystem.
  • Choose Hudi for heavy streaming and continuous upserts.

There is no single winner in the battle of table formats, only the right tool for your specific infrastructure strategy. By aligning your choice with your engine preference and streaming needs, you ensure high team agility and keep storage costs manageable.

Why Split Hubs Are a Data Vault Anti-Pattern

Split Hubs Are a Data Vault Anti-Pattern: Here’s Why

A practice that occasionally surfaces in Data Vault projects — though it doesn’t appear in the official methodology — is splitting Hubs by source system in the Raw Data Vault, then consolidating them into a “golden record” Hub in the Business Vault. The idea seems intuitive: keep SAP customers and Oracle customers separate at the raw layer, then unify them later. In practice, this approach undermines one of Data Vault’s most powerful features. This post explains why split Hubs are an anti-pattern and what the correct approach looks like.



Split Hubs: Why They Contradict the Purpose of a Hub

To understand why splitting Hubs by source system is a problem, start with the fundamental purpose of a Hub in Data Vault 2.0: a Hub represents a business concept. Not a SAP customer. Not an Oracle customer. A customer. Full stop.

One of the most valuable properties of the Raw Data Vault is that it serves as the integration layer for business keys. This is called passive integration: when two source systems share the same business key for the same real-world entity — a customer number that exists in both SAP and Oracle, for example — loading both into the same Hub causes integration to happen automatically at load time. The moment the same business key is hashed and loaded from both systems, it maps to the same Hub record. No additional logic required.

When you split Hubs by source system, you bypass this integration entirely. HUB_SAP_CUSTOMERS and HUB_ORACLE_CUSTOMERS are two separate entities in the model, and any integration between them has to be built explicitly later — which is exactly the kind of work the Raw Data Vault was designed to handle for you. You’ve taken a passive, automatic process and made it a manual, deferred one.

Business Key Identification: The Real Work

The split Hub pattern often appears in projects where the business key selection process hasn’t been given enough attention. Identifying the right business key is one of the most important — and underestimated — tasks in a Data Vault implementation. It’s a topic that deserves its own dedicated discussion, but the key hierarchy is worth understanding at a high level.

At the top are global business keys: identifiers that are recognized universally, like a VIN number for vehicles or an ISBN for books. These are ideal because they enable integration not just across internal systems but with external data sources as well. Below that are company-wide business keys — identifiers shared across multiple internal source systems. These are the keys that enable cross-system Hub integration. At the bottom are system-specific keys, known only to a single source system.

The temptation for data engineers under time pressure is to reach for whatever unique key is most readily available — often a surrogate key or a system-generated sequence. These keys reliably identify records within their source system, but they were never designed to integrate across systems. Using them as Hub business keys produces technically valid Hubs that miss the entire integration value of the Raw Data Vault.

Investing time upfront in identifying a company-wide or global business key — even if it requires conversations with business stakeholders and source system specialists — pays back significantly in the quality and simplicity of the resulting model. Our Data Vault 2.1 Training & Certification covers business key identification as a core modeling skill.

When Two Systems Use Different Keys for the Same Entity

What if SAP and Oracle genuinely use different, unrelated keys for the same customer? This is a common real-world scenario, and the solution is not to create separate Hubs. Both keys still go into the same customer Hub — because a Hub is a distinct list of business keys, not a distinct list of business objects. Two different keys can represent the same customer in the Hub without causing a problem.

The tool for resolving that ambiguity is the Same-as-Link (SAL). A Same-as-Link references the same Hub twice — one side for the master record, one side for the duplicate — and establishes the relationship between them. The golden record logic, the master record calculation, the determination of which key takes precedence: all of that belongs in the Business Vault, expressed as an explicit business rule through the SAL. In some cases, the source system itself provides a key mapping — a master data management system that already knows which keys refer to the same entity — and that mapping can be loaded directly into the SAL in the Raw Data Vault.

This approach keeps the Raw Data Vault clean and close to the source, while giving the Business Vault a precise, auditable place to implement the integration logic. For a deeper look at how SALs enable enterprise-wide deduplication, see our post on Data Vault in modern architecture patterns.

Handling Surrogate Key Collisions

Surrogate keys — sequence numbers used as primary keys in source systems — introduce a specific risk: the same number in SAP and Oracle might refer to two completely different customers. Customer 1042 in SAP is not the same entity as Customer 1042 in Oracle, but if both are loaded into the same Hub using just the sequence number as the business key, they hash to the same value and collapse into a single Hub record. That’s a data integrity problem.

The fix is not to create separate Hubs. The fix is to include a source system identifier in the hash key calculation. The business key fed into the hash function becomes a combination of the source system identifier and the sequence number — SAP + 1042 and Oracle + 1042 hash to different values and produce separate Hub records. One Hub, two distinct records, no collision. The source system becomes part of the key definition rather than a reason to fragment the model.

What Correct Hub Loading Looks Like

To bring this together: if SAP and Oracle share the same company-wide business key for customers, load both into a single customer Hub and add separate Satellites per source system. The integration happens automatically at load time — no golden record logic required in the Raw Data Vault.

If they use different keys, load both into the same Hub and create a Same-as-Link in the Business Vault to express the relationship between them. If surrogate keys create collision risk, include the source system identifier in the hash key computation to ensure uniqueness while still maintaining a single Hub.

In all three scenarios, the answer is one Hub per business concept. Split Hubs trade short-term convenience for long-term complexity — and they give up the passive integration capability that makes Data Vault worth using in the first place.

To go deeper on Hub design, business key identification, and the full Raw Data Vault methodology, explore our Data Vault certification program. The Data Vault Handbook is also available as a free physical copy or ebook for a solid grounding in the core concepts.

Watch the Video

Close Menu