Skip to main content
search
0
All Posts By

Building a scalable Data Platform?

Whether you're implementing Data Vault 2.1 or modernizing your analytics architecture, our experts help you turn complex data challenges into practical, future-proof solutions. From hands-on implementation to in-depth training, we support your team every step of the way.

Databricks and dbt: A Practical Approach to Data Vault Implementation

Bronze Silver and Gold layers in the Data Vault Structure

Databricks and dbt

Selecting the appropriate technology stack is a critical factor in the successful delivery of a Data Vault 2 architecture. Two technologies that work effectively together at a large scale data solutions are Databricks and dbt. When combined, they provide a practical way to implement Data Vault models while addressing performance, governance, and auditability requirements.

It can be argued that dbt’s role in a Databricks-based architecture is not always essential, since many of its core capabilities (such as transformation scheduling, lineage tracking, and documentation) can also be implemented using native Databricks features. Understanding the specific role each tool plays helps clarify where they complement each other and where functionality overlaps.



Databricks as the Processing and Storage Platform

Databricks’ Lakehouse architecture combines the scalability of a data lake with the reliability of a warehouse. Its Delta Lake technology offers ACID transactions, schema enforcement, and time travel, enabling precise historical querying, which are relevant aspects when it comes to Data Vault’s historization requirements.

With Unity Catalog, Databricks centralizes metadata management and enforces fine-grained access control, ensuring sensitive attributes are protected without introducing unnecessary satellite splits. This alignment between governance and performance is particularly relevant in Data Vault environments.

dbt as the Transformation and Orchestration Layer

dbt manages and automates SQL-based transformations in a modular and version-controlled manner. In a Data Vault context, dbt enables:

  • The creation of Hubs, Links, and Satellites through templated, reusable models. Here, different packages can be leveraged, like our datavault4dbt package, which is constantly updated to be fully compliant with the most recent Data Vault standards.
  • Integrated testing to validate business keys, relationships, and data quality.
  • Automated documentation that directly reflects the structure and dependencies of the Data Vault.

This structured approach makes transformations transparent and repeatable, supporting the auditability requirements inherent to Data Vault.

Integration in a Data Vault Workflow

When Databricks and dbt are deployed together:

  • Data ingestion occurs in Databricks, storing raw datasets as Delta tables, usually in the Bronze layer.
  • dbt transformations generate Raw Vault entities and Business Vault objects in the Silver layer.
  • Governance and security controls are enforced via Unity Catalog without altering the Data Vault model structure.
Bronze Silver and Gold layers in the Data Vault Structure

This approach preserves Data Vault’s methodological structure while using Databricks’ distributed compute and storage capabilities.

Business Value when combining dbt and Databricks

The combined use of Databricks and dbt offers:

  • Scalable processing of large, complex datasets: Databricks handles enterprise-scale data efficiently, while dbt structures transformations into modular, reusable components.
  • Consistent governance across all layers of the Data Vault: dbt’s lineage and documentation, plus Unity Catalog’s access control, ensure compliance and transparency end to end.
  • Lower operational risk through tested, version-controlled transformations: Git-based versioning and automated tests in dbt reduce errors before execution on Databricks.
  • Improved query performance for information marts and analytics: Delta Lake optimizations and dbt’s pre-aggregated tables with business logic minimize expensive joins.

For organizations building Data Vault on Databricks, dbt strengthens structure and quality while Databricks ensures scalability and performance.

– Ricardo Rodríguez (Scalefree)

Agile Development in Data Warehousing with Data Vault 2.0

Agile Development in Data Warehousing: Initial Situation

Agile methodologies bring flexibility and adaptability to data warehousing, making them a natural fit for modern approaches like Data Vault 2.0. A common issue in data warehousing projects is that a scope is often missing and many of the processes such as controlled access, GDPR handling, auditability, documentation and infrastructure are not optimized. Additionally, data warehouse projects that have a scope often begin without a real focus on business value. This is mostly due to the fact that the use cases are not clearly communicated and the data architects do not know where to start. The consequence of this means  no business value can be delivered.

Data Vault 2.0 Methodology

It is often assumed that Data Vault 2.0 is only a modeling technique, but this is not correct. Data Vault 2.0 includes the modeling technique, a reference architecture and the methodology. The methodology introduces project management tools such as CMMI, Six Sigma and Scrum to solve the problems described. While CMMI and Six Sigma deal with general project management issues, Scrum is mostly used specifically in the development team and provides the framework for a continuously improving development process.  The use of agile development in Data Vault 2.0 projects will be described in more detail below.

The Scope of a Sprint

The first step in setting up a data warehouse project in an agile way is defining the objective of the project with just one or two pages. Unlike waterfall projects, the goal is to produce working pieces of usable outputs, could be reports or dashboards, in continuous iterations, otherwise called sprints. This means that we don’t need to plan the entire project in detail but instead can build around a general idea or goal for the final data warehouse before then focusing on planning the first sprints. In order to address the aforementioned problems, the focus of the sprints needs to be centered around business value. For this reason, it is important to receive constant feedback from the business users for a continuous improvement process.

Define the project

Both the scope of a sprint and the architecture follow a business value driven approach built vertically and not horizontally. This means they are not built layer by layer but instead feature by feature. A common approach for this is the Tracer Bullet approach. Based on business value, which is defined by a report, a dashboard or an information mart, the source data will be identified and modeled through all layers and loaded. 

As shown in Figure 1, the entire staging area layer is not initially built but rather a small part of the respective layer is built based on data in the scope, in this case the SalesReport.

Agile Development

Before new functionality can be implemented in a sprint, it needs to be defined.
This task lies with the product owner as they are the ones to write and prioritize user stories.
As already explained, the goal of a sprint is to produce working pieces of usable outputs called features.
In addition, there are tech topics that need to be considered. There are various methods to support Sprint Planning, such as planning poker or Function Point Analysis, which are discussed in more detail in another article.

Another good indicator is to evaluate the sprint itself while the sprint is still ongoing. If the development team does not manage to implement a feature in a sprint, this can often be seen as a good indicator that the scope is too large. 

To avoid this, all work packages that are not relevant for the feature should be removed. Though, what is often the case these work packages are not completely removed out of fear from the business user. 

To address this fear it is important to educate the business user that they will be delivered but only in a later sprint and temporarily moved into the backlog.

Agile - Data Warehousing Sprint
Figure1 : Data Vault 2.0 Architecture

Due to the flexible and scalable Data Vault model, these layers can be extended with the next feature with little to no re-engineering. This is possible due to the fact Data Vault consists of a Raw Data Vault and a Business Vault model which means it contains the logical architecture as well as the data modeling perspective. The Raw Data Vault is modeled in a data-driven way by integrating the data by business keys. Only hard business rules like data type conversions or hash key calculations are applied. All other soft business rules are only applied in the Business Vault. 

Here, we turn data into information. For this reason, the Raw Data Vault requires less refactoring and can be extended limitlessly.

Agile Development Review

Another important success factor for agile projects is proper review and improvement. Even before the next sprint starts, two meetings must be held by the team:

  • The sprint review meeting: This meeting is about reviewing the delivered features. Usually the development team, the product owner, the Scrum Master and the end-users participate in this meeting.
  • Retrospective meeting: This meeting usually takes place directly after the review meeting and focuses on identifying activities that need to be improved.
  • Backlog refinement for prioritizing the user stories and to make sure that the team understands what to do
  • Sprint planning to plan which user stories fit into the next sprint based on estimating the effort.

It is important that these meetings are held so that toe source errors can be found. In this way, the outcome of a project can be improved and the development processes optimized in an iterative way.

Conclusion

Data Vault 2.0 is not only a scalable and flexible modeling technique, but a complete methodology to accomplish enterprise vision in Data Warehousing and Information Delivery by following an agile approach and focusing on business value. By using agile methods in data warehousing, the focus in projects can be on the business value and delivering useful products to the customer.

The Business Value of Data Vault – and Why It Matters

The Business Value of Data Vault

Data Vault is not just another data model. It’s a pragmatic architecture and methodology built for change, auditability, automation and fast delivery — and those traits translate directly into measurable business advantages compared with traditional approaches or doing nothing at all.



Why we need a different approach to data

Most organisations don’t start with the perfect, governed data platform. They begin with spreadsheets, a few scripts, maybe a single operational system. Then the business grows: new SaaS apps (Salesforce, shop platforms), partner APIs, IoT feeds, regulatory requirements and ad-hoc reporting requests. Before long you have multiple sources, inconsistent definitions, and changing business rules.

Traditional models — Kimball’s star schemas or Inmon’s normalized enterprise warehouse — work well when sources, rules and requirements are stable. But the reality today is constant change. That’s exactly the gap Data Vault was designed to fill: a model and architecture focused on capturing raw facts reliably, separating business logic, and enabling incremental, auditable growth.

High-level difference: Data Vault vs. traditional models (or none)

Put simply:

  • Traditional models (Kimball/Inmon): great for reporting, intuitive star schemas for business users, but rigid and costly to change when sources or rules evolve.
  • No model / ad-hoc reports: fastest at day zero but leads to duplicated effort, inconsistent numbers, and brittle scripts that break as systems change.
  • Data Vault: engineered for change. Capture everything in a consistent, standard way, keep full lineage, and build business logic and reporting layers on top. This structure enables automation, auditability and rapid delivery of real business reports.

Concrete business advantages of implementing Data Vault

1. Faster time-to-value (Tracer-bullet delivery)

Data Vault enables an iterative “tracer bullet” approach: pick a high-value report, identify the raw source data, ingest and model only what’s needed to deliver that report end-to-end. Business users get a working dashboard in weeks (not months), giving immediate value, generating trust, and allowing the team to expand incrementally.

2. Built-for-change — lower cost of future change

Because Data Vault separates raw data (hubs, links, satellites) from business rules (Business Vault / information marts), adding a new source, new attribute, or updated business rule rarely requires tearing down and rebuilding existing models. That translates into lower rework, lower maintenance costs, and much faster onboarding of new systems.

3. Automation reduces delivery time and human error

Data Vault entities follow standardized patterns. Hubs look alike; satellites follow the same tracking patterns. That repeatability makes the ingestion and loading processes highly automatable with modern tools (for example, dbt builders, Wherescape-style automation, Coalesce). Automation frees developers to focus on business logic instead of tedious ETL plumbing — more predictable pipelines, fewer bugs, faster delivery.

4. Auditable, traceable data for compliance and trust

Every record in a Data Vault carries load dates, record source identifiers, and historical versions. That full lineage is invaluable for audits, GDPR/DSR processes, finance reconciliations and provenance requirements for AI. When regulators or internal auditors ask “where did this number come from?” you can show the full trail back to source.

5. Future-proof architecture for analytics and AI

Data Vault’s decoupling of raw capture from business logic means you can adopt new storage or compute technologies (data lakes, cloud object stores, NoSQL or streaming platforms) without reworking the core model. It’s an architecture that scales with both data volume and analytics sophistication: data science teams can access the raw, auditable records they need without breaking downstream reporting.

6. Reduced risk and predictable governance

Standardized patterns, auditable history and clear separation of concerns improve governance. Data owners can define rules in a separate layer, compliance teams can inspect lineage, and operations can automate quality checks. That lowers operational risk and makes governance predictable rather than ad-hoc.

Specific business problems Data Vault can solve

Below are concrete problems organisations experience — and how Data Vault addresses them.

  • M&As and rapid source expansion: After an acquisition you must onboard dozens of new systems. Data Vault lets you ingest raw records quickly and map business rules later, so analytics can start immediately without delaying integration for perfect master data alignment.
  • Conflicting definitions across departments: Different teams report different revenue numbers. With Data Vault you capture every source event and build reconciled information marts, so one canonical report can be produced while source-level values remain auditable.
  • Regulatory or audit requests: Need to prove how a figure was derived six months ago? Data Vault’s lineage (load timestamps, record source) shows exactly which source values contributed to any derived metric.
  • GDPR / Data Subject Requests: Because raw values and their sources are stored with provenance, it’s easier to locate and isolate personal data, show retention windows, or delete/segment records if needed.
  • AI/ML model drift and explainability: Models need defensible inputs. Data Vault keeps the raw inputs and transformation history separate, so feature engineers and auditors can trace which raw values produced a model input.
  • Slow BI delivery and constant rework: BI projects where every change requires a model rewrite burn budget. Data Vault’s incremental approach reduces rework and keeps BI teams delivering incremental, reliable reports.
  • Operational reporting vs historical analytics conflict: Operational needs often demand current-state views; analytics wants full history. Data Vault stores full history by design, while downstream information marts can present both current and historical perspectives appropriately.

How the business benefit translates to measurable outcomes

Organisations that adopt Data Vault commonly see measurable improvements such as:

  • Shorter lead times for report delivery (weeks vs months for new reports).
  • Lower total cost of ownership because changes require less rework and are more automatable.
  • Fewer data incidents and faster root-cause analysis because lineage is built-in.
  • Stronger compliance posture and faster audit responses.
  • Better support for analytics and AI initiatives — because data scientists get consistent, traceable raw data.

These translate to business outcomes: faster decisions, less risk, better regulatory positioning, and a higher ROI on analytics investments.

Practical adoption path — a pragmatic recipe

You don’t have to flip the switch for everything at once. A typical, low-risk path is:

  1. Choose one high-value report (the tracer bullet). Identify required sources and ingest the raw records into the Raw Vault.
  2. Build the Business Vault where you apply the business rules for that specific report (transformations live here, not in the raw zone).
  3. Deliver an Information Mart tuned for reporting (star schema if that’s what BI needs) that offers the business an immediate, usable report.
  4. Iterate and scale — add more reports and sources, reuse existing Hubs/Links/Satellites, automate loading patterns and apply governance over time.

This approach gives quick wins, builds trust, and progressively modernises your data landscape without huge upfront modelling effort.

When Data Vault might be overkill

Data Vault is powerful, but it’s not always necessary. If you’re a very small organisation with a single system, little change, and a handful of reports, a simple star schema or a few curated data marts could be more pragmatic. Evaluate:

  • Number of sources and expected change rate
  • Regulatory/audit requirements
  • Scale of historical data needs
  • Long-term analytics and AI ambitions

If those requirements are modest today but expected to grow, Data Vault often makes sense as a future-proofing step you can introduce incrementally.

Final thoughts — why business leaders should care

At the executive level, Data Vault should be evaluated not as a modeling fad but as an investment in enterprise agility, compliance and scalable analytics. The technical patterns (hubs, links, satellites) map directly to business outcomes: rapid delivery of trusted reports, reduced change costs, auditable provenance, and a platform ready for advanced analytics and AI.

Compared to doing nothing (ad-hoc scripts) or building a rigid, monolithic warehouse, Data Vault gives you a repeatable way to capture everything, govern it, and build the business-facing outputs that actually create ROI.

If you’re considering a modern data platform, start with a tracer-bullet use case, prove the approach, automate the repeatable parts, and keep the focus on business outcomes rather than perfect modelling up front.

Watch the Video

Meet the Speaker

Picture of Lorenz Kindling

Lorenz Kindling
Senior Consultant

Lorenz is working in Business Intelligence and Enterprise Data Warehousing (EDW) with a focus on data warehouse automation and Data Vault modeling. Since 2021, he has been advising renowned companies in various industries for Scalefree International. Prior to Scalefree, he also worked as a consultant in the field of data analytics. This allowed him to gain a comprehensive overview of data warehousing projects and common issues that arise.

Differences between Data Vault 2.0 and Data Vault 2.1

Data Vault 2.0 vs Data Vault 2.1

As organizations continue to grapple with rapidly evolving data landscapes, Data Vault remains a leading methodology for building scalable, auditable, and flexible data warehouses. With the release of Data Vault 2.1, practitioners and architects often ask: “What’s changed since 2.0?” In this article, we’ll dive into the differences across three core areas—design principles, ETL patterns, and modeling best practices—and show you how 2.1 enhances your ability to tackle modern data challenges like data lakehouses, data mesh, and nested JSON feeds.



1. Design Principles: Staying True but Embracing Modern Architectures

Core Continuity

At its heart, Data Vault 2.1 retains all the foundational tenets of 2.0: separation of concerns (Hubs, Links, Satellites), immutable history, and decoupling of raw data capture from business transformations. If you already have a robust 2.0 implementation, there’s no need for a forklift upgrade—2.1 is evolutionary, not revolutionary.

Lakehouses, Mesh, and Fabric

Where Data Vault 2.1 shines is in explicitly addressing emerging architectures. You’ll find guidance on integrating Vaults within data lakehouses (e.g., Delta Lake, Apache Iceberg), as well as how Vault concepts align with data mesh domains and data fabric overlays. Instead of an “Enterprise Data Warehouse” monolith, 2.1 helps you embed Vault patterns into cloud-native, distributed environments.

Logical vs. Physical Modeling

With the proliferation of diverse storage engines—relational, columnar, NoSQL document stores, and graph databases—2.1 distinguishes your logical Vault model (Hubs, Links, Satellites) from its physical implementation. You now have clear guidelines on:

  • Keeping the logical model technology-agnostic
  • Adapting physical denormalization or document embedding strategies per platform capabilities
  • Optimizing storage formats (e.g., Parquet, Delta, or JSONB) while preserving auditability

This separation equips data engineers to leverage the strengths of their chosen database without sacrificing Vault integrity.

2. ETL Patterns: From Batch to Streaming and JSON

Expanded CDC Strategies

Data Vault 2.1 deepens its coverage of Change Data Capture (CDC) patterns. You’ll find refined techniques for:

  • Transactional order guarantees: Ensuring raw Vault loads adhere to source system timestamps to preserve lineage.
  • Handling late-arriving or out-of-order events: Techniques to backfill or correct Satellites without breaking immutability.
  • Parallel loading: Avoiding cross-system dependencies by pre-joining keys within each source’s staging area.

Informal “Pre-Join” Denormalization

2.1 codifies the practice of pre-joining business keys in staging or external views—a pattern previously covered only in practitioner forums. This denormalization step enriches payload tables with true business keys upfront, eliminating repetitive lookups during Link loads and simplifying ETL script maintenance.

JSON and Nested Structures

Perhaps the most visible ETL addition is 2.1’s JSON processing module. With more sources emitting nested, semi-structured payloads, new patterns include:

  • Flatten-first loading: Initial extraction of atomic fields into raw Satellites before storing full payloads.
  • Schema evolution handling: Capturing structural changes (added arrays or nested objects) as metadata in Vault artifacts.
  • Selective shredding: Automating transformation of common sub-documents into separate Hubs/Links/Satellites.

3. Modeling Best Practices: Updated Patterns for a Distributed World

Managed Self-Service BI

Data Vault 2.1 recognizes the shift toward self-service analytics within federated teams. Best practices now recommend:

  • Role-based access controls at the raw & business Vault layers, ensuring data stewards can grant fine-grained permissions.
  • Row- and column-level security patterns that can be implemented natively in cloud warehouses (Snowflake masking policies, SQL Server RLS, etc.).
  • Embedding governance metadata in Vault tables, enabling automated lineage and impact analysis for downstream consumers.

Expanded Satellite Strategies

While 2.0 introduced Point-in-Time (PIT) and Bridge tables for performance, 2.1 adds:

  • Snapshot Satellites: Prebuilt structures for frequented combinations of Hubs & Satellites—ideal for dimensional views.
  • Behavioural Satellites: Grouping event-driven attributes (e.g., clickstreams) separately from master-data Satellites.

Cross-Domain Linkage

Data Vault 2.1 extends guidance on managing relationships across micro-warehouse domains—a nod to data mesh. It clarifies when to use:

  • Cross-domain Links: For relationships spanning autonomous teams with separate Hubs.
  • Reference Hubs: Capturing shared code lists (e.g., currency, country) that multiple domains consume.

4. Educational & Organizational Enhancements

Rich Video & Quiz Content

Training for 2.1 now includes extensive pre-recorded modules by Dan, focusing on conceptual foundations—freeing up live classroom time for interactive labs and advanced case studies. Over 40 quizzes interspersed throughout the curriculum reinforce learning and feed directly into certification exams.

Certification & Community

Becoming a Data Vault 2.1 certified practitioner involves:

  • 5 days of combined video and onsite training (versus one day of video + three days live in 2.0).
  • An updated exam covering new ETL patterns, JSON handling, and modern architecture integration.
  • Access to an expanded Slack community and biweekly “Vault Clinics.”

Choosing When to Adopt 2.1

Given the backwards-compatible design, migration from 2.0 to 2.1 can be phased:

  1. Retain existing Hub/Link/Satellite structures in the Raw Vault.
  2. Gradually introduce new ETL patterns (JSON shredding, snapshot Satellites) in parallel.
  3. Implement enhanced governance and self-service controls in the Business Vault.
  4. Leverage certification resources to upskill architects and engineers on updated best practices.

Conclusion

Data Vault 2.1 advances the methodology by weaving in lessons from cloud-native architectures, self-service analytics, and semi-structured data sources—without discarding the proven foundation of 2.0. Whether you’re standardizing a data mesh deployment or optimizing your JSON pipelines, 2.1 provides the patterns and guardrails needed to build a modern, auditable, and flexible data platform.

Watch the Video

Meet the Speaker

Marc Winkelmann

Marc Winkelmann
Managing Consultant

Marc is working in Business Intelligence and Enterprise Data Warehousing (EDW) with a focus on Data Vault 2.0 implementation and coaching. Since 2016 he is active in consulting and implementation of Data Vault 2.0 solutions with industry leaders in manufacturing, energy supply and facility management sector. In 2020 he became a Data Vault 2.0 Instructor for Scalefree.

Unlock the Intelligence Layer: LLMs in Data Warehousing and the Future of Your Data

Natural Language AI Model

“Stop writing complex SQL, start talking to your data?”

This provocative question highlights a growing shift in how we interact with data. For years, getting answers from a Data Warehouse meant writing SQL queries or relying on pre-built dashboards.

For many organizations, their data platforms remain underutilized because accessing insights still requires writing code or navigating complex dashboards. It’s time to go beyond static reports and unlock a true intelligence layer on top of your data warehouse. Recent advances in Large Language Models (LLMs) and Natural Language Processing (NLP) are making data warehouses smarter, faster, and easier to use for everyone. In this article, we’ll explore how LLMs can transform the way you interact with your data – from using plain English queries instead of SQL, to AI-driven discovery of hidden insights, to enriching your data pipelines – and why this shift represents the future of data analytics.

Unlock the Intelligence Layer: LLMs in Data Warehousing and the Future of your Data

Unlock your data warehouse’s full potential! This webinar reveals how Large Language Models and Natural Language Processing are transforming data interaction, empowering everyone to effortlessly translate plain language into SQL, enable AI-driven data discovery, and deliver actionable insights to every stakeholder. Register for our free webinar, August 12th, 2025!

Watch Webinar Recording

From Complex SQL to Conversational Queries

Business users often depend on data engineers or analysts to fetch answers, creating bottlenecks in decision-making. Even data professionals themselves spend considerable time writing and optimizing SQL, rather than interpreting results. What if anyone could simply ask the data warehouse a question in plain language and get the answer? This is the promise of LLMs as an “intelligence layer”, a layer that bridges complex datasets and human comprehension. Advanced LLMs can understand a user’s question or request and generate the appropriate SQL queries on the fly.

This technology (often called Text-to-SQL or Natural-Language-to-SQL or NL2SQL) has rapidly evolved and major technology players have already taken note. For example, Databricks introduced a Natural Language Query feature (LakehouseIQ) to let users ask questions of their Lakehouse, and Snowflake is also exploring LLM-driven query capabilities.

Imagine asking your data warehouse in plain English: “What were our top-selling products last quarter by region?”. This text input is passed into a LLM, often enriched by company-specific data via RAG and then the system translates that into a correct, optimized SQL query that retrieves the answer.

Natural Language AI Model

Of course, translating natural language to SQL at an enterprise scale isn’t trivial. Complex schemas, ambiguous user input, and security considerations mean the LLM has to be both smart and careful. Uber has built such an AI system that works on an enterprise scale level.

Uber’s QueryGPT is an NL2SQL system that uses a multi-step, RAG-based pipeline combining LLMs with retrieval and agent modules. It fetches context via similarity search over a vector database of example queries and schema information for SQL generation. To manage Uber’s vast data ecosystem, QueryGPT employs specialized agents:

  • an Intent Agent classifies requests by business domain
  • a Table Agent suggests tables for the query
  • a Column Prune Agent trims irrelevant columns to reduce prompt length. The LLM then produces the SQL query and an explanation.

This layered design allows QueryGPT to handle large schemas and reliably generate complex multi-table queries. It’s a hybrid architecture where multiple transformer calls specialize in sub-tasks, enabling scalable, accurate NL2SQL as a production service, saving thousands of Uber employees significant time by mid-2024.

AI-Augmented Data Discovery and Insights

Beyond simply fetching results for user queries, LLMs can augment data discovery by revealing insights that users might not have explicitly asked for. Traditional dashboards show you what is happening, but a smart LLM-based system can tell you why it’s happening and highlight patterns you might not notice. This is often called augmented analytics – using AI to automatically find important correlations, trends, outliers, and drivers in your data.

LLMs excel at interpreting data outputs and providing additional context. For example, rather than just displaying a chart or a table, an LLM can generate a written summary pointing out key trends or anomalies. They can explain which metrics are up or down and suggest potential reasons (for instance, detecting that “conversion rates dipped in July, possibly due to seasonality or inventory issues”), enabling quicker and more informed decision-making.

Another area where LLMs can significantly reduce manual effort is in the creation and maintenance of data catalogs. Documenting data models, table structures, and especially individual column descriptions is often time-consuming and easily skipped due to missing resources, despite being crucial for an effective use and accessibility of the data. LLMs can automate large parts of this process by generating descriptions based on data profiling, SQL logic, naming conventions, and metadata.

dbt Cloud has recently released their dbt Copilot AI Agent that supports the developer in various ways, for example by letting the AI analyzing the SQL code and schema metadata to automatically generate model and column descriptions.

LLMs in Your Data Pipeline: Enrichment and Efficiency

LLMs don’t just enhance how users interact with the Data Warehouse; they can also improve the data itself and the efficiency of data engineering processes. In modern ELT (Extract-Load-Transform) pipelines, a lot of time is spent cleaning, enriching, and preparing data for analysis. Here, LLMs offer new tools to automate and augment these steps.

One promising use case is the semantic enrichment of data. Large Language Models have absorbed a vast amount of world knowledge and language patterns, and they can use that to fill gaps or add context to your raw data. For example, imagine you have a dataset of customer feedback where each entry is a text comment. An LLM could automatically classify the sentiment of each comment (positive/negative), extract key themes, or even generate a summary of common issues. In this way, unstructured data becomes structured insights without manual effort. The image below illustrates how an LLM is integrated into a data pipeline: text inputs from a CustomerFeedback table are passed to an OpenAI API endpoint, where the model returns structured sentiment labels that are then stored back in the database.

Large Language Models Sentiment Analysis

In a practical case study, LLMs were used to enrich an academic dataset by inferring missing attributes (like guessing a person’s gender from their name with high accuracy), which outperformed dedicated API services. This showcases how LLMs can bring external knowledge and reasoning to enhance your data.

Another area is metadata enrichment and semantic enrichment of unstructured data. Enterprise data is often filled with cryptic column names and jargon that prevents usability. LLMs can intelligently expand abbreviations and annotate fields with business-friendly descriptions. For instance, an LLM-driven catalog might take a column labeled “CUST_ID” and annotate it as “Customer Identifier, unique ID for each customer record”.

LLMs can also assist in the coding and transformation process itself. Data engineers can leverage LLMs to generate boilerplate code or SQL for transformations, document pipeline logic in plain English, or even detect anomalies and data quality issues through pattern analysis. By automating tedious parts of data preparation and providing AI-generated suggestions, LLMs free up engineers to focus on higher-level architecture and problem-solving.

Conclusion

While the promise of an LLM-powered intelligence layer is exciting, it’s important to approach it with a clear strategy. Successful implementation requires considering a few key challenges and best practices. Data quality and governance are more crucial than ever. If your underlying data is inaccurate or poorly structured, the AI’s answers will be unreliable. As the saying goes, “garbage in, garbage out.”

Ensuring clean, well-organized data (and maintaining a robust data governance program) will help the LLM produce meaningful and correct insights. Additionally, organizations may need to fine-tune or configure their LLMs to understand industry-specific terminology or business context. This reduces the chance of the AI misinterpreting what a user asks or generating an incorrect query.

Privacy and security are another important consideration. If your data includes sensitive information, you must ensure that any AI tool accessing it complies with your security requirements. This might involve using self-hosted models or secure APIs, and setting up proper access controls.

The dream of a self-service analytics experience: “just talk to the data and get answers” is quickly becoming a reality. This evolution may redefine roles (enabling analysts and engineers alike to focus on higher-value tasks) and open up analytics to a wider audience than ever before. It’s an exciting time to be a data professional, but also one that demands staying informed and ready to adapt.

– Ole Bause (Scalefree)

Get Started with dbt: A Quick 15-Minute Guide Using Snowflake

Get Started with dbt Using Snowflake

Introduction to dbt

dbt (data build tool) revolutionizes the way teams build and maintain analytics workflows by bringing software‐engineering best practices to SQL‐based data transformations. Instead of ad‐hoc scripts, dbt encourages version control, modular models, testing, documentation, and lineage graphs. In this guide, you’ll learn how to go from zero to your first dbt models—running entirely in the cloud on Snowflake—in under 15 minutes.



Why Choose dbt Cloud + Snowflake?

  • Fully managed: No local install or complex orchestration; dbt Cloud handles hosting.
  • Quick setup: Snowflake’s partner connector spins up a dbt trial, pre-configured with your credentials.
  • Best practices out of the box: Built-in IDE, job scheduling, Git integration, and documentation.
  • Scalable performance: Leverages Snowflake’s compute power for fast model builds.

Prerequisites

  1. A Snowflake trial account (free, 1-minute setup).
  2. A modern browser (Chrome, Firefox) or VS Code for remote development.
  3. Basic familiarity with SQL.

1. Launching dbt Cloud from Snowflake

Once logged into your Snowflake trial, navigate to the Data Products → Partners → Connect pane. Scroll to find the dbt entry and click Connect → Launch. This will automatically:

  • Provision a dbt Cloud trial account
  • Create a new Snowflake database and warehouse
  • Inject Snowflake credentials into your dbt Cloud connection

You’ll land in the dbt Cloud dashboard, ready to start your first project.

2. Exploring the dbt Cloud UI

In dbt Cloud’s left navigation bar you’ll find:

  • Develop: Interactive IDE for coding models, sources, tests, documentation.
  • Deploy: Job definitions, environments, and run history.
  • Documentation: Auto‐generated docs site with lineage graphs.
  • Settings: Account, project, and Git integration.

Click Develop → IDE (hosted in Chrome or connect your VS Code). Let’s initialize our dbt project.

3. Initializing Your dbt Project

  1. In the IDE, open the Version Control pane and click Initialize dbt Project.
  2. Accept the defaults; dbt creates a dbt_project.yml and folder structure (models/, macros/, etc.).
  3. Commit the auto‐generated files in a new Git branch: “initialized dbt project”.

Your Git pane now shows uncommitted files; click Commit & Push to save the project baseline.

4. Defining Your Source Data

dbt doesn’t load data from external systems—you must point it to existing tables. Snowflake’s sample database (SNOWFLAKE_SAMPLE_DATA) contains TPC-H tables you can use.

Create a new YAML file under models/ named sources.yml with:

version: 2

sources:
  - name: tpch
    database: SNOWFLAKE_SAMPLE_DATA
    schema: TPCH_SF1
    tables:
      - name: CUSTOMER
      - name: ORDERS

Save to see the lineage graph update with two new source nodes.

5. Building a Staging Model

Staging models standardize raw tables and prepare them for downstream transformations. In models/, delete the example/ folder and instead:

  1. Create a folder called models/staging/tpch.
  2. In that folder, create stg_tpch_customer.sql:
{{ config(materialized='view') }}

select
  C_CUSTKEY   as customer_key,
  C_NAME      as customer_name,
  C_ACCTBAL   as account_balance,
  C_COMMENT   as comment
from {{ source('tpch', 'CUSTOMER') }}

Notes:

  • The source() macro resolves to the fully qualified table.
  • materialized='view' tells dbt to build a view by default.

Save and click Run → dbt run to build just this model. In seconds you’ll see a view in your Snowflake UI under the dev schema.

6. Creating a Production-Ready Dimension

Dimensions (Gold layer) contain curated, business-ready tables. Let’s filter for customers with positive balances.

  1. Create models/marts/customer_dim.sql:
{{ config(materialized='table') }}

select
  customer_key,
  customer_name,
  account_balance
from {{ ref('stg_tpch_customer') }}
where account_balance > 0

Here, ref() links to another model, ensuring correct build order and clear lineage.

Click Compile to preview generated SQL, then Run → dbt run to create the table in Snowflake.

7. Testing and Documentation

dbt encourages tests to enforce data quality:

  • Add to models/staging/tpch/schema.yml:
version: 2

models:
  - name: stg_tpch_customer
    tests:
      - not_null:
          column_name: customer_key
      - unique:
          column_name: customer_key

Run dbt test to validate your models. Any failures will be reported in the UI.

Generate documentation with dbt docs generate, then preview via dbt docs serve. Explore your project’s lineage graph and column descriptions.

8. Version Control & Deployment

dbt Cloud integrates Git for collaboration. After feature development:

  1. Commit your branch and open a pull request (GitHub, GitLab, Bitbucket).
  2. Merge into main.

In Deploy → Environments, create a production environment. Under Jobs, define a job that runs:

  • dbt seed (if you have local CSV seeds)
  • dbt run
  • dbt test

Schedule the job (e.g., hourly, daily) or trigger it on Git commits. Monitor run history and logs directly in dbt Cloud.

9. Best Practices & Next Steps

  • Modularize models: Break complex logic into smaller models.
  • Document extensively: Use YAML descriptions for sources, models, and columns.
  • Implement CI/CD: Integrate dbt Cloud jobs with your team’s CI pipeline.
  • Leverage analyses: Create analyses/ for ad-hoc queries and charts.
  • Scale with packages: Reuse community packages (e.g., dbt_utils).

For deeper dives, explore the official dbt documentation and the dbt Hub for community packages and best practices.

Watch the Video

Conclusion

In just a few steps, you’ve:

  • Provisioned dbt Cloud via Snowflake
  • Initialized a dbt project with Git
  • Defined raw sources and built staging models
  • Created a production dimension with ref()
  • Tested data quality and generated documentation
  • Set up a CI/CD job for automated deployment

Now you have a repeatable, maintainable analytics pipeline. Keep building new models, add tests, and document as you go—your future self (and team!) will thank you.

Pre-Joining Data Vault Business Keys During Load

Data Vault Business Keys

Data Vault architects often encounter a common challenge when their source systems mix surrogate technical IDs with true business keys. In one real-world scenario, an Employee table used a technical ID as its primary key, while the legitimate business identifier was NBR. Downstream tables—even a self-referencing manager hierarchy—used that ID as a foreign key. When building Hubs on both ID and NBR and linking them with a Same-As Link, the result resembles a Source Vault design—and it quickly becomes unwieldy to repeat ID→NBR lookups for every referencing table.

In this article, we’ll explore the best practice for handling mismatched keys: pre-joining business keys in your staging (or view) layer before loading the Raw Data Vault. You’ll learn why this denormalization is fully compliant with Data Vault principles, how to implement it (even with CDC feeds), and when you might need alternate approaches.



Why Separate Business Keys from Technical IDs?

Surrogate keys (ID) offer stable integer references, simple indexing, and isolation from business rule changes. But they aren’t meaningful outside the operational schema. True business keys (NBR)—like employee numbers, order numbers, or product SKUs—carry real-world meaning and ensure consistency across downstream BI and analytics models.

  • Business Key (NBR): Immutable identifier used in reporting, cross-system integration, and audit.
  • Technical ID (ID): Auto-generated surrogate for OLTP performance and referential integrity.

When you build a Hub on the business key, all Links and Satellites must reference that same key. Mixing in surrogate IDs without conversion violates business lineage and forces repetitive lookups.

The Pitfall: Source Vault by Accident

Faced with source tables referencing ID, some teams create:

  1. A Hub on ID (surrogate),
  2. A Hub on NBR (business key),
  3. A Same-As Link between them to tie ID↔NBR.

This “Source Vault” pattern captures technical IDs as though they were business keys—contradicting the principle that your Vault’s integration key must be a shared business identifier. Moreover, every time you load any Link or Satellite that uses ID, you must look up NBR via the Same-As Link. Tedious, error-prone, and defeating the agility of your Data Vault.

Best Practice: Pre-Join Business Keys in Staging

Data Vault training explicitly allows you to reshape your staging area—denormalizing or normalizing source data to simplify Raw Vault loads. Pre-joining means: before your load process begins, enrich every source record with the true business key (NBR) rather than the surrogate ID. You then feed the Hub/Satellite/Link loaders with business keys directly—no Same-As Link gymnastics required.

How It Works

  1. Create a view or staging query that joins your Employee table (on ID) to itself or to the hierarchy table to retrieve NBR. Add the NBR field into every downstream staging record.
  2. Use that pre-joined staging view as the source for your Data Vault loaders. All Hubs, Links, and Satellites can now reference NBR consistently.
  3. Drop the accidental Hub on ID and Same-As Link—your Vault only contains the true business key Hub (Hub_Employee on NBR).

This approach turns the repeated lookup problem into a one-time denormalization, improving performance and maintainability.

Tool Support: dbt & FlowBI

Modern Data Vault toolkits recognize this pattern:

  • datavault4dbt package: Offers a pre_join feature to automatically enrich staging tables with business keys.
  • FlowBI: Includes configuration options to map surrogate IDs to business keys before Vault loads.

When your staging area resides on a cloud data lake (e.g., AWS S3 + Redshift Spectrum, Azure Data Lake + Synapse), these tools can reference external tables and materialize pre-joined views seamlessly.

Handling CDC & Empty Deltas with Forward Lookup

Change Data Capture (CDC) introduces a nuance: sometimes the Employee record doesn’t change (NBR remains the same), but the hierarchy table (manager assignment) does. A pure inner-join staging view would omit the hierarchy change because no new employee row appeared.

Forward Lookup Pattern

  1. When your CDC batch contains only hierarchy changes, load your staging view with the ID column but no accompanying NBR.
  2. Instead of joining to the source Employee table (which has no new row), perform a forward lookup against the target Satellite in your Raw Vault. That Satellite already stores every historic mapping of ID→NBR.
  3. Retrieve the latest NBR value for each ID and inject it into your staging records—just as if you had joined to the source.

This assumes your CDC infrastructure and initial loads correctly populated the Satellite. If CDC reliability is questionable, you may need to capture ID first and resolve to NBR later in a Business Vault layer.

Caveats: Pre-Join Within a Single Source Only

While you can (and should) pre-join within one source system, avoid chaining pre-joins across multiple systems. If you first wait for System A’s staging load to produce NBR from its ID, and then join System B’s staging to A’s data, you introduce cross-system load dependencies. That forces you to serialize loads—waiting for one system’s batch to finish before you can process another.

To maintain parallel ingestion, each source should be pre-joined only to its own business keys. If two systems share a business key, let that intersection happen downstream in your Vault (via Hub on the shared business key), not in the staging layer.

When You Can’t Pre-Join: Source Vault as Last Resort

In rare environments where you cannot reliably pre-join—legacy databases with locked-down permissions or untrustworthy CDC—you may fall back to a Source Vault. In this design:

  • Your Hubs use surrogate ID as the primary key (capturing the technical ID).
  • You defer mapping to true business keys into the Business Vault layer, after all sources land.

Source Vaults make sense only when staging denormalization is impossible. Otherwise, they sacrifice business clarity for expedience.

Step-By-Step Implementation Guide

  1. Identify Business Keys: Catalog each table’s true business key(s)—not the surrogate PKs.
  2. Build Staging Views: For each source, create a view that LEFT JOINs the “owner” table back to itself (or to its lookup tables) to pull in NBR wherever ID appears.
  3. Validate Keys: Ensure every staging record includes a non-null business key. Flag or quarantine any orphans (IDs without known business key).
  4. Configure Load Scripts: Point your Vault loaders (Hubs, Links, Satellites) at these staging views. Remove any loaders that target surrogate key Hubs or Same-As Links.
  5. Implement Forward Lookup: For CDC batches that may omit source changes, add a fallback join to the Employee Satellite in your load script to fetch the last known NBR for each ID.

Benefits of Pre-Joining Business Keys

  • Simplicity: One denormalization step replaces dozens of repetitive lookups.
  • Performance: Staging views optimize key enrichment in set-based SQL rather than row-by-row Link loads.
  • Lineage: Your Raw Vault contains only true business keys, preserving clear end-to-end lineage.
  • Maintainability: Future source schemas that reference ID get mapped automatically via the shared staging view logic.

Conclusion

Mismatched surrogate and business keys need not derail your Data Vault design. By embracing pre-joining business keys in your staging layer—along with forward lookups for CDC edge cases—you preserve a clean, business-centric Vault model without cumbersome Same-As Links. Reserve Source Vaults only for environments where staging denormalization simply cannot occur. With these best practices, your Vault remains performant, transparent, and aligned with true business identifiers.

Watch the Video

Meet the Speaker

Profile picture of Michael Olschimke

Michael Olschimke

Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

10 Essential Skills Your Team Needs to Build an Analytical Data Platform

Analytical data platform team meeting

Build an Analytical Data Platform

Building a modern analytical data platform is more than just choosing the right database or ETL tool. It requires a blend of business insight, data expertise, architecture design, and automation savvy. In this article, we’ll explore ten essential skills your team needs to design, develop, and maintain a robust, scalable, and high-value data platform.



1. Business Understanding

Before diving into any technical work, your team must understand the business domain and the data itself. This doesn’t mean every engineer needs to be a data analyst, but they should know:

  • Which source systems hold the data (CRM, ERP, marketing platforms, etc.)
  • Key business objects (customers, contracts, opportunities) and how they relate
  • Business processes behind the data, like a customer’s lifecycle or sales funnel

By grounding the team in real-world outcomes—such as improving customer retention or reducing churn—engineers stay focused on delivering measurable ROI.

2. Objective Setting & ROI Focus

Clear objectives guide every stage of your platform’s development. Whether your goal is to accelerate financial reporting or enable real-time marketing analytics, defining the desired outcomes:

  • Helps prioritize features and data sources
  • Aligns stakeholders around common metrics
  • Boosts motivation by tying work to tangible business value

Teams that regularly track ROI milestones can adjust scope and resources proactively, ensuring the platform grows in step with organizational goals.

3. Data Understanding & Modeling

A deep dive into your source systems reveals hundreds—even thousands—of tables. Your engineers need to know:

  • Primary and foreign keys connecting entities
  • Relationship cardinalities (one-to-one, one-to-many, many-to-many)
  • Data quality quirks and domain-specific rules

This understanding informs the modeling approach—be it third normal form, star schemas, or Data Vault—ensuring downstream analytics are consistent and reliable.

4. Data Acquisition Techniques

Extracting data from source systems can take many forms:

  • Full daily extracts via CSV or JSON files
  • API calls for near-real-time data feeds
  • Change Data Capture (CDC) for incremental updates

Knowing when to use each approach minimizes data latency, reduces load times, and optimizes storage. CDC, in particular, slashes the volume of data transferred, but requires robust handling to maintain consistency.

5. Structured Architecture: The Medallion Approach

Dumping raw data into a single database is a recipe for chaos. Instead, adopt a layered “medallion” architecture:

  • Bronze Layer (Staging/Landing): Raw data as ingested
  • Silver Layer (Cleansed, Integrated): Unified and harmonized data across systems
  • Gold Layer (Presentation): Curated tables/views for business users and BI tools
Medallion Architecture in an analytical data platform

This separation of concerns simplifies debugging, improves performance, and clarifies responsibilities for each team member.

6. Data Integration & Modeling in the Silver Layer

The silver layer is where the “magic” happens:

  • Integrating disparate systems into a unified view
  • Applying your chosen modeling technique (e.g., star schema, Data Vault)
  • Ensuring referential integrity and consistent business definitions

Investing in a proven modeling framework not only scales with additional data sources but also enables automation and accelerates the onboarding of new subject areas.

7. Temporality & Historical Tracking

Beyond technical timestamps (extract load times), your data has business timelines:

  • Contract start/end dates
  • Customer sign-up and churn events
  • Promotion or campaign effective periods

Implementing snapshot tables, slowly changing dimensions, or time-aware modeling ensures accurate trend analysis, historical comparisons, and auditability.

8. Code Generation & Automation Tools

Hand-coding every pipeline is time-consuming and error-prone. Leverage tools that:

  • Automatically generate ETL/ELT code based on templates
  • Orchestrate complex workflows and dependencies
  • Enforce consistency through standard patterns and conventions

Automation not only speeds up development but also improves data quality by reducing manual interventions.

9. Agile Development & Traceability

Adopting an agile mindset means delivering small, working increments quickly. Apply traceability by:

  • Defining clear targets (e.g., monthly revenue report)
  • Mapping those targets back to specific source tables
  • Focusing on data that directly supports your objectives

This approach prevents “scope creep” and ensures that every pipeline built serves an immediate analytical need.

10. DevOps & Cost Management

Once pipelines are automated, you need:

  • Orchestration frameworks (e.g., Airflow, Dagster) to schedule and monitor jobs
  • CI/CD for data code, including version control and automated testing
  • Cost monitoring tools to track cloud resource usage and optimize performance

Effective DevOps practices guarantee reliability, while cost-awareness keeps your platform sustainable in the cloud era.

  • Data Profiling: DataHub continuously monitors and analyzes data quality, automatically generating profiling metrics that reveal data distributions, identify anomalies, and help maintain high data quality standards. It provides key statistics such as row and column counts, query frequency, top users, and last update timestamps, along with detailed attribute profiling, including value ranges, central tendencies, null and distinct values. The table below shows some examples of these profiling metrics.

Watch the Video

Conclusion

Building an analytical data platform is a multifaceted endeavor. By equipping your team with these ten skills—spanning business understanding, data modeling, architecture design, automation, and DevOps—you’ll lay the foundation for a platform that delivers consistent insights, scales gracefully, and drives real business value.

Close Menu