Building a scalable Data Platform? – Scalefree

Building a scalable Data Platform? In Data Tools,

Databricks and dbt: A Practical Approach to Data Vault Implementation

Bronze Silver and Gold layers in the Data Vault Structure

Databricks and dbt

Selecting the appropriate technology stack is a critical factor in the successful delivery of a Data Vault 2 architecture. Two technologies that work effectively together at a large scale data solutions are Databricks and dbt. When combined, they provide a practical way to implement Data Vault models while addressing performance, governance, and auditability requirements.

It can be argued that dbt’s role in a Databricks-based architecture is not always essential, since many of its core capabilities (such as transformation scheduling, lineage tracking, and documentation) can also be implemented using native Databricks features. Understanding the specific role each tool plays helps clarify where they complement each other and where functionality overlaps.

In this article:

Databricks as the Processing and Storage Platform
dbt as the Transformation and Orchestration Layer
Integration in a Data Vault Workflow
Business Value when combining dbt and Databricks

Databricks as the Processing and Storage Platform

Databricks’ Lakehouse architecture combines the scalability of a data lake with the reliability of a warehouse. Its Delta Lake technology offers ACID transactions, schema enforcement, and time travel, enabling precise historical querying, which are relevant aspects when it comes to Data Vault’s historization requirements.

With Unity Catalog, Databricks centralizes metadata management and enforces fine-grained access control, ensuring sensitive attributes are protected without introducing unnecessary satellite splits. This alignment between governance and performance is particularly relevant in Data Vault environments.

dbt as the Transformation and Orchestration Layer

dbt manages and automates SQL-based transformations in a modular and version-controlled manner. In a Data Vault context, dbt enables:

The creation of Hubs, Links, and Satellites through templated, reusable models. Here, different packages can be leveraged, like our datavault4dbt package, which is constantly updated to be fully compliant with the most recent Data Vault standards.
Integrated testing to validate business keys, relationships, and data quality.
Automated documentation that directly reflects the structure and dependencies of the Data Vault.

This structured approach makes transformations transparent and repeatable, supporting the auditability requirements inherent to Data Vault.

Integration in a Data Vault Workflow

When Databricks and dbt are deployed together:

Data ingestion occurs in Databricks, storing raw datasets as Delta tables, usually in the Bronze layer.
dbt transformations generate Raw Vault entities and Business Vault objects in the Silver layer.
Governance and security controls are enforced via Unity Catalog without altering the Data Vault model structure.

This approach preserves Data Vault’s methodological structure while using Databricks’ distributed compute and storage capabilities.

Business Value when combining dbt and Databricks

The combined use of Databricks and dbt offers:

Scalable processing of large, complex datasets: Databricks handles enterprise-scale data efficiently, while dbt structures transformations into modular, reusable components.
Consistent governance across all layers of the Data Vault: dbt’s lineage and documentation, plus Unity Catalog’s access control, ensure compliance and transparency end to end.
Lower operational risk through tested, version-controlled transformations: Git-based versioning and automated tests in dbt reduce errors before execution on Databricks.
Improved query performance for information marts and analytics: Delta Lake optimizations and dbt’s pre-aggregated tables with business logic minimize expensive joins.

For organizations building Data Vault on Databricks, dbt strengthens structure and quality while Databricks ensures scalability and performance.

– Ricardo Rodríguez (Scalefree)

Building a scalable Data Platform? In Agile Data

Agile Development in Data Warehousing with Data Vault 2.0

Agile Development in Data Warehousing: Initial Situation

Agile methodologies bring flexibility and adaptability to data warehousing, making them a natural fit for modern approaches like Data Vault 2.0. A common issue in data warehousing projects is that a scope is often missing and many of the processes such as controlled access, GDPR handling, auditability, documentation and infrastructure are not optimized. Additionally, data warehouse projects that have a scope often begin without a real focus on business value. This is mostly due to the fact that the use cases are not clearly communicated and the data architects do not know where to start. The consequence of this means no business value can be delivered.

In this article:

Data Vault 2.0 Methodology
The Scope of a Sprint
Define the project
Agile Development
Agile Development Review
Conclusion

Data Vault 2.0 Methodology

It is often assumed that Data Vault 2.0 is only a modeling technique, but this is not correct. Data Vault 2.0 includes the modeling technique, a reference architecture and the methodology. The methodology introduces project management tools such as CMMI, Six Sigma and Scrum to solve the problems described. While CMMI and Six Sigma deal with general project management issues, Scrum is mostly used specifically in the development team and provides the framework for a continuously improving development process. The use of agile development in Data Vault 2.0 projects will be described in more detail below.

The Scope of a Sprint

The first step in setting up a data warehouse project in an agile way is defining the objective of the project with just one or two pages. Unlike waterfall projects, the goal is to produce working pieces of usable outputs, could be reports or dashboards, in continuous iterations, otherwise called sprints. This means that we don’t need to plan the entire project in detail but instead can build around a general idea or goal for the final data warehouse before then focusing on planning the first sprints. In order to address the aforementioned problems, the focus of the sprints needs to be centered around business value. For this reason, it is important to receive constant feedback from the business users for a continuous improvement process.

Define the project

Both the scope of a sprint and the architecture follow a business value driven approach built vertically and not horizontally. This means they are not built layer by layer but instead feature by feature. A common approach for this is the Tracer Bullet approach. Based on business value, which is defined by a report, a dashboard or an information mart, the source data will be identified and modeled through all layers and loaded.

As shown in Figure 1, the entire staging area layer is not initially built but rather a small part of the respective layer is built based on data in the scope, in this case the SalesReport.

Agile Development

Before new functionality can be implemented in a sprint, it needs to be defined.
This task lies with the product owner as they are the ones to write and prioritize user stories.
As already explained, the goal of a sprint is to produce working pieces of usable outputs called features.
In addition, there are tech topics that need to be considered. There are various methods to support Sprint Planning, such as planning poker or Function Point Analysis, which are discussed in more detail in another article.

Another good indicator is to evaluate the sprint itself while the sprint is still ongoing. If the development team does not manage to implement a feature in a sprint, this can often be seen as a good indicator that the scope is too large.

To avoid this, all work packages that are not relevant for the feature should be removed. Though, what is often the case these work packages are not completely removed out of fear from the business user.

To address this fear it is important to educate the business user that they will be delivered but only in a later sprint and temporarily moved into the backlog.

Figure1 : Data Vault 2.0 Architecture

Due to the flexible and scalable Data Vault model, these layers can be extended with the next feature with little to no re-engineering. This is possible due to the fact Data Vault consists of a Raw Data Vault and a Business Vault model which means it contains the logical architecture as well as the data modeling perspective. The Raw Data Vault is modeled in a data-driven way by integrating the data by business keys. Only hard business rules like data type conversions or hash key calculations are applied. All other soft business rules are only applied in the Business Vault.

Here, we turn data into information. For this reason, the Raw Data Vault requires less refactoring and can be extended limitlessly.

Agile Development Review

Another important success factor for agile projects is proper review and improvement. Even before the next sprint starts, two meetings must be held by the team:

The sprint review meeting: This meeting is about reviewing the delivered features. Usually the development team, the product owner, the Scrum Master and the end-users participate in this meeting.
Retrospective meeting: This meeting usually takes place directly after the review meeting and focuses on identifying activities that need to be improved.
Backlog refinement for prioritizing the user stories and to make sure that the team understands what to do
Sprint planning to plan which user stories fit into the next sprint based on estimating the effort.

It is important that these meetings are held so that toe source errors can be found. In this way, the outcome of a project can be improved and the development processes optimized in an iterative way.

Conclusion

Data Vault 2.0 is not only a scalable and flexible modeling technique, but a complete methodology to accomplish enterprise vision in Data Warehousing and Information Delivery by following an agile approach and focusing on business value. By using agile methods in data warehousing, the focus in projects can be on the business value and delivering useful products to the customer.

Building a scalable Data Platform? In Data Vault Friday

The Business Value of Data Vault – and Why It Matters

The Business Value of Data Vault

Data Vault is not just another data model. It’s a pragmatic architecture and methodology built for change, auditability, automation and fast delivery — and those traits translate directly into measurable business advantages compared with traditional approaches or doing nothing at all.

In this article:

Why we need a different approach to data
High-level difference: Data Vault vs. traditional models (or none)
Concrete business advantages of implementing Data Vault
Specific business problems Data Vault can solve
How the business benefit translates to measurable outcomes
Practical adoption path — a pragmatic recipe
When Data Vault might be overkill
Final thoughts — why business leaders should care
Watch the Video
Meet the Speaker

Why we need a different approach to data

Most organisations don’t start with the perfect, governed data platform. They begin with spreadsheets, a few scripts, maybe a single operational system. Then the business grows: new SaaS apps (Salesforce, shop platforms), partner APIs, IoT feeds, regulatory requirements and ad-hoc reporting requests. Before long you have multiple sources, inconsistent definitions, and changing business rules.

Traditional models — Kimball’s star schemas or Inmon’s normalized enterprise warehouse — work well when sources, rules and requirements are stable. But the reality today is constant change. That’s exactly the gap Data Vault was designed to fill: a model and architecture focused on capturing raw facts reliably, separating business logic, and enabling incremental, auditable growth.

High-level difference: Data Vault vs. traditional models (or none)

Put simply:

Traditional models (Kimball/Inmon): great for reporting, intuitive star schemas for business users, but rigid and costly to change when sources or rules evolve.
No model / ad-hoc reports: fastest at day zero but leads to duplicated effort, inconsistent numbers, and brittle scripts that break as systems change.
Data Vault: engineered for change. Capture everything in a consistent, standard way, keep full lineage, and build business logic and reporting layers on top. This structure enables automation, auditability and rapid delivery of real business reports.

Concrete business advantages of implementing Data Vault

1. Faster time-to-value (Tracer-bullet delivery)

Data Vault enables an iterative “tracer bullet” approach: pick a high-value report, identify the raw source data, ingest and model only what’s needed to deliver that report end-to-end. Business users get a working dashboard in weeks (not months), giving immediate value, generating trust, and allowing the team to expand incrementally.

2. Built-for-change — lower cost of future change

Because Data Vault separates raw data (hubs, links, satellites) from business rules (Business Vault / information marts), adding a new source, new attribute, or updated business rule rarely requires tearing down and rebuilding existing models. That translates into lower rework, lower maintenance costs, and much faster onboarding of new systems.

3. Automation reduces delivery time and human error

Data Vault entities follow standardized patterns. Hubs look alike; satellites follow the same tracking patterns. That repeatability makes the ingestion and loading processes highly automatable with modern tools (for example, dbt builders, Wherescape-style automation, Coalesce). Automation frees developers to focus on business logic instead of tedious ETL plumbing — more predictable pipelines, fewer bugs, faster delivery.

4. Auditable, traceable data for compliance and trust

Every record in a Data Vault carries load dates, record source identifiers, and historical versions. That full lineage is invaluable for audits, GDPR/DSR processes, finance reconciliations and provenance requirements for AI. When regulators or internal auditors ask “where did this number come from?” you can show the full trail back to source.

5. Future-proof architecture for analytics and AI

Data Vault’s decoupling of raw capture from business logic means you can adopt new storage or compute technologies (data lakes, cloud object stores, NoSQL or streaming platforms) without reworking the core model. It’s an architecture that scales with both data volume and analytics sophistication: data science teams can access the raw, auditable records they need without breaking downstream reporting.

6. Reduced risk and predictable governance

Standardized patterns, auditable history and clear separation of concerns improve governance. Data owners can define rules in a separate layer, compliance teams can inspect lineage, and operations can automate quality checks. That lowers operational risk and makes governance predictable rather than ad-hoc.

Specific business problems Data Vault can solve

Below are concrete problems organisations experience — and how Data Vault addresses them.

M&As and rapid source expansion: After an acquisition you must onboard dozens of new systems. Data Vault lets you ingest raw records quickly and map business rules later, so analytics can start immediately without delaying integration for perfect master data alignment.
Conflicting definitions across departments: Different teams report different revenue numbers. With Data Vault you capture every source event and build reconciled information marts, so one canonical report can be produced while source-level values remain auditable.
Regulatory or audit requests: Need to prove how a figure was derived six months ago? Data Vault’s lineage (load timestamps, record source) shows exactly which source values contributed to any derived metric.
GDPR / Data Subject Requests: Because raw values and their sources are stored with provenance, it’s easier to locate and isolate personal data, show retention windows, or delete/segment records if needed.
AI/ML model drift and explainability: Models need defensible inputs. Data Vault keeps the raw inputs and transformation history separate, so feature engineers and auditors can trace which raw values produced a model input.
Slow BI delivery and constant rework: BI projects where every change requires a model rewrite burn budget. Data Vault’s incremental approach reduces rework and keeps BI teams delivering incremental, reliable reports.
Operational reporting vs historical analytics conflict: Operational needs often demand current-state views; analytics wants full history. Data Vault stores full history by design, while downstream information marts can present both current and historical perspectives appropriately.

How the business benefit translates to measurable outcomes

Organisations that adopt Data Vault commonly see measurable improvements such as:

Shorter lead times for report delivery (weeks vs months for new reports).
Lower total cost of ownership because changes require less rework and are more automatable.
Fewer data incidents and faster root-cause analysis because lineage is built-in.
Stronger compliance posture and faster audit responses.
Better support for analytics and AI initiatives — because data scientists get consistent, traceable raw data.

These translate to business outcomes: faster decisions, less risk, better regulatory positioning, and a higher ROI on analytics investments.

Practical adoption path — a pragmatic recipe

You don’t have to flip the switch for everything at once. A typical, low-risk path is:

Choose one high-value report (the tracer bullet). Identify required sources and ingest the raw records into the Raw Vault.
Build the Business Vault where you apply the business rules for that specific report (transformations live here, not in the raw zone).
Deliver an Information Mart tuned for reporting (star schema if that’s what BI needs) that offers the business an immediate, usable report.
Iterate and scale — add more reports and sources, reuse existing Hubs/Links/Satellites, automate loading patterns and apply governance over time.

This approach gives quick wins, builds trust, and progressively modernises your data landscape without huge upfront modelling effort.

When Data Vault might be overkill

Data Vault is powerful, but it’s not always necessary. If you’re a very small organisation with a single system, little change, and a handful of reports, a simple star schema or a few curated data marts could be more pragmatic. Evaluate:

Number of sources and expected change rate
Regulatory/audit requirements
Scale of historical data needs
Long-term analytics and AI ambitions

If those requirements are modest today but expected to grow, Data Vault often makes sense as a future-proofing step you can introduce incrementally.

Final thoughts — why business leaders should care

At the executive level, Data Vault should be evaluated not as a modeling fad but as an investment in enterprise agility, compliance and scalable analytics. The technical patterns (hubs, links, satellites) map directly to business outcomes: rapid delivery of trusted reports, reduced change costs, auditable provenance, and a platform ready for advanced analytics and AI.

Compared to doing nothing (ad-hoc scripts) or building a rigid, monolithic warehouse, Data Vault gives you a repeatable way to capture everything, govern it, and build the business-facing outputs that actually create ROI.

If you’re considering a modern data platform, start with a tracer-bullet use case, prove the approach, automate the repeatable parts, and keep the focus on business outcomes rather than perfect modelling up front.

Watch the Video

Meet the Speaker

Lorenz Kindling
Senior Consultant

Lorenz is working in Business Intelligence and Enterprise Data Warehousing (EDW) with a focus on data warehouse automation and Data Vault modeling. Since 2021, he has been advising renowned companies in various industries for Scalefree International. Prior to Scalefree, he also worked as a consultant in the field of data analytics. This allowed him to gain a comprehensive overview of data warehousing projects and common issues that arise.

Building a scalable Data Platform? In Data Vault Friday

Differences between Data Vault 2.0 and Data Vault 2.1

Data Vault 2.0 vs Data Vault 2.1

As organizations continue to grapple with rapidly evolving data landscapes, Data Vault remains a leading methodology for building scalable, auditable, and flexible data warehouses. With the release of Data Vault 2.1, practitioners and architects often ask: “What’s changed since 2.0?” In this article, we’ll dive into the differences across three core areas—design principles, ETL patterns, and modeling best practices—and show you how 2.1 enhances your ability to tackle modern data challenges like data lakehouses, data mesh, and nested JSON feeds.

In this article:

1. Design Principles: Staying True but Embracing Modern Architectures
2. ETL Patterns: From Batch to Streaming and JSON
3. Modeling Best Practices: Updated Patterns for a Distributed World
4. Educational & Organizational Enhancements
- Rich Video & Quiz Content
- Certification & Community
Choosing When to Adopt 2.1
Conclusion
Watch the Video
Meet the Speaker

1. Design Principles: Staying True but Embracing Modern Architectures

Core Continuity

At its heart, Data Vault 2.1 retains all the foundational tenets of 2.0: separation of concerns (Hubs, Links, Satellites), immutable history, and decoupling of raw data capture from business transformations. If you already have a robust 2.0 implementation, there’s no need for a forklift upgrade—2.1 is evolutionary, not revolutionary.

Lakehouses, Mesh, and Fabric

Where Data Vault 2.1 shines is in explicitly addressing emerging architectures. You’ll find guidance on integrating Vaults within data lakehouses (e.g., Delta Lake, Apache Iceberg), as well as how Vault concepts align with data mesh domains and data fabric overlays. Instead of an “Enterprise Data Warehouse” monolith, 2.1 helps you embed Vault patterns into cloud-native, distributed environments.

Logical vs. Physical Modeling

With the proliferation of diverse storage engines—relational, columnar, NoSQL document stores, and graph databases—2.1 distinguishes your logical Vault model (Hubs, Links, Satellites) from its physical implementation. You now have clear guidelines on:

Keeping the logical model technology-agnostic
Adapting physical denormalization or document embedding strategies per platform capabilities
Optimizing storage formats (e.g., Parquet, Delta, or JSONB) while preserving auditability

This separation equips data engineers to leverage the strengths of their chosen database without sacrificing Vault integrity.

2. ETL Patterns: From Batch to Streaming and JSON

Expanded CDC Strategies

Data Vault 2.1 deepens its coverage of Change Data Capture (CDC) patterns. You’ll find refined techniques for:

Transactional order guarantees: Ensuring raw Vault loads adhere to source system timestamps to preserve lineage.
Handling late-arriving or out-of-order events: Techniques to backfill or correct Satellites without breaking immutability.
Parallel loading: Avoiding cross-system dependencies by pre-joining keys within each source’s staging area.

Informal “Pre-Join” Denormalization

2.1 codifies the practice of pre-joining business keys in staging or external views—a pattern previously covered only in practitioner forums. This denormalization step enriches payload tables with true business keys upfront, eliminating repetitive lookups during Link loads and simplifying ETL script maintenance.

JSON and Nested Structures

Perhaps the most visible ETL addition is 2.1’s JSON processing module. With more sources emitting nested, semi-structured payloads, new patterns include:

Flatten-first loading: Initial extraction of atomic fields into raw Satellites before storing full payloads.
Schema evolution handling: Capturing structural changes (added arrays or nested objects) as metadata in Vault artifacts.
Selective shredding: Automating transformation of common sub-documents into separate Hubs/Links/Satellites.

3. Modeling Best Practices: Updated Patterns for a Distributed World

Managed Self-Service BI

Data Vault 2.1 recognizes the shift toward self-service analytics within federated teams. Best practices now recommend:

Role-based access controls at the raw & business Vault layers, ensuring data stewards can grant fine-grained permissions.
Row- and column-level security patterns that can be implemented natively in cloud warehouses (Snowflake masking policies, SQL Server RLS, etc.).
Embedding governance metadata in Vault tables, enabling automated lineage and impact analysis for downstream consumers.

Expanded Satellite Strategies

While 2.0 introduced Point-in-Time (PIT) and Bridge tables for performance, 2.1 adds:

Snapshot Satellites: Prebuilt structures for frequented combinations of Hubs & Satellites—ideal for dimensional views.
Behavioural Satellites: Grouping event-driven attributes (e.g., clickstreams) separately from master-data Satellites.

Cross-Domain Linkage

Data Vault 2.1 extends guidance on managing relationships across micro-warehouse domains—a nod to data mesh. It clarifies when to use:

Cross-domain Links: For relationships spanning autonomous teams with separate Hubs.
Reference Hubs: Capturing shared code lists (e.g., currency, country) that multiple domains consume.

4. Educational & Organizational Enhancements

Rich Video & Quiz Content

Training for 2.1 now includes extensive pre-recorded modules by Dan, focusing on conceptual foundations—freeing up live classroom time for interactive labs and advanced case studies. Over 40 quizzes interspersed throughout the curriculum reinforce learning and feed directly into certification exams.

Certification & Community

Becoming a Data Vault 2.1 certified practitioner involves:

5 days of combined video and onsite training (versus one day of video + three days live in 2.0).
An updated exam covering new ETL patterns, JSON handling, and modern architecture integration.
Access to an expanded Slack community and biweekly “Vault Clinics.”

Choosing When to Adopt 2.1

Given the backwards-compatible design, migration from 2.0 to 2.1 can be phased:

Retain existing Hub/Link/Satellite structures in the Raw Vault.
Gradually introduce new ETL patterns (JSON shredding, snapshot Satellites) in parallel.
Implement enhanced governance and self-service controls in the Business Vault.
Leverage certification resources to upskill architects and engineers on updated best practices.

Conclusion

Data Vault 2.1 advances the methodology by weaving in lessons from cloud-native architectures, self-service analytics, and semi-structured data sources—without discarding the proven foundation of 2.0. Whether you’re standardizing a data mesh deployment or optimizing your JSON pipelines, 2.1 provides the patterns and guardrails needed to build a modern, auditable, and flexible data platform.

Watch the Video

Meet the Speaker

Marc Winkelmann
Managing Consultant

Marc is working in Business Intelligence and Enterprise Data Warehousing (EDW) with a focus on Data Vault 2.0 implementation and coaching. Since 2016 he is active in consulting and implementation of Data Vault 2.0 solutions with industry leaders in manufacturing, energy supply and facility management sector. In 2020 he became a Data Vault 2.0 Instructor for Scalefree.

Building a scalable Data Platform? In WhereScape

Data Transformation Implementation

Implement Data Transformations

In today’s data-driven world, organizations need robust, flexible, and scalable solutions to manage and transform their ever-growing volumes of data. Data Vault 2.0 has emerged as a leading methodology for designing enterprise data warehouses that are agile, auditable, and adaptive to change. When paired with powerful automation platforms like WhereScape, it becomes possible to implement these methodologies quickly and efficiently. In this article, we’ll explore where to implement data transformations prior to a Data Vault model in WhereScape and which types of transformations are permitted under Data Vault 2.0.

In this article:

Understanding Transformation Types in Data Vault
- Hard Rules
- Soft Rules
Why Apply Hard Rules Before the Raw Vault?
Implementing Hard Rules in WhereScape
- 1. Define Data-Type and Format Alignments
- 2. Implementing Pre-Joins for Business Keys
Handling Complex Transformations
Live Demo Walkthrough
Best Practices and Tips
Conclusion
Watch the Video
Meet the Speaker

Understanding Transformation Types in Data Vault

Data Vault categorizes transformations into two primary buckets:

Hard Rules: Technical transformations that align data types and formats without altering the underlying meaning of the data.
Soft (Business) Rules: Business-driven transformations that may change data meaning, adjust granularity, or apply interpretations to meet business requirements.

Hard Rules

Hard rules are the bedrock of a clean, consistent Raw Data Vault. They consist of simple SQL operations such as CAST, TRIM, normalization, and pre-join operations. These transformations must not change the semantic meaning of the data—they simply ensure that data types and formats are consistent before loading into the Raw Vault.

Data Type Alignment: Converting text-based dates or numbers into proper DATE, TIMESTAMP, or numeric types.
Data Format Standardization: Trimming whitespace, standardizing phone numbers or currency formats, normalizing text cases.
Pre-join: Flattening complex or hierarchical source data by joining child tables to master tables using surrogate keys to obtain real business keys.

Soft Rules

Soft rules—also known as business rules—are applied later in the Data Vault workflow, typically in the Business Vault or downstream data marts. These rules implement business logic such as KPI calculations, trend analyses, or granular rollups. Unlike hard rules, soft rules may change data meaning, shift granularity, or introduce new interpretations.

KPI Calculations: Computing metrics like year-to-date sales, customer churn rates, or profit margins.
Data Interpretations: Categorizing transaction types, deriving risk scores, or labeling customer segments.
Granularity Adjustments: Aggregating daily data to monthly summaries or drilling transaction data down to event-level detail.

Why Apply Hard Rules Before the Raw Vault?

The core principle of Data Vault 2.0 is to keep early layers of your data warehouse as raw and unaltered as possible. By applying only hard rules before the Raw Vault, you:

Maintain Auditability: Raw data remains traceable back to its source, preserving lineage.
Ensure Reusability: Raw Vault structures can serve multiple downstream use cases without pre-judging reporting requirements.
Reduce Dependency: Soft rules—and their business dependencies—are deferred, minimizing changes that ripple through the entire pipeline.

Implementing Hard Rules in WhereScape

WhereScape’s automation capabilities allow you to define and apply hard rules seamlessly within both the 3D Data Model and the RED (Rapid ELT Development) modules. However, best practice is to centralize these transformations in the 3D model, where the metadata-driven approach can manage attribute-level rules efficiently.

1. Define Data-Type and Format Alignments

In the Source Model, import your tables and define primary/foreign key relationships based on your source system metadata.
Switch to the Data Vault Design layer (3D). Select the Hub, Link, or Satellite object and click the attribute you wish to transform.
Under the Data Transformations tab, add a new transformation expression—e.g., CAST(birthdate AS DATE)—ensuring you choose the correct SQL dialect for your target platform.
Specify the Source (schema or source system) if the rule varies by system.

2. Implementing Pre-Joins for Business Keys

When source tables only provide technical or surrogate keys, you can implement a “pre-join” to fetch the real business key before loading into the Raw Vault:

In the Source Model, ensure foreign key relationships are defined (or manually create them by dragging and dropping).
In the Data Vault Design, edit the Link object that originates from the child table (e.g., Orders).
Under Source Mappings, add the parent table (e.g., Shippers) as an additional source entity.
Map the parent’s business key attribute (e.g., CompanyName) to the Link’s business key slot, and move the surrogate ShipperID to a Satellite attribute if desired.
When the model conversion runs, WhereScape will auto-generate the JOIN syntax in the staging area, fetching the business key for the Link’s hash-key computation.

Handling Complex Transformations

While WhereScape allows in-model transformations, it’s prudent to keep hard rules simple to avoid breaking the Raw Vault loading patterns. For more intricate logic—such as multi-step string parsing or nested conditional rules—consider creating dedicated pre-processing views:

Create a database view (or staging table) on top of the raw source or staging tables.
Implement your complex SQL logic in that view, ensuring it produces standardized output.
Point your WhereScape models to these views as the source entities, treating them like any other source table.

Live Demo Walkthrough

In our webinar demo, we used the Northwind sample database in WhereScape 3D to showcase two scenarios:

Birthdate Alignment: Converting the Employee.BirthDate from DATETIME to DATE in the Hub.
Key Steps: Select Hub_Employee in the Data Vault design, choose BirthDate, and add CAST(BirthDate AS DATE) under Data Transformations.
Pre-Join for Shipper Business Key: Generating a Link_Order_Shipper that uses CompanyName (from Shippers) instead of the surrogate ShipperID.
Key Steps: In the Link’s Source Mapping, add Shippers as an additional source, map CompanyName to the Link’s business key, and move ShipperID to Satellite.

Upon conversion and deployment:

The staging objects automatically featured a JOIN between Orders and Shippers using ShipperID → ShipperID.
The physical Raw Vault tables contained the correct hash-keys and reference columns for the Link.

Best Practices and Tips

Keep Hard Rules Simple: Avoid multi-layer nested SQL. If logic becomes unwieldy, move it to a view.
Manage by Source: Use source-specific transformations if you ingest data from multiple systems with differing formats.
Document Transformations: Leverage WhereScape’s metadata repository to annotate why each transformation exists.
Test Incrementally: Validate each rule in isolation to ensure it doesn’t break downstream Raw Vault loads.
Defer Business Logic: Always push soft rules to the Business Vault or reporting layers to maintain Raw Vault purity.

Conclusion

Implementing data transformations in a Data Vault architecture requires a clear separation between technical (hard) rules and business (soft) rules. By applying hard rules—such as data type alignment, format standardization, and pre-joins—prior to loading your Raw Data Vault, you ensure a clean, consistent foundation that remains adaptable to evolving business needs. WhereScape’s 3D and RED modules provide powerful, metadata-driven tools to define these transformations at the model level, automating the generation of staging logic and downstream objects.

With a disciplined approach—keeping hard rules simple, leveraging pre-processing views for complex logic, and deferring business rules to later layers—you’ll build a scalable, auditable, and high-performance Data Vault 2.0 implementation that stands the test of time.

Watch the Video

Meet the Speaker

Trung Ta
Senior Consultant

Trung has been Senior BI Consultant since 2019. As a Certified Data Vault 2.0 Practitioner at Scalefree, his area of expertise includes Data Warehousing in a cloud environment, as well as Data Vault 2.0 modeling and implementation – especially, but not limited to, with Wherescape 3D/RED. He’s been working with industry leaders in the insurance and finance sector, advising them on building their own Data Vault 2.0 solution.

Building a scalable Data Platform? In Artificial Intelligence, Data Warehouse, Intermediate

Unlock the Intelligence Layer: LLMs in Data Warehousing and the Future of Your Data

“Stop writing complex SQL, start talking to your data?”

This provocative question highlights a growing shift in how we interact with data. For years, getting answers from a Data Warehouse meant writing SQL queries or relying on pre-built dashboards.

For many organizations, their data platforms remain underutilized because accessing insights still requires writing code or navigating complex dashboards. It’s time to go beyond static reports and unlock a true intelligence layer on top of your data warehouse. Recent advances in Large Language Models (LLMs) and Natural Language Processing (NLP) are making data warehouses smarter, faster, and easier to use for everyone. In this article, we’ll explore how LLMs can transform the way you interact with your data – from using plain English queries instead of SQL, to AI-driven discovery of hidden insights, to enriching your data pipelines – and why this shift represents the future of data analytics.

Unlock the Intelligence Layer: LLMs in Data Warehousing and the Future of your Data

Unlock your data warehouse’s full potential! This webinar reveals how Large Language Models and Natural Language Processing are transforming data interaction, empowering everyone to effortlessly translate plain language into SQL, enable AI-driven data discovery, and deliver actionable insights to every stakeholder. Register for our free webinar, August 12th, 2025!

Watch Webinar Recording

In this article:

From Complex SQL to Conversational Queries
AI-Augmented Data Discovery and Insights
LLMs in Your Data Pipeline: Enrichment and Efficiency
Conclusion

From Complex SQL to Conversational Queries

Business users often depend on data engineers or analysts to fetch answers, creating bottlenecks in decision-making. Even data professionals themselves spend considerable time writing and optimizing SQL, rather than interpreting results. What if anyone could simply ask the data warehouse a question in plain language and get the answer? This is the promise of LLMs as an “intelligence layer”, a layer that bridges complex datasets and human comprehension. Advanced LLMs can understand a user’s question or request and generate the appropriate SQL queries on the fly.

This technology (often called Text-to-SQL or Natural-Language-to-SQL or NL2SQL) has rapidly evolved and major technology players have already taken note. For example, Databricks introduced a Natural Language Query feature (LakehouseIQ) to let users ask questions of their Lakehouse, and Snowflake is also exploring LLM-driven query capabilities.

Imagine asking your data warehouse in plain English: “What were our top-selling products last quarter by region?”. This text input is passed into a LLM, often enriched by company-specific data via RAG and then the system translates that into a correct, optimized SQL query that retrieves the answer.

Of course, translating natural language to SQL at an enterprise scale isn’t trivial. Complex schemas, ambiguous user input, and security considerations mean the LLM has to be both smart and careful. Uber has built such an AI system that works on an enterprise scale level.

Uber’s QueryGPT is an NL2SQL system that uses a multi-step, RAG-based pipeline combining LLMs with retrieval and agent modules. It fetches context via similarity search over a vector database of example queries and schema information for SQL generation. To manage Uber’s vast data ecosystem, QueryGPT employs specialized agents:

an Intent Agent classifies requests by business domain
a Table Agent suggests tables for the query
a Column Prune Agent trims irrelevant columns to reduce prompt length. The LLM then produces the SQL query and an explanation.

This layered design allows QueryGPT to handle large schemas and reliably generate complex multi-table queries. It’s a hybrid architecture where multiple transformer calls specialize in sub-tasks, enabling scalable, accurate NL2SQL as a production service, saving thousands of Uber employees significant time by mid-2024.

AI-Augmented Data Discovery and Insights

Beyond simply fetching results for user queries, LLMs can augment data discovery by revealing insights that users might not have explicitly asked for. Traditional dashboards show you what is happening, but a smart LLM-based system can tell you why it’s happening and highlight patterns you might not notice. This is often called augmented analytics – using AI to automatically find important correlations, trends, outliers, and drivers in your data.

LLMs excel at interpreting data outputs and providing additional context. For example, rather than just displaying a chart or a table, an LLM can generate a written summary pointing out key trends or anomalies. They can explain which metrics are up or down and suggest potential reasons (for instance, detecting that “conversion rates dipped in July, possibly due to seasonality or inventory issues”), enabling quicker and more informed decision-making.

Another area where LLMs can significantly reduce manual effort is in the creation and maintenance of data catalogs. Documenting data models, table structures, and especially individual column descriptions is often time-consuming and easily skipped due to missing resources, despite being crucial for an effective use and accessibility of the data. LLMs can automate large parts of this process by generating descriptions based on data profiling, SQL logic, naming conventions, and metadata.

dbt Cloud has recently released their dbt Copilot AI Agent that supports the developer in various ways, for example by letting the AI analyzing the SQL code and schema metadata to automatically generate model and column descriptions.

LLMs in Your Data Pipeline: Enrichment and Efficiency

LLMs don’t just enhance how users interact with the Data Warehouse; they can also improve the data itself and the efficiency of data engineering processes. In modern ELT (Extract-Load-Transform) pipelines, a lot of time is spent cleaning, enriching, and preparing data for analysis. Here, LLMs offer new tools to automate and augment these steps.

One promising use case is the semantic enrichment of data. Large Language Models have absorbed a vast amount of world knowledge and language patterns, and they can use that to fill gaps or add context to your raw data. For example, imagine you have a dataset of customer feedback where each entry is a text comment. An LLM could automatically classify the sentiment of each comment (positive/negative), extract key themes, or even generate a summary of common issues. In this way, unstructured data becomes structured insights without manual effort. The image below illustrates how an LLM is integrated into a data pipeline: text inputs from a CustomerFeedback table are passed to an OpenAI API endpoint, where the model returns structured sentiment labels that are then stored back in the database.

Large Language Models Sentiment Analysis

In a practical case study, LLMs were used to enrich an academic dataset by inferring missing attributes (like guessing a person’s gender from their name with high accuracy), which outperformed dedicated API services. This showcases how LLMs can bring external knowledge and reasoning to enhance your data.

Another area is metadata enrichment and semantic enrichment of unstructured data. Enterprise data is often filled with cryptic column names and jargon that prevents usability. LLMs can intelligently expand abbreviations and annotate fields with business-friendly descriptions. For instance, an LLM-driven catalog might take a column labeled “CUST_ID” and annotate it as “Customer Identifier, unique ID for each customer record”.

LLMs can also assist in the coding and transformation process itself. Data engineers can leverage LLMs to generate boilerplate code or SQL for transformations, document pipeline logic in plain English, or even detect anomalies and data quality issues through pattern analysis. By automating tedious parts of data preparation and providing AI-generated suggestions, LLMs free up engineers to focus on higher-level architecture and problem-solving.

Conclusion

While the promise of an LLM-powered intelligence layer is exciting, it’s important to approach it with a clear strategy. Successful implementation requires considering a few key challenges and best practices. Data quality and governance are more crucial than ever. If your underlying data is inaccurate or poorly structured, the AI’s answers will be unreliable. As the saying goes, “garbage in, garbage out.”

Ensuring clean, well-organized data (and maintaining a robust data governance program) will help the LLM produce meaningful and correct insights. Additionally, organizations may need to fine-tune or configure their LLMs to understand industry-specific terminology or business context. This reduces the chance of the AI misinterpreting what a user asks or generating an incorrect query.

Privacy and security are another important consideration. If your data includes sensitive information, you must ensure that any AI tool accessing it complies with your security requirements. This might involve using self-hosted models or secure APIs, and setting up proper access controls.

The dream of a self-service analytics experience: “just talk to the data and get answers” is quickly becoming a reality. This evolution may redefine roles (enabling analysts and engineers alike to focus on higher-value tasks) and open up analytics to a wider audience than ever before. It’s an exciting time to be a data professional, but also one that demands staying informed and ready to adapt.

– Ole Bause (Scalefree)

Building a scalable Data Platform? In Data Tools,

Get Started with dbt: A Quick 15-Minute Guide Using Snowflake

Introduction to dbt

dbt (data build tool) revolutionizes the way teams build and maintain analytics workflows by bringing software‐engineering best practices to SQL‐based data transformations. Instead of ad‐hoc scripts, dbt encourages version control, modular models, testing, documentation, and lineage graphs. In this guide, you’ll learn how to go from zero to your first dbt models—running entirely in the cloud on Snowflake—in under 15 minutes.

In this article:

Why Choose dbt Cloud + Snowflake?
Prerequisites
1. Launching dbt Cloud from Snowflake
2. Exploring the dbt Cloud UI
3. Initializing Your dbt Project
4. Defining Your Source Data
5. Building a Staging Model
6. Creating a Production-Ready Dimension
7. Testing and Documentation
8. Version Control & Deployment
9. Best Practices & Next Steps
Watch the Video
Conclusion

Why Choose dbt Cloud + Snowflake?

Fully managed: No local install or complex orchestration; dbt Cloud handles hosting.
Quick setup: Snowflake’s partner connector spins up a dbt trial, pre-configured with your credentials.
Best practices out of the box: Built-in IDE, job scheduling, Git integration, and documentation.
Scalable performance: Leverages Snowflake’s compute power for fast model builds.

Prerequisites

A Snowflake trial account (free, 1-minute setup).
A modern browser (Chrome, Firefox) or VS Code for remote development.
Basic familiarity with SQL.

1. Launching dbt Cloud from Snowflake

Once logged into your Snowflake trial, navigate to the Data Products → Partners → Connect pane. Scroll to find the dbt entry and click Connect → Launch. This will automatically:

Provision a dbt Cloud trial account
Create a new Snowflake database and warehouse
Inject Snowflake credentials into your dbt Cloud connection

You’ll land in the dbt Cloud dashboard, ready to start your first project.

2. Exploring the dbt Cloud UI

In dbt Cloud’s left navigation bar you’ll find:

Develop: Interactive IDE for coding models, sources, tests, documentation.
Deploy: Job definitions, environments, and run history.
Documentation: Auto‐generated docs site with lineage graphs.
Settings: Account, project, and Git integration.

Click Develop → IDE (hosted in Chrome or connect your VS Code). Let’s initialize our dbt project.

3. Initializing Your dbt Project

In the IDE, open the Version Control pane and click Initialize dbt Project.
Accept the defaults; dbt creates a dbt_project.yml and folder structure (models/, macros/, etc.).
Commit the auto‐generated files in a new Git branch: “initialized dbt project”.

Your Git pane now shows uncommitted files; click Commit & Push to save the project baseline.

4. Defining Your Source Data

dbt doesn’t load data from external systems—you must point it to existing tables. Snowflake’s sample database (SNOWFLAKE_SAMPLE_DATA) contains TPC-H tables you can use.

Create a new YAML file under models/ named sources.yml with:

version: 2

sources:
  - name: tpch
    database: SNOWFLAKE_SAMPLE_DATA
    schema: TPCH_SF1
    tables:
      - name: CUSTOMER
      - name: ORDERS

Save to see the lineage graph update with two new source nodes.

5. Building a Staging Model

Staging models standardize raw tables and prepare them for downstream transformations. In models/, delete the example/ folder and instead:

Create a folder called models/staging/tpch.
In that folder, create stg_tpch_customer.sql:

{{ config(materialized='view') }}

select
  C_CUSTKEY   as customer_key,
  C_NAME      as customer_name,
  C_ACCTBAL   as account_balance,
  C_COMMENT   as comment
from {{ source('tpch', 'CUSTOMER') }}

Notes:

The source() macro resolves to the fully qualified table.
materialized='view' tells dbt to build a view by default.

Save and click Run → dbt run to build just this model. In seconds you’ll see a view in your Snowflake UI under the dev schema.

6. Creating a Production-Ready Dimension

Dimensions (Gold layer) contain curated, business-ready tables. Let’s filter for customers with positive balances.

Create models/marts/customer_dim.sql:

{{ config(materialized='table') }}

select
  customer_key,
  customer_name,
  account_balance
from {{ ref('stg_tpch_customer') }}
where account_balance > 0

Here, ref() links to another model, ensuring correct build order and clear lineage.

Click Compile to preview generated SQL, then Run → dbt run to create the table in Snowflake.

7. Testing and Documentation

dbt encourages tests to enforce data quality:

Add to models/staging/tpch/schema.yml:

version: 2

models:
  - name: stg_tpch_customer
    tests:
      - not_null:
          column_name: customer_key
      - unique:
          column_name: customer_key

Run dbt test to validate your models. Any failures will be reported in the UI.

Generate documentation with dbt docs generate, then preview via dbt docs serve. Explore your project’s lineage graph and column descriptions.

8. Version Control & Deployment

dbt Cloud integrates Git for collaboration. After feature development:

Commit your branch and open a pull request (GitHub, GitLab, Bitbucket).
Merge into main.

In Deploy → Environments, create a production environment. Under Jobs, define a job that runs:

dbt seed (if you have local CSV seeds)
dbt run
dbt test

Schedule the job (e.g., hourly, daily) or trigger it on Git commits. Monitor run history and logs directly in dbt Cloud.

9. Best Practices & Next Steps

Modularize models: Break complex logic into smaller models.
Document extensively: Use YAML descriptions for sources, models, and columns.
Implement CI/CD: Integrate dbt Cloud jobs with your team’s CI pipeline.
Leverage analyses: Create analyses/ for ad-hoc queries and charts.
Scale with packages: Reuse community packages (e.g., dbt_utils).

For deeper dives, explore the official dbt documentation and the dbt Hub for community packages and best practices.

Watch the Video

Conclusion

In just a few steps, you’ve:

Provisioned dbt Cloud via Snowflake
Initialized a dbt project with Git
Defined raw sources and built staging models
Created a production dimension with ref()
Tested data quality and generated documentation
Set up a CI/CD job for automated deployment

Now you have a repeatable, maintainable analytics pipeline. Keep building new models, add tests, and document as you go—your future self (and team!) will thank you.

Building a scalable Data Platform? In Data Vault Friday

Pre-Joining Data Vault Business Keys During Load

Data Vault Business Keys

Data Vault architects often encounter a common challenge when their source systems mix surrogate technical IDs with true business keys. In one real-world scenario, an Employee table used a technical ID as its primary key, while the legitimate business identifier was NBR. Downstream tables—even a self-referencing manager hierarchy—used that ID as a foreign key. When building Hubs on both ID and NBR and linking them with a Same-As Link, the result resembles a Source Vault design—and it quickly becomes unwieldy to repeat ID→NBR lookups for every referencing table.

In this article, we’ll explore the best practice for handling mismatched keys: pre-joining business keys in your staging (or view) layer before loading the Raw Data Vault. You’ll learn why this denormalization is fully compliant with Data Vault principles, how to implement it (even with CDC feeds), and when you might need alternate approaches.

In this article:

Why Separate Business Keys from Technical IDs?
The Pitfall: Source Vault by Accident
Best Practice: Pre-Join Business Keys in Staging
- How It Works
Tool Support: dbt & FlowBI
Handling CDC & Empty Deltas with Forward Lookup
- Forward Lookup Pattern
Caveats: Pre-Join Within a Single Source Only
When You Can’t Pre-Join: Source Vault as Last Resort
Step-By-Step Implementation Guide
Benefits of Pre-Joining Business Keys
Conclusion
Watch the Video
Meet the Speaker

Why Separate Business Keys from Technical IDs?

Surrogate keys (ID) offer stable integer references, simple indexing, and isolation from business rule changes. But they aren’t meaningful outside the operational schema. True business keys (NBR)—like employee numbers, order numbers, or product SKUs—carry real-world meaning and ensure consistency across downstream BI and analytics models.

Business Key (NBR): Immutable identifier used in reporting, cross-system integration, and audit.
Technical ID (ID): Auto-generated surrogate for OLTP performance and referential integrity.

When you build a Hub on the business key, all Links and Satellites must reference that same key. Mixing in surrogate IDs without conversion violates business lineage and forces repetitive lookups.

The Pitfall: Source Vault by Accident

Faced with source tables referencing ID, some teams create:

A Hub on ID (surrogate),
A Hub on NBR (business key),
A Same-As Link between them to tie ID↔NBR.

This “Source Vault” pattern captures technical IDs as though they were business keys—contradicting the principle that your Vault’s integration key must be a shared business identifier. Moreover, every time you load any Link or Satellite that uses ID, you must look up NBR via the Same-As Link. Tedious, error-prone, and defeating the agility of your Data Vault.

Best Practice: Pre-Join Business Keys in Staging

Data Vault training explicitly allows you to reshape your staging area—denormalizing or normalizing source data to simplify Raw Vault loads. Pre-joining means: before your load process begins, enrich every source record with the true business key (NBR) rather than the surrogate ID. You then feed the Hub/Satellite/Link loaders with business keys directly—no Same-As Link gymnastics required.

How It Works

Create a view or staging query that joins your Employee table (on ID) to itself or to the hierarchy table to retrieve NBR. Add the NBR field into every downstream staging record.
Use that pre-joined staging view as the source for your Data Vault loaders. All Hubs, Links, and Satellites can now reference NBR consistently.
Drop the accidental Hub on ID and Same-As Link—your Vault only contains the true business key Hub (Hub_Employee on NBR).

This approach turns the repeated lookup problem into a one-time denormalization, improving performance and maintainability.

Tool Support: dbt & FlowBI

Modern Data Vault toolkits recognize this pattern:

datavault4dbt package: Offers a pre_join feature to automatically enrich staging tables with business keys.
FlowBI: Includes configuration options to map surrogate IDs to business keys before Vault loads.

When your staging area resides on a cloud data lake (e.g., AWS S3 + Redshift Spectrum, Azure Data Lake + Synapse), these tools can reference external tables and materialize pre-joined views seamlessly.

Handling CDC & Empty Deltas with Forward Lookup

Change Data Capture (CDC) introduces a nuance: sometimes the Employee record doesn’t change (NBR remains the same), but the hierarchy table (manager assignment) does. A pure inner-join staging view would omit the hierarchy change because no new employee row appeared.

Forward Lookup Pattern

When your CDC batch contains only hierarchy changes, load your staging view with the ID column but no accompanying NBR.
Instead of joining to the source Employee table (which has no new row), perform a forward lookup against the target Satellite in your Raw Vault. That Satellite already stores every historic mapping of ID→NBR.
Retrieve the latest NBR value for each ID and inject it into your staging records—just as if you had joined to the source.

This assumes your CDC infrastructure and initial loads correctly populated the Satellite. If CDC reliability is questionable, you may need to capture ID first and resolve to NBR later in a Business Vault layer.

Caveats: Pre-Join Within a Single Source Only

While you can (and should) pre-join within one source system, avoid chaining pre-joins across multiple systems. If you first wait for System A’s staging load to produce NBR from its ID, and then join System B’s staging to A’s data, you introduce cross-system load dependencies. That forces you to serialize loads—waiting for one system’s batch to finish before you can process another.

To maintain parallel ingestion, each source should be pre-joined only to its own business keys. If two systems share a business key, let that intersection happen downstream in your Vault (via Hub on the shared business key), not in the staging layer.

When You Can’t Pre-Join: Source Vault as Last Resort

In rare environments where you cannot reliably pre-join—legacy databases with locked-down permissions or untrustworthy CDC—you may fall back to a Source Vault. In this design:

Your Hubs use surrogate ID as the primary key (capturing the technical ID).
You defer mapping to true business keys into the Business Vault layer, after all sources land.

Source Vaults make sense only when staging denormalization is impossible. Otherwise, they sacrifice business clarity for expedience.

Step-By-Step Implementation Guide

Identify Business Keys: Catalog each table’s true business key(s)—not the surrogate PKs.
Build Staging Views: For each source, create a view that LEFT JOINs the “owner” table back to itself (or to its lookup tables) to pull in NBR wherever ID appears.
Validate Keys: Ensure every staging record includes a non-null business key. Flag or quarantine any orphans (IDs without known business key).
Configure Load Scripts: Point your Vault loaders (Hubs, Links, Satellites) at these staging views. Remove any loaders that target surrogate key Hubs or Same-As Links.
Implement Forward Lookup: For CDC batches that may omit source changes, add a fallback join to the Employee Satellite in your load script to fetch the last known NBR for each ID.

Benefits of Pre-Joining Business Keys

Simplicity: One denormalization step replaces dozens of repetitive lookups.
Performance: Staging views optimize key enrichment in set-based SQL rather than row-by-row Link loads.
Lineage: Your Raw Vault contains only true business keys, preserving clear end-to-end lineage.
Maintainability: Future source schemas that reference ID get mapped automatically via the shared staging view logic.

Conclusion

Mismatched surrogate and business keys need not derail your Data Vault design. By embracing pre-joining business keys in your staging layer—along with forward lookups for CDC edge cases—you preserve a clean, business-centric Vault model without cumbersome Same-As Links. Reserve Source Vaults only for environments where staging denormalization simply cannot occur. With these best practices, your Vault remains performant, transparent, and aligned with true business identifiers.

Watch the Video

Meet the Speaker

Michael Olschimke

Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

Building a scalable Data Platform? In Data Warehouse

10 Essential Skills Your Team Needs to Build an Analytical Data Platform

Build an Analytical Data Platform

Building a modern analytical data platform is more than just choosing the right database or ETL tool. It requires a blend of business insight, data expertise, architecture design, and automation savvy. In this article, we’ll explore ten essential skills your team needs to design, develop, and maintain a robust, scalable, and high-value data platform.

In this article:

1. Business Understanding
2. Objective Setting & ROI Focus
3. Data Understanding & Modeling
4. Data Acquisition Techniques
5. Structured Architecture: The Medallion Approach
6. Data Integration & Modeling in the Silver Layer
7. Temporality & Historical Tracking
8. Code Generation & Automation Tools
9. Agile Development & Traceability
10. DevOps & Cost Management
Watch the Video
Conclusion

1. Business Understanding

Before diving into any technical work, your team must understand the business domain and the data itself. This doesn’t mean every engineer needs to be a data analyst, but they should know:

Which source systems hold the data (CRM, ERP, marketing platforms, etc.)
Key business objects (customers, contracts, opportunities) and how they relate
Business processes behind the data, like a customer’s lifecycle or sales funnel

By grounding the team in real-world outcomes—such as improving customer retention or reducing churn—engineers stay focused on delivering measurable ROI.

2. Objective Setting & ROI Focus

Clear objectives guide every stage of your platform’s development. Whether your goal is to accelerate financial reporting or enable real-time marketing analytics, defining the desired outcomes:

Helps prioritize features and data sources
Aligns stakeholders around common metrics
Boosts motivation by tying work to tangible business value

Teams that regularly track ROI milestones can adjust scope and resources proactively, ensuring the platform grows in step with organizational goals.

3. Data Understanding & Modeling

A deep dive into your source systems reveals hundreds—even thousands—of tables. Your engineers need to know:

Primary and foreign keys connecting entities
Relationship cardinalities (one-to-one, one-to-many, many-to-many)
Data quality quirks and domain-specific rules

This understanding informs the modeling approach—be it third normal form, star schemas, or Data Vault—ensuring downstream analytics are consistent and reliable.

4. Data Acquisition Techniques

Extracting data from source systems can take many forms:

Full daily extracts via CSV or JSON files
API calls for near-real-time data feeds
Change Data Capture (CDC) for incremental updates

Knowing when to use each approach minimizes data latency, reduces load times, and optimizes storage. CDC, in particular, slashes the volume of data transferred, but requires robust handling to maintain consistency.

5. Structured Architecture: The Medallion Approach

Dumping raw data into a single database is a recipe for chaos. Instead, adopt a layered “medallion” architecture:

Bronze Layer (Staging/Landing): Raw data as ingested
Silver Layer (Cleansed, Integrated): Unified and harmonized data across systems
Gold Layer (Presentation): Curated tables/views for business users and BI tools

Medallion Architecture in an analytical data platform

This separation of concerns simplifies debugging, improves performance, and clarifies responsibilities for each team member.

6. Data Integration & Modeling in the Silver Layer

The silver layer is where the “magic” happens:

Integrating disparate systems into a unified view
Applying your chosen modeling technique (e.g., star schema, Data Vault)
Ensuring referential integrity and consistent business definitions

Investing in a proven modeling framework not only scales with additional data sources but also enables automation and accelerates the onboarding of new subject areas.

7. Temporality & Historical Tracking

Beyond technical timestamps (extract load times), your data has business timelines:

Contract start/end dates
Customer sign-up and churn events
Promotion or campaign effective periods

Implementing snapshot tables, slowly changing dimensions, or time-aware modeling ensures accurate trend analysis, historical comparisons, and auditability.

8. Code Generation & Automation Tools

Hand-coding every pipeline is time-consuming and error-prone. Leverage tools that:

Automatically generate ETL/ELT code based on templates
Orchestrate complex workflows and dependencies
Enforce consistency through standard patterns and conventions

Automation not only speeds up development but also improves data quality by reducing manual interventions.

9. Agile Development & Traceability

Adopting an agile mindset means delivering small, working increments quickly. Apply traceability by:

Defining clear targets (e.g., monthly revenue report)
Mapping those targets back to specific source tables
Focusing on data that directly supports your objectives

This approach prevents “scope creep” and ensures that every pipeline built serves an immediate analytical need.

10. DevOps & Cost Management

Once pipelines are automated, you need:

Orchestration frameworks (e.g., Airflow, Dagster) to schedule and monitor jobs
CI/CD for data code, including version control and automated testing
Cost monitoring tools to track cloud resource usage and optimize performance

Effective DevOps practices guarantee reliability, while cost-awareness keeps your platform sustainable in the cloud era.

Data Profiling: DataHub continuously monitors and analyzes data quality, automatically generating profiling metrics that reveal data distributions, identify anomalies, and help maintain high data quality standards. It provides key statistics such as row and column counts, query frequency, top users, and last update timestamps, along with detailed attribute profiling, including value ranges, central tendencies, null and distinct values. The table below shows some examples of these profiling metrics.

Watch the Video

Conclusion

Building an analytical data platform is a multifaceted endeavor. By equipping your team with these ten skills—spanning business understanding, data modeling, architecture design, automation, and DevOps—you’ll lay the foundation for a platform that delivers consistent insights, scales gracefully, and drives real business value.

Building a scalable Data Platform? In Data Vault Friday

How to Sell Data Vault to Management

How to Sell Data Vault

Convincing senior executives to invest in a robust data architecture like Data Vault can feel like scaling a fortress gate. You know the technical merits inside out, but managers care about business outcomes, budgets, and timelines—not hash keys, hubs, and satellites. In this article, we’ll explore how to “sell” Data Vault to C-level stakeholders by focusing on the features and value it delivers, rather than its underlying technical mechanics.

In this article:

The Danger of Technical Jargon
Reframe the Conversation: It’s a Data Platform, Not a Methodology
Key Business Benefits to Highlight
Positioning Your Pitch
Handling Common Objections
Engaging Different Stakeholders
Real-World Success Stories
Roadmap & Phasing
Conclusion
Watch the Video
Meet the Speaker

The Danger of Technical Jargon

When building a house, a homeowner asks, “How many bedrooms? How many bathrooms? What’s the square footage?” They don’t delve into whether the builder used nails or screws. Likewise, executives don’t care if you use Data Vault, Kimball, or Inmon. They care about the end product: reliable reports, faster insights, and lower risk.

Starting a conversation with “Let me explain our Hub-and-Satellite architecture…” alienates non-technical stakeholders. Instead, frame your pitch around business capabilities and measurable outcomes.

Reframe the Conversation: It’s a Data Platform, Not a Methodology

Ask yourself: what does management really want? The answer is simple:

Governed, compliant data for audits and regulations
Fast, accurate reporting to inform decisions
Scalable infrastructure that grows with your business
Ability to integrate new sources—legacy systems, real-time feeds, and unstructured data
Future-proof automation and AI-driven efficiency

Position your solution as an enterprise data platform that delivers these capabilities. Only dive into Data Vault specifics when a technical stakeholder asks—then you can explain how its modular design underpins agility and auditability.

Key Business Benefits to Highlight

Below are the core value propositions you should emphasize. Each maps directly to executive priorities:

1. Integrate Any Source, Any Format

– Combine data from ERP, CRM, cloud services, IoT streams, social feeds, and Excel sheets.
– Handle conflicting or incomplete data without losing lineage.
– Accelerate time-to-insight by onboarding new sources in days, not months.

2. Auditability & Compliance

– Capture full history of every data change for regulations (GDPR, SOX, HIPAA).
– Reconstruct past reports exactly as they were delivered.
– Demonstrate data lineage and provenance to satisfy auditors and regulators.

3. Automation & Developer Productivity

– Use off-the-shelf tools (FlowBI, automation frameworks) to generate 70–80% of your pipelines code.
– Reduce manual coding errors with template-driven scaffolding.
– Free your team to focus on business logic and analytics, not plumbing.

4. Scalable Performance & Flexibility

– Scale out compute and storage independently in the cloud (AWS, Azure, Google).
– Handle spikes in data volume—batch or streaming—without re-architecting.
– Support thousands of concurrent users and complex analytics workloads.

5. Multiple Business Perspectives

– Deliver different “versions of the truth” side-by-side: sales view, finance view, marketing view.
– Maintain consistent business rules and definitions across departments.
– Enable self-service BI without sacrificing governance.

Positioning Your Pitch

Armed with these benefits, craft a narrative that aligns with your organization’s strategic goals:

Cost Avoidance: Showcase how auditability and automation reduce remediation costs and manual reconciliation.
Risk Mitigation: Emphasize regulated data lineage, reducing compliance fines and reputational damage.
Business Agility: Illustrate faster source onboarding to support new products, M&A, and market pivots.
Developer Efficiency: Quantify hours saved through code generation and reusable templates.

Handling Common Objections

“We don’t want to invest in fundamentals anymore.”

Many firms chase “silver bullet” tools—data lakes, LLMs, or generic ETL appliances—hoping to skip architecture. Explain that without a solid foundation, new tools amplify chaos, not clarity. Draw parallels: no builder skips the foundation to save budget.

“Isn’t this too complicated?”

Compare Data Vault to modular construction: pre-fabricated components assembled with repeatable processes. Complexity is hidden under the hood; what management sees is a predictable, standardized delivery pipeline.

“We already have a data lake/warehouse.”

Acknowledge existing investments. Then demonstrate: Data Vault can sit atop or alongside current environments, enhancing governability and enabling phased migration without rip-and-replace.

Engaging Different Stakeholders

Each audience has different concerns. Tailor your message accordingly:

CEO/COO: Focus on revenue growth, operational efficiency, and risk reduction.
CFO: Highlight cost avoidance, predictable budgeting through reusable components, and audit compliance.
CTO/CIO: Dive into scalability, cloud economics, and integration patterns.
Business Unit Leaders: Emphasize faster delivery of insights, tailored dashboards, and self-service BI.

Real-World Success Stories

Nothing beats concrete examples. Present case studies or internal pilots that showcase:

50% reduction in data onboarding time for a new source system.
30% decrease in remediation tickets due to automated auditing.
Consistent reporting across finance and marketing, eliminating “version conflict” meetings.

Roadmap & Phasing

Break the project into manageable phases:

Pilot Phase: Integrate one or two critical sources, deliver dashboards in 4–6 weeks.
Expansion Phase: Add additional systems, build out automation and governance playbooks.
Optimization Phase: Operationalize PIT tables, refine performance, onboard self-service users.
Continuous Evolution: Incorporate AI-driven code generation and new data sources as needed.

Phased delivery reduces risk and allows management to see progressive value, reinforcing buy-in for subsequent investments.

Conclusion

Selling Data Vault to management isn’t about cold, technical lectures on hash keys and satellites. It’s about painting a vivid picture of what the organization will achieve: faster insights, iron-clad audit trails, automated pipelines, and a flexible, scalable data platform that grows with the business. Speak their language—doors, windows, and square footage—then build your foundation behind the scenes.

Watch the Video

Meet the Speaker

Michael Olschimke

Building a scalable Data Platform? In Inside Modern Data Teams

Data Lineage: Mapping Data Flows for Decisions and Compliance

What About Data Lineage?

In today’s data-driven organizations, ensuring trust, transparency, and compliance in data usage is more crucial than ever. A foundational component that enables these outcomes is data lineage mapping. It provides a visual and logical understanding of data’s journey — from its origin in source systems through various transformations to its final destination in dashboards and reports.

In this article, we’ll explore what data lineage is, why it matters for modern data teams, and how to implement it effectively using both manual and automated approaches. Whether you’re just beginning or optimizing your governance strategy, this guide will help you start small, scale smart, and deliver value early.

In this article:

What is Data Lineage?
Why Data Lineage Matters
Simple Example of Data Lineage
Capturing Data Lineage: Manual vs. Automated
- Manual Mapping
- Automated Tools
Quick Wins to Get Started
Common Pitfalls to Avoid
Key Takeaways
Final Thoughts
Watch the Video
Meet the Speaker

What is Data Lineage?

Data lineage is the process of tracking and visualizing the lifecycle of data as it moves through systems, transformations, and uses. It maps how data flows from source to destination — including every stage it touches along the way, such as staging areas, data warehouses, and reports.

For example, in a typical setup, customer data might originate in a CRM system, move through ETL pipelines into a cloud data warehouse, and end up in a business intelligence report. Data lineage helps answer: Where did this data come from? What transformations were applied? Which systems and people interacted with it?

Why Data Lineage Matters

Compliance and Regulation: Many regulations like GDPR and HIPAA require data traceability. Having data lineage helps organizations meet legal obligations by showing how personal or sensitive data is handled.
Trust and Transparency: Business users gain confidence in the reports they rely on when they can understand the data’s origin and the processes behind it.
Impact Analysis: With a clear lineage, you can instantly identify which reports or models are affected by changes in source systems or logic.
Improved Decision-Making: Accurate, well-understood data leads to better business decisions and more effective use of data products.
Strategic Enablement: As more people understand your data ecosystem, collaboration improves, and innovation becomes more achievable.

Simple Example of Data Lineage

Let’s break down a basic data lineage flow:

Source: A CRM system collects new customer data.
Processing: ETL processes extract the data and load it into a cloud data warehouse (e.g., Snowflake).
Transformation: Business rules are applied in staging or modeling layers using tools like dbt.
Output: The processed data is visualized in a reporting dashboard (e.g., a compliance report named CS-3239).

Each of these steps can and should be documented and tracked in your data lineage tool or framework. This becomes essential when something breaks, or compliance auditors ask for data traceability.

Capturing Data Lineage: Manual vs. Automated

Manual Mapping

In the early stages, manual mapping is a valuable exercise. Use tools like Excel, Visio, or Lucidchart to map one high-impact report end-to-end. Identify where the data comes from, how it’s transformed, and where it’s consumed. This approach is resource-intensive and doesn’t scale, but it’s a powerful first step for:

Understanding your data landscape
Validating with data owners and stewards
Testing your understanding before committing to tooling

Automated Tools

For scalable implementation, automated data lineage tools are essential. Options include:

Datahub
Collibra
Informatica
Microsoft Purview
OpenLineage

These tools automatically gather metadata from your systems and visualize data flows. However, automation still requires configuration, integration, and validation. No tool does it all out of the box.

Best Practice: Integrate your data lineage with your business glossary and data catalog. This creates a connected governance ecosystem, where clicking on a data object reveals lineage, definitions, and ownership.

Quick Wins to Get Started

Here’s a practical, proven strategy to build momentum:

Start Small: Identify one critical report or dataset that is heavily used or often misunderstood.
Map Manually: Trace its data lineage from source to consumption. Focus on transformations and logic.
Validate: Work with data owners, analysts, and engineers to validate the map.
Test Tooling: Use this one case to evaluate lineage tools. Compare ease of integration, visibility, and automation.
Integrate: Tie lineage into your broader governance structure — glossary, catalog, ownership, and quality.

This approach helps you avoid “big bang” governance failures. Starting with a focused win builds trust and demonstrates value to other teams.

Common Pitfalls to Avoid

Overengineering: Avoid making your first project too large. Focus on delivering a working example fast.
Ignoring Technical Setup: Before choosing a tool, check with your engineering teams. They may already be using dbt or similar tools that support lineage.
Lack of Collaboration: Governance is a team sport. Include data stewards, engineers, analysts, and business users.

Key Takeaways

Data lineage provides control and clarity over your data landscape, enabling better decisions and easier compliance.
Start small and iterate. One validated report lineage is worth more than 10 unfinished diagrams.
Work cross-functionally. Involve both governance and technical stakeholders early in the process.
Leverage what you already have. Tools like dbt, Snowflake, and BI platforms may already offer lineage features.
Choose tools carefully. Test with real examples before rolling out across the organization.

Final Thoughts

Data lineage mapping is no longer a luxury — it’s a necessity for organizations that aim to be data-driven, compliant, and transparent. Whether you’re leading a governance initiative or optimizing data operations, understanding your data’s journey is the foundation of success.

If you’re interested in a more detailed session on tooling or implementation strategies, feel free to reach out via LinkedIn or the contact form. Let’s bring visibility and trust into your data ecosystem.

Watch the Video

Meet the Speaker

Lorenz Kindling
Senior Consultant

Building a scalable Data Platform? In Data Vault

Green Bond Reporting in Record Time at Grenke AG

Green Bond Reporting

Sustainability and transparency have long been more than just buzzwords – nowadays, they are part of how modern companies see themselves. Green bonds are becoming increasingly important as they enable targeted investments in sustainable projects. Professional and audit-proof reporting is crucial to create trust among investors, auditors, and other stakeholders.

Our client Grenke, had already relied on our expertise and implemented a data warehouse based on Data Vault 2.0. The processes were largely automated so that data sources could be integrated and processed efficiently. When a new requirement for green bond reporting arose, we were able to implement it in just one month, thanks to the already existing scalable setup.

In this article:

Initial Situation: An Existing, Automated Data Warehouse
New Requirement: Green Bond Reporting
Our Approach: Expansion Instead Of New Construction
Result: Green Bond Reporting In Just One Month
- What Grenke Says
Conclusion: Agile And Sustainable Into The Future

Initial Situation: An Existing, Automated Data Warehouse

Data Vault 2.0 as a foundation:
Grenke was already using a robust Data Vault 2.0 architecture that enables flexible and expandable data storage thanks to its clear structures (hubs, links and satellites).
Automated model generation:
By using templates and metadata-driven approaches, data vault models can be generated automatically. This reduces manual effort, increases standardization and improves data quality.
Quality checks and audit compliance:
Plausibility checks, historization and metadata-supported processes already ensured high data quality and traceability – essential for audits and reporting.

These prerequisites formed the perfect springboard for quickly and reliably integrating the new Green Bond Reporting into the existing system.

New Requirement: Green Bond Reporting

With this new requirement, Grenke was faced with the challenge of collecting and preparing and presenting specific ESG key figures and green bond-specific data in a comprehensible report.

The aim was to design the reporting in such a way that:

External reviewers and auditors can gain insight quickly and easily.

Investors and other stakeholders receive transparent information about the sustainable projects.

Regulatory requirements and internal standards are met at all times and documented in a comprehensible manner.

Thanks to the existing Data Vault 2.0 infrastructure and the high level of automation, it was possible to implement these new requirements in a short space of time.

Our Approach: Expansion Instead Of New Construction

Requirements analysis
Together with Grenke, we defined the relevant green bond key figures and reporting requirements. These included classifications according to ESG criteria, assignment of project types, as well as regional and financial attributes.
Integration into the existing data warehouse
Instead of building a new system, we added the required fields to the existing hubs, links and satellites. Thanks to the agile Data Vault 2.0 methodology, this was possible without much additional effort.
Automated processes and quality checks
Thanks to the existing ETL/ELT routes, we were able to quickly and securely load the data into the system. New validation rules for green bond reporting were added to ensure that all relevant data was recorded completely and correctly.
Reporting & dashboards
Based on the processed data, we have developed interactive dashboards and reports that clearly present the project status, the scope of financing, and other ESG key figures. External auditors can also be given access via export functions if required.
Rapid approval through external audits
As the Data Vault 2.0 structure ensures complete historization and traceability of the data, the external audits ran smoothly. The auditors were able to fully trace all steps and data changes – a decisive advantage for sustainability reports.

Result: Green Bond Reporting In Just One Month

The combination of a scalable Data Vault 2.0 approach, a high level of automation, and an already established data infrastructure enabled us to successfully deliver the Green Bond Reporting in just one month.

This means:

Fast time-to-market: Grenke was able to publish the report quickly and go straight into communication and marketing.
Trustworthy database: Thanks to integrated quality checks and traceability, the reporting is audit-proof – a crucial prerequisite for external audits.
Future-proof solution: New key figures, extended ESG criteria, or regulatory requirements can be flexibly integrated without having to fundamentally rebuild the system.

What Grenke Says

“Partnering with Scalefree has been instrumental in our Data Vault 2.0 journey. Their deep expertise in Data Vault principles and practical dbt know-how have significantly supported our implementation, ensuring a smooth and structured process. Thanks to their guidance, we’ve already improved our ability to integrate and analyze business data while building a scalable and future-proof data warehouse.”

Oliwia Borecka
Chief Data & Analytics Officer at grenke digital GmbH

Conclusion: Agile And Sustainable Into The Future

The project shows how Scalefree supports customers in quickly and efficiently integrating new requirements into existing data ecosystems. The Data Vault 2.0 approach provides the ideal basis for this: scalability, flexibility, and revision security ensure that companies can meet their reporting requirements not only today, but also tomorrow.

Would you like to find out more about how you can future-proof your data warehouse or ESG reporting?
Contact us at Scalefree – together we will develop a customized solution that meets your requirements and puts you in the best possible position in terms of sustainability and transparency. We look forward to making your project a success!