Sample Page – Scalefree

Building a scalable Data Platform? In Data Vault Friday

Data Vault Hashing on Databricks with XXHASH64

Hashing on Databricks

Hashing is a core element of Data Vault modeling. Hash keys are used to uniquely identify Hubs, Links, and Satellites, and they need to be consistent, reproducible, and efficient. A common debate is whether to use widely adopted hashing algorithms like MD5 or SHA-1, or to opt for faster and smaller hash functions such as xxhash64, which Databricks natively supports.

The question is simple: What if we stored hash keys as 64-bit integers (int64) using xxhash64 instead of 128-bit MD5 values? On the surface, this looks attractive — faster generation, better join performance, and reduced storage. But as we’ll explore in this article, the trade-offs around collisions, scalability, and platform independence make this a risky choice in Data Vault 2.0 architectures.

In this article:

Why Smaller Hash Keys Look Tempting
The Risk of Collisions
Performance vs. Integrity
Platform Portability Matters
Why 128-Bit is the Safe Minimum
Alternative Approaches
Summary: Should You Use xxhash64 in Data Vault?
Conclusion
Watch the Video
Meet the Speaker

Why Smaller Hash Keys Look Tempting

There are clear benefits to using smaller hash values such as int64:

Speed of generation: Algorithms like xxhash64 or CRC-64 are significantly faster than MD5 or SHA-1. OpenSSL and Linux benchmarks consistently show xxhash64 outperforming older cryptographic functions.
Join performance: Joining 64-bit integers is naturally faster than joining 128-bit binary or 32-character string columns.
Storage efficiency: An int64 hash key requires only 8 bytes, compared to 16 bytes for binary MD5 or 32 bytes for an MD5 stored as a string.

From a performance perspective, the appeal is undeniable. But in Data Vault, performance is not the only concern. The fundamental question is: How safe are these smaller hashes when used as surrogate keys at scale?

The Risk of Collisions

A hash collision occurs when two different inputs produce the same hash value. In Data Vault, this means two different business keys could be treated as the same Hub, Link, or Satellite record — corrupting your data integrity.

With 64-bit hashes, the number of possible unique values is 2^64. While that sounds huge, probability tells a different story when you start loading millions or billions of rows. The so-called birthday paradox makes collisions far more likely than intuition suggests.

For example:

With just 10,000 records, a 32-bit hash already has a 1 in 100 chance of collision.
A 64-bit hash greatly reduces the risk, but at large scales (hundreds of millions of rows), the probability becomes uncomfortably high.
A 128-bit hash (MD5) pushes collision risk into the realm of trillions of rows before it becomes statistically relevant.

That’s why MD5 and SHA-1 — despite being slower — are still standard in Data Vault: they provide mathematically safe keyspace sizes for large enterprise datasets.

Performance vs. Integrity

The argument for int64 hashing often emphasizes query performance:

Joins on numeric columns are faster than joins on strings or binary.
xxhash64 is faster to compute than MD5.

These points are true. But in practice, the cost of a collision far outweighs the performance gains. A single collision can undermine the entire lineage of your data warehouse. Once data integrity is compromised, every downstream analytic and report is suspect.

As a result, most Data Vault practitioners will gladly accept the slightly higher CPU cost of MD5 or SHA-1 in exchange for peace of mind.

Platform Portability Matters

Another issue with xxhash64 is platform availability. While Databricks supports it, you also need to consider:

Can you compute xxhash64 in Snowflake, BigQuery, SQL Server, or Oracle?
Can you reproduce xxhash64 consistently in Python, Java, or ETL tools?
Will the algorithm be supported 5–10 years from now?

One of the strengths of MD5 and SHA-1 is their ubiquity. They are implemented in nearly every database, programming language, and ETL platform. This cross-platform reproducibility is essential when building a Data Vault that may span multiple systems. By contrast, xxhash64 locks you into Databricks (or requires custom implementations elsewhere).

Why 128-Bit is the Safe Minimum

Let’s compare hash sizes:

CRC-32: Very fast, but collisions appear after ~10,000 records. Unusable for Data Vault.
xxhash64 / CRC-64: Better, but collisions become likely as datasets grow into the hundreds of millions.
MD5 (128-bit): Standard choice. Safe up to trillions of rows.
SHA-1 (160-bit): Provides even more headroom for extremely large datasets.

For most enterprises, MD5 hits the sweet spot: fast enough, widely supported, and statistically collision-free at realistic data volumes.

Alternative Approaches

If performance is a serious concern, consider these approaches instead of shrinking hash size:

Binary storage: Store MD5 as a 16-byte binary instead of a 32-character string. This cuts storage in half and improves join performance.
Partitioning strategies: Optimize joins by partitioning your Data Vault tables, reducing the need for full-table joins.
Hardware acceleration: Modern CPUs have optimized instructions for MD5 and SHA-1, making them faster than you might expect.
Consider SHA-256 only if required: While SHA-2 offers stronger guarantees, it’s rarely necessary in Data Vault and adds performance overhead.

Summary: Should You Use xxhash64 in Data Vault?

While xxhash64 looks attractive in Databricks because of its speed and smaller footprint, it’s not a good fit for Data Vault 2.0:

Collision risk is too high for large-scale data warehouses.
Portability is limited — you risk vendor lock-in.
Long-term maintainability suffers if your algorithm isn’t standardized across platforms.

In Data Vault, data integrity always comes first. That’s why MD5 (128-bit) or SHA-1 (160-bit) remain the recommended standards for hash keys. They provide the balance of performance, portability, and collision safety needed for enterprise-scale solutions.

Conclusion

If you’re working in Databricks, it might be tempting to adopt xxhash64 for hash keys. But resist that temptation. The risks of collisions and platform lock-in far outweigh the benefits. Stick with MD5 or SHA-1 for your Data Vault hash keys, store them efficiently as binary values, and optimize performance through storage and join strategies.

Remember: a faster broken key is still a broken key. In Data Vault, correctness and consistency are always the highest priority.

Watch the Video

Meet the Speaker

Michael Olschimke

Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

Building a scalable Data Platform? In Data Vault Friday

Joining SCD2 Tables Using Data Vault

Joining SCD2 Tables

When working with Data Vault 2.0, one of the most common challenges is how to handle Slowly Changing Dimension type 2 (SCD2) tables when loading Link Satellites. Imagine a scenario where you need to join three SCD2 tables, each with valid_from and valid_to dates, and bring them together into a single Satellite hanging from a Link. At first, this might sound straightforward, but the details matter — and depending on your approach, you could make your warehouse harder to maintain, less performant, or less flexible.

In this article, we’ll walk through the best practices for handling this situation, based on Data Vault principles. We’ll discuss why you shouldn’t rush into joining multiple SCD2 tables into one Satellite, how to handle business timelines, the role of PIT and T-PIT tables, and strategies for keeping your design scalable and future-proof.

In this article:

Understanding SCD2 in the Context of Data Vault
Why Not Load One Satellite From Three SCD2 Tables?
Handling Validity Dates in Satellites
- 1. Multi-Active Satellites
- 2. Standard (CDC) Satellites with Subsequence
Aligning Data with PIT Tables
Where Should You Apply Business Timelines?
Building a Business Satellite
Best Practices Recap
Conclusion
Watch the Video
Meet the Speaker

Understanding SCD2 in the Context of Data Vault

Slowly Changing Dimensions type 2 are a way to track historical changes in dimensional data. Each record typically has valid_from and valid_to dates that describe its period of effect. In a Data Vault model, however, the focus isn’t on interpreting those business dates upfront. Instead, the Raw Data Vault stage is all about capturing what the source gives you, as-is.

That means when we first bring data into the Raw Vault, the valid_from and valid_to fields should be stored as descriptive attributes — not as part of the Satellite’s primary key. Trying to interpret and align them too early will only create unnecessary complexity.

Why Not Load One Satellite From Three SCD2 Tables?

At first glance, it may seem attractive to combine all three SCD2 tables directly into one Satellite. But this goes against Data Vault best practices:

Each Satellite should usually source from a single table or source system.
Denormalizing multiple sources into one Satellite complicates your Raw Vault and makes it harder to maintain.
You risk having to undo the denormalization later when new requirements come in.

The recommended approach is to create at least three Satellites — one per SCD2 table. You may even need more Satellites if you have to split them based on privacy, rate of change, or security rules. This simplifies your Raw Vault and sets you up for flexibility later.

Handling Validity Dates in Satellites

Once your Satellites are created, the question is how to handle valid_from and valid_to. Here are two key approaches:

1. Multi-Active Satellites

If your SCD2 records represent multiple simultaneously valid states (for example, different price lists where both current and future prices are valid), you can use a multi-active Satellite. In this design:

The primary key is composed of hash_key + load_date + subsequence.
valid_from and valid_to are stored as descriptive attributes, not as key parts.
A staging-generated subsequence ensures uniqueness within a load.

2. Standard (CDC) Satellites with Subsequence

If only one record is active at a time (the classic SCD2 case), then you don’t need a multi-active design. Instead:

Stick with the standard Satellite primary key: hash_key + load_date.
Handle multiple intraday changes by adding micro- or nanosecond subsequences to load_date.
This ensures only one active row per parent key at any given time.

Choosing between multi-active and CDC-style Satellites depends entirely on your data. Do you need multiple simultaneously valid records? Or does one replace the other in sequence? Your answer determines the right design.

Aligning Data with PIT Tables

Once the three Satellites are loaded, you’ll need to bring them together for reporting. This is where PIT (Point-in-Time) tables come in. PIT tables align deltas across Satellites to a common snapshot date, making it possible to present a unified view of related changes.

There are two main flavors:

Standard PIT – aligns data based on load_date (technical historization).
Temporal PIT (T-PIT) – aligns data based on business timelines like valid_from and valid_to.

A T-PIT lets you activate rows based on both the technical and business timelines. However, it comes with maintenance challenges: whenever valid_from or valid_to dates change, you may need to rebuild PIT partitions — which can be expensive if large date ranges are altered.

Where Should You Apply Business Timelines?

There are three strategies for applying business timelines like valid_from and valid_to:

Downstream in reporting or dashboards – simplest to maintain, and business users can define how timelines should be applied.
In dimension views – apply filters and conditions directly in the SQL layer that feeds reports.
Upstream in PIT/T-PIT – most performant but requires heavier maintenance whenever business timelines change.

A practical approach is to start downstream and only move timeline application upstream if performance issues demand it.

Building a Business Satellite

Once your PIT table aligns the Satellites, you may still need a Business Satellite (or Computed Satellite). This Satellite:

Uses the PIT table as its foundation.
Combines attributes from the three original Satellites using COALESCE or other business rules.
Optionally applies T-PIT logic if strict business timeline alignment is required.

By separating Raw Satellites (which store raw source data) from Business Satellites (which apply interpretation and business logic), you preserve Data Vault’s flexibility while still meeting analytical requirements.

Best Practices Recap

Create one Satellite per SCD2 source table — don’t denormalize too early.
Treat valid_from and valid_to as descriptive attributes in the Raw Vault.
Use multi-active Satellites only when multiple records are simultaneously valid; otherwise, stick with standard CDC Satellites.
Align Satellites with PIT tables; consider T-PITs only if business timelines must be applied upstream.
Whenever possible, push business timeline interpretation downstream to dimensions or reports for easier maintenance.
Use Business Satellites when you need to merge multiple sources into a single unified view.

Conclusion

Joining multiple SCD2 tables into a single Satellite is rarely the right first step in a Data Vault 2.0 implementation. Instead, build your Raw Vault Satellites closely aligned with their sources, use PIT tables to align changes, and apply business timelines carefully — starting as far downstream as possible. This approach keeps your architecture maintainable, flexible, and scalable while still supporting complex historical analysis.

By following these principles, you’ll not only simplify your data model but also give your business users the power to interpret validity ranges in ways that make sense for them — without locking your warehouse into rigid rules that are hard to maintain.

Watch the Video

Meet the Speaker

Michael Olschimke

Jan Paul Langner In Data Tools, Data Warehouse

Monitoring of a Snowflake Powered EDW

Introduction to Snowflake

Snowflake established itself as one of the most widely used cloud data platforms, providing a scalable and flexible architecture for building, maintaining, and operating Enterprise Data Warehouses. While it aims to eliminate data silos and simplify the data structure, with billions of queries a day (overall), its elastic computing handles the workload for good performance and a satisfied customer.

However, as organizations rely increasingly on Snowflake, monitoring their solution might be crucial for performance optimization, error and root cause analysis, and cost control. Although Snowflake handles these tasks very well on its own, experience shows some need for human interference when it comes to more complex queries, to increase performance, lowering the runtime without upscaling or scaling out.

This blog article aims to provide general information about and technical insights into Snowflake. Focusing on techniques as well as built-in tools to analyze bottlenecks, we aim to potentially increase performance and therefore decrease unnecessary costs.

In this article:

Snowflake Built-In Tools
- Query History
- Query Profile
Custom Monitoring Solution
Conclusion

Snowflake Built-In Tools

Query History

Starting with Snowflake’s built-in tools, we will dive into its roots and combinations of several metrics.

Monitoring query performance includes several depths of insight, from an overview of the general query history, down to the performance of each individual step within a query.

Let’s assume the top level as an overview of our queries. Including metrics such as:

Start Time,
Total Time,
Status and
Query Type.

For the first level, we can gain insights when the queries start, how much time they were consuming, whether they succeeded or failed, or even are still running. Lastly, we can see the type of query, such as ‘CREATE TABLE’, ‘INSERT’, or ‘SELECT’. At this level, a brief overview could be created about the traffic within the Warehouse.

When hearing about this, you may be familiar with a built-in tool called ‘Query History’, which tracks this kind of information automatically without the need for user interference. It is easily available within the sections ‘Monitoring’ and ‘Query History’.

Query Profile

Further, it is easily possible to extract more information about the executed queries. Every single query within the ‘Query History’ has a unique identifier and attached metrics. Within the ‘Query History’ section, when choosing a query by simply clicking on it, we gain insights into the single query. These include two main sections.

‘Query Details’ delivers information about the query in general. Including the above-mentioned ‘Start Time’, ‘End Time’, ‘Warehouse Size’, and ‘Duration’, which is divided into ‘Compilation’, ‘Execution’, and further time-consuming steps.

‘Query Profile’ describes the execution plan of the query as a graphical representation. Each step within the query, such as ‘SELECT’, ‘JOIN’, and ‘WHERE CONDITION’ are represented as nodes within the graph. These nodes are operations, triggered by the query. Each fulfilling a task like filtering data, selecting data rows, but also scanning results, and reading data from cache. Snowflake interprets the written SQL and creates a performant execution plan by itself.

Despite that, each node consumes more or less resources, depending on the SQL code itself, the underlying dataset, and the Warehouse configurations. Using the ‘Query Profile’, we can identify the most time-consuming steps in our query, such as false ‘JOINS’, and work around those nodes to identify more performant solutions. Therefore, the ‘Query Profile’ provides detailed information about each node, like processing time, cache scanning, and partition pruning. For further information, check out the Snowflake Documentation.

Custom Monitoring Solution

But when Snowflake already provides a solution, why do we need separate monitoring?

Although already delivered, your requirements may vary from the existing solution. With the built-in tools mentioned above, the possibility to check on each query is given. Additionally, Snowflake may not be the only tool used for data processing. ELT tools like dbt or Coalesce help data engineers improve processes. To include these further metrics from outside Snowflake, a custom monitoring solution, based on Snowflake’s delivered metrics, is needed.

With this in mind, the following section focuses on our own scalable monitoring solution.

Account Usage and Information Schema

With the idea that the data of the graphical output exists as a table or view, containing all data necessary, it turns out that Snowflake provides us the schemas ‘ACCOUNT_USAGE’ and ‘INFORMATION_SCHEMA’, fulfilling this purpose. They provide similar opportunities for this challenge, although there are some differences.

‘ACCOUNT_USAGE’ provides insights into Snowflake metadata of several objects. This includes ‘QUERY_HISTORY’, which contains the information described in the sections above. The query history is available for all queries within the last 365 days. This enables the loading of metrics entities in the Data Warehouse, making the information readily accessible and persistent for daily, weekly, or monthly processes. The downside is that access to the ‘ACCOUNT_USAGE’ schema is often limited.

The ‘INFORMATION_SCHEMA’ contains a set of system-defined views and tables, providing metadata and information about created objects. Using predefined functions on the ‘INFORMATION_SCHEMA’, metadata can be retrieved, which, on the other hand, is limited to some factors. As this option is more accessible for developers, the focus remains on this option.

Extract Metrics Data

At the beginning, it is crucial to point out the limitations and possibilities of this approach. For the scope of this blog, two functions on the ‘INFORMATION_SCHEMA’ are needed.

‘QUERY_HISTORY_BY_WAREHOUSE()’ is equal to ‘QUERY_HISTORY’.

‘GET_QUERY_OPERATOR_STATS()’ is the equivalent of ‘QUERY_PROFILE’.

‘QUERY_HISTORY_BY_WAREHOUSE()’ takes four different arguments.

‘WAREHOUSE_NAME’ – The name of the Warehouse executing the queries
‘END_TIME_RANGE_START’ – Time range within the last 7 days, in which the query started running
‘END_TIME_RANGE_END’ – Time range within the last 7 days, in which the query completed running
‘RESULT_LIMIT’ – The number of the maximum returned rows. The default lies by ‘100’, the range by ‘1’ to ‘10.000’

The output is the ‘QUERY_HISTORY’ as a structured table with some extra information. Each query is displayed in one row. Therefore, the metrics are aggregated into one specific query.

The most critical limitation seen here is the ‘RESULT_LIMIT’ and ‘END_TIME_RANGE_*’, as it forces us to retrieve the data before the 7-day retention ends and before 10.000 other queries have been executed. Depending on the size of the Warehouse and the scope of the monitoring range, the extract and load process must be customized.

‘GET_QUERY_OPERATOR_STATS()’ takes one argument.

‘QUERY_ID’ – The unique ID of the executed query.

The output is a structured table of the ‘QUERY_PROFILE’ of one specific query. Each step in the execution plan is displayed as one row, so the output size depends on the complexity of the query itself. Each node is detailed with available metadata, breaking down information from ‘GET_QUERY_OPERATOR_STATS()’ into individual steps. This allows for a deeper analysis of performance metrics, helping to identify any bottlenecks.

As the function is on query-level, it is not done by joining the table outputs of those two functions together to get a satisfactory result. Further steps are needed.

Load Metrics Data

Before filtering, transforming, or joining the data, it is suggested to load the data as it comes into persistent tables.

Snowflake Query History

The first step should be to create the table when calling the function, already combining these two components to avoid missing capture data, columns, or false data types. To keep continuous loading, use your preferred ELT tool to append the process at the end, or set up a scheduled Snowflake Task, considering the limitations mentioned above.

Snowflake Query Profile

This process is more complex. To load the data, it is not sufficient to simply call the function once for the query history. To ensure data integrity and a clear 1:n solution (1 row in the query history, n>0 rows in the query profile), it is needed to iterate through the list of all query IDs inside the query history and run the function for each one of them. Each of the resulting datasets needs to be inserted into the corresponding table.

Information Delivery

Now, as the data is extracted and loaded, the last remaining steps are transforming the data and information delivery. As this strongly depends on the use case, we focus on linking the two metrics tables and creating a standard dashboard inside Snowflake to show the results.

Our two tables ‘snowflake_query_history’ and ‘snowflake_query_profile’ are in a soft relation, where the “QUERY_ID” serves as a unique identifier to link each query from the query history to its query profile. Therefore, joining both tables on the “QUERY_ID” is indispensable. The first decision to be made is whether to keep one row for each query and aggregate the object constructs or to split it up into its individual steps, like within the query profile. For simplicity, we focus on the second option to avoid complex SQL statements for the dashboard. As the SQL statement only includes the ‘SELECT’ and ‘JOIN’ operators, you may choose a view as materialization, instead of a table.

With the possibility of creating dashboards out of Snowflake worksheets, it is relatively easy to integrate simple SQL statements into the dashboard, represented as charts. For some examples, we will take a look at the resulting data.

This example shows the top 20 queries by ‘TOTAL_ELAPSED_TIME’, the overall time the query needs to compile, run, waiting time, and other technical steps. As noticed by the chosen x-axis label, Snowflake’s metrics can be combined using other tools, such as dbt. It is set to compare this time against the produced rows to gain some insights into whether the queries can handle much data or not. As can be seen, the query consuming the most time, and therefore credits, does not produce the most rows. At this point, we may be interested in gaining some insights into the query itself.

Showing each Operator with its corresponding relative runtime, only a handful exceed 5% of the total execution time. This may indicate that within the SQL statement, only a few steps need to be optimized to create an overall more performant query.

With this in mind, the single query can also be analyzed in the dashboard. For example, you may take into account input rows and output rows, bytes spilled, partitions scanned, and partitions total, or the join condition. Furthermore, you may not only gain insights into your query performance, but also into your table management, such as clustering and partitioning. Additionally, you are also capable of integrating more metrics coming from external sources and linking them directly to Snowflake, making it a powerful monitoring solution.

Conclusion

Snowflake offers a powerful and highly adaptable cloud data platform that meets the demands of modern enterprise data warehousing. However, as data volume and complexity grow, so does the need for proactive monitoring and optimization to ensure continued performance and cost efficiency. By leveraging Snowflake’s built-in tools and implementing strategic performance-enhancing techniques, organizations can address bottlenecks effectively and optimize query performance.

In this article, we addressed the opportunities given through Snowflake’s built-in tools and how to effectively use them. We learned that although these tools are great, to gain a fast and high-level overview, as well as detailed insights at the same time, it is relatively easy to create your own dashboard with the loaded data. Therefore, you are able to analyze and evaluate it, as well as make optimum use of resources to enable high-performance operation without increasing cost by scaling up or scaling out.

Damian Hinz In Data Vault, DevOps Knowledge, Intermediate

Behind the Branches – Navigating Git Workflows in Modern DevOps

Branching Strategies

Branching strategies are one of those topics that rarely get much attention until they suddenly become a problem. Whether it’s drowning in merge conflicts, the headache of implementing and synchronizing hotfixes across multiple branches, or a feature freeze caused by insufficient quality assurance, your repository and branching structure can have a major impact on day-to-day development.

But what branching strategies actually exist, and what are their pros and cons? Which approach allows you to deploy changes most quickly? And how can you maintain high software quality despite frequent releases?

In this article, we’ll provide a structured overview of common branching strategies and typical challenges developers face when using them.

Navigating Git Workflows in Modern DevOps

This webinar offers a clear overview of common approaches and how they impact CI/CD, code quality, and maintainability. Beyond theory, we’ll dive into practical challenges and real-world issues teams face every day. Register now for our free webinar on September 16th, 2025!

Watch Webinar Recording

Why Are Branching Strategies Relevant?

Branching models reflect the organization, release culture, and technical maturity of a project. There is no single “correct” strategy that fits every project. Choosing the right one depends heavily on the project’s context. Some of the most important questions to consider when selecting a branching strategy include:

Does the team work in fixed sprint or release cycles, or is code deployed continuously?
How many developers are working simultaneously on the same codebase?
What is the quality of your CI/CD pipeline? Does every change need a manual review, even if the pipeline passes, or can it be deployed automatically?

Depending on the answers to these questions, a simple or more complex branching strategy may be appropriate.

Comparison of Common Strategies

Git Flow

The Git Flow strategy was originally developed for traditional software projects with planned release cycles. Its long-lived main branches are “main” (or “master”) and “dev”.

In addition, it introduces several short-lived branches:

Feature branches

New features are developed in separate feature branches, which are merged into the develop branch once completed.

Hotfix branches

If a critical bug occurs in the production environment (i.e., on the main branch), a hotfix branch is created from main to address the issue. Once the fix is implemented and pushed to the hotfix branch, it is merged into both main and develop to ensure the bug is resolved in both branches.

Release branches

When a release is approaching, a release branch is created from develop, containing all features added since the last release. This branch is then used for final QA testing, bug fixing, and versioning. Once the release is approved, the release branch is merged into both main and develop.

The main advantage of Git Flow is its clear structure. Even in larger teams with many developers and therefore multiple concurrent feature branches, it’s easy to track which version is in what state. The strategy supports parallel development very well due to its structured branching model.

However, the downside is the organizational and technical overhead. The large number of branches and merges can lead to conflicts and divergence over time, especially with long-lived release and hotfix branches. A particular challenge arises when keeping branches in sync. Hotfixes created from main need to be merged back into main and dev, and changes made in release branches, which originate from dev, must eventually be merged into both main and dev, as shown in the diagram. These synchronization steps often introduce additional effort and increase the risk of conflicts or inconsistencies, especially when multiple streams of work are active in parallel.

Additionally, the path a feature must take, from a feature branch to develop, to a release branch, and finally to main, can slow down the deployment process.

While a solid CI/CD pipeline can help automate and streamline parts of this workflow, Git Flow does not rely on automation to function. This makes it especially suitable for teams with more manual QA processes or limited automation infrastructure.

GitHub Flow

Compared to Git Flow, the GitHub Flow strategy is significantly leaner. It uses only a single long-lived branch, usually main, and temporary feature branches that are merged via pull requests.

Once all changes on a feature branch are complete and have passed review and various tests, the branch is merged directly into main.

The key advantage of GitHub Flow is its simplicity. There are no separate release or develop branches, and even hotfixes can be handled in short-lived branches. Teams can respond to changes quickly and deploy frequently. This agility is especially effective when supported by a robust CI/CD pipeline. If properly implemented, testing, building, and deployment processes are automated, further improving GitHub Flow’s fast time to market.

Because of its low complexity and minimal coordination overhead, GitHub Flow is also particularly well-suited for smaller teams that value speed and iteration over rigid release planning.

If you’re interested in how such pipelines are structured in practice, our CI/CD pipeline Blog article offers a look at a practical GitHub-based setup using GitHub Actions and dbt. It’s a useful companion piece for understanding the automation layer that supports fast and reliable delivery.

However, this strategy also comes with limitations: it doesn’t support managing multiple parallel versions or complex release planning.

Additionally, it relies heavily on the quality of the CI/CD pipeline.

Trunk-Based Development

Trunk-Based Development is quite similar to the GitHub Flow strategy, but there are a few key differences.

While it also relies on a single long-lived branch (the trunk, typically main), commits are either made directly to main or via very short-lived feature branches. These feature branches often exist for only a few hours, and it’s common for changes to be merged into main multiple times a day. The goal is to integrate changes as early as possible to avoid conflicts before they even arise.

Because there are no fixed release cycles in Trunk-Based Development, it’s essential to ensure that incomplete features don’t go live prematurely. Feature flags play a central role here, allowing unfinished functionality to be hidden in the production environment until it’s ready.

As with GitHub Flow, a strong CI/CD pipeline is essential. It acts as the main safeguard for quality assurance and enables rapid deployment to the main branch.

Trunk-Based Development is especially effective for teams that are comfortable with rapid iteration and a high level of automation. While it can be used by smaller teams, it truly shines in larger organizations where multiple teams work in parallel and frequent integration is critical to maintaining momentum and consistency.

The benefits of Trunk-Based Development include extremely fast deployments and minimal risk of merge conflicts due to the short-lived nature of branches and continuous integration.

However, similar to GitHub Flow, this strategy heavily depends on the reliability of the CI/CD pipeline. If your team operates in a highly automated DevOps environment, this approach works smoothly. But if that’s not the case, software quality can suffer significantly. The risk is especially high here, as all changes are deployed directly to the main branch.

Conclusion

All three strategies come with their own strengths and weaknesses.

Git Flow is well-suited for larger projects with fixed release cycles, manual QA, and structured approval processes. It offers stability and clear workflows, but also brings significant technical and organizational overhead, making it a heavyweight option that can slow down development and release cycles due to its complexity and synchronization requirements.

GitHub Flow, by contrast, emphasizes speed and simplicity. It’s an excellent fit for smaller teams working on web or SaaS projects that deploy continuously, thanks to its low complexity and quick turnaround. But it relies on a good CI/CD pipeline. If tests are insufficient, faulty code might get deployed automatically.

Many of these risks can be mitigated with proper pipeline design and DevOps experience within the team, ensuring that automation is not just fast but also reliable.

Trunk-Based Development enables the highest release frequency, but only delivers consistent quality if the necessary technical maturity is in place. This makes it ideal for highly automated environments where teams ship many changes every day.

There are always ways to mitigate or minimize the downsides of any branching strategy. Techniques like blue/green or canary deployments, for example, can help reduce the impact of faulty changes and make rollbacks easier.

Stay tuned, we regularly share practical insights and solutions on topics like CI/CD, DevOps patterns, and deployment strategies.

Building a scalable Data Platform? In Data Vault Friday

Defining Snapshot Dates in Data Vault

When working with Data Vault, one of the most common questions practitioners face is: how do we define snapshot dates? While load dates are tied to when data arrives from the source, snapshot dates serve a different and equally important purpose — they allow us to deliver stable, consistent, and predictable datasets to end users. In this article, we’ll break down the concept of snapshot dates, explain how they differ from load dates, and walk through practical examples of how they are implemented in a Data Vault architecture. By the end, you’ll understand how to design and manage snapshot dates effectively, and how they fit into the broader picture of multi-temporal data management.

In this article:

Why Snapshot Dates Matter
The Three Timelines in Data Vault
Defining Snapshot Dates with Examples
How Control Tables Help Manage Snapshot Dates
Implementing Snapshot Dates in PIT Tables
Best Practices for Defining Snapshot Dates
Snapshot Dates vs. Business Dates
Conclusion
Watch the Video
Meet the Speaker

Why Snapshot Dates Matter

Imagine you are a business user who arrives at the office at 9 AM, expecting your reports to be ready. You want those reports to be stable — not changing throughout the day unless a new cycle of data refresh is scheduled. Snapshot dates exist to decouple data delivery from data ingestion.

Load Date: Indicates when data from the source system arrived in your warehouse. It is system-driven and depends on the source’s delivery schedule.
Snapshot Date: Defines when the data is frozen for reporting and analysis. It is business-driven and follows a regular schedule (daily, hourly, every 5 minutes, etc.).

This separation allows data teams to provide users with predictable datasets, regardless of how irregularly or frequently source systems deliver new data.

The Three Timelines in Data Vault

To fully grasp snapshot dates, it’s useful to understand that Data Vault designs operate across three different timelines:

Data Warehouse Timeline: Driven by load datetime stamps, representing when data batches (or real-time messages) arrive from the source system.
Snapshot Timeline: Driven by the business delivery cycle. This is when data is made available to users in stable form.
Business Timeline: Driven by business events (contract start dates, end dates, valid-from/valid-to fields, modified timestamps, etc.) and stored in Satellites or other model components.

For this article, we’ll focus on the second timeline: the snapshot timeline.

Defining Snapshot Dates with Examples

Daily Snapshot Example

Suppose a company wants its reports refreshed once per day, every morning at 7 AM UTC. This means that:

Regardless of when the source delivers data, the snapshot timestamp is always set to 7 AM.
Users querying the data warehouse at 9 AM will see a stable version of the data that won’t change until the next snapshot is generated.
In the control table, a row is inserted daily with the snapshot timestamp (e.g., 2025-09-16 07:00:00 UTC).

Hourly Snapshot Example

For real-time dashboards, an hourly or even 5-minute snapshot might be necessary. Let’s say hourly snapshots are generated:

A new snapshot timestamp is inserted every hour (e.g., 2025-09-16 01:00:00, 2025-09-16 02:00:00, etc.).
Older hourly snapshots may be discarded after a week to save storage, while daily snapshots are retained for a year.
End users can query either the most recent hourly snapshot or the daily snapshot depending on their needs.

Mixed Use Case

Some users might want hourly updates, while others only need a daily snapshot. In such cases:

The control table holds all snapshots (hourly + daily).
Boolean flags are used to mark whether a snapshot is “hourly,” “daily,” or “real-time.”
Users can filter based on these flags when running queries.

How Control Tables Help Manage Snapshot Dates

In practice, snapshot dates are managed using control tables. There are typically two types of control tables in Data Vault projects:

Load Control Table: Tracks load datetime stamps for each source, indicating whether the batch has been processed into staging, raw Data Vault, business vault, or marts.
Snapshot Control Table: Stores snapshot datetime stamps generated by the warehouse. These define the stable reporting layers that users can query.

When building PIT (Point-in-Time) tables and bridge tables, the snapshot control table plays a critical role. It ensures that snapshots align with user expectations, and it provides metadata for filtering (e.g., “latest snapshot,” “daily snapshot,” etc.).

Implementing Snapshot Dates in PIT Tables

PIT tables act like indexes into your Data Vault, enabling efficient query performance. The loading process of PIT tables typically involves:

Checking the snapshot control table to see which snapshots should exist.
Loading the required snapshots into the PIT table if they’re missing.
Ensuring that the PIT table and snapshot control table remain in sync after each refresh cycle.

When a user queries a PIT table, they can join it to the snapshot control table based on the snapshot datetime stamp. From there, they can filter by flags (e.g., latest, daily, hourly) to get the version of data they need.

Best Practices for Defining Snapshot Dates

Work with business users: Define snapshot frequencies based on real business needs (daily, hourly, real-time).
Be consistent: Establish a standard snapshot time (e.g., 7 AM UTC daily) to simplify reporting.
Use control tables: Automate the insertion of snapshot timestamps and track them for PIT/bridge table loading.
Retain wisely: Keep high-frequency snapshots (hourly/5-minute) for a short time, but retain daily snapshots for longer historical analysis.
Decouple load and snapshot timelines: Remember that load datetime depends on the source, but snapshot datetime depends on user requirements.

Snapshot Dates vs. Business Dates

It’s worth highlighting again that snapshot dates are not the same as business dates like “contract start” or “valid-to.” Business dates come directly from source systems and are stored as part of the business timeline in Satellites or Links. Snapshot dates, on the other hand, are warehouse-generated and serve as reference points for reporting and querying.

Conclusion

Snapshot dates are a cornerstone of Data Vault’s multi-temporal design. By providing a regular, predictable timestamp for reporting and queries, they ensure stability and trust in data delivery — even as source systems deliver data at unpredictable times. With the help of control tables, PIT/bridge tables, and well-defined retention policies, snapshot dates give both IT teams and business users the clarity and consistency they need. Whether your business requires daily snapshots for stable reports or high-frequency snapshots for real-time dashboards, the principles remain the same: define, control, and communicate your snapshot strategy clearly.

Watch the Video

Meet the Speaker

Michael Olschimke

Building a scalable Data Platform? In Coalesce Transformation Talks

Creating Data Vault Links with Coalesce.io

Data Vault Links

Data Vault modeling separates data into hubs, links, and satellites to support scalable, auditable, and historized data warehouses. Links represent relationships between business entities (hubs) — for example, which supplier delivers which part. Coalesce.io makes creating Data Vault objects fast by providing a GUI-driven workflow for building stages, generating hash keys, and creating links that can be run incrementally against a target like Snowflake.

This article walks through the exact process demonstrated in the video transcript: exploring source tables, creating a Data Vault stage, generating hash columns, building a single-source link, then extending it to a multi-source link. The goal is to give you a clear checklist and practical tips so you can reproduce the steps in your Coalesce.io environment.

In this article:

Why create a Data Vault link?
Overview of the source data used
Step 1 — Create a Data Vault stage in Coalesce
Step 2 — Generate hub hash keys and the link hash key
Step 3 — Select only the columns you need for the link
Step 4 — Configure the link object
Step 5 — Inspect results and generated code
Creating a multi-source link
Common pitfalls & best practices
When to add satellites
Conclusion
Watch the Video
Meet the Speaker

Why create a Data Vault link?

In Data Vault, links are the canonical way to model relationships between business entities. A link contains:

a link hash key — generated from the business keys of all participants
the hub hash keys of every participating hub
load metadata such as load timestamp and record source

Creating links gives you an integrated relational layer independent of the transactional sources. Links allow you to track relationship history and stitch together hubs for downstream analytics and satellites.

Overview of the source data used

In the example from the video, you start with two resource tables in Coalesce:

parts — defines parts with attributes like brand, name, type, size, and contains a reference to a supplier.
suppliers — contains supplier metadata and a supplier key.

The parts table contains a column that references the supplier key, so semantically there is a relationship: part → supplier. This is the relationship we model as a Data Vault link.

Step 1 — Create a Data Vault stage in Coalesce

Every Data Vault object creation in Coalesce.io should start from a Data Vault stage. The stage is where you prepare the source rowset and add the derived columns that your downstream hub/link/satellite will need (hash keys, record source, load date, etc.).

Right-click the source entity (in the video: parts) and select Node → Data Vault for Coalesce → Stage.
Coalesce.io creates a new object (e.g., DV_stage_parts) and forwards the source columns into the stage.
Add any extra columns you need for the link: a hash key for the part hub, a hash key for the supplier hub, and the combined link hash key.

Best practice: keep the naming consistent. If the column will become the hub hash key for part, name it something like HK_part_H. That makes it obvious where the column flows later.

Step 2 — Generate hub hash keys and the link hash key

Hash keys are central to Data Vault 2.0 implementations. They provide stable, compact identifiers for business keys and are typically generated using a deterministic hash function (often MD5 or SHA). In Coalesce.io you can generate these with the GUI.

Right-click the business key column for the part and choose Generate hash column — rename it to HK_part_H.
Repeat for the supplier business key and call it HK_supplier_H.
Select both business key columns (or both generated hash columns) and choose Generate hash column again to produce the link hash. Name it something descriptive, like HK_part_supplier_L (L for link).

Important: the link hash must be calculated from the business keys (or their hub hashes) of all relationship participants and in a deterministic order. That ensures the same relationship always produces the same link hash across sources and loads.

Step 3 — Select only the columns you need for the link

Coalesce.io lets you choose which stage columns flow into the next object. For the link you typically need:

the link hash key (first column)
the hub hash keys for all participants (in order)
load metadata: LoadDate or LoadId, and RecordSource

Select these five (or more if you want custom metadata) and then create the Data Vault link node by right-clicking → Node → Data Vault for Coalesce → Link.

Step 4 — Configure the link object

When the new link object appears, follow these checks:

Confirm the link hash column is the first column and matches your naming standard (e.g., HK_part_supplier_L).
Verify the hub hash keys follow — HK_part_H then HK_supplier_H.
Make sure load metadata (load timestamp/id and record source) are present.
Open the Data Vault options panel in Coalesce.io and explicitly set which column is the link hash. This tells the Coalesce.io macro how to populate the link.

Click Create and then Run. Coalesce.io will generate and execute an INSERT statement (an incremental load) against your target (Snowflake in the example).

Tip: check the generated SQL before running. Coalesce.io usually issues an incremental INSERT that only adds new link rows, so the second run often returns “0 rows inserted” when there are no new relationships — that’s expected and desirable.

Step 5 — Inspect results and generated code

After running, Coalesce.io shows the result set and the generated SQL. In the video, the first execution inserted the bulk of rows (e.g., hundreds of thousands), and subsequent runs inserted zero because no new relationships existed. This shows the incremental behavior is functioning correctly.


-- example pseudo-SQL generated by Coalesce
INSERT INTO dv_link_part_supplier (...)
SELECT ...
FROM staging.dv_stage_parts
WHERE NOT EXISTS (
SELECT 1 FROM dv_link_part_supplier l
WHERE l.hk_link = staging.hk_part_supplier_l
);

Always validate the counts and confirm that the link hash values are unique per relationship. If you see duplicates or mismatches, re-check the hash generation order and the columns used.

Creating a multi-source link

A common Data Vault case is that the same relationship may appear in multiple source systems (e.g., ERP, procurement feed, third-party data). Data Vault links are designed to aggregate relationship evidence across sources. Coalesce.io supports multi-source links by allowing you to add additional source mappings to the link object.

The workflow from the video:

Create or update a stage for the second (or additional) source that produces the same five columns (link hash, hub hashes, load metadata).
On the link object in Coalesce, open the multi-source mappings and add a new source mapping (e.g., part_sub_source).
Map the source stage columns to the link columns — Coalesce.io often auto-maps if names match.
Run the link. Coalesce.io will process both sources and insert any newly observed relationships.

Note: If you run into an error where Coalesce.io cannot find the link hash for the new source, make sure the stage includes the generated join or column you intended to hash. In the video, the speaker realized they needed to explicitly build the join in the stage (generate the joined dataset) before the link could reference its hash column.

Common pitfalls & best practices

Hash ordering: Always use a consistent ordering for participants when generating the link hash (e.g., alphabetical by object name or a documented canonical order).
Naming conventions: Adopt a clear naming convention: HK_<object>_H for hub hashes and HK_<a>_<b>_L for link hashes. Consistency helps Coalesce.io auto-map and keeps downstream ETL predictable.
Stage first: Always prepare your stage before creating a link. The stage is where joins, derived fields, and hash generation happen.
Record source & load metadata: Always include record source and a load timestamp or load ID in the link so you can trace where and when a relationship was observed.
Incremental testing: Run the insert once to ingest the historical baseline, then run it again to verify zero rows are inserted when no changes exist.
Multi-source growth: Plan for adding multiple sources over time — links should be able to accept additional source mappings without rework.

When to add satellites

Links can also have satellites if you want to capture attributes specific to the relationship (for example, contract terms, effective dates, or relationship status). If you need to historize relationship attributes, create a satellite for the link and drive it with a hash diff or change detection strategy.

The video skipped satellite configuration because its focus was link creation, but be mindful that links + satellites are the full pattern for historized relationship data in Data Vault.

Conclusion

Creating Data Vault links with Coalesce.io is a straightforward, GUI-assisted process once you follow a repeatable pattern:

Create a Data Vault stage for your source rows.
Generate hub hash keys for all participating entities.
Generate a deterministic link hash from the participants’ business keys (or hub hashes).
Select the required columns and create the link object.
Configure multi-source mappings as needed and run incremental loads.

Following these steps ensures your links are consistent, auditable, and ready for enterprise-grade analytics. If you haven’t implemented hubs yet, consider building hubs first (or in parallel) so your link hub keys map cleanly into the rest of the Data Vault model.

Happy modeling — and if you’re using Snowflake as your target, double-check the generated SQL from Coalesce.io and watch the run results to validate incremental behavior.

Watch the Video

Meet the Speaker

Tim Kirschke
Senior Consultant

Tim has a Bachelor’s degree in Applied Mathematics and has been working as a BI consultant for Scalefree since the beginning of 2021. He’s an expert in the design and implementation of BI solutions, with focus on the Data Vault 2.0 methodology. His main areas of expertise are dbt, Coalesce, and BigQuery.

Building a scalable Data Platform? In Data Tools,

Databricks and dbt: A Practical Approach to Data Vault Implementation

Bronze Silver and Gold layers in the Data Vault Structure

Databricks and dbt

Selecting the appropriate technology stack is a critical factor in the successful delivery of a Data Vault 2 architecture. Two technologies that work effectively together at a large scale data solutions are Databricks and dbt. When combined, they provide a practical way to implement Data Vault models while addressing performance, governance, and auditability requirements.

It can be argued that dbt’s role in a Databricks-based architecture is not always essential, since many of its core capabilities (such as transformation scheduling, lineage tracking, and documentation) can also be implemented using native Databricks features. Understanding the specific role each tool plays helps clarify where they complement each other and where functionality overlaps.

In this article:

Databricks as the Processing and Storage Platform
dbt as the Transformation and Orchestration Layer
Integration in a Data Vault Workflow
Business Value when combining dbt and Databricks

Databricks as the Processing and Storage Platform

Databricks’ Lakehouse architecture combines the scalability of a data lake with the reliability of a warehouse. Its Delta Lake technology offers ACID transactions, schema enforcement, and time travel, enabling precise historical querying, which are relevant aspects when it comes to Data Vault’s historization requirements.

With Unity Catalog, Databricks centralizes metadata management and enforces fine-grained access control, ensuring sensitive attributes are protected without introducing unnecessary satellite splits. This alignment between governance and performance is particularly relevant in Data Vault environments.

dbt as the Transformation and Orchestration Layer

dbt manages and automates SQL-based transformations in a modular and version-controlled manner. In a Data Vault context, dbt enables:

The creation of Hubs, Links, and Satellites through templated, reusable models. Here, different packages can be leveraged, like our datavault4dbt package, which is constantly updated to be fully compliant with the most recent Data Vault standards.
Integrated testing to validate business keys, relationships, and data quality.
Automated documentation that directly reflects the structure and dependencies of the Data Vault.

This structured approach makes transformations transparent and repeatable, supporting the auditability requirements inherent to Data Vault.

Integration in a Data Vault Workflow

When Databricks and dbt are deployed together:

Data ingestion occurs in Databricks, storing raw datasets as Delta tables, usually in the Bronze layer.
dbt transformations generate Raw Vault entities and Business Vault objects in the Silver layer.
Governance and security controls are enforced via Unity Catalog without altering the Data Vault model structure.

This approach preserves Data Vault’s methodological structure while using Databricks’ distributed compute and storage capabilities.

Business Value when combining dbt and Databricks

The combined use of Databricks and dbt offers:

Scalable processing of large, complex datasets: Databricks handles enterprise-scale data efficiently, while dbt structures transformations into modular, reusable components.
Consistent governance across all layers of the Data Vault: dbt’s lineage and documentation, plus Unity Catalog’s access control, ensure compliance and transparency end to end.
Lower operational risk through tested, version-controlled transformations: Git-based versioning and automated tests in dbt reduce errors before execution on Databricks.
Improved query performance for information marts and analytics: Delta Lake optimizations and dbt’s pre-aggregated tables with business logic minimize expensive joins.

For organizations building Data Vault on Databricks, dbt strengthens structure and quality while Databricks ensures scalability and performance.

– Ricardo Rodríguez (Scalefree)

Building a scalable Data Platform? In Agile Data

Agile Development in Data Warehousing with Data Vault 2.0

Agile Development in Data Warehousing: Initial Situation

Agile methodologies bring flexibility and adaptability to data warehousing, making them a natural fit for modern approaches like Data Vault 2.0. A common issue in data warehousing projects is that a scope is often missing and many of the processes such as controlled access, GDPR handling, auditability, documentation and infrastructure are not optimized. Additionally, data warehouse projects that have a scope often begin without a real focus on business value. This is mostly due to the fact that the use cases are not clearly communicated and the data architects do not know where to start. The consequence of this means no business value can be delivered.

In this article:

Data Vault 2.0 Methodology
The Scope of a Sprint
Define the project
Agile Development
Agile Development Review
Conclusion

Data Vault 2.0 Methodology

It is often assumed that Data Vault 2.0 is only a modeling technique, but this is not correct. Data Vault 2.0 includes the modeling technique, a reference architecture and the methodology. The methodology introduces project management tools such as CMMI, Six Sigma and Scrum to solve the problems described. While CMMI and Six Sigma deal with general project management issues, Scrum is mostly used specifically in the development team and provides the framework for a continuously improving development process. The use of agile development in Data Vault 2.0 projects will be described in more detail below.

The Scope of a Sprint

The first step in setting up a data warehouse project in an agile way is defining the objective of the project with just one or two pages. Unlike waterfall projects, the goal is to produce working pieces of usable outputs, could be reports or dashboards, in continuous iterations, otherwise called sprints. This means that we don’t need to plan the entire project in detail but instead can build around a general idea or goal for the final data warehouse before then focusing on planning the first sprints. In order to address the aforementioned problems, the focus of the sprints needs to be centered around business value. For this reason, it is important to receive constant feedback from the business users for a continuous improvement process.

Define the project

Both the scope of a sprint and the architecture follow a business value driven approach built vertically and not horizontally. This means they are not built layer by layer but instead feature by feature. A common approach for this is the Tracer Bullet approach. Based on business value, which is defined by a report, a dashboard or an information mart, the source data will be identified and modeled through all layers and loaded.

As shown in Figure 1, the entire staging area layer is not initially built but rather a small part of the respective layer is built based on data in the scope, in this case the SalesReport.

Agile Development

Before new functionality can be implemented in a sprint, it needs to be defined.
This task lies with the product owner as they are the ones to write and prioritize user stories.
As already explained, the goal of a sprint is to produce working pieces of usable outputs called features.
In addition, there are tech topics that need to be considered. There are various methods to support Sprint Planning, such as planning poker or Function Point Analysis, which are discussed in more detail in another article.

Another good indicator is to evaluate the sprint itself while the sprint is still ongoing. If the development team does not manage to implement a feature in a sprint, this can often be seen as a good indicator that the scope is too large.

To avoid this, all work packages that are not relevant for the feature should be removed. Though, what is often the case these work packages are not completely removed out of fear from the business user.

To address this fear it is important to educate the business user that they will be delivered but only in a later sprint and temporarily moved into the backlog.

Figure1 : Data Vault 2.0 Architecture

Due to the flexible and scalable Data Vault model, these layers can be extended with the next feature with little to no re-engineering. This is possible due to the fact Data Vault consists of a Raw Data Vault and a Business Vault model which means it contains the logical architecture as well as the data modeling perspective. The Raw Data Vault is modeled in a data-driven way by integrating the data by business keys. Only hard business rules like data type conversions or hash key calculations are applied. All other soft business rules are only applied in the Business Vault.

Here, we turn data into information. For this reason, the Raw Data Vault requires less refactoring and can be extended limitlessly.

Agile Development Review

Another important success factor for agile projects is proper review and improvement. Even before the next sprint starts, two meetings must be held by the team:

The sprint review meeting: This meeting is about reviewing the delivered features. Usually the development team, the product owner, the Scrum Master and the end-users participate in this meeting.
Retrospective meeting: This meeting usually takes place directly after the review meeting and focuses on identifying activities that need to be improved.
Backlog refinement for prioritizing the user stories and to make sure that the team understands what to do
Sprint planning to plan which user stories fit into the next sprint based on estimating the effort.

It is important that these meetings are held so that toe source errors can be found. In this way, the outcome of a project can be improved and the development processes optimized in an iterative way.

Conclusion

Data Vault 2.0 is not only a scalable and flexible modeling technique, but a complete methodology to accomplish enterprise vision in Data Warehousing and Information Delivery by following an agile approach and focusing on business value. By using agile methods in data warehousing, the focus in projects can be on the business value and delivering useful products to the customer.

Julian Brunner In Data Vault Friday

Automation Options for Data Vault

How Automation Tools Are Changing the Game

Quick takeaway: Automation for Data Vault spans multiple waves — from template-driven code generation that builds the vault structure to modern generative-AI tools that help discover the model itself and AI copilots that accelerate business-rule development. Combined, these approaches dramatically cut time-to-value, reduce errors, and let teams focus on the parts that actually create business impact.

In this article:

Why automation matters for Data Vault
The evolution of automation — three waves
First-wave automation: structured, reliable, repeatable
Second-wave automation: AI-assisted model discovery
AI copilots for the Business Vault and transformations
Putting the technologies together: a practical workflow
Governance, auditability and human-in-the-loop
When automation is the right move — and when to hold back
Risks, limitations and best practices
Final thoughts — how to get started
Watch the Video

Why automation matters for Data Vault

Data Vault was designed for change: it separates raw capture from business logic, records full history and provenance, and uses standardized patterns (hubs, links, satellites). Those same patterns make the model ideal for automation. Manual Data Vault development works, but it’s slow and error-prone — especially when you must onboard many sources, handle evolving business keys or prove lineage for audits and AI projects. Automation reduces repetitive work, enforces consistency, and lets your engineers and architects spend time on modelling decisions and business rules, not boilerplate SQL.

The evolution of automation — three waves

Automation for Data Vault didn’t appear overnight. Think about it in three waves:

Manual coding era: Everything by hand — raw ingestion, keys, history tracking, and the transformations. Effective, but slow and brittle.
Template-driven automation (first wave): Tools that generate physical vault objects and standard loading code from a defined model. They speed up delivery and cut repetitive errors.
AI-driven automation (second wave) + AI copilots: Tools that assist or even automate the model discovery itself, and AI copilots that generate business logic or transformation code — moving humans from creators to reviewers.

Understanding these waves helps you choose the right mix — existing template tools remain valuable, while AI tools are rapidly becoming practical for model discovery and logic generation.

First-wave automation: structured, reliable, repeatable

The first-wave tools are the ones most teams have used for years. Their primary job is to take a model and generate the physical implementation and ETL/ELT pipelines. Key benefits:

Speed: Generating hub, link and satellite structures with standard loading patterns significantly reduces delivery time.
Consistency: Every table and load pattern follows the same, tested template — fewer bugs and easier maintenance.
Orchestration and operations: Many tools build pipelines, manage hash keys, and include scheduling and error handling.

These tools are excellent when you already have a trusted logical model and want to automate the “how” of implementation. They do not solve the “what” (the model discovery) — that still requires human analysis.

Second-wave automation: AI-assisted model discovery

The real shift happens when automation starts to help with — or take over — the model discovery process itself. Instead of hand-crafting hubs and links, generative AI platforms can scan source systems and metadata to propose an initial Raw Data Vault logical model. What does that look like in practice?

Source scanning: The tool ingests table/field metadata, sample values and constraints.
Entity discovery: It suggests candidate hubs (business entities) by grouping columns and identifying recurring patterns and unique keys.
Key recommendation: It proposes business-key candidates and highlights primary/unique candidates derived from the source.
Relationship discovery: It suggests link structures where keys appear together or where foreign-key relationships are inferred.
Satellite design hints: The AI may split attributes into satellites based on volatility, sensitivity (PII), or update patterns.

This capability moves the needle: modelers become reviewers and validators instead of building every piece from scratch. It accelerates onboarding of new sources and shortens the path to a working Raw Vault.

AI copilots for the Business Vault and transformations

While model discovery is one hard problem, translating business requirements into transformation logic is another. This is where AI copilots shine. Integrated into developer environments, they can:

Generate SQL transforms from plain-English requirements (e.g., “calculate monthly churn rate by customer segment”).
Create complex joins, window functions and aggregations that implement business rules.
Suggest test cases, edge-case handling and simple data quality checks.
Accelerate the creation of information-marts (star schemas) by scaffolding the necessary queries and documentation.

Important caveat: copilots are accelerators, not autopilots. Generated code still needs human review for correctness, performance and governance. But they massively reduce the repetitive cognitive load and let experienced engineers focus on validation and optimisation.

Putting the technologies together: a practical workflow

Here’s a pragmatic, step-by-step workflow that mixes first-wave tools and AI capabilities into a usable process:

Connect an AI discovery tool to your sources. Let it propose hubs, links and satellites.
Review and refine the AI-suggested model with domain experts — confirm business keys and entity definitions.
Export logical model into a template-driven automation tool (ELT/DBT/Wherescape). Generate physical tables, load patterns and orchestration pipelines.
Use AI copilots to implement Business Vault logic and information-marts — write high-level requirements and have the copilot scaffold the SQL/Python transform code.
Run tests and checks: automated unit tests, data quality checks and lineage validation.
Deploy and monitor: schedule pipelines, monitor failures and feed back findings into the model or automation templates.

This end-to-end process reduces the time spent in data plumbing and increases time spent on business validation and value delivery.

Governance, auditability and human-in-the-loop

Automation is powerful, but it must sit inside proper governance. Because Data Vault is often used for regulatory and audit-sensitive environments, keep these guardrails in place:

Human review points: AI should suggest, not decide. Model approvals and business-key selection must be explicit sign-offs by domain owners.
Lineage and provenance: Ensure automation tools emit metadata and lineage so every generated artifact is traceable back to sources and the AI suggestions that influenced it.
Testing and validation: Automatically generate tests for any AI-generated transformation and fail deployments until tests pass.
CI/CD and version control: Keep generated models and transformations in version control so you can audit changes over time.

When these controls exist, you get the speed of automation without sacrificing compliance or trust.

When automation is the right move — and when to hold back

Automation fits particularly well when:

Your landscape includes many sources and you expect change.
You need fast onboarding (M&A or rapid product expansion).
Traceability and auditability are core requirements.
You want to reduce repetitive developer work and scale the team’s output.

Consider holding off or using a hybrid approach when:

Your environment is tiny and unlikely to change — heavy automation may be overkill.
Source data semantics are ambiguous and require deep domain expertise that AI cannot infer reliably.
You lack governance and testing practices to safely validate generated models and code.

Risks, limitations and best practices

Generative AI is not perfect: it can hallucinate or misinterpret faint signals in metadata. Best practices to mitigate risk include:

Always pair AI output with domain validation. Treat AI suggestions as draft artefacts, not final products.
Enforce tests: Automated unit tests and data quality checks should gate deployment.
Keep humans in the loop: Use model reviewers, not model builders — domain experts must accept or correct AI outputs.
Capture metadata: Store which AI model/version produced which suggestion for future audits.

Final thoughts — how to get started

If you’re curious about Data Vault automation, start small: pick one source or one high-value report and run it through an AI-assisted discovery + template automation pipeline. Measure case outcomes: time saved, fewer errors, and the number of iterations required to reach stakeholder approval. Use these metrics to build a business case and expand automation incrementally.

Automation won’t replace thoughtful modelling and governance, but used correctly it turns weeks of repetitive engineering into hours and lets teams focus on the decisions that move the business forward.

Watch the Video

Lorenz Kindling In Inside Modern Data Teams

Rising Complexity in BI Solutions

Introduction to BI Solutions

Business intelligence (BI) and AI-driven analytics are no longer niche support functions — they are strategic products that touch product, ops, finance, compliance and customer experience. As BI expands from traditional reporting into real-time analytics, predictive modeling and self-service, the shape of data teams and the way they work are changing fast. This article summarizes the main drivers of that change, the practical impacts on teams and projects, and concrete responses you can apply now to reduce risk and keep delivering value.

In this article:

Why complexity is rising: five key challenges
Typical impacts on organizations
Practical responses: four core actions
Operational checklist you can use today
Common pitfalls and how to avoid them
Leadership and culture: the invisible infrastructure
Case example (illustrative)
Key takeaways
Watch the Video

Why complexity is rising: five key challenges

Modern BI projects are visiting new territory. Below are five core challenges that repeatedly appear across industries and organizations.

1. Broader scope

BI today must do more than historical reporting. Stakeholders expect real-time dashboards, anomaly detection, predictive forecasts and self-service capabilities — often from the same platform. That breadth increases integration points, testing surface and the number of decisions that must be made early in the project.

2. Broader skillset

Delivering modern analytics requires a richer set of roles: data engineers who build pipelines, data modelers who craft semantic layers, data scientists who build predictive models, UX designers who make outputs usable, and governance specialists who protect privacy and ensure compliance. It’s rare for one person to cover all of these competently.

3. Increased coordination

More roles equals more handoffs. Each handoff is a potential point of misunderstanding — different assumptions, different definitions, different delivery cadences. Without deliberate coordination, projects fragment into disconnected workstreams.

4. Technical revolution

BI and cloud platforms evolve rapidly. New services, improved runtimes and updated best practices arrive often. Teams must continuously upskill and decide which innovations to adopt, and when. Certification cycles and vendor roadmaps move fast — staying current costs time and creates churn.

5. Balancing agility and governance

Stakeholders want rapid delivery and iterative improvement. At the same time, many industries require strict data handling, privacy controls and auditability. Finding an operating model that supports quick experiments while preserving accuracy and regulatory compliance is a central tension for modern BI teams.

Typical impacts on organizations

Those drivers produce predictable impacts on teams and delivery models. If unaddressed, they create bottlenecks and risk.

Role specialization: Teams move toward niche expertise rather than single-person full-stack delivery. That boosts depth but can reduce flexibility.
Stronger collaboration needs: Alignment across roles becomes essential to avoid silos and inconsistent decisions.
Higher dependency chains: A delay in one role (e.g., data engineering) can block downstream teams (reporting, model validation).
Greater governance needs: Shared definitions, standards and processes become mandatory to ensure trust, auditability and repeatability.

Practical responses: four core actions

Complexity is manageable when teams adopt clear practices focused on responsibility, agility, shared knowledge and training. Below are four practical responses that reduce friction and increase predictability.

1. Define clear responsibilities

Clarify who owns each stage of the data lifecycle: extraction, transformation, modeling, publication and maintenance. Use simple role definitions and RACI (Responsible, Accountable, Consulted, Informed) charts for every project. When people know who to ask and who will act, coordination overhead drops and turnaround time improves.

2. Use the best agile approach for your context

Agile isn’t one-size-fits-all. For a fast-moving SaaS product team, continuous delivery and short sprints might be ideal. For a bank with heavy regulation, a scaled framework with gated releases and stronger QA may be necessary. Choose the agile flavor (Scrum, Kanban, SAFe or a hybrid) that balances speed with the required controls — and make those rules explicit to stakeholders.

3. Implement shared documentation and data cataloging

Documentation isn’t optional — it is the connective tissue of modern BI. Practical, searchable documentation and a data catalog with lineage, owners and semantic definitions reduce onboarding time and prevent duplicated work. Track data lineage so teams can answer “where did this value come from?” quickly, and attach clear owners to key datasets and metrics.

4. Invest in cross-training

Cross-training creates T-shaped team members: specialists with enough adjacent knowledge to collaborate effectively. Data engineers who understand reporting constraints, and BI analysts who understand pipeline limitations, can resolve many issues without escalating. Cross-training also builds empathy — teams that understand each other’s constraints make better trade-offs.

Operational checklist you can use today

Use this short checklist to reduce immediate friction on a new or existing BI project.

Run a one-hour roles workshop: Map responsibilities and publish a RACI for the first three deliverables.
Choose an agile cadence: Decide sprint length, release gates and who signs off on production models or dashboards.
Set up a minimal data catalog: Start with your top 10 datasets and add owners, a short description and lineage.
Schedule cross-training sessions: One hour per week where a team member shares how they work and what they need from others.
Document privacy and compliance rules: Keep them accessible and tie them to datasets and pipelines.

Common pitfalls and how to avoid them

Even with good intentions, teams stumble. Here are three pitfalls to watch for and short fixes.

Pitfall: Documentation as a chore

Fix: Make documentation part of the workflow. Use templates, require a one-line summary when a dataset changes, and keep a lightweight catalog rather than one massive, stale repository.

Pitfall: Over-specialization that creates handoff bottlenecks

Fix: Rotate or pair people for critical tasks. Pair a report developer with the data engineer for the first run of a new dashboard so knowledge spreads and the dependency weakens.

Pitfall: Chasing every new tool

Fix: Adopt a “value before novelty” rule. Evaluate new technologies against clear criteria: maintainability, onboarding cost, security and measurable improvement to outcomes.

Leadership and culture: the invisible infrastructure

Technical practices are important, but culture and leadership set the pace. Leaders must invest time in alignment, create incentives for collaboration and reward knowledge sharing. Prioritize outcomes (business impact) over tool novelty, and create safe spaces for cross-role feedback so teams can continuously improve.

Case example (illustrative)

Imagine a retail company expanding its BI program to support personalized promotions. The team must deliver real-time stock levels, predictive demand models and marketer self-service dashboards. If data engineering, modeling and UX are siloed, the marketer receives dashboards with stale inventory and models that don’t incorporate seasonal signals. If the company instead defines clear dataset ownership, runs weekly cross-functional reviews, and keeps a living data catalog, the same project becomes manageable: engineers expose real-time feeds, modelers publish validated artifacts with clear assumptions, and UX designers deliver interfaces the marketers can use without ambiguity.

Key takeaways

BI is broader now — expect to support streaming, prediction and self-service in addition to reporting.
Specialization is necessary but must be counterbalanced by collaboration practices and shared documentation.
Pick an agile approach that matches your risk tolerance and regulatory environment.
Make documentation and data cataloging practical and integrated into your workflows.
Cross-training is a small investment with outsized returns for speed and resilience.

Watch the Video

Building a scalable Data Platform? In Data Vault Friday

The Business Value of Data Vault – and Why It Matters

The Business Value of Data Vault

Data Vault is not just another data model. It’s a pragmatic architecture and methodology built for change, auditability, automation and fast delivery — and those traits translate directly into measurable business advantages compared with traditional approaches or doing nothing at all.

In this article:

Why we need a different approach to data
High-level difference: Data Vault vs. traditional models (or none)
Concrete business advantages of implementing Data Vault
Specific business problems Data Vault can solve
How the business benefit translates to measurable outcomes
Practical adoption path — a pragmatic recipe
When Data Vault might be overkill
Final thoughts — why business leaders should care
Watch the Video
Meet the Speaker

Why we need a different approach to data

Most organisations don’t start with the perfect, governed data platform. They begin with spreadsheets, a few scripts, maybe a single operational system. Then the business grows: new SaaS apps (Salesforce, shop platforms), partner APIs, IoT feeds, regulatory requirements and ad-hoc reporting requests. Before long you have multiple sources, inconsistent definitions, and changing business rules.

Traditional models — Kimball’s star schemas or Inmon’s normalized enterprise warehouse — work well when sources, rules and requirements are stable. But the reality today is constant change. That’s exactly the gap Data Vault was designed to fill: a model and architecture focused on capturing raw facts reliably, separating business logic, and enabling incremental, auditable growth.

High-level difference: Data Vault vs. traditional models (or none)

Put simply:

Traditional models (Kimball/Inmon): great for reporting, intuitive star schemas for business users, but rigid and costly to change when sources or rules evolve.
No model / ad-hoc reports: fastest at day zero but leads to duplicated effort, inconsistent numbers, and brittle scripts that break as systems change.
Data Vault: engineered for change. Capture everything in a consistent, standard way, keep full lineage, and build business logic and reporting layers on top. This structure enables automation, auditability and rapid delivery of real business reports.

Concrete business advantages of implementing Data Vault

1. Faster time-to-value (Tracer-bullet delivery)

Data Vault enables an iterative “tracer bullet” approach: pick a high-value report, identify the raw source data, ingest and model only what’s needed to deliver that report end-to-end. Business users get a working dashboard in weeks (not months), giving immediate value, generating trust, and allowing the team to expand incrementally.

2. Built-for-change — lower cost of future change

Because Data Vault separates raw data (hubs, links, satellites) from business rules (Business Vault / information marts), adding a new source, new attribute, or updated business rule rarely requires tearing down and rebuilding existing models. That translates into lower rework, lower maintenance costs, and much faster onboarding of new systems.

3. Automation reduces delivery time and human error

Data Vault entities follow standardized patterns. Hubs look alike; satellites follow the same tracking patterns. That repeatability makes the ingestion and loading processes highly automatable with modern tools (for example, dbt builders, Wherescape-style automation, Coalesce). Automation frees developers to focus on business logic instead of tedious ETL plumbing — more predictable pipelines, fewer bugs, faster delivery.

4. Auditable, traceable data for compliance and trust

Every record in a Data Vault carries load dates, record source identifiers, and historical versions. That full lineage is invaluable for audits, GDPR/DSR processes, finance reconciliations and provenance requirements for AI. When regulators or internal auditors ask “where did this number come from?” you can show the full trail back to source.

5. Future-proof architecture for analytics and AI

Data Vault’s decoupling of raw capture from business logic means you can adopt new storage or compute technologies (data lakes, cloud object stores, NoSQL or streaming platforms) without reworking the core model. It’s an architecture that scales with both data volume and analytics sophistication: data science teams can access the raw, auditable records they need without breaking downstream reporting.

6. Reduced risk and predictable governance

Standardized patterns, auditable history and clear separation of concerns improve governance. Data owners can define rules in a separate layer, compliance teams can inspect lineage, and operations can automate quality checks. That lowers operational risk and makes governance predictable rather than ad-hoc.

Specific business problems Data Vault can solve

Below are concrete problems organisations experience — and how Data Vault addresses them.

M&As and rapid source expansion: After an acquisition you must onboard dozens of new systems. Data Vault lets you ingest raw records quickly and map business rules later, so analytics can start immediately without delaying integration for perfect master data alignment.
Conflicting definitions across departments: Different teams report different revenue numbers. With Data Vault you capture every source event and build reconciled information marts, so one canonical report can be produced while source-level values remain auditable.
Regulatory or audit requests: Need to prove how a figure was derived six months ago? Data Vault’s lineage (load timestamps, record source) shows exactly which source values contributed to any derived metric.
GDPR / Data Subject Requests: Because raw values and their sources are stored with provenance, it’s easier to locate and isolate personal data, show retention windows, or delete/segment records if needed.
AI/ML model drift and explainability: Models need defensible inputs. Data Vault keeps the raw inputs and transformation history separate, so feature engineers and auditors can trace which raw values produced a model input.
Slow BI delivery and constant rework: BI projects where every change requires a model rewrite burn budget. Data Vault’s incremental approach reduces rework and keeps BI teams delivering incremental, reliable reports.
Operational reporting vs historical analytics conflict: Operational needs often demand current-state views; analytics wants full history. Data Vault stores full history by design, while downstream information marts can present both current and historical perspectives appropriately.

How the business benefit translates to measurable outcomes

Organisations that adopt Data Vault commonly see measurable improvements such as:

Shorter lead times for report delivery (weeks vs months for new reports).
Lower total cost of ownership because changes require less rework and are more automatable.
Fewer data incidents and faster root-cause analysis because lineage is built-in.
Stronger compliance posture and faster audit responses.
Better support for analytics and AI initiatives — because data scientists get consistent, traceable raw data.

These translate to business outcomes: faster decisions, less risk, better regulatory positioning, and a higher ROI on analytics investments.

Practical adoption path — a pragmatic recipe

You don’t have to flip the switch for everything at once. A typical, low-risk path is:

Choose one high-value report (the tracer bullet). Identify required sources and ingest the raw records into the Raw Vault.
Build the Business Vault where you apply the business rules for that specific report (transformations live here, not in the raw zone).
Deliver an Information Mart tuned for reporting (star schema if that’s what BI needs) that offers the business an immediate, usable report.
Iterate and scale — add more reports and sources, reuse existing Hubs/Links/Satellites, automate loading patterns and apply governance over time.

This approach gives quick wins, builds trust, and progressively modernises your data landscape without huge upfront modelling effort.

When Data Vault might be overkill

Data Vault is powerful, but it’s not always necessary. If you’re a very small organisation with a single system, little change, and a handful of reports, a simple star schema or a few curated data marts could be more pragmatic. Evaluate:

Number of sources and expected change rate
Regulatory/audit requirements
Scale of historical data needs
Long-term analytics and AI ambitions

If those requirements are modest today but expected to grow, Data Vault often makes sense as a future-proofing step you can introduce incrementally.

Final thoughts — why business leaders should care

At the executive level, Data Vault should be evaluated not as a modeling fad but as an investment in enterprise agility, compliance and scalable analytics. The technical patterns (hubs, links, satellites) map directly to business outcomes: rapid delivery of trusted reports, reduced change costs, auditable provenance, and a platform ready for advanced analytics and AI.

Compared to doing nothing (ad-hoc scripts) or building a rigid, monolithic warehouse, Data Vault gives you a repeatable way to capture everything, govern it, and build the business-facing outputs that actually create ROI.

If you’re considering a modern data platform, start with a tracer-bullet use case, prove the approach, automate the repeatable parts, and keep the focus on business outcomes rather than perfect modelling up front.

Watch the Video

Meet the Speaker

Lorenz Kindling
Senior Consultant

Lorenz is working in Business Intelligence and Enterprise Data Warehousing (EDW) with a focus on data warehouse automation and Data Vault modeling. Since 2021, he has been advising renowned companies in various industries for Scalefree International. Prior to Scalefree, he also worked as a consultant in the field of data analytics. This allowed him to gain a comprehensive overview of data warehousing projects and common issues that arise.

Hernan Revale In dbt Talk

5 Ways of Testing Your Data Pipelines with dbt

Testing Data Pipelines

In today’s data-driven world, the reliability of your analytics depends on the integrity of your data pipelines. Even the most sophisticated transformations can be undermined by bad source data, schema changes, or simple human error. That’s where dbt (data build tool) shines: it provides a framework not only for transforming and modeling your data, but also for validating it at every step. In this article, we’ll explore five essential testing strategies you can implement with dbt to catch issues early, enforce data contracts, and build confidence in your analytics.

In this article:

1. Custom SQL Tests
2. Personalized Macros
3. Built-In dbt Data Tests
4. Unit Tests
5. dbt Model Contracts
Putting It All Together: A Comprehensive Testing Strategy
Conclusion
Watch the Video

1. Custom SQL Tests

What they are: Custom SQL tests allow you to write bespoke SQL queries against your models to enforce complex business rules or edge-case validations. They live alongside your models and execute as part of your dbt test suite.

Why you need them: Out-of-the-box tests cover many common cases, but sometimes you have unique conditions—like “no customer may have more than one active subscription”—that require a tailored query.

-- Example: Ensure no customer has more than one active subscription
SELECT
  customer_id,
  COUNT(*) AS active_sub_count
FROM {{ ref('subscriptions') }}
WHERE status = 'active'
GROUP BY customer_id
HAVING COUNT(*) > 1;

If this query returns any rows, the test will fail, alerting you to a data integrity issue before downstream models consume bad data.

2. Personalized Macros

What they are: Macros in dbt are reusable snippets of SQL or Jinja logic. By building custom macros for your testing patterns, you can enforce consistent checks without repeating verbose SQL.

Why you need them: Many projects share recurring validation needs—like ensuring date fields are never in the future or that monetary amounts are non-negative. A macro lets you encapsulate that logic once and apply it across multiple models.

{% macro expect_dates_in_past(model, column) -%}
SELECT
  *
FROM {{ ref(model) }}
WHERE {{ column }} > current_date
{%- endmacro %}

-- Usage in schema.yml
tests:
  - expect_dates_in_past:
      args:
        model: 'orders'
        column: 'order_date'

With just a few lines in your YAML, you’ve enabled a robust, maintainable check across your entire project.

3. Built-In dbt Data Tests

What they are: dbt ships with four standard data tests: unique, not_null, accepted_values, and relationships. These cover the most frequent requirements for uniqueness, presence, domain constraints, and referential integrity.

Why you need them: They require zero custom SQL. You simply declare your expectations in your model’s .yml file, and dbt will generate and execute the underlying queries.

models:
  - name: users
    columns:
      - name: user_id
        tests:
          - unique
          - not_null

      - name: country_code
        tests:
          - accepted_values:
              values: ['US', 'CA', 'MX']

Within minutes, you’ve added core validation checks to your critical tables, ensuring nulls or duplicates never slip through.

4. Unit Tests

What they are: Unit tests isolate a model’s logic by feeding it controlled input data and verifying that its output matches expected results. In dbt, you can implement unit tests using seed files or temporary models.

Why you need them: When your transformations involve complex calculations—like computing a customer’s lifetime value—you want airtight proof that the logic works under all conditions, including edge cases.

-- seed file: tests/lv_input.csv
customer_id,order_amount
1,100
1,200
2,50
2,75

-- model: tests/lv_test.sql
with input_data as (
  select * from {{ ref('lv_input') }}
),

calculated as (
  select
    customer_id,
    sum(order_amount) as lifetime_value
  from input_data
  group by customer_id
)

select *
from calculated
where
  (customer_id = 1 and lifetime_value = 300)
  or (customer_id = 2 and lifetime_value = 125);

If the calculated results don’t match the hardcoded expectations, the test fails, pinpointing exactly where your logic diverged.

5. dbt Model Contracts

What they are: Model contracts let you define strict schema expectations—column types, required fields, accepted value sets—directly in your schema.yml files. They act as formal agreements between your data producers and consumers.

Why you need them: Without explicit contracts, a subtle change in upstream data can silently break your analytics. Contracts ensure that any structural or type deviations immediately surface as test failures.

models:
  - name: orders
    columns:
      - name: order_id
        data_type: integer
        tests:
          - not_null

      - name: order_amount
        data_type: decimal
        tests:
          - not_null

With contracts in place, if someone accidentally returns order_amount as text or drops a required field, dbt stops execution and flags the violation, protecting downstream dashboards and reports.

Putting It All Together: A Comprehensive Testing Strategy

Testing in dbt is more than an afterthought—it’s a core pillar of a robust analytics engineering workflow. By combining custom SQL tests, personalized macros, built-in data tests, unit tests, and model contracts, you establish multiple layers of defense against data quality issues. Here’s a sample workflow:

YAML Configuration: Start by declaring built-in tests and contracts in your schema.yml for each model.
Custom Tests: Add bespoke SQL tests in tests/ for any project-specific rules.
Macros: Create a macros/ folder with reusable test macros for common patterns.
Unit Tests: Define seed files and test models under tests/unit/ to validate critical transformations.
CI/CD Integration: Hook dbt test into your CI pipeline so that every pull request runs the full suite, ensuring no faulty code or data reaches production.

This layered approach not only catches errors early but also documents your data expectations for new team members and stakeholders.

Conclusion

Ensuring the accuracy, reliability, and trustworthiness of your data requires more than ad-hoc checks—it demands a structured, repeatable testing framework. dbt provides the tools you need, from built-in quick-start tests to fully customized SQL validations and contracts. By implementing these five testing methods, you’ll be well on your way to bulletproofing your data pipelines and empowering stakeholders with confidence in their analytics.

Ready to level up your data quality? Start by adding one new test to your next dbt model, and watch your data reliability soar.