Sample Page – Scalefree

Building a scalable Data Platform? In Data Vault Friday

Modeling Project Tasks and Actions in Data Vault

Modeling Project Tasks and Actions

In the world of Data Vault, creating effective data models to capture project tasks and actions is a crucial part of building scalable, efficient, and auditable data solutions. One of the challenges that data engineers face when building such models is ensuring that the history of project tasks and actions is captured correctly, while also accounting for the complexity of slowly changing data and continuous changes like those seen in Change Data Capture (CDC) environments.

In this article:

Understanding the Data Model
Challenges with the CDC Environment
Solution: Using a Non-Historized Link
- Handling Updates and Deletions
Building the Task Dimension
Optimizing the Approach
Conclusion
Watch the Video
Meet the Speaker

Understanding the Data Model

Let’s start by understanding the data model in question. We have three key tables:

Projects: This table contains details about various projects.
Tasks: This table contains tasks associated with each project. It has a foreign key linking back to the Projects table.
Actions: Each task can have one or more actions associated with it, and the Actions table has a foreign key linking back to the Tasks table.

The objective is to create a Project Dimension and a Task Dimension. The Task Dimension should include an additional attribute that represents the latest action associated with each task. To solve this, there are multiple approaches you could take, but two common ones stand out:

Multi-Active Satellite (MA-SAT) based on sets: The simplest approach would be to use a MA-SAT model based on sets, selecting the latest action for each task within those sets. However, this can be complex in a CDC environment.
Multi-Active Satellite (MA-SAT) based on an MA-Attribute: Instead of using sets, this model would use an action ID as an MA-Attribute, helping to simplify the approach while avoiding the issues created by capturing sets.

Challenges with the CDC Environment

When your source system uses Change Data Capture (CDC), the problem of handling sets becomes more complicated. In CDC, you’re dealing with a stream of changes where new records are added, existing records are updated, and old records may be deleted. This constant flow of data presents two key challenges:

Capturing sets is not straightforward: In a typical MA-SAT approach, a set captures all records related to a specific task or project, but with CDC data, you’re more likely to receive individual changes, rather than a full snapshot of all records at once.
Volume management: In a CDC environment, where actions are continually added, there is a risk that using sets will generate a massive volume of records. For example, if a new action is added to a task every day, this could result in an overwhelming number of records over time.

Given these challenges, using an MA-Attribute approach where the action ID serves as the key for the task’s latest action provides a more scalable solution. Instead of creating an ever-expanding set, you can focus on just the latest action for each task, keeping the volume under control.

Solution: Using a Non-Historized Link

The idea behind using a Non-Historized Link in this case is to capture actions as they come in from the source system, without the need for sets. In this setup, a Non-Historized Link captures the relationship between tasks and actions and includes the action ID. When new actions come in, they are inserted into the link table, and updates or deletions are handled as changes to the data. This provides a streamlined and efficient way to track the most recent actions associated with each task.

Handling Updates and Deletions

In CDC environments, when updates and deletions occur, it’s important to track these events accurately. The solution to this is technical counter transactions. When an update occurs in the source system, it’s treated as a deletion of the old version and an insert of the new version. This allows the system to track changes over time and ensures the history is accurately captured.

For deletions, we simply negate the value of the original record, ensuring the deleted record is not included in any aggregations. This negation technique is useful for tracking the current state of each task, action, or project.

Building the Task Dimension

The task dimension is where we’ll include the latest action for each task. To do this, we can use a technique involving a Point-in-Time (PIT) table, which captures the latest version of data (in this case, the latest action for each task) by linking the task hub to the appropriate satellite data.

Here’s how this works:

Create the PIT table for tasks: This table will store the latest version of each task, including references to the task satellite and the action satellite.
Use window functions: To retrieve the latest action for each task, you can apply window functions in your SQL queries to order by the business date (or load date) and select the latest action based on these values.

Once the PIT table is created, it can be used to build the Task Dimension, which will then reflect the latest action for each task.

Optimizing the Approach

One of the key considerations when implementing this model is performance. Using window functions and PIT tables is an effective way to manage the volume of records, but it’s important to ensure your database is optimized for these types of queries. In high-volume environments, applying business logic during the PIT table loading process can reduce the load time and make data delivery faster.

Conclusion

Modeling project tasks and actions in Data Vault requires careful consideration of the data flow, especially in environments that rely on CDC. By using Non-Historized Links, Multi-Active Satellites based on MA-Attributes, and PIT tables, you can create a scalable, efficient model that captures the latest actions associated with each task, without generating excessive volumes of data.

By applying these techniques, you can streamline the data flow, optimize performance, and ensure the data is always accurate and up-to-date, making it easier for your data users to access the information they need.

Watch the Video

Meet the Speaker

Michael Olschimke

Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

Hernan Revale In dbt Talk

DBT Next Chapter with SDF – From SQL Strings to Semantic Insights

Introduction

dbt Labs has recently acquired SDF, a company known for its cutting-edge SQL comprehension technology. This acquisition brings several key benefits, including:

Faster dbt project compilation
Improved developer experience
High-fidelity data lineage tracking

But what does this mean for dbt users? Let’s explore.

In this article:

Background & Context
The Acquisition Explained
What is SDF Technology?
What Does "SQL Comprehension" Mean?
Impact on dbt
Future Implications
Conclusion
Watch the Video

Background & Context

dbt simplifies the creation of data models using SQL SELECT statements, making data engineering accessible to anyone familiar with SQL. Historically, however, dbt has treated SQL as mere strings, limiting its ability to understand the deeper meaning behind queries.

The Acquisition Explained

In January 2025, dbt Labs announced the acquisition of SDF. According to dbt Labs’ founder and CEO Tristan Handy:

“We are acquiring SDF to bring SQL comprehension into dbt and usher in a new era of ‘what’s possible’ for analytics: supercharging developer productivity and heightening data quality, all while optimizing data platform costs.”

What is SDF Technology?

Unlike dbt’s traditional approach of treating SQL as text, SDF understands the deeper structure of SQL, recognizing objects, types, syntax, and semantics. It emulates the SQL compilers of various data platforms (e.g., Snowflake, Redshift, BigQuery), allowing developers to:

Validate SQL queries before execution
Catch breaking changes early
Ensure platform compatibility
Perform real-time impact analysis
Reduce computational costs

What Does “SQL Comprehension” Mean?

SQL comprehension enables dbt to:

Identify query components
Generate structured artifacts
Validate SQL syntax and semantics
Predict query outcomes, including column creation and datatype assignments
Execute queries efficiently

Impact on dbt

1. Validate: Will My SQL Really Work?

With SDF, dbt can ensure that SQL queries are correct before execution, reducing errors and debugging time.

2. Analyze: Precise Column-Level Lineage

SQL comprehension improves data lineage tracking, leading to:

Better debugging workflows
Optimized CI builds (only rebuilding models affected by changes)
Enhanced metadata propagation (e.g., PII tagging, test descriptions)

3. Optimize: Right Query, Right Place

With a better understanding of SQL, dbt can:

Optimize queries for specific engines
Identify performance bottlenecks before execution
Support query pruning (scanning only relevant data subsets)
Optimize DAG execution across multiple platforms

Future Implications

The integration of SDF into dbt is in progress. While it won’t be part of the Apache 2.0 code base, dbt Labs plans to make key SDF capabilities available to all users—both in dbt Core and dbt Cloud.

With SDF, dbt users can look forward to:

Faster performance
Optimized data platform costs
Improved data lineage tracking

All without needing to change existing dbt projects.

Conclusion

The acquisition of SDF marks a significant milestone for dbt, bringing true SQL comprehension into the platform. Developers will benefit from enhanced query validation, real-time impact analysis, and optimized execution strategies. As dbt evolves, its users can expect more powerful, efficient, and cost-effective data transformations.

Watch the Video

Building a scalable Data Platform? In Data Vault Friday

Unit of Work in Data Vault

What Is the Unit of Work?

The Unit of Work is a fundamental concept in Data Vault modeling. It refers to the idea of capturing all the related business keys from a source system together in a single Link structure. This ensures that the relationships represented in the raw data are maintained intact in your Raw Data Vault.

Think of it as keeping all pieces of a puzzle together. When you break the pieces apart without a clear strategy to reassemble them, you risk losing the original picture—or worse, creating a picture that never existed in the first place.

In this article:

The Key Question: What Happens If You Split Links?
The Risk of Splitting Links: A Practical Example
The Jedi Test: Verifying Your Model
Driving Keys and Unit of Work
The Safe Bet: Keep Links Together in the Raw Data Vault
What If You Have to Split the Link?
Efficiency Considerations
Final Thoughts and Recommendations
Watch the Video
Meet the Speaker

The Key Question: What Happens If You Split Links?

The question posed in our Data Vault Friday session was:

“It would be good to refresh the topic of the Unit of Work. What is the information I’d be losing if I split wider Links into smaller pairs (DrivingKey -> Key1/Key2 etc)? I can think of a technical setup where I’d have multiple Links and Effectivity Satellites and I should be able to get the same results to any point in time… What information exactly gets lost if I split Links?”

The Risk of Splitting Links: A Practical Example

Let’s start with a simple example to illustrate the point:

Imagine you have a relationship between three business entities—Customer, Store, and Product. Customers go into Stores and purchase Products. If you model this relationship as a single Link that includes all three Hubs (Customer, Store, Product), you have a clear and accurate picture of that transaction.

Now, suppose you decide to split this Link into two smaller Links:

Customer -> Store
Customer -> Product

At first glance, this may seem fine. However, when you attempt to reconstruct the original dataset by joining these two Links, you may create new combinations that never existed. For example:

Customer A visited Store K and purchased Product X (valid)
Customer A visited Store L (also valid), but when you join this with Product X, you create a false relationship—Customer A purchased Product X at Store L (which never happened in reality).

This phenomenon, where new, artificial rows are generated during joins, leads to data inconsistencies. You’ve effectively introduced relationships that didn’t exist in your source system. That’s the major risk when splitting Links.

The Jedi Test: Verifying Your Model

To avoid these issues, Michael recommends applying what he calls the Jedi Test (yes, Star Wars fans, rejoice!).

The Jedi Test is a simple but powerful validation: Try to reconstruct the original source dataset from your target Data Vault model. If you can recreate the original rows exactly—no more, no less—you’re golden. If you end up with extra rows or missing rows, you’ve got a problem.

This test helps you ensure that your Links are modeled correctly and that no data is lost or inaccurately recreated when splitting Links.

Driving Keys and Unit of Work

One scenario where splitting Links causes definitive problems is when you’re dealing with Driving Keys. Driving Keys are a subset of business keys that uniquely identify a relationship.

Michael explains that splitting components of a Driving Key almost always results in a Unit of Work problem. If you break apart a Driving Key, you risk losing the integrity of your data relationships. That’s why it’s best to keep all Hub references from the same source together in a Unit of Work Link within the Raw Data Vault.

He emphasizes that although he’s not a fan of Driving Keys (calling it a “hate-love relationship”), there are scenarios where they are necessary, and when they are, you must be cautious not to split them.

The Safe Bet: Keep Links Together in the Raw Data Vault

The best practice? Keep all Hub references from the same source table together in one Link—the Unit of Work Link.

This approach guarantees that the full relationship is captured as it appeared in the source data. You can always split these relationships later in the Business Vault, where transformations and additional rules can be applied. But the Raw Data Vault should act as your trusted, immutable source of truth.

By maintaining the Unit of Work in your Raw Data Vault, you protect yourself from potential data inconsistencies and ensure you can always rebuild accurate, reliable data structures downstream.

What If You Have to Split the Link?

Sometimes, technical limitations force you to split a Link. For instance:

Your database system limits the number of Hub references you can include in a composite index.
You have descriptive data at different granularities, making a wide Link impractical.

In these cases, Michael suggests a few strategies:

Keep the Unit of Work Link as your base Link.
Create additional Links to capture specific granularity needs (Customer -> Product, Customer -> Store, etc.).
If you split Links, ensure you also capture any missing relationships—for example, if you have Customer -> Store and Customer -> Product, you might also need Store -> Product to maintain the complete relationship picture.

These extra Links allow you to recreate the original dataset accurately if needed, but they add complexity. More Links mean more entities, more joins, and potentially slower query performance. That’s why Michael emphasizes the importance of an efficient model and recommends keeping Links together whenever possible.

Efficiency Considerations

From a performance perspective, fewer Links with more Hub references often result in faster queries. You reduce the number of joins and simplify your model. Conversely, more Links increase complexity and the potential for errors.

Therefore, the trade-off is clear: Simplicity and efficiency with a Unit of Work Link versus flexibility with additional Links—but at the cost of complexity and higher risk.

Final Thoughts and Recommendations

To summarize:

Always aim to keep Hub references from the same source together in a Unit of Work Link in your Raw Data Vault.
Perform the Jedi Test to ensure you aren’t losing or falsely creating data relationships.
If you must split Links, document and manage them carefully, ensuring you maintain all necessary relationships (even those not obvious at first glance).
Save complex splits and transformations for the Business Vault, where they can be more easily corrected without reloading the entire dataset.

At the end of the day, a well-structured Unit of Work ensures your Data Vault remains a solid, reliable foundation for analytics, reporting, and data science initiatives.

Watch the Video

Meet the Speaker

Michael Olschimke

Building a scalable Data Platform? In Data Vault Friday

Data Vault on Wide Tables: Best Practices and Considerations

Understanding Data Vault on Wide Tables

Storing data as wide tables in a data lake while applying a logical Data Vault layer using views presents unique challenges. The goal is to virtualize a raw Data Vault model on top of a data lake while ensuring optimal performance.

In this article:

Key Considerations for Performance Optimization
Virtualizing Data Vault on a Data Lake
GDPR and Data Privacy Challenges
Enhancing Performance with Materialized Structures
Does Datavault4dbt Support This Approach?
Conclusion
Watch the Video
Meet the Speaker

Key Considerations for Performance Optimization

Data Remains the Same: Regardless of whether you use Iceberg, Snowflake, or another technology, the fundamental data characteristics remain unchanged. Issues like dirty data and transformations still apply.
Descriptive vs. Transactional Data: Wide tables typically contain a mix of master and transactional data. Most attributes tend to be descriptive, especially in master data.
Granularity Matters: Properly defining granularity helps structure the Data Vault model efficiently, especially for hubs, links, and satellites.

Virtualizing Data Vault on a Data Lake

When implementing a virtual Data Vault, consider:

Hubs and Links: These should be materialized instead of virtualized. Business keys from wide tables should be extracted and stored in separate iceberg tables for efficiency.
Satellites: Virtualizing satellites using views is recommended, but pay attention to GDPR and personal data separation.
Indexing Performance: Hubs, links, PIT (Point-in-Time) tables, and bridges serve as indexes in a data lake environment.

GDPR and Data Privacy Challenges

One of the biggest concerns in a wide table approach is data privacy. Since wide tables often include personal data, you must consider logical deletion, encryption, or physical separation techniques to comply with regulations.

Enhancing Performance with Materialized Structures

To achieve good query performance, consider materializing certain structures:

Materialized Hubs and Links: These structures act as indexes and improve data retrieval efficiency.
PIT and Bridge Tables: These further optimize queries by structuring data in a way that minimizes computational load.
Denormalized Information Marts: End-users should query fully materialized information marts, ensuring high-speed access.

Does Datavault4dbt Support This Approach?

Discussions around Datavault4dbt suggest it may support this approach in the future. If you’re working on a project with this implementation, consider reaching out to Scalefree to explore collaboration opportunities.

Conclusion

Applying Data Vault on wide tables within a data lake architecture requires careful planning. The key takeaways include:

Virtualizing satellites while materializing hubs and links for performance.
Addressing GDPR concerns by separating or encrypting personal data.
Using PIT and bridge tables to enhance indexing and query speed.
Building fully denormalized information marts for end-user access.

By following these best practices, organizations can ensure efficient and scalable Data Vault implementations on wide tables.

Watch the Video

Meet the Speaker

Michael Olschimke

Marc Winkelmann In Beginner, Data Warehouse

Real-Time Data Warehousing and Business Intelligence with Data Vault 2.0 and AWS Kinesis

Data is the fuel of the digital economy. However, its true value is realized only when it is processed quickly, reliably, and structured for analysis and reporting. Real-time data streaming enables companies to make data-driven decisions instantly. Data Vault 2.0 combined with AWS Kinesis provides a future-proof solution for efficiently processing and storing large volumes of data in modern data warehousing and BI environments.

Realtime on AWS with Data Vault 2.0

Join our webinar on March 18th, 2025, 11 am CET, and learn how to build a scalable, real-time data architecture on AWS. We’ll cover AWS infrastructure for real-time data, applying Data Vault 2.0 in real-time scenarios, and showcase a live demo with a real-world use case.

Watch Webinar Recording

In this article:

Why Real-Time Data Streaming for Data Warehousing and BI?
Data Vault 2.0 as the Foundation for Real-Time Data Warehousing
AWS Kinesis: Real-Time Data for Your Data Warehouse
Conclusion: Future-Proof BI and Data Warehousing with Real-Time Streaming

Why Real-Time Data Streaming for Data Warehousing and BI?

In today’s fast-paced business environment, timely access to accurate data is essential for making informed decisions. Traditional batch processing methods can no longer keep up with the need for real-time insights, often resulting in outdated reports and slow reaction times. Real-time data streaming solves this problem by enabling continuous data integration, allowing companies to analyze and act on fresh data as it arrives. This shift not only improves operational efficiency but also enhances overall business intelligence strategies by ensuring that the most up-to-date information is always available.

Data Vault 2.0 as the Foundation for Real-Time Data Warehousing

As organizations deal with increasing volumes of data from multiple sources, they need a flexible and scalable approach to data modeling. Data Vault 2.0 provides the ideal foundation for real-time data warehousing by offering a structured yet adaptable methodology. Unlike traditional data models, which can be rigid and difficult to modify, Data Vault 2.0 adapts to new requirements quite fast. By leveraging Data Vault 2.0, companies can build a resilient and future-proof data warehouse capable of handling real-time data streams with ease.

AWS Kinesis: Real-Time Data for Your Data Warehouse

Processing real-time data at scale requires a robust infrastructure, and AWS Kinesis is built precisely for this purpose. It enables businesses to collect, process, and analyze real-time data streams, ensuring that data warehouses remain continuously updated. By eliminating data latency, companies can generate insights in real time, leading to faster decision-making and improved operational performance. Furthermore, AWS Kinesis seamlessly integrates with widely used BI systems such as AWS Redshift and Snowflake, making it an essential component for modern data architectures. Its dynamic scaling capabilities provide cost efficiency by adjusting resource consumption based on actual demand. Additionally, Kinesis includes advanced security features, ensuring that sensitive data remains protected while adhering to industry regulations.

Conclusion: Future-Proof BI and Data Warehousing with Real-Time Streaming

Companies that embrace real-time data processing benefit from faster BI analysis, lower costs, and greater scalability. Data Vault 2.0 combined with AWS Kinesis offers a powerful, future-proof solution for modern data warehousing architectures. By enabling seamless integration of real-time data, businesses can react instantly to market changes, optimize their operations, and stay ahead of the competition.

Investing in real-time data streaming is not just about speed, it’s about building a resilient and adaptive data infrastructure that grows with your business. Organizations that leverage these technologies today will gain a significant competitive edge, ensuring long-term success in an increasingly data-driven world. Leverage real-time streaming for BI and maximize the value of your data!

Building a scalable Data Platform? In Coalesce Transformation Talks

Column Propagation in Coalesce: Handling IT Table Changes

How Coalesce Manages Column Propagation

Change is inevitable in data management. Whether you like it or not, IT table structures evolve due to various reasons. When these changes occur, they can impact data pipelines, potentially leading to inefficiencies or even failures. Fortunately, coalesce.io offers a robust solution to manage column propagation seamlessly.

In this article:

Why Do Table Structures Change?
Types of Table Structure Changes
Impact of Changes in IT Tables
Column Propagation in Coalesce
How Column Propagation Works
Benefits of Using Coalesce for Column Propagation
Final Thoughts
Watch the Video
Meet the Speaker

Why Do Table Structures Change?

Changes in database table structures can occur for several reasons:

Change in the source system: New data sources, modifications in existing systems, or upgrades can introduce changes.
Change in the data ingestion process: Adjustments in ETL (Extract, Transform, Load) or ELT (Extract, Load, Transform) processes may require modifications in tables.
Development mistakes: Incorrect data modeling, schema design flaws, or unintended changes can also trigger table updates.

Types of Table Structure Changes

Table modifications generally fall into three categories:

New Attributes: Additional columns were introduced to capture new data.
Removed Attributes: Deprecated or unnecessary columns are being eliminated.
Changed Datatypes: Modifications in column data types for compatibility or optimization.

Impact of Changes in IT Tables

The consequences of these changes can vary widely:

Best Case: Unused and unabsorbed data, leading to inefficiencies but not immediate failure.
Worst Case: Complete pipeline failure, causing data loss or system downtime.

Column Propagation in Coalesce

Coalesce simplifies the process of managing table changes through an efficient column propagation mechanism.

How Column Propagation Works

Column propagation in coalesce.io follows a structured approach:

Select Column: Identify the column that has been added, removed, or modified.
Propagate Addition or Deletion: Ensure that the column change is applied throughout the pipeline.
Mark Downstream Objects: Identify downstream objects that are affected and should be updated accordingly.
Create Commit: Finalize the changes with a commit to reflect them across the system.

Benefits of Using Coalesce for Column Propagation

By leveraging Coalesce’s column propagation features, data engineers can:

Automate schema changes: Reduce manual intervention and minimize errors.
Ensure data consistency: Prevent mismatches between schema and data models.
Improve efficiency: Accelerate change implementation without disrupting workflows.
Enhance visibility: Gain better control over how changes impact the entire data pipeline.

Final Thoughts

Managing table changes is a critical aspect of data engineering. With coalesce.io, data teams can seamlessly handle column propagation, ensuring minimal disruptions and optimal performance. Whether you’re dealing with new attributes, removed attributes, or datatype modifications, coalesce.io streamlines the process and enhances data reliability.

Watch the Video

Meet the Speaker

Tim Kirschke
Senior Consultant

Tim has a Bachelor’s degree in Applied Mathematics and has been working as a BI consultant for Scalefree since the beginning of 2021. He’s an expert in the design and implementation of BI solutions, with focus on the Data Vault 2.0 methodology. His main areas of expertise are dbt, Coalesce, and BigQuery.

Building a scalable Data Platform? In Data Vault Friday

When to Invest in Data Warehousing

Investing in Data Warehousing

Data warehousing is a crucial step for businesses looking to manage, integrate, and analyze large volumes of data efficiently. But at what point should a company consider investing in a data warehouse? Many assume that only large enterprises need data warehousing, but the reality is more nuanced.

In this article:

Is Company Size the Right Factor?
Understanding Business Needs and Data Complexity
When Data Integration Becomes a Challenge
Regulatory and Compliance Requirements
External Data and Decision-Making
Cost vs. Benefit Analysis
Final Thoughts
Watch the Video
Meet the Speaker

Is Company Size the Right Factor?

Many businesses believe that data warehousing is only necessary when they reach a certain size. However, size alone is not the determining factor. Instead, the key driver for investing in data warehousing is the complexity of data management and the business challenges that arise from it.

Understanding Business Needs and Data Complexity

Some businesses need data solutions early, even during the startup phase. For example, data-driven companies—such as tech startups that rely on analytics—often invest in data warehousing early to gain insights and improve decision-making.

On the other hand, traditional businesses, like small manufacturing firms, may not require advanced data solutions immediately. If a business operates without much digital data, spreadsheets or simple databases may suffice for years.

When Data Integration Becomes a Challenge

As businesses grow, they tend to use multiple systems for different operations. For example, customer data may reside in a CRM, sales data in an ERP, and marketing data in yet another platform. When companies need a unified view of their operations, data warehousing becomes a necessity.

Operational reports from systems like SAP or Salesforce can work for a while, but eventually, businesses realize they need to consolidate data from various sources to make better strategic decisions.

Regulatory and Compliance Requirements

In some industries, regulatory requirements mandate specific reporting and auditing capabilities. A cleaning service provider, for instance, may need to generate reports on the number of windows cleaned to comply with contract terms. Failure to provide such reports can result in financial penalties.

In such cases, investing in data warehousing can be justified by calculating the potential financial losses due to non-compliance. Businesses in regulated industries, such as finance and healthcare, often prioritize data warehousing earlier for compliance reasons.

External Data and Decision-Making

Some businesses receive vast amounts of external data from suppliers, partners, or customers. If decisions rely on analyzing large datasets, a data warehouse can streamline operations, improve accuracy, and reduce manual effort.

For example, a company relying on Excel sheets for pricing decisions may struggle to track and compare data efficiently. A data warehouse helps aggregate, cleanse, and analyze data, enabling better decision-making.

Cost vs. Benefit Analysis

While budget constraints may delay investment in data warehousing, businesses should assess the cost of inefficiencies. If manual data handling, poor integration, or compliance risks cost the company significant time and money, the investment in a data warehouse is often justified.

Final Thoughts

Investing in data warehousing is not solely about company size but rather about business needs and data complexity. Companies facing data-driven challenges, compliance requirements, or inefficiencies in data management should consider investing in a data warehouse—regardless of their size.

Watch the Video

Meet the Speaker

Michael Olschimke

Building a scalable Data Platform? In Inside Modern Data Teams

Implementing a Business Glossary: A Step-by-Step Guide

What is a Business Glossary?

A Business Glossary is a structured collection of business terms with clear definitions, ensuring consistency and accuracy across an organization. It serves as a single source of truth for terminology used in different teams and departments.

Why a Business Glossary is Essential

Standardized Terminology: Ensures that everyone uses the same definitions, reducing ambiguity.
Improved Communication: Minimizes misunderstandings between teams.
Enhanced Data Quality: Ensures consistency across reports and databases.
Supports Compliance: Helps meet regulatory requirements such as GDPR, ESG, and BCBS 239.

In this article:

Key Benefits of a Business Glossary
Challenges Without a Business Glossary
Key Components of a Business Glossary
How to Implement a Business Glossary
Key Takeaways
Conclusion
Watch the Video
Meet the Speaker

Key Benefits of a Business Glossary

Standardized terminology across teams
Faster and more accurate reporting
Easier regulatory compliance
Trustworthy, high-quality data

Challenges Without a Business Glossary

Data inconsistency across departments
Compliance risks (GDPR, ESG, BCBS 239)
Errors in reporting and decision-making
Wasted time fixing data discrepancies

Key Components of a Business Glossary

Term Name: The business term (e.g., “Customer”).
Definition: A clear, non-technical explanation.
Synonyms & Acronyms: Alternative names used across departments.
Owner: The responsible person for maintaining the term.
Business Rules: Conditions or constraints applied to the term.
Data Source: The official location of the data.

How to Implement a Business Glossary

Step 1: Identify Key Business Terms

Start by finding the most commonly used yet misunderstood terms in your organization. These are the terms that frequently cause confusion or inconsistencies.

Step 2: Define the Terms

Get cross-team agreement on definitions, document all synonyms, and resolve any conflicts in terminology.

Step 3: Store & Publish the Glossary

Make the glossary accessible to everyone in the organization. Common platforms include Excel, SharePoint, or specialized tools like Collibra.

Step 4: Assign Ownership & Governance

Assign a data owner or steward to ensure ongoing updates and accountability.

Step 5: Monitor & Improve the Glossary

Conduct quarterly reviews, track data usage trends, and integrate the glossary into reports and workflows.

Step 6: Adapt to Industry Standards

Stay updated with new regulations and industry best practices to ensure your glossary remains relevant.

Key Takeaways

A Business Glossary improves data clarity, accuracy, and trust.
Assign owners and governance roles to maintain the glossary.
Start small with 15-20 key terms before scaling.
Monitor usage and resolve conflicts regularly.
Integrating a Business Glossary into your data governance framework enhances long-term efficiency.
Start with simple tools like Excel or SharePoint, then upgrade as needed.

Conclusion

Implementing a Business Glossary is a crucial step toward achieving data consistency, improving communication, and ensuring compliance. By following a structured approach, organizations can establish a reliable glossary that grows with their business needs.

Watch the Video

Meet the Speaker

Lorenz Kindling
Senior Consultant

Lorenz is working in Business Intelligence and Enterprise Data Warehousing (EDW) with a focus on data warehouse automation and Data Vault modeling. Since 2021, he has been advising renowned companies in various industries for Scalefree International. Prior to Scalefree, he also worked as a consultant in the field of data analytics. This allowed him to gain a comprehensive overview of data warehousing projects and common issues that arise.

Building a scalable Data Platform? In Data Vault Friday

PII Business Keys: Best Practices for Artificial Hubs and Satellites

PII Business Keys

In modern data architecture, handling Personally Identifiable Information (PII) is a crucial aspect of maintaining data privacy and integrity. One common challenge in Data Vault modeling is determining how to properly load artificial hubs when using PII fields as business keys. In this article, we explore different approaches and best practices to handle this scenario effectively.

In this article:

Understanding the Loading Process for Artificial Hubs
Managing UUIDs in the ETL Process
Handling PII Deletions
Conclusion
Watch the Video
Meet the Speaker

Understanding the Loading Process for Artificial Hubs

When an artificial hub is created using a PII-based business key, a key question arises: should we load one UUID per person, or should we generate multiple UUIDs for each version or change? Additionally, how do we manage this in the ETL process?

Solution 1: Using the Technical ID as the Business Key

One approach is to integrate data using the technical ID (e.g., employee ID from a CRM system) instead of the PII-based business key. This method ensures that the actual PII data remains in the satellite, making deletions easier while maintaining integrity in data integration.

Solution 2: Storing Both Technical and Business Keys in the Hub

Another option is to load both the technical ID and the PII-based business key into the same hub. A same-as-link can be used to create a mapping between the two, allowing flexibility in identifying records while ensuring that the satellites reference the technical ID for consistency.

Solution 3: Separating Technical and Business Keys into Two Hubs

For greater flexibility, technical IDs and business keys can be stored in separate hubs with a linking mechanism. While this approach introduces additional complexity, it provides a structured way to manage mappings between different keys while keeping PII data separate.

Managing UUIDs in the ETL Process

If the source system does not provide a technical ID, an artificial UUID must be generated. This requires maintaining a lookup table in the staging layer to map each business key to a UUID. This mapping must be handled efficiently in the ETL process to ensure consistency across data loads.

Handling PII Deletions

When a delete request is received, it is essential to remove PII data while preserving relationships in the data model. Using the technical ID ensures that descriptive information remains, while direct identifiers are removed. Additionally, solutions like Delta Lake or Iceberg tables can help manage deletions effectively in a data lake environment.

Conclusion

Choosing the right approach for handling PII-based business keys depends on the specific use case and integration requirements. Using a technical ID simplifies integration but may not always be feasible. The same-as-link approach provides a balanced solution, while separate hubs offer greater flexibility at the cost of added complexity. Ultimately, a well-structured ETL process is key to ensuring data integrity and compliance with privacy regulations.

Watch the Video

Meet the Speaker

Marc Winkelmann
Managing Consultant

Marc is working in Business Intelligence and Enterprise Data Warehousing (EDW) with a focus on Data Vault 2.0 implementation and coaching. Since 2016 he is active in consulting and implementation of Data Vault 2.0 solutions with industry leaders in manufacturing, energy supply and facility management sector. In 2020 he became a Data Vault 2.0 Instructor for Scalefree.

Jan Paul Langner In Beginner, Data Tools

Data Streaming in Snowflake

In this newsletter, we’ll provide an overview of the possibilities for streaming data using Snowflake.

The approaches discussed here focus on how Snowflake’s features and capabilities enable ingestion and processing data streams with a certain speed. The goal is to explore how businesses can create an advantage by using real time data to make time critical decisions, improve operational efficiency, and enhance customer experiences. Additionally, we’ll examine Snowflake’s architecture to enable a reliable solutions base, its suitability for streaming workloads, and the challenges it addresses.

Data Streaming in Snowflake

This webinar on February 11th, 2025, 11 am CET, shows you how to speed up data-driven decisions using Snowflake’s real-time data capabilities. Learn to unify batch and streaming data pipelines to handle large volumes of data with Snowpipe Streaming, Hybrid Tables, and Dynamic Tables. Discover how to get low-latency insights and make faster decisions without affecting your existing processes. We’ll explore real-world examples of how others use real-time data for dashboards, customer experiences, and more, plus best practices for ensuring data quality and performance. Ideal for data engineers and architects, this session will show you how Snowflake can revolutionize your analytics and help you thrive in today’s fast-paced digital world.

Watch Webinar Recording

In this article:

What to Expect
Real Time Data - an Overview
Real Time Data in Snowflake
- Snowflake’s Architecture for Real Time Data
Conclusion

What to Expect

You will gain an overview of the various built-in tools and techniques Snowflake provides to handle real time data streaming, including data ingestion, processing, and querying capabilities. We’ll cover Snowflake’s architectural components, such as tables, views, and processing capabilities, and explore how they are used to managing transactional workloads.

The main focus of this blog is to provide insights on how Snowflake enables real time data processing, the opportunities it presents, and the trade-offs to consider when implementing streaming solutions.

Real Time Data – an Overview

Real time data enables businesses to process information immediately after it is generated. Unlike traditional batch processing, which works with fixed intervals, real time data flows continuously, allowing for dynamic and immediate actions. This type of data processing has become increasingly important for organizations seeking to stay competitive in a fast-moving market.

Key Characteristics of Real Time Data

Real time data has distinct characteristics that differentiate it from other types of data flows:

High Velocity: Data is processed at high speeds, often within milliseconds or seconds.
Low Latency: Systems are designed to minimize delays, ensuring timely access to insights.
Dynamic Streams: Continuous and often unpredictable data streams require flexible and scalable processing.

Each of these characteristics brings unique advantages and challenges to the table. For this reason, organizations need specialized tools and platforms to handle real time data effectively.

Why Real Time Data is Important

The value of real time data lies in its ability to provide immediate insights, which enable organizations to act without delay. It can significantly enhance several business areas, such as:

Enhanced Decision-Making: Real time insights enable proactive responses, such as dynamically adjusting pricing or inventory based on live demand.
Customer Personalization: By analyzing behavior in real time, companies can deliver tailored experiences that meet customer expectations in the moment.
Operational Efficiency: Constant monitoring helps organizations to detect and address issues early, reducing downtime and optimizing performance.
Risk and Fraud Detection: Rapid identification of anomalies or threats reduces exposure to fraud and operational risks.

The benefits of real time data are clear; however, its implementation comes with specific challenges, as detailed below.

Challenges with Real Time Data

While the advantages of real time data are significant, they are accompanied by a unique set of challenges:

Handling Velocity and Volume: Processing large streams of data at high speed requires scalable systems.
Ensuring Consistency: Maintaining data accuracy across distributed systems can be complex, especially when dealing with multiple sources.
Integration Complexity: Real time data must often be combined with batch systems or legacy analytics, which introduces technical complexities.

The continuous flow and immediacy of real time data make it indispensable for businesses aiming to stay ahead in their industries. With the right tools and architecture, organizations can unlock its potential to deliver significant value and competitive advantage.

Real Time Data in Snowflake

Snowflake offers a modern approach to handling real time data by combining its cloud-native architecture with features designed for continuous ingestion, processing, and querying. With built-in tools like Snowpipe, streams, and tasks, Snowflake enables organizations to integrate real time data. In this section, we are going to explore how Snowflake manages real time data and its possibilities to meet the demands of streaming workloads.

Snowflake’s Architecture for Real Time Data

Snowflake’s architecture combines shared-disk and shared-nothing concepts. It uses a central, cloud-based data repository (the “shared-disk” portion) while compute resources (virtual warehouses) scale independently and process data in parallel (the “shared-nothing” aspect).

Batch: Traditionally, data engineers batch large amounts of data into Snowflake using the COPY command or scheduled loads. The compute layer scales up for heavy loads and then scales back down for cost efficiency.
Real Time: Continuous ingestion focuses on small, frequent data arrivals. The architecture still benefits from the separation of storage and compute, allowing Snowflake to handle real time feeds without blocking batch workloads or analytics queries.

Micro-partitions

Snowflake automatically divides tables into micro-partitions, compact storage units that group data based on natural ordering or load patterns.

Batch: Micro-partitions shine when running large analytical queries, as Snowflake can prune out unnecessary partitions quickly, boosting performance for wide-ranging queries.
Real Time: Frequent data loads generate more micro-partitions. The fine-grained organization remains beneficial, but you need to plan for slightly more overhead in terms of partition management, particularly if your real time data streams produce extremely high volumes.

Dynamic and Hybrid Tables

Speaking of tables, the next two entities are very helpful when considering processing real time data in Snowflake.

Dynamic Tables allow you to define continuous transformations within Snowflake. Think of them as similar to materialized views, but more flexible and with the ability to handle complex transformations and dependencies. You specify a SQL query for how the table should be built, and Snowflake automatically processes incremental changes from source tables behind the scenes.

Incremental Data Processing: Instead of running ad-hoc or scheduled jobs, Dynamic Tables automatically update when new data arrives. You can set a target lag by specifying a refresh duration or defining a downstream dependency. Within the downstream dependency, the last dynamic table determines when the data needs to be refreshed.
Continuous Pipelines: They reduce the need for manual orchestration with Snowflake Tasks or external job schedulers.
Complex Transformations: Unlike materialized views (which are generally suited for simpler aggregates), Dynamic Tables can handle joins, window functions, and other advanced SQL operations.

Hybrid Tables are a newer concept that merges Snowflake’s columnar micro-partition storage with row-oriented features, making it easier to handle fast, high-volume inserts or updates. They aim to support both analytic (OLAP) and transactional (OLTP-like) workloads in a single Snowflake environment.

Faster Row-Level Operations: Traditional Snowflake tables can handle inserts and updates at scale, but they’re optimized primarily for analytical reads. Hybrid Tables aim to make row-level operations more efficient.
Mixed Workload Support: Combine real time event ingestion (often associated with row-level databases) and analytical querying (where column-based storage excels).
Reduced Latency: By better handling small transactions and frequent data changes, Hybrid Tables can help lower the time it takes to get new data ready for querying.

Snowpipe and Snowpipe Streaming

Snowpipe is Snowflake’s continuous data ingestion service that loads data from external or internal stages. Typically, you point Snowpipe at a cloud storage location (e.g., Amazon S3) where new files land, and it automatically imports them.

Classic Batch: The COPY command is often scheduled to run at fixed intervals (e.g., hourly or daily), which can introduce latency.
Snowpipe: Instead of waiting for a scheduled batch, Snowpipe automatically pulls in smaller file increments soon after they arrive in the stage. This reduces the time between data generation and availability for queries.

However, Snowpipe is optimized for continuous loads of small files – think micro-batches, rather than large single-file loads. If your data volume spikes significantly, you might face higher costs or need to switch to different ingestion strategies to maintain throughput.

Snowpipe Streaming is an API-based approach that writes data directly to Snowflake tables, bypassing intermediate storage altogether. This can reduce latency and loading times even further than standard Snowpipe.

Why is it “More Real Time”? Data arrives instantly (or very close to it) in Snowflake, enabling near real time reporting. You don’t need to wait for batch files to land in cloud storage, and it’s often more cost-effective at scale, especially for high-frequency, small data events.

Additionally, the Snowflake Connector for Apache Kafka supports Snowpipe Streaming, allowing a near-seamless way to flow messages from Kafka topics straight into Snowflake tables. This integration is a significant step toward bridging the gap between streaming data platforms and Snowflake’s cloud data warehouse.

Snowflake Streams

Lastly, a Snowflake “Stream” tracks changes made to a table (inserts, updates, and deletes). This is particularly useful for Change Data Capture (CDC) workflows, enabling efficient processing of real time data.
You can design pipelines that react to these CDC streams, triggering transformations or downstream processes. This is ideal for scenarios where real time data updates must flow into transactional or operational systems.

Limitations and Possibilities of Real-Time Data in Snowflake

After breaking down some key factors, we will examine several limitations and possibilities when working with Snowflake and real time data.

While Snowflake aims for near real time processing, it isn’t designed as an ultra-low-latency, sub-second event-processing engine. Rapid, continuous ingestion can drive up costs if you’re not careful with warehouse sizing and auto-suspend settings. Each warehouse or streaming component could incur charges for frequent queries and data loads.

Monitoring multiple data streams, scaling them and managing their throughput can be complex. You’ll need robust DevOps and DataOps practices to ensure data integrity and consistent performance.

On the other hand, you can enable real time dashboards and analytics. With Snowpipe Streaming and the Kafka Connector, you can feed data directly into Snowflake and power dashboards that update within seconds or minutes. CDC streams enable event-driven pipelines where changes in your transactional databases trigger immediate actions in Snowflake, such as downstream transformations or alerts. Snowflake’s separation of storage and compute makes it easier to handle spiky workloads. Real time pipelines can scale up independently of other batch or analytical tasks, ensuring smooth parallel processing.

Many organizations mix real time flows with traditional batch loads. Snowflake’s architecture can accommodate both simultaneously without conflict.

Conclusion

By bringing together Hybrid Tables for row-based writes, Snowpipe Streaming for near real time ingestion, Snowflake Streams for continuous change detection, and Dynamic Tables for automated transformations, organizations can introduce real time data flows on top of their existing batch approach. This collaboration of Snowflake features opens up opportunities to capture high-velocity data and process it almost immediately, enabling decision-makers to access live metrics, personalize customer interactions as they happen, optimize operational processes, and detect anomalies or fraud with minimal delay. When combined with standard batch ingestion, this architecture preserves historical data and large-scale analytics while adding a complementary layer for low-latency, event-driven insights.

However, there are challenges and limitations to consider. More frequent writes can drive up compute costs, especially if data arrives at unpredictable volumes and velocities. Ensuring consistent data models becomes trickier if schema changes occur rapidly, as real time pipelines need to keep up with adjustments in source systems. Hybrid Tables, while excellent for rapid inserts, may not always yield the same query performance as columnar tables when it comes to large-scale analytical workloads; likewise, the cost overhead of always-on streaming pipelines can be significant if not carefully sized and monitored. Complexity also rises with more moving parts, and a robust DataOps or DevOps framework is critical for controlling ingestion, transformation, and monitoring across multiple modes of data processing.

Still, integrating these real time capabilities into an existing batch-loading architecture can be done gradually by setting up a dedicated real time pipeline that feeds into Snowflake alongside the traditional bulk loads. The new pipeline could ingest event data using Snowpipe Streaming and store it initially in Hybrid Tables, where Streams and Dynamic Tables can process changes in near real time. Historical data loads and heavy analytical queries would continue using the existing batch approach with columnar, micro-partitioned tables. Over time, teams can merge both approaches at the consumption layer—whether in BI dashboards, ML pipelines, or operational analytics—to combine the immediacy of real time data with the depth of historical context, all within the same Snowflake ecosystem.

Dmytro Polishchuk In dbt Talk

Microbatch Incremental Models: A New Approach to Large Time-Series Data

What is a Microbatch?

Microbatch is an innovative incremental strategy designed for large time-series datasets. Introduced in dbt Core version 1.9 (currently in beta), it complements existing incremental strategies by offering a structured and efficient way to process data in batches.

In this article:

How Microbatches Work
Microbatch Model Configurations
Running Batches in Parallel
How to Backload Data
Microbatch vs. Other Incremental Strategies
Conclusion
Watch the Video

Key features of Microbatch include:

Utilizes a time column to define batch ranges.
Supports reprocessing failed batches.
Auto-detects parallel batch y
Eliminates complex conditional logic for backfilling.

However, it’s not suitable for datasets lacking a reliable time column or requiring fine-grained control over processing logic.

How Microbatches Work

Microbatching works by splitting model processing into multiple queries (batches) based on:

event_time: The time column defining batch ranges.
batch_size: The time period for each batch (hour, day [default], month, year).

Each batch functions as an independent, atomic unit, meaning:

Batches can be processed, retried, or replaced individually.
Parallel execution enables separate, idempotent batch processing.

Batch replacement strategies vary by database adapter:

Postgres: Uses merge.
BigQuery, Spark: Uses insert_overwrite.
Databricks: Uses replace_where.
Redshift, Snowflake: Uses delete + insert.

Microbatch Model Configurations

When setting up a Microbatch model, the following configurations are required:

event_time: Specifies the time column in UTC.
batch_size: Defines batch granularity (hour, day, month, year).
begin: Sets the start point for initial or full-refresh builds.

Optional configurations include:

lookback: Processes prior batches for late-arriving records.
concurrent_batches: Controls parallel execution (auto-detected by default).

Running Batches in Parallel

Parallel execution is automatically detected based on batch conditions and adapter support. However, users can override this behavior using the concurrent_batches setting.

Parallel execution is possible when:

The batch is neither the first nor last in the sequence.
The database adapter supports parallel execution.
The model logic does not depend on execution order.

How to Backload Data

Backloading allows reprocessing historical data within a specific time range using the following command:

dbt run --event-time-start "2025-02-01" --event-time-end "2025-02-03"

This ensures that only batches within the defined range are processed independently.

Microbatch vs. Other Incremental Strategies

Microbatch differs from traditional incremental strategies by:

Using independent queries for time-based batches.
Eliminating the need for is_incremental() and complex SQL logic.
Automatically selecting the most efficient operation (insert, update, replace) for each platform.

Conclusion

Microbatch is a powerful new approach to incremental data processing in dbt Core. By breaking down large datasets into manageable, parallelizable chunks, it simplifies data modeling while improving efficiency and scalability. However, it is essential to consider whether Microbatch suits your data pipeline’s requirements before implementing it.

Watch the Video

Building a scalable Data Platform? In Data Vault Friday

Effectivity Satellites on Links

Watch the Video

Understanding Effectivity Satellites in Data Vault

Effectivity satellites play a crucial role in Data Vault modeling by tracking changes and deletions in source systems. In this article, we’ll explore when to use an effectivity satellite, how it differs from a regular satellite, and the best practices for implementing it.

What is an Effectivity Satellite?

An effectivity satellite is essentially a standard satellite used to capture business time attributes such as valid-from and valid-to dates, contract start and end dates, and deletion timestamps. The key distinction is that it tracks soft deletions from source systems, ensuring that historical data integrity is maintained in the Data Vault.

When Should You Use an Effectivity Satellite?

Effectivity satellites should be used when you need to track historical changes in relationships and entities, especially deletions. Common use cases include:

Tracking contract start and end dates
Monitoring employee and corporate car assignments
Managing customer records and their deletion status

Difference Between Regular and Effectivity Satellites

Regular satellites store descriptive attributes like names and addresses, whereas effectivity satellites focus on time-based attributes and deletion markers. While regular satellites track changes in data, effectivity satellites specifically manage record deletions and validity periods.

Choosing Between Link Satellites and Effectivity Satellites

Link satellites capture changes in relationships between entities, whereas effectivity satellites track the validity of those relationships. You should choose an effectivity satellite when:

Deletions need to be recorded without physically removing data
You need to track when a relationship was created and ended
Historical relationship integrity must be preserved

Example: Employee and Corporate Car Assignment

Consider an employee assigned to a corporate car. When the assignment changes, a new link entry is created between the employee and the new car. The effectivity satellite records the deletion timestamp for the old relationship and maintains the history of assignments.

Handling Deletions in Effectivity Satellites

One of the main challenges in effectivity satellites is detecting and handling deletions. Different data loading methods impact how deletions are recorded:

Full Loads: Compare current and previous loads to identify missing records.
Change Data Capture (CDC): Uses system flags to detect deletions.
Delta Loads: Requires additional logic to identify removed records.

Effectivity Satellites in Business Vault

In a Business Vault, effectivity satellites can be used to implement business rules for tracking deletions. For instance, a customer may be considered deleted only if removed from all source systems, which requires a business-driven deletion logic.

Conclusion

Effectivity satellites are essential in Data Vault modeling for tracking deletions and historical changes. Understanding their role and choosing the right satellite type ensures accurate data lineage and integrity.

Meet the Speaker

Michael Olschimke