Skip to main content
search
0

Load Date vs Snapshot Date in Data Vault

Watch the Video

In our continuous Data Vault Friday series, our CEO Michael Olschimke delves into a pertinent question posed by our audience about the use of a Snapshot Date.

“What is the purpose of the Snapshot Date and how does it relate to the load date?”

In this enlightening video, Michael explores the significance of both the load date (TS) and the snapshot date (TS) within a well-designed Data Vault 2.0 architecture. Acknowledging their crucial roles as timelines, he provides a clear and insightful explanation of how these dates function and interrelate in the context of Data Vault 2.0.

Understanding the nuances of these temporal elements is key to optimizing data management within the Data Vault framework, and Michael’s explanation offers valuable insights for both newcomers and experienced practitioners in the field.

Meet the Speaker

Profile picture of Michael Olschimke

Michael Olschimke

Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

What’s New in Data Vault?

Data Vault 2.0

Attendees of our Data Vault 2.0 trainings often ask us what the difference between the book “Building a Scalable Data Warehouse with Data Vault 2.0” and our current consulting practice is. To give a definitive answer, at least for now, we have written this article.

What’s new in Data Vault?

This webinar  discusses recent updates in Data Vault 2.0 practices since the publication of the book “Building a Scalable Data Warehouse with Data Vault 2.0.” The key changes include the virtualization of the Load End Date using window functions to enhance efficiency, the adoption of hybrid architectures that integrate data lakes for staging purposes, and the shift of business timelines into the dimensional model for greater flexibility. Additionally, there’s an emphasis on using Snapshot Date Timestamps in Business Vault satellites to simplify business rule development. The article clarifies that these updates are considered enhancements within the existing Data Vault 2.0 framework, referred to as “Data Vault 2.0.1,” rather than constituting a new version like Data Vault 3.0.

Watch webinar recording

Data Vault 3.0?

First of all, there is no Data Vault 3.0. We like to refer to the version of Data Vault that we teach and apply in our consulting practice, as “Data Vault 2.0.1”. Therefore, there are only slight differences to the version we used in the book. Some of these changes look big, but they don’t really modify the underlying concept and that is what matters regarding version changes. So, based on the concept, we only see minor changes or enhancements. However, they might have some larger impact on the actual implementation, but that is only due to technology changes.

How to Fit Data Vault 2.0 Between Two Covers

On top of the minor changes to the concept, there is also another factor at play here: when writing the book, we had to scope the book to make it fit between the covers as the publisher had a page limit (which we actually exceeded a bit). Therefore, we did not apply all Data Vault 2.0 concepts in the book: for example, real-time concepts are not covered by the book, and we did not use a data lake or cloud computing in our examples. Instead we only briefly covered these concepts for completeness but focused on on-premise technology which was more used in actual projects back then. In 2012, cloud computing was already available and widely used, but it was easier to tailor the book to more readers by the use of on-premise technologies.

With that in mind, what has changed since the book came out? 

Removal of Load End Date

The most obvious change is the removal of the Load End Date. Well, in our actual projects, we don’t completely remove it from the Data Vault 2.0 model, we just virtualize it by removing the Load End Date from the underlying satellite table and virtually calculating it in a view on top of the satellite table using a window function (typically LEAD, but LAG is also possible). This way we can get rid of the update procedure to maintain the Load End Date (the so-called end-dating process) while preserving the query patterns downstream. Keep in mind that it might also be more efficient to use the window function when loading PIT tables or satellites, and therefore the query layer is actually only for power users when they query the Data Vault model directly. 

That’s actually not an update: even in 2012 (and before) we used these approaches but they didn’t work on SQL Server 2014, which was used in the book as the analytical window function required for this approach is ways too slow. However, it improved in 2016. To get rid of the Load End Date in such scenarios where the window function is too slow or just don’t exist, a new solution has emerged: the use of a snapshot partition in the PIT table. The end of all times is used in the PIT table to refer to the latest delta in each satellite for the delta check. Once the PIT table is available, it can also be used to produce SCD Type 1 dimensions (without history) and therefore the need for the Load End Date (or a fast replacement as described above) only exists in loading.

Hybrid Architecture

The next obvious change is the use of a hybrid architecture where a data lake is used for staging purposes. In the book, we focused on the on-premise Microsoft SQL Server stack that didn’t include a distributed file system. We already recommended to clients to use a data lake for staging purposes, in an architecture we called the “hybrid architecture.” Back then, only a few followed the advice, but today most clients use the hybrid architecture for their data warehouse. We actually consider a relational staging area an anti-pattern and do not recommend it anymore to clients (with some exceptions). 

Multi Temporal Data Vault 2.0

The book describes a temporal PIT (TPIT) for building multi-temporal solutions. While the pattern is still valid and relatively flexible compared to other solutions, today we typically move the business timelines into the dimensional model for highest flexibility. This is discussed in more detail in the Multi-Temporal Data Vault 2.0 class. TPITs are less flexible but have higher performance.

Snapshot Date

Satellites in the Business Vault might use a Snapshot Date Timestamp instead of a Load Date Timestamp. The idea is that, as a general practice, a Business Vault entity should always reuse an existing granularity. It just becomes much easier to develop Business Vault solutions. With the Snapshot date in the primary key of the satellite, this becomes much easier for business rules on the outgoing information granularitiy. This is discussed in more detail in the Data Vault 2.0 Information Delivery class

Conclusion

And finally, we also made some mistakes. We are willing (not happy) to admit it in our errata page for the book – this link is directly from our internal Wiki and we promise to keep it updated (however, we didn’t receive additional reports lately).

Modeling Invoices in Data Vault

Watch the Video

As part of our ongoing Data Vault Friday series, our CEO, Michael Olschimke, engages with a relevant and practical question from our audience about Data Vault modeling.

“What are the best practices for modeling Data Vault table structure to store invoice data?”

In this concise yet informative video, Michael shares valuable insights into the best practices for designing Data Vault table structures specifically tailored for storing invoice data. Recognizing the importance of effectively modeling this type of data, Michael addresses key considerations, and potential challenges, and recommends optimal approaches to ensure a robust and scalable solution.

For those seeking guidance on structuring Data Vault tables for invoice data, this video serves as a quick and insightful resource.

Meet the Speaker

Profile picture of Michael Olschimke

Michael Olschimke

Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

Real-Time Loading of CDC Packages in Data Vault – PART 2

Watch the Video

As part of our ongoing Data Vault Friday series, our CEO Michael Olschimke delves into a pertinent question posed by a member of our audience.

“I would be interested in some ideas about how to load data from Apache Kafka. In our case, we receive CDC data from DB servers over Apache Kafka.

One specific concern raised is about maintaining the correct sequence of data in Raw Vault when dealing with different partitions in a Kafka topic. This becomes particularly crucial in scenarios involving Change Data Capture (CDC) from database servers.”

In this enlightening video, Michael provides insightful ideas and strategies for effectively loading data from Apache Kafka while ensuring the integrity of the sequence in the Raw Vault. He tackles the nuances of handling different partitions within a Kafka topic, offering practical guidance to address challenges associated with maintaining data order.

For those navigating the intricacies of data loading from Apache Kafka, this video provides valuable insights and solutions.

Meet the Speaker

Profile picture of Michael Olschimke

Michael Olschimke

Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

Logical Industry Models in Data Vault

Watch the Video

In our ongoing Data Vault Friday series, our CEO Michael Olschimke addresses a thought-provoking question raised by a member of our audience.

“In our data warehouse architecture, we have integrated a Data Vault 2.0 approach with a logical industry model: once the raw data has been loaded into the Raw Data Vault, the data undergoes a transformation into the Business Vault. This follows the logical design, maintaining the Data Vault style but derived from the industry model. Subsequently, it is further transformed into the business access layer, forming an information mart.

However, despite the initial intent of adopting a best-of-breed approach, there’s a realization that somewhere along the way, agility was compromised.”

In this enlightening video, Michael delves into the challenges faced in a data warehouse architecture that combines a Data Vault 2.0 approach with a logical industry model. Specifically, he addresses the placement of logical vendor models within this framework, exploring ways to maintain agility in the process.

Meet the Speaker

Profile picture of Michael Olschimke

Michael Olschimke

Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

Real-Time Loading of CDC Packages in Data Vault – PART 1

Watch the Video

In the latest installment of our Data Vault Friday series, our CEO Michael Olschimke addresses a pertinent question posed by an audience member.

“I would be interested in some ideas about how to load data from Apache Kafka. In our case, we receive CDC data from DB servers over Apache Kafka.

Should the data be converted from AVRO/JSON format to database format in Staging / Raw Vault? Or should it be loaded directly in an unchanged format? What is the Best Practice here?”

In this insightful video, Michael provides practical guidance on loading data from Apache Kafka, specifically when dealing with CDC (Change Data Capture) information from database servers. He explores the options of converting data from AVRO/JSON format to a database format within Staging/Raw Vault versus loading it directly in its original unchanged format.

For those navigating the complexities of data loading from Apache Kafka, this video offers valuable insights and best practices to inform decision-making in data architecture.

Meet the Speaker

Profile picture of Michael Olschimke

Michael Olschimke

Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

Data Vault 2.0 with DBT – Part 1

Data Vault 2.0 with dbt

This article focuses on the benefits of using Data Vault 2.0 with dbt, and the importance of choosing the correct implementation tools. Data is an important asset in the decision-making process. As we have previously discussed in another post, Data Vault 2.0 is the right choice when the goal of an enterprise data warehouse is to have fully historized and integrated data. Additionally, it is also better suited to instances in which data from many source systems needs to be combined. You can find the previous blog post here.

While Data Vault 2.0 focuses on the “what”, there are many options for the “how” of technically translating a Data Vault model into physical tables and views in the enterprise data warehouse, as well as for orchestrating and loading/processing the procedures. And this is where Data Vault 2.0 with dbt creates an effective solution.

About dbt

This data build tool transforms your data directly into your data warehouse. For reference, dbt is the “T” in ELT. Therefore, dbt assumes that data is already loaded into a database from which the current database can query. In contrast, ETL extracts data then transforms it and before loading it into the target. With ELT, the data is not transformed as it moves into the data warehouse. (with transform we mean transformation by soft business rules which would change the meaning of the data. Of course, we have to make sure that the data fits into the target table (data type adjustments, etc.). Here we are talking about “hard rules”.)
Dbt is particularly compatible and useful in cloud DWH solutions such as Snowflake, Azure Synapse Analytics, BigQuery, and Redshift and performs transformations and modeling directly on the database in order to take advantage of the performance of these highly scalable platforms.

How dbt works

Models and SQL statements can be easily created, tested, and managed in dbt itself. A powerful combination of the scripting language, Jinja2, and the all-time classic SQL allows users to build models. The simple interface enables data analysts without engineering know-how to initiate appropriate transformations. Data team workflows become more efficient and cost-effective because of this. Behind this tool sits an open-source community that is constantly and passionately developing the tool. As such, dbt is available both as a free, reduced core version and as a comprehensive and flexible cloud version.

How does Data Vault 2.0 with dbt work?

Part of the Data Vault 2.0 methodology is the model that focuses on how to design the core data warehouse with a scalable solution. The core Data Vault entities are hubs, links ,and satellites. That said, using Data Vault 2.0 with dbt provides the ability to generate Data Vault models and also allows you to write your data transformations using SQL and code-reusable macros powered by Jinja2 to run your data pipelines in a clean and efficient way.

Data Vault 2.0 with dbt

Conclusion

Dbt does not reinvent the wheel, but when it comes to building a new EDW – especially if it is in the clouds – it provides a very helpful basic framework with many important functions for Continuous Integration and Deployment already defined. dbt brings the standards of software development into the world of data transformation. This allows developers to concentrate on the core tasks of data modeling and business logic. Especially, not only for smaller projects, this tool offers a lightweight and extremely affordable alternative to other data warehouse automation solutions.

– by Ole Bause (Scalefree)

NULL Business Keys in Data Vault

Watch the Video

In our continuous Data Vault Friday series, our CEO, Michael Olschimke, takes a moment to delve into a thought-provoking question raised by our audience.

“Does the Business Key should be a not null column at the source?”

This succinct yet critical query is the focus of this brief but insightful video. Michael engages with the nuances of the business key, exploring whether it should be a mandatory, not-null column at the source. As he unpacks the considerations, the audience gains valuable insights into the implications and potential advantages of enforcing the not-null constraint on the business key.

For those seeking clarity on best practices surrounding business key management, this video provides concise guidance.

Meet the Speaker

Profile picture of Michael Olschimke

Michael Olschimke

Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

Managed Self-Service Industrialization in Data Vault

Watch the Video

As part of our continuing Data Vault Friday series, our CEO, Michael Olschimke, engages with a pertinent question posed by our audience.

“How does the industrialization work in Managed Self-Service BI?”

In this succinct yet informative video, Michael delves into the intricacies of the industrialization process within the realm of Managed Self-Service Business Intelligence (BI). The audience is treated to a valuable discussion on the methodologies and practices involved in streamlining and scaling BI processes for efficient and consistent outcomes.

Michael sheds light on the significance of industrialization in the context of Self-Service BI scenarios, providing insights that are relevant for both beginners and seasoned professionals in the field.

Meet the Speaker

Profile picture of Michael Olschimke

Michael Olschimke

Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

Sharding in Data Vault 2.0

Watch the Video

In our continuing Data Vault Friday series, our CEO, Michael Olschimke, engages with an intriguing question posed by our audience.

“How does sharding work in Data Vault 2.0?”

In this illuminating video, Michael takes us on a journey to explore the intricacies of sharding within the context of Data Vault 2.0. Delving into the technical aspects, he provides insights into the process of laying out data on a Massively Parallel Processing (MPP) cluster. Interestingly, Michael shares his expertise from the comfort of his personal MPP cluster located in his home basement, adding a unique and practical dimension to the discussion.

For those seeking a deeper understanding of sharding techniques and their implementation in Data Vault 2.0, this video serves as a valuable resource.

Meet the Speaker

Profile picture of Michael Olschimke

Michael Olschimke

Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

Utilizing Potentials of Data Vault 2.0 – Overcoming Bad Practices – Part 1

Watch the Webinar

What are common mistakes when applying Data Vault 2.0 in enterprise data warehouse projects? Do you have questions regarding Data Vault modeling  and the realization of GDPR causes you great difficulties or is your project stuck because you are delivering no business value?

This webinar describes common Anti-patterns of Data Vault 2.0, their consequences, and the solution to eliminate them from your current or in your future projects.

Tune in and learn more to avoid Data Vault 2.0 bad practices and apply simple solutions.

Watch Webinar Recording

In this article:

Webinar Agenda

1. How to use Data Vault 2.0 for modeling business information
2. How to avoid the pitfalls of being unable to deliver business value
3. How to mask Business Keys from Hubs for privacy

Close Menu