Skip to main content
search
0

Data Vault 2.0 with DBT – Part 2

Data Vault 2.0 with dbt

Data Vault 2.0 with dbt

In the first part of this blog series we introduced you to dbt. Now let’s take a look at how you can implement Data Vault 2.0 with dbt and what advantages this offers. If you don’t know the first part yet, you can read it here.

 

Dbt Models

dbt provides the ability to create models and generate and execute SQL dynamically out of these models. This allows you to write your data transformations in models using SQL and reusable macros powered by Jinja2 to run your data pipelines in a clean and efficient way. However, the most important part for the Data Vault use case is the ability to define and use those macros.

But first we should clarify how models work in dbt.

This tool handles the compilation and execution of models written using SQL and the Jinja macro language. Each model consists of exactly one SQL SELECT statement. Jinja code is translated into SQL during compilation.

Data Vault 2.0 with dbt

This figure shows a simple model in dbt. A big advantage of Jinja is the possibility to generate sql programmatically, for example with loops and conditions. Also, through the ref() functions, dbt understands the relationships between the models and builds a dependency graph. This ensures that the models are built in the correct order and that the data lineage is documented. A lineage graph may look like this:

Data Vault 2.0 with dbt

The materialization of models can be defined at different configuration levels. This allows fast prototyping with views and switching to materialized tables if necessary for performance reasons.

Data Vault 2.0 and Macros

But how can we implement Data Vault 2.0 with dbt? The most important part for using Data Vault 2.0 is the ability to define and use macros. Macros can be called in models and then generate into that macro additional SQL or even the entire SQL code.

For example, you could create a macro to generate a hub that gets the source/staging model as an input parameter, as well as the specification of the columns for the business key(s), load date and record source. The sql code for the hub is then generated dynamically from this. The advantage of this is that a change to the macro would directly affect each individual hub, which greatly improves maintainability.

At this point, you also gain huge benefits from the active open-source community around dbt. There are many open-source packages with which it can be extended.

There are also already some packages that are perfect for using Data Vault 2.0 with dbt. 

For example, our own open-source package DataVault4dbt, developed and actively maintained here at Scalefree, provides a comprehensive set of dbt macros to translate a Data Vault model “on paper” into actual tables and views—such as Hubs, Links, Satellites, and more. The package is actively used in real-world projects and helps enforce best practices for a modern, audit-friendly Data Vault 2.0 implementation.

To explore all features and macro parameters in detail, check out the documentation.

The only thing you need in your model, for example for a hub, is just one macro call:

{%-

hub(src_pk, src_nk, src_ldts, src_source, source_model) 

-%}

With the parameters of the macro call you define the source table where the columns can be found (source_model) and the column names for the hash-key (src_pk), business key(s) (src_nk), load date (src_ldts) and the record source (src_source) column. When the model and the macro(s) defined in the model are executed, the SQL gets compiled and processed on the database system.

The metadata needed can for example be defined in variables with jinja directly in the model:

What you can also see is that this tool provides different options for the materialization. The incremental materialization for example, will load an entity as a table on an incremental basis.

When the model is executed, dbt generates the whole sql out of the macro and decides how the records are loaded. If the hub table does not exist, yet it is created and all records are loaded, if it already exists, the table is loaded incrementally. 

For people who tried or managed to implement a Data Vault with “vanilla” SQL, you will realise that this is a real game-changer. The team can now focus entirely on the Data Vault design itself. Once the metadata is identified, dbt, along with your macros, can take care of the entire logic.

Openly available packages can add basic Data Vault 2.0 principles to dbt and therefore allow users to quick-dive into Data Vault implementation. Dbt’s general openness allows you to adjust all macros, for example, to your company or project specific Data Vault flavor to meet your technical and business requirements. 

It is important to note that many of the currently available dbt packages for implementing Data Vault 2.0 deviate from the official standards in some details. Our own open-source package DataVault4dbt, developed and actively maintained here at Scalefree, addresses this gap by supporting all core Data Vault 2.0 entities and the latest best practices. The package is already used in real-world projects to help ensure a modern and audit-friendly implementation.

Conclusion

Integrating Data Vault 2.0 with dbt streamlines the data warehousing process by leveraging dbt’s capabilities to define models and macros, enabling efficient and dynamic SQL generation. This approach allows teams to focus on Data Vault design, with dbt handling the underlying logic.

Our open-source package DataVault4dbt brings these advantages to real-world projects, offering a reliable and standards-aligned way to build Hubs, Links, Satellites, and more.

Modelling Currencies in Non-Historized Links in Data Vault (PART 2)

Watch the Video

As part of our ongoing Data Vault Friday series, our CEO Michael Olschimke engages with a thought-provoking question posed by a member of our audience.

“Is it a good idea to create a HUB for an ‘opposite side Account’? Or maybe we should go even further and try to merge the ‘opposite side Account’ with HUB_ACCOUNT? If yes, what about different IBAN formats in different countries? Do we really want to have accounts from all over the world in our HUB_ACCOUNT?”

In this informative video, Michael delves into the intricacies of data modeling and hub creation, specifically addressing the concept of incorporating an “opposite side Account” into the HUB structure. He explores the potential benefits and challenges associated with merging these entities and offers insights into managing the diversity of IBAN formats across different countries.

For those navigating the complexities of global data integration and considering the structure of their HUB_ACCOUNT, this video provides valuable guidance and practical considerations.

Implementing GDPR in Data Warehousing

Solutions

Implementing GDPR

In the realm of data warehousing, whether it be Data Vault 2.0 or traditional approaches like Kimball and Inmon, data is stored and processed across multiple layers. The intricacies of privacy, particularly the application of security measures and the concept of the “right to be forgotten,” permeate every layer housing personal data.

For privacy implementation, the primary objective is the removal of Personally Identifiable Information (PII) data from each layer. This meticulous process aims to extract PII data, leaving non-PII data intact. In the ideal scenario, this ensures a reduction in consumer data proportionate to the removed PII data.

The General Data Protection Regulation (GDPR) casts a significant influence on data warehouse projects, introducing stringent requirements for data processing and storage. This impact spans across security considerations, determining who has access to what data, and privacy mandates, addressing the right to be forgotten.

ACCESS THE SOLUTION

Modelling Currencies in Non-Historized Links in Data Vault (PART 1)

Watch the Video

As part of our ongoing Data Vault Friday series, our CEO Michael Olschimke engages with a thought-provoking question posed by a member of our audience.

“Is it a good idea to create a HUB for an ‘opposite side Account’? Or maybe we should go even further and try to merge the ‘opposite side Account’ with HUB_ACCOUNT? If yes, what about different IBAN formats in different countries? Do we really want to have accounts from all over the world in our HUB_ACCOUNT?”

In this informative video, Michael tackles the complexities surrounding the creation of a HUB for the “opposite side Account” and explores the possibility of merging it with HUB_ACCOUNT. Delving into the practical considerations of accommodating different IBAN formats across countries, he provides nuanced answers and considerations to guide the decision-making process.

For those grappling with the challenges of data modeling in the context of global account structures, this video offers valuable insights and practical solutions.

Load Date vs Snapshot Date in Data Vault

Watch the Video

In our continuous Data Vault Friday series, our CEO Michael Olschimke delves into a pertinent question posed by our audience about the use of a Snapshot Date.

“What is the purpose of the Snapshot Date and how does it relate to the load date?”

In this enlightening video, Michael explores the significance of both the load date (TS) and the snapshot date (TS) within a well-designed Data Vault 2.0 architecture. Acknowledging their crucial roles as timelines, he provides a clear and insightful explanation of how these dates function and interrelate in the context of Data Vault 2.0.

Understanding the nuances of these temporal elements is key to optimizing data management within the Data Vault framework, and Michael’s explanation offers valuable insights for both newcomers and experienced practitioners in the field.

What’s New in Data Vault?

Data Vault 2.0

Attendees of our Data Vault 2.0 trainings often ask us what the difference between the book “Building a Scalable Data Warehouse with Data Vault 2.0” and our current consulting practice is. To give a definitive answer, at least for now, we have written this article.

What’s new in Data Vault?

This webinar  discusses recent updates in Data Vault 2.0 practices since the publication of the book “Building a Scalable Data Warehouse with Data Vault 2.0.” The key changes include the virtualization of the Load End Date using window functions to enhance efficiency, the adoption of hybrid architectures that integrate data lakes for staging purposes, and the shift of business timelines into the dimensional model for greater flexibility. Additionally, there’s an emphasis on using Snapshot Date Timestamps in Business Vault satellites to simplify business rule development. The article clarifies that these updates are considered enhancements within the existing Data Vault 2.0 framework, referred to as “Data Vault 2.0.1,” rather than constituting a new version like Data Vault 3.0.

Watch webinar recording

Data Vault 3.0?

First of all, there is no Data Vault 3.0. We like to refer to the version of Data Vault that we teach and apply in our consulting practice, as “Data Vault 2.0.1”. Therefore, there are only slight differences to the version we used in the book. Some of these changes look big, but they don’t really modify the underlying concept and that is what matters regarding version changes. So, based on the concept, we only see minor changes or enhancements. However, they might have some larger impact on the actual implementation, but that is only due to technology changes.

How to Fit Data Vault 2.0 Between Two Covers

On top of the minor changes to the concept, there is also another factor at play here: when writing the book, we had to scope the book to make it fit between the covers as the publisher had a page limit (which we actually exceeded a bit). Therefore, we did not apply all Data Vault 2.0 concepts in the book: for example, real-time concepts are not covered by the book, and we did not use a data lake or cloud computing in our examples. Instead we only briefly covered these concepts for completeness but focused on on-premise technology which was more used in actual projects back then. In 2012, cloud computing was already available and widely used, but it was easier to tailor the book to more readers by the use of on-premise technologies.

With that in mind, what has changed since the book came out? 

Removal of Load End Date

The most obvious change is the removal of the Load End Date. Well, in our actual projects, we don’t completely remove it from the Data Vault 2.0 model, we just virtualize it by removing the Load End Date from the underlying satellite table and virtually calculating it in a view on top of the satellite table using a window function (typically LEAD, but LAG is also possible). This way we can get rid of the update procedure to maintain the Load End Date (the so-called end-dating process) while preserving the query patterns downstream. Keep in mind that it might also be more efficient to use the window function when loading PIT tables or satellites, and therefore the query layer is actually only for power users when they query the Data Vault model directly. 

That’s actually not an update: even in 2012 (and before) we used these approaches but they didn’t work on SQL Server 2014, which was used in the book as the analytical window function required for this approach is ways too slow. However, it improved in 2016. To get rid of the Load End Date in such scenarios where the window function is too slow or just don’t exist, a new solution has emerged: the use of a snapshot partition in the PIT table. The end of all times is used in the PIT table to refer to the latest delta in each satellite for the delta check. Once the PIT table is available, it can also be used to produce SCD Type 1 dimensions (without history) and therefore the need for the Load End Date (or a fast replacement as described above) only exists in loading.

Hybrid Architecture

The next obvious change is the use of a hybrid architecture where a data lake is used for staging purposes. In the book, we focused on the on-premise Microsoft SQL Server stack that didn’t include a distributed file system. We already recommended to clients to use a data lake for staging purposes, in an architecture we called the “hybrid architecture.” Back then, only a few followed the advice, but today most clients use the hybrid architecture for their data warehouse. We actually consider a relational staging area an anti-pattern and do not recommend it anymore to clients (with some exceptions). 

Multi Temporal Data Vault 2.0

The book describes a temporal PIT (TPIT) for building multi-temporal solutions. While the pattern is still valid and relatively flexible compared to other solutions, today we typically move the business timelines into the dimensional model for highest flexibility. This is discussed in more detail in the Multi-Temporal Data Vault 2.0 class. TPITs are less flexible but have higher performance.

Snapshot Date

Satellites in the Business Vault might use a Snapshot Date Timestamp instead of a Load Date Timestamp. The idea is that, as a general practice, a Business Vault entity should always reuse an existing granularity. It just becomes much easier to develop Business Vault solutions. With the Snapshot date in the primary key of the satellite, this becomes much easier for business rules on the outgoing information granularitiy. This is discussed in more detail in the Data Vault 2.0 Information Delivery class

Conclusion

And finally, we also made some mistakes. We are willing (not happy) to admit it in our errata page for the book – this link is directly from our internal Wiki and we promise to keep it updated (however, we didn’t receive additional reports lately).

Modeling Invoices in Data Vault

Watch the Video

As part of our ongoing Data Vault Friday series, our CEO, Michael Olschimke, engages with a relevant and practical question from our audience about Data Vault modeling.

“What are the best practices for modeling Data Vault table structure to store invoice data?”

In this concise yet informative video, Michael shares valuable insights into the best practices for designing Data Vault table structures specifically tailored for storing invoice data. Recognizing the importance of effectively modeling this type of data, Michael addresses key considerations, and potential challenges, and recommends optimal approaches to ensure a robust and scalable solution.

For those seeking guidance on structuring Data Vault tables for invoice data, this video serves as a quick and insightful resource.

Real-Time Loading of CDC Packages in Data Vault – PART 2

Watch the Video

As part of our ongoing Data Vault Friday series, our CEO Michael Olschimke delves into a pertinent question posed by a member of our audience.

“I would be interested in some ideas about how to load data from Apache Kafka. In our case, we receive CDC data from DB servers over Apache Kafka.

One specific concern raised is about maintaining the correct sequence of data in Raw Vault when dealing with different partitions in a Kafka topic. This becomes particularly crucial in scenarios involving Change Data Capture (CDC) from database servers.”

In this enlightening video, Michael provides insightful ideas and strategies for effectively loading data from Apache Kafka while ensuring the integrity of the sequence in the Raw Vault. He tackles the nuances of handling different partitions within a Kafka topic, offering practical guidance to address challenges associated with maintaining data order.

For those navigating the intricacies of data loading from Apache Kafka, this video provides valuable insights and solutions.

Data Warehouse Automation – Build or Buy?

Watch the Webinar

In this webinar, we take a sneak peek into one of the hot topics of modern data warehousing, namely Data Warehouse Automation. We would break down the basics of DW automation & how it has brought about a cultural shift in the realm of a modern Data Warehouse & its architecture.

In this regard, we will also touch upon the often-asked question “Build or buy” along with sharing our experience working with several customers who have benefited immensely from automation and the key lessons we have learned as part of our overall DW automation journey.

This webinar is for anyone who loves Data!

Watch Webinar Recording

Webinar Agenda

1. Understanding Data Warehouse Automation
2. Drivers for Decision ING
3. Automation in Data Vault
4. Anti-Patterns in DV Automation
5. Best Practices

Logical Industry Models in Data Vault

Watch the Video

In our ongoing Data Vault Friday series, our CEO Michael Olschimke addresses a thought-provoking question raised by a member of our audience.

“In our data warehouse architecture, we have integrated a Data Vault 2.0 approach with a logical industry model: once the raw data has been loaded into the Raw Data Vault, the data undergoes a transformation into the Business Vault. This follows the logical design, maintaining the Data Vault style but derived from the industry model. Subsequently, it is further transformed into the business access layer, forming an information mart.

However, despite the initial intent of adopting a best-of-breed approach, there’s a realization that somewhere along the way, agility was compromised.”

In this enlightening video, Michael delves into the challenges faced in a data warehouse architecture that combines a Data Vault 2.0 approach with a logical industry model. Specifically, he addresses the placement of logical vendor models within this framework, exploring ways to maintain agility in the process.

Real-Time Loading of CDC Packages in Data Vault – PART 1

Watch the Video

In the latest installment of our Data Vault Friday series, our CEO Michael Olschimke addresses a pertinent question posed by an audience member.

“I would be interested in some ideas about how to load data from Apache Kafka. In our case, we receive CDC data from DB servers over Apache Kafka.

Should the data be converted from AVRO/JSON format to database format in Staging / Raw Vault? Or should it be loaded directly in an unchanged format? What is the Best Practice here?”

In this insightful video, Michael provides practical guidance on loading data from Apache Kafka, specifically when dealing with CDC (Change Data Capture) information from database servers. He explores the options of converting data from AVRO/JSON format to a database format within Staging/Raw Vault versus loading it directly in its original unchanged format.

For those navigating the complexities of data loading from Apache Kafka, this video offers valuable insights and best practices to inform decision-making in data architecture.

Data Vault 2.0 with DBT – Part 1

Data Vault 2.0 with dbt

This article focuses on the benefits of using Data Vault 2.0 with dbt, and the importance of choosing the correct implementation tools. Data is an important asset in the decision-making process. As we have previously discussed in another post, Data Vault 2.0 is the right choice when the goal of an enterprise data warehouse is to have fully historized and integrated data. Additionally, it is also better suited to instances in which data from many source systems needs to be combined. You can find the previous blog post here.

While Data Vault 2.0 focuses on the “what”, there are many options for the “how” of technically translating a Data Vault model into physical tables and views in the enterprise data warehouse, as well as for orchestrating and loading/processing the procedures. And this is where Data Vault 2.0 with dbt creates an effective solution.

About dbt

This data build tool transforms your data directly into your data warehouse. For reference, dbt is the “T” in ELT. Therefore, dbt assumes that data is already loaded into a database from which the current database can query. In contrast, ETL extracts data then transforms it and before loading it into the target. With ELT, the data is not transformed as it moves into the data warehouse. (with transform we mean transformation by soft business rules which would change the meaning of the data. Of course, we have to make sure that the data fits into the target table (data type adjustments, etc.). Here we are talking about “hard rules”.)
Dbt is particularly compatible and useful in cloud DWH solutions such as Snowflake, Azure Synapse Analytics, BigQuery, and Redshift and performs transformations and modeling directly on the database in order to take advantage of the performance of these highly scalable platforms.

How dbt works

Models and SQL statements can be easily created, tested, and managed in dbt itself. A powerful combination of the scripting language, Jinja2, and the all-time classic SQL allows users to build models. The simple interface enables data analysts without engineering know-how to initiate appropriate transformations. Data team workflows become more efficient and cost-effective because of this. Behind this tool sits an open-source community that is constantly and passionately developing the tool. As such, dbt is available both as a free, reduced core version and as a comprehensive and flexible cloud version.

How does Data Vault 2.0 with dbt work?

Part of the Data Vault 2.0 methodology is the model that focuses on how to design the core data warehouse with a scalable solution. The core Data Vault entities are hubs, links ,and satellites. That said, using Data Vault 2.0 with dbt provides the ability to generate Data Vault models and also allows you to write your data transformations using SQL and code-reusable macros powered by Jinja2 to run your data pipelines in a clean and efficient way.

Data Vault 2.0 with dbt

Conclusion

Dbt does not reinvent the wheel, but when it comes to building a new EDW – especially if it is in the clouds – it provides a very helpful basic framework with many important functions for Continuous Integration and Deployment already defined. dbt brings the standards of software development into the world of data transformation. This allows developers to concentrate on the core tasks of data modeling and business logic. Especially, not only for smaller projects, this tool offers a lightweight and extremely affordable alternative to other data warehouse automation solutions.

Close Menu