Building a scalable Data Platform?

Building a scalable Data Platform? In Intermediate

Coalesce and Data Vault 2.0 – A Perfect Match?

Watch the Webinar

This webinar introduces the data warehousing automation tool coalesce.io and how it can be used to create a Data Vault 2.0-powered data warehouse solution. You will see live demonstrations of the tool and the data vault entities.

Learn why Data Vault 2.0 is the perfect choice for date warehouse automation tools like Coalesce and how Coalesce can kickstart your Data Vault 2.0 solution!

Watch Webinar Recording

Webinar Agenda

1. Introduction to Coalesce
2. Demo Session
3. Introduction to Data Vault 2.0
4. Why Coalesce and Data Vault?
5. Demo Session

Meet the Speakers

Ole Bause

Ole Bause is working in the Business Intelligence area at Scalefree since the beginning fo 2021. He has very good knowledge in the areas of Data Vault 2.0 implementation, data engineering, Python development, and data warehouse automation with dbt.

Tim Kirschke

Tim has a Bachelor’s degree in Applied Mathematics and has been working as a BI consultant for Scalefree since the beginning of 2021. He’s an expert in the design and implementation of BI solutions, with focus on the Data Vault 2.0 methodology. His main areas of expertise are dbt, Coalesce, and BigQuery.

Building a scalable Data Platform? In Data Tools, Intermediate

Kick-Start Your Data Vault 2.0 Implementation with Datavault4DBT

Datavault4dbt

Scalefree has released datavault4dbt. An open source package, that provides best-practice loading templates for Data Vault 2.0 entities, embedded into the open source data warehouse automation tool dbt.

Datavault4dbt currently supports Snowflake, BigQuery and Exasol and comes with a lot of great features:

A Data Vault 2.0 implementation congruent to the original Data Vault 2.0 definition by Dan Linstedt
Ready for both Persistent Staging Areas and Transient Staging Areas, due to the allowance of multiple deltas in all macros, without loosing any intermediate changes
Creating a centralized, snapshot-based Business interface by using a centralized snapshot table supporting logarithmic logic
Optimizing incremental loads by implementing a high-water-mark that also works for entities that are loaded from multiple sources

Kickstart your Data Vault 2.0 Implementation – with datavault4dbt

This webinar delves datavault4dbt, an open-source package by Scalefree that simplifies Data Vault 2.0 implementation in dbt. It provides best-practice templates for hubs, links, and satellites, ensures compliance with Data Vault standards, and supports flexible staging with optimized incremental loads, you won’t want to miss this webinar.

Watch webinar recording

In this article:

Building a Data Vault 2.0 Solution - made easy
From the Stage over the Spine into the PITs
Start now and boost your Data Vault experience!

Building a Data Vault 2.0 Solution – made easy

The overall goal of releasing Data Vault 2.0 templates for dbt is to combine our years of experience in creating and loading Data Vault 2.0 solutions into publicly available loading patterns and best practices for everyone to use. Out of this ambition, datavault4dbt, an open source package for dbt was created and will be maintained by the Scalefree expert team.

The most valuable characteristic of datavault4dbt is that it carnates the original Data Vault 2.0 definition by Dan Linstedt. It represents a fully auditable solution for your Data Vault 2.0 powered Data Warehouse. With a straight-forward, standardized approach, it enables the team to conduct agile development cycles.

By allowing multiple increments per batch while loading each Data Vault entity type, datavault4dbt supports both Persistent and Transient Staging Areas without losing any intermediate changes. These incremental loads are even optimized by implementing a dynamic high-water-mark that even works when loading an entity from multiple sources.

Additionally, datavault4dbt encourages strict naming conventions and standards by implementing a variety of global variables that span across all Data Vault layers and supported Databases. The process of end-dating data is completely virtualized to ensure a modern insert-only approach that avoids updating data.

With all these features, datavault4dbt is the perfect solution for your modern Big Data Enterprise Data Warehouse.

From the Stage over the Spine into the PITs

To achieve all this, we worked hard on creating a solid and universal staging area. All hashkeys and hashdiffs are calculated here and users are given the option to add derived columns, generate prejoins with other stages and add ghost records to their data. All of this highly automated based on parameterized user input.

Based on staging areas, the Data Vault 2.0 spine can be created. Hubs, Links and Non-Historized Links can be loaded from multiple sources including mapping options to ensure business harmonization.

This spine is then enriched by Standard Satellites, Non-Historized Satellites, Multi-Active Satellites and/or Record-Tracking Satellites. All of those that require it come with a version 0 for tables and a version 1 for end-dated views.

Based on the Raw Data Vault, PITs can be created automatically, and their loading is backed by an automated, highly-configurable but optional logarithmic snapshot logic. This logic is included in the Control Snapshot Table, which also comes in two consecutive versions. To wrap the logarithmic snapshot logic up, a post-hook for cleaning up all PITs is included and comes in handy.

Start now and boost your Data Vault experience!

The lines above made you think “Nah, that’s all too good to be true!”? Convince yourself, or give us your highly appreciated feedback by visiting datavault4dbt on Github!

Of course, our future ambitions for datavault4dbt are high and next on our list are a lot of important topics, like:

Provide a detailed working example of datavault4dbt
Extend and migrate the existing documentation of the package
Support more and more databases
Add more advanced and specific Data Vault 2.0 entities
Develop automated Data Vault related tests
Review and implement user feedback and suggestions

Stay tuned for more datavault4dbt content on all our marketing channels!

– Tim Kirschke (Scalefree)

Building a scalable Data Platform? In Data Vault Friday

Record Source and the Business Vault

Watch the Video

In our ongoing Data Vault Friday series, our CEO Michael Olschimke addresses a question raised by a member of our audience.

“We load a Same-As-Link (SAL) from multiple sources to deduplicate records across these source systems. How do you define the record source? The data is not clearly coming from only one source.”

In this insightful video, Michael delves into the intricacies of handling Same-As-Links (SAL) when loading data from multiple sources for the purpose of deduplication. The specific challenge raised about defining the record source when data is sourced from multiple systems is a common issue in many projects.

Michael shares best practices, providing clarity on the intended use of the record source and offering guidance on how to avoid common pitfalls associated with its misuse. The discussion sheds light on the importance of correctly defining the record source to ensure accurate data lineage and enhance the effectiveness of deduplication efforts.

For those navigating the complexities of deduplication in a multi-source environment, this video offers valuable insights and practical recommendations.

Meet the Speaker

Michael Olschimke

Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

Building a scalable Data Platform? In Data Vault Friday

Data Quality in the Enterprise Data Warehouse (EDW) in Data Vault

Watch the Video

In our continuous Data Vault Friday series, our CEO Michael Olschimke engages with a set of pertinent questions posed by our audience.

“How to deal with dirty data managed in BV? What are the best practices for correct data management? How can business rules, versions, and fixes be managed on correction properly?”

In this enlightening video, Michael addresses the challenges associated with handling dirty data within the Business Vault (BV) and explores best practices for effective data management. He dives into the complexities of managing business rules, versions, and corrections, offering insights into the proper approaches for ensuring data accuracy and consistency.

The video emphasizes a strict Extract, Load, Transform (ELT) approach, advocating for the application of data cleansing rules after loading the Raw Data Vault. Michael explains the rationale behind this methodology and highlights the advantages of maintaining a robust data flow.

For teams grappling with data quality issues and seeking optimal data management strategies, this video provides valuable guidance and practical considerations.

Meet the Speaker

Michael Olschimke

Building a scalable Data Platform? In Data Vault

Data Vault 2.0: Best of Breed from Data Warehousing and Data Lakes

There are two competing approaches to data analytics available in the industry, and most professionals at least tend to one or the other as the preferred tool of choice: data warehousing vs data lake. This article sheds light on the differences between both approaches and how Data Vault 2.0 provides a best of breed solution that integrates the advantages of both approaches into a unified concept.

About Data Warehousing

Data Warehousing is the traditional enterprise solution for providing reliable information to decision makers at every level of the organization. Data warehouse solutions (but also data lakes) are based on a data model, which is traditionally defined on the basis of either the information requirement in a bottom-up approach or in a top-down approach based on an integrated enterprise information model.

In any case, the traditional data warehouse is based on a concept called “schema-on-write” where the data model is established when loading the data into the data warehouse. This often leads to non-agile data processing, as this data model often requires modifications in order to cope with changes in the business.

About Data Lakes

Data lakes, on the other hand, are based on the “schema-on-read” concept. Instead of modeling the enterprise or fitting the incoming dataset into a target information model, the data is first and foremost stored on the data lake as delivered without any modelling applied.

While traditional data warehousing often leads to overmodeling and non-agile data analytics, the data lake approach often leads to the direct opposite: to unmanaged data and inconsistent information results.

The Best of Breed

Both approaches are on the extreme ends of the data analytics space and used throughout the years with mixed results. With the emergence of the Data Vault 2.0 concept, a third option is available to industry professionals to build data analytics platforms.

Data Vault 2.0 is a best of breed between traditional data warehousing and data lakes: for example, there is a data model to manage the data and business logic as in traditional data warehousing, but it follows a schema-on-write approach as in data lakes.

The Data Vault 2.0 architecture comprises multiple layers:

The first layer is the staging area: it is used to extract the data from the source systems. The next layer is the Raw Data Vault. This layer is still functionally oriented as the staging layer, but integrates and versionizes the data. To achieve this, the incoming source data model is broken down into smaller components: business keys (stored in hubs), relationships between business keys (stored in links) and descriptive data (captured by satellites).

The Business Vault is the next layer, but only sparsely modelled: only where business logic is required to deliver useful information, a Business Vault entity is put in place. The Business Vault bridges the gap between the target information model (as in the next layer) and the actual raw data. Often, the raw data doesn’t meet the expectations of the business regarding data quality, completeness, or content and thus needs to be adjusted. Business logic is used to fill the gap.

The final layer is the information mart layer where the information model is produced to deliver the final information in the desired format, e.g., a dimensional star schema. This model is used by the business user either directly in ad-hoc queries or using business intelligence tools such as dashboarding or reporting software.

The first layers until the Raw Data Vault are still functionally oriented because the model is still derived either directly from the source system (as in the staging area) or by breaking down the incoming data model into smaller, normalized components, as in the Raw Data Vault. The target schema is only applied at the latest layer, the information mart layer. This is when the desired information model is applied. Because the information mart is often virtualized using SQL views, the target schema is actually applied during query time. Queries against the view layers are merged with the SQL statements inside the view layer and run against the materialized tables in the Raw Data Vault, the actual data. Therefore, the schema-on-read concept is used in Data Vault 2.0.

Data Vault 2.0 also preserves the agility: the concept has demonstrated in many projects that it is easy to extend over time when either the source system structures change, the business rules change or the information models need to be adjusted. In addition, it is easy to add new data sources, additional business logic and additional information artifacts to the data warehouse.

On top of that, the Data Vault 2.0 model is typically integrated with a data lake: the above diagram shows the use of a data lake for staging purposes, which is the recommended “hybrid architecture” for new projects at Scalefree. But the data lake can also be used to capture semi-structured or unstructured data for the enterprise data warehouse or to deliver unstructured information marts.

With all that in mind, the Data Vault 2.0 concept has been established itself as a best of breed approach between the traditional data warehouse and the data lake. Organizations of all sizes use is to build data analytics platforms to deliver useful information to their decision makers.

-by Michael Olschimke (Scalefree)

Building a scalable Data Platform? In Data Vault Friday

Managed Self-service Monitoring in Data Vault

Watch the Video

In our ongoing Data Vault Friday series, our CEO Michael Olschimke engages with a pertinent question posed by our audience.

“Why is monitoring important in Managed Self-Service BI?”

In this insightful video, Michael delves into one of his favorite topics – managed self-service BI. He expounds on the significance and role of monitoring within the context of managed self-service BI scenarios. Drawing from his expertise, Michael provides a comprehensive exploration of the uses, values, and critical importance of monitoring tools in ensuring the effectiveness and efficiency of managed self-service BI implementations.

For those keen on optimizing their self-service BI initiatives and understanding the practical applications of monitoring, this video offers valuable insights and considerations.

Meet the Speaker

Michael Olschimke

Building a scalable Data Platform? In Data Vault Friday

Modelling Currencies in Non-Historized Links in Data Vault (PART 3)

Watch the Video

In our continuous Data Vault Friday series, our CEO Michael Olschimke delves into a compelling question presented by a member of our audience.

“Is it a good idea to create a HUB for an ‘opposite side Account’? Or perhaps, should we take it a step further and explore the possibility of merging the ‘opposite side Account’ with HUB_ACCOUNT? If so, how do we handle the diversity of IBAN formats across different countries? Is it practical to have accounts from all over the world within our HUB_ACCOUNT?”

In this informative video, Michael addresses the intricacies of data modeling, specifically concerning the creation of a HUB for the ‘opposite side Account.’ He discusses the potential advantages and challenges associated with merging this entity with HUB_ACCOUNT. Furthermore, Michael offers insights into navigating the complexities of different IBAN formats in a global context, weighing the considerations of including accounts from diverse geographical regions within a unified HUB.

For those grappling with decisions around global data integration and HUB design, Michael’s responses in this video offer valuable guidance and considerations.

Meet the Speaker

Michael Olschimke

Building a scalable Data Platform? In Data Vault Friday

Ghost Records in Data Vault

Watch the Video

In our ongoing Data Vault Friday series, our CEO Michael Olschimke addresses an interesting question raised by a member of our audience.

“We wonder if a ghost record in a satellite should have NULL values in the descriptive fields or not. What is the advantage of non-NULL values?”

In this insightful video, Michael explores the nuances of handling ghost records in satellites, specifically focusing on whether descriptive fields should contain NULL values or default values. He provides a comprehensive analysis of the pros and cons associated with each solution, shedding light on the advantages and considerations that come with choosing either non-NULL values or default values for descriptive data in ghost records.

For those grappling with the challenges of managing ghost records within the Data Vault framework, this video offers valuable insights and considerations to inform decision-making.

Meet the Speaker

Michael Olschimke

Building a scalable Data Platform? In Data Vault Friday

LINK Record Source in Data Vault

Watch the Video

In our continuous Data Vault Friday series, our CEO Michael Olschimke takes on a thoughtful question posed by a member of our audience.

“LINK table has only one ‘Record Source’ element to capture the source system value. In case the LINK generates the key relationship between two hubs which are sourced from different systems (I have not encountered any such but I can foresee this situation), What would be the RECORD SOURCE Value?”

In this enlightening video, Michael delves into the fundamental modeling principles of link entities within the Data Vault framework. Specifically, he addresses the scenario where a LINK table establishes key relationships between two hubs originating from distinct systems. Michael provides insights into determining the appropriate “Record Source” value in such situations, offering guidance on maintaining clarity and accuracy in data lineage.

For those navigating the nuances of link table modeling and considering potential cross-system relationships, this video provides valuable insights and best practices.

Meet the Speaker

Michael Olschimke

Building a scalable Data Platform? In Intermediate

How to Successfully Manage Data Vault 2.0 Using dbt

Watch the Webinar

More and more companies are making the big step towards a modern data architecture. As they deal with the components, they can’t actually stumble upon dbt and Data Vault. Among a lot of features “automation” is one, that they have totally in common. So why not combine these worlds?

Hear firsthand from dbt Labs and Scalefree how to successfully manage Data Vault 2.0 using dbt!

Highlights

– Learn how dbt will move your data platform to the next level and how the dbt Cloud product will simplify your code development and project management.

– Is dbt x Data Vault the right fit for your company? We will talk about the considerations you should make when starting your project.

Watch Webinar Recording

Webinar Agenda

1. Dbt Snapshots
2. Dbt Incremental Models
3. Incremental Satellites in Data Vault
4. Conclusion

Meet the Speaker

Marvin Geerken

Marvin is working as a BI consultant for Scalefree since his Master’s degree in BI & Analytics in the beginning of 2021. He’s an expert in the design and implementation of BI solutions, with focus on the Data Vault 2.0 methodology. In his client projects, he also has been heavily focussing on the application of dbt and Snowflake.

Building a scalable Data Platform? In Data Tools

Data Vault 2.0 with DBT – Part 2

Data Vault 2.0 with dbt

In the first part of this blog series we introduced you to dbt. Now let’s take a look at how you can implement Data Vault 2.0 with dbt and what advantages this offers. If you don’t know the first part yet, you can read it here.

In this article:

Dbt Models
Data Vault 2.0 and Macros
Conclusion

Dbt Models

dbt provides the ability to create models and generate and execute SQL dynamically out of these models. This allows you to write your data transformations in models using SQL and reusable macros powered by Jinja2 to run your data pipelines in a clean and efficient way. However, the most important part for the Data Vault use case is the ability to define and use those macros.

But first we should clarify how models work in dbt.

This tool handles the compilation and execution of models written using SQL and the Jinja macro language. Each model consists of exactly one SQL SELECT statement. Jinja code is translated into SQL during compilation.

This figure shows a simple model in dbt. A big advantage of Jinja is the possibility to generate sql programmatically, for example with loops and conditions. Also, through the ref() functions, dbt understands the relationships between the models and builds a dependency graph. This ensures that the models are built in the correct order and that the data lineage is documented. A lineage graph may look like this:

The materialization of models can be defined at different configuration levels. This allows fast prototyping with views and switching to materialized tables if necessary for performance reasons.

Data Vault 2.0 and Macros

But how can we implement Data Vault 2.0 with dbt? The most important part for using Data Vault 2.0 is the ability to define and use macros. Macros can be called in models and then generate into that macro additional SQL or even the entire SQL code.

For example, you could create a macro to generate a hub that gets the source/staging model as an input parameter, as well as the specification of the columns for the business key(s), load date and record source. The sql code for the hub is then generated dynamically from this. The advantage of this is that a change to the macro would directly affect each individual hub, which greatly improves maintainability.

At this point, you also gain huge benefits from the active open-source community around dbt. There are many open-source packages with which it can be extended.

There are also already some packages that are perfect for using Data Vault 2.0 with dbt.

For example, our own open-source package DataVault4dbt, developed and actively maintained here at Scalefree, provides a comprehensive set of dbt macros to translate a Data Vault model “on paper” into actual tables and views—such as Hubs, Links, Satellites, and more. The package is actively used in real-world projects and helps enforce best practices for a modern, audit-friendly Data Vault 2.0 implementation.

To explore all features and macro parameters in detail, check out the documentation.

The only thing you need in your model, for example for a hub, is just one macro call:

{%-

hub(src_pk, src_nk, src_ldts, src_source, source_model) 

-%}

With the parameters of the macro call you define the source table where the columns can be found (source_model) and the column names for the hash-key (src_pk), business key(s) (src_nk), load date (src_ldts) and the record source (src_source) column. When the model and the macro(s) defined in the model are executed, the SQL gets compiled and processed on the database system.

The metadata needed can for example be defined in variables with jinja directly in the model:

What you can also see is that this tool provides different options for the materialization. The incremental materialization for example, will load an entity as a table on an incremental basis.

When the model is executed, dbt generates the whole sql out of the macro and decides how the records are loaded. If the hub table does not exist, yet it is created and all records are loaded, if it already exists, the table is loaded incrementally.

For people who tried or managed to implement a Data Vault with “vanilla” SQL, you will realise that this is a real game-changer. The team can now focus entirely on the Data Vault design itself. Once the metadata is identified, dbt, along with your macros, can take care of the entire logic.

Openly available packages can add basic Data Vault 2.0 principles to dbt and therefore allow users to quick-dive into Data Vault implementation. Dbt’s general openness allows you to adjust all macros, for example, to your company or project specific Data Vault flavor to meet your technical and business requirements.

It is important to note that many of the currently available dbt packages for implementing Data Vault 2.0 deviate from the official standards in some details. Our own open-source package DataVault4dbt, developed and actively maintained here at Scalefree, addresses this gap by supporting all core Data Vault 2.0 entities and the latest best practices. The package is already used in real-world projects to help ensure a modern and audit-friendly implementation.

Conclusion

Integrating Data Vault 2.0 with dbt streamlines the data warehousing process by leveraging dbt’s capabilities to define models and macros, enabling efficient and dynamic SQL generation. This approach allows teams to focus on Data Vault design, with dbt handling the underlying logic.

Our open-source package DataVault4dbt brings these advantages to real-world projects, offering a reliable and standards-aligned way to build Hubs, Links, Satellites, and more.

-by Ole Bause (Scalefree)

Building a scalable Data Platform? In Data Vault Friday

Modelling Currencies in Non-Historized Links in Data Vault (PART 2)

Watch the Video

As part of our ongoing Data Vault Friday series, our CEO Michael Olschimke engages with a thought-provoking question posed by a member of our audience.

“Is it a good idea to create a HUB for an ‘opposite side Account’? Or maybe we should go even further and try to merge the ‘opposite side Account’ with HUB_ACCOUNT? If yes, what about different IBAN formats in different countries? Do we really want to have accounts from all over the world in our HUB_ACCOUNT?”

In this informative video, Michael delves into the intricacies of data modeling and hub creation, specifically addressing the concept of incorporating an “opposite side Account” into the HUB structure. He explores the potential benefits and challenges associated with merging these entities and offers insights into managing the diversity of IBAN formats across different countries.

For those navigating the complexities of global data integration and considering the structure of their HUB_ACCOUNT, this video provides valuable guidance and practical considerations.

Meet the Speaker

Michael Olschimke

Watch the Webinar

Webinar Agenda

Meet the Speakers

Datavault4dbt

Kickstart your Data Vault 2.0 Implementation – with datavault4dbt

Building a Data Vault 2.0 Solution – made easy

From the Stage over the Spine into the PITs

Start now and boost your Data Vault experience!

Watch the Video

Meet the Speaker

Watch the Video

Meet the Speaker

About Data Warehousing

About Data Lakes

The Best of Breed

Watch the Video

Meet the Speaker

Watch the Video

Meet the Speaker

Watch the Video

Meet the Speaker

Watch the Video

Meet the Speaker

Watch the Webinar

Webinar Agenda

Meet the Speaker

Data Vault 2.0 with dbt

Dbt Models

Data Vault 2.0 and Macros

Conclusion

Watch the Video

Meet the Speaker

Build Better Data Platforms

SOLUTIONS

TRAINING

EVENTS

KNOWLEDGE HUB

CAREERS

COMPANY