Skip to main content
All Posts By

Sandra Krenn

Get started with real-time processing in Data Vault 2.0 – on Microsoft Azure

By Scalefree Newsletter No Comments

Introduction

In this newsletter, you’re going to get an overview of what real-time processing is and what possibilities it can provide your Data Vault 2.0 implementation.

There’s also going to be a webinar on May 9, 2023 at 2pm CEST with a more in-depth look into the possibilities of real-time data. Take a look into the recording here.

What to expect

You will learn that real-time processing gives you the ability to create value out of data quicker, have the most up-to-date data in your reporting tools and allow more accurate decisions regarding data.
With that, your company will be able to adapt to changes in the market quicker by seeing developments right away with the most recent data.

Additionally, you can save costs by moving away from batch loading because the peak of computing power normally required for that gets reduced and is more evenly distributed throughout the day. That is especially the case when using cloud environments, because then it’s possible to replace promised environments and contribute the needed computing power perfectly.

The traditional way – batch-loading

Batch loading is a traditional method used to load data into a data warehouse system in large batches, mostly overnight. The data from data sources is delivered up to a certain time in the night to be transformed and loaded into the core data warehouse layer.

This method leads to a peak of data processing overnight, and organizations have to adjust their infrastructure needs to be able to deal with the expected maximum peak of required computing power.

The new way – real-time data

Real-time data is processed and made available immediately as it is generated, instead of being loaded in batches overnight. When using real-time approaches, the loading window is extended to 24 hours. So the overnight peak and its disadvantages are gone.
When using real-time data, it’s always modeled as a non-historized link or as a satellite.

Possible use cases for real-time data are vital monitoring in the healthcare industry, inventory tracking, user behavior on social media or production line monitoring.

Different types of real-time data

There are different types of real-time data based on how frequently the data is loaded and the degree of urgency or immediacy of the data.

Near real-time data refers to data that is loaded in mini-batches at least every fifteen minutes, with the data stored in a cache until it is loaded into the data analytics platform.
Actual real-time data, also called message streaming, involves loading every single message directly into the data analytics platform without any cache.
This type of real-time data is useful when it is important to have data available as soon as it is generated for dashboards or further analytics.

The acceptable processing delay for real-time data is typically defined by the consequences of missing a deadline. Additionally, there are three types of real-time systems: hard real-time, soft real-time, and firm real-time.

Implementing real-time data

So, how do you implement real-time data processing into your data warehouse solution? There are many architectures for that, but we will focus on the Lambda and Data Vault 2.0 architecture.

The lambda architecture separates data processing into a speed layer and a batch layer. The speed layer processes real-time messages with a focus on speed and throughput, while the batch layer provides accuracy and completeness by processing high volumes of data in regular batches. The serving layer integrates data from both layers for presentation purposes.

At first, the Data Vault 2.0 architecture seems to be similar to the lambda architecture, but it treats some aspects differently. The lambda architecture has issues from a Data Vault 2.0 perspective, such as implementing a single layer in each data flow and lacking a defined layer for capturing raw, unmodified data for auditing purposes.

The Data Vault 2.0 architecture adds a real-time part called “message streaming” to the existing batch-driven architecture, with multiple layers implemented for capturing and processing real-time data, integrating it with the batch-driven flow at multiple points. Messages are pushed downstream from the publisher to the subscriber, loaded into the Raw Data Vault and forked off into the data lake. But the main process is the push inside the message streaming area. The architecture is able to integrate data from batch feeds or to stream the real-time data directly into the dashboard.

Using Microsoft Azure for real-time data

Microsoft Azure is a cloud computing platform and set of services offered by Microsoft. It provides a variety of services, including virtual machines, databases, analytics, storage, and networking. These services can be used to create web and mobile applications, run large-scale data processing tasks, store and manage data, host websites and much more.

The illustration describes a typical real-time architecture used by Scalefree consultants, which follows the conceptual Data Vault 2.0 architecture.

Data sources deliver data either in batches or real-time. This is loaded into the Azure Data Lake or accepted by the Event Hub beforehand. The Raw Data Vault Loader separates business keys, relationships and descriptive data using Stream Analytics and forwards the message to the Business Vault processor. The Business Vault processor applies transformation and other business rules to produce the target message structure for consumption by the (dashboarding) application. The results can be loaded into physical tables in the Business Vault on Synapse or be delivered in real-time without further materialization in the database. The target message is generated and sent to the real-time information mart layer implemented by a streaming dataset, which is consumed by PowerBI. The cache of the dashboard service will expire quickly, but the Synapse database has all data available for other uses, including strategic, long-term reporting.

Conclusion and Outlook

In conclusion, real-time data processing offers numerous benefits over traditional batch loading methods, including the ability to create value out of data quicker, have the most up-to-date information in reporting tools, and make more accurate decisions. By adapting to changes in the market quicker, companies can stay ahead of the competition. Moving away from batch loading can also save costs by reducing the peak of computing power required.

As mentioned before, the last illustration shows an architecture that the Scalefree Consultants implemented to make use of real-time data.

Read more on our recently released Microsoft Blog Article.

How is your current experience with real-time data processing?
Are you thinking about kick-starting your Data Vault by also using real-time data?
Or are you already using it and thinking about improving it further?

Let us know your thoughts in the comment section!

Get Updates and Support

Please send inquiries and feature requests to [email protected]

For Data Vault training and on-site training inquiries, please contact [email protected] or register at www.scalefree.com.

To support the creation of Visual Data Vault drawings in Microsoft Visio, a stencil is implemented that can be used to draw Data Vault models. The stencil is available at www.visualdatavault.com.

Scalefree

Quick guide of a Data Vault 2.0 Implementation

By Scalefree Newsletter No Comments

It can be a bit overwhelming for beginners to start using Data Vault 2.0, as well as learning how and where to implement it. It’s important to note that Data Vault 2.0 is often assumed to be only a modeling technique but it encompasses a lot more than that. Not only that, but it is a whole BI solution composed of agile methodology, architecture, implementation, and modeling.

So why start using Data Vault?

  • Data Vault 2.0 allows you to build automated loading processes/patterns and generate models very easily
  • Platform independence
  • Auditability 
  • Scalability
  • Supports ELT instead of ETL processes

Now that we answered the why, you may be wondering what steps are needed to implement Data Vault 2.0 in your project ? 

It depends on a lot of factors like your business case, the architecture you want to have in place, how your sources are loaded, the sprint timeline of your project, etc.

Feature by Feature Architecture

One thing is for sure: the architecture should be built vertically, not horizontally. This means not layer by layer but feature by feature. 

A common approach here is the Tracer Bullet approach. Based on business value, which is defined by a report, a dashboard or an information mart, the source data needs to be identified, modeled, and loaded through all layers of the architecture. 

For example, let’s say the business request was to build a dashboard to analyze the company’s sales:

1. Extract

First thing, we need to extract the data from the source systems and load the data as it is somewhere. In this example, we put it in a Transient Staging Area but you could choose a persistent one in a Data Lake as well.

2. Transform

Next, you should apply some hard rules if necessary, be careful with this as you  do not want to make business calculations here, using a transformation tool. There are a lot of different warehouse automation tools that you can choose from: dbt, Coalesce, WhereScape, etc.

3. Load

Load your Raw Stage into the Raw Vault.

4. Model Business requirements


Model the Data Vault entities needed for the business requirement to be fulfilled. If we have some Sales transactions and customers data, for example, we will model a Non-historized Link, also known as Transactional Link, and a Customer Hub, along with any additional Satellites for holding the Customer descriptive data that we want to see in the Sales Dashboard in the end.

5. Apply Business Logic

Next we need some calculations and aggregations to be performed, so we will build some business logic on top of the raw entities, loading it into the business vault.

6. Build an Information Mart

Now, we could directly use the data stored in the Raw and Business Vault into charts/dashboard, but we want to structure the data, so it can be easily read and fetched by business users, so we will build an information mart with a star schema model with a fact table and dimensions.

7. Visualize Data

To build the Sales Dashboard in a BI visualization tool like PowerBI or Tableau, we now fetch directly from the star schema in the information mart, which has all the information we need, using a connection to my data warehouse in our database.

 

Data Vault 2.0 offers an agile, scalable, and flexible approach to Data Warehousing Automation. As demonstrated in the example, we only modeled the Data Vault tables that were necessary for accomplishing the handed task of building a Sales dashboard. This way you can scale up your business by demand, so you don’t have to figure out and map out the whole enterprise in one go. 

The answer to how to implement Data Vault 2.0 can be translated into a simple phrase: Focus on business value!

If you would like to see an explanation of this step by step implementation with some demonstration of actual data using dbt as the chosen transformation tool, check out the webinar recording.

Get Updates and Support

Please send inquiries and feature requests to [email protected]

For Data Vault training and on-site training inquiries, please contact [email protected] or register at www.scalefree.com.

To support the creation of Visual Data Vault drawings in Microsoft Visio, a stencil is implemented that can be used to draw Data Vault models. The stencil is available at www.visualdatavault.com.

Scalefree
Image of the structure from Data Vault

Data Vault 2.0: Best of Breed from Data Warehousing and Data Lakes

By Scalefree Newsletter No Comments

There are two competing approaches to data analytics available in the industry, and most professionals at least tend to one or the other as the preferred tool of choice: data warehousing vs data lake. This article sheds light on the differences between both approaches and how Data Vault 2.0 provides a best of breed solution that integrates the advantages of both approaches into a unified concept.

About Data Warehousing

Data Warehousing is the traditional enterprise solution for providing reliable information to decision makers at every level of the organization. Data warehouse solutions (but also data lakes) are based on a data model, which is traditionally defined on the basis of either the information requirement in a bottom-up approach or in a top-down approach based on an integrated enterprise information model.

In any case, the traditional data warehouse is based on a concept called “schema-on-write” where the data model is established when loading the data into the data warehouse. This often leads to non-agile data processing, as this data model often requires modifications in order to cope with changes in the business.

About Data Lakes

Data lakes, on the other hand, are based on the “schema-on-read” concept. Instead of modeling the enterprise or fitting the incoming dataset into a target information model, the data is first and foremost stored on the data lake as delivered without any modelling applied. 

While traditional data warehousing often leads to overmodeling and non-agile data analytics, the data lake approach often leads to the direct opposite: to unmanaged data and inconsistent information results.

The Best of Breed

Both approaches are on the extreme ends of the data analytics space and used throughout the years with mixed results. With the emergence of the Data Vault 2.0 concept, a third option is available to industry professionals to build data analytics platforms. 

Data Vault 2.0 is a best of breed between traditional data warehousing and data lakes: for example, there is a data model to manage the data and business logic as in traditional data warehousing, but it follows a schema-on-write approach as in data lakes. 

The Data Vault 2.0 architecture comprises multiple layers:

Image of the structure from Data Vault

The first layer is the staging area: it is used to extract the data from the source systems. The next layer is the Raw Data Vault. This layer is still functionally oriented as the staging layer, but integrates and versionizes the data. To achieve this, the incoming source data model is broken down into smaller components: business keys (stored in hubs), relationships between business keys (stored in links) and descriptive data (captured by satellites). 

 

The Business Vault is the next layer, but only sparsely modelled: only where business logic is required to deliver useful information, a Business Vault entity is put in place. The Business Vault bridges the gap between the target information model (as in the next layer) and the actual raw data. Often, the raw data doesn’t meet the expectations of the business regarding data quality, completeness, or content and thus needs to be adjusted. Business logic is used to fill the gap.

The final layer is the information mart layer where the information model is produced to deliver the final information in the desired format, e.g., a dimensional star schema. This model is used by the business user either directly in ad-hoc queries or using business intelligence tools such as dashboarding or reporting software.

The first layers until the Raw Data Vault are still functionally oriented because the model is still derived either directly from the source system (as in the staging area) or by breaking down the incoming data model into smaller, normalized components, as in the Raw Data Vault. The target schema is only applied at the latest layer, the information mart layer. This is when the desired information model is applied. Because the information mart is often virtualized using SQL views, the target schema is actually applied during query time. Queries against the view layers are merged with the SQL statements inside the view layer and run against the materialized tables in the Raw Data Vault, the actual data. Therefore, the schema-on-read concept is used in Data Vault 2.0.

Data Vault 2.0 also preserves the agility: the concept has demonstrated in many projects that it is easy to extend over time when either the source system structures change, the business rules change or the information models need to be adjusted. In addition, it is easy to add new data sources, additional business logic and additional information artifacts to the data warehouse. 

On top of that, the Data Vault 2.0 model is typically integrated with a data lake: the above diagram shows the use of a data lake for staging purposes, which is the recommended “hybrid architecture” for new projects at Scalefree. But the data lake can also be used to capture semi-structured or unstructured data for the enterprise data warehouse or to deliver unstructured information marts.

With all that in mind, the Data Vault 2.0 concept has been established itself as a best of breed approach between the traditional data warehouse and the data lake. Organizations of all sizes use is to build data analytics platforms to deliver useful information to their decision makers.

 

-by Michael Olschimke (Scalefree)

Get Updates and Support

Please send inquiries and feature requests to [email protected]

For Data Vault training and on-site training inquiries, please contact [email protected] or register at www.scalefree.com.

To support the creation of Visual Data Vault drawings in Microsoft Visio, a stencil is implemented that can be used to draw Data Vault models. The stencil is available at www.visualdatavault.com.

Scalefree

Data Vault 2.0 with dbt – Part 2

By Scalefree Newsletter No Comments

In the first part of this blog series we introduced you to dbt. Now let’s take a look at how you can implement Data Vault 2.0 with dbt and what advantages this offers. If you don’t know the first part yet, you can read it here

dbt provides the ability to create models and generate and execute SQL dynamically out of these models. This allows you to write your data transformations in models using SQL and reusable macros powered by Jinja2 to run your data pipelines in a clean and efficient way. However, the most important part for the Data Vault use case is the ability to define and use those macros.

But first we should clarify how models work in dbt.

Dbt handles the compilation and execution of models written using SQL and the Jinja macro language. Each model consists of exactly one SQL SELECT statement. Jinja code is translated into SQL during compilation.

This figure shows a simple model in dbt. A big advantage of Jinja is the possibility to generate sql programmatically, for example with loops and conditions. Also, through the ref() functions, dbt understands the relationships between the models and builds a dependency graph. This ensures that the models are built in the correct order and that the data lineage is documented.

A lineage graph may look like this:

The materialization of models can be defined at different configuration levels. This allows fast prototyping with views and switching to materialized tables if necessary for performance reasons.

Data Vault 2.0 and macros

But how can Data Vault 2.0 be implemented in dbt? The most important part for using Data Vault 2.0 is the ability to define and use macros. Macros can be called in models and then generate into that macro additional SQL or even the entire SQL code.

For example, you could create a macro to generate a hub that gets the source/staging model as an input parameter, as well as the specification of the columns for the business key(s), load date and record source. The sql code for the hub is then generated dynamically from this. The advantage of this is that a change to the macro would directly affect each individual hub, which greatly improves maintainability.

At this point, you also gain huge benefits from the active open-source community around dbt. There are many open-source packages with which dbt can be extended.

There are also already some packages that are perfect for using data vault 2.0. 

We at Scalefree are currently developing an open-source dbt package that provides macros to translate a Data Vault model “on paper”, to actual tables and views like Hubs, Links, Satellites and more. Subscribe to our Newsletter to get notified as soon as it’s available.

The only thing you need in your model, for example for a hub, is just one macro call:

{%-

hub(src_pk, src_nk, src_ldts, src_source, source_model) 

-%}

With the parameters of the macro call you define the source table where the columns can be found (source_model) and the column names for the hash-key (src_pk), business key(s) (src_nk), load date (src_ldts) and the record source (src_source) column. When the model and the macro(s) defined in the model are executed, the SQL gets compiled and processed on the database system.

The metadata needed can for example be defined in variables with jinja directly in the model:

What you also can see is that dbt provides different options for the materialization. The incremental materialization for example will load an entity as a table on an incremental basis.

When the model is executed, dbt generates the whole sql out of the macro and decides how the records are loaded. If the hub table does not exist, yet it is created and all records are loaded, if it already exists the table is loaded incrementally. 

For people who tried or managed to implement a Data Vault with “vanilla” SQL, you will realise that this is a real game-changer. The team can now focus entirely on the Data Vault design itself. Once the metadata is identified, dbt along with your macros can take care of the entire logic.

Openly available packages can add basic Data Vault 2.0 principles to dbt and therefore allow users to quick-dive into Data Vault implementation. Dbt’s general openness allows you to adjust all macros for example to your company or project specific Data Vault flavor to meet your technical and business requirements. 

However it is important to mention, that existing dbt packages for Data Vault 2.0 does not completely fulfill Data Vault 2.0 standards and deviates from them in details. Currently, we at Scalefree are working on an open-source dbt package that provides all important Data Vault 2.0 entities, as well as the latest standards and best practices and will be released soon. But these details are worth another blog post!

 

-by Ole Bause (Scalefree)

 

Get Updates and Support

Please send inquiries and feature requests to [email protected]

For Data Vault training and on-site training inquiries, please contact [email protected] or register at www.scalefree.com.

To support the creation of Visual Data Vault drawings in Microsoft Visio, a stencil is implemented that can be used to draw Data Vault models. The stencil is available at www.visualdatavault.com.

Scalefree

What’s new in Data Vault?

By Scalefree Newsletter 2 Comments

Whats new in Data Vault?

Attendees of our Data Vault 2.0 trainings often ask us what the difference between the book “Building a Scalable Data Warehouse with Data Vault 2.0” and our current consulting practice is. To give a definitive answer, at least for now, we have written this article.

Data Vault 3.0?

First of all, there is no Data Vault 3.0. We like to refer to the version of Data Vault that we teach and apply in our consulting practice, as “Data Vault 2.0.1”. Therefore, there are only slight differences to the version we used in the book. Some of these changes look big, but they don’t really modify the underlying concept and that is what matters regarding version changes. So, based on the concept, we only see minor changes or enhancements. However, they might have some larger impact on the actual implementation, but that is only due to technology changes.

 

How to Fit Data Vault 2.0 Between Two Covers

On top of the minor changes to the concept, there is also another factor at play here: when writing the book, we had to scope the book to make it fit between the covers as the publisher had a page limit (which we actually exceeded a bit). Therefore, we did not apply all Data Vault 2.0 concepts in the book: for example, real-time concepts are not covered by the book, and we did not use a data lake or cloud computing in our examples. Instead we only briefly covered these concepts for completeness but focused on on-premise technology which was more used in actual projects back then. In 2012, cloud computing was already available and widely used, but it was easier to tailor the book to more readers by the use of on-premise technologies.

 

With that in mind, what has changed since the book came out? 

 

Removal of Load End Date

 

The most obvious change is the removal of the Load End Date. Well, in our actual projects, we don’t completely remove it from the Data Vault 2.0 model, we just virtualize it by removing the Load End Date from the underlying satellite table and virtually calculating it in a view on top of the satellite table using a window function (typically LEAD, but LAG is also possible). This way we can get rid of the update procedure to maintain the Load End Date (the so-called end-dating process) while preserving the query patterns downstream. Keep in mind that it might also be more efficient to use the window function when loading PIT tables or satellites, and therefore the query layer is actually only for power users when they query the Data Vault model directly. 

 

That’s actually not an update: even in 2012 (and before) we used these approaches but they didn’t work on SQL Server 2014, which was used in the book as the analytical window function required for this approach is ways too slow. However, it improved in 2016. To get rid of the Load End Date in such scenarios where the window function is too slow or just don’t exist, a new solution has emerged: the use of a snapshot partition in the PIT table. The end of all times is used in the PIT table to refer to the latest delta in each satellite for the delta check. Once the PIT table is available, it can also be used to produce SCD Type 1 dimensions (without history) and therefore the need for the Load End Date (or a fast replacement as described above) only exists in loading.

 

Hybrid Architecture

 

The next obvious change is the use of a hybrid architecture where a data lake is used for staging purposes. In the book, we focused on the on-premise Microsoft SQL Server stack that didn’t include a distributed file system. We already recommended to clients to use a data lake for staging purposes, in an architecture we called the “hybrid architecture.” Back then, only a few followed the advice, but today most clients use the hybrid architecture for their data warehouse. We actually consider a relational staging area an anti-pattern and do not recommend it anymore to clients (with some exceptions). 

 

Multi Temporal Data Vault 2.0

 

The book describes a temporal PIT (TPIT) for building multi-temporal solutions. While the pattern is still valid and relatively flexible compared to other solutions, today we typically move the business timelines into the dimensional model for highest flexibility. This is discussed in more detail in the Multi-Temporal Data Vault 2.0 class. TPITs are less flexible but have higher performance.

 

Snapshot Date

Satellites in the Business Vault might use a Snapshot Date Timestamp instead of a Load Date Timestamp. The idea is that, as a general practice, a Business Vault entity should always reuse an existing granularity. It just becomes much easier to develop Business Vault solutions. With the Snapshot date in the primary key of the satellite, this becomes much easier for business rules on the outgoing information granularitiy. This is discussed in more detail in the Data Vault 2.0 Information Delivery class

 

A Confession

 

And finally, we also made some mistakes. We are willing (not happy) to admit it in our errata page for the book – this link is directly from our internal Wiki and we promise to keep it updated (however, we didn’t receive additional reports lately).

Get Updates and Support

Please send inquiries and feature requests to [email protected]

For Data Vault training and on-site training inquiries, please contact [email protected] or register at www.scalefree.com.

To support the creation of Visual Data Vault drawings in Microsoft Visio, a stencil is implemented that can be used to draw Data Vault models. The stencil is available at www.visualdatavault.com.

Scalefree

Data Vault 2.0 with dbt – Part 1

By Scalefree Newsletter 2 Comments

Data is an important asset in the decision making process. As we have previously discussed in another post, Data Vault 2.0 is the right choice when the goal of an enterprise data warehouse is to have fully historized and integrated data. Additionally, it is also better suited to instances in which data from many source systems needs to be combined. You can find the previous blog post here.

While Data Vault 2.0 focuses on the “what”, there are many options for the “how” of technically translating a Data Vault model into physical tables and views in the enterprise data warehouse, as well as for orchestrating and loading/processing the procedures. And this is where dbt comes in.

dbt

ABOUT DBT

The data build tool (dbt) transforms your data directly in your data warehouse. For reference, dbt is the “T” in ELT. Therefore, dbt assumes that data is already loaded into a database from which the current database can query from. In contrast, ETL extracts data then transforms it and before loading it into the target. With ELT, the data is not transformed as it moves into the data warehouse. (with transform we mean transformation by soft business rules which would change the meaning of the data. Of course we have to make sure that the data fits into the target table (data type adjustments etc.). Here we are talking about “hard rules”.)
Dbt is particularly compatible and useful in cloud DWH solutions such as Snowflake, Azure Synapse Analytics, BigQuery and Redshift and performs transformations and modeling directly on the database in order to take advantage of the performance of these highly scalable platforms.

HOW DBT WORKS

Models and SQL statements can be easily created, tested and managed in dbt itself. A powerful combination of the scripting language, Jinja2, and the all-time classic SQL allows users to build models. The simple interface enables data analysts without engineering know-how to initiate appropriate transformations. Data team workflows become more efficient and cost-effective because of this. Behind dbt sits an open source community that is constantly and passionately developing the tool. As such, dbt is available both as a free, reduced core version and as a comprehensive and flexible cloud version.

What is the place of dbt in Data Vault 2.0?

Part of the Data Vault 2.0 methodology is the model which focuses on how to design the core data warehouse with a scalable solution. The core Data Vault entities are hubs, links and satellites. That said, dbt provides the ability to generate Data Vault models and also allows you to write your data transformations using SQL and code-reusable macros powered by Jinja2 to run your data pipelines in a clean and efficient way.

Application Areas of dbt in an Enterprise BI Solution

dbt does not reinvent the world, but when it comes to building a new EDW – especially if it is in the clouds – dbt provides a very helpful basic framework with many important functions for Continuous Integration and Deployment already defined. dbt brings the standards of software development into the world of data transformation. This allows developers to concentrate on the core tasks of data modeling and business logic. Especially, but not only for smaller projects, dbt offers a lightweight and extremely affordable alternative to other data warehouse automation solutions.

– by Ole Bause (Scalefree)

Get Updates and Support

Please send inquiries and feature requests to [email protected]

For Data Vault training and on-site training inquiries, please contact [email protected] or register at www.scalefree.com.

To support the creation of Visual Data Vault drawings in Microsoft Visio, a stencil is implemented that can be used to draw Data Vault models. The stencil is available at www.visualdatavault.com.

Scalefree

Multi-Temporality in Data Vault 2.0 Part 2 – A Practical Example

By Scalefree Newsletter No Comments

Multi-Temporal Example from the Travel Industry

In one of our previous blog posts, we gave an insight into the basics of multi-temporality in Data Vault 2.0. This article will give you a short practical example of the topic.

When you book your holidays, you might change something after the initial booking: the duration, the flight departure location, the hotel, the start and/or end date of your journey or might just book some extras. From an operational perspective, these changes are not a problem to deal with, and the expectations are not high.

The data set for your journey will be updated in the source systems’ database, and only the most recent data is shown in the graphical user interface. The employee might see some logs in the history part, if they exist.

That’s usually enough to keep the operational part working. In the business intelligence world, the way to deal with data is a bit different. There, we want to see everything. Each time a record in the source system is updated, we catch all the changes and store them in a separate database.

This means that we introduce a new timeline which provides us new opportunities but also comes with new challenges. Let’s have a look into some example data sets of a source system over time:

Day 1: Initial booking of a holiday trip

Day 2: Correction of the start and end date and a small price adjustment

Day 2: Correction of the start and end date and a small price adjustment

When you have a look into the source system today, you will see that booking B4711 is cancelled, or maybe it is completely gone, and B4712 is the active one (ID = 2, created on day 3). From an analytical perspective you might have some more questions you want to have answered.
For example:

  • What was the estimated revenue for July, looking from February?
  • What was the estimated revenue for July, looking from March?
  • Which dates do I need to use to answer the two questions above?
  • In which month do we have the most bookings for the next summer season?
  • Can we reproduce a report we created some months ago?
  • What if some business dates change? What affects my reports in the past?
  • Where can we prepare the data to answer all these questions? In the data warehouse? In the front-end BI tool? Or somewhere else?

In our webinar “Multi-Temporality in Data Vault 2.0” we will provide you answers to all these questions by using Data Vault entities.

If you are interested to see some more use cases related to multi-temporality: we offer a 2-day class especially for this topic: “Multi-Temporal Data Vault 2.0 Class”

by Marc Finger (Scalefree)

Get Updates and Support

Please send inquiries and feature requests to [email protected]

For Data Vault training and on-site training inquiries, please contact [email protected] or register at www.scalefree.com.

To support the creation of Visual Data Vault drawings in Microsoft Visio, a stencil is implemented that can be used to draw Data Vault models. The stencil is available at www.visualdatavault.com.

Scalefree

Why should I use Data Vault (2.0)?

By Scalefree Newsletter No Comments

As a database developer, administrator or user, you probably have already heard of Data Vault. But with an existing database solution in mind, you might come to the question of why you should use Data Vault in your enterprise data warehouse. This blog article discusses some features where Data Vault can shine and why you should use it.

Reason 1: Supporting multiple source systems and frequently changing relationships.

Data Vault 2.0 provides the most benefits when your data comes from many source systems or has constantly changing relationships. 

Data Vault 2.0 works well for multi-source systems or those that have constantly changing relationships.

The reason why it works well for these systems is its ability to make adding attributes simple. If there is a change to one source system, that change does not need to show up within all source systems. Similarly, you can limit the number of places changes are made, as attributes are stored separately from structural data in satellites.

Additionally, it is easier to account for new and changing relationships by closing off one link and creating another. You don’t have to change the historical data to account for a new relationship or update an existing schema. You only need to account for the changes going forward. This brings enormous flexibility and scalability into your enterprise data warehouse.

Reason 2: Easily track and audit your data.

In a modern data environment, the data runs through various different stops. To still provide continuous data quality, it must always be clear where data has come from. 

Figure 1: Data Lineage

Given the need to be able to easily track and audit data, Data Vault has made it easier to do so. It does so by inherently enabling auditing, as load times and record sources are required for every row. It also tracks all historical changes as satellites, including the load time as part of the primary key.
When an attribute is updated, a new record is created.
All of this auditing enables you to easily provide auditability for both regulatory and data governance purposes. And because you store all of your history, you can access data from any point in time.

Reason 3: Load data from multiple source systems more quickly.

Data Vault enables quicker data loading simply because a number of tables can be loaded at the same time in parallel. The model decreases dependencies between tables during the load process and simplifies the ingestion process by leveraging inserts only, which load quicker than upserts or merges. This also leads to less complexity.

Conclusion:

Data Vault 2.0 is designed to integrate data from several source systems and may over perform in some situations.

In summary, if you have a small to medium-sized analytics requirement, with a small team of architects, designers, and engineers delivering a solution with data sourced from a few systems, then Data Vault may be inappropriate for your need.

If, however, you have a large project with many source systems leading to an enormous data integration challenge, then Data Vault can potentially add massive value to the project.

Get Updates and Support

Please send inquiries and feature requests to [email protected]

For Data Vault training and on-site training inquiries, please contact [email protected] or register at www.scalefree.com.

To support the creation of Visual Data Vault drawings in Microsoft Visio, a stencil is implemented that can be used to draw Data Vault models. The stencil is available at www.visualdatavault.com.

Scalefree

Running modern ETL-Processes with Framework-Based Tools – Part 2

By Scalefree Newsletter 2 Comments

In the last blog post, we introduced Singer, the open-source framework, as a powerful tool for ETL processes. This time, we’d like to discuss how you can implement the framework in your own projects.

How to start working with Singer

Starting a test run is rather simple. First, you need to create a python environment,  for which step-by-step instructions to do so are available online. 

As soon as you’ve done that, it’s time to create your first virtual environment inside python.
Please note before beginning that it’s a best practice to create and use an individual virtual environment for every tap and target. This avoids any conflicts between module requirements for the different modules. 

The next step is to install the tap and target you’ve chosen into their corresponding virtual environment. This installation can be performed very easily using a pip install command. This example command installs the tap-salesforce to the load data from your Salesforce account:
Read More

Running modern ETL-Processes with Framework-Based Tools – Part 1

By Scalefree Newsletter No Comments

A big part of every Enterprise Datawarehouse are ETL- or ELT-processes.
In both abbreviations, the letters stand for the same words, only the order in which each process is done changes.
To brush-up on those processes, “E” stands for extraction, “T” for transformation and “L” is for loading.

That said, rather than dive into the benefits of each,  we would like to present a powerful open-source framework to execute the processes instead.

Why use a framework?

Rather than developing individual solutions per source system, using standardized frameworks provides a wide variety of benefits. The main of which we have already mentioned, standardization.
Another benefit, using the same concept for extracting data from different source systems allows your system to become more auditable and reliable.
And when taking into consideration the varied benefits between frameworks, other potential upsides become available as well. Read More

Implementing Data Vault 2.0 Zero Keys

By Scalefree Newsletter 4 Comments
In a previous blog post, we discussed how to implement ghost records within a Data Vault 2.0 solution. This time around, we’d like to talk about “the other” concept, namely zero keys, which oftentimes are referenced interchangeably with ghost records.

 

As discussed in the previous part of this series, a ghost record is a dummy record in satellite entities containing default values. Simply put, zero keys are the entry in each hub and link entity that is a counterpart to the satellite’s ghost record containing its hash key. In this manner, the term “zero key” is oftentimes used to describe the ghost record’s hash key, which might show up in other Data Vault entities such as in Point-in-Time (PIT) tables or links. Accompanying the zero hash key is, similar to a ghost record, a default value for the business key . Or, in the case of a composite business key, multiple default values for each of its components.

Read More

Using Multi-Active Satellites the Correct Way (2/2)

By Scalefree Newsletter 2 Comments
In our first post about multi-active satellites, we briefly explained different implementations that can be used to solve multi-activity. Now, we’re going to go into more detail regarding the advantages and disadvantages of these approaches having delta checks on or off.

 

Short summary of Multi-Active Satellites

Multi-active satellites allow you to implement multi-active records per business key in Data Vault 2.0. To illustrate the need for the solution, let’s look at the common occurrence of a source system that doesn’t provide the needed metadata such as when working with XML-files.
One solution to the above is to create a multi-active satellite by adding a subsequence number per business key. This accounts for any instance in which there is no multi-active attribute delivered by the source itself. Regarding phone numbers, this information could be a tag for a business, home or mobile phone number. Another possibility is to create an extra hub for the multi-active attribute. Though, since it doesn’t present a real business object, the first solution can be more effective.

Delta Check OFF

There are two ways to insert new records into a multi-active satellite – having delta checks active or inactive. With delta checks turned off, all records of a business key are inserted into the satellite from your source delivery.
The advantage to that is that loads are faster and have a consistent load date timestamp to the parent hash key, independent of the multi-active attribute.
Later on, it simplifies the query based on the multi-active data (see figure 1). As a critical drawback, the ingested amount of data can increase strongly if full date loads are received.
In this case, you should partition your data by the load date timestamp. 

Read More

Handling Validation of Relationships in Data Vault 2.0

By Scalefree Newsletter 4 Comments
In Data Vault 2.0, we differentiate data by keys, relationships and description.
That said, an often underestimated point is the handling of relationships in Data Vault 2.0.
In the following we explain what to consider and how to deal with it:

There are different ways to handle the validation of relationships from source systems depending on how the data is delivered, (full-extract or CDC), and the way a delete is delivered by the source system, such as a soft delete or hard delete.

First, let us explain the different kinds of deletes in source systems:

  1. Hard delete – A record is hard deleted in the source system and no longer appears in the system.
  2. Soft delete – The deleted record still exists in the source systems database and is flagged as deleted.


Secondly, let’s explore  how we find the data in the staging area:

  1. Full-extract – This can be the current status of the source system or a delta/incremental extract.
  2. CDC (Change Data Capture) – Only new, updated or deleted records to load data in an incremental/delta way.

 

To keep the following explanation as simple as possible, our assumption is that we want to mark relationships as deleted as soon as we get the delete information, even if there is no audit trail from the source system (data aging is another topic).

Read More

Salesforce meets Data Vault

By Scalefree Newsletter No Comments

It’s a Match!

Data integration with Salesforce can be tricky and is in dire need of a system of business intelligence to handle that complexity.

Data Vault is capable of decoupling all the necessary business-driven changes, extensions and customizations to the platform while maintaining the ability to become the cornerstone of an integrated architecture. The decoupling is a part of our Data Vault Boot Camp and is summarized in Figure 1.

Scalefree can provide knowledge and implementation assistance in both Data Vault as well as Salesforce therefore creating the optimal partner for your Salesforce integration project.

Read More

Data Quality in the Data Vault Architecture

By Scalefree Newsletter No Comments

The Basis for solid decision-making

In making business decisions whether daily or long term, the quality of data is a critical facet to factor into these decision-making processes.

Thus, the immediate access to the data and certainty on its quality can enhance business performance immensely. But the sad truth is that we see bad data in operational systems due to human-caused errors such as typos, ignoring standards and duplicates, in addition to lack of input-validators in operating systems such as must-fields not being declared as well as references to other entities (primary-foreign-key constraints) not being defined.

Read More

Write-backs in the Enterprise Data Warehouse Architecture

By Scalefree Newsletter No Comments

Many people already know the three layer architecture of data warehouses which is used in Data Vault 2.0. The first layer represents the staging area which holds the raw data from the source systems. The enterprise data warehouse layer, which in this case contains a Data Vault 2.0 model and the third layer with the Information Marts, which deliver the information in various structures (Star Schemas, Snowflake Schemas etc.).

Figure 1. Data Vault 2.0 Architecture

This architecture provides possibilities and benefits for writing back data. Two possibilities are writing back data into the enterprise data warehouse and into the source systems. This issue covers the write back into the enterprise data warehouse, while an upcoming article will cover the write back into the source systems.

Read More

Data Vault Use Cases Beyond Classical Reporting: Part 3

By Scalefree Newsletter No Comments
As we have shown in previous issues, Data Vault 2.0 enables individuals to implement reporting beyond the traditional methods.
In the first part, we demonstrated how to perform data cleansing in Data Vault 2.0.
And the second use case showed how to implement business process automation using Interface Marts.

The scalability and flexibility of Data Vault 2.0 offers a whole variety of use cases that can be realized, e.g. to optimize as well as automate operational processes, predict the future, push data back to operational systems as a new input or trigger events outside the data warehouse, to name a few. Read More

Satellite modeling for any structural changes in the source system

By Scalefree Newsletter No Comments

Modeling a Satellite in the instance of any structural changes within the source system

Over time, most source systems change.
The question is how to absorb these changes into the data warehouse based on Data Vault, especially when considering the satellites?
It is necessary to find a balance between the reengineering effort and performance when the source table structure changes. To better help those who find structural changes in source systems, this article will present our recommendations, based upon our knowledge base,  for various types of changes in a source.

This article describes features embodied in the Data Vault 2.0 model: the foundation of hub, link, and satellite entities can adjust to changes in the source data easily, thus reducing the cost of reengineering the enterprise data warehouse. 

Read More