Skip to main content
search
0

Utilizing Potentials of Data Vault 2.0 – Overcoming Bad Practices – Part 2

Watch the Webinar

What are common mistakes when applying Data Vault 2.0 in enterprise data warehouse projects? Do you have questions regarding modeling in Data Vault and the realization of GDPR causes you great difficulties or is your project stuck because you are delivering no business value?

This webinar describes common Anti-patterns of Data Vault, their consequences, and the solution to eliminate them from your current or in your future projects.

Tune in and learn more to avoid bad practices and apply simple solutions.

Watch Webinar Recording

Webinar Agenda

1. How to use Data Vault for modeling business information
2. How to avoid the pitfalls of being unable to deliver business value
3. How to mask Business Keys from Hubs for privacy

Meet the Speaker

Picture of Lorenz Kindling

Lorenz Kindling

Lorenz is working in Business Intelligence and Enterprise Data Warehousing (EDW) with a focus on data warehouse automation and Data Vault modeling. Since 2021, he has been advising renowned companies in various industries for Scalefree International. Prior to Scalefree, he also worked as a consultant in the field of data analytics. This allowed him to gain a comprehensive overview of data warehousing projects and common issues that arise.

Get Started with Real-Time Processing in Data Vault 2.0 on Microsoft Azure

Data Vault 2.0 on Microsoft Azure

In this newsletter, you’re going to get an overview of what real-time processing is and what possibilities it can provide your Data Vault 2.0 implementation.

Real-Time Processing with Data Vault 2.0 on Azure

In this webinar, we’ll discuss the new data warehouse requirements for data and explore
Real-Time processing. We’ll cover various Real-Time processing architectures for an initial overview. The second part focuses on Real-Time data architecture with Data Vault 2.0 and includes a brief overview of Microsoft Azure. You’ll also see a Real-Time processing implementation of Data Vault 2.0 in Azure. This webinar is for anyone new to Real-Time Data with Data Vault 2.0 and interested in an overview and implementation in Azure.

Watch Webinar Part 1Watch Webinar Part 2

What to expect

You will learn that real-time processing gives you the ability to create value out of data quicker, have the most up-to-date data in your reporting tools and allow more accurate decisions regarding data.
With that, your company will be able to adapt to changes in the market quicker by seeing developments right away with the most recent data.

Additionally, you can save costs by moving away from batch loading because the peak of computing power normally required for that gets reduced and is more evenly distributed throughout the day. That is especially the case when using cloud environments, because then it’s possible to replace promised environments and contribute the needed computing power perfectly.

The traditional way – batch-loading

Batch loading is a traditional method used to load data into a data warehouse system in large batches, mostly overnight. The data from data sources is delivered up to a certain time in the night to be transformed and loaded into the core data warehouse layer.

This method leads to a peak of data processing overnight, and organizations have to adjust their infrastructure needs to be able to deal with the expected maximum peak of required computing power.

The new way – real-time data

Real-time data is processed and made available immediately as it is generated, instead of being loaded in batches overnight. When using real-time approaches, the loading window is extended to 24 hours. So the overnight peak and its disadvantages are gone.
When using real-time data, it’s always modeled as a non-historized link or as a satellite.

Possible use cases for real-time data are vital monitoring in the healthcare industry, inventory tracking, user behavior on social media or production line monitoring.

Different types of real-time data

There are different types of real-time data based on how frequently the data is loaded and the degree of urgency or immediacy of the data.

Near real-time data refers to data that is loaded in mini-batches at least every fifteen minutes, with the data stored in a cache until it is loaded into the data analytics platform.
Actual real-time data, also called message streaming, involves loading every single message directly into the data analytics platform without any cache.
This type of real-time data is useful when it is important to have data available as soon as it is generated for dashboards or further analytics.

The acceptable processing delay for real-time data is typically defined by the consequences of missing a deadline. Additionally, there are three types of real-time systems: hard real-time, soft real-time, and firm real-time.

Real-time processing types

Implementing real-time processing

So, how do you implement real-time data processing into your data warehouse solution? There are many architectures for that, but we will focus on the Lambda and Data Vault 2.0 architecture.

Generic real-time processing architecture

The lambda architecture separates data processing into a speed layer and a batch layer. The speed layer processes real-time messages with a focus on speed and throughput, while the batch layer provides accuracy and completeness by processing high volumes of data in regular batches. The serving layer integrates data from both layers for presentation purposes.

At first, the Data Vault 2.0 architecture seems to be similar to the lambda architecture, but it treats some aspects differently. The lambda architecture has issues from a Data Vault 2.0 perspective, such as implementing a single layer in each data flow and lacking a defined layer for capturing raw, unmodified data for auditing purposes.

The Data Vault 2.0 architecture adds a real-time part called “message streaming” to the existing batch-driven architecture, with multiple layers implemented for capturing and processing real-time data, integrating it with the batch-driven flow at multiple points. Messages are pushed downstream from the publisher to the subscriber, loaded into the Raw Data Vault and forked off into the data lake. But the main process is the push inside the message streaming area. The architecture is able to integrate data from batch feeds or to stream the real-time data directly into the dashboard.

Using Microsoft Azure for real-time processing

Microsoft Azure is a cloud computing platform and set of services offered by Microsoft. It provides a variety of services, including virtual machines, databases, analytics, storage, and networking. These services can be used to create web and mobile applications, run large-scale data processing tasks, store and manage data, host websites and much more.

Microsoft Azure for real-time processing

The illustration describes a typical real-time architecture used by Scalefree consultants, which follows the conceptual Data Vault 2.0 architecture.

Data sources deliver data either in batches or real-time. This is loaded into the Azure Data Lake or accepted by the Event Hub beforehand. The Raw Data Vault Loader separates business keys, relationships and descriptive data using Stream Analytics and forwards the message to the Business Vault processor. The Business Vault processor applies transformation and other business rules to produce the target message structure for consumption by the (dashboarding) application. The results can be loaded into physical tables in the Business Vault on Synapse or be delivered in real-time without further materialization in the database. The target message is generated and sent to the real-time information mart layer implemented by a streaming dataset, which is consumed by PowerBI. The cache of the dashboard service will expire quickly, but the Synapse database has all data available for other uses, including strategic, long-term reporting.

Conclusion

In conclusion, real-time data processing offers numerous benefits over traditional batch loading methods, including the ability to create value out of data quicker, have the most up-to-date information in reporting tools, and make more accurate decisions. By adapting to changes in the market quicker, companies can stay ahead of the competition. Moving away from batch loading can also save costs by reducing the peak of computing power required.

As mentioned before, the last illustration shows an architecture that the Scalefree Consultants implemented to make use of real-time data.

Read more on our recently released Microsoft Blog Article.

How is your current experience with real-time data processing?
Are you thinking about kick-starting your Data Vault by also using real-time data?
Or are you already using it and thinking about improving it further?

Let us know your thoughts in the comment section!

Data Vault on Databricks

Watch the Video

In our ongoing Data Vault Friday series, our CEO Michael Olschimke addresses a pertinent question raised by the audience, unraveling the discourse around the compatibility of Data Vault 2.0 (DV2.0) with Databricks.

“There has been hype going on on LinkedIn about whether or not DV2.0 is suited to exist on Databricks. Many people disagree that it is. The most significant comments are ‘lots of joins,’ ‘performance getting data out,’ and ‘not suited for modern automation.’ The latter ties to tools creating generated code per object VS. parameterized pipelines.”

In this illuminating video, Michael delves into the discussions surrounding the suitability of Data Vault 2.0 in the Databricks environment. He provides insights into the concerns raised, such as the perceived challenges related to joint operations, data retrieval performance, and the alignment with modern automation practices.

Michael offers a balanced perspective, exploring the nuances of utilizing DV2.0 on Databricks and addressing the key considerations raised in the LinkedIn discussions.

Meet the Speaker

Profile picture of Michael Olschimke

Michael Olschimke

Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

Warum Eigentlich Salesforce?

Watch the Webinar

In diesem Webinar geht es um die Frage “Warum eigentlich Salesforce?”. Wir werden uns die 3 Hauptvorteile des CRM-Systems von Salesforce ansehen: Integrierbarkeit, Erweiterbarkeit und Anpassbarkeit. Erfahren Sie, wie Salesforce Ihnen dabei helfen kann, Ihr Unternehmen zu transformieren und Ihre digitalen Prozesse zu optimieren.

Sie werden verstehen, wie Salesforce Ihre Arbeitsabläufe nahtlos integrieren und automatisieren kann, um Ihnen Zeit und Ressourcen zu sparen. Wir zeigen Ihnen, wie einfach und schnell es ist, Salesforce zu erweitern und anzupassen, um den spezifischen Bedürfnissen Ihres Unternehmens gerecht zu werden.

Melden Sie sich jetzt an und erfahren Sie, warum Salesforce die beste Wahl für die digitale Transformation Ihres Unternehmens ist.

Watch Webinar Recording

Webinar Agenda

1. CRM Systeme in der digitalen Transformation → Shared Customer Insight (nach Jeanne Ross)
2. Warum Salesforce so gut passt. (Salesforce order MS Dynamics in Spitzengruppe)
3. Grund 1 Integrierbarkeit
4. Grund 2 Erweiterbarkeit
5. Grund 3 Anpassbarkeit

Meet the Speaker

Picture of Markus Lewandowski

Markus Lewandowski

Markus Lewandowski hat mehr als 6 Jahre Salesforce Erfahrung und ist ein zertifizierter Salesforce Berater bei Scalefree. Er hilft Kunden in ganz Europa, Salesforce Umgebungen zu implementieren, zu verbessern und in ihren Tech-Stack zu integrieren.

Multi-temporal Source Data (Sap Hrms) in Data Vault

Watch the Video

In our ongoing Data Vault Friday series, our CEO Michael Olschimke explores a valuable question from the audience, shedding light on the intricacies of modeling an SAP HRMS source with SCD type 2 data and dealing with time-dependent information in Data Vault 2.0.

“Could you please guide us on how to model an SAP HRMS Source that holds the data in SCD type 2 in the source itself with an effectivity start date and end date for each change? What will be the best way to deal with time-dependent data in Data Vault 2.0?”

In this enlightening video, Michael provides practical guidance on modeling strategies for incorporating SAP HRMS source data with Slowly Changing Dimension (SCD) type 2 attributes directly in the source. He addresses the complexities of handling time-dependent data within the Data Vault 2.0 framework, offering insights into the best practices for managing effectivity start and end dates for each change.

Michael shares valuable considerations and recommendations, providing a clear roadmap for efficiently handling time-dependent data scenarios in Data Vault 2.0 projects.

Meet the Speaker

Profile picture of Michael Olschimke

Michael Olschimke

Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

Quick Guide of a Data Vault 2.0 Implementation

Data Vault 2.0 Architecture

Data Vault 2.0 Implementation

Data Vault 2.0 is often assumed to be only a modeling technique, but it encompasses much more than that. Not only that, but it is a whole BI solution composed of agile methodology, architecture, implementation, and modeling.

So why start using Data Vault?

  • Data Vault 2.0 allows you to build automated loading processes/patterns and generate models very easily
  • Platform independence
  • Auditability 
  • Scalability
  • Supports ELT instead of ETL processes

Now that we answered the why, you may be wondering what steps are needed to implement Data Vault 2.0 in your project.

It depends on a lot of factors like your business case, the architecture you want to have in place, how your sources are loaded, the sprint timeline of your project, etc.

Walk-through of a Data Vault 2.0 Implementation

It can be a bit overwhelming for beginners to start using Data Vault 2.0 and how and where to implement it. In this webinar, a very basic guide will be provided showing the steps needed for making a Data Vault 2.0 implementation based on a business requirement from scratch. This will be done with a demonstrated example, and it starts from the gathering of some sample requirements to the finished delivered product.

Watch Webinar Part 1Watch Webinar Part 2

Data Vault 2.0 feature by feature architecture

One thing is for sure: the architecture should be built vertically, not horizontally. This means not layer by layer but feature by feature. 

A common approach here is the Tracer Bullet approach. Based on business value, which is defined by a report, a dashboard, or an information mart, the source data needs to be identified, modeled, and loaded through all layers of the architecture. 

For example, let’s say the business request was to build a dashboard to analyze the company’s sales:

1. Extract

First thing, we need to extract the data from the source systems and load the data as it is somewhere. In this example, we put it in a Transient Staging Area but you could choose a persistent one in a Data Lake as well.

2. Transform

Next, you should apply some hard rules if necessary, be careful with this as you do not want to make business calculations here, using a transformation tool. There are a lot of different data warehouse automation tools that you can choose from: dbt, Coalesce, WhereScape, etc.

Data Vault 2.0 Architecture

3. Load

Load your Raw Stage into the Raw Vault.

4. Model Business requirements

Model the Data Vault entities needed for the business requirement to be fulfilled. If we have some Sales transactions and customers data, for example, we will model a Non-historized Link, also known as Transactional Link, and a Customer Hub, along with any additional Satellites for holding the Customer descriptive data that we want to see in the Sales Dashboard in the end.

5. Apply Business Logic

Next, we need some calculations and aggregations to be performed, so we will build some business logic on top of the raw entities, loading it into the business vault.

6. Build an Information Mart

Now, we could directly use the data stored in the Raw and Business Vault into charts/dashboards, but we want to structure the data, so it can be easily read and fetched by business users, so we will build an information mart with a star schema model with a fact table and dimensions.

7. Visualize Data

To build the Sales Dashboard in a BI visualization tool like PowerBI or Tableau, we now fetch directly from the star schema in the information mart, which has all the information we need, using a connection to my data warehouse in our database.

Data Vault 2.0 offers an agile, scalable, and flexible approach to Data Warehousing Automation. As demonstrated in the example, we only modeled the Data Vault tables that were necessary for accomplishing the handed task of building a Sales dashboard. This way you can scale up your business by demand, so you don’t have to figure out and map out the whole enterprise in one go. 

The answer to how to implement Data Vault 2.0 can be translated into a simple phrase: Focus on business value!

If you would like to see an explanation of this step-by-step implementation with some demonstration of actual data using dbt as the chosen transformation tool, check out the webinar recording.

Conclusion

Implementing Data Vault 2.0 involves a structured approach that begins with extracting data from source systems into a staging area, followed by minimal necessary transformations, and loading into the Raw Vault. Subsequently, business requirements guide the modeling of Data Vault entities, application of business logic, construction of information marts, and data visualization. This feature-by-feature methodology ensures scalability and flexibility, allowing organizations to focus on delivering business value incrementally. By aligning development efforts with specific business needs, enterprises can efficiently build and expand their data warehousing solutions.

EDW Environments in Data Vault

Watch the Video

In our ongoing Data Vault Friday series, our CEO Michael Olschimke addresses a crucial question from the audience that highlights a common challenge in data projects.

“I’m currently working on a project where the ‘environments’ (Dev, Prod, Test) are not well administrated. This topic is not mentioned at all in the DV2.0 methodology. Could you please elaborate on the roles of these environments and how to correctly use and manage them? As context, the problem faced at the moment by the company is that they’re not being able to test correctly and then implement. Also, the environments don’t necessarily count with the same information.”

In this insightful video, Michael provides a comprehensive discussion on the roles and importance of environments (Development, Production, Test) in the context of Data Vault 2.0 methodology. He addresses the challenges faced by the company, emphasizing the critical role that well-administered environments play in testing, implementing, and ensuring data consistency across different stages.

Michael shares practical insights into the correct utilization and management of environments, offering guidance on establishing a robust environment strategy within the Data Vault framework.

Meet the Speaker

Profile picture of Michael Olschimke

Michael Olschimke

Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

Automating Business Logic

Watch the Webinar

In this webinar, you’ll learn that Data Vault automation is not restricted to loading data, but can also be applied to the presentation layer.

There’s always some repeatable business logic – think of calculations such as currency conversion, Lifetime Value (LTV), or Net Present Value (NPV) – to feed different reports, even if all of them contain different information.

We’ll explain how you can create custom business templates and add additional layers in the information marts, to apply calculations repeatedly and even interdependently, thereby extending the scope of Data Vault automation from integration to presentation.

This webinar focuses on practical solutions.

Watch Webinar Recording

Webinar Agenda

1. How to get data out of a Data Vault.
2. What’s a PIT, what’s a bridge?
3. What’s meant by virtualization?
4. How to identify low-hanging fruits, i.e. the repeatable business logic in your solution.
5. How to automate those business rules using VaultSpeed.

Meet the Speakers

Profile picture of Michael Olschimke

Michael Olschimke

Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

blank

Jonas De Keuster

Jonasis VP Product Marketing at VaultSpeed. He had close to 10 years of experience as a DWH consultant in various industries like banking, insurance, healthcare, and HR services, before joining the company. This background allows him to help understand current customer needs and engage in conversations with members of the data industry

Loading Historical Data in Data Vault

Watch the Video

In our ongoing Data Vault Friday series, our BI Consultant Julian Brunner delves into a question from the audience that addresses a common challenge.

“One of our sources delivers all the historical data in one batch. So all the records have the same load date. How can I load the data into the EDW properly?”

In this insightful video, Julian shares practical solutions and strategies for loading historical data into an Enterprise Data Warehouse (EDW) when faced with the unique scenario of receiving all records with the same load date. The question prompts a discussion on best practices to ensure proper handling and integration of historical data within the Data Vault framework.

Julian provides valuable insights into the considerations and steps involved in effectively managing historical data loads, offering guidance on maintaining data integrity and accuracy within the EDW.

Data Vault 2.0 Source System Disaster Recovery

Watch the Video

In our ongoing Data Vault Friday series, our CEO Michael Olschimke engages with a challenging question from our audience, aiming to find an elegant solution to a complex scenario.

“I’m trying to find an elegant way of addressing the following problem.

You have a DV2.0 Insert Only BI deployment fed by multiple OLTP systems. One of these OLTP systems will be subject to a disaster and associated recovery process. This will be done with a loss of 3h worth of data from the OLTP in question. During the 3 hours, multiple loads into the DV were completed.

I’m trying to avoid an effectivity satellite for each hub.”

In this insightful video, Michael explores strategies for handling data from multiple source systems with disaster considerations in a Data Vault 2.0 Insert Only BI deployment. The question prompts a discussion on avoiding the use of an effectivity satellite for each hub, offering alternative approaches to address the challenges posed by data loss during disaster recovery.

Michael shares practical insights and considerations for designing resilient solutions within the Data Vault framework while optimizing the balance between complexity and efficiency.

Meet the Speaker

Profile picture of Michael Olschimke

Michael Olschimke

Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

Data Vault 2.0 Project Tracking

Watch the Video

In our continuous Data Vault Friday series, our CEO Michael Olschimke addresses a pertinent question from our audience regarding the application of Scrum in Data Vault 2.0.

“We are struggling with the application of Scrum in Data Vault 2.0: the Kanban board is overloaded with technical user stories. However, in theory, the user stories should be oriented towards the business and user needs.”

In this insightful video, Michael delves into the challenges faced when integrating Scrum methodologies into Data Vault 2.0 projects, particularly the issue of an overloaded Kanban board with technical user stories. The question prompts a discussion on the alignment of user stories with business and user needs, emphasizing the importance of maintaining a business-centric focus.

Michael shares practical insights and recommendations for optimizing the use of Kanban boards in Data Vault 2.0 projects, ensuring a balance between technical requirements and business-oriented user stories.

Meet the Speaker

Profile picture of Michael Olschimke

Michael Olschimke

Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

Close Menu