Skip to main content
search
0

Aggregate Measures Using Bridge Tables in Data Vault

Watch the Video

In our ongoing Data Vault Friday series, our knowledgeable trainer, Marc Finger, dives into a pertinent question raised by our audience regarding aggregate measures in Data Vault:

“How and where to aggregate measures from transactional system on a business date considering late arriving messages?”

In this informative video, Marc addresses the intricacies of aggregating measure values within bridge tables. The audience is treated to a comprehensive discussion on the techniques and considerations involved in efficiently handling aggregate measures in the context of bridge tables.

Marc provides valuable insights and practical tips, offering clarity on how to navigate the challenges associated with aggregating measures while utilizing bridge tables in your data architecture.

Meet the Speaker

Marc Winkelmann

Marc Finger

Marc is working in Business Intelligence and Enterprise Data Warehousing (EDW) with a focus on Data Vault 2.0 implementation and coaching. Since 2016 he is active in consulting and implementation of Data Vault 2.0 solutions with industry leaders in manufacturing, energy supply and facility management sector. In 2020 he became a Data Vault 2.0 Instructor for Scalefree.

Lambda Architecture vs. Data Vault 2.0 Architecture

Watch the Video

In our ongoing Data Vault Friday series, our CEO Michael Olschimke engages with a question that delves into the comparison between the Data Vault 2.0 architecture and the Lambda architecture for real-time systems.

“We are currently comparing the Data Vault 2.0 architecture with the Lambda architecture for real-time systems. Can you elaborate on the similarities and differences?”

In this enlightening video, Michael provides a comprehensive exploration of the distinctions and commonalities between the Lambda architecture and the Data Vault 2.0 architecture. The audience gains valuable insights into the considerations, strengths, and potential use cases of each approach, aiding in informed decision-making for real-time system implementations.

Meet the Speaker

Profile picture of Michael Olschimke

Michael Olschimke

Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

Overloaded Links in Data Vault

Watch the Video

In our ongoing Data Vault Friday series, our CEO Michael Olschimke addresses a thought-provoking question raised by our audience regarding data warehouse management.

“In our data warehouse currently, there are 3 hubs. Now, business users are seeking new information from another table, one that contains the results of tests conducted on the business objects represented by the 3 hubs. This new table features foreign keys corresponding to the 3 hubs. Interestingly, per row, only 1 foreign key is filled, followed by the associated test results.

To address this scenario, we considered attaching a satellite to each of the hubs and populating it with data only if the relevant foreign key is set. Another option we explored was modeling it as a link between the three hubs. However, given that there is no other table depending on it, we are inclined to lean towards option 1, attaching satellites to the hubs.”

This nuanced discussion on handling data relationships and the potential risks associated with overloading links is further explored in detail in the accompanying short video.

Meet the Speaker

Profile picture of Michael Olschimke

Michael Olschimke

Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

User Spaces in Managed Self-Service BI in Data Vault

Watch the Video

In our engaging Data Vault Friday series, our CEO, Michael Olschimke, delves into an intriguing question posed by our audience.

“What is the purpose of a user space? Are there any variants?”

In this insightful video, we explore the intricacies of the user space concept within the realm of managed Self-Service Business Intelligence (BI). Michael Olschimke provides valuable insights and clarification, shedding light on the significance of user spaces and addressing potential variations that exist.

Whether you’re new to the concept or seeking a deeper understanding, this short yet informative video serves as a valuable resource in comprehending the role and nuances of user spaces in the context of BI.

Meet the Speaker

Profile picture of Michael Olschimke

Michael Olschimke

Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

Multi-Temporality in Data Vault 2.0 – Part 2: A Practical Example

An abstract illustration features a large clock in the center with connected circuitry-like lines. Icons of a calendar and a smaller clock are in the background. Two large arrows pointing outward appear on both sides, and the backdrop has splashes of blue and gray.

Multi-Temporality

This article explores managing multi-temporality in Data Vault 2.0 using a travel industry scenario. It highlights how operational and business intelligence systems handle data differently and demonstrates how Data Vault 2.0 supports comprehensive historical tracking and multi-temporal analysis.

Multi-temporality in Data Vault 2.0

You may have heard about Bi-temporal data already. But usually, there are more than just two timelines in your data which complicate your work. You usually find multiple timestamps and dates from different perspectives in your data sets, which create multiple opportunities on “how to look at your data” from a time perspective. But, you should also be able to handle this beast of a time machine. Did you know that Data Vault 2.0 is able to handle Multi-temporal data? How does this influence your work and how can you take advantage of it? Join this webinar to learn how Data Vault 2.0 can help you master multi-temporality.

Watch webinar recording

Multi-Temporal Example from the Travel Industry

In one of our previous blog posts, we gave an insight into the basics of multi-temporality in Data Vault 2.0. This article will give you a short practical example of the topic. When you book your holidays, you might change something after the initial booking: the duration, the flight departure location, the hotel, the start and/or end date of your journey or might just book some extras. From an operational perspective, these changes are not a problem to deal with, and the expectations are not high. The data set for your journey will be updated in the source systems’ database, and only the most recent data is shown in the graphical user interface. The employee might see some logs in the history part, if they exist. That’s usually enough to keep the operational part working. In the business intelligence world, the way to deal with data is a bit different. There, we want to see everything. Each time a record in the source system is updated, we catch all the changes and store them in a separate database. This means that we introduce a new timeline which provides us new opportunities but also comes with new challenges. Let’s have a look into some example data sets of a source system over time:

Day 1: Initial booking of a holiday trip

Day 2: Correction of the start and end date and a small price adjustment

Day 3: Correction of the start and end date and a small price adjustment

When you have a look into the source system today, you will see that booking B4711 is cancelled, or maybe it is completely gone, and B4712 is the active one (ID = 2, created on day 3). From an analytical perspective you might have some more questions you want to have answered.
For example:

  • What was the estimated revenue for July, looking from February?
  • What was the estimated revenue for July, looking from March?
  • Which dates do I need to use to answer the two questions above?
  • In which month do we have the most bookings for the next summer season?
  • Can we reproduce a report we created some months ago?
  • What if some business dates change? What affects my reports in the past?
  • Where can we prepare the data to answer all these questions? In the data warehouse? In the front-end BI tool? Or somewhere else?

In our webinar “Multi-Temporality in Data Vault 2.0” we will provide you answers to all these questions by using Data Vault entities.

If you are interested to see some more use cases related to multi-temporality: we offer a 2-day class especially for this topic: “Multi-Temporal Data Vault 2.0 Class”.

Conclusion

Incorporating multi-temporality into Data Vault 2.0, as demonstrated through the travel industry example, allows organizations to effectively capture and analyze data changes over multiple timelines. This approach not only preserves historical accuracy but also enables comprehensive insights into business processes. By leveraging Data Vault 2.0’s capabilities, businesses can address complex temporal questions, such as estimating revenue forecasts from different points in time and reproducing past reports with precision. This methodology ensures a robust framework for managing evolving data landscapes and supports informed decision-making.

by Marc Finger (Scalefree)

What is Data Vault 2.0? – Agile Big Data Management

Watch the Webinar

In this webinar, we will give a general overview of the concepts of Data Vault 2.0. Learn, what the common problems in data warehouse projects are, and how a Data Vault solution can solve them.

Explore the architecture and the different layers needed for a successful implementation of an enterprise data warehouse.

Get an insight into the basic entity types of Data Vault 2.0 and the benefits of this modeling approach.

In addition, we will talk about the methodology in an agile project environment.

Data Vault 2.0 brings many new features to help anyone involved in business intelligence enter a new era of data warehousing.

Register now and build confidence in data across your organization.

Watch Webinar Recording

Webinar Agenda

1. What is your experience with Data Vault 2.0?
2. Which phase of your Data Warehouse project are you currently in?
3. What is your next step?

Why Should I Use Data Vault 2.0?

Data Lineage with Data Vault 2.0

Why Data Vault 2.0?

This blog article discusses some features where Data Vault 2.0 can shine and why you should use it. As a database developer, administrator or user, you probably have already heard of Data Vault. But with an existing database solution in mind, you might come to the question of why you should use Data Vault in your enterprise data warehouse

Reason 1: Supporting multiple source systems and frequently changing relationships

Data Vault 2.0 provides the most benefits when your data comes from many source systems or has constantly changing relationships. 

Data Vault 2.0 works well for multi-source systems or those that have constantly changing relationships.

The reason why it works well for these systems is its ability to make adding attributes simple. If there is a change to one source system, that change does not need to show up within all source systems. Similarly, you can limit the number of places changes are made, as attributes are stored separately from structural data in satellites.

Additionally, it is easier to account for new and changing relationships by closing off one link and creating another. You don’t have to change the historical data to account for a new relationship or update an existing schema. You only need to account for the changes going forward. This brings enormous flexibility and scalability into your enterprise data warehouse.

Reason 2: Easily track and audit your data with Data Vault 2.0

In a modern data environment, the data runs through various different stops. To still provide continuous data quality, it must always be clear where data has come from. 

Data Lineage with Data Vault 2.0

Figure 1: Data Lineage

Given the need to be able to easily track and audit data, Data Vault has made it easier to do so. It does so by inherently enabling auditing, as load times and record sources are required for every row. It also tracks all historical changes as satellites, including the load time as part of the primary key.
When an attribute is updated, a new record is created.
All of this auditing enables you to easily provide auditability for both regulatory and data governance purposes. And because you store all of your history, you can access data from any point in time.

Reason 3: Load data from multiple source systems more quickly with Data Vault 2.0

Data Vault enables quicker data loading simply because a number of tables can be loaded at the same time in parallel. The model decreases dependencies between tables during the load process and simplifies the ingestion process by leveraging inserts only, which load quicker than upserts or merges. This also leads to less complexity.

Conclusion

Data Vault 2.0 is designed to integrate data from several source systems and may over perform in some situations.

In summary, if you have a small to medium-sized analytics requirement, with a small team of architects, designers, and engineers delivering a solution with data sourced from a few systems, then Data Vault may be inappropriate for your need.

If, however, you have a large project with many source systems leading to an enormous data integration challenge, then Data Vault can potentially add massive value to the project.

Multi-Temporality in Data Vault 2.0 – Part 1

An abstract illustration features a large clock in the center with connected circuitry-like lines. Icons of a calendar and a smaller clock are in the background. Two large arrows pointing outward appear on both sides, and the backdrop has splashes of blue and gray.

Multi-Temporality in Data Vault 2.0

The following article gives an overview of the theoretical understanding of Multi-Temporality in a data warehouse.

Multi-temporality in Data Vault 2.0

You may have heard about Bi-temporal data already. But usually, there are more than just two timelines in your data which complicate your work. You usually find multiple timestamps and dates from different perspectives in your data sets, which create multiple opportunities on “how to look at your data” from a time perspective. But, you should also be able to handle this beast of a time machine. Did you know that Data Vault 2.0 is able to handle Multi-temporal data? How does this influence your work and how can you take advantage of it? Join this webinar to learn how Data Vault 2.0 can help you master multi-temporality.

Watch webinar recording

What is “Multi-Temporality” in a Data Warehouse?

Before we start talking about multi-temporality, let’s define the term bi-temporality first, given, a common misconception is that Data Vault 2.0 is only bi-temporal (which is wrong):

“Bitemporal Modeling is a specific case of Temporal database information modeling technique designed to handle historical data along two different timelines. This makes it possible to rewind the information to “as it actually was” in combination with “as it was recorded” at some point in time.” (According to Wikipedia)

Bi-temporality just addresses two timelines, which are commonly referred to as “System Time” (the technical timeline) and “Valid Time” (the business timeline). Data Vault Satellites, Point-in-Time tables (PIT), and Bridge tables are able to address multiple active timelines in the same record. Let’s categorise just some of them:

  • Source Driven times
    • Created time
    • Updated time
    • Deleted time
  • System times
    • CDC time
    • Message event time
  • Business times
    • any times that represent when something happened or will happen in the “real world” like a purchase or sell timestamp.
  • Time spans
    • can be technical, can be business-driven
    • Contract start and end dates/times
    • Technical valid from and valid to dates/timestamps
  • Enterprise Data Warehouse (EDW)
    • Load date timestamp (set in the first layer of the EDW during the insert)
    • Timestamp when a record is written in the table

All of these dates and timestamps could be found in just one record in a Satellite table. This allows us to look at the data from multiple time perspectives. Therefore, the Data Vault Model accounts for multi-temporality and not just bi-temporality.

The Load Date Timestamp with Multi-Temporality

One requirement to realize multi-temporality on the data is that the Load Date Timestamp is used for loading data into Satellites when doing the delta check. Only the Load Date Timestamp can provide us with a consistent, gapless, and non-overlapping time which is under our control. This allows us to have an unrestricted view of the multi-timelines in Satellites.

All other timestamps are not qualified. First, they would restrict the number of possible perspectives on the data to a single instance. Additionally, they can have gaps, and overlappings, be NULL, and are not controlled by the Enterprise Data Warehouse teams.

In short: We will never get rid of the Load Date Timestamp which is set during the insert in the first layer of the Enterprise Data Warehouse architecture and pushed through all the layers as far as possible (think of aggregates in the Business Vault over multiple Load Date Timestamps).

3 Different Perspectives on Data

The core Data Vault is differentiated into the Raw Data Vault (RDV) and the Business Vault (BV). The reason is to split soft business rules from hard business rules as soft business rules can change the content of the data. The result is that the number of possible perspectives on the raw data is reduced when soft business rules are applied early in the loading architecture. The same rules have to be applied to timelines. Timeline-driven business perspectives on raw data happen earliest in the Business Vault. 

There are basically three different perspectives related to timelines in the data warehouse: A data warehouse perspective, a business perspective, and an information delivery perspective. 

The data warehouse perspective relates to the Load Date Timestamp to have a consistent incremental integration of the data into the Raw Data Vault and Business Vault.  

The business perspective relates to all dates and timestamps which are delivered by the source system. Also, the technical fields are counted in the same way as the created, updated, or deleted date/timestamp from the source system. Everything that is part of the payload is handled as descriptive data during the Raw Data Vault loading.
Now, different queries can create all possible views of the raw data; for example, aggregates based on the most recent record per Business Key and grouped by a sales date.

The information delivery perspective relies on a snapshot to “freeze” all the data as it was active at a point in time. That said, the interpretation of what “active” means can be different.

To address this, multiple perspectives can be created. That’s also the reason why we talk about the single version of the facts in the Raw Data Vault and multiple versions of the truth in the Business Vault (different perspectives on raw data = different truths from different standpoints).

This could, for example, be an hourly, daily, weekly, monthly, or yearly snapshot or timespan. The Data Vault entities that are used here are the PIT and Bridge tables. The current delta of master data like customer data in a Satellite can be “frozen” based on a daily snapshot in a PIT table. Also, transactional data kept in a Non-Historized Link can be attached to an hourly snapshot in a Bridge table.

How that exactly looks will be shown in the next part of the multi-temporal newsletter series. To enhance your understanding of these data perspectives, you can also explore our Multi-Temporal Data Vault class. 

Conclusion

Incorporating multi-temporality into Data Vault 2.0 enables organizations to manage and analyze data across various timelines, providing a comprehensive view of historical changes from multiple perspectives. This approach enhances the ability to track and understand data evolution, leading to more informed decision-making and improved data governance. By effectively handling multiple temporal aspects, Data Vault 2.0 ensures a robust and flexible framework for capturing the complexities of time-variant data.

Meltano Open Source Production Grade Data Integration – Part 2

Meltano result

Meltano Open Source Production Grade Data Integration

In this article, we delve into the practical application of Meltano, an open-source platform designed for building, running, and orchestrating ELT pipelines. Building upon our previous discussion of Meltano’s architecture, this installment guides you through the process of creating a data integration pipeline, from initializing a project to configuring extractors and loaders, and utilizing the Meltano UI for streamlined management.

Meltano in action

In our last overview, we talked about Meltano and its architecture. Now, we would like to illustrate the ease in which you can use Meltano to create a data integration pipeline.
Before we start, please ensure that you have already installed Meltano on your machine. If you haven’t yet, you can follow Meltano’s official installation guide.

First we will initialize a Meltano project.
Initialize a new project in a directory of your choice by using  “meltano init”. This will create a new directory with, among other things, your  “meltano.yml” project file.

Meltano

The Meltano project folder

The project folder is the single source of truth for all your data needs. It is a simple directory and version controllable. 

The main part of your project is the meltano.yml project file. This file defines your plugins, pipelines and configuration.

The project and meltano.yml files are manageable using the Meltano CLI and are instantly containerized for Docker/Kubernetes deployment. Though, please note that there are not any defined plugins or pipeline schedules created yet. We will do this in the next step.

Adding an extractor and loader

Now, let’s initialize the pipeline’s components. The first plugin you’ll want to add is an extractor which will be responsible for pulling data out of your data source.

To find out if an extractor for your data source is supported out of the box, you can check the Extractors list on MeltanoHub or run  “meltano discover”.

We will use the Tap for Gitlab in this example as we don’t need to create API credentials.

Meltano

Meltano manages setup, configuration and handles invocation.

Meltano

We will configure the extractor to pull the data from the repository meltano/meltano. Additionally, we will define it to only extract data as of 1 January 2021 and to include only data under the “Tags” stream.

Meltano

Data selection is way easier than using just Singer! The extractor is now set up. Now, we will add a loader to store the data into a CSV file and define our destination path:

Meltano

Remember that the directory needs to be previously created, as it will not be created automatically. This is what our meltano.yml looks like:

Instead of using the CLI, we can make changes directly in the YAML.

This way, it’s also possible to configure the extractors and loaders in the Meltano UI:

Start the Meltano UI web server using  meltano ui. Unless configured otherwise, the UI will now be available at http://localhost:5000.

Meltano Extractors
Meltano Configuration

Run a pipeline

Now it’s time to run a pipeline. To run a one-time pipeline, we can just use the meltano elt command:

Meltano
Pipelines

Result

And we are done! It took us just ten commands to create a data integration pipeline.

Meltano result

Conclusion

It’s clear why Meltano is a great choice for building your data platform: it’s powerful but simple to maintain and its open-source model makes it flexible, budget-friendly and reliable.

-by Ole Bause (Scalefree)

Singer Open Source Production Grade Data Integration – Part 1

In our past blog post, we introduced an open source framework for ELT processes called Singer. This framework can be wrapped up using another open source tool which adds more interesting features to Singer including installation, setup of environments, monitoring, scheduling and orchestration. At Scalefree, we moved all of our ELT pipelines into this framework on AWS and are pleased with the results.

Please note, there are a large number of platforms for managing data integration but there is a lack of robust and easy-to-use, free open source solutions. The Meltano project aims to provide a solution to that situation. Meltano is a full-package data integration platform that challenges the most established players in the data space. Meltano is built on top of the best open source tools for data integration and infuses them with DataOps best practices.

Meltano is the easiest way to build, run and orchestrate ELT pipelines made-up of Singer taps, targets and dbt models. It is open source, self-hosted and version controlled as well as containerized.

Meltano’s open source model lets you easily adapt it to your own needs and reduces cost. Continue Reading

Close Menu