Expert

Building a scalable Data Platform? In Data Warehouse, Expert

Mastering Metadata: Data Catalogs in Data Warehousing with Datahub

Mastering Metadata in Data Warehousing

In today’s data-driven world, it is essential to be able to manage and organize large amounts of data in an efficient way. Businesses across all industries are forced to contend with more data than ever before. The introduction and development of an enterprise data warehouse in a company naturally plays a central role, but does not solve a major challenge: How to effectively organize and manage the data, especially metadata, in an Enterprise Data Warehouse? This is where the concept of data catalogs comes into play and where tools like Datahub become essential.

A data catalog serves as a comprehensive inventory of data assets in an organization, providing context, annotations, and metadata to facilitate the understanding and discovery of data. It’s like a map to your data, helping users navigate the complex data landscape to find the exact data they need.

A data catalog can help users to understand where to find specific data in the data warehouse that fits their needs and to investigate where it came from, as well as how it might be connected to other data. This can greatly simplify tasks like data analysis and reporting, making the data warehouse more accessible and usable for everyone in the organization.

Mastering Metadata: Data Catalogs in Data Warehousing with DataHub

Don’t miss our upcoming webinar about data catalogs! This session will explore in detail the critical role of data catalogs in data warehousing, with an exclusive focus on the powerful tool DataHub. You’ll gain practical insights on enhancing data discovery, metadata management, data lineage, and data governance. Sign up today and transform your data management strategies into a competitive advantage.

Watch webinar recording

In this article:

Understanding Data Catalogs
- What is a Data Catalog?
- Role of a Data Catalog in Data Warehousing
Introduction to DataHub
- Key Features and Capabilities of DataHub
Conclusion

Understanding Data Catalogs

What is a Data Catalog?

In general, a data catalog is like a metadata inventory, which consists of organized and structured metadata regarding all data assets in an organization. It is a central place where all this metadata can be stored, combined, and categorized, which makes it a lot easier to discover and understand the corresponding data, for example in a data warehouse. A data catalog also has search functionalities to find specific data from the available indexed datasets. It serves like a single source of truth of your metadata, enabling users to trust the data they’re using for their analyses or business decisions.

Role of a Data Catalog in Data Warehousing

In the context of data warehousing, a data catalog brings a lot of benefits. It provides a way to explore and search all data stored in the data warehouse. Technical users, as well as Business Users, can discover relevant data, understand its context, and ensure it is up-to-date, reliable, and accurate. The following figure shows where data catalogs in data warehousing with Data Vault 2.0 take place. A Data Catalog should cover the entire Enterprise BI Solution. This also applies, for example, to a data lake, if available, and to the information delivery layer.

Now that we’ve understood what a data catalog is, let’s delve into how each component plays a part in a data catalog and explore how a tool like DataHub can assist organizations in these tasks.

Introduction to DataHub

In the world of data catalogs, DataHub stands out as an increasingly popular choice for many businesses. DataHub is a growing open-source software developed by LinkedIn to address its growing need for a more dynamic and scalable data management tool. It was created in part due to the fact that the existing tools were not sufficient with LinkedIns expanding needs.

As LinkedIn grew, so did its data volume, variety, and velocity. Recognizing the need for a more efficient way to manage its data, LinkedIn built and introduced DataHub in 2020. Open-sourcing DataHub allowed other organizations to benefit from this advanced tool, and it has since been adopted by many businesses looking for a modern, scalable data catalog solution.

DataHub supports both push-based and pull-based metadata ingestion, including a wide range of integrations for example Airflow, BigQuery, Databricks, dbt, Hive, Kafka, Looker, MSSQL, MongoDB, Oracle, S3, PowerBI, Snowflake, Spark, and much more. You can find a full list here. This gives datahub the ability to combine and show metadata of the same data from multiple sources, for example, a dbt model definition, and if the tests were running successfully, right next to the database schema and stats for all columns.

Key Features and Capabilities of DataHub

DataHub, as a metadata platform, goes beyond traditional data catalogs. DataHub offers all important features and capabilities:

1. Scalability: DataHub is designed to handle metadata from thousands of datasets, which makes it a great choice for large organizations.

2. Flexible and Extensible Data Model: The technical data model behind this tool is designed to be customizable and expandable to allow organizations to customize it to their specific business requirements

3. Powerful Search and Discovery: Leveraging Elasticsearch, DataHub offers robust search functionality that enables users to discover datasets quickly based on various attributes, such as the data’s origin, schema, and usage.

4. Rich Metadata: Unlike traditional data catalogs, DataHub captures and presents a wide variety of metadata, including data lineage, operational metadata, and business metadata. This gives users a comprehensive understanding of their data.

5. Data Lineage and Relationships: DataHub automatically captures and visualizes data lineage, showing how data flows through various systems. It also displays relationships between datasets, allowing users to understand how different data assets interact with each other.

Conclusion

Using a data catalog comes with several benefits:

Enhanced Data Discovery: With the search and categorization capabilities of a data catalog, users can quickly find the exact data they need without having to comb through large datasets.
Improved Data Understanding: The metadata in a data catalog provides users with necessary context about the data, making it easier to interpret and use correctly.
Better Compliance and Governance: A data catalog supports data governance initiatives by ensuring data is consistent, accurate, and compliant with relevant regulations.
Increased Trust in Data: By providing transparency into data lineage, a data catalog helps build trust in the data by allowing users to see its history and verify its accuracy and reliability.
Time and Resource Efficiency: By making it easier to locate and understand data, a data catalog can save the company resources, thus speeding up data-driven activities and reducing the burden on data management teams.

In conclusion, DataHub provides a flexible, feature-rich, and all-encompassing option for data catalogs in a data warehousing environment. By providing powerful features for data discovery, metadata management, data lineage, and data governance, it enables businesses to extract maximum value from their data.

If you’re interested in learning more about data catalogs, watch the recording here for free.

– Ole Bause (Scalefree)

Building a scalable Data Platform? In Data Warehouse, Expert

Mastering Metadata: Data Catalogs in Data Warehousing with Datahub

Mastering Metadata in Data Warehousing

Mastering Metadata: Data Catalogs in Data Warehousing with DataHub

Watch webinar recording

In this article:

Understanding Data Catalogs
- What is a Data Catalog?
- Role of a Data Catalog in Data Warehousing
Introduction to DataHub
- Key Features and Capabilities of DataHub
Conclusion

Understanding Data Catalogs

What is a Data Catalog?

Role of a Data Catalog in Data Warehousing

Now that we’ve understood what a data catalog is, let’s delve into how each component plays a part in a data catalog and explore how a tool like DataHub can assist organizations in these tasks.

Introduction to DataHub

Key Features and Capabilities of DataHub

DataHub, as a metadata platform, goes beyond traditional data catalogs. DataHub offers all important features and capabilities:

1. Scalability: DataHub is designed to handle metadata from thousands of datasets, which makes it a great choice for large organizations.

Conclusion

Using a data catalog comes with several benefits:

Enhanced Data Discovery: With the search and categorization capabilities of a data catalog, users can quickly find the exact data they need without having to comb through large datasets.
Improved Data Understanding: The metadata in a data catalog provides users with necessary context about the data, making it easier to interpret and use correctly.
Better Compliance and Governance: A data catalog supports data governance initiatives by ensuring data is consistent, accurate, and compliant with relevant regulations.
Increased Trust in Data: By providing transparency into data lineage, a data catalog helps build trust in the data by allowing users to see its history and verify its accuracy and reliability.
Time and Resource Efficiency: By making it easier to locate and understand data, a data catalog can save the company resources, thus speeding up data-driven activities and reducing the burden on data management teams.

If you’re interested in learning more about data catalogs, watch the recording here for free.

– Ole Bause (Scalefree)

Marc Winkelmann In Data Vault, Expert

Multi-Temporality in Data Vault 2.0 – Part 2: A Practical Example

An abstract illustration features a large clock in the center with connected circuitry-like lines. Icons of a calendar and a smaller clock are in the background. Two large arrows pointing outward appear on both sides, and the backdrop has splashes of blue and gray.

Multi-Temporality

This article explores managing multi-temporality in Data Vault 2.0 using a travel industry scenario. It highlights how operational and business intelligence systems handle data differently and demonstrates how Data Vault 2.0 supports comprehensive historical tracking and multi-temporal analysis.

Multi-temporality in Data Vault 2.0

You may have heard about Bi-temporal data already. But usually, there are more than just two timelines in your data which complicate your work. You usually find multiple timestamps and dates from different perspectives in your data sets, which create multiple opportunities on “how to look at your data” from a time perspective. But, you should also be able to handle this beast of a time machine. Did you know that Data Vault 2.0 is able to handle Multi-temporal data? How does this influence your work and how can you take advantage of it? Join this webinar to learn how Data Vault 2.0 can help you master multi-temporality.

Watch webinar recording

In this article:

Multi-Temporal Example from the Travel Industry
Conclusion

Multi-Temporal Example from the Travel Industry

In one of our previous blog posts, we gave an insight into the basics of multi-temporality in Data Vault 2.0. This article will give you a short practical example of the topic. When you book your holidays, you might change something after the initial booking: the duration, the flight departure location, the hotel, the start and/or end date of your journey or might just book some extras. From an operational perspective, these changes are not a problem to deal with, and the expectations are not high. The data set for your journey will be updated in the source systems’ database, and only the most recent data is shown in the graphical user interface. The employee might see some logs in the history part, if they exist. That’s usually enough to keep the operational part working. In the business intelligence world, the way to deal with data is a bit different. There, we want to see everything. Each time a record in the source system is updated, we catch all the changes and store them in a separate database. This means that we introduce a new timeline which provides us new opportunities but also comes with new challenges. Let’s have a look into some example data sets of a source system over time:

Day 1: Initial booking of a holiday trip

This Image shows the initial booking data of a holiday trip - multi-temporality example

Day 2: Correction of the start and end date and a small price adjustment

Day 3: Correction of the start and end date and a small price adjustment

The employee in the tourist office promised the early bird price. To apply for it, the booking date needs to be adjusted. But the old booking needs to be cancelled first as the booking date can only be set on the initial creation (operational business rule). Multi-temporality example

When you have a look into the source system today, you will see that booking B4711 is cancelled, or maybe it is completely gone, and B4712 is the active one (ID = 2, created on day 3). From an analytical perspective you might have some more questions you want to have answered.
For example:

What was the estimated revenue for July, looking from February?
What was the estimated revenue for July, looking from March?
Which dates do I need to use to answer the two questions above?
In which month do we have the most bookings for the next summer season?
Can we reproduce a report we created some months ago?
What if some business dates change? What affects my reports in the past?
Where can we prepare the data to answer all these questions? In the data warehouse? In the front-end BI tool? Or somewhere else?

In our webinar “Multi-Temporality in Data Vault 2.0” we will provide you answers to all these questions by using Data Vault entities.

If you are interested to see some more use cases related to multi-temporality: we offer a 2-day class especially for this topic: “Multi-Temporal Data Vault 2.0 Class”.

Conclusion

Incorporating multi-temporality into Data Vault 2.0, as demonstrated through the travel industry example, allows organizations to effectively capture and analyze data changes over multiple timelines. This approach not only preserves historical accuracy but also enables comprehensive insights into business processes. By leveraging Data Vault 2.0’s capabilities, businesses can address complex temporal questions, such as estimating revenue forecasts from different points in time and reproducing past reports with precision. This methodology ensures a robust framework for managing evolving data landscapes and supports informed decision-making.

Marc Winkelmann In Data Vault, Expert

Multi-Temporalität in Data Vault 2.0 – Teil 2: Ein praktisches Beispiel

Multi-Temporality

Multi-temporality in Data Vault 2.0

etabliert. Aufgrund der zunehmenden Datenmengen und -vielfalt ist jedoch der Bedarf nach Methoden entstanden, mit denen sich die vorhandenen Daten so nutzen lassen, dass zusätzlicher Nutzen für die individuellen Anforderungen eines Unternehmens geschaffen wird. Data Vault 2.0 bietet eine breite Palette an Methoden zur Entscheidungsunterstützung, die über das klassische Reporting hinausgehen, und liefert darüber hinaus wertvolle Informationen mit Blick auf zukünftige Entwicklungen. Überzeugen Sie sich selbst und begleiten Sie uns, wenn wir Ihnen verschiedene Ansätze und Lösungen vorstellen, mit denen Sie das volle Potenzial Ihrer Daten ausschöpfen können.

In this article:

Multi-Temporal Example from the Travel Industry
Conclusion

Multi-Temporal Example from the Travel Industry

Day 1: Initial booking of a holiday trip

Day 2: Correction of the start and end date and a small price adjustment

Day 3: Correction of the start and end date and a small price adjustment

What was the estimated revenue for July, looking from February?
What was the estimated revenue for July, looking from March?
Which dates do I need to use to answer the two questions above?
In which month do we have the most bookings for the next summer season?
Can we reproduce a report we created some months ago?
What if some business dates change? What affects my reports in the past?
Where can we prepare the data to answer all these questions? In the data warehouse? In the front-end BI tool? Or somewhere else?

In our webinar “Multi-Temporality in Data Vault 2.0” we will provide you answers to all these questions by using Data Vault entities.

If you are interested to see some more use cases related to multi-temporality: we offer a 2-day class especially for this topic: “Multi-Temporal Data Vault 2.0 Class”.

Conclusion

Marc Winkelmann In Data Vault, Expert

Multi-Temporality in Data Vault 2.0 – Part 1

Multi-Temporality in Data Vault 2.0

The following article gives an overview of the theoretical understanding of Multi-Temporality in a data warehouse.

Multi-temporality in Data Vault 2.0

Watch webinar recording

In this article:

What is “Multi-Temporality” in a Data Warehouse?
The Load Date Timestamp with Multi-Temporality
3 Different Perspectives on Data
Conclusion

What is “Multi-Temporality” in a Data Warehouse?

Before we start talking about multi-temporality, let’s define the term bi-temporality first, given, a common misconception is that Data Vault 2.0 is only bi-temporal (which is wrong):

“Bitemporal Modeling is a specific case of Temporal database information modeling technique designed to handle historical data along two different timelines. This makes it possible to rewind the information to “as it actually was” in combination with “as it was recorded” at some point in time.” (According to Wikipedia)

Bi-temporality just addresses two timelines, which are commonly referred to as “System Time” (the technical timeline) and “Valid Time” (the business timeline). Data Vault Satellites, Point-in-Time tables (PIT), and Bridge tables are able to address multiple active timelines in the same record. Let’s categorise just some of them:

Source Driven times
- Created time
- Updated time
- Deleted time
System times
- CDC time
- Message event time
Business times
- any times that represent when something happened or will happen in the “real world” like a purchase or sell timestamp.
Time spans
- can be technical, can be business-driven
- Contract start and end dates/times
- Technical valid from and valid to dates/timestamps
Enterprise Data Warehouse (EDW)
- Load date timestamp (set in the first layer of the EDW during the insert)
- Timestamp when a record is written in the table

All of these dates and timestamps could be found in just one record in a Satellite table. This allows us to look at the data from multiple time perspectives. Therefore, the Data Vault Model accounts for multi-temporality and not just bi-temporality.

The Load Date Timestamp with Multi-Temporality

One requirement to realize multi-temporality on the data is that the Load Date Timestamp is used for loading data into Satellites when doing the delta check. Only the Load Date Timestamp can provide us with a consistent, gapless, and non-overlapping time which is under our control. This allows us to have an unrestricted view of the multi-timelines in Satellites.

All other timestamps are not qualified. First, they would restrict the number of possible perspectives on the data to a single instance. Additionally, they can have gaps, and overlappings, be NULL, and are not controlled by the Enterprise Data Warehouse teams.

In short: We will never get rid of the Load Date Timestamp which is set during the insert in the first layer of the Enterprise Data Warehouse architecture and pushed through all the layers as far as possible (think of aggregates in the Business Vault over multiple Load Date Timestamps).

3 Different Perspectives on Data

The core Data Vault is differentiated into the Raw Data Vault (RDV) and the Business Vault (BV). The reason is to split soft business rules from hard business rules as soft business rules can change the content of the data. The result is that the number of possible perspectives on the raw data is reduced when soft business rules are applied early in the loading architecture. The same rules have to be applied to timelines. Timeline-driven business perspectives on raw data happen earliest in the Business Vault.

There are basically three different perspectives related to timelines in the data warehouse: A data warehouse perspective, a business perspective, and an information delivery perspective.

The data warehouse perspective relates to the Load Date Timestamp to have a consistent incremental integration of the data into the Raw Data Vault and Business Vault.

The business perspective relates to all dates and timestamps which are delivered by the source system. Also, the technical fields are counted in the same way as the created, updated, or deleted date/timestamp from the source system. Everything that is part of the payload is handled as descriptive data during the Raw Data Vault loading.
Now, different queries can create all possible views of the raw data; for example, aggregates based on the most recent record per Business Key and grouped by a sales date.

The information delivery perspective relies on a snapshot to “freeze” all the data as it was active at a point in time. That said, the interpretation of what “active” means can be different.

To address this, multiple perspectives can be created. That’s also the reason why we talk about the single version of the facts in the Raw Data Vault and multiple versions of the truth in the Business Vault (different perspectives on raw data = different truths from different standpoints).

This could, for example, be an hourly, daily, weekly, monthly, or yearly snapshot or timespan. The Data Vault entities that are used here are the PIT and Bridge tables. The current delta of master data like customer data in a Satellite can be “frozen” based on a daily snapshot in a PIT table. Also, transactional data kept in a Non-Historized Link can be attached to an hourly snapshot in a Bridge table.

How that exactly looks will be shown in the next part of the multi-temporal newsletter series. To enhance your understanding of these data perspectives, you can also explore our Multi-Temporal Data Vault class.

Conclusion

Incorporating multi-temporality into Data Vault 2.0 enables organizations to manage and analyze data across various timelines, providing a comprehensive view of historical changes from multiple perspectives. This approach enhances the ability to track and understand data evolution, leading to more informed decision-making and improved data governance. By effectively handling multiple temporal aspects, Data Vault 2.0 ensures a robust and flexible framework for capturing the complexities of time-variant data.

Marc Winkelmann In Data Vault, Expert

Multi-Temporality in Data Vault 2.0 – Part 1

Multi-Temporality in Data Vault 2.0

The following article gives an overview of the theoretical understanding of Multi-Temporality in a data warehouse.

Multi-temporality in Data Vault 2.0

Watch webinar recording

In this article:

What is “Multi-Temporality” in a Data Warehouse?
The Load Date Timestamp with Multi-Temporality
3 Different Perspectives on Data
Conclusion

What is “Multi-Temporality” in a Data Warehouse?

Before we start talking about multi-temporality, let’s define the term bi-temporality first, given, a common misconception is that Data Vault 2.0 is only bi-temporal (which is wrong):

Source Driven times
- Created time
- Updated time
- Deleted time
System times
- CDC time
- Message event time
Business times
- any times that represent when something happened or will happen in the “real world” like a purchase or sell timestamp.
Time spans
- can be technical, can be business-driven
- Contract start and end dates/times
- Technical valid from and valid to dates/timestamps
Enterprise Data Warehouse (EDW)
- Load date timestamp (set in the first layer of the EDW during the insert)
- Timestamp when a record is written in the table

The Load Date Timestamp with Multi-Temporality

3 Different Perspectives on Data

There are basically three different perspectives related to timelines in the data warehouse: A data warehouse perspective, a business perspective, and an information delivery perspective.

The data warehouse perspective relates to the Load Date Timestamp to have a consistent incremental integration of the data into the Raw Data Vault and Business Vault.

The information delivery perspective relies on a snapshot to “freeze” all the data as it was active at a point in time. That said, the interpretation of what “active” means can be different.

Conclusion

Building a scalable Data Platform? In Expert

Data-Driven Decision Making – For Power Users and Data Scientists

Watch the Webinar

In this webinar, Michael Olschimke, the CEO of Scalefree, presents how to gain additional advantage from the enterprise data warehouse by making the solution available to power users and data scientists.

No one likes to wait! If you have to wait too long for data or information, you will find another way to get what you need. This will end up in different inconsistent data warehouse “solutions” in your company.

To avoid this, an agile approach with managed Self-Service BI is essential to keep a good relationship with Power Users (e.g. Data Scientists) and to build a governed enterprise BI solution for better decision-making. In this webinar, Michael Olschimke will talk about approaches, best practices, and experiences.

Watch Webinar Recording

Webinar Agenda

1. About Data-Driven Organizations
2. Creating a Data-Driven Strategy
3. Making Data Available throughout the Enterprise

Meet the Speaker

Michael Olschimke

Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

Building a scalable Data Platform? In Expert

Data-Driven Decision Making – For Power Users and Data Scientists

Watch the Webinar

Watch Webinar Recording

Webinar Agenda

1. About Data-Driven Organizations
2. Creating a Data-Driven Strategy
3. Making Data Available throughout the Enterprise

Meet the Speaker

Michael Olschimke

Mastering Metadata in Data Warehousing

Mastering Metadata: Data Catalogs in Data Warehousing with DataHub

Understanding Data Catalogs

What is a Data Catalog?

Role of a Data Catalog in Data Warehousing

Introduction to DataHub

Key Features and Capabilities of DataHub

Conclusion

Mastering Metadata in Data Warehousing

Mastering Metadata: Data Catalogs in Data Warehousing with DataHub

Understanding Data Catalogs

What is a Data Catalog?

Role of a Data Catalog in Data Warehousing

Introduction to DataHub

Key Features and Capabilities of DataHub

Conclusion

Multi-Temporality

Multi-temporality in Data Vault 2.0

Multi-Temporal Example from the Travel Industry

Conclusion

Multi-Temporality

Multi-temporality in Data Vault 2.0

Multi-Temporal Example from the Travel Industry

Conclusion

Multi-Temporality in Data Vault 2.0

Multi-temporality in Data Vault 2.0

What is “Multi-Temporality” in a Data Warehouse?

The Load Date Timestamp with Multi-Temporality

3 Different Perspectives on Data

Conclusion

Multi-Temporality in Data Vault 2.0

Multi-temporality in Data Vault 2.0

What is “Multi-Temporality” in a Data Warehouse?

The Load Date Timestamp with Multi-Temporality

3 Different Perspectives on Data

Conclusion

Watch the Webinar

Webinar Agenda

Meet the Speaker

Watch the Webinar

Webinar Agenda

Meet the Speaker

Build Better Data Platforms

SOLUTIONS

TRAINING

EVENTS

KNOWLEDGE HUB

CAREERS

COMPANY