Michael Olschimke

Same-as-links Business Rules in Data Vault

Watch the Video

In our ongoing Data Vault Friday series, our CEO Michael Olschimke engages with a thought-provoking question from our audience regarding Same-as-Links.

“Sometimes mapping logic for Same-As-Links (SAL) requires complex ‘fuzzy’ business logic. When does the logic become too complex for the Raw Data Vault, and instead, the joining of similar tables from different sources becomes a Business Vault concern? It’s important to not have convoluted transformations in the Raw Data Vault, so where do we ‘draw the line’ on transformations being too convoluted/complex for a Raw Data Vault entity?”

In this enlightening video, Michael addresses the delicate balance between the Raw Data Vault, where no business logic is applied, and the Business Vault, where most business logic is implemented. He provides insights into recognizing when the mapping logic for Same-As-Links (SAL) becomes too intricate for the Raw Data Vault, prompting the shift to the Business Vault for handling complex transformations.

The discussion offers practical considerations and a clear perspective on drawing the line to maintain the efficiency and clarity of transformations within the Raw Data Vault.

Michael Olschimke In Data Vault Friday

Pivotizing Fact Measures in Data Vault

Watch the Video

In our continuous Data Vault Friday series, our CEO Michael Olschimke engages with a pertinent question from our audience.

“There are 6 measure values (float/decimal values) in the fact entity. In each row, typically 3 of them are NULL. Would it make sense to unpivot the data and encode this in a dimension for measure type? We also have measure values which are based on integers. Does it make sense to separate them into their own fact entity?”

In this insightful video, Michael delves into the considerations surrounding the structure of fact entities when dealing with multiple-measure values. The specific scenario of having null values for some measures prompts a discussion on whether it is beneficial to unpivot the data and encode it in a dimension for measure type. Additionally, Michael explores the case of measuring values based on integers and evaluates whether separating them into their own fact entity is a sound approach.

The video offers practical guidance and best practices for optimizing the design of fact entities in Data Vault models, ensuring efficiency and clarity in data representation.

Michael Olschimke In Data Vault

About Information Marts in Data Vault 2.0 – Part 2

In the Data Vault 2.0 architecture, information marts are used to deliver information to the end-users.
Conceptually, an information mart follows the same definition as a data mart in legacy data warehousing. However, in legacy data warehousing, a data mart is used to deliver useful information, not raw data. This is why the data mart has been renamed in Data Vault 2.0 to better reflect the use case.

“Classical” information marts

But the definition of information marts has more facets. In the book “Building a Scalable Data Warehouse with Data Vault 2.0” we present three types of marts:

Information marts: used to deliver information to business users, typically via dashboards and reports.
Metrics Mart: used in conjunction to a Metrics Vault, which captures EDW log data in a Data Vault model. The Metrics Mart is derived from the Metrics Vault to present the metrics in order to analyze performance bottlenecks or in resource consumption of power users and data scientists in managed self-service BI solutions.
Error Mart: stores those records that typically fail a hard rule when loading the data into the enterprise data warehouse.

Additional information marts

In addition to these “classical” information marts, we use additional ones in our consulting practice:

Interface Mart: this is more or less an information mart, however, the information is not delivered to a human being, e.g. via a dashboard or report. Instead, it is delivered to a subsequent application, or as a write-back, to the source system (for example when using the enterprise data warehouse for data cleansing).
Quality Mart: the quality mart is again an information mart, but instead of cleansing bad data, it is used to report bad data. Essentially, it turns the business logic used to cleanse bad data upside down: only bad data, in addition to ugly data at times, is delivered to the end-user, the data steward. This is often done in conjunction with data cleansing frontends where the data steward can either correct source data or comment and tag the exceptions.
Source Mart: again an information mart, but this time not using one of the popular schemas, such as star schemas, snowflake schemas or fully denormalized schemas. Instead, the information mart uses the data model of the source application, similar to an operational data store (ODS) schema. However, the Source Mart is not a copy of the data, it is a virtualized model on top of the Data Vault model, reflecting the original structures. It’s great for ad-hoc reporting and provides great value for many data scientists as well as power users.

This concludes our list of information marts. We have used them successfully in projects for our clients to better communicate the actual application of the information marts in their organization.

Let us know in the comments if you think this is helpful for you, too!

Michael Olschimke In Data Vault Friday

Redundancy in Dimensional Models in Data Vault

Watch the Video

In our ongoing Data Vault Friday series, our CEO Michael Olschimke addresses a thought-provoking question from our audience.

“A company might have many assets. The asset dimension contains many descriptive fields describing the company, leading to redundancy. When does it make sense to separate the attributes into their own dimension? If a dimension attribute is often used in filtering, does it make sense to separate this into its own dimension?”

In this enlightening video, Michael delves into the considerations and decision-making process surrounding the design of dimensional models derived from a Data Vault model. Specifically, he explores the scenario where the asset dimension contains numerous descriptive fields, potentially leading to redundancy. Michael provides insights into when it makes sense to separate these attributes into their own dimension and discusses the factors influencing this decision.

Furthermore, the discussion extends to instances where a dimension attribute is frequently used in filtering and whether it warrants a separate dimension. Michael’s explanation offers practical guidance and considerations for optimizing the target models while managing redundancies effectively.

For those involved in data modeling and dimensional design within the Data Vault framework, this video provides valuable insights and strategic considerations.

Michael Olschimke In Data Vault Friday

Hash Keys vs Sequence Keys vs Business Keys in Data Vault

Watch the Video

In our continuous Data Vault Friday series, our CEO Michael Olschimke engages with a pertinent question posed by our audience.

“How do the loading cycles benefit from joining on a hash key or a Business Key, as opposed to a surrogate value?”

In this insightful video, Michael delves into the critical aspect of designing the Enterprise Data Warehouse (EDW) by examining three choices for identifying records in the Data Vault model. The specific focus is on the advantages and implications of joining on a hash key or a Business Key, contrasting these approaches with the use of a surrogate value.

Michael’s comprehensive exploration provides clarity on the impact these choices have on loading cycles within the EDW. By understanding the nuances of each option, viewers gain valuable insights into optimizing loading processes and achieving efficient data integration.

For those involved in the design and management of Data Vault models, this video offers practical considerations and strategic insights.

Michael Olschimke In Data Vault Friday

Record Source and the Business Vault

Watch the Video

In our ongoing Data Vault Friday series, our CEO Michael Olschimke addresses a question raised by a member of our audience.

“We load a Same-As-Link (SAL) from multiple sources to deduplicate records across these source systems. How do you define the record source? The data is not clearly coming from only one source.”

In this insightful video, Michael delves into the intricacies of handling Same-As-Links (SAL) when loading data from multiple sources for the purpose of deduplication. The specific challenge raised about defining the record source when data is sourced from multiple systems is a common issue in many projects.

Michael shares best practices, providing clarity on the intended use of the record source and offering guidance on how to avoid common pitfalls associated with its misuse. The discussion sheds light on the importance of correctly defining the record source to ensure accurate data lineage and enhance the effectiveness of deduplication efforts.

For those navigating the complexities of deduplication in a multi-source environment, this video offers valuable insights and practical recommendations.

Michael Olschimke In Data Vault Friday

Data Quality in the Enterprise Data Warehouse (EDW) in Data Vault

Watch the Video

In our continuous Data Vault Friday series, our CEO Michael Olschimke engages with a set of pertinent questions posed by our audience.

“How to deal with dirty data managed in BV? What are the best practices for correct data management? How can business rules, versions, and fixes be managed on correction properly?”

In this enlightening video, Michael addresses the challenges associated with handling dirty data within the Business Vault (BV) and explores best practices for effective data management. He dives into the complexities of managing business rules, versions, and corrections, offering insights into the proper approaches for ensuring data accuracy and consistency.

The video emphasizes a strict Extract, Load, Transform (ELT) approach, advocating for the application of data cleansing rules after loading the Raw Data Vault. Michael explains the rationale behind this methodology and highlights the advantages of maintaining a robust data flow.

For teams grappling with data quality issues and seeking optimal data management strategies, this video provides valuable guidance and practical considerations.

Michael Olschimke In Data Vault

Data Vault 2.0: Best of Breed from Data Warehousing and Data Lakes

There are two competing approaches to data analytics available in the industry, and most professionals at least tend to one or the other as the preferred tool of choice: data warehousing vs data lake. This article sheds light on the differences between both approaches and how Data Vault 2.0 provides a best of breed solution that integrates the advantages of both approaches into a unified concept.

About Data Warehousing

Data Warehousing is the traditional enterprise solution for providing reliable information to decision makers at every level of the organization. Data warehouse solutions (but also data lakes) are based on a data model, which is traditionally defined on the basis of either the information requirement in a bottom-up approach or in a top-down approach based on an integrated enterprise information model.

In any case, the traditional data warehouse is based on a concept called “schema-on-write” where the data model is established when loading the data into the data warehouse. This often leads to non-agile data processing, as this data model often requires modifications in order to cope with changes in the business.

About Data Lakes

Data lakes, on the other hand, are based on the “schema-on-read” concept. Instead of modeling the enterprise or fitting the incoming dataset into a target information model, the data is first and foremost stored on the data lake as delivered without any modelling applied.

While traditional data warehousing often leads to overmodeling and non-agile data analytics, the data lake approach often leads to the direct opposite: to unmanaged data and inconsistent information results.

The Best of Breed

Both approaches are on the extreme ends of the data analytics space and used throughout the years with mixed results. With the emergence of the Data Vault 2.0 concept, a third option is available to industry professionals to build data analytics platforms.

Data Vault 2.0 is a best of breed between traditional data warehousing and data lakes: for example, there is a data model to manage the data and business logic as in traditional data warehousing, but it follows a schema-on-write approach as in data lakes.

The Data Vault 2.0 architecture comprises multiple layers:

The first layer is the staging area: it is used to extract the data from the source systems. The next layer is the Raw Data Vault. This layer is still functionally oriented as the staging layer, but integrates and versionizes the data. To achieve this, the incoming source data model is broken down into smaller components: business keys (stored in hubs), relationships between business keys (stored in links) and descriptive data (captured by satellites).

The Business Vault is the next layer, but only sparsely modelled: only where business logic is required to deliver useful information, a Business Vault entity is put in place. The Business Vault bridges the gap between the target information model (as in the next layer) and the actual raw data. Often, the raw data doesn’t meet the expectations of the business regarding data quality, completeness, or content and thus needs to be adjusted. Business logic is used to fill the gap.

The final layer is the information mart layer where the information model is produced to deliver the final information in the desired format, e.g., a dimensional star schema. This model is used by the business user either directly in ad-hoc queries or using business intelligence tools such as dashboarding or reporting software.

The first layers until the Raw Data Vault are still functionally oriented because the model is still derived either directly from the source system (as in the staging area) or by breaking down the incoming data model into smaller, normalized components, as in the Raw Data Vault. The target schema is only applied at the latest layer, the information mart layer. This is when the desired information model is applied. Because the information mart is often virtualized using SQL views, the target schema is actually applied during query time. Queries against the view layers are merged with the SQL statements inside the view layer and run against the materialized tables in the Raw Data Vault, the actual data. Therefore, the schema-on-read concept is used in Data Vault 2.0.

Data Vault 2.0 also preserves the agility: the concept has demonstrated in many projects that it is easy to extend over time when either the source system structures change, the business rules change or the information models need to be adjusted. In addition, it is easy to add new data sources, additional business logic and additional information artifacts to the data warehouse.

On top of that, the Data Vault 2.0 model is typically integrated with a data lake: the above diagram shows the use of a data lake for staging purposes, which is the recommended “hybrid architecture” for new projects at Scalefree. But the data lake can also be used to capture semi-structured or unstructured data for the enterprise data warehouse or to deliver unstructured information marts.

With all that in mind, the Data Vault 2.0 concept has been established itself as a best of breed approach between the traditional data warehouse and the data lake. Organizations of all sizes use is to build data analytics platforms to deliver useful information to their decision makers.

Michael Olschimke In Data Vault Friday

Managed Self-service Monitoring in Data Vault

Watch the Video

In our ongoing Data Vault Friday series, our CEO Michael Olschimke engages with a pertinent question posed by our audience.

“Why is monitoring important in Managed Self-Service BI?”

In this insightful video, Michael delves into one of his favorite topics – managed self-service BI. He expounds on the significance and role of monitoring within the context of managed self-service BI scenarios. Drawing from his expertise, Michael provides a comprehensive exploration of the uses, values, and critical importance of monitoring tools in ensuring the effectiveness and efficiency of managed self-service BI implementations.

For those keen on optimizing their self-service BI initiatives and understanding the practical applications of monitoring, this video offers valuable insights and considerations.

Michael Olschimke In Data Vault Friday

Modelling Currencies in Non-Historized Links in Data Vault (PART 3)

Watch the Video

In our continuous Data Vault Friday series, our CEO Michael Olschimke delves into a compelling question presented by a member of our audience.

“Is it a good idea to create a HUB for an ‘opposite side Account’? Or perhaps, should we take it a step further and explore the possibility of merging the ‘opposite side Account’ with HUB_ACCOUNT? If so, how do we handle the diversity of IBAN formats across different countries? Is it practical to have accounts from all over the world within our HUB_ACCOUNT?”

In this informative video, Michael addresses the intricacies of data modeling, specifically concerning the creation of a HUB for the ‘opposite side Account.’ He discusses the potential advantages and challenges associated with merging this entity with HUB_ACCOUNT. Furthermore, Michael offers insights into navigating the complexities of different IBAN formats in a global context, weighing the considerations of including accounts from diverse geographical regions within a unified HUB.

For those grappling with decisions around global data integration and HUB design, Michael’s responses in this video offer valuable guidance and considerations.

Michael Olschimke In Data Vault Friday

Ghost Records in Data Vault

Watch the Video

In our ongoing Data Vault Friday series, our CEO Michael Olschimke addresses an interesting question raised by a member of our audience.

“We wonder if a ghost record in a satellite should have NULL values in the descriptive fields or not. What is the advantage of non-NULL values?”

In this insightful video, Michael explores the nuances of handling ghost records in satellites, specifically focusing on whether descriptive fields should contain NULL values or default values. He provides a comprehensive analysis of the pros and cons associated with each solution, shedding light on the advantages and considerations that come with choosing either non-NULL values or default values for descriptive data in ghost records.

For those grappling with the challenges of managing ghost records within the Data Vault framework, this video offers valuable insights and considerations to inform decision-making.

Michael Olschimke In Data Vault Friday

LINK Record Source in Data Vault

Watch the Video

In our continuous Data Vault Friday series, our CEO Michael Olschimke takes on a thoughtful question posed by a member of our audience.

“LINK table has only one ‘Record Source’ element to capture the source system value. In case the LINK generates the key relationship between two hubs which are sourced from different systems (I have not encountered any such but I can foresee this situation), What would be the RECORD SOURCE Value?”

In this enlightening video, Michael delves into the fundamental modeling principles of link entities within the Data Vault framework. Specifically, he addresses the scenario where a LINK table establishes key relationships between two hubs originating from distinct systems. Michael provides insights into determining the appropriate “Record Source” value in such situations, offering guidance on maintaining clarity and accuracy in data lineage.

For those navigating the nuances of link table modeling and considering potential cross-system relationships, this video provides valuable insights and best practices.

Same-as-links Business Rules in Data Vault

Watch the Video

Pivotizing Fact Measures in Data Vault

Watch the Video

About Information Marts in Data Vault 2.0 – Part 2

“Classical” information marts

Additional information marts

Redundancy in Dimensional Models in Data Vault

Watch the Video

Hash Keys vs Sequence Keys vs Business Keys in Data Vault

Watch the Video

Record Source and the Business Vault

Watch the Video

Data Quality in the Enterprise Data Warehouse (EDW) in Data Vault

Watch the Video

Data Vault 2.0: Best of Breed from Data Warehousing and Data Lakes

About Data Warehousing

About Data Lakes

The Best of Breed

Managed Self-service Monitoring in Data Vault

Watch the Video

Modelling Currencies in Non-Historized Links in Data Vault (PART 3)

Watch the Video

Ghost Records in Data Vault

Watch the Video

LINK Record Source in Data Vault

Watch the Video

Build Better Data Platforms

SOLUTIONS

TRAINING

EVENTS

KNOWLEDGE HUB

CAREERS

COMPANY

Make Better Salesforce Decisions

Build Better Data Platforms