Marc Winkelmann

How to Scale in a Disciplined Agile Manner?

Looking beyond Scrum and learn how to increase the value in Data Vault 2.0 projects

Earlier this year we talked about Managed Self-Service BI to explain how business users can take a benefit from this approach in Data Vault 2.0. Now we want to show you how to get there from a project management perspective, even in large companies where the standard Scrum approach often not works with the accorded deployment/release regulations and other approaches like the Disciplined Agile framework are the better fit.

Agile transformation is hard because cultural change is hard. It’s not one problem that needs to be solved, but a series of hundreds of decisions affecting lots of people over a long period of time that affects relationships, processes, and even the state of mind of those working within the change.

There are two fundamental visions about what it means to scale agile: Tailoring agile strategies to address the scaling challenges – such as geographic distribution, regulatory compliance, and large team size – faced by development teams and adopting agility across your organization. Both visions are important, but if you can’t successfully perform the former then there is little hope that you’ll be successful at the latter.

Marc Winkelmann In Data Vault

Still Struggling with GDPR?

GDPR

The new General Data Protection Regulation (GDPR) is a law by the European Union (EU) and became effective on May 25, 2018. This new regulation is designed to put a high level of protection to personal data of European citizens, what means that companies around the world have to establish transparency and ownership to the individuals’ data and need to get a clear declaration of consent from them to save and process their personal data. Though laws from countries outside the EU (especially the USA) tend to favor business over consumer, GDPR affects all companies over the world who have personal data from EU-citizens in their database.

In this article:

What is new in GDPR?
Data Vault 2.0 as a solution
Conclusion

What is new in GDPR?

To be careful with personal data is nothing new, especially not in the EU. The key change of collecting and processing personal data is that the data is now completely under control of the owner, who can force the companies to delete or anonymize their data or to request copies of all owners personal data stored in the system. Personal data or Privately Identifiable Information (PII) means data, an individual can be identified with, e.g. name, phone number or email address. Continue Reading

Marc Winkelmann In Data Warehouse

Data Warehouse and Data Lake: Do We Still Need a Data Warehouse?

“Big Data”, “Data Lake”, “Data Swamp”, “Hybrid Architecture”, “NoSQL”, “Hadoop” … terms you are confronted with very often these days when you are dealing with data. Furthermore, the question comes up if you really need a data warehouse nowadays when you deal with a high variety and volume of data. We want to talk about what a data lake is, if we need a data warehouse when using NoSQL platforms like Hadoop, and how it is combined with Data Vault.

WHAT IS A DATA LAKE?

There is a proper definition from Tamara Dull (SAS): “A data lake is a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured, and unstructured data. The data structure and requirements are not defined until the data is needed.” ¹ Continue Reading

Marc Winkelmann In Data Vault

How to Combine Managed Self-Service BI with Data Vault 2.0?

Managed Self-Service BI and Data Vault 2.0

Combining Managed Self-Service BI with Data Vault 2.0

This article explores how combining Managed Self-Service BI with Data Vault 2.0 enables organizations to balance data governance and agility, ensuring both control and flexibility in their analytics processes.Last month we talked about a hybrid architecture in Data Vault 2.0, where we explain how to combine structured and unstructured data with a hybrid architecture. To follow up on this topic, we now want to explain how your business users (especially power users) can take a benefit from it with the managed Self-Service Business Intelligence (mSSBI) approach in Data Vault 2.0.

In this article:

About Self-Service BI
About the Managed Self-Service BI Approach
Implement MSSI in the Data Vault 2.0 Architecture
Conclusion

About Self-Service BI

Self-service BI allows end-users to completely circumvent IT due to this unresponsiveness of IT. In this approach, business users are left on their own with the whole process of sourcing the data from operational systems, integration and consolidation of the raw data. There are many problems with this self-service approach without the involvement of IT:

Direct access to source systems
Unintegrated raw data
Low data quality
Unconsolidated raw data
Non-standardized business rules

In many cases, end-users – even if they are power users with the knowledge to SQL, MDX, and other techniques, don’t have the right tools available to solve the tasks. Instead, much work is done manually and error-prone. But from our experience, it is not possible to completely prevent such power users from obtaining data from source systems, preparing it, and eventually reporting the data to upper management. What organizations need is a compromise between IT agility and data management that allows power users to obtain the data they need quickly, in a usable quality. To overcome these problems, the Data Vault 2.0 standard allows experienced or advanced business users to perform their own data analysis tasks on the raw data of the data warehouse.

About the Managed Self-Service BI Approach

In fact, a Data Vault 2.0 powered IT welcomes business users to take the data that is available in the enterprise data warehouse (either in the Raw Data Vault or in the Business Vault) to create local information marts using specialized tools. These tools retrieve the data from the enterprise data warehouse, apply a set of user-defined business rules and present the output to the end-user. IT might also create structures where organizational-wide business rules are applied to provide a consolidated view on parts of the model or pre-calculate KPIs to ensure consistency among such calculations. Because both types of data (raw data and business rule applied data) is already integrated, the business user can also join consolidated data with raw data from specific source systems. This approach is called Managed Self-Service BI, where IT evolves to a service organization that provides those power users with the data they want, in the timeframe they need. The data is integrated by its business key and can be consolidated as well as quality checked.

Implement MSSI in the Data Vault 2.0 Architecture

The Data Vault 2.0 architecture provides self-service capabilities for power users in the organization:

Figure1 : mSSBI Architecture

In this case, power users who build their own, custom solutions can write back data and information into the Enterprise Data Warehouse by leveraging a dedicated user space for this purpose. The write-back can then later be re-used in the solution for information delivery. Furthermore, the business users can manage their own master data by using an MDM application. It enables authorized business users to change the parameter values and therefore influence the results of the business rules.

The difference between “managed” Self-Service BI to the standard Self-Service BI approach from the general industry is that Data Vault 2.0 provides a managed environment where data and information are provided in a controlled and secure manner. Power users can only query the data they are allowed to see from a data security perspective.
Another advantage is that this approach enables organizations in security and banking industries to provide a fully auditable and traceable environment that meets the highest security requirements.

Managed Self-Service BI does require a write-back possibility in the enterprise data warehouse architecture, otherwise, it’s just plain old BI solution. Without write-back, there are no differentiators. Beyond that, write-back is necessary in order for enriching or enhancing the quality of the data being put forward. Data scientists, for example, do this all the time: when using Hadoop they create a new target file as a result of their processing output. This is a direct write-back in the Hadoop space. We have required write-back in Self Service BI for years, otherwise, master data, and hierarchy management don’t work properly.

In this modification of the architecture from the previous section, the relational staging area is replaced by a HDFS based staging area which captures all unstructured and structured data. While capturing structured data on the HDFS appears as overhead at first glance, this strategy actually reduces the burden of the source system by making sure that the source data is always being extracted, regardless of any structural changes. The data is then extracted using Apache Drill, Hive External or similar technologies. It is also possible to store the Raw Data Vault and the Business Vault (the structured data in the Data Vault model) on Hive Internal.

Conclusion

Combining Managed Self-Service BI with Data Vault 2.0 empowers organizations to strike a balance between governance and agility in their data ecosystems. By leveraging Data Vault’s structured, auditable architecture alongside self-service BI’s flexibility, businesses can ensure data accuracy, security, and scalability while enabling users to gain faster insights. This approach enables collaboration between IT and business users, driving more informed decision-making and accelerating data-driven innovation.

Marc Winkelmann In Data Architecture

Hybrid Architecture in Data Vault 2.0

Hybrid Architecture in Data Vault 2.0

Business users expect their data warehouse systems to load and prepare more and more data, regarding the variety, volume, and velocity of data. Also, the workload that is put on typical data warehouse environments is increasing more and more, especially if the initial version of the warehouse has become a success with its first users. Therefore, scalability has multiple dimensions. Last month we talked about Satellites, which play an important role in scalability. Now we explain how to combine structured and unstructured data with a hybrid architecture.

In this article:

Logical Data Vault 2.0 Architecture
Hybrid Architecture
Conclusion

Logical Data Vault 2.0 Architecture

The Data Vault 2.0 architecture is based on three layers: the staging area which collects the raw data from the source systems, the enterprise data warehouse layer, modeled as a Data Vault 2.0 model, and the information delivery layer with information marts as star schemas and other structures. The architecture supports both batch loading of source systems and real-time loading from the enterprise service bus (ESB) or any other service-oriented architecture (SOA).

The following diagram shows the most basic logical Data Vault 2.0 architecture:

Figure 1: Logical Data Vault 2.0 Architecture

In this case, structured data from source systems is first loaded into the staging area to reduce the operational / performance burden from the operational source systems. It is then loaded unmodified into the Raw Data Vault which represents the Enterprise Data Warehouse layer. After the data has been loaded into this Data Vault model (with hubs, links, and satellites), business rules are applied in the Business Vault on top of the data in the Raw Data Vault. Once the business logic is applied, both, the Raw Data Vault and the Business Vault are joined and restructured into the business model for information delivery in the information marts. The business user is using dashboard applications (or reporting applications) to access the information in the information marts.

The architecture allows implementation of the business rules in the Business Vault using a mix of various technologies, such as SQL-based virtualization (typically using SQL views), and external tools, such as business rule management systems (BRMS).

However, it is also possible to integrate unstructured NoSQL database systems using a hybrid architecture. Due to the platform independence of Data Vault 2.0, NoSQL can be used for every data warehouse layer, including the stage area, the enterprise data warehouse layer, and information delivery. Therefore, the NoSQL database could be used as a staging area and load data into the relational Data Vault layer. However, it could also be integrated both ways with the Data Vault layer via a hashed business key. In this case, it would become a hybrid architecture solution and information marts would consume data from both environments.

Hybrid Architecture

The standard Data Vault 2.0 architecture in Figure 1 focuses on structured data. Because more and more enterprise data is semi-structured or unstructured, the recommended best practice for a new enterprise data warehouse is to use a hybrid architecture based on a Hadoop cluster, as shown in the next figure:

Figure 2: Hybrid Data Vault 2.0 Architecture

In this hybrid architecture modification, the relational staging area is replaced by a HDFS based staging area that captures all unstructured and structured data. While capturing structured data on the HDFS appears as overhead at first glance, this strategy actually reduces the burden of the source system by making sure that the source data is always being extracted, regardless of any structural changes. The data is then extracted using Apache Drill, Hive External, or similar technologies.

It is also possible to store the Raw Data Vault and the Business Vault (the structured data in the Data Vault model) on Hive Internal.

Conclusion

Integrating a hybrid architecture within Data Vault 2.0 enables organizations to effectively manage both structured and unstructured data by leveraging platforms like Hadoop. This approach enhances scalability and flexibility, allowing for efficient data processing and storage. By replacing traditional relational staging areas with HDFS-based systems, businesses can reduce the burden on source systems and ensure seamless data extraction

Marc Winkelmann In Data Modeling

Visual Data Vault by Example: Satellites Modeling in the Health Care Industry

Data Vault 2.0 is a concept for data warehousing, invented by Dan Linstedt. It brings many new features that help anyone who is concerned with Business Intelligence entering a new age of data warehousing. Data Vault 2.0 is a Big Data concept that integrates relational data warehousing with unstructured data warehousing in real-time. It is an extensible data model where new data sources are easy to add. When our founders wrote the book, they required a visual approach to model the concepts of Data Vault in the book. For this purpose, they developed the graphical modeling language, which focuses on the logical aspects of Data Vault. The Microsoft Visio stencils and a detailed white paper are available on www.visualdatavault.com as a free download.

This year we already wrote about the modeling of hubs and links in Data Vault 2.0. Now, we want to introduce you the third standard entity, the Satellite.

SATELLITES IN VISUAL DATA VAULT

Satellites add descriptive data to hubs and links. Descriptive data is stored in attributes that are added to the satellite. The individual attributes are added to the satellite one at a time. A satellite might be attached to any hub or link. However, it is only possible to attach the satellite to one parent. Continue Reading

Marc Winkelmann In Data Modeling

Visual Data Vault by Example: Links Modeling in the Banking Industry

Visual Data Vault

The following article describes how to represent links using a Visual Data Vault example. With the advent of Data Vault 2.0, which adds architecture and process definitions to the Data Vault 1.0 standard, Dan Linstedt standardized the Data Vault symbols used in modeling. Based on these standardized symbols, the Visual Data Vault (VDV) modeling language was developed, which can be used by EDW architects to build Data Vault models.

When our founders wrote the book, they required a visual approach to model the concepts of Data Vault in the book. For this purpose, they developed the graphical modeling language, which focuses on the logical aspects of Data Vault. The Microsoft Visio stencils and a detailed white paper are available on www.visualdatavault.com as a free download.

In this article:

Links in Visual Data Vault
Conclusion

Links in Visual Data Vault

We previously published another newsletter how hubs are modeled in the accounting industry. In this Newsletter we explain the function of standard links and how the modeling in the banking industry works.

Links connect individual hubs in a Data Vault model and represent either transactions or relationships between business objects. Business objects are connected in business. No business object is entirely separate from other business objects. Instead, they are connected to each other through the operational business processes that use business objects in the execution of their tasks. The image below shows a link that connects two hubs (a standard link has to have at least two connections) as the following diagram shows:

Figure 1: A standard link connects two hubs

The link in in the image above references two hubs: Account and Customer. The connector (the arrow) should be read as “(the hub) Customer is used by (the link) Account to Customer.” The second reference is a little different because the name of the connection between the Account hub and the link is overwritten by the meaning of a credit or a debt Account. This is necessary in cases where the model requires more meaning or when multiple connections are required to the same hub. The hash keys of each hub, which identify each business object unique by one calculated attribute, are replicated into the link entity by using the same attribute name.

A link represents many-to-many relationships and therefore they provide flexibility because changes to the business rules don’t require re-engineering and the granularity is expressed by the number of referenced hubs and is thus well documented.

Figure 2: Link table with its attributes

The link contains all hash keys of the related hubs (logical foreign keys), a load date when the relationship arrives the data warehouse for the first time, the record source where the data comes from, and the link hash key (logical primary key) which is calculated from the business keys of the hubs (not from the hash keys – never hash a hash!) and follows an insert-only loading pattern.

Links greatly improve the flexibility of the Data Vault model, because it is easy to add links or modify the relationship type of existing links. It takes less time to respond to changes in the business. To add new functionality, you only need to add new hubs and connect them via links to existing hubs. Usually a standard satellite is attached to the link, which contains the descriptive data of the relationship between the hubs.

Another common kind of link is the Non-Historized Link (also known as Transactional Link) which contains transactions only and does not need a Satellite, what means that the loading pattern is a complete insert-only approach. Read more about the value of NH-Links in an earlier Newsletter this year.

Conclusion

In conclusion, the Visual Data Vault modeling language offers a standardized and effective approach for representing complex relationships and transactions within the banking industry. By connecting business objects through links, it ensures a clear and logical structure that mirrors real-world processes. This methodology not only enhances the clarity of data models but also facilitates efficient data management and retrieval, making it an invaluable tool for enterprise data warehouse architects.

Marc Winkelmann In Data Vault

Achieve Data Lineage in Data Vault 2.0

One common requirement in data warehouse projects is to provide data lineage from end-to-end. However, custom solutions (for example custom Meta Marts for self-developed Data Vault generators) or tools from different vendors often break such end-to-end data lineage.

Unlike business or technical metadata, which is provided by the business or source applications, process execution metadata is generated by the data warehouse team and provides insights into the ETL processing for maintenance. The data is used by the data warehouse team or by end-users to better understand the data warehouse performance and results presented in the information marts. One type of process execution metadata is the control flow metadata which executes one or more data flows among other tasks. Logging the process execution provides a valuable tool for maintaining or debugging the ETL processes of the data warehouse because it provided information about the data lineage of all elements of the data warehouse. Continue Reading

Marc Winkelmann In Data Modeling

Data Vault Modeling: Visual Example in the Accounting Industry

Visual Example of a Data Vault Modeling

With the addition of architecture and process definitions in Data Vault 2.0, Dan Linstedt has standardized Data Vault modeling.

Based on these standardized symbols, the Visual Data Vault modeling language was developed, which can be used by Enterprise Data Warehouse architects to build Data Vault models.

The authors of the book “Building a Scalable Data Warehouse”, who are the founders of Scalefree, required a visual approach to model the concepts of Data Vault in the book.

For this purpose, they developed the graphical Data Vault modeling language, which focuses on the logical aspects of Data Vault.

The Microsoft Visio stencils and a detailed white paper are available on www.visualdatavault.com as a free download.

Hubs in Visual Data Vault

Business keys play an important role in every business, because they are referenced by business transactions and relationships between business objects.

Whenever a business identifies and tracks business objects, business keys are used throughout business processes.

This is one of the reasons why Data Vault is based on the business keys. In Data Vault modeling, business keys are stored in hub entities.

The challenge is to identify the business keys that represent a business object uniquely. That can be just one business key, but also a composite key or a smart key.

The first image shows a hub with only one business key attribute:

Here, the attribute Invoice Number is sufficient to identify the invoice. No other attribute is required (such as the invoice year).

In other cases, it is not as easy, as the following diagram shows:

In this Data Vault modeling case, the accountant is identified by a Country Code attribute (such as the ISO2 code) and an Employee Number attribute.

One attribute alone would not be sufficient to identify the accountant: the employee number by itself might be overlapping across all countries and have only a local meaning (employee number 10006 might be used in multiple countries and identify a different accountant in each country).

Therefore, the local key is extended by the country code to uniquely identify the accountant. Be aware, the country code has to be in the source data to make this a valid model in Data Vault (in the end, we do model source data, in the Raw Data Vault, not the desired model of the business).

Another example extends this concept into a so-called smart-key:

Here, the IBAN number, which is used to identify banking accounts internationally, consists of 4 physical elements in the number:

Country code
Checking number
Account number
Bank Identifier code

In order to model a smart key (a key that comprises multiple parts or keys), add a smart key to the hub and then add business keys to identify the sections of the smart key.

As you can see from above figure, the logical symbol of a smart key is similar to that of a business key. However, the icons are slightly different and the shape indicates a stack.

In this Data Vault modeling example, each business key is modeled as an individual attribute in the hub entity. The combination identifies a business object in the business. The checking number is actually not modelled, because it contains no business value (except the ability to serve as a technical checksum).

However, you’re not wrong with adding it to the model, too.

Marc Winkelmann In Data Vault

Hash Keys in Data Vault

Hash Keys in DV2.0

Data Vault 2.0 introduces hash keys to enhance the traditional Data Vault model, bringing several advantages to data warehousing. Hash keys do not only speed up the loading process; they also ensure that the enterprise data warehouse can span across multiple environments: on-premise databases, Hadoop clusters, and cloud storage.

Let’s discuss the performance gain first: to increase the loading procedures, dependencies in the loading process have to be minimized or even eliminated. Back in Data Vault 1.0 sequence numbers were used to identify a business entity and that had to include dependencies during the loading process as a consequence.

These dependencies have slowed down the load process which is especially an issue in real-time-feeds. Hubs had to be loaded first before the load process of the satellites and links could start. The intention is to break these dependencies by using the hash keys instead of sequence numbers as the primary key.

In this article:

Business Keys vs Hash Keys
Massively Parallel Processing (MPP)
Hash Keys to Join Apache™ Hadoop® Data Sets
Hash Difference
What Hash Function to Use
Collision
Summary

Business Keys vs Hash Keys

In advance, business keys may be a sequence number created by a single source system, e.g. the customer number.

But, business keys can also be a composite key to uniquely identify a business entity, e.g. a flight in the aviation industry is identified by the flight number and the date because the flight number will be reused every day.

In general: a business key is the natural key used by the business to identify a business object.

While using the business keys in Data Vault might be an option, it is actually a slow one, using a lot of storage (even more than hash keys).

Especially in links and their dependent satellites, many composite business keys are required to identify the relationship or transaction/event in a link – and to describe it in the satellite.

This would require a lot of storage and slow down the loading process because not all database engines have the capability to execute efficient joins on variable-length business keys.

On the other hand, we would have too many columns in the link because every business key must be a part of the link.

The issue at this point is that we also have different data types with different lengths in the links. This issue is exaggerated because it is also required to replicate the business keys into their satellites.

To guarantee a consistent join performance, the solution is to combine the business keys into a single column value by using hash functions to calculate a unique representation of a business object.

Massively Parallel Processing (MPP)

Due to the independence during the load process of hubs, links, and satellites, it is possible to do that all in parallel.

The idea is to use the fact that a hash key is derived from a business key or combination of business keys without the need for a lookup in the parent table.

Therefore, instead of looking up the sequence of a business key in a hub before describing the business key in the satellite, we can just calculate the hash key of the business key.

The (correct) implementation of the hash function ensures that the same semantic business key leads to exactly the same hash key, regardless of the target entity loaded.

Hash Keys to Join Apache™ Hadoop^® Data Sets

Without hashing the load to Hadoop^® or NoSQL requires a lookup on the hub or link sequence in the relational system before it can insert or attach it’s data.

Hashing instead of sequencing means that we can load in complete 100% parallel operations to all hubs, all links, all satellites, and enrich all Hadoop^® or NoSQL based documents in parallel at the same time.

It also then allows to join across multiple heterogeneous platforms – from Teradata Database to ApacheTM Hadoop^® for example.

Hash Difference

The hash difference column applies to the satellites. The approach is the same as with the business keys, only that here all the descriptive data is hashed.

That reduces the effort during an upload process because just one column has to be looked up. The satellite upload process first examines if the hash key is already in the satellite, and secondly, if there are differences between the hash difference values.

Hash Keys and Hash Difference in Data Vault 2.0

The image above shows the process if the source system sends you a full data delivery.

The point where to hash the values should be on the way into the staging area because at this point there is time to check for “hash collisions”, and handle hashing issues before continuing with the data in the Data Vault.

What Hash Function to Use

There are many hash functions to choose from: MD5, MD6, SHA-1, SHA-2, and some more.

We recommend using the MD5 algorithm with a length of 128 bits in most cases, because it is most ubiquitously available across most platforms, and has a decently low percentage chance of hash collision with an acceptable storage requirement.

An additional advantage hashing brings is that the hashes have the same length and are from the same data type, from which a performance improvement arises.

Furthermore, hash values are generated, so they never get lost and can be recreated.

Collision

Hashing has a very small risk: the collision. That means, two different data will get the same hash value.

But, the risk is very small, for example: In a database with more than one trillion hash values, the probability that you will get a collision is like the odds of a meteor landing on your data center.

Summary

Hash Keys are not mandatory in the Data Vault, but they are highly recommended. The advantages of

massively Parallel Processing (MPP),
data load performance,
consistency and
auditability

are indispensable and can not be reached with Sequences or Business Keys.

Hash keys are no silver bullet…but they provide (much) more advantages than they introduce disadvantages (storage).

Marc Winkelmann In Data Modeling

The Value of Non-Historized Links

Non-Historized Links in Data Vault 2.0

Non-historized links in Data Vault 2.0 simplify data modeling by focusing on current relationships instead of historical changes. This improves efficiency, reduces complexity, and enhances query performance for more agile analytics.

In this article:

Introduction to Non-Historized Links
Key Characteristics of Non-Historized Links
Conclusion

Introduction to Non-Historized Links

Non-historized links, also known as Transaction Links, are one of the nuts and bolts of the Data Vault 2.0 framework, but how do they work within a model? When Dan Linstedt, co-founder of Scalefree, invented the Data Vault, he had several goals in mind. One of the goals was to load data as fast as possible from the source into a data warehouse model, process it into information, and present it to the business analyst in any desired target structure.

For Data Warehouse automation and simplicity, the Data Vault 2.0 model exists only of three basic entity types:

Hubs: a distinct list of business keys
Links: a distinct list of relationships between business keys
Satellites: descriptive data, that describe the parent (business key or relationship) from a specific context, versioned over time.

Now, as we always teach (and sometimes preach): you can model all enterprise data using these three entity types alone. However, a model using only these entity types would have multiple disadvantages. Many complex joins, storage consumption, ingestion performance, and missed opportunities for virtualization.

The solution? Adding a little more nuts and bolts to the core entity types of the Data Vault to cope with these issues. One of the nuts and bolts is the use of non-historized links, also known as Transaction Links:

In this example, Sales can be modeled with non-historized links that capture sales transactions of a customer, related to a store. The goal of the non-historized link is to ensure high performance on the way into the data warehouse and on the way out. Don’t forget, that the ultimate goal of data warehousing is to build a data warehouse not just model it. And building a data warehouse involves much more than just the model: it requires people, processes, and technology.

Key Characteristics of Non-Historized Links

So, how do non-historized links meet these goals? Think about your business analysts. What are their goals? In the end, to be honest, they don’t care about a Data Vault model. Instead, they would like to see dimensional models, such as star schema’s and snowflake models or flat-and-wide models for data mining. Or, once in a while, they want to see the ugly-looking tables from the mainframe, sometimes linked, sometimes not, and not a lot of people fully understand the relationships anymore…but for backward compatibility, that’s just great.

Having defined the target, the next question comes into mind: what is the target granularity? For example, in a dimensional model, the target granularity of fact tables often reflects the transactions to be analyzed (think about call records in the telecommunications industry or banking transactions).

Interestingly, this desired target granularity can often be found directly in the source systems. Because a telecommunications provider has an operational system in place that records each phone call. Or a banking application that records every account transaction. These records are typically loaded to the data warehouse without aggregation (at least in Data Vault where we are interested in the finest granularity for auditing and delivery purposes).

And here comes the problem with the standard Data Vault entity types. While they are very simple and patternized, they have one problem. It’s the fact that the standard link “stores a distinct list of relationships”, as stated above. That means the link is only interested in relationships from the source that are unknown to the target link. If a customer walks into a store multiple times and purchases the same product, the relationship between the customer (number), store (number), and product (number) is already known and no additional link entry is added.

As a result, the granularity of the incoming data is changed when loading the target link. If the transaction’s underlying relationship is already known, the transaction would be omitted (and instead captured by a satellite).

The next problem is that the link granularity now differs from the target granularity because the business analyst wanted one record per transaction and not per distinct business key relationship. Another grain shift is required that typically involves joining the satellite of the link to the link itself to recover the original grain.

As we explained in our book “Building a Scalable Data Warehouse with Data Vault 2.0” a grain shift is relatively costly in terms of performance. This is because the operation requires costly GROUP BY statements or LEFT and RIGHT JOINS.

And for what? The end result of both operations often results in the original granularity from the source system. Two expensive grain shifts for nothing sounds like a bad deal.

And it is.

That is where non-historized links come into play: the link is a simple variation of the standard link with the goal of capturing the source transactions and events at the original granularity.

In a standard link, the granularity is defined by the number of hub references in the link. This would not be sufficient to capture multiple transactions that involve the same business keys (e.g., the same customer, store, and product) in the same target link. To resolve the issue, an additional element of the alternate key on the non-historized links is required. Typically a transaction or event ID, such as the call ID or a basket ID in the retail industry (which also can be modeled as its own Hub). In combination with the hub references, the combined business key should be unique per transaction. This way, we can now capture one record in the target link per source record from the operational system.

Figure: Non-Historized Link with Sales ID as an additional key

One deeper scenario would be, if the same product (in that case the highest granularity) emerges twice in a sale, because of different discounts for example. In that case, the line item number would be an additional key (Dependent Child Key) to make every data set unique.

From a loading perspective, the ingestion process of the incoming data is much improved, because no grain shift is required anymore. All data is loaded directly from the source into the target. As long as the business analyst wants to report exactly on this granularity, it is possible to restructure the non-historized link table into a fact table by using a virtual SQL view on top.

One major point is that non-historized links do not allow edits to the descriptive data, meaning there is never any history issued on the facts (e.g. sensor or machine-generated data). Because the descriptive data cannot change, there is no need to put it in a Satellite, rather descriptive data is kept directly in the non-historized links.

For performance reasons, avoid grain shifts on the way out. If a different target granularity is required, and it is not possible to load this granularity from the source system or from the Data Vault model, consider a bridge table. The purpose of this entity type is to materialize the grain shift while keeping the advantage of customizing the target according to the individual requirements set for the dimensional target.

Figure: Virtual Bridge Table as Fact Table with GROUP BY Store Hash Key

Conclusion

Non-historized links offer a streamlined approach to Data Vault modeling, optimizing performance and simplifying data structures. By focusing on current relationships rather than historical changes, they enhance efficiency, reduce storage needs, and improve query speed. Incorporating them into your data architecture can lead to a more agile and scalable analytics environment.

Looking beyond Scrum and learn how to increase the value in Data Vault 2.0 projects

GDPR

What is new in GDPR?

WHAT IS A DATA LAKE?

Combining Managed Self-Service BI with Data Vault 2.0

About Self-Service BI

About the Managed Self-Service BI Approach

Implement MSSI in the Data Vault 2.0 Architecture

Conclusion

Hybrid Architecture in Data Vault 2.0

Logical Data Vault 2.0 Architecture

Hybrid Architecture

Conclusion

SATELLITES IN VISUAL DATA VAULT

Visual Data Vault

Links in Visual Data Vault

Conclusion

Visual Example of a Data Vault Modeling

Hubs in Visual Data Vault

Hash Keys in DV2.0

Business Keys vs Hash Keys

Massively Parallel Processing (MPP)

Hash Keys to Join Apache™ Hadoop® Data Sets

Hash Difference

What Hash Function to Use

Collision

Summary

Non-Historized Links in Data Vault 2.0

Introduction to Non-Historized Links

Key Characteristics of Non-Historized Links

Conclusion

Build Better Data Platforms

SOLUTIONS

TRAINING

EVENTS

KNOWLEDGE HUB

CAREERS

COMPANY

Hash Keys to Join Apache™ Hadoop^® Data Sets