Skip to main content
search
0
All Posts By

Michael Olschimke

Michael Olschimke is the Co-Founder and CEO of Scalefree and a "Data Vault 2.0 Pioneer" with over 20 years of IT experience. A Fulbright scholar and co-author of Building a Scalable Data Warehouse with Data Vault 2.0, Michael is a global authority on AI, Big Data, and scalable Lakehouse design across sectors like banking, automotive, and state security.

Modeling Links with Null Business Keys in Data Vault

Null Business Keys

Data Vault is a methodology used for modeling data in large-scale data warehouse environments. It’s designed to handle rapidly changing data and easily scalable. One of the challenges faced by data modelers is handling null business keys in Data Vault. In this article, we’ll explore how to model links with null business keys, using an example scenario involving the Entity_Roles table.



Understanding the Challenge

The question we’re dealing with here revolves around how to model the Entity_Roles table when there are null values in the linked data fields. In particular, the table might have records where certain fields like Project or Task are null, creating a challenge when attempting to create links between different entities. Let’s break down the situation and find the best way to handle these null values and ensure smooth data modeling.

Context: The Entity_Roles Table

The Entity_Roles table holds information about the relationship between entities and their roles in different tasks and projects. However, there may be situations where a given record in the table has missing (null) values for certain business keys, such as Project or Task.

What Are Business Keys and Links?

Before diving into the solution, let’s first clarify what business keys and links are in Data Vault. A business key is a unique identifier for a specific business object, such as a project or task. Links are used in Data Vault to represent relationships between business keys. The Entity_Roles table, for example, might act as a link between an entity, its role, and the task or project it’s associated with.

Approaching the Problem: Handling Null Business Keys

The key challenge here is dealing with the null values in the Entity_Roles table. Null values in business keys can create issues when trying to establish relationships between entities. A typical solution in Data Vault is to replace these null values with zero keys. A zero key represents an unknown business key, allowing the link to still function properly despite missing data.

Option 1: Using Zero Keys

One approach to handling null values is to use zero keys for any missing business keys. For instance, if a Task or Project is null, we can assign a zero key to represent the unknown value. The advantage of using zero keys is that it helps maintain query efficiency by ensuring consistent joins, and it allows us to deal with missing data without breaking the link.

Option 2: Splitting the Link into Two Views

Another option could be to pre-stage and create two views to split the Entity_Roles table: one where Task is not null, and another where Project is not null. However, this approach is not recommended because splitting the data may lead to unnecessary complexity and potential inconsistencies in the data. In general, it’s a good practice in Data Vault to avoid splitting data at this level.

Option 3: Handling Multi-Activity Data

If the Entity_Roles table contains multiple rows for the same relationship (e.g., an entity linked to both a task and a project), you may need to account for multi-activity data. This can be done by using a Multi-Active Satellite, which allows you to capture multiple descriptions of the same link. This ensures that you can store all the variations of a relationship, such as different roles or project-task assignments, in the same link.

Designing the Solution

When it comes to designing the solution, there are a few key design decisions to make:

1. Create a Single Link

Rather than splitting the Entity_Roles table, it’s generally better to keep it as a single link. This approach allows you to maintain all relationships in one place and makes the system simpler to manage.

2. Using a Multi-Active Satellite

If there are multiple active roles or tasks associated with a single entity, then using a Multi-Active Satellite is a good option. This satellite would store all the variations of the role-task-project relationships, allowing you to track all relevant data points.

3. Handling Null Values with Zero Keys

As mentioned earlier, replacing null values with zero keys is a good practice in Data Vault. This helps maintain referential integrity, ensures that your links remain intact, and avoids the need for complex filtering or splitting of data.

4. Use of CDC Satellites for Intraday Changes

If your data involves intraday changes (i.e., updates that occur within a single day), you may want to use Change Data Capture (CDC) Satellites. These satellites track changes at the record level, allowing you to preserve the history of the relationships between entities, roles, tasks, and projects.

Additional Considerations

There are a few other considerations when modeling links with null business keys:

1. Should I Create a Weak Hub for ID?

No, it’s generally not recommended to create a weak hub for the ID. Instead, treat the ID as a descriptive attribute and store it in the satellite attached to the link.

2. Where Should I Put Descriptive Data?

Descriptive data should go in the satellite. This can include things like roles, entity types, or other attributes that provide more context to the link between entities.

3. Should I Put the Role in the Link?

If the role is an important business key, you could consider putting it in its own hub. However, if the role is just a descriptive attribute, it’s best to store it in the satellite attached to the link. If the role becomes a business key later, you can refactor the design and create a hub for it.

Final Thoughts

Modeling links with null business keys is a common challenge in Data Vault. The key is to handle null values appropriately, whether that’s by using zero keys, leveraging multi-active satellites, or managing intraday changes with CDC satellites. By maintaining a clear and consistent approach to handling these null values and relationships, you can ensure that your Data Vault design is scalable, efficient, and capable of handling complex data scenarios.

Watch the Video

Modeling Project Tasks and Actions in Data Vault

Modeling Project Tasks and Actions

In the world of Data Vault, creating effective data models to capture project tasks and actions is a crucial part of building scalable, efficient, and auditable data solutions. One of the challenges that data engineers face when building such models is ensuring that the history of project tasks and actions is captured correctly, while also accounting for the complexity of slowly changing data and continuous changes like those seen in Change Data Capture (CDC) environments.



Understanding the Data Model

Let’s start by understanding the data model in question. We have three key tables:

  • Projects: This table contains details about various projects.
  • Tasks: This table contains tasks associated with each project. It has a foreign key linking back to the Projects table.
  • Actions: Each task can have one or more actions associated with it, and the Actions table has a foreign key linking back to the Tasks table.

The objective is to create a Project Dimension and a Task Dimension. The Task Dimension should include an additional attribute that represents the latest action associated with each task. To solve this, there are multiple approaches you could take, but two common ones stand out:

  • Multi-Active Satellite (MA-SAT) based on sets: The simplest approach would be to use a MA-SAT model based on sets, selecting the latest action for each task within those sets. However, this can be complex in a CDC environment.
  • Multi-Active Satellite (MA-SAT) based on an MA-Attribute: Instead of using sets, this model would use an action ID as an MA-Attribute, helping to simplify the approach while avoiding the issues created by capturing sets.

Challenges with the CDC Environment

When your source system uses Change Data Capture (CDC), the problem of handling sets becomes more complicated. In CDC, you’re dealing with a stream of changes where new records are added, existing records are updated, and old records may be deleted. This constant flow of data presents two key challenges:

  • Capturing sets is not straightforward: In a typical MA-SAT approach, a set captures all records related to a specific task or project, but with CDC data, you’re more likely to receive individual changes, rather than a full snapshot of all records at once.
  • Volume management: In a CDC environment, where actions are continually added, there is a risk that using sets will generate a massive volume of records. For example, if a new action is added to a task every day, this could result in an overwhelming number of records over time.

Given these challenges, using an MA-Attribute approach where the action ID serves as the key for the task’s latest action provides a more scalable solution. Instead of creating an ever-expanding set, you can focus on just the latest action for each task, keeping the volume under control.

Solution: Using a Non-Historized Link

The idea behind using a Non-Historized Link in this case is to capture actions as they come in from the source system, without the need for sets. In this setup, a Non-Historized Link captures the relationship between tasks and actions and includes the action ID. When new actions come in, they are inserted into the link table, and updates or deletions are handled as changes to the data. This provides a streamlined and efficient way to track the most recent actions associated with each task.

Handling Updates and Deletions

In CDC environments, when updates and deletions occur, it’s important to track these events accurately. The solution to this is technical counter transactions. When an update occurs in the source system, it’s treated as a deletion of the old version and an insert of the new version. This allows the system to track changes over time and ensures the history is accurately captured.

For deletions, we simply negate the value of the original record, ensuring the deleted record is not included in any aggregations. This negation technique is useful for tracking the current state of each task, action, or project.

Building the Task Dimension

The task dimension is where we’ll include the latest action for each task. To do this, we can use a technique involving a Point-in-Time (PIT) table, which captures the latest version of data (in this case, the latest action for each task) by linking the task hub to the appropriate satellite data.

Here’s how this works:

  • Create the PIT table for tasks: This table will store the latest version of each task, including references to the task satellite and the action satellite.
  • Use window functions: To retrieve the latest action for each task, you can apply window functions in your SQL queries to order by the business date (or load date) and select the latest action based on these values.

Once the PIT table is created, it can be used to build the Task Dimension, which will then reflect the latest action for each task.

Optimizing the Approach

One of the key considerations when implementing this model is performance. Using window functions and PIT tables is an effective way to manage the volume of records, but it’s important to ensure your database is optimized for these types of queries. In high-volume environments, applying business logic during the PIT table loading process can reduce the load time and make data delivery faster.

Conclusion

Modeling project tasks and actions in Data Vault requires careful consideration of the data flow, especially in environments that rely on CDC. By using Non-Historized Links, Multi-Active Satellites based on MA-Attributes, and PIT tables, you can create a scalable, efficient model that captures the latest actions associated with each task, without generating excessive volumes of data.

By applying these techniques, you can streamline the data flow, optimize performance, and ensure the data is always accurate and up-to-date, making it easier for your data users to access the information they need.

Watch the Video

Unit of Work in Data Vault

What Is the Unit of Work?

The Unit of Work is a fundamental concept in Data Vault modeling. It refers to the idea of capturing all the related business keys from a source system together in a single Link structure. This ensures that the relationships represented in the raw data are maintained intact in your Raw Data Vault.

Think of it as keeping all pieces of a puzzle together. When you break the pieces apart without a clear strategy to reassemble them, you risk losing the original picture—or worse, creating a picture that never existed in the first place.



The Key Question: What Happens If You Split Links?

The question posed in our Data Vault Friday session was:

“It would be good to refresh the topic of the Unit of Work. What is the information I’d be losing if I split wider Links into smaller pairs (DrivingKey -> Key1/Key2 etc)? I can think of a technical setup where I’d have multiple Links and Effectivity Satellites and I should be able to get the same results to any point in time… What information exactly gets lost if I split Links?”

The Risk of Splitting Links: A Practical Example

Let’s start with a simple example to illustrate the point:

Imagine you have a relationship between three business entities—Customer, Store, and Product. Customers go into Stores and purchase Products. If you model this relationship as a single Link that includes all three Hubs (Customer, Store, Product), you have a clear and accurate picture of that transaction.

Now, suppose you decide to split this Link into two smaller Links:

  • Customer -> Store
  • Customer -> Product

At first glance, this may seem fine. However, when you attempt to reconstruct the original dataset by joining these two Links, you may create new combinations that never existed. For example:

  • Customer A visited Store K and purchased Product X (valid)
  • Customer A visited Store L (also valid), but when you join this with Product X, you create a false relationship—Customer A purchased Product X at Store L (which never happened in reality).

This phenomenon, where new, artificial rows are generated during joins, leads to data inconsistencies. You’ve effectively introduced relationships that didn’t exist in your source system. That’s the major risk when splitting Links.

The Jedi Test: Verifying Your Model

To avoid these issues, Michael recommends applying what he calls the Jedi Test (yes, Star Wars fans, rejoice!).

The Jedi Test is a simple but powerful validation: Try to reconstruct the original source dataset from your target Data Vault model. If you can recreate the original rows exactly—no more, no less—you’re golden. If you end up with extra rows or missing rows, you’ve got a problem.

This test helps you ensure that your Links are modeled correctly and that no data is lost or inaccurately recreated when splitting Links.

Driving Keys and Unit of Work

One scenario where splitting Links causes definitive problems is when you’re dealing with Driving Keys. Driving Keys are a subset of business keys that uniquely identify a relationship.

Michael explains that splitting components of a Driving Key almost always results in a Unit of Work problem. If you break apart a Driving Key, you risk losing the integrity of your data relationships. That’s why it’s best to keep all Hub references from the same source together in a Unit of Work Link within the Raw Data Vault.

He emphasizes that although he’s not a fan of Driving Keys (calling it a “hate-love relationship”), there are scenarios where they are necessary, and when they are, you must be cautious not to split them.

The Safe Bet: Keep Links Together in the Raw Data Vault

The best practice? Keep all Hub references from the same source table together in one Link—the Unit of Work Link.

This approach guarantees that the full relationship is captured as it appeared in the source data. You can always split these relationships later in the Business Vault, where transformations and additional rules can be applied. But the Raw Data Vault should act as your trusted, immutable source of truth.

By maintaining the Unit of Work in your Raw Data Vault, you protect yourself from potential data inconsistencies and ensure you can always rebuild accurate, reliable data structures downstream.

What If You Have to Split the Link?

Sometimes, technical limitations force you to split a Link. For instance:

  • Your database system limits the number of Hub references you can include in a composite index.
  • You have descriptive data at different granularities, making a wide Link impractical.

In these cases, Michael suggests a few strategies:

  1. Keep the Unit of Work Link as your base Link.
  2. Create additional Links to capture specific granularity needs (Customer -> Product, Customer -> Store, etc.).
  3. If you split Links, ensure you also capture any missing relationships—for example, if you have Customer -> Store and Customer -> Product, you might also need Store -> Product to maintain the complete relationship picture.

These extra Links allow you to recreate the original dataset accurately if needed, but they add complexity. More Links mean more entities, more joins, and potentially slower query performance. That’s why Michael emphasizes the importance of an efficient model and recommends keeping Links together whenever possible.

Efficiency Considerations

From a performance perspective, fewer Links with more Hub references often result in faster queries. You reduce the number of joins and simplify your model. Conversely, more Links increase complexity and the potential for errors.

Therefore, the trade-off is clear: Simplicity and efficiency with a Unit of Work Link versus flexibility with additional Links—but at the cost of complexity and higher risk.

Final Thoughts and Recommendations

To summarize:

  • Always aim to keep Hub references from the same source together in a Unit of Work Link in your Raw Data Vault.
  • Perform the Jedi Test to ensure you aren’t losing or falsely creating data relationships.
  • If you must split Links, document and manage them carefully, ensuring you maintain all necessary relationships (even those not obvious at first glance).
  • Save complex splits and transformations for the Business Vault, where they can be more easily corrected without reloading the entire dataset.

At the end of the day, a well-structured Unit of Work ensures your Data Vault remains a solid, reliable foundation for analytics, reporting, and data science initiatives.

Watch the Video

Data Vault on Wide Tables: Best Practices and Considerations

Understanding Data Vault on Wide Tables

Storing data as wide tables in a data lake while applying a logical Data Vault layer using views presents unique challenges. The goal is to virtualize a raw Data Vault model on top of a data lake while ensuring optimal performance.



Key Considerations for Performance Optimization

  • Data Remains the Same: Regardless of whether you use Iceberg, Snowflake, or another technology, the fundamental data characteristics remain unchanged. Issues like dirty data and transformations still apply.
  • Descriptive vs. Transactional Data: Wide tables typically contain a mix of master and transactional data. Most attributes tend to be descriptive, especially in master data.
  • Granularity Matters: Properly defining granularity helps structure the Data Vault model efficiently, especially for hubs, links, and satellites.

Virtualizing Data Vault on a Data Lake

When implementing a virtual Data Vault, consider:

  • Hubs and Links: These should be materialized instead of virtualized. Business keys from wide tables should be extracted and stored in separate iceberg tables for efficiency.
  • Satellites: Virtualizing satellites using views is recommended, but pay attention to GDPR and personal data separation.
  • Indexing Performance: Hubs, links, PIT (Point-in-Time) tables, and bridges serve as indexes in a data lake environment.

GDPR and Data Privacy Challenges

One of the biggest concerns in a wide table approach is data privacy. Since wide tables often include personal data, you must consider logical deletion, encryption, or physical separation techniques to comply with regulations.

Enhancing Performance with Materialized Structures

To achieve good query performance, consider materializing certain structures:

  • Materialized Hubs and Links: These structures act as indexes and improve data retrieval efficiency.
  • PIT and Bridge Tables: These further optimize queries by structuring data in a way that minimizes computational load.
  • Denormalized Information Marts: End-users should query fully materialized information marts, ensuring high-speed access.

Does Datavault4dbt Support This Approach?

Discussions around Datavault4dbt suggest it may support this approach in the future. If you’re working on a project with this implementation, consider reaching out to Scalefree to explore collaboration opportunities.

Conclusion

Applying Data Vault on wide tables within a data lake architecture requires careful planning. The key takeaways include:

  • Virtualizing satellites while materializing hubs and links for performance.
  • Addressing GDPR concerns by separating or encrypting personal data.
  • Using PIT and bridge tables to enhance indexing and query speed.
  • Building fully denormalized information marts for end-user access.

By following these best practices, organizations can ensure efficient and scalable Data Vault implementations on wide tables.

Watch the Video

When to Invest in Data Warehousing

Investing in Data Warehousing

Data warehousing is a crucial step for businesses looking to manage, integrate, and analyze large volumes of data efficiently. But at what point should a company consider investing in a data warehouse? Many assume that only large enterprises need data warehousing, but the reality is more nuanced.



Is Company Size the Right Factor?

Many businesses believe that data warehousing is only necessary when they reach a certain size. However, size alone is not the determining factor. Instead, the key driver for investing in data warehousing is the complexity of data management and the business challenges that arise from it.

Understanding Business Needs and Data Complexity

Some businesses need data solutions early, even during the startup phase. For example, data-driven companies—such as tech startups that rely on analytics—often invest in data warehousing early to gain insights and improve decision-making.

On the other hand, traditional businesses, like small manufacturing firms, may not require advanced data solutions immediately. If a business operates without much digital data, spreadsheets or simple databases may suffice for years.

When Data Integration Becomes a Challenge

As businesses grow, they tend to use multiple systems for different operations. For example, customer data may reside in a CRM, sales data in an ERP, and marketing data in yet another platform. When companies need a unified view of their operations, data warehousing becomes a necessity.

Operational reports from systems like SAP or Salesforce can work for a while, but eventually, businesses realize they need to consolidate data from various sources to make better strategic decisions.

Regulatory and Compliance Requirements

In some industries, regulatory requirements mandate specific reporting and auditing capabilities. A cleaning service provider, for instance, may need to generate reports on the number of windows cleaned to comply with contract terms. Failure to provide such reports can result in financial penalties.

In such cases, investing in data warehousing can be justified by calculating the potential financial losses due to non-compliance. Businesses in regulated industries, such as finance and healthcare, often prioritize data warehousing earlier for compliance reasons.

External Data and Decision-Making

Some businesses receive vast amounts of external data from suppliers, partners, or customers. If decisions rely on analyzing large datasets, a data warehouse can streamline operations, improve accuracy, and reduce manual effort.

For example, a company relying on Excel sheets for pricing decisions may struggle to track and compare data efficiently. A data warehouse helps aggregate, cleanse, and analyze data, enabling better decision-making.

Cost vs. Benefit Analysis

While budget constraints may delay investment in data warehousing, businesses should assess the cost of inefficiencies. If manual data handling, poor integration, or compliance risks cost the company significant time and money, the investment in a data warehouse is often justified.

Final Thoughts

Investing in data warehousing is not solely about company size but rather about business needs and data complexity. Companies facing data-driven challenges, compliance requirements, or inefficiencies in data management should consider investing in a data warehouse—regardless of their size.

Watch the Video

Effectivity Satellites on Links

Watch the Video

Understanding Effectivity Satellites in Data Vault

Effectivity satellites play a crucial role in Data Vault modeling by tracking changes and deletions in source systems. In this article, we’ll explore when to use an effectivity satellite, how it differs from a regular satellite, and the best practices for implementing it.

What is an Effectivity Satellite?

An effectivity satellite is essentially a standard satellite used to capture business time attributes such as valid-from and valid-to dates, contract start and end dates, and deletion timestamps. The key distinction is that it tracks soft deletions from source systems, ensuring that historical data integrity is maintained in the Data Vault.

When Should You Use an Effectivity Satellite?

Effectivity satellites should be used when you need to track historical changes in relationships and entities, especially deletions. Common use cases include:

  • Tracking contract start and end dates
  • Monitoring employee and corporate car assignments
  • Managing customer records and their deletion status

Difference Between Regular and Effectivity Satellites

Regular satellites store descriptive attributes like names and addresses, whereas effectivity satellites focus on time-based attributes and deletion markers. While regular satellites track changes in data, effectivity satellites specifically manage record deletions and validity periods.

Choosing Between Link Satellites and Effectivity Satellites

Link satellites capture changes in relationships between entities, whereas effectivity satellites track the validity of those relationships. You should choose an effectivity satellite when:

  • Deletions need to be recorded without physically removing data
  • You need to track when a relationship was created and ended
  • Historical relationship integrity must be preserved

Example: Employee and Corporate Car Assignment

Consider an employee assigned to a corporate car. When the assignment changes, a new link entry is created between the employee and the new car. The effectivity satellite records the deletion timestamp for the old relationship and maintains the history of assignments.

Handling Deletions in Effectivity Satellites

One of the main challenges in effectivity satellites is detecting and handling deletions. Different data loading methods impact how deletions are recorded:

  • Full Loads: Compare current and previous loads to identify missing records.
  • Change Data Capture (CDC): Uses system flags to detect deletions.
  • Delta Loads: Requires additional logic to identify removed records.

Effectivity Satellites in Business Vault

In a Business Vault, effectivity satellites can be used to implement business rules for tracking deletions. For instance, a customer may be considered deleted only if removed from all source systems, which requires a business-driven deletion logic.

Conclusion

Effectivity satellites are essential in Data Vault modeling for tracking deletions and historical changes. Understanding their role and choosing the right satellite type ensures accurate data lineage and integrity.

Interview with Julien Redmond, Creator of IRiS

Interview with Julien Redmond

Welcome to another edition of Data Vault Friday! I’m Michael Olschimke, CEO of Scalefree, and every Friday at 11 o’clock, we dive into discussions about Data Vault, data mining, cloud computing, and any data-driven applications. Today, we have a special guest—Julien Redmond from Ignition Data in Australia, who’s been working with us as a partner. Julien has developed the IRiS Data Vault automation tool, and he’s here to share insights about this innovative solution.



The Global IRiS Tour

Julien has been traveling the globe, promoting IRiS and ensuring that everyone knows about this groundbreaking tool. IRiS focuses on simplifying the data engineering aspects of Data Vault, rather than the modeling tasks, making it accessible and easy to use. Julien’s goal was to create a process so straightforward that anyone could learn it in less than a day. This simplicity allows teams to make Data Vault tasks repeatable and manageable, even for junior members.

What Sets IRiS Apart?

With so many Data Vault automation offerings available, IRiS stands out by addressing common pain points. The tool aims to minimize the steep learning curves often associated with other automation tools and facilitates seamless knowledge transfer between experienced and new users. It’s designed to integrate softly with existing data management platforms—whether that’s Microsoft Data Factory, AWS Glue, or other established tools—without disrupting current systems.

Seamless Integration

IRiS requires a minimal amount of metadata, which can be easily extracted from any modeling tool. This means there’s no new modeling interface to learn—just feed the metadata into IRiS, and it generates the necessary stored procedures and data definition scripts. This integration approach ensures that companies can leverage their existing platforms while adding powerful Data Vault automation capabilities.

Empowering Data-Driven Organizations

IRiS supports a range of target platforms like Databricks, Snowflake, and Microsoft tools, aligning with the growing trend of moving towards Lakehouse architectures. Organizations can incrementally move data into the Lakehouse based on specific use cases, promoting value-driven design and delivery. Julien emphasized that IRiS is lightweight, inexpensive, and comes as a single container—making it easy to deploy and use without significant overhead.

Learning and Community Support

One of the standout features of IRiS is its supportive learning environment. It includes an online training program with six hours of videos, a comprehensive playbook blending Data Vault methodology with practical user guidance, and access to a knowledge hub with tips and tricks. New users can get up to speed quickly, reinforcing their learning with a supportive community ready to help when needed.

Future of IRiS

Julien’s global tour reflects the excitement and confidence behind IRiS. As he visits partners worldwide—from Finland to the US—he’s spreading the word about how IRiS can transform Data Vault engineering, especially for organizations invested in cloud platforms. The response so far has been overwhelmingly positive, with teams appreciating how IRiS fits into their existing infrastructures while simplifying their workflows.

That wraps up this special session of Data Vault Friday! Thanks for joining us, and a big thanks to Julien for sharing his journey with IRiS. We’ll return to our usual Q&A format next time, so be sure to bring your questions. Until then, have a fantastic weekend!

Watch the Video

Modelling Salesforce History Tables in Data Vault

Modelling Salesforce History Tables

Salesforce tracks changes to configured attributes by storing them in history tables. This data, which includes record ID, field name, old and new values, and timestamps, presents a unique challenge for Data Vault modeling. In this article, we’ll explore an optimal way to model this data using Data Vault principles.



Understanding Salesforce History Tables

Salesforce allows tracking of specific attribute changes within objects like Contacts. These changes are stored in history tables such as ContactHistory. Each entry logs:

  • Record ID (e.g., Contact ID)
  • Field Name
  • Old Value
  • New Value
  • Timestamp

Challenges in Modeling Salesforce History Data

When designing a Data Vault model for this history data, there are key challenges to consider:

  • Handling multiple changes for the same record within a short time frame
  • Maintaining referential integrity
  • Efficiently querying and pivoting data for reporting

Approach: Multi-Active Satellite

A common initial approach is to model the history table as a multi-active satellite attached to a Contact Hub, with the field name as the dependent key. However, this approach has pitfalls:

  • Duplicates can arise if multiple changes occur for the same field in the same batch
  • Timestamp-based keys are unreliable due to possible duplicate timestamps

To counter this, a unique sequence number should be assigned in the staging area and used as a dependent key.

Optimized Approach: Non-Historized Link

Instead of a multi-active satellite, a non-historized link can be used to model Salesforce history data more efficiently. Here’s how it works:

  • Create a non-historized link connecting the Contact and User hubs.
  • Store change-related attributes (field name, old value, new value, timestamp) directly within this link.
  • Use the timestamp as an event-based attribute rather than part of the primary key.

This approach avoids the need for complex joins and simplifies querying.

Efficient Data Retrieval: Pivoting

Since history tables are structured in a key-value format, queries often require pivoting. By using database pivot functions, we can restructure the data into a more usable format for reporting without excessive joins.

Alternative Consideration: JSON Storage

Another approach is to store change data as a JSON object in a standard satellite. This method offers flexibility, particularly when dealing with a large number of attributes. However, it complicates querying and should be used only when necessary.

Conclusion

For most cases, a non-historized link is the optimal way to model Salesforce history tables in Data Vault. It simplifies data storage, reduces the need for extensive joins, and enhances query performance. Multi-active satellites are an alternative but require careful handling of duplicate timestamps and field changes.

Watch the Video

The Power of Data Vault – Business Use Cases

Business Use Cases in Data Vault

In the world of data management and integration, businesses face many challenges. Data Vault is a methodology designed to address these challenges, offering a flexible and scalable solution for integrating and managing data across an enterprise. But when is it the right time to use Data Vault? Are there specific business scenarios where Data Vault’s power truly shines? This article explores the core benefits of Data Vault, its use cases, and how it can solve complex data integration problems.



Understanding the Pain Points

Before diving into when and where Data Vault is most beneficial, it’s important to understand the underlying pain points businesses face in their data management processes. According to Michael Olschimke, CEO of Scalefree, understanding the business pain points is crucial. If there is no significant problem, there may be no need for a new solution like Data Vault. The key is identifying situations where current methods fall short in handling data integration, privacy regulations, and evolving business rules.

The most common pain point is the challenge of data integration. Modern businesses typically operate with data spread across multiple sources, from internal systems to external data feeds. Integrating these data sources into a single, unified view is one of the biggest challenges. Whether you’re trying to generate reports, create dashboards, or analyze data for business insights, you need a consistent and reliable method to integrate data from diverse systems. This is where Data Vault excels.

The Core Strength of Data Vault: Data Integration

Data Vault is designed specifically for situations where integration is a priority. If a business needs to bring together multiple disparate data sources into a single framework for reporting, Data Vault offers a robust solution. Its flexibility allows businesses to combine data from different systems, apply various business logic, and present the data in a meaningful way.

In contrast to other methods, Data Vault shines when the data integration needs are complex. Simply dumping data into a data lake may seem like an easy solution, but it leaves businesses with the challenge of how to integrate these disparate datasets into a cohesive model. Without a clear method for integration, data lakes become isolated silos of information, and producing integrated reports becomes a significant challenge.

Addressing Regulatory Compliance and Privacy

In today’s data-driven world, businesses must also address regulatory requirements such as GDPR. One of the strengths of Data Vault is its built-in support for privacy and security regulations. When managing sensitive data, businesses need to ensure compliance with privacy regulations, including the ability to delete or anonymize personal data when necessary.

While other methods can also address regulatory concerns, Data Vault provides out-of-the-box patterns and solutions that are easy to implement and scale. For example, Data Vault allows businesses to securely store data, apply business rules, and remove or anonymize personal attributes without disrupting the overall data structure. This capability is crucial in today’s regulatory environment, where compliance is not just a best practice but a legal requirement.

Handling Changing Business Rules Over Time

Another key use case for Data Vault arises when businesses face changing business rules. Over time, companies evolve, and with this evolution comes changes in how data is processed and interpreted. For example, a business might need to apply different versions of a business rule to historical and current data, depending on when the rule was in effect.

Data Vault provides a solution to this challenge by separating the data transformation processes and storing them in the “business vault.” This separation allows businesses to apply different versions of business rules to different datasets. For instance, you might apply one rule to data from the previous year and a different rule to the current year’s data. This flexibility allows companies to adapt to new business requirements without overhauling their data architecture every time the rules change.

Scalability and Flexibility

As businesses grow and their data needs become more complex, the scalability of their data management solutions becomes critical. Data Vault is highly scalable because it allows companies to add new data sources, apply new business rules, and adjust their data models as needed without requiring a complete redesign of their data infrastructure.

One of the most powerful features of Data Vault is its ability to “creatively destruct” incoming data. This means that data from different source systems can be broken down into fundamental components—such as business keys, relationships, and descriptive data. These components can then be recombined in any format or structure that suits the business’s reporting or analytical needs, whether that’s a star schema, flat tables, or any other target structure. This flexibility ensures that businesses can meet various use cases and reporting requirements using the same data platform.

Data Vault and Business Intelligence

In the realm of business intelligence (BI), Data Vault stands out as an effective method for managing large, complex datasets. It offers businesses the ability to handle multiple use cases, such as generating reports, analyzing trends, and forecasting future performance. Because it integrates data from multiple sources, it provides a single, reliable source of truth for reporting and analysis.

Unlike traditional BI systems, which often require multiple data platforms or complex ETL (extract, transform, load) processes, Data Vault allows businesses to use a single platform for all their BI needs. Whether you’re running operational reports, building data marts, or creating advanced analytics models, Data Vault’s flexibility ensures that businesses can handle various BI scenarios without the need for separate systems or tools.

Addressing Complex Data Models

While Data Vault is highly flexible, it can also become more complex as businesses face increasingly complex data models. The complexity arises when businesses deal with dirty data, unclear business key definitions, or overlapping data from different source systems. In these situations, Data Vault allows companies to address these challenges by adding new components to their data models, such as hubs, links, and satellites.

For instance, if a business has two different source systems with different business key definitions, Data Vault can create a new hub to store these keys and establish relationships between them. Similarly, when data quality is an issue, Data Vault allows businesses to add computed satellites to clean the data before it’s used for reporting or analysis. While these additional components can increase the complexity of the data model, they are essential for solving the challenges presented by messy, inconsistent, or incomplete data.

When Should You Use Data Vault?

Ultimately, the decision to use Data Vault depends on your business’s data requirements. If your data integration needs are relatively simple, or if you don’t have stringent privacy or regulatory requirements, other solutions might suffice. However, for businesses dealing with complex datasets, evolving business rules, and compliance challenges, Data Vault provides a comprehensive, scalable solution that addresses all these needs.

When evaluating whether Data Vault is the right choice for your organization, it’s essential to assess your current and future data needs. If you require robust data integration, the ability to apply different business rules over time, and compliance with privacy regulations, Data Vault is a powerful tool that can handle these challenges. Its flexibility and scalability ensure that it can grow with your business as your data needs evolve.

Conclusion

Data Vault is a powerful methodology for businesses that need to integrate complex data from multiple sources, apply evolving business rules, and comply with privacy regulations. While it may not be necessary for every business, for those facing challenges in these areas, Data Vault offers a robust, flexible, and scalable solution. By breaking down and restructuring data in a way that supports various reporting and analytical needs, Data Vault ensures businesses can keep up with the ever-changing demands of today’s data-driven world.

Watch the Video

How to Tackle GDPR with Data Vault

Understanding GDPR in the Context of Data Vault

GDPR compliance is a critical concern for organizations handling personal data. Data Vault, a well-structured data modeling approach, offers a robust solution for meeting GDPR requirements, particularly in two key areas: data security and data privacy.



Data Security in Data Vault

Data security involves protecting existing data from unauthorized access. Data Vault supports this through two levels of security:

  • Row-Level Security: This ensures that users can only access records relevant to them. It can be implemented via database row-level security features or view layers.
  • Column-Level Security: Attributes are separated based on security classification. Each classification is stored in a separate Satellite, with access granted accordingly.

By controlling access at both row and column levels, organizations can ensure compliance with GDPR’s data access requirements.

Data Privacy and Deletion in Data Vault

Data privacy focuses on removing personal data when required. Data Vault’s design allows for the physical deletion of personal data without affecting the integrity of the entire dataset. This is achieved through:

  • Satellite Splitting: Personal and non-personal data are stored in separate Satellites. When a deletion request is made, only the personal data Satellite needs to be altered.
  • Data Retention Policies: Different personal data attributes may have varying retention periods. Separate Satellites are created for attributes that must be deleted at different times.
  • Point-in-Time (PIT) Table Updates: When personal data is deleted, PIT tables are rebuilt to reflect the absence of that data.

This approach ensures that deleted data is no longer accessible or retrievable, aligning with GDPR’s right to be forgotten.

Access Control Lists (ACL) in Data Vault

Managing user access to data is another essential aspect of GDPR compliance. Data Vault facilitates this through an ACL system modeled using Hubs and Links:

  • A User Hub stores information about individual users.
  • A User Group Hub categorizes users into groups with shared permissions.
  • A Customer Hub and Bank Account Hub manage customer and account details.
  • A Link connects users, user groups, and customers.
  • An Effectivity Satellite records the time periods during which users have access to specific data.

By applying this structure, access control can be managed dynamically, ensuring that only authorized users can view or modify data.

Security vs. Privacy: A Crucial Distinction

When discussing GDPR, it’s essential to distinguish between security and privacy:

  • Security: The data remains in the system, but access is restricted based on security policies.
  • Privacy: The data is physically removed from the system when no longer needed.

Organizations should ensure that security officers and privacy officers handle these concerns separately to avoid misconceptions, such as assuming filtered data is deleted when it is still present in the database.

Conclusion

Data Vault provides a comprehensive approach to managing GDPR requirements through built-in security and privacy mechanisms. By structuring data appropriately and implementing proper access control and deletion strategies, organizations can achieve GDPR compliance efficiently.

Watch the Video

How to Explain Data Vault to Business Users?

How to Explaining Data Vault

When introducing Data Vault to business users, it’s important to communicate its value in a way that resonates with them. Instead of focusing on the technical details, it’s best to highlight the benefits and business impact.



Understanding the Audience

Business users, including executives and commercial leaders, generally don’t need to understand the intricacies of Data Vault. They are more interested in the outcomes, such as data accessibility, security, and adaptability.

Explaining Data Vault with a Simple Analogy

Imagine you are building a house. As a homeowner, you don’t need to know every construction detail; you just want to ensure that it’s solid, safe, and has all the necessary features. Similarly, business leaders don’t need to understand the technical framework of Data Vault—they just want a reliable data management system that supports their decision-making.

Focusing on Business Value

Instead of using the term “Data Vault,” it’s often more effective to discuss the advantages of a managed data platform:

  • Data Security & Privacy: Ensures compliance with regulations like GDPR while securing sensitive information.
  • Data Integration: Consolidates structured, semi-structured, and unstructured data from multiple sources.
  • Auditability & Transparency: Provides full data lineage, ensuring that every data point can be traced back to its source.
  • Agile Data Delivery: Enables incremental delivery of insights, so business teams don’t have to wait months or years to see results.
  • Adaptability to Change: Easily adjusts to changes in source systems, business logic, and reporting needs.
  • Handling Multiple Business Timelines: Supports complex business requirements, including postdating and backdating of records.
  • Scalability: Handles large datasets and high-speed data processing.

Delivering Business Outcomes

Business leaders care about measurable outcomes. With a Data Vault-based platform, they can expect:

  • Improved decision-making with accurate, timely data.
  • Faster adaptation to market changes and customer demands.
  • Cost efficiency through a structured yet flexible data architecture.
  • Enhanced reporting and analytics with reliable data sources.

Making the Pitch to Business Users

When discussing Data Vault with executives, avoid technical jargon and focus on business goals. Instead of saying, “We use Data Vault 2.0 for data modeling,” say, “We have a data platform that ensures secure, auditable, and easily accessible insights to drive your business forward.”

By emphasizing real-world benefits, you can effectively communicate the value of Data Vault without overwhelming non-technical stakeholders with complexity.

Conclusion

Communicating the benefits of Data Vault to business users requires a shift from technical explanations to business value discussions. By framing the conversation around security, agility, and data-driven decision-making, you can successfully gain buy-in from stakeholders and demonstrate the impact of a well-managed data platform.

Watch the Video

Using PIT and Bridge Tables in Business Vault Entities

Watch the Video

PIT and Bridge Tables

In this blog post, we will answer a commonly asked question regarding PIT and Bridge Tables:

In the Data Vault architecture, is it okay to use/reuse created PIT and Bridge tables in the code of the Business Vault business rules?

The short answer is yes, but let’s dive into the details to understand the rationale and how PIT (Point-In-Time) tables and Bridge tables work in the context of Business Vault entities.



Understanding PIT and Bridge Tables

Before explaining their usage, let’s quickly clarify what PIT and Bridge tables are in the Data Vault architecture:

  • PIT Tables: These provide a snapshot of data for a specific point in time. They help combine deltas and descriptive data to enable calculations or business logic that requires a specific snapshot.
  • Bridge Tables: These are primarily used to resolve many-to-many relationships and improve query performance when dealing with large datasets.

Applying Business Logic in Business Vault

In the Data Vault, data flows from the Raw Data Vault (RDV) to the Business Vault (BV) and finally to the Information Mart (IM). The key difference lies in the granularity of data:

  • Load Date: In the Raw Data Vault, data batches are identified by a load date, which represents when the data was ingested.
  • Snapshot Date: In the Information Mart, data is often presented as snapshots, where each snapshot represents the data at a specific point in time.

Now, the Business Vault sits between the Raw Data Vault and Information Marts. When applying business rules in the BV, there are two major types of granularities to consider:

1. Granularity Based on Incoming Deltas

In this case, business logic is applied to all incoming deltas identified by the load date. For example, cleansing phone numbers is a typical use case where every delta (update) must be processed, even if only the latest version is needed in the end.

The resulting data is stored in a computed Satellite in the Business Vault. The primary key remains the hash key of the parent entity and the load date.

2. Granularity Based on Snapshot Date

Some business logic requires calculations for specific points in time. For example, calculating the lifetime value of a customer:

  • The lifetime value increases when a customer makes a purchase.
  • The lifetime value decreases incrementally if no purchases are made over time.

In this scenario, even when no new delta is coming in, the value must still be recalculated daily. This granularity aligns with the snapshot date, which is already defined in the PIT table. By leveraging the PIT table, you can calculate and store the lifetime value in a computed Satellite with a primary key of the parent hash key and snapshot date.

Reusing PIT Tables

When switching from load date (deltas) to snapshot date (snapshots), PIT tables play a crucial role:

  • PIT tables help join descriptive data from Satellites to provide a snapshot-based view of the data.
  • They allow business rules to be applied to outgoing information granularity (snapshot date).

For example, if you want to calculate a specific measure, such as a customer’s lifetime value, the PIT table provides the granularity needed to compute the values for every day, hour, or minute, depending on your requirements.

Reusing Bridge Tables

Bridge tables can also be reused in Business Vault entities but with one key consideration:

Avoid loading one Bridge Table from another Bridge Table.

Why? Cascading Bridge Tables can lead to sequential dependencies, which hinder parallelization. Parallel processing is essential for performance, especially in high-volume environments. To work around this limitation, use Computed Aggregate Links.

What Is a Computed Aggregate Link?

A Computed Aggregate Link is essentially a Link with pre-computed aggregations. This concept is described in the Data Vault methodology and allows you to reuse aggregations efficiently without chaining Bridge Tables together.

For example, if you want to calculate a new measure based on facts stored in a Bridge Table:

  • Use the Bridge Table as the FROM source for a computed Satellite.
  • Attach the new measure to the Bridge Table as part of the Business Vault entity.

This approach avoids cascading dependencies while allowing you to extend facts or perform complex calculations.

Best Practices Recap for PIT and Bridge Tables

Here are the key takeaways for using PIT and Bridge tables in Business Vault entities:

  • Yes, you can reuse PIT tables: They are commonly used to provide snapshot granularity for computed Satellites.
  • Yes, you can reuse Bridge tables: Use them carefully to avoid cascading dependencies.
  • Use Computed Aggregate Links: When you need to extend a Bridge Table, this is the recommended approach to maintain efficiency and parallelization.
  • Granularity switch: Be mindful of the transition from load date (delta-driven) to snapshot date (snapshot-driven) when applying business logic.

Summary

In summary, PIT and Bridge tables are powerful tools in the Data Vault architecture, especially within the Business Vault. They enable complex business logic, such as snapshot-based calculations, while maintaining efficiency and performance. By adhering to best practices like avoiding cascading Bridge Table loads, you can ensure your implementation remains scalable and robust.

Close Menu