Michael Olschimke – Scalefree

Loading SAP CDC Data into GDPR Compliant Data Vault

Watch the Video

Loading SAP CDC Data into a GDPR-Compliant Data Vault

When managing change data capture (CDC) data from SAP in a Raw Data Vault, special considerations are needed for both CDC information and GDPR-relevant personal data. This post will cover how to model CDC data in a Data Vault, including the unique handling of created, updated, and deleted records. We’ll also discuss best practices for splitting data into separate satellites to manage GDPR-compliant attributes, including empty columns and privacy concerns.

This content is based on a discussion led by Michael Olschimke, CEO of Scalefree, during a Data Vault Q&A session.

In this article:

CDC Data Modeling in the Data Vault
- 1. Load CDC Data in Satellites with Adjusted Patterns
Handling GDPR-Relevant Personal Data
- 2. Splitting Satellites Based on Privacy Classifications
Managing Empty Columns in the Data Vault
Example Satellite Structure
Best Practices for Satellite Splitting in the Data Vault
Conclusion

CDC Data Modeling in the Data Vault

The primary challenge with CDC data is that it only includes changes from SAP, not the full dataset each time. CDC data typically includes metadata on whether a record was created, updated, or deleted in SAP. Here’s a look at how to approach modeling this data in a Data Vault:

1. Load CDC Data in Satellites with Adjusted Patterns

In Data Vault, the data model remains unchanged, regardless of how the data is delivered (batch, CDC, or real-time). However, the loading pattern for CDC data into satellites needs some adjustments:

Delta Check Adjustment: Normally, the Data Vault delta check identifies changes before loading data into a satellite. With CDC data, the changes are already captured, so this step can be bypassed. Instead, all changes from CDC data are loaded into the appropriate satellites directly.
Change Impact Across Satellites: When there’s a change in one attribute, it triggers an update in all relevant satellites. While this approach can create non-delta records, the impact is typically minimal, and the redundant data can be compressed for storage efficiency.

Handling GDPR-Relevant Personal Data

CDC data often includes both regular attributes and GDPR-sensitive personal data. In the Data Vault, personal data attributes should be separated based on privacy and security classes to ensure compliance and manage access. Here’s the recommended approach:

2. Splitting Satellites Based on Privacy Classifications

For GDPR compliance, split CDC data into multiple satellites:

Personal Data Satellite: A dedicated satellite for GDPR-relevant attributes (such as names or IDs). This separation allows for tighter security and privacy control.
Non-Personal Data Satellite: General attributes with no privacy concerns go into a separate satellite to reduce the risk of exposure.
Additional Splits: Further splits may be required based on rate of change, security levels, or business context, depending on the specific needs of your organization.

Maintaining separate satellites for different classes of data ensures that personal information is handled with stricter privacy controls, helping your data architecture comply with GDPR requirements.

Managing Empty Columns in the Data Vault

It’s common for source tables to contain columns that are always empty. When working with CDC data in a Data Vault:

Include Empty Columns for Auditing: To retain full traceability and audibility, include empty columns in the satellite. This preserves the exact structure of the source data without altering it.
Consider Separate “Unused Data” Satellite: If there are many empty columns, these can be grouped into a dedicated satellite, making the primary satellites leaner for users.

This approach allows for flexibility if the data in these columns becomes relevant in the future. Auditors will appreciate the comprehensive structure, and the Data Vault will retain all source data in its original form.

Example Satellite Structure

With GDPR compliance and CDC loading adjustments in mind, here’s an example structure for splitting SAP CDC data into satellites:

Satellite: CDC_Personal_Data
- Attributes: GDPR-relevant data (e.g., personal names, social security numbers)
- Metadata: Load date, source, change type (create, update, delete)
- Purpose: Privacy-controlled access

Satellite: CDC_NonPersonal_Data
- Attributes: Non-personal data columns
- Metadata: Load date, source, change type
- Purpose: General access

Satellite: CDC_Unused_Columns
- Attributes: Columns always empty in the source table
- Metadata: Load date, source
- Purpose: Compliance and future-proofing

Best Practices for Satellite Splitting in the Data Vault

When splitting data into satellites, follow these best practices:

Split by Privacy and Security: Ensure that personal and non-personal data are stored separately, particularly when handling GDPR-relevant information.
Split by Source System: Keep different source systems in separate satellites for clarity and maintainability.
Consider Business Needs: If certain data attributes are only relevant to specific business cases, split them accordingly to reduce satellite complexity.

These principles provide a clean, secure, and compliant Data Vault structure that enables efficient data retrieval, flexibility, and regulatory adherence.

Conclusion

Modeling SAP CDC data in a GDPR-compliant Data Vault involves adjustments to loading patterns, especially when dealing with CDC deltas and GDPR-sensitive data. By separating data based on privacy classes and including empty columns where necessary, you can ensure compliance and maintain a flexible data model. The approach outlined here simplifies the handling of CDC data, while providing robust auditing and privacy control.

Michael Olschimke In Data Vault Friday

Modelling Demographic Data

Watch the Video

Modeling Demographic Data for the Raw Data Vault: A Practical Guide

When working with demographic data for population analysis, one often encounters datasets that include variables such as region, year, age, and sex. For organizations implementing a Data Vault methodology, the task then becomes modeling this data in the Raw Data Vault in a way that enables efficient use in the Business Vault for proportion calculations and other business logic.

In this blog post, we’ll explore how to approach modeling demographic data for the Raw Data Vault. We’ll look at considerations for reference tables, hubs, links, and satellites, and discuss why a simplified approach can be beneficial while still capturing change history when needed. Let’s walk through an example based on a question that was raised during a Data Vault Q&A session.

In this article:

Defining the Dataset and Objective
The Simplified Modeling Approach
- 1. Flatten the Data Structure First
- 2. Introducing Reference Hubs and Satellites for Change Tracking
Modeling Strategy and Design
Why Avoiding Multiple Hubs and Links Makes Sense
Advantages of a Single Reference Hub with a Satellite
Using the Model in Business Logic
Potential Challenges and Solutions
Conclusion

Defining the Dataset and Objective

Let’s start by outlining the dataset and the main objective. Suppose we have a demographic dataset sourced externally (e.g., from a national bureau) that includes the following columns:

Region
Year
Age
Sex
Population count per region, year, age, and sex

The goal is to load this dataset into the Raw Data Vault and make it available for calculations in the Business Vault, such as determining population proportions across different dimensions.

The Simplified Modeling Approach

To understand the modeling approach, let’s consider the basic elements of Data Vault architecture:

Hubs: Tables that hold unique business keys and serve as identifiers.
Links: Tables that define relationships between hubs.
Satellites: Tables that hold descriptive data and track changes over time.

Since this dataset contains demographic attributes with no true business keys, modeling choices hinge on balancing simplification and change capture. Here’s a step-by-step breakdown of the approach:

1. Flatten the Data Structure First

The simplest model for this dataset would be to create a flat, wide table that includes columns for region, year, age, sex, and population value. This structure would load the data directly without further separation into hubs, links, or satellites.

Pros: Easy to query and manage, especially if there’s no need to track changes over time.
Cons: Lacks support for tracking updates or changes in population values.

This approach works well if the data is static and updated infrequently, but it limits flexibility for versioning or incremental updates.

2. Introducing Reference Hubs and Satellites for Change Tracking

To address the limitations of a flat table, we can take advantage of reference hubs and satellites. Reference hubs allow us to treat region, year, age, and sex as reference codes. In a reference hub, each unique combination of these codes is treated as a single, composite key.

The structure then includes:

A Reference Hub with region, year, age, and sex as identifying attributes.
A Reference Satellite with population counts as the descriptive attribute, which links back to the Reference Hub.

By introducing these reference structures, we ensure that changes in population values can be captured over time. If a new population record comes in with a different population value for a given region, year, age, and sex, it is added to the Reference Satellite as a new row. This provides a change history without needing to modify the original row.

Modeling Strategy and Design

With a focus on simplification, here’s how the structure would look in the Raw Data Vault:

Reference Hub: Combines the codes for region, year, age, and sex. This setup eliminates the need for separate hubs for each attribute and keeps the model straightforward. The composite key formed by region, year, age, and sex uniquely identifies each record, while the hub serves as a central reference.
Reference Satellite: Attached to the Reference Hub, the Reference Satellite holds the population value and includes a load date for tracking when data was loaded. This enables us to capture historical changes efficiently. If an update occurs for the population value, a new row is added to the Satellite with an updated load date, creating a versioned history.

Example Structure:

Reference Hub: Demographic_Hub
- Primary Key: Region, Year, Age, Sex
- Additional Columns: Load Date (ldts), Source (src)

Reference Satellite: Population_Satellite
- Foreign Key: Region, Year, Age, Sex (composite key from Hub)
- Population Value
- Load Date (ldts)
- Source (src)

In this setup, all demographic attributes are contained within a single hub and linked to the population value in the satellite. This design maintains a simple and effective data model that supports historical data changes without additional complexity.

Why Avoiding Multiple Hubs and Links Makes Sense

In theory, one could create separate hubs for each demographic attribute—region, year, age, and sex—and link them together. However, this approach introduces unnecessary complexity without adding value in this context. Here’s why:

Multiple Joins: Multiple hubs and a link table require additional joins, increasing complexity when querying the data.
Increased Entity Count: Additional hubs and links inflate the number of entities, making the data model harder to understand and maintain.
Performance Concerns: Each join adds processing cost, which can slow down queries, especially with larger datasets.

By consolidating all demographic attributes into a single hub, we reduce the number of entities and simplify the model, making it easier to use and maintain while still meeting the business needs.

Advantages of a Single Reference Hub with a Satellite

This approach is particularly advantageous because it balances simplicity with flexibility. Here’s how:

Change Capture: With the reference satellite, we can track historical population data changes over time. Each new row represents an update, identified by load date, making it easy to see when population data was updated.
Granularity and Scalability: The single hub provides a consistent granularity for the data, ensuring that queries are straightforward and changes are easy to manage.
Performance Efficiency: Joining based on region, year, age, and sex is computationally efficient. These attributes are typically small (e.g., integer or small character values), reducing the burden on processing.

Using the Model in Business Logic

Once the demographic data is loaded into the Raw Data Vault, it’s ready for use in the Business Vault. Here’s how it could be used for business logic:

Joining Data: In the Business Vault, data analysts can join other datasets with the demographic reference satellite on the region, year, age, and sex attributes to incorporate population data.
Proportion Calculations: With population counts available, proportion calculations (e.g., the proportion of a certain demographic group in a region) become straightforward.
Temporal Analysis: The load date in the satellite provides historical tracking, allowing analysts to view demographic changes over time and analyze trends.

Potential Challenges and Solutions

One common concern is that joining on multiple columns (region, year, age, sex) could impact performance. However, with optimized indexing and the small size of these columns, this concern is minimized. For larger datasets, partitioning on region or year might further optimize query performance.

Conclusion

In summary, a simplified model that leverages a single reference hub with a satellite provides a highly effective way to model demographic data in a Raw Data Vault. This approach allows for straightforward use in the Business Vault and ensures flexibility for tracking historical changes, all while keeping the model manageable and efficient.

This example illustrates how, by focusing on simplification and change capture, organizations can build an effective demographic data model in the Raw Data Vault that meets both current and future needs for population-based analysis.

Michael Olschimke In Data Vault Friday

Data Vault ROI

Watch the Video

Calculating ROI for Data Warehouse and Data Vault Investments

Investing in a data warehouse or Data Vault is a strategic decision with the potential to transform how organizations leverage their data. However, quantifying the return on investment (ROI) for such projects can be challenging, as it often involves intangible benefits alongside measurable cost savings. This article delves into the key considerations for evaluating ROI in data warehousing and Data Vault initiatives.

In this article:

Understanding the Dual Nature of Benefits
Evaluating ROI: A Two-Pronged Approach
- 1. Quantifying Tangible Benefits
- 2. Assessing Intangible Benefits
Conclusion

Understanding the Dual Nature of Benefits

The benefits of data warehouse and Data Vault investments can be broadly categorized into two types:

Tangible Benefits: These are easily measurable outcomes, such as increased revenue, reduced costs, and improved efficiency. For example, a data warehouse project might lead to a significant increase in sales due to enhanced customer personalization or a reduction in storage costs due to optimized data management.
Intangible Benefits: These are less quantifiable but equally valuable outcomes, such as improved decision-making, enhanced customer satisfaction, and faster time to market. While these benefits are harder to measure directly, they play a crucial role in driving long-term value for the organization.

Evaluating ROI: A Two-Pronged Approach

To effectively assess ROI, consider both the tangible and intangible aspects of your data warehouse or Data Vault investment.

1. Quantifying Tangible Benefits

Cost Savings: Identify areas where your data warehouse is reducing costs. This could include savings in data storage, data integration, or operational expenses due to automation.
Revenue Generation: Explore how your data warehouse is contributing to revenue growth. This might involve improved customer service, new product development, or optimized pricing strategies.
Efficiency Gains: Measure how your data warehouse has streamlined processes and reduced manual data handling, leading to time and resource savings.

2. Assessing Intangible Benefits

Cost of Inaction: Flip the question and ask yourself, “What are the costs of not having a data warehouse?” This can help quantify the value of risk mitigation, improved decision-making, and faster response to market changes.
Pain Point Valuation: Assign a monetary value to the pain points your data warehouse addresses. If your investment solves a customer satisfaction issue or a delivery time problem, estimate the financial impact of these improvements.
Benchmarking: Compare your performance against industry benchmarks to gauge how your data warehouse is helping you achieve a competitive advantage.

Conclusion

Evaluating ROI for data warehouse and Data Vault projects requires a holistic approach that considers both tangible and intangible benefits. By carefully analyzing cost savings, revenue generation, efficiency gains, and the value of addressing pain points, you can build a comprehensive picture of the return on your investment. Remember that the true value of a data warehouse extends far beyond monetary gains, encompassing strategic advantages that drive long-term business success.

Michael Olschimke In Data Vault Friday

Multiple Business Keys in One Source Column

Watch the Video

As part of our ongoing Data Vault Friday series, our CEO, Michael Olschimke, engages with a relevant and practical question from our audience about multiple business keys in one source column.

Data Vault Business Keys

In the realm of Customer Relationship Management (CRM) and Master Data Management (MDM), integrating data from diverse systems is a common challenge. This is particularly true when dealing with external identifiers, such as social security numbers, VAT IDs, or passport numbers, which can also serve as business keys in other systems. This article explores how to effectively model such scenarios within a Data Vault framework to streamline data integration and data analysis.

In this article:

Understanding the Challenge
Modeling Strategies
Data-Driven Modeling
Hub It Out Pattern
Considerations
Conclusion

Understanding the Challenge

When a CRM system doubles as an MDM platform, it often houses an “External Identifiers” entity. This entity stores relationships between customers and various external systems, encompassing both external identifiers (like social security numbers) and internal IDs (such as customer IDs in core systems).

The complexity arises when some of these identifiers also function as business keys in other systems, each with its own Data Vault hub. The goal is to combine customer-related data from different domains while maintaining the integrity of these relationships.

Modeling Strategies

Multi-Active Satellite: One approach involves modeling the “External Identifiers” entity as a multi-active satellite attached to the customer hub. This approach accommodates multiple keys or external identifiers linked to a single customer. By including the ID type (e.g., VAT, SSN) in the satellite’s descriptive data, you can distinguish between different identifier types within the group.
Joining Data: If a separate hub exists for a specific business key (e.g., VAT), you can directly join the data from the MDM system’s satellite to the corresponding hub. This approach facilitates data integration and enables easier queries.
Business Vault Links: To optimize join performance, especially with complex business keys or multi-column identifiers, you can create exploration or business links in the business vault. These links implement conditional logic, establishing connections between the customer hub and the relevant business key hubs based on the ID type.

Data-Driven Modeling

A data-driven modeling approach is essential in these scenarios. Start by capturing raw data from the source systems without applying business logic. In the raw Data Vault, treat business keys as descriptive fields within the satellite. Subsequently, in the business vault, you can implement the necessary business logic through links and relationships to integrate data from different domains effectively.

Hub It Out Pattern

The Hub It Out pattern, often used in refactoring, can also be applied here. If a new data set with descriptive data for a specific business key (e.g., corporate car VIN numbers) becomes available, you can extract those values from the existing satellite and create a new hub. Then, establish links between the customer hub and the new hub based on the existing relationships.

Considerations

Hash Keys: Consider using hash keys for improved join performance, especially when dealing with complex or variable-length business keys.
Data Virtualization: Where possible, virtualize data downstream from the raw Data Vault satellite to simplify the deletion of personal data.

Conclusion

In conclusion, integrating CRM and MDM data involving external identifiers that double as business keys requires a thoughtful modeling approach within the Data Vault framework. By leveraging multi-active satellites, joining data, creating business vault links, and adhering to a data-driven modeling philosophy, you can efficiently combine customer-related data from disparate domains.

Remember that the goal is to create a flexible and adaptable data model that caters to the evolving needs of your business. By employing these strategies and considering the specific requirements of your environment, you can unlock the full potential of your CRM and MDM systems, enabling seamless data integration and enhanced analytical capabilities.

Michael Olschimke In Data Vault Friday

Virtualized Load End Date and the SQL Optimizer

Watch the Video

Virtualized Load End Date in Data Vault

In the world of data warehousing, optimizing query performance is crucial, especially in complex data models like Data Vault 2.0. One common challenge is dealing with virtualized load end dates in reference tables, which can hinder join elimination and impact query execution times. In this article, we delve into this issue and explore potential solutions to enhance query performance.

In this article:

Understanding the Problem
Solution Approaches
Additional Tips
Conclusion

Understanding the Problem

The scenario involves joining dictionaries to satellite tables based on code attributes, where reference tables lack a point-in-time (PIT) table. This necessitates joining reference satellite entries, using a virtualized load end date, which prevents join elimination.

The problem lies in the complex join condition arising from the virtualized load end date calculation. This complex condition prevents the SQL optimizer from utilizing indexes effectively, leading to performance bottlenecks.

Solution Approaches

1. Snapshot Satellites

One solution is to implement snapshot satellites in the business vault. These satellites use the snapshot date as the timeline in their primary key, aligning with the granularity of outgoing information. This approach is efficient for handling different granularities in incoming and outgoing data and simplifies business logic implementation.

2. Materializing Reference Tables

Another option is to materialize the reference tables. While this can help with query performance, it introduces challenges when dealing with personal data, as deleting such data becomes more complex.

3. Consolidated Reference Table

Consider using a consolidated reference table to capture reference data for different domains with similar structures. This simplifies the data model and potentially keeps the table in memory, reducing disk access.

4. Extending the PIT Table

You could extend the PIT table of the relevant hub or dimension to include reference data. However, this increases redundancy and may not be suitable if numerous reference tables or attributes are involved.

5. Materializing Snapshot Satellite

Materializing the snapshot satellite is another alternative, especially if reference tables do not contain personal data. This simplifies data deletion when necessary.

6. Virtualization

If possible, consider virtualizing downstream data from the raw data vault satellite. This eliminates the need to delete personal data in materialized views, simplifying data management.

Additional Tips

Limit reference satellites: In scenarios where only the latest snapshot is needed, limiting reference satellites to type 1 dimensions can simplify maintenance, although it restricts the availability of historical data.
Partitioning: Partitioning reference and PIT tables by relevant codes or dates can improve query performance and storage management.

Conclusion

Optimizing query performance in Data Vault 2.0 requires careful consideration of various factors, especially when dealing with virtualized load end dates in reference tables. The solutions discussed in this article offer different approaches to tackle this challenge, each with its pros and cons. Choosing the most suitable approach depends on the specific requirements of your data warehouse environment.

By implementing these solutions and following the additional tips, you can enhance query performance, improve data management, and ensure the efficient delivery of information in your Data Vault 2.0 environment.

Michael Olschimke In Data Vault Friday

Microsoft SQL Server Join Elimination and Data Vault PIT Tables

Watch the Video

In this week’s Data Vault Friday, Michael Olschimke, our CEO, addresses a thought-provoking question about join elimination in SQL Server, specifically concerning PIT (Point-in-Time) tables in Data Vault. Let’s explore this topic in depth.

In this article:

Understanding PIT Tables and Their Role in Data Vault
- The Query at Hand
The Join Elimination Dilemma
Tailoring Information Marts for Performance
Conclusion

Understanding PIT Tables and Their Role in Data Vault

In Data Vault 2.0, PIT tables are crucial for efficient data retrieval. They are designed to reference the most recent changes (deltas) in satellite tables linked to a hub or link. This setup is instrumental in creating dimension tables that reflect a snapshot of your data at a specific point in time.

The Query at Hand

The question we received revolves around the challenge of optimizing queries using PIT tables. The scenario is familiar: you’ve built dimensions from PIT tables, which join to satellite entries valid at the snapshot time. However, when users query only a few attributes, unnecessary joins to multiple satellites can significantly slow down performance. The goal is to ensure join elimination works effectively, allowing the SQL optimizer to bypass unnecessary joins and thus speed up query execution.

The Join Elimination Dilemma

Join elimination is a technique where the SQL optimizer skips joins that aren’t needed for the final result. This is particularly useful in data warehousing scenarios where large dimension tables are involved. In SQL Server, achieving efficient join elimination, especially with PIT tables, involves several considerations:

Inner Joins vs. Left Joins

Inner joins are commonly used but can be problematic because SQL Server requires foreign keys between tables for optimal join elimination.
Left joins, on the other hand, can sometimes make join elimination easier for the optimizer, as they are less restrictive.

Foreign Keys

For SQL Server to perform join elimination efficiently, there must be a foreign key relationship between the PIT table and the satellite tables.
This foreign key doesn’t have to be active but must exist to inform the optimizer about the relationships.

Column Considerations

Ideally, the foreign key should be based on a single column. In Data Vault, where primary keys often consist of composite keys (e.g., hash key and load date), this can complicate matters.

Practical Steps for Efficient Join Elimination

Here are some practical steps to ensure join elimination works effectively in SQL Server:

Use Left Joins

By using left joins instead of inner joins, you can simplify the optimizer’s task. In many cases, left joins with equi join conditions perform as efficiently as inner joins.

Implement Foreign Keys

Ensure that foreign keys exist between your PIT tables and satellite tables. Even if these keys are inactive, they provide the necessary metadata for the optimizer.

Simplify Foreign Key Structures

Where possible, simplify the foreign key structures to single columns. This can help the optimizer in performing join elimination more efficiently.

Testing and Verification

Given the complexity of join elimination, thorough testing is essential. Ensure that your SQL Server setup supports the required features and verify the execution plans to confirm that unnecessary joins are indeed being eliminated. Tools like SQL Server Profiler and Execution Plan Viewer can be invaluable for this purpose.

Tailoring Information Marts for Performance

Another critical aspect is the design of your information marts. Traditional data warehousing often involved creating large, comprehensive information marts. However, in the Data Vault paradigm, it’s more efficient to create smaller, purpose-specific marts. Here’s why:

Performance Optimization

Smaller marts tailored to specific reports or queries minimize the amount of data processed, thus speeding up query execution.

Virtualization

By virtualizing information marts in SQL Server, you can dynamically assemble the necessary data without physically storing it, reducing storage requirements and increasing flexibility.

Conformed Dimensions

When dimensions are based on the same PIT tables, they are inherently conformed, allowing for seamless joins across different information marts.

Conclusion

Join elimination is a powerful technique for optimizing queries in SQL Server, especially when working with Data Vault 2.0 and PIT tables. By using left joins, implementing foreign keys, and simplifying key structures, you can enhance the performance of your data retrieval processes. Additionally, designing smaller, purpose-specific information marts tailored to specific use cases can further boost efficiency.

If you have more questions or need further clarification, feel free to use the form at https://sfr.ee/DVFriday to submit your queries. You can also join our monthly webinars on WhereScape and dbt.

Michael Olschimke In Data Vault Friday

Deriving Dimensions from Reference Data

Watch the Video

Reference Data

Welcome to another episode of Data Vault Friday! I’m Michael Olschimke, CEO of Scalefree. Today’s question comes from our online form, and it’s about deriving dimensions from reference data in the raw Data Vault. Specifically, the questioner has several lookup reference tables that they add as Hub or reference tables while creating the raw Data Vault. For example, they have a region table that includes a region code, description, language code, and valid from/to dates.

In this article:

Understanding Degenerate Dimensions
Building the Model
Creating a Degenerate Dimension
Handling Time-Based Data
Performance Considerations
Conformed Dimensions
Implementation Steps
Conclusion

Understanding Degenerate Dimensions

The query mentions a “degenerated dimension.” To clarify, a degenerate dimension is a dimension attribute, such as the region code, included in a fact table without any additional descriptions. This attribute exists within the fact table itself and doesn’t have a separate dimension table.

Building the Model

To illustrate this, let’s start with a basic structure. Imagine you have a non-historized link containing transaction data, and it references Hubs such as Hub Customer and Hub Product. Additionally, you have reference tables, like a region reference table with a region code and associated descriptions. Here’s a simplified model:

Non-historized Link: Contains transaction data.
Hub Customer and Hub Product: Reference customer and product data.
Reference Hub for Region: Contains the region code.
Reference Satellite for Region: Contains the descriptions, language codes, and valid dates.

This setup allows for capturing changes in reference data, making the model auditable and maintaining historical accuracy.

Creating a Degenerate Dimension

To create a degenerate dimension from the reference data, follow these steps:

Include the Code in the Fact Table: Add the region code directly into your transaction data (the non-historized link).
Determine the Required Attributes: Decide if you only need the region code or additional attributes like the region name.
Create a Fact View: If you only need the region code, simply create a fact view that includes this code.
Prejoin Additional Attributes: If you need additional attributes, prejoin the reference Hub and Satellite to get the region name or other details based on the timeline of your facts.

Handling Time-Based Data

When dealing with time-based data, it’s essential to identify the correct version of your reference data. If you want the latest description of the region (a Type 1 dimension), you can join the latest entry. For a Type 2 dimension (tracking changes over time), join based on the fact timestamp to match the correct version of the region name.

Performance Considerations

Reference tables typically contain a relatively small amount of data, which allows most joins to be efficient. However, if performance becomes an issue, you can consider creating a Point-in-Time (PIT) table in the Business Vault. This table can precompute the current description for each region on a daily basis, making joins faster and more efficient.

Conformed Dimensions

If you prefer to use a conformed dimension, convert your reference table into a dimension table. Use the primary key of the reference table (e.g., region code) as the dimension identifier. This approach involves joining the reference Hub and Satellite to create a dimension view that can be used in your fact tables.

Implementation Steps

Turn Reference Table into Dimension: Join the reference Hub and Satellite to create a dimension view.
Use Reference Code as Dimension Key: The region code becomes the dimension key.
Create Fact View: Include the dimension key in your fact view and join the necessary attributes from the dimension view.
Configure in Dashboard: Set up relationships between your facts and dimensions in your dashboard application for seamless data visualization.

Conclusion

In summary, deriving dimensions from reference data in a Data Vault involves understanding your needs for degenerate or conformed dimensions, handling time-based data appropriately, and ensuring efficient joins. By following these steps, you can create a robust and scalable data model that meets your analytical needs.

Thank you for joining us for this Data Vault Friday session. If you have more questions, submit them at sfr.ee/dvfriday. For additional learning, check out our webinars at Scalefree.to/webinars. If you need answers before next Friday, visit the Data Vault Innovators community we set up with Ignition Data.

Until next time, keep those data questions coming, and we’ll see you next Friday!

Michael Olschimke In Data Vault Friday

Still Struggling with GDPR?

Watch the Video

Navigating GDPR Compliance in Data Warehousing

In today’s digital age, GDPR compliance is a crucial aspect for any organization dealing with personal data. With the rise of data warehousing and advanced modeling solutions like Data Vault 2.0 (DV 2.0), questions often arise about how to handle Personally Identifiable Information (PII) within these frameworks. This article addresses some common concerns and provides practical recommendations for ensuring GDPR compliance in data warehouses.

In this article:

Understanding the Challenge
Question #1: Managing Activity History with PII Business Keys
- The Problem
- The Solution
Question #2: Using Hashed Business Keys in Link Tables
- The Problem
- The Solution
Question #3: Originating and Managing Artificial Keys
Ensuring Compliance and Security
Conclusion

Understanding the Challenge

GDPR mandates that personal data must be handled with the utmost care, ensuring individuals’ privacy and security. In the context of data warehousing, this often translates to managing business keys that might contain PII. Let’s dive into some specific questions raised around this topic:

How should activity history be managed when the main hub contains a PII business key?
Is it best practice to use hashed business keys in link tables to improve load performance?
Should artificial keys originate from each business domain, and how should they be managed if not?

Question #1: Managing Activity History with PII Business Keys

The Problem

In a typical data warehouse model, customer records might include PII, such as social security numbers or tax IDs. According to GDPR, it’s crucial that activity history is not traceable back to the individual once they exercise their right to be forgotten.

The Solution

One effective approach is to split descriptive attributes into different satellites—one for personal data and another for non-personal data. This way, when a deletion request is made, only the personal satellite needs to be purged. The non-personal satellite can retain anonymized data, maintaining the integrity of the dataset while ensuring compliance.

Question #2: Using Hashed Business Keys in Link Tables

The Problem

Hashing business keys is often recommended in DV 2.0 to improve load performance. However, directly using business keys in link tables can pose a challenge, especially when those keys contain PII.

The Solution

In DV 2.0, it’s a standard practice to use hashed values of business key components rather than the business keys themselves. This approach ensures better performance and security. Here’s how it works:

Hash the Business Key: Use a cryptographic hash function (e.g., SHA-256) to convert the business key into a hashed key.
Use Hashed Keys in Link Tables: The hashed key then serves as the foreign key in link tables, ensuring that PII is not directly exposed.

Question #3: Originating and Managing Artificial Keys

The Problem

There’s a debate on whether artificial keys should be generated within each business domain or within the data warehouse itself. This raises concerns about consistency and management, especially if the artificial key must be derived from PII.

The Solution

Artificial keys should ideally be generated within the data warehouse to maintain consistency and control. Here’s the process:

Generate a UUID: Use a universally unique identifier (UUID) for the artificial key. This ensures randomness and reduces the risk of duplication.
Link Artificial Keys to Business Keys: Establish a relationship between the artificial key and the business key within the data warehouse, ensuring that the artificial key is never exposed in operational systems.

Handling Scenarios Without Artificial Keys

If generating artificial keys within the data warehouse is not feasible, the data warehouse should still generate these keys upon ingestion. This method ensures that all keys are managed consistently and securely.

Ensuring Compliance and Security

Satellite Splitting

By splitting satellites into personal and non-personal data, organizations can easily manage deletion requests without compromising data integrity.

Cryptographic Hashing

Utilizing cryptographic hashing for business keys in link tables enhances both security and performance, crucial for maintaining GDPR compliance.

Artificial Keys Management

Generating artificial keys within the data warehouse ensures consistency and security, reducing the risk of PII exposure.

Regular Audits and Legal Consultation

Regular audits and consultations with legal experts ensure ongoing compliance with GDPR and other regulations. Implementing these practices helps organizations stay ahead of potential compliance issues.

Conclusion

Handling PII in data warehouses requires careful planning and robust solutions. By implementing satellite splitting, cryptographic hashing, and consistent artificial key management, organizations can ensure GDPR compliance while maintaining data integrity and performance. Regular audits and legal advice further bolster these practices, ensuring that data handling processes remain secure and compliant with evolving regulations.

Michael Olschimke In Data Vault Friday

Loading Technical Counter-Transactions

Watch the Video

Managing Data Vault Performance with Incremental Changes and Deletions

In the world of data warehousing, the Data Vault methodology has emerged as a robust and scalable solution for managing vast amounts of data. However, one common concern among practitioners is how to efficiently handle incremental changes and deletions, particularly when dealing with structures containing billions of rows. This article aims to elucidate the process, focusing on the questions around loading structures, performance considerations, and practical strategies for maintaining efficiency.

In this article:

Understanding the Basics: Tracking Changes and Deletions
Loading Structures: The Practical Approach
- Using CDC (Change Data Capture)
- Full Load vs. Incremental Load
Performance Considerations
Practical Example: Incremental Loading without CDC
Conclusion

Understanding the Basics: Tracking Changes and Deletions

The core principle of Data Vault involves capturing all changes and deletions incrementally. This ensures that the data warehouse remains an accurate historical record of the enterprise’s data. Here’s a simplified illustration of how this can be achieved:

Initial Load: When a new transaction is recorded, it is inserted into the Data Vault as a new record. For instance, if customer A purchases product C at store B on day one, this transaction is recorded with a value of €7.
Handling Updates: If the value of the transaction changes from €7 to €5 on day two, instead of updating the existing record, two new records are created: one to nullify the original transaction (-€7) and another to represent the new transaction (€5).
Dealing with Deletions: If a transaction is deleted, it is handled similarly by inserting a record that nullifies the original transaction.

This method ensures that the Data Vault remains immutable, as records are never directly altered once inserted. Instead, changes are tracked by adding new records, which simplifies loading processes and maintains data integrity.

Loading Structures: The Practical Approach

Loading structures in Data Vault can be challenging, especially when dealing with large datasets. Here are some practical strategies:

Using CDC (Change Data Capture)

If the source system supports CDC, this is the most straightforward method:

Insert New Records: Directly insert new records into the target system.
Handle Updates and Deletes: For updates and deletes, insert the corresponding counter transactions.

CDC provides a clear and efficient way to track changes and deletions, significantly simplifying the loading process.

Full Load vs. Incremental Load

In scenarios where full loads are used (though rare for very large datasets), the process involves:

Identifying New Records: Select records from the staging area that do not exist in the target and insert them with a counter of one.
Identifying Deletions: Select records from the target that do not exist in the staging area and insert counter transactions to nullify them.

While full loads can be intensive, they can be managed effectively by optimizing the identification of new and deleted records.

Performance Considerations

Handling billions of rows requires careful planning to avoid performance bottlenecks. Here are some strategies to mitigate performance issues:

Parallel Processing

By running multiple processes in parallel, you can significantly speed up the loading process. For example, separate processes can handle inserts and counter transactions concurrently.

Hash Keys and Indexes

Using hash keys and indexes efficiently can reduce the time needed to check for existing records. Ensure that your hash keys include all relevant business keys and transaction IDs to maintain uniqueness.

High-Water Marks and System Indicators

Some systems, like Oracle, offer features like SCN (System Change Number) or row versions that can help identify modified records. Using these indicators can reduce the amount of data processed by focusing only on recently changed records.

Practical Example: Incremental Loading without CDC

In cases where CDC is not available, you can still achieve efficient incremental loading:

Incremental Updates from Source: If the source system provides daily increments (inserted and updated records), use this data to update the target.
Handling Deletions: For deleted records, you might need an additional table or mechanism to track deletions. If such a table is available, use it to insert counter transactions.
Full Load Approach: If only full loads are available, implement a two-step process to identify and handle new, updated, and deleted records.

Conclusion

Managing incremental changes and deletions in Data Vault structures, especially for large datasets, requires a combination of strategies tailored to the specific capabilities of your source systems. Whether using CDC, full loads, or incremental updates, the goal remains the same: to maintain an accurate and efficient data warehouse. By understanding the principles and applying practical solutions, you can handle the complexities of Data Vault performance effectively.

Remember, the key to success lies in thorough planning, efficient use of system capabilities, and continuous optimization of your data loading processes. By following these guidelines, you can ensure that your Data Vault implementation scales efficiently, even as your data volumes grow.

Michael Olschimke In Data Vault Friday

Data Vault Hashing or Not?

Watch the Video

Exploring Data Vault 2.0: Managing Hashing Costs in Smaller Environments

In the evolving landscape of data management, Data Vault 2.0 stands out as a robust methodology designed for scalability, flexibility, and consistency across diverse technological environments. A crucial component of Data Vault 2.0 is the use of hashing for business keys (BKs) and hash diffs. Hashing ensures data integrity and efficiency, especially in distributed systems. However, the performance costs associated with hashing can sometimes become a significant concern. This blog post delves into the nuances of hashing in Data Vault 2.0, the trade-offs involved, and when it might be feasible to deviate from the standard approach.

In this article:

The Role of Hashing in Data Vault 2.0
Challenges of Hashing
Evaluating Hashing Alternatives
The Case Against Sequences
Hash Keys vs. Business Keys
- Hash Keys
- Business Keys
Performance Optimization Strategies for Hashing
Future Trends and Recommendations
Conclusion

The Role of Hashing in Data Vault 2.0

Data Vault 2.0 leverages hashing to create unique, consistent identifiers for business keys and to detect changes in data efficiently. This method is technologically agnostic, meaning it can be implemented across various databases and data platforms, whether on-premises or in the cloud. The primary advantages of hashing include:

Consistency Across Systems: Hashing ensures that business keys are consistent and unique across different systems and regions.
Improved Query Performance: Pre-calculating hash diffs can make query execution faster and more efficient, transferring the computational load from query time to data loading time.
Simplified Data Integration: Hash keys provide a straightforward way to manage and integrate data from multiple sources, reducing the complexity of data joins.

Challenges of Hashing

Despite its benefits, hashing can introduce performance challenges, particularly in the following scenarios:

Wide Tables: Calculating hash diffs for tables with a large number of columns can be computationally intensive.
Complex Hash Functions: Ensuring that hash functions generate unique strings can be complex and resource-heavy.
Hardware Limitations: On-premises environments with limited hardware capabilities might struggle with the additional computational load required for hashing.

Evaluating Hashing Alternatives

When faced with performance concerns, particularly in smaller, local solutions, it’s essential to consider whether deviating from the standard hashing approach would be beneficial. There are three primary options to consider:

Hash Keys: The default and recommended option for most environments, especially those involving distributed systems or diverse technologies.
Sequences: A legacy approach from Data Vault 1.0 that uses sequential numbers as identifiers.
Business Keys: Using the original business keys directly as identifiers.

The Case Against Sequences

Sequences, although a viable option, are generally not recommended in modern Data Vault implementations due to several drawbacks:

Lookup Overhead: Sequences require lookups during data loading, which can slow down the process significantly.
Orchestration Complexity: Managing sequences adds complexity to the loading process, particularly in real-time scenarios.
Distributed System Challenges: Sequences do not perform well in distributed environments where parts of the solution might reside in different locations (e.g., cloud and on-premises).

Hash Keys vs. Business Keys

When deciding between hash keys and business keys, the choice largely depends on the specific technology stack and the environment. Here are some considerations:

Hash Keys

Pros: Provide a consistent, fixed-length identifier that simplifies joins and queries across various systems. They are particularly beneficial in mixed environments.
Cons: Slightly higher computational cost during data loading compared to sequences. However, the consistent performance across queries often outweighs this drawback.

Business Keys

Pros: Directly using business keys can simplify the architecture in environments where the data platform supports efficient handling of these keys.
Cons: Can lead to complex and less efficient joins, especially in mixed or distributed environments.

Performance Optimization Strategies for Hashing

For environments where hashing performance is a concern, several optimization strategies can be employed:

Leverage Hardware Acceleration: On-premises environments can benefit from hardware acceleration, such as PCIe express cards with crypto chips, to offload hash computation from the CPU.
Utilize Optimized Libraries: Many platforms use highly optimized libraries (e.g., OpenSSL) for hash computations, which can significantly improve performance.
Incremental Loads: Ensure that performance evaluations consider multiple load cycles to capture the benefits of hash diffs during delta checks, not just initial loads.

Future Trends and Recommendations

Looking forward, the evolution of data platforms and technologies might shift the balance towards using business keys more frequently. As Massively Parallel Processing (MPP) databases become more prevalent, their native support for efficient key management could make business keys a more attractive option. However, until such technologies are ubiquitous, the default recommendation remains to use hash keys for their broad compatibility and consistent performance.

Conclusion

Data Vault 2.0’s approach to hashing business keys and hash diffs provides significant advantages in terms of consistency, scalability, and performance. While the performance costs of hashing can be a concern, particularly in smaller environments with limited hardware, careful consideration of the available options and optimization strategies can mitigate these issues. Ultimately, the decision should be guided by the specific technological context and future-proofing considerations.

For most scenarios, hash keys remain the recommended approach due to their versatility and robustness in mixed and distributed environments. However, as technology evolves, the use of business keys might become more feasible, highlighting the importance of staying informed about the latest trends and advancements in data management.

Michael Olschimke In Data Vault Friday

Multi Active Satellites on Links

Watch the Video

In our ongoing series, our CEO Michael Olschimke addresses a complex question from the audience regarding the use of Multi Active Satellites (MAS) on Links within a Data Vault 2.0 model. This topic touches on advanced aspects of data modeling, particularly in the context of handling multiple active records.

The question posed was, “Can the Multi Active Satellites be used on LINKs too (considering that on Link we have the option of using the child dependent key)? Please ignore the fact that the link doesn’t have a Hash column on all HUB keys.” Michael’s response delves into the practical application of MAS on Links, an area that can greatly enhance the flexibility and scalability of data models. He explains that while traditionally Multi Active Satellites are used with Hubs to track multiple active records, their application on Links is feasible and beneficial. By leveraging the child dependent key, it is possible to maintain multiple active relationships between entities, which is particularly useful in scenarios where relationships are dynamic and subject to frequent changes.

Drawing on his 15 years of experience in Information Technology, with a focus on Business Intelligence over the past eight years, Michael offers a nuanced perspective on this topic. He highlights that while the absence of a hash column on all HUB keys might pose a challenge, it can be mitigated through careful design and implementation strategies. By ensuring that each Link is adequately documented and structured, organizations can effectively use MAS to capture the complexity of real-world relationships without sacrificing data integrity or performance.

In conclusion, Michael emphasizes the importance of flexibility and adaptability in data modeling. Implementing Multi-Active Satellites on Links can provide significant advantages in managing complex data relationships, allowing for more granular and accurate data analysis. This approach aligns with best practices in Data Vault 2.0 and supports the goal of creating robust, scalable, and responsive data architectures. Michael encourages practitioners to challenge conventional boundaries and explore innovative solutions to meet their unique data management needs.

Michael Olschimke In Data Vault Friday

Modelling Exchange Rates

Watch the Video

In our ongoing series, our CEO Michael Olschimke addresses a question from the audience about modelling daily exchange rates within the Data Vault framework for a non-banking industry. The query highlights a common challenge faced by many organizations: integrating and managing exchange rate data effectively.

The question posed was, “How would you model daily exchange rates in Data Vault 2.0 for a non-banking industry? We are already using a reference table for the list of currencies (I guess we would have currency as a hub in the banking industry, but that is not our case). Now we also need daily exchange rates for currency conversions in the datamart layer. I would start with a Link for exchange rates, but do we need to create a hub for currencies? How about existing references to currency in the existing model (currently in SAT, because we have currency as a reference table)?”

Michael’s response delves into the intricacies of data modelling in such scenarios. He suggests that even though your industry is non-banking, establishing a structured and scalable way to manage exchange rate data is crucial. Using a Link for exchange rates is a good starting point, but creating a hub for currencies could provide additional benefits. This hub would act as a central repository for all currency-related information, ensuring consistency and ease of access across different layers of the data architecture. Additionally, integrating existing references to currencies within the model can streamline operations and enhance the accuracy of financial data analytics.

In conclusion, Michael emphasizes the importance of a well-thought-out data architecture. By creating a dedicated hub for currencies and effectively linking exchange rate data, organizations can ensure more accurate and efficient currency conversions in their datamart layer. This approach not only aligns with best practices in data vault modeling but also supports the broader goal of maintaining data integrity and usability across the enterprise.