Building a scalable Data Platform? – Scalefree

Building a scalable Data Platform? In Data Tools, dbt blog article, Intermediate

Unify Your Understanding of Data with the dbt Semantic Layer

dbt Semantic Layer

The bigger a corporation gets, the more data is available, and more and more users want to use this data. In a traditional data warehouse (DWH) environment, the DWH typically provides a consumption layer consisting of various information marts, which are then loaded into multiple business intelligence (BI) tools. In there, business users transform and aggregate the data to calculate KPIs and finally make business decisions.

This, as it turns out, is easier said than done. To derive KPIs out of the data, business users need to have a common understanding of the data provided by the DWH. The information to understand the data is typically hard to find and not accessible in a single place.

Ultimately, this might lead to multiple departments having different understandings of the same data and deriving different interpretations of the same KPI. Now, it’s very likely that the worst case scenario happens, the trust in your data fades out. This is where a unified semantic layer can help!

From Raw Data To Semantic Layer – With Turbovault And Dbt Cloud

Data Vault is vital for businesses due to its adaptability and scalability in managing dynamic data environments. Its hub-and-spoke architecture ensures traceability and agility, enabling quick adaptation to changing requirements and diverse data sources.

Join our webinar and learn about how to use dbt Cloud with Turbovault and a data modeling tool to implement Data Vault in your organization. Additionally, we will have a look at the dbt Semantic Layer.

Watch Webinar Recording

In this article:

Components of the dbt Semantic Layer
Consuming the dbt Semantic Layer
Conclusion

Components of the dbt Semantic Layer

The dbt Semantic Layer helps simplify the definition and usage of critical business metrics in your dbt project. Metric definitions are centralized to allow consistent self-service usage for all data teams.

By shifting metric definitions from the BI layer into the modeling layer, business users from different units can be confident that they use the same metric definition, no matter which tool they use. In case a metric definition changes over time, the changes will be applied everywhere it’s used and therefore consistency is enforced.

To create a unified semantic layer inside your dbt project, the following steps are necessary:

Draft a semantic model
- To implement a semantic model, a model needs to be drafted first. This should happen via a collaboration between the technical and business teams, to identify core business concepts and how they relate to each other.
Create dbt models that match your semantic model
- Each object of your semantic model should be turned into a dbt model 1:1. While creating them, put extra work into aligning column names with naming standards and correctly developing the loading logics.
Create new .yml files in the metrics folder
- Everything related to the dbt Semantic Layer needs to be located in a new folder called “metrics”. In there, .yml files are used to define and configure your semantic models. We recommend creating one .yml file per semantic model.
Define entities
- In contrast to the name, entities in the semantic model roughly describe columns of semantic models. Entities can be of four different types: Primary, Unique, Foreign, or Natural. Every model needs to have a primary entity, and one entity can be just one column or a SQL expression transforming a column.
Define dimensions
- A dimension in the dbt Semantic Layer can be seen as different ways to look at your model, i.e., group the data by a specific attribute. Every dimension needs to be tied to a primary entity, which is used for the grouping. A good example is a date column which enables you to group your data by day, month, or year.
Define measures
- Measures represent aggregations applied to specific columns in your data model. Measures can be used in other measures to calculate more complex ones, and can be defined with various parameters that help create executable SQL code for calculation.
Define metrics
- Metrics represent the language of the business users. They can be of various types, which represent different kinds of calculations. Some examples include Conversion metrics, Cumulative metrics, Derived metrics, and Ratio metrics. They are always based on measures and represent the last building block of the semantic layer.

This is how the semantic layer is reflected in your dbt lineage:

Consuming the dbt Semantic Layer

Once your dbt project has a semantic layer defined, it can be opened to data consumers. The dbt Semantic Layer allows various BI tools to directly connect to your dbt Cloud project and integrate metrics, measures, and filters directly into the tool of choice.

The following tools are already natively supported:

Tableau
Google Sheets
Microsoft Excel
Hex
Klipfolio PowerMetrics
Lightdash
Mode
Push.ai
Steep

Other tools can be integrated with custom integrations, as long as they support generic JDBC connections and are compatible with Arrow Flight SQL.

Conclusion

The dbt Semantic Layer can help regain trust in your data warehouse. By moving calculations back from the business users into the data model, a common definition of business KPIs is created.

Although there is some additional setup required, implementing a semantic layer can highly improve the value generated by your data. Integrating it into the BI tools of your business users even simplifies the way your data is consumed.

If you want to know more about the dbt Semantic Layer and learn how it fits into a Data Vault 2.0 powered Data Warehouse, make sure to join our next webinar on August 13th!

– Tim Kirschke (Scalefree)

Building a scalable Data Platform? In Data Vault Friday

Data Vault ROI

Watch the Video

Calculating ROI for Data Warehouse and Data Vault Investments

Investing in a data warehouse or Data Vault is a strategic decision with the potential to transform how organizations leverage their data. However, quantifying the return on investment (ROI) for such projects can be challenging, as it often involves intangible benefits alongside measurable cost savings. This article delves into the key considerations for evaluating ROI in data warehousing and Data Vault initiatives.

In this article:

Understanding the Dual Nature of Benefits
Evaluating ROI: A Two-Pronged Approach
- 1. Quantifying Tangible Benefits
- 2. Assessing Intangible Benefits
Conclusion
Meet the Speaker

Understanding the Dual Nature of Benefits

The benefits of data warehouse and Data Vault investments can be broadly categorized into two types:

Tangible Benefits: These are easily measurable outcomes, such as increased revenue, reduced costs, and improved efficiency. For example, a data warehouse project might lead to a significant increase in sales due to enhanced customer personalization or a reduction in storage costs due to optimized data management.
Intangible Benefits: These are less quantifiable but equally valuable outcomes, such as improved decision-making, enhanced customer satisfaction, and faster time to market. While these benefits are harder to measure directly, they play a crucial role in driving long-term value for the organization.

Evaluating ROI: A Two-Pronged Approach

To effectively assess ROI, consider both the tangible and intangible aspects of your data warehouse or Data Vault investment.

1. Quantifying Tangible Benefits

Cost Savings: Identify areas where your data warehouse is reducing costs. This could include savings in data storage, data integration, or operational expenses due to automation.
Revenue Generation: Explore how your data warehouse is contributing to revenue growth. This might involve improved customer service, new product development, or optimized pricing strategies.
Efficiency Gains: Measure how your data warehouse has streamlined processes and reduced manual data handling, leading to time and resource savings.

2. Assessing Intangible Benefits

Cost of Inaction: Flip the question and ask yourself, “What are the costs of not having a data warehouse?” This can help quantify the value of risk mitigation, improved decision-making, and faster response to market changes.
Pain Point Valuation: Assign a monetary value to the pain points your data warehouse addresses. If your investment solves a customer satisfaction issue or a delivery time problem, estimate the financial impact of these improvements.
Benchmarking: Compare your performance against industry benchmarks to gauge how your data warehouse is helping you achieve a competitive advantage.

Conclusion

Evaluating ROI for data warehouse and Data Vault projects requires a holistic approach that considers both tangible and intangible benefits. By carefully analyzing cost savings, revenue generation, efficiency gains, and the value of addressing pain points, you can build a comprehensive picture of the return on your investment. Remember that the true value of a data warehouse extends far beyond monetary gains, encompassing strategic advantages that drive long-term business success.

Meet the Speaker

Michael Olschimke

Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

Building a scalable Data Platform? In Data Vault Friday

Multiple Business Keys in One Source Column

Watch the Video

As part of our ongoing Data Vault Friday series, our CEO, Michael Olschimke, engages with a relevant and practical question from our audience about multiple business keys in one source column.

Data Vault Business Keys

In the realm of Customer Relationship Management (CRM) and Master Data Management (MDM), integrating data from diverse systems is a common challenge. This is particularly true when dealing with external identifiers, such as social security numbers, VAT IDs, or passport numbers, which can also serve as business keys in other systems. This article explores how to effectively model such scenarios within a Data Vault framework to streamline data integration and data analysis.

In this article:

Understanding the Challenge
Modeling Strategies
Data-Driven Modeling
Hub It Out Pattern
Considerations
Conclusion
Meet the Speaker

Understanding the Challenge

When a CRM system doubles as an MDM platform, it often houses an “External Identifiers” entity. This entity stores relationships between customers and various external systems, encompassing both external identifiers (like social security numbers) and internal IDs (such as customer IDs in core systems).

The complexity arises when some of these identifiers also function as business keys in other systems, each with its own Data Vault hub. The goal is to combine customer-related data from different domains while maintaining the integrity of these relationships.

Modeling Strategies

Multi-Active Satellite: One approach involves modeling the “External Identifiers” entity as a multi-active satellite attached to the customer hub. This approach accommodates multiple keys or external identifiers linked to a single customer. By including the ID type (e.g., VAT, SSN) in the satellite’s descriptive data, you can distinguish between different identifier types within the group.
Joining Data: If a separate hub exists for a specific business key (e.g., VAT), you can directly join the data from the MDM system’s satellite to the corresponding hub. This approach facilitates data integration and enables easier queries.
Business Vault Links: To optimize join performance, especially with complex business keys or multi-column identifiers, you can create exploration or business links in the business vault. These links implement conditional logic, establishing connections between the customer hub and the relevant business key hubs based on the ID type.

Data-Driven Modeling

A data-driven modeling approach is essential in these scenarios. Start by capturing raw data from the source systems without applying business logic. In the raw Data Vault, treat business keys as descriptive fields within the satellite. Subsequently, in the business vault, you can implement the necessary business logic through links and relationships to integrate data from different domains effectively.

Hub It Out Pattern

The Hub It Out pattern, often used in refactoring, can also be applied here. If a new data set with descriptive data for a specific business key (e.g., corporate car VIN numbers) becomes available, you can extract those values from the existing satellite and create a new hub. Then, establish links between the customer hub and the new hub based on the existing relationships.

Considerations

Hash Keys: Consider using hash keys for improved join performance, especially when dealing with complex or variable-length business keys.
Data Virtualization: Where possible, virtualize data downstream from the raw Data Vault satellite to simplify the deletion of personal data.

Conclusion

In conclusion, integrating CRM and MDM data involving external identifiers that double as business keys requires a thoughtful modeling approach within the Data Vault framework. By leveraging multi-active satellites, joining data, creating business vault links, and adhering to a data-driven modeling philosophy, you can efficiently combine customer-related data from disparate domains.

Remember that the goal is to create a flexible and adaptable data model that caters to the evolving needs of your business. By employing these strategies and considering the specific requirements of your environment, you can unlock the full potential of your CRM and MDM systems, enabling seamless data integration and enhanced analytical capabilities.

Meet the Speaker

Michael Olschimke

Building a scalable Data Platform? In Data Vault Friday

Virtualized Load End Date and the SQL Optimizer

Watch the Video

Virtualized Load End Date in Data Vault

In the world of data warehousing, optimizing query performance is crucial, especially in complex data models like Data Vault 2.0. One common challenge is dealing with virtualized load end dates in reference tables, which can hinder join elimination and impact query execution times. In this article, we delve into this issue and explore potential solutions to enhance query performance.

In this article:

Understanding the Problem
Solution Approaches
Additional Tips
Conclusion
Meet the Speaker

Understanding the Problem

The scenario involves joining dictionaries to satellite tables based on code attributes, where reference tables lack a point-in-time (PIT) table. This necessitates joining reference satellite entries, using a virtualized load end date, which prevents join elimination.

The problem lies in the complex join condition arising from the virtualized load end date calculation. This complex condition prevents the SQL optimizer from utilizing indexes effectively, leading to performance bottlenecks.

Solution Approaches

1. Snapshot Satellites

One solution is to implement snapshot satellites in the business vault. These satellites use the snapshot date as the timeline in their primary key, aligning with the granularity of outgoing information. This approach is efficient for handling different granularities in incoming and outgoing data and simplifies business logic implementation.

2. Materializing Reference Tables

Another option is to materialize the reference tables. While this can help with query performance, it introduces challenges when dealing with personal data, as deleting such data becomes more complex.

3. Consolidated Reference Table

Consider using a consolidated reference table to capture reference data for different domains with similar structures. This simplifies the data model and potentially keeps the table in memory, reducing disk access.

4. Extending the PIT Table

You could extend the PIT table of the relevant hub or dimension to include reference data. However, this increases redundancy and may not be suitable if numerous reference tables or attributes are involved.

5. Materializing Snapshot Satellite

Materializing the snapshot satellite is another alternative, especially if reference tables do not contain personal data. This simplifies data deletion when necessary.

6. Virtualization

If possible, consider virtualizing downstream data from the raw data vault satellite. This eliminates the need to delete personal data in materialized views, simplifying data management.

Additional Tips

Limit reference satellites: In scenarios where only the latest snapshot is needed, limiting reference satellites to type 1 dimensions can simplify maintenance, although it restricts the availability of historical data.
Partitioning: Partitioning reference and PIT tables by relevant codes or dates can improve query performance and storage management.

Conclusion

Optimizing query performance in Data Vault 2.0 requires careful consideration of various factors, especially when dealing with virtualized load end dates in reference tables. The solutions discussed in this article offer different approaches to tackle this challenge, each with its pros and cons. Choosing the most suitable approach depends on the specific requirements of your data warehouse environment.

By implementing these solutions and following the additional tips, you can enhance query performance, improve data management, and ensure the efficient delivery of information in your Data Vault 2.0 environment.

Meet the Speaker

Michael Olschimke

Building a scalable Data Platform? In Data Tools, Intermediate

Leveraging Snowflake and Data Vault 2.0 for Enhanced Scalability, Performance, and Cost Efficiency

Snowflake and Data Vault 2.0

In today’s fast-paced, data-driven business environment, organizations are continuously seeking innovative solutions to manage and optimize their vast amounts of data. The combination of Snowflake, a state-of-the-art cloud data platform, and Data Vault 2.0, a methodology designed for long-term historical storage of data, presents a game-changing approach to data management. Together, these technologies offer unparalleled scalability, performance, and cost efficiency, making them indispensable tools for any data-centric organization.

Leveraging Snowflake and Data Vault 2.0 for Enhanced Scalability, Performance, and Cost Efficiency

In this webinar, we’ll explore how to leverage Snowflake’s powerful cloud data platform in combination with Data Vault 2.0 to meet modern data warehouse requirements. We’ll cover strategies for achieving enhanced scalability and performance while maintaining cost efficiency.

The first part of the webinar provides an overview of Snowflake’s unique features and their benefits for Data Vault 2.0. The second part delves into key strategies for optimizing development processes, reducing total cost of ownership, and ensuring sustainable data management.

Watch Webinar Recording

In this article:

Enhanced Scalability and Performance
Cost Efficiency and Total Cost of Ownership
- Zero-Copy Cloning
- Efficient Data Storage Solutions
Development Agility and Rapid Results
Cost Monitoring and Dashboards
- Free Webinar Recording
Conclusion

Enhanced Scalability and Performance

Snowflake’s architecture, which separates compute and storage, allows for horizontal scaling and ensures high performance even with large data volumes. This scalability is crucial for handling the increasing data demands of modern businesses. Data Vault 2.0’s modular approach complements Snowflake’s architecture, enabling efficient data integration and historical tracking without compromising performance. The synergy between Snowflake’s robust infrastructure and Data Vault’s methodology ensures that businesses can scale their operations seamlessly while maintaining data integrity and accessibility.

Snowflake also allows for dynamic resizing of compute resources to handle varying workloads efficiently. Additionally, Snowflake supports vertical scaling by adding more clusters to handle concurrent queries, ensuring optimal performance during peak times. This ability to dynamically resize and scale vertically ensures that businesses can meet performance demands without incurring unnecessary costs.

Cost Efficiency and Total Cost of Ownership

One of the standout features of Snowflake is its pay-per-second pricing model, which helps businesses minimize costs by only paying for the compute resources they use. This, coupled with Data Vault’s efficient data storage practices, reduces redundancy and optimizes storage costs. Implementing these technologies together significantly lowers the total cost of ownership (TCO), making it a financially viable solution for organizations of all sizes. The cost efficiency is further enhanced by Snowflake’s zero-copy cloning and efficient data storage solutions, which reduce overall data management expenses.

Zero-Copy Cloning

Zero-copy cloning is a feature in Snowflake that allows users to create a clone of a database, schema, or table without actually copying the data. Instead, Snowflake uses metadata pointers to reference the original data. This means that creating a clone is nearly instantaneous and does not consume additional storage.

Efficiency: Because no actual data is copied, the process is very fast and storage efficient.
Cost-Effective: Since clones share the same underlying data, storage costs are minimized. Only changes made to the cloned data incur additional storage.
Flexibility: Clones can be used for various purposes such as testing, development, or analytics without affecting the original dataset.

Efficient Data Storage Solutions

Snowflake offers several features that contribute to efficient data storage, some of which include:

Automatic Data Compression: Snowflake automatically compresses data as it is loaded into the system. This reduces storage costs and improves query performance.
Columnar Storage Format: Data in Snowflake is stored in a columnar format, which is optimized for analytical queries. This format allows for efficient data retrieval and storage, especially for large datasets.
Time Travel: Snowflake’s time travel feature allows users to access historical data without additional storage costs. This is achieved through data versioning, where changes to data are tracked over time, and previous versions can be queried as needed.
Data Pruning: Snowflake uses metadata to filter out unnecessary data at query time, which reduces the amount of data scanned and speeds up query performance. This is especially useful for large datasets where only a subset of the data is required for analysis.

Storage Optimization Services: Snowflake continuously manages the storage infrastructure, ensuring that data is stored efficiently. This includes automatic clustering and re-clustering of data to optimize query performance.

Development Agility and Rapid Results

Integrating Snowflake and Data Vault 2.0 accelerates development processes. Tools like dbt (data build tool) automate Data Vault model creation, reducing the time and effort required for data engineering tasks. This modular approach allows for rapid iteration and adaptation to changing business needs, enhancing overall agility. Businesses can quickly respond to new data insights and business requirements, providing a significant competitive advantage. The automation capabilities provided by tools like dbt not only save development costs but also ensure that data models can be updated and maintained with minimal manual intervention.

Cost Monitoring and Dashboards

Effective cost monitoring is essential to managing and optimizing expenses in data management. Building dashboards that monitor costs at the data product or even query level provides granular visibility into spending. These dashboards can highlight the most expensive queries and help identify inefficiencies. By implementing such dashboards, businesses can set up alerts and notifications for cost overruns, enabling proactive management of resources. Snowflake’s robust monitoring and logging features facilitate the creation of these detailed cost dashboards, ensuring that every aspect of data consumption is tracked and optimized.

Free Webinar Recording

This recording will delve into the benefits of these technologies and provide strategies for optimizing your data management infrastructure. Join us for this insightful session and learn how to maximize the value of your data infrastructure investments. Watch here for free

Conclusion

Integrating Snowflake with Data Vault 2.0 offers a powerful framework for modern data management, providing enhanced scalability, performance, and cost efficiency. By leveraging these technologies, businesses can achieve significant savings, streamline development processes, and ensure their data infrastructure is future-proof and sustainable. For more details and to register for the webinar, visit our webinar page.

– Lorenz Kindling, Senior Consultant (Scalefree)

Building a scalable Data Platform? In Data Vault Friday

Microsoft SQL Server Join Elimination and Data Vault PIT Tables

Watch the Video

In this week’s Data Vault Friday, Michael Olschimke, our CEO, addresses a thought-provoking question about join elimination in SQL Server, specifically concerning PIT (Point-in-Time) tables in Data Vault. Let’s explore this topic in depth.

In this article:

Understanding PIT Tables and Their Role in Data Vault
- The Query at Hand
The Join Elimination Dilemma
Tailoring Information Marts for Performance
Conclusion
Meet the Speaker

Understanding PIT Tables and Their Role in Data Vault

In Data Vault 2.0, PIT tables are crucial for efficient data retrieval. They are designed to reference the most recent changes (deltas) in satellite tables linked to a hub or link. This setup is instrumental in creating dimension tables that reflect a snapshot of your data at a specific point in time.

The Query at Hand

The question we received revolves around the challenge of optimizing queries using PIT tables. The scenario is familiar: you’ve built dimensions from PIT tables, which join to satellite entries valid at the snapshot time. However, when users query only a few attributes, unnecessary joins to multiple satellites can significantly slow down performance. The goal is to ensure join elimination works effectively, allowing the SQL optimizer to bypass unnecessary joins and thus speed up query execution.

The Join Elimination Dilemma

Join elimination is a technique where the SQL optimizer skips joins that aren’t needed for the final result. This is particularly useful in data warehousing scenarios where large dimension tables are involved. In SQL Server, achieving efficient join elimination, especially with PIT tables, involves several considerations:

Inner Joins vs. Left Joins

Inner joins are commonly used but can be problematic because SQL Server requires foreign keys between tables for optimal join elimination.
Left joins, on the other hand, can sometimes make join elimination easier for the optimizer, as they are less restrictive.

Foreign Keys

For SQL Server to perform join elimination efficiently, there must be a foreign key relationship between the PIT table and the satellite tables.
This foreign key doesn’t have to be active but must exist to inform the optimizer about the relationships.

Column Considerations

Ideally, the foreign key should be based on a single column. In Data Vault, where primary keys often consist of composite keys (e.g., hash key and load date), this can complicate matters.

Practical Steps for Efficient Join Elimination

Here are some practical steps to ensure join elimination works effectively in SQL Server:

Use Left Joins

By using left joins instead of inner joins, you can simplify the optimizer’s task. In many cases, left joins with equi join conditions perform as efficiently as inner joins.

Implement Foreign Keys

Ensure that foreign keys exist between your PIT tables and satellite tables. Even if these keys are inactive, they provide the necessary metadata for the optimizer.

Simplify Foreign Key Structures

Where possible, simplify the foreign key structures to single columns. This can help the optimizer in performing join elimination more efficiently.

Testing and Verification

Given the complexity of join elimination, thorough testing is essential. Ensure that your SQL Server setup supports the required features and verify the execution plans to confirm that unnecessary joins are indeed being eliminated. Tools like SQL Server Profiler and Execution Plan Viewer can be invaluable for this purpose.

Tailoring Information Marts for Performance

Another critical aspect is the design of your information marts. Traditional data warehousing often involved creating large, comprehensive information marts. However, in the Data Vault paradigm, it’s more efficient to create smaller, purpose-specific marts. Here’s why:

Performance Optimization

Smaller marts tailored to specific reports or queries minimize the amount of data processed, thus speeding up query execution.

Virtualization

By virtualizing information marts in SQL Server, you can dynamically assemble the necessary data without physically storing it, reducing storage requirements and increasing flexibility.

Conformed Dimensions

When dimensions are based on the same PIT tables, they are inherently conformed, allowing for seamless joins across different information marts.

Conclusion

Join elimination is a powerful technique for optimizing queries in SQL Server, especially when working with Data Vault 2.0 and PIT tables. By using left joins, implementing foreign keys, and simplifying key structures, you can enhance the performance of your data retrieval processes. Additionally, designing smaller, purpose-specific information marts tailored to specific use cases can further boost efficiency.

If you have more questions or need further clarification, feel free to use the form at https://sfr.ee/DVFriday to submit your queries. You can also join our monthly webinars on WhereScape and dbt.

Meet the Speaker

Michael Olschimke

Building a scalable Data Platform? In Data Vault Friday

Deriving Dimensions from Reference Data

Watch the Video

Reference Data

Welcome to another episode of Data Vault Friday! I’m Michael Olschimke, CEO of Scalefree. Today’s question comes from our online form, and it’s about deriving dimensions from reference data in the raw Data Vault. Specifically, the questioner has several lookup reference tables that they add as Hub or reference tables while creating the raw Data Vault. For example, they have a region table that includes a region code, description, language code, and valid from/to dates.

In this article:

Understanding Degenerate Dimensions
Building the Model
Creating a Degenerate Dimension
Handling Time-Based Data
Performance Considerations
Conformed Dimensions
Implementation Steps
Conclusion
Meet the Speaker

Understanding Degenerate Dimensions

The query mentions a “degenerated dimension.” To clarify, a degenerate dimension is a dimension attribute, such as the region code, included in a fact table without any additional descriptions. This attribute exists within the fact table itself and doesn’t have a separate dimension table.

Building the Model

To illustrate this, let’s start with a basic structure. Imagine you have a non-historized link containing transaction data, and it references Hubs such as Hub Customer and Hub Product. Additionally, you have reference tables, like a region reference table with a region code and associated descriptions. Here’s a simplified model:

Non-historized Link: Contains transaction data.
Hub Customer and Hub Product: Reference customer and product data.
Reference Hub for Region: Contains the region code.
Reference Satellite for Region: Contains the descriptions, language codes, and valid dates.

This setup allows for capturing changes in reference data, making the model auditable and maintaining historical accuracy.

Creating a Degenerate Dimension

To create a degenerate dimension from the reference data, follow these steps:

Include the Code in the Fact Table: Add the region code directly into your transaction data (the non-historized link).
Determine the Required Attributes: Decide if you only need the region code or additional attributes like the region name.
Create a Fact View: If you only need the region code, simply create a fact view that includes this code.
Prejoin Additional Attributes: If you need additional attributes, prejoin the reference Hub and Satellite to get the region name or other details based on the timeline of your facts.

Handling Time-Based Data

When dealing with time-based data, it’s essential to identify the correct version of your reference data. If you want the latest description of the region (a Type 1 dimension), you can join the latest entry. For a Type 2 dimension (tracking changes over time), join based on the fact timestamp to match the correct version of the region name.

Performance Considerations

Reference tables typically contain a relatively small amount of data, which allows most joins to be efficient. However, if performance becomes an issue, you can consider creating a Point-in-Time (PIT) table in the Business Vault. This table can precompute the current description for each region on a daily basis, making joins faster and more efficient.

Conformed Dimensions

If you prefer to use a conformed dimension, convert your reference table into a dimension table. Use the primary key of the reference table (e.g., region code) as the dimension identifier. This approach involves joining the reference Hub and Satellite to create a dimension view that can be used in your fact tables.

Implementation Steps

Turn Reference Table into Dimension: Join the reference Hub and Satellite to create a dimension view.
Use Reference Code as Dimension Key: The region code becomes the dimension key.
Create Fact View: Include the dimension key in your fact view and join the necessary attributes from the dimension view.
Configure in Dashboard: Set up relationships between your facts and dimensions in your dashboard application for seamless data visualization.

Conclusion

In summary, deriving dimensions from reference data in a Data Vault involves understanding your needs for degenerate or conformed dimensions, handling time-based data appropriately, and ensuring efficient joins. By following these steps, you can create a robust and scalable data model that meets your analytical needs.

Thank you for joining us for this Data Vault Friday session. If you have more questions, submit them at sfr.ee/dvfriday. For additional learning, check out our webinars at Scalefree.to/webinars. If you need answers before next Friday, visit the Data Vault Innovators community we set up with Ignition Data.

Until next time, keep those data questions coming, and we’ll see you next Friday!

Meet the Speaker

Michael Olschimke

Building a scalable Data Platform? In Artificial Intelligence, Beginner, Data Warehouse

AI in Data Warehousing: Fundamental Principles and Applications

AI in Data Warehousing

The ability to efficiently store, manage, and analyze data has become a critical aspect for any organization. The field of Data Warehousing is an integral part of the process by providing structured, centralized storage for large volumes of data and is an important element of business intelligence.
However, the increasing volume and complexity of data is becoming increasingly challenging to manage. That said, AI is already shaping many processes in a wide variety of areas, including data warehousing.

This article explores the basic principles and applications of artificial intelligence and how AI can be integrated into an Enterprise Data Warehouse, by using it to enhance the design and operation of data warehouses, as well as enabling the development of more challenging data science applications like machine learning or predictive analytics.

AI in Data Warehousing: Fundamental Principles and Applications

This webinar delves into the essential principles of Artificial Intelligence (AI) and its transformative role in data warehousing. We explore how AI and Machine Learning (ML) not only enhance the construction and operation of data warehouses but also leverage the vast amounts of data stored within them for developing powerful applications, e.g. for predictive analytics.

You will gain an understanding of the foundational AI technologies driving these advancements and how they intersect with data science and machine learning to optimize data storage, improve decision-making, and unlock new business insights. This webinar provides valuable perspectives on integrating these technologies into your data warehousing practices.

Watch webinar recording

In this article:

Foundational Principles of AI
The Role of AI in Building Data Warehouses
- Enhancing Design and Structure
- Automating Data Integration, Cleaning, and Transformation
Using a Data Warehouse for AI Applications
- Enhancing AI Capabilities with Data Vault
Conclusion

Foundational Principles of AI

Let’s start by categorizing the term artificial intelligence, as well as other terms such as machine learning, deep learning, or generative AI, in order to clarify the context and what exactly is hidden behind these terms.

Artificial Intelligence is now more of a generic term and represents a broad field. In general, the term AI encompasses techniques that allow computers to emulate human behavior, enabling them to learn, understand language, make decisions, recognize patterns, and solve complex problems in a way that is similar to human intelligence.

Machine Learning (ML) as a Subset of AI

Machine Learning is a subset of the field of artificial intelligence and, even if less present as a term, one of the most important. Machine Learning uses advanced techniques and algorithms to recognize patterns in large amounts of data, allowing machines to learn and adapt autonomously to make inferences or predictions based on the data. Unlike traditional programming where tasks are executed based on clear and defined instructions, machine learning relies on statistical analysis based on input data to output data values that fall within an expected range. Essentially, machine learning enables computers to learn and interpret data without being explicitly programmed for each individual case.

Deep Learning (DL) – A Deeper Dive into ML

Deep Learning is a more specialized sub-area of machine learning & AI and uses deep neural networks to perform in-depth data processing tasks to recognize complex patterns.

Deep learning uses many layers to extract high-level features from raw data as input, in a sense simulating the way a human brain works.

In this respect, deep learning is particularly strong in processing very large amounts of data, using it to learn complex patterns in order to solve a wide variety of tasks. Well-known examples of this are tasks such as image or voice recognition, which every smartphone is capable of.

Generative AI – The Creative Aspect of Deep Learning

Generative AI is a term that has recently become more widely known, especially due to ChatGPT. Generative AI is a sub-area of deep learning and includes deep learning models that generate new content, such as text, images, code, or even videos, based on data on which they have been trained on. They are able to generate novel outputs without explicit instructions that do not directly replicate the training data. This opens up completely new possibilities such as generating large texts, complex images, or even music.

The Role of AI in Building Data Warehouses

AI is changing everything, including data warehousing. As such, it can help to enhance the efficiency and effectiveness of your data warehouse from the ground up. From the design and structure to the ongoing data management processes, there are many opportunities where AI can help.

AI has the power to address a data warehouse’s biggest challenges: performance, governance, and usability. This is data intelligence and will revolutionize the way you query, manage, govern, and visualize your data.

Enhancing Design and Structure

The architecture of a data warehouse is critical as it needs to support efficient data querying and scalability while maintaining performance. By analyzing usage and query patterns, AI algorithms can suggest the most effective data models and indexing strategies. This not only speeds up the retrieval of information but also ensures more agile data handling when scaling or integrating new data sources.

Automating Data Integration, Cleaning, and Transformation

By letting AI take over low-level tasks, data engineers can focus on higher-level tasks such as designing data models, training machine learning algorithms, and creating data visualizations. For example, The Coca-Cola Company uses AI-powered ETL tools to automate data integration tasks across its global supply chain to optimize sourcing and procurement processes.

But AI can also support the actual developers in their work, enabling them to work faster and more efficiently. Developers can use AI to debug problems in their code faster by using AI-based code generation and analysis, such as Github Copilot, which completes, refactors, and debugs code in real time directly in the IDE.

The automation of performance tuning of data warehouse workloads can also be improved with predictive optimizations, which can save a considerable amount of costs.

AI can also scale and automate governance using automated tagging, documentation, and natural language search across all data and assets in an organization.

Right up to the possibility that business users could use natural language to interact with the data, ask questions, and build dashboards, the possibilities are endless and we are only at the very beginning of these developments.

Using a Data Warehouse for AI Applications

The nature of a data warehouse, a structured and centralized repository that aggregates data from multiple sources within an organization, makes it a perfect foundation for building and training AI applications. AI models require large amounts of data with as much variability as possible that are well structured and have a high data quality, which a data warehouse can perfectly provide for an organization to train its own AI model.

For example, a data warehouse in a retail context might store years of customer purchase history, demographics, and product information. Data that can be used to train AI models to predict future buying trends or recommend products effectively.

Enhancing AI Capabilities with Data Vault

Data Vault 2.0 in particular sets the exact right conditions for training your own AI models by leveraging the structured, reliable data framework that Data Vault 2.0 provides.

Here’s are some examples how this integration can benefit AI-driven analytics:

Improved Data Quality: The organized and auditable structure of Data Vault 2.0 ensures higher data quality and consistency, which is important for training accurate AI models. Clean, well-structured data models reduce the time spent on data preparation and increase the reliability of AI predictions.
Enhanced Historical Analysis: The comprehensive historical data captured by Data Vault 2.0 allows AI models to perform more accurate trend analysis and forecasting. This capability is especially valuable in sectors like finance and retail, where understanding long-term trends can significantly impact strategic decisions.
Data Reliability and Lineage: Data Vault 2.0’s robust framework ensures data reliability through its unique architecture that captures data from various sources while maintaining its lineage. This means every piece of data in the system can be traced back to its origin, providing transparency and trust in the data used for AI models.

Conclusion

Integrating artificial intelligence (AI) into data warehousing significantly enhances the efficiency and effectiveness of data storage, management, and analysis. By applying AI techniques, organizations can automate data integration, cleaning, and transformation processes, leading to more accurate and timely insights. Furthermore, AI-driven optimization of data warehouse design and structure ensures improved performance and scalability. Embracing AI in data warehousing not only streamlines operations but also empowers businesses to leverage advanced analytics, such as machine learning and predictive analytics, thereby unlocking deeper insights and fostering informed decision-making.

– Ole Bause (Scalefree)

Building a scalable Data Platform? In Data Vault Friday

Still Struggling with GDPR?

Watch the Video

Navigating GDPR Compliance in Data Warehousing

In today’s digital age, GDPR compliance is a crucial aspect for any organization dealing with personal data. With the rise of data warehousing and advanced modeling solutions like Data Vault 2.0 (DV 2.0), questions often arise about how to handle Personally Identifiable Information (PII) within these frameworks. This article addresses some common concerns and provides practical recommendations for ensuring GDPR compliance in data warehouses.

In this article:

Understanding the Challenge
Question #1: Managing Activity History with PII Business Keys
- The Problem
- The Solution
Question #2: Using Hashed Business Keys in Link Tables
- The Problem
- The Solution
Question #3: Originating and Managing Artificial Keys
Ensuring Compliance and Security
Conclusion
Meet the Speaker

Understanding the Challenge

GDPR mandates that personal data must be handled with the utmost care, ensuring individuals’ privacy and security. In the context of data warehousing, this often translates to managing business keys that might contain PII. Let’s dive into some specific questions raised around this topic:

How should activity history be managed when the main hub contains a PII business key?
Is it best practice to use hashed business keys in link tables to improve load performance?
Should artificial keys originate from each business domain, and how should they be managed if not?

Question #1: Managing Activity History with PII Business Keys

The Problem

In a typical data warehouse model, customer records might include PII, such as social security numbers or tax IDs. According to GDPR, it’s crucial that activity history is not traceable back to the individual once they exercise their right to be forgotten.

The Solution

One effective approach is to split descriptive attributes into different satellites—one for personal data and another for non-personal data. This way, when a deletion request is made, only the personal satellite needs to be purged. The non-personal satellite can retain anonymized data, maintaining the integrity of the dataset while ensuring compliance.

Question #2: Using Hashed Business Keys in Link Tables

The Problem

Hashing business keys is often recommended in DV 2.0 to improve load performance. However, directly using business keys in link tables can pose a challenge, especially when those keys contain PII.

The Solution

In DV 2.0, it’s a standard practice to use hashed values of business key components rather than the business keys themselves. This approach ensures better performance and security. Here’s how it works:

Hash the Business Key: Use a cryptographic hash function (e.g., SHA-256) to convert the business key into a hashed key.
Use Hashed Keys in Link Tables: The hashed key then serves as the foreign key in link tables, ensuring that PII is not directly exposed.

Question #3: Originating and Managing Artificial Keys

The Problem

There’s a debate on whether artificial keys should be generated within each business domain or within the data warehouse itself. This raises concerns about consistency and management, especially if the artificial key must be derived from PII.

The Solution

Artificial keys should ideally be generated within the data warehouse to maintain consistency and control. Here’s the process:

Generate a UUID: Use a universally unique identifier (UUID) for the artificial key. This ensures randomness and reduces the risk of duplication.
Link Artificial Keys to Business Keys: Establish a relationship between the artificial key and the business key within the data warehouse, ensuring that the artificial key is never exposed in operational systems.

Handling Scenarios Without Artificial Keys

If generating artificial keys within the data warehouse is not feasible, the data warehouse should still generate these keys upon ingestion. This method ensures that all keys are managed consistently and securely.

Ensuring Compliance and Security

Satellite Splitting

By splitting satellites into personal and non-personal data, organizations can easily manage deletion requests without compromising data integrity.

Cryptographic Hashing

Utilizing cryptographic hashing for business keys in link tables enhances both security and performance, crucial for maintaining GDPR compliance.

Artificial Keys Management

Generating artificial keys within the data warehouse ensures consistency and security, reducing the risk of PII exposure.

Regular Audits and Legal Consultation

Regular audits and consultations with legal experts ensure ongoing compliance with GDPR and other regulations. Implementing these practices helps organizations stay ahead of potential compliance issues.

Conclusion

Handling PII in data warehouses requires careful planning and robust solutions. By implementing satellite splitting, cryptographic hashing, and consistent artificial key management, organizations can ensure GDPR compliance while maintaining data integrity and performance. Regular audits and legal advice further bolster these practices, ensuring that data handling processes remain secure and compliant with evolving regulations.

Meet the Speaker

Michael Olschimke

Building a scalable Data Platform? In Intermediate

Master Data Governance: Die EU Datenverordnung und was sie für Ihr Unternehmen bedeutet

Webinar-Übersicht

Am 12. September 2025 tritt die neue EU Datenverordnung in Kraft. Sie soll eine faire und innovative Datenwirtschaft gewährleiten. Aber was heißt dies für Ihr Unternehmen in Bezug auf die Data Governance?

In diesem Webinar enträtseln wir Feinheiten der EU-Datenverordnung (Data Act). Was umfasst diese Verordnung und was bedeutet dies für Ihr Unternehmen und Data Governance. Erfahren Sie mehr über die rechtlichen und technischen Verpflichtungen, die sich aus dieser Gesetzgebung ergeben, und erhalten Sie wertvolle Einblicke in die Implementierung konformer Datenplattformen.

Im ersten Teil bietet Dr. Benno Barnitzke, Rechtsanwalt und Spezialist im Bereich IT, Datenschutz und Digitalisierung eine Einführung und Erklärungen in die rechtlichen Rahmenbedingungen dieser Verordnung.

Im zweiten Teil gibt Trung Ta, Profi im Bereich Datenplattformen Tipps und Tricks zu Umsetzung dieser Verordnung.

AUFZEICHNUNG ANSEHEN

Das erwartet Sie

Erläuterungen über das kürzlich in Kraft getretene EU-Datengesetz von Rechtsanwalt Dr. Benno Barnitzke
Technische Data Governance Empfehlungen und Best Practices, um Data Act umzusetzen
Praktische Einblicke in die rechtlichen und technischen Verpflichtungen bei der Implementation von Datenplattformen.

Teilnahme-Details

Datum: June 5th 2024
Uhrzeit: 14:00 – 15:00 CEST
Zu Ihrem Kalender hinzufügen

Agenda

Überblick über das EU Data Act
Pflichten des Dateninhabers (Data Owner)
Durchsetzung
Empfehlungen zur Einhaltung der Rechtsvorschriften
Technische Anforderungen und Herausforderungen
Technische Empfehlungen für Datenplattformen

Zielgruppen

Datenmanagement- & -Compliance-Fachkräfte, Data Product Owners und Technische Rechtsexperten.

MEISTERN SIE HEUTE DIE EU-Datenverordnung!

Bereiten Sie Ihr Unternehmen mit unserer umfassenden Checkliste der 10 wichtigsten Schritte auf die Datenverordnung vor. Erfahren Sie, wie Sie Daten schützen, Verträge überprüfen und Interoperabilität gewährleisten können.

KOSTENLOSE CHECKLISTE ANFORDERN

Redner

Trung Ta
Senior BI-Berater

Trung Ta ist seit 2019 als Senior BI Consultant tätig. Als zertifizierter Data Vault 2.0-Praktiker bei Scalefree umfasst sein Fachgebiet Data Warehousing in Hybrid/Cloud-Umgebungen, sowie Data Vault 2.0-Modellierung und -Implementierung – insbesondere mit Data Warehouse-Automatisierungswerkzeugen. Trung arbeitet mit Branchenführern im Versicherungs- und Finanzsektor zusammen und berät sie bei der Implementation von Data Vault 2.0-Lösungen.

Dr. Benno Barnitzke LL.M.
Rechtsanwalt & Partner von GÖHMANN Rechtsanwälte

Dr. Benno Barnitzke LL.M. ist Rechtsanwalt seit 2013 und seit 2020 Partner von GÖHMANN Rechtsanwälte. Er hat sich auf IT- und Datenschutzrecht spezialisiert und berät KMU sowie Konzerne und die öffentliche Hand, wie unter anderem das Land Niedersachsen bei der landesweiten Einführung von Microsoft Teams. An der Leibniz Universität Hannover unterrichtet Herr Dr. Barnitzke Studierende im Schwerpunkt IT-Recht. Als Autor zahlreicher Veröffentlichungen zu datenschutz- und IT-rechtlichen Themen ist er auch wissenschaftlich tätig.

Building a scalable Data Platform? In Data Vault Friday

Loading Technical Counter-Transactions

Watch the Video

Managing Data Vault Performance with Incremental Changes and Deletions

In the world of data warehousing, the Data Vault methodology has emerged as a robust and scalable solution for managing vast amounts of data. However, one common concern among practitioners is how to efficiently handle incremental changes and deletions, particularly when dealing with structures containing billions of rows. This article aims to elucidate the process, focusing on the questions around loading structures, performance considerations, and practical strategies for maintaining efficiency.

In this article:

Understanding the Basics: Tracking Changes and Deletions
Loading Structures: The Practical Approach
- Using CDC (Change Data Capture)
- Full Load vs. Incremental Load
Performance Considerations
Practical Example: Incremental Loading without CDC
Conclusion
Meet the Speaker

Understanding the Basics: Tracking Changes and Deletions

The core principle of Data Vault involves capturing all changes and deletions incrementally. This ensures that the data warehouse remains an accurate historical record of the enterprise’s data. Here’s a simplified illustration of how this can be achieved:

Initial Load: When a new transaction is recorded, it is inserted into the Data Vault as a new record. For instance, if customer A purchases product C at store B on day one, this transaction is recorded with a value of €7.
Handling Updates: If the value of the transaction changes from €7 to €5 on day two, instead of updating the existing record, two new records are created: one to nullify the original transaction (-€7) and another to represent the new transaction (€5).
Dealing with Deletions: If a transaction is deleted, it is handled similarly by inserting a record that nullifies the original transaction.

This method ensures that the Data Vault remains immutable, as records are never directly altered once inserted. Instead, changes are tracked by adding new records, which simplifies loading processes and maintains data integrity.

Loading Structures: The Practical Approach

Loading structures in Data Vault can be challenging, especially when dealing with large datasets. Here are some practical strategies:

Using CDC (Change Data Capture)

If the source system supports CDC, this is the most straightforward method:

Insert New Records: Directly insert new records into the target system.
Handle Updates and Deletes: For updates and deletes, insert the corresponding counter transactions.

CDC provides a clear and efficient way to track changes and deletions, significantly simplifying the loading process.

Full Load vs. Incremental Load

In scenarios where full loads are used (though rare for very large datasets), the process involves:

Identifying New Records: Select records from the staging area that do not exist in the target and insert them with a counter of one.
Identifying Deletions: Select records from the target that do not exist in the staging area and insert counter transactions to nullify them.

While full loads can be intensive, they can be managed effectively by optimizing the identification of new and deleted records.

Performance Considerations

Handling billions of rows requires careful planning to avoid performance bottlenecks. Here are some strategies to mitigate performance issues:

Parallel Processing

By running multiple processes in parallel, you can significantly speed up the loading process. For example, separate processes can handle inserts and counter transactions concurrently.

Hash Keys and Indexes

Using hash keys and indexes efficiently can reduce the time needed to check for existing records. Ensure that your hash keys include all relevant business keys and transaction IDs to maintain uniqueness.

High-Water Marks and System Indicators

Some systems, like Oracle, offer features like SCN (System Change Number) or row versions that can help identify modified records. Using these indicators can reduce the amount of data processed by focusing only on recently changed records.

Practical Example: Incremental Loading without CDC

In cases where CDC is not available, you can still achieve efficient incremental loading:

Incremental Updates from Source: If the source system provides daily increments (inserted and updated records), use this data to update the target.
Handling Deletions: For deleted records, you might need an additional table or mechanism to track deletions. If such a table is available, use it to insert counter transactions.
Full Load Approach: If only full loads are available, implement a two-step process to identify and handle new, updated, and deleted records.

Conclusion

Managing incremental changes and deletions in Data Vault structures, especially for large datasets, requires a combination of strategies tailored to the specific capabilities of your source systems. Whether using CDC, full loads, or incremental updates, the goal remains the same: to maintain an accurate and efficient data warehouse. By understanding the principles and applying practical solutions, you can handle the complexities of Data Vault performance effectively.

Remember, the key to success lies in thorough planning, efficient use of system capabilities, and continuous optimization of your data loading processes. By following these guidelines, you can ensure that your Data Vault implementation scales efficiently, even as your data volumes grow.

Meet the Speaker

Michael Olschimke

Building a scalable Data Platform? In Data Vault Friday

Data Vault Hashing or Not?

Watch the Video

Exploring Data Vault 2.0: Managing Hashing Costs in Smaller Environments

In the evolving landscape of data management, Data Vault 2.0 stands out as a robust methodology designed for scalability, flexibility, and consistency across diverse technological environments. A crucial component of Data Vault 2.0 is the use of hashing for business keys (BKs) and hash diffs. Hashing ensures data integrity and efficiency, especially in distributed systems. However, the performance costs associated with hashing can sometimes become a significant concern. This blog post delves into the nuances of hashing in Data Vault 2.0, the trade-offs involved, and when it might be feasible to deviate from the standard approach.

In this article:

The Role of Hashing in Data Vault 2.0
Challenges of Hashing
Evaluating Hashing Alternatives
The Case Against Sequences
Hash Keys vs. Business Keys
- Hash Keys
- Business Keys
Performance Optimization Strategies for Hashing
Future Trends and Recommendations
Conclusion
Meet the Speaker

The Role of Hashing in Data Vault 2.0

Data Vault 2.0 leverages hashing to create unique, consistent identifiers for business keys and to detect changes in data efficiently. This method is technologically agnostic, meaning it can be implemented across various databases and data platforms, whether on-premises or in the cloud. The primary advantages of hashing include:

Consistency Across Systems: Hashing ensures that business keys are consistent and unique across different systems and regions.
Improved Query Performance: Pre-calculating hash diffs can make query execution faster and more efficient, transferring the computational load from query time to data loading time.
Simplified Data Integration: Hash keys provide a straightforward way to manage and integrate data from multiple sources, reducing the complexity of data joins.

Challenges of Hashing

Despite its benefits, hashing can introduce performance challenges, particularly in the following scenarios:

Wide Tables: Calculating hash diffs for tables with a large number of columns can be computationally intensive.
Complex Hash Functions: Ensuring that hash functions generate unique strings can be complex and resource-heavy.
Hardware Limitations: On-premises environments with limited hardware capabilities might struggle with the additional computational load required for hashing.

Evaluating Hashing Alternatives

When faced with performance concerns, particularly in smaller, local solutions, it’s essential to consider whether deviating from the standard hashing approach would be beneficial. There are three primary options to consider:

Hash Keys: The default and recommended option for most environments, especially those involving distributed systems or diverse technologies.
Sequences: A legacy approach from Data Vault 1.0 that uses sequential numbers as identifiers.
Business Keys: Using the original business keys directly as identifiers.

The Case Against Sequences

Sequences, although a viable option, are generally not recommended in modern Data Vault implementations due to several drawbacks:

Lookup Overhead: Sequences require lookups during data loading, which can slow down the process significantly.
Orchestration Complexity: Managing sequences adds complexity to the loading process, particularly in real-time scenarios.
Distributed System Challenges: Sequences do not perform well in distributed environments where parts of the solution might reside in different locations (e.g., cloud and on-premises).

Hash Keys vs. Business Keys

When deciding between hash keys and business keys, the choice largely depends on the specific technology stack and the environment. Here are some considerations:

Hash Keys

Pros: Provide a consistent, fixed-length identifier that simplifies joins and queries across various systems. They are particularly beneficial in mixed environments.
Cons: Slightly higher computational cost during data loading compared to sequences. However, the consistent performance across queries often outweighs this drawback.

Business Keys

Pros: Directly using business keys can simplify the architecture in environments where the data platform supports efficient handling of these keys.
Cons: Can lead to complex and less efficient joins, especially in mixed or distributed environments.

Performance Optimization Strategies for Hashing

For environments where hashing performance is a concern, several optimization strategies can be employed:

Leverage Hardware Acceleration: On-premises environments can benefit from hardware acceleration, such as PCIe express cards with crypto chips, to offload hash computation from the CPU.
Utilize Optimized Libraries: Many platforms use highly optimized libraries (e.g., OpenSSL) for hash computations, which can significantly improve performance.
Incremental Loads: Ensure that performance evaluations consider multiple load cycles to capture the benefits of hash diffs during delta checks, not just initial loads.

Future Trends and Recommendations

Looking forward, the evolution of data platforms and technologies might shift the balance towards using business keys more frequently. As Massively Parallel Processing (MPP) databases become more prevalent, their native support for efficient key management could make business keys a more attractive option. However, until such technologies are ubiquitous, the default recommendation remains to use hash keys for their broad compatibility and consistent performance.

Conclusion

Data Vault 2.0’s approach to hashing business keys and hash diffs provides significant advantages in terms of consistency, scalability, and performance. While the performance costs of hashing can be a concern, particularly in smaller environments with limited hardware, careful consideration of the available options and optimization strategies can mitigate these issues. Ultimately, the decision should be guided by the specific technological context and future-proofing considerations.

For most scenarios, hash keys remain the recommended approach due to their versatility and robustness in mixed and distributed environments. However, as technology evolves, the use of business keys might become more feasible, highlighting the importance of staying informed about the latest trends and advancements in data management.

Meet the Speaker

Michael Olschimke