Skip to main content
search
0

Loading CDC Data into Multi-Active Satellites and Status Tracking Satellites

Watch the Video

Mastering CDC Data in Data Vault 2.0

Change Data Capture (CDC) is a powerful mechanism for tracking changes in source systems. However, when the primary key in your source system differs from the business key used in your Data Vault hub, you may encounter challenges in loading data into multi-active satellites. This article explores various strategies for handling CDC data in such scenarios, offering practical solutions to ensure accurate and efficient data loading.



Understanding the Challenge

In many source systems, the primary key is a technical identifier unknown to the business. Instead, the business key represents the meaningful identifier for a business object. In a typical scenario, the relationship between the primary key and the business key is one-to-one. However, in some cases, multiple records can be active for the same business key on the same date, resulting in multi-activity.

This situation arises when the primary key is unique at a given point in time, but the business key is not. For instance, you might have multiple customer IDs in your source system (primary keys) referring to the same customer (business key).


Solution 1: Verify Multi-Activity

Before diving into complex solutions, it’s crucial to verify whether the data is genuinely multi-active. In some cases, the appearance of multi-activity might be due to records being deleted and recreated with the same business key, resulting in different primary keys.

To check this, analyze the CDC data and other technical columns in the source system to determine the order of events. If a sequence of delete and create operations is detected, you may not be dealing with true multi-activity.


Solution 2: Create a Multi-Active Satellite with Delta Checking

If the data is genuinely multi-active, the most straightforward approach is to create a multi-active satellite. Perform delta checks on the combination of the business key and the multi-active attribute (e.g., customer ID). This ensures that only changes within specific multi-active groups are loaded into the satellite.

However, this approach necessitates a specialized point-in-time (PIT) table, as the CDC data provides changes at the finest granularity (row level). You’ll need to consider both the load date timestamp and the multi-active attribute when querying the satellite to retrieve the most recent delta.


Solution 3: Remodel with Satellites on Links

Another option is to remodel your Data Vault structure by placing the satellite on the link. In this approach, the multi-active attribute becomes a dependent child key in the link, and a standard satellite is created on this link. This simplifies the handling of multi-activity within the link itself.

However, it’s important to note that the satellite in this case describes the relationship between the customer and other components, rather than directly describing the business object hub. Evaluate whether this modeling change aligns with your downstream querying requirements.


Solution 4: Use the Primary Key as a Technical Hub

As a last resort, you can use the primary key from the source system as a technical hub. This involves creating a hub for the primary key values (e.g., customer IDs) and linking it to the real customer hub using a same-as link. While not the preferred method, this can be a workaround in situations where other solutions are not feasible.


Additional Considerations

  • CDC Data vs. Full Extracts: When dealing with full data extracts, even if only a part of the multi-active component changes, it’s best practice to insert the full block of data with the newest load date timestamp. This simplifies downstream processes and eliminates the need for a specialized PIT table.
  • Non-History Links: If the CDC data represents transactional events and is analyzed as such, consider loading it into non-history links instead of satellites. This approach aligns with the transactional nature of the data and facilitates aggregations and trend analysis.

Conclusion

Handling CDC data in Data Vault 2.0 when dealing with multi-active satellites requires a careful assessment of your specific use case and data characteristics. The solutions presented in this article offer various approaches to tackle this challenge, each with its own advantages and trade-offs. By understanding these strategies and selecting the most appropriate one, you can ensure accurate and efficient data loading in your Data Vault environment.

Unify Your Understanding of Data with the dbt Semantic Layer

dbt_semantic_layer_lineage

dbt Semantic Layer

The bigger a corporation gets, the more data is available, and more and more users want to use this data. In a traditional data warehouse (DWH) environment, the DWH typically provides a consumption layer consisting of various information marts, which are then loaded into multiple business intelligence (BI) tools. In there, business users transform and aggregate the data to calculate KPIs and finally make business decisions. 

This, as it turns out, is easier said than done. To derive KPIs out of the data, business users need to have a common understanding of the data provided by the DWH. The information to understand the data is typically hard to find and not accessible in a single place. 

Ultimately, this might lead to multiple departments having different understandings of the same data and deriving different interpretations of the same KPI. Now, it’s very likely that the worst case scenario happens, the trust in your data fades out. This is where a unified semantic layer can help!

From Raw Data To Semantic Layer – With Turbovault And Dbt Cloud

Data Vault is vital for businesses due to its adaptability and scalability in managing dynamic data environments. Its hub-and-spoke architecture ensures traceability and agility, enabling quick adaptation to changing requirements and diverse data sources.

Join our webinar and learn about how to use dbt Cloud with Turbovault and a data modeling tool to implement Data Vault in your organization. Additionally, we will have a look at the dbt Semantic Layer.

Watch Webinar Recording

Components of the dbt Semantic Layer

The dbt Semantic Layer helps simplify the definition and usage of critical business metrics in your dbt project. Metric definitions are centralized to allow consistent self-service usage for all data teams. 

By shifting metric definitions from the BI layer into the modeling layer, business users from different units can be confident that they use the same metric definition, no matter which tool they use. In case a metric definition changes over time, the changes will be applied everywhere it’s used and therefore consistency is enforced. 

To create a unified semantic layer inside your dbt project, the following steps are necessary:

  1. Draft a semantic model
    • To implement a semantic model, a model needs to be drafted first. This should happen via a collaboration between the technical and business teams, to identify core business concepts and how they relate to each other.
  2. Create dbt models that match your semantic model
    • Each object of your semantic model should be turned into a dbt model 1:1. While creating them, put extra work into aligning column names with naming standards and correctly developing the loading logics.
  3. Create new .yml files in the metrics folder
    • Everything related to the dbt Semantic Layer needs to be located in a new folder called “metrics”. In there, .yml files are used to define and configure your semantic models. We recommend creating one .yml file per semantic model.
  4. Define entities
    • In contrast to the name, entities in the semantic model roughly describe columns of semantic models. Entities can be of four different types: Primary, Unique, Foreign, or Natural. Every model needs to have a primary entity, and one entity can be just one column or a SQL expression transforming a column.
  5. Define dimensions
    • A dimension in the dbt Semantic Layer can be seen as different ways to look at your model, i.e., group the data by a specific attribute. Every dimension needs to be tied to a primary entity, which is used for the grouping. A good example is a date column which enables you to group your data by day, month, or year.
  6. Define measures
    • Measures represent aggregations applied to specific columns in your data model. Measures can be used in other measures to calculate more complex ones, and can be defined with various parameters that help create executable SQL code for calculation.
  7. Define metrics
    • Metrics represent the language of the business users. They can be of various types, which represent different kinds of calculations. Some examples include Conversion metrics, Cumulative metrics, Derived metrics, and Ratio metrics. They are always based on measures and represent the last building block of the semantic layer.

 

This is how the semantic layer is reflected in your dbt lineage:

Consuming the dbt Semantic Layer

Once your dbt project has a semantic layer defined, it can be opened to data consumers. The dbt Semantic Layer allows various BI tools to directly connect to your dbt Cloud project and integrate metrics, measures, and filters directly into the tool of choice.

The following tools are already natively supported: 

  • Tableau
  • Google Sheets
  • Microsoft Excel
  • Hex
  • Klipfolio PowerMetrics
  • Lightdash
  • Mode
  • Push.ai
  • Steep

Other tools can be integrated with custom integrations, as long as they support generic JDBC connections and are compatible with Arrow Flight SQL.

Conclusion

The dbt Semantic Layer can help regain trust in your data warehouse. By moving calculations back from the business users into the data model, a common definition of business KPIs is created. 

Although there is some additional setup required, implementing a semantic layer can highly improve the value generated by your data. Integrating it into the BI tools of your business users even simplifies the way your data is consumed. 

If you want to know more about the dbt Semantic Layer and learn how it fits into a Data Vault 2.0 powered Data Warehouse, make sure to join our next webinar on August 13th!

The Benefits of Data Warehouse and Data Vault

Watch the Video

Demystifying Data Warehouse and Data Vault

In today’s data-driven business landscape, the terms “data warehouse” and “Data Vault” are frequently tossed around. But what exactly are they, and why should businesses invest in them? This article aims to demystify these concepts, addressing common questions from a business perspective. We’ll delve into the reasons behind implementing a data warehouse or Data Vault, how to explain their value to non-technical stakeholders, and when companies typically start investing in these solutions.



Why Do We Need Data Warehouses and Data Vaults?

Before diving into the benefits of data warehouses and Data Vaults, let’s explore the challenges businesses face without them. Many traditional organizations grapple with:

  • Limited Data Access: Data is often siloed, accessible only to specific departments, hindering cross-functional collaboration and insights.
  • Lack of Structure: Ad hoc queries and a lack of standardized data processes lead to inefficiencies and unreliable results.
  • Expensive Trial and Error: Decision-making based on incomplete or inaccurate data can be costly and time-consuming.
  • Unreliable Data: Inconsistent data sources and ad hoc reporting can lead to errors and misguided decisions.

Data warehouses and Data Vaults address these challenges by providing a centralized, structured, and reliable repository for data. They enable:

  • Data Integration: Combining data from various sources into a single source of truth supporting a comprehensive data strategy
  • Enhanced Decision-Making: Empowering data-driven decision-making with accurate and timely insights.
  • Historical Analysis: Enabling trend analysis and forecasting based on historical data.
  • Improved Data Quality: Implementing data quality management processes to ensure accuracy and consistency.
  • Scalability and Flexibility: Adapting to evolving business needs and data volumes.
  • Auditability and Compliance: Maintaining data lineage and ensuring compliance with regulations like GDPR.

Explaining Data Vault to Non-Technical Stakeholders

When communicating the value of a Data Vault to commercial executives or non-technical stakeholders, it’s crucial to emphasize that it’s more than just a data model. Data Vault 2.0 is a comprehensive system of business intelligence, encompassing methodology, architecture, and modeling.

Highlight the key benefits Data Vault offers:

  • Agility: Agile development methodologies enable quick responses to changing business requirements.
  • Scalability and Flexibility: The architecture allows for seamless growth and adaptation.
  • Consistency and Auditability: Data Vault ensures data accuracy, traceability, and compliance.

Use relatable examples to illustrate how Data Vault addresses specific business challenges. For instance, you could explain how it streamlines data integration from multiple sources, ensuring a single version of the truth for customer information.


When Do Companies Start Investing in Data Warehousing?

There’s no one-size-fits-all answer to this question. The ideal time to invest in data warehousing depends on several factors, including:

  • Data Volume: The amount of data your company generates and the complexity of your data landscape.
  • Business Needs: The extent to which your business relies on data for decision-making and operations.
  • Strategic Goals: The importance of data-driven insights in achieving your company’s strategic objectives.

While larger enterprises with vast data volumes often invest in data warehouses early on, even smaller companies can benefit from them. Starting early, even with a smaller data warehouse, can be advantageous as it allows for gradual expansion and integration of external data sources as the business grows.


Conclusion

Data warehouses and Data Vaults are essential tools for businesses aiming to harness the power of their data. They address common data challenges, enable better decision-making, and offer a range of benefits that extend beyond mere reporting.

By understanding the key reasons for implementing these solutions and effectively communicating their value to stakeholders, you can build a strong case for investment and ensure that your organization reaps the rewards of a data-driven future.

Why Do We Need Hubs in Data Vault?

Watch the Video

Why Hubs in Data Vault are Essential

Data Vault modeling is a powerful methodology for building robust and scalable data warehouses. One of its core components, the Hub, often raises questions among practitioners and stakeholders. Why do we need hubs? Can’t we just simplify the model by putting business keys directly into satellites? In this article, we delve into the reasons behind the existence of hubs and explore scenarios where deviating from the standard practice might be acceptable.



The Role of Hubs in Data Vault

Hubs play a pivotal role in Data Vault by storing a distinct list of business keys. These keys serve as unique identifiers for real-world entities, such as customers, products, or employees. Hubs provide several critical benefits:

    1. Data Integration: Hubs act as anchors for integrating data from disparate source systems. By consolidating different representations of the same entity into a single hub, you ensure consistency and accuracy across your data warehouse.
    2. Scalability: Hubs facilitate seamless scalability. When new data sources are introduced, you can simply add the business keys to the existing hub without the need for major model refactoring. This simplifies the onboarding of new data and reduces the risk of introducing inconsistencies.
    3. Auditability: Hubs maintain a clear lineage and audit trail for your data. The load timestamp in a hub functions as a “first seen” date, making it easy to track the evolution of your data over time.
    4. Granularity: Perhaps most importantly, hubs define the granularity of multiple downstream objects, including information marts and dimensions. This granularity is crucial for accurate reporting and analysis, making hubs indispensable for many use cases.

Why Not Put Business Keys in Satellites?

While hubs are generally considered best practice, there are rare instances where storing business keys in satellites might be justifiable. One such scenario is when a business key represents an entity that currently lacks descriptive data and is not actively queried.

For example, consider an employee dataset that includes the vehicle identification number (VIN) of the employee’s company car. If there’s no additional information about the car and no immediate need to query it, treating the VIN as a descriptive attribute within the employee satellite might be acceptable.

However, if the need to query or analyze data related to company cars arises in the future, a refactoring strategy called “Hub It Out” can be employed. This involves extracting distinct VIN numbers from the employee satellite into a new hub, creating links between the employee and car hubs, and potentially adding satellites with descriptive data about the cars.


Important Considerations about Hubs

While the above scenario demonstrates a valid exception, it’s crucial to remember that storing business keys in satellites should be the exception, not the rule. Hubs offer numerous benefits in terms of data integration, scalability, auditability, and granularity, making them essential for most Data Vault implementations.

Before deviating from the standard practice, carefully assess whether the potential benefits of storing business keys in satellites outweigh the potential drawbacks, such as increased storage costs, redundancy, and a less elegant data model.


Conclusion

In conclusion, hubs are fundamental building blocks in Data Vault modeling, providing a range of benefits that contribute to the overall integrity, scalability, and usability of your data warehouse. While there are rare cases where storing business keys in satellites might be justifiable, it’s crucial to carefully weigh the pros and cons before adopting this approach. By adhering to Data Vault best practices and understanding the specific requirements of your use case, you can ensure that your data warehouse is optimized for performance, maintainability, and long-term success.

Data Vault ROI

Watch the Video

Calculating ROI for Data Warehouse and Data Vault Investments

Investing in a data warehouse or Data Vault is a strategic decision with the potential to transform how organizations leverage their data. However, quantifying the return on investment (ROI) for such projects can be challenging, as it often involves intangible benefits alongside measurable cost savings. This article delves into the key considerations for evaluating ROI in data warehousing and Data Vault initiatives.



Understanding the Dual Nature of Benefits

The benefits of data warehouse and Data Vault investments can be broadly categorized into two types:

  1. Tangible Benefits: These are easily measurable outcomes, such as increased revenue, reduced costs, and improved efficiency. For example, a data warehouse project might lead to a significant increase in sales due to enhanced customer personalization or a reduction in storage costs due to optimized data management.
  2. Intangible Benefits: These are less quantifiable but equally valuable outcomes, such as improved decision-making, enhanced customer satisfaction, and faster time to market. While these benefits are harder to measure directly, they play a crucial role in driving long-term value for the organization.

Evaluating ROI: A Two-Pronged Approach

To effectively assess ROI, consider both the tangible and intangible aspects of your data warehouse or Data Vault investment.


1. Quantifying Tangible Benefits

  • Cost Savings: Identify areas where your data warehouse is reducing costs. This could include savings in data storage, data integration, or operational expenses due to automation.
  • Revenue Generation: Explore how your data warehouse is contributing to revenue growth. This might involve improved customer service, new product development, or optimized pricing strategies.
  • Efficiency Gains: Measure how your data warehouse has streamlined processes and reduced manual data handling, leading to time and resource savings.

2. Assessing Intangible Benefits

  • Cost of Inaction: Flip the question and ask yourself, “What are the costs of not having a data warehouse?” This can help quantify the value of risk mitigation, improved decision-making, and faster response to market changes.
  • Pain Point Valuation: Assign a monetary value to the pain points your data warehouse addresses. If your investment solves a customer satisfaction issue or a delivery time problem, estimate the financial impact of these improvements.
  • Benchmarking: Compare your performance against industry benchmarks to gauge how your data warehouse is helping you achieve a competitive advantage.

Conclusion

Evaluating ROI for data warehouse and Data Vault projects requires a holistic approach that considers both tangible and intangible benefits. By carefully analyzing cost savings, revenue generation, efficiency gains, and the value of addressing pain points, you can build a comprehensive picture of the return on your investment. Remember that the true value of a data warehouse extends far beyond monetary gains, encompassing strategic advantages that drive long-term business success.

Multiple Business Keys in One Source Column

Watch the Video

As part of our ongoing Data Vault Friday series, our CEO, Michael Olschimke, engages with a relevant and practical question from our audience about multiple business keys in one source column.

Data Vault Business Keys

In the realm of Customer Relationship Management (CRM) and Master Data Management (MDM), integrating data from diverse systems is a common challenge. This is particularly true when dealing with external identifiers, such as social security numbers, VAT IDs, or passport numbers, which can also serve as business keys in other systems. This article explores how to effectively model such scenarios within a Data Vault framework to streamline data integration and data analysis.



Understanding the Challenge

When a CRM system doubles as an MDM platform, it often houses an “External Identifiers” entity. This entity stores relationships between customers and various external systems, encompassing both external identifiers (like social security numbers) and internal IDs (such as customer IDs in core systems).

The complexity arises when some of these identifiers also function as business keys in other systems, each with its own Data Vault hub. The goal is to combine customer-related data from different domains while maintaining the integrity of these relationships.


Modeling Strategies

  1. Multi-Active Satellite: One approach involves modeling the “External Identifiers” entity as a multi-active satellite attached to the customer hub. This approach accommodates multiple keys or external identifiers linked to a single customer. By including the ID type (e.g., VAT, SSN) in the satellite’s descriptive data, you can distinguish between different identifier types within the group.
  2. Joining Data: If a separate hub exists for a specific business key (e.g., VAT), you can directly join the data from the MDM system’s satellite to the corresponding hub. This approach facilitates data integration and enables easier queries.
  3. Business Vault Links: To optimize join performance, especially with complex business keys or multi-column identifiers, you can create exploration or business links in the business vault. These links implement conditional logic, establishing connections between the customer hub and the relevant business key hubs based on the ID type.

Data-Driven Modeling

A data-driven modeling approach is essential in these scenarios. Start by capturing raw data from the source systems without applying business logic. In the raw Data Vault, treat business keys as descriptive fields within the satellite. Subsequently, in the business vault, you can implement the necessary business logic through links and relationships to integrate data from different domains effectively.


Hub It Out Pattern

The Hub It Out pattern, often used in refactoring, can also be applied here. If a new data set with descriptive data for a specific business key (e.g., corporate car VIN numbers) becomes available, you can extract those values from the existing satellite and create a new hub. Then, establish links between the customer hub and the new hub based on the existing relationships.


Considerations

  • Hash Keys: Consider using hash keys for improved join performance, especially when dealing with complex or variable-length business keys.
  • Data Virtualization: Where possible, virtualize data downstream from the raw Data Vault satellite to simplify the deletion of personal data.

Conclusion

In conclusion, integrating CRM and MDM data involving external identifiers that double as business keys requires a thoughtful modeling approach within the Data Vault framework. By leveraging multi-active satellites, joining data, creating business vault links, and adhering to a data-driven modeling philosophy, you can efficiently combine customer-related data from disparate domains.

Remember that the goal is to create a flexible and adaptable data model that caters to the evolving needs of your business. By employing these strategies and considering the specific requirements of your environment, you can unlock the full potential of your CRM and MDM systems, enabling seamless data integration and enhanced analytical capabilities.

Virtualized Load End Date and the SQL Optimizer

Watch the Video

Virtualized Load End Date in Data Vault

In the world of data warehousing, optimizing query performance is crucial, especially in complex data models like Data Vault 2.0. One common challenge is dealing with virtualized load end dates in reference tables, which can hinder join elimination and impact query execution times. In this article, we delve into this issue and explore potential solutions to enhance query performance.



Understanding the Problem

The scenario involves joining dictionaries to satellite tables based on code attributes, where reference tables lack a point-in-time (PIT) table. This necessitates joining reference satellite entries, using a virtualized load end date, which prevents join elimination.

The problem lies in the complex join condition arising from the virtualized load end date calculation. This complex condition prevents the SQL optimizer from utilizing indexes effectively, leading to performance bottlenecks.


Solution Approaches


1. Snapshot Satellites

One solution is to implement snapshot satellites in the business vault. These satellites use the snapshot date as the timeline in their primary key, aligning with the granularity of outgoing information. This approach is efficient for handling different granularities in incoming and outgoing data and simplifies business logic implementation.


2. Materializing Reference Tables

Another option is to materialize the reference tables. While this can help with query performance, it introduces challenges when dealing with personal data, as deleting such data becomes more complex.


3. Consolidated Reference Table

Consider using a consolidated reference table to capture reference data for different domains with similar structures. This simplifies the data model and potentially keeps the table in memory, reducing disk access.


4. Extending the PIT Table

You could extend the PIT table of the relevant hub or dimension to include reference data. However, this increases redundancy and may not be suitable if numerous reference tables or attributes are involved.


5. Materializing Snapshot Satellite

Materializing the snapshot satellite is another alternative, especially if reference tables do not contain personal data. This simplifies data deletion when necessary.


6. Virtualization

If possible, consider virtualizing downstream data from the raw data vault satellite. This eliminates the need to delete personal data in materialized views, simplifying data management.


Additional Tips

  • Limit reference satellites: In scenarios where only the latest snapshot is needed, limiting reference satellites to type 1 dimensions can simplify maintenance, although it restricts the availability of historical data.
  • Partitioning: Partitioning reference and PIT tables by relevant codes or dates can improve query performance and storage management.

Conclusion

Optimizing query performance in Data Vault 2.0 requires careful consideration of various factors, especially when dealing with virtualized load end dates in reference tables. The solutions discussed in this article offer different approaches to tackle this challenge, each with its pros and cons. Choosing the most suitable approach depends on the specific requirements of your data warehouse environment.

By implementing these solutions and following the additional tips, you can enhance query performance, improve data management, and ensure the efficient delivery of information in your Data Vault 2.0 environment.

Leveraging Snowflake and Data Vault 2.0 for Enhanced Scalability, Performance, and Cost Efficiency

Snowflake and Data Vault 2.0

In today’s fast-paced, data-driven business environment, organizations are continuously seeking innovative solutions to manage and optimize their vast amounts of data. The combination of Snowflake, a state-of-the-art cloud data platform, and Data Vault 2.0, a methodology designed for long-term historical storage of data, presents a game-changing approach to data management. Together, these technologies offer unparalleled scalability, performance, and cost efficiency, making them indispensable tools for any data-centric organization.

Leveraging Snowflake and Data Vault 2.0 for Enhanced Scalability, Performance, and Cost Efficiency

In this webinar, we’ll explore how to leverage Snowflake’s powerful cloud data platform in combination with Data Vault 2.0 to meet modern data warehouse requirements. We’ll cover strategies for achieving enhanced scalability and performance while maintaining cost efficiency.

The first part of the webinar provides an overview of Snowflake’s unique features and their benefits for Data Vault 2.0. The second part delves into key strategies for optimizing development processes, reducing total cost of ownership, and ensuring sustainable data management.

Watch Webinar Recording

Enhanced Scalability and Performance

Snowflake’s architecture, which separates compute and storage, allows for horizontal scaling and ensures high performance even with large data volumes. This scalability is crucial for handling the increasing data demands of modern businesses. Data Vault 2.0’s modular approach complements Snowflake’s architecture, enabling efficient data integration and historical tracking without compromising performance. The synergy between Snowflake’s robust infrastructure and Data Vault’s methodology ensures that businesses can scale their operations seamlessly while maintaining data integrity and accessibility.

Snowflake also allows for dynamic resizing of compute resources to handle varying workloads efficiently. Additionally, Snowflake supports vertical scaling by adding more clusters to handle concurrent queries, ensuring optimal performance during peak times. This ability to dynamically resize and scale vertically ensures that businesses can meet performance demands without incurring unnecessary costs.

Cost Efficiency and Total Cost of Ownership

One of the standout features of Snowflake is its pay-per-second pricing model, which helps businesses minimize costs by only paying for the compute resources they use. This, coupled with Data Vault’s efficient data storage practices, reduces redundancy and optimizes storage costs. Implementing these technologies together significantly lowers the total cost of ownership (TCO), making it a financially viable solution for organizations of all sizes. The cost efficiency is further enhanced by Snowflake’s zero-copy cloning and efficient data storage solutions, which reduce overall data management expenses.

Zero-Copy Cloning

Zero-copy cloning is a feature in Snowflake that allows users to create a clone of a database, schema, or table without actually copying the data. Instead, Snowflake uses metadata pointers to reference the original data. This means that creating a clone is nearly instantaneous and does not consume additional storage.

  • Efficiency: Because no actual data is copied, the process is very fast and storage efficient.
  • Cost-Effective: Since clones share the same underlying data, storage costs are minimized. Only changes made to the cloned data incur additional storage.
  • Flexibility: Clones can be used for various purposes such as testing, development, or analytics without affecting the original dataset.

Efficient Data Storage Solutions

Snowflake offers several features that contribute to efficient data storage, some of which include:

  • Automatic Data Compression: Snowflake automatically compresses data as it is loaded into the system. This reduces storage costs and improves query performance.
  • Columnar Storage Format: Data in Snowflake is stored in a columnar format, which is optimized for analytical queries. This format allows for efficient data retrieval and storage, especially for large datasets.
  • Time Travel: Snowflake’s time travel feature allows users to access historical data without additional storage costs. This is achieved through data versioning, where changes to data are tracked over time, and previous versions can be queried as needed.
  • Data Pruning: Snowflake uses metadata to filter out unnecessary data at query time, which reduces the amount of data scanned and speeds up query performance. This is especially useful for large datasets where only a subset of the data is required for analysis.

Storage Optimization Services: Snowflake continuously manages the storage infrastructure, ensuring that data is stored efficiently. This includes automatic clustering and re-clustering of data to optimize query performance.

Development Agility and Rapid Results

Integrating Snowflake and Data Vault 2.0 accelerates development processes. Tools like dbt (data build tool) automate Data Vault model creation, reducing the time and effort required for data engineering tasks. This modular approach allows for rapid iteration and adaptation to changing business needs, enhancing overall agility. Businesses can quickly respond to new data insights and business requirements, providing a significant competitive advantage. The automation capabilities provided by tools like dbt not only save development costs but also ensure that data models can be updated and maintained with minimal manual intervention.

Cost Monitoring and Dashboards

Effective cost monitoring is essential to managing and optimizing expenses in data management. Building dashboards that monitor costs at the data product or even query level provides granular visibility into spending. These dashboards can highlight the most expensive queries and help identify inefficiencies. By implementing such dashboards, businesses can set up alerts and notifications for cost overruns, enabling proactive management of resources. Snowflake’s robust monitoring and logging features facilitate the creation of these detailed cost dashboards, ensuring that every aspect of data consumption is tracked and optimized.

Free Webinar Recording

This recording will delve into the benefits of these technologies and provide strategies for optimizing your data management infrastructure. Join us for this insightful session and learn how to maximize the value of your data infrastructure investments. Watch here for free

Conclusion

Integrating Snowflake with Data Vault 2.0 offers a powerful framework for modern data management, providing enhanced scalability, performance, and cost efficiency. By leveraging these technologies, businesses can achieve significant savings, streamline development processes, and ensure their data infrastructure is future-proof and sustainable. For more details and to register for the webinar, visit our webinar page.

Microsoft SQL Server Join Elimination and Data Vault PIT Tables

Watch the Video

In this week’s Data Vault Friday, Michael Olschimke, our CEO, addresses a thought-provoking question about join elimination in SQL Server, specifically concerning PIT (Point-in-Time) tables in Data Vault. Let’s explore this topic in depth.



Understanding PIT Tables and Their Role in Data Vault

In Data Vault 2.0, PIT tables are crucial for efficient data retrieval. They are designed to reference the most recent changes (deltas) in satellite tables linked to a hub or link. This setup is instrumental in creating dimension tables that reflect a snapshot of your data at a specific point in time.


The Query at Hand

The question we received revolves around the challenge of optimizing queries using PIT tables. The scenario is familiar: you’ve built dimensions from PIT tables, which join to satellite entries valid at the snapshot time. However, when users query only a few attributes, unnecessary joins to multiple satellites can significantly slow down performance. The goal is to ensure join elimination works effectively, allowing the SQL optimizer to bypass unnecessary joins and thus speed up query execution.


The Join Elimination Dilemma

Join elimination is a technique where the SQL optimizer skips joins that aren’t needed for the final result. This is particularly useful in data warehousing scenarios where large dimension tables are involved. In SQL Server, achieving efficient join elimination, especially with PIT tables, involves several considerations:


Inner Joins vs. Left Joins

Inner joins are commonly used but can be problematic because SQL Server requires foreign keys between tables for optimal join elimination.
Left joins, on the other hand, can sometimes make join elimination easier for the optimizer, as they are less restrictive.


Foreign Keys

For SQL Server to perform join elimination efficiently, there must be a foreign key relationship between the PIT table and the satellite tables.
This foreign key doesn’t have to be active but must exist to inform the optimizer about the relationships.


Column Considerations

Ideally, the foreign key should be based on a single column. In Data Vault, where primary keys often consist of composite keys (e.g., hash key and load date), this can complicate matters.


Practical Steps for Efficient Join Elimination

Here are some practical steps to ensure join elimination works effectively in SQL Server:


Use Left Joins

By using left joins instead of inner joins, you can simplify the optimizer’s task. In many cases, left joins with equi join conditions perform as efficiently as inner joins.


Implement Foreign Keys

Ensure that foreign keys exist between your PIT tables and satellite tables. Even if these keys are inactive, they provide the necessary metadata for the optimizer.


Simplify Foreign Key Structures

Where possible, simplify the foreign key structures to single columns. This can help the optimizer in performing join elimination more efficiently.


Testing and Verification

Given the complexity of join elimination, thorough testing is essential. Ensure that your SQL Server setup supports the required features and verify the execution plans to confirm that unnecessary joins are indeed being eliminated. Tools like SQL Server Profiler and Execution Plan Viewer can be invaluable for this purpose.


Tailoring Information Marts for Performance

Another critical aspect is the design of your information marts. Traditional data warehousing often involved creating large, comprehensive information marts. However, in the Data Vault paradigm, it’s more efficient to create smaller, purpose-specific marts. Here’s why:


Performance Optimization

Smaller marts tailored to specific reports or queries minimize the amount of data processed, thus speeding up query execution.


Virtualization

By virtualizing information marts in SQL Server, you can dynamically assemble the necessary data without physically storing it, reducing storage requirements and increasing flexibility.


Conformed Dimensions

When dimensions are based on the same PIT tables, they are inherently conformed, allowing for seamless joins across different information marts.


Conclusion

Join elimination is a powerful technique for optimizing queries in SQL Server, especially when working with Data Vault 2.0 and PIT tables. By using left joins, implementing foreign keys, and simplifying key structures, you can enhance the performance of your data retrieval processes. Additionally, designing smaller, purpose-specific information marts tailored to specific use cases can further boost efficiency.

If you have more questions or need further clarification, feel free to use the form at https://sfr.ee/DVFriday to submit your queries. You can also join our monthly webinars on WhereScape and dbt.

Benefits of the Shift Left Approach in Agile Software Development

Colleagues working on an Agile Software Development

The Shift Left Methodology

In this article, the Shift Left Approach is introduced as one of the core elements of DevOps methods and a crucial part of Agile Software Development.

The problem with traditional methods is that testing occurs very late in the software development process, leading to significant effort and high costs for fixing errors. The methods of the Shift Left Approach ensure early testing, minimizing both the costs and effort required for error correction while maintaining the quality of the product.

Other than maintenance of the quality of the product, this article will help you understand how the shift-left approach enhances teamwork and collaboration with the team. Similarly, this article focuses on the core aspects of the approach along with its guidelines.



Traditional Methods vs. Shift Left Approach

In traditional software development methods, it is often a problem that tests are placed very late. This behavior can be seen in typical models such as the waterfall model or the V-model. The outcome arises from the sequential nature of the model, with each phase succeeding the previous one. The testing phase is relatively far towards the end of the model, allowing the entire product to be tested in a phase shortly before release.

However, this often leads to problems being recognized only very late, resulting in high effort and associated costs for correcting the issues. The Shift Left Approach offers various guidelines for conducting tests at an early stage, enabling the identification and resolution of problems quickly and cost-effectively.

As depicted in the diagram, testing is shifted to the left, meaning to an earlier stage in development.

Testing is shifted to the left

The fundamental idea is to continuously gather feedback through automated tests as much as possible. This ensures, on the one hand, that the product meets the desired quality standards and, on the other hand, helps to minimize the effort and costs associated with testing.


Shift Left Approach Guidelines

There are no hard and fast rules about the implementation of a Shift Left Approach. However, based on our experience, we have formulated some guidelines suggested by the approach:

  • Collaboration: Different teams, such as developers and Quality Assurance, should work closely and early together, to better address various product requirements and benefit from each other’s expertise.
  • Continuous feedback: Gather feedback from testing and QA throughout the development process, rather than waiting until the end.
  • Automated testing: Automating tests facilitates efficient testing, thereby reducing the time and costs associated with testing.
  • Iterative development: Use an iterative development approach, where small changes are made and tested frequently.
  • Error Prevention: The goal is not only to detect errors but also to take measures to prevent errors early proactively. This can be achieved through training, coding guidelines, and best practices.

Example for Integration of Shift Left Guidelines

Typically, in many companies, different teams are assigned to different tasks in the development pipeline. The development team writes the code for the product, the Quality Assurance team conducts tests and ensures that the product meets the desired quality and the DevOps team attempts to automate and optimize processes for both teams. The Shift Left Approach promotes closer collaboration among these teams.

Testers should be involved early on. When testers are involved in the product planning phase, they can better understand the product’s requirements and design tests accordingly. Additionally, their experience allows them to identify potential issues early on and inform the developers.

Furthermore, testers should have some coding skills. While no one expects them to be experts outside their core competence, if testers can perform minor bug fixes themselves, the development team can focus on more significant issues.

Additionally, developers should consider testability to facilitate quick and straightforward testing for testers, such as using unique IDs for each element and performing some smaller tests themselves. Typical tests that can be done by developers themselves are automated security tests.


Automation of Security Tests

An important type of test are Security Tests. For this purpose, some tools enable automated testing for security risks. Various types of tests can be conducted to examine the product for different errors and vulnerabilities. This could include, for example, a Static Application Security Test, or SAST for short, where source code is checked for vulnerabilities before compilation. Many tools also guide how to address the found issues with minimal effort. These tests can be automated in a way that developers receive independent feedback on their code and can easily address problems.

For instance, the version control tool Github utilizes automation, where with each push, the code is checked for data that poses a security risk. This could include credentials for logging in or authentication keys. If GitHub detects such risks, the developer is automatically alerted to prevent critical information from being exposed to the public.


Conclusion

Overall, this article has demonstrated the core elements of the Shift Left approach and the guidelines it provides. It has shown how late testing can be problematic in traditional software development models and the advantages of incorporating the methods of the Shift Left approach into one’s projects. The results include not only a reduced time investment for a high-quality product but also improved collaboration among individual teams.

However, there are some challenges in restructuring the development pipeline, but the benefits it brings can pay off in the medium term. Scalefrees consultants in the DevOps and Cloud domain are experts in integrating and automating practices such as the Shift Left Approach in their projects and are pleased to assist you.

If you wish to learn more about technical testing, the following article presents various methods to implement tests in a Data Vault-powered EDW.

Watch the Video

Deriving Dimensions from Reference Data

Watch the Video

Reference Data

Welcome to another episode of Data Vault Friday! I’m Michael Olschimke, CEO of Scalefree. Today’s question comes from our online form, and it’s about deriving dimensions from reference data in the raw Data Vault. Specifically, the questioner has several lookup reference tables that they add as Hub or reference tables while creating the raw Data Vault. For example, they have a region table that includes a region code, description, language code, and valid from/to dates.



Understanding Degenerate Dimensions

The query mentions a “degenerated dimension.” To clarify, a degenerate dimension is a dimension attribute, such as the region code, included in a fact table without any additional descriptions. This attribute exists within the fact table itself and doesn’t have a separate dimension table.


Building the Model

To illustrate this, let’s start with a basic structure. Imagine you have a non-historized link containing transaction data, and it references Hubs such as Hub Customer and Hub Product. Additionally, you have reference tables, like a region reference table with a region code and associated descriptions. Here’s a simplified model:

  1. Non-historized Link: Contains transaction data.
  2. Hub Customer and Hub Product: Reference customer and product data.
  3. Reference Hub for Region: Contains the region code.
  4. Reference Satellite for Region: Contains the descriptions, language codes, and valid dates.

This setup allows for capturing changes in reference data, making the model auditable and maintaining historical accuracy.


Creating a Degenerate Dimension

To create a degenerate dimension from the reference data, follow these steps:

  1. Include the Code in the Fact Table: Add the region code directly into your transaction data (the non-historized link).
  2. Determine the Required Attributes: Decide if you only need the region code or additional attributes like the region name.
  3. Create a Fact View: If you only need the region code, simply create a fact view that includes this code.
  4. Prejoin Additional Attributes: If you need additional attributes, prejoin the reference Hub and Satellite to get the region name or other details based on the timeline of your facts.

Handling Time-Based Data

When dealing with time-based data, it’s essential to identify the correct version of your reference data. If you want the latest description of the region (a Type 1 dimension), you can join the latest entry. For a Type 2 dimension (tracking changes over time), join based on the fact timestamp to match the correct version of the region name.


Performance Considerations

Reference tables typically contain a relatively small amount of data, which allows most joins to be efficient. However, if performance becomes an issue, you can consider creating a Point-in-Time (PIT) table in the Business Vault. This table can precompute the current description for each region on a daily basis, making joins faster and more efficient.


Conformed Dimensions

If you prefer to use a conformed dimension, convert your reference table into a dimension table. Use the primary key of the reference table (e.g., region code) as the dimension identifier. This approach involves joining the reference Hub and Satellite to create a dimension view that can be used in your fact tables.


Implementation Steps

  1. Turn Reference Table into Dimension: Join the reference Hub and Satellite to create a dimension view.
  2. Use Reference Code as Dimension Key: The region code becomes the dimension key.
  3. Create Fact View: Include the dimension key in your fact view and join the necessary attributes from the dimension view.
  4. Configure in Dashboard: Set up relationships between your facts and dimensions in your dashboard application for seamless data visualization.

Conclusion

In summary, deriving dimensions from reference data in a Data Vault involves understanding your needs for degenerate or conformed dimensions, handling time-based data appropriately, and ensuring efficient joins. By following these steps, you can create a robust and scalable data model that meets your analytical needs.

Thank you for joining us for this Data Vault Friday session. If you have more questions, submit them at sfr.ee/dvfriday. For additional learning, check out our webinars at Scalefree.to/webinars. If you need answers before next Friday, visit the Data Vault Innovators community we set up with Ignition Data.

Until next time, keep those data questions coming, and we’ll see you next Friday!

AI in Data Warehousing: Fundamental Principles and Applications

AI in Data Warehousing

AI in Data Warehousing

The ability to efficiently store, manage, and analyze data has become a critical aspect for any organization. The field of Data Warehousing is an integral part of the process by providing structured, centralized storage for large volumes of data and is an important element of business intelligence.
However, the increasing volume and complexity of data is becoming increasingly challenging to manage. That said, AI is already shaping many processes in a wide variety of areas, including data warehousing.

This article explores the basic principles and applications of artificial intelligence and how AI can be integrated into an Enterprise Data Warehouse, by using it to enhance the design and operation of data warehouses, as well as enabling the development of more challenging data science applications like machine learning or predictive analytics.

AI in Data Warehousing: Fundamental Principles and Applications

This webinar delves into the essential principles of Artificial Intelligence (AI) and its transformative role in data warehousing. We explore how AI and Machine Learning (ML) not only enhance the construction and operation of data warehouses but also leverage the vast amounts of data stored within them for developing powerful applications, e.g. for predictive analytics.

You will gain an understanding of the foundational AI technologies driving these advancements and how they intersect with data science and machine learning to optimize data storage, improve decision-making, and unlock new business insights. This webinar provides valuable perspectives on integrating these technologies into your data warehousing practices.

Watch webinar recording

Foundational Principles of AI

Let’s start by categorizing the term artificial intelligence, as well as other terms such as machine learning, deep learning, or generative AI, in order to clarify the context and what exactly is hidden behind these terms. 

Artificial Intelligence is now more of a generic term and represents a broad field. In general, the term AI encompasses techniques that allow computers to emulate human behavior, enabling them to learn, understand language, make decisions, recognize patterns, and solve complex problems in a way that is similar to human intelligence.

AI Graphic

Machine Learning (ML) as a Subset of AI

Machine Learning is a subset of the field of artificial intelligence and, even if less present as a term, one of the most important. Machine Learning uses advanced techniques and algorithms to recognize patterns in large amounts of data, allowing machines to learn and adapt autonomously to make inferences or predictions based on the data. Unlike traditional programming where tasks are executed based on clear and defined instructions, machine learning relies on statistical analysis based on input data to output data values that fall within an expected range. Essentially, machine learning enables computers to learn and interpret data without being explicitly programmed for each individual case.

Deep Learning (DL) – A Deeper Dive into ML

Deep Learning is a more specialized sub-area of machine learning & AI and uses deep neural networks to perform in-depth data processing tasks to recognize complex patterns.

Deep learning uses many layers to extract high-level features from raw data as input, in a sense simulating the way a human brain works.

In this respect, deep learning is particularly strong in processing very large amounts of data, using it to learn complex patterns in order to solve a wide variety of tasks. Well-known examples of this are tasks such as image or voice recognition, which every smartphone is capable of.

Generative AI – The Creative Aspect of Deep Learning

Generative AI is a term that has recently become more widely known, especially due to ChatGPT. Generative AI is a sub-area of deep learning and includes deep learning models that generate new content, such as text, images, code, or even videos, based on data on which they have been trained on. They are able to generate novel outputs without explicit instructions that do not directly replicate the training data. This opens up completely new possibilities such as generating large texts, complex images, or even music.

The Role of AI in Building Data Warehouses

AI is changing everything, including data warehousing. As such, it can help to enhance the efficiency and effectiveness of your data warehouse from the ground up. From the design and structure to the ongoing data management processes, there are many opportunities where AI can help.

AI has the power to address a data warehouse’s biggest challenges: performance, governance, and usability. This is data intelligence and will revolutionize the way you query, manage, govern, and visualize your data.

Enhancing Design and Structure

The architecture of a data warehouse is critical as it needs to support efficient data querying and scalability while maintaining performance. By analyzing usage and query patterns, AI algorithms can suggest the most effective data models and indexing strategies. This not only speeds up the retrieval of information but also ensures more agile data handling when scaling or integrating new data sources.

Automating Data Integration, Cleaning, and Transformation

By letting AI take over low-level tasks, data engineers can focus on higher-level tasks such as designing data models, training machine learning algorithms, and creating data visualizations. For example, The Coca-Cola Company uses AI-powered ETL tools to automate data integration tasks across its global supply chain to optimize sourcing and procurement processes. 

But AI can also support the actual developers in their work, enabling them to work faster and more efficiently. Developers can use AI to debug problems in their code faster by using AI-based code generation and analysis, such as Github Copilot, which completes, refactors, and debugs code in real time directly in the IDE. 

The automation of performance tuning of data warehouse workloads can also be improved with predictive optimizations, which can save a considerable amount of costs.

AI can also scale and automate governance using automated tagging, documentation, and natural language search across all data and assets in an organization.

Right up to the possibility that business users could use natural language to interact with the data, ask questions, and build dashboards, the possibilities are endless and we are only at the very beginning of these developments.

Using a Data Warehouse for AI Applications

The nature of a data warehouse, a structured and centralized repository that aggregates data from multiple sources within an organization, makes it a perfect foundation for building and training AI applications. AI models require large amounts of data with as much variability as possible that are well structured and have a high data quality, which a data warehouse can perfectly provide for an organization to train its own AI model. 

For example, a data warehouse in a retail context might store years of customer purchase history, demographics, and product information. Data that can be used to train AI models to predict future buying trends or recommend products effectively.

Enhancing AI Capabilities with Data Vault

Data Vault 2.0 in particular sets the exact right conditions for training your own AI models by leveraging the structured, reliable data framework that Data Vault 2.0 provides. 

Here’s are some examples how this integration can benefit AI-driven analytics:

  • Improved Data Quality: The organized and auditable structure of Data Vault 2.0 ensures higher data quality and consistency, which is important for training accurate AI models. Clean, well-structured data models reduce the time spent on data preparation and increase the reliability of AI predictions.
  • Enhanced Historical Analysis: The comprehensive historical data captured by Data Vault 2.0 allows AI models to perform more accurate trend analysis and forecasting. This capability is especially valuable in sectors like finance and retail, where understanding long-term trends can significantly impact strategic decisions.
  • Data Reliability and Lineage: Data Vault 2.0’s robust framework ensures data reliability through its unique architecture that captures data from various sources while maintaining its lineage. This means every piece of data in the system can be traced back to its origin, providing transparency and trust in the data used for AI models.

Conclusion

Integrating artificial intelligence (AI) into data warehousing significantly enhances the efficiency and effectiveness of data storage, management, and analysis. By applying AI techniques, organizations can automate data integration, cleaning, and transformation processes, leading to more accurate and timely insights. Furthermore, AI-driven optimization of data warehouse design and structure ensures improved performance and scalability. Embracing AI in data warehousing not only streamlines operations but also empowers businesses to leverage advanced analytics, such as machine learning and predictive analytics, thereby unlocking deeper insights and fostering informed decision-making.

Close Menu