Skip to main content
search
0

CDC, Status Tracking Satellite, and Delta Lake

Watch the Video

Understanding CDC and Status Tracking Satellites in Data Vault

The integration of Change Data Capture (CDC) data into a multi-active satellite and status tracking satellite is a nuanced topic. In a previous session, the focus was primarily on multi-active satellites, leaving the status tracking satellite underexplored. This article will dive deeper into their utility, especially in the context of CDC data.

A status tracking satellite in Data Vault serves a specific purpose: it tracks the appearance, updates, and disappearance of business objects in the source system. However, if CDC data is available, this tracking becomes inherently simpler because CDC already provides explicit information about creates (C), updates (U), and deletes (D). Thus, creating a separate status tracking satellite may not be necessary.

In contrast, when dealing with full extracts (non-CDC data), a status tracking satellite can be invaluable. It enables you to derive creates, updates, and deletes by comparing consecutive extracts, identifying the first appearance (create), differences between records (update), and removal of records (delete). This can be achieved by maintaining a delta check mechanism and creating a robust satellite to store these events.



Handling Multi-Active Data in Status Tracking Satellites

Multi-active data arises when the same business key appears multiple times in the source system, distinguished by another attribute (e.g., customer ID). In these cases, status tracking satellites must accommodate the additional attributes, ensuring that individual records are not incorrectly marked as deleted when only one instance of the multi-active data changes.

For example, consider a scenario where a customer appears twice in the source system with different technical IDs but the same business key. A delete operation on one ID should not remove the customer from the source entirely. To address this, a status tracking satellite should maintain a composite key combining the business key and the multi-active attribute.

This approach ensures that changes are tracked at the appropriate granularity, preserving the integrity of multi-active records. Additionally, adding the CDC information (CUD columns) directly to the main satellite can simplify tracking without requiring a separate status tracking satellite.

Data Vault and Delta Lake: Complementary Approaches

The second question posed is whether Data Vault adds value when Delta Lake is already in use. To address this, it’s essential to understand the distinctions between the two. Delta Lake is a technology, whereas Data Vault is a methodology. While Delta Lake provides a robust framework for handling data in its native form (e.g., JSON, XML) and managing deltas, it does not prescribe how to model or process data for business purposes.

Data Vault, on the other hand, excels in its structured, agile methodology for modeling data. It provides a clear architecture, including hubs, links, and satellites, which organize data effectively for analytics and reporting. This is where Data Vault complements Delta Lake by applying a methodical approach to the data stored in the lake.

In practice, Delta Lake can serve as the persistent staging area (landing zone) in a Data Vault architecture. The metadata and delta tracking capabilities of Delta Lake enhance the efficiency of loading and processing data, while Data Vault ensures that the data is modeled and structured to meet business requirements. This synergy allows organizations to leverage the strengths of both technologies, creating a powerful data ecosystem.

Combining CDC Data with Data Vault and Delta Lake

By integrating CDC data, Delta Lake, and Data Vault, organizations can achieve an optimized data architecture. CDC data feeds directly into Delta Lake’s storage layers (bronze, silver, gold), which in turn populate the Data Vault’s hubs, links, and satellites. This integration streamlines data ingestion, transformation, and querying while maintaining flexibility and scalability.

For instance, CDC data can directly populate status tracking satellites or be included in a main satellite for simplicity. Meanwhile, Delta Lake’s metadata features support efficient querying and analysis, enabling the Data Vault layer to focus on applying business logic and producing meaningful insights.

By combining these tools and methodologies, data teams can build robust, agile data platforms that support modern analytics and decision-making needs.

Key Factors for Data Vault Automation

Key Factors for Data Vault Automation

We are excited to announce an upcoming webinar, “Key Factors for Data Vault Automation,” where you’ll gain valuable insights into leveraging automation to optimize your data warehousing processes. This session will feature expert speakers who will explore how Datavault Builder can streamline data modeling and significantly enhance your Data Vault implementation.

Automation has become essential in data warehousing, enabling organizations to reduce manual effort, minimize errors, and boost efficiency. Our speakers will share best practices and real-world use cases that demonstrate the transformative power of automation in Data Vault projects. You’ll learn actionable strategies to ensure a smoother, faster, and more reliable data modeling process.

Whether you are new to Data Vault or seeking ways to fine-tune your existing setup, this webinar will provide practical knowledge and tools to help you succeed. Don’t miss this opportunity to discover how to make the most of automation and take your data warehousing efforts to the next level.

Register now to secure your spot and stay ahead in the ever-evolving world of data warehousing!

Webinar Details

  • Date: November 20th 2024
  • Time: 15:00 – 16:00 CET
Watch Webinar Recording

The Need for a Data Warehouse

The Need for a Data Warehouse

In today’s rapidly evolving digital landscape, businesses generate and collect vast amounts of data. However, data alone isn’t enough to ensure success—it’s about how we manage, analyze, and utilize this data. This brings us to a fundamental question: why do we need a data warehouse or Data Vault in our business model?



From Gut Feeling to Data-Driven Decisions

Many organizations, especially mid-sized firms, often rely on gut feelings for decision-making. While experience-based intuition has its place, it also carries a significant risk of error, especially as businesses grow and their operations become more complex. A data warehouse is a game-changer in transforming such organizations into data-driven entities where decisions are made based on facts and analytics rather than instinct alone.

As businesses scale, leaders lose the ability to maintain a complete overview of every operational detail. This is where a systematic approach to organizing, analyzing, and processing data becomes essential. A data warehouse centralizes and standardizes data from multiple sources, making it easier to extract insights and support rational, informed decisions across all levels of the organization.

Enhanced Business Process Automation

With a centralized repository like a data warehouse, businesses can unlock opportunities for automation. Automated processes can access, analyze, and utilize data seamlessly, leading to improved efficiency and accuracy. Whether it’s optimizing workflows or refining customer interactions, having reliable, accessible data is crucial for these systems to function effectively.

Democratizing Data Access

A significant aspect of a data-driven organization is making relevant data accessible to employees across roles. Every employee, from frontline workers to C-suite executives, is expected to make decisions. For these decisions to be effective, they need to be grounded in data.

However, this doesn’t mean unrestricted access. Data warehouses must incorporate robust security measures, such as role-based access, to ensure that employees can access only the data necessary for their responsibilities. This combination of widespread accessibility and stringent security supports a culture of informed decision-making while safeguarding sensitive information.

Do You Need a Data Vault?

When it comes to managing enterprise data, many organizations face additional challenges: integrating multiple source systems, ensuring data security and privacy, and handling real-time and batch processing simultaneously. A Data Vault model offers a comprehensive solution to these challenges by supporting integration, auditability, and adaptability.

While some businesses may start with simpler models, their requirements will inevitably evolve. Laws and industry standards may impose new data privacy mandates, or management may seek to leverage more advanced analytics capabilities. A well-designed Data Vault can accommodate these future needs without requiring a complete overhaul of the existing system.

The Future-Ready Advantage

One of the standout features of a Data Vault is its flexibility. It allows businesses to scale and adapt their data management strategies as they grow. Whether it’s adding new data sources, meeting stricter compliance requirements, or enabling more sophisticated analytics, the Data Vault model supports incremental changes without disrupting existing operations.

This adaptability makes it an invaluable asset for enterprises looking to future-proof their data strategies. While simpler solutions might suffice for today’s needs, they may not hold up against tomorrow’s demands. A Data Vault ensures that businesses are prepared for the inevitable increase in complexity and volume of their data requirements.

Conclusion

Investing in a data warehouse or Data Vault isn’t just about technology—it’s about fostering a culture of informed, data-driven decision-making. From streamlining processes to democratizing data access, these systems provide the foundation for businesses to thrive in an increasingly competitive and data-centric world. Whether you’re just starting your data journey or looking to enhance your existing capabilities, now is the time to prioritize a robust, scalable data solution.

As your business grows, so will your data requirements. A data warehouse or Data Vault not only meets these needs but positions your organization to capitalize on the full potential of its data—today and in the future.

Watch the Video

Expanding Agile Practices and Embracing Data Governance

Expanding Agile Practices and Embracing Data Governance for Modern Organizations

Why change a running system? In a rapidly evolving digital landscape, embracing new methodologies and exploring broader perspectives becomes essential. Agile practices are more than just Scrum; they encompass a wide array of approaches aimed at optimizing organizational workflows. This journey led us to become certified trainers in Disciplined Agile and integrate this fresh knowledge into our projects. However, staying agile also means continuously seeking new methodologies and frameworks, like Data Mesh, that align with modern data needs.



Adapting Agile Principles for Data Governance and Data Mesh

Our journey has taken us deeper into the realms of data architecture, governance, and the intersection with GDPR and organizational needs. Integrating data governance within an agile framework ensures a structured yet adaptable approach that allows innovation while maintaining control and data security. This shift promotes better domain ownership, federated governance, and viewing data as a product.

Key Components for Effective Data Mesh Implementation

  • Standardized DevOps: Unified processes and seamless integration of tools to facilitate automation and consistency.
  • Data Catalogue: A centralized source for metadata, data lineage, and ownership information to enhance transparency and usability.
  • Federated Governance: Collaborative frameworks where domain leaders establish platform rules and sharing protocols.
  • Governed Platform: A managed platform that supports efficient data sharing and collaboration across teams.
  • Automation: Streamlined data provisioning, especially in the Data Lake and Data Vault, to avoid delivery bottlenecks.
  • Release Management: Organized release notes to communicate new data products and functionalities effectively.
  • Standard Guides: Comprehensive guidelines to ensure consistent data handling throughout the organization.

Why Data Governance Matters

Data governance should be a central focus for modern organizations. Research shows that only 11% of companies have a robust data governance structure, yet those that do experience significant benefits:

  • Improved Efficiency: Effective governance can reduce data search time by up to 50% (IBM).
  • Enhanced Decision-Making: Strong governance leads to 40% faster decision-making due to better data access (Databricks).
  • Long-term Value: By 2027, 60% of companies may not realize their AI project potential due to inadequate governance (Gartner).

Core Elements of Data Governance

To establish an effective governance framework, focus on:

  • Ownership: Clear roles for data stewardship and accountability for data lifecycle management.
  • Accessibility: Authorized, user-friendly data access for stakeholders.
  • Security: Robust data protection policies including encryption and access control.
  • Quality: Continuous monitoring and improvement of data accuracy, completeness, and consistency.
  • Transparency: Comprehensive documentation and metadata management to foster data literacy.

By integrating these agile and data governance principles, organizations can unlock true potential, fostering both innovation and compliance.

Watch the Video

Federate Multiple Sets of Non-Historized Data

Federate Multiple Sets of Non-Historized Data

Welcome to another edition of Data Vault Friday! In this session, we’ll address an interesting question: how to integrate a second system into an existing setup with a non-historized link. This specific case involves transactions that can be updated for the last two months. We’ll explore the necessity of satellites, the role of non-historized links, and the concept of federating data in the business vault or information mart (BV/IM).

Let’s break it down, discuss the challenges, and provide actionable solutions.



The Challenge

The question revolves around a scenario where two systems provide transactional data. The goal is to integrate these datasets into a single fact entity in the information mart, enabling aggregation across both systems. A specific challenge arises when transactions from one system need to be updated within the last two months, necessitating a satellite to track those changes. So, how do we effectively design and implement this integration?

Understanding Non-Historized Links

Non-historized links play a central role in integrating datasets across source systems. These links typically store only insert operations, ensuring that facts—such as transactions—remain immutable. However, the reality is often more complex, especially when updates or changes occur.

For transactions that change, the non-historized link’s primary purpose is to maintain a consistent structure. Attributes frequently used for aggregations, such as amounts, are stored directly in the link structure. Meanwhile, descriptive attributes that casual users don’t aggregate (e.g., line descriptions) are placed in satellites to keep the link slim and optimized for storage and performance.

Federation Across Systems

One of the first decisions when integrating two systems is determining whether their data can share the same link structure. The key considerations are:

  • Semantic Consistency: Do both systems provide the same type of transactional data?
  • Granularity: Are the transactions at the same level of detail?
  • Key Uniqueness: Is there any overlap in identifiers that might require adding source-specific elements like a tenant ID?

If the answers indicate alignment, both systems can load into the same non-historized link. However, each source system would have its own satellite, ensuring clear separation of descriptive attributes.

Handling Updates in Transactions

Updates to transactional data pose a significant challenge, particularly when attributes such as amounts or hub references change. The typical strategy involves **technical counter transactions**, which ensure insert-only behavior while maintaining an accurate representation of changes. Here’s how it works:

  1. For a new transaction, insert the original record into the link.
  2. If an update occurs (e.g., an amount changes from €5 to €7), insert two additional records:
    • A “counter” record negating the original amount (e.g., -€5).
    • A new record reflecting the updated amount (e.g., +€7).
  3. Aggregations will reflect the latest transaction value while retaining a complete history of changes.

This approach keeps the data model consistent and avoids direct updates to the non-historized link.

Federation in the Business Vault and Information Mart

While the raw Data Vault integrates data from multiple systems, the Business Vault (BV) and Information Mart (IM) are where federated datasets shine. By standardizing and transforming data, you can provide end-users with a unified view of all transactions.

The information mart combines the transactional data from both systems into a single fact entity. This entity supports aggregations across all transactions, ensuring that analytical queries deliver accurate and actionable insights.

Best Practices and Recommendations

To ensure a robust implementation, consider these best practices:

  • Design for Scalability: Keep the link structure slim to optimize performance, especially with large datasets.
  • Clear Satellite Separation: Create one satellite per source system to maintain clarity and prevent mixing attributes from different sources.
  • Use Technical Counter Transactions: Handle updates efficiently while preserving insert-only behavior.
  • Test Aggregations: Validate the unified fact entity in the information mart to ensure accurate reporting across systems.
  • Monitor Data Quality: Regularly check for inconsistencies, especially when integrating systems with different update patterns.

Watch the Video

Snapshots in dbt

Watch the Video

Snapshots in dbt: A Quick Overview

dbt snapshots allow you to “look back” at historical data by capturing changes in your database tables. This is achieved by implementing type-2 Slowly Changing Dimensions (SCDs), which track how a row has changed over time. For example, you can keep track of an order’s status as it moves from ‘pending’ to ‘shipped’ to ‘delivered’.



How Snapshots Work

When you run dbt snapshot, it creates a new table with the same columns as your source table, plus additional metadata columns like dbt_valid_from and dbt_valid_to. On subsequent runs, dbt updates the dbt_valid_to column for changed records and adds new records with dbt_valid_to set to null.

dbt offers different strategies for detecting changes, including comparing timestamps, specific column values, or a unique key. You can also configure dbt to track deletions.

Snapshots Best Practices

  • Store snapshots in a separate schema to keep your historical data organized.
  • Snapshot raw data using the source function and select all columns.
  • Avoid joins in your snapshot queries; instead, create separate snapshots and join them downstream.
  • If you need to perform transformations, do so in ephemeral models before snapshotting.
  • Schedule snapshots to run frequently to capture changes regularly.

By following these best practices, you can ensure that your dbt snapshots are accurate, efficient, and easy to maintain.

Non-Historized Links and Their Satellites

Understanding Non-Historized Links and Satellites

In the world of data vault modeling, we often encounter situations where we need to deal with non-historized links. A non-historized link is a structure used to capture relationships between entities (hubs) in a data vault. Unlike historized links, these structures are not designed to track changes over time, making them particularly useful in certain high-performance scenarios.

However, when integrating with systems that can update transactions for recent periods, additional structures like non-historized satellites become essential. These satellites allow us to track updates to attributes associated with the link while maintaining the integrity and performance of the data model.

Design Considerations for Non-Historized Links

The design of non-historized links requires careful planning, particularly when dealing with wide datasets or platforms with specific storage constraints. A wide link, for instance, might contain attributes like transaction ID, account segments, accounting document, accounting period, and timestamps. These attributes uniquely identify a transaction and establish relationships with other entities in the data vault.

Here are key considerations when designing non-historized links:

  • Include Essential Identifiers: Attributes required to uniquely identify a transaction and establish relationships with hubs should always be part of the link. This typically includes the transaction ID and any necessary alternate keys.
  • Balance Row Width: Including too many attributes in the link can lead to performance issues on row-based storage systems, as wide rows reduce the number of records stored per page.
  • Use Satellites for Additional Attributes: Attributes that are not essential for casual queries or dashboard aggregation should be moved to a satellite structure to keep the link lean and efficient.

Role of Non-Historized Satellites

Non-historized satellites play a critical role in complementing non-historized links. They store additional descriptive attributes that are not frequently queried or aggregated by typical users. This approach minimizes the join operations required during querying, improving performance.

For instance, while attributes like debit and credit amounts might be included in the link due to their importance in aggregations, other less frequently used attributes, such as transaction type or invoice descriptions, can reside in the satellite. This division ensures a balance between storage efficiency and query performance.

Query Performance Optimization

A significant design challenge lies in ensuring optimal query performance, particularly on column-based storage systems. Joining a link with its satellite can significantly impact performance. To mitigate this, consider the following:

  • Prejoin Frequently Queried Attributes: Move attributes that are commonly used in aggregations or group-by operations into the link structure to eliminate the need for joins.
  • Tailor Satellites for Specific Use Cases: Design satellites to serve specialized needs, such as data science or detailed analysis, where users can afford the higher cost of join operations.
  • Adopt Hybrid Storage Strategies: Leverage hybrid storage solutions, such as Snowflake’s mini partitions, to combine row-based and column-based advantages and support wider rows for frequently queried data.

Practical Example

Let’s consider a scenario with a financial transactions table containing over a billion rows. The table includes attributes such as transaction ID, account segments, accounting document, accounting period, debit amount, credit amount, and timestamps.

Here’s how we can structure the non-historized link and satellite:

  • Non-Historized Link: Include transaction ID, account segments, accounting document, accounting period, debit amount, and credit amount. These attributes are essential for aggregations and dimension references.
  • Non-Historized Satellite: Store additional attributes like transaction type and invoice descriptions, which are not frequently queried by casual users.

This structure ensures that typical queries for dashboards and reports are efficient, while still supporting detailed analysis for specialized users.

Conclusion

Designing non-historized links and their satellites requires a deep understanding of data usage patterns and storage platform constraints. By carefully selecting which attributes to include in the link and which to offload to a satellite, we can strike a balance between query performance and storage efficiency. For most scenarios, prioritizing attributes that support frequent aggregations and groupings in the link structure is key, while satellites serve as a repository for less critical details.

As data modeling continues to evolve, adapting these principles to modern storage technologies and hybrid platforms will be crucial for building robust and performant data vault systems.

Watch the Video

AI Act Insight: Ensuring Responsible AI for Your Business

AI Act Business Intelligence Architecture graphic

AI Act

The Artificial Intelligence (AI) Act has been in force since August 1, 2024 and will gradually come into effect over time. As a new legal milestone, the AI Act brings with it requirements for the use of artificial intelligence in companies in order to promote the responsible development and use of artificial intelligence in the EU. But instead of seeing only hurdles in the risk-based approach of the AI-Act, it also opens up a wide range of opportunities for a future-oriented AI strategy for companies.

In our newsletter, we take a first glance at the new legal framework and its significance for your company. We also show how companies can use legal compliance strategically to gain a competitive edge and promote innovative business models.

AI Act Insight: Ensuring Responsible AI for Your Business

The EU’s AI Act is here! Learn how this groundbreaking regulation impacts your business. We’ll break down the risk-based approach to AI systems, focusing on high-risk applications and compliance requirements. Discover practical steps to ensure transparency and leverage tools like AI-Marts for effective AI governance.

Watch Webinar Recording

What is the AI Act and Why Should You Care?

The AI Act aims to make the use of artificial intelligence within the EU safer and more trustworthy by creating clear rules for the development and use of AI systems. The focus here is on the protection of fundamental rights, health, and safety. The legal framework is based on a risk-based approach: AI applications are divided into four different categories, from minimal to unacceptable risks, depending on the potential threat to society. Specifically, the AI Act provides for the following categories:

  1. Unacceptable Risk: AI systems that pose a threat to human rights or safety, such as those used for social scoring or manipulative practices, are prohibited.
  2. High Risk: These systems are heavily regulated and include applications in critical areas such as biometric identification, healthcare, transportation, education, and employment. Businesses using high-risk AI must meet strict compliance standards.
  3. Limited Risk: These systems face fewer restrictions but must still adhere to transparency requirements. For example, chatbots need to inform users that they are interacting with AI.
  4. Minimal or No Risk: The least regulated category includes AI applications such as spam filters or AI-driven video games.

The AI Act is of great importance for companies, as the new requirements not only entail compliance obligations, but also open up opportunities to gain competitive advantages.

By adapting to the legal requirements at an early stage, companies can strengthen trust among customers and partners, minimize risks, and promote innovation responsibly. A sound understanding of regulation enables them to make strategic use of the legal framework and position themselves better in international comparison.

Hence, for businesses operating in or with the EU, compliance with the AI Act will be a decisive factor. Failure to comply could result in significant penalties—up to 7% of the global annual turnover or €35 million, whichever is higher.

As the legislation moves gradually forward, it is recommended that companies, as a first step, review their AI tools and analyze how these systems are classified and regulated under the new framework to implement the necessary obligations.

Comply With the AI Act Today!

Preparing for the AI Act requires a proactive and comprehensive approach. Ensure compliance, mitigate risks, and foster trust in your AI applications.

Get My Free Checklist

From a Data Warehousing perspective: How can an AI-Mart help?

As businesses prepare to comply with the European Union’s AI Act, ensuring that their data and AI systems meet the new regulations is critical. Central to this is the concept of data governance and traceability, especially for AI models classified as high-risk. A modern data warehouse (DWH), particularly one powered by Data Vault 2.0, when combined with a specialized AI-Mart, can provide the technical foundation needed for compliance by managing the data lifecycle, ensuring transparent operations, and logging AI model activities. Data Vault 2.0 offers several advantages for this purpose, including its ability to support agile development, enabling rapid changes in business requirements; ensuring scalability, allowing businesses to handle increasing data volumes seamlessly; and providing strong historical tracking through its architecture, which facilitates easier auditing and compliance verification.

In the context of AI, the AI-Mart is a specialized data mart within a DWH, focused solely on managing AI training data. Its purpose is to provide a structured and compliant environment for storing and curating datasets that will be used to train, validate, and test AI models. Unlike a traditional data mart, the AI-Mart is designed with features tailored for AI, such as enhanced metadata, tracking, and model training documentation.

Key Features of an AI-Mart

  1. Data Curation for AI Training: The AI-Mart stores data specifically curated for training AI models, ensuring that all datasets are clean, unbiased, and high-quality. Built-in data governance rules ensure that only validated data enters the mart. This ensures compliance with the AI Act’s requirements for high-risk AI systems, where data must be trustworthy, accurate, and free of bias.
  2. Metadata and Documentation: The AI-Mart stores metadata about each dataset, including its source, transformations applied, and its use in specific AI models. This metadata is essential for traceability, ensuring that every data point used in an AI model can be traced back to its origin and all changes can be documented.
  3. Data Versioning and Lineage: In AI applications, ensuring that models use up-to-date and reliable data is critical. The AI-Mart supports data versioning, allowing teams to maintain multiple versions of datasets and trace changes over time. Data lineage tracking ensures that the lifecycle of the data—from ingestion to usage in AI models—can be fully traced, providing a comprehensive audit trail required for compliance with the AI Act.

This is why a robust data governance framework is crucial for ensuring compliance with the AI Act. By integrating a data warehouse (DWH) with an AI-Mart, businesses can implement stringent governance measures that ensure the quality and reliability of AI training data. For example, automated validation pipelines within the DWH verify that only data meeting predefined quality standards is used, minimizing errors, biases, and missing information. This is particularly important for high-risk AI applications, such as those in biometric identification or healthcare, where poor data quality could lead to harmful or inaccurate outcomes.

To comply with the AI Act, businesses must ensure traceability in their AI systems by tracking and documenting key stages of the AI process, from data preparation to model usage. Integrating AI model logs into a data warehouse (DWH) plays a crucial role in this, providing a centralized system to monitor and store critical information about how AI models operate and interact with data.

AI Act Business Intelligence Architecture graphic

Logging AI Decisions and Outputs: Each time an AI model processes data, logs should be automatically generated and stored in the DWH. These logs capture essential details, including input data, feature transformations, model parameters, decision thresholds, and output probabilities. By loading these logs into the DWH, businesses create a detailed audit trail of AI activity, ensuring that key aspects of the model’s operations are documented.

Log Aggregation and Storage: Logs from AI models, whether during training or production, can be continuously fed into the DWH as part of the AI-Mart infrastructure. These logs may include:

  • Model training logs: Documenting how the model was trained, the datasets used, and the parameters adjusted during training.
  • Model inference logs: Recording the input data, features generated, and each prediction made by the model.
  • Performance metrics: Storing evaluations like accuracy, precision, and recall, which help track the model’s performance over time and detect any model drift.

By storing these logs in the DWH, businesses can establish detailed records of AI model operations for regulatory purposes.

Querying and Auditing Logs: The DWH’s querying tools allow compliance teams to generate reports that show how models operate, what data was used, and how the AI model has evolved. This simplifies the process of responding to regulatory audits and demonstrates adherence to the AI Act’s requirements.

By combining a DWH with an AI-Mart for AI training data and loading AI model logs into the same infrastructure, businesses can build a comprehensive framework for compliance with the AI Act. This approach supports data governance, ensuring high-quality data for AI models, and ensures traceability, allowing businesses to track and audit every aspect of their AI systems. This not only meets regulatory requirements but also fosters trust and accountability in the use of AI technology.

Upcoming Resources and Events

For more information, contact our team at [email protected].

Final Remarks from the Authors

The AI Act should not only be seen as a regulatory challenge but also as an opportunity for businesses to differentiate themselves by adopting trustworthy and ethical AI practices. As AI continues to evolve, businesses that prioritize compliance, transparency, and human oversight will be better positioned to thrive in the coming years. By taking proactive steps now to ensure compliance, businesses can turn AI regulation into a strategic advantage, building trust with customers, partners, and regulators alike.

Identifying Non-Historized Links

Watch the Video

Identifying Non-Historized Links in Data Vault

In data management, particularly in the Data Vault 2.0 methodology, understanding how to handle non-historized links can be essential for maintaining accurate, traceable records. In today’s Data Vault Friday, we’ll cover the concept of non-historized links, explore the unit of work in data storage, and discuss which elements should be included in link hashes.



Non-Historized Links Background: Understanding the GL Table

Imagine we’re working with a General Ledger (GL) table for storing financial transactions, which includes over a billion rows. Each record in this table represents a transaction and includes attributes such as:

  • TransactionID: A unique identifier for each transaction.
  • AccountSegment: A business key identifying the account involved.
  • AccountingDocument: Another business key associated with the document related to the transaction.
  • AccountingPeriod: A reference key indicating the period of the transaction.
  • DebitAmount and CreditAmount: The monetary amounts for each transaction.
  • TransactionTimestamp: The exact date and time of the transaction.

While the TransactionID alone uniquely identifies each transaction, a combination of AccountSegment, AccountingDocument, AccountingPeriod, and TransactionTimestamp also provides a unique identifier for each entry. This setup raises questions about how to best organize, hash, and manage these records within a Data Vault model, specifically in terms of defining a Unit of Work and deciding which values should be included in a link hash.

Defining the Unit of Work in Data Vault

The Unit of Work is a concept used in Data Vault to determine which business keys need to be grouped together in a single record. In this context, we’re focusing on how these business keys—AccountSegment, AccountingDocument, AccountingPeriod, and TransactionTimestamp—should be structured and managed within a non-historized link.

According to Michael Olschimke, the CEO of Scalefree, the Unit of Work is primarily relevant when dealing with standard relationship-based links. In our example, there’s no need to split the link between these business keys because the link can capture the original granularity of each transaction from the source data. This means that by keeping all four attributes within the same link, we avoid disrupting the Unit of Work and ensure a coherent data structure.

Testing the Structure of the Unit of Work

One way to verify if we’re correctly applying the Unit of Work is to check if the data model enables us to reconstruct the original data source without loss of records or attributes. If we find that we can reconstruct the data accurately using the non-historized link and potentially a Satellite table later, then we’re likely adhering to the Unit of Work correctly.

Since the attributes in our case uniquely identify each transaction, we can assume the Unit of Work is preserved, allowing for a stable and consistent structure in the data model.

Which Values Should Be Hashed in the Link Hash?

The next question concerns which values we should include in the link hash. Generally, the hash key in a link table is derived from the business keys of referenced hubs. In this case, the primary question is whether to hash just the AccountSegment, AccountingDocument, and AccountingPeriod, or if we should also include the TransactionID.

Olschimke suggests a practical approach here: while we could technically get by with hashing either the TransactionID alone or the combination of the other three elements, it’s often better to include all relevant attributes in the hash. This includes the TransactionID along with AccountSegment, AccountingDocument, and AccountingPeriod.

Here’s why:

  1. Consistency with Automation Tools: Many Data Vault automation tools automatically include all hub references and their business keys in the hash computation. Following this approach aligns with standard automation practices, making it easier to work with automation tools later.
  2. Minimizing Potential Issues: Including all attributes in the hash computation reduces the risk of data loss or accidental data duplication. It ensures that our hashes accurately represent each unique record.
  3. Negligible Performance Impact: While adding an extra attribute to the hash computation may slightly increase the string length, the performance impact is minimal. In large-scale data processing, this small change typically does not result in significant slowdowns.

In Practice: Setting Up Non-Historized Links

In Data Vault, non-historized links are a powerful tool for managing large datasets without versioning each record. By including all four attributes in the hash computation, we establish a stable link structure that reflects the source data’s original granularity. This approach allows us to confidently manage and retrieve accurate records without fear of unintentional data loss.

For practitioners, setting up these non-historized links can be straightforward when using automation tools that handle most of the heavy lifting. It’s generally recommended to follow the automation tools’ conventions, especially for larger datasets, as this minimizes discrepancies and facilitates smoother data processing in the long run.

How to Submit Your Questions

If you’re working with Data Vault and have a question similar to this one, Data Vault Friday sessions provide an opportunity to get expert insights. To submit a question, you can use the form, which allows you to attach diagrams, whiteboard images, or other supporting materials for clarity. Alternatively, for faster responses, you can post questions on the Data Innovators Exchange, where consultants actively respond and discuss Data Vault topics.

For those interested in more in-depth training, Scalefree also offers webinars, including sessions on WhereScape and dbt, which run monthly. These provide valuable resources for anyone looking to deepen their Data Vault expertise.

Fazit

Identifying non-historized links and deciding on a consistent hashing approach can significantly impact data integrity and model reliability. By adhering to Data Vault principles and utilizing automation tools effectively, you can manage vast amounts of data without sacrificing accuracy or consistency.

With the right approach, you’ll be able to create a resilient, scalable data model that aligns with your organization’s needs and remains flexible for future changes. If you’re diving deeper into Data Vault, stay tuned for more Data Vault Friday insights and don’t hesitate to participate in the community for ongoing support and expertise.

Identifying Non-Historized Links

Watch the Video

Identifying Non-Historized Links in Data Vault

In data management, particularly in the Data Vault 2.0 methodology, understanding how to handle non-historized links can be essential for maintaining accurate, traceable records. In today’s Data Vault Friday, we’ll cover the concept of non-historized links, explore the unit of work in data storage, and discuss which elements should be included in link hashes.



Non-Historized Links Background: Understanding the GL Table

Imagine we’re working with a General Ledger (GL) table for storing financial transactions, which includes over a billion rows. Each record in this table represents a transaction and includes attributes such as:

  • TransactionID: A unique identifier for each transaction.
  • AccountSegment: A business key identifying the account involved.
  • AccountingDocument: Another business key associated with the document related to the transaction.
  • AccountingPeriod: A reference key indicating the period of the transaction.
  • DebitAmount and CreditAmount: The monetary amounts for each transaction.
  • TransactionTimestamp: The exact date and time of the transaction.

While the TransactionID alone uniquely identifies each transaction, a combination of AccountSegment, AccountingDocument, AccountingPeriod, and TransactionTimestamp also provides a unique identifier for each entry. This setup raises questions about how to best organize, hash, and manage these records within a Data Vault model, specifically in terms of defining a Unit of Work and deciding which values should be included in a link hash.

Defining the Unit of Work in Data Vault

The Unit of Work is a concept used in Data Vault to determine which business keys need to be grouped together in a single record. In this context, we’re focusing on how these business keys—AccountSegment, AccountingDocument, AccountingPeriod, and TransactionTimestamp—should be structured and managed within a non-historized link.

According to Michael Olschimke, the CEO of Scalefree, the Unit of Work is primarily relevant when dealing with standard relationship-based links. In our example, there’s no need to split the link between these business keys because the link can capture the original granularity of each transaction from the source data. This means that by keeping all four attributes within the same link, we avoid disrupting the Unit of Work and ensure a coherent data structure.

Testing the Structure of the Unit of Work

One way to verify if we’re correctly applying the Unit of Work is to check if the data model enables us to reconstruct the original data source without loss of records or attributes. If we find that we can reconstruct the data accurately using the non-historized link and potentially a Satellite table later, then we’re likely adhering to the Unit of Work correctly.

Since the attributes in our case uniquely identify each transaction, we can assume the Unit of Work is preserved, allowing for a stable and consistent structure in the data model.

Which Values Should Be Hashed in the Link Hash?

The next question concerns which values we should include in the link hash. Generally, the hash key in a link table is derived from the business keys of referenced hubs. In this case, the primary question is whether to hash just the AccountSegment, AccountingDocument, and AccountingPeriod, or if we should also include the TransactionID.

Olschimke suggests a practical approach here: while we could technically get by with hashing either the TransactionID alone or the combination of the other three elements, it’s often better to include all relevant attributes in the hash. This includes the TransactionID along with AccountSegment, AccountingDocument, and AccountingPeriod.

Here’s why:

  1. Consistency with Automation Tools: Many Data Vault automation tools automatically include all hub references and their business keys in the hash computation. Following this approach aligns with standard automation practices, making it easier to work with automation tools later.
  2. Minimizing Potential Issues: Including all attributes in the hash computation reduces the risk of data loss or accidental data duplication. It ensures that our hashes accurately represent each unique record.
  3. Negligible Performance Impact: While adding an extra attribute to the hash computation may slightly increase the string length, the performance impact is minimal. In large-scale data processing, this small change typically does not result in significant slowdowns.

In Practice: Setting Up Non-Historized Links

In Data Vault, non-historized links are a powerful tool for managing large datasets without versioning each record. By including all four attributes in the hash computation, we establish a stable link structure that reflects the source data’s original granularity. This approach allows us to confidently manage and retrieve accurate records without fear of unintentional data loss.

For practitioners, setting up these non-historized links can be straightforward when using automation tools that handle most of the heavy lifting. It’s generally recommended to follow the automation tools’ conventions, especially for larger datasets, as this minimizes discrepancies and facilitates smoother data processing in the long run.

How to Submit Your Questions

If you’re working with Data Vault and have a question similar to this one, Data Vault Friday sessions provide an opportunity to get expert insights. To submit a question, you can use the form, which allows you to attach diagrams, whiteboard images, or other supporting materials for clarity. Alternatively, for faster responses, you can post questions on the Data Innovators Exchange, where consultants actively respond and discuss Data Vault topics.

For those interested in more in-depth training, Scalefree also offers webinars, including sessions on WhereScape and dbt, which run monthly. These provide valuable resources for anyone looking to deepen their Data Vault expertise.

Conclusion

Identifying non-historized links and deciding on a consistent hashing approach can significantly impact data integrity and model reliability. By adhering to Data Vault principles and utilizing automation tools effectively, you can manage vast amounts of data without sacrificing accuracy or consistency.

With the right approach, you’ll be able to create a resilient, scalable data model that aligns with your organization’s needs and remains flexible for future changes. If you’re diving deeper into Data Vault, stay tuned for more Data Vault Friday insights and don’t hesitate to participate in the community for ongoing support and expertise.

Best Practices for Managing Costs in Data Warehousing

Watch the Video

Best Practices for Managing Costs in Data Warehousing

In the world of data warehousing, managing and optimizing costs is essential, particularly as more organizations move their operations to the cloud. The shift to cloud data platforms like Snowflake, Databricks, and Azure has opened new opportunities for scaling data operations. However, it has also introduced new challenges in terms of cost management. Let’s explore key strategies and best practices for keeping data warehousing costs under control while maintaining high-quality, reliable data operations.

Why Cost Monitoring Matters in Data Warehousing

Cost monitoring often becomes a focus only after a project has started and costs have begun to accumulate. However, implementing cost control early on can yield substantial savings. Statistics from AWS and Gartner underscore the importance of cloud cost management: organizations can reduce monthly cloud costs by 10-20% with monitoring tools, and companies with cloud optimization strategies may see savings of up to 30%.

These savings underscore the significance of early planning, as businesses that establish cost management measures from the outset are far better positioned to maintain budget predictability. Let’s dive into actionable strategies to keep data warehousing costs in check.

Establish Clear Ownership and Responsibility

One of the first steps in cost management is defining who is responsible for each aspect of the data warehouse. Often, data projects start with a business use case but lack a clear person to oversee costs. Every data product, data source, and even individual data warehouse instance should have an assigned data owner. By giving someone ownership, you create accountability, and there’s always a go-to person to consult when costs spike.

Set Expiration Dates for Projects and Reports

Data reports and dashboards created for specific projects can linger long after they’re needed, consuming unnecessary resources. To avoid this, establish “end dates” for each project. For instance, a report created for a finance analysis in 2024 may not be relevant after that year ends. By checking with departments to verify report usage periodically, you can ensure that outdated reports are retired, freeing up resources and reducing costs.

Use Tags to Track Resources and Cost Allocation

Modern data platforms like Snowflake and Databricks allow you to tag resources for easier tracking. Tagging can be done at various levels—by department, project, or cost center. This makes it easier to allocate costs to specific business functions and track where expenses are going. However, be strategic with tags. A thoughtful, organized approach to tagging can streamline reporting and give you a clearer picture of how resources are used across the organization.

Define Purpose and Value of Reports and Data Products

When creating new reports or data products, always define their purpose. Determine how long they’ll be useful and assess their business value. Having a clear understanding of why each data product exists ensures resources are only allocated to valuable outputs, preventing unnecessary data processing and storage costs.

Implement Cost Monitoring and Alerts

Cost monitoring should be integrated directly into your data warehouse operations. Start by defining the key performance indicators (KPIs) for cost monitoring, such as monthly costs per tag or project. Build a dashboard that visualizes these metrics, making it easier to track costs at a glance. Additionally, set up budget alerts to notify you of any significant changes or cost surges.

For instance, Snowflake and other cloud platforms allow you to set automated alerts for query runtimes, storage limits, and overall costs. This is particularly useful for identifying high-cost queries or storage use that might need optimization.

Regularly Review Cost Allocation and Query Performance

Set up regular monthly reviews with your DevOps team to evaluate your data warehouse’s costs and budgets. During these reviews, discuss areas where you can optimize queries or resource use. Identifying expensive or long-running queries is critical. Optimizing them can have a noticeable impact on your overall budget.

Best Practices for Cost Efficiency in Data Warehousing

1. Involve Stakeholders in Cost Management

Stakeholders should be aware of the cost implications of the reports they request. Make them part of the conversation, helping them understand which reports are more costly and the associated budget impacts. This can make it easier to justify the costs and encourage stakeholders to make more cost-effective choices.

2. Set Up Budget Alerts

Budget alerts are essential for staying within allocated funds. Use them to monitor query and storage costs, and receive notifications if any thresholds are breached. This can prevent unexpected spikes in expenses.

3. Create a Cost Dashboard

Establish a dashboard that visualizes real-time costs and usage statistics. This is especially straightforward in platforms like Snowflake, where dashboards can display resource costs in an easily digestible format. Regularly viewing this dashboard can help your team make timely adjustments to reduce expenses.

4. Monitor and Optimize Queries

Query monitoring is essential. Keep an eye on long-running or high-cost queries, as they often account for a significant portion of the total expenses. Optimizing these queries can substantially reduce costs.

5. Apply Data Vault Techniques for Efficiency

Data Vault methodology brings several benefits for cost efficiency. Its standardization and automation in development reduce manual effort, lowering overall project costs. The agile approach of the Data Vault, with its “Tracer Bullet” development, ensures that you deliver business value early, which helps justify costs to stakeholders.

Additionally, Data Vault supports GDPR compliance and auditability, reducing the risk of costly legal issues. Its approach to parallel loading, materialization, and the use of PIT and Bridge tables enables efficient data processing, minimizing runtime and storage needs.

6. Follow the Pareto Principle in Cost Optimization

In cost monitoring, the Pareto Principle often applies. Focus on the top 20% of queries or tables that account for 80% of costs. By targeting optimizations to these high-cost items, you can achieve significant cost savings.

Conclusion

Effective cost management in data warehousing requires early planning, stakeholder involvement, and regular monitoring. By establishing clear ownership, tagging resources, setting budget alerts, and leveraging Data Vault principles, you can maintain cost-effective data operations that continue to deliver business value. Implement these practices to ensure your data warehousing operations remain scalable, efficient, and aligned with your organization’s budgetary goals.

If you’d like to learn more about optimizing data warehousing costs, check out our other posts or join us for next week’s Data Vault Friday session!

Snapshot Based Bridge Table on Link and Effectivity Satellite

Watch the Video

Snapshot Based Bridge Table on Link and Effectivity Satellite

Welcome to another edition of Data Vault Friday! Today, we’re diving into the concept of a snapshot-based bridge table, particularly focusing on its application in scenarios with links and effectivity satellites. This approach helps us handle complex relationships between business objects over time, managing changes in relationships, and retaining a complete historical view. In this article, we will explore solutions to three main questions:

  • How to handle a missing relationship on day 2 compared to day 1
  • How to manage changed relationships from day 1 to day 2 for the same key
  • How to include a business object (A3) on day 2 when it lacks a relationship to any object in B


Understanding the Data Vault Modeling Example

In our example, we have two sources, each representing a different business object (A and B). Business object A is represented as a static dataset, while business object B shows dynamic relationships to A over three days. Each day brings changes in the relationships, with some entries disappearing or shifting. Our task is to capture these changes effectively in a snapshot bridge table.

Examining the Relationship Changes

Let’s walk through the changes observed over three days:

  • Day 1: B1 is related to A1, B2 to A2, and B3 to A3.
  • Day 2: The relationships change. B1 is now related to A2, B2 to A1, and B3 disappears.
  • Day 3: No further changes occur compared to Day 2.

Our objective is to document these relationships using an Effectivity Satellite and a Bridge Table, enabling us to query the state of relationships as they existed on each day.

Creating the Link and Effectivity Satellite

The first step in capturing these relationships is to create a Link table that holds a distinct list of relationships between business objects. Links should not have additional metadata, such as validity dates, as this can complicate data retrieval and reduce performance. In our example, the Link table captures each unique combination of A and B keys but does not record their start or end dates.

Next, we create an Effectivity Satellite. This table extends the Link by recording each relationship’s start and end timestamps, as well as an “is_active” flag to indicate the current status of each relationship. Using this table, we can track when a relationship starts, changes, or ends. Let’s examine how this works:

  • Day 1 entries: All relationships (B1-A1, B2-A2, B3-A3) are marked as active.
  • Day 2 entries: New relationships (B1-A2 and B2-A1) are added and marked active, while previous relationships (B1-A1 and B2-A2) are marked inactive. B3-A3 is removed entirely.

Building the Snapshot-Based Bridge Table

With the Effectivity Satellite in place, we can now create a Bridge Table that snapshots the active relationships for each day. This table provides a point-in-time view of the relationships as they existed on a particular day. Let’s look at how the Bridge Table is created:

Day 1 Snapshot

The Day 1 snapshot reflects the initial relationships, pulling all active records from the Effectivity Satellite. At this stage, B1-A1, B2-A2, and B3-A3 are all active.

Day 2 Snapshot

In the Day 2 snapshot, only the relationships B1-A2 and B2-A1 remain active, while B3-A3 is removed. By applying a filter to include only active entries, the snapshot accurately represents the relationships on Day 2.

Day 3 Snapshot

Day 3’s snapshot is identical to Day 2, as no additional changes were made. The active relationships B1-A2 and B2-A1 remain unchanged.

This process ensures we have a clear audit trail of relationship changes and deletions. Each day’s snapshot represents the state of relationships at that point in time, without introducing redundant data.

Handling Missing Data and Reinstating Relationships

One critical feature of the Effectivity Satellite is the ability to manage reinstated relationships. For instance, if B1-A1’s relationship is reestablished on Day 4, we add new rows in the Effectivity Satellite, marking the previous active combination (B1-A2) as inactive and reactivating B1-A1. This dynamic structure makes it easy to adjust for relationships that appear, disappear, and reappear over time.

Incorporating A3 in the Bridge Table on Day 2

A common challenge is how to incorporate business objects like A3 on Day 2, despite having no relationship in B on that day. In Data Vault, this is often addressed in downstream queries or report joins rather than in the Bridge Table itself.

By starting with object A and performing a left join to the Bridge Table, you will include all records from A (even if they don’t appear in B). This ensures that unlinked objects are included in the results, with their B relationships shown as null, or as a placeholder if desired.

Preventing Cartesian Products and Other Best Practices

When using a Bridge Table, it’s essential to avoid unexpected Cartesian products, which can inflate data during aggregation. Always check the cardinality of relationships between objects (e.g., one-to-one or many-to-many) to ensure joins occur only on necessary keys. The driving key, which anchors the relationships in the Link Table, should be the primary focus, especially in cases of many-to-many relationships.

This method ensures accuracy and performance when aggregating data and avoids inflated results in reporting.

Conclusion

Snapshot-based bridge tables are powerful tools in Data Vault modeling for handling changing relationships and tracking historical snapshots. By carefully structuring Links, Effectivity Satellites, and Bridge Tables, we create a robust, auditable trail of data changes over time. As we’ve discussed, this approach allows us to accommodate missing data, reinstate relationships, and prevent data inflation, ensuring reliable, performant data models.

If you’d like further assistance with templates for bridge tables or Effectivity Satellites, feel free to reach out to me. I hope this discussion has provided clarity on using bridge tables in Data Vault. Join us again for more Data Vault insights, and feel free to submit your questions!

Close Menu