Skip to main content
search
0
All Posts By

Building a scalable Data Platform?

Whether you're implementing Data Vault 2.1 or modernizing your analytics architecture, our experts help you turn complex data challenges into practical, future-proof solutions. From hands-on implementation to in-depth training, we support your team every step of the way.

Creating Data Vault Stages

Data Vault Stages

The Data Vault methodology provides a robust framework for managing and organizing enterprise data. One of the foundational components of a Data Vault is the stage. In this guide, we’ll explore what Data Vault stages are, their importance, and how to create them effectively.



Understanding Node Types in Data Vault

Before diving into stages, let’s review the key node types in a Data Vault:

  • Stages: Temporary storage areas where raw data is preprocessed.
  • Hubs: Central entities containing unique business keys.
  • Links: Relationships between hubs.
  • Satellites: Contextual and descriptive data for hubs and links.
  • PITs (Point-in-Time Tables): Optimized query performance tools.
  • Snapshot Tables: Historical states of data.
  • Non-Historized Links & Satellites: Used when historical tracking isn’t required.
  • Multi-Active Satellites: Support multiple active records for the same key.
  • Record Tracking Satellites: Track changes and versions of records.

Features of Data Vault Patterns

The Data Vault methodology leverages years of practical experience to deliver several key features:

  • Patterns Based on Expertise: Proven methods for efficient loading and processing.
  • Multi-Batch Processing: Handle multiple data batches simultaneously.
  • Automatic PIT Cleanup: Uses logarithmic snapshot logic for optimal performance.
  • Virtual Load End-Date: Allows insert-only processes by using calculated end dates.
  • Automated Ghost Records: Simplifies handling of missing or incomplete data.

Why Are Stages Important in Data Vault?

Stages play a critical role in the Data Vault architecture by enabling efficient data preparation and ensuring data integrity. Key benefits include:

  • Hash Keys & Hash Diffs: Ensures unique identifiers for data integration and deduplication.
  • Load Date & Record Source: Tracks the origin and timing of data entries.
  • Prejoins: Combines data efficiently before entering the vault.
  • Hard Rules: Implements strict validation and transformation logic.

How to Create a Data Vault Stage

Creating a stage in a Data Vault involves leveraging the right tools and techniques. For this, we recommend using Datavault4Coalesce, a powerful platform designed for Data Vault implementation. This tool simplifies the process by automating key tasks and ensuring best practices are followed.

Conclusion

Stages are a foundational component of the Data Vault methodology, enabling seamless data preparation and integration. By understanding their role and leveraging the right tools, you can ensure the success of your Data Vault implementation.

Watch the Video

Meet the Speaker

Profile picture of Tim Kirschke

Tim Kirschke
Senior Consultant

Tim has a Bachelor’s degree in Applied Mathematics and has been working as a BI consultant for Scalefree since the beginning of 2021. He’s an expert in the design and implementation of BI solutions, with focus on the Data Vault 2.0 methodology. His main areas of expertise are dbt, Coalesce, and BigQuery.

Using PIT and Bridge Tables in Business Vault Entities

Watch the Video

PIT and Bridge Tables

In this blog post, we will answer a commonly asked question regarding PIT and Bridge Tables:

In the Data Vault architecture, is it okay to use/reuse created PIT and Bridge tables in the code of the Business Vault business rules?

The short answer is yes, but let’s dive into the details to understand the rationale and how PIT (Point-In-Time) tables and Bridge tables work in the context of Business Vault entities.



Understanding PIT and Bridge Tables

Before explaining their usage, let’s quickly clarify what PIT and Bridge tables are in the Data Vault architecture:

  • PIT Tables: These provide a snapshot of data for a specific point in time. They help combine deltas and descriptive data to enable calculations or business logic that requires a specific snapshot.
  • Bridge Tables: These are primarily used to resolve many-to-many relationships and improve query performance when dealing with large datasets.

Applying Business Logic in Business Vault

In the Data Vault, data flows from the Raw Data Vault (RDV) to the Business Vault (BV) and finally to the Information Mart (IM). The key difference lies in the granularity of data:

  • Load Date: In the Raw Data Vault, data batches are identified by a load date, which represents when the data was ingested.
  • Snapshot Date: In the Information Mart, data is often presented as snapshots, where each snapshot represents the data at a specific point in time.

Now, the Business Vault sits between the Raw Data Vault and Information Marts. When applying business rules in the BV, there are two major types of granularities to consider:

1. Granularity Based on Incoming Deltas

In this case, business logic is applied to all incoming deltas identified by the load date. For example, cleansing phone numbers is a typical use case where every delta (update) must be processed, even if only the latest version is needed in the end.

The resulting data is stored in a computed Satellite in the Business Vault. The primary key remains the hash key of the parent entity and the load date.

2. Granularity Based on Snapshot Date

Some business logic requires calculations for specific points in time. For example, calculating the lifetime value of a customer:

  • The lifetime value increases when a customer makes a purchase.
  • The lifetime value decreases incrementally if no purchases are made over time.

In this scenario, even when no new delta is coming in, the value must still be recalculated daily. This granularity aligns with the snapshot date, which is already defined in the PIT table. By leveraging the PIT table, you can calculate and store the lifetime value in a computed Satellite with a primary key of the parent hash key and snapshot date.

Reusing PIT Tables

When switching from load date (deltas) to snapshot date (snapshots), PIT tables play a crucial role:

  • PIT tables help join descriptive data from Satellites to provide a snapshot-based view of the data.
  • They allow business rules to be applied to outgoing information granularity (snapshot date).

For example, if you want to calculate a specific measure, such as a customer’s lifetime value, the PIT table provides the granularity needed to compute the values for every day, hour, or minute, depending on your requirements.

Reusing Bridge Tables

Bridge tables can also be reused in Business Vault entities but with one key consideration:

Avoid loading one Bridge Table from another Bridge Table.

Why? Cascading Bridge Tables can lead to sequential dependencies, which hinder parallelization. Parallel processing is essential for performance, especially in high-volume environments. To work around this limitation, use Computed Aggregate Links.

What Is a Computed Aggregate Link?

A Computed Aggregate Link is essentially a Link with pre-computed aggregations. This concept is described in the Data Vault methodology and allows you to reuse aggregations efficiently without chaining Bridge Tables together.

For example, if you want to calculate a new measure based on facts stored in a Bridge Table:

  • Use the Bridge Table as the FROM source for a computed Satellite.
  • Attach the new measure to the Bridge Table as part of the Business Vault entity.

This approach avoids cascading dependencies while allowing you to extend facts or perform complex calculations.

Best Practices Recap for PIT and Bridge Tables

Here are the key takeaways for using PIT and Bridge tables in Business Vault entities:

  • Yes, you can reuse PIT tables: They are commonly used to provide snapshot granularity for computed Satellites.
  • Yes, you can reuse Bridge tables: Use them carefully to avoid cascading dependencies.
  • Use Computed Aggregate Links: When you need to extend a Bridge Table, this is the recommended approach to maintain efficiency and parallelization.
  • Granularity switch: Be mindful of the transition from load date (delta-driven) to snapshot date (snapshot-driven) when applying business logic.

Summary

In summary, PIT and Bridge tables are powerful tools in the Data Vault architecture, especially within the Business Vault. They enable complex business logic, such as snapshot-based calculations, while maintaining efficiency and performance. By adhering to best practices like avoiding cascading Bridge Table loads, you can ensure your implementation remains scalable and robust.

Meet the Speaker

Profile picture of Michael Olschimke

Michael Olschimke

Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

Building Responsible AI Systems Under the EU AI Act

EU AI Act Responsible Systems

The EU Artificial Intelligence (AI) Act represents a significant step forward in regulating AI technologies across the European Union. Its purpose is to establish a unified legal framework, ensuring human rights protection, safety, and the ethical use of AI, while fostering innovation and accountability. With its phased implementation starting in 2024, the Act brings major changes to how AI systems are designed, deployed, and monitored.



Overview of the EU AI Act

The EU AI Act aims to:

  • Establish a unified legal framework for AI across the EU.
  • Protect human rights and ensure safety.
  • Prohibit harmful and unethical uses of AI.
  • Promote transparency and accountability in AI systems.
  • Foster innovation and technological growth.

Timeline for Implementation

The Act includes specific deadlines for compliance:

  • August 2024: Prohibited AI practices must stop immediately.
  • August 2025: Transparency rules for general-purpose AI, including content labeling, take effect.
  • August 2026: High-risk AI regulations, such as those in healthcare, become enforceable with strict data quality standards.

Why This Matters

AI adoption is growing rapidly, with 42% of organizations utilizing AI in 2023—a 7% increase from 2022. The EU AI Act not only imposes penalties of up to 7% of global turnover for non-compliance but also reflects a societal responsibility to use AI ethically, addressing inequalities and safeguarding future generations.

The Risk-Based Approach

The EU AI Act categorizes AI systems into four risk levels:

  • Unacceptable Risk: Prohibited under Article 5.
  • High Risk: Strict regulation and obligations under Articles 6-51.
  • Limited Risk: Providers regulated under Articles 52a-52e.
  • Minimal Risk: Subject to transparency obligations under Article 52.

Key Principles of Responsible AI

Building responsible AI systems involves adhering to several key principles:

  • Explainability: AI models should be transparent and easy to understand.
  • Bias & Fairness: Detect and mitigate biases to ensure equitable outcomes.
  • Accountability: Define responsibilities for AI outcomes clearly.
  • Data Suitability: Use appropriate, high-quality data in compliance with regulations.
  • Monitoring: Continuously track AI performance to ensure reliability.
  • Transparency: Disclose system functionalities clearly and provide user mechanisms for feedback.
  • Auditability: Maintain detailed logs of algorithms, datasets, and configurations.

Steps to Build Responsible AI Systems

Organizations can prepare for compliance and ethical AI usage through the following steps:

  • Implement scalable AI services.
  • Develop predictive reporting mechanisms.
  • Establish robust governance frameworks.
  • Leverage tools and platforms for AI development.
  • Ensure data suitability and compliance.

AI Marts: Enabling AI Act Compliance

Traditional machine learning workflows without centralized data management can lead to feature inconsistencies, operational complexity, and compliance issues. AI Marts address these challenges by providing:

  • Centralized feature management.
  • Integration of feature engineering into workflows and pipelines.
  • Metadata and version control.
  • Scalable feature serving across targets.
  • Comprehensive logs for governance and auditing.

Benefits: AI Marts enhance data governance and security, serving as a critical step towards compliance with the EU AI Act.

Conclusion

As AI adoption grows, compliance with the EU AI Act is essential for organizations aiming to use AI responsibly. By implementing risk-based strategies, embracing transparency, and leveraging tools like AI Marts, companies can align with regulatory requirements while fostering trust and innovation.

Watch the Video

Meet the Speaker

Picture of Lorenz Kindling

Lorenz Kindling
Senior Consultant

Lorenz is working in Business Intelligence and Enterprise Data Warehousing (EDW) with a focus on data warehouse automation and Data Vault modeling. Since 2021, he has been advising renowned companies in various industries for Scalefree International. Prior to Scalefree, he also worked as a consultant in the field of data analytics. This allowed him to gain a comprehensive overview of data warehousing projects and common issues that arise.

Delta Lake vs Data Vault

How does Data Vault add value when we have the Delta Lake?

In the world of modern data management, businesses often find themselves navigating a maze of tools, architectures, and methodologies to meet their ever-evolving data needs. Among the popular approaches are Delta Lake and Data Vault. While both have their strengths, it’s important to understand how they complement each other and why Data Vault can be a game-changer even when you’re leveraging Delta Lake.



Understanding Delta Lake

Delta Lake is an open-source storage layer that brings reliability to data lakes. Built on top of Parquet files, it provides ACID transactions, schema enforcement, and the ability to handle incremental data changes. It’s a robust foundation for modern data warehouses and data lakes, especially when using tools like Databricks.
However, Delta Lake primarily focuses on managing data storage and changes. It doesn’t inherently bridge the gap between raw source data and the business-ready reports and dashboards that users demand.

Enter Data Vault: Bridging the Gap

Data Vault is a modeling approach designed to address the disconnect between raw data and user needs. While Delta Lake handles data storage efficiently, Data Vault focuses on the *why* and *how* of transforming that data into actionable insights. Here’s where Data Vault excels:

  • Data Modeling: Data Vault organizes data into Hubs, Links, and Satellites, ensuring a flexible and scalable structure. Hubs capture business keys, Links handle relationships, and Satellites store descriptive data.
  • Data Integration: It helps integrate disparate data sources into a unified model that reflects the business context.
  • Change Tracking: While Delta Lake tracks changes at the file or record level, Data Vault optimizes this by capturing deltas more efficiently, especially when splitting data into specialized Satellites.
  • Target-Oriented Design: Data Vault focuses on producing business-ready data models like star schemas, flat tables, or dashboards, rather than being a consumption model itself.

Performance Challenges and Solutions

A frequent criticism of Data Vault on Delta Lake revolves around query performance, particularly due to the columnar storage of Parquet files. Joins can be slow, but this is more a characteristic of the storage mechanism than the modeling technique. Here are some strategies to address this:

  • Denormalization: Flattening data into wide tables eliminates the need for joins, resulting in faster query performance.
  • Materialized Views: Creating materialized Parquet views for end-user consumption ensures high performance without impacting upstream processes.
  • Optimized Storage: Use technologies like Iceberg or Delta tables for Hubs and Links, and consider presenting Satellites as views to minimize storage overhead.
  • Incremental Load: Design systems to handle insert-only incremental loads, reducing the complexity of updates and deletes.

Why Business Users Love Data Vault (Even If They Don’t Know It)

The ultimate goal of any data architecture is to serve business users. Reports, dashboards, and analytics are the end-products they care about. Data Vault excels here by enabling the creation of robust information models that align with user requirements:

  • Flexibility: Business rules can be implemented on top of the Data Vault model to derive the desired target model.
  • Scalability: Large data flows can be broken down into manageable pieces, making the system easier to maintain.
  • Agility: Changes in business requirements can be accommodated without overhauling the entire model.

Delta Lake and Data Vault: Better Together

Rather than viewing Delta Lake and Data Vault as competing approaches, think of them as complementary. Delta Lake provides the foundation for reliable data storage and change tracking, while Data Vault transforms this raw data into meaningful, business-ready formats.
For example, Delta Lake can serve as the staging or landing zone, where raw data is ingested and stored. Data Vault then takes over to model this data into Hubs, Links, and Satellites, preparing it for business consumption. The combination ensures both robust data management and the flexibility to meet diverse analytical needs.

Final Thoughts

Data Vault is a powerful methodology for bridging the gap between raw data and actionable insights. Even in environments that leverage Delta Lake, Data Vault adds value by providing a scalable, user-focused approach to data modeling. By combining the strengths of these two technologies, organizations can achieve both reliability and agility in their data architectures.
As with any tool or methodology, the key is to tailor the implementation to your specific needs, ensuring that both performance and usability are optimized. Whether you’re dealing with Databricks, Parquet, or other tools, Data Vault provides the flexibility and structure to deliver what matters most: business value.

Watch the Video

Meet the Speaker

Profile picture of Michael Olschimke

Michael Olschimke

Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

Custom Node Types in Coalesce

Custom Node Types in Coalesce: Unlocking Flexibility and Reusability

Nodes are the foundational building blocks in coalesce.io, serving as database objects like tables or views. Each node belongs to a specific type, equipped with a predefined user interface, a create template, and a run template. While coalesce.io provides four standard node types, custom node types allow users to adapt and extend these capabilities for unique requirements.



What Are Custom Node Types?

Custom node types enable users to define reusable database object patterns. By specifying a user interface (UI), Data Definition Language (DDL), and Data Manipulation Language (DML), users can create tailored solutions for patterns such as stages, dimensions, facts, hubs, and links. Parameters and macros make these custom types even more adaptable and reusable.

Why Create Custom Node Types?

Custom node types address two key needs:

  • Custom Needs: Standard node types may not cover specific use cases.
  • Reusability: Custom node types eliminate the redundancy of repeatedly creating similar nodes, saving time and effort.

Key Components of Custom Nodes

Node Definition and UI Configuration

The node definition specifies the UI elements, such as materialization selectors, toggles, dropdowns, and text boxes. These components define how users interact with and configure the custom node.

Create Template

The create template includes SQL logic for generating tables or views. It supports column transformations, comments, clustering keys, and all Snowflake DDL features.

Run Template

The run template defines DML operations, such as inserting data, applying incremental or merge strategies, and performing transformations. These operations are executed exclusively for table-based nodes and utilize all Snowflake DML features.

Get Started with Custom Node Types

Custom node types in coalesce.io empower teams to design reusable, scalable solutions tailored to specific needs. By leveraging their flexibility, you can streamline development, reduce repetitive tasks, and maximize efficiency in your data workflows.

Watch the Video

Meet the Speakers

Profile picture of Tim Kirschke

Tim Kirschke
Senior Consultant

Tim has a Bachelor’s degree in Applied Mathematics and has been working as a BI consultant for Scalefree since the beginning of 2021. He’s an expert in the design and implementation of BI solutions, with focus on the Data Vault 2.0 methodology. His main areas of expertise are dbt, Coalesce, and BigQuery.

Picture of Deniz Polat

Deniz Polat
Consultant

Deniz is working in Business Intelligence and Enterprise Data Warehousing (EDW), supporting Scalefree International since the beginning of 2022. He has a Bachelor’s degree in Business Information Systems. He is a Certified Data Vault 2.0 Practitioner, Scrum Master and Product Owner and has experience in Data Vault modeling, Data Warehouse Automation and Data warehouse transformation with the tools dbt and Coalesce.

CDC, Status Tracking Satellite, and Delta Lake

Watch the Video

Understanding CDC and Status Tracking Satellites in Data Vault

The integration of Change Data Capture (CDC) data into a multi-active satellite and status tracking satellite is a nuanced topic. In a previous session, the focus was primarily on multi-active satellites, leaving the status tracking satellite underexplored. This article will dive deeper into their utility, especially in the context of CDC data.

A status tracking satellite in Data Vault serves a specific purpose: it tracks the appearance, updates, and disappearance of business objects in the source system. However, if CDC data is available, this tracking becomes inherently simpler because CDC already provides explicit information about creates (C), updates (U), and deletes (D). Thus, creating a separate status tracking satellite may not be necessary.

In contrast, when dealing with full extracts (non-CDC data), a status tracking satellite can be invaluable. It enables you to derive creates, updates, and deletes by comparing consecutive extracts, identifying the first appearance (create), differences between records (update), and removal of records (delete). This can be achieved by maintaining a delta check mechanism and creating a robust satellite to store these events.



Handling Multi-Active Data in Status Tracking Satellites

Multi-active data arises when the same business key appears multiple times in the source system, distinguished by another attribute (e.g., customer ID). In these cases, status tracking satellites must accommodate the additional attributes, ensuring that individual records are not incorrectly marked as deleted when only one instance of the multi-active data changes.

For example, consider a scenario where a customer appears twice in the source system with different technical IDs but the same business key. A delete operation on one ID should not remove the customer from the source entirely. To address this, a status tracking satellite should maintain a composite key combining the business key and the multi-active attribute.

This approach ensures that changes are tracked at the appropriate granularity, preserving the integrity of multi-active records. Additionally, adding the CDC information (CUD columns) directly to the main satellite can simplify tracking without requiring a separate status tracking satellite.

Data Vault and Delta Lake: Complementary Approaches

The second question posed is whether Data Vault adds value when Delta Lake is already in use. To address this, it’s essential to understand the distinctions between the two. Delta Lake is a technology, whereas Data Vault is a methodology. While Delta Lake provides a robust framework for handling data in its native form (e.g., JSON, XML) and managing deltas, it does not prescribe how to model or process data for business purposes.

Data Vault, on the other hand, excels in its structured, agile methodology for modeling data. It provides a clear architecture, including hubs, links, and satellites, which organize data effectively for analytics and reporting. This is where Data Vault complements Delta Lake by applying a methodical approach to the data stored in the lake.

In practice, Delta Lake can serve as the persistent staging area (landing zone) in a Data Vault architecture. The metadata and delta tracking capabilities of Delta Lake enhance the efficiency of loading and processing data, while Data Vault ensures that the data is modeled and structured to meet business requirements. This synergy allows organizations to leverage the strengths of both technologies, creating a powerful data ecosystem.

Combining CDC Data with Data Vault and Delta Lake

By integrating CDC data, Delta Lake, and Data Vault, organizations can achieve an optimized data architecture. CDC data feeds directly into Delta Lake’s storage layers (bronze, silver, gold), which in turn populate the Data Vault’s hubs, links, and satellites. This integration streamlines data ingestion, transformation, and querying while maintaining flexibility and scalability.

For instance, CDC data can directly populate status tracking satellites or be included in a main satellite for simplicity. Meanwhile, Delta Lake’s metadata features support efficient querying and analysis, enabling the Data Vault layer to focus on applying business logic and producing meaningful insights.

By combining these tools and methodologies, data teams can build robust, agile data platforms that support modern analytics and decision-making needs.

Meet the Speaker

Marc Winkelmann

Marc Winkelmann
Managing Consultant

Marc is working in Business Intelligence and Enterprise Data Warehousing (EDW) with a focus on Data Vault 2.0 implementation and coaching. Since 2016 he is active in consulting and implementation of Data Vault 2.0 solutions with industry leaders in manufacturing, energy supply and facility management sector. In 2020 he became a Data Vault 2.0 Instructor for Scalefree.

The Need for a Data Warehouse

The Need for a Data Warehouse

In today’s rapidly evolving digital landscape, businesses generate and collect vast amounts of data. However, data alone isn’t enough to ensure success—it’s about how we manage, analyze, and utilize this data. This brings us to a fundamental question: why do we need a data warehouse or Data Vault in our business model?



From Gut Feeling to Data-Driven Decisions

Many organizations, especially mid-sized firms, often rely on gut feelings for decision-making. While experience-based intuition has its place, it also carries a significant risk of error, especially as businesses grow and their operations become more complex. A data warehouse is a game-changer in transforming such organizations into data-driven entities where decisions are made based on facts and analytics rather than instinct alone.

As businesses scale, leaders lose the ability to maintain a complete overview of every operational detail. This is where a systematic approach to organizing, analyzing, and processing data becomes essential. A data warehouse centralizes and standardizes data from multiple sources, making it easier to extract insights and support rational, informed decisions across all levels of the organization.

Enhanced Business Process Automation

With a centralized repository like a data warehouse, businesses can unlock opportunities for automation. Automated processes can access, analyze, and utilize data seamlessly, leading to improved efficiency and accuracy. Whether it’s optimizing workflows or refining customer interactions, having reliable, accessible data is crucial for these systems to function effectively.

Democratizing Data Access

A significant aspect of a data-driven organization is making relevant data accessible to employees across roles. Every employee, from frontline workers to C-suite executives, is expected to make decisions. For these decisions to be effective, they need to be grounded in data.

However, this doesn’t mean unrestricted access. Data warehouses must incorporate robust security measures, such as role-based access, to ensure that employees can access only the data necessary for their responsibilities. This combination of widespread accessibility and stringent security supports a culture of informed decision-making while safeguarding sensitive information.

Do You Need a Data Vault?

When it comes to managing enterprise data, many organizations face additional challenges: integrating multiple source systems, ensuring data security and privacy, and handling real-time and batch processing simultaneously. A Data Vault model offers a comprehensive solution to these challenges by supporting integration, auditability, and adaptability.

While some businesses may start with simpler models, their requirements will inevitably evolve. Laws and industry standards may impose new data privacy mandates, or management may seek to leverage more advanced analytics capabilities. A well-designed Data Vault can accommodate these future needs without requiring a complete overhaul of the existing system.

The Future-Ready Advantage

One of the standout features of a Data Vault is its flexibility. It allows businesses to scale and adapt their data management strategies as they grow. Whether it’s adding new data sources, meeting stricter compliance requirements, or enabling more sophisticated analytics, the Data Vault model supports incremental changes without disrupting existing operations.

This adaptability makes it an invaluable asset for enterprises looking to future-proof their data strategies. While simpler solutions might suffice for today’s needs, they may not hold up against tomorrow’s demands. A Data Vault ensures that businesses are prepared for the inevitable increase in complexity and volume of their data requirements.

Conclusion

Investing in a data warehouse or Data Vault isn’t just about technology—it’s about fostering a culture of informed, data-driven decision-making. From streamlining processes to democratizing data access, these systems provide the foundation for businesses to thrive in an increasingly competitive and data-centric world. Whether you’re just starting your data journey or looking to enhance your existing capabilities, now is the time to prioritize a robust, scalable data solution.

As your business grows, so will your data requirements. A data warehouse or Data Vault not only meets these needs but positions your organization to capitalize on the full potential of its data—today and in the future.

Watch the Video

Meet the Speaker

Profile picture of Michael Olschimke

Michael Olschimke

Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

Federate Multiple Sets of Non-Historized Data

Federate Multiple Sets of Non-Historized Data

Welcome to another edition of Data Vault Friday! In this session, we’ll address an interesting question: how to integrate a second system into an existing setup with a non-historized link. This specific case involves transactions that can be updated for the last two months. We’ll explore the necessity of satellites, the role of non-historized links, and the concept of federating data in the business vault or information mart (BV/IM).

Let’s break it down, discuss the challenges, and provide actionable solutions.



The Challenge

The question revolves around a scenario where two systems provide transactional data. The goal is to integrate these datasets into a single fact entity in the information mart, enabling aggregation across both systems. A specific challenge arises when transactions from one system need to be updated within the last two months, necessitating a satellite to track those changes. So, how do we effectively design and implement this integration?

Understanding Non-Historized Links

Non-historized links play a central role in integrating datasets across source systems. These links typically store only insert operations, ensuring that facts—such as transactions—remain immutable. However, the reality is often more complex, especially when updates or changes occur.

For transactions that change, the non-historized link’s primary purpose is to maintain a consistent structure. Attributes frequently used for aggregations, such as amounts, are stored directly in the link structure. Meanwhile, descriptive attributes that casual users don’t aggregate (e.g., line descriptions) are placed in satellites to keep the link slim and optimized for storage and performance.

Federation Across Systems

One of the first decisions when integrating two systems is determining whether their data can share the same link structure. The key considerations are:

  • Semantic Consistency: Do both systems provide the same type of transactional data?
  • Granularity: Are the transactions at the same level of detail?
  • Key Uniqueness: Is there any overlap in identifiers that might require adding source-specific elements like a tenant ID?

If the answers indicate alignment, both systems can load into the same non-historized link. However, each source system would have its own satellite, ensuring clear separation of descriptive attributes.

Handling Updates in Transactions

Updates to transactional data pose a significant challenge, particularly when attributes such as amounts or hub references change. The typical strategy involves **technical counter transactions**, which ensure insert-only behavior while maintaining an accurate representation of changes. Here’s how it works:

  1. For a new transaction, insert the original record into the link.
  2. If an update occurs (e.g., an amount changes from €5 to €7), insert two additional records:
    • A “counter” record negating the original amount (e.g., -€5).
    • A new record reflecting the updated amount (e.g., +€7).
  3. Aggregations will reflect the latest transaction value while retaining a complete history of changes.

This approach keeps the data model consistent and avoids direct updates to the non-historized link.

Federation in the Business Vault and Information Mart

While the raw Data Vault integrates data from multiple systems, the Business Vault (BV) and Information Mart (IM) are where federated datasets shine. By standardizing and transforming data, you can provide end-users with a unified view of all transactions.

The information mart combines the transactional data from both systems into a single fact entity. This entity supports aggregations across all transactions, ensuring that analytical queries deliver accurate and actionable insights.

Best Practices and Recommendations

To ensure a robust implementation, consider these best practices:

  • Design for Scalability: Keep the link structure slim to optimize performance, especially with large datasets.
  • Clear Satellite Separation: Create one satellite per source system to maintain clarity and prevent mixing attributes from different sources.
  • Use Technical Counter Transactions: Handle updates efficiently while preserving insert-only behavior.
  • Test Aggregations: Validate the unified fact entity in the information mart to ensure accurate reporting across systems.
  • Monitor Data Quality: Regularly check for inconsistencies, especially when integrating systems with different update patterns.

Watch the Video

Meet the Speaker

Profile picture of Michael Olschimke

Michael Olschimke

Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

Non-Historized Links and Their Satellites

Understanding Non-Historized Links and Satellites

In the world of data vault modeling, we often encounter situations where we need to deal with non-historized links. A non-historized link is a structure used to capture relationships between entities (hubs) in a data vault. Unlike historized links, these structures are not designed to track changes over time, making them particularly useful in certain high-performance scenarios.

However, when integrating with systems that can update transactions for recent periods, additional structures like non-historized satellites become essential. These satellites allow us to track updates to attributes associated with the link while maintaining the integrity and performance of the data model.

Design Considerations for Non-Historized Links

The design of non-historized links requires careful planning, particularly when dealing with wide datasets or platforms with specific storage constraints. A wide link, for instance, might contain attributes like transaction ID, account segments, accounting document, accounting period, and timestamps. These attributes uniquely identify a transaction and establish relationships with other entities in the data vault.

Here are key considerations when designing non-historized links:

  • Include Essential Identifiers: Attributes required to uniquely identify a transaction and establish relationships with hubs should always be part of the link. This typically includes the transaction ID and any necessary alternate keys.
  • Balance Row Width: Including too many attributes in the link can lead to performance issues on row-based storage systems, as wide rows reduce the number of records stored per page.
  • Use Satellites for Additional Attributes: Attributes that are not essential for casual queries or dashboard aggregation should be moved to a satellite structure to keep the link lean and efficient.

Role of Non-Historized Satellites

Non-historized satellites play a critical role in complementing non-historized links. They store additional descriptive attributes that are not frequently queried or aggregated by typical users. This approach minimizes the join operations required during querying, improving performance.

For instance, while attributes like debit and credit amounts might be included in the link due to their importance in aggregations, other less frequently used attributes, such as transaction type or invoice descriptions, can reside in the satellite. This division ensures a balance between storage efficiency and query performance.

Query Performance Optimization

A significant design challenge lies in ensuring optimal query performance, particularly on column-based storage systems. Joining a link with its satellite can significantly impact performance. To mitigate this, consider the following:

  • Prejoin Frequently Queried Attributes: Move attributes that are commonly used in aggregations or group-by operations into the link structure to eliminate the need for joins.
  • Tailor Satellites for Specific Use Cases: Design satellites to serve specialized needs, such as data science or detailed analysis, where users can afford the higher cost of join operations.
  • Adopt Hybrid Storage Strategies: Leverage hybrid storage solutions, such as Snowflake’s mini partitions, to combine row-based and column-based advantages and support wider rows for frequently queried data.

Practical Example

Let’s consider a scenario with a financial transactions table containing over a billion rows. The table includes attributes such as transaction ID, account segments, accounting document, accounting period, debit amount, credit amount, and timestamps.

Here’s how we can structure the non-historized link and satellite:

  • Non-Historized Link: Include transaction ID, account segments, accounting document, accounting period, debit amount, and credit amount. These attributes are essential for aggregations and dimension references.
  • Non-Historized Satellite: Store additional attributes like transaction type and invoice descriptions, which are not frequently queried by casual users.

This structure ensures that typical queries for dashboards and reports are efficient, while still supporting detailed analysis for specialized users.

Conclusion

Designing non-historized links and their satellites requires a deep understanding of data usage patterns and storage platform constraints. By carefully selecting which attributes to include in the link and which to offload to a satellite, we can strike a balance between query performance and storage efficiency. For most scenarios, prioritizing attributes that support frequent aggregations and groupings in the link structure is key, while satellites serve as a repository for less critical details.

As data modeling continues to evolve, adapting these principles to modern storage technologies and hybrid platforms will be crucial for building robust and performant data vault systems.

Watch the Video

Meet the Speaker

Profile picture of Michael Olschimke

Michael Olschimke

Michael has more than 20 years of experience in Information Technology. During the last eight years, he has specialised in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

AI Act Insight: Ensuring Responsible AI for Your Business

AI Act Business Intelligence Architecture graphic

AI Act

The Artificial Intelligence (AI) Act has been in force since August 1, 2024 and will gradually come into effect over time. As a new legal milestone, the AI Act brings with it requirements for the use of artificial intelligence in companies in order to promote the responsible development and use of artificial intelligence in the EU. But instead of seeing only hurdles in the risk-based approach of the AI-Act, it also opens up a wide range of opportunities for a future-oriented AI strategy for companies.

In our newsletter, we take a first glance at the new legal framework and its significance for your company. We also show how companies can use legal compliance strategically to gain a competitive edge and promote innovative business models.

AI Act Insight: Ensuring Responsible AI for Your Business

The EU’s AI Act is here! Learn how this groundbreaking regulation impacts your business. We’ll break down the risk-based approach to AI systems, focusing on high-risk applications and compliance requirements. Discover practical steps to ensure transparency and leverage tools like AI-Marts for effective AI governance.

Watch Webinar Recording

What is the AI Act and Why Should You Care?

The AI Act aims to make the use of artificial intelligence within the EU safer and more trustworthy by creating clear rules for the development and use of AI systems. The focus here is on the protection of fundamental rights, health, and safety. The legal framework is based on a risk-based approach: AI applications are divided into four different categories, from minimal to unacceptable risks, depending on the potential threat to society. Specifically, the AI Act provides for the following categories:

  1. Unacceptable Risk: AI systems that pose a threat to human rights or safety, such as those used for social scoring or manipulative practices, are prohibited.
  2. High Risk: These systems are heavily regulated and include applications in critical areas such as biometric identification, healthcare, transportation, education, and employment. Businesses using high-risk AI must meet strict compliance standards.
  3. Limited Risk: These systems face fewer restrictions but must still adhere to transparency requirements. For example, chatbots need to inform users that they are interacting with AI.
  4. Minimal or No Risk: The least regulated category includes AI applications such as spam filters or AI-driven video games.

The AI Act is of great importance for companies, as the new requirements not only entail compliance obligations, but also open up opportunities to gain competitive advantages.

By adapting to the legal requirements at an early stage, companies can strengthen trust among customers and partners, minimize risks, and promote innovation responsibly. A sound understanding of regulation enables them to make strategic use of the legal framework and position themselves better in international comparison.

Hence, for businesses operating in or with the EU, compliance with the AI Act will be a decisive factor. Failure to comply could result in significant penalties—up to 7% of the global annual turnover or €35 million, whichever is higher.

As the legislation moves gradually forward, it is recommended that companies, as a first step, review their AI tools and analyze how these systems are classified and regulated under the new framework to implement the necessary obligations.

Comply With the AI Act Today!

Preparing for the AI Act requires a proactive and comprehensive approach. Ensure compliance, mitigate risks, and foster trust in your AI applications.

Get My Free Checklist

From a Data Warehousing perspective: How can an AI-Mart help?

As businesses prepare to comply with the European Union’s AI Act, ensuring that their data and AI systems meet the new regulations is critical. Central to this is the concept of data governance and traceability, especially for AI models classified as high-risk. A modern data warehouse (DWH), particularly one powered by Data Vault 2.0, when combined with a specialized AI-Mart, can provide the technical foundation needed for compliance by managing the data lifecycle, ensuring transparent operations, and logging AI model activities. Data Vault 2.0 offers several advantages for this purpose, including its ability to support agile development, enabling rapid changes in business requirements; ensuring scalability, allowing businesses to handle increasing data volumes seamlessly; and providing strong historical tracking through its architecture, which facilitates easier auditing and compliance verification.

In the context of AI, the AI-Mart is a specialized data mart within a DWH, focused solely on managing AI training data. Its purpose is to provide a structured and compliant environment for storing and curating datasets that will be used to train, validate, and test AI models. Unlike a traditional data mart, the AI-Mart is designed with features tailored for AI, such as enhanced metadata, tracking, and model training documentation.

Key Features of an AI-Mart

  1. Data Curation for AI Training: The AI-Mart stores data specifically curated for training AI models, ensuring that all datasets are clean, unbiased, and high-quality. Built-in data governance rules ensure that only validated data enters the mart. This ensures compliance with the AI Act’s requirements for high-risk AI systems, where data must be trustworthy, accurate, and free of bias.
  2. Metadata and Documentation: The AI-Mart stores metadata about each dataset, including its source, transformations applied, and its use in specific AI models. This metadata is essential for traceability, ensuring that every data point used in an AI model can be traced back to its origin and all changes can be documented.
  3. Data Versioning and Lineage: In AI applications, ensuring that models use up-to-date and reliable data is critical. The AI-Mart supports data versioning, allowing teams to maintain multiple versions of datasets and trace changes over time. Data lineage tracking ensures that the lifecycle of the data—from ingestion to usage in AI models—can be fully traced, providing a comprehensive audit trail required for compliance with the AI Act.

This is why a robust data governance framework is crucial for ensuring compliance with the AI Act. By integrating a data warehouse (DWH) with an AI-Mart, businesses can implement stringent governance measures that ensure the quality and reliability of AI training data. For example, automated validation pipelines within the DWH verify that only data meeting predefined quality standards is used, minimizing errors, biases, and missing information. This is particularly important for high-risk AI applications, such as those in biometric identification or healthcare, where poor data quality could lead to harmful or inaccurate outcomes.

To comply with the AI Act, businesses must ensure traceability in their AI systems by tracking and documenting key stages of the AI process, from data preparation to model usage. Integrating AI model logs into a data warehouse (DWH) plays a crucial role in this, providing a centralized system to monitor and store critical information about how AI models operate and interact with data.

AI Act Business Intelligence Architecture graphic

Logging AI Decisions and Outputs: Each time an AI model processes data, logs should be automatically generated and stored in the DWH. These logs capture essential details, including input data, feature transformations, model parameters, decision thresholds, and output probabilities. By loading these logs into the DWH, businesses create a detailed audit trail of AI activity, ensuring that key aspects of the model’s operations are documented.

Log Aggregation and Storage: Logs from AI models, whether during training or production, can be continuously fed into the DWH as part of the AI-Mart infrastructure. These logs may include:

  • Model training logs: Documenting how the model was trained, the datasets used, and the parameters adjusted during training.
  • Model inference logs: Recording the input data, features generated, and each prediction made by the model.
  • Performance metrics: Storing evaluations like accuracy, precision, and recall, which help track the model’s performance over time and detect any model drift.

By storing these logs in the DWH, businesses can establish detailed records of AI model operations for regulatory purposes.

Querying and Auditing Logs: The DWH’s querying tools allow compliance teams to generate reports that show how models operate, what data was used, and how the AI model has evolved. This simplifies the process of responding to regulatory audits and demonstrates adherence to the AI Act’s requirements.

By combining a DWH with an AI-Mart for AI training data and loading AI model logs into the same infrastructure, businesses can build a comprehensive framework for compliance with the AI Act. This approach supports data governance, ensuring high-quality data for AI models, and ensures traceability, allowing businesses to track and audit every aspect of their AI systems. This not only meets regulatory requirements but also fosters trust and accountability in the use of AI technology.

Upcoming Resources and Events

For more information, contact our team at [email protected].

Final Remarks from the Authors

The AI Act should not only be seen as a regulatory challenge but also as an opportunity for businesses to differentiate themselves by adopting trustworthy and ethical AI practices. As AI continues to evolve, businesses that prioritize compliance, transparency, and human oversight will be better positioned to thrive in the coming years. By taking proactive steps now to ensure compliance, businesses can turn AI regulation into a strategic advantage, building trust with customers, partners, and regulators alike.

 

– Ulf Mattern (Scalefree) & Dr. Céline Helmschrot (GÖHMANN Rechtsanwälte)

Close Menu