Building a scalable Data Platform? – Scalefree

Building a scalable Data Platform? In Data Vault Friday

Soft-Deleting Records in Data Vault: A Real-World Approach to Status Tracking Satellites

Soft-Deleting Records in Data Vault

When working with Data Vault in real-world enterprise data warehousing projects, managing soft-deleted records is more than just a theoretical exercise—it’s a necessity. While many books and training examples offer simplified scenarios, real-world implementations must take into account complex requirements such as status tracking from multiple source systems. In this article, we dive into how to virtualize dimensions in a Data Vault model with proper handling of status tracking satellites, ensuring that soft deletions are effectively managed and the integrity of historical records is preserved.

In this article:

The Problem with Simplified Examples
Scenario Overview: Multi-Source Concept with Soft Deletes
What Do Status Tracking Satellites Actually Do?
How to Use the PIT Table
Business Rules for Determining Active Status
Should You Remove Deleted Entities from Dimensions?
Virtualizing the Dimension View
Final Thoughts
Watch the Video
Meet the Speaker

The Problem with Simplified Examples

Most Data Vault tutorials focus on core concepts: Hubs, Links, and standard Satellites. These foundational examples are useful for learning but fall short when we need to address data that has been logically removed or soft-deleted. In real-world systems, this is a common occurrence, and ignoring it risks producing inaccurate analytics and flawed dimension models.

The key challenge is to determine whether a business entity—like a customer, product, or concept—is still “active” in the eyes of the business. This gets even trickier when the data comes from multiple source systems, each with its own deletion logic. That’s where Status Tracking Satellites (also known as Effectivity Satellites) come in.

Scenario Overview: Multi-Source Concept with Soft Deletes

Consider a scenario where a Concept (e.g., “Customer”) is fed by two different source systems. Here’s a breakdown of the Data Vault objects involved:

Concept_PIT – A Point-In-Time (PIT) table indexing data across all satellites.
Concept_SAT_S_source1 – A standard satellite with descriptive data from Source 1.
Concept_SATST_source1 – A status tracking satellite for Source 1.
Concept_SAT_S_source2 – A standard satellite from Source 2 with more descriptive attributes.
Concept_SATST_source2 – A status tracking satellite for Source 2.

In this setup, each source system tracks its own deletions, independently of the other. That means a Concept could be deleted in one source but still be active in another. Properly modeling and querying this requires careful integration of all status indicators.

What Do Status Tracking Satellites Actually Do?

Contrary to some misunderstandings, status tracking satellites are not used to track the deletion of descriptive attributes. That’s the role of standard satellites. Instead, they track whether the entire business key has been logically deleted in the source system. For example, if a customer row is completely removed from a source table, the status tracking satellite records that deletion event.

This distinction is important: you might null out a field in the source system without deleting the record. The standard satellite captures that. But if the whole customer row is deleted, only the status tracking satellite will catch it.

How to Use the PIT Table

The PIT table is key to virtualizing dimensions. It acts as a bridge, linking together the various satellites—both descriptive and status tracking—by capturing the effective row per business key and timestamp. Your virtual dimension view selects from the PIT table and joins all relevant satellites using hash keys and load dates.

In this case, you treat the status tracking satellites just like standard satellites. You join them on hash keys and load dates using PIT indexes. This lets you bring in any flags, such as IsActive or deletion timestamps, as part of your view logic.

Business Rules for Determining Active Status

One of the biggest questions in designing this architecture is: How do we determine if a business key is still active?

There are a few approaches:

Single-source dominance: If the customer is deleted in the primary (golden) source, the business key is considered deleted.
All-source consensus: The business key is only considered deleted if it’s removed from all source systems.

Each approach has its pros and cons. The decision should be based on your business rules and requirements. You can use simple boolean flags like IsActive or more advanced logic combining multiple status indicators from different sources.

Should You Remove Deleted Entities from Dimensions?

This is a hot topic. Some organizations want to remove soft-deleted entities from their dimension tables entirely. While that may sound clean, it can create problems downstream—especially in fact tables where foreign key references still exist for those deleted entities.

The recommended approach? Flag them instead of deleting them. This preserves history and maintains referential integrity. It also helps analysts understand that, yes, the product or customer was deleted—but it still contributed to revenue or other KPIs in the past.

Virtualizing the Dimension View

Your final dimension view should:

Select from the PIT table for the appropriate concept.
Join to all descriptive satellites using the hash key and PIT load dates.
Join to all status tracking satellites similarly, treating them like descriptive sources.
Derive an IsActive or IsDeleted flag using business logic.

Here’s a simplified example of what that SQL might look like:

SELECT 
  pit.BusinessKey,
  sat1.Description1,
  sat2.Description2,
  CASE 
    WHEN st1.IsActive = 1 AND st2.IsActive = 1 THEN 1
    ELSE 0
  END AS IsActive
FROM Concept_PIT pit
LEFT JOIN Concept_SAT_S_source1 sat1 
  ON pit.HashKey = sat1.HashKey AND pit.LoadDate1 = sat1.LoadDate
LEFT JOIN Concept_SAT_S_source2 sat2 
  ON pit.HashKey = sat2.HashKey AND pit.LoadDate2 = sat2.LoadDate
LEFT JOIN Concept_SATST_source1 st1 
  ON pit.HashKey = st1.HashKey AND pit.LoadDateST1 = st1.LoadDate
LEFT JOIN Concept_SATST_source2 st2 
  ON pit.HashKey = st2.HashKey AND pit.LoadDateST2 = st2.LoadDate

This query structure ensures that all available data is consolidated and evaluated according to business-specific logic to determine the final dimension state.

Final Thoughts

Handling soft deletes in a Data Vault using status tracking satellites is a robust and scalable solution for real-world enterprise systems. The key is to treat these satellites as regular descriptive tables, include them in your PIT tables, and let business logic drive how deletions are interpreted in the dimension views.

Instead of deleting records from your dimensions—which can break fact table relationships—simply flag them. This gives you the full power of historical traceability while still providing clear information about current entity status. As always, your implementation should follow the business rules defined by your organization or your client.

Watch the Video

Meet the Speaker

Michael Olschimke

Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

Building a scalable Data Platform? In Data Vault Friday

How to Model Address Data in Data Vault

Understanding the Nature of Address Data

In many systems, address data doesn’t come in a uniform format. Some systems may embed it directly in a contact or customer table—think “billing address” and “shipping address” fields in Salesforce—while others provide a separate table for addresses and even a relationship table showing links between addresses and business entities.

Let’s look at how to tackle both of these situations using Data Vault best practices.

In this article:

Case 1: Addresses as Attributes Inside Another Table
Case 2: Addresses in a Separate Table
- Option A: Treat Address as a Hub
- Option B: Treat Address as Reference Data
Tracking Changes in Reference Data
Handling Many-to-Many Relationships Between Addresses and Entities
Resolving Ambiguity with Conditional Relationships
Guiding Principles for Modeling Address Data
Conclusion
Watch the Video
Meet the Speaker

Case 1: Addresses as Attributes Inside Another Table

If your source delivers addresses as part of another table (e.g., contact data with billing and shipping fields), the Raw Data Vault should model the data exactly as it comes. For example:

Create a Contact Hub with a business key for contacts.
Attach a Satellite containing billing and shipping address fields like city, street, and ZIP code.

Even if multiple contacts share the same address, duplication is acceptable in the Raw Vault—it’s a reflection of how the source system delivers the data. Optimization or deduplication can happen in the Business Vault or information marts.

Case 2: Addresses in a Separate Table

When your source system contains a dedicated address table, you have two main modeling options:

Option A: Treat Address as a Hub

Create an Address Hub using a business key. If no natural key exists, use a surrogate/technical key.
Attach a Satellite to store descriptive fields (e.g., street, city, ZIP).
Use a Link to relate addresses to other Hubs like Contact, Customer, or Lead.

This pattern is especially useful in industries like insurance, where addresses are treated as critical business objects (e.g., accident location).

Option B: Treat Address as Reference Data

Store addresses in a flat reference table.
Use an ID (like address_id = 55) as a code in a descriptive Satellite on related Hubs.

This is simpler but comes with limitations—it doesn’t track historical changes. For example, if a street name changes, the system won’t retain that history.

Tracking Changes in Reference Data

To address the historical limitation, consider modeling reference data with:

A Reference Hub (no hashed keys, just the code, load date, and record source).
A Reference Satellite to track changes over time using load date as part of the primary key.

This design allows you to maintain a history of changes in descriptive reference data without violating Data Vault principles.

Handling Many-to-Many Relationships Between Addresses and Entities

Things get more interesting when addresses are shared across multiple business objects (e.g., one address used by both a Lead and a Customer). If your source includes a bridge table (many-to-many), follow this approach:

Create an Address Hub.
Create individual Hubs for each business object (Lead, Customer, Contact).
Establish a Link using the bridge table to represent relationships.
Track relationship history using Effectivity Satellites.

If the bridge table uses a generalized object like “Customer” to reference multiple types (Contact, Account, Lead), you’ll need:

A Generic Customer Hub.
Use raw business keys or technical IDs (UUIDs).
Create Links between the Customer Hub and the Address Hub.
Use Satellites to track effectiveness (i.e., from when to when an address is associated).

Resolving Ambiguity with Conditional Relationships

Sometimes, the source system generalizes business objects (e.g., Microsoft CRM’s Customer entity could be an Account, Lead, or Contact). In these cases, build a generic Customer Hub first. Then, in the Business Vault, apply conditional logic to determine if a Customer is actually a Lead, Contact, or Account—but only in the Business Vault.

This conditional logic would take the form of queries that check whether a UUID from the generic Customer Hub exists in the Lead Hub. If so, you can establish a Business Vault Link between the generic Customer and the Lead.

Guiding Principles for Modeling Address Data

Stay Data-Driven: Model what you see, not what you think should be there.
Don’t Add Conditional Logic to the Raw Vault. It belongs in the Business Vault.
Use Hubs for real business objects like Address when they are shared across systems or have standalone value.
Use Reference Data when addresses are just descriptive codes without relationships.
Track History with Effectivity or Reference Satellites if needed.

Ultimately, the choice between treating address data as a business object or reference data depends on your use case. If you’re dealing with complex, shared addresses with historical importance, model them as Hubs. If not, use reference tables or Satellites. But always be consistent and avoid conditional logic in the Raw Vault.

Conclusion

Modeling address data in a Data Vault architecture isn’t one-size-fits-all. Whether it’s Salesforce, SAP, or Microsoft CRM, the goal is to be faithful to your data, follow the architecture’s guiding principles, and maintain flexibility. By doing so, you ensure scalability, auditability, and long-term maintainability of your data warehouse solution.

Watch the Video

Meet the Speaker

Michael Olschimke

Building a scalable Data Platform? In Inside Modern Data Teams

Know Your Data: Making Data Ownership Work for You

Introduction: The Critical Role of Data Ownership

In today’s rapidly evolving business landscape, managing data effectively is paramount. With increasing regulatory pressures, digital transformation, and a growing reliance on data-driven decision making, clear and defined data ownership becomes a strategic imperative. Without it, organizations risk ambiguity, poor data quality, and non-compliance. This article explores why data ownership is essential for accountability, consistency, and the overall trustworthiness of your data, while providing a clear roadmap to implement effective data stewardship.

The concept is simple: without clearly identifying who is responsible for your data, you invite confusion, inefficiency, and even regulatory penalties. Conversely, establishing clear ownership transforms data from a potential liability into a powerful asset. Whether you are looking to meet the stringent requirements of regulations like GDPR and the EU AI Act, or simply wishing to improve internal communication and decision-making processes, ownership is the key.

In this article:

Why Data Ownership is Fundamental Today
When Data Lacks Ownership: The High Stakes of Unclear Accountability
Understanding Data Ownership Roles: A Team Effort
Common Pitfalls in Establishing Data Ownership
A Pragmatic 5-Step Approach to Effective Data Ownership
What 'Good' Data Ownership Looks Like
Conclusion: Empower Your Organization with Clear Data Ownership
Watch the Video
Meet the Speaker

Why Data Ownership is Fundamental Today

Let’s delve into the essentials of why data ownership matters. At its core, data ownership is about establishing accountability within an organization. When each segment of data has a designated owner, every piece of information is managed with a specific focus on maintaining quality, compliance, and consistency. This clarity helps in:

Ensuring Compliance: With defined responsibilities, it’s easier to meet regulatory requirements such as GDPR, detailed ESG reporting, and the complexities of the EU AI Act. Regulatory bodies demand clear traceability of data – knowing who is accountable for it can prevent fines and reputational risk.
Enabling Data Quality: When someone is responsible for a data domain, they are motivated to maintain its accuracy, timeliness, and overall quality. This creates a trustworthy data environment which is critical for advanced analytics and informed decision-making.
Aligning Communication: Clear ownership minimizes internal conflicts and misunderstandings between departments. It reduces debates about data definitions and usage, leading to more harmonious and efficient operations.
Driving Better Decisions: Ultimately, when data is reliable and well-governed, it forms the foundation for strategic planning, innovative analytics, and effective AI implementations.

In essence, effective data ownership isn’t just a technical or operational necessity—it’s a strategic tool that can drive significant business value.

When Data Lacks Ownership: The High Stakes of Unclear Accountability

The oft-quoted phrase “data is the new oil” highlights the immense value of data, yet without clear ownership, its potential can quickly be undermined. Without accountability, several risks emerge:

Fuzzy Accountability: When it is unclear who is responsible for data, errors and delays multiply. Issues such as inaccurate reports or unresolved data discrepancies can lead to operational inefficiencies and financial losses.
Poor Quality Data: Without an owner’s vigilant oversight, data quality suffers. Decisions and strategies built on shaky foundations can lead to misguided initiatives and lost opportunities.
Regulatory Risks: The absence of a clear data ownership structure can turn regulatory compliance into a nightmare. With GDPR, the EU AI Act, and strict ESG standards, non-compliance is not just costly—it can also damage the trust stakeholders have in the business.

Clear data ownership transforms these risks into opportunities. By appointing dedicated owners, organizations can turn data into a reliable, high-quality asset that fuels better decisions, drives innovation, and facilitates compliance.

Understanding Data Ownership Roles: A Team Effort

Data ownership is not about placing the burden on a single person—it’s a collaborative effort that requires distinct roles. Using an analogy of managing a valuable property can help illustrate this clearly:

Data Owner: The Property Owner

Imagine the data owner as the property owner—usually a business leader. They hold the ultimate accountability for a specific data domain, such as customer data or financial records. Their responsibilities include setting policies, defining quality expectations, and deciding who has access to critical data. They focus on leveraging data for strategic advantages.

Data Steward: The Property Manager

The data steward, akin to a property manager, is a subject matter expert responsible for the day-to-day management of the data. They maintain key definitions (metadata), continuously monitor data quality, and promptly address issues. Their role ensures that the data remains fit for purpose—clean, accurate, and understandable.

Data Custodian: The Maintenance Crew

Finally, the data custodian is like the security and maintenance team responsible for the physical upkeep of a property. In data management, this is typically the IT role that oversees the technical infrastructure. They manage storage, implement robust security controls, control backups, and facilitate access—keeping the data safe and technically accessible.

The key takeaway is that these roles must operate in close collaboration. While each function is distinct, together they create a comprehensive framework that supports secure, reliable, and high-quality data management.

Common Pitfalls in Establishing Data Ownership

Even the most well-intentioned organizations can stumble in implementing data ownership. Understanding common pitfalls is crucial to designing a more practical and effective approach.

Lack of Clarity: Often, data ownership exists only on paper. When roles are not operationalized in day-to-day activities, everyone ends up assuming that someone else is responsible for data quality and governance.
“Not My Job” Syndrome: Diffusion of responsibility can lead to a culture where critical data falls through the cracks because every team member assumes someone else owns it.
Missing Authority: Assigning someone as a data owner without providing the real power, time, or resources to enforce decisions hinders effective data governance.
Defaulting to IT: A common error is to assume that IT should automatically be the data owner. However, the true understanding of data often lies within the business side where its meaning and implications are most evident.
Overcomplicating the Process: Trying to implement perfect data ownership across every aspect of an organization at once can lead to analysis paralysis. It’s essential to start small and build progressively.
Misplaced Faith in Tools: Technology alone, such as data catalogues or governance platforms, cannot solve ownership problems. Without defining the people and processes involved, these tools will only add layers of complexity.

Recognizing and avoiding these pitfalls paves the way for a more pragmatic and sustainable approach to data ownership.

A Pragmatic 5-Step Approach to Effective Data Ownership

Instead of being overwhelmed by the complexities, organizations can follow a pragmatic step-by-step approach to implement data ownership effectively.

Start Small & Focused: Identify one or two critical data domains where the issues are most significant. Whether it’s customer contact information or key financial data, focusing on a few areas initially can deliver rapid improvements.
Appoint and Empower REAL Owners: Assign business leaders as the owners, ensuring they have both the authority and mandate to enforce decisions. It is vital to support them with the necessary resources to act decisively.
Create an Ownership Charter: Draft a simple yet comprehensive charter that documents the roles—Data Owner, Steward, and Custodian—their core responsibilities, and the key processes. This document should define data elements clearly and establish an escalation process.
Track and Communicate: Implement basic metrics to measure data quality, such as completeness, accuracy, and timeliness. Dashboards and regular reports can provide transparency and keep everyone aligned.
Build a Shared Understanding: Develop a common data language across the organization. Use a business glossary and data lineage maps to ensure that every stakeholder is on the same page. Formalize handoffs between teams with clear data delivery agreements.

By following these steps, organizations can establish a culture of accountability and quality, turning data ownership into a powerful driver of business success.

What ‘Good’ Data Ownership Looks Like

When data ownership is effectively established, organizations experience significant benefits, including:

Reduced Risk & Faster Issue Resolution: With a designated owner, issues are identified and resolved promptly, reducing the risk of prolonged disruptions and costly errors.
Smoother Compliance: Audits and regulatory inspections become less stressful and more straightforward, as clear audit trails and accountability measures are in place.
Enhanced Decision-Making: Trusted data leads to smarter, data-driven decisions. It enables reliable analytics, robust business intelligence (BI), and even more effective artificial intelligence (AI) strategies.
Increased Operational Efficiency: Teams spend less time searching for data or fixing errors. Clear ownership reduces friction, ultimately speeding up decision-making processes.
A Culture of Responsibility: When data is viewed as a shared asset, collaboration increases and data is treated with the care it deserves. This shifts the organizational mindset towards continuous improvement and value creation.

In summary, good data ownership turns what could be a cumbersome obligation into a strategic asset that bolsters every facet of an organization—from compliance and risk management to innovation and operational agility.

Conclusion: Empower Your Organization with Clear Data Ownership

Data ownership is more than an administrative necessity; it is a strategic asset that underpins compliance, quality, and overall business success. By clearly defining who is responsible for data, organizations can ensure that information is managed with precision, accountability, and a strategic focus on value creation.

Remember, the journey starts by identifying key data domains where the pain points are most pronounced. Once you appoint responsible owners and empower them with real authority and clear charter documents, you create an environment where data is nurtured, trusted, and effectively leveraged. This approach not only minimizes risks and regulatory challenges but also sets the stage for innovation and smarter decision-making.

As you move forward, ask yourself: What is the first critical data domain in your organization where clear ownership could unlock real value? The answer to this question may well be the catalyst for transforming your data from a potential liability into your most valued asset.

Embrace the principles of effective data ownership today, and watch as your organization evolves into a more agile, confident, and data-driven powerhouse.

Watch the Video

Meet the Speaker

Lorenz Kindling
Senior Consultant

Lorenz is working in Business Intelligence and Enterprise Data Warehousing (EDW) with a focus on data warehouse automation and Data Vault modeling. Since 2021, he has been advising renowned companies in various industries for Scalefree International. Prior to Scalefree, he also worked as a consultant in the field of data analytics. This allowed him to gain a comprehensive overview of data warehousing projects and common issues that arise.

Building a scalable Data Platform? In Data Vault, Intermediate

Data Vault & Data Mesh in a Data Fabric: A Modern Architecture Guide

Best Practices for Data Mesh Implementation

Organizations often struggle in managing their data efficiently. Data is usually spread across many separate systems, constantly growing in size and complexity, and required for an increasing number of uses. Even seasoned experts struggle with these challenges. To address this, approaches like Data Fabric, Data Vault, and Data Mesh have become important for building robust and flexible data platforms and ensuring efficient processes.

However, these new approaches also add further complexity for data platform management. This article explores how to combine these three concepts to create a strong and efficient data architecture that data architects can use as a foundational guide.

Data Vault & Data Mesh in a Data Fabric: A Modern Architecture Guide

This webinar will provide a brief overview of Data Fabric, Data Vault, and Data Mesh, and then delve into the advantages that can be realized by combining these approaches. Register for our free webinar May 13th, 2025!

Watch Webinar Recording

In this article:

The Data Fabric: Unifying Distributed Data Ecosystems
Data Vault: Establishing a Single Source of Facts
Data Mesh: Decentralizing Data Ownership and Access
Best Practices for Data Mesh Implementation
Recommended Architectural Synthesis
Use Cases and Applications
Conclusion

The Data Fabric: Unifying Distributed Data Ecosystems

To address the challenges of managing data scattered across diverse and distributed environments, the Data Fabric has emerged as an architectural approach. It leverages metadata-driven automation and intelligent capabilities to create a unified and consistent data management layer. This framework facilitates seamless data access and delivery, ultimately enhancing organizational agility.

Key characteristics of a Data Fabric include:

Unified Data Access: Providing integrated data access for diverse user needs.
Centralized Metadata: Utilizing an AI-augmented data catalog for data discovery and comprehension.
Enhanced and Metadata-Driven Automation: Promoting efficiency and scalability through automated processes. Intelligent automation powered by comprehensive metadata management.
Strengthened Governance and Security: Standardizing procedures to improve governance and security.

A modern Data Fabric platform integrates a spectrum of systems and processes to streamline data management. This evolution begins with the incorporation of data from diverse source systems, such as ERP, CRM, HR, and MDM. Subsequently, a Data Lakehouse is integrated, featuring a staging area for data preparation.

The architecture further encompasses an Enterprise Data Warehouse for core data storage, followed by the implementation of information marts, AI marts, and user marts for tailored information delivery. At last, the platform supports various data consumption methods, including applications, dashboards, and OLAP cubes.

The Data Lakehouse also shows the three medallion layers, which represent the raw data (bronze layer), integrated data layer (Silver) and information delivery layer (Gold) with its data products ready for consumption.

Critical to this architecture is robust metadata management and an AI-augmented data catalog, which together drive automation and facilitate data discovery.

Data Vault: Establishing a Single Source of Facts

Data Vault as a data modeling methodology is designed for the construction and maintenance of enterprise data warehouses. Renowned for its flexibility, scalability, and emphasis on historical data, Data Vault aligns seamlessly with the goal of a unified and consisting data management layer of a Data Fabric and its automation focus.

Key benefits of a Data Vault include:

Scalability: Adapting to growing data volumes and complexity.
Flexibility: Accommodating evolving business requirements.
Consistency: Ensuring data integrity across the enterprise.
Pattern based modeling: Perfect foundation for data automation.
Auditability: Providing a clear and traceable data history.
Agility: enabling faster responses to change business needs.

Within a modern Data Fabric platform, a Data Vault model is implemented within the Enterprise Data Warehouse component. The Raw Data Vault integrates all source systems into business objects and its relationships. The sparsely built Business Vault on top of the Raw Data Vault adds advanced Data Vault entities for e.g. query assistants to ease the creation and increase the performance of the information delivery layer.

This approach delivers all advantages listed above and enables a high level of automation due to its pattern based modeling method.

Data Mesh: Decentralizing Data Ownership and Access

Data Mesh is a decentralized approach to data management that prioritizes domain ownership, data as a product, self-service data platforms, and federated governance. This approach shifts data management responsibilities to domain-specific teams, fostering greater accountability and agility.

Key principles include:

Domain Ownership: Decentralized management of analytical and operational data.
Data as a Product: Treating analytical data as a valuable and managed asset.
Self-Service Data Platform: Providing tools for independent data sharing and management.
Federated Governance: Enabling collaborative governance across domains.
Decentralized data domains: Each domain managing its own data products.

Implementing a Data Mesh on a Data Fabric platform requires several essential components like standardized DevOps processes and modeling guides, as well as a comprehensive data catalog.

Although fully distributing the data pipeline via a Data Mesh presents certain attractions, our experience indicates that a more effective strategy involves selectively integrating key Data Mesh principles within a Data Fabric architecture, thereby utilizing decentralized ownership while keeping the advantages of an automated centralized core leveraging the Data Vault approach.

Best Practices for Data Mesh Implementation

Centralized Staging and Raw Vault: This promotes high-level automation.
Decentralized Business Vault and Beyond: This facilitates business knowledge integration and efficient use of cross-functional teams.

For optimal implementation, a centralized staging and Raw Vault approach promotes high-level automation and ensures that all data products refer to a single source of facts. In contrast, a decentralized Business Vault and beyond strategy allows for necessary business knowledge integration, clear data product ownership, and efficient scaling. This level of decentralization is crucial for a successful Data Mesh implementation leveraging cross-functional domain teams.

Recommended Architectural Synthesis

The recommended architecture integrates Data Fabric with Data Mesh and Data Vault, capitalizing on the strengths of each approach. This synthesis yields a metadata-driven, flexible, automated, transparent, efficient, and governed data environment.

Use Cases and Applications

This modern data architecture supports a broad spectrum of use cases, including:

Efficient & Trusted Reporting and Analytics
Regulation Compliance through an auditable core
Various AI Applications

Conclusion

The integration of Data Fabric, Data Vault, and Data Mesh enables organizations to construct a modern data architecture characterized by flexibility, scalability, and efficiency. This holistic approach enhances data management, improves data access, and accelerates the delivery of data products, ultimately driving superior business outcomes with a high level of automation, governance and transparency.

– Marc Winkelmann & Christof Wenzeritt(Scalefree)

Building a scalable Data Platform? In Data Vault Friday

Applying Soft-Deletes in Dimensions

Soft-Deletes in Dimensions

Data Vault practitioners often encounter the challenge of handling soft deletes in dimensional modeling. While source systems may hard delete records, a data warehouse must maintain historical integrity. This is where effectivity satellites and PIT (Point-in-Time) tables become crucial. Let’s explore how to apply soft deletes effectively in Data Vault dimensions.

In this article:

Why Soft Deletes?
Effectivity Satellites for Status Tracking
One Satellite per Source System?
Adding Effectivity Satellites to PIT Tables
Virtualized Dimensions Without PIT Tables
Final Thoughts
Watch the Video
Meet the Speaker

Why Soft Deletes?

In a data warehouse, deleting a record outright can lead to inconsistencies, particularly when historical data or fact tables reference that entity. Instead of deleting, we mark records as inactive while preserving history.

Effectivity Satellites for Status Tracking

An effectivity satellite helps track changes in a record’s lifecycle, including soft deletes. This satellite contains:

A deletion flag (e.g., isDeleted or isActive).
A deletion timestamp (deletedDate).
Load date timestamps to track changes.

This setup allows toggling between active and inactive states as records are deleted and restored in the source system.

One Satellite per Source System?

Each source system may handle deletions differently. If multiple sources feed the same entity, separate effectivity satellites per source ensure accurate tracking.

Adding Effectivity Satellites to PIT Tables

The PIT table aligns different satellite timelines, allowing efficient query performance. Including effectivity satellites in PIT tables ensures:

Accurate status tracking across snapshots.
Seamless integration with dimensional views.
Faster lookups by pre-aligning timelines.

Virtualized Dimensions Without PIT Tables

Without a PIT table, virtualized dimensions must align effectivity data dynamically. This requires:

Joining effectivity satellites on-the-fly.
Using window functions like LEAD or LAG for tracking state changes.
Referencing a date dimension for time-based joins.

While possible, this approach is less efficient than using PIT tables.

Final Thoughts

Soft deletes in Data Vault dimensions are best managed through effectivity satellites and PIT tables. By storing deletion timestamps and flags, we retain history while allowing records to be marked inactive instead of being removed.

Watch the Video

Meet the Speaker

Michael Olschimke

Building a scalable Data Platform? In Beginner, Data Warehouse

Handling Snapshotting via a Timeline other than Load Date

Snapshotting is a crucial process when managing financial and business data. It involves capturing a static copy of data at a specific point in time, preserving it for future reference, analysis, and reporting. Therefore, snapshotting facilitates data-driven decision-making by providing a reliable historical timeline on given business dates for trend analysis, compliance, forecasting and many more to add value to the reports. This newsletter delves into the details of snapshotting based on business dates, as opposed to system timestamps, and emphasizes its significance in ensuring data accuracy and consistency.

HANDLING SNAPSHOTTING VIA A TIMELINE OTHER THAN LOAD DATE

This webinar delves into the intricacies of business date snapshotting, a vital data warehousing technique that aligns historical data with specific business requirements, contrasting it with load date snapshotting. Join us on April 15th at 11:00 AM CEST to explore this topic in depth.

Watch Webinar Recording

In this article:

Snapshotting - An Overview
Key Characteristics of Snapshots
Challenges with Snapshots
Strategies for Effective Snapshot Management
Conclusion

Snapshotting – An Overview

Snapshotting, in essence, is the process of creating a replica of data at a particular moment. In the context of business and finance, aligning snapshots with business dates, rather than system timestamps guarantees that reports and analyses mirror operational timelines, which is crucial for period-end reporting, regulatory compliance, and historical trend analysis.

Key Characteristics of Snapshots

Data Integrity and Accuracy: Snapshots capture a complete record of data at a specific point in time; the captured record should be immutable, except for late-arriving data, which we’ll cover later. This historical representation ensures that data remains consistent and reliable for future reference, reporting, and analysis, regardless of any subsequent modifications or deletions.
Source of Truth: By preserving data exactly as it existed at a particular moment, snapshots offer a dependable source of truth for auditing, compliance, and regulatory requirements. They enable organizations to track changes over time, identify trends, and make informed decisions based on accurate and historical data.
Alignment with Business Operations: Unlike traditional data storage methods that rely on timestamps, snapshots are indexed according to a given schedule e.g. business dates or time (hourly, daily, weekly etc.). This approach ensures that data is organized and accessible in a manner that aligns with business operations and reporting cycles. By accounting for non-operational days, holidays, and designated business cutoffs, snapshots provide a more meaningful and relevant representation of data from a business perspective.

Challenges with Snapshots

Handling Late-Arriving Data: Transactions or updates may come in after a snapshot is taken, requiring strategies to manage retroactive changes.
Business Date vs. Calendar Date: Aligning snapshots with business dates rather than system timestamps can be complex, especially when dealing with weekends, holidays, or different time zones.
Data Consistency Across Systems: Ensuring that all related datasets are captured at the same logical point in time is critical for maintaining consistency in reporting and analysis.
Snapshot Frequency and Granularity: Choosing the right balance between full and incremental snapshots affects system performance and usability. Taking too few snapshots may result in data gaps, while excessive snapshots increase processing overhead.

Strategies for Effective Snapshot Management

To effectively manage snapshotting through a timeline that isn’t solely reliant on load date, several key considerations must be addressed:

Snapshot Frequency

The frequency with which snapshots are taken should be determined by the specific requirements of the business, including reporting needs, regulatory compliance, and data retention policies.
Options for snapshot frequency include daily, weekly, monthly, or even more granular intervals depending on the volatility of the data and the necessity for historical accuracy.
It is essential to balance the need for frequent snapshots with the storage and processing overhead they incur.

Business Date Alignment

Snapshots should be aligned with business dates rather than system timestamps to ensure consistency and relevance to business operations.
This alignment must take into account weekends, holidays, and other non-business days, as well as period-end adjustments and other business-specific calendar considerations.
The goal is to capture data that accurately reflects the state of the business at a given point in time from a business perspective.

Snapshot Type

The choice between full snapshots and incremental snapshots depends on factors such as storage capacity, processing power, and data retrieval requirements.
Full snapshots capture the entire dataset at a specific point in time, while incremental snapshots only store the changes that have occurred since the last snapshot.
Incremental snapshots are generally more efficient in terms of storage and processing, but may be more complex to manage and restore.

Late/ Corrected Data

To ensure data accuracy and historical integrity within a snapshotting system that utilizes a timeline other than load date, a robust mechanism must be implemented to manage late-arriving transactions and data corrections. This is essential to maintain a reliable and consistent representation of the data over time.

Several strategies can be employed to achieve this, including versioning and backdating. Versioning involves maintaining multiple versions of the data, each representing a specific point in time, allowing for easy tracking of changes and rollbacks if necessary. Backdating, on the other hand, involves assigning a timestamp to the data that reflects its actual occurrence time, rather than the time it was loaded into the system. This ensures that the data is placed in its correct historical context.

Furthermore, it is crucial to consider the potential impact of late or corrected data on reporting and analysis. Late-arriving data can distort results and lead to inaccurate conclusions if not handled properly. Similarly, data corrections can invalidate previous analyses and require recalculations. Therefore, appropriate controls and safeguards must be put in place to mitigate these risks. This may include data validation checks, reconciliation processes, and audit trails to track data changes and ensure accountability.

By implementing these measures, organizations can maintain the accuracy, consistency, and reliability of their snapshot data, even in the face of late-arriving transactions and data corrections. This, in turn, enables them to make informed decisions, generate accurate reports, and support effective analysis based on trustworthy historical data.

Additional Considerations

Retention Policy: Establish a clear retention policy for snapshots, considering legal, regulatory, and business requirements.
Storage and Performance: Evaluate storage options and their impact on system performance, considering scalability and cost.
Data Security: Implement appropriate security measures to protect snapshot data from unauthorized access or modification.
Disaster Recovery: Include snapshots in disaster recovery plans to ensure business continuity.
Metadata Management: Maintain metadata about snapshots, including creation time, business date, and snapshot type, to facilitate management and retrieval.

By carefully considering these factors and implementing a well-designed snapshot management strategy, organizations can effectively leverage snapshots to support business operations, regulatory compliance, and data-driven decision-making.

Conclusion

Snapshotting based on business dates is crucial for data management, especially in business and finance. It provides a reliable basis for reporting, analysis, and decision-making by capturing and structuring data in alignment with business operations. Snapshots also facilitate historical records for compliance and auditing, and enable comparisons between different time periods. However, potential challenges like data duplication and corruption need to be addressed through robust data management practices. In conclusion, snapshotting offers significant benefits, but requires careful management to ensure data accuracy and integrity.

– Tim Hoffmann (Scalefree)

Building a scalable Data Platform? In Data Vault Friday

Modeling Links with Null Business Keys in Data Vault

Null Business Keys

Data Vault is a methodology used for modeling data in large-scale data warehouse environments. It’s designed to handle rapidly changing data and easily scalable. One of the challenges faced by data modelers is handling null business keys in Data Vault. In this article, we’ll explore how to model links with null business keys, using an example scenario involving the Entity_Roles table.

In this article:

Understanding the Challenge
Context: The Entity_Roles Table
What Are Business Keys and Links?
Approaching the Problem: Handling Null Business Keys
Designing the Solution
Additional Considerations
Final Thoughts
Watch the Video
Meet the Speaker

Understanding the Challenge

The question we’re dealing with here revolves around how to model the Entity_Roles table when there are null values in the linked data fields. In particular, the table might have records where certain fields like Project or Task are null, creating a challenge when attempting to create links between different entities. Let’s break down the situation and find the best way to handle these null values and ensure smooth data modeling.

Context: The Entity_Roles Table

The Entity_Roles table holds information about the relationship between entities and their roles in different tasks and projects. However, there may be situations where a given record in the table has missing (null) values for certain business keys, such as Project or Task.

What Are Business Keys and Links?

Before diving into the solution, let’s first clarify what business keys and links are in Data Vault. A business key is a unique identifier for a specific business object, such as a project or task. Links are used in Data Vault to represent relationships between business keys. The Entity_Roles table, for example, might act as a link between an entity, its role, and the task or project it’s associated with.

Approaching the Problem: Handling Null Business Keys

The key challenge here is dealing with the null values in the Entity_Roles table. Null values in business keys can create issues when trying to establish relationships between entities. A typical solution in Data Vault is to replace these null values with zero keys. A zero key represents an unknown business key, allowing the link to still function properly despite missing data.

Option 1: Using Zero Keys

One approach to handling null values is to use zero keys for any missing business keys. For instance, if a Task or Project is null, we can assign a zero key to represent the unknown value. The advantage of using zero keys is that it helps maintain query efficiency by ensuring consistent joins, and it allows us to deal with missing data without breaking the link.

Option 2: Splitting the Link into Two Views

Another option could be to pre-stage and create two views to split the Entity_Roles table: one where Task is not null, and another where Project is not null. However, this approach is not recommended because splitting the data may lead to unnecessary complexity and potential inconsistencies in the data. In general, it’s a good practice in Data Vault to avoid splitting data at this level.

Option 3: Handling Multi-Activity Data

If the Entity_Roles table contains multiple rows for the same relationship (e.g., an entity linked to both a task and a project), you may need to account for multi-activity data. This can be done by using a Multi-Active Satellite, which allows you to capture multiple descriptions of the same link. This ensures that you can store all the variations of a relationship, such as different roles or project-task assignments, in the same link.

Designing the Solution

When it comes to designing the solution, there are a few key design decisions to make:

1. Create a Single Link

Rather than splitting the Entity_Roles table, it’s generally better to keep it as a single link. This approach allows you to maintain all relationships in one place and makes the system simpler to manage.

2. Using a Multi-Active Satellite

If there are multiple active roles or tasks associated with a single entity, then using a Multi-Active Satellite is a good option. This satellite would store all the variations of the role-task-project relationships, allowing you to track all relevant data points.

3. Handling Null Values with Zero Keys

As mentioned earlier, replacing null values with zero keys is a good practice in Data Vault. This helps maintain referential integrity, ensures that your links remain intact, and avoids the need for complex filtering or splitting of data.

4. Use of CDC Satellites for Intraday Changes

If your data involves intraday changes (i.e., updates that occur within a single day), you may want to use Change Data Capture (CDC) Satellites. These satellites track changes at the record level, allowing you to preserve the history of the relationships between entities, roles, tasks, and projects.

Additional Considerations

There are a few other considerations when modeling links with null business keys:

1. Should I Create a Weak Hub for ID?

No, it’s generally not recommended to create a weak hub for the ID. Instead, treat the ID as a descriptive attribute and store it in the satellite attached to the link.

2. Where Should I Put Descriptive Data?

Descriptive data should go in the satellite. This can include things like roles, entity types, or other attributes that provide more context to the link between entities.

3. Should I Put the Role in the Link?

If the role is an important business key, you could consider putting it in its own hub. However, if the role is just a descriptive attribute, it’s best to store it in the satellite attached to the link. If the role becomes a business key later, you can refactor the design and create a hub for it.

Final Thoughts

Modeling links with null business keys is a common challenge in Data Vault. The key is to handle null values appropriately, whether that’s by using zero keys, leveraging multi-active satellites, or managing intraday changes with CDC satellites. By maintaining a clear and consistent approach to handling these null values and relationships, you can ensure that your Data Vault design is scalable, efficient, and capable of handling complex data scenarios.

Watch the Video

Meet the Speaker

Michael Olschimke

Building a scalable Data Platform? In Data Vault Friday

Modeling Project Tasks and Actions in Data Vault

Modeling Project Tasks and Actions

In the world of Data Vault, creating effective data models to capture project tasks and actions is a crucial part of building scalable, efficient, and auditable data solutions. One of the challenges that data engineers face when building such models is ensuring that the history of project tasks and actions is captured correctly, while also accounting for the complexity of slowly changing data and continuous changes like those seen in Change Data Capture (CDC) environments.

In this article:

Understanding the Data Model
Challenges with the CDC Environment
Solution: Using a Non-Historized Link
- Handling Updates and Deletions
Building the Task Dimension
Optimizing the Approach
Conclusion
Watch the Video
Meet the Speaker

Understanding the Data Model

Let’s start by understanding the data model in question. We have three key tables:

Projects: This table contains details about various projects.
Tasks: This table contains tasks associated with each project. It has a foreign key linking back to the Projects table.
Actions: Each task can have one or more actions associated with it, and the Actions table has a foreign key linking back to the Tasks table.

The objective is to create a Project Dimension and a Task Dimension. The Task Dimension should include an additional attribute that represents the latest action associated with each task. To solve this, there are multiple approaches you could take, but two common ones stand out:

Multi-Active Satellite (MA-SAT) based on sets: The simplest approach would be to use a MA-SAT model based on sets, selecting the latest action for each task within those sets. However, this can be complex in a CDC environment.
Multi-Active Satellite (MA-SAT) based on an MA-Attribute: Instead of using sets, this model would use an action ID as an MA-Attribute, helping to simplify the approach while avoiding the issues created by capturing sets.

Challenges with the CDC Environment

When your source system uses Change Data Capture (CDC), the problem of handling sets becomes more complicated. In CDC, you’re dealing with a stream of changes where new records are added, existing records are updated, and old records may be deleted. This constant flow of data presents two key challenges:

Capturing sets is not straightforward: In a typical MA-SAT approach, a set captures all records related to a specific task or project, but with CDC data, you’re more likely to receive individual changes, rather than a full snapshot of all records at once.
Volume management: In a CDC environment, where actions are continually added, there is a risk that using sets will generate a massive volume of records. For example, if a new action is added to a task every day, this could result in an overwhelming number of records over time.

Given these challenges, using an MA-Attribute approach where the action ID serves as the key for the task’s latest action provides a more scalable solution. Instead of creating an ever-expanding set, you can focus on just the latest action for each task, keeping the volume under control.

Solution: Using a Non-Historized Link

The idea behind using a Non-Historized Link in this case is to capture actions as they come in from the source system, without the need for sets. In this setup, a Non-Historized Link captures the relationship between tasks and actions and includes the action ID. When new actions come in, they are inserted into the link table, and updates or deletions are handled as changes to the data. This provides a streamlined and efficient way to track the most recent actions associated with each task.

Handling Updates and Deletions

In CDC environments, when updates and deletions occur, it’s important to track these events accurately. The solution to this is technical counter transactions. When an update occurs in the source system, it’s treated as a deletion of the old version and an insert of the new version. This allows the system to track changes over time and ensures the history is accurately captured.

For deletions, we simply negate the value of the original record, ensuring the deleted record is not included in any aggregations. This negation technique is useful for tracking the current state of each task, action, or project.

Building the Task Dimension

The task dimension is where we’ll include the latest action for each task. To do this, we can use a technique involving a Point-in-Time (PIT) table, which captures the latest version of data (in this case, the latest action for each task) by linking the task hub to the appropriate satellite data.

Here’s how this works:

Create the PIT table for tasks: This table will store the latest version of each task, including references to the task satellite and the action satellite.
Use window functions: To retrieve the latest action for each task, you can apply window functions in your SQL queries to order by the business date (or load date) and select the latest action based on these values.

Once the PIT table is created, it can be used to build the Task Dimension, which will then reflect the latest action for each task.

Optimizing the Approach

One of the key considerations when implementing this model is performance. Using window functions and PIT tables is an effective way to manage the volume of records, but it’s important to ensure your database is optimized for these types of queries. In high-volume environments, applying business logic during the PIT table loading process can reduce the load time and make data delivery faster.

Conclusion

Modeling project tasks and actions in Data Vault requires careful consideration of the data flow, especially in environments that rely on CDC. By using Non-Historized Links, Multi-Active Satellites based on MA-Attributes, and PIT tables, you can create a scalable, efficient model that captures the latest actions associated with each task, without generating excessive volumes of data.

By applying these techniques, you can streamline the data flow, optimize performance, and ensure the data is always accurate and up-to-date, making it easier for your data users to access the information they need.

Watch the Video

Meet the Speaker

Michael Olschimke

Building a scalable Data Platform? In Data Vault Friday

Unit of Work in Data Vault

What Is the Unit of Work?

The Unit of Work is a fundamental concept in Data Vault modeling. It refers to the idea of capturing all the related business keys from a source system together in a single Link structure. This ensures that the relationships represented in the raw data are maintained intact in your Raw Data Vault.

Think of it as keeping all pieces of a puzzle together. When you break the pieces apart without a clear strategy to reassemble them, you risk losing the original picture—or worse, creating a picture that never existed in the first place.

In this article:

The Key Question: What Happens If You Split Links?
The Risk of Splitting Links: A Practical Example
The Jedi Test: Verifying Your Model
Driving Keys and Unit of Work
The Safe Bet: Keep Links Together in the Raw Data Vault
What If You Have to Split the Link?
Efficiency Considerations
Final Thoughts and Recommendations
Watch the Video
Meet the Speaker

The Key Question: What Happens If You Split Links?

The question posed in our Data Vault Friday session was:

“It would be good to refresh the topic of the Unit of Work. What is the information I’d be losing if I split wider Links into smaller pairs (DrivingKey -> Key1/Key2 etc)? I can think of a technical setup where I’d have multiple Links and Effectivity Satellites and I should be able to get the same results to any point in time… What information exactly gets lost if I split Links?”

The Risk of Splitting Links: A Practical Example

Let’s start with a simple example to illustrate the point:

Imagine you have a relationship between three business entities—Customer, Store, and Product. Customers go into Stores and purchase Products. If you model this relationship as a single Link that includes all three Hubs (Customer, Store, Product), you have a clear and accurate picture of that transaction.

Now, suppose you decide to split this Link into two smaller Links:

Customer -> Store
Customer -> Product

At first glance, this may seem fine. However, when you attempt to reconstruct the original dataset by joining these two Links, you may create new combinations that never existed. For example:

Customer A visited Store K and purchased Product X (valid)
Customer A visited Store L (also valid), but when you join this with Product X, you create a false relationship—Customer A purchased Product X at Store L (which never happened in reality).

This phenomenon, where new, artificial rows are generated during joins, leads to data inconsistencies. You’ve effectively introduced relationships that didn’t exist in your source system. That’s the major risk when splitting Links.

The Jedi Test: Verifying Your Model

To avoid these issues, Michael recommends applying what he calls the Jedi Test (yes, Star Wars fans, rejoice!).

The Jedi Test is a simple but powerful validation: Try to reconstruct the original source dataset from your target Data Vault model. If you can recreate the original rows exactly—no more, no less—you’re golden. If you end up with extra rows or missing rows, you’ve got a problem.

This test helps you ensure that your Links are modeled correctly and that no data is lost or inaccurately recreated when splitting Links.

Driving Keys and Unit of Work

One scenario where splitting Links causes definitive problems is when you’re dealing with Driving Keys. Driving Keys are a subset of business keys that uniquely identify a relationship.

Michael explains that splitting components of a Driving Key almost always results in a Unit of Work problem. If you break apart a Driving Key, you risk losing the integrity of your data relationships. That’s why it’s best to keep all Hub references from the same source together in a Unit of Work Link within the Raw Data Vault.

He emphasizes that although he’s not a fan of Driving Keys (calling it a “hate-love relationship”), there are scenarios where they are necessary, and when they are, you must be cautious not to split them.

The Safe Bet: Keep Links Together in the Raw Data Vault

The best practice? Keep all Hub references from the same source table together in one Link—the Unit of Work Link.

This approach guarantees that the full relationship is captured as it appeared in the source data. You can always split these relationships later in the Business Vault, where transformations and additional rules can be applied. But the Raw Data Vault should act as your trusted, immutable source of truth.

By maintaining the Unit of Work in your Raw Data Vault, you protect yourself from potential data inconsistencies and ensure you can always rebuild accurate, reliable data structures downstream.

What If You Have to Split the Link?

Sometimes, technical limitations force you to split a Link. For instance:

Your database system limits the number of Hub references you can include in a composite index.
You have descriptive data at different granularities, making a wide Link impractical.

In these cases, Michael suggests a few strategies:

Keep the Unit of Work Link as your base Link.
Create additional Links to capture specific granularity needs (Customer -> Product, Customer -> Store, etc.).
If you split Links, ensure you also capture any missing relationships—for example, if you have Customer -> Store and Customer -> Product, you might also need Store -> Product to maintain the complete relationship picture.

These extra Links allow you to recreate the original dataset accurately if needed, but they add complexity. More Links mean more entities, more joins, and potentially slower query performance. That’s why Michael emphasizes the importance of an efficient model and recommends keeping Links together whenever possible.

Efficiency Considerations

From a performance perspective, fewer Links with more Hub references often result in faster queries. You reduce the number of joins and simplify your model. Conversely, more Links increase complexity and the potential for errors.

Therefore, the trade-off is clear: Simplicity and efficiency with a Unit of Work Link versus flexibility with additional Links—but at the cost of complexity and higher risk.

Final Thoughts and Recommendations

To summarize:

Always aim to keep Hub references from the same source together in a Unit of Work Link in your Raw Data Vault.
Perform the Jedi Test to ensure you aren’t losing or falsely creating data relationships.
If you must split Links, document and manage them carefully, ensuring you maintain all necessary relationships (even those not obvious at first glance).
Save complex splits and transformations for the Business Vault, where they can be more easily corrected without reloading the entire dataset.

At the end of the day, a well-structured Unit of Work ensures your Data Vault remains a solid, reliable foundation for analytics, reporting, and data science initiatives.

Watch the Video

Meet the Speaker

Michael Olschimke

Building a scalable Data Platform? In Data Vault Friday

Data Vault on Wide Tables: Best Practices and Considerations

Understanding Data Vault on Wide Tables

Storing data as wide tables in a data lake while applying a logical Data Vault layer using views presents unique challenges. The goal is to virtualize a raw Data Vault model on top of a data lake while ensuring optimal performance.

In this article:

Key Considerations for Performance Optimization
Virtualizing Data Vault on a Data Lake
GDPR and Data Privacy Challenges
Enhancing Performance with Materialized Structures
Does Datavault4dbt Support This Approach?
Conclusion
Watch the Video
Meet the Speaker

Key Considerations for Performance Optimization

Data Remains the Same: Regardless of whether you use Iceberg, Snowflake, or another technology, the fundamental data characteristics remain unchanged. Issues like dirty data and transformations still apply.
Descriptive vs. Transactional Data: Wide tables typically contain a mix of master and transactional data. Most attributes tend to be descriptive, especially in master data.
Granularity Matters: Properly defining granularity helps structure the Data Vault model efficiently, especially for hubs, links, and satellites.

Virtualizing Data Vault on a Data Lake

When implementing a virtual Data Vault, consider:

Hubs and Links: These should be materialized instead of virtualized. Business keys from wide tables should be extracted and stored in separate iceberg tables for efficiency.
Satellites: Virtualizing satellites using views is recommended, but pay attention to GDPR and personal data separation.
Indexing Performance: Hubs, links, PIT (Point-in-Time) tables, and bridges serve as indexes in a data lake environment.

GDPR and Data Privacy Challenges

One of the biggest concerns in a wide table approach is data privacy. Since wide tables often include personal data, you must consider logical deletion, encryption, or physical separation techniques to comply with regulations.

Enhancing Performance with Materialized Structures

To achieve good query performance, consider materializing certain structures:

Materialized Hubs and Links: These structures act as indexes and improve data retrieval efficiency.
PIT and Bridge Tables: These further optimize queries by structuring data in a way that minimizes computational load.
Denormalized Information Marts: End-users should query fully materialized information marts, ensuring high-speed access.

Does Datavault4dbt Support This Approach?

Discussions around Datavault4dbt suggest it may support this approach in the future. If you’re working on a project with this implementation, consider reaching out to Scalefree to explore collaboration opportunities.

Conclusion

Applying Data Vault on wide tables within a data lake architecture requires careful planning. The key takeaways include:

Virtualizing satellites while materializing hubs and links for performance.
Addressing GDPR concerns by separating or encrypting personal data.
Using PIT and bridge tables to enhance indexing and query speed.
Building fully denormalized information marts for end-user access.

By following these best practices, organizations can ensure efficient and scalable Data Vault implementations on wide tables.

Watch the Video

Meet the Speaker

Michael Olschimke

Building a scalable Data Platform? In Coalesce Transformation Talks

Column Propagation in Coalesce: Handling IT Table Changes

How Coalesce Manages Column Propagation

Change is inevitable in data management. Whether you like it or not, IT table structures evolve due to various reasons. When these changes occur, they can impact data pipelines, potentially leading to inefficiencies or even failures. Fortunately, coalesce.io offers a robust solution to manage column propagation seamlessly.

In this article:

Why Do Table Structures Change?
Types of Table Structure Changes
Impact of Changes in IT Tables
Column Propagation in Coalesce
How Column Propagation Works
Benefits of Using Coalesce for Column Propagation
Final Thoughts
Watch the Video
Meet the Speaker

Why Do Table Structures Change?

Changes in database table structures can occur for several reasons:

Change in the source system: New data sources, modifications in existing systems, or upgrades can introduce changes.
Change in the data ingestion process: Adjustments in ETL (Extract, Transform, Load) or ELT (Extract, Load, Transform) processes may require modifications in tables.
Development mistakes: Incorrect data modeling, schema design flaws, or unintended changes can also trigger table updates.

Types of Table Structure Changes

Table modifications generally fall into three categories:

New Attributes: Additional columns were introduced to capture new data.
Removed Attributes: Deprecated or unnecessary columns are being eliminated.
Changed Datatypes: Modifications in column data types for compatibility or optimization.

Impact of Changes in IT Tables

The consequences of these changes can vary widely:

Best Case: Unused and unabsorbed data, leading to inefficiencies but not immediate failure.
Worst Case: Complete pipeline failure, causing data loss or system downtime.

Column Propagation in Coalesce

Coalesce simplifies the process of managing table changes through an efficient column propagation mechanism.

How Column Propagation Works

Column propagation in coalesce.io follows a structured approach:

Select Column: Identify the column that has been added, removed, or modified.
Propagate Addition or Deletion: Ensure that the column change is applied throughout the pipeline.
Mark Downstream Objects: Identify downstream objects that are affected and should be updated accordingly.
Create Commit: Finalize the changes with a commit to reflect them across the system.

Benefits of Using Coalesce for Column Propagation

By leveraging Coalesce’s column propagation features, data engineers can:

Automate schema changes: Reduce manual intervention and minimize errors.
Ensure data consistency: Prevent mismatches between schema and data models.
Improve efficiency: Accelerate change implementation without disrupting workflows.
Enhance visibility: Gain better control over how changes impact the entire data pipeline.

Final Thoughts

Managing table changes is a critical aspect of data engineering. With coalesce.io, data teams can seamlessly handle column propagation, ensuring minimal disruptions and optimal performance. Whether you’re dealing with new attributes, removed attributes, or datatype modifications, coalesce.io streamlines the process and enhances data reliability.

Watch the Video

Meet the Speaker

Tim Kirschke
Senior Consultant

Tim has a Bachelor’s degree in Applied Mathematics and has been working as a BI consultant for Scalefree since the beginning of 2021. He’s an expert in the design and implementation of BI solutions, with focus on the Data Vault 2.0 methodology. His main areas of expertise are dbt, Coalesce, and BigQuery.

Building a scalable Data Platform? In Data Vault Friday

When to Invest in Data Warehousing

Investing in Data Warehousing

Data warehousing is a crucial step for businesses looking to manage, integrate, and analyze large volumes of data efficiently. But at what point should a company consider investing in a data warehouse? Many assume that only large enterprises need data warehousing, but the reality is more nuanced.

In this article:

Is Company Size the Right Factor?
Understanding Business Needs and Data Complexity
When Data Integration Becomes a Challenge
Regulatory and Compliance Requirements
External Data and Decision-Making
Cost vs. Benefit Analysis
Final Thoughts
Watch the Video
Meet the Speaker

Is Company Size the Right Factor?

Many businesses believe that data warehousing is only necessary when they reach a certain size. However, size alone is not the determining factor. Instead, the key driver for investing in data warehousing is the complexity of data management and the business challenges that arise from it.

Understanding Business Needs and Data Complexity

Some businesses need data solutions early, even during the startup phase. For example, data-driven companies—such as tech startups that rely on analytics—often invest in data warehousing early to gain insights and improve decision-making.

On the other hand, traditional businesses, like small manufacturing firms, may not require advanced data solutions immediately. If a business operates without much digital data, spreadsheets or simple databases may suffice for years.

When Data Integration Becomes a Challenge

As businesses grow, they tend to use multiple systems for different operations. For example, customer data may reside in a CRM, sales data in an ERP, and marketing data in yet another platform. When companies need a unified view of their operations, data warehousing becomes a necessity.

Operational reports from systems like SAP or Salesforce can work for a while, but eventually, businesses realize they need to consolidate data from various sources to make better strategic decisions.

Regulatory and Compliance Requirements

In some industries, regulatory requirements mandate specific reporting and auditing capabilities. A cleaning service provider, for instance, may need to generate reports on the number of windows cleaned to comply with contract terms. Failure to provide such reports can result in financial penalties.

In such cases, investing in data warehousing can be justified by calculating the potential financial losses due to non-compliance. Businesses in regulated industries, such as finance and healthcare, often prioritize data warehousing earlier for compliance reasons.

External Data and Decision-Making

Some businesses receive vast amounts of external data from suppliers, partners, or customers. If decisions rely on analyzing large datasets, a data warehouse can streamline operations, improve accuracy, and reduce manual effort.

For example, a company relying on Excel sheets for pricing decisions may struggle to track and compare data efficiently. A data warehouse helps aggregate, cleanse, and analyze data, enabling better decision-making.

Cost vs. Benefit Analysis

While budget constraints may delay investment in data warehousing, businesses should assess the cost of inefficiencies. If manual data handling, poor integration, or compliance risks cost the company significant time and money, the investment in a data warehouse is often justified.

Final Thoughts

Investing in data warehousing is not solely about company size but rather about business needs and data complexity. Companies facing data-driven challenges, compliance requirements, or inefficiencies in data management should consider investing in a data warehouse—regardless of their size.

Watch the Video

Meet the Speaker

Michael Olschimke