Building a scalable Data Platform? – Scalefree

Building a scalable Data Platform? In Data Vault Friday

Identifying Non-Historized Links

Watch the Video

Identifying Non-Historized Links in Data Vault

In data management, particularly in the Data Vault 2.0 methodology, understanding how to handle non-historized links can be essential for maintaining accurate, traceable records. In today’s Data Vault Friday, we’ll cover the concept of non-historized links, explore the unit of work in data storage, and discuss which elements should be included in link hashes.

In this article:

Non-Historized Links Background: Understanding the GL Table
Defining the Unit of Work in Data Vault
Testing the Structure of the Unit of Work
Which Values Should Be Hashed in the Link Hash?
In Practice: Setting Up Non-Historized Links
How to Submit Your Questions
Conclusion
Meet the Speaker

Non-Historized Links Background: Understanding the GL Table

Imagine we’re working with a General Ledger (GL) table for storing financial transactions, which includes over a billion rows. Each record in this table represents a transaction and includes attributes such as:

TransactionID: A unique identifier for each transaction.
AccountSegment: A business key identifying the account involved.
AccountingDocument: Another business key associated with the document related to the transaction.
AccountingPeriod: A reference key indicating the period of the transaction.
DebitAmount and CreditAmount: The monetary amounts for each transaction.
TransactionTimestamp: The exact date and time of the transaction.

While the TransactionID alone uniquely identifies each transaction, a combination of AccountSegment, AccountingDocument, AccountingPeriod, and TransactionTimestamp also provides a unique identifier for each entry. This setup raises questions about how to best organize, hash, and manage these records within a Data Vault model, specifically in terms of defining a Unit of Work and deciding which values should be included in a link hash.

Defining the Unit of Work in Data Vault

The Unit of Work is a concept used in Data Vault to determine which business keys need to be grouped together in a single record. In this context, we’re focusing on how these business keys—AccountSegment, AccountingDocument, AccountingPeriod, and TransactionTimestamp—should be structured and managed within a non-historized link.

According to Michael Olschimke, the CEO of Scalefree, the Unit of Work is primarily relevant when dealing with standard relationship-based links. In our example, there’s no need to split the link between these business keys because the link can capture the original granularity of each transaction from the source data. This means that by keeping all four attributes within the same link, we avoid disrupting the Unit of Work and ensure a coherent data structure.

Testing the Structure of the Unit of Work

One way to verify if we’re correctly applying the Unit of Work is to check if the data model enables us to reconstruct the original data source without loss of records or attributes. If we find that we can reconstruct the data accurately using the non-historized link and potentially a Satellite table later, then we’re likely adhering to the Unit of Work correctly.

Since the attributes in our case uniquely identify each transaction, we can assume the Unit of Work is preserved, allowing for a stable and consistent structure in the data model.

Which Values Should Be Hashed in the Link Hash?

The next question concerns which values we should include in the link hash. Generally, the hash key in a link table is derived from the business keys of referenced hubs. In this case, the primary question is whether to hash just the AccountSegment, AccountingDocument, and AccountingPeriod, or if we should also include the TransactionID.

Olschimke suggests a practical approach here: while we could technically get by with hashing either the TransactionID alone or the combination of the other three elements, it’s often better to include all relevant attributes in the hash. This includes the TransactionID along with AccountSegment, AccountingDocument, and AccountingPeriod.

Here’s why:

Consistency with Automation Tools: Many Data Vault automation tools automatically include all hub references and their business keys in the hash computation. Following this approach aligns with standard automation practices, making it easier to work with automation tools later.
Minimizing Potential Issues: Including all attributes in the hash computation reduces the risk of data loss or accidental data duplication. It ensures that our hashes accurately represent each unique record.
Negligible Performance Impact: While adding an extra attribute to the hash computation may slightly increase the string length, the performance impact is minimal. In large-scale data processing, this small change typically does not result in significant slowdowns.

In Practice: Setting Up Non-Historized Links

In Data Vault, non-historized links are a powerful tool for managing large datasets without versioning each record. By including all four attributes in the hash computation, we establish a stable link structure that reflects the source data’s original granularity. This approach allows us to confidently manage and retrieve accurate records without fear of unintentional data loss.

For practitioners, setting up these non-historized links can be straightforward when using automation tools that handle most of the heavy lifting. It’s generally recommended to follow the automation tools’ conventions, especially for larger datasets, as this minimizes discrepancies and facilitates smoother data processing in the long run.

How to Submit Your Questions

If you’re working with Data Vault and have a question similar to this one, Data Vault Friday sessions provide an opportunity to get expert insights. To submit a question, you can use the form, which allows you to attach diagrams, whiteboard images, or other supporting materials for clarity. Alternatively, for faster responses, you can post questions on the Data Innovators Exchange, where consultants actively respond and discuss Data Vault topics.

For those interested in more in-depth training, Scalefree also offers webinars, including sessions on WhereScape and dbt, which run monthly. These provide valuable resources for anyone looking to deepen their Data Vault expertise.

Conclusion

Identifying non-historized links and deciding on a consistent hashing approach can significantly impact data integrity and model reliability. By adhering to Data Vault principles and utilizing automation tools effectively, you can manage vast amounts of data without sacrificing accuracy or consistency.

With the right approach, you’ll be able to create a resilient, scalable data model that aligns with your organization’s needs and remains flexible for future changes. If you’re diving deeper into Data Vault, stay tuned for more Data Vault Friday insights and don’t hesitate to participate in the community for ongoing support and expertise.

Meet the Speaker

Michael Olschimke

Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

Building a scalable Data Platform? In Data Vault Friday

Snapshot Based Bridge Table on Link and Effectivity Satellite

Watch the Video

Snapshot Based Bridge Table on Link and Effectivity Satellite

Welcome to another edition of Data Vault Friday! Today, we’re diving into the concept of a snapshot-based bridge table, particularly focusing on its application in scenarios with links and effectivity satellites. This approach helps us handle complex relationships between business objects over time, managing changes in relationships, and retaining a complete historical view. In this article, we will explore solutions to three main questions:

How to handle a missing relationship on day 2 compared to day 1
How to manage changed relationships from day 1 to day 2 for the same key
How to include a business object (A3) on day 2 when it lacks a relationship to any object in B

In this article:

Understanding the Data Vault Modeling Example
- Examining the Relationship Changes
Creating the Link and Effectivity Satellite
Building the Snapshot-Based Bridge Table
Handling Missing Data and Reinstating Relationships
Incorporating A3 in the Bridge Table on Day 2
Preventing Cartesian Products and Other Best Practices
Conclusion
Meet the Speaker

Understanding the Data Vault Modeling Example

In our example, we have two sources, each representing a different business object (A and B). Business object A is represented as a static dataset, while business object B shows dynamic relationships to A over three days. Each day brings changes in the relationships, with some entries disappearing or shifting. Our task is to capture these changes effectively in a snapshot bridge table.

Examining the Relationship Changes

Let’s walk through the changes observed over three days:

Day 1: B1 is related to A1, B2 to A2, and B3 to A3.
Day 2: The relationships change. B1 is now related to A2, B2 to A1, and B3 disappears.
Day 3: No further changes occur compared to Day 2.

Our objective is to document these relationships using an Effectivity Satellite and a Bridge Table, enabling us to query the state of relationships as they existed on each day.

Creating the Link and Effectivity Satellite

The first step in capturing these relationships is to create a Link table that holds a distinct list of relationships between business objects. Links should not have additional metadata, such as validity dates, as this can complicate data retrieval and reduce performance. In our example, the Link table captures each unique combination of A and B keys but does not record their start or end dates.

Next, we create an Effectivity Satellite. This table extends the Link by recording each relationship’s start and end timestamps, as well as an “is_active” flag to indicate the current status of each relationship. Using this table, we can track when a relationship starts, changes, or ends. Let’s examine how this works:

Day 1 entries: All relationships (B1-A1, B2-A2, B3-A3) are marked as active.
Day 2 entries: New relationships (B1-A2 and B2-A1) are added and marked active, while previous relationships (B1-A1 and B2-A2) are marked inactive. B3-A3 is removed entirely.

Building the Snapshot-Based Bridge Table

With the Effectivity Satellite in place, we can now create a Bridge Table that snapshots the active relationships for each day. This table provides a point-in-time view of the relationships as they existed on a particular day. Let’s look at how the Bridge Table is created:

Day 1 Snapshot

The Day 1 snapshot reflects the initial relationships, pulling all active records from the Effectivity Satellite. At this stage, B1-A1, B2-A2, and B3-A3 are all active.

Day 2 Snapshot

In the Day 2 snapshot, only the relationships B1-A2 and B2-A1 remain active, while B3-A3 is removed. By applying a filter to include only active entries, the snapshot accurately represents the relationships on Day 2.

Day 3 Snapshot

Day 3’s snapshot is identical to Day 2, as no additional changes were made. The active relationships B1-A2 and B2-A1 remain unchanged.

This process ensures we have a clear audit trail of relationship changes and deletions. Each day’s snapshot represents the state of relationships at that point in time, without introducing redundant data.

Handling Missing Data and Reinstating Relationships

One critical feature of the Effectivity Satellite is the ability to manage reinstated relationships. For instance, if B1-A1’s relationship is reestablished on Day 4, we add new rows in the Effectivity Satellite, marking the previous active combination (B1-A2) as inactive and reactivating B1-A1. This dynamic structure makes it easy to adjust for relationships that appear, disappear, and reappear over time.

Incorporating A3 in the Bridge Table on Day 2

A common challenge is how to incorporate business objects like A3 on Day 2, despite having no relationship in B on that day. In Data Vault, this is often addressed in downstream queries or report joins rather than in the Bridge Table itself.

By starting with object A and performing a left join to the Bridge Table, you will include all records from A (even if they don’t appear in B). This ensures that unlinked objects are included in the results, with their B relationships shown as null, or as a placeholder if desired.

Preventing Cartesian Products and Other Best Practices

When using a Bridge Table, it’s essential to avoid unexpected Cartesian products, which can inflate data during aggregation. Always check the cardinality of relationships between objects (e.g., one-to-one or many-to-many) to ensure joins occur only on necessary keys. The driving key, which anchors the relationships in the Link Table, should be the primary focus, especially in cases of many-to-many relationships.

This method ensures accuracy and performance when aggregating data and avoids inflated results in reporting.

Conclusion

Snapshot-based bridge tables are powerful tools in Data Vault modeling for handling changing relationships and tracking historical snapshots. By carefully structuring Links, Effectivity Satellites, and Bridge Tables, we create a robust, auditable trail of data changes over time. As we’ve discussed, this approach allows us to accommodate missing data, reinstate relationships, and prevent data inflation, ensuring reliable, performant data models.

If you’d like further assistance with templates for bridge tables or Effectivity Satellites, feel free to reach out to me. I hope this discussion has provided clarity on using bridge tables in Data Vault. Join us again for more Data Vault insights, and feel free to submit your questions!

Meet the Speaker

Marc Winkelmann

Marc is working in Business Intelligence and Enterprise Data Warehousing (EDW) with a focus on Data Vault 2.0 implementation and coaching. Since 2016 he is active in consulting and implementation of Data Vault 2.0 solutions with industry leaders in manufacturing, energy supply and facility management sector. In 2020 he became a Data Vault 2.0 Instructor for Scalefree.

Building a scalable Data Platform? In Data Vault Friday

Learning the Data Vault Patterns

Watch the Video

Learning the Data Vault Patterns with Azure Synapse

When it comes to managing complex, evolving data landscapes, implementing a Data Vault architecture is a popular choice. The Data Vault approach is especially helpful in handling large volumes of data while maintaining flexibility, scalability, and historical tracking. Recently, a question was raised regarding the use of Azure Synapse Analytics with Data Vault for the initial Raw Data Vault development without a finalized automation tool. Specifically, the question was: Is it feasible to start development with Synapse Notebooks using PySpark and Delta Lake?

This article will explore the answer to that question and dive into key considerations, best practices, and strategies to make the most of a manual setup while preparing for eventual automation in your Data Vault implementation.

In this article:

Is It Feasible to Start Data Vault with PySpark and Delta Lake?
Benefits of Starting Manually with Synapse, PySpark, and Delta Lake
The Drawbacks of a Manual Approach
Steps to Optimize Your Manual Development Process
Preparing for Automation: What to Keep in Mind
Moving Forward: Transitioning to Automation
Final Thoughts
Meet the Speaker

Is It Feasible to Start Data Vault with PySpark and Delta Lake?

The short answer is yes—starting with Synapse Notebooks, PySpark, and Delta Lake is indeed a viable option. In fact, this manual approach can be an excellent way to familiarize your team with Data Vault concepts, patterns, and methods before committing to an automation tool. By manually building the Raw Data Vault, your team can gain hands-on experience with essential processes, which will make the transition to automation smoother and more effective later on.

Historically, many Data Vault practitioners began with manual scripting due to limited automation tools. Over time, this “manual-first” method became a useful way to learn the intricate patterns of Data Vault. Today, automation tools for Data Vault are abundant, and using them is generally more efficient, but there’s still a place for manual methods, especially in the early learning stages of a project. Let’s look closer at why this approach works and what to consider as you work manually.

Benefits of Starting Manually with Synapse, PySpark, and Delta Lake

Using PySpark and Delta Lake in Azure Synapse Notebooks gives your team flexibility to:

Learn Core Data Vault Patterns: Building the Raw Data Vault manually helps the team understand Data Vault concepts, such as Hubs, Links, and Satellites. This is crucial knowledge that will benefit the project long-term.
Experiment with Modeling: Working without automation allows you to refine your approach and test different design patterns. This is especially helpful in creating a foundation that’s tailored to your organization’s specific needs and datasets.
Understand Data Transformation and Ingestion: By manually scripting transformations and ingesting data, your team will better understand the processes that an automation tool would handle. This will help in configuring automation later and troubleshooting any issues that arise.
Validate Requirements and Patterns: Since no tool has been chosen yet, working manually allows you to get a head start on modeling and confirm your business and technical requirements early in the project.

The Drawbacks of a Manual Approach

While starting manually has its advantages, it’s important to be aware of the limitations. The primary drawbacks of a manual approach are:

Time and Effort: Developing the Raw Data Vault by hand is time-consuming. Each process, from creating Hubs to tracking Satellites, requires careful attention to ensure the design aligns with Data Vault standards.
Limited Scalability: A manual setup is challenging to scale, especially as the data volume grows. While PySpark and Delta Lake are powerful tools, they’re not a substitute for the scalability offered by automation tools.
Risk of Technical Debt: Scripts developed manually might not be as maintainable or reusable as the templates generated by automation tools. Technical debt can accumulate if the team spends too much time maintaining manual scripts, and transitioning to automation could later require extensive rework.

Steps to Optimize Your Manual Development Process

If you decide to move forward with this manual approach, here are some strategies to make the process more efficient and to set a foundation for a future transition to automation:

Document Your Patterns Thoroughly: Take detailed notes on the specific design patterns, scripts, and models you develop manually. These can serve as templates when you move to automation, making the transition much easier.
Define Clear Modeling Standards: Establish consistent modeling practices for Hubs, Links, and Satellites. This will reduce ambiguity and provide a structured foundation for automation tools to build upon later.
Refine and Iterate: Since you’re building manually, use this time to refine your models. Adjust and improve them based on the unique data flows and needs of your organization.
Focus on Core Entities: Prioritize building out core Hubs and Links in your Raw Data Vault, focusing on the entities that are most crucial to your organization. This will create a solid foundation that can be extended as you move to automation.

Preparing for Automation: What to Keep in Mind

Even as you start manually, keep automation in mind as a goal. Today, there are numerous Data Vault automation tools available, each with its own strengths and weaknesses. As you prepare for this transition, here are some key considerations:

1. Research Automation Tools

Explore the different automation tools available on the market. Each tool has its own approach, interface, and capabilities, so it’s essential to choose one that aligns well with your organization’s technical needs, budget, and data infrastructure. Some tools focus on business-user accessibility, while others offer more technical configurations. Common tools for Data Vault automation in Azure include solutions specifically designed for Synapse or those that support PySpark and Delta Lake.

2. Select Tools with Scalability

When selecting a tool, consider how it handles scalability, as this is vital for supporting the growing volume of data in a Data Vault. Some automation tools handle scalability better than others, depending on how they manage Hubs, Links, and Satellites. In Azure Synapse, it’s also important to assess compatibility with Delta Lake and PySpark, as well as overall integration with Azure’s data ecosystem.

3. Account for Tool Limitations

Even the best automation tools have limitations. Be prepared to adjust your approach based on the chosen tool’s capabilities. For example, some tools may limit specific patterns, such as complex Satellites or multi-active relationships. By understanding these limitations ahead of time, you can avoid rework and ensure that your initial manual development aligns well with the selected tool.

4. Focus on Configurability and Customization

Ensure that your chosen tool allows for some level of customization. This is important because the patterns you develop manually may need to be fine-tuned or adjusted within the tool. Look for tools that offer configurable templates, adaptable interfaces, and support for customizations to fit your organization’s specific needs.

Moving Forward: Transitioning to Automation

As your team becomes familiar with Data Vault patterns through manual development, the next step is to select and implement an automation tool. While the manual work provides a deep understanding of Data Vault patterns, an automation tool will streamline repetitive processes, ensure consistency, and save significant time and effort as your data volume grows.

One recommended approach is to use the lessons learned during manual development to create tailored templates and workflows within your chosen tool. By doing so, you can optimize the automation tool’s capabilities based on the patterns you’ve already tested and refined. This makes for a smoother, more effective transition from manual development to automated workflows.

Final Thoughts

Starting Data Vault development manually with Synapse Notebooks, PySpark, and Delta Lake is a feasible and often beneficial approach, especially when automation tools haven’t been finalized. While this method is time-consuming and requires effort, it offers valuable insights and allows your team to learn and optimize Data Vault patterns before committing to an automation tool.

Remember, the goal is to use this manual phase to build a strong foundation, explore modeling choices, and establish best practices. When the time comes to select an automation tool, your team will be well-prepared to leverage it to its fullest potential, ensuring a scalable and efficient Data Vault implementation within Azure Synapse Analytics.

Meet the Speaker

Michael Olschimke

Building a scalable Data Platform? In Data Vault Friday

Handling JSON Data in Data Vault Satellites on Snowflake

Watch the Video

Handling JSON Data in Data Vault Satellites on Snowflake

In this blog post, we’ll discuss the challenges and best practices for handling JSON data in Data Vault architectures, specifically when using Snowflake. This question from our audience raises a common scenario in modern data environments where semi-structured data is prevalent. Let’s dive into how storing JSON in Data Vault Satellites works, when to extract fields, and the performance considerations for Snowflake.

In this article:

Why Store JSON in the Satellite?
Choosing When to Extract Fields from JSON
JSON and Snowflake: Optimized for Performance
Handling Business Logic in the Business Vault
Structuring Semi-Structured Data Over Time
Privacy and Security Considerations with JSON in Satellites
Working with JSON Arrays and Nested Objects
Virtualizing JSON Access in Snowflake
Conclusion
Meet the Speaker

Why Store JSON in the Satellite?

With the increasing use of APIs and complex data integrations, many businesses are receiving data as JSON. JSON is a flexible, semi-structured format that allows for varying structures and nested elements, making it ideal for some applications. However, this flexibility also introduces complexities when using a structured data model like the Data Vault.

The challenge here is that when you receive a JSON payload, it may contain hundreds of fields, many of which may not be immediately needed. Extracting every JSON field to store as individual columns in the Raw Data Vault can be inefficient and unnecessary. Therefore, one solution is to store the JSON “as-is” in the Satellite and extract only what’s required later in the Business Vault.

Choosing When to Extract Fields from JSON

As a general rule, it’s beneficial to extract attributes that are needed for reporting, security, or compliance early in the process. Relational attributes stored directly as columns in the Raw Data Vault are easier to query, and they’re stored more efficiently in structured databases.

For other fields in the JSON payload that are less critical or not immediately required, you can keep the data in its JSON format. This approach minimizes upfront work and allows for flexible data exploration later without overloading the Raw Data Vault with unnecessary fields.

For example, consider an e-commerce environment where transaction data is received in JSON format. The Raw Data Vault could extract only essential fields like transaction ID, customer ID, and total amount while leaving other detailed information about the items purchased in JSON format.

JSON and Snowflake: Optimized for Performance

One of Snowflake’s strengths is its ability to store and query JSON data natively. Snowflake’s support for semi-structured data formats like JSON and its efficient handling of this data type make it possible to perform queries directly on JSON attributes without needing to first extract them into columns.

This capability allows you to store JSON data in the Raw Data Vault and query it directly, even in production, without significant performance penalties. This approach works well for data that may only be needed occasionally or where performance is not a major concern. Snowflake’s built-in functions for working with JSON make it easy to extract specific attributes on demand and use them in downstream reporting.

Handling Business Logic in the Business Vault

In Data Vault, business rules are often applied within the Business Vault layer, which is designed for derived and calculated data. When working with JSON in this context, it’s common to keep the original JSON structure in the Satellite and then perform transformations as needed in the Business Vault.

In Snowflake, you can take advantage of virtual views to simplify your data pipeline. Rather than physically extracting and storing each JSON attribute in the Business Vault, you can create virtual views that access JSON fields as needed. This approach offers the benefit of flexibility and reduces storage requirements since you avoid duplicating data.

Structuring Semi-Structured Data Over Time

Storing JSON in the Satellite provides flexibility, but as your business needs evolve, you may find that some attributes become more relevant for analysis. At this point, you may consider structuring your JSON data incrementally, extracting only the fields that are frequently queried or required for compliance.

This gradual structuring process aligns with the Data Vault’s design principles. Over time, your semi-structured data will naturally become more organized as business requirements clarify which fields are necessary for analytics or reporting. You can continue to leave less critical fields in JSON format, maintaining flexibility while optimizing performance where it matters.

Privacy and Security Considerations with JSON in Satellites

One important consideration when storing JSON data in the Satellite is ensuring compliance with data privacy regulations such as GDPR. It’s essential to classify personal or sensitive data in the JSON payload and handle it appropriately.

In cases where sensitive data is involved, you may choose to split your Satellites by data classification, isolating high-sensitivity fields into their own Satellite tables with stricter access controls. Alternatively, you can extract and separate specific fields that require special handling and leave the rest in JSON format.

Working with JSON Arrays and Nested Objects

JSON data often includes nested structures, such as arrays, that add complexity to handling data in a relational model. For instance, customer records might contain an array of phone numbers or addresses. In these cases, Snowflake offers tools for managing nested data without requiring extensive restructuring.

One option is to store the entire JSON array as-is within a JSON attribute in the Satellite. For example, a JSON array containing multiple phone numbers could be stored directly in the Satellite, eliminating the need for a separate table to manage multi-active data. This approach provides a simpler alternative to the traditional multi-active Satellite design.

Snowflake allows you to work with nested JSON data using its FLATTEN function, which simplifies querying nested structures without requiring complex joins or data transformations. This method can be particularly useful when dealing with highly nested data, especially in real-time applications.

Virtualizing JSON Access in Snowflake

Snowflake’s JSON handling capabilities allow you to query JSON fields dynamically, making it possible to virtualize the access to JSON attributes in your Satellite tables. Virtual views can be used to transform JSON data for reporting without needing to physically extract every field into columns.

This approach offers significant flexibility. By keeping the JSON data in its original form, you avoid unnecessary data transformations and can leverage Snowflake’s powerful JSON functions to create reports on demand. Virtualization is an effective way to simplify your data pipeline, particularly when dealing with complex JSON structures.

Conclusion

Storing JSON in the Data Vault Satellite layer on Snowflake is a sound approach, especially when dealing with semi-structured or complex data. By extracting only the fields that are critical and leaving the rest in JSON format, you balance flexibility with efficiency. Snowflake’s support for JSON makes it possible to work with this data in real-time, leveraging virtual views to keep your pipeline agile.

Consider implementing a mix of JSON storage and relational attributes, structuring your data incrementally as your business needs evolve. With the right approach, Snowflake’s JSON capabilities allow you to handle diverse data types efficiently, supporting both compliance requirements and analytics needs.

Meet the Speaker

Michael Olschimke

Building a scalable Data Platform? In Data Vault Friday

(Single) Point of Facts

Watch the Video

Single Point of Facts in Data Lakehouse Architecture

Welcome to another edition of DataVault Friday! Today, we’re diving into a frequently debated topic in data management: determining the “source of truth” in a data lakehouse architecture. Specifically, we’ll answer whether that source of truth resides in data lake files or in the tables of the Raw Vault and Business Vault. Additionally, we’ll address what to do if a bug in the ingestion framework requires a re-ingestion of data sources. Let’s explore these questions to better understand data lineage, data governance, and strategies for a reliable and flexible data ecosystem.

In this article:

Understanding the "Single Point of Facts"
Data Lakehouse Architecture and Points of Truth
What If There’s a Bug in the Ingestion Framework?
Data Lake vs. Raw Vault: Which Is Easier for Reconstruction?
Best Practices for Managing Points of Fact
Conclusion
Meet the Speaker

Understanding the “Single Point of Facts”

In traditional data warehousing, the idea of a “single version of the truth” is well-known. This concept implies that there is one version of the data that acts as the definitive source across an enterprise. For instance, a single “customer” or “product” definition applies universally within the organization.

However, in Data Vault architecture, we move from a “single version of the truth” to a “single point of facts.” The focus shifts from universal definitions to an unaltered, auditable record of events. Data Vault is designed to capture historical data accurately and reliably. It provides multiple perspectives on the data (versions of the truth) by isolating raw data from any business logic or transformations. This flexibility allows organizations to apply different business rules depending on context, while maintaining a consistent underlying dataset.

In this context, the Raw Vault is considered the foundational layer, capturing facts as they are, directly from the source systems. The Business Vault, on the other hand, introduces additional business rules, metrics, and aggregated data for reporting purposes. But in essence, the “single point of facts” remains within the Raw Vault because it represents an unaltered and auditable record.

Data Lakehouse Architecture and Points of Truth

In data lakehouse architecture, data is stored both in a data lake and within Data Vault tables. This raises the question: Which source is the ultimate truth? The data lake, with its raw files, or the Raw Vault tables?

The answer depends on the architectural requirements and the level of traceability and auditability needed. Ideally, both systems should mirror each other and serve as points of fact:

Data Lake: The data lake serves as a repository for raw data files, often storing snapshots or full loads of data from source systems. This makes it easier to preserve the original data as-is without alteration.
Raw Vault: In the Raw Vault, data is loaded into a structured schema, capturing the same original details but in a way that can be more systematically queried and analyzed. Like the data lake, the Raw Vault stores unmodified facts, but it also preserves lineage information, making it possible to reproduce deliveries and trace data transformations.

Since both layers should hold the same underlying data, they collectively represent the point of fact. Either the data lake or Raw Vault can serve as the truth source, depending on the scenario. This dual system ensures a resilient architecture, as data can be cross-validated across layers.

What If There’s a Bug in the Ingestion Framework?

One key question that arises is: What happens if there’s a bug in the ingestion framework? Bugs such as incorrect population of business keys or other erroneous transformations might require a complete re-ingestion of data sources.

When dealing with bugs in data ingestion, having both a data lake and a Raw Vault allows flexibility and safeguards. Here’s how to address these issues:

Identify and Isolate the Problem: Pinpoint where the issue occurred in the ingestion process and document the scope of the bug, especially if it affects business keys or other critical aspects of data integrity.
Rely on the Data Lake for Original Files: Since the data lake contains the original, unaltered data files, you can reload the affected data from here into the Raw Vault. This ensures that any corrupted or inaccurately transformed data can be replaced without loss.
Reprocess the Raw Vault: With the correct data now available from the data lake, reload the Raw Vault. Ensure that new ingestion processes are thoroughly tested to avoid repeating the error.
Automate Audits and Reconciliation: Implement automated reconciliation checks between the data lake and the Raw Vault. Automated scripts can flag discrepancies, giving early warning of issues before they reach production or reporting layers.

By leveraging both the data lake and Raw Vault as points of fact, the architecture remains robust and auditable. This redundancy allows for re-ingestion without significant downtime and ensures that data lineage remains traceable throughout the lifecycle.

Data Lake vs. Raw Vault: Which Is Easier for Reconstruction?

When it comes to reconstructing deliveries, the data lake often offers simplicity. Since the data lake can hold raw files with minimal transformation, data reconstruction is a straightforward matter of accessing the original files. In contrast, reconstructing from the Raw Vault requires additional effort, as data must be accurately joined across hubs, links, and satellites, while preserving the original state.

That said, both layers should be auditable, with logging mechanisms that allow for a traceable history of changes. Having a clear data lineage in place allows organizations to meet compliance and audit requirements while supporting accurate reporting.

Best Practices for Managing Points of Fact

While it’s tempting to designate a single point of fact, the dual-layered approach with a data lake and Raw Vault provides a more resilient framework. Here are some best practices for managing points of fact in a data lakehouse architecture:

Maintain Consistency Between Layers: Ensure that data lake files and Raw Vault tables match exactly. Automate reconciliation checks between these layers to verify data integrity.
Implement Auditable Ingestion Processes: Document all transformations from the data lake to the Raw Vault, with logging and error-checking mechanisms. This allows for easier tracing of issues if they arise.
Retain Original Data in the Data Lake: Always keep a copy of original files in the data lake. These files provide a reliable source of truth that can be referenced or reloaded into the Raw Vault if issues occur.
Leverage Metadata for Automation: Metadata can streamline both ingestion and reconciliation. Use metadata to define business keys, relationships, and descriptive data in the Raw Vault, while automating verification processes.

With these practices, data lakehouse architecture can be made robust, auditable, and resilient to changes or errors. By treating both the data lake and Raw Vault as points of fact, you ensure that your data ecosystem remains flexible, trustworthy, and ready to meet evolving business requirements.

Conclusion

The question of “single point of facts” in data lakehouse architecture doesn’t have a straightforward answer. Both the data lake and the Raw Vault act as points of fact, each offering unique benefits in terms of auditability and reconstruction. By utilizing both, you create a highly resilient system capable of withstanding data issues while providing a comprehensive, consistent view of your data.

In summary, while the Raw Vault may traditionally serve as the “single point of facts,” using both the data lake and the Raw Vault as truth sources creates a flexible architecture that can accommodate re-ingestion, mitigate risks, and support accurate reporting. With this dual approach, your data lakehouse architecture becomes a reliable foundation for modern data needs.

Meet the Speaker

Michael Olschimke

Building a scalable Data Platform? In Data Vault Friday

Multi-Active Satellites & Dependent Child Keys

Watch the Video

Understanding Multi-Active Satellites and Dependent Child Keys in Data Vault

Multi-Active satellites and dependent child keys provide solutions for storing data with complex granularities in Data Vault models. Data Vault is known for its highly structured enterprise data warehousing approach, built on Hubs, Links, and Satellites to capture data lineage, maintain historical accuracy, and ensure scalability. However, specific data scenarios, such as handling different data granularities, often lead to questions on multi-active satellites and dependent child keys. This article breaks down these concepts and clarifies their differences and use cases in a Data Vault environment.

In this article:

What is a Multi-Active Satellite?
- Example of a Multi-Active Satellite
What is a Dependent Child Key?
- Example of a Dependent Child Key
Key Differences Between Multi-Active Satellites and Dependent Child Keys
When to Use Each Approach: Multi-Active Satellites & Dependent Child
Summary
Meet the Speaker

What is a Multi-Active Satellite?

A multi-active satellite is designed to manage multiple records for a single business object that are active simultaneously. This scenario arises when a business object, like a customer, can have several active data entries at the same time. For example, a customer could have multiple addresses (home and work), both of which are valid at the same time.

In a typical satellite structure, a business key (e.g., customer ID) combined with a load date timestamp defines the primary key. However, in cases of multiple active records, this primary key is insufficient because it won’t uniquely identify each active instance. Instead, an additional attribute, such as an address type (home or work), is added to the primary key to differentiate each record. This approach allows the satellite to track multiple entries for the same business key without duplicating data and helps capture finer details in the data warehouse.

Example of a Multi-Active Satellite

Let’s say our source system has a customer with ID C123 who has two active addresses: one for home and one for work. In a standard satellite, we might have one record per business key. But in a multi-active satellite, we store both addresses simultaneously by using an additional identifier (e.g., “address type”) in the primary key:

Customer ID: C123
Load Date: Timestamp of data load
Additional Identifier: Address type (e.g., home, work)

This approach allows multiple entries for a single business object (in this case, customer C123) while maintaining unique records in the satellite table.

What is a Dependent Child Key?

A dependent child key is used to manage relationships between multiple business objects at a finer granularity level than a standard Data Vault link would allow. Dependent child keys are typically applied in links where we need to track multiple occurrences of a relationship between business objects, such as an order and its line items.

Consider an order containing multiple line items, where each item references a product. Here, the dependent child key (like line item number) uniquely identifies each relationship instance, as it provides additional detail beyond just the order and product identifiers. This allows multiple rows in the link for the same business objects while maintaining unique records.

Example of a Dependent Child Key

Imagine we have an order O123 for a customer C123, which includes two line items for the same product but with different prices or quantities:

Order ID: O123
Customer ID: C123
Product ID: P123
Dependent Child Key: Line item number (e.g., 1, 2)

In this case, we create unique rows for each line item, where the line item number differentiates each record. This approach ensures that each entry is stored and tracked individually.

Key Differences Between Multi-Active Satellites and Dependent Child Keys

Although multi-active satellites and dependent child keys both enable handling of finer data granularity, they serve different purposes and are used in distinct contexts:

Multi-Active Satellites
Applied within a single business object to handle multiple active records at the same time. The additional identifier helps capture simultaneous entries for the same object in a satellite.
Dependent Child Keys
Used in links between multiple business objects, where the additional key captures the finer detail of each relationship instance, such as line items in an order.

When to Use Each Approach: Multi-Active Satellites & Dependent Child

The choice between using a multi-active satellite or a dependent child key depends on the data granularity and relationships in your data model:

Use Multi-Active Satellites when handling multiple active records for a single business object, where each entry is related only to the primary business key (e.g., customer with multiple addresses).
Use Dependent Child Keys when tracking detailed relationships between different business objects that require additional identifiers to maintain uniqueness (e.g., order and line items).

Summary

Multi-active satellites and dependent child keys are helpful for storing data with complex granularities in Data Vault models. While multi-active satellites allow multiple simultaneous records for a single business object, dependent child keys enable unique identification of complex relationships in links. Both approaches maintain Data Vault’s principles of scalability and data integrity by preserving unique records and enabling detailed tracking of business data.

In short:

Multi-Active Satellite: For multiple records active simultaneously within a single business object.
Dependent Child Key: For relationships across multiple business objects that need finer detail, typically in links.

Meet the Speaker

Marc Winkelmann

Building a scalable Data Platform? In Data Vault Friday

Data Vault Mixed Model

Watch the Video

Understanding Data Vault Mixed Models: Integrating Non-Data Vault Entities in the Business Vault

Data Vault architecture is a widely used methodology in data warehousing, providing a highly adaptable model for managing complex data environments. It organizes data primarily in three core components: Hubs, Links, and Satellites. These elements support business keys, relationships, and descriptive data to create a comprehensive data structure within the Raw Data Vault layer. However, as with many methodologies, real-world data often introduces elements outside the strict boundaries of this structure, sparking questions around flexibility.

This article explores the concept of a “mixed model” in Data Vault, where non-Data Vault entities coexist with Raw Data Vault components and how they might be integrated within a Business Vault structure. While the purest Data Vault models focus on auditability and lineage, allowing for seamless automation, mixed models can sometimes be practical if managed thoughtfully. So, is it permissible to mix non-Data Vault entities with the Raw Data Vault within a Business Vault? Let’s dive into this topic.

In this article:

What is a Data Vault Mixed Model?
Can You Integrate Non-Data Vault Entities with Raw Data in the Business Vault?
Strategies for Long-Term Success with a Mixed Model
Practical Example of a Mixed Model in Action
Conclusion: Balancing Flexibility with Data Vault Integrity
Meet the Speaker

What is a Data Vault Mixed Model?

A “mixed model” in Data Vault refers to a scenario where traditional Data Vault structures (Hubs, Links, and Satellites) are used alongside other non-Data Vault tables or entities. In essence, while most data resides in the structured Raw Data Vault, there are other data components within the same database that do not conform to Data Vault architecture. This raises questions about integrating these disparate data types in the Business Vault.

The Business Vault is designed to serve as a refined, operational version of the Raw Data Vault. It enhances the raw data with business logic and transformations to create actionable insights. In scenarios where a mixed model is necessary, the goal is often to leverage existing non-Data Vault tables to derive business insights while minimizing disruption to the original data model.

Can You Integrate Non-Data Vault Entities with Raw Data in the Business Vault?

According to Data Vault principles, the ideal approach is to structure all data as Hubs, Links, and Satellites to ensure consistency, auditability, and lineage. However, a mixed model approach can sometimes be necessary. For instance, you may have a database that combines data stored in the Raw Data Vault with tables or entities that don’t follow Data Vault structures. So, is it allowed?

The short answer is yes, you can technically integrate non-Data Vault entities within the Business Vault, but it comes with caveats. Here’s a deeper look at the implications:

Temporary Solutions Only
Mixing non-Data Vault data with Raw Data Vault entities is generally seen as a temporary solution. It may help in quickly bridging data that doesn’t yet fit into the Data Vault model, allowing for rapid integration. However, over time, this approach can lead to complexity in querying and reduce the consistency that Data Vault offers.
Impact on Automation and Maintainability
Introducing non-standard tables complicates automation within the Business Vault. Data Vault design leverages automation tools like dbt, Wherescape, and Vaultspeed, among others. These tools facilitate a streamlined workflow in Data Vault implementations by allowing for automated lineage, auditing, and data transformations. When introducing non-Data Vault entities, the automation capabilities are hindered, requiring custom scripts or queries that deviate from standard Data Vault patterns.
Jeopardizing Auditability and Lineage
One of Data Vault’s strongest value propositions is its focus on data lineage and auditability. In a mixed model, these aspects may be compromised. Without adhering to the structure of Hubs, Links, and Satellites, it becomes challenging to track data history, version control, and capture all changes comprehensively. For organizations that rely on these features for regulatory or quality purposes, compromising lineage may be a serious drawback.
User Mart as an Alternative
An alternative approach is to build a “User Mart” for ad hoc or analytical queries that combine data from the Raw Data Vault with non-Data Vault tables. This User Mart allows users to query both Raw Data Vault and external entities without disrupting the core Business Vault structure. This approach is particularly useful when users have specific reporting or analytical requirements that may not require full Data Vault transformation.
Pragmatic Approach: Virtual Hubs and Links
A practical solution in Data Vault projects is to create “virtual” Hubs, Links, and Satellites for non-standard tables, which serve as placeholders within the Raw Data Vault structure. This approach allows for quick integration while maintaining some level of standardization. For example, if there’s a reference table with country names and codes, you might create a virtual Hub for the country and map descriptive details as a virtual Satellite. This doesn’t achieve full lineage but can serve as a bridge until a proper Data Vault structure can be implemented.
Reference Tables and Non-Critical Data
In scenarios where data like reference tables (e.g., country codes, zip codes) doesn’t require full lineage or version tracking, a flat and wide reference table can be used. If a reference Hub and Satellite are unnecessary, keeping the data simple with a primary key and descriptive columns is often sufficient. This approach can work well for non-essential data, where maintaining Data Vault-style rigor may not be worth the effort.

Strategies for Long-Term Success with a Mixed Model

If you decide to proceed with a mixed model, it’s crucial to plan for a future transition toward a fully Data Vault-compliant design. Here are some tips:

Prioritize Refactoring Non-Data Vault Entities
Establish a clear roadmap for converting non-Data Vault tables into Hubs, Links, and Satellites over time. This phased approach enables you to work within existing constraints while planning for a more robust and compliant Business Vault.
Minimize Technical Debt
Track instances of non-Data Vault elements within your data ecosystem and treat them as “technical debt” to be managed and resolved in the long term. This keeps you aware of areas where auditability or automation might be compromised.
Use Metadata-Driven Automation
Employ metadata-driven automation tools as much as possible to simplify future integrations and transitions. These tools enable automated data processing across the Data Vault pipeline, making it easier to add and transform new data sources into compliant Data Vault structures.
Implement Strict Governance for User-Generated Data
In cases where users introduce their own data models within the User Mart or Business Vault, set governance policies to standardize data usage and maintain some level of alignment with Data Vault patterns. These policies can mitigate risks related to data quality and ensure that non-Data Vault data remains manageable.

Practical Example of a Mixed Model in Action

Consider a financial services organization that maintains a Raw Data Vault with transaction data but also has a separate schema for customer reference tables, such as customer demographics and location details. Rather than directly integrating these tables into the Business Vault, the organization could create virtual Hubs and Links that link customer IDs and locations to transactions. This allows them to continue working within the Raw Data Vault framework while planning to reformat reference tables in alignment with Data Vault standards.

Another example might involve a large retail company where user-generated data models in the User Mart are frequently used to support marketing analysis. Here, the organization could implement a temporary mixed model that accommodates fast-paced analysis while planning for a phased migration to Data Vault structures over time.

Conclusion: Balancing Flexibility with Data Vault Integrity

While a mixed model is not ideal within Data Vault architecture, it can serve as a temporary, pragmatic solution when there’s an immediate need to integrate non-Data Vault entities. Virtual Hubs and Links, User Marts, and strict governance policies can help manage the complexity introduced by non-standard tables. However, organizations should prioritize migrating all data into the Data Vault model over time to preserve the long-term benefits of auditability, lineage, and automation that Data Vault offers.

In the end, remember that the strength of Data Vault lies in its flexibility, auditability, and scalability. Introducing non-Data Vault tables as a quick fix is feasible, but for sustainable and reliable insights, a fully Data Vault-compliant model remains the optimal choice.

Meet the Speaker

Michael Olschimke

Building a scalable Data Platform? In DevOps Knowledge

Pre-Commit: The First Line of Defense in Shift-Left Development

Programmer Working with Virtual Interface for Software Development and Technology

Pre-Commit as a Foundation for Code Quality

In software development, ensuring code quality and minimizing defects early is crucial. The shift-left approach emphasizes detecting and resolving issues as soon as possible, rather than later in development or after deployment. Pre-commit supports this approach by managing multi-language pre-commit hooks, scripts that run automatically before finalizing a commit. These hooks act as a first line of defense, enforcing coding standards, preventing bugs, and enhancing security.

This article examines how the combination of pre-commit and the shift-left approach can enhance development workflows, resulting in more reliable and maintainable software. We will explore the advantages of using pre-commit hooks, their function within the shift-left framework, and offer practical advice for effective implementation. By adopting these practices, teams can improve code quality, accelerate development cycles, and deliver robust software solutions more efficiently.

In this article:

Understanding Pre-Commit
Example for the Use of Pre-Commit
Recommended Pre-Commit Repositories for Effective Integration
Conclusion
About the Author

Understanding Pre-Commit

General Hooks in Versioning Systems

In version control systems like Git, hooks are scripts that are triggered by various events within the repository. These scripts can automate tasks, enforce policies, or improve workflows. Hooks are generally categorized into two types: client-side and server-side. Client-side hooks run on the local machine and can be triggered by operations such as committing or merging changes. Server-side hooks run on the server and can be triggered by events like receiving pushed commits.

The Role of Pre-Commit Hooks

Among the various types of client-side hooks, pre-commit hooks are particularly significant. These hooks are executed before a commit is finalized. They check for potential issues in the code, such as syntax errors, code style violations, or even security vulnerabilities. By running these checks before the code is committed to the repository, pre-commit hooks ensure that only high-quality code makes it into the version control system.

Pre-Commit Use Cases

Pre-Commit is a versatile tool that can be adapted to various scenarios within the development workflow, offering significant flexibility and control to development teams. One key use case is hosting pre-commit code locally, allowing developers to customize and manage hooks according to their project’s specific requirements. This local hosting ensures that all team members adhere to the same standards without relying on external repositories. Another powerful feature of pre-commit is the ability for users to create custom hooks tailored to their unique needs. These hooks can enforce specific coding standards, run specialized security checks, or automate repetitive tasks, providing a high degree of flexibility. Teams can write these hooks in any language supported by their environment, allowing pre-commit to seamlessly integrate into their workflow.

Furthermore, pre-commit can be integrated into Continuous Integration (CI) pipelines to enhance automation and enforce quality checks across all stages of development. By incorporating pre-commit hooks into the CI process, teams can ensure that code meets quality standards before being merged into the main codebase. This integration helps maintain high standards of code quality and reliability throughout the development lifecycle.

Multi-Language Support in Pre-Commit

One of the key strengths of pre-commit is its support for a wide range of programming languages. This makes it an ideal tool for projects that involve multiple languages or frameworks. For example, in DataOps, it is common to use Terraform for infrastructure provisioning and dbt for data transformation. Pre-commit’s multi-language support allows it to be seamlessly integrated into such diverse development environments, providing consistent quality checks across different languages and tools. This cross-language capability ensures that best practices are enforced and issues are detected regardless of the technologies used in the project.

Example for the Use of Pre-Commit

Pre-Commit operates through the use of a configuration file called pre-commit-config.yaml. This YAML file serves as the central configuration that pre-commit references every time a commit is made. It defines which hooks should be executed and where to fetch them from before finalizing a commit in your Git repository. As an example for such a file with an explanation to the key elements, the following image is provided:

repos:
  - repo: https://github.com/antonbabenko/pre-commit-terraform
    rev: v1.90.0 
    hooks:
      - id: terraform_validate
      - id: terraform_fmt
        args:
          - --args=-recursive

repo: Specifies the URL of the repository containing the hooks. In this example, https://github.com/antonbabenko/pre-commit-terraform is the repository from which hooks will be fetched.
rev: Indicates the specific revision or version of the repository (v1.90.0 in this case) to use. This ensures consistency in hook behavior across different executions.
hooks: Defines a list of hooks to be run from the specified repository. Each hook is identified by its unique id.
id: Represents the identifier for each hook. For instance, trailing-whitespace, end-of-file-fixer, and check-yaml are examples of hook IDs that correspond to specific tasks or checks the hooks will perform.
args: Specifies additional arguments or options to be passed to the hook command. In the case of terraform_fmt, the args field includes –args=-recursive. This configuration instructs the hook to format Terraform files recursively within the project directory.

This configuration allows you to tailor pre-commit to your project’s needs by specifying which hooks to execute, where to retrieve them from, and how to ensure consistency through defined versions or revisions. Each hook ID corresponds to a particular quality check or task that pre-commit will enforce before allowing a commit to proceed, ensuring higher code quality and adherence to project standards.

Recommended Pre-Commit Repositories for Effective Integration

To kickstart your use of pre-commit effectively, consider leveraging the following repositories tailored for specific programming languages and tools:

Terraform Repository: antonbabenko/pre-commit-terraform
Offers a comprehensive set of hooks for Terraform projects, including validators and formatters to ensure consistent and high-quality code.
Dbt (Data Build Tool) Repository: dbt-checkpoint/dbt-checkpoint
Provides hooks specifically designed for dbt projects, enabling validation and enforcing best practices for SQL-based data transformations.
SQL Repository: sqlfluff/sqlfluff
Offers hooks for SQL linting to ensure syntax correctness, adherence to coding standards, and optimization of query performance.

These repositories offer essential hooks that integrate seamlessly with pre-commit, enabling automated checks to maintain code quality and consistency across Terraform, dbt, and SQL projects. By incorporating these hooks into your workflow, you can streamline development processes and ensure robust software deployments. However these are just suggestions and you are free to use any other hook if it benefits your use case.

Conclusion

Pre-Commit exemplifies the shift-left approach by automating quality checks early in the development cycle, before code is committed. This proactive strategy catches issues like syntax errors, formatting inconsistencies, and security vulnerabilities early, reducing downstream defects.

By using a centralized pre-commit-config.yaml file, Pre-Commit ensures consistent coding standards across teams. It allows developers to focus on coding rather than manual reviews, speeding up software delivery.

Supporting various languages and tools, Pre-Commit can be customized for specific project needs, enhancing code reliability through automated checks. Integrating Pre-Commit improves code quality, efficiency, and promotes continuous improvement, leading to faster time-to-market, lower development costs, and more reliable software.

About the Author

Moritz Gunkel

Moritz is an aspiring Consultant in the DevOps department for Scalefree, specializing in cloud engineering, automation, and Infrastructure as Code, with a particular knack for Terraform. While juggling his responsibilities, including pursuing a Bachelor’s degree in Computer Science as a working student, Moritz’s methodical approach has significantly impacted internal operations, earning him recognition and setting the stage for an exciting transition into a full-time consulting role. With a passion for innovation and a commitment to excellence, Moritz is set to continue making a lasting impression in the dynamic world of DevOps.

Building a scalable Data Platform? In Data Vault Friday

Modelling Address Data

Watch the Video

Modeling Address Data: Key Insights and Recommendations

Address data is one of the fundamental components in various business databases, especially where detailed customer information is essential. This complexity can make the modeling process challenging, particularly when aiming for a single “Hub” that consolidates all address information. In a recent discussion on Data Vault Friday, Michael Olschimke of Scalefree explored the best ways to model address data effectively, considering key business and regulatory factors. This article will provide a summary of those insights, offering recommendations for creating a robust, scalable, and efficient address data model.

In this article:

The Context: Address Data in a Single Hub
Challenges with Replacing NULL Values
Avoid Overloaded Hubs and Null Values in Composite Keys
Alternative Modeling Solutions: JSON-Based and Reference Tables
Considering Address Data as Descriptive in Satellites
Applying Business Rules in Data Vault Modeling
Conclusion
Meet the Speaker

The Context: Address Data in a Single Hub

In this scenario, the challenge presented was how to model various address types within a single Hub. The primary objective was to avoid redundant address data and simplify handling of NULL values. As per ISO20022 standards and European Union regulations, the data model included attributes such as STREET_NAME, BUILDING_NUMBER, BUILDING_NAME, ROOM, FLOOR, POSTAL_CODE, TOWN_NAME, COUNTRY_CODE, ADDRESS_LINE_1, and ADDRESS_LINE_2. Each of these elements is part of a composite business key used to uniquely identify each address.

A practical but complex solution proposed by the team was to replace NULL values with a placeholder (e.g., “-2”) to streamline the loading process and minimize handling issues. However, Olschimke proposed several alternative approaches to ensure the model is both sustainable and scalable.

Challenges with Replacing NULL Values

Replacing NULL values in business keys can simplify loading but has significant drawbacks downstream, particularly when managing dimensional data and maintaining business logic clarity. For instance, when NULL values are replaced with a placeholder like “-2,” this value could appear in downstream reports, causing confusion. Olschimke suggested a more nuanced approach that utilizes “fixed hash values,” such as all zeros or all F’s, to represent empty or erroneous values.

By using fixed hash values, it becomes easier to identify default or error states directly within the data structure. This approach avoids unnecessary complexity when filtering data downstream and improves the clarity and manageability of data processing operations.

Avoid Overloaded Hubs and Null Values in Composite Keys

One of the main points of consideration was the risk of “overloading” Hubs, which occurs when multiple business objects with different semantic meanings are stored within the same Hub. This is particularly common when different types of addresses are stored under a single business key, where each address type may not require all fields (e.g., ROOM or FLOOR).

According to Olschimke, overloaded Hubs introduce complexity due to differing data granularities and missing values across address types. For example, multiple buildings, floors, or rooms could exist under a single address, resulting in multiple granularities within a single Hub. This makes it challenging to maintain clear, meaningful data relationships. Instead, he advised defining clear granularity levels and possibly separating address types or using more flexible data structures.

Alternative Modeling Solutions: JSON-Based and Reference Tables

In cases where multiple address types require flexibility, Olschimke suggested using JSON-based data structures. JSON provides flexibility in defining address attributes dynamically, storing only the keys available for a particular address. This approach reduces the risk of overloading and accommodates varying address structures without creating a complex, rigid schema.

JSON-based Hubs allow for hashing the address data as a single JSON object, ordered by key names to prevent duplicates. However, this approach requires a consistent, standardized order of attributes when hashing to ensure duplicate-free keys. For instance, JSON formatting could streamline the Hub and enable more adaptive data loading while simplifying downstream data extraction.

Additionally, using reference tables is another approach for frequently accessed address data, enabling deduplication without over-complicating the Hub. Reference tables act as dedicated sources of address data, indexed by a unique address ID, which reduces redundancy across other Hubs.

Considering Address Data as Descriptive in Satellites

Instead of adding addresses as business keys in the Hub, it can be more effective to store them as descriptive attributes within a Satellite structure. This avoids overloading the Hub with attributes that may not always be needed for identifying the business key itself. By storing address data in Satellites linked to the primary business entity (e.g., customers, stores), you can achieve a balance between deduplication and schema simplicity.

Olschimke recommended this approach particularly when the main goal is to eliminate redundancies across address data. This approach aligns with a best practice in Data Vault modeling: Satellite tables should contain descriptive data that change over time, while Hubs contain only essential business identifiers.

Applying Business Rules in Data Vault Modeling

Address data often requires additional business rules, especially when handling complex keys or duplicates. Olschimke pointed out that handling NULL values with a placeholder complicates creating downstream dimensions. Instead, a two-step approach was advised: (1) defining the business keys within the Hub with fixed placeholders (e.g., all zeros or all F’s) for default and error handling and (2) standardizing the Satellite structure to handle varying address formats dynamically.

Ultimately, each business has unique requirements, and the choice between single Hubs, JSON structures, and Reference tables will depend on how critical the address data is to the core business operations. By focusing on avoiding overloading and ensuring scalability, businesses can set up a Data Vault model that minimizes long-term maintenance while maximizing data clarity and accessibility.

Conclusion

Modeling address data in a Data Vault context can be intricate, especially when attempting to create a unified Hub that supports various address types. The key considerations discussed by Olschimke emphasize flexibility, simplicity, and adherence to business rules without overloading Hubs. JSON-based keys, reference tables, and Satellite structures offer alternative approaches to managing address data, allowing you to avoid pitfalls associated with NULL placeholders and composite keys.

For businesses tackling complex address data requirements, experimenting with these alternatives may yield significant benefits, particularly in managing data deduplication, compliance, and future scalability.

Interested in learning more? Check out Scalefree’s webinars and consider joining the Data Innovators Exchange community for discussions on data modeling, cloud computing, and Data Vault 2.0 best practices.

Meet the Speaker

Michael Olschimke

Building a scalable Data Platform? In Data Vault Friday

Loading SAP CDC Data into GDPR Compliant Data Vault

Watch the Video

Loading SAP CDC Data into a GDPR-Compliant Data Vault

When managing change data capture (CDC) data from SAP in a Raw Data Vault, special considerations are needed for both CDC information and GDPR-relevant personal data. This post will cover how to model CDC data in a Data Vault, including the unique handling of created, updated, and deleted records. We’ll also discuss best practices for splitting data into separate satellites to manage GDPR-compliant attributes, including empty columns and privacy concerns.

This content is based on a discussion led by Michael Olschimke, CEO of Scalefree, during a Data Vault Q&A session.

In this article:

CDC Data Modeling in the Data Vault
- 1. Load CDC Data in Satellites with Adjusted Patterns
Handling GDPR-Relevant Personal Data
- 2. Splitting Satellites Based on Privacy Classifications
Managing Empty Columns in the Data Vault
Example Satellite Structure
Best Practices for Satellite Splitting in the Data Vault
Conclusion
Meet the Speaker

CDC Data Modeling in the Data Vault

The primary challenge with CDC data is that it only includes changes from SAP, not the full dataset each time. CDC data typically includes metadata on whether a record was created, updated, or deleted in SAP. Here’s a look at how to approach modeling this data in a Data Vault:

1. Load CDC Data in Satellites with Adjusted Patterns

In Data Vault, the data model remains unchanged, regardless of how the data is delivered (batch, CDC, or real-time). However, the loading pattern for CDC data into satellites needs some adjustments:

Delta Check Adjustment: Normally, the Data Vault delta check identifies changes before loading data into a satellite. With CDC data, the changes are already captured, so this step can be bypassed. Instead, all changes from CDC data are loaded into the appropriate satellites directly.
Change Impact Across Satellites: When there’s a change in one attribute, it triggers an update in all relevant satellites. While this approach can create non-delta records, the impact is typically minimal, and the redundant data can be compressed for storage efficiency.

Handling GDPR-Relevant Personal Data

CDC data often includes both regular attributes and GDPR-sensitive personal data. In the Data Vault, personal data attributes should be separated based on privacy and security classes to ensure compliance and manage access. Here’s the recommended approach:

2. Splitting Satellites Based on Privacy Classifications

For GDPR compliance, split CDC data into multiple satellites:

Personal Data Satellite: A dedicated satellite for GDPR-relevant attributes (such as names or IDs). This separation allows for tighter security and privacy control.
Non-Personal Data Satellite: General attributes with no privacy concerns go into a separate satellite to reduce the risk of exposure.
Additional Splits: Further splits may be required based on rate of change, security levels, or business context, depending on the specific needs of your organization.

Maintaining separate satellites for different classes of data ensures that personal information is handled with stricter privacy controls, helping your data architecture comply with GDPR requirements.

Managing Empty Columns in the Data Vault

It’s common for source tables to contain columns that are always empty. When working with CDC data in a Data Vault:

Include Empty Columns for Auditing: To retain full traceability and audibility, include empty columns in the satellite. This preserves the exact structure of the source data without altering it.
Consider Separate “Unused Data” Satellite: If there are many empty columns, these can be grouped into a dedicated satellite, making the primary satellites leaner for users.

This approach allows for flexibility if the data in these columns becomes relevant in the future. Auditors will appreciate the comprehensive structure, and the Data Vault will retain all source data in its original form.

Example Satellite Structure

With GDPR compliance and CDC loading adjustments in mind, here’s an example structure for splitting SAP CDC data into satellites:

Satellite: CDC_Personal_Data
- Attributes: GDPR-relevant data (e.g., personal names, social security numbers)
- Metadata: Load date, source, change type (create, update, delete)
- Purpose: Privacy-controlled access

Satellite: CDC_NonPersonal_Data
- Attributes: Non-personal data columns
- Metadata: Load date, source, change type
- Purpose: General access

Satellite: CDC_Unused_Columns
- Attributes: Columns always empty in the source table
- Metadata: Load date, source
- Purpose: Compliance and future-proofing

Best Practices for Satellite Splitting in the Data Vault

When splitting data into satellites, follow these best practices:

Split by Privacy and Security: Ensure that personal and non-personal data are stored separately, particularly when handling GDPR-relevant information.
Split by Source System: Keep different source systems in separate satellites for clarity and maintainability.
Consider Business Needs: If certain data attributes are only relevant to specific business cases, split them accordingly to reduce satellite complexity.

These principles provide a clean, secure, and compliant Data Vault structure that enables efficient data retrieval, flexibility, and regulatory adherence.

Conclusion

Modeling SAP CDC data in a GDPR-compliant Data Vault involves adjustments to loading patterns, especially when dealing with CDC deltas and GDPR-sensitive data. By separating data based on privacy classes and including empty columns where necessary, you can ensure compliance and maintain a flexible data model. The approach outlined here simplifies the handling of CDC data, while providing robust auditing and privacy control.

Meet the Speaker

Michael Olschimke

Building a scalable Data Platform? In Data Vault Friday

Modelling Demographic Data

Watch the Video

Modeling Demographic Data for the Raw Data Vault: A Practical Guide

When working with demographic data for population analysis, one often encounters datasets that include variables such as region, year, age, and sex. For organizations implementing a Data Vault methodology, the task then becomes modeling this data in the Raw Data Vault in a way that enables efficient use in the Business Vault for proportion calculations and other business logic.

In this blog post, we’ll explore how to approach modeling demographic data for the Raw Data Vault. We’ll look at considerations for reference tables, hubs, links, and satellites, and discuss why a simplified approach can be beneficial while still capturing change history when needed. Let’s walk through an example based on a question that was raised during a Data Vault Q&A session.

In this article:

Defining the Dataset and Objective
The Simplified Modeling Approach
- 1. Flatten the Data Structure First
- 2. Introducing Reference Hubs and Satellites for Change Tracking
Modeling Strategy and Design
Why Avoiding Multiple Hubs and Links Makes Sense
Advantages of a Single Reference Hub with a Satellite
Using the Model in Business Logic
Potential Challenges and Solutions
Conclusion
Meet the Speaker

Defining the Dataset and Objective

Let’s start by outlining the dataset and the main objective. Suppose we have a demographic dataset sourced externally (e.g., from a national bureau) that includes the following columns:

Region
Year
Age
Sex
Population count per region, year, age, and sex

The goal is to load this dataset into the Raw Data Vault and make it available for calculations in the Business Vault, such as determining population proportions across different dimensions.

The Simplified Modeling Approach

To understand the modeling approach, let’s consider the basic elements of Data Vault architecture:

Hubs: Tables that hold unique business keys and serve as identifiers.
Links: Tables that define relationships between hubs.
Satellites: Tables that hold descriptive data and track changes over time.

Since this dataset contains demographic attributes with no true business keys, modeling choices hinge on balancing simplification and change capture. Here’s a step-by-step breakdown of the approach:

1. Flatten the Data Structure First

The simplest model for this dataset would be to create a flat, wide table that includes columns for region, year, age, sex, and population value. This structure would load the data directly without further separation into hubs, links, or satellites.

Pros: Easy to query and manage, especially if there’s no need to track changes over time.
Cons: Lacks support for tracking updates or changes in population values.

This approach works well if the data is static and updated infrequently, but it limits flexibility for versioning or incremental updates.

2. Introducing Reference Hubs and Satellites for Change Tracking

To address the limitations of a flat table, we can take advantage of reference hubs and satellites. Reference hubs allow us to treat region, year, age, and sex as reference codes. In a reference hub, each unique combination of these codes is treated as a single, composite key.

The structure then includes:

A Reference Hub with region, year, age, and sex as identifying attributes.
A Reference Satellite with population counts as the descriptive attribute, which links back to the Reference Hub.

By introducing these reference structures, we ensure that changes in population values can be captured over time. If a new population record comes in with a different population value for a given region, year, age, and sex, it is added to the Reference Satellite as a new row. This provides a change history without needing to modify the original row.

Modeling Strategy and Design

With a focus on simplification, here’s how the structure would look in the Raw Data Vault:

Reference Hub: Combines the codes for region, year, age, and sex. This setup eliminates the need for separate hubs for each attribute and keeps the model straightforward. The composite key formed by region, year, age, and sex uniquely identifies each record, while the hub serves as a central reference.
Reference Satellite: Attached to the Reference Hub, the Reference Satellite holds the population value and includes a load date for tracking when data was loaded. This enables us to capture historical changes efficiently. If an update occurs for the population value, a new row is added to the Satellite with an updated load date, creating a versioned history.

Example Structure:

Reference Hub: Demographic_Hub
- Primary Key: Region, Year, Age, Sex
- Additional Columns: Load Date (ldts), Source (src)

Reference Satellite: Population_Satellite
- Foreign Key: Region, Year, Age, Sex (composite key from Hub)
- Population Value
- Load Date (ldts)
- Source (src)

In this setup, all demographic attributes are contained within a single hub and linked to the population value in the satellite. This design maintains a simple and effective data model that supports historical data changes without additional complexity.

Why Avoiding Multiple Hubs and Links Makes Sense

In theory, one could create separate hubs for each demographic attribute—region, year, age, and sex—and link them together. However, this approach introduces unnecessary complexity without adding value in this context. Here’s why:

Multiple Joins: Multiple hubs and a link table require additional joins, increasing complexity when querying the data.
Increased Entity Count: Additional hubs and links inflate the number of entities, making the data model harder to understand and maintain.
Performance Concerns: Each join adds processing cost, which can slow down queries, especially with larger datasets.

By consolidating all demographic attributes into a single hub, we reduce the number of entities and simplify the model, making it easier to use and maintain while still meeting the business needs.

Advantages of a Single Reference Hub with a Satellite

This approach is particularly advantageous because it balances simplicity with flexibility. Here’s how:

Change Capture: With the reference satellite, we can track historical population data changes over time. Each new row represents an update, identified by load date, making it easy to see when population data was updated.
Granularity and Scalability: The single hub provides a consistent granularity for the data, ensuring that queries are straightforward and changes are easy to manage.
Performance Efficiency: Joining based on region, year, age, and sex is computationally efficient. These attributes are typically small (e.g., integer or small character values), reducing the burden on processing.

Using the Model in Business Logic

Once the demographic data is loaded into the Raw Data Vault, it’s ready for use in the Business Vault. Here’s how it could be used for business logic:

Joining Data: In the Business Vault, data analysts can join other datasets with the demographic reference satellite on the region, year, age, and sex attributes to incorporate population data.
Proportion Calculations: With population counts available, proportion calculations (e.g., the proportion of a certain demographic group in a region) become straightforward.
Temporal Analysis: The load date in the satellite provides historical tracking, allowing analysts to view demographic changes over time and analyze trends.

Potential Challenges and Solutions

One common concern is that joining on multiple columns (region, year, age, sex) could impact performance. However, with optimized indexing and the small size of these columns, this concern is minimized. For larger datasets, partitioning on region or year might further optimize query performance.

Conclusion

In summary, a simplified model that leverages a single reference hub with a satellite provides a highly effective way to model demographic data in a Raw Data Vault. This approach allows for straightforward use in the Business Vault and ensures flexibility for tracking historical changes, all while keeping the model manageable and efficient.

This example illustrates how, by focusing on simplification and change capture, organizations can build an effective demographic data model in the Raw Data Vault that meets both current and future needs for population-based analysis.

Meet the Speaker

Michael Olschimke

Building a scalable Data Platform? In Data Vault Friday

Loading CDC Data into Multi-Active Satellites and Status Tracking Satellites

Watch the Video

Mastering CDC Data in Data Vault 2.0

Change Data Capture (CDC) is a powerful mechanism for tracking changes in source systems. However, when the primary key in your source system differs from the business key used in your Data Vault hub, you may encounter challenges in loading data into multi-active satellites. This article explores various strategies for handling CDC data in such scenarios, offering practical solutions to ensure accurate and efficient data loading.

In this article:

Understanding the Challenge
Additional Considerations
Conclusion
Meet the Speaker

Understanding the Challenge

In many source systems, the primary key is a technical identifier unknown to the business. Instead, the business key represents the meaningful identifier for a business object. In a typical scenario, the relationship between the primary key and the business key is one-to-one. However, in some cases, multiple records can be active for the same business key on the same date, resulting in multi-activity.

This situation arises when the primary key is unique at a given point in time, but the business key is not. For instance, you might have multiple customer IDs in your source system (primary keys) referring to the same customer (business key).

Solution 1: Verify Multi-Activity

Before diving into complex solutions, it’s crucial to verify whether the data is genuinely multi-active. In some cases, the appearance of multi-activity might be due to records being deleted and recreated with the same business key, resulting in different primary keys.

To check this, analyze the CDC data and other technical columns in the source system to determine the order of events. If a sequence of delete and create operations is detected, you may not be dealing with true multi-activity.

Solution 2: Create a Multi-Active Satellite with Delta Checking

If the data is genuinely multi-active, the most straightforward approach is to create a multi-active satellite. Perform delta checks on the combination of the business key and the multi-active attribute (e.g., customer ID). This ensures that only changes within specific multi-active groups are loaded into the satellite.

However, this approach necessitates a specialized point-in-time (PIT) table, as the CDC data provides changes at the finest granularity (row level). You’ll need to consider both the load date timestamp and the multi-active attribute when querying the satellite to retrieve the most recent delta.

Solution 3: Remodel with Satellites on Links

Another option is to remodel your Data Vault structure by placing the satellite on the link. In this approach, the multi-active attribute becomes a dependent child key in the link, and a standard satellite is created on this link. This simplifies the handling of multi-activity within the link itself.

However, it’s important to note that the satellite in this case describes the relationship between the customer and other components, rather than directly describing the business object hub. Evaluate whether this modeling change aligns with your downstream querying requirements.

Solution 4: Use the Primary Key as a Technical Hub

As a last resort, you can use the primary key from the source system as a technical hub. This involves creating a hub for the primary key values (e.g., customer IDs) and linking it to the real customer hub using a same-as link. While not the preferred method, this can be a workaround in situations where other solutions are not feasible.

Additional Considerations

CDC Data vs. Full Extracts: When dealing with full data extracts, even if only a part of the multi-active component changes, it’s best practice to insert the full block of data with the newest load date timestamp. This simplifies downstream processes and eliminates the need for a specialized PIT table.
Non-History Links: If the CDC data represents transactional events and is analyzed as such, consider loading it into non-history links instead of satellites. This approach aligns with the transactional nature of the data and facilitates aggregations and trend analysis.

Conclusion

Handling CDC data in Data Vault 2.0 when dealing with multi-active satellites requires a careful assessment of your specific use case and data characteristics. The solutions presented in this article offer various approaches to tackle this challenge, each with its own advantages and trade-offs. By understanding these strategies and selecting the most appropriate one, you can ensure accurate and efficient data loading in your Data Vault environment.

Meet the Speaker

Marc Winkelmann