Skip to main content
search
0
All Posts By

Building a scalable Data Platform?

Whether you're implementing Data Vault 2.1 or modernizing your analytics architecture, our experts help you turn complex data challenges into practical, future-proof solutions. From hands-on implementation to in-depth training, we support your team every step of the way.

Identifying Non-Historized Links

Watch the Video

Identifying Non-Historized Links in Data Vault

In data management, particularly in the Data Vault 2.0 methodology, understanding how to handle non-historized links can be essential for maintaining accurate, traceable records. In today’s Data Vault Friday, we’ll cover the concept of non-historized links, explore the unit of work in data storage, and discuss which elements should be included in link hashes.



Non-Historized Links Background: Understanding the GL Table

Imagine we’re working with a General Ledger (GL) table for storing financial transactions, which includes over a billion rows. Each record in this table represents a transaction and includes attributes such as:

  • TransactionID: A unique identifier for each transaction.
  • AccountSegment: A business key identifying the account involved.
  • AccountingDocument: Another business key associated with the document related to the transaction.
  • AccountingPeriod: A reference key indicating the period of the transaction.
  • DebitAmount and CreditAmount: The monetary amounts for each transaction.
  • TransactionTimestamp: The exact date and time of the transaction.

While the TransactionID alone uniquely identifies each transaction, a combination of AccountSegment, AccountingDocument, AccountingPeriod, and TransactionTimestamp also provides a unique identifier for each entry. This setup raises questions about how to best organize, hash, and manage these records within a Data Vault model, specifically in terms of defining a Unit of Work and deciding which values should be included in a link hash.

Defining the Unit of Work in Data Vault

The Unit of Work is a concept used in Data Vault to determine which business keys need to be grouped together in a single record. In this context, we’re focusing on how these business keys—AccountSegment, AccountingDocument, AccountingPeriod, and TransactionTimestamp—should be structured and managed within a non-historized link.

According to Michael Olschimke, the CEO of Scalefree, the Unit of Work is primarily relevant when dealing with standard relationship-based links. In our example, there’s no need to split the link between these business keys because the link can capture the original granularity of each transaction from the source data. This means that by keeping all four attributes within the same link, we avoid disrupting the Unit of Work and ensure a coherent data structure.

Testing the Structure of the Unit of Work

One way to verify if we’re correctly applying the Unit of Work is to check if the data model enables us to reconstruct the original data source without loss of records or attributes. If we find that we can reconstruct the data accurately using the non-historized link and potentially a Satellite table later, then we’re likely adhering to the Unit of Work correctly.

Since the attributes in our case uniquely identify each transaction, we can assume the Unit of Work is preserved, allowing for a stable and consistent structure in the data model.

Which Values Should Be Hashed in the Link Hash?

The next question concerns which values we should include in the link hash. Generally, the hash key in a link table is derived from the business keys of referenced hubs. In this case, the primary question is whether to hash just the AccountSegment, AccountingDocument, and AccountingPeriod, or if we should also include the TransactionID.

Olschimke suggests a practical approach here: while we could technically get by with hashing either the TransactionID alone or the combination of the other three elements, it’s often better to include all relevant attributes in the hash. This includes the TransactionID along with AccountSegment, AccountingDocument, and AccountingPeriod.

Here’s why:

  1. Consistency with Automation Tools: Many Data Vault automation tools automatically include all hub references and their business keys in the hash computation. Following this approach aligns with standard automation practices, making it easier to work with automation tools later.
  2. Minimizing Potential Issues: Including all attributes in the hash computation reduces the risk of data loss or accidental data duplication. It ensures that our hashes accurately represent each unique record.
  3. Negligible Performance Impact: While adding an extra attribute to the hash computation may slightly increase the string length, the performance impact is minimal. In large-scale data processing, this small change typically does not result in significant slowdowns.

In Practice: Setting Up Non-Historized Links

In Data Vault, non-historized links are a powerful tool for managing large datasets without versioning each record. By including all four attributes in the hash computation, we establish a stable link structure that reflects the source data’s original granularity. This approach allows us to confidently manage and retrieve accurate records without fear of unintentional data loss.

For practitioners, setting up these non-historized links can be straightforward when using automation tools that handle most of the heavy lifting. It’s generally recommended to follow the automation tools’ conventions, especially for larger datasets, as this minimizes discrepancies and facilitates smoother data processing in the long run.

How to Submit Your Questions

If you’re working with Data Vault and have a question similar to this one, Data Vault Friday sessions provide an opportunity to get expert insights. To submit a question, you can use the form, which allows you to attach diagrams, whiteboard images, or other supporting materials for clarity. Alternatively, for faster responses, you can post questions on the Data Innovators Exchange, where consultants actively respond and discuss Data Vault topics.

For those interested in more in-depth training, Scalefree also offers webinars, including sessions on WhereScape and dbt, which run monthly. These provide valuable resources for anyone looking to deepen their Data Vault expertise.

Conclusion

Identifying non-historized links and deciding on a consistent hashing approach can significantly impact data integrity and model reliability. By adhering to Data Vault principles and utilizing automation tools effectively, you can manage vast amounts of data without sacrificing accuracy or consistency.

With the right approach, you’ll be able to create a resilient, scalable data model that aligns with your organization’s needs and remains flexible for future changes. If you’re diving deeper into Data Vault, stay tuned for more Data Vault Friday insights and don’t hesitate to participate in the community for ongoing support and expertise.

Meet the Speaker

Profile picture of Michael Olschimke

Michael Olschimke

Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

Snapshot Based Bridge Table on Link and Effectivity Satellite

Watch the Video

Snapshot Based Bridge Table on Link and Effectivity Satellite

Welcome to another edition of Data Vault Friday! Today, we’re diving into the concept of a snapshot-based bridge table, particularly focusing on its application in scenarios with links and effectivity satellites. This approach helps us handle complex relationships between business objects over time, managing changes in relationships, and retaining a complete historical view. In this article, we will explore solutions to three main questions:

  • How to handle a missing relationship on day 2 compared to day 1
  • How to manage changed relationships from day 1 to day 2 for the same key
  • How to include a business object (A3) on day 2 when it lacks a relationship to any object in B


Understanding the Data Vault Modeling Example

In our example, we have two sources, each representing a different business object (A and B). Business object A is represented as a static dataset, while business object B shows dynamic relationships to A over three days. Each day brings changes in the relationships, with some entries disappearing or shifting. Our task is to capture these changes effectively in a snapshot bridge table.

Examining the Relationship Changes

Let’s walk through the changes observed over three days:

  • Day 1: B1 is related to A1, B2 to A2, and B3 to A3.
  • Day 2: The relationships change. B1 is now related to A2, B2 to A1, and B3 disappears.
  • Day 3: No further changes occur compared to Day 2.

Our objective is to document these relationships using an Effectivity Satellite and a Bridge Table, enabling us to query the state of relationships as they existed on each day.

Creating the Link and Effectivity Satellite

The first step in capturing these relationships is to create a Link table that holds a distinct list of relationships between business objects. Links should not have additional metadata, such as validity dates, as this can complicate data retrieval and reduce performance. In our example, the Link table captures each unique combination of A and B keys but does not record their start or end dates.

Next, we create an Effectivity Satellite. This table extends the Link by recording each relationship’s start and end timestamps, as well as an “is_active” flag to indicate the current status of each relationship. Using this table, we can track when a relationship starts, changes, or ends. Let’s examine how this works:

  • Day 1 entries: All relationships (B1-A1, B2-A2, B3-A3) are marked as active.
  • Day 2 entries: New relationships (B1-A2 and B2-A1) are added and marked active, while previous relationships (B1-A1 and B2-A2) are marked inactive. B3-A3 is removed entirely.

Building the Snapshot-Based Bridge Table

With the Effectivity Satellite in place, we can now create a Bridge Table that snapshots the active relationships for each day. This table provides a point-in-time view of the relationships as they existed on a particular day. Let’s look at how the Bridge Table is created:

Day 1 Snapshot

The Day 1 snapshot reflects the initial relationships, pulling all active records from the Effectivity Satellite. At this stage, B1-A1, B2-A2, and B3-A3 are all active.

Day 2 Snapshot

In the Day 2 snapshot, only the relationships B1-A2 and B2-A1 remain active, while B3-A3 is removed. By applying a filter to include only active entries, the snapshot accurately represents the relationships on Day 2.

Day 3 Snapshot

Day 3’s snapshot is identical to Day 2, as no additional changes were made. The active relationships B1-A2 and B2-A1 remain unchanged.

This process ensures we have a clear audit trail of relationship changes and deletions. Each day’s snapshot represents the state of relationships at that point in time, without introducing redundant data.

Handling Missing Data and Reinstating Relationships

One critical feature of the Effectivity Satellite is the ability to manage reinstated relationships. For instance, if B1-A1’s relationship is reestablished on Day 4, we add new rows in the Effectivity Satellite, marking the previous active combination (B1-A2) as inactive and reactivating B1-A1. This dynamic structure makes it easy to adjust for relationships that appear, disappear, and reappear over time.

Incorporating A3 in the Bridge Table on Day 2

A common challenge is how to incorporate business objects like A3 on Day 2, despite having no relationship in B on that day. In Data Vault, this is often addressed in downstream queries or report joins rather than in the Bridge Table itself.

By starting with object A and performing a left join to the Bridge Table, you will include all records from A (even if they don’t appear in B). This ensures that unlinked objects are included in the results, with their B relationships shown as null, or as a placeholder if desired.

Preventing Cartesian Products and Other Best Practices

When using a Bridge Table, it’s essential to avoid unexpected Cartesian products, which can inflate data during aggregation. Always check the cardinality of relationships between objects (e.g., one-to-one or many-to-many) to ensure joins occur only on necessary keys. The driving key, which anchors the relationships in the Link Table, should be the primary focus, especially in cases of many-to-many relationships.

This method ensures accuracy and performance when aggregating data and avoids inflated results in reporting.

Conclusion

Snapshot-based bridge tables are powerful tools in Data Vault modeling for handling changing relationships and tracking historical snapshots. By carefully structuring Links, Effectivity Satellites, and Bridge Tables, we create a robust, auditable trail of data changes over time. As we’ve discussed, this approach allows us to accommodate missing data, reinstate relationships, and prevent data inflation, ensuring reliable, performant data models.

If you’d like further assistance with templates for bridge tables or Effectivity Satellites, feel free to reach out to me. I hope this discussion has provided clarity on using bridge tables in Data Vault. Join us again for more Data Vault insights, and feel free to submit your questions!

Meet the Speaker

Marc Winkelmann

Marc Winkelmann

Marc is working in Business Intelligence and Enterprise Data Warehousing (EDW) with a focus on Data Vault 2.0 implementation and coaching. Since 2016 he is active in consulting and implementation of Data Vault 2.0 solutions with industry leaders in manufacturing, energy supply and facility management sector. In 2020 he became a Data Vault 2.0 Instructor for Scalefree.

Learning the Data Vault Patterns

Watch the Video

Learning the Data Vault Patterns with Azure Synapse

When it comes to managing complex, evolving data landscapes, implementing a Data Vault architecture is a popular choice. The Data Vault approach is especially helpful in handling large volumes of data while maintaining flexibility, scalability, and historical tracking. Recently, a question was raised regarding the use of Azure Synapse Analytics with Data Vault for the initial Raw Data Vault development without a finalized automation tool. Specifically, the question was: Is it feasible to start development with Synapse Notebooks using PySpark and Delta Lake?

This article will explore the answer to that question and dive into key considerations, best practices, and strategies to make the most of a manual setup while preparing for eventual automation in your Data Vault implementation.



Is It Feasible to Start Data Vault with PySpark and Delta Lake?

The short answer is yes—starting with Synapse Notebooks, PySpark, and Delta Lake is indeed a viable option. In fact, this manual approach can be an excellent way to familiarize your team with Data Vault concepts, patterns, and methods before committing to an automation tool. By manually building the Raw Data Vault, your team can gain hands-on experience with essential processes, which will make the transition to automation smoother and more effective later on.

Historically, many Data Vault practitioners began with manual scripting due to limited automation tools. Over time, this “manual-first” method became a useful way to learn the intricate patterns of Data Vault. Today, automation tools for Data Vault are abundant, and using them is generally more efficient, but there’s still a place for manual methods, especially in the early learning stages of a project. Let’s look closer at why this approach works and what to consider as you work manually.

Benefits of Starting Manually with Synapse, PySpark, and Delta Lake

Using PySpark and Delta Lake in Azure Synapse Notebooks gives your team flexibility to:

  • Learn Core Data Vault Patterns: Building the Raw Data Vault manually helps the team understand Data Vault concepts, such as Hubs, Links, and Satellites. This is crucial knowledge that will benefit the project long-term.
  • Experiment with Modeling: Working without automation allows you to refine your approach and test different design patterns. This is especially helpful in creating a foundation that’s tailored to your organization’s specific needs and datasets.
  • Understand Data Transformation and Ingestion: By manually scripting transformations and ingesting data, your team will better understand the processes that an automation tool would handle. This will help in configuring automation later and troubleshooting any issues that arise.
  • Validate Requirements and Patterns: Since no tool has been chosen yet, working manually allows you to get a head start on modeling and confirm your business and technical requirements early in the project.

The Drawbacks of a Manual Approach

While starting manually has its advantages, it’s important to be aware of the limitations. The primary drawbacks of a manual approach are:

  • Time and Effort: Developing the Raw Data Vault by hand is time-consuming. Each process, from creating Hubs to tracking Satellites, requires careful attention to ensure the design aligns with Data Vault standards.
  • Limited Scalability: A manual setup is challenging to scale, especially as the data volume grows. While PySpark and Delta Lake are powerful tools, they’re not a substitute for the scalability offered by automation tools.
  • Risk of Technical Debt: Scripts developed manually might not be as maintainable or reusable as the templates generated by automation tools. Technical debt can accumulate if the team spends too much time maintaining manual scripts, and transitioning to automation could later require extensive rework.

Steps to Optimize Your Manual Development Process

If you decide to move forward with this manual approach, here are some strategies to make the process more efficient and to set a foundation for a future transition to automation:

  1. Document Your Patterns Thoroughly: Take detailed notes on the specific design patterns, scripts, and models you develop manually. These can serve as templates when you move to automation, making the transition much easier.
  2. Define Clear Modeling Standards: Establish consistent modeling practices for Hubs, Links, and Satellites. This will reduce ambiguity and provide a structured foundation for automation tools to build upon later.
  3. Refine and Iterate: Since you’re building manually, use this time to refine your models. Adjust and improve them based on the unique data flows and needs of your organization.
  4. Focus on Core Entities: Prioritize building out core Hubs and Links in your Raw Data Vault, focusing on the entities that are most crucial to your organization. This will create a solid foundation that can be extended as you move to automation.

Preparing for Automation: What to Keep in Mind

Even as you start manually, keep automation in mind as a goal. Today, there are numerous Data Vault automation tools available, each with its own strengths and weaknesses. As you prepare for this transition, here are some key considerations:

1. Research Automation Tools

Explore the different automation tools available on the market. Each tool has its own approach, interface, and capabilities, so it’s essential to choose one that aligns well with your organization’s technical needs, budget, and data infrastructure. Some tools focus on business-user accessibility, while others offer more technical configurations. Common tools for Data Vault automation in Azure include solutions specifically designed for Synapse or those that support PySpark and Delta Lake.

2. Select Tools with Scalability

When selecting a tool, consider how it handles scalability, as this is vital for supporting the growing volume of data in a Data Vault. Some automation tools handle scalability better than others, depending on how they manage Hubs, Links, and Satellites. In Azure Synapse, it’s also important to assess compatibility with Delta Lake and PySpark, as well as overall integration with Azure’s data ecosystem.

3. Account for Tool Limitations

Even the best automation tools have limitations. Be prepared to adjust your approach based on the chosen tool’s capabilities. For example, some tools may limit specific patterns, such as complex Satellites or multi-active relationships. By understanding these limitations ahead of time, you can avoid rework and ensure that your initial manual development aligns well with the selected tool.

4. Focus on Configurability and Customization

Ensure that your chosen tool allows for some level of customization. This is important because the patterns you develop manually may need to be fine-tuned or adjusted within the tool. Look for tools that offer configurable templates, adaptable interfaces, and support for customizations to fit your organization’s specific needs.

Moving Forward: Transitioning to Automation

As your team becomes familiar with Data Vault patterns through manual development, the next step is to select and implement an automation tool. While the manual work provides a deep understanding of Data Vault patterns, an automation tool will streamline repetitive processes, ensure consistency, and save significant time and effort as your data volume grows.

One recommended approach is to use the lessons learned during manual development to create tailored templates and workflows within your chosen tool. By doing so, you can optimize the automation tool’s capabilities based on the patterns you’ve already tested and refined. This makes for a smoother, more effective transition from manual development to automated workflows.

Final Thoughts

Starting Data Vault development manually with Synapse Notebooks, PySpark, and Delta Lake is a feasible and often beneficial approach, especially when automation tools haven’t been finalized. While this method is time-consuming and requires effort, it offers valuable insights and allows your team to learn and optimize Data Vault patterns before committing to an automation tool.

Remember, the goal is to use this manual phase to build a strong foundation, explore modeling choices, and establish best practices. When the time comes to select an automation tool, your team will be well-prepared to leverage it to its fullest potential, ensuring a scalable and efficient Data Vault implementation within Azure Synapse Analytics.

Meet the Speaker

Profile picture of Michael Olschimke

Michael Olschimke

Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

Handling JSON Data in Data Vault Satellites on Snowflake

Watch the Video

Handling JSON Data in Data Vault Satellites on Snowflake

In this blog post, we’ll discuss the challenges and best practices for handling JSON data in Data Vault architectures, specifically when using Snowflake. This question from our audience raises a common scenario in modern data environments where semi-structured data is prevalent. Let’s dive into how storing JSON in Data Vault Satellites works, when to extract fields, and the performance considerations for Snowflake.

Why Store JSON in the Satellite?

With the increasing use of APIs and complex data integrations, many businesses are receiving data as JSON. JSON is a flexible, semi-structured format that allows for varying structures and nested elements, making it ideal for some applications. However, this flexibility also introduces complexities when using a structured data model like the Data Vault.

The challenge here is that when you receive a JSON payload, it may contain hundreds of fields, many of which may not be immediately needed. Extracting every JSON field to store as individual columns in the Raw Data Vault can be inefficient and unnecessary. Therefore, one solution is to store the JSON “as-is” in the Satellite and extract only what’s required later in the Business Vault.

Choosing When to Extract Fields from JSON

As a general rule, it’s beneficial to extract attributes that are needed for reporting, security, or compliance early in the process. Relational attributes stored directly as columns in the Raw Data Vault are easier to query, and they’re stored more efficiently in structured databases.

For other fields in the JSON payload that are less critical or not immediately required, you can keep the data in its JSON format. This approach minimizes upfront work and allows for flexible data exploration later without overloading the Raw Data Vault with unnecessary fields.

For example, consider an e-commerce environment where transaction data is received in JSON format. The Raw Data Vault could extract only essential fields like transaction ID, customer ID, and total amount while leaving other detailed information about the items purchased in JSON format.

JSON and Snowflake: Optimized for Performance

One of Snowflake’s strengths is its ability to store and query JSON data natively. Snowflake’s support for semi-structured data formats like JSON and its efficient handling of this data type make it possible to perform queries directly on JSON attributes without needing to first extract them into columns.

This capability allows you to store JSON data in the Raw Data Vault and query it directly, even in production, without significant performance penalties. This approach works well for data that may only be needed occasionally or where performance is not a major concern. Snowflake’s built-in functions for working with JSON make it easy to extract specific attributes on demand and use them in downstream reporting.

Handling Business Logic in the Business Vault

In Data Vault, business rules are often applied within the Business Vault layer, which is designed for derived and calculated data. When working with JSON in this context, it’s common to keep the original JSON structure in the Satellite and then perform transformations as needed in the Business Vault.

In Snowflake, you can take advantage of virtual views to simplify your data pipeline. Rather than physically extracting and storing each JSON attribute in the Business Vault, you can create virtual views that access JSON fields as needed. This approach offers the benefit of flexibility and reduces storage requirements since you avoid duplicating data.

Structuring Semi-Structured Data Over Time

Storing JSON in the Satellite provides flexibility, but as your business needs evolve, you may find that some attributes become more relevant for analysis. At this point, you may consider structuring your JSON data incrementally, extracting only the fields that are frequently queried or required for compliance.

This gradual structuring process aligns with the Data Vault’s design principles. Over time, your semi-structured data will naturally become more organized as business requirements clarify which fields are necessary for analytics or reporting. You can continue to leave less critical fields in JSON format, maintaining flexibility while optimizing performance where it matters.

Privacy and Security Considerations with JSON in Satellites

One important consideration when storing JSON data in the Satellite is ensuring compliance with data privacy regulations such as GDPR. It’s essential to classify personal or sensitive data in the JSON payload and handle it appropriately.

In cases where sensitive data is involved, you may choose to split your Satellites by data classification, isolating high-sensitivity fields into their own Satellite tables with stricter access controls. Alternatively, you can extract and separate specific fields that require special handling and leave the rest in JSON format.

Working with JSON Arrays and Nested Objects

JSON data often includes nested structures, such as arrays, that add complexity to handling data in a relational model. For instance, customer records might contain an array of phone numbers or addresses. In these cases, Snowflake offers tools for managing nested data without requiring extensive restructuring.

One option is to store the entire JSON array as-is within a JSON attribute in the Satellite. For example, a JSON array containing multiple phone numbers could be stored directly in the Satellite, eliminating the need for a separate table to manage multi-active data. This approach provides a simpler alternative to the traditional multi-active Satellite design.

Snowflake allows you to work with nested JSON data using its FLATTEN function, which simplifies querying nested structures without requiring complex joins or data transformations. This method can be particularly useful when dealing with highly nested data, especially in real-time applications.

Virtualizing JSON Access in Snowflake

Snowflake’s JSON handling capabilities allow you to query JSON fields dynamically, making it possible to virtualize the access to JSON attributes in your Satellite tables. Virtual views can be used to transform JSON data for reporting without needing to physically extract every field into columns.

This approach offers significant flexibility. By keeping the JSON data in its original form, you avoid unnecessary data transformations and can leverage Snowflake’s powerful JSON functions to create reports on demand. Virtualization is an effective way to simplify your data pipeline, particularly when dealing with complex JSON structures.

Conclusion

Storing JSON in the Data Vault Satellite layer on Snowflake is a sound approach, especially when dealing with semi-structured or complex data. By extracting only the fields that are critical and leaving the rest in JSON format, you balance flexibility with efficiency. Snowflake’s support for JSON makes it possible to work with this data in real-time, leveraging virtual views to keep your pipeline agile.

Consider implementing a mix of JSON storage and relational attributes, structuring your data incrementally as your business needs evolve. With the right approach, Snowflake’s JSON capabilities allow you to handle diverse data types efficiently, supporting both compliance requirements and analytics needs.

Meet the Speaker

Profile picture of Michael Olschimke

Michael Olschimke

Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

Multi-Active Satellites & Dependent Child Keys

Watch the Video

Understanding Multi-Active Satellites and Dependent Child Keys in Data Vault

Multi-Active satellites and dependent child keys provide solutions for storing data with complex granularities in Data Vault models. Data Vault is known for its highly structured enterprise data warehousing approach, built on Hubs, Links, and Satellites to capture data lineage, maintain historical accuracy, and ensure scalability. However, specific data scenarios, such as handling different data granularities, often lead to questions on multi-active satellites and dependent child keys. This article breaks down these concepts and clarifies their differences and use cases in a Data Vault environment.

 

What is a Multi-Active Satellite?

A multi-active satellite is designed to manage multiple records for a single business object that are active simultaneously. This scenario arises when a business object, like a customer, can have several active data entries at the same time. For example, a customer could have multiple addresses (home and work), both of which are valid at the same time.

In a typical satellite structure, a business key (e.g., customer ID) combined with a load date timestamp defines the primary key. However, in cases of multiple active records, this primary key is insufficient because it won’t uniquely identify each active instance. Instead, an additional attribute, such as an address type (home or work), is added to the primary key to differentiate each record. This approach allows the satellite to track multiple entries for the same business key without duplicating data and helps capture finer details in the data warehouse.

Example of a Multi-Active Satellite

Let’s say our source system has a customer with ID C123 who has two active addresses: one for home and one for work. In a standard satellite, we might have one record per business key. But in a multi-active satellite, we store both addresses simultaneously by using an additional identifier (e.g., “address type”) in the primary key:

  • Customer ID: C123
  • Load Date: Timestamp of data load
  • Additional Identifier: Address type (e.g., home, work)

This approach allows multiple entries for a single business object (in this case, customer C123) while maintaining unique records in the satellite table.

What is a Dependent Child Key?

A dependent child key is used to manage relationships between multiple business objects at a finer granularity level than a standard Data Vault link would allow. Dependent child keys are typically applied in links where we need to track multiple occurrences of a relationship between business objects, such as an order and its line items.

Consider an order containing multiple line items, where each item references a product. Here, the dependent child key (like line item number) uniquely identifies each relationship instance, as it provides additional detail beyond just the order and product identifiers. This allows multiple rows in the link for the same business objects while maintaining unique records.

Example of a Dependent Child Key

Imagine we have an order O123 for a customer C123, which includes two line items for the same product but with different prices or quantities:

  • Order ID: O123
  • Customer ID: C123
  • Product ID: P123
  • Dependent Child Key: Line item number (e.g., 1, 2)

In this case, we create unique rows for each line item, where the line item number differentiates each record. This approach ensures that each entry is stored and tracked individually.

Key Differences Between Multi-Active Satellites and Dependent Child Keys

Although multi-active satellites and dependent child keys both enable handling of finer data granularity, they serve different purposes and are used in distinct contexts:

  1. Multi-Active Satellites
    Applied within a single business object to handle multiple active records at the same time. The additional identifier helps capture simultaneous entries for the same object in a satellite.
  2. Dependent Child Keys
    Used in links between multiple business objects, where the additional key captures the finer detail of each relationship instance, such as line items in an order.

When to Use Each Approach: Multi-Active Satellites & Dependent Child

The choice between using a multi-active satellite or a dependent child key depends on the data granularity and relationships in your data model:

  • Use Multi-Active Satellites when handling multiple active records for a single business object, where each entry is related only to the primary business key (e.g., customer with multiple addresses).
  • Use Dependent Child Keys when tracking detailed relationships between different business objects that require additional identifiers to maintain uniqueness (e.g., order and line items).

Summary

Multi-active satellites and dependent child keys are helpful for storing data with complex granularities in Data Vault models. While multi-active satellites allow multiple simultaneous records for a single business object, dependent child keys enable unique identification of complex relationships in links. Both approaches maintain Data Vault’s principles of scalability and data integrity by preserving unique records and enabling detailed tracking of business data.

In short:

  • Multi-Active Satellite: For multiple records active simultaneously within a single business object.
  • Dependent Child Key: For relationships across multiple business objects that need finer detail, typically in links.

Meet the Speaker

Marc Winkelmann

Marc Winkelmann

Marc is working in Business Intelligence and Enterprise Data Warehousing (EDW) with a focus on Data Vault 2.0 implementation and coaching. Since 2016 he is active in consulting and implementation of Data Vault 2.0 solutions with industry leaders in manufacturing, energy supply and facility management sector. In 2020 he became a Data Vault 2.0 Instructor for Scalefree.

Pre-Commit: The First Line of Defense in Shift-Left Development

Programmer Working with Virtual Interface for Software Development and Technology

Pre-Commit as a Foundation for Code Quality

In software development, ensuring code quality and minimizing defects early is crucial. The shift-left approach emphasizes detecting and resolving issues as soon as possible, rather than later in development or after deployment. Pre-commit supports this approach by managing multi-language pre-commit hooks, scripts that run automatically before finalizing a commit. These hooks act as a first line of defense, enforcing coding standards, preventing bugs, and enhancing security.

This article examines how the combination of pre-commit and the shift-left approach can enhance development workflows, resulting in more reliable and maintainable software. We will explore the advantages of using pre-commit hooks, their function within the shift-left framework, and offer practical advice for effective implementation. By adopting these practices, teams can improve code quality, accelerate development cycles, and deliver robust software solutions more efficiently.



Understanding Pre-Commit


General Hooks in Versioning Systems

In version control systems like Git, hooks are scripts that are triggered by various events within the repository. These scripts can automate tasks, enforce policies, or improve workflows. Hooks are generally categorized into two types: client-side and server-side. Client-side hooks run on the local machine and can be triggered by operations such as committing or merging changes. Server-side hooks run on the server and can be triggered by events like receiving pushed commits.

local repository with accepted repo

local repository with falty repo


The Role of Pre-Commit Hooks

Among the various types of client-side hooks, pre-commit hooks are particularly significant. These hooks are executed before a commit is finalized. They check for potential issues in the code, such as syntax errors, code style violations, or even security vulnerabilities. By running these checks before the code is committed to the repository, pre-commit hooks ensure that only high-quality code makes it into the version control system.


Pre-Commit Use Cases

Pre-Commit is a versatile tool that can be adapted to various scenarios within the development workflow, offering significant flexibility and control to development teams. One key use case is hosting pre-commit code locally, allowing developers to customize and manage hooks according to their project’s specific requirements. This local hosting ensures that all team members adhere to the same standards without relying on external repositories. Another powerful feature of pre-commit is the ability for users to create custom hooks tailored to their unique needs. These hooks can enforce specific coding standards, run specialized security checks, or automate repetitive tasks, providing a high degree of flexibility. Teams can write these hooks in any language supported by their environment, allowing pre-commit to seamlessly integrate into their workflow.

Furthermore, pre-commit can be integrated into Continuous Integration (CI) pipelines to enhance automation and enforce quality checks across all stages of development. By incorporating pre-commit hooks into the CI process, teams can ensure that code meets quality standards before being merged into the main codebase. This integration helps maintain high standards of code quality and reliability throughout the development lifecycle.


Multi-Language Support in Pre-Commit

One of the key strengths of pre-commit is its support for a wide range of programming languages. This makes it an ideal tool for projects that involve multiple languages or frameworks. For example, in DataOps, it is common to use Terraform for infrastructure provisioning and dbt for data transformation. Pre-commit’s multi-language support allows it to be seamlessly integrated into such diverse development environments, providing consistent quality checks across different languages and tools. This cross-language capability ensures that best practices are enforced and issues are detected regardless of the technologies used in the project.


Example for the Use of Pre-Commit

Pre-Commit operates through the use of a configuration file called pre-commit-config.yaml. This YAML file serves as the central configuration that pre-commit references every time a commit is made. It defines which hooks should be executed and where to fetch them from before finalizing a commit in your Git repository. As an example for such a file with an explanation to the key elements, the following image is provided:

repos:
  - repo: https://github.com/antonbabenko/pre-commit-terraform
    rev: v1.90.0 
    hooks:
      - id: terraform_validate
      - id: terraform_fmt
        args:
          - --args=-recursive
  • repo: Specifies the URL of the repository containing the hooks. In this example, https://github.com/antonbabenko/pre-commit-terraform is the repository from which hooks will be fetched.
  • rev: Indicates the specific revision or version of the repository (v1.90.0 in this case) to use. This ensures consistency in hook behavior across different executions.
  • hooks: Defines a list of hooks to be run from the specified repository. Each hook is identified by its unique id.
  • id: Represents the identifier for each hook. For instance, trailing-whitespace, end-of-file-fixer, and check-yaml are examples of hook IDs that correspond to specific tasks or checks the hooks will perform.
  • args: Specifies additional arguments or options to be passed to the hook command. In the case of terraform_fmt, the args field includes –args=-recursive. This configuration instructs the hook to format Terraform files recursively within the project directory.

This configuration allows you to tailor pre-commit to your project’s needs by specifying which hooks to execute, where to retrieve them from, and how to ensure consistency through defined versions or revisions. Each hook ID corresponds to a particular quality check or task that pre-commit will enforce before allowing a commit to proceed, ensuring higher code quality and adherence to project standards.


Recommended Pre-Commit Repositories for Effective Integration

To kickstart your use of pre-commit effectively, consider leveraging the following repositories tailored for specific programming languages and tools:

  • Terraform Repository: antonbabenko/pre-commit-terraform
    Offers a comprehensive set of hooks for Terraform projects, including validators and formatters to ensure consistent and high-quality code.
  • Dbt (Data Build Tool) Repository: dbt-checkpoint/dbt-checkpoint
    Provides hooks specifically designed for dbt projects, enabling validation and enforcing best practices for SQL-based data transformations.
  • SQL Repository: sqlfluff/sqlfluff
    Offers hooks for SQL linting to ensure syntax correctness, adherence to coding standards, and optimization of query performance.

These repositories offer essential hooks that integrate seamlessly with pre-commit, enabling automated checks to maintain code quality and consistency across Terraform, dbt, and SQL projects. By incorporating these hooks into your workflow, you can streamline development processes and ensure robust software deployments. However these are just suggestions and you are free to use any other hook if it benefits your use case.


Conclusion

Pre-Commit exemplifies the shift-left approach by automating quality checks early in the development cycle, before code is committed. This proactive strategy catches issues like syntax errors, formatting inconsistencies, and security vulnerabilities early, reducing downstream defects.

By using a centralized pre-commit-config.yaml file, Pre-Commit ensures consistent coding standards across teams. It allows developers to focus on coding rather than manual reviews, speeding up software delivery.

Supporting various languages and tools, Pre-Commit can be customized for specific project needs, enhancing code reliability through automated checks. Integrating Pre-Commit improves code quality, efficiency, and promotes continuous improvement, leading to faster time-to-market, lower development costs, and more reliable software.

About the Author

Picture of Moritz Gunkel

Moritz Gunkel

Moritz is an aspiring Consultant in the DevOps department for Scalefree, specializing in cloud engineering, automation, and Infrastructure as Code, with a particular knack for Terraform. While juggling his responsibilities, including pursuing a Bachelor’s degree in Computer Science as a working student, Moritz’s methodical approach has significantly impacted internal operations, earning him recognition and setting the stage for an exciting transition into a full-time consulting role. With a passion for innovation and a commitment to excellence, Moritz is set to continue making a lasting impression in the dynamic world of DevOps.

Modelling Address Data

Watch the Video

Modeling Address Data: Key Insights and Recommendations

Address data is one of the fundamental components in various business databases, especially where detailed customer information is essential. This complexity can make the modeling process challenging, particularly when aiming for a single “Hub” that consolidates all address information. In a recent discussion on Data Vault Friday, Michael Olschimke of Scalefree explored the best ways to model address data effectively, considering key business and regulatory factors. This article will provide a summary of those insights, offering recommendations for creating a robust, scalable, and efficient address data model.

The Context: Address Data in a Single Hub

In this scenario, the challenge presented was how to model various address types within a single Hub. The primary objective was to avoid redundant address data and simplify handling of NULL values. As per ISO20022 standards and European Union regulations, the data model included attributes such as STREET_NAME, BUILDING_NUMBER, BUILDING_NAME, ROOM, FLOOR, POSTAL_CODE, TOWN_NAME, COUNTRY_CODE, ADDRESS_LINE_1, and ADDRESS_LINE_2. Each of these elements is part of a composite business key used to uniquely identify each address.

A practical but complex solution proposed by the team was to replace NULL values with a placeholder (e.g., “-2”) to streamline the loading process and minimize handling issues. However, Olschimke proposed several alternative approaches to ensure the model is both sustainable and scalable.

Challenges with Replacing NULL Values

Replacing NULL values in business keys can simplify loading but has significant drawbacks downstream, particularly when managing dimensional data and maintaining business logic clarity. For instance, when NULL values are replaced with a placeholder like “-2,” this value could appear in downstream reports, causing confusion. Olschimke suggested a more nuanced approach that utilizes “fixed hash values,” such as all zeros or all F’s, to represent empty or erroneous values.

By using fixed hash values, it becomes easier to identify default or error states directly within the data structure. This approach avoids unnecessary complexity when filtering data downstream and improves the clarity and manageability of data processing operations.

Avoid Overloaded Hubs and Null Values in Composite Keys

One of the main points of consideration was the risk of “overloading” Hubs, which occurs when multiple business objects with different semantic meanings are stored within the same Hub. This is particularly common when different types of addresses are stored under a single business key, where each address type may not require all fields (e.g., ROOM or FLOOR).

According to Olschimke, overloaded Hubs introduce complexity due to differing data granularities and missing values across address types. For example, multiple buildings, floors, or rooms could exist under a single address, resulting in multiple granularities within a single Hub. This makes it challenging to maintain clear, meaningful data relationships. Instead, he advised defining clear granularity levels and possibly separating address types or using more flexible data structures.

Alternative Modeling Solutions: JSON-Based and Reference Tables

In cases where multiple address types require flexibility, Olschimke suggested using JSON-based data structures. JSON provides flexibility in defining address attributes dynamically, storing only the keys available for a particular address. This approach reduces the risk of overloading and accommodates varying address structures without creating a complex, rigid schema.

JSON-based Hubs allow for hashing the address data as a single JSON object, ordered by key names to prevent duplicates. However, this approach requires a consistent, standardized order of attributes when hashing to ensure duplicate-free keys. For instance, JSON formatting could streamline the Hub and enable more adaptive data loading while simplifying downstream data extraction.

Additionally, using reference tables is another approach for frequently accessed address data, enabling deduplication without over-complicating the Hub. Reference tables act as dedicated sources of address data, indexed by a unique address ID, which reduces redundancy across other Hubs.

Considering Address Data as Descriptive in Satellites

Instead of adding addresses as business keys in the Hub, it can be more effective to store them as descriptive attributes within a Satellite structure. This avoids overloading the Hub with attributes that may not always be needed for identifying the business key itself. By storing address data in Satellites linked to the primary business entity (e.g., customers, stores), you can achieve a balance between deduplication and schema simplicity.

Olschimke recommended this approach particularly when the main goal is to eliminate redundancies across address data. This approach aligns with a best practice in Data Vault modeling: Satellite tables should contain descriptive data that change over time, while Hubs contain only essential business identifiers.

Applying Business Rules in Data Vault Modeling

Address data often requires additional business rules, especially when handling complex keys or duplicates. Olschimke pointed out that handling NULL values with a placeholder complicates creating downstream dimensions. Instead, a two-step approach was advised: (1) defining the business keys within the Hub with fixed placeholders (e.g., all zeros or all F’s) for default and error handling and (2) standardizing the Satellite structure to handle varying address formats dynamically.

Ultimately, each business has unique requirements, and the choice between single Hubs, JSON structures, and Reference tables will depend on how critical the address data is to the core business operations. By focusing on avoiding overloading and ensuring scalability, businesses can set up a Data Vault model that minimizes long-term maintenance while maximizing data clarity and accessibility.

Conclusion

Modeling address data in a Data Vault context can be intricate, especially when attempting to create a unified Hub that supports various address types. The key considerations discussed by Olschimke emphasize flexibility, simplicity, and adherence to business rules without overloading Hubs. JSON-based keys, reference tables, and Satellite structures offer alternative approaches to managing address data, allowing you to avoid pitfalls associated with NULL placeholders and composite keys.

For businesses tackling complex address data requirements, experimenting with these alternatives may yield significant benefits, particularly in managing data deduplication, compliance, and future scalability.

Interested in learning more? Check out Scalefree’s webinars and consider joining the Data Innovators Exchange community for discussions on data modeling, cloud computing, and Data Vault 2.0 best practices.

Meet the Speaker

Profile picture of Michael Olschimke

Michael Olschimke

Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

Loading SAP CDC Data into GDPR Compliant Data Vault

Watch the Video

Loading SAP CDC Data into a GDPR-Compliant Data Vault

When managing change data capture (CDC) data from SAP in a Raw Data Vault, special considerations are needed for both CDC information and GDPR-relevant personal data. This post will cover how to model CDC data in a Data Vault, including the unique handling of created, updated, and deleted records. We’ll also discuss best practices for splitting data into separate satellites to manage GDPR-compliant attributes, including empty columns and privacy concerns.

This content is based on a discussion led by Michael Olschimke, CEO of Scalefree, during a Data Vault Q&A session.

CDC Data Modeling in the Data Vault

The primary challenge with CDC data is that it only includes changes from SAP, not the full dataset each time. CDC data typically includes metadata on whether a record was created, updated, or deleted in SAP. Here’s a look at how to approach modeling this data in a Data Vault:

1. Load CDC Data in Satellites with Adjusted Patterns

In Data Vault, the data model remains unchanged, regardless of how the data is delivered (batch, CDC, or real-time). However, the loading pattern for CDC data into satellites needs some adjustments:

  • Delta Check Adjustment: Normally, the Data Vault delta check identifies changes before loading data into a satellite. With CDC data, the changes are already captured, so this step can be bypassed. Instead, all changes from CDC data are loaded into the appropriate satellites directly.
  • Change Impact Across Satellites: When there’s a change in one attribute, it triggers an update in all relevant satellites. While this approach can create non-delta records, the impact is typically minimal, and the redundant data can be compressed for storage efficiency.

Handling GDPR-Relevant Personal Data

CDC data often includes both regular attributes and GDPR-sensitive personal data. In the Data Vault, personal data attributes should be separated based on privacy and security classes to ensure compliance and manage access. Here’s the recommended approach:

2. Splitting Satellites Based on Privacy Classifications

For GDPR compliance, split CDC data into multiple satellites:

  • Personal Data Satellite: A dedicated satellite for GDPR-relevant attributes (such as names or IDs). This separation allows for tighter security and privacy control.
  • Non-Personal Data Satellite: General attributes with no privacy concerns go into a separate satellite to reduce the risk of exposure.
  • Additional Splits: Further splits may be required based on rate of change, security levels, or business context, depending on the specific needs of your organization.

Maintaining separate satellites for different classes of data ensures that personal information is handled with stricter privacy controls, helping your data architecture comply with GDPR requirements.

Managing Empty Columns in the Data Vault

It’s common for source tables to contain columns that are always empty. When working with CDC data in a Data Vault:

  • Include Empty Columns for Auditing: To retain full traceability and audibility, include empty columns in the satellite. This preserves the exact structure of the source data without altering it.
  • Consider Separate “Unused Data” Satellite: If there are many empty columns, these can be grouped into a dedicated satellite, making the primary satellites leaner for users.

This approach allows for flexibility if the data in these columns becomes relevant in the future. Auditors will appreciate the comprehensive structure, and the Data Vault will retain all source data in its original form.

Example Satellite Structure

With GDPR compliance and CDC loading adjustments in mind, here’s an example structure for splitting SAP CDC data into satellites:

Satellite: CDC_Personal_Data
- Attributes: GDPR-relevant data (e.g., personal names, social security numbers)
- Metadata: Load date, source, change type (create, update, delete)
- Purpose: Privacy-controlled access

Satellite: CDC_NonPersonal_Data
- Attributes: Non-personal data columns
- Metadata: Load date, source, change type
- Purpose: General access

Satellite: CDC_Unused_Columns
- Attributes: Columns always empty in the source table
- Metadata: Load date, source
- Purpose: Compliance and future-proofing

Best Practices for Satellite Splitting in the Data Vault

When splitting data into satellites, follow these best practices:

  • Split by Privacy and Security: Ensure that personal and non-personal data are stored separately, particularly when handling GDPR-relevant information.
  • Split by Source System: Keep different source systems in separate satellites for clarity and maintainability.
  • Consider Business Needs: If certain data attributes are only relevant to specific business cases, split them accordingly to reduce satellite complexity.

These principles provide a clean, secure, and compliant Data Vault structure that enables efficient data retrieval, flexibility, and regulatory adherence.

Conclusion

Modeling SAP CDC data in a GDPR-compliant Data Vault involves adjustments to loading patterns, especially when dealing with CDC deltas and GDPR-sensitive data. By separating data based on privacy classes and including empty columns where necessary, you can ensure compliance and maintain a flexible data model. The approach outlined here simplifies the handling of CDC data, while providing robust auditing and privacy control.

Meet the Speaker

Profile picture of Michael Olschimke

Michael Olschimke

Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

Modelling Demographic Data

Watch the Video

Modeling Demographic Data for the Raw Data Vault: A Practical Guide

When working with demographic data for population analysis, one often encounters datasets that include variables such as region, year, age, and sex. For organizations implementing a Data Vault methodology, the task then becomes modeling this data in the Raw Data Vault in a way that enables efficient use in the Business Vault for proportion calculations and other business logic.

In this blog post, we’ll explore how to approach modeling demographic data for the Raw Data Vault. We’ll look at considerations for reference tables, hubs, links, and satellites, and discuss why a simplified approach can be beneficial while still capturing change history when needed. Let’s walk through an example based on a question that was raised during a Data Vault Q&A session.

Defining the Dataset and Objective

Let’s start by outlining the dataset and the main objective. Suppose we have a demographic dataset sourced externally (e.g., from a national bureau) that includes the following columns:

  • Region
  • Year
  • Age
  • Sex
  • Population count per region, year, age, and sex

The goal is to load this dataset into the Raw Data Vault and make it available for calculations in the Business Vault, such as determining population proportions across different dimensions.

The Simplified Modeling Approach

To understand the modeling approach, let’s consider the basic elements of Data Vault architecture:

  • Hubs: Tables that hold unique business keys and serve as identifiers.
  • Links: Tables that define relationships between hubs.
  • Satellites: Tables that hold descriptive data and track changes over time.

Since this dataset contains demographic attributes with no true business keys, modeling choices hinge on balancing simplification and change capture. Here’s a step-by-step breakdown of the approach:

1. Flatten the Data Structure First

The simplest model for this dataset would be to create a flat, wide table that includes columns for region, year, age, sex, and population value. This structure would load the data directly without further separation into hubs, links, or satellites.

  • Pros: Easy to query and manage, especially if there’s no need to track changes over time.
  • Cons: Lacks support for tracking updates or changes in population values.

This approach works well if the data is static and updated infrequently, but it limits flexibility for versioning or incremental updates.

2. Introducing Reference Hubs and Satellites for Change Tracking

To address the limitations of a flat table, we can take advantage of reference hubs and satellites. Reference hubs allow us to treat region, year, age, and sex as reference codes. In a reference hub, each unique combination of these codes is treated as a single, composite key.

The structure then includes:

  • A Reference Hub with region, year, age, and sex as identifying attributes.
  • A Reference Satellite with population counts as the descriptive attribute, which links back to the Reference Hub.

By introducing these reference structures, we ensure that changes in population values can be captured over time. If a new population record comes in with a different population value for a given region, year, age, and sex, it is added to the Reference Satellite as a new row. This provides a change history without needing to modify the original row.

Modeling Strategy and Design

With a focus on simplification, here’s how the structure would look in the Raw Data Vault:

  • Reference Hub: Combines the codes for region, year, age, and sex. This setup eliminates the need for separate hubs for each attribute and keeps the model straightforward. The composite key formed by region, year, age, and sex uniquely identifies each record, while the hub serves as a central reference.
  • Reference Satellite: Attached to the Reference Hub, the Reference Satellite holds the population value and includes a load date for tracking when data was loaded. This enables us to capture historical changes efficiently. If an update occurs for the population value, a new row is added to the Satellite with an updated load date, creating a versioned history.

Example Structure:

Reference Hub: Demographic_Hub
- Primary Key: Region, Year, Age, Sex
- Additional Columns: Load Date (ldts), Source (src)

Reference Satellite: Population_Satellite
- Foreign Key: Region, Year, Age, Sex (composite key from Hub)
- Population Value
- Load Date (ldts)
- Source (src)

In this setup, all demographic attributes are contained within a single hub and linked to the population value in the satellite. This design maintains a simple and effective data model that supports historical data changes without additional complexity.

Why Avoiding Multiple Hubs and Links Makes Sense

In theory, one could create separate hubs for each demographic attribute—region, year, age, and sex—and link them together. However, this approach introduces unnecessary complexity without adding value in this context. Here’s why:

  • Multiple Joins: Multiple hubs and a link table require additional joins, increasing complexity when querying the data.
  • Increased Entity Count: Additional hubs and links inflate the number of entities, making the data model harder to understand and maintain.
  • Performance Concerns: Each join adds processing cost, which can slow down queries, especially with larger datasets.

By consolidating all demographic attributes into a single hub, we reduce the number of entities and simplify the model, making it easier to use and maintain while still meeting the business needs.

Advantages of a Single Reference Hub with a Satellite

This approach is particularly advantageous because it balances simplicity with flexibility. Here’s how:

  • Change Capture: With the reference satellite, we can track historical population data changes over time. Each new row represents an update, identified by load date, making it easy to see when population data was updated.
  • Granularity and Scalability: The single hub provides a consistent granularity for the data, ensuring that queries are straightforward and changes are easy to manage.
  • Performance Efficiency: Joining based on region, year, age, and sex is computationally efficient. These attributes are typically small (e.g., integer or small character values), reducing the burden on processing.

Using the Model in Business Logic

Once the demographic data is loaded into the Raw Data Vault, it’s ready for use in the Business Vault. Here’s how it could be used for business logic:

  1. Joining Data: In the Business Vault, data analysts can join other datasets with the demographic reference satellite on the region, year, age, and sex attributes to incorporate population data.
  2. Proportion Calculations: With population counts available, proportion calculations (e.g., the proportion of a certain demographic group in a region) become straightforward.
  3. Temporal Analysis: The load date in the satellite provides historical tracking, allowing analysts to view demographic changes over time and analyze trends.

Potential Challenges and Solutions

One common concern is that joining on multiple columns (region, year, age, sex) could impact performance. However, with optimized indexing and the small size of these columns, this concern is minimized. For larger datasets, partitioning on region or year might further optimize query performance.

Conclusion

In summary, a simplified model that leverages a single reference hub with a satellite provides a highly effective way to model demographic data in a Raw Data Vault. This approach allows for straightforward use in the Business Vault and ensures flexibility for tracking historical changes, all while keeping the model manageable and efficient.

This example illustrates how, by focusing on simplification and change capture, organizations can build an effective demographic data model in the Raw Data Vault that meets both current and future needs for population-based analysis.

Meet the Speaker

Profile picture of Michael Olschimke

Michael Olschimke

Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

Loading CDC Data into Multi-Active Satellites and Status Tracking Satellites

Watch the Video

Mastering CDC Data in Data Vault 2.0

Change Data Capture (CDC) is a powerful mechanism for tracking changes in source systems. However, when the primary key in your source system differs from the business key used in your Data Vault hub, you may encounter challenges in loading data into multi-active satellites. This article explores various strategies for handling CDC data in such scenarios, offering practical solutions to ensure accurate and efficient data loading.



Understanding the Challenge

In many source systems, the primary key is a technical identifier unknown to the business. Instead, the business key represents the meaningful identifier for a business object. In a typical scenario, the relationship between the primary key and the business key is one-to-one. However, in some cases, multiple records can be active for the same business key on the same date, resulting in multi-activity.

This situation arises when the primary key is unique at a given point in time, but the business key is not. For instance, you might have multiple customer IDs in your source system (primary keys) referring to the same customer (business key).


Solution 1: Verify Multi-Activity

Before diving into complex solutions, it’s crucial to verify whether the data is genuinely multi-active. In some cases, the appearance of multi-activity might be due to records being deleted and recreated with the same business key, resulting in different primary keys.

To check this, analyze the CDC data and other technical columns in the source system to determine the order of events. If a sequence of delete and create operations is detected, you may not be dealing with true multi-activity.


Solution 2: Create a Multi-Active Satellite with Delta Checking

If the data is genuinely multi-active, the most straightforward approach is to create a multi-active satellite. Perform delta checks on the combination of the business key and the multi-active attribute (e.g., customer ID). This ensures that only changes within specific multi-active groups are loaded into the satellite.

However, this approach necessitates a specialized point-in-time (PIT) table, as the CDC data provides changes at the finest granularity (row level). You’ll need to consider both the load date timestamp and the multi-active attribute when querying the satellite to retrieve the most recent delta.


Solution 3: Remodel with Satellites on Links

Another option is to remodel your Data Vault structure by placing the satellite on the link. In this approach, the multi-active attribute becomes a dependent child key in the link, and a standard satellite is created on this link. This simplifies the handling of multi-activity within the link itself.

However, it’s important to note that the satellite in this case describes the relationship between the customer and other components, rather than directly describing the business object hub. Evaluate whether this modeling change aligns with your downstream querying requirements.


Solution 4: Use the Primary Key as a Technical Hub

As a last resort, you can use the primary key from the source system as a technical hub. This involves creating a hub for the primary key values (e.g., customer IDs) and linking it to the real customer hub using a same-as link. While not the preferred method, this can be a workaround in situations where other solutions are not feasible.


Additional Considerations

  • CDC Data vs. Full Extracts: When dealing with full data extracts, even if only a part of the multi-active component changes, it’s best practice to insert the full block of data with the newest load date timestamp. This simplifies downstream processes and eliminates the need for a specialized PIT table.
  • Non-History Links: If the CDC data represents transactional events and is analyzed as such, consider loading it into non-history links instead of satellites. This approach aligns with the transactional nature of the data and facilitates aggregations and trend analysis.

Conclusion

Handling CDC data in Data Vault 2.0 when dealing with multi-active satellites requires a careful assessment of your specific use case and data characteristics. The solutions presented in this article offer various approaches to tackle this challenge, each with its own advantages and trade-offs. By understanding these strategies and selecting the most appropriate one, you can ensure accurate and efficient data loading in your Data Vault environment.

Meet the Speaker

Marc Winkelmann

Marc Winkelmann

Marc is working in Business Intelligence and Enterprise Data Warehousing (EDW) with a focus on Data Vault 2.0 implementation and coaching. Since 2016 he is active in consulting and implementation of Data Vault 2.0 solutions with industry leaders in manufacturing, energy supply and facility management sector. In 2020 he became a Data Vault 2.0 Instructor for Scalefree.

Close Menu