Skip to main content
search
0
All Posts By

Michael Olschimke

Michael Olschimke is the Co-Founder and CEO of Scalefree and a "Data Vault 2.0 Pioneer" with over 20 years of IT experience. A Fulbright scholar and co-author of Building a Scalable Data Warehouse with Data Vault 2.0, Michael is a global authority on AI, Big Data, and scalable Lakehouse design across sectors like banking, automotive, and state security.

Flow.BI: Generating the Raw Data Vault Using AI and dbt

Flow.Bi architecture Data Warehouse

Raw Data Vault

This question might be as old as the Data Vault: Can it generate the Raw Data Vault using artificial intelligence (AI)? Until recently, the prevailing expectation in the industry was that an AI, if ever existing, might only be able to assist the data modeler, for example, by identifying and suggesting business keys or modeling parts of the model.

The question arises out of need: in the past, data volume and shape have risen exponentially. And there is no sign that it should flatten out in the future. But who should analyze all the data required for today’s data platforms? We already have a shortage of qualified data engineers. And this situation will only become worse in the future because university students don’t rise exponentially.

Generating the Raw Data Vault Using Flow.BI and dbt

We present the integration between datavault4dbt and Flow.BI which is used to generate the Raw Data Vault. Flow.BI is an artificial intelligence capable of defining all the hubs, links, and satellites for enterprise data. This includes determining the business keys and special entity types such as effectivity satellites, hierarchical links, multi-active satellites, non-historized links, and reference tables. The work-sharing is simple: the advanced AI of Flow.BI defines the Raw Data Vault with all required entities, and datavault4dbt generates the code, including CREATE TABLE and INSERT INTO statements for both the model and the Raw Data Vault loading procedures.

Watch Webinar Recording

Exceeding Expectations with Generative AI

With the release of Flow.BI, the expectation that only an assisting AI is possible has been exceeded. Flow.BI is a generative AI that fully “defines” the Raw Data Vault, including:

  • Hubs and their business keys
  • Links, including hierarchical links
  • Satellites, including the satellite splits for privacy, security and rate-of-change
  • Non-historized links and their satellites
  • Effectivity satellites and multi-active satellites
  • Reference hubs and their satellites

Model Structure and Load Definitions

In addition to the model structure, the advanced AI of Flow.BI also defines the load definitions, that is, the definition from where a hub’s business key or satellite’s attribute is loaded. This is later used to generate the INSERT INTO statements.

Because it is all done by the AI, it works at scale and can quickly generate Raw Data Vault models with thousands of entities.

Simplified Modeling with Flow.BI

Flow.BI drastically simplifies the modeling approach: all the user has to do is attach data sources to Flow.BI for analysis and profiling, hit the red button and Flow.BI defines at least a valid model. To achieve this, the solution identifies the concepts in the entities of the data sources first and then the business keys for those concepts. Next, it identifies the relationships and processes the descriptive data attributes into satellites. For the last finishing, the integrated natural language generator adds the entities’ documentation, attributes, and load definitions.

Users can improve the target model by adjusting the identified concepts and the rules for privacy, security, and satellite splits. 

A Data-Driven Approach

The model defined by Flow.BI follows a data-driven approach and, therefore, aligns with the teaching of Scalefree’s Data Vault training

In a data-driven approach, the Data Vault model is “modeling the raw data as the business uses it.” That means the focus is clearly on the raw data, but business keys (“as the business is using it”) integrate data across multiple data sources. 

The idea behind this is that business keys are often shared keys that exist in multiple source systems and can, therefore, be used for integration purposes. 

Raw Data Vault vs. Business Logic

Business logic, such as WHERE conditions or conditional logic, has no place in a data-driven Raw Data Vault. The Business Vault aims to extend the Raw Data Vault by business logic. 

Therefore, the defined model doesn’t contain any business logic, which must be added later in the Business Vault. However, the Raw Data Vault model is an integrated enterprise data model that spans all attached data sources.

Metadata for Advanced Data Warehousing

The metadata produced by Flow.BI can be used not only for generating the Raw Data Vault but also for the staging area, either on a relational database or a data lake. 

Flow.BI doesn’t generate the Raw Data Vault alone but defines it. That means it only indicates which hubs, links, and satellites should exist to capture the data from the source systems. But it doesn’t generate the CREATE TABLE and INSERT INTO statements for the physical model. Instead, it relies on tools such as dbt via the datavault4dbt package to generate the code. Flow.BI hands over the metadata of the defined model, and datavault4dbt generates the actual code. 

With this in mind, Flow.BI is a teammate who analyzes and profiles the data sources, knows how to model the Raw Data Vault, and ingests the metadata into dbt’s SQL models.

User Control Over AI Modeling

Flow.BI is imitating the human data modeler. 

But does this mean that the user has lost control over the AI? No. There are many options to influence Flow.BI’s AI to produce a “better” target model. But first, what “better” means should be defined: fewer entities in the target model? Faster queries? Faster loading? Depending on the goals, Flow.BI can be influenced, for example, by the concept classification and how source data is presented to Flow.BI.

Integration with Data Warehouse Automation

Once the metadata for the Raw Data Vault has been defined, it is handed over to the data warehouse automation (DWA) solution. There are many solutions available, but a popular option is dbt. Scalefree has developed the open-source dbt package datavault4dbt, which enjoys growing popularity in the industry. 

To generate SQL models for the dbt package, the integration between Flow.BI and datavault4dbt leverages TurboVault, another open-source package by Scalefree. TurboVault is a graphical user interface that sets up the metadata for a Raw Data Vault to be generated with datavault4dbt.

Once the metadata for the Raw Data Vault has been defined, it is handed over to the data warehouse automation (DWA) solution. There are many solutions available, but a popular option is dbt. Scalefree has developed the open-source dbt package datavault4dbt, which enjoys growing popularity in the industry. 

To generate SQL models for the dbt package, the integration between Flow.BI and datavault4dbt leverages TurboVault, another open-source package by Scalefree. TurboVault is a graphical user interface that sets up the metadata for a Raw Data Vault to be generated with datavault4dbt.

The best option was to ingest Flow.BI’s metadata into TurboVault to leverage its capabilities. Once Flow.BI’s metadata has been loaded into TurboVault, TurboVault generates the SQL models for datavault4dbt, which, in turn, generates the Data Vault entities and the loading procedures.

To facilitate the integration between Flow.BI and TurboVault, Scalefree has now released a dedicated Flow.BI connector that retrieves and transforms the logical data models from Flow.BI and enables the automated generation of your Data Vault.

Value Proposition of Flow.BI

Flow.BI offers many values: first, the price. Defining the model using AI is much more cost-effective than manually setting up the metadata for datavault4dbt. Another problem is that Data Vault experts are a scarce resource and not widely available, especially when quality is an essential factor, which always should be the case when dealing with enterprise data.

Another value of Flow.BI is the agility: instead of defining the Raw Data Vault in months and years, Flow.BI’s advanced AI calculates the Raw Data Vault model within minutes and hours.

This also reduces the project risk: what if the Data Vault experts produce a low-quality (or even invalid) model after years of working and millions of Euros spent? Having another attempt at modeling is often unrealistic.

But with Flow.BI, results are close to immediate; if they are unsatisfactory, the manual alternative is still available.

Therefore, the best option to get started is to contact us for a proof of concept or workshop on Flow.BI.

Delta Lake vs Data Vault

How does Data Vault add value when we have the Delta Lake?

In the world of modern data management, businesses often find themselves navigating a maze of tools, architectures, and methodologies to meet their ever-evolving data needs. Among the popular approaches are Delta Lake and Data Vault. While both have their strengths, it’s important to understand how they complement each other and why Data Vault can be a game-changer even when you’re leveraging Delta Lake.



Understanding Delta Lake

Delta Lake is an open-source storage layer that brings reliability to data lakes. Built on top of Parquet files, it provides ACID transactions, schema enforcement, and the ability to handle incremental data changes. It’s a robust foundation for modern data warehouses and data lakes, especially when using tools like Databricks.
However, Delta Lake primarily focuses on managing data storage and changes. It doesn’t inherently bridge the gap between raw source data and the business-ready reports and dashboards that users demand.

Enter Data Vault: Bridging the Gap

Data Vault is a modeling approach designed to address the disconnect between raw data and user needs. While Delta Lake handles data storage efficiently, Data Vault focuses on the *why* and *how* of transforming that data into actionable insights. Here’s where Data Vault excels:

  • Data Modeling: Data Vault organizes data into Hubs, Links, and Satellites, ensuring a flexible and scalable structure. Hubs capture business keys, Links handle relationships, and Satellites store descriptive data.
  • Data Integration: It helps integrate disparate data sources into a unified model that reflects the business context.
  • Change Tracking: While Delta Lake tracks changes at the file or record level, Data Vault optimizes this by capturing deltas more efficiently, especially when splitting data into specialized Satellites.
  • Target-Oriented Design: Data Vault focuses on producing business-ready data models like star schemas, flat tables, or dashboards, rather than being a consumption model itself.

Performance Challenges and Solutions

A frequent criticism of Data Vault on Delta Lake revolves around query performance, particularly due to the columnar storage of Parquet files. Joins can be slow, but this is more a characteristic of the storage mechanism than the modeling technique. Here are some strategies to address this:

  • Denormalization: Flattening data into wide tables eliminates the need for joins, resulting in faster query performance.
  • Materialized Views: Creating materialized Parquet views for end-user consumption ensures high performance without impacting upstream processes.
  • Optimized Storage: Use technologies like Iceberg or Delta tables for Hubs and Links, and consider presenting Satellites as views to minimize storage overhead.
  • Incremental Load: Design systems to handle insert-only incremental loads, reducing the complexity of updates and deletes.

Why Business Users Love Data Vault (Even If They Don’t Know It)

The ultimate goal of any data architecture is to serve business users. Reports, dashboards, and analytics are the end-products they care about. Data Vault excels here by enabling the creation of robust information models that align with user requirements:

  • Flexibility: Business rules can be implemented on top of the Data Vault model to derive the desired target model.
  • Scalability: Large data flows can be broken down into manageable pieces, making the system easier to maintain.
  • Agility: Changes in business requirements can be accommodated without overhauling the entire model.

Delta Lake and Data Vault: Better Together

Rather than viewing Delta Lake and Data Vault as competing approaches, think of them as complementary. Delta Lake provides the foundation for reliable data storage and change tracking, while Data Vault transforms this raw data into meaningful, business-ready formats.
For example, Delta Lake can serve as the staging or landing zone, where raw data is ingested and stored. Data Vault then takes over to model this data into Hubs, Links, and Satellites, preparing it for business consumption. The combination ensures both robust data management and the flexibility to meet diverse analytical needs.

Final Thoughts

Data Vault is a powerful methodology for bridging the gap between raw data and actionable insights. Even in environments that leverage Delta Lake, Data Vault adds value by providing a scalable, user-focused approach to data modeling. By combining the strengths of these two technologies, organizations can achieve both reliability and agility in their data architectures.
As with any tool or methodology, the key is to tailor the implementation to your specific needs, ensuring that both performance and usability are optimized. Whether you’re dealing with Databricks, Parquet, or other tools, Data Vault provides the flexibility and structure to deliver what matters most: business value.

Watch the Video

The Need for a Data Warehouse

The Need for a Data Warehouse

In today’s rapidly evolving digital landscape, businesses generate and collect vast amounts of data. However, data alone isn’t enough to ensure success—it’s about how we manage, analyze, and utilize this data. This brings us to a fundamental question: why do we need a data warehouse or Data Vault in our business model?



From Gut Feeling to Data-Driven Decisions

Many organizations, especially mid-sized firms, often rely on gut feelings for decision-making. While experience-based intuition has its place, it also carries a significant risk of error, especially as businesses grow and their operations become more complex. A data warehouse is a game-changer in transforming such organizations into data-driven entities where decisions are made based on facts and analytics rather than instinct alone.

As businesses scale, leaders lose the ability to maintain a complete overview of every operational detail. This is where a systematic approach to organizing, analyzing, and processing data becomes essential. A data warehouse centralizes and standardizes data from multiple sources, making it easier to extract insights and support rational, informed decisions across all levels of the organization.

Enhanced Business Process Automation

With a centralized repository like a data warehouse, businesses can unlock opportunities for automation. Automated processes can access, analyze, and utilize data seamlessly, leading to improved efficiency and accuracy. Whether it’s optimizing workflows or refining customer interactions, having reliable, accessible data is crucial for these systems to function effectively.

Democratizing Data Access

A significant aspect of a data-driven organization is making relevant data accessible to employees across roles. Every employee, from frontline workers to C-suite executives, is expected to make decisions. For these decisions to be effective, they need to be grounded in data.

However, this doesn’t mean unrestricted access. Data warehouses must incorporate robust security measures, such as role-based access, to ensure that employees can access only the data necessary for their responsibilities. This combination of widespread accessibility and stringent security supports a culture of informed decision-making while safeguarding sensitive information.

Do You Need a Data Vault?

When it comes to managing enterprise data, many organizations face additional challenges: integrating multiple source systems, ensuring data security and privacy, and handling real-time and batch processing simultaneously. A Data Vault model offers a comprehensive solution to these challenges by supporting integration, auditability, and adaptability.

While some businesses may start with simpler models, their requirements will inevitably evolve. Laws and industry standards may impose new data privacy mandates, or management may seek to leverage more advanced analytics capabilities. A well-designed Data Vault can accommodate these future needs without requiring a complete overhaul of the existing system.

The Future-Ready Advantage

One of the standout features of a Data Vault is its flexibility. It allows businesses to scale and adapt their data management strategies as they grow. Whether it’s adding new data sources, meeting stricter compliance requirements, or enabling more sophisticated analytics, the Data Vault model supports incremental changes without disrupting existing operations.

This adaptability makes it an invaluable asset for enterprises looking to future-proof their data strategies. While simpler solutions might suffice for today’s needs, they may not hold up against tomorrow’s demands. A Data Vault ensures that businesses are prepared for the inevitable increase in complexity and volume of their data requirements.

Conclusion

Investing in a data warehouse or Data Vault isn’t just about technology—it’s about fostering a culture of informed, data-driven decision-making. From streamlining processes to democratizing data access, these systems provide the foundation for businesses to thrive in an increasingly competitive and data-centric world. Whether you’re just starting your data journey or looking to enhance your existing capabilities, now is the time to prioritize a robust, scalable data solution.

As your business grows, so will your data requirements. A data warehouse or Data Vault not only meets these needs but positions your organization to capitalize on the full potential of its data—today and in the future.

Watch the Video

Federate Multiple Sets of Non-Historized Data

Federate Multiple Sets of Non-Historized Data

Welcome to another edition of Data Vault Friday! In this session, we’ll address an interesting question: how to integrate a second system into an existing setup with a non-historized link. This specific case involves transactions that can be updated for the last two months. We’ll explore the necessity of satellites, the role of non-historized links, and the concept of federating data in the business vault or information mart (BV/IM).

Let’s break it down, discuss the challenges, and provide actionable solutions.



The Challenge

The question revolves around a scenario where two systems provide transactional data. The goal is to integrate these datasets into a single fact entity in the information mart, enabling aggregation across both systems. A specific challenge arises when transactions from one system need to be updated within the last two months, necessitating a satellite to track those changes. So, how do we effectively design and implement this integration?

Understanding Non-Historized Links

Non-historized links play a central role in integrating datasets across source systems. These links typically store only insert operations, ensuring that facts—such as transactions—remain immutable. However, the reality is often more complex, especially when updates or changes occur.

For transactions that change, the non-historized link’s primary purpose is to maintain a consistent structure. Attributes frequently used for aggregations, such as amounts, are stored directly in the link structure. Meanwhile, descriptive attributes that casual users don’t aggregate (e.g., line descriptions) are placed in satellites to keep the link slim and optimized for storage and performance.

Federation Across Systems

One of the first decisions when integrating two systems is determining whether their data can share the same link structure. The key considerations are:

  • Semantic Consistency: Do both systems provide the same type of transactional data?
  • Granularity: Are the transactions at the same level of detail?
  • Key Uniqueness: Is there any overlap in identifiers that might require adding source-specific elements like a tenant ID?

If the answers indicate alignment, both systems can load into the same non-historized link. However, each source system would have its own satellite, ensuring clear separation of descriptive attributes.

Handling Updates in Transactions

Updates to transactional data pose a significant challenge, particularly when attributes such as amounts or hub references change. The typical strategy involves **technical counter transactions**, which ensure insert-only behavior while maintaining an accurate representation of changes. Here’s how it works:

  1. For a new transaction, insert the original record into the link.
  2. If an update occurs (e.g., an amount changes from €5 to €7), insert two additional records:
    • A “counter” record negating the original amount (e.g., -€5).
    • A new record reflecting the updated amount (e.g., +€7).
  3. Aggregations will reflect the latest transaction value while retaining a complete history of changes.

This approach keeps the data model consistent and avoids direct updates to the non-historized link.

Federation in the Business Vault and Information Mart

While the raw Data Vault integrates data from multiple systems, the Business Vault (BV) and Information Mart (IM) are where federated datasets shine. By standardizing and transforming data, you can provide end-users with a unified view of all transactions.

The information mart combines the transactional data from both systems into a single fact entity. This entity supports aggregations across all transactions, ensuring that analytical queries deliver accurate and actionable insights.

Best Practices and Recommendations

To ensure a robust implementation, consider these best practices:

  • Design for Scalability: Keep the link structure slim to optimize performance, especially with large datasets.
  • Clear Satellite Separation: Create one satellite per source system to maintain clarity and prevent mixing attributes from different sources.
  • Use Technical Counter Transactions: Handle updates efficiently while preserving insert-only behavior.
  • Test Aggregations: Validate the unified fact entity in the information mart to ensure accurate reporting across systems.
  • Monitor Data Quality: Regularly check for inconsistencies, especially when integrating systems with different update patterns.

Watch the Video

Non-Historized Links and Their Satellites

Understanding Non-Historized Links and Satellites

In the world of data vault modeling, we often encounter situations where we need to deal with non-historized links. A non-historized link is a structure used to capture relationships between entities (hubs) in a data vault. Unlike historized links, these structures are not designed to track changes over time, making them particularly useful in certain high-performance scenarios.

However, when integrating with systems that can update transactions for recent periods, additional structures like non-historized satellites become essential. These satellites allow us to track updates to attributes associated with the link while maintaining the integrity and performance of the data model.

Design Considerations for Non-Historized Links

The design of non-historized links requires careful planning, particularly when dealing with wide datasets or platforms with specific storage constraints. A wide link, for instance, might contain attributes like transaction ID, account segments, accounting document, accounting period, and timestamps. These attributes uniquely identify a transaction and establish relationships with other entities in the data vault.

Here are key considerations when designing non-historized links:

  • Include Essential Identifiers: Attributes required to uniquely identify a transaction and establish relationships with hubs should always be part of the link. This typically includes the transaction ID and any necessary alternate keys.
  • Balance Row Width: Including too many attributes in the link can lead to performance issues on row-based storage systems, as wide rows reduce the number of records stored per page.
  • Use Satellites for Additional Attributes: Attributes that are not essential for casual queries or dashboard aggregation should be moved to a satellite structure to keep the link lean and efficient.

Role of Non-Historized Satellites

Non-historized satellites play a critical role in complementing non-historized links. They store additional descriptive attributes that are not frequently queried or aggregated by typical users. This approach minimizes the join operations required during querying, improving performance.

For instance, while attributes like debit and credit amounts might be included in the link due to their importance in aggregations, other less frequently used attributes, such as transaction type or invoice descriptions, can reside in the satellite. This division ensures a balance between storage efficiency and query performance.

Query Performance Optimization

A significant design challenge lies in ensuring optimal query performance, particularly on column-based storage systems. Joining a link with its satellite can significantly impact performance. To mitigate this, consider the following:

  • Prejoin Frequently Queried Attributes: Move attributes that are commonly used in aggregations or group-by operations into the link structure to eliminate the need for joins.
  • Tailor Satellites for Specific Use Cases: Design satellites to serve specialized needs, such as data science or detailed analysis, where users can afford the higher cost of join operations.
  • Adopt Hybrid Storage Strategies: Leverage hybrid storage solutions, such as Snowflake’s mini partitions, to combine row-based and column-based advantages and support wider rows for frequently queried data.

Practical Example

Let’s consider a scenario with a financial transactions table containing over a billion rows. The table includes attributes such as transaction ID, account segments, accounting document, accounting period, debit amount, credit amount, and timestamps.

Here’s how we can structure the non-historized link and satellite:

  • Non-Historized Link: Include transaction ID, account segments, accounting document, accounting period, debit amount, and credit amount. These attributes are essential for aggregations and dimension references.
  • Non-Historized Satellite: Store additional attributes like transaction type and invoice descriptions, which are not frequently queried by casual users.

This structure ensures that typical queries for dashboards and reports are efficient, while still supporting detailed analysis for specialized users.

Conclusion

Designing non-historized links and their satellites requires a deep understanding of data usage patterns and storage platform constraints. By carefully selecting which attributes to include in the link and which to offload to a satellite, we can strike a balance between query performance and storage efficiency. For most scenarios, prioritizing attributes that support frequent aggregations and groupings in the link structure is key, while satellites serve as a repository for less critical details.

As data modeling continues to evolve, adapting these principles to modern storage technologies and hybrid platforms will be crucial for building robust and performant data vault systems.

Watch the Video

Identifying Non-Historized Links

Watch the Video

Identifying Non-Historized Links in Data Vault

In data management, particularly in the Data Vault 2.0 methodology, understanding how to handle non-historized links can be essential for maintaining accurate, traceable records. In today’s Data Vault Friday, we’ll cover the concept of non-historized links, explore the unit of work in data storage, and discuss which elements should be included in link hashes.



Non-Historized Links Background: Understanding the GL Table

Imagine we’re working with a General Ledger (GL) table for storing financial transactions, which includes over a billion rows. Each record in this table represents a transaction and includes attributes such as:

  • TransactionID: A unique identifier for each transaction.
  • AccountSegment: A business key identifying the account involved.
  • AccountingDocument: Another business key associated with the document related to the transaction.
  • AccountingPeriod: A reference key indicating the period of the transaction.
  • DebitAmount and CreditAmount: The monetary amounts for each transaction.
  • TransactionTimestamp: The exact date and time of the transaction.

While the TransactionID alone uniquely identifies each transaction, a combination of AccountSegment, AccountingDocument, AccountingPeriod, and TransactionTimestamp also provides a unique identifier for each entry. This setup raises questions about how to best organize, hash, and manage these records within a Data Vault model, specifically in terms of defining a Unit of Work and deciding which values should be included in a link hash.

Defining the Unit of Work in Data Vault

The Unit of Work is a concept used in Data Vault to determine which business keys need to be grouped together in a single record. In this context, we’re focusing on how these business keys—AccountSegment, AccountingDocument, AccountingPeriod, and TransactionTimestamp—should be structured and managed within a non-historized link.

According to Michael Olschimke, the CEO of Scalefree, the Unit of Work is primarily relevant when dealing with standard relationship-based links. In our example, there’s no need to split the link between these business keys because the link can capture the original granularity of each transaction from the source data. This means that by keeping all four attributes within the same link, we avoid disrupting the Unit of Work and ensure a coherent data structure.

Testing the Structure of the Unit of Work

One way to verify if we’re correctly applying the Unit of Work is to check if the data model enables us to reconstruct the original data source without loss of records or attributes. If we find that we can reconstruct the data accurately using the non-historized link and potentially a Satellite table later, then we’re likely adhering to the Unit of Work correctly.

Since the attributes in our case uniquely identify each transaction, we can assume the Unit of Work is preserved, allowing for a stable and consistent structure in the data model.

Which Values Should Be Hashed in the Link Hash?

The next question concerns which values we should include in the link hash. Generally, the hash key in a link table is derived from the business keys of referenced hubs. In this case, the primary question is whether to hash just the AccountSegment, AccountingDocument, and AccountingPeriod, or if we should also include the TransactionID.

Olschimke suggests a practical approach here: while we could technically get by with hashing either the TransactionID alone or the combination of the other three elements, it’s often better to include all relevant attributes in the hash. This includes the TransactionID along with AccountSegment, AccountingDocument, and AccountingPeriod.

Here’s why:

  1. Consistency with Automation Tools: Many Data Vault automation tools automatically include all hub references and their business keys in the hash computation. Following this approach aligns with standard automation practices, making it easier to work with automation tools later.
  2. Minimizing Potential Issues: Including all attributes in the hash computation reduces the risk of data loss or accidental data duplication. It ensures that our hashes accurately represent each unique record.
  3. Negligible Performance Impact: While adding an extra attribute to the hash computation may slightly increase the string length, the performance impact is minimal. In large-scale data processing, this small change typically does not result in significant slowdowns.

In Practice: Setting Up Non-Historized Links

In Data Vault, non-historized links are a powerful tool for managing large datasets without versioning each record. By including all four attributes in the hash computation, we establish a stable link structure that reflects the source data’s original granularity. This approach allows us to confidently manage and retrieve accurate records without fear of unintentional data loss.

For practitioners, setting up these non-historized links can be straightforward when using automation tools that handle most of the heavy lifting. It’s generally recommended to follow the automation tools’ conventions, especially for larger datasets, as this minimizes discrepancies and facilitates smoother data processing in the long run.

How to Submit Your Questions

If you’re working with Data Vault and have a question similar to this one, Data Vault Friday sessions provide an opportunity to get expert insights. To submit a question, you can use the form, which allows you to attach diagrams, whiteboard images, or other supporting materials for clarity. Alternatively, for faster responses, you can post questions on the Data Innovators Exchange, where consultants actively respond and discuss Data Vault topics.

For those interested in more in-depth training, Scalefree also offers webinars, including sessions on WhereScape and dbt, which run monthly. These provide valuable resources for anyone looking to deepen their Data Vault expertise.

Conclusion

Identifying non-historized links and deciding on a consistent hashing approach can significantly impact data integrity and model reliability. By adhering to Data Vault principles and utilizing automation tools effectively, you can manage vast amounts of data without sacrificing accuracy or consistency.

With the right approach, you’ll be able to create a resilient, scalable data model that aligns with your organization’s needs and remains flexible for future changes. If you’re diving deeper into Data Vault, stay tuned for more Data Vault Friday insights and don’t hesitate to participate in the community for ongoing support and expertise.

Learning the Data Vault Patterns

Watch the Video

Learning the Data Vault Patterns with Azure Synapse

When it comes to managing complex, evolving data landscapes, implementing a Data Vault architecture is a popular choice. The Data Vault approach is especially helpful in handling large volumes of data while maintaining flexibility, scalability, and historical tracking. Recently, a question was raised regarding the use of Azure Synapse Analytics with Data Vault for the initial Raw Data Vault development without a finalized automation tool. Specifically, the question was: Is it feasible to start development with Synapse Notebooks using PySpark and Delta Lake?

This article will explore the answer to that question and dive into key considerations, best practices, and strategies to make the most of a manual setup while preparing for eventual automation in your Data Vault implementation.



Is It Feasible to Start Data Vault with PySpark and Delta Lake?

The short answer is yes—starting with Synapse Notebooks, PySpark, and Delta Lake is indeed a viable option. In fact, this manual approach can be an excellent way to familiarize your team with Data Vault concepts, patterns, and methods before committing to an automation tool. By manually building the Raw Data Vault, your team can gain hands-on experience with essential processes, which will make the transition to automation smoother and more effective later on.

Historically, many Data Vault practitioners began with manual scripting due to limited automation tools. Over time, this “manual-first” method became a useful way to learn the intricate patterns of Data Vault. Today, automation tools for Data Vault are abundant, and using them is generally more efficient, but there’s still a place for manual methods, especially in the early learning stages of a project. Let’s look closer at why this approach works and what to consider as you work manually.

Benefits of Starting Manually with Synapse, PySpark, and Delta Lake

Using PySpark and Delta Lake in Azure Synapse Notebooks gives your team flexibility to:

  • Learn Core Data Vault Patterns: Building the Raw Data Vault manually helps the team understand Data Vault concepts, such as Hubs, Links, and Satellites. This is crucial knowledge that will benefit the project long-term.
  • Experiment with Modeling: Working without automation allows you to refine your approach and test different design patterns. This is especially helpful in creating a foundation that’s tailored to your organization’s specific needs and datasets.
  • Understand Data Transformation and Ingestion: By manually scripting transformations and ingesting data, your team will better understand the processes that an automation tool would handle. This will help in configuring automation later and troubleshooting any issues that arise.
  • Validate Requirements and Patterns: Since no tool has been chosen yet, working manually allows you to get a head start on modeling and confirm your business and technical requirements early in the project.

The Drawbacks of a Manual Approach

While starting manually has its advantages, it’s important to be aware of the limitations. The primary drawbacks of a manual approach are:

  • Time and Effort: Developing the Raw Data Vault by hand is time-consuming. Each process, from creating Hubs to tracking Satellites, requires careful attention to ensure the design aligns with Data Vault standards.
  • Limited Scalability: A manual setup is challenging to scale, especially as the data volume grows. While PySpark and Delta Lake are powerful tools, they’re not a substitute for the scalability offered by automation tools.
  • Risk of Technical Debt: Scripts developed manually might not be as maintainable or reusable as the templates generated by automation tools. Technical debt can accumulate if the team spends too much time maintaining manual scripts, and transitioning to automation could later require extensive rework.

Steps to Optimize Your Manual Development Process

If you decide to move forward with this manual approach, here are some strategies to make the process more efficient and to set a foundation for a future transition to automation:

  1. Document Your Patterns Thoroughly: Take detailed notes on the specific design patterns, scripts, and models you develop manually. These can serve as templates when you move to automation, making the transition much easier.
  2. Define Clear Modeling Standards: Establish consistent modeling practices for Hubs, Links, and Satellites. This will reduce ambiguity and provide a structured foundation for automation tools to build upon later.
  3. Refine and Iterate: Since you’re building manually, use this time to refine your models. Adjust and improve them based on the unique data flows and needs of your organization.
  4. Focus on Core Entities: Prioritize building out core Hubs and Links in your Raw Data Vault, focusing on the entities that are most crucial to your organization. This will create a solid foundation that can be extended as you move to automation.

Preparing for Automation: What to Keep in Mind

Even as you start manually, keep automation in mind as a goal. Today, there are numerous Data Vault automation tools available, each with its own strengths and weaknesses. As you prepare for this transition, here are some key considerations:

1. Research Automation Tools

Explore the different automation tools available on the market. Each tool has its own approach, interface, and capabilities, so it’s essential to choose one that aligns well with your organization’s technical needs, budget, and data infrastructure. Some tools focus on business-user accessibility, while others offer more technical configurations. Common tools for Data Vault automation in Azure include solutions specifically designed for Synapse or those that support PySpark and Delta Lake.

2. Select Tools with Scalability

When selecting a tool, consider how it handles scalability, as this is vital for supporting the growing volume of data in a Data Vault. Some automation tools handle scalability better than others, depending on how they manage Hubs, Links, and Satellites. In Azure Synapse, it’s also important to assess compatibility with Delta Lake and PySpark, as well as overall integration with Azure’s data ecosystem.

3. Account for Tool Limitations

Even the best automation tools have limitations. Be prepared to adjust your approach based on the chosen tool’s capabilities. For example, some tools may limit specific patterns, such as complex Satellites or multi-active relationships. By understanding these limitations ahead of time, you can avoid rework and ensure that your initial manual development aligns well with the selected tool.

4. Focus on Configurability and Customization

Ensure that your chosen tool allows for some level of customization. This is important because the patterns you develop manually may need to be fine-tuned or adjusted within the tool. Look for tools that offer configurable templates, adaptable interfaces, and support for customizations to fit your organization’s specific needs.

Moving Forward: Transitioning to Automation

As your team becomes familiar with Data Vault patterns through manual development, the next step is to select and implement an automation tool. While the manual work provides a deep understanding of Data Vault patterns, an automation tool will streamline repetitive processes, ensure consistency, and save significant time and effort as your data volume grows.

One recommended approach is to use the lessons learned during manual development to create tailored templates and workflows within your chosen tool. By doing so, you can optimize the automation tool’s capabilities based on the patterns you’ve already tested and refined. This makes for a smoother, more effective transition from manual development to automated workflows.

Final Thoughts

Starting Data Vault development manually with Synapse Notebooks, PySpark, and Delta Lake is a feasible and often beneficial approach, especially when automation tools haven’t been finalized. While this method is time-consuming and requires effort, it offers valuable insights and allows your team to learn and optimize Data Vault patterns before committing to an automation tool.

Remember, the goal is to use this manual phase to build a strong foundation, explore modeling choices, and establish best practices. When the time comes to select an automation tool, your team will be well-prepared to leverage it to its fullest potential, ensuring a scalable and efficient Data Vault implementation within Azure Synapse Analytics.

Handling JSON Data in Data Vault Satellites on Snowflake

Watch the Video

Handling JSON Data in Data Vault Satellites on Snowflake

In this blog post, we’ll discuss the challenges and best practices for handling JSON data in Data Vault architectures, specifically when using Snowflake. This question from our audience raises a common scenario in modern data environments where semi-structured data is prevalent. Let’s dive into how storing JSON in Data Vault Satellites works, when to extract fields, and the performance considerations for Snowflake.

Why Store JSON in the Satellite?

With the increasing use of APIs and complex data integrations, many businesses are receiving data as JSON. JSON is a flexible, semi-structured format that allows for varying structures and nested elements, making it ideal for some applications. However, this flexibility also introduces complexities when using a structured data model like the Data Vault.

The challenge here is that when you receive a JSON payload, it may contain hundreds of fields, many of which may not be immediately needed. Extracting every JSON field to store as individual columns in the Raw Data Vault can be inefficient and unnecessary. Therefore, one solution is to store the JSON “as-is” in the Satellite and extract only what’s required later in the Business Vault.

Choosing When to Extract Fields from JSON

As a general rule, it’s beneficial to extract attributes that are needed for reporting, security, or compliance early in the process. Relational attributes stored directly as columns in the Raw Data Vault are easier to query, and they’re stored more efficiently in structured databases.

For other fields in the JSON payload that are less critical or not immediately required, you can keep the data in its JSON format. This approach minimizes upfront work and allows for flexible data exploration later without overloading the Raw Data Vault with unnecessary fields.

For example, consider an e-commerce environment where transaction data is received in JSON format. The Raw Data Vault could extract only essential fields like transaction ID, customer ID, and total amount while leaving other detailed information about the items purchased in JSON format.

JSON and Snowflake: Optimized for Performance

One of Snowflake’s strengths is its ability to store and query JSON data natively. Snowflake’s support for semi-structured data formats like JSON and its efficient handling of this data type make it possible to perform queries directly on JSON attributes without needing to first extract them into columns.

This capability allows you to store JSON data in the Raw Data Vault and query it directly, even in production, without significant performance penalties. This approach works well for data that may only be needed occasionally or where performance is not a major concern. Snowflake’s built-in functions for working with JSON make it easy to extract specific attributes on demand and use them in downstream reporting.

Handling Business Logic in the Business Vault

In Data Vault, business rules are often applied within the Business Vault layer, which is designed for derived and calculated data. When working with JSON in this context, it’s common to keep the original JSON structure in the Satellite and then perform transformations as needed in the Business Vault.

In Snowflake, you can take advantage of virtual views to simplify your data pipeline. Rather than physically extracting and storing each JSON attribute in the Business Vault, you can create virtual views that access JSON fields as needed. This approach offers the benefit of flexibility and reduces storage requirements since you avoid duplicating data.

Structuring Semi-Structured Data Over Time

Storing JSON in the Satellite provides flexibility, but as your business needs evolve, you may find that some attributes become more relevant for analysis. At this point, you may consider structuring your JSON data incrementally, extracting only the fields that are frequently queried or required for compliance.

This gradual structuring process aligns with the Data Vault’s design principles. Over time, your semi-structured data will naturally become more organized as business requirements clarify which fields are necessary for analytics or reporting. You can continue to leave less critical fields in JSON format, maintaining flexibility while optimizing performance where it matters.

Privacy and Security Considerations with JSON in Satellites

One important consideration when storing JSON data in the Satellite is ensuring compliance with data privacy regulations such as GDPR. It’s essential to classify personal or sensitive data in the JSON payload and handle it appropriately.

In cases where sensitive data is involved, you may choose to split your Satellites by data classification, isolating high-sensitivity fields into their own Satellite tables with stricter access controls. Alternatively, you can extract and separate specific fields that require special handling and leave the rest in JSON format.

Working with JSON Arrays and Nested Objects

JSON data often includes nested structures, such as arrays, that add complexity to handling data in a relational model. For instance, customer records might contain an array of phone numbers or addresses. In these cases, Snowflake offers tools for managing nested data without requiring extensive restructuring.

One option is to store the entire JSON array as-is within a JSON attribute in the Satellite. For example, a JSON array containing multiple phone numbers could be stored directly in the Satellite, eliminating the need for a separate table to manage multi-active data. This approach provides a simpler alternative to the traditional multi-active Satellite design.

Snowflake allows you to work with nested JSON data using its FLATTEN function, which simplifies querying nested structures without requiring complex joins or data transformations. This method can be particularly useful when dealing with highly nested data, especially in real-time applications.

Virtualizing JSON Access in Snowflake

Snowflake’s JSON handling capabilities allow you to query JSON fields dynamically, making it possible to virtualize the access to JSON attributes in your Satellite tables. Virtual views can be used to transform JSON data for reporting without needing to physically extract every field into columns.

This approach offers significant flexibility. By keeping the JSON data in its original form, you avoid unnecessary data transformations and can leverage Snowflake’s powerful JSON functions to create reports on demand. Virtualization is an effective way to simplify your data pipeline, particularly when dealing with complex JSON structures.

Conclusion

Storing JSON in the Data Vault Satellite layer on Snowflake is a sound approach, especially when dealing with semi-structured or complex data. By extracting only the fields that are critical and leaving the rest in JSON format, you balance flexibility with efficiency. Snowflake’s support for JSON makes it possible to work with this data in real-time, leveraging virtual views to keep your pipeline agile.

Consider implementing a mix of JSON storage and relational attributes, structuring your data incrementally as your business needs evolve. With the right approach, Snowflake’s JSON capabilities allow you to handle diverse data types efficiently, supporting both compliance requirements and analytics needs.

(Single) Point of Facts

Watch the Video

Single Point of Facts in Data Lakehouse Architecture

Welcome to another edition of DataVault Friday! Today, we’re diving into a frequently debated topic in data management: determining the “source of truth” in a data lakehouse architecture. Specifically, we’ll answer whether that source of truth resides in data lake files or in the tables of the Raw Vault and Business Vault. Additionally, we’ll address what to do if a bug in the ingestion framework requires a re-ingestion of data sources. Let’s explore these questions to better understand data lineage, data governance, and strategies for a reliable and flexible data ecosystem.

Understanding the “Single Point of Facts”

In traditional data warehousing, the idea of a “single version of the truth” is well-known. This concept implies that there is one version of the data that acts as the definitive source across an enterprise. For instance, a single “customer” or “product” definition applies universally within the organization.

However, in Data Vault architecture, we move from a “single version of the truth” to a “single point of facts.” The focus shifts from universal definitions to an unaltered, auditable record of events. Data Vault is designed to capture historical data accurately and reliably. It provides multiple perspectives on the data (versions of the truth) by isolating raw data from any business logic or transformations. This flexibility allows organizations to apply different business rules depending on context, while maintaining a consistent underlying dataset.

In this context, the Raw Vault is considered the foundational layer, capturing facts as they are, directly from the source systems. The Business Vault, on the other hand, introduces additional business rules, metrics, and aggregated data for reporting purposes. But in essence, the “single point of facts” remains within the Raw Vault because it represents an unaltered and auditable record.

Data Lakehouse Architecture and Points of Truth

In data lakehouse architecture, data is stored both in a data lake and within Data Vault tables. This raises the question: Which source is the ultimate truth? The data lake, with its raw files, or the Raw Vault tables?

The answer depends on the architectural requirements and the level of traceability and auditability needed. Ideally, both systems should mirror each other and serve as points of fact:

  • Data Lake: The data lake serves as a repository for raw data files, often storing snapshots or full loads of data from source systems. This makes it easier to preserve the original data as-is without alteration.
  • Raw Vault: In the Raw Vault, data is loaded into a structured schema, capturing the same original details but in a way that can be more systematically queried and analyzed. Like the data lake, the Raw Vault stores unmodified facts, but it also preserves lineage information, making it possible to reproduce deliveries and trace data transformations.

Since both layers should hold the same underlying data, they collectively represent the point of fact. Either the data lake or Raw Vault can serve as the truth source, depending on the scenario. This dual system ensures a resilient architecture, as data can be cross-validated across layers.

What If There’s a Bug in the Ingestion Framework?

One key question that arises is: What happens if there’s a bug in the ingestion framework? Bugs such as incorrect population of business keys or other erroneous transformations might require a complete re-ingestion of data sources.

When dealing with bugs in data ingestion, having both a data lake and a Raw Vault allows flexibility and safeguards. Here’s how to address these issues:

  1. Identify and Isolate the Problem: Pinpoint where the issue occurred in the ingestion process and document the scope of the bug, especially if it affects business keys or other critical aspects of data integrity.
  2. Rely on the Data Lake for Original Files: Since the data lake contains the original, unaltered data files, you can reload the affected data from here into the Raw Vault. This ensures that any corrupted or inaccurately transformed data can be replaced without loss.
  3. Reprocess the Raw Vault: With the correct data now available from the data lake, reload the Raw Vault. Ensure that new ingestion processes are thoroughly tested to avoid repeating the error.
  4. Automate Audits and Reconciliation: Implement automated reconciliation checks between the data lake and the Raw Vault. Automated scripts can flag discrepancies, giving early warning of issues before they reach production or reporting layers.

By leveraging both the data lake and Raw Vault as points of fact, the architecture remains robust and auditable. This redundancy allows for re-ingestion without significant downtime and ensures that data lineage remains traceable throughout the lifecycle.

Data Lake vs. Raw Vault: Which Is Easier for Reconstruction?

When it comes to reconstructing deliveries, the data lake often offers simplicity. Since the data lake can hold raw files with minimal transformation, data reconstruction is a straightforward matter of accessing the original files. In contrast, reconstructing from the Raw Vault requires additional effort, as data must be accurately joined across hubs, links, and satellites, while preserving the original state.

That said, both layers should be auditable, with logging mechanisms that allow for a traceable history of changes. Having a clear data lineage in place allows organizations to meet compliance and audit requirements while supporting accurate reporting.

Best Practices for Managing Points of Fact

While it’s tempting to designate a single point of fact, the dual-layered approach with a data lake and Raw Vault provides a more resilient framework. Here are some best practices for managing points of fact in a data lakehouse architecture:

  • Maintain Consistency Between Layers: Ensure that data lake files and Raw Vault tables match exactly. Automate reconciliation checks between these layers to verify data integrity.
  • Implement Auditable Ingestion Processes: Document all transformations from the data lake to the Raw Vault, with logging and error-checking mechanisms. This allows for easier tracing of issues if they arise.
  • Retain Original Data in the Data Lake: Always keep a copy of original files in the data lake. These files provide a reliable source of truth that can be referenced or reloaded into the Raw Vault if issues occur.
  • Leverage Metadata for Automation: Metadata can streamline both ingestion and reconciliation. Use metadata to define business keys, relationships, and descriptive data in the Raw Vault, while automating verification processes.

With these practices, data lakehouse architecture can be made robust, auditable, and resilient to changes or errors. By treating both the data lake and Raw Vault as points of fact, you ensure that your data ecosystem remains flexible, trustworthy, and ready to meet evolving business requirements.

Conclusion

The question of “single point of facts” in data lakehouse architecture doesn’t have a straightforward answer. Both the data lake and the Raw Vault act as points of fact, each offering unique benefits in terms of auditability and reconstruction. By utilizing both, you create a highly resilient system capable of withstanding data issues while providing a comprehensive, consistent view of your data.

In summary, while the Raw Vault may traditionally serve as the “single point of facts,” using both the data lake and the Raw Vault as truth sources creates a flexible architecture that can accommodate re-ingestion, mitigate risks, and support accurate reporting. With this dual approach, your data lakehouse architecture becomes a reliable foundation for modern data needs.

Data Vault Mixed Model

Watch the Video

Understanding Data Vault Mixed Models: Integrating Non-Data Vault Entities in the Business Vault

Data Vault architecture is a widely used methodology in data warehousing, providing a highly adaptable model for managing complex data environments. It organizes data primarily in three core components: Hubs, Links, and Satellites. These elements support business keys, relationships, and descriptive data to create a comprehensive data structure within the Raw Data Vault layer. However, as with many methodologies, real-world data often introduces elements outside the strict boundaries of this structure, sparking questions around flexibility.

This article explores the concept of a “mixed model” in Data Vault, where non-Data Vault entities coexist with Raw Data Vault components and how they might be integrated within a Business Vault structure. While the purest Data Vault models focus on auditability and lineage, allowing for seamless automation, mixed models can sometimes be practical if managed thoughtfully. So, is it permissible to mix non-Data Vault entities with the Raw Data Vault within a Business Vault? Let’s dive into this topic.

What is a Data Vault Mixed Model?

A “mixed model” in Data Vault refers to a scenario where traditional Data Vault structures (Hubs, Links, and Satellites) are used alongside other non-Data Vault tables or entities. In essence, while most data resides in the structured Raw Data Vault, there are other data components within the same database that do not conform to Data Vault architecture. This raises questions about integrating these disparate data types in the Business Vault.

The Business Vault is designed to serve as a refined, operational version of the Raw Data Vault. It enhances the raw data with business logic and transformations to create actionable insights. In scenarios where a mixed model is necessary, the goal is often to leverage existing non-Data Vault tables to derive business insights while minimizing disruption to the original data model.

Can You Integrate Non-Data Vault Entities with Raw Data in the Business Vault?

According to Data Vault principles, the ideal approach is to structure all data as Hubs, Links, and Satellites to ensure consistency, auditability, and lineage. However, a mixed model approach can sometimes be necessary. For instance, you may have a database that combines data stored in the Raw Data Vault with tables or entities that don’t follow Data Vault structures. So, is it allowed?

The short answer is yes, you can technically integrate non-Data Vault entities within the Business Vault, but it comes with caveats. Here’s a deeper look at the implications:

  1. Temporary Solutions Only
    Mixing non-Data Vault data with Raw Data Vault entities is generally seen as a temporary solution. It may help in quickly bridging data that doesn’t yet fit into the Data Vault model, allowing for rapid integration. However, over time, this approach can lead to complexity in querying and reduce the consistency that Data Vault offers.
  2. Impact on Automation and Maintainability
    Introducing non-standard tables complicates automation within the Business Vault. Data Vault design leverages automation tools like dbt, Wherescape, and Vaultspeed, among others. These tools facilitate a streamlined workflow in Data Vault implementations by allowing for automated lineage, auditing, and data transformations. When introducing non-Data Vault entities, the automation capabilities are hindered, requiring custom scripts or queries that deviate from standard Data Vault patterns.
  3. Jeopardizing Auditability and Lineage
    One of Data Vault’s strongest value propositions is its focus on data lineage and auditability. In a mixed model, these aspects may be compromised. Without adhering to the structure of Hubs, Links, and Satellites, it becomes challenging to track data history, version control, and capture all changes comprehensively. For organizations that rely on these features for regulatory or quality purposes, compromising lineage may be a serious drawback.
  4. User Mart as an Alternative
    An alternative approach is to build a “User Mart” for ad hoc or analytical queries that combine data from the Raw Data Vault with non-Data Vault tables. This User Mart allows users to query both Raw Data Vault and external entities without disrupting the core Business Vault structure. This approach is particularly useful when users have specific reporting or analytical requirements that may not require full Data Vault transformation.
  5. Pragmatic Approach: Virtual Hubs and Links
    A practical solution in Data Vault projects is to create “virtual” Hubs, Links, and Satellites for non-standard tables, which serve as placeholders within the Raw Data Vault structure. This approach allows for quick integration while maintaining some level of standardization. For example, if there’s a reference table with country names and codes, you might create a virtual Hub for the country and map descriptive details as a virtual Satellite. This doesn’t achieve full lineage but can serve as a bridge until a proper Data Vault structure can be implemented.
  6. Reference Tables and Non-Critical Data
    In scenarios where data like reference tables (e.g., country codes, zip codes) doesn’t require full lineage or version tracking, a flat and wide reference table can be used. If a reference Hub and Satellite are unnecessary, keeping the data simple with a primary key and descriptive columns is often sufficient. This approach can work well for non-essential data, where maintaining Data Vault-style rigor may not be worth the effort.

Strategies for Long-Term Success with a Mixed Model

If you decide to proceed with a mixed model, it’s crucial to plan for a future transition toward a fully Data Vault-compliant design. Here are some tips:

  1. Prioritize Refactoring Non-Data Vault Entities
    Establish a clear roadmap for converting non-Data Vault tables into Hubs, Links, and Satellites over time. This phased approach enables you to work within existing constraints while planning for a more robust and compliant Business Vault.
  2. Minimize Technical Debt
    Track instances of non-Data Vault elements within your data ecosystem and treat them as “technical debt” to be managed and resolved in the long term. This keeps you aware of areas where auditability or automation might be compromised.
  3. Use Metadata-Driven Automation
    Employ metadata-driven automation tools as much as possible to simplify future integrations and transitions. These tools enable automated data processing across the Data Vault pipeline, making it easier to add and transform new data sources into compliant Data Vault structures.
  4. Implement Strict Governance for User-Generated Data
    In cases where users introduce their own data models within the User Mart or Business Vault, set governance policies to standardize data usage and maintain some level of alignment with Data Vault patterns. These policies can mitigate risks related to data quality and ensure that non-Data Vault data remains manageable.

Practical Example of a Mixed Model in Action

Consider a financial services organization that maintains a Raw Data Vault with transaction data but also has a separate schema for customer reference tables, such as customer demographics and location details. Rather than directly integrating these tables into the Business Vault, the organization could create virtual Hubs and Links that link customer IDs and locations to transactions. This allows them to continue working within the Raw Data Vault framework while planning to reformat reference tables in alignment with Data Vault standards.

Another example might involve a large retail company where user-generated data models in the User Mart are frequently used to support marketing analysis. Here, the organization could implement a temporary mixed model that accommodates fast-paced analysis while planning for a phased migration to Data Vault structures over time.

Conclusion: Balancing Flexibility with Data Vault Integrity

While a mixed model is not ideal within Data Vault architecture, it can serve as a temporary, pragmatic solution when there’s an immediate need to integrate non-Data Vault entities. Virtual Hubs and Links, User Marts, and strict governance policies can help manage the complexity introduced by non-standard tables. However, organizations should prioritize migrating all data into the Data Vault model over time to preserve the long-term benefits of auditability, lineage, and automation that Data Vault offers.

In the end, remember that the strength of Data Vault lies in its flexibility, auditability, and scalability. Introducing non-Data Vault tables as a quick fix is feasible, but for sustainable and reliable insights, a fully Data Vault-compliant model remains the optimal choice.

Modelling Address Data

Watch the Video

Modeling Address Data: Key Insights and Recommendations

Address data is one of the fundamental components in various business databases, especially where detailed customer information is essential. This complexity can make the modeling process challenging, particularly when aiming for a single “Hub” that consolidates all address information. In a recent discussion on Data Vault Friday, Michael Olschimke of Scalefree explored the best ways to model address data effectively, considering key business and regulatory factors. This article will provide a summary of those insights, offering recommendations for creating a robust, scalable, and efficient address data model.

The Context: Address Data in a Single Hub

In this scenario, the challenge presented was how to model various address types within a single Hub. The primary objective was to avoid redundant address data and simplify handling of NULL values. As per ISO20022 standards and European Union regulations, the data model included attributes such as STREET_NAME, BUILDING_NUMBER, BUILDING_NAME, ROOM, FLOOR, POSTAL_CODE, TOWN_NAME, COUNTRY_CODE, ADDRESS_LINE_1, and ADDRESS_LINE_2. Each of these elements is part of a composite business key used to uniquely identify each address.

A practical but complex solution proposed by the team was to replace NULL values with a placeholder (e.g., “-2”) to streamline the loading process and minimize handling issues. However, Olschimke proposed several alternative approaches to ensure the model is both sustainable and scalable.

Challenges with Replacing NULL Values

Replacing NULL values in business keys can simplify loading but has significant drawbacks downstream, particularly when managing dimensional data and maintaining business logic clarity. For instance, when NULL values are replaced with a placeholder like “-2,” this value could appear in downstream reports, causing confusion. Olschimke suggested a more nuanced approach that utilizes “fixed hash values,” such as all zeros or all F’s, to represent empty or erroneous values.

By using fixed hash values, it becomes easier to identify default or error states directly within the data structure. This approach avoids unnecessary complexity when filtering data downstream and improves the clarity and manageability of data processing operations.

Avoid Overloaded Hubs and Null Values in Composite Keys

One of the main points of consideration was the risk of “overloading” Hubs, which occurs when multiple business objects with different semantic meanings are stored within the same Hub. This is particularly common when different types of addresses are stored under a single business key, where each address type may not require all fields (e.g., ROOM or FLOOR).

According to Olschimke, overloaded Hubs introduce complexity due to differing data granularities and missing values across address types. For example, multiple buildings, floors, or rooms could exist under a single address, resulting in multiple granularities within a single Hub. This makes it challenging to maintain clear, meaningful data relationships. Instead, he advised defining clear granularity levels and possibly separating address types or using more flexible data structures.

Alternative Modeling Solutions: JSON-Based and Reference Tables

In cases where multiple address types require flexibility, Olschimke suggested using JSON-based data structures. JSON provides flexibility in defining address attributes dynamically, storing only the keys available for a particular address. This approach reduces the risk of overloading and accommodates varying address structures without creating a complex, rigid schema.

JSON-based Hubs allow for hashing the address data as a single JSON object, ordered by key names to prevent duplicates. However, this approach requires a consistent, standardized order of attributes when hashing to ensure duplicate-free keys. For instance, JSON formatting could streamline the Hub and enable more adaptive data loading while simplifying downstream data extraction.

Additionally, using reference tables is another approach for frequently accessed address data, enabling deduplication without over-complicating the Hub. Reference tables act as dedicated sources of address data, indexed by a unique address ID, which reduces redundancy across other Hubs.

Considering Address Data as Descriptive in Satellites

Instead of adding addresses as business keys in the Hub, it can be more effective to store them as descriptive attributes within a Satellite structure. This avoids overloading the Hub with attributes that may not always be needed for identifying the business key itself. By storing address data in Satellites linked to the primary business entity (e.g., customers, stores), you can achieve a balance between deduplication and schema simplicity.

Olschimke recommended this approach particularly when the main goal is to eliminate redundancies across address data. This approach aligns with a best practice in Data Vault modeling: Satellite tables should contain descriptive data that change over time, while Hubs contain only essential business identifiers.

Applying Business Rules in Data Vault Modeling

Address data often requires additional business rules, especially when handling complex keys or duplicates. Olschimke pointed out that handling NULL values with a placeholder complicates creating downstream dimensions. Instead, a two-step approach was advised: (1) defining the business keys within the Hub with fixed placeholders (e.g., all zeros or all F’s) for default and error handling and (2) standardizing the Satellite structure to handle varying address formats dynamically.

Ultimately, each business has unique requirements, and the choice between single Hubs, JSON structures, and Reference tables will depend on how critical the address data is to the core business operations. By focusing on avoiding overloading and ensuring scalability, businesses can set up a Data Vault model that minimizes long-term maintenance while maximizing data clarity and accessibility.

Conclusion

Modeling address data in a Data Vault context can be intricate, especially when attempting to create a unified Hub that supports various address types. The key considerations discussed by Olschimke emphasize flexibility, simplicity, and adherence to business rules without overloading Hubs. JSON-based keys, reference tables, and Satellite structures offer alternative approaches to managing address data, allowing you to avoid pitfalls associated with NULL placeholders and composite keys.

For businesses tackling complex address data requirements, experimenting with these alternatives may yield significant benefits, particularly in managing data deduplication, compliance, and future scalability.

Interested in learning more? Check out Scalefree’s webinars and consider joining the Data Innovators Exchange community for discussions on data modeling, cloud computing, and Data Vault 2.0 best practices.

Loading SAP CDC Data into GDPR Compliant Data Vault

Watch the Video

Loading SAP CDC Data into a GDPR-Compliant Data Vault

When managing change data capture (CDC) data from SAP in a Raw Data Vault, special considerations are needed for both CDC information and GDPR-relevant personal data. This post will cover how to model CDC data in a Data Vault, including the unique handling of created, updated, and deleted records. We’ll also discuss best practices for splitting data into separate satellites to manage GDPR-compliant attributes, including empty columns and privacy concerns.

This content is based on a discussion led by Michael Olschimke, CEO of Scalefree, during a Data Vault Q&A session.

CDC Data Modeling in the Data Vault

The primary challenge with CDC data is that it only includes changes from SAP, not the full dataset each time. CDC data typically includes metadata on whether a record was created, updated, or deleted in SAP. Here’s a look at how to approach modeling this data in a Data Vault:

1. Load CDC Data in Satellites with Adjusted Patterns

In Data Vault, the data model remains unchanged, regardless of how the data is delivered (batch, CDC, or real-time). However, the loading pattern for CDC data into satellites needs some adjustments:

  • Delta Check Adjustment: Normally, the Data Vault delta check identifies changes before loading data into a satellite. With CDC data, the changes are already captured, so this step can be bypassed. Instead, all changes from CDC data are loaded into the appropriate satellites directly.
  • Change Impact Across Satellites: When there’s a change in one attribute, it triggers an update in all relevant satellites. While this approach can create non-delta records, the impact is typically minimal, and the redundant data can be compressed for storage efficiency.

Handling GDPR-Relevant Personal Data

CDC data often includes both regular attributes and GDPR-sensitive personal data. In the Data Vault, personal data attributes should be separated based on privacy and security classes to ensure compliance and manage access. Here’s the recommended approach:

2. Splitting Satellites Based on Privacy Classifications

For GDPR compliance, split CDC data into multiple satellites:

  • Personal Data Satellite: A dedicated satellite for GDPR-relevant attributes (such as names or IDs). This separation allows for tighter security and privacy control.
  • Non-Personal Data Satellite: General attributes with no privacy concerns go into a separate satellite to reduce the risk of exposure.
  • Additional Splits: Further splits may be required based on rate of change, security levels, or business context, depending on the specific needs of your organization.

Maintaining separate satellites for different classes of data ensures that personal information is handled with stricter privacy controls, helping your data architecture comply with GDPR requirements.

Managing Empty Columns in the Data Vault

It’s common for source tables to contain columns that are always empty. When working with CDC data in a Data Vault:

  • Include Empty Columns for Auditing: To retain full traceability and audibility, include empty columns in the satellite. This preserves the exact structure of the source data without altering it.
  • Consider Separate “Unused Data” Satellite: If there are many empty columns, these can be grouped into a dedicated satellite, making the primary satellites leaner for users.

This approach allows for flexibility if the data in these columns becomes relevant in the future. Auditors will appreciate the comprehensive structure, and the Data Vault will retain all source data in its original form.

Example Satellite Structure

With GDPR compliance and CDC loading adjustments in mind, here’s an example structure for splitting SAP CDC data into satellites:

Satellite: CDC_Personal_Data
- Attributes: GDPR-relevant data (e.g., personal names, social security numbers)
- Metadata: Load date, source, change type (create, update, delete)
- Purpose: Privacy-controlled access

Satellite: CDC_NonPersonal_Data
- Attributes: Non-personal data columns
- Metadata: Load date, source, change type
- Purpose: General access

Satellite: CDC_Unused_Columns
- Attributes: Columns always empty in the source table
- Metadata: Load date, source
- Purpose: Compliance and future-proofing

Best Practices for Satellite Splitting in the Data Vault

When splitting data into satellites, follow these best practices:

  • Split by Privacy and Security: Ensure that personal and non-personal data are stored separately, particularly when handling GDPR-relevant information.
  • Split by Source System: Keep different source systems in separate satellites for clarity and maintainability.
  • Consider Business Needs: If certain data attributes are only relevant to specific business cases, split them accordingly to reduce satellite complexity.

These principles provide a clean, secure, and compliant Data Vault structure that enables efficient data retrieval, flexibility, and regulatory adherence.

Conclusion

Modeling SAP CDC data in a GDPR-compliant Data Vault involves adjustments to loading patterns, especially when dealing with CDC deltas and GDPR-sensitive data. By separating data based on privacy classes and including empty columns where necessary, you can ensure compliance and maintain a flexible data model. The approach outlined here simplifies the handling of CDC data, while providing robust auditing and privacy control.

Close Menu