Skip to main content
search
0

Learning the Data Vault Patterns

Watch the Video

Learning the Data Vault Patterns with Azure Synapse

When it comes to managing complex, evolving data landscapes, implementing a Data Vault architecture is a popular choice. The Data Vault approach is especially helpful in handling large volumes of data while maintaining flexibility, scalability, and historical tracking. Recently, a question was raised regarding the use of Azure Synapse Analytics with Data Vault for the initial Raw Data Vault development without a finalized automation tool. Specifically, the question was: Is it feasible to start development with Synapse Notebooks using PySpark and Delta Lake?

This article will explore the answer to that question and dive into key considerations, best practices, and strategies to make the most of a manual setup while preparing for eventual automation in your Data Vault implementation.



Is It Feasible to Start Data Vault with PySpark and Delta Lake?

The short answer is yes—starting with Synapse Notebooks, PySpark, and Delta Lake is indeed a viable option. In fact, this manual approach can be an excellent way to familiarize your team with Data Vault concepts, patterns, and methods before committing to an automation tool. By manually building the Raw Data Vault, your team can gain hands-on experience with essential processes, which will make the transition to automation smoother and more effective later on.

Historically, many Data Vault practitioners began with manual scripting due to limited automation tools. Over time, this “manual-first” method became a useful way to learn the intricate patterns of Data Vault. Today, automation tools for Data Vault are abundant, and using them is generally more efficient, but there’s still a place for manual methods, especially in the early learning stages of a project. Let’s look closer at why this approach works and what to consider as you work manually.

Benefits of Starting Manually with Synapse, PySpark, and Delta Lake

Using PySpark and Delta Lake in Azure Synapse Notebooks gives your team flexibility to:

  • Learn Core Data Vault Patterns: Building the Raw Data Vault manually helps the team understand Data Vault concepts, such as Hubs, Links, and Satellites. This is crucial knowledge that will benefit the project long-term.
  • Experiment with Modeling: Working without automation allows you to refine your approach and test different design patterns. This is especially helpful in creating a foundation that’s tailored to your organization’s specific needs and datasets.
  • Understand Data Transformation and Ingestion: By manually scripting transformations and ingesting data, your team will better understand the processes that an automation tool would handle. This will help in configuring automation later and troubleshooting any issues that arise.
  • Validate Requirements and Patterns: Since no tool has been chosen yet, working manually allows you to get a head start on modeling and confirm your business and technical requirements early in the project.

The Drawbacks of a Manual Approach

While starting manually has its advantages, it’s important to be aware of the limitations. The primary drawbacks of a manual approach are:

  • Time and Effort: Developing the Raw Data Vault by hand is time-consuming. Each process, from creating Hubs to tracking Satellites, requires careful attention to ensure the design aligns with Data Vault standards.
  • Limited Scalability: A manual setup is challenging to scale, especially as the data volume grows. While PySpark and Delta Lake are powerful tools, they’re not a substitute for the scalability offered by automation tools.
  • Risk of Technical Debt: Scripts developed manually might not be as maintainable or reusable as the templates generated by automation tools. Technical debt can accumulate if the team spends too much time maintaining manual scripts, and transitioning to automation could later require extensive rework.

Steps to Optimize Your Manual Development Process

If you decide to move forward with this manual approach, here are some strategies to make the process more efficient and to set a foundation for a future transition to automation:

  1. Document Your Patterns Thoroughly: Take detailed notes on the specific design patterns, scripts, and models you develop manually. These can serve as templates when you move to automation, making the transition much easier.
  2. Define Clear Modeling Standards: Establish consistent modeling practices for Hubs, Links, and Satellites. This will reduce ambiguity and provide a structured foundation for automation tools to build upon later.
  3. Refine and Iterate: Since you’re building manually, use this time to refine your models. Adjust and improve them based on the unique data flows and needs of your organization.
  4. Focus on Core Entities: Prioritize building out core Hubs and Links in your Raw Data Vault, focusing on the entities that are most crucial to your organization. This will create a solid foundation that can be extended as you move to automation.

Preparing for Automation: What to Keep in Mind

Even as you start manually, keep automation in mind as a goal. Today, there are numerous Data Vault automation tools available, each with its own strengths and weaknesses. As you prepare for this transition, here are some key considerations:

1. Research Automation Tools

Explore the different automation tools available on the market. Each tool has its own approach, interface, and capabilities, so it’s essential to choose one that aligns well with your organization’s technical needs, budget, and data infrastructure. Some tools focus on business-user accessibility, while others offer more technical configurations. Common tools for Data Vault automation in Azure include solutions specifically designed for Synapse or those that support PySpark and Delta Lake.

2. Select Tools with Scalability

When selecting a tool, consider how it handles scalability, as this is vital for supporting the growing volume of data in a Data Vault. Some automation tools handle scalability better than others, depending on how they manage Hubs, Links, and Satellites. In Azure Synapse, it’s also important to assess compatibility with Delta Lake and PySpark, as well as overall integration with Azure’s data ecosystem.

3. Account for Tool Limitations

Even the best automation tools have limitations. Be prepared to adjust your approach based on the chosen tool’s capabilities. For example, some tools may limit specific patterns, such as complex Satellites or multi-active relationships. By understanding these limitations ahead of time, you can avoid rework and ensure that your initial manual development aligns well with the selected tool.

4. Focus on Configurability and Customization

Ensure that your chosen tool allows for some level of customization. This is important because the patterns you develop manually may need to be fine-tuned or adjusted within the tool. Look for tools that offer configurable templates, adaptable interfaces, and support for customizations to fit your organization’s specific needs.

Moving Forward: Transitioning to Automation

As your team becomes familiar with Data Vault patterns through manual development, the next step is to select and implement an automation tool. While the manual work provides a deep understanding of Data Vault patterns, an automation tool will streamline repetitive processes, ensure consistency, and save significant time and effort as your data volume grows.

One recommended approach is to use the lessons learned during manual development to create tailored templates and workflows within your chosen tool. By doing so, you can optimize the automation tool’s capabilities based on the patterns you’ve already tested and refined. This makes for a smoother, more effective transition from manual development to automated workflows.

Final Thoughts

Starting Data Vault development manually with Synapse Notebooks, PySpark, and Delta Lake is a feasible and often beneficial approach, especially when automation tools haven’t been finalized. While this method is time-consuming and requires effort, it offers valuable insights and allows your team to learn and optimize Data Vault patterns before committing to an automation tool.

Remember, the goal is to use this manual phase to build a strong foundation, explore modeling choices, and establish best practices. When the time comes to select an automation tool, your team will be well-prepared to leverage it to its fullest potential, ensuring a scalable and efficient Data Vault implementation within Azure Synapse Analytics.

Handling JSON Data in Data Vault Satellites on Snowflake

Watch the Video

Handling JSON Data in Data Vault Satellites on Snowflake

In this blog post, we’ll discuss the challenges and best practices for handling JSON data in Data Vault architectures, specifically when using Snowflake. This question from our audience raises a common scenario in modern data environments where semi-structured data is prevalent. Let’s dive into how storing JSON in Data Vault Satellites works, when to extract fields, and the performance considerations for Snowflake.

Why Store JSON in the Satellite?

With the increasing use of APIs and complex data integrations, many businesses are receiving data as JSON. JSON is a flexible, semi-structured format that allows for varying structures and nested elements, making it ideal for some applications. However, this flexibility also introduces complexities when using a structured data model like the Data Vault.

The challenge here is that when you receive a JSON payload, it may contain hundreds of fields, many of which may not be immediately needed. Extracting every JSON field to store as individual columns in the Raw Data Vault can be inefficient and unnecessary. Therefore, one solution is to store the JSON “as-is” in the Satellite and extract only what’s required later in the Business Vault.

Choosing When to Extract Fields from JSON

As a general rule, it’s beneficial to extract attributes that are needed for reporting, security, or compliance early in the process. Relational attributes stored directly as columns in the Raw Data Vault are easier to query, and they’re stored more efficiently in structured databases.

For other fields in the JSON payload that are less critical or not immediately required, you can keep the data in its JSON format. This approach minimizes upfront work and allows for flexible data exploration later without overloading the Raw Data Vault with unnecessary fields.

For example, consider an e-commerce environment where transaction data is received in JSON format. The Raw Data Vault could extract only essential fields like transaction ID, customer ID, and total amount while leaving other detailed information about the items purchased in JSON format.

JSON and Snowflake: Optimized for Performance

One of Snowflake’s strengths is its ability to store and query JSON data natively. Snowflake’s support for semi-structured data formats like JSON and its efficient handling of this data type make it possible to perform queries directly on JSON attributes without needing to first extract them into columns.

This capability allows you to store JSON data in the Raw Data Vault and query it directly, even in production, without significant performance penalties. This approach works well for data that may only be needed occasionally or where performance is not a major concern. Snowflake’s built-in functions for working with JSON make it easy to extract specific attributes on demand and use them in downstream reporting.

Handling Business Logic in the Business Vault

In Data Vault, business rules are often applied within the Business Vault layer, which is designed for derived and calculated data. When working with JSON in this context, it’s common to keep the original JSON structure in the Satellite and then perform transformations as needed in the Business Vault.

In Snowflake, you can take advantage of virtual views to simplify your data pipeline. Rather than physically extracting and storing each JSON attribute in the Business Vault, you can create virtual views that access JSON fields as needed. This approach offers the benefit of flexibility and reduces storage requirements since you avoid duplicating data.

Structuring Semi-Structured Data Over Time

Storing JSON in the Satellite provides flexibility, but as your business needs evolve, you may find that some attributes become more relevant for analysis. At this point, you may consider structuring your JSON data incrementally, extracting only the fields that are frequently queried or required for compliance.

This gradual structuring process aligns with the Data Vault’s design principles. Over time, your semi-structured data will naturally become more organized as business requirements clarify which fields are necessary for analytics or reporting. You can continue to leave less critical fields in JSON format, maintaining flexibility while optimizing performance where it matters.

Privacy and Security Considerations with JSON in Satellites

One important consideration when storing JSON data in the Satellite is ensuring compliance with data privacy regulations such as GDPR. It’s essential to classify personal or sensitive data in the JSON payload and handle it appropriately.

In cases where sensitive data is involved, you may choose to split your Satellites by data classification, isolating high-sensitivity fields into their own Satellite tables with stricter access controls. Alternatively, you can extract and separate specific fields that require special handling and leave the rest in JSON format.

Working with JSON Arrays and Nested Objects

JSON data often includes nested structures, such as arrays, that add complexity to handling data in a relational model. For instance, customer records might contain an array of phone numbers or addresses. In these cases, Snowflake offers tools for managing nested data without requiring extensive restructuring.

One option is to store the entire JSON array as-is within a JSON attribute in the Satellite. For example, a JSON array containing multiple phone numbers could be stored directly in the Satellite, eliminating the need for a separate table to manage multi-active data. This approach provides a simpler alternative to the traditional multi-active Satellite design.

Snowflake allows you to work with nested JSON data using its FLATTEN function, which simplifies querying nested structures without requiring complex joins or data transformations. This method can be particularly useful when dealing with highly nested data, especially in real-time applications.

Virtualizing JSON Access in Snowflake

Snowflake’s JSON handling capabilities allow you to query JSON fields dynamically, making it possible to virtualize the access to JSON attributes in your Satellite tables. Virtual views can be used to transform JSON data for reporting without needing to physically extract every field into columns.

This approach offers significant flexibility. By keeping the JSON data in its original form, you avoid unnecessary data transformations and can leverage Snowflake’s powerful JSON functions to create reports on demand. Virtualization is an effective way to simplify your data pipeline, particularly when dealing with complex JSON structures.

Conclusion

Storing JSON in the Data Vault Satellite layer on Snowflake is a sound approach, especially when dealing with semi-structured or complex data. By extracting only the fields that are critical and leaving the rest in JSON format, you balance flexibility with efficiency. Snowflake’s support for JSON makes it possible to work with this data in real-time, leveraging virtual views to keep your pipeline agile.

Consider implementing a mix of JSON storage and relational attributes, structuring your data incrementally as your business needs evolve. With the right approach, Snowflake’s JSON capabilities allow you to handle diverse data types efficiently, supporting both compliance requirements and analytics needs.

Data Migration – Ensuring Data Accuracy and Compliance During a Migration

Data Migration Great Expectations Diagram Architecture

Data Migration

Data migration is a complex process that requires careful planning and execution. Understanding the data landscape, ensuring minimal downtime, managing stakeholder expectations, and most importantly, maintaining the integrity and security of your data throughout the transition are critical. Failing to address these factors can lead to data loss, corruption, or non-compliance with regulatory standards, which can have significant business implications. In a worst-case scenario, stakeholders may notice data issues before the data team does, decreasing trust in the data and the team. Another potential problem, having to work overtime because of data issues which were not noticed before.

Features embedded within dbt (Data Build Tool) and tools like Great Expectations offer powerful solutions to help organizations manage these risks, ensuring that the data remains reliable and compliant as it moves through the migration process.

Ensuring Data Accuracy and Compliance During a Migration: Leveraging dbt and Great Expectations

This webinar will cover the essential aspects of maintaining data accuracy and compliance throughout the data migration process. We will explore how dbt (Data Build Tool) enables robust data transformation with built-in and custom tests, ensuring data integrity at each stage. Additionally, we will demonstrate how Great Expectations enhances data validation, allowing you to enforce specific rules and expectations, ensuring a smooth and secure migration with minimal risk of errors or inconsistencies.

Watch Webinar Recording

Leveraging dbt for Data Accuracy

A powerful tool for data transformation: dbt enables teams to build, test, and document data pipelines. By utilizing its features, such as tests and contracts, dbt ensures data consistency and accuracy. We’ll explore these capabilities in detail below.

dbt offers two ways to define tests, singular and generic data tests.

  • Singular data tests: Custom SQL query that is written to test a specific condition or logic in the data. It is highly tailored to a particular use case or business logic. In essence, it’s a standalone test where the developer writes custom SQL to check for specific data anomalies or inconsistencies.
  • Generic data tests: Pre-defined and reusable tests that can be applied to multiple models or columns across different datasets.

Examples of Generic Tests:

  • Unique Tests: Ensure that a field in your dataset contains unique values, which is critical for primary key fields
  • Not Null Tests: Validate that a field does not contain any null values
  • Referential Integrity Tests: Checks that foreign key relationships are maintained, ensuring consistency across related tables
  • Accepted values: Useful tests for columns which receive predictable data

Tests can be configured to: either fail (severity: error) or issue a warning (severity: warning). Conditional expressions such as error_if and warn_if can refine this behavior, e.g., triggering a warning only after a certain number of failures.

dbt Contract enforcement

  • Enforces that dbt model schemas adhere to predefined data structures
  • Defines specific columns, data types, and constraints (e.g., not null, unique)
  • Raises errors before materializing the model as a table, allowing identification of schema issues

Pro-tip for data migrations: use incremental models to update only new or modified records, which improves efficiency and avoids full table rebuilds. When enforcing a contract, the “append_new_columns” option is useful as it retains old columns, minimizing issues. The “sync_all_columns” setting is particularly handy for automatically adjusting the schema by adding new columns and removing missing ones, making it ideal for migrations with frequent renaming.

dbt-expectations vs Great Expectations

dbt-expectations integrates data quality tests into dbt, while Great Expectations provides a broader framework for managing data validation across various sources. Together, they enhance data accuracy and reliability.

dbt-expectations

The dbt-expectations package extends dbt’s testing capabilities by providing a collection of pre-built, customizable data quality tests inspired by Great Expectations. This package helps automate and standardize data quality checks across multiple models, ensuring that datasets meet specific expectations before they are used in downstream processes.

Here are some examples of data quality tests you can run using the dbt-expectations repository (we are going to cover more in the webinar):

  • Expect_column_values_to_match_regex: Verifies that all values in a column match a given regular expression pattern
  • Expect_column_median_to_be_between, expect_column_min_to_be_between, expect_column_max_to_be_between: Ensures numeric column values fall within specified ranges
  • Expect_column_pair_values_a_to_be_greater_than_b: Checks that values in one column are greater than values in another

Why Consider Integrating Great Expectations?

With dbt-expectations providing robust testing within a single dbt project, you might wonder why you’d want to integrate Great Expectations. Here’s why:

  • Cross-Database Comparisons: dbt-expectations works well within a single SQL-based data warehouse. However, if you need to compare data across different databases (like Snowflake and SQL Server), Great Expectations offers a broader solution.
  • Broader Data Validation: Great Expectations supports multiple data sources, including CSV, Parquet, JSON, APIs, and various SQL databases. It provides a flexible and user-friendly platform to define, manage, and execute data quality tests across diverse sources.
Data Migration Great Expectations Diagram Architecture

Key Features of Great Expectations:

  • Data Profiling: Before starting your migration, use GE to profile your data and set expectations based on its current state.
  • Detailed Validation Reports & Dashboards: GE offers comprehensive reports and visualizations, outputting results in formats like HTML, Slack, JSON, and Data Docs. This enhances transparency and provides deeper insights for both technical and non-technical stakeholders.
  • Customizability and Extensibility: GE allows you to define custom expectations tailored to your data pipeline and integrate with other testing libraries.
  • Version Control & Historical Validation: Track changes in data quality over time with version control, helping to identify trends and recurring issues.
  • Production Monitoring & Integration: Integrate GE with data orchestration tools like Airflow, Prefect, or Dagster to incorporate data quality checks into your broader workflows, including those not managed by dbt.

Integrating Great Expectations with dbt

Great Expectations complements dbt by offering a flexible platform for data validation beyond single-project scenarios. By integrating GE with dbt, you can achieve a more comprehensive approach to data quality, ensuring your migration process is as smooth and reliable as possible.

In the upcoming webinar, we will explore practical examples of dbt tests, dbt-expectations, and Great Expectations validations, so stay tuned!

(Single) Point of Facts

Watch the Video

Single Point of Facts in Data Lakehouse Architecture

Welcome to another edition of DataVault Friday! Today, we’re diving into a frequently debated topic in data management: determining the “source of truth” in a data lakehouse architecture. Specifically, we’ll answer whether that source of truth resides in data lake files or in the tables of the Raw Vault and Business Vault. Additionally, we’ll address what to do if a bug in the ingestion framework requires a re-ingestion of data sources. Let’s explore these questions to better understand data lineage, data governance, and strategies for a reliable and flexible data ecosystem.

Understanding the “Single Point of Facts”

In traditional data warehousing, the idea of a “single version of the truth” is well-known. This concept implies that there is one version of the data that acts as the definitive source across an enterprise. For instance, a single “customer” or “product” definition applies universally within the organization.

However, in Data Vault architecture, we move from a “single version of the truth” to a “single point of facts.” The focus shifts from universal definitions to an unaltered, auditable record of events. Data Vault is designed to capture historical data accurately and reliably. It provides multiple perspectives on the data (versions of the truth) by isolating raw data from any business logic or transformations. This flexibility allows organizations to apply different business rules depending on context, while maintaining a consistent underlying dataset.

In this context, the Raw Vault is considered the foundational layer, capturing facts as they are, directly from the source systems. The Business Vault, on the other hand, introduces additional business rules, metrics, and aggregated data for reporting purposes. But in essence, the “single point of facts” remains within the Raw Vault because it represents an unaltered and auditable record.

Data Lakehouse Architecture and Points of Truth

In data lakehouse architecture, data is stored both in a data lake and within Data Vault tables. This raises the question: Which source is the ultimate truth? The data lake, with its raw files, or the Raw Vault tables?

The answer depends on the architectural requirements and the level of traceability and auditability needed. Ideally, both systems should mirror each other and serve as points of fact:

  • Data Lake: The data lake serves as a repository for raw data files, often storing snapshots or full loads of data from source systems. This makes it easier to preserve the original data as-is without alteration.
  • Raw Vault: In the Raw Vault, data is loaded into a structured schema, capturing the same original details but in a way that can be more systematically queried and analyzed. Like the data lake, the Raw Vault stores unmodified facts, but it also preserves lineage information, making it possible to reproduce deliveries and trace data transformations.

Since both layers should hold the same underlying data, they collectively represent the point of fact. Either the data lake or Raw Vault can serve as the truth source, depending on the scenario. This dual system ensures a resilient architecture, as data can be cross-validated across layers.

What If There’s a Bug in the Ingestion Framework?

One key question that arises is: What happens if there’s a bug in the ingestion framework? Bugs such as incorrect population of business keys or other erroneous transformations might require a complete re-ingestion of data sources.

When dealing with bugs in data ingestion, having both a data lake and a Raw Vault allows flexibility and safeguards. Here’s how to address these issues:

  1. Identify and Isolate the Problem: Pinpoint where the issue occurred in the ingestion process and document the scope of the bug, especially if it affects business keys or other critical aspects of data integrity.
  2. Rely on the Data Lake for Original Files: Since the data lake contains the original, unaltered data files, you can reload the affected data from here into the Raw Vault. This ensures that any corrupted or inaccurately transformed data can be replaced without loss.
  3. Reprocess the Raw Vault: With the correct data now available from the data lake, reload the Raw Vault. Ensure that new ingestion processes are thoroughly tested to avoid repeating the error.
  4. Automate Audits and Reconciliation: Implement automated reconciliation checks between the data lake and the Raw Vault. Automated scripts can flag discrepancies, giving early warning of issues before they reach production or reporting layers.

By leveraging both the data lake and Raw Vault as points of fact, the architecture remains robust and auditable. This redundancy allows for re-ingestion without significant downtime and ensures that data lineage remains traceable throughout the lifecycle.

Data Lake vs. Raw Vault: Which Is Easier for Reconstruction?

When it comes to reconstructing deliveries, the data lake often offers simplicity. Since the data lake can hold raw files with minimal transformation, data reconstruction is a straightforward matter of accessing the original files. In contrast, reconstructing from the Raw Vault requires additional effort, as data must be accurately joined across hubs, links, and satellites, while preserving the original state.

That said, both layers should be auditable, with logging mechanisms that allow for a traceable history of changes. Having a clear data lineage in place allows organizations to meet compliance and audit requirements while supporting accurate reporting.

Best Practices for Managing Points of Fact

While it’s tempting to designate a single point of fact, the dual-layered approach with a data lake and Raw Vault provides a more resilient framework. Here are some best practices for managing points of fact in a data lakehouse architecture:

  • Maintain Consistency Between Layers: Ensure that data lake files and Raw Vault tables match exactly. Automate reconciliation checks between these layers to verify data integrity.
  • Implement Auditable Ingestion Processes: Document all transformations from the data lake to the Raw Vault, with logging and error-checking mechanisms. This allows for easier tracing of issues if they arise.
  • Retain Original Data in the Data Lake: Always keep a copy of original files in the data lake. These files provide a reliable source of truth that can be referenced or reloaded into the Raw Vault if issues occur.
  • Leverage Metadata for Automation: Metadata can streamline both ingestion and reconciliation. Use metadata to define business keys, relationships, and descriptive data in the Raw Vault, while automating verification processes.

With these practices, data lakehouse architecture can be made robust, auditable, and resilient to changes or errors. By treating both the data lake and Raw Vault as points of fact, you ensure that your data ecosystem remains flexible, trustworthy, and ready to meet evolving business requirements.

Conclusion

The question of “single point of facts” in data lakehouse architecture doesn’t have a straightforward answer. Both the data lake and the Raw Vault act as points of fact, each offering unique benefits in terms of auditability and reconstruction. By utilizing both, you create a highly resilient system capable of withstanding data issues while providing a comprehensive, consistent view of your data.

In summary, while the Raw Vault may traditionally serve as the “single point of facts,” using both the data lake and the Raw Vault as truth sources creates a flexible architecture that can accommodate re-ingestion, mitigate risks, and support accurate reporting. With this dual approach, your data lakehouse architecture becomes a reliable foundation for modern data needs.

Multi-Active Satellites & Dependent Child Keys

Watch the Video

Understanding Multi-Active Satellites and Dependent Child Keys in Data Vault

Multi-Active satellites and dependent child keys provide solutions for storing data with complex granularities in Data Vault models. Data Vault is known for its highly structured enterprise data warehousing approach, built on Hubs, Links, and Satellites to capture data lineage, maintain historical accuracy, and ensure scalability. However, specific data scenarios, such as handling different data granularities, often lead to questions on multi-active satellites and dependent child keys. This article breaks down these concepts and clarifies their differences and use cases in a Data Vault environment.

 

What is a Multi-Active Satellite?

A multi-active satellite is designed to manage multiple records for a single business object that are active simultaneously. This scenario arises when a business object, like a customer, can have several active data entries at the same time. For example, a customer could have multiple addresses (home and work), both of which are valid at the same time.

In a typical satellite structure, a business key (e.g., customer ID) combined with a load date timestamp defines the primary key. However, in cases of multiple active records, this primary key is insufficient because it won’t uniquely identify each active instance. Instead, an additional attribute, such as an address type (home or work), is added to the primary key to differentiate each record. This approach allows the satellite to track multiple entries for the same business key without duplicating data and helps capture finer details in the data warehouse.

Example of a Multi-Active Satellite

Let’s say our source system has a customer with ID C123 who has two active addresses: one for home and one for work. In a standard satellite, we might have one record per business key. But in a multi-active satellite, we store both addresses simultaneously by using an additional identifier (e.g., “address type”) in the primary key:

  • Customer ID: C123
  • Load Date: Timestamp of data load
  • Additional Identifier: Address type (e.g., home, work)

This approach allows multiple entries for a single business object (in this case, customer C123) while maintaining unique records in the satellite table.

What is a Dependent Child Key?

A dependent child key is used to manage relationships between multiple business objects at a finer granularity level than a standard Data Vault link would allow. Dependent child keys are typically applied in links where we need to track multiple occurrences of a relationship between business objects, such as an order and its line items.

Consider an order containing multiple line items, where each item references a product. Here, the dependent child key (like line item number) uniquely identifies each relationship instance, as it provides additional detail beyond just the order and product identifiers. This allows multiple rows in the link for the same business objects while maintaining unique records.

Example of a Dependent Child Key

Imagine we have an order O123 for a customer C123, which includes two line items for the same product but with different prices or quantities:

  • Order ID: O123
  • Customer ID: C123
  • Product ID: P123
  • Dependent Child Key: Line item number (e.g., 1, 2)

In this case, we create unique rows for each line item, where the line item number differentiates each record. This approach ensures that each entry is stored and tracked individually.

Key Differences Between Multi-Active Satellites and Dependent Child Keys

Although multi-active satellites and dependent child keys both enable handling of finer data granularity, they serve different purposes and are used in distinct contexts:

  1. Multi-Active Satellites
    Applied within a single business object to handle multiple active records at the same time. The additional identifier helps capture simultaneous entries for the same object in a satellite.
  2. Dependent Child Keys
    Used in links between multiple business objects, where the additional key captures the finer detail of each relationship instance, such as line items in an order.

When to Use Each Approach: Multi-Active Satellites & Dependent Child

The choice between using a multi-active satellite or a dependent child key depends on the data granularity and relationships in your data model:

  • Use Multi-Active Satellites when handling multiple active records for a single business object, where each entry is related only to the primary business key (e.g., customer with multiple addresses).
  • Use Dependent Child Keys when tracking detailed relationships between different business objects that require additional identifiers to maintain uniqueness (e.g., order and line items).

Summary

Multi-active satellites and dependent child keys are helpful for storing data with complex granularities in Data Vault models. While multi-active satellites allow multiple simultaneous records for a single business object, dependent child keys enable unique identification of complex relationships in links. Both approaches maintain Data Vault’s principles of scalability and data integrity by preserving unique records and enabling detailed tracking of business data.

In short:

  • Multi-Active Satellite: For multiple records active simultaneously within a single business object.
  • Dependent Child Key: For relationships across multiple business objects that need finer detail, typically in links.

Meet the Speaker

Data Vault Mixed Model

Watch the Video

Understanding Data Vault Mixed Models: Integrating Non-Data Vault Entities in the Business Vault

Data Vault architecture is a widely used methodology in data warehousing, providing a highly adaptable model for managing complex data environments. It organizes data primarily in three core components: Hubs, Links, and Satellites. These elements support business keys, relationships, and descriptive data to create a comprehensive data structure within the Raw Data Vault layer. However, as with many methodologies, real-world data often introduces elements outside the strict boundaries of this structure, sparking questions around flexibility.

This article explores the concept of a “mixed model” in Data Vault, where non-Data Vault entities coexist with Raw Data Vault components and how they might be integrated within a Business Vault structure. While the purest Data Vault models focus on auditability and lineage, allowing for seamless automation, mixed models can sometimes be practical if managed thoughtfully. So, is it permissible to mix non-Data Vault entities with the Raw Data Vault within a Business Vault? Let’s dive into this topic.

What is a Data Vault Mixed Model?

A “mixed model” in Data Vault refers to a scenario where traditional Data Vault structures (Hubs, Links, and Satellites) are used alongside other non-Data Vault tables or entities. In essence, while most data resides in the structured Raw Data Vault, there are other data components within the same database that do not conform to Data Vault architecture. This raises questions about integrating these disparate data types in the Business Vault.

The Business Vault is designed to serve as a refined, operational version of the Raw Data Vault. It enhances the raw data with business logic and transformations to create actionable insights. In scenarios where a mixed model is necessary, the goal is often to leverage existing non-Data Vault tables to derive business insights while minimizing disruption to the original data model.

Can You Integrate Non-Data Vault Entities with Raw Data in the Business Vault?

According to Data Vault principles, the ideal approach is to structure all data as Hubs, Links, and Satellites to ensure consistency, auditability, and lineage. However, a mixed model approach can sometimes be necessary. For instance, you may have a database that combines data stored in the Raw Data Vault with tables or entities that don’t follow Data Vault structures. So, is it allowed?

The short answer is yes, you can technically integrate non-Data Vault entities within the Business Vault, but it comes with caveats. Here’s a deeper look at the implications:

  1. Temporary Solutions Only
    Mixing non-Data Vault data with Raw Data Vault entities is generally seen as a temporary solution. It may help in quickly bridging data that doesn’t yet fit into the Data Vault model, allowing for rapid integration. However, over time, this approach can lead to complexity in querying and reduce the consistency that Data Vault offers.
  2. Impact on Automation and Maintainability
    Introducing non-standard tables complicates automation within the Business Vault. Data Vault design leverages automation tools like dbt, Wherescape, and Vaultspeed, among others. These tools facilitate a streamlined workflow in Data Vault implementations by allowing for automated lineage, auditing, and data transformations. When introducing non-Data Vault entities, the automation capabilities are hindered, requiring custom scripts or queries that deviate from standard Data Vault patterns.
  3. Jeopardizing Auditability and Lineage
    One of Data Vault’s strongest value propositions is its focus on data lineage and auditability. In a mixed model, these aspects may be compromised. Without adhering to the structure of Hubs, Links, and Satellites, it becomes challenging to track data history, version control, and capture all changes comprehensively. For organizations that rely on these features for regulatory or quality purposes, compromising lineage may be a serious drawback.
  4. User Mart as an Alternative
    An alternative approach is to build a “User Mart” for ad hoc or analytical queries that combine data from the Raw Data Vault with non-Data Vault tables. This User Mart allows users to query both Raw Data Vault and external entities without disrupting the core Business Vault structure. This approach is particularly useful when users have specific reporting or analytical requirements that may not require full Data Vault transformation.
  5. Pragmatic Approach: Virtual Hubs and Links
    A practical solution in Data Vault projects is to create “virtual” Hubs, Links, and Satellites for non-standard tables, which serve as placeholders within the Raw Data Vault structure. This approach allows for quick integration while maintaining some level of standardization. For example, if there’s a reference table with country names and codes, you might create a virtual Hub for the country and map descriptive details as a virtual Satellite. This doesn’t achieve full lineage but can serve as a bridge until a proper Data Vault structure can be implemented.
  6. Reference Tables and Non-Critical Data
    In scenarios where data like reference tables (e.g., country codes, zip codes) doesn’t require full lineage or version tracking, a flat and wide reference table can be used. If a reference Hub and Satellite are unnecessary, keeping the data simple with a primary key and descriptive columns is often sufficient. This approach can work well for non-essential data, where maintaining Data Vault-style rigor may not be worth the effort.

Strategies for Long-Term Success with a Mixed Model

If you decide to proceed with a mixed model, it’s crucial to plan for a future transition toward a fully Data Vault-compliant design. Here are some tips:

  1. Prioritize Refactoring Non-Data Vault Entities
    Establish a clear roadmap for converting non-Data Vault tables into Hubs, Links, and Satellites over time. This phased approach enables you to work within existing constraints while planning for a more robust and compliant Business Vault.
  2. Minimize Technical Debt
    Track instances of non-Data Vault elements within your data ecosystem and treat them as “technical debt” to be managed and resolved in the long term. This keeps you aware of areas where auditability or automation might be compromised.
  3. Use Metadata-Driven Automation
    Employ metadata-driven automation tools as much as possible to simplify future integrations and transitions. These tools enable automated data processing across the Data Vault pipeline, making it easier to add and transform new data sources into compliant Data Vault structures.
  4. Implement Strict Governance for User-Generated Data
    In cases where users introduce their own data models within the User Mart or Business Vault, set governance policies to standardize data usage and maintain some level of alignment with Data Vault patterns. These policies can mitigate risks related to data quality and ensure that non-Data Vault data remains manageable.

Practical Example of a Mixed Model in Action

Consider a financial services organization that maintains a Raw Data Vault with transaction data but also has a separate schema for customer reference tables, such as customer demographics and location details. Rather than directly integrating these tables into the Business Vault, the organization could create virtual Hubs and Links that link customer IDs and locations to transactions. This allows them to continue working within the Raw Data Vault framework while planning to reformat reference tables in alignment with Data Vault standards.

Another example might involve a large retail company where user-generated data models in the User Mart are frequently used to support marketing analysis. Here, the organization could implement a temporary mixed model that accommodates fast-paced analysis while planning for a phased migration to Data Vault structures over time.

Conclusion: Balancing Flexibility with Data Vault Integrity

While a mixed model is not ideal within Data Vault architecture, it can serve as a temporary, pragmatic solution when there’s an immediate need to integrate non-Data Vault entities. Virtual Hubs and Links, User Marts, and strict governance policies can help manage the complexity introduced by non-standard tables. However, organizations should prioritize migrating all data into the Data Vault model over time to preserve the long-term benefits of auditability, lineage, and automation that Data Vault offers.

In the end, remember that the strength of Data Vault lies in its flexibility, auditability, and scalability. Introducing non-Data Vault tables as a quick fix is feasible, but for sustainable and reliable insights, a fully Data Vault-compliant model remains the optimal choice.

Pre-Commit: The First Line of Defense in Shift-Left Development

Programmer Working with Virtual Interface for Software Development and Technology

Pre-Commit as a Foundation for Code Quality

In software development, ensuring code quality and minimizing defects early is crucial. The shift-left approach emphasizes detecting and resolving issues as soon as possible, rather than later in development or after deployment. Pre-commit supports this approach by managing multi-language pre-commit hooks, scripts that run automatically before finalizing a commit. These hooks act as a first line of defense, enforcing coding standards, preventing bugs, and enhancing security.

This article examines how the combination of pre-commit and the shift-left approach can enhance development workflows, resulting in more reliable and maintainable software. We will explore the advantages of using pre-commit hooks, their function within the shift-left framework, and offer practical advice for effective implementation. By adopting these practices, teams can improve code quality, accelerate development cycles, and deliver robust software solutions more efficiently.



Understanding Pre-Commit


General Hooks in Versioning Systems

In version control systems like Git, hooks are scripts that are triggered by various events within the repository. These scripts can automate tasks, enforce policies, or improve workflows. Hooks are generally categorized into two types: client-side and server-side. Client-side hooks run on the local machine and can be triggered by operations such as committing or merging changes. Server-side hooks run on the server and can be triggered by events like receiving pushed commits.

local repository with accepted repo

local repository with falty repo


The Role of Pre-Commit Hooks

Among the various types of client-side hooks, pre-commit hooks are particularly significant. These hooks are executed before a commit is finalized. They check for potential issues in the code, such as syntax errors, code style violations, or even security vulnerabilities. By running these checks before the code is committed to the repository, pre-commit hooks ensure that only high-quality code makes it into the version control system.


Pre-Commit Use Cases

Pre-Commit is a versatile tool that can be adapted to various scenarios within the development workflow, offering significant flexibility and control to development teams. One key use case is hosting pre-commit code locally, allowing developers to customize and manage hooks according to their project’s specific requirements. This local hosting ensures that all team members adhere to the same standards without relying on external repositories. Another powerful feature of pre-commit is the ability for users to create custom hooks tailored to their unique needs. These hooks can enforce specific coding standards, run specialized security checks, or automate repetitive tasks, providing a high degree of flexibility. Teams can write these hooks in any language supported by their environment, allowing pre-commit to seamlessly integrate into their workflow.

Furthermore, pre-commit can be integrated into Continuous Integration (CI) pipelines to enhance automation and enforce quality checks across all stages of development. By incorporating pre-commit hooks into the CI process, teams can ensure that code meets quality standards before being merged into the main codebase. This integration helps maintain high standards of code quality and reliability throughout the development lifecycle.


Multi-Language Support in Pre-Commit

One of the key strengths of pre-commit is its support for a wide range of programming languages. This makes it an ideal tool for projects that involve multiple languages or frameworks. For example, in DataOps, it is common to use Terraform for infrastructure provisioning and dbt for data transformation. Pre-commit’s multi-language support allows it to be seamlessly integrated into such diverse development environments, providing consistent quality checks across different languages and tools. This cross-language capability ensures that best practices are enforced and issues are detected regardless of the technologies used in the project.


Example for the Use of Pre-Commit

Pre-Commit operates through the use of a configuration file called pre-commit-config.yaml. This YAML file serves as the central configuration that pre-commit references every time a commit is made. It defines which hooks should be executed and where to fetch them from before finalizing a commit in your Git repository. As an example for such a file with an explanation to the key elements, the following image is provided:

repos:
  - repo: https://github.com/antonbabenko/pre-commit-terraform
    rev: v1.90.0 
    hooks:
      - id: terraform_validate
      - id: terraform_fmt
        args:
          - --args=-recursive
  • repo: Specifies the URL of the repository containing the hooks. In this example, https://github.com/antonbabenko/pre-commit-terraform is the repository from which hooks will be fetched.
  • rev: Indicates the specific revision or version of the repository (v1.90.0 in this case) to use. This ensures consistency in hook behavior across different executions.
  • hooks: Defines a list of hooks to be run from the specified repository. Each hook is identified by its unique id.
  • id: Represents the identifier for each hook. For instance, trailing-whitespace, end-of-file-fixer, and check-yaml are examples of hook IDs that correspond to specific tasks or checks the hooks will perform.
  • args: Specifies additional arguments or options to be passed to the hook command. In the case of terraform_fmt, the args field includes –args=-recursive. This configuration instructs the hook to format Terraform files recursively within the project directory.

This configuration allows you to tailor pre-commit to your project’s needs by specifying which hooks to execute, where to retrieve them from, and how to ensure consistency through defined versions or revisions. Each hook ID corresponds to a particular quality check or task that pre-commit will enforce before allowing a commit to proceed, ensuring higher code quality and adherence to project standards.


Recommended Pre-Commit Repositories for Effective Integration

To kickstart your use of pre-commit effectively, consider leveraging the following repositories tailored for specific programming languages and tools:

  • Terraform Repository: antonbabenko/pre-commit-terraform
    Offers a comprehensive set of hooks for Terraform projects, including validators and formatters to ensure consistent and high-quality code.
  • Dbt (Data Build Tool) Repository: dbt-checkpoint/dbt-checkpoint
    Provides hooks specifically designed for dbt projects, enabling validation and enforcing best practices for SQL-based data transformations.
  • SQL Repository: sqlfluff/sqlfluff
    Offers hooks for SQL linting to ensure syntax correctness, adherence to coding standards, and optimization of query performance.

These repositories offer essential hooks that integrate seamlessly with pre-commit, enabling automated checks to maintain code quality and consistency across Terraform, dbt, and SQL projects. By incorporating these hooks into your workflow, you can streamline development processes and ensure robust software deployments. However these are just suggestions and you are free to use any other hook if it benefits your use case.


Conclusion

Pre-Commit exemplifies the shift-left approach by automating quality checks early in the development cycle, before code is committed. This proactive strategy catches issues like syntax errors, formatting inconsistencies, and security vulnerabilities early, reducing downstream defects.

By using a centralized pre-commit-config.yaml file, Pre-Commit ensures consistent coding standards across teams. It allows developers to focus on coding rather than manual reviews, speeding up software delivery.

Supporting various languages and tools, Pre-Commit can be customized for specific project needs, enhancing code reliability through automated checks. Integrating Pre-Commit improves code quality, efficiency, and promotes continuous improvement, leading to faster time-to-market, lower development costs, and more reliable software.

Modelling Address Data

Watch the Video

Modeling Address Data: Key Insights and Recommendations

Address data is one of the fundamental components in various business databases, especially where detailed customer information is essential. This complexity can make the modeling process challenging, particularly when aiming for a single “Hub” that consolidates all address information. In a recent discussion on Data Vault Friday, Michael Olschimke of Scalefree explored the best ways to model address data effectively, considering key business and regulatory factors. This article will provide a summary of those insights, offering recommendations for creating a robust, scalable, and efficient address data model.

The Context: Address Data in a Single Hub

In this scenario, the challenge presented was how to model various address types within a single Hub. The primary objective was to avoid redundant address data and simplify handling of NULL values. As per ISO20022 standards and European Union regulations, the data model included attributes such as STREET_NAME, BUILDING_NUMBER, BUILDING_NAME, ROOM, FLOOR, POSTAL_CODE, TOWN_NAME, COUNTRY_CODE, ADDRESS_LINE_1, and ADDRESS_LINE_2. Each of these elements is part of a composite business key used to uniquely identify each address.

A practical but complex solution proposed by the team was to replace NULL values with a placeholder (e.g., “-2”) to streamline the loading process and minimize handling issues. However, Olschimke proposed several alternative approaches to ensure the model is both sustainable and scalable.

Challenges with Replacing NULL Values

Replacing NULL values in business keys can simplify loading but has significant drawbacks downstream, particularly when managing dimensional data and maintaining business logic clarity. For instance, when NULL values are replaced with a placeholder like “-2,” this value could appear in downstream reports, causing confusion. Olschimke suggested a more nuanced approach that utilizes “fixed hash values,” such as all zeros or all F’s, to represent empty or erroneous values.

By using fixed hash values, it becomes easier to identify default or error states directly within the data structure. This approach avoids unnecessary complexity when filtering data downstream and improves the clarity and manageability of data processing operations.

Avoid Overloaded Hubs and Null Values in Composite Keys

One of the main points of consideration was the risk of “overloading” Hubs, which occurs when multiple business objects with different semantic meanings are stored within the same Hub. This is particularly common when different types of addresses are stored under a single business key, where each address type may not require all fields (e.g., ROOM or FLOOR).

According to Olschimke, overloaded Hubs introduce complexity due to differing data granularities and missing values across address types. For example, multiple buildings, floors, or rooms could exist under a single address, resulting in multiple granularities within a single Hub. This makes it challenging to maintain clear, meaningful data relationships. Instead, he advised defining clear granularity levels and possibly separating address types or using more flexible data structures.

Alternative Modeling Solutions: JSON-Based and Reference Tables

In cases where multiple address types require flexibility, Olschimke suggested using JSON-based data structures. JSON provides flexibility in defining address attributes dynamically, storing only the keys available for a particular address. This approach reduces the risk of overloading and accommodates varying address structures without creating a complex, rigid schema.

JSON-based Hubs allow for hashing the address data as a single JSON object, ordered by key names to prevent duplicates. However, this approach requires a consistent, standardized order of attributes when hashing to ensure duplicate-free keys. For instance, JSON formatting could streamline the Hub and enable more adaptive data loading while simplifying downstream data extraction.

Additionally, using reference tables is another approach for frequently accessed address data, enabling deduplication without over-complicating the Hub. Reference tables act as dedicated sources of address data, indexed by a unique address ID, which reduces redundancy across other Hubs.

Considering Address Data as Descriptive in Satellites

Instead of adding addresses as business keys in the Hub, it can be more effective to store them as descriptive attributes within a Satellite structure. This avoids overloading the Hub with attributes that may not always be needed for identifying the business key itself. By storing address data in Satellites linked to the primary business entity (e.g., customers, stores), you can achieve a balance between deduplication and schema simplicity.

Olschimke recommended this approach particularly when the main goal is to eliminate redundancies across address data. This approach aligns with a best practice in Data Vault modeling: Satellite tables should contain descriptive data that change over time, while Hubs contain only essential business identifiers.

Applying Business Rules in Data Vault Modeling

Address data often requires additional business rules, especially when handling complex keys or duplicates. Olschimke pointed out that handling NULL values with a placeholder complicates creating downstream dimensions. Instead, a two-step approach was advised: (1) defining the business keys within the Hub with fixed placeholders (e.g., all zeros or all F’s) for default and error handling and (2) standardizing the Satellite structure to handle varying address formats dynamically.

Ultimately, each business has unique requirements, and the choice between single Hubs, JSON structures, and Reference tables will depend on how critical the address data is to the core business operations. By focusing on avoiding overloading and ensuring scalability, businesses can set up a Data Vault model that minimizes long-term maintenance while maximizing data clarity and accessibility.

Conclusion

Modeling address data in a Data Vault context can be intricate, especially when attempting to create a unified Hub that supports various address types. The key considerations discussed by Olschimke emphasize flexibility, simplicity, and adherence to business rules without overloading Hubs. JSON-based keys, reference tables, and Satellite structures offer alternative approaches to managing address data, allowing you to avoid pitfalls associated with NULL placeholders and composite keys.

For businesses tackling complex address data requirements, experimenting with these alternatives may yield significant benefits, particularly in managing data deduplication, compliance, and future scalability.

Interested in learning more? Check out Scalefree’s webinars and consider joining the Data Innovators Exchange community for discussions on data modeling, cloud computing, and Data Vault 2.0 best practices.

CI/CD: Practical Insights into Automating Data Vault 2.0 with dbt

CI/CD Graphic Cycle

CI/CD

CI/CD pipelines are becoming increasingly important for ensuring that software updates can be released cost-effectively while maintaining high quality. But how exactly do CI/CD pipelines work, and how can a project benefit from using one?

This newsletter aims to answer these questions through a practical example of a CI/CD pipeline. The example focuses on a CI/CD pipeline for a GitHub repository that includes a package for implementing Data Vault 2.0 in dbt across various databases. Therefore, this newsletter will also cover the basics of dbt and GitHub Actions.

From Continuous Integration To Data Vaults: A Comprehensive Workflow

This webinar will cover what CI/CD pipelines are and the advantages they offer. We will present parts of the CI/CD pipeline for the public datavault4dbt package to demonstrate how a CI/CD pipeline can be used. The webinar will introduce the key features of GitHub Actions and explain them through examples. This will show how each feature can be utilized in practice and highlight the various possibilities GitHub Actions offers. The webinar aims to explain the benefits of CI/CD pipelines and illustrate what such a pipeline can look like through a practical example.

Watch Webinar Recording

What is CI/CD?

CI stands for Continuous Integration, and CD stands for Continuous Delivery or Continuous Deployment. But what exactly do these terms mean?

Continuous Integration refers to the regular merging of code changes, where automated tests are conducted to detect potential errors early and ensure that the software remains in a functional state.

Continuous Delivery involves making the validated code available in a repository. CI tests should already be conducted in the pipeline for this purpose. It also includes further automation needed to enable rapid deployment, such as creating a production-ready build. The difference between Continuous Delivery and Continuous Deployment is that with Continuous Deployment, the successfully tested software is released directly to production, while Continuous Delivery prepares everything for release without automatically deploying it.

Continuous Deployment allows changes to be implemented quickly through many small releases rather than one large release. However, the tests must be well-configured, as there is no manual gate for transitioning to production.

CI/CD Graphic Cycle

CI/CD pipelines provide immense time savings through automation. The costs of resources needed for manual testing are also lower with CI/CD pipelines, as they can be configured to spin up resources only for testing and then shut them down afterward. Since permanent resources aren’t required, you only pay for the resources needed during the test runtime.

Introduction to dbt

The abbreviation dbt stands for “data build tool.” dbt is a tool that enables data transformation directly within a data warehouse. It uses SQL-based transformations that can be defined, tested, and documented directly in the dbt environment.

This makes dbt an excellent choice for implementing Data Vault 2.0 as dbt can be used to create and manage the hubs, links, and satellites required by Data Vault.

To facilitate this process, we at Scalefree have developed the datavault4dbt package. Datavault4dbt offers many useful features, such as predefined macros for hubs, links, satellites, the staging area, and much more.

For a deeper understanding of dbt or datavault4dbt, feel free to read one of our articles on the topic.

The Capabilities of GitHub Actions

GitHub Actions is a feature of GitHub that allows you to create and execute workflows directly within GitHub repositories. You can define various triggers for workflows, such as pull requests, commits, schedules, manual triggers, and more.

This makes GitHub Actions ideal for building CI/CD pipelines for both private and public repositories. The workflows are divided into multiple jobs, each consisting of several steps. Each job runs on a different virtual machine.

Within these steps, you can define custom tasks or utilize external or internal workflows. This offers the significant advantage of not having to develop everything from scratch in a workflow; instead, you can leverage public workflows created by others.

The seamless integration of Docker also provides numerous possibilities, such as quickly setting up different test environments, which greatly simplifies the creation of a CI/CD pipeline.

GitHub Actions is the key tool in the following example of a CI/CD pipeline.

Practical Example: CI/CD Pipeline for datavault4dbt

For the public repository of the datavault4dbt package, we have built a CI/CD pipeline to ensure that all features continue to function across all supported databases with every pull request (PR). When a PR is submitted by an external user, someone from our developer team must approve the start of the pipeline. In contrast, a PR from an internal user can be automated by adding a specific label to initiate the pipeline.

Once the pipeline is triggered, GitHub Actions automatically starts a separate virtual machine (VM) for each database. Currently, the datavault4dbt package supports AWS Redshift, Microsoft Azure Synapse, Snowflake, Google BigQuery, PostgreSQL, and Exasol, so a total of six VMs will be launched. Since GitHub Actions operates in a serverless manner, these VMs do not need to be manually set up or managed.

The VMs then connect to the required cloud systems. For instance, the VM for Google BigQuery connects to Google Cloud, while the VM for AWS Redshift connects to AWS. Subsequently, the necessary resources for each database are generated, which can be done via API calls or using tools like Terraform.

After the resources are created, additional files required for testing are generated and loaded onto the VM. In our example pipeline, these include files such as profiles.yml,  which contains information needed by dbt to connect to the databases.

Next, a Dockerfile is used on each VM to build an image that automatically installs all dependencies for the respective database. At this stage, Git is also installed on each image so that tests stored in a separate Git repository can be loaded onto the image.

Loading the tests from a repository allows for centralized management of the tests, ensuring any changes are executed for each database during the next pipeline run. Once the images are built, containers are created using these images, where tests are conducted with various parameters. After all tests are completed, the containers are shut down, and by default, the resources on the respective cloud providers are deleted.

CI/CD graphic dbt tests yml file

The test results are fully visible in GitHub Actions, with successful and failed tests clearly marked.

CI/CD graphic workflow form

If the pipeline is started manually, there is an additional option to specify whether only certain selected databases should be tested and whether the resources on the cloud systems should not be deleted after the tests. This allows developers to examine the data on the databases more closely in case of an error.

This pipeline offers numerous advantages for the development of the datavault4dbt package. It allows testing for errors on any of the supported databases with each change, without spending much time creating test resources. At the same time, it saves costs because all resources run only as long as necessary and are immediately shut down after the tests.

Managing the pipeline is also simplified through GitHub, as all variables and secrets can be stored directly in GitHub, providing a centralized location for everything. Once the pipeline is set up, it can be easily extended to include additional databases that may be supported in the future.

Ultimately, this is just one example of what a CI/CD pipeline can look like. Such pipelines are as diverse as the software for which they are designed. If we have piqued your interest and you have further questions about a possible pipeline for your company, please feel free to contact us.

Conclusion 

This newsletter explores the benefits and workings of CI/CD pipelines in agile software development, illustrated through a practical example involving a GitHub repository and a dbt package for implementing Data Vault 2.0, highlighting tools like GitHub Actions for automation and efficiency in deployment processes.

How Can DataOps Support and Improve Your Data Solution?

CI/CD (Continuous Integration / Continuous Deployment)

Watch the Video

DataOps: Revolutionizing Data Solutions

The modern business landscape is awash with data. From customer interactions to market trends, organizations are constantly collecting and analyzing information to gain insights and make informed decisions. However, managing data effectively can be a significant challenge. Traditional approaches, such as on-premise data solutions, often suffer from limitations like scalability, complexity, and high maintenance costs. Additionally, data quality concerns can lead to inaccurate analytics and insights.

To overcome these challenges, a new methodology called DataOps has emerged. DataOps is a transformative approach that revolutionizes how organizations develop, deploy, and operate their data solutions.



Understanding DataOps

At its core, DataOps is about fostering collaboration, embracing agility, and driving continuous improvement throughout the data lifecycle. It combines the principles of DevOps with data management best practices to create a streamlined and efficient data pipeline.

The term DataOps splits into two key components:

  1. Data Development: Focuses on engineering and evolving the data aspects, or modifying data.
  2. Data Operations: Deals with operating, supporting, and governing the data aspects.

The Benefits of DataOps

DataOps offers a multitude of benefits that can significantly improve the way organizations manage their data:

  1. Overcoming Scalability and Flexibility Limitations: DataOps, combined with cloud-based data platforms, enables dynamic resource provisioning on demand. This eliminates the need for regular hardware upgrades and allows organizations to pay only for what they need.
  2. Improved Collaboration: DataOps methodologies, like continuous delivery, promote collaboration between development, operations, and data teams. This leads to shorter sprint cycles and faster delivery of new features.
  3. Enhanced Data Quality and Governance: Automated testing, data validation, and continuous monitoring ensure data accuracy, consistency, and compliance. Data lineage tracking and role-based access controls further strengthen data quality and trustworthiness.

The Key Principles of DataOps

DataOps is built upon a set of key principles that guide its implementation:

  1. Collaboration: Encourages cross-functional collaboration between data engineers, data scientists, and operations teams.
  2. Automation: Automates repetitive tasks to reduce errors and improve efficiency.
  3. Continuous Improvement: Promotes a culture of continuous learning and improvement through regular feedback and iteration.
  4. Data Quality: Emphasizes the importance of data quality throughout the data lifecycle.
  5. Agility: Enables rapid response to changing business needs and market conditions.

Summary

DataOps is a powerful methodology that enables organizations to overcome the challenges of traditional data management approaches. By embracing collaboration, agility, and continuous improvement, organizations can leverage DataOps to unlock the full potential of their data and gain a competitive edge in the modern business environment.

Loading SAP CDC Data into GDPR Compliant Data Vault

Watch the Video

Loading SAP CDC Data into a GDPR-Compliant Data Vault

When managing change data capture (CDC) data from SAP in a Raw Data Vault, special considerations are needed for both CDC information and GDPR-relevant personal data. This post will cover how to model CDC data in a Data Vault, including the unique handling of created, updated, and deleted records. We’ll also discuss best practices for splitting data into separate satellites to manage GDPR-compliant attributes, including empty columns and privacy concerns.

This content is based on a discussion led by Michael Olschimke, CEO of Scalefree, during a Data Vault Q&A session.

CDC Data Modeling in the Data Vault

The primary challenge with CDC data is that it only includes changes from SAP, not the full dataset each time. CDC data typically includes metadata on whether a record was created, updated, or deleted in SAP. Here’s a look at how to approach modeling this data in a Data Vault:

1. Load CDC Data in Satellites with Adjusted Patterns

In Data Vault, the data model remains unchanged, regardless of how the data is delivered (batch, CDC, or real-time). However, the loading pattern for CDC data into satellites needs some adjustments:

  • Delta Check Adjustment: Normally, the Data Vault delta check identifies changes before loading data into a satellite. With CDC data, the changes are already captured, so this step can be bypassed. Instead, all changes from CDC data are loaded into the appropriate satellites directly.
  • Change Impact Across Satellites: When there’s a change in one attribute, it triggers an update in all relevant satellites. While this approach can create non-delta records, the impact is typically minimal, and the redundant data can be compressed for storage efficiency.

Handling GDPR-Relevant Personal Data

CDC data often includes both regular attributes and GDPR-sensitive personal data. In the Data Vault, personal data attributes should be separated based on privacy and security classes to ensure compliance and manage access. Here’s the recommended approach:

2. Splitting Satellites Based on Privacy Classifications

For GDPR compliance, split CDC data into multiple satellites:

  • Personal Data Satellite: A dedicated satellite for GDPR-relevant attributes (such as names or IDs). This separation allows for tighter security and privacy control.
  • Non-Personal Data Satellite: General attributes with no privacy concerns go into a separate satellite to reduce the risk of exposure.
  • Additional Splits: Further splits may be required based on rate of change, security levels, or business context, depending on the specific needs of your organization.

Maintaining separate satellites for different classes of data ensures that personal information is handled with stricter privacy controls, helping your data architecture comply with GDPR requirements.

Managing Empty Columns in the Data Vault

It’s common for source tables to contain columns that are always empty. When working with CDC data in a Data Vault:

  • Include Empty Columns for Auditing: To retain full traceability and audibility, include empty columns in the satellite. This preserves the exact structure of the source data without altering it.
  • Consider Separate “Unused Data” Satellite: If there are many empty columns, these can be grouped into a dedicated satellite, making the primary satellites leaner for users.

This approach allows for flexibility if the data in these columns becomes relevant in the future. Auditors will appreciate the comprehensive structure, and the Data Vault will retain all source data in its original form.

Example Satellite Structure

With GDPR compliance and CDC loading adjustments in mind, here’s an example structure for splitting SAP CDC data into satellites:

Satellite: CDC_Personal_Data
- Attributes: GDPR-relevant data (e.g., personal names, social security numbers)
- Metadata: Load date, source, change type (create, update, delete)
- Purpose: Privacy-controlled access

Satellite: CDC_NonPersonal_Data
- Attributes: Non-personal data columns
- Metadata: Load date, source, change type
- Purpose: General access

Satellite: CDC_Unused_Columns
- Attributes: Columns always empty in the source table
- Metadata: Load date, source
- Purpose: Compliance and future-proofing

Best Practices for Satellite Splitting in the Data Vault

When splitting data into satellites, follow these best practices:

  • Split by Privacy and Security: Ensure that personal and non-personal data are stored separately, particularly when handling GDPR-relevant information.
  • Split by Source System: Keep different source systems in separate satellites for clarity and maintainability.
  • Consider Business Needs: If certain data attributes are only relevant to specific business cases, split them accordingly to reduce satellite complexity.

These principles provide a clean, secure, and compliant Data Vault structure that enables efficient data retrieval, flexibility, and regulatory adherence.

Conclusion

Modeling SAP CDC data in a GDPR-compliant Data Vault involves adjustments to loading patterns, especially when dealing with CDC deltas and GDPR-sensitive data. By separating data based on privacy classes and including empty columns where necessary, you can ensure compliance and maintain a flexible data model. The approach outlined here simplifies the handling of CDC data, while providing robust auditing and privacy control.

Modelling Demographic Data

Watch the Video

Modeling Demographic Data for the Raw Data Vault: A Practical Guide

When working with demographic data for population analysis, one often encounters datasets that include variables such as region, year, age, and sex. For organizations implementing a Data Vault methodology, the task then becomes modeling this data in the Raw Data Vault in a way that enables efficient use in the Business Vault for proportion calculations and other business logic.

In this blog post, we’ll explore how to approach modeling demographic data for the Raw Data Vault. We’ll look at considerations for reference tables, hubs, links, and satellites, and discuss why a simplified approach can be beneficial while still capturing change history when needed. Let’s walk through an example based on a question that was raised during a Data Vault Q&A session.

Defining the Dataset and Objective

Let’s start by outlining the dataset and the main objective. Suppose we have a demographic dataset sourced externally (e.g., from a national bureau) that includes the following columns:

  • Region
  • Year
  • Age
  • Sex
  • Population count per region, year, age, and sex

The goal is to load this dataset into the Raw Data Vault and make it available for calculations in the Business Vault, such as determining population proportions across different dimensions.

The Simplified Modeling Approach

To understand the modeling approach, let’s consider the basic elements of Data Vault architecture:

  • Hubs: Tables that hold unique business keys and serve as identifiers.
  • Links: Tables that define relationships between hubs.
  • Satellites: Tables that hold descriptive data and track changes over time.

Since this dataset contains demographic attributes with no true business keys, modeling choices hinge on balancing simplification and change capture. Here’s a step-by-step breakdown of the approach:

1. Flatten the Data Structure First

The simplest model for this dataset would be to create a flat, wide table that includes columns for region, year, age, sex, and population value. This structure would load the data directly without further separation into hubs, links, or satellites.

  • Pros: Easy to query and manage, especially if there’s no need to track changes over time.
  • Cons: Lacks support for tracking updates or changes in population values.

This approach works well if the data is static and updated infrequently, but it limits flexibility for versioning or incremental updates.

2. Introducing Reference Hubs and Satellites for Change Tracking

To address the limitations of a flat table, we can take advantage of reference hubs and satellites. Reference hubs allow us to treat region, year, age, and sex as reference codes. In a reference hub, each unique combination of these codes is treated as a single, composite key.

The structure then includes:

  • A Reference Hub with region, year, age, and sex as identifying attributes.
  • A Reference Satellite with population counts as the descriptive attribute, which links back to the Reference Hub.

By introducing these reference structures, we ensure that changes in population values can be captured over time. If a new population record comes in with a different population value for a given region, year, age, and sex, it is added to the Reference Satellite as a new row. This provides a change history without needing to modify the original row.

Modeling Strategy and Design

With a focus on simplification, here’s how the structure would look in the Raw Data Vault:

  • Reference Hub: Combines the codes for region, year, age, and sex. This setup eliminates the need for separate hubs for each attribute and keeps the model straightforward. The composite key formed by region, year, age, and sex uniquely identifies each record, while the hub serves as a central reference.
  • Reference Satellite: Attached to the Reference Hub, the Reference Satellite holds the population value and includes a load date for tracking when data was loaded. This enables us to capture historical changes efficiently. If an update occurs for the population value, a new row is added to the Satellite with an updated load date, creating a versioned history.

Example Structure:

Reference Hub: Demographic_Hub
- Primary Key: Region, Year, Age, Sex
- Additional Columns: Load Date (ldts), Source (src)

Reference Satellite: Population_Satellite
- Foreign Key: Region, Year, Age, Sex (composite key from Hub)
- Population Value
- Load Date (ldts)
- Source (src)

In this setup, all demographic attributes are contained within a single hub and linked to the population value in the satellite. This design maintains a simple and effective data model that supports historical data changes without additional complexity.

Why Avoiding Multiple Hubs and Links Makes Sense

In theory, one could create separate hubs for each demographic attribute—region, year, age, and sex—and link them together. However, this approach introduces unnecessary complexity without adding value in this context. Here’s why:

  • Multiple Joins: Multiple hubs and a link table require additional joins, increasing complexity when querying the data.
  • Increased Entity Count: Additional hubs and links inflate the number of entities, making the data model harder to understand and maintain.
  • Performance Concerns: Each join adds processing cost, which can slow down queries, especially with larger datasets.

By consolidating all demographic attributes into a single hub, we reduce the number of entities and simplify the model, making it easier to use and maintain while still meeting the business needs.

Advantages of a Single Reference Hub with a Satellite

This approach is particularly advantageous because it balances simplicity with flexibility. Here’s how:

  • Change Capture: With the reference satellite, we can track historical population data changes over time. Each new row represents an update, identified by load date, making it easy to see when population data was updated.
  • Granularity and Scalability: The single hub provides a consistent granularity for the data, ensuring that queries are straightforward and changes are easy to manage.
  • Performance Efficiency: Joining based on region, year, age, and sex is computationally efficient. These attributes are typically small (e.g., integer or small character values), reducing the burden on processing.

Using the Model in Business Logic

Once the demographic data is loaded into the Raw Data Vault, it’s ready for use in the Business Vault. Here’s how it could be used for business logic:

  1. Joining Data: In the Business Vault, data analysts can join other datasets with the demographic reference satellite on the region, year, age, and sex attributes to incorporate population data.
  2. Proportion Calculations: With population counts available, proportion calculations (e.g., the proportion of a certain demographic group in a region) become straightforward.
  3. Temporal Analysis: The load date in the satellite provides historical tracking, allowing analysts to view demographic changes over time and analyze trends.

Potential Challenges and Solutions

One common concern is that joining on multiple columns (region, year, age, sex) could impact performance. However, with optimized indexing and the small size of these columns, this concern is minimized. For larger datasets, partitioning on region or year might further optimize query performance.

Conclusion

In summary, a simplified model that leverages a single reference hub with a satellite provides a highly effective way to model demographic data in a Raw Data Vault. This approach allows for straightforward use in the Business Vault and ensures flexibility for tracking historical changes, all while keeping the model manageable and efficient.

This example illustrates how, by focusing on simplification and change capture, organizations can build an effective demographic data model in the Raw Data Vault that meets both current and future needs for population-based analysis.

Close Menu