Sample Page – Scalefree

Building a scalable Data Platform? In Data Vault Friday

Multi-Active Satellites & Dependent Child Keys

Watch the Video

Understanding Multi-Active Satellites and Dependent Child Keys in Data Vault

Multi-Active satellites and dependent child keys provide solutions for storing data with complex granularities in Data Vault models. Data Vault is known for its highly structured enterprise data warehousing approach, built on Hubs, Links, and Satellites to capture data lineage, maintain historical accuracy, and ensure scalability. However, specific data scenarios, such as handling different data granularities, often lead to questions on multi-active satellites and dependent child keys. This article breaks down these concepts and clarifies their differences and use cases in a Data Vault environment.

In this article:

What is a Multi-Active Satellite?
- Example of a Multi-Active Satellite
What is a Dependent Child Key?
- Example of a Dependent Child Key
Key Differences Between Multi-Active Satellites and Dependent Child Keys
When to Use Each Approach: Multi-Active Satellites & Dependent Child
Summary
Meet the Speaker

What is a Multi-Active Satellite?

A multi-active satellite is designed to manage multiple records for a single business object that are active simultaneously. This scenario arises when a business object, like a customer, can have several active data entries at the same time. For example, a customer could have multiple addresses (home and work), both of which are valid at the same time.

In a typical satellite structure, a business key (e.g., customer ID) combined with a load date timestamp defines the primary key. However, in cases of multiple active records, this primary key is insufficient because it won’t uniquely identify each active instance. Instead, an additional attribute, such as an address type (home or work), is added to the primary key to differentiate each record. This approach allows the satellite to track multiple entries for the same business key without duplicating data and helps capture finer details in the data warehouse.

Example of a Multi-Active Satellite

Let’s say our source system has a customer with ID C123 who has two active addresses: one for home and one for work. In a standard satellite, we might have one record per business key. But in a multi-active satellite, we store both addresses simultaneously by using an additional identifier (e.g., “address type”) in the primary key:

Customer ID: C123
Load Date: Timestamp of data load
Additional Identifier: Address type (e.g., home, work)

This approach allows multiple entries for a single business object (in this case, customer C123) while maintaining unique records in the satellite table.

What is a Dependent Child Key?

A dependent child key is used to manage relationships between multiple business objects at a finer granularity level than a standard Data Vault link would allow. Dependent child keys are typically applied in links where we need to track multiple occurrences of a relationship between business objects, such as an order and its line items.

Consider an order containing multiple line items, where each item references a product. Here, the dependent child key (like line item number) uniquely identifies each relationship instance, as it provides additional detail beyond just the order and product identifiers. This allows multiple rows in the link for the same business objects while maintaining unique records.

Example of a Dependent Child Key

Imagine we have an order O123 for a customer C123, which includes two line items for the same product but with different prices or quantities:

Order ID: O123
Customer ID: C123
Product ID: P123
Dependent Child Key: Line item number (e.g., 1, 2)

In this case, we create unique rows for each line item, where the line item number differentiates each record. This approach ensures that each entry is stored and tracked individually.

Key Differences Between Multi-Active Satellites and Dependent Child Keys

Although multi-active satellites and dependent child keys both enable handling of finer data granularity, they serve different purposes and are used in distinct contexts:

Multi-Active Satellites
Applied within a single business object to handle multiple active records at the same time. The additional identifier helps capture simultaneous entries for the same object in a satellite.
Dependent Child Keys
Used in links between multiple business objects, where the additional key captures the finer detail of each relationship instance, such as line items in an order.

When to Use Each Approach: Multi-Active Satellites & Dependent Child

The choice between using a multi-active satellite or a dependent child key depends on the data granularity and relationships in your data model:

Use Multi-Active Satellites when handling multiple active records for a single business object, where each entry is related only to the primary business key (e.g., customer with multiple addresses).
Use Dependent Child Keys when tracking detailed relationships between different business objects that require additional identifiers to maintain uniqueness (e.g., order and line items).

Summary

Multi-active satellites and dependent child keys are helpful for storing data with complex granularities in Data Vault models. While multi-active satellites allow multiple simultaneous records for a single business object, dependent child keys enable unique identification of complex relationships in links. Both approaches maintain Data Vault’s principles of scalability and data integrity by preserving unique records and enabling detailed tracking of business data.

In short:

Multi-Active Satellite: For multiple records active simultaneously within a single business object.
Dependent Child Key: For relationships across multiple business objects that need finer detail, typically in links.

Meet the Speaker

Marc Winkelmann

Marc is working in Business Intelligence and Enterprise Data Warehousing (EDW) with a focus on Data Vault 2.0 implementation and coaching. Since 2016 he is active in consulting and implementation of Data Vault 2.0 solutions with industry leaders in manufacturing, energy supply and facility management sector. In 2020 he became a Data Vault 2.0 Instructor for Scalefree.

Building a scalable Data Platform? In Data Vault Friday

Data Vault Mixed Model

Watch the Video

Understanding Data Vault Mixed Models: Integrating Non-Data Vault Entities in the Business Vault

Data Vault architecture is a widely used methodology in data warehousing, providing a highly adaptable model for managing complex data environments. It organizes data primarily in three core components: Hubs, Links, and Satellites. These elements support business keys, relationships, and descriptive data to create a comprehensive data structure within the Raw Data Vault layer. However, as with many methodologies, real-world data often introduces elements outside the strict boundaries of this structure, sparking questions around flexibility.

This article explores the concept of a “mixed model” in Data Vault, where non-Data Vault entities coexist with Raw Data Vault components and how they might be integrated within a Business Vault structure. While the purest Data Vault models focus on auditability and lineage, allowing for seamless automation, mixed models can sometimes be practical if managed thoughtfully. So, is it permissible to mix non-Data Vault entities with the Raw Data Vault within a Business Vault? Let’s dive into this topic.

In this article:

What is a Data Vault Mixed Model?
Can You Integrate Non-Data Vault Entities with Raw Data in the Business Vault?
Strategies for Long-Term Success with a Mixed Model
Practical Example of a Mixed Model in Action
Conclusion: Balancing Flexibility with Data Vault Integrity
Meet the Speaker

What is a Data Vault Mixed Model?

A “mixed model” in Data Vault refers to a scenario where traditional Data Vault structures (Hubs, Links, and Satellites) are used alongside other non-Data Vault tables or entities. In essence, while most data resides in the structured Raw Data Vault, there are other data components within the same database that do not conform to Data Vault architecture. This raises questions about integrating these disparate data types in the Business Vault.

The Business Vault is designed to serve as a refined, operational version of the Raw Data Vault. It enhances the raw data with business logic and transformations to create actionable insights. In scenarios where a mixed model is necessary, the goal is often to leverage existing non-Data Vault tables to derive business insights while minimizing disruption to the original data model.

Can You Integrate Non-Data Vault Entities with Raw Data in the Business Vault?

According to Data Vault principles, the ideal approach is to structure all data as Hubs, Links, and Satellites to ensure consistency, auditability, and lineage. However, a mixed model approach can sometimes be necessary. For instance, you may have a database that combines data stored in the Raw Data Vault with tables or entities that don’t follow Data Vault structures. So, is it allowed?

The short answer is yes, you can technically integrate non-Data Vault entities within the Business Vault, but it comes with caveats. Here’s a deeper look at the implications:

Temporary Solutions Only
Mixing non-Data Vault data with Raw Data Vault entities is generally seen as a temporary solution. It may help in quickly bridging data that doesn’t yet fit into the Data Vault model, allowing for rapid integration. However, over time, this approach can lead to complexity in querying and reduce the consistency that Data Vault offers.
Impact on Automation and Maintainability
Introducing non-standard tables complicates automation within the Business Vault. Data Vault design leverages automation tools like dbt, Wherescape, and Vaultspeed, among others. These tools facilitate a streamlined workflow in Data Vault implementations by allowing for automated lineage, auditing, and data transformations. When introducing non-Data Vault entities, the automation capabilities are hindered, requiring custom scripts or queries that deviate from standard Data Vault patterns.
Jeopardizing Auditability and Lineage
One of Data Vault’s strongest value propositions is its focus on data lineage and auditability. In a mixed model, these aspects may be compromised. Without adhering to the structure of Hubs, Links, and Satellites, it becomes challenging to track data history, version control, and capture all changes comprehensively. For organizations that rely on these features for regulatory or quality purposes, compromising lineage may be a serious drawback.
User Mart as an Alternative
An alternative approach is to build a “User Mart” for ad hoc or analytical queries that combine data from the Raw Data Vault with non-Data Vault tables. This User Mart allows users to query both Raw Data Vault and external entities without disrupting the core Business Vault structure. This approach is particularly useful when users have specific reporting or analytical requirements that may not require full Data Vault transformation.
Pragmatic Approach: Virtual Hubs and Links
A practical solution in Data Vault projects is to create “virtual” Hubs, Links, and Satellites for non-standard tables, which serve as placeholders within the Raw Data Vault structure. This approach allows for quick integration while maintaining some level of standardization. For example, if there’s a reference table with country names and codes, you might create a virtual Hub for the country and map descriptive details as a virtual Satellite. This doesn’t achieve full lineage but can serve as a bridge until a proper Data Vault structure can be implemented.
Reference Tables and Non-Critical Data
In scenarios where data like reference tables (e.g., country codes, zip codes) doesn’t require full lineage or version tracking, a flat and wide reference table can be used. If a reference Hub and Satellite are unnecessary, keeping the data simple with a primary key and descriptive columns is often sufficient. This approach can work well for non-essential data, where maintaining Data Vault-style rigor may not be worth the effort.

Strategies for Long-Term Success with a Mixed Model

If you decide to proceed with a mixed model, it’s crucial to plan for a future transition toward a fully Data Vault-compliant design. Here are some tips:

Prioritize Refactoring Non-Data Vault Entities
Establish a clear roadmap for converting non-Data Vault tables into Hubs, Links, and Satellites over time. This phased approach enables you to work within existing constraints while planning for a more robust and compliant Business Vault.
Minimize Technical Debt
Track instances of non-Data Vault elements within your data ecosystem and treat them as “technical debt” to be managed and resolved in the long term. This keeps you aware of areas where auditability or automation might be compromised.
Use Metadata-Driven Automation
Employ metadata-driven automation tools as much as possible to simplify future integrations and transitions. These tools enable automated data processing across the Data Vault pipeline, making it easier to add and transform new data sources into compliant Data Vault structures.
Implement Strict Governance for User-Generated Data
In cases where users introduce their own data models within the User Mart or Business Vault, set governance policies to standardize data usage and maintain some level of alignment with Data Vault patterns. These policies can mitigate risks related to data quality and ensure that non-Data Vault data remains manageable.

Practical Example of a Mixed Model in Action

Consider a financial services organization that maintains a Raw Data Vault with transaction data but also has a separate schema for customer reference tables, such as customer demographics and location details. Rather than directly integrating these tables into the Business Vault, the organization could create virtual Hubs and Links that link customer IDs and locations to transactions. This allows them to continue working within the Raw Data Vault framework while planning to reformat reference tables in alignment with Data Vault standards.

Another example might involve a large retail company where user-generated data models in the User Mart are frequently used to support marketing analysis. Here, the organization could implement a temporary mixed model that accommodates fast-paced analysis while planning for a phased migration to Data Vault structures over time.

Conclusion: Balancing Flexibility with Data Vault Integrity

While a mixed model is not ideal within Data Vault architecture, it can serve as a temporary, pragmatic solution when there’s an immediate need to integrate non-Data Vault entities. Virtual Hubs and Links, User Marts, and strict governance policies can help manage the complexity introduced by non-standard tables. However, organizations should prioritize migrating all data into the Data Vault model over time to preserve the long-term benefits of auditability, lineage, and automation that Data Vault offers.

In the end, remember that the strength of Data Vault lies in its flexibility, auditability, and scalability. Introducing non-Data Vault tables as a quick fix is feasible, but for sustainable and reliable insights, a fully Data Vault-compliant model remains the optimal choice.

Meet the Speaker

Michael Olschimke

Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

Building a scalable Data Platform? In DevOps Knowledge

Pre-Commit: The First Line of Defense in Shift-Left Development

Programmer Working with Virtual Interface for Software Development and Technology

Pre-Commit as a Foundation for Code Quality

In software development, ensuring code quality and minimizing defects early is crucial. The shift-left approach emphasizes detecting and resolving issues as soon as possible, rather than later in development or after deployment. Pre-commit supports this approach by managing multi-language pre-commit hooks, scripts that run automatically before finalizing a commit. These hooks act as a first line of defense, enforcing coding standards, preventing bugs, and enhancing security.

This article examines how the combination of pre-commit and the shift-left approach can enhance development workflows, resulting in more reliable and maintainable software. We will explore the advantages of using pre-commit hooks, their function within the shift-left framework, and offer practical advice for effective implementation. By adopting these practices, teams can improve code quality, accelerate development cycles, and deliver robust software solutions more efficiently.

In this article:

Understanding Pre-Commit
Example for the Use of Pre-Commit
Recommended Pre-Commit Repositories for Effective Integration
Conclusion
About the Author

Understanding Pre-Commit

General Hooks in Versioning Systems

In version control systems like Git, hooks are scripts that are triggered by various events within the repository. These scripts can automate tasks, enforce policies, or improve workflows. Hooks are generally categorized into two types: client-side and server-side. Client-side hooks run on the local machine and can be triggered by operations such as committing or merging changes. Server-side hooks run on the server and can be triggered by events like receiving pushed commits.

The Role of Pre-Commit Hooks

Among the various types of client-side hooks, pre-commit hooks are particularly significant. These hooks are executed before a commit is finalized. They check for potential issues in the code, such as syntax errors, code style violations, or even security vulnerabilities. By running these checks before the code is committed to the repository, pre-commit hooks ensure that only high-quality code makes it into the version control system.

Pre-Commit Use Cases

Pre-Commit is a versatile tool that can be adapted to various scenarios within the development workflow, offering significant flexibility and control to development teams. One key use case is hosting pre-commit code locally, allowing developers to customize and manage hooks according to their project’s specific requirements. This local hosting ensures that all team members adhere to the same standards without relying on external repositories. Another powerful feature of pre-commit is the ability for users to create custom hooks tailored to their unique needs. These hooks can enforce specific coding standards, run specialized security checks, or automate repetitive tasks, providing a high degree of flexibility. Teams can write these hooks in any language supported by their environment, allowing pre-commit to seamlessly integrate into their workflow.

Furthermore, pre-commit can be integrated into Continuous Integration (CI) pipelines to enhance automation and enforce quality checks across all stages of development. By incorporating pre-commit hooks into the CI process, teams can ensure that code meets quality standards before being merged into the main codebase. This integration helps maintain high standards of code quality and reliability throughout the development lifecycle.

Multi-Language Support in Pre-Commit

One of the key strengths of pre-commit is its support for a wide range of programming languages. This makes it an ideal tool for projects that involve multiple languages or frameworks. For example, in DataOps, it is common to use Terraform for infrastructure provisioning and dbt for data transformation. Pre-commit’s multi-language support allows it to be seamlessly integrated into such diverse development environments, providing consistent quality checks across different languages and tools. This cross-language capability ensures that best practices are enforced and issues are detected regardless of the technologies used in the project.

Example for the Use of Pre-Commit

Pre-Commit operates through the use of a configuration file called pre-commit-config.yaml. This YAML file serves as the central configuration that pre-commit references every time a commit is made. It defines which hooks should be executed and where to fetch them from before finalizing a commit in your Git repository. As an example for such a file with an explanation to the key elements, the following image is provided:

repos:
  - repo: https://github.com/antonbabenko/pre-commit-terraform
    rev: v1.90.0 
    hooks:
      - id: terraform_validate
      - id: terraform_fmt
        args:
          - --args=-recursive

repo: Specifies the URL of the repository containing the hooks. In this example, https://github.com/antonbabenko/pre-commit-terraform is the repository from which hooks will be fetched.
rev: Indicates the specific revision or version of the repository (v1.90.0 in this case) to use. This ensures consistency in hook behavior across different executions.
hooks: Defines a list of hooks to be run from the specified repository. Each hook is identified by its unique id.
id: Represents the identifier for each hook. For instance, trailing-whitespace, end-of-file-fixer, and check-yaml are examples of hook IDs that correspond to specific tasks or checks the hooks will perform.
args: Specifies additional arguments or options to be passed to the hook command. In the case of terraform_fmt, the args field includes –args=-recursive. This configuration instructs the hook to format Terraform files recursively within the project directory.

This configuration allows you to tailor pre-commit to your project’s needs by specifying which hooks to execute, where to retrieve them from, and how to ensure consistency through defined versions or revisions. Each hook ID corresponds to a particular quality check or task that pre-commit will enforce before allowing a commit to proceed, ensuring higher code quality and adherence to project standards.

Recommended Pre-Commit Repositories for Effective Integration

To kickstart your use of pre-commit effectively, consider leveraging the following repositories tailored for specific programming languages and tools:

Terraform Repository: antonbabenko/pre-commit-terraform
Offers a comprehensive set of hooks for Terraform projects, including validators and formatters to ensure consistent and high-quality code.
Dbt (Data Build Tool) Repository: dbt-checkpoint/dbt-checkpoint
Provides hooks specifically designed for dbt projects, enabling validation and enforcing best practices for SQL-based data transformations.
SQL Repository: sqlfluff/sqlfluff
Offers hooks for SQL linting to ensure syntax correctness, adherence to coding standards, and optimization of query performance.

These repositories offer essential hooks that integrate seamlessly with pre-commit, enabling automated checks to maintain code quality and consistency across Terraform, dbt, and SQL projects. By incorporating these hooks into your workflow, you can streamline development processes and ensure robust software deployments. However these are just suggestions and you are free to use any other hook if it benefits your use case.

Conclusion

Pre-Commit exemplifies the shift-left approach by automating quality checks early in the development cycle, before code is committed. This proactive strategy catches issues like syntax errors, formatting inconsistencies, and security vulnerabilities early, reducing downstream defects.

By using a centralized pre-commit-config.yaml file, Pre-Commit ensures consistent coding standards across teams. It allows developers to focus on coding rather than manual reviews, speeding up software delivery.

Supporting various languages and tools, Pre-Commit can be customized for specific project needs, enhancing code reliability through automated checks. Integrating Pre-Commit improves code quality, efficiency, and promotes continuous improvement, leading to faster time-to-market, lower development costs, and more reliable software.

About the Author

Moritz Gunkel

Moritz is an aspiring Consultant in the DevOps department for Scalefree, specializing in cloud engineering, automation, and Infrastructure as Code, with a particular knack for Terraform. While juggling his responsibilities, including pursuing a Bachelor’s degree in Computer Science as a working student, Moritz’s methodical approach has significantly impacted internal operations, earning him recognition and setting the stage for an exciting transition into a full-time consulting role. With a passion for innovation and a commitment to excellence, Moritz is set to continue making a lasting impression in the dynamic world of DevOps.

Building a scalable Data Platform? In Data Vault Friday

Modelling Address Data

Watch the Video

Modeling Address Data: Key Insights and Recommendations

Address data is one of the fundamental components in various business databases, especially where detailed customer information is essential. This complexity can make the modeling process challenging, particularly when aiming for a single “Hub” that consolidates all address information. In a recent discussion on Data Vault Friday, Michael Olschimke of Scalefree explored the best ways to model address data effectively, considering key business and regulatory factors. This article will provide a summary of those insights, offering recommendations for creating a robust, scalable, and efficient address data model.

In this article:

The Context: Address Data in a Single Hub
Challenges with Replacing NULL Values
Avoid Overloaded Hubs and Null Values in Composite Keys
Alternative Modeling Solutions: JSON-Based and Reference Tables
Considering Address Data as Descriptive in Satellites
Applying Business Rules in Data Vault Modeling
Conclusion
Meet the Speaker

The Context: Address Data in a Single Hub

In this scenario, the challenge presented was how to model various address types within a single Hub. The primary objective was to avoid redundant address data and simplify handling of NULL values. As per ISO20022 standards and European Union regulations, the data model included attributes such as STREET_NAME, BUILDING_NUMBER, BUILDING_NAME, ROOM, FLOOR, POSTAL_CODE, TOWN_NAME, COUNTRY_CODE, ADDRESS_LINE_1, and ADDRESS_LINE_2. Each of these elements is part of a composite business key used to uniquely identify each address.

A practical but complex solution proposed by the team was to replace NULL values with a placeholder (e.g., “-2”) to streamline the loading process and minimize handling issues. However, Olschimke proposed several alternative approaches to ensure the model is both sustainable and scalable.

Challenges with Replacing NULL Values

Replacing NULL values in business keys can simplify loading but has significant drawbacks downstream, particularly when managing dimensional data and maintaining business logic clarity. For instance, when NULL values are replaced with a placeholder like “-2,” this value could appear in downstream reports, causing confusion. Olschimke suggested a more nuanced approach that utilizes “fixed hash values,” such as all zeros or all F’s, to represent empty or erroneous values.

By using fixed hash values, it becomes easier to identify default or error states directly within the data structure. This approach avoids unnecessary complexity when filtering data downstream and improves the clarity and manageability of data processing operations.

Avoid Overloaded Hubs and Null Values in Composite Keys

One of the main points of consideration was the risk of “overloading” Hubs, which occurs when multiple business objects with different semantic meanings are stored within the same Hub. This is particularly common when different types of addresses are stored under a single business key, where each address type may not require all fields (e.g., ROOM or FLOOR).

According to Olschimke, overloaded Hubs introduce complexity due to differing data granularities and missing values across address types. For example, multiple buildings, floors, or rooms could exist under a single address, resulting in multiple granularities within a single Hub. This makes it challenging to maintain clear, meaningful data relationships. Instead, he advised defining clear granularity levels and possibly separating address types or using more flexible data structures.

Alternative Modeling Solutions: JSON-Based and Reference Tables

In cases where multiple address types require flexibility, Olschimke suggested using JSON-based data structures. JSON provides flexibility in defining address attributes dynamically, storing only the keys available for a particular address. This approach reduces the risk of overloading and accommodates varying address structures without creating a complex, rigid schema.

JSON-based Hubs allow for hashing the address data as a single JSON object, ordered by key names to prevent duplicates. However, this approach requires a consistent, standardized order of attributes when hashing to ensure duplicate-free keys. For instance, JSON formatting could streamline the Hub and enable more adaptive data loading while simplifying downstream data extraction.

Additionally, using reference tables is another approach for frequently accessed address data, enabling deduplication without over-complicating the Hub. Reference tables act as dedicated sources of address data, indexed by a unique address ID, which reduces redundancy across other Hubs.

Considering Address Data as Descriptive in Satellites

Instead of adding addresses as business keys in the Hub, it can be more effective to store them as descriptive attributes within a Satellite structure. This avoids overloading the Hub with attributes that may not always be needed for identifying the business key itself. By storing address data in Satellites linked to the primary business entity (e.g., customers, stores), you can achieve a balance between deduplication and schema simplicity.

Olschimke recommended this approach particularly when the main goal is to eliminate redundancies across address data. This approach aligns with a best practice in Data Vault modeling: Satellite tables should contain descriptive data that change over time, while Hubs contain only essential business identifiers.

Applying Business Rules in Data Vault Modeling

Address data often requires additional business rules, especially when handling complex keys or duplicates. Olschimke pointed out that handling NULL values with a placeholder complicates creating downstream dimensions. Instead, a two-step approach was advised: (1) defining the business keys within the Hub with fixed placeholders (e.g., all zeros or all F’s) for default and error handling and (2) standardizing the Satellite structure to handle varying address formats dynamically.

Ultimately, each business has unique requirements, and the choice between single Hubs, JSON structures, and Reference tables will depend on how critical the address data is to the core business operations. By focusing on avoiding overloading and ensuring scalability, businesses can set up a Data Vault model that minimizes long-term maintenance while maximizing data clarity and accessibility.

Conclusion

Modeling address data in a Data Vault context can be intricate, especially when attempting to create a unified Hub that supports various address types. The key considerations discussed by Olschimke emphasize flexibility, simplicity, and adherence to business rules without overloading Hubs. JSON-based keys, reference tables, and Satellite structures offer alternative approaches to managing address data, allowing you to avoid pitfalls associated with NULL placeholders and composite keys.

For businesses tackling complex address data requirements, experimenting with these alternatives may yield significant benefits, particularly in managing data deduplication, compliance, and future scalability.

Interested in learning more? Check out Scalefree’s webinars and consider joining the Data Innovators Exchange community for discussions on data modeling, cloud computing, and Data Vault 2.0 best practices.

Meet the Speaker

Michael Olschimke

Damian Hinz In Data Vault, dbt blog article, Intermediate

CI/CD: Practical Insights into Automating Data Vault 2.0 with dbt

CI/CD

CI/CD pipelines are becoming increasingly important for ensuring that software updates can be released cost-effectively while maintaining high quality. But how exactly do CI/CD pipelines work, and how can a project benefit from using one?

This newsletter aims to answer these questions through a practical example of a CI/CD pipeline. The example focuses on a CI/CD pipeline for a GitHub repository that includes a package for implementing Data Vault 2.0 in dbt across various databases. Therefore, this newsletter will also cover the basics of dbt and GitHub Actions.

From Continuous Integration To Data Vaults: A Comprehensive Workflow

This webinar will cover what CI/CD pipelines are and the advantages they offer. We will present parts of the CI/CD pipeline for the public datavault4dbt package to demonstrate how a CI/CD pipeline can be used. The webinar will introduce the key features of GitHub Actions and explain them through examples. This will show how each feature can be utilized in practice and highlight the various possibilities GitHub Actions offers. The webinar aims to explain the benefits of CI/CD pipelines and illustrate what such a pipeline can look like through a practical example.

Watch Webinar Recording

In this article:

What is CI/CD?
Introduction to dbt
The Capabilities of GitHub Actions
Practical Example: CI/CD Pipeline for datavault4dbt
Conclusion

What is CI/CD?

CI stands for Continuous Integration, and CD stands for Continuous Delivery or Continuous Deployment. But what exactly do these terms mean?

Continuous Integration refers to the regular merging of code changes, where automated tests are conducted to detect potential errors early and ensure that the software remains in a functional state.

Continuous Delivery involves making the validated code available in a repository. CI tests should already be conducted in the pipeline for this purpose. It also includes further automation needed to enable rapid deployment, such as creating a production-ready build. The difference between Continuous Delivery and Continuous Deployment is that with Continuous Deployment, the successfully tested software is released directly to production, while Continuous Delivery prepares everything for release without automatically deploying it.

Continuous Deployment allows changes to be implemented quickly through many small releases rather than one large release. However, the tests must be well-configured, as there is no manual gate for transitioning to production.

CI/CD pipelines provide immense time savings through automation. The costs of resources needed for manual testing are also lower with CI/CD pipelines, as they can be configured to spin up resources only for testing and then shut them down afterward. Since permanent resources aren’t required, you only pay for the resources needed during the test runtime.

Introduction to dbt

The abbreviation dbt stands for “data build tool.” dbt is a tool that enables data transformation directly within a data warehouse. It uses SQL-based transformations that can be defined, tested, and documented directly in the dbt environment.

This makes dbt an excellent choice for implementing Data Vault 2.0 as dbt can be used to create and manage the hubs, links, and satellites required by Data Vault.

To facilitate this process, we at Scalefree have developed the datavault4dbt package. Datavault4dbt offers many useful features, such as predefined macros for hubs, links, satellites, the staging area, and much more.

For a deeper understanding of dbt or datavault4dbt, feel free to read one of our articles on the topic.

The Capabilities of GitHub Actions

GitHub Actions is a feature of GitHub that allows you to create and execute workflows directly within GitHub repositories. You can define various triggers for workflows, such as pull requests, commits, schedules, manual triggers, and more.

This makes GitHub Actions ideal for building CI/CD pipelines for both private and public repositories. The workflows are divided into multiple jobs, each consisting of several steps. Each job runs on a different virtual machine.

Within these steps, you can define custom tasks or utilize external or internal workflows. This offers the significant advantage of not having to develop everything from scratch in a workflow; instead, you can leverage public workflows created by others.

The seamless integration of Docker also provides numerous possibilities, such as quickly setting up different test environments, which greatly simplifies the creation of a CI/CD pipeline.

GitHub Actions is the key tool in the following example of a CI/CD pipeline.

Practical Example: CI/CD Pipeline for datavault4dbt

For the public repository of the datavault4dbt package, we have built a CI/CD pipeline to ensure that all features continue to function across all supported databases with every pull request (PR). When a PR is submitted by an external user, someone from our developer team must approve the start of the pipeline. In contrast, a PR from an internal user can be automated by adding a specific label to initiate the pipeline.

Once the pipeline is triggered, GitHub Actions automatically starts a separate virtual machine (VM) for each database. Currently, the datavault4dbt package supports AWS Redshift, Microsoft Azure Synapse, Snowflake, Google BigQuery, PostgreSQL, and Exasol, so a total of six VMs will be launched. Since GitHub Actions operates in a serverless manner, these VMs do not need to be manually set up or managed.

The VMs then connect to the required cloud systems. For instance, the VM for Google BigQuery connects to Google Cloud, while the VM for AWS Redshift connects to AWS. Subsequently, the necessary resources for each database are generated, which can be done via API calls or using tools like Terraform.

After the resources are created, additional files required for testing are generated and loaded onto the VM. In our example pipeline, these include files such as profiles.yml, which contains information needed by dbt to connect to the databases.

Next, a Dockerfile is used on each VM to build an image that automatically installs all dependencies for the respective database. At this stage, Git is also installed on each image so that tests stored in a separate Git repository can be loaded onto the image.

Loading the tests from a repository allows for centralized management of the tests, ensuring any changes are executed for each database during the next pipeline run. Once the images are built, containers are created using these images, where tests are conducted with various parameters. After all tests are completed, the containers are shut down, and by default, the resources on the respective cloud providers are deleted.

The test results are fully visible in GitHub Actions, with successful and failed tests clearly marked.

If the pipeline is started manually, there is an additional option to specify whether only certain selected databases should be tested and whether the resources on the cloud systems should not be deleted after the tests. This allows developers to examine the data on the databases more closely in case of an error.

This pipeline offers numerous advantages for the development of the datavault4dbt package. It allows testing for errors on any of the supported databases with each change, without spending much time creating test resources. At the same time, it saves costs because all resources run only as long as necessary and are immediately shut down after the tests.

Managing the pipeline is also simplified through GitHub, as all variables and secrets can be stored directly in GitHub, providing a centralized location for everything. Once the pipeline is set up, it can be easily extended to include additional databases that may be supported in the future.

Ultimately, this is just one example of what a CI/CD pipeline can look like. Such pipelines are as diverse as the software for which they are designed. If we have piqued your interest and you have further questions about a possible pipeline for your company, please feel free to contact us.

Conclusion

This newsletter explores the benefits and workings of CI/CD pipelines in agile software development, illustrated through a practical example involving a GitHub repository and a dbt package for implementing Data Vault 2.0, highlighting tools like GitHub Actions for automation and efficiency in deployment processes.

Damian Hinz In DevOps Knowledge

How Can DataOps Support and Improve Your Data Solution?

CI/CD (Continuous Integration / Continuous Deployment)

Watch the Video

DataOps: Revolutionizing Data Solutions

The modern business landscape is awash with data. From customer interactions to market trends, organizations are constantly collecting and analyzing information to gain insights and make informed decisions. However, managing data effectively can be a significant challenge. Traditional approaches, such as on-premise data solutions, often suffer from limitations like scalability, complexity, and high maintenance costs. Additionally, data quality concerns can lead to inaccurate analytics and insights.

To overcome these challenges, a new methodology called DataOps has emerged. DataOps is a transformative approach that revolutionizes how organizations develop, deploy, and operate their data solutions.

In this article:

Understanding DataOps
The Benefits of DataOps
The Key Principles of DataOps
Summary

Understanding DataOps

At its core, DataOps is about fostering collaboration, embracing agility, and driving continuous improvement throughout the data lifecycle. It combines the principles of DevOps with data management best practices to create a streamlined and efficient data pipeline.

The term DataOps splits into two key components:

Data Development: Focuses on engineering and evolving the data aspects, or modifying data.
Data Operations: Deals with operating, supporting, and governing the data aspects.

The Benefits of DataOps

DataOps offers a multitude of benefits that can significantly improve the way organizations manage their data:

Overcoming Scalability and Flexibility Limitations: DataOps, combined with cloud-based data platforms, enables dynamic resource provisioning on demand. This eliminates the need for regular hardware upgrades and allows organizations to pay only for what they need.
Improved Collaboration: DataOps methodologies, like continuous delivery, promote collaboration between development, operations, and data teams. This leads to shorter sprint cycles and faster delivery of new features.
Enhanced Data Quality and Governance: Automated testing, data validation, and continuous monitoring ensure data accuracy, consistency, and compliance. Data lineage tracking and role-based access controls further strengthen data quality and trustworthiness.

The Key Principles of DataOps

DataOps is built upon a set of key principles that guide its implementation:

Collaboration: Encourages cross-functional collaboration between data engineers, data scientists, and operations teams.
Automation: Automates repetitive tasks to reduce errors and improve efficiency.
Continuous Improvement: Promotes a culture of continuous learning and improvement through regular feedback and iteration.
Data Quality: Emphasizes the importance of data quality throughout the data lifecycle.
Agility: Enables rapid response to changing business needs and market conditions.

Summary

DataOps is a powerful methodology that enables organizations to overcome the challenges of traditional data management approaches. By embracing collaboration, agility, and continuous improvement, organizations can leverage DataOps to unlock the full potential of their data and gain a competitive edge in the modern business environment.

Building a scalable Data Platform? In Data Vault Friday

Loading SAP CDC Data into GDPR Compliant Data Vault

Watch the Video

Loading SAP CDC Data into a GDPR-Compliant Data Vault

When managing change data capture (CDC) data from SAP in a Raw Data Vault, special considerations are needed for both CDC information and GDPR-relevant personal data. This post will cover how to model CDC data in a Data Vault, including the unique handling of created, updated, and deleted records. We’ll also discuss best practices for splitting data into separate satellites to manage GDPR-compliant attributes, including empty columns and privacy concerns.

This content is based on a discussion led by Michael Olschimke, CEO of Scalefree, during a Data Vault Q&A session.

In this article:

CDC Data Modeling in the Data Vault
- 1. Load CDC Data in Satellites with Adjusted Patterns
Handling GDPR-Relevant Personal Data
- 2. Splitting Satellites Based on Privacy Classifications
Managing Empty Columns in the Data Vault
Example Satellite Structure
Best Practices for Satellite Splitting in the Data Vault
Conclusion
Meet the Speaker

CDC Data Modeling in the Data Vault

The primary challenge with CDC data is that it only includes changes from SAP, not the full dataset each time. CDC data typically includes metadata on whether a record was created, updated, or deleted in SAP. Here’s a look at how to approach modeling this data in a Data Vault:

1. Load CDC Data in Satellites with Adjusted Patterns

In Data Vault, the data model remains unchanged, regardless of how the data is delivered (batch, CDC, or real-time). However, the loading pattern for CDC data into satellites needs some adjustments:

Delta Check Adjustment: Normally, the Data Vault delta check identifies changes before loading data into a satellite. With CDC data, the changes are already captured, so this step can be bypassed. Instead, all changes from CDC data are loaded into the appropriate satellites directly.
Change Impact Across Satellites: When there’s a change in one attribute, it triggers an update in all relevant satellites. While this approach can create non-delta records, the impact is typically minimal, and the redundant data can be compressed for storage efficiency.

Handling GDPR-Relevant Personal Data

CDC data often includes both regular attributes and GDPR-sensitive personal data. In the Data Vault, personal data attributes should be separated based on privacy and security classes to ensure compliance and manage access. Here’s the recommended approach:

2. Splitting Satellites Based on Privacy Classifications

For GDPR compliance, split CDC data into multiple satellites:

Personal Data Satellite: A dedicated satellite for GDPR-relevant attributes (such as names or IDs). This separation allows for tighter security and privacy control.
Non-Personal Data Satellite: General attributes with no privacy concerns go into a separate satellite to reduce the risk of exposure.
Additional Splits: Further splits may be required based on rate of change, security levels, or business context, depending on the specific needs of your organization.

Maintaining separate satellites for different classes of data ensures that personal information is handled with stricter privacy controls, helping your data architecture comply with GDPR requirements.

Managing Empty Columns in the Data Vault

It’s common for source tables to contain columns that are always empty. When working with CDC data in a Data Vault:

Include Empty Columns for Auditing: To retain full traceability and audibility, include empty columns in the satellite. This preserves the exact structure of the source data without altering it.
Consider Separate “Unused Data” Satellite: If there are many empty columns, these can be grouped into a dedicated satellite, making the primary satellites leaner for users.

This approach allows for flexibility if the data in these columns becomes relevant in the future. Auditors will appreciate the comprehensive structure, and the Data Vault will retain all source data in its original form.

Example Satellite Structure

With GDPR compliance and CDC loading adjustments in mind, here’s an example structure for splitting SAP CDC data into satellites:

Satellite: CDC_Personal_Data
- Attributes: GDPR-relevant data (e.g., personal names, social security numbers)
- Metadata: Load date, source, change type (create, update, delete)
- Purpose: Privacy-controlled access

Satellite: CDC_NonPersonal_Data
- Attributes: Non-personal data columns
- Metadata: Load date, source, change type
- Purpose: General access

Satellite: CDC_Unused_Columns
- Attributes: Columns always empty in the source table
- Metadata: Load date, source
- Purpose: Compliance and future-proofing

Best Practices for Satellite Splitting in the Data Vault

When splitting data into satellites, follow these best practices:

Split by Privacy and Security: Ensure that personal and non-personal data are stored separately, particularly when handling GDPR-relevant information.
Split by Source System: Keep different source systems in separate satellites for clarity and maintainability.
Consider Business Needs: If certain data attributes are only relevant to specific business cases, split them accordingly to reduce satellite complexity.

These principles provide a clean, secure, and compliant Data Vault structure that enables efficient data retrieval, flexibility, and regulatory adherence.

Conclusion

Modeling SAP CDC data in a GDPR-compliant Data Vault involves adjustments to loading patterns, especially when dealing with CDC deltas and GDPR-sensitive data. By separating data based on privacy classes and including empty columns where necessary, you can ensure compliance and maintain a flexible data model. The approach outlined here simplifies the handling of CDC data, while providing robust auditing and privacy control.

Meet the Speaker

Michael Olschimke

Building a scalable Data Platform? In Data Vault Friday

Modelling Demographic Data

Watch the Video

Modeling Demographic Data for the Raw Data Vault: A Practical Guide

When working with demographic data for population analysis, one often encounters datasets that include variables such as region, year, age, and sex. For organizations implementing a Data Vault methodology, the task then becomes modeling this data in the Raw Data Vault in a way that enables efficient use in the Business Vault for proportion calculations and other business logic.

In this blog post, we’ll explore how to approach modeling demographic data for the Raw Data Vault. We’ll look at considerations for reference tables, hubs, links, and satellites, and discuss why a simplified approach can be beneficial while still capturing change history when needed. Let’s walk through an example based on a question that was raised during a Data Vault Q&A session.

In this article:

Defining the Dataset and Objective
The Simplified Modeling Approach
- 1. Flatten the Data Structure First
- 2. Introducing Reference Hubs and Satellites for Change Tracking
Modeling Strategy and Design
Why Avoiding Multiple Hubs and Links Makes Sense
Advantages of a Single Reference Hub with a Satellite
Using the Model in Business Logic
Potential Challenges and Solutions
Conclusion
Meet the Speaker

Defining the Dataset and Objective

Let’s start by outlining the dataset and the main objective. Suppose we have a demographic dataset sourced externally (e.g., from a national bureau) that includes the following columns:

Region
Year
Age
Sex
Population count per region, year, age, and sex

The goal is to load this dataset into the Raw Data Vault and make it available for calculations in the Business Vault, such as determining population proportions across different dimensions.

The Simplified Modeling Approach

To understand the modeling approach, let’s consider the basic elements of Data Vault architecture:

Hubs: Tables that hold unique business keys and serve as identifiers.
Links: Tables that define relationships between hubs.
Satellites: Tables that hold descriptive data and track changes over time.

Since this dataset contains demographic attributes with no true business keys, modeling choices hinge on balancing simplification and change capture. Here’s a step-by-step breakdown of the approach:

1. Flatten the Data Structure First

The simplest model for this dataset would be to create a flat, wide table that includes columns for region, year, age, sex, and population value. This structure would load the data directly without further separation into hubs, links, or satellites.

Pros: Easy to query and manage, especially if there’s no need to track changes over time.
Cons: Lacks support for tracking updates or changes in population values.

This approach works well if the data is static and updated infrequently, but it limits flexibility for versioning or incremental updates.

2. Introducing Reference Hubs and Satellites for Change Tracking

To address the limitations of a flat table, we can take advantage of reference hubs and satellites. Reference hubs allow us to treat region, year, age, and sex as reference codes. In a reference hub, each unique combination of these codes is treated as a single, composite key.

The structure then includes:

A Reference Hub with region, year, age, and sex as identifying attributes.
A Reference Satellite with population counts as the descriptive attribute, which links back to the Reference Hub.

By introducing these reference structures, we ensure that changes in population values can be captured over time. If a new population record comes in with a different population value for a given region, year, age, and sex, it is added to the Reference Satellite as a new row. This provides a change history without needing to modify the original row.

Modeling Strategy and Design

With a focus on simplification, here’s how the structure would look in the Raw Data Vault:

Reference Hub: Combines the codes for region, year, age, and sex. This setup eliminates the need for separate hubs for each attribute and keeps the model straightforward. The composite key formed by region, year, age, and sex uniquely identifies each record, while the hub serves as a central reference.
Reference Satellite: Attached to the Reference Hub, the Reference Satellite holds the population value and includes a load date for tracking when data was loaded. This enables us to capture historical changes efficiently. If an update occurs for the population value, a new row is added to the Satellite with an updated load date, creating a versioned history.

Example Structure:

Reference Hub: Demographic_Hub
- Primary Key: Region, Year, Age, Sex
- Additional Columns: Load Date (ldts), Source (src)

Reference Satellite: Population_Satellite
- Foreign Key: Region, Year, Age, Sex (composite key from Hub)
- Population Value
- Load Date (ldts)
- Source (src)

In this setup, all demographic attributes are contained within a single hub and linked to the population value in the satellite. This design maintains a simple and effective data model that supports historical data changes without additional complexity.

Why Avoiding Multiple Hubs and Links Makes Sense

In theory, one could create separate hubs for each demographic attribute—region, year, age, and sex—and link them together. However, this approach introduces unnecessary complexity without adding value in this context. Here’s why:

Multiple Joins: Multiple hubs and a link table require additional joins, increasing complexity when querying the data.
Increased Entity Count: Additional hubs and links inflate the number of entities, making the data model harder to understand and maintain.
Performance Concerns: Each join adds processing cost, which can slow down queries, especially with larger datasets.

By consolidating all demographic attributes into a single hub, we reduce the number of entities and simplify the model, making it easier to use and maintain while still meeting the business needs.

Advantages of a Single Reference Hub with a Satellite

This approach is particularly advantageous because it balances simplicity with flexibility. Here’s how:

Change Capture: With the reference satellite, we can track historical population data changes over time. Each new row represents an update, identified by load date, making it easy to see when population data was updated.
Granularity and Scalability: The single hub provides a consistent granularity for the data, ensuring that queries are straightforward and changes are easy to manage.
Performance Efficiency: Joining based on region, year, age, and sex is computationally efficient. These attributes are typically small (e.g., integer or small character values), reducing the burden on processing.

Using the Model in Business Logic

Once the demographic data is loaded into the Raw Data Vault, it’s ready for use in the Business Vault. Here’s how it could be used for business logic:

Joining Data: In the Business Vault, data analysts can join other datasets with the demographic reference satellite on the region, year, age, and sex attributes to incorporate population data.
Proportion Calculations: With population counts available, proportion calculations (e.g., the proportion of a certain demographic group in a region) become straightforward.
Temporal Analysis: The load date in the satellite provides historical tracking, allowing analysts to view demographic changes over time and analyze trends.

Potential Challenges and Solutions

One common concern is that joining on multiple columns (region, year, age, sex) could impact performance. However, with optimized indexing and the small size of these columns, this concern is minimized. For larger datasets, partitioning on region or year might further optimize query performance.

Conclusion

In summary, a simplified model that leverages a single reference hub with a satellite provides a highly effective way to model demographic data in a Raw Data Vault. This approach allows for straightforward use in the Business Vault and ensures flexibility for tracking historical changes, all while keeping the model manageable and efficient.

This example illustrates how, by focusing on simplification and change capture, organizations can build an effective demographic data model in the Raw Data Vault that meets both current and future needs for population-based analysis.

Meet the Speaker

Michael Olschimke

Building a scalable Data Platform? In Data Vault Friday

Loading CDC Data into Multi-Active Satellites and Status Tracking Satellites

Watch the Video

Mastering CDC Data in Data Vault 2.0

Change Data Capture (CDC) is a powerful mechanism for tracking changes in source systems. However, when the primary key in your source system differs from the business key used in your Data Vault hub, you may encounter challenges in loading data into multi-active satellites. This article explores various strategies for handling CDC data in such scenarios, offering practical solutions to ensure accurate and efficient data loading.

In this article:

Understanding the Challenge
Additional Considerations
Conclusion
Meet the Speaker

Understanding the Challenge

In many source systems, the primary key is a technical identifier unknown to the business. Instead, the business key represents the meaningful identifier for a business object. In a typical scenario, the relationship between the primary key and the business key is one-to-one. However, in some cases, multiple records can be active for the same business key on the same date, resulting in multi-activity.

This situation arises when the primary key is unique at a given point in time, but the business key is not. For instance, you might have multiple customer IDs in your source system (primary keys) referring to the same customer (business key).

Solution 1: Verify Multi-Activity

Before diving into complex solutions, it’s crucial to verify whether the data is genuinely multi-active. In some cases, the appearance of multi-activity might be due to records being deleted and recreated with the same business key, resulting in different primary keys.

To check this, analyze the CDC data and other technical columns in the source system to determine the order of events. If a sequence of delete and create operations is detected, you may not be dealing with true multi-activity.

Solution 2: Create a Multi-Active Satellite with Delta Checking

If the data is genuinely multi-active, the most straightforward approach is to create a multi-active satellite. Perform delta checks on the combination of the business key and the multi-active attribute (e.g., customer ID). This ensures that only changes within specific multi-active groups are loaded into the satellite.

However, this approach necessitates a specialized point-in-time (PIT) table, as the CDC data provides changes at the finest granularity (row level). You’ll need to consider both the load date timestamp and the multi-active attribute when querying the satellite to retrieve the most recent delta.

Solution 3: Remodel with Satellites on Links

Another option is to remodel your Data Vault structure by placing the satellite on the link. In this approach, the multi-active attribute becomes a dependent child key in the link, and a standard satellite is created on this link. This simplifies the handling of multi-activity within the link itself.

However, it’s important to note that the satellite in this case describes the relationship between the customer and other components, rather than directly describing the business object hub. Evaluate whether this modeling change aligns with your downstream querying requirements.

Solution 4: Use the Primary Key as a Technical Hub

As a last resort, you can use the primary key from the source system as a technical hub. This involves creating a hub for the primary key values (e.g., customer IDs) and linking it to the real customer hub using a same-as link. While not the preferred method, this can be a workaround in situations where other solutions are not feasible.

Additional Considerations

CDC Data vs. Full Extracts: When dealing with full data extracts, even if only a part of the multi-active component changes, it’s best practice to insert the full block of data with the newest load date timestamp. This simplifies downstream processes and eliminates the need for a specialized PIT table.
Non-History Links: If the CDC data represents transactional events and is analyzed as such, consider loading it into non-history links instead of satellites. This approach aligns with the transactional nature of the data and facilitates aggregations and trend analysis.

Conclusion

Handling CDC data in Data Vault 2.0 when dealing with multi-active satellites requires a careful assessment of your specific use case and data characteristics. The solutions presented in this article offer various approaches to tackle this challenge, each with its own advantages and trade-offs. By understanding these strategies and selecting the most appropriate one, you can ensure accurate and efficient data loading in your Data Vault environment.

Meet the Speaker

Marc Winkelmann

Building a scalable Data Platform? In Data Tools, dbt blog article, Intermediate

Unify Your Understanding of Data with the dbt Semantic Layer

dbt Semantic Layer

The bigger a corporation gets, the more data is available, and more and more users want to use this data. In a traditional data warehouse (DWH) environment, the DWH typically provides a consumption layer consisting of various information marts, which are then loaded into multiple business intelligence (BI) tools. In there, business users transform and aggregate the data to calculate KPIs and finally make business decisions.

This, as it turns out, is easier said than done. To derive KPIs out of the data, business users need to have a common understanding of the data provided by the DWH. The information to understand the data is typically hard to find and not accessible in a single place.

Ultimately, this might lead to multiple departments having different understandings of the same data and deriving different interpretations of the same KPI. Now, it’s very likely that the worst case scenario happens, the trust in your data fades out. This is where a unified semantic layer can help!

From Raw Data To Semantic Layer – With Turbovault And Dbt Cloud

Data Vault is vital for businesses due to its adaptability and scalability in managing dynamic data environments. Its hub-and-spoke architecture ensures traceability and agility, enabling quick adaptation to changing requirements and diverse data sources.

Join our webinar and learn about how to use dbt Cloud with Turbovault and a data modeling tool to implement Data Vault in your organization. Additionally, we will have a look at the dbt Semantic Layer.

Watch Webinar Recording

In this article:

Components of the dbt Semantic Layer
Consuming the dbt Semantic Layer
Conclusion

Components of the dbt Semantic Layer

The dbt Semantic Layer helps simplify the definition and usage of critical business metrics in your dbt project. Metric definitions are centralized to allow consistent self-service usage for all data teams.

By shifting metric definitions from the BI layer into the modeling layer, business users from different units can be confident that they use the same metric definition, no matter which tool they use. In case a metric definition changes over time, the changes will be applied everywhere it’s used and therefore consistency is enforced.

To create a unified semantic layer inside your dbt project, the following steps are necessary:

Draft a semantic model
- To implement a semantic model, a model needs to be drafted first. This should happen via a collaboration between the technical and business teams, to identify core business concepts and how they relate to each other.
Create dbt models that match your semantic model
- Each object of your semantic model should be turned into a dbt model 1:1. While creating them, put extra work into aligning column names with naming standards and correctly developing the loading logics.
Create new .yml files in the metrics folder
- Everything related to the dbt Semantic Layer needs to be located in a new folder called “metrics”. In there, .yml files are used to define and configure your semantic models. We recommend creating one .yml file per semantic model.
Define entities
- In contrast to the name, entities in the semantic model roughly describe columns of semantic models. Entities can be of four different types: Primary, Unique, Foreign, or Natural. Every model needs to have a primary entity, and one entity can be just one column or a SQL expression transforming a column.
Define dimensions
- A dimension in the dbt Semantic Layer can be seen as different ways to look at your model, i.e., group the data by a specific attribute. Every dimension needs to be tied to a primary entity, which is used for the grouping. A good example is a date column which enables you to group your data by day, month, or year.
Define measures
- Measures represent aggregations applied to specific columns in your data model. Measures can be used in other measures to calculate more complex ones, and can be defined with various parameters that help create executable SQL code for calculation.
Define metrics
- Metrics represent the language of the business users. They can be of various types, which represent different kinds of calculations. Some examples include Conversion metrics, Cumulative metrics, Derived metrics, and Ratio metrics. They are always based on measures and represent the last building block of the semantic layer.

This is how the semantic layer is reflected in your dbt lineage:

Consuming the dbt Semantic Layer

Once your dbt project has a semantic layer defined, it can be opened to data consumers. The dbt Semantic Layer allows various BI tools to directly connect to your dbt Cloud project and integrate metrics, measures, and filters directly into the tool of choice.

The following tools are already natively supported:

Tableau
Google Sheets
Microsoft Excel
Hex
Klipfolio PowerMetrics
Lightdash
Mode
Push.ai
Steep

Other tools can be integrated with custom integrations, as long as they support generic JDBC connections and are compatible with Arrow Flight SQL.

Conclusion

The dbt Semantic Layer can help regain trust in your data warehouse. By moving calculations back from the business users into the data model, a common definition of business KPIs is created.

Although there is some additional setup required, implementing a semantic layer can highly improve the value generated by your data. Integrating it into the BI tools of your business users even simplifies the way your data is consumed.

If you want to know more about the dbt Semantic Layer and learn how it fits into a Data Vault 2.0 powered Data Warehouse, make sure to join our next webinar on August 13th!

– Tim Kirschke (Scalefree)

Lorenz Kindling In Data Vault Friday

The Benefits of Data Warehouse and Data Vault

Watch the Video

Demystifying Data Warehouse and Data Vault

In today’s data-driven business landscape, the terms “data warehouse” and “Data Vault” are frequently tossed around. But what exactly are they, and why should businesses invest in them? This article aims to demystify these concepts, addressing common questions from a business perspective. We’ll delve into the reasons behind implementing a data warehouse or Data Vault, how to explain their value to non-technical stakeholders, and when companies typically start investing in these solutions.

In this article:

Why Do We Need Data Warehouses and Data Vaults?
Explaining Data Vault to Non-Technical Stakeholders
When Do Companies Start Investing in Data Warehousing?
Conclusion

Why Do We Need Data Warehouses and Data Vaults?

Before diving into the benefits of data warehouses and Data Vaults, let’s explore the challenges businesses face without them. Many traditional organizations grapple with:

Limited Data Access: Data is often siloed, accessible only to specific departments, hindering cross-functional collaboration and insights.
Lack of Structure: Ad hoc queries and a lack of standardized data processes lead to inefficiencies and unreliable results.
Expensive Trial and Error: Decision-making based on incomplete or inaccurate data can be costly and time-consuming.
Unreliable Data: Inconsistent data sources and ad hoc reporting can lead to errors and misguided decisions.

Data warehouses and Data Vaults address these challenges by providing a centralized, structured, and reliable repository for data. They enable:

Data Integration: Combining data from various sources into a single source of truth supporting a comprehensive data strategy
Enhanced Decision-Making: Empowering data-driven decision-making with accurate and timely insights.
Historical Analysis: Enabling trend analysis and forecasting based on historical data.
Improved Data Quality: Implementing data quality management processes to ensure accuracy and consistency.
Scalability and Flexibility: Adapting to evolving business needs and data volumes.
Auditability and Compliance: Maintaining data lineage and ensuring compliance with regulations like GDPR.

Explaining Data Vault to Non-Technical Stakeholders

When communicating the value of a Data Vault to commercial executives or non-technical stakeholders, it’s crucial to emphasize that it’s more than just a data model. Data Vault 2.0 is a comprehensive system of business intelligence, encompassing methodology, architecture, and modeling.

Highlight the key benefits Data Vault offers:

Agility: Agile development methodologies enable quick responses to changing business requirements.
Scalability and Flexibility: The architecture allows for seamless growth and adaptation.
Consistency and Auditability: Data Vault ensures data accuracy, traceability, and compliance.

Use relatable examples to illustrate how Data Vault addresses specific business challenges. For instance, you could explain how it streamlines data integration from multiple sources, ensuring a single version of the truth for customer information.

When Do Companies Start Investing in Data Warehousing?

There’s no one-size-fits-all answer to this question. The ideal time to invest in data warehousing depends on several factors, including:

Data Volume: The amount of data your company generates and the complexity of your data landscape.
Business Needs: The extent to which your business relies on data for decision-making and operations.
Strategic Goals: The importance of data-driven insights in achieving your company’s strategic objectives.

While larger enterprises with vast data volumes often invest in data warehouses early on, even smaller companies can benefit from them. Starting early, even with a smaller data warehouse, can be advantageous as it allows for gradual expansion and integration of external data sources as the business grows.

Conclusion

Data warehouses and Data Vaults are essential tools for businesses aiming to harness the power of their data. They address common data challenges, enable better decision-making, and offer a range of benefits that extend beyond mere reporting.

By understanding the key reasons for implementing these solutions and effectively communicating their value to stakeholders, you can build a strong case for investment and ensure that your organization reaps the rewards of a data-driven future.

Julian Brunner In Data Vault Friday

Why Do We Need Hubs in Data Vault?

Watch the Video

Why Hubs in Data Vault are Essential

Data Vault modeling is a powerful methodology for building robust and scalable data warehouses. One of its core components, the Hub, often raises questions among practitioners and stakeholders. Why do we need hubs? Can’t we just simplify the model by putting business keys directly into satellites? In this article, we delve into the reasons behind the existence of hubs and explore scenarios where deviating from the standard practice might be acceptable.

In this article:

The Role of Hubs in Data Vault
Why Not Put Business Keys in Satellites?
Important Considerations about Hubs
Conclusion

The Role of Hubs in Data Vault

Hubs play a pivotal role in Data Vault by storing a distinct list of business keys. These keys serve as unique identifiers for real-world entities, such as customers, products, or employees. Hubs provide several critical benefits:

1. Data Integration: Hubs act as anchors for integrating data from disparate source systems. By consolidating different representations of the same entity into a single hub, you ensure consistency and accuracy across your data warehouse.
2. Scalability: Hubs facilitate seamless scalability. When new data sources are introduced, you can simply add the business keys to the existing hub without the need for major model refactoring. This simplifies the onboarding of new data and reduces the risk of introducing inconsistencies.
3. Auditability: Hubs maintain a clear lineage and audit trail for your data. The load timestamp in a hub functions as a “first seen” date, making it easy to track the evolution of your data over time.
4. Granularity: Perhaps most importantly, hubs define the granularity of multiple downstream objects, including information marts and dimensions. This granularity is crucial for accurate reporting and analysis, making hubs indispensable for many use cases.

Why Not Put Business Keys in Satellites?

While hubs are generally considered best practice, there are rare instances where storing business keys in satellites might be justifiable. One such scenario is when a business key represents an entity that currently lacks descriptive data and is not actively queried.

For example, consider an employee dataset that includes the vehicle identification number (VIN) of the employee’s company car. If there’s no additional information about the car and no immediate need to query it, treating the VIN as a descriptive attribute within the employee satellite might be acceptable.

However, if the need to query or analyze data related to company cars arises in the future, a refactoring strategy called “Hub It Out” can be employed. This involves extracting distinct VIN numbers from the employee satellite into a new hub, creating links between the employee and car hubs, and potentially adding satellites with descriptive data about the cars.

Important Considerations about Hubs

While the above scenario demonstrates a valid exception, it’s crucial to remember that storing business keys in satellites should be the exception, not the rule. Hubs offer numerous benefits in terms of data integration, scalability, auditability, and granularity, making them essential for most Data Vault implementations.

Before deviating from the standard practice, carefully assess whether the potential benefits of storing business keys in satellites outweigh the potential drawbacks, such as increased storage costs, redundancy, and a less elegant data model.

Conclusion

In conclusion, hubs are fundamental building blocks in Data Vault modeling, providing a range of benefits that contribute to the overall integrity, scalability, and usability of your data warehouse. While there are rare cases where storing business keys in satellites might be justifiable, it’s crucial to carefully weigh the pros and cons before adopting this approach. By adhering to Data Vault best practices and understanding the specific requirements of your use case, you can ensure that your data warehouse is optimized for performance, maintainability, and long-term success.