Data Vault – Scalefree

Data Vault Glossary: Hub, Link, Satellite, Business Vault, and More

The Essential Data Vault Glossary

Data Vault has its own precise vocabulary. Whether you are evaluating the methodology for the first time or preparing for Data Vault certification, understanding what each term means — and why it exists — is the foundation for everything else. This glossary covers the core concepts of Data Vault 2.0 and 2.1, defined at the conceptual level for data engineers, architects, and IT leaders building or modernising a data platform.

In this article:

Business Key
Hub
Link
Satellite
Raw Vault
Business Vault
Information Mart
Hash Key
Load Date
Record Source
PIT Table
Bridge Table
Ghost Record
Effectivity Satellite
Persistent Staging Area
Unit of Work
Data Aging
CDVP2.1

Business Key

A business key is the identifier that the business actually uses to recognise and track a business object — a customer number, a product code, an account number, an ISBN. It is the natural, meaningful key that appears in source systems and that business users refer to in their daily work.

In Data Vault, the business key is the fundamental organising principle of the entire model. Every Hub is built around business keys. The goal is to find keys that are unique across the enterprise and stable over time — keys that different source systems share, enabling integration between them.

Business keys sit above surrogate keys (technical IDs generated by a source system). A surrogate key is unique within one system but carries no meaning outside it. A business key has meaning across the organisation, making it suitable for integration. The hierarchy runs from global business keys (universally unique, like a Vehicle Identification Number), through organisational business keys (assigned by the enterprise, like a customer number), down to system-wide surrogate keys where no better option exists.

Hub

A Hub is one of the three fundamental entity types in Data Vault. It stores a distinct list of business keys for a single type of business object — all customer numbers, all product codes, all account numbers. The Hub identifies. It records which business keys have ever existed in the data platform, alongside when they were first seen (the load date) and where they came from (the record source).

The Hub does not describe anything about the business object — that is the Satellite’s job. It does not store relationships — that is the Link’s job. A Hub is insert-only: once a business key is recorded, it is never updated or deleted (except under legal obligation). This permanence is what makes Data Vault historically complete.

Link

A Link is the second fundamental entity type. It stores a distinct list of relationships between business keys — the fact that a customer purchased a product, that an employee was assigned a vehicle, that a booking involved a passenger and a flight. Like the Hub, the Link is insert-only and records when the relationship was first identified and from which source.

The Link does not describe the relationship — it only establishes that it existed. All descriptive information (when it started, when it ended, what conditions applied) lives in Satellites attached to the Link. Importantly, Links can connect more than two Hubs: a purchase transaction might link a customer, a product, and a store simultaneously. This is entirely normal in Data Vault design.

Satellite

A Satellite is the third fundamental entity type, and where the actual data warehousing happens. It stores descriptive data — the attributes that describe a business object or relationship over time. A customer’s name and address. A product’s description and list price. The start and end dates of an employment contract.

Every time an attribute changes in the source, a new row is inserted into the Satellite. No rows are ever updated. This insert-only behaviour is what gives Data Vault its complete historical record. Each Satellite has exactly one parent — either a Hub or a Link — and Satellites are typically split by source system, by security or privacy classification, and sometimes by rate of change.

The combination of Hub, Link, and Satellite reflects the three fundamental components present in all enterprise data: business keys, relationships, and descriptive attributes. For a deeper treatment of how these entities are modelled and loaded, Data Vault 2.1 Training & Certification covers the full methodology in detail.

Raw Vault

The Raw Vault (also called the Raw Data Vault) is the layer of the Data Vault architecture that stores unmodified source data. It consists of Hubs, Links, and Satellites that capture data exactly as it arrived — no cleansing, no business rules, no filtering, no conditional logic of any kind.

The Raw Vault is the single point of facts. Because no business interpretation has been applied, the data it holds is fully auditable: you can demonstrate precisely what any source system delivered on any given date. This auditability is one of the primary reasons Data Vault is adopted in regulated industries such as banking, insurance, and government.

Business Vault

The Business Vault is the layer above the Raw Vault where business logic is applied. It uses the same Hub-Link-Satellite structures, but its purpose is to transform and enrich the raw data — cleansing records, resolving duplicates, applying currency conversions, tagging data quality levels, and deriving calculated attributes.

The Business Vault is not a mandatory pass-through layer. Data that is already clean and ready for reporting can flow directly from the Raw Vault to an Information Mart. In practice, organisations typically maintain multiple Business Vault schemas — one per department or domain — each expressing the business rules and definitions relevant to that context. This is how Data Vault delivers multiple versions of the truth from a single set of facts: different teams can apply their own definitions without touching the shared Raw Vault underneath. Learn more about the full Data Vault 2.0 methodology and how Scalefree applies it in client projects.

Information Mart

An Information Mart is the delivery layer that presents data to end users and reporting tools. Unlike the Raw Vault and Business Vault — which use Hub-Link-Satellite structures — Information Marts use dimensional models such as star schemas, snowflake schemas, or flat wide tables, in whatever structure the consuming tool requires.

Information Marts are usually virtualised (SQL views) rather than materialised tables, making them lightweight and easy to modify. The recommended approach is many small, focused Information Marts — one per report or use case — rather than a single large mart. Several specialised mart types exist for specific purposes:

Error Mart — captures records rejected by a loading process due to hard rule violations. Should always be empty in a healthy system.
Raw Mart — presents raw data in a reportable dimensional form without applying business rules. Used during agile requirements gathering to help business users articulate what they need.
Quality Mart — shows only the bad or suspect records, giving data stewards visibility into data quality issues so they can be fixed at the source.
Source Mart — reconstructs the original structure of a source system from the Data Vault model, with the added benefit of historical versioning and built-in GDPR data removal.
Interface Mart — designed for machine-to-machine consumption, used when a downstream application needs to read from the platform or receive cleansed data back from it.
AI Feature Mart — a specialised Interface Mart designed for AI and machine learning model consumption, typically wide, flat, and enriched with semantic field descriptions.

Hash Key

A Hash key is a fixed-length value derived by applying a hashing algorithm (typically MD5 or SHA-256) to one or more business key columns. In Data Vault, Hash keys serve as the primary keys of Hubs and Links, and as the foreign key references connecting Satellites to their parents.

The key advantage of Hash keys is that they can be computed independently: any system, given the same business key input, will always produce the same Hash key. This enables parallel loading, makes the model portable across environments, and simplifies join logic. The actual business key columns remain stored alongside the Hash key in the Hub or Link. For a detailed look at how Hash keys are implemented in practice, see Scalefree’s article on Hash Keys in the Data Vault.

Load Date

The load date timestamp is a technical metadata attribute on every Hub, Link, and Satellite row. It records the moment the record was loaded into the data platform — not when the event occurred in the source system, but when the data arrived in the vault. The load date is always a full timestamp, never just a date, since data platforms often receive deliveries multiple times per day.

Combined with the record source, the load date answers two fundamental audit questions for every piece of data: when was it received, and from where?

Record Source

The record source identifies which source system a particular record came from. It is stored on every Hub, Link, and Satellite row alongside the load date. Its primary audience is the development and engineering team — when investigating a data issue, the record source points directly to the originating system and delivery batch. It is not used for business reporting or compliance auditing in the same way as the load date.

PIT Table

A PIT table (Point-in-Time table) is a helper structure that makes querying historical data across multiple Satellites significantly more efficient. Without a PIT table, reconstructing the complete state of a business object at a specific historical moment requires complex, expensive joins across Satellites with different load dates. A PIT table pre-computes the correct Satellite row timestamps for each point in time, so downstream queries can join the PIT table directly rather than re-solving the temporal logic on every run.

PIT tables are derived structures — generated from Raw Vault data and rebuildable at any time. They are not part of the core Data Vault model but are standard production companions to it.

Bridge Table

A Bridge table is a helper structure that simplifies querying across multiple Links. Where PIT tables solve the temporal complexity of Satellites, Bridge tables solve the structural complexity of navigating a chain of linked Hubs — for example, tracing from a customer through their orders, through their order lines, to the products. Bridge tables are pre-joined snapshots of relationship paths that would otherwise require multiple sequential joins. See also: Bridge Tables 101 on the Scalefree blog.

Ghost Record

A ghost record (also called a default record or zero key record) is a placeholder row inserted into a Hub or Satellite to handle situations where a foreign key reference exists in the source data but the referenced record itself does not. It prevents referential integrity violations and allows the data platform to load records completely even when source data is incomplete. Ghost records are technical placeholders, not real business data, and are distinguishable by their defined default key values.

Effectivity Satellite

An Effectivity Satellite tracks the active or inactive status of a Hub record or a Link relationship over time. It records when a business object or relationship became active in the source system and when it was deactivated or deleted. When a source system deletes a record, the Hub retains the business key permanently — the Effectivity Satellite gains a new row reflecting the deletion, preserving the complete history while making the current active state queryable.

Persistent Staging Area

The Persistent Staging Area (PSA) is the layer where raw source data is stored before it enters the Raw Vault. Unlike a transient staging area (which holds only the most recent delivery), a PSA retains every historical delivery — a complete, time-stamped archive of everything ever received from every source system. In modern Data Vault architectures, the PSA role is typically fulfilled by a data lake, organised in a folder structure partitioned by source system, table, and load date.

Unit of Work

The unit of work is a concept from the Data Vault agile methodology that defines the smallest deliverable increment of business value in a sprint. It consists of a complete data flow from source to Information Mart — staging the required source data, loading the Raw Vault entities, applying business rules in the Business Vault, and delivering the result in a mart that a business user can consume. Organising development around units of work ensures every sprint delivers something tangible to the business rather than invisible infrastructure.

Data Aging

Data aging refers to the practice of identifying and marking historical records in the Raw Vault or Business Vault that are no longer operationally relevant — records that have not been updated or referenced over a significant period. Data aging strategies help manage storage costs and query performance over time. In keeping with Data Vault’s insert-only philosophy, aged records are flagged or moved to archival storage rather than deleted, preserving the completeness of the historical record.

CDVP2.1

CDVP2.1 stands for Certified Data Vault Practitioner 2.1 — the professional certification awarded by the Data Vault Alliance upon passing the certification examination. It validates that a practitioner understands and can apply the Data Vault 2.1 methodology across architecture, modeling, and implementation.

Scalefree is an authorised Data Vault Alliance training partner. The Data Vault 2.1 Training & Certification is the official path to CDVP2.1, combining instructor-led training with exam preparation and two included exam attempts. If you are building or modernising a data platform and want to understand how Data Vault fits into a broader enterprise architecture, explore the free Data Vault Handbook or get in touch with Scalefree directly.

Michael Olschimke In Data Vault

Data Vault Glossary: Hub, Link, Satellite, Business Vault, and More

The Essential Data Vault Glossary

In this article:

Business Key
Hub
Link
Satellite
Raw Vault
Business Vault
Information Mart
Hash Key
Load Date
Record Source
PIT Table
Bridge Table
Ghost Record
Effectivity Satellite
Persistent Staging Area
Unit of Work
Data Aging
CDVP2.1

Business Key

Hub

Link

Satellite

Raw Vault

Business Vault

Information Mart

Error Mart — captures records rejected by a loading process due to hard rule violations. Should always be empty in a healthy system.
Raw Mart — presents raw data in a reportable dimensional form without applying business rules. Used during agile requirements gathering to help business users articulate what they need.
Quality Mart — shows only the bad or suspect records, giving data stewards visibility into data quality issues so they can be fixed at the source.
Source Mart — reconstructs the original structure of a source system from the Data Vault model, with the added benefit of historical versioning and built-in GDPR data removal.
Interface Mart — designed for machine-to-machine consumption, used when a downstream application needs to read from the platform or receive cleansed data back from it.
AI Feature Mart — a specialised Interface Mart designed for AI and machine learning model consumption, typically wide, flat, and enriched with semantic field descriptions.

Hash Key

Load Date

Combined with the record source, the load date answers two fundamental audit questions for every piece of data: when was it received, and from where?

Record Source

PIT Table

PIT tables are derived structures — generated from Raw Vault data and rebuildable at any time. They are not part of the core Data Vault model but are standard production companions to it.

Bridge Table

Ghost Record

Effectivity Satellite

Persistent Staging Area

Unit of Work

Data Aging

CDVP2.1

Marc Winkelmann In Data Vault

What’s New in Data Vault 2.1 and Why It Matters for Modern Data Warehousing

What’s New in the Data Vault 2.1 Training

In the world of data warehousing and business intelligence, the Data Vault methodology has long been a trusted foundation for scalable and agile data architectures. With the release of Data Vault 2.1, the methodology has evolved to address new challenges in modern data environments — from handling semi-structured data to aligning with concepts like Data Mesh and Data Lakehouse.

In this article, we summarize what’s new in Data Vault 2.1 compared to 2.0, what these updates mean for practitioners, and how you can take advantage of the new training materials to become officially certified.

In this article:

1. A Major Expansion in Content and Learning Resources
2. Enhanced Instructor-Led Training Experience
3. Better Preparation for Certification
4. Dealing with JSON and Semi-Structured Data
5. Stronger Differentiation Between Logical and Physical Modeling
6. Introducing Ontologies and Taxonomies
7. Extended Business Key Collision Code Concept
8. Merging Satellites Without PIT Tables
9. Alignment with Modern Industry Terminology
10. Unlock the Full Potential with Constructor-Led Training
Final Thoughts
Watch the Video

1. A Major Expansion in Content and Learning Resources

One of the most visible improvements in Data Vault 2.1 is the expansion of the official training content. The updated course now includes extensive video material featuring Dan Linstedt himself, who explains and demonstrates key Data Vault principles in depth.

Participants can now spend several hours watching recorded theoretical sessions and hands-on demonstrations. The new format combines the benefits of self-paced learning with the engagement of instructor-led sessions. You can also download the official SQL loading patterns for all Data Vault entities from the Data Vault Alliance (DVA) training portal.

Another highlight is access to the Data Vault Alliance community — a global network of Data Vault practitioners where members exchange best practices, discuss implementations, and share insights from real-world projects.

2. Enhanced Instructor-Led Training Experience

The well-known three-day instructor-led training remains a cornerstone of the certification path, but it has been optimized to deliver even more value. Trainers now dedicate more time to practical case studies, group discussions, and collaborative modeling workshops.

Instead of spending large portions of class time on theory, participants focus on applying concepts to real-world data challenges. Trainers provide direct feedback on Data Vault models, encourage peer review, and help attendees explore different architectural scenarios.

This redesign creates a more interactive, productive learning experience — especially valuable for consultants, data architects, and engineers who want to strengthen their practical Data Vault expertise.

3. Better Preparation for Certification

Preparing for the official Certified Data Vault 2.1 Practitioner (CDVP2.1) exam is now easier and more structured. The course includes integrated live quizzes during training sessions, allowing participants to test their understanding and interact directly with the instructor.

In addition, a practice exam has been introduced to help you assess your readiness before attempting the final certification. This makes it easier to identify knowledge gaps and feel confident on exam day.

4. Dealing with JSON and Semi-Structured Data

One of the most exciting updates in Data Vault 2.1 is the new module on handling JSON data and other semi-structured sources. As modern data platforms increasingly deal with variable data structures, the methodology now provides clear guidance for integrating such data efficiently.

The course introduces a set of rules and best practices for balancing performance, flexibility, and complexity. You’ll learn when to apply a schema-on-read approach instead of schema-on-write, how to maintain stability as source structures evolve, and how to preserve governance and traceability in semi-structured environments.

Dan Linstedt often refers to this as the “JSON Dilemma” — the challenge of maximizing flexibility without sacrificing performance or clarity. Data Vault 2.1 equips you with the methodology and patterns to solve that dilemma effectively.

5. Stronger Differentiation Between Logical and Physical Modeling

Another core enhancement in Data Vault 2.1 is the clearer separation between logical and physical modeling. While Data Vault 2.0 touched on this concept, version 2.1 makes it explicit: the logical model represents the business concept, while the physical model depends on the underlying technology and performance needs.

For example, on some platforms normalization works best, while on others (such as document-oriented databases), denormalization might be more efficient. The physical implementation should adapt to these realities — but the logical model remains consistent as the blueprint for the business layer.

This separation provides greater flexibility to evolve with technology without compromising the integrity of the business model. It also helps teams align architecture decisions with specific database or cloud platform requirements.

6. Introducing Ontologies and Taxonomies

In line with the growing emphasis on semantic data integration, Data Vault 2.1 introduces the use of ontologies and taxonomies as essential tools for business modeling. These concepts allow organizations to connect business terms, hierarchies, and relationships in a way that supports consistent data integration across departments and systems.

By embedding ontologies and taxonomies into the modeling process, organizations can improve data understanding, reduce ambiguity, and strengthen the link between data structures and business meaning.

7. Extended Business Key Collision Code Concept

The Business Key Collision Code concept has been extended in Data Vault 2.1 to better support cross-system integration. This improvement helps resolve conflicts that arise when business keys overlap or differ across systems — a common challenge in enterprise data integration.

With enhanced rules and examples, the training now guides you through best practices for identifying, classifying, and merging business keys, ensuring a consistent, high-quality data foundation.

8. Merging Satellites Without PIT Tables

Data Vault 2.1 introduces new approaches for handling historical data when traditional Point-In-Time (PIT) tables or snapshot techniques are not required. In cases where you need to maintain very long data histories or join multiple satellites describing the same business object, version 2.1 outlines methods for merging satellites without relying on PIT tables.

This allows for greater flexibility in data retrieval strategies and helps optimize performance in long-term historical scenarios.

9. Alignment with Modern Industry Terminology

To stay relevant with the evolving data landscape, Data Vault 2.1 integrates current industry concepts such as Data Mesh, Data Fabric, and Data Lakehouse. These paradigms are mapped to the Data Vault framework, demonstrating how the methodology fits within modern data architectures.

This update ensures that Data Vault practitioners can easily connect the methodology to the broader trends and technologies shaping the data industry today.

10. Unlock the Full Potential with Constructor-Led Training

If you’re ready to deepen your knowledge and apply these updates in practice, the constructor-led Data Vault 2.1 training offered by Scalefree is the next step. This hands-on training combines theoretical knowledge, real-world exercises, and guided discussions to help you implement Data Vault successfully in your organization.

Visit the training page to find more information, view upcoming training dates, and begin your journey toward CDVP2.1 certification.

Final Thoughts

Data Vault 2.1 represents a significant step forward for data professionals seeking a future-proof methodology. With improved training content, better integration of semi-structured data, a sharper focus on modeling concepts, and alignment with modern architectural trends, Data Vault continues to be a robust choice for building scalable, flexible, and business-aligned data warehouses.

Whether you are transitioning from Data Vault 2.0 or starting fresh, the new version provides the tools, knowledge, and community support to take your data architecture to the next level.

Watch the Video

Marc Winkelmann In Data Vault

What’s New in Data Vault 2.1 and Why It Matters for Modern Data Warehousing

What’s New in the Data Vault 2.1 Training

In this article:

1. A Major Expansion in Content and Learning Resources
2. Enhanced Instructor-Led Training Experience
3. Better Preparation for Certification
4. Dealing with JSON and Semi-Structured Data
5. Stronger Differentiation Between Logical and Physical Modeling
6. Introducing Ontologies and Taxonomies
7. Extended Business Key Collision Code Concept
8. Merging Satellites Without PIT Tables
9. Alignment with Modern Industry Terminology
10. Unlock the Full Potential with Constructor-Led Training
Final Thoughts
Watch the Video

1. A Major Expansion in Content and Learning Resources

2. Enhanced Instructor-Led Training Experience

3. Better Preparation for Certification

4. Dealing with JSON and Semi-Structured Data

5. Stronger Differentiation Between Logical and Physical Modeling

6. Introducing Ontologies and Taxonomies

By embedding ontologies and taxonomies into the modeling process, organizations can improve data understanding, reduce ambiguity, and strengthen the link between data structures and business meaning.

7. Extended Business Key Collision Code Concept

With enhanced rules and examples, the training now guides you through best practices for identifying, classifying, and merging business keys, ensuring a consistent, high-quality data foundation.

8. Merging Satellites Without PIT Tables

This allows for greater flexibility in data retrieval strategies and helps optimize performance in long-term historical scenarios.

9. Alignment with Modern Industry Terminology

This update ensures that Data Vault practitioners can easily connect the methodology to the broader trends and technologies shaping the data industry today.

10. Unlock the Full Potential with Constructor-Led Training

Visit the training page to find more information, view upcoming training dates, and begin your journey toward CDVP2.1 certification.

Final Thoughts

Whether you are transitioning from Data Vault 2.0 or starting fresh, the new version provides the tools, knowledge, and community support to take your data architecture to the next level.

Watch the Video

Damian Hinz In Data Vault, DevOps Knowledge, Intermediate

Behind the Branches – Navigating Git Workflows in Modern DevOps

Branching Strategies

Branching strategies are one of those topics that rarely get much attention until they suddenly become a problem. Whether it’s drowning in merge conflicts, the headache of implementing and synchronizing hotfixes across multiple branches, or a feature freeze caused by insufficient quality assurance, your repository and branching structure can have a major impact on day-to-day development.

But what branching strategies actually exist, and what are their pros and cons? Which approach allows you to deploy changes most quickly? And how can you maintain high software quality despite frequent releases?

In this article, we’ll provide a structured overview of common branching strategies and typical challenges developers face when using them.

Navigating Git Workflows in Modern DevOps

This webinar offers a clear overview of common approaches and how they impact CI/CD, code quality, and maintainability. Beyond theory, we’ll dive into practical challenges and real-world issues teams face every day. Register now for our free webinar on September 16th, 2025!

Watch Webinar Recording

Why Are Branching Strategies Relevant?

Branching models reflect the organization, release culture, and technical maturity of a project. There is no single “correct” strategy that fits every project. Choosing the right one depends heavily on the project’s context. Some of the most important questions to consider when selecting a branching strategy include:

Does the team work in fixed sprint or release cycles, or is code deployed continuously?
How many developers are working simultaneously on the same codebase?
What is the quality of your CI/CD pipeline? Does every change need a manual review, even if the pipeline passes, or can it be deployed automatically?

Depending on the answers to these questions, a simple or more complex branching strategy may be appropriate.

Comparison of Common Strategies

Git Flow

The Git Flow strategy was originally developed for traditional software projects with planned release cycles. Its long-lived main branches are “main” (or “master”) and “dev”.

In addition, it introduces several short-lived branches:

Feature branches

New features are developed in separate feature branches, which are merged into the develop branch once completed.

Hotfix branches

If a critical bug occurs in the production environment (i.e., on the main branch), a hotfix branch is created from main to address the issue. Once the fix is implemented and pushed to the hotfix branch, it is merged into both main and develop to ensure the bug is resolved in both branches.

Release branches

When a release is approaching, a release branch is created from develop, containing all features added since the last release. This branch is then used for final QA testing, bug fixing, and versioning. Once the release is approved, the release branch is merged into both main and develop.

The main advantage of Git Flow is its clear structure. Even in larger teams with many developers and therefore multiple concurrent feature branches, it’s easy to track which version is in what state. The strategy supports parallel development very well due to its structured branching model.

However, the downside is the organizational and technical overhead. The large number of branches and merges can lead to conflicts and divergence over time, especially with long-lived release and hotfix branches. A particular challenge arises when keeping branches in sync. Hotfixes created from main need to be merged back into main and dev, and changes made in release branches, which originate from dev, must eventually be merged into both main and dev, as shown in the diagram. These synchronization steps often introduce additional effort and increase the risk of conflicts or inconsistencies, especially when multiple streams of work are active in parallel.

Additionally, the path a feature must take, from a feature branch to develop, to a release branch, and finally to main, can slow down the deployment process.

While a solid CI/CD pipeline can help automate and streamline parts of this workflow, Git Flow does not rely on automation to function. This makes it especially suitable for teams with more manual QA processes or limited automation infrastructure.

GitHub Flow

Compared to Git Flow, the GitHub Flow strategy is significantly leaner. It uses only a single long-lived branch, usually main, and temporary feature branches that are merged via pull requests.

Once all changes on a feature branch are complete and have passed review and various tests, the branch is merged directly into main.

The key advantage of GitHub Flow is its simplicity. There are no separate release or develop branches, and even hotfixes can be handled in short-lived branches. Teams can respond to changes quickly and deploy frequently. This agility is especially effective when supported by a robust CI/CD pipeline. If properly implemented, testing, building, and deployment processes are automated, further improving GitHub Flow’s fast time to market.

Because of its low complexity and minimal coordination overhead, GitHub Flow is also particularly well-suited for smaller teams that value speed and iteration over rigid release planning.

If you’re interested in how such pipelines are structured in practice, our CI/CD pipeline Blog article offers a look at a practical GitHub-based setup using GitHub Actions and dbt. It’s a useful companion piece for understanding the automation layer that supports fast and reliable delivery.

However, this strategy also comes with limitations: it doesn’t support managing multiple parallel versions or complex release planning.

Additionally, it relies heavily on the quality of the CI/CD pipeline.

Trunk-Based Development

Trunk-Based Development is quite similar to the GitHub Flow strategy, but there are a few key differences.

While it also relies on a single long-lived branch (the trunk, typically main), commits are either made directly to main or via very short-lived feature branches. These feature branches often exist for only a few hours, and it’s common for changes to be merged into main multiple times a day. The goal is to integrate changes as early as possible to avoid conflicts before they even arise.

Because there are no fixed release cycles in Trunk-Based Development, it’s essential to ensure that incomplete features don’t go live prematurely. Feature flags play a central role here, allowing unfinished functionality to be hidden in the production environment until it’s ready.

As with GitHub Flow, a strong CI/CD pipeline is essential. It acts as the main safeguard for quality assurance and enables rapid deployment to the main branch.

Trunk-Based Development is especially effective for teams that are comfortable with rapid iteration and a high level of automation. While it can be used by smaller teams, it truly shines in larger organizations where multiple teams work in parallel and frequent integration is critical to maintaining momentum and consistency.

The benefits of Trunk-Based Development include extremely fast deployments and minimal risk of merge conflicts due to the short-lived nature of branches and continuous integration.

However, similar to GitHub Flow, this strategy heavily depends on the reliability of the CI/CD pipeline. If your team operates in a highly automated DevOps environment, this approach works smoothly. But if that’s not the case, software quality can suffer significantly. The risk is especially high here, as all changes are deployed directly to the main branch.

Conclusion

All three strategies come with their own strengths and weaknesses.

Git Flow is well-suited for larger projects with fixed release cycles, manual QA, and structured approval processes. It offers stability and clear workflows, but also brings significant technical and organizational overhead, making it a heavyweight option that can slow down development and release cycles due to its complexity and synchronization requirements.

GitHub Flow, by contrast, emphasizes speed and simplicity. It’s an excellent fit for smaller teams working on web or SaaS projects that deploy continuously, thanks to its low complexity and quick turnaround. But it relies on a good CI/CD pipeline. If tests are insufficient, faulty code might get deployed automatically.

Many of these risks can be mitigated with proper pipeline design and DevOps experience within the team, ensuring that automation is not just fast but also reliable.

Trunk-Based Development enables the highest release frequency, but only delivers consistent quality if the necessary technical maturity is in place. This makes it ideal for highly automated environments where teams ship many changes every day.

There are always ways to mitigate or minimize the downsides of any branching strategy. Techniques like blue/green or canary deployments, for example, can help reduce the impact of faulty changes and make rollbacks easier.

Stay tuned, we regularly share practical insights and solutions on topics like CI/CD, DevOps patterns, and deployment strategies.

Damian Hinz In Data Vault, DevOps Knowledge, Intermediate

Behind the Branches – Navigating Git Workflows in Modern DevOps

Branching Strategies

In this article, we’ll provide a structured overview of common branching strategies and typical challenges developers face when using them.

Navigating Git Workflows in Modern DevOps

Watch Webinar Recording

Why Are Branching Strategies Relevant?

Does the team work in fixed sprint or release cycles, or is code deployed continuously?
How many developers are working simultaneously on the same codebase?
What is the quality of your CI/CD pipeline? Does every change need a manual review, even if the pipeline passes, or can it be deployed automatically?

Depending on the answers to these questions, a simple or more complex branching strategy may be appropriate.

Comparison of Common Strategies

Git Flow

The Git Flow strategy was originally developed for traditional software projects with planned release cycles. Its long-lived main branches are “main” (or “master”) and “dev”.

In addition, it introduces several short-lived branches:

Feature branches

New features are developed in separate feature branches, which are merged into the develop branch once completed.

Hotfix branches

Release branches

Additionally, the path a feature must take, from a feature branch to develop, to a release branch, and finally to main, can slow down the deployment process.

GitHub Flow

Compared to Git Flow, the GitHub Flow strategy is significantly leaner. It uses only a single long-lived branch, usually main, and temporary feature branches that are merged via pull requests.

Once all changes on a feature branch are complete and have passed review and various tests, the branch is merged directly into main.

Because of its low complexity and minimal coordination overhead, GitHub Flow is also particularly well-suited for smaller teams that value speed and iteration over rigid release planning.

However, this strategy also comes with limitations: it doesn’t support managing multiple parallel versions or complex release planning.

Additionally, it relies heavily on the quality of the CI/CD pipeline.

Trunk-Based Development

Trunk-Based Development is quite similar to the GitHub Flow strategy, but there are a few key differences.

As with GitHub Flow, a strong CI/CD pipeline is essential. It acts as the main safeguard for quality assurance and enables rapid deployment to the main branch.

The benefits of Trunk-Based Development include extremely fast deployments and minimal risk of merge conflicts due to the short-lived nature of branches and continuous integration.

Conclusion

All three strategies come with their own strengths and weaknesses.

Many of these risks can be mitigated with proper pipeline design and DevOps experience within the team, ensuring that automation is not just fast but also reliable.

Stay tuned, we regularly share practical insights and solutions on topics like CI/CD, DevOps patterns, and deployment strategies.

Building a scalable Data Platform? In Data Vault

Green Bond Reporting in Record Time at Grenke AG

Green Bond Reporting

Sustainability and transparency have long been more than just buzzwords – nowadays, they are part of how modern companies see themselves. Green bonds are becoming increasingly important as they enable targeted investments in sustainable projects. Professional and audit-proof reporting is crucial to create trust among investors, auditors, and other stakeholders.

Our client Grenke, had already relied on our expertise and implemented a data warehouse based on Data Vault 2.0. The processes were largely automated so that data sources could be integrated and processed efficiently. When a new requirement for green bond reporting arose, we were able to implement it in just one month, thanks to the already existing scalable setup.

In this article:

Initial Situation: An Existing, Automated Data Warehouse
New Requirement: Green Bond Reporting
Our Approach: Expansion Instead Of New Construction
Result: Green Bond Reporting In Just One Month
- What Grenke Says
Conclusion: Agile And Sustainable Into The Future

Initial Situation: An Existing, Automated Data Warehouse

Data Vault 2.0 as a foundation:
Grenke was already using a robust Data Vault 2.0 architecture that enables flexible and expandable data storage thanks to its clear structures (hubs, links and satellites).
Automated model generation:
By using templates and metadata-driven approaches, data vault models can be generated automatically. This reduces manual effort, increases standardization and improves data quality.
Quality checks and audit compliance:
Plausibility checks, historization and metadata-supported processes already ensured high data quality and traceability – essential for audits and reporting.

These prerequisites formed the perfect springboard for quickly and reliably integrating the new Green Bond Reporting into the existing system.

New Requirement: Green Bond Reporting

With this new requirement, Grenke was faced with the challenge of collecting and preparing and presenting specific ESG key figures and green bond-specific data in a comprehensible report.

The aim was to design the reporting in such a way that:

External reviewers and auditors can gain insight quickly and easily.

Investors and other stakeholders receive transparent information about the sustainable projects.

Regulatory requirements and internal standards are met at all times and documented in a comprehensible manner.

Thanks to the existing Data Vault 2.0 infrastructure and the high level of automation, it was possible to implement these new requirements in a short space of time.

Our Approach: Expansion Instead Of New Construction

Requirements analysis
Together with Grenke, we defined the relevant green bond key figures and reporting requirements. These included classifications according to ESG criteria, assignment of project types, as well as regional and financial attributes.
Integration into the existing data warehouse
Instead of building a new system, we added the required fields to the existing hubs, links and satellites. Thanks to the agile Data Vault 2.0 methodology, this was possible without much additional effort.
Automated processes and quality checks
Thanks to the existing ETL/ELT routes, we were able to quickly and securely load the data into the system. New validation rules for green bond reporting were added to ensure that all relevant data was recorded completely and correctly.
Reporting & dashboards
Based on the processed data, we have developed interactive dashboards and reports that clearly present the project status, the scope of financing, and other ESG key figures. External auditors can also be given access via export functions if required.
Rapid approval through external audits
As the Data Vault 2.0 structure ensures complete historization and traceability of the data, the external audits ran smoothly. The auditors were able to fully trace all steps and data changes – a decisive advantage for sustainability reports.

Result: Green Bond Reporting In Just One Month

The combination of a scalable Data Vault 2.0 approach, a high level of automation, and an already established data infrastructure enabled us to successfully deliver the Green Bond Reporting in just one month.

This means:

Fast time-to-market: Grenke was able to publish the report quickly and go straight into communication and marketing.
Trustworthy database: Thanks to integrated quality checks and traceability, the reporting is audit-proof – a crucial prerequisite for external audits.
Future-proof solution: New key figures, extended ESG criteria, or regulatory requirements can be flexibly integrated without having to fundamentally rebuild the system.

What Grenke Says

“Partnering with Scalefree has been instrumental in our Data Vault 2.0 journey. Their deep expertise in Data Vault principles and practical dbt know-how have significantly supported our implementation, ensuring a smooth and structured process. Thanks to their guidance, we’ve already improved our ability to integrate and analyze business data while building a scalable and future-proof data warehouse.”

Oliwia Borecka
Chief Data & Analytics Officer at grenke digital GmbH

Conclusion: Agile And Sustainable Into The Future

The project shows how Scalefree supports customers in quickly and efficiently integrating new requirements into existing data ecosystems. The Data Vault 2.0 approach provides the ideal basis for this: scalability, flexibility, and revision security ensure that companies can meet their reporting requirements not only today, but also tomorrow.

Would you like to find out more about how you can future-proof your data warehouse or ESG reporting?
Contact us at Scalefree – together we will develop a customized solution that meets your requirements and puts you in the best possible position in terms of sustainability and transparency. We look forward to making your project a success!

Building a scalable Data Platform? In Data Vault

Green Bond Reporting in Record Time at Grenke AG

Green Bond Reporting

In this article:

Initial Situation: An Existing, Automated Data Warehouse
New Requirement: Green Bond Reporting
Our Approach: Expansion Instead Of New Construction
Result: Green Bond Reporting In Just One Month
- What Grenke Says
Conclusion: Agile And Sustainable Into The Future

Initial Situation: An Existing, Automated Data Warehouse

Data Vault 2.0 as a foundation:
Grenke was already using a robust Data Vault 2.0 architecture that enables flexible and expandable data storage thanks to its clear structures (hubs, links and satellites).
Automated model generation:
By using templates and metadata-driven approaches, data vault models can be generated automatically. This reduces manual effort, increases standardization and improves data quality.
Quality checks and audit compliance:
Plausibility checks, historization and metadata-supported processes already ensured high data quality and traceability – essential for audits and reporting.

These prerequisites formed the perfect springboard for quickly and reliably integrating the new Green Bond Reporting into the existing system.

New Requirement: Green Bond Reporting

With this new requirement, Grenke was faced with the challenge of collecting and preparing and presenting specific ESG key figures and green bond-specific data in a comprehensible report.

The aim was to design the reporting in such a way that:

External reviewers and auditors can gain insight quickly and easily.

Investors and other stakeholders receive transparent information about the sustainable projects.

Regulatory requirements and internal standards are met at all times and documented in a comprehensible manner.

Thanks to the existing Data Vault 2.0 infrastructure and the high level of automation, it was possible to implement these new requirements in a short space of time.

Our Approach: Expansion Instead Of New Construction

Requirements analysis
Together with Grenke, we defined the relevant green bond key figures and reporting requirements. These included classifications according to ESG criteria, assignment of project types, as well as regional and financial attributes.
Integration into the existing data warehouse
Instead of building a new system, we added the required fields to the existing hubs, links and satellites. Thanks to the agile Data Vault 2.0 methodology, this was possible without much additional effort.
Automated processes and quality checks
Thanks to the existing ETL/ELT routes, we were able to quickly and securely load the data into the system. New validation rules for green bond reporting were added to ensure that all relevant data was recorded completely and correctly.
Reporting & dashboards
Based on the processed data, we have developed interactive dashboards and reports that clearly present the project status, the scope of financing, and other ESG key figures. External auditors can also be given access via export functions if required.
Rapid approval through external audits
As the Data Vault 2.0 structure ensures complete historization and traceability of the data, the external audits ran smoothly. The auditors were able to fully trace all steps and data changes – a decisive advantage for sustainability reports.

Result: Green Bond Reporting In Just One Month

This means:

Fast time-to-market: Grenke was able to publish the report quickly and go straight into communication and marketing.
Trustworthy database: Thanks to integrated quality checks and traceability, the reporting is audit-proof – a crucial prerequisite for external audits.
Future-proof solution: New key figures, extended ESG criteria, or regulatory requirements can be flexibly integrated without having to fundamentally rebuild the system.

What Grenke Says

Oliwia Borecka
Chief Data & Analytics Officer at grenke digital GmbH

Conclusion: Agile And Sustainable Into The Future

Building a scalable Data Platform? In Data Vault, Intermediate

Data Vault & Data Mesh in a Data Fabric: A Modern Architecture Guide

Best Practices for Data Mesh Implementation

Organizations often struggle in managing their data efficiently. Data is usually spread across many separate systems, constantly growing in size and complexity, and required for an increasing number of uses. Even seasoned experts struggle with these challenges. To address this, approaches like Data Fabric, Data Vault, and Data Mesh have become important for building robust and flexible data platforms and ensuring efficient processes.

However, these new approaches also add further complexity for data platform management. This article explores how to combine these three concepts to create a strong and efficient data architecture that data architects can use as a foundational guide.

Data Vault & Data Mesh in a Data Fabric: A Modern Architecture Guide

This webinar will provide a brief overview of Data Fabric, Data Vault, and Data Mesh, and then delve into the advantages that can be realized by combining these approaches. Register for our free webinar May 13th, 2025!

Watch Webinar Recording

In this article:

The Data Fabric: Unifying Distributed Data Ecosystems
Data Vault: Establishing a Single Source of Facts
Data Mesh: Decentralizing Data Ownership and Access
Best Practices for Data Mesh Implementation
Recommended Architectural Synthesis
Use Cases and Applications
Conclusion

The Data Fabric: Unifying Distributed Data Ecosystems

To address the challenges of managing data scattered across diverse and distributed environments, the Data Fabric has emerged as an architectural approach. It leverages metadata-driven automation and intelligent capabilities to create a unified and consistent data management layer. This framework facilitates seamless data access and delivery, ultimately enhancing organizational agility.

Key characteristics of a Data Fabric include:

Unified Data Access: Providing integrated data access for diverse user needs.
Centralized Metadata: Utilizing an AI-augmented data catalog for data discovery and comprehension.
Enhanced and Metadata-Driven Automation: Promoting efficiency and scalability through automated processes. Intelligent automation powered by comprehensive metadata management.
Strengthened Governance and Security: Standardizing procedures to improve governance and security.

A modern Data Fabric platform integrates a spectrum of systems and processes to streamline data management. This evolution begins with the incorporation of data from diverse source systems, such as ERP, CRM, HR, and MDM. Subsequently, a Data Lakehouse is integrated, featuring a staging area for data preparation.

The architecture further encompasses an Enterprise Data Warehouse for core data storage, followed by the implementation of information marts, AI marts, and user marts for tailored information delivery. At last, the platform supports various data consumption methods, including applications, dashboards, and OLAP cubes.

The Data Lakehouse also shows the three medallion layers, which represent the raw data (bronze layer), integrated data layer (Silver) and information delivery layer (Gold) with its data products ready for consumption.

Critical to this architecture is robust metadata management and an AI-augmented data catalog, which together drive automation and facilitate data discovery.

Data Vault: Establishing a Single Source of Facts

Data Vault as a data modeling methodology is designed for the construction and maintenance of enterprise data warehouses. Renowned for its flexibility, scalability, and emphasis on historical data, Data Vault aligns seamlessly with the goal of a unified and consisting data management layer of a Data Fabric and its automation focus.

Key benefits of a Data Vault include:

Scalability: Adapting to growing data volumes and complexity.
Flexibility: Accommodating evolving business requirements.
Consistency: Ensuring data integrity across the enterprise.
Pattern based modeling: Perfect foundation for data automation.
Auditability: Providing a clear and traceable data history.
Agility: enabling faster responses to change business needs.

Within a modern Data Fabric platform, a Data Vault model is implemented within the Enterprise Data Warehouse component. The Raw Data Vault integrates all source systems into business objects and its relationships. The sparsely built Business Vault on top of the Raw Data Vault adds advanced Data Vault entities for e.g. query assistants to ease the creation and increase the performance of the information delivery layer.

This approach delivers all advantages listed above and enables a high level of automation due to its pattern based modeling method.

Data Mesh: Decentralizing Data Ownership and Access

Data Mesh is a decentralized approach to data management that prioritizes domain ownership, data as a product, self-service data platforms, and federated governance. This approach shifts data management responsibilities to domain-specific teams, fostering greater accountability and agility.

Key principles include:

Domain Ownership: Decentralized management of analytical and operational data.
Data as a Product: Treating analytical data as a valuable and managed asset.
Self-Service Data Platform: Providing tools for independent data sharing and management.
Federated Governance: Enabling collaborative governance across domains.
Decentralized data domains: Each domain managing its own data products.

Implementing a Data Mesh on a Data Fabric platform requires several essential components like standardized DevOps processes and modeling guides, as well as a comprehensive data catalog.

Although fully distributing the data pipeline via a Data Mesh presents certain attractions, our experience indicates that a more effective strategy involves selectively integrating key Data Mesh principles within a Data Fabric architecture, thereby utilizing decentralized ownership while keeping the advantages of an automated centralized core leveraging the Data Vault approach.

Best Practices for Data Mesh Implementation

Centralized Staging and Raw Vault: This promotes high-level automation.
Decentralized Business Vault and Beyond: This facilitates business knowledge integration and efficient use of cross-functional teams.

For optimal implementation, a centralized staging and Raw Vault approach promotes high-level automation and ensures that all data products refer to a single source of facts. In contrast, a decentralized Business Vault and beyond strategy allows for necessary business knowledge integration, clear data product ownership, and efficient scaling. This level of decentralization is crucial for a successful Data Mesh implementation leveraging cross-functional domain teams.

Recommended Architectural Synthesis

The recommended architecture integrates Data Fabric with Data Mesh and Data Vault, capitalizing on the strengths of each approach. This synthesis yields a metadata-driven, flexible, automated, transparent, efficient, and governed data environment.

Use Cases and Applications

This modern data architecture supports a broad spectrum of use cases, including:

Efficient & Trusted Reporting and Analytics
Regulation Compliance through an auditable core
Various AI Applications

Conclusion

The integration of Data Fabric, Data Vault, and Data Mesh enables organizations to construct a modern data architecture characterized by flexibility, scalability, and efficiency. This holistic approach enhances data management, improves data access, and accelerates the delivery of data products, ultimately driving superior business outcomes with a high level of automation, governance and transparency.

– Marc Winkelmann & Christof Wenzeritt(Scalefree)

Building a scalable Data Platform? In Data Vault, Intermediate

Data Vault & Data Mesh in a Data Fabric: A Modern Architecture Guide

Watch Webinar Recording

In this article:

The Data Fabric: Unifying Distributed Data Ecosystems
Data Vault: Establishing a Single Source of Facts
Data Mesh: Decentralizing Data Ownership and Access
Best Practices for Data Mesh Implementation
Recommended Architectural Synthesis
Use Cases and Applications
Conclusion

The Data Fabric: Unifying Distributed Data Ecosystems

Key characteristics of a Data Fabric include:

Unified Data Access: Providing integrated data access for diverse user needs.
Centralized Metadata: Utilizing an AI-augmented data catalog for data discovery and comprehension.
Enhanced and Metadata-Driven Automation: Promoting efficiency and scalability through automated processes. Intelligent automation powered by comprehensive metadata management.
Strengthened Governance and Security: Standardizing procedures to improve governance and security.

Critical to this architecture is robust metadata management and an AI-augmented data catalog, which together drive automation and facilitate data discovery.

Data Vault: Establishing a Single Source of Facts

Key benefits of a Data Vault include:

Scalability: Adapting to growing data volumes and complexity.
Flexibility: Accommodating evolving business requirements.
Consistency: Ensuring data integrity across the enterprise.
Pattern based modeling: Perfect foundation for data automation.
Auditability: Providing a clear and traceable data history.
Agility: enabling faster responses to change business needs.

This approach delivers all advantages listed above and enables a high level of automation due to its pattern based modeling method.

Data Mesh: Decentralizing Data Ownership and Access

Key principles include:

Domain Ownership: Decentralized management of analytical and operational data.
Data as a Product: Treating analytical data as a valuable and managed asset.
Self-Service Data Platform: Providing tools for independent data sharing and management.
Federated Governance: Enabling collaborative governance across domains.
Decentralized data domains: Each domain managing its own data products.

Implementing a Data Mesh on a Data Fabric platform requires several essential components like standardized DevOps processes and modeling guides, as well as a comprehensive data catalog.

Best Practices for Data Mesh Implementation

Centralized Staging and Raw Vault: This promotes high-level automation.
Decentralized Business Vault and Beyond: This facilitates business knowledge integration and efficient use of cross-functional teams.

Recommended Architectural Synthesis

Use Cases and Applications

This modern data architecture supports a broad spectrum of use cases, including:

Efficient & Trusted Reporting and Analytics
Regulation Compliance through an auditable core
Various AI Applications

Conclusion

– Marc Winkelmann & Christof Wenzeritt(Scalefree)

Damian Hinz In Data Vault, dbt blog article, Intermediate

CI/CD: Practical Insights into Automating Data Vault 2.0 with dbt

CI/CD

CI/CD pipelines are becoming increasingly important for ensuring that software updates can be released cost-effectively while maintaining high quality. But how exactly do CI/CD pipelines work, and how can a project benefit from using one?

This newsletter aims to answer these questions through a practical example of a CI/CD pipeline. The example focuses on a CI/CD pipeline for a GitHub repository that includes a package for implementing Data Vault 2.0 in dbt across various databases. Therefore, this newsletter will also cover the basics of dbt and GitHub Actions.

From Continuous Integration To Data Vaults: A Comprehensive Workflow

This webinar will cover what CI/CD pipelines are and the advantages they offer. We will present parts of the CI/CD pipeline for the public datavault4dbt package to demonstrate how a CI/CD pipeline can be used. The webinar will introduce the key features of GitHub Actions and explain them through examples. This will show how each feature can be utilized in practice and highlight the various possibilities GitHub Actions offers. The webinar aims to explain the benefits of CI/CD pipelines and illustrate what such a pipeline can look like through a practical example.

Watch Webinar Recording

In this article:

What is CI/CD?
Introduction to dbt
The Capabilities of GitHub Actions
Practical Example: CI/CD Pipeline for datavault4dbt
Conclusion

What is CI/CD?

CI stands for Continuous Integration, and CD stands for Continuous Delivery or Continuous Deployment. But what exactly do these terms mean?

Continuous Integration refers to the regular merging of code changes, where automated tests are conducted to detect potential errors early and ensure that the software remains in a functional state.

Continuous Delivery involves making the validated code available in a repository. CI tests should already be conducted in the pipeline for this purpose. It also includes further automation needed to enable rapid deployment, such as creating a production-ready build. The difference between Continuous Delivery and Continuous Deployment is that with Continuous Deployment, the successfully tested software is released directly to production, while Continuous Delivery prepares everything for release without automatically deploying it.

Continuous Deployment allows changes to be implemented quickly through many small releases rather than one large release. However, the tests must be well-configured, as there is no manual gate for transitioning to production.

CI/CD pipelines provide immense time savings through automation. The costs of resources needed for manual testing are also lower with CI/CD pipelines, as they can be configured to spin up resources only for testing and then shut them down afterward. Since permanent resources aren’t required, you only pay for the resources needed during the test runtime.

Introduction to dbt

The abbreviation dbt stands for “data build tool.” dbt is a tool that enables data transformation directly within a data warehouse. It uses SQL-based transformations that can be defined, tested, and documented directly in the dbt environment.

This makes dbt an excellent choice for implementing Data Vault 2.0 as dbt can be used to create and manage the hubs, links, and satellites required by Data Vault.

To facilitate this process, we at Scalefree have developed the datavault4dbt package. Datavault4dbt offers many useful features, such as predefined macros for hubs, links, satellites, the staging area, and much more.

For a deeper understanding of dbt or datavault4dbt, feel free to read one of our articles on the topic.

The Capabilities of GitHub Actions

GitHub Actions is a feature of GitHub that allows you to create and execute workflows directly within GitHub repositories. You can define various triggers for workflows, such as pull requests, commits, schedules, manual triggers, and more.

This makes GitHub Actions ideal for building CI/CD pipelines for both private and public repositories. The workflows are divided into multiple jobs, each consisting of several steps. Each job runs on a different virtual machine.

Within these steps, you can define custom tasks or utilize external or internal workflows. This offers the significant advantage of not having to develop everything from scratch in a workflow; instead, you can leverage public workflows created by others.

The seamless integration of Docker also provides numerous possibilities, such as quickly setting up different test environments, which greatly simplifies the creation of a CI/CD pipeline.

GitHub Actions is the key tool in the following example of a CI/CD pipeline.

Practical Example: CI/CD Pipeline for datavault4dbt

For the public repository of the datavault4dbt package, we have built a CI/CD pipeline to ensure that all features continue to function across all supported databases with every pull request (PR). When a PR is submitted by an external user, someone from our developer team must approve the start of the pipeline. In contrast, a PR from an internal user can be automated by adding a specific label to initiate the pipeline.

Once the pipeline is triggered, GitHub Actions automatically starts a separate virtual machine (VM) for each database. Currently, the datavault4dbt package supports AWS Redshift, Microsoft Azure Synapse, Snowflake, Google BigQuery, PostgreSQL, and Exasol, so a total of six VMs will be launched. Since GitHub Actions operates in a serverless manner, these VMs do not need to be manually set up or managed.

The VMs then connect to the required cloud systems. For instance, the VM for Google BigQuery connects to Google Cloud, while the VM for AWS Redshift connects to AWS. Subsequently, the necessary resources for each database are generated, which can be done via API calls or using tools like Terraform.

After the resources are created, additional files required for testing are generated and loaded onto the VM. In our example pipeline, these include files such as profiles.yml, which contains information needed by dbt to connect to the databases.

Next, a Dockerfile is used on each VM to build an image that automatically installs all dependencies for the respective database. At this stage, Git is also installed on each image so that tests stored in a separate Git repository can be loaded onto the image.

Loading the tests from a repository allows for centralized management of the tests, ensuring any changes are executed for each database during the next pipeline run. Once the images are built, containers are created using these images, where tests are conducted with various parameters. After all tests are completed, the containers are shut down, and by default, the resources on the respective cloud providers are deleted.

The test results are fully visible in GitHub Actions, with successful and failed tests clearly marked.

If the pipeline is started manually, there is an additional option to specify whether only certain selected databases should be tested and whether the resources on the cloud systems should not be deleted after the tests. This allows developers to examine the data on the databases more closely in case of an error.

This pipeline offers numerous advantages for the development of the datavault4dbt package. It allows testing for errors on any of the supported databases with each change, without spending much time creating test resources. At the same time, it saves costs because all resources run only as long as necessary and are immediately shut down after the tests.

Managing the pipeline is also simplified through GitHub, as all variables and secrets can be stored directly in GitHub, providing a centralized location for everything. Once the pipeline is set up, it can be easily extended to include additional databases that may be supported in the future.

Ultimately, this is just one example of what a CI/CD pipeline can look like. Such pipelines are as diverse as the software for which they are designed. If we have piqued your interest and you have further questions about a possible pipeline for your company, please feel free to contact us.

Conclusion

This newsletter explores the benefits and workings of CI/CD pipelines in agile software development, illustrated through a practical example involving a GitHub repository and a dbt package for implementing Data Vault 2.0, highlighting tools like GitHub Actions for automation and efficiency in deployment processes.

Damian Hinz In Data Vault, dbt blog article, Intermediate

CI/CD: Practical Insights into Automating Data Vault 2.0 with dbt

CI/CD

From Continuous Integration To Data Vaults: A Comprehensive Workflow

Watch Webinar Recording

In this article:

What is CI/CD?
Introduction to dbt
The Capabilities of GitHub Actions
Practical Example: CI/CD Pipeline for datavault4dbt
Conclusion

What is CI/CD?

CI stands for Continuous Integration, and CD stands for Continuous Delivery or Continuous Deployment. But what exactly do these terms mean?

Continuous Integration refers to the regular merging of code changes, where automated tests are conducted to detect potential errors early and ensure that the software remains in a functional state.

Introduction to dbt

This makes dbt an excellent choice for implementing Data Vault 2.0 as dbt can be used to create and manage the hubs, links, and satellites required by Data Vault.

For a deeper understanding of dbt or datavault4dbt, feel free to read one of our articles on the topic.

The Capabilities of GitHub Actions

The seamless integration of Docker also provides numerous possibilities, such as quickly setting up different test environments, which greatly simplifies the creation of a CI/CD pipeline.

GitHub Actions is the key tool in the following example of a CI/CD pipeline.

Practical Example: CI/CD Pipeline for datavault4dbt

The test results are fully visible in GitHub Actions, with successful and failed tests clearly marked.