Skip to main content
search
0
All Posts By

Hernan Revale

Hernan Revale is a Senior BI Consultant and Head of Research at Scalefree. With an MSc in Business Analytics from Imperial College London, he specializes in Data Vault 2.0, Data Mesh, and cloud migrations for major European industrial and transportation leaders. A CDVP2 and Snowflake expert, Hernan bridges the gap between academic research and enterprise data strategy.

Orchestration of Agentic Workflows

The Shift from Prompts to Autonomous Systems

For years, organizations have focused on mastering “prompt engineering”, the art of writing precise instructions to extract useful outputs from Large Language Models (LLMs). While highly effective for simple, singular tasks, the prompt-based approach has inherent limitations when faced with complex, multi-step business problems.

The next paradigm shift in enterprise AI is the move toward Agentic Workflows.

An “Agent” is more than just an LLM. It is an autonomous or semi-autonomous system that combines reasoning capability (the LLM) with access to tools, memory, and the ability to act on its environment. Instead of answering a question, an agent performs a role, acting as an analyst, a software engineer, or a project manager, handling sequential professional tasks until a goal is achieved.

Orchestration of Agentic Workflows

Master the art of building multi-step autonomous systems by integrating the LangChain ecosystem with powerful tools like Zapier. This session provides a practical roadmap for evolving from simple prompts to sophisticated, coordinated architectures that execute complex professional tasks with ease. Learn more in our upcoming webinar on April 21st, 2026!

Watch Webinar Recording

Why Agents Require Orchestration

The premise of agentic workflows is powerful, but deployment is difficult. In a complex scenario, you may need a system to:

  1. Analyze a business request.
  2. Search a database.
  3. Process results.
  4. Consult a second specialized agent (e.g., a “Coder Agent”).
  5. Revise the plan based on output and finally provide a summary.

Without proper coordination, this series of steps breaks down. The model might hallucinate a tool execution, forget crucial data from step one by step four, or enter an endless loop of unhelpful actions.

Orchestration is the framework that manages this complexity. It is the conductor of the agentic orchestra, defining how different agents, tools, and memory systems interact, ensuring reliability, traceability, and successful execution of the business objective.

Anatomy of an Agentic Stack

To build a reliable orchestrator for autonomous systems, your architecture must unite three fundamental components:

  • Intelligence Layer (The Brain): The reasoning core, usually an LLM, capable of taking input, breaking it into smaller tasks, and evaluating progress.
  • Action Layer (The Tools): A library of external integrations, such as databases, web scrapers, computational engines, and business APIs, that the agent can use to gather real-world data or execute actions.
  • Coordination Layer (The Orchestrator): The logic that manages state, standardizes how agents exchange data, handles errors, and ensures loops are terminated when goals are met.

Tools of the Trade: Navigating the Lang Ecosystem

As organizations move from proof-of-concept to production, the ecosystem of framework tools is rapidly evolving. The “Lang” suite has emerged as a particularly dominant force in defining how agents are built and orchestrated. During our workshop, we will explore several critical tools within this stack:

LangChain

While often used for simple prompt channelling, LangChain’s core contribution to agentic architecture is standardizing integration and chain creation. It provides the interface to connect the LLM to dozens of external systems. Crucially, it allows us to define custom “tools” for the agent. These are specialized, user-created functions that give the agent specific capabilities, such as querying a proprietary data warehouse or executing an internal Python script. By wrapping these functions in LangChain’s tool abstraction, the agent can autonomously decide when and how to invoke them to solve complex problems.

LangGraph

Managing complex agentic workflows required a different mental model: graphs. LangGraph extends LangChain by allowing developers to model agentic flows as stateful graphs (DAGs, or Directed Acyclic Graphs). This is crucial for systems that require robust loops, cyclical processes, and complex state management, ensuring that “Agent A” always knows what state “Agent B” left the system in.

Langfuse

Orchestrating agents is messy, and you need visibility. While not officially developed by the creators of LangChain, Langfuse is an essential open-source operational companion that integrates seamlessly with the ecosystem. It provides a robust platform for debugging, testing, and monitoring agentic systems without vendor lock-in. Langfuse allows teams to “trace” the entire multi-step process, viewing every prompt, tool call, and internal decision, making it possible to identify bottlenecks, reduce costs, and debug failures in production.

Complementary Orchestration Tools

While the Lang ecosystem excels at managing LLM logic, a true enterprise solution often requires integration with generalized orchestration and automation tools (like Zapier or n8n). These tools excel at managing event triggers, parallel processes, and standard API interactions that do not require LLM reasoning, complementing the Lang stack in a complete enterprise architecture.

Final Thoughts

Moving from single prompts to coordinated, agentic systems is a necessary evolutionary step for organizations aiming to unlock true operational efficiency with AI. Mastery of these systems requires shifting your perspective from “engineering a prompt” to “engineering a system.”

Want to see how this works in practice?

This article provides a conceptual blueprint of agentic workflows and the essential role of orchestration. To gain hands-on experience in building these systems, we invite you to join our upcoming webinar on the Orchestration of Agentic Workflows. During the session, we will demonstrate how to build multi-step autonomous systems by integrating these platforms into a single architecture, providing a practical guide for moving from simple prompts to coordinated AI systems that handle professional tasks.

Register for free

5 Ways of Testing Your Data Pipelines with dbt

Testing Data Pipelines

In today’s data-driven world, the reliability of your analytics depends on the integrity of your data pipelines. Even the most sophisticated transformations can be undermined by bad source data, schema changes, or simple human error. That’s where dbt (data build tool) shines: it provides a framework not only for transforming and modeling your data, but also for validating it at every step. In this article, we’ll explore five essential testing strategies you can implement with dbt to catch issues early, enforce data contracts, and build confidence in your analytics.



1. Custom SQL Tests

What they are: Custom SQL tests allow you to write bespoke SQL queries against your models to enforce complex business rules or edge-case validations. They live alongside your models and execute as part of your dbt test suite.

Why you need them: Out-of-the-box tests cover many common cases, but sometimes you have unique conditions—like “no customer may have more than one active subscription”—that require a tailored query.

-- Example: Ensure no customer has more than one active subscription
SELECT
  customer_id,
  COUNT(*) AS active_sub_count
FROM {{ ref('subscriptions') }}
WHERE status = 'active'
GROUP BY customer_id
HAVING COUNT(*) > 1;

If this query returns any rows, the test will fail, alerting you to a data integrity issue before downstream models consume bad data.

2. Personalized Macros

What they are: Macros in dbt are reusable snippets of SQL or Jinja logic. By building custom macros for your testing patterns, you can enforce consistent checks without repeating verbose SQL.

Why you need them: Many projects share recurring validation needs—like ensuring date fields are never in the future or that monetary amounts are non-negative. A macro lets you encapsulate that logic once and apply it across multiple models.

{% macro expect_dates_in_past(model, column) -%}
SELECT
  *
FROM {{ ref(model) }}
WHERE {{ column }} > current_date
{%- endmacro %}

-- Usage in schema.yml
tests:
  - expect_dates_in_past:
      args:
        model: 'orders'
        column: 'order_date'

With just a few lines in your YAML, you’ve enabled a robust, maintainable check across your entire project.

3. Built-In dbt Data Tests

What they are: dbt ships with four standard data tests: unique, not_null, accepted_values, and relationships. These cover the most frequent requirements for uniqueness, presence, domain constraints, and referential integrity.

Why you need them: They require zero custom SQL. You simply declare your expectations in your model’s .yml file, and dbt will generate and execute the underlying queries.

models:
  - name: users
    columns:
      - name: user_id
        tests:
          - unique
          - not_null

      - name: country_code
        tests:
          - accepted_values:
              values: ['US', 'CA', 'MX']

Within minutes, you’ve added core validation checks to your critical tables, ensuring nulls or duplicates never slip through.

4. Unit Tests

What they are: Unit tests isolate a model’s logic by feeding it controlled input data and verifying that its output matches expected results. In dbt, you can implement unit tests using seed files or temporary models.

Why you need them: When your transformations involve complex calculations—like computing a customer’s lifetime value—you want airtight proof that the logic works under all conditions, including edge cases.

-- seed file: tests/lv_input.csv
customer_id,order_amount
1,100
1,200
2,50
2,75

-- model: tests/lv_test.sql
with input_data as (
  select * from {{ ref('lv_input') }}
),

calculated as (
  select
    customer_id,
    sum(order_amount) as lifetime_value
  from input_data
  group by customer_id
)

select *
from calculated
where
  (customer_id = 1 and lifetime_value = 300)
  or (customer_id = 2 and lifetime_value = 125);

If the calculated results don’t match the hardcoded expectations, the test fails, pinpointing exactly where your logic diverged.

5. dbt Model Contracts

What they are: Model contracts let you define strict schema expectations—column types, required fields, accepted value sets—directly in your schema.yml files. They act as formal agreements between your data producers and consumers.

Why you need them: Without explicit contracts, a subtle change in upstream data can silently break your analytics. Contracts ensure that any structural or type deviations immediately surface as test failures.

models:
  - name: orders
    columns:
      - name: order_id
        data_type: integer
        tests:
          - not_null

      - name: order_amount
        data_type: decimal
        tests:
          - not_null

With contracts in place, if someone accidentally returns order_amount as text or drops a required field, dbt stops execution and flags the violation, protecting downstream dashboards and reports.

Putting It All Together: A Comprehensive Testing Strategy

Testing in dbt is more than an afterthought—it’s a core pillar of a robust analytics engineering workflow. By combining custom SQL tests, personalized macros, built-in data tests, unit tests, and model contracts, you establish multiple layers of defense against data quality issues. Here’s a sample workflow:

  1. YAML Configuration: Start by declaring built-in tests and contracts in your schema.yml for each model.
  2. Custom Tests: Add bespoke SQL tests in tests/ for any project-specific rules.
  3. Macros: Create a macros/ folder with reusable test macros for common patterns.
  4. Unit Tests: Define seed files and test models under tests/unit/ to validate critical transformations.
  5. CI/CD Integration: Hook dbt test into your CI pipeline so that every pull request runs the full suite, ensuring no faulty code or data reaches production.

This layered approach not only catches errors early but also documents your data expectations for new team members and stakeholders.

Conclusion

Ensuring the accuracy, reliability, and trustworthiness of your data requires more than ad-hoc checks—it demands a structured, repeatable testing framework. dbt provides the tools you need, from built-in quick-start tests to fully customized SQL validations and contracts. By implementing these five testing methods, you’ll be well on your way to bulletproofing your data pipelines and empowering stakeholders with confidence in their analytics.

Ready to level up your data quality? Start by adding one new test to your next dbt model, and watch your data reliability soar.

Watch the Video

dbt Fusion Explained: The Next Step in dbt’s Evolution

dbt Fusion Engine

As data teams continue to scale and the demand for faster, more reliable analytics grows, the tools we depend on must evolve. Enter dbt Fusion, the latest high‑performance execution engine from dbt Labs that promises to take your dbt workflows to unprecedented speeds. In this post, we’ll dive deep into what dbt Fusion is, explore its key features, discuss supported platforms and migration paths, and help you decide if—and when—you should upgrade. Let’s get started!



Why a New Engine?

dbt (data build tool) has revolutionized how analytics engineers transform and test data directly within the data warehouse. Until now, both dbt Core and dbt Cloud have relied on a Python-based execution engine. While powerful, Python parsing and compilation can become a bottleneck as projects grow to thousands of models. Recognizing this, dbt Labs has developed dbt Fusion from the ground up in Rust, a language known for its speed and memory safety.

Key Benefit: Lightning‑Fast Parsing

One of dbt Fusion’s marquee improvements is its parsing speed. Traditional dbt projects—especially those with tens of thousands of models—could take minutes to parse. With Fusion’s Rust implementation, parsing times drop dramatically, often by up to 30× faster, bringing multi‑minute delays down to mere seconds (or even milliseconds). Faster parsing means quicker iterations, faster CI checks, and more responsive development workflows.

Ahead‑of‑Time Cycle Compilation

Typically, dbt compilation happens right before execution, which means syntax errors or schema mismatches only surface during run time. dbt Fusion introduces ahead‑of‑time cycle compilation, enabling the engine to analyze your SQL and model dependencies intelligently before executing any queries against your warehouse. This pre‑flight check catches errors early, saving compute costs and developer time by preventing failed runs on the warehouse.

Column‑Level Lineage & Data Type Validation

Data governance is becoming ever more critical. With dbt Fusion, you gain column‑level lineage and built‑in data type validation. This fine‑grained visibility ensures that every downstream model inherits accurate metadata. For instance, if you tag a column as “PII” or “Personal Information” at the source model, Fusion will automatically propagate that tag to any downstream models referencing the same column—streamlining compliance and auditability.

Smarter Orchestration & Cost Savings

dbt Cloud users already benefit from intelligent job scheduling, but Fusion takes orchestration to the next level. It can detect unchanged models and skip them, dramatically reducing unnecessary computation. In practice, this means your daily or hourly runs only re‑execute models that truly need it, leading to significant savings on warehousing costs.

Enhanced Developer Experience in VS Code

To complement the core engine improvements, dbt Labs has released an updated VS Code extension tailored for Fusion. Highlights include:

  • Autocomplete for model names, macros, and config blocks
  • Inline SQL preview so you see your compiled SQL before executing
  • Live feedback on syntax or type errors as you code

These enhancements further shrink the feedback loop, allowing analytics engineers to develop with confidence and speed.

Supported Platforms & Future Connectors

At launch (beta stage), dbt Fusion supports:

  • Snowflake
  • Databricks

dbt Labs has confirmed that additional connectors—such as BigQuery and Redshift—are on the roadmap. To stay up to date, subscribe to the official dbt community forums or follow the dbt Twitter account for announcement alerts.

Beta to GA: What to Expect

dbt Fusion is currently in beta, but the pace of innovation is rapid. dbt Labs aims to reach general availability soon. During the beta, you can:

  1. Experiment with your most complex projects to quantify performance gains.
  2. Report issues and help refine features via GitHub or the dbt community channels.
  3. Understand limitations—such as unsupported adapters or edge‑case macros—before rolling out to production.

Migration Paths for dbt Cloud & Core Users

If you’re on dbt Cloud, you don’t need to lift a finger: Fusion will become the default execution engine automatically once GA is reached. Your existing jobs and orchestrations will seamlessly target Fusion under the hood.

For dbt Core users, upgrading is straightforward:

  1. Install the latest dbt-fusion package alongside dbt-core.
  2. Follow the step‑by‑step migration guide on the dbt Labs documentation site.
  3. Run your test suite locally to confirm compatibility.

License & Pricing Considerations

dbt Fusion introduces a new tiered licensing model:

  • Local Development (dbt Core users): Source‑available, free, and fully functional for local builds (with some advanced features behind a paywall).
  • dbt Cloud customers: Fusion is included in paid tiers, unlocking all premium capabilities—such as enterprise connectors, deeper metadata lineage, and priority support.

Review the official pricing page to see which features align with your team’s needs.

Is dbt Fusion Right for You?

If your team regularly works on large-scale dbt projects or you’re chasing every millisecond of performance, dbt Fusion is a game‑changer. Early adopters report 10×–30× faster parsing, near‑instant validation feedback, and lower cloud compute bills thanks to smarter orchestration.

That said, if your project is small or you’re comfortable with existing runtimes, you may choose to wait until GA and additional adapters ship. Either way, Fusion is the future of dbt, and understanding its capabilities now will help you plan your analytics roadmap.

Next Steps

  • Read the dbt Fusion docs to explore detailed benchmarks and feature matrices.
  • Join the beta: enable Fusion in your dev environment and share feedback.
  • Monitor connector announcements to align Fusion with your warehouse of choice.

Watch the Video

DBT Next Chapter with SDF – From SQL Strings to Semantic Insights

Introduction

dbt Labs has recently acquired SDF, a company known for its cutting-edge SQL comprehension technology. This acquisition brings several key benefits, including:

  • Faster dbt project compilation
  • Improved developer experience
  • High-fidelity data lineage tracking

But what does this mean for dbt users? Let’s explore.



Background & Context

dbt simplifies the creation of data models using SQL SELECT statements, making data engineering accessible to anyone familiar with SQL. Historically, however, dbt has treated SQL as mere strings, limiting its ability to understand the deeper meaning behind queries.

The Acquisition Explained

In January 2025, dbt Labs announced the acquisition of SDF. According to dbt Labs’ founder and CEO Tristan Handy:

“We are acquiring SDF to bring SQL comprehension into dbt and usher in a new era of ‘what’s possible’ for analytics: supercharging developer productivity and heightening data quality, all while optimizing data platform costs.”

What is SDF Technology?

Unlike dbt’s traditional approach of treating SQL as text, SDF understands the deeper structure of SQL, recognizing objects, types, syntax, and semantics. It emulates the SQL compilers of various data platforms (e.g., Snowflake, Redshift, BigQuery), allowing developers to:

  • Validate SQL queries before execution
  • Catch breaking changes early
  • Ensure platform compatibility
  • Perform real-time impact analysis
  • Reduce computational costs

What Does “SQL Comprehension” Mean?

SQL comprehension enables dbt to:

  • Identify query components
  • Generate structured artifacts
  • Validate SQL syntax and semantics
  • Predict query outcomes, including column creation and datatype assignments
  • Execute queries efficiently

Impact on dbt

1. Validate: Will My SQL Really Work?

With SDF, dbt can ensure that SQL queries are correct before execution, reducing errors and debugging time.

2. Analyze: Precise Column-Level Lineage

SQL comprehension improves data lineage tracking, leading to:

  • Better debugging workflows
  • Optimized CI builds (only rebuilding models affected by changes)
  • Enhanced metadata propagation (e.g., PII tagging, test descriptions)

3. Optimize: Right Query, Right Place

With a better understanding of SQL, dbt can:

  • Optimize queries for specific engines
  • Identify performance bottlenecks before execution
  • Support query pruning (scanning only relevant data subsets)
  • Optimize DAG execution across multiple platforms

Future Implications

The integration of SDF into dbt is in progress. While it won’t be part of the Apache 2.0 code base, dbt Labs plans to make key SDF capabilities available to all users—both in dbt Core and dbt Cloud.

With SDF, dbt users can look forward to:

  • Faster performance
  • Optimized data platform costs
  • Improved data lineage tracking

All without needing to change existing dbt projects.

Conclusion

The acquisition of SDF marks a significant milestone for dbt, bringing true SQL comprehension into the platform. Developers will benefit from enhanced query validation, real-time impact analysis, and optimized execution strategies. As dbt evolves, its users can expect more powerful, efficient, and cost-effective data transformations.

Watch the Video

Semantic Models and Metrics

Unlocking Analytics with Semantic Models and Metrics

A semantic model is a layer of abstraction that defines business-friendly terms and metrics on top of raw or transformed data. It bridges the gap between data transformations and end-user reporting, ensuring accuracy, consistency, and clarity across analytics tools. By providing a unified way to define and calculate key metrics, semantic models empower businesses with reusability and precision in reporting.



Understanding Semantic Models

Semantic models form the foundation of the dbt Semantic Layer. Configured using YAML files, they correspond to dbt models in your DAG. Each model requires a unique YAML configuration, enabling dynamic and reliable dataset refinement. You can even create multiple semantic models from a single dbt model, provided each has a distinct name.
These models comprise three key components:

  • Entities: Define relationships between semantic models (e.g., IDs).
  • Dimensions: Columns used for slicing, grouping, and filtering data (e.g., timestamps, categories).
  • Measures: Quantitative values aggregated in analyses.

Diving into Metrics

Metrics are calculations representing essential business measures, built from entities, measures, and dimensions. They ensure centralized definitions, reusability across tools, and consistency in analysis. Metrics encapsulate both logic (e.g., aggregations, filters) and context (e.g., time granularity, dimensions).
Types of metrics include:

  • Conversion Metrics: Track events like purchases per user.
  • Cumulative Metrics: Aggregate measures over specified windows.
  • Derived Metrics: Expressions combining multiple metrics.
  • Ratio Metrics: Comparisons of numerator and denominator metrics.
  • Simple Metrics: Directly reference a single measure.

Commanding Metrics with dbt

dbt Cloud CLI provides MetricFlow commands to interact with the semantic layer. For instance, dbt sl query executes queries and validates metrics, while dbt sl list dimensions retrieves dimensions for specific metrics. These tools streamline metric management and ensure robust analytics workflows.

Semantic models and metrics are vital for bridging data transformations and actionable insights. They provide a foundation for scalable, consistent, and reusable analytics frameworks, enabling businesses to thrive in data-driven environments.

Watch the Video

Master Data Governance: Understanding the EU’s Data Act

Data Governance

The EU’s Data Act is on the horizon, promising to reshape how businesses access and utilize data according to Data Governance, particularly industrial data. This legislation aims to foster a competitive data market, drive innovation, and ensure fairness in data sharing.

Key provisions include granting users control over data from connected products, enhancing fairness in data allocation, and safeguarding against unfair contractual terms. Complementing the Data Governance Act, these regulations lay the groundwork for an EU single market for data, positioning Europe as a global leader in the data economy.

Master Data Governance: Understanding the EU’s Data Act

Join us for an insightful webinar where we unravel the intricacies of the EU’s Data Act. Delve into the legal and technical obligations imposed by this landmark legislation and gain valuable insights into implementing compliant data platforms.

Watch Webinar Recording

Understanding the EU Data Act

The Data Act, formulated by the European Union, represents a significant step forward in governing data sharing, access, and utilization within the digital economy. It aims to foster collaboration while upholding principles of fairness, transparency, and compliance with data protection regulations such as the GDPR. This regulatory framework provides guidelines for businesses, public sector entities, and research organizations, ensuring that data-sharing practices align with legal requirements and ethical standards.

Scheduled for enforcement on September 12, 2025, the Data Act mandates adherence to its provisions concerning data-sharing practices, contractual agreements, and operational procedures. Stakeholders must prepare to align their processes with the requirements outlined in the Data Act to maintain compliance and mitigate regulatory risks effectively.

The scope of the Data Act encompasses various stakeholders engaged in data-sharing activities within the European Union. This includes businesses of all sizes, public sector bodies, research organizations, and data processing service providers. Data holders bear the responsibility of ensuring compliance with the Data Act’s directives, promoting fair and non-discriminatory data-sharing practices while safeguarding privacy and intellectual property rights.

Data Governance Free PDF - European Data Act Compliance Checklist - 10 Key Steps

Master The Eu’s Data Act Today!

Ensure your business is ready for the Data Act with our comprehensive 10 Key Steps Compliance Checklist. Learn how to protect data, review contracts, and maintain interoperability.

Get My Free Checklist

Technical Challenges

Addressing technical challenges posed by the Data Act requires robust solutions that guarantee data integrity, security, and accessibility. Leveraging methodologies like Data Vault 2.0 offers a multi-faceted approach to data management, facilitating scalability, real-time capabilities, and efficient data architecture. By decoupling storage from delivery and implementing real-time data capture and processing, organizations can streamline compliance efforts and enhance data governance.

Implementing a Data Vault 2.0-based data platform enables organizations to meet the complex requirements of the Data Act effectively. By establishing a robust architecture for data ingestion, processing, and presentation, businesses can ensure compliance while driving innovation and agility in their data operations.

Data Vault architecture meets the Data Act requirements

Figure: reference architecture for real-time Data Vault solution

Data Vault 2.0 architecture follows a multi-layer approach, consisting of the Staging Layer, the Enterprise Data Warehouse Layer, and the Information Marts Layer. This integrated approach ensures a harmonious synergy between technical and business objectives. In particular, to respond to the Data act requirements, in this case we are including a “message queue” that loads the data from source systems to our enterprise data warehouse “without undue delay,” i.e., real-time, near real-time, or with the delay stipulated by your specific technology and related services.

In leveraging a real-time or near real-time Data Vault 2.0 architecture for your data platform, you’re not just meeting the requirements of the EU Data Act but also laying the foundation for streamlined data management and enhanced data analytics capabilities. By centralizing the processing and preparation of raw data from IoT devices, this architecture ensures that compliance with the Data Act is seamlessly integrated into your data infrastructure.

Additionally, in the Information Mart layer, we can create a specific Interface Mart to cater to security and accessibility requirements. Here, data can be flagged by various criteria such as device or user ID and access level. This enables users to either download the data upon request or visualize it directly through a dedicated app. The data will be as fresh as possible, as the Interface Mart can be generated as views built on top of our Raw Vault, which is updated “without undue delay.”

Moreover, by decoupling the data storage from delivery and employing real-time data capture and processing mechanisms, you’re not only facilitating adherence to regulatory standards but also enabling agile and responsive data analytics. These business rules could be implemented in your Business Vault and downstream layers. Hence, this approach ensures compliance and also sets the stage for harnessing the full potential of your data assets for driving innovation and decision-making across your organization.

Final Remarks

As the enforcement date of the Data Act approaches, it is imperative for organizations to prioritize compliance and adopt proactive strategies for data management. By embracing technologies like Data Vault 2.0 and adhering to agile development methodologies, businesses can navigate the regulatory landscape with confidence, harnessing the synergies between regulatory requirements and technological advancements to drive sustainable growth and innovation.

Check out our webinar recordings as we explore the intersection of the Data Act and Data Vault 2.0, offering insights and practical guidance for navigating the evolving data governance landscape.

Watch here for free: In English or in German

Exploring datavault4dbt: A Practical Series on the dbt Package for Data Vault 2.0 – Vol. 2: Standard Entities in the Raw Vault

Exploring datavault4dbt

In our initial post of this series, we delved into the creation of our staging layer using DataVault4dbt, an open-source package designed for Data Vault 2.0 within dbt. In this installment, we embark on the journey to construct our first standard Data Vault 2.0 model entities in the Raw Vault, including Hubs, Links, and Satellites. As in our previous post, we recommend staying up-to-date with the latest changes and adaptations in the DataVault4dbt package by referring to the project’s GitHub repository Wiki.

Before We Start

Before we get started, ensure that you have the DataVault4dbt package correctly installed in your packages.yml file and that you’ve executed dbt deps.

For this tutorial, we’ll be using the TPCH Snowflake Sample Data. Moreover, we assume you’ve already established your staging model, which includes the calculation of hashkeys and hashdiffs. Here’s a snippet from our staging model‘s configuration, which we’ll need later when creating the Raw Vault entities:

stg_orders
Staging layer in datavault4dbt

A. Standard Hub with datavault4dbt

Hubs are constructed based on a unique list of business keys, making their configuration relatively straightforward. In this example, we’ll be creating the Hub for orders:

order_h
Hub in datavault4dbt
  • hashkey: the hashkey name in the staging model
  • business_keys: name of the business key used as input for the previously mentioned hashkey
  • source_models: name of our staging model

B. Standard Link with datavault4dbt

Link models establish connections between business keys. In our case, we’ll create a connection between the previously formed Order Hub and the Customer Hub:

order_customer_1
Link in datavault4dbt
  • link_hashkey: hashkey of the Link, generated using the foreign keys from the Hubs in the staging layer
  • foreign_haskeys: a list of foreign hashkeys to be included in our link
  • source_models: name of our staging model

C. Standard Satellite Version 0 with datavault4dbt

Following Data Vault 2.0 standards, Version 0 Satellites are created as incremental tables. In our example, the Satellite will be connected to the previously generated Order Hub:

order_0s
satellite version 0 in datavault4dbt
  • parent_hashkey: name of the parent entity’s hashkey, in our case, the Order Hub
  • src_hashdiff: hashdiff already calculated on the staging model
  • src_payload: original columns used in the hashdiff calculation
  • source_model: name of our staging model

D. Standard Satellite Version 1 with datavault4dbt

Additionally, the Version 1 Satellite is a virtually generated entity created on top of our Version 0 Satellite. Beyond the materialization type, the main difference with the V0 Satellite is the introduction of a new column for calculating the load end date. The load end date will be useful for us downstream when dealing with PIT tables in the Business Vault.

order_s
satellite version 1 in datavault4dbt
  • sat_v0: name of the related Version 0 Satellite
  • hashkey: hashkey name of the parent entity, in our case, the order Hub
  • hashdiff: hashdiff already calculated on the staging model
  • ledts_alias: name of the load end date column to be generated
  • add_is_current_flag: when true, it generates a new column flagging the last loaded rows based on the load end date

Conclusion

In this journey through the creation of Raw Vault standard entities, we’ve established a strong foundation for our Data Vault 2.0 architecture. By utilizing DataVault4dbt within dbt, we’ve simplified the development of Hubs, Links, and Satellites. These fundamental building blocks are the cornerstone of a robust and scalable data warehousing solution. As we progress in this series, we’ll continue to explore advanced concepts and delve into the intricacies of Data Vault modeling, preparing us to unlock the full potential of our data.

Designing the Business Vault: Key Strategies for Effective Data Organization

DV2.0 Architecture with Business Vault

Designing the Business Vault

Data Vault 2.0 has emerged as a comprehensive framework, offering agility, scalability, and adaptability. At the heart of this framework lies the Business Vault, a critical component for effective data organization and analysis in modern enterprises.

In this article, we will check the key principles and strategies for designing a robust Business Vault within the context of Data Vault 2.0.

Designing the Business Vault: Key Strategies for Effective Data Organization

Join us for an insightful webinar on “How to design the Business Vault?” as we explore the critical role of the Business Vault within the Data Vault 2.0 framework. Discover how the Business Vault serves as a pivotal component for translating raw data into actionable insights, applying soft business rules to streamline end-user structure creation, and ensuring an efficient population of Information Marts.

Watch webinar recording

Understanding Data Vault 2.0

Data Vault 2.0 represents a paradigm shift in data architecture, distinguishing itself from traditional warehousing methods. Its flexibility and scalability make it ideal for organizations navigating the complexities of modern data ecosystems.

Data Vault 2.0 architecture follows a multi-layer approach, consisting of the Staging Layer, the Enterprise Data Warehouse Layer, and the Information Marts Layer. By dividing our data architecture into multiple layers, we can respond to both the needs of the technical teams (i.e., historization, auditability, and data integration) and the requirements of the business users (i.e., quick access to relevant, well-organized information). This integrated approach ensures a harmonious synergy between technical and business objectives.

DV2.0 Architecture with Business Vault

To achieve all these goals, Data Vault 2.0 proposes a subdivision inside the Enterprise Data Warehouse Layer: the Raw Vault and the Business Vault. The Raw Vault will receive and integrate the unaltered data from the source, while the Business Vault will translate the raw data into meaningful insights for informed decision-making.

Importance of Business Vault

The Business Vault serves as a middle ground between the Raw Vault and the Information Mart layers. It is an optional vault, sparsely generated on top of the Raw Vault and normally it is virtualized. Differently from the Raw Vault, in the Business Vault, we will be applying soft business rules, i.e., those rules that change the data.

This layer will be created to serve the business in different ways, such as the generation of query assistance entities or by precomputing calculated fields that later will be used on downstream layers. In other words, the Business Vault will host business-rule changed data and its purpose is to ease the creation of end-user structures.

Key Concepts of a Business Vault

A Business Vault will be modeled following the Data Vault 2.0 design principles. Nevertheless, it won’t necessarily follow the strict auditability requirements of the Raw Vault, as we can drop and recreate the Business Vault entities at any time. With the purpose of serving to populate the Information Mart more easily and efficiently, the entities will be created only if they are necessary for the business. This is also why the Business Vault usually keeps reusable business logic.

The types of entities we can typically find inside a Business Vault could be Point-In-Time (PIT) and Bridge Tables, for query assistance; Computed Satellites or Links, for storing computed data; and Exploration Links, for connecting Hubs that were not previously connected to the Raw Vault. Besides, any other entities that are created on top of the Raw Vault, using business logic and queried by the Information Marts layer, would belong to the Business Vault. For instance, we might need business logic to map instances of the same thing, thus creating a Business Same-as Link.

Conclusion

In data management, Data Vault 2.0 encompasses different aspects such as data modeling, methodology, and architecture. Distinguished by its versatility, this framework places a significant emphasis on agility and adaptability. In this sense, at the core of Data Vault 2.0 architecture lies a pivotal concept, the Business Vault, a key player for efficient data organization and analysis in modern enterprises.

The Business Vault, a flexible optional layer, interprets raw data into actionable insights, applying soft business rules. Its purpose is to streamline end-user structure creation by hosting processed data. Entities are created selectively, keeping reusable business logic. In essence, the Business Vault ensures the efficient population of Information Marts by focusing on business-critical data.

Interested in more? Watch the webinar recording here for free!

Exploring Datavault4DBT: A Practical Series on the DBT Package for Data Vault 2.0 – Vol. 1: The Staging Layer

Exploring DataVault4dbt

Last year Scalefree released DataVault4dbt, a Data Vault 2.0 open-source package for dbt, which includes loading templates for creating and Data Vault 2.0 modeling entities following up-to-date standards and best practices. If you want to read more about the general content of the package and its motivation, you can do so here.

We’re excited to launch a series of insightful posts and webinars, showcasing practical implementations of DataVault4dbt. This will empower you to leverage its full potential in your data warehousing endeavors. Today, we’ll spotlight its application in the staging layer.

Before we start with DataVault4dbt

We will assume some previous knowledge related to Data Vault 2.0 and dbt. Besides, for the following examples, we will be using dbt Cloud IDE connected to Snowflake. For an updated list of supported platforms, check the package’s GitHub repository.

Also, bear in mind that for optimal use of the macros, you must meet a couple of prerequisites:

  • Flat & Wide source data, accessible in your target database
  • A Load Date column signifying the time of arrival in the source data storage
  • A Record Source column detailing the origin of the source data, such as the file location within a Data Lake

In our case, we used and adapted the data from the jaffle_shop example project available on dbt.

Installing DataVault4dbt package on dbt

Installing DataVault4dbt is like installing any other package on your project. You will need to follow two simple steps:

1.Add it to your packages.yml file

DataVault4dbt installation

2. Run dbt deps

DataVault4dbt installation

Using the macro for staging our source data

According to the documentation for the staging layer of DataVault4dbt, this layer primarily focuses on hashing. It also offers functionalities like creating derived columns, conducting prejoins, and adding NULL values for missing columns. Rather than diving deep into the technical aspects of each macro component, which are comprehensively covered in the documentation, let’s dive straight into its application!

A. Basic source information

Identifying the Source Model (source_model):

  • When referencing a source, adopt the dictionary format: ‘source_name’: ‘source_table’.
  • For models within our dbt project, just use the model name: ‘source_table’.

Setting Load Date Timestamp (ldts) & Record Source (rsrc):

  • Both can reference a column from the source table or a more detailed SQL expression.
  • Additionally, for the Record Source, you can use a static string beginning with ‘!’, like ‘!my_source’.

Example

DataVault4dbt: A table with two table blocks.
  • source_model: Calls an already created table on dbt named ‘orders_example’.
  • ldts: Calls a timestamp column from our source model.
  • rsrc: Calls a column which contains a string referring to our record source name.

B. Hashing

In DataVault4dbt, the hashed_columns parameter outlines how to generate hashkeys and hashdiffs. For each hash column:

  • The key represents the hash column’s name.
  • For Hashkeys, the value is a list of business keys.
  • For Hashdiffs, the value will usually be a list of descriptive attributes.

Example

DataVault4dbt: Screen shot, table, different types of data.
  • hk_order_h: hashkey generated using two columns inputs (O_ORDERKEY and O_CUSTKEY)
  • hd_order_s: hashdiff generated using multiple descriptive attributes

C. Derived columns

Derived Columns in DataVault4dbt stage models allow users to directly apply specific transformations to data. They act as on-the-fly customizations, enabling immediate adjustments to data within the column itself. Essentially, if data isn’t in the desired format, with DataVault4dbt you can derive a new version right within the column using a specified rule.

When setting the derived_columns parameter, each derived column includes:

  • value: The transformation expression.
  • datatype: The datatype of the column.
  • src_cols_required: Source columns needed for the transformation.

Depending on how you name the derived column and the source columns, you can achieve two outcomes:

  1. If the derived column’s name matches its source column’s name, the original column’s data will be replaced by the transformed data. This effectively means you’re overwriting the original data.
  2. On the other hand, if the derived column’s name is different from its source column’s name, the transformation will result in a brand new column, preserving the original column’s data.

Example

DataVault4dbt: table, prices, items.
  • price_euro: creation of a new column with the same values as the O_TOTALPRICE column.
  • country_isocode: creation of a new column with a static string ‘GER’.

D. Prejoining

Why Prejoin?

In certain scenarios, your source data might not have the ‘Business Key’ which is often a human-readable identifier, such as an email address or username. Instead, it might have a ‘Technical Key’, which could be an internally generated identifier or code. If you need to use the human-readable Business Key in your processing but only have the Technical Key, you would use prejoining to combine your data with another table that maps Technical Keys to Business Keys.

How to Define Prejoins in DataVault4dbt?

The DataVault4dbt package provides a structured way to define these prejoins (prejoined_columns) using dictionaries.

For every column you’re adding through prejoining, you need to specify a few things:

  • src_name: This is the source of the prejoined data, as defined in a .yml file.
  • src_table: This specifies which table you’re prejoining with, as named in the .yml file.
  • bk: This is the name of the Business Key column in the prejoined table or the column values you are bringing to your table.
  • this_column_name: In your original data, this is the column that matches up with the prejoined table. This is often a Technical Key.
  • ref_column_name: In the prejoined table, this is the column that this_column_name points to. It should match up with the values in this_column_name.

Note that both ‘this_column_name’ and ‘ref_column_name’ can represent either a single column or a list of columns, serving as the basis for constructing the JOIN conditions.

Example

DataVault4dbt example: Table, data.
  • c_name: we brought the column “C_NAME” from the customer source table, joining on orders.o_custkey = customer.c_custkey.

E. Multi active config

The multi_active_config parameter is used when dealing with source data that contains multiple active records for the same Business Key. Essentially, you need to specify which columns are the multi-active keys and the primary hashkey column.

If your source data doesn’t have a natural multi-active key column, you should create one using functions like row_number in a preceding layer. Then, add the name of this newly created column to the multi-active-key parameter. It’s crucial that the combination of multi-active keys, the main hashkey, and the ldts column be unique in the final satellite output. If you don’t use this setting, the stage is considered to have only single active records.

Example

DataVault4dbt example: Table, types, block.

By setting this parameter, we’ll observe consistent hashdiffs for identical Business Keys, proving beneficial in subsequent layers. If you want to know why, you can check this post.

F. Missing columns

With DataVault4dbt, the missing_columns parameter helps handle scenarios where the source schema changes and some columns no longer exist. Using this parameter, you can create placeholder columns filled with NULL values to replace the missing ones. This ensures that hashdiff calculations and satellite payloads continue to work. Essentially, you provide a dictionary where the column names are the keys and their respective SQL datatypes are the values.

Example

DataVault4dbt example: table, price
  • discount_code: creation of a new discount_code column with NULL values.

Conclusion

Scalefree’s DataVault4dbt package introduces an easy-to-use yet powerful solution for database modeling. In our case, we went through the staging layer macro, which combines best practices with the flexibility to address diverse source data needs. From hashing to on-the-fly column modifications, this Data Vault 2.0 open-source package for dbt streamlines complex processes.

As we continue to explore its potential, we invite you to join our monthly expert session for a deeper dive. Reserve your spot here and stay tuned to the package’s GitHub repository for the latest updates and support.

Close Menu