Dmytro Polishchuk

dbt Fusion Demo: Dialect-Aware Validation, State-Aware Orchestration & Efficient Testing

dbt Fusion Demo

This is the second installment in our dbt Expert Series. In the first video, we introduced dbt Fusion, explored what it is, why it matters, and highlighted its core capabilities: dialect error validation, state-based orchestration, and efficient testing. If you have not watched that video yet, we recommend doing so before continuing here.

In this session, Scale Free principal consultant Metropolis takes those concepts into a live demo environment to show how they actually behave in practice. The result is a clear picture of how dbt Fusion can reduce costs, eliminate redundant work, and catch errors before they become expensive problems.

In this article:

The demo project: Hub Speak Base
Dialect error validation in the IDE
State-based orchestration: only rebuild what changed
Efficient testing: stop running the same tests twice
What this means for your dbt workflows
Watch the Video

The demo project: Hub Speak Base

To ground the demo in something realistic, Metropolis built a project called Hub Speak Base. It includes four data sources — customer, line item, orders, and part — along with seven models: one staging model per source, two dimension models (customer and part), and a fact orders model. Unique and not-null tests are configured on both the sources and the staging models, and the mart models include unit tests defined in a star schema YAML file. This gives a solid foundation for demonstrating all three major Fusion capabilities without abstracting away the messiness of real-world pipelines.

Dialect error validation in the IDE

One of the most immediately useful features of dbt Fusion is its ability to catch SQL errors before a query ever leaves the IDE. This is what Fusion calls dialect error validation, and the demo shows it working in two distinct scenarios.

First, Metropolis demonstrates column reference checking. In the fact orders model, he intentionally references a column called order_keys instead of the correct order_key. Fusion flags this instantly — hovering over the incorrect reference surfaces an error message explaining that the column l_order_keys cannot be found. The same error appears in the problems panel below the editor, making it impossible to overlook.

Second, he tests function name validation by changing the round function to a fictitious roundy. Fusion flags this as well. When he attempts to build the model anyway, Fusion does not even send the query to Snowflake. Instead, it stops during static analysis and throws an error immediately — saving both time and warehouse compute.

It is worth noting a current limitation: as of the time of recording, not all SQL errors are caught by Fusion’s static analysis engine. For example, passing a third argument to the round function — which Snowflake does not support — is not yet flagged locally, and the query is sent to Snowflake where it fails at runtime. Since Fusion is still in preview, this behavior is expected to improve over time.

State-based orchestration: only rebuild what changed

State-based orchestration is where dbt Fusion offers some of its most compelling cost savings. The idea is simple: instead of rebuilding every model on every run, Fusion tracks the state of each model in the database and only rebuilds the ones that have changed — whether due to code updates or upstream data changes.

To enable this, you navigate to the Orchestration settings for your production environment and toggle on the Fusion cost optimization features, which includes both state orchestration and efficient testing.

The demo makes the behavior concrete. Metropolis drops all tables and views from the production schema, then triggers a job. Every model is rebuilt from scratch because Fusion detects that nothing exists yet. On the second run, with no changes made, every model is marked as reused. The logs confirm this clearly: no new changes on any upstream model.

Then things get interesting. Metropolis inserts one row into the customer source table and one row into the line item source table. The expected behavior — that only the models downstream of those sources would be refreshed — is exactly what happens. The staging customer model and the dimension customer model are rebuilt. The staging line item model and the fact orders model are rebuilt. Everything else is reused. Fusion is detecting data changes at the row level, without any update timestamps configured. The orchestration works automatically out of the box.

One particularly useful aspect of Fusion’s state orchestration is that state is shared across jobs within the same environment. The demo includes a second production job configured to refresh fact orders and its upstream dependencies. When this job runs after the first, it finds all models already built and up to date — so everything is reused. Teams running multiple jobs against the same environment avoid paying twice for the same compute.

Efficient testing: stop running the same tests twice

The third capability demonstrated is efficient testing. When using the build command in a Fusion-enabled job, dbt Fusion tracks which tests have already run and reuses their results for downstream models within the same job execution — rather than re-running identical tests multiple times.

In the demo, after switching the job command from run to build, the results show tests on the sources executing as expected. But the equivalent tests defined on the staging and mart models — tests that reference the same underlying data — show their results as reused. This avoids redundant warehouse queries and can meaningfully reduce both execution time and compute cost on larger projects.

The current limitation here is scope: as of the time of recording, test result reuse only happens within the context of a single job run. Results are not carried over to subsequent runs or shared across different jobs. This may change in future versions of Fusion.

What this means for your dbt workflows

Taken together, these three capabilities address real pain points that dbt teams encounter as their projects scale. IDE-level error validation shortens the feedback loop between writing SQL and knowing it works, without requiring a round-trip to the warehouse. State-based orchestration dramatically reduces unnecessary compute by treating rebuilds as the exception rather than the rule. And efficient testing ensures that the tests you have invested time in writing do not become a bottleneck in CI/CD by running redundantly.

dbt Fusion is still in preview, and some capabilities are still being refined. But based on this demo, the direction is clear: Fusion is designed to make the full dbt development and deployment cycle faster, cheaper, and more intelligent.

Future videos in this series will cover more advanced configuration options for state orchestration, as well as additional Fusion features as they become available. Make sure you are subscribed so you do not miss them.

Watch the Video

Dmytro Polishchuk In dbt Talk

dbt Fusion: The Next Generation of dbt Execution

dbt Fusion

dbt is evolving rapidly, and with the introduction of dbt Fusion, data teams are entering a new era of performance, efficiency, and intelligence. Built from the ground up, dbt Fusion represents a fundamental shift in how dbt projects are executed, validated, and optimized.

In this article, we’ll explore what dbt Fusion is, why it matters, and how its core capabilities—dialect-aware validation and state-aware orchestration—are changing the way modern data platforms operate.

In this article:

What is dbt Fusion?
Why dbt Fusion Matters
Dialect-Aware SQL Validation
State-Aware Orchestration
Efficient Testing (Beta)
Performance and Cost Benefits
Compatibility with dbt Projects
Current State of dbt Fusion
Conclusion
Watch the Video
Meet the Speaker

What is dbt Fusion?

dbt Fusion is a next-generation execution engine for dbt, designed to overcome the limitations of dbt Core and unlock new capabilities for data teams. Rather than incrementally improving the existing engine, dbt Labs rebuilt the execution layer entirely.

One of the most important differences lies in its foundation: dbt Fusion is written in Rust, while dbt Core is built in Python. This change enables significantly better performance, especially for large-scale projects with complex dependency graphs.

But performance is only part of the story.

dbt Fusion introduces a native understanding of SQL across multiple dialects, allowing it to analyze queries more deeply than ever before. This enables advanced features like early error detection, improved lineage tracking, and smarter orchestration.

Importantly, dbt Fusion is designed to support the full dbt Core framework. Most existing dbt projects can run on Fusion with minimal changes, making adoption straightforward for many teams.

Note: Deprecated dbt Core functionality is not supported.

Why dbt Fusion Matters

dbt Fusion introduces two major innovations that directly impact day-to-day data work:

Dialect-aware SQL validation
State-aware orchestration

Together, these features significantly improve developer productivity, reduce execution time, and lower compute costs.

Dialect-Aware SQL Validation

Static SQL Analysis

One of the most powerful capabilities of dbt Fusion is its ability to perform static SQL analysis. Instead of simply rendering SQL and sending it to the data warehouse, Fusion builds a logical execution plan for every query during compilation.

This means that SQL correctness can be validated before any warehouse resources are used. As a result, many errors are caught early in the development process rather than during execution.

Handling Introspective Models

Not all SQL can be fully analyzed ahead of time. Some models rely on database-dependent macros, often referred to as introspective macros. Examples include:

get_column_values
star
unpivot

In these cases, dbt Fusion may defer part of the validation to the database itself, since the final structure depends on runtime information.

Why This Matters

Dialect-aware validation provides several key benefits:

Early error detection: Catch issues before execution
Improved developer experience: Faster feedback in the IDE
Precise column-level lineage: Better understanding of data flow
Foundation for advanced features: Enables orchestration and optimization

In practice, this means fewer failed runs, faster debugging, and more confidence in your transformations.

State-Aware Orchestration

The second major innovation in dbt Fusion is state-aware orchestration, which fundamentally changes how dbt jobs are executed.

Build Only What Changed

Traditionally, dbt runs rebuild models even if nothing has changed. dbt Fusion eliminates this inefficiency by detecting changes in both code and upstream data.

If no changes are detected, the model is skipped and the existing version is reused.

This results in:

Faster execution times
Reduced compute usage
Lower cloud costs

Shared Model State

dbt Fusion maintains a shared, real-time state at the model level. All jobs within the same environment can read and write to this shared state.

This allows dbt to determine whether a model has already been built and whether rebuilding it would produce a different result.

Concurrent Job Handling

In modern data platforms, multiple jobs often run at the same time. dbt Fusion is designed to handle this safely and efficiently.

It avoids unnecessary duplication by:

Preventing warehouse collisions
Reusing models across concurrent jobs
Ensuring consistency across executions

Works Out of the Box

One of the strengths of dbt Fusion is its ease of use. State-aware orchestration works automatically in most cases, without requiring additional configuration.

For advanced use cases, teams can still fine-tune behavior with more granular controls.

Efficient Testing (Beta)

dbt Fusion also introduces efficient testing, a feature currently in beta that optimizes how tests are executed.

Key improvements include:

Test result reuse: Avoid rerunning tests when results are unchanged
Query aggregation: Combine multiple tests into a single query
Reduced warehouse load: Lower compute costs

This makes testing faster and more cost-efficient, especially in large projects with extensive test coverage.

Performance and Cost Benefits

By combining Rust-based execution, advanced SQL analysis, and intelligent orchestration, dbt Fusion delivers measurable improvements:

Significantly faster runtimes
Reduced warehouse usage
Lower infrastructure costs
Improved developer productivity

For organizations managing complex data pipelines, these benefits can translate into substantial operational savings.

Compatibility with dbt Projects

dbt Fusion is designed to integrate seamlessly with existing dbt workflows.

Most projects can be migrated without major changes, as Fusion supports the core dbt framework. However, teams should be aware that deprecated features from dbt Core are not supported.

This makes it important to review and modernize older projects before transitioning.

Current State of dbt Fusion

At the time of writing, dbt Fusion is still in preview. While its capabilities are already impressive, some features may evolve as the engine matures.

Organizations considering adoption should monitor updates and test Fusion in controlled environments before full deployment.

Conclusion

dbt Fusion represents a major step forward in the evolution of dbt. By rethinking the execution engine from the ground up, it introduces powerful capabilities that go beyond incremental improvements.

With dialect-aware SQL validation, state-aware orchestration, and efficient testing, data teams can build pipelines that are not only faster, but also smarter and more cost-effective.

As the modern data stack continues to evolve, dbt Fusion is positioned to play a key role in shaping the future of analytics engineering.

Watch the Video

Meet the Speaker

Dmytro Polishchuk
Senior BI Consultant

Dmytro Polishchuk has 7 years of experience in business intelligence and works as a Senior BI Consultant for Scalefree. Dmytro is a proven Data Vault 2.0 expert and has excellent knowledge of various (cloud) architectures, data modeling, and the implementation of automation frameworks. Dmytro excels in team integration and structured project work. Dmytro has a bachelor’s degree in Finance and Financial Management.

Dmytro Polishchuk In dbt Talk

How Model Access Works in dbt Cloud: Groups, Permissions & Cross-Project References

Model Access in dbt Cloud

As organizations scale their data platforms, controlling data access and ownership becomes increasingly important. Teams need clear rules around who can use specific datasets, how models are shared across projects, and how data governance is enforced without slowing down collaboration.

dbt Cloud provides powerful model access features that help analytics engineers and data teams manage visibility, ownership, and cross-project collaboration. By defining groups and access levels such as private, protected, and public, organizations can build scalable and secure data architectures aligned with modern data mesh principles.

This article explains how model access works in dbt Cloud using a producer-consumer project setup. We will explore access levels, group configuration, cross-project references, and how model relationships appear in the Catalog and lineage graph.

In this article:

Why Model Access Matters in Modern Data Platforms
Understanding Model Access Levels in dbt
Producer and Consumer Project Setup
Configuring Groups in dbt
- Assigning Groups to Individual Models
- Assigning Groups to Folders
Testing Private Model Access
Using Protected Models
Configuring Public Models for Cross-Project Use
Star Schema Example with Mixed Access Levels
Exploring Model Ownership in the dbt Catalog
Referencing Public Models from Consumer Projects
Understanding Lineage Graphs and Dependencies
Benefits of Model Access Control in dbt
Best Practices for Managing Model Access
Conclusion
Watch the Video

Why Model Access Matters in Modern Data Platforms

In a growing data ecosystem, multiple teams often build and consume data models simultaneously. Without clear access control, this can lead to:

Unclear data ownership
Breaking changes across teams
Data quality risks
Limited governance
Uncontrolled dependencies

dbt addresses these challenges by allowing teams to:

Control who can reference specific models
Define ownership through groups
Enable safe collaboration between teams
Support data mesh architectures
Manage cross-project dependencies

With proper configuration, organizations can ensure reliable and scalable data pipelines while maintaining flexibility.

Understanding Model Access Levels in dbt

dbt provides three main model access levels that control visibility and usage across projects and teams.

Private Models

Private models are restricted to a specific group. Only models within the same group can reference them.

Key characteristics:

Limited visibility
Strong access control
Internal use within a team
Prevents external dependencies

Private models are useful for intermediate transformations or sensitive data that should not be widely accessible.

Protected Models

Protected models allow broader usage while maintaining controlled ownership.

Key characteristics:

Can be referenced outside their group
Still managed by a specific owner
Suitable for shared internal data
Balanced control and accessibility

By default, dbt models are typically configured with protected access unless specified otherwise.

Public Models

Public models are designed for cross-project collaboration and wider consumption.

Key characteristics:

Accessible across projects
Supports data sharing
Enables data mesh architecture
Clear ownership boundaries

Public models are commonly used as trusted data products that other teams depend on.

Producer and Consumer Project Setup

To demonstrate model access behavior, we use two dbt Cloud projects:

Producer Project: Provides models for consumption
Consumer Project: References public models from the producer

This setup reflects real-world scenarios where one team publishes data assets and another team consumes them.

Configuring Groups in dbt

Groups in dbt define ownership and control access to models. They help manage responsibilities and enforce governance.

A group can be configured in YAML files and assigned to specific models or entire folders.

Assigning Groups to Individual Models

For example, an analytics group can be defined and assigned to models such as:

Orders per supplier country customer
Orders per customer
Orders per supplier
Orders per country

This configuration ensures that models follow consistent ownership and access rules.

Assigning Groups to Folders

Instead of configuring each model individually, teams can assign a group to an entire folder using the project configuration file. This approach simplifies governance and ensures consistent access settings.

Testing Private Model Access

Private models can only be referenced by models within the same group. If a model outside the group attempts to reference a private model, dbt returns an error.

To resolve this, the referencing model must be added to the same group configuration.

This behavior ensures:

Clear ownership boundaries
Reduced risk of unintended dependencies
Improved data security

Private access is particularly useful for internal transformations or staging logic that should remain hidden from external teams.

Using Protected Models

Protected models offer more flexibility. They can be referenced outside their group without requiring additional configuration.

This makes them ideal for:

Shared business logic
Reusable transformations
Internal reporting models
Organization-wide metrics

Protected access balances governance with usability.

Configuring Public Models for Cross-Project Use

Public models allow other projects to reference them, enabling collaboration across teams.

For public models to be visible to other projects, dbt Cloud requires:

A successful job run
An environment defined as staging or production
Metadata resolution by the dbt Cloud service

Once these conditions are met, public models become available to consumer projects.

This mechanism ensures that only validated and production-ready models are shared.

Star Schema Example with Mixed Access Levels

A typical data modeling setup might include a star schema with dimensions and fact tables.

In this scenario:

Most models are configured as public
Specific models, such as certain dimensions, may remain protected
Protected models are used internally by other models
Only necessary data is exposed externally

This design prevents unnecessary data exposure while maintaining efficient dependencies.

Exploring Model Ownership in the dbt Catalog

The dbt Catalog provides visibility into model ownership, access levels, and dependencies.

Within the Catalog, users can:

View model owners
Filter by access level
Explore group information
See associated models
Understand data lineage

This transparency improves governance and helps teams understand data responsibilities.

Referencing Public Models from Consumer Projects

Consumer projects can reference public models from producer projects using cross-project references.

These models appear in the development environment and lineage graph as external dependencies. However, consumer teams cannot build or modify them. The producer team retains full ownership.

This separation provides:

Clear responsibility boundaries
Reduced operational risk
Reliable shared data products
Improved collaboration

Understanding Lineage Graphs and Dependencies

The lineage graph provides a visual representation of how models relate to one another across projects.

It helps teams:

Track upstream and downstream dependencies
Understand data flow
Identify external data sources
Analyze impact of changes
Improve system reliability

For cross-project references, the lineage graph clearly shows the project and environment where external models originate.

Benefits of Model Access Control in dbt

Implementing structured model access provides significant advantages:

Strong data governance
Clear ownership structure
Secure data sharing
Scalable collaboration
Support for data mesh architecture
Reduced risk of breaking changes
Improved transparency

These capabilities help organizations scale their analytics operations effectively.

Best Practices for Managing Model Access

To maximize the benefits of dbt model access, organizations should follow these practices:

Define clear ownership groups
Use private models for internal logic
Expose only necessary data through public models
Validate models before sharing
Monitor dependencies using lineage graphs
Document access rules consistently

Following these guidelines helps maintain a reliable and scalable data ecosystem.

Conclusion

Model access in dbt Cloud enables organizations to control data visibility, manage ownership, and support cross-team collaboration. By configuring groups and defining access levels such as private, protected, and public, teams can build secure and scalable data architectures.

When combined with Catalog visibility and lineage tracking, these features provide a strong foundation for data governance and modern analytics workflows.

As data platforms continue to grow in complexity, structured access control becomes essential for ensuring trust, reliability, and collaboration across the organization.

Watch the Video

Dmytro Polishchuk In dbt Talk

Model Access in dbt: Governing Analytics at Scale with Groups

Model Access in dbt

As dbt projects grow, so do the challenges around collaboration, ownership, and reuse. What starts as a small analytics codebase can quickly evolve into a complex ecosystem of models shared across teams, domains, and even projects. Without clear boundaries, it becomes difficult to understand who owns what, which models are safe to reuse, and how to scale analytics without breaking downstream consumers.

This is where model access and groups come into play. Together, they form a powerful governance mechanism in dbt that helps teams structure responsibility, control visibility, and safely enable cross-project data sharing.

In this article, we’ll break down what groups are, how model access works, and how these features support scalable analytics and data mesh architectures.

In this article:

What Are Groups in dbt?
What Is Model Access in dbt?
How Groups and Model Access Work Together
Public Models and Cross-Project References
Why Model Access Matters for Data Mesh
Conclusion
Watch the Video

What Are Groups in dbt?

A group in dbt is a named collection of nodes within a project. Groups provide a way to logically organize resources and define ownership, which becomes increasingly important as more people and teams contribute to the same dbt codebase.

Groups can include the following node types:

Models
Tests
Seeds
Snapshots
Analyses
Metrics

It’s important to note that sources and exposures are not included in groups.

There are a few key rules to understand:

Each node can belong to only one group
Every group must have a name
Every group must have an owner

The owner definition requires at least a name and an email address. Typically, the owner represents a team rather than an individual, such as an Analytics or Finance team. This explicit ownership makes responsibilities visible and helps clarify who to contact when questions or issues arise.

Beyond organization, groups play a critical role in how dbt enforces model access rules. In fact, model access is defined and evaluated in the context of groups.

What Is Model Access in dbt?

Model access is a governance feature in dbt that controls how and where a model can be referenced. It works alongside other governance capabilities such as model contracts and model versions.

By assigning an access level to each model, teams can define which models are internal implementation details and which are intended for broader reuse.

dbt provides three access levels:

Private

Private models are the most restrictive. They can only be referenced by other models within the same group.

This is ideal for intermediate or helper models that support a specific team’s logic but are not meant to be consumed outside that context.

Protected

Protected is the default access level in dbt. Protected models can be referenced by any group within the same project, or by other projects if the project is installed as a package.

This level supports collaboration within a single dbt project while still preventing accidental exposure across project boundaries.

Public

Public models are designed to be consumed outside the project they are defined in. These models can be referenced from other dbt projects using cross-project references.

Public access is especially important for organizations adopting a data mesh approach, where teams expose trusted data products for others to consume.

One important limitation to keep in mind: models materialized as ephemeral cannot be public. Since ephemeral models are not materialized in the warehouse, they cannot be safely shared across projects.

Cross-project public access requires the Enterprise tier of dbt Cloud.

How Groups and Model Access Work Together

Groups and model access complement each other. While groups define ownership and responsibility, access modifiers define visibility and usage.

Consider the following example configuration:

groups:
  - name: analytics
    owner:
      name: Analytics team
      email: [email protected]

models:
  - name: orders_per_supplier_country_customer
    config:
      access: private
      group: analytics

  - name: orders_per_customer
    config:
      access: protected
      group: analytics

  - name: orders_per_country
    config:
      access: public
      group: analytics

In this setup, all models belong to the analytics group, but each has a different access level.

The private model can only be referenced within the analytics group. The protected model can be referenced anywhere inside the same project. The public model can be referenced by other projects entirely.

In the dbt DAG and lineage views, private and public models are visually marked, making it easier to understand boundaries at a glance.

This combination allows teams to clearly communicate intent:

Which models are internal building blocks
Which models are safe for internal reuse
Which models are stable data products

Public Models and Cross-Project References

Public models enable collaboration across dbt projects. This is particularly valuable when different teams manage separate projects but still need to share data in a controlled way.

To make public models available to downstream projects, the upstream project must meet two conditions:

An environment defined as PROD or STG
At least one successful job run in that environment

These requirements ensure that dbt has generated the necessary metadata for downstream reference resolution.

In the downstream project, the upstream project is added as a dependency:

# dependencies.yml
projects:
  - name: hub_speak_dmytro_base

Once declared, public models from the upstream project can be referenced using a two-argument ref() function:

{{ ref('hub_speak_dmytro_base', 'dim_customer') }}

This pattern allows downstream teams to consume only the models explicitly marked as public, while all internal logic remains private or protected in the upstream project.

Why Model Access Matters for Data Mesh

Model access is a foundational capability for implementing data mesh with dbt.

By treating public models as stable interfaces, teams can publish data products with clear contracts and ownership. Downstream consumers rely on these models without needing to understand or depend on upstream implementation details.

At the same time, teams retain full control over what they expose. Internal experimentation and refactoring can happen safely behind private and protected boundaries.

This approach enables autonomy without chaos — a key principle of data mesh.

Conclusion

As dbt projects scale, governance becomes essential. Groups and model access provide simple but powerful tools to define ownership, enforce boundaries, and enable safe reuse of analytics models.

By thoughtfully combining groups with private, protected, and public access levels, teams can scale collaboration, support cross-project data sharing, and build reliable data products without sacrificing flexibility.

In upcoming sessions, we’ll demonstrate these concepts hands-on in dbt Cloud, showing how producer and consumer projects interact and how these relationships appear in the Catalog and lineage views.

If you’re working toward a scalable analytics or data mesh architecture, mastering model access in dbt is a crucial step.

Watch the Video

Dmytro Polishchuk In dbt Talk

How to Monitor Data Reliability with dbt Source Freshness in dbt Cloud

How to Validate Data Freshness in dbt Cloud

Ensuring that your data is fresh, reliable, and aligned with your SLAs is one of the most important responsibilities of any analytics engineering or BI team. In modern data stacks, dbt Source Freshness plays a key role in validating that upstream systems are loading data as expected. When used properly, it helps teams identify delays, pipeline failures, or missing updates before they impact models and downstream reporting.

This article walks through a full demo of how to configure, run, and monitor Source Freshness checks in dbt Cloud. It builds on the fundamentals introduced in the first video of our series, where we explained what source freshness is, why it matters, and how dbt evaluates freshness. If you haven’t watched that introduction yet, we recommend doing so first.

In this second part, we go hands-on: reviewing source configurations, running freshness checks using both fields and custom SQL queries, applying optional filters, triggering warnings and failures, and inspecting the results in the dbt Cloud UI and Catalog. By the end, you’ll have a clear understanding of how to integrate freshness checks into your workflows and jobs to maintain a highly trustworthy data foundation.

In this article:

Understanding the Source Freshness Configuration
Running Freshness Checks in dbt Cloud
- Triggering a Warning
- Triggering a Failure
Including Freshness in dbt Cloud Jobs
- Option 1: "Run Source Freshness" Checkbox
- Option 2: Adding Freshness as a Job Step
Monitoring Freshness in the dbt Cloud Catalog
Key Takeaways
Watch the Video

Understanding the Source Freshness Configuration

The demo starts inside a dbt Cloud project with a YAML file containing our source definitions. For this walkthrough, we are working with a source called dbt_talk_demo_sources, which includes two tables:

customer_source
employee_source

Inside the configuration block, we define the core freshness thresholds:

warning_after: 30 minutes
error_after: 60 minutes

These settings tell dbt when to flag a source as slightly stale (warning) or critically outdated (error). They are typically aligned with SLAs and expectations for how often upstream data should be updated.

Using `loaded_at_field`

For the customer_source table, we use a basic configuration: the table includes an updated_at timestamp column, which dbt uses directly to calculate the freshness. However, the timestamps in this demo are recorded in CET (Europe/Berlin), which means dbt converts them to UTC before evaluating freshness. This highlights a common real-world consideration: time zones must always be handled consistently in freshness checks.

Using `loaded_at_query`

For employee_source, we use a different approach. This table does not store a timestamp column. Instead, the load timestamps are stored in a metadata table. To handle this, we configure a loaded_at_query—a SQL query that retrieves the latest load time externally. This method is often used when:

Timestamps come from an ETL metadata or logging table
Data loads use high-watermark patterns
You want more control over how freshness timestamps are calculated

In the demo, the query simply selects the MAX(updated_at) value from the metadata table. While simple, it demonstrates how flexible dbt is when working with custom data loading patterns.

Using Optional Filters

dbt also supports an optional filter configuration, which lets you skip certain rows when evaluating freshness. For example, if a table contains soft-deleted records or historical rows that should not count toward freshness checks, you can filter them out. In our demo, the filter excludes rows where deleted = TRUE, ensuring only active records contribute to the freshness calculation.

This becomes particularly useful when old records appear fresher than the latest valid ones, which could skew your results or hide actual issues.

Running Freshness Checks in dbt Cloud

With the configuration in place, we run our first freshness check via the CLI:

dbt source freshness

Before running the check, we insert new rows into the source tables so that the latest data delay is around 20 minutes. Since this is below both the warning and error thresholds, the run reports everything as green.

This confirms that both types of configurations—loaded_at_field and loaded_at_query—are working as expected.

Triggering a Warning

Next, we enable the earlier-mentioned filter configuration on customer_source. After filtering out the deleted rows, the next valid record has a delay of about 50 minutes. When we run:

dbt source freshness -s source:dbt_talk_demo_sources.customer_source

dbt reports a warning state, because the threshold of 30 minutes is exceeded. This demonstrates how filtering can impact the evaluation in meaningful ways.

Triggering a Failure

To understand how a failed freshness check behaves, we insert data with delays exceeding the 60-minute error threshold. Running the same command again produces an error state. The dbt output also shows the exact SQL query it executed to determine freshness—useful when troubleshooting unexpected results.

Including Freshness in dbt Cloud Jobs

dbt Cloud provides two ways to incorporate freshness checks into scheduled jobs:

Option 1: “Run Source Freshness” Checkbox

With this option enabled, dbt automatically runs dbt source freshness as the first step of the job. However, failures do not stop the rest of the job from executing. This mode is ideal when you want visibility but don’t want freshness violations to block model builds.

Option 2: Adding Freshness as a Job Step

Alternatively, you can include freshness checks as an explicit job step. In this case, if freshness fails, subsequent steps are skipped and the job fails. This is the preferred option when:

Data reliability is critical
Your models depend on up-to-date sources
You want strong enforcement of data SLAs

The demo shows examples of both approaches, so you can choose which one best fits your project needs.

Monitoring Freshness in the dbt Cloud Catalog

dbt Cloud makes it easy to monitor freshness results long after the run completes. In the Catalog, you can drill down into each source and see the most recent freshness status, including warnings and errors. This gives data teams better visibility into upstream issues without needing to dive into logs.

For example, in our demo environment, the Catalog displays a warning icon for dbt_talk_demo_sources. Opening the source reveals the individual freshness statuses for each table. This is especially helpful in larger projects where tracking freshness manually would be impractical.

Key Takeaways

This demo highlights the full power and flexibility of dbt Source Freshness in real-world analytics environments. Here are the main lessons:

Freshness thresholds provide an essential guardrail for data reliability.
loaded_at_field is simple when the timestamp is in the table.
loaded_at_query enables more advanced scenarios using external metadata.
Filters help refine which rows count toward freshness.
dbt distinguishes between OK, warning, and error states in a clear, actionable way.
Freshness checks can run as part of your dbt Cloud jobs with configurable strictness levels.
The dbt Cloud Catalog provides ongoing visibility into the freshness of all sources.

By combining these tools, you can ensure your source data stays timely, trustworthy, and perfectly aligned with your organization’s SLAs. This ultimately improves downstream analytics quality, enhances user confidence, and reduces the risk of building insights on outdated data.

Watch the Video

Dmytro Polishchuk In dbt Talk

dbt Source Freshness: Ensuring Reliable and Timely Data in Your Pipeline

dbt Source Freshness

Data teams rely on timely, accurate, and complete data to support dashboards, KPIs, reporting, and data-driven decision making. But even the most advanced data models and transformation logic cannot fix one critical issue: stale or outdated upstream data. This is where dbt Source Freshness becomes one of the most valuable quality checks in your analytics engineering toolkit.

In this article, we take a close look at what dbt Source Freshness is, why it matters, how it works under the hood, and how you can configure and run freshness checks both locally and in dbt Cloud. If your organization depends on reliable data pipelines—or if you’ve ever discovered too late that a report was built on old data—this guide will help you avoid those costly surprises.

In this article:

What Is dbt Source Freshness?
Why Source Freshness Matters
How dbt Source Freshness Works
- loaded_at_field
- Thresholds: warn_after and error_after
What Happens During a Freshness Check?
How to Run Freshness Checks in dbt
Running Freshness Checks in dbt Cloud
Why Freshness Checks Should Be a Standard Practice
Watch the Video

What Is dbt Source Freshness?

Source freshness in dbt is a built-in mechanism that measures how up-to-date data is in your defined source tables. While data transformations can apply logic, aggregations, and business rules, they inherently depend on data arriving on time. If source data is delayed, incomplete, or entirely outdated, every model downstream will reflect that delay.

dbt Source Freshness provides a simple, reliable indicator of whether the data you are working with is fresh enough to support your operational and analytical processes. It helps you answer one crucial question:

“Is the data I’m transforming actually the latest data available?”

When enabling freshness checks, dbt evaluates the most recent timestamp from a specified column in your source table and determines whether that timestamp violates your defined freshness thresholds. These thresholds act as data SLAs for your pipeline.

Why Source Freshness Matters

The importance of monitoring source data freshness cannot be overstated. When upstream data is stale, the consequences cascade throughout your entire analytics ecosystem. Dashboards may show outdated KPIs. Operational teams may make decisions based on incomplete numbers. Forecasts and reports may misrepresent the true state of the business.

One scenario that many data teams have encountered illustrates the problem perfectly: a business report runs on what everyone assumes is the latest data. After a few weeks, the team discovers that the upstream system had stopped updating its tables entirely. What appeared to be fresh data was actually months old. As a result, the report generated incorrect metrics for an extended period.

With source freshness monitoring in place, delays like these can be caught immediately. dbt highlights them clearly, allowing teams to:

Detect upstream system failures.
Identify delays in ingestion or replication pipelines.
Enforce data delivery SLAs with source system owners.
Stop inaccurate transformations from running on stale data.

Freshness checks turn what could be a hidden issue into a transparent, actionable signal.

How dbt Source Freshness Works

Source freshness configuration lives directly inside the YAML file where your source is defined. This design decision is intentional—freshness belongs to the source, not to downstream models. Each source or table can have its own customized freshness rules.

A typical source block with freshness configuration looks like this:

sources:
  - name: my_source
    tables:
      - name: orders
        freshness:
          warn_after: {hours: 24}
          error_after: {hours: 48}
        loaded_at_field: updated_at

Let’s break down the key components.

loaded_at_field

This is the timestamp column dbt uses to determine when the most recent record arrived. dbt queries this field, finds the newest timestamp, and calculates its age relative to the current time.

Important: dbt always evaluates freshness in UTC time. If your source system stores local timestamps (e.g., CET, EST), the value in loaded_at_field must be converted to UTC.

Thresholds: warn_after and error_after

Freshness thresholds define what “fresh enough” means. dbt compares the age of the newest record with these time limits and returns one of three statuses:

pass – the data is within the acceptable freshness window.
warn – the data is late but not critically late.
error – the data is beyond the maximum acceptable age.

These thresholds effectively act as SLAs, helping teams formalize expectations about data arrival. For example:

Warn after 24 hours.
Error after 48 hours.

If the source table hasn’t received new records in over 48 hours, dbt marks the freshness check as an error, signaling that the table is unreliable until updated.

What Happens During a Freshness Check?

When you run a freshness check, dbt performs a straightforward but effective procedure:

dbt queries the loaded_at_field and finds the most recent timestamp.
It calculates the time difference between that timestamp and the current UTC time.
It compares the age of the data to your defined thresholds.
It returns a pass, warn, or error result.

This process is intentionally lightweight and fast. It avoids unnecessary complexity while giving teams a dependable, high-value signal about upstream data timeliness.

How to Run Freshness Checks in dbt

Running a freshness check in dbt is simple. The main command is:

dbt source freshness

This command evaluates freshness for all sources that have freshness configurations defined. You can also target a specific source or table:

dbt source freshness --select source:my_source
dbt source freshness --select source:my_source.orders

When executed, dbt displays the freshness status for each table along with metadata such as:

The latest timestamp found.
The calculated age of the data.
The threshold values used.

Running Freshness Checks in dbt Cloud

dbt Cloud makes managing freshness checks even easier. You can create a dedicated job that runs only freshness checks, or you can add freshness as a step in a larger job. This enables automatic monitoring without requiring manual execution.

Once the job completes, results appear directly in the dbt Cloud UI. For each table, you can see:

The age of the most recent record.
Whether the table passed, warned, or errored.
When the freshness check was last executed.

You can also inspect the detailed logs to understand exactly how dbt evaluated each source.

Why Freshness Checks Should Be a Standard Practice

In modern analytics engineering, data reliability is just as important as transformation logic. Freshness checks are a lightweight yet powerful way to ensure that your source systems are delivering data on time.

Without freshness checks, data issues may go unnoticed until they have already impacted dashboards, stakeholder decisions, or downstream processes. With freshness monitoring enabled, you gain visibility into problems early, allowing your team to respond quickly and prevent incorrect reporting.

As data ecosystems grow more complex—with multiple ingestion pipelines, third-party APIs, and event-based systems—freshness checks provide a simple, standardized way to maintain trust in your data.

Watch the Video

Dmytro Polishchuk In dbt Talk

How to Publish Power BI Reports with the dbt Semantic Layer

Publishing Power BI Reports with the dbt Semantic Layer

Welcome back to our two-part guide on connecting Power BI to the dbt Semantic Layer. In Part 1, we demonstrated how to connect Power BI Desktop to dbt Cloud and build a simple dashboard using semantic metrics. That was just the beginning.

In this article, we continue the journey by publishing that report to Power BI Service and configuring the On-premises Data Gateway to ensure the connection to the dbt Semantic Layer remains alive. This is a crucial step for moving from a personal development workflow to a shared, production-ready BI environment.

In this article:

Why Power BI Service Needs a Gateway
Step 1: Publish the Report to Power BI Service
Step 2: Install the On-Premises Data Gateway
Step 3: Configure Custom Connector Support
Step 4: Configure the Gateway in Power BI Service
Step 5: Validate the Connection
Summary of the Workflow
Best Practices
Conclusion
Additional Resources
Watch the Video

Why Power BI Service Needs a Gateway

When you publish a Power BI Desktop report connected to the dbt Semantic Layer, the connection details don’t automatically carry over to Power BI Service. Instead, you’ll likely see an error message such as “The model cannot be loaded”. This happens because:

Power BI Service cannot directly use the desktop connection credentials.
A secure, always-online connection is required to keep queries alive.

The solution is to use the On-premises Data Gateway. This component securely bridges Power BI Service with your data sources—whether on-premises or cloud-based—so your dashboards can refresh seamlessly.

Step 1: Publish the Report to Power BI Service

We’ll start where we left off in Part 1, with a working report in Power BI Desktop named DEMO – dbt Semantic Layer.

Open the report in Power BI Desktop.
Click the Publish button.
Choose a workspace (for this demo, we’ll use “My workspace”).
Open Power BI Service and locate the published report.

At this stage, you’ll notice that the report cannot load data. This is expected and is exactly why we need to configure the gateway.

Step 2: Install the On-Premises Data Gateway

To enable the connection, download and install the On-premises Data Gateway in Standard mode. The dbt Semantic Layer connector is not supported in Personal mode at this time.

Download link: Get the Power BI Gateway (Microsoft).

During installation:

Choose Standard mode.
Follow the guided steps and complete the setup.
Launch the gateway after installation.
Sign in with your Power BI Service account to register the gateway.

At this point, your gateway should show as online and ready to use.

Step 3: Configure Custom Connector Support

The dbt Semantic Layer connection is enabled through a custom Power BI connector. This requires a manual file copy:

Locate the connector file on your machine. By default, it is stored in:
C:\Users\<YourUsername>\Documents\Power BI Desktop\Custom Connectors\dbtSemanticLayer.pqx
Copy this file into the gateway’s custom connector directory:
C:\Windows\ServiceProfiles\PBIEgwService\Documents\Power BI Desktop\Custom Connectors
Restart the gateway if needed. The connector should now be detected under the Connectors section of the gateway.

This step is critical—without copying the connector, the gateway will not recognize the dbt Semantic Layer.

Step 4: Configure the Gateway in Power BI Service

With the gateway online and the custom connector installed, it’s time to configure the connection inside Power BI Service:

In Power BI Service, navigate to Settings → Manage Connections and Gateways.
Select your gateway from the list of available gateways.
Go to the gateway Settings and allow the use of Custom Connectors. Without this option, the connection will fail.
Next, go to Settings → Power BI Settings → Semantic Models.
Locate your published report (DEMO – dbt Semantic Layer).
Under Gateway and Cloud Connections, add your connector to the gateway and provide the required host, environment ID, and service token values.

Once these steps are complete, your Power BI Service report should start loading live data from the dbt Semantic Layer.

Step 5: Validate the Connection

Open your report in Power BI Service. This time, instead of the “model cannot be loaded” error, you should see the data loading and the dashboard visuals appearing. This confirms that the gateway is successfully maintaining the connection to the dbt Semantic Layer.

Summary of the Workflow

Let’s recap what we’ve achieved across both parts of this series:

Part 1: Connected Power BI Desktop to the dbt Semantic Layer and built a working dashboard using semantic metrics.
Part 2: Published the report to Power BI Service, configured the On-premises Data Gateway in Standard mode, enabled custom connectors, and restored connectivity to the dbt Semantic Layer.

The end result? A Power BI dashboard that queries dbt metrics live, end-to-end, even in the Service environment.

Best Practices

While the demo used a local laptop, in production environments the On-premises Data Gateway should be installed on a reliable, always-on machine, such as:

An on-premises server.
A cloud-hosted virtual machine (e.g., Azure VM, Amazon EC2).

This ensures the connection remains stable, and report refreshes happen as scheduled without interruption.

Conclusion

Connecting Power BI Service to the dbt Semantic Layer is a powerful way to extend governed, consistent metrics to your entire organization. By configuring the On-premises Data Gateway and enabling custom connectors, you ensure that your published reports continue to deliver real-time insights based on dbt’s semantic definitions.

If this guide was helpful, make sure to check out Part 1 if you haven’t already, and stay tuned for more deep dives into the dbt platform and BI integrations.

Additional Resources

Watch the Video

Dmytro Polishchuk In dbt Talk

How to Connect to the dbt Semantic Layer Through Power BI

Connecting Power BI to the dbt Semantic Layer

As organizations increasingly rely on data-driven decision making, the ability to connect business intelligence tools directly to semantic layers becomes essential. One of the most common requests we’ve heard from our community is: “How can I connect Power BI to the dbt Semantic Layer to expose metrics?”

In this guide, we’ll walk through the entire process of setting up and connecting Power BI with the dbt Semantic Layer. By the end, you’ll be able to query dbt metrics directly in Power BI and build interactive dashboards that stay in sync with your semantic models.

In this article:

Why Connect Power BI with the dbt Semantic Layer?
Pre-Requisites
Step 1: Review the dbt Semantic Layer Setup
Step 2: Collect Required Credentials in dbt Cloud
Step 3: Install the dbt Semantic Layer Power BI Connector
Step 4: Connect Power BI to the dbt Semantic Layer
Step 5: Build a Simple Dashboard
Step 6: What’s Next?
Conclusion
Additional Resources
Watch the Video

Why Connect Power BI with the dbt Semantic Layer?

The dbt Semantic Layer allows teams to define business metrics and dimensions in one central place. Instead of duplicating logic across BI tools, analysts and business users can rely on consistent definitions for KPIs such as revenue, churn, or order count. When Power BI is connected to this layer, dashboards automatically reflect the same trusted metrics already defined in dbt.

This integration helps:

Maintain consistency in metric definitions across the organization.
Reduce manual work for analysts when creating Power BI reports.
Ensure real-time access to governed data models.

Pre-Requisites

Before starting, make sure you have the following:

A working dbt Cloud project with a configured Semantic Layer.
Permission to create or access a Service Token in dbt Cloud.
Installed version of Power BI Desktop.
Internet access to download the dbt Semantic Layer Power BI connector.

Step 1: Review the dbt Semantic Layer Setup

If you’re not familiar with how the dbt Semantic Layer is configured, check out Hernan Revale’s detailed session on setting up metrics and dimensions in dbt Cloud:
Watch the dbt Semantic Layer session here.

Step 2: Collect Required Credentials in dbt Cloud

Navigate to your dbt Cloud Dashboard and head to:

Settings → Semantic Layer or Settings → Edit Semantic Layer

Here, confirm or configure the following:

Credentials for the deployment environment where your semantic models run.
A Service Token linked to the Semantic Layer. If you don’t have permission, ask your dbt admin to create one.
Your Environment ID and Host, which will be used in Power BI during connection setup.

Important: Store your Service Token securely. You’ll need it to authenticate Power BI.

Step 3: Install the dbt Semantic Layer Power BI Connector

Download the Power BI connector for dbt Semantic Layer from the official documentation:
Download the connector here.

Run the installer and follow the on-screen steps. After installation, verify it by checking the list of available drivers in the ODBC Data Sources. The dbt Semantic Layer connector should now appear in the list.

Step 4: Connect Power BI to the dbt Semantic Layer

Now that everything is set up, it’s time to establish the connection:

Open Power BI Desktop and start a blank report.
Search for dbt Semantic Layer in the available connectors.
Accept the beta notice (as the connector is still under development).
Provide the required details:
- Host
- Environment ID
- Service Token
Choose DirectQuery (Import is not yet supported).
Click Load to access your metrics.

Step 5: Build a Simple Dashboard

Once the semantic model is loaded, you’ll see your dbt metrics in the Power BI fields pane. You can now build visualizations just like you would with any other dataset. For example:

Create a stacked column chart with “Orders Total” on the Y-axis and “Customer Region” on the X-axis.
Add slicers for Region, Market, or Segment to filter the data dynamically.
Include a card visualization to highlight key metrics such as total revenue.

At this point, your Power BI dashboard is fully connected to the dbt Semantic Layer. Metrics are updated live and reflect the definitions you’ve configured in dbt Cloud.

Step 6: What’s Next?

In this tutorial, we focused on connecting Power BI Desktop to the dbt Semantic Layer. In the next part of this series, we’ll publish the report to Power BI Service and explain how to retain dbt connectivity in a collaborative environment.

Stay tuned for the next video and article, and don’t forget to subscribe to our channel for updates.

Conclusion

Connecting Power BI to the dbt Semantic Layer is a powerful way to bring consistent, governed metrics directly into your BI environment. With a few configuration steps, you can ensure that every report and dashboard your team creates in Power BI leverages the same trusted metric definitions managed in dbt.

This setup not only accelerates dashboard creation but also strengthens data governance across your organization. As the connector continues to evolve, we can expect even smoother integrations and more functionality in the near future.

Additional Resources

Official dbt Documentation
dbt Semantic Layer Setup Session (YouTube)
Upcoming video: Publishing Power BI Reports with dbt Semantic Layer Access

Watch the Video

Dmytro Polishchuk In dbt Talk

DBT Snapshots as Sources for Data Vault Powered EDW

DBT Snapshots Question

This topic was inspired by a viewer’s question: “How can you build a Data Vault data warehouse downstream from dbt snapshots?”

In this article:

What is a Snapshot in dbt?
Demo: Data Vault Integration with dbt Snapshots
- Source Data: Snowflake Sample Data
- Target: Raw Data Vault Model
Conclusion
Watch the Video

What is a Snapshot in dbt?

A snapshot in dbt is a tool that captures and preserves changes in data for tables that may be updated over time. This enables historical analysis and ensures that changes in data are properly recorded.

Key Features of dbt Snapshots

Definition: Snapshots capture historical data by preserving changes over time.
Purpose: They allow analysts to look back at previous versions of data, supporting historical analysis, auditing, and compliance.
Mechanism: Snapshots are built on Slowly Changing Dimensions (SCD) Type 2.
Value: Essential for auditing, regulatory compliance, and data analysis.

To create snapshots, use the command:

dbt snapshot

They are also included in the broader dbt build command. Notably, snapshots cannot be rebuilt in a direct way to prevent accidental loss of historical data.

Metadata Fields

dbt adds four key metadata fields to snapshot tables:

dbt_scd_id: Unique identifier for change tracking.
dbt_updated_at: Timestamp of the latest update.
dbt_valid_from: Start timestamp of the record.
dbt_valid_to: End timestamp (when applicable).

Starting with dbt version 1.9, these field names are customizable.

Snapshot Strategies in dbt

Timestamp Strategy: Tracks changes using an updated_at column. Recommended due to its performance.
Check Strategy: Compares column values for changes. Useful when there is no reliable timestamp column but can be less efficient.

Handling Hard Deletes

By default, dbt does not track hard deletes. However, enabling invalidate_hard_deletes: true ensures that deleted records are marked with an updated dbt_valid_to timestamp. This is crucial for effectivity satellites in Data Vault modeling.

Demo: Data Vault Integration with dbt Snapshots

Source Data: Snowflake Sample Data

For the demo, we use Snowflake’s TPCH sample dataset:

Database: snowflake_sample_data
Schema: tpch_sf1
Selected Tables: customer, orders, part, lineitem

Each table has a corresponding dbt snapshot:

snap_customer
snap_orders
snap_part
snap_lineitem

The snapshot configurations:

Strategy: Check (due to lack of reliable timestamp columns)
Hard Deletes: Enabled (invalidate_hard_deletes: true)

Target: Raw Data Vault Model

The Data Vault model consists of:

Hubs: customer_h, order_h, part_h
Links: orders_l (customer-order relationship), orders_parts_l (order line items)
Satellites: customer_s, order_s, part_s, orders_parts_s
Effectivity Satellites: Track historical changes with dbt_valid_from and dbt_valid_to

During the demo, we focus on the customer entity, tracking how changes in the customer table are reflected in its corresponding Data Vault satellites.

Conclusion

Using dbt snapshots as a persistent staging area (PSA) for a Data Vault-powered EDW enables accurate historical tracking. The combination of snapshot strategies and effectivity satellites ensures that data lineage and changes are well-documented and auditable. By leveraging Snowflake’s sample datasets, we can efficiently test and refine Data Vault implementations.

Watch the Video

Dmytro Polishchuk In dbt Talk

Microbatch Incremental Models: A New Approach to Large Time-Series Data

What is a Microbatch?

Microbatch is an innovative incremental strategy designed for large time-series datasets. Introduced in dbt Core version 1.9 (currently in beta), it complements existing incremental strategies by offering a structured and efficient way to process data in batches.

In this article:

How Microbatches Work
Microbatch Model Configurations
Running Batches in Parallel
How to Backload Data
Microbatch vs. Other Incremental Strategies
Conclusion
Watch the Video

Key features of Microbatch include:

Utilizes a time column to define batch ranges.
Supports reprocessing failed batches.
Auto-detects parallel batch y
Eliminates complex conditional logic for backfilling.

However, it’s not suitable for datasets lacking a reliable time column or requiring fine-grained control over processing logic.

How Microbatches Work

Microbatching works by splitting model processing into multiple queries (batches) based on:

event_time: The time column defining batch ranges.
batch_size: The time period for each batch (hour, day [default], month, year).

Each batch functions as an independent, atomic unit, meaning:

Batches can be processed, retried, or replaced individually.
Parallel execution enables separate, idempotent batch processing.

Batch replacement strategies vary by database adapter:

Postgres: Uses merge.
BigQuery, Spark: Uses insert_overwrite.
Databricks: Uses replace_where.
Redshift, Snowflake: Uses delete + insert.

Microbatch Model Configurations

When setting up a Microbatch model, the following configurations are required:

event_time: Specifies the time column in UTC.
batch_size: Defines batch granularity (hour, day, month, year).
begin: Sets the start point for initial or full-refresh builds.

Optional configurations include:

lookback: Processes prior batches for late-arriving records.
concurrent_batches: Controls parallel execution (auto-detected by default).

Running Batches in Parallel

Parallel execution is automatically detected based on batch conditions and adapter support. However, users can override this behavior using the concurrent_batches setting.

Parallel execution is possible when:

The batch is neither the first nor last in the sequence.
The database adapter supports parallel execution.
The model logic does not depend on execution order.

How to Backload Data

Backloading allows reprocessing historical data within a specific time range using the following command:

dbt run --event-time-start "2025-02-01" --event-time-end "2025-02-03"

This ensures that only batches within the defined range are processed independently.

Microbatch vs. Other Incremental Strategies

Microbatch differs from traditional incremental strategies by:

Using independent queries for time-based batches.
Eliminating the need for is_incremental() and complex SQL logic.
Automatically selecting the most efficient operation (insert, update, replace) for each platform.

Conclusion

Microbatch is a powerful new approach to incremental data processing in dbt Core. By breaking down large datasets into manageable, parallelizable chunks, it simplifies data modeling while improving efficiency and scalability. However, it is essential to consider whether Microbatch suits your data pipeline’s requirements before implementing it.

Watch the Video

Dmytro Polishchuk In dbt Talk

Snapshots in dbt

Watch the Video

Snapshots in dbt: A Quick Overview

dbt snapshots allow you to “look back” at historical data by capturing changes in your database tables. This is achieved by implementing type-2 Slowly Changing Dimensions (SCDs), which track how a row has changed over time. For example, you can keep track of an order’s status as it moves from ‘pending’ to ‘shipped’ to ‘delivered’.

In this article:

How Snapshots Work
Snapshots Best Practices

How Snapshots Work

When you run dbt snapshot, it creates a new table with the same columns as your source table, plus additional metadata columns like dbt_valid_from and dbt_valid_to. On subsequent runs, dbt updates the dbt_valid_to column for changed records and adds new records with dbt_valid_to set to null.

dbt offers different strategies for detecting changes, including comparing timestamps, specific column values, or a unique key. You can also configure dbt to track deletions.

Snapshots Best Practices

Store snapshots in a separate schema to keep your historical data organized.
Snapshot raw data using the source function and select all columns.
Avoid joins in your snapshot queries; instead, create separate snapshots and join them downstream.
If you need to perform transformations, do so in ephemeral models before snapshotting.
Schedule snapshots to run frequently to capture changes regularly.

By following these best practices, you can ensure that your dbt snapshots are accurate, efficient, and easy to maintain.

Dmytro Polishchuk In Beginner, Data Tools

The Potentials of Microsoft Fabric for Business Intelligence Solutions

Microsoft Fabric for Business Intelligence

Microsoft Fabric is an all-in-one cloud-based analytics platform that provides a unified environment for data professionals and business users to collaborate on data solutions. It is a powerful analytics platform that helps businesses automate workflows, improve productivity, and gain insight from their data.

In today’s data-driven world, businesses are increasingly turning to business intelligence (BI) solutions to gain insight from their data. BI solutions can help businesses improve their decision-making, optimize their operations, and gain a competitive edge.

The Potentials of Microsoft Fabric for Business Intelligence Solutions

Microsoft Fabric is a new and innovative data analytics platform that has the potential to revolutionize the way businesses make decisions. With Fabric, organizations can ingest, store, process, and analyze all of their data in one place, using a unified set of tools and services. This makes it possible to get insights from data faster and easier than ever before. We will explore the potential of Microsoft Fabric for business intelligence solutions. We will discuss how Fabric can be used to improve data quality, streamline analytics workflows, and deliver insights to users. This webinar is interesting for everyone wanting to learn more about how to use Microsoft Fabric to get more value from data. And even more interesting for Data engineers and analysts, Business intelligence professionals, and IT decision-makers. Register now for this free webinar and learn how Microsoft Fabric can help you take your business intelligence solutions to the next level!

Watch webinar recording

In this article:

What to expect
What is Microsoft Fabric?
Workloads of Microsoft Fabric
Data Lake
So What Are Microsoft Fabrics Potentials?
Conclusion

What to expect

This newsletter will take you on a journey to discover the transformative power of Microsoft Fabric for business intelligence solutions. You will explore the various workloads and experiences that Microsoft Fabric offers, gaining a comprehensive understanding of its capabilities. You will also uncover the benefits that Microsoft Fabric brings to data-driven decision-making, enabling you to make informed choices that propel your business forward. Finally, you will delve into the potential that Microsoft Fabric holds for enhancing business intelligence solutions, empowering you to unlock new levels of insight and decision-making capabilities.

To dive even deeper, watch the webinar recording about this topic for free. Click here to register.

What is Microsoft Fabric?

Microsoft Fabric is an all-in-one cloud-based analytics platform that provides a unified environment for data professionals and business users to collaborate on data solutions. Fabric offers a suite of integrated services that enable you to collect, store, process, and analyze data in a single platform, which is built on a foundation of Software-as-a-Service (SaaS).

Microsoft Fabric provides tools for people with all levels of data expertise and connects with the tools businesses use to make decisions.
Fabric itself is an umbrella for the following Microsoft cloud-based services that constitute Microsoft Analytics Portfolio:

Azure Data Factory
Azure Event Hubs
Azure Data Explorer
Azure Artificial Intelligence
Azure Databricks
Azure Synapse Spark Pools
Azure Synapse Analytics
and Microsoft Power BI

The above items have been re-tooled and taken to the next level inside the Microsoft Fabric.

Workloads of Microsoft Fabric

Fabric includes the following workloads or experiences:

Data integration
Data Engineering
Data Warehousing
Data Science
Real-time analytics
Business intelligence
Insight to action

The foundation of these experiences in Fabric is the data lake, which is known as OneLake. The below picture illustrates the concept of Microsoft Fabric.

Let’s review every component of Fabric in a bit more detail.

Data Integration

Microsoft Fabric’s data integration workload, called Data Factory, brings data movement capabilities to both dataflows and data pipelines.

Dataflows offer flexible and user-friendly ways to transform data with over 300 transformations. It is built on the familiar Power Query experience, which is available in several Microsoft products and services, such as Excel, Power BI, Power Platform, and others.
Data pipelines let you create flexible data workflows to meet your organizational goals. You can use the built-in data orchestration features to refresh your dataflows, process large datasets, and define complex control flow pipelines.

Data Engineering

Data Engineering Experience or Synapse Data Engineering offers a top-tier Spark platform, fostering rich authoring experiences. This empowers data engineers to conduct extensive data transformations and facilitates widespread access to data via the lakehouse.

Microsoft Fabric provides various data engineering capabilities to ensure that your data is easily accessible, well-organized, and of high quality. From the data engineering homepage, you have the following options:

Lakehouse
Create and manage Lakehouse. Lakehouse is a logical location in OneLake where you store and manage structured and unstructured data using various tools and frameworks. You can even mount an external storage account into your Lakehouse with the Shortcut feature.
You can use the SQL Endpoint to query Lakehouse tables, but only read-only queries are supported.
Notebook
Write and run code in popular programming languages, like Python, R, and Scala. Leverage notebooks for data ingestion, transformation, analysis, and other data processing tasks.
Environment
Within an environment, you have the flexibility to choose from a variety of Spark runtimes, configure your computational resources, and incorporate libraries – either from public repositories or by uploading locally-built custom libraries. Attaching these environments to your notebooks and Spark job definitions is a seamless process.
Spark Job Definition
Define, schedule, and manage Spark jobs to process big data in your lakehouse, apply transformation logic to the data, and more.
Data Pipeline
Design and orchestrate pipelines to copy data into your lakehouse, schedule Spark jobs and notebooks to process high-volume data, and automate data workflows via integration with Data Factory.

Data Warehousing

Data warehouse or Synapse Data Warehouse is a lake-centric repository for storing and analyzing structured data built on a distributed processing engine. Data warehousing workload benefits from the rich capabilities of the SQL engine over an open Delta Lake format, which are parquet files published as Delta Lake Logs and stored in OneLake. Delta Lake Logs enable ACID transactions. A warehouse can contain only structured data. Here you can not only read data with SQL but also execute inserts and updates.

Data Science

The Data Science Experience, or Synapse Data Science, allows for the creation, deployment, and operationalization of machine learning models. Data scientists are empowered to enrich organizational data with predictions and allow business analysts to integrate those predictions into their BI reports. Data Science in Fabric provides you with the following options:

ML model
Leverage machine learning models to forecast results and identify irregularities within datasets.
Experiment
Engage in the experimentation phase by generating, executing, and monitoring the evolution of various models for validating hypotheses.
Notebook
Utilize the Notebook feature to delve into data exploration and construct machine learning solutions through Apache Spark applications.
Environment
This option has the same purpose as in Data Engineering experience.

Real-Time Analytics

Real-Time Analytics in Fabric or Synapse Real-Time Analytics is a fully managed big data analytics platform optimized for streaming and time-series data. It utilizes a Kusto Query Language (KQL) — an engine with exceptional performance for searching structured, semi-structured, and unstructured data. Real-Time Analytics is fully integrated with the entire suite of Fabric products for data loading, data transformation, and advanced visualization scenarios. There are three components of the Real-Time Analytics in Fabric:

KQL Database — the place to store streaming data. It uses OneLake as the underlying storage system.
KQL Queryset — run queries on your data to produce shareable tables and visuals. Save, manage, export, and share KQL queries. KQL Queryset is an analog of SSMS for a SQL Database.
Eventstream — capture, transform, and route real-time event stream to various destinations in the desired format with no-code experience. It is a hub of streaming data, where multiple sources (including Event Hub) and various destinations (including KQL database) can be set.

Business Intelligence

From the BI side, Fabric has Power BI, one of the world’s leading Business Intelligence platforms. It represents a set of services, tools, and connectors that turn your data into interactive visual reports and dashboards. In Fabric, there are some new features and enhancements to work with Fabric objects and experiences. Among these are:

Direct Lake mode connection that is based on loading parquet-formatted files directly from a data lake without the need to import the data into a data set.
Integration with Synapse Real-Time Analytics to produce real- or near-real-time reports.
Semantic link that allows loading data from Power BI data sets into Data Science experience and others.

Insight to Action

Insight to action in Fabric offers Data Activator, which is an experience for automatically triggering actions when certain conditions are met in the incoming data. It works with Eventstreams for real-time data and Power BI for batch data. There are three possible actions that can be configured after the conditions are detected:

Email — get notified by email.
Teams message — send a notification to an individual or a channel in Teams.
Custom action — perform a custom action to call Power Automate workflow.

Data Lake

OneLake is the central component of all Fabric services. It is a built-in, unified data lake that stores all organizational data and is used by all Fabric experiences. OneLake uses ADLS Gen2 as its underlying storage.

To simplify management across the organization, OneLake is organized hierarchically. Each tenant has only one OneLake instance, which provides a single namespace that extends over users, regions, and even clouds. For easy handling, data in OneLake is divided into manageable containers.

Similar to Microsoft OneDrive, any developer or business unit in the tenant can create their own workspaces in OneLake. They can ingest data into their own lakehouses, and start processing, analyzing, and collaborating on the data. All Fabric experiences are bound to and operate on top of OneLake.

So What Are Microsoft Fabrics Potentials?

Microsoft Fabric has the potential to revolutionize the way that businesses build and deploy business intelligence solutions. Here are some of the key potentials:

Unified data platform: Microsoft Fabric provides a unified data platform for storing and analyzing all types of data, including structured, semi-structured, and unstructured data. This means that businesses can use a single set of tools and services to manage all aspects of their data pipeline, from data ingestion and preparation to data analysis and visualization. This can help to reduce complexity and improve efficiency.
End-to-end analytics: Microsoft Fabric provides a complete set of capabilities for building and deploying end-to-end analytics solutions. This includes data engineering, data science, data warehousing, and business intelligence. This simplifies the development and deployment of analytics solutions and reduces the need for specialized expertise.
AI-powered insights: Microsoft Fabric embeds AI and machine learning capabilities throughout the platform. This can be used to automate tasks, generate insights, and improve the accuracy of predictions. This can help businesses to make better decisions faster.
A unified platform accommodating diverse expertise: Data professionals with varying backgrounds can leverage their skill sets to collaborate on a singular data solution. This inclusive environment allows individuals from different disciplines and proficiency levels to contribute their expertise to a common data project.
Reduced data silos: It eliminates the need to move data between various systems, making it easier to get a complete view of your business.
Reduced costs: Microsoft Fabric can help businesses reduce the costs associated with building and deploying business intelligence solutions. This is because it eliminates the need to purchase and maintain multiple systems and it provides a number of features that can automate tasks and improve efficiency.

In essence, Microsoft Fabric holds the promise of simplifying and rendering the development and deployment of robust business intelligence solutions more accessible and cost-effective for businesses, regardless of their scale or size.

Conclusion

Microsoft Fabric represents a paradigm shift in the realm of business intelligence solutions. With its unified data platform, end-to-end analytics capabilities, and AI-driven insights, Fabric streamlines data management from collection to visualization. The platform’s ability to cater to diverse skill sets and backgrounds, reducing data silos, and automating tasks illustrates a remarkable potential to revolutionize the industry.

Through its components, such as Data Factory, Data Engineering, Data Warehousing, Data Science, Real-Time Analytics, Power BI, Data Activator, and the robust foundation of OneLake, Microsoft Fabric promises a future where businesses can efficiently harness the power of data to make informed decisions and drive innovation. Ultimately, it has the potential to democratize powerful business intelligence solutions, making them more accessible and cost-effective for enterprises of all scales.

dbt Fusion Demo

The demo project: Hub Speak Base

Dialect error validation in the IDE

State-based orchestration: only rebuild what changed

Efficient testing: stop running the same tests twice

What this means for your dbt workflows

Watch the Video

dbt Fusion

What is dbt Fusion?

Why dbt Fusion Matters

Dialect-Aware SQL Validation

Static SQL Analysis

Handling Introspective Models

Why This Matters

State-Aware Orchestration

Build Only What Changed

Shared Model State

Concurrent Job Handling

Works Out of the Box

Efficient Testing (Beta)

Performance and Cost Benefits

Compatibility with dbt Projects

Current State of dbt Fusion

Conclusion

Watch the Video

Meet the Speaker

Model Access in dbt Cloud

Why Model Access Matters in Modern Data Platforms

Understanding Model Access Levels in dbt

Private Models

Protected Models

Public Models

Producer and Consumer Project Setup

Configuring Groups in dbt

Assigning Groups to Individual Models

Assigning Groups to Folders

Testing Private Model Access

Using Protected Models

Configuring Public Models for Cross-Project Use

Star Schema Example with Mixed Access Levels

Exploring Model Ownership in the dbt Catalog

Referencing Public Models from Consumer Projects

Understanding Lineage Graphs and Dependencies

Benefits of Model Access Control in dbt

Best Practices for Managing Model Access

Conclusion

Watch the Video

Model Access in dbt

What Are Groups in dbt?

What Is Model Access in dbt?

Private

Protected

Public

How Groups and Model Access Work Together

Public Models and Cross-Project References

Why Model Access Matters for Data Mesh

Conclusion

Watch the Video

How to Validate Data Freshness in dbt Cloud

Understanding the Source Freshness Configuration

Using loaded_at_field

Using loaded_at_query

Using Optional Filters

Running Freshness Checks in dbt Cloud

Triggering a Warning

Triggering a Failure

Including Freshness in dbt Cloud Jobs

Option 1: “Run Source Freshness” Checkbox

Option 2: Adding Freshness as a Job Step

Monitoring Freshness in the dbt Cloud Catalog

Key Takeaways

Watch the Video

dbt Source Freshness

What Is dbt Source Freshness?

Why Source Freshness Matters

How dbt Source Freshness Works

loaded_at_field

Thresholds: warn_after and error_after

What Happens During a Freshness Check?

Using `loaded_at_field`

Using `loaded_at_query`