Skip to main content
search
0
All Posts By

Joshua Stendel

Joshua Stendel is a Consultant at Scalefree specializing in Data Vault 2.0 implementations and modern data warehousing. With a background in Business Information Systems, he is a Certified Data Vault 2.0 Practitioner and Databricks Certified Data Engineer Professional. Joshua specializes in architecting scalable data solutions utilizing Databricks and dbt to successfully guide and enable enterprise teams.

How to Set Up dbt with Databricks in 15 Minutes

Get Started with dbt and databricks

Getting Started With dbt on Databricks

dbt (data build tool) has become one of the most popular ways to transform data directly inside a modern cloud platform. Paired with Databricks, it gives data teams a clean, version-controlled workflow for turning raw, already-ingested data into trusted, well-modeled tables — without ever leaving the lakehouse. If you have never connected the two before, the encouraging part is that going from an empty environment to your first working models takes only a few minutes.

The path is straightforward: understand what dbt is actually responsible for, set up a Databricks environment, launch dbt through Partner Connect, initialize a project, define your sources, and build your first staging and dimension models — right through to committing your work with version control. The walkthrough below covers each step in order.



What dbt Does (and What It Doesn’t)

Before touching any configuration, it helps to be clear about one core concept. dbt is built strictly for the transformation step of your data pipeline. In the language of ETL or ELT, dbt owns the “T.” It does not extract data from source systems, and it is not the tool that loads raw data into your platform. Instead, it expects that some data already lives inside your warehouse or lakehouse, and its job is to reshape, clean, and model that data into something analytics-ready.

On Databricks, that means your raw data should already be sitting in the lakehouse before dbt enters the picture. From there, dbt takes over the modeling work: building staging layers, applying business logic, and producing the dimensional or analytical models your reporting depends on. Keeping this boundary in mind avoids a common early misconception — dbt is not an ingestion tool, it is a transformation framework.

Setting Up Your Databricks Environment

To follow along, you need access to a Databricks workspace. If you do not already have one, a free trial is enough to complete every step described here. Search for the Databricks free trial, create your account, and you will have a workspace ready within a minute or two.

Once your workspace is available, you have somewhere for dbt to connect to and somewhere for your transformed models to land. That is the only prerequisite on the Databricks side before bringing dbt into the workflow.

Launching dbt Through Databricks Partner Connect

The simplest way to stand up dbt against Databricks is through Partner Connect, the marketplace section of the workspace where Databricks lists integration partners with their connection settings pre-configured. Open Partner Connect, search for dbt, and you will find dbt Cloud listed among the available integrations. Selecting it opens the connection flow.

If you do not already have a dbt account, you can enter the same email you used for your Databricks trial. dbt then provisions a trial account for you with the connection credentials and settings already populated, so you do not have to wire up the connection by hand. After confirming, you land on the dbt Cloud dashboard, where the Databricks Partner Connect trial is shown and ready to use.

From the dashboard, opening the cloud-based development environment takes a moment to load, and you are dropped into an empty project — the blank canvas for your first models.

Initializing Your First dbt Project

An empty project needs structure before you can build anything. Initializing a dbt project scaffolds that structure for you: with a single action, dbt generates the full folder layout, configuration files, and a set of example SQL models, all visible in the file explorer.

Opening the models example folder reveals a handful of starter models you can inspect right away. If you simply want to confirm everything works end to end, you can run a build immediately. The build compiles and runs the example models in a few seconds and also executes the default tests that ship with them — typically a not-null test and a uniqueness test. Because one example model deliberately produces a row with a null value, the not-null test is expected to fail. That “failure” is intentional and confirms that dbt’s testing layer is functioning, not that something is broken.

Back in your Databricks workspace, refreshing the Partner Connect schema shows the freshly created model sitting alongside the connection schema. The round trip — build in dbt, see the result in Databricks — is the loop you will repeat for every model you create.

Defining Your Data Sources

Before building your own models, dbt needs to know where your raw data comes from. This is the role of a sources file. Deleting the example folder and starting clean, you create a sources.yml file that points dbt at data you have already ingested into the lakehouse.

The Databricks sample data is perfect for a first run. Using the Bakehouse sample dataset, you can focus on two tables — the sales customers table and the sales transactions table — to keep things simple. A minimal sources definition looks like this:

sources:
  - name: bakehouse
    database: samples
    schema: bakehouse
    tables:
      - name: sales_customers
      - name: sales_transactions

Here the source name is your own label, the database and schema point to where the data physically lives in Databricks, and the tables list the specific objects you want dbt to be aware of. Saving the file (using the save action or Ctrl+S) registers these sources for the rest of your project.

Generating Your First Staging Models

With sources defined, dbt can do a lot of the boilerplate for you. The development environment detects the tables in your sources file and offers to generate a staging model for each one. Choosing to generate a model for the sales customers table produces a simple staging model in seconds.

The generated model is a clean Common Table Expression that leverages dbt’s source() macro — Jinja syntax that resolves to the correct Databricks object at run time — and references the Bakehouse source you defined:

with source as (
    select * from {{ source('bakehouse', 'sales_customers') }}
),

renamed as (
    select
        first_name as name,
        -- additional columns
    from source
)

select * from renamed

The generator includes a renaming layer but leaves the actual renaming to you — that final shaping is where your modeling decisions live. For example, renaming the source’s first_name column to a cleaner name is a typical first edit. Once you are happy with the model, building it runs in a couple of seconds. A successful run reports its timing (often under two seconds) and confirms the model now exists in your database.

Refreshing the schema in Databricks reveals the new staging view — for example, stg_bakehouse__sales_customers — with every column present and your rename applied. Repeating the same generate-and-build flow for the sales transactions table gives you a second staging view. dbt automatically organizes these into subfolders for tidy project structuring, so a staging folder and a Bakehouse subfolder appear, each holding its corresponding SQL model and a matching configuration file.

Building a Dimension Model With ref()

Staging models give you clean, renamed source data. The next layer is where you build the analytical models your business actually consumes — for example, a customer dimension. To keep the project organized, create a new folder for your information marts, then add a SQL file such as customer_dimension.sql, since every dbt model is simply a SQL file.

The key technique at this layer is the ref() function. Instead of hard-coding a table name, you reference the staging model by its name, and dbt resolves the dependency for you. A straightforward dimension that selects all columns and filters to a single market — customers in the USA, say — looks like this:

select *
from {{ ref('stg_bakehouse__sales_customers') }}
where country = 'USA'

Building this model produces the customized dimension in Databricks, complete with all its columns and the filter applied. Because you used ref() rather than a literal table name, dbt now understands that this dimension depends on the staging model, which in turn depends on the source — a chain it tracks automatically.

Tracking Data Flow With the Lineage Graph

One of the most useful features of dbt is its lineage graph. As soon as your models reference one another through ref() and source(), dbt can draw the full dependency graph. Opening the lineage view shows the source table — the Bakehouse customer data — flowing into your staging model, and the staging model flowing into your customer dimension model.

This makes it easy to see exactly where any piece of data comes from and how it is transformed along the way. As projects grow, that visibility becomes invaluable for debugging, impact analysis, and onboarding new team members who need to understand the model landscape quickly.

Version Control and Committing Your Work

Everything you build in dbt is backed by version control. As you create and edit files, the history panel records what was created, edited, or modified, exactly as you would expect from any Git workflow. When you are ready to save your progress, you write a commit message — something like “created first models” — and commit your changes to the main branch.

From there, the usual Git capabilities are available: creating additional branches, switching between them, and opening pull requests. By default, dbt provides managed repositories so you can get started without any setup, but you are free to connect your own provider — GitHub, Azure DevOps, GitLab, or whatever your team already uses. This means your modeling work fits naturally into existing engineering practices rather than living in isolation.

Where dbt and Data Vault Meet

The workflow above scales far beyond a couple of sample tables. The same building blocks — sources, staging models, ref()-based dependencies, testing, and lineage — are exactly what you need to implement a robust, automated Data Vault on a modern lakehouse. dbt’s modularity and dependency management make it a natural fit for the repeatable, pattern-driven loading that Data Vault is built around.

If you want to go deeper into designing and automating these patterns the right way, structured learning makes a real difference. Scalefree’s Data Vault 2.1 Training & Certification covers the methodology end to end, from modeling fundamentals to production-grade automation on platforms like Databricks.

Start Building With dbt on Databricks

Getting dbt running on Databricks is genuinely a matter of minutes: connect through Partner Connect, initialize a project, point dbt at your sources, and let it generate the staging models you can then refine into dimensions and beyond. With version control and lineage built in from the very first commit, you are not just producing tables — you are establishing a maintainable, transparent transformation layer you can grow with confidence. Spin up a workspace, follow the steps above, and your first models will be live before you know it.

Watch the Video

Close Menu