Skip to main content
search
0

Data Vault on dbt Snapshots

Data Vault on dbt Snapshots

In recent years, dbt has become one of the most popular tools in modern data stacks. At the same time, Data Vault continues to be a proven methodology for building scalable, auditable, and historically complete data warehouses.

It is therefore no surprise that questions arise at the intersection of both worlds. One question we recently received perfectly captures this:

“Can you build a Data Vault view downstream off of dbt snapshots?
I feel dbt snapshots are safer because they capture data ‘as is’, and a Data Vault might be designed wrong.”

This is a great question—and one that touches architecture, performance, data modeling, and risk management at the same time. In this article, we’ll unpack the topic step by step and give a clear, practical answer.



First Things First: What Are dbt Snapshots?

Before we compare dbt snapshots with Data Vault concepts, let’s align on what dbt snapshots actually are.

According to dbt’s own documentation, snapshots are used to implement Type 2 Slowly Changing Dimensions (SCD Type 2) on mutable source tables.

If you’re familiar with dimensional modeling, this should sound very familiar. SCD Type 2 means:

  • Whenever a record changes, a new row is inserted.
  • The old version of the record is kept for historical analysis.
  • Validity timestamps define from when to when a record version was valid.

In a typical example, a source table might only store the current state of an order:

  • January 1st: Order status = pending
  • January 2nd: Order status = shipped

The source system overwrites the status, so you only ever see the latest value. But in analytics and data warehousing, we usually want to know how the data looked at a specific point in time.

That’s where dbt snapshots come in. They store multiple versions of the same business key and enrich the data with technical columns such as:

  • dbt_valid_from
  • dbt_valid_to

Whenever dbt detects a change, it:

  • Inserts a new row for the new version.
  • Updates the dbt_valid_to of the previous version.

From a functional perspective, this is classic SCD Type 2 behavior.

dbt Snapshots vs. Data Vault Satellites

Now let’s compare dbt snapshots with Data Vault modeling. This is where things get interesting.

In Data Vault, Satellites are responsible for storing descriptive attributes and tracking changes over time. In other words:

  • Satellites are also SCD Type 2 structures.
  • They store full history.
  • They insert a new row for every detected change.

So at first glance, dbt snapshots and Data Vault satellites look almost identical. And conceptually, they are very close.

However, there is one important difference.

Insert-Only vs. Update-Based Modeling

Modern Data Vault implementations follow a strict insert-only approach. That means:

  • No updates to existing records.
  • No physical valid_to column updates.
  • History is reconstructed using window functions or Point-in-Time (PIT) tables.

dbt snapshots, on the other hand, do update the previous record to set the dbt_valid_to timestamp.

From a pure modeling perspective, both approaches are valid. But from a platform and performance perspective, insert-only has some strong advantages—
especially in cloud data warehouses like Snowflake, BigQuery, or Redshift.

Why Insert-Only Matters in the Cloud

Cloud-native data warehouses are optimized for append-heavy workloads.

For example:

  • Snowflake uses micro-partitions that are immutable.
  • Updates often result in copy-on-write operations.
  • Insert-only workloads scale better and are cheaper.

This is one of the reasons why Data Vault adopted insert-only patterns years ago. It’s not just about modeling philosophy—it’s about performance and scalability.

That doesn’t mean dbt snapshots are “wrong”. It just means they were designed with a slightly different use case in mind.

Where dbt Snapshots Shine

From a practical standpoint, dbt snapshots are extremely useful in specific scenarios.

One very common use case is a persistent staging area.

Imagine you receive:

  • Full data extracts every day from a source system.
  • No CDC (Change Data Capture).
  • Large tables where storing daily full loads would be wasteful.

In this case, dbt snapshots allow you to:

  • Store only the changes between loads.
  • Keep historical versions.
  • Reduce storage and processing overhead.

From this perspective, dbt snapshots act like a slim persistent staging layer. They capture the source data “as is” and preserve history.

If you already receive proper CDC data from upstream systems, then dbt snapshots are often unnecessary. The change tracking has already been done for you.

Back to the Core Question

So let’s return to the original question:

Can you build a Data Vault view downstream of dbt snapshots?

Technically and conceptually, the answer is:

Yes, you can.

If your dbt snapshots contain all source changes, you have everything you need to:

  • Identify business keys.
  • Track attribute changes.
  • Build hubs, links, and satellites.

In theory, you could build a fully virtualized Data Vault layer on top of snapshots:

  • Virtual hubs
  • Virtual links
  • Virtual satellites

From a data completeness perspective, nothing is missing.

The Real Challenge: Performance and Cost

Unfortunately, theory and reality often diverge.

While a fully virtualized Data Vault sounds elegant, it usually doesn’t work well in practice—at least not today.

Why?

  • Large historical datasets require heavy joins and window functions.
  • Virtualization pushes computation to query time.
  • Cloud compute costs increase rapidly.

In most real-world environments, fully virtualizing the Data Vault on top of snapshots leads to:

  • Slow queries
  • High compute bills
  • Poor user experience

That’s why most architectures still materialize the Data Vault at some point.

Does a “Wrong” Data Vault Design Mean Data Loss?

Another concern in the question is the fear of designing the Data Vault “wrong”.

This fear is understandable—but largely unfounded.

One of the core promises of Data Vault is:

You do not lose data due to modeling decisions.

Even if:

  • You split satellites too much.
  • You group attributes differently than you would today.
  • You later realize a better modeling pattern.

You can always:

  • Refactor satellites.
  • Split or merge them.
  • Reload data from existing Data Vault tables.

This is possible because Data Vault stores raw, historized data—not business logic.

So while a persistent staging area can be helpful, it is not a safety net you absolutely must have. A properly loaded Data Vault already is that safety net.

A Pragmatic Recommendation

So what does a pragmatic architecture look like today?

  • Use dbt snapshots if you need a persistent staging layer and don’t have CDC.
  • Materialize the Raw Data Vault for performance and scalability.
  • Virtualize downstream layers (Business Vault, Information Marts) where possible.

This approach balances:

  • Data safety
  • Performance
  • Cost efficiency

As data volumes grow and histories span years or decades, full virtualization simply becomes inefficient. Materialization at the Raw Vault level is still the sweet spot in most projects.

Final Thoughts

dbt snapshots are a powerful feature and fit nicely into modern data stacks. They can absolutely support Data Vault architectures—especially as a persistent staging layer.

However, they don’t eliminate the need for a materialized Data Vault. Nor do they replace the robustness and flexibility that Data Vault modeling provides.

Used together, dbt and Data Vault can form a strong, future-proof foundation for enterprise analytics—when each tool is applied where it makes the most sense.

Watch the Video

Simplify Data Pipelines with Custom Macros in Coalesce.io

Custom Macros in Coalesce.io

When working in data transformation platforms like Coalesce.io, efficiency and maintainability are everything. As your data pipelines grow, so does the complexity of your transformation logic. This is where custom macros become a powerful ally — they let you write reusable, dynamic SQL code that can be applied consistently across your entire data model.

In this article, we’ll explore how to use macros in Coalesce.io with the help of the Jinja templating language. You’ll see how a small example — converting currency values from cents to dollars — can scale into a reusable pattern that saves hours of development and maintenance time.



What Are Macros in Coalesce.io?

Macros are reusable pieces of logic that can be dynamically inserted into your SQL transformations. They’re written in Jinja, a templating language originally popularized in web development frameworks such as Django and Flask. Coalesce.io leverages Jinja to make SQL more dynamic and parameterized, letting you write logic once and reuse it anywhere.

In simple terms, a macro is like a small function that can generate SQL code on the fly. You define a macro once — for example, to perform a mathematical conversion, a date calculation, or a string manipulation — and then call it across multiple nodes in your Coalesce.io project.

Why Use Macros?

As data models evolve, repetitive transformations become a maintenance challenge. Imagine you have multiple columns across different tables that need to be converted from cents to dollars. You could copy and paste the same transformation logic everywhere, but what happens when the business decides to change the precision or switch the data type?

You’d have to update every single instance manually — a process that’s time-consuming and error-prone. With a macro, you simply change the logic once, and it updates everywhere it’s used. This drastically reduces maintenance efforts and improves consistency.

Setting Up the Example: Converting Cents to Dollars

Let’s walk through a simple example. Suppose you have a customer table with an account_balance column that stores values in cents. The business requires all monetary values to be in dollars for reporting and analytics. This means we need to divide the column by 100 and adjust the data type accordingly.

In Coalesce.io, we can perform this transformation directly within a stage node. You might start with something like:

{{ SC }} / 100::NUMERIC(18, 2)

The {{ SC }} syntax tells Coalesce.io to dynamically insert the source column name. After running the transformation, you’ll see that account_balance is now expressed in dollars instead of cents. So far, so good — but what if multiple columns or nodes need this conversion?

Copying and pasting this logic everywhere quickly becomes inefficient. This is the perfect case for turning the transformation into a macro.

Creating a Custom Macro in Coalesce.io

To define a new macro, navigate to your project’s Build Settings and locate the Macros section. There, you can create workspace-level macros that are available across your entire Coalesce.io environment.

Here’s a simple Jinja macro to convert cents to dollars:

{% macro cents_to_dollars(column, scale=2) %}
    ({{ column }} / 100)::NUMERIC(18, {{ scale }})
{% endmacro %}

This macro does a few important things:

  • Accepts parameters — The column parameter specifies which column to transform, and scale defines the number of decimal places (defaulting to 2).
  • Performs the conversion — It divides the value by 100 and casts it to a numeric type.
  • Is reusable — You can now call this macro anywhere without rewriting the logic.

Applying the Macro in a Transformation

Once your macro is defined, you can use it directly within your stage or transform node. For example, if you want to create a new column that stores the converted dollar value, simply write:

{{ cents_to_dollars('customer.account_balance') }}

This will execute the macro, substitute the SQL expression, and generate the proper transformation logic dynamically. You can even adjust the precision if needed:

{{ cents_to_dollars('customer.account_balance', scale=4) }}

With just one line of code, you’ve achieved a flexible, reusable transformation that can be applied across multiple nodes and columns.

Benefits of Using Macros in Coalesce.io

Macros may seem like a small feature, but their impact on data engineering workflows is significant. Here are a few key advantages:

1. Reduce Maintenance Overhead

When transformation logic changes, you only need to update it in one place. This ensures consistency across your pipelines and minimizes human error.

2. Promote Reusability

Macros make your codebase more modular. Teams can share standardized transformation logic, reducing duplication and ensuring best practices are applied everywhere.

3. Improve Readability

Instead of cluttering transformations with complex SQL expressions, you can reference clean, descriptive macro calls. This improves readability and helps onboard new team members faster.

4. Enhance Collaboration

Macros can be shared across teams or workspaces, enabling collaborative development in larger analytics engineering environments.

5. Simplify Complex Logic

As transformations get more sophisticated, you can encapsulate multi-step logic inside a macro. For example, handling data type conversions, conditional mappings, or even custom business rules — all within one reusable function.

Debugging and Testing Macros

Because macros generate SQL dynamically, debugging can sometimes feel tricky. Coalesce.io helps by allowing you to preview the compiled SQL code. After running a transformation that uses a macro, open the results to view the underlying SQL — you’ll see exactly how your macro was expanded and executed.

This visibility is crucial for validating logic and ensuring your macros behave as expected.

When to Use Macros

A good rule of thumb is simple:

If you find yourself repeating the same transformation in more than one place, consider turning it into a macro.

Macros are particularly useful for:

  • Currency or unit conversions
  • Data cleansing logic (e.g., trimming whitespace, normalizing case)
  • Timestamp or date formatting
  • Standard calculations (e.g., margin, growth rate, ratios)
  • Conditional logic applied across multiple datasets

By standardizing these repetitive transformations, your data environment becomes cleaner, easier to manage, and more scalable.

Best Practices for Writing Coalesce.io Macros

  • Keep macros simple — Each macro should serve a single clear purpose. Break complex logic into smaller, composable macros when possible.
  • Use default parameters wisely — Providing defaults (like scale=2) makes macros flexible and user-friendly.
  • Document your macros — Add comments explaining what each macro does and the parameters it expects. Future maintainers will thank you.
  • Test before scaling — Validate each macro on a small dataset before applying it widely.
  • Version control your macros — Store them in a shared Git repository or Coalesce.io workspace for collaboration and traceability.

Real-World Impact

In production environments, macros can save hours of repetitive work each week. Imagine applying the same data type conversion or formatting rule across dozens of tables — a single macro call replaces all those redundant SQL snippets.

Macros also make large-scale refactoring much safer. If a business requirement changes (for example, moving from numeric to money data types), one macro edit automatically updates every transformation that depends on it.

Final Thoughts

Custom macros in Coalesce.io aren’t just a convenience — they’re a foundation for scalable, maintainable data engineering. By abstracting common logic into reusable templates, you streamline your transformations, reduce technical debt, and empower your team to focus on building insights rather than maintaining code.

As a best practice, always look for opportunities to generalize repetitive logic. When you spot patterns across transformations, that’s your cue to create a macro. Start small, experiment, and watch your data pipelines become cleaner, faster, and easier to manage.

Next Steps

Explore Coalesce’s documentation on macros and Jinja templating to deepen your understanding. You can also download our free Data World Handbook — a comprehensive guide to modern data architecture, data vault modeling, and transformation best practices.

And if you’ve built interesting macros of your own, share them in the comments below. We’d love to see how you’re simplifying your Coalesce.io workflows.

Watch the Video

Row- & Column-Level Security in the Reporting Layer

Row-Level Security & Column-Level Security

In modern BI and Big Data architectures, security is no longer something you “add later”. If you build a data warehouse, a Data Vault, or even a smaller reporting solution without a clear security concept, you will almost certainly run into problems down the road.

One of the most common and most important questions we get in BI projects is: How do you actually implement row-level and column-level security in the reporting layer?

In this article, we’ll walk through the reasoning behind row- and column-level security, explain why hard-coded rules don’t scale, and show a proven, practical approach using access control lists (ACLs) directly in the data warehouse reporting layer.



Why Row- and Column-Level Security Matters

Let’s start with the basics. Why do we even need row-level and column-level security in a data warehouse or reporting layer?

The answer is simple: not all users should see all data.

Here are two very common examples from real-world projects:

  • Row-level security: A sales representative in Germany should only see customers from Germany (or the DACH region) and not customers from France, Spain, or other regions.
  • Column-level (attribute-level) security: Sensitive fields like revenue, margin, salary, or bonus information should only be visible to specific roles, such as finance or management.

These requirements exist in almost every company, regardless of size or industry. Yet, many teams still struggle to implement them in a clean, scalable way.

The Problem with Hard-Coded Security Rules

A common first approach is to implement security rules directly in reporting tools like Power BI, Tableau, or Looker. While this might work for a small number of reports, it quickly becomes a nightmare as your BI landscape grows.

Here’s why hard-coded security does not scale:

  • High maintenance effort: Every report or dashboard needs to be updated whenever security rules change.
  • Inconsistent logic: Different reports may implement slightly different rules, leading to confusion and errors.
  • Frequent changes: Users change departments, teams get reorganized, and access rules evolve over time.
  • Risk of mistakes: Forgetting to apply a rule in one report can expose sensitive data.

In short: implementing row- and column-level security repeatedly in every reporting tool is inefficient and risky.

The Core Idea: Access Control Lists (ACLs)

A scalable and proven approach is to use Access Control Lists (ACLs). This is a well-known concept in IT security and works extremely well in data warehousing and BI environments.

The idea is straightforward:

  • Maintain centralized tables (or files) that define who is allowed to see what.
  • Map users or user groups to business attributes, such as regions, countries, or access rights.
  • Apply these rules once in the reporting layer of the data warehouse.

Instead of implementing security in every report, you implement it in the data warehouse views that your reporting tools consume.

Users vs. User Groups: Always Think in Groups

One very important design decision: always work with user groups, not individual users.

Managing security on a per-user basis creates a lot of overhead and quickly becomes unmanageable. Groups, on the other hand, scale well and align nicely with how companies organize access rights.

A typical setup might look like this:

  • corp\\bi-read-DACH
  • corp\\bi-read-EMEA
  • corp\\bi-read-FINANCE

These groups are usually managed in Active Directory, Azure AD, or a similar identity provider. Your data warehouse then simply needs to know which group a user belongs to.

Implementing Row-Level Security with ACLs

Row-level security controls which rows a user is allowed to see. The ACL table for this typically maps user groups to business attributes.

A simplified example of a row-level ACL table could look like this:

USER_GROUP          | REGION_CODe --------------------|-------------
bi-read-DACH        | DACh bi-read-EMEA        | EMEa

This table says:

  • Users in the DACH group can see data for the DACH region.
  • Users in the EMEA group can see data for the EMEA region.

Where does this table live? Ideally:

  • In a master data system, if your organization has one.
  • In a reference data schema in the data warehouse.
  • For smaller setups, even an Excel file that is ingested regularly can work.

Applying Row-Level Security in Views

Once the ACL exists, applying it in the reporting layer is straightforward. In your Information Mart or reporting views, you simply filter based on the current user’s group.

Most modern databases allow you to access session context information, such as:

  • The current user
  • The current role
  • The current group

Conceptually, the SQL logic looks like this:

SELECT *
FROM customer c WHERE c.region_code IN (
    SELECT region_code     FROM row_level_acl     WHERE user_group = CURRENT_USER_GROUP()
)

The exact syntax depends on your database, but the concept is universal. The result: users only ever see rows they are allowed to see, no matter which reporting tool they use.

Implementing Column-Level (Attribute-Level) Security

Column-level security works slightly differently. Instead of filtering rows, you control whether a column is visible or not.

Typical use cases include:

  • Revenue
  • Margin
  • Salary
  • Bonus

Again, the foundation is an ACL table. A simplified example:

USER_GROUP          | COLUMN_NAME | CAN_REAd --------------------|-------------|---------
bi-read-DACH        | revenue     | false bi-read-EMEA        | revenue     | true

In this example:

  • The DACH sales team cannot see the revenue column.
  • The EMEA finance team can see the revenue column.

Applying Column-Level Security in Views

In the reporting view, you typically implement column-level security using a CASE WHEN statement:

CASe     WHEN EXISTS (
        SELECT 1
        FROM column_level_acl         WHERE user_group = CURRENT_USER_GROUP()
          AND column_name = 'revenue'
          AND can_read = true     )
    THEN revenue     ELSE NULl END AS revenue

If the user is allowed to see the column, they get the value. If not, they get NULL. From the reporting tool’s perspective, the column exists but contains no sensitive data.

Who Should Manage the Security Rules?

One important organizational point: the data warehouse team should not manually manage ACLs.

Security rules change frequently, and they are usually driven by business or governance decisions. Ideally:

  • Reporting or data governance teams own the rules.
  • Business users can maintain ACLs via a master data system or controlled interface.
  • The data warehouse simply consumes these rules.

This separation of responsibilities reduces operational overhead and avoids constant change requests to the IT or data engineering team.

Automation Is Key

In modern data stacks, manual SQL coding should be the exception, not the rule. Security logic is no different.

If you write row- and column-level security logic manually for every single view, you will:

  • Forget to apply it in some places.
  • Introduce inconsistencies.
  • Create unnecessary technical debt.

The better approach is to standardize and automate.

For example:

  • Use dbt macros to apply security logic consistently.
  • Enable or disable security with a simple configuration flag.
  • Automatically apply security to all views in a specific schema.

In one project, we implemented a dbt security macro that could be activated with a single line of code. Depending on the configuration, the macro automatically injected the row- and column-level ACL logic into the view.

This ensures:

  • Consistency across the entire reporting layer.
  • Minimal manual effort.
  • Much lower risk of security gaps.

Where Should Security Be Applied?

Best practice is to apply row- and column-level security in the final reporting layer
of your data warehouse:

  • Information Marts
  • Presentation Layer
  • Semantic Layer

This keeps your raw and integration layers clean and flexible while ensuring that everything exposed to BI tools is properly secured.

Key Takeaways

  • Row- and column-level security is a foundational requirement in BI projects.
  • Hard-coded security in reports does not scale.
  • Access Control Lists provide a clean, centralized solution.
  • Always work with user groups, not individual users.
  • Apply security in the reporting layer of the data warehouse.
  • Automate everything using modern data tooling.

If you get these basics right early in your data warehouse or Data Vault project, you will save yourself a lot of pain, rework, and risk later on.

Watch the Video

How to Refactor Your Raw Data Vault: Proven Strategies for Scalable Multi-Tenant Data Warehouses

Refactoring the Raw Data Vault

Refactoring is a natural part of any evolving data architecture. Whether you’re adding a new entity, integrating additional source systems, or moving toward a multi-tenant Data Warehouse, change is inevitable. In this article, we’ll explore what it means to refactor your Raw Data Vault (RDV), why it’s essential, and how to do it safely using strategies like Hub-it-out.

The discussion is based on a real-world scenario where a company’s recruitment data model needed to evolve. Initially, the model tracked requisitions but didn’t fully capture recruiter details. With new requirements, the team needed to refactor their RDV to introduce a new Hub Recruiter entity — without breaking historical data or existing queries.



What Refactoring Really Means

Before diving into the technical details, let’s clarify what refactoring is — and what it isn’t.

Refactoring means changing the internal structure of your data model without altering its external behavior. Think of it as improving or extending the foundation — adding new components, reorganizing relationships, or optimizing loading logic — while keeping the overall functionality intact.

Redesign, on the other hand, means making user-visible changes. This might involve restructuring dashboards, altering reports, or redefining business logic in ways that affect your end users.

Since the Raw Data Vault is an internal structure (not typically user-facing), most adjustments here fall under refactoring. These changes are iterative and low-risk if your base model is valid.

Deploy First, Refactor Later

Many data teams fall into the trap of seeking the “perfect” model before deployment. But as experts at Scalefree emphasize — perfection is the enemy of progress. Instead of spending months debating the right business key or model structure, deploy a valid model quickly, gather feedback, and improve it through incremental refactoring.

“Don’t aim for the best model on day one. Aim for a valid model that you can continuously improve.”

This agile mindset ensures your team delivers value early and often. Over time, with each sprint, you evolve closer to the “best” model for your organization — even though, in truth, a perfect model doesn’t exist.

The Recruiter Example: When Business Requirements Change

Let’s revisit our use case: your company is in the recruiting business. Each job requisition is led by a single recruiter. Your original model included a RecruiterID in the requisition table but didn’t have a dedicated Hub Recruiter. Now, the business wants to integrate additional recruiter data from new source systems.

So how should you adapt your Data Vault model?

  • Do you close the old Link and create a new one that includes the Recruiter Hub?
  • Do you split existing Links into smaller, recruiter-focused relationships?
  • How do you handle historical backfills and multi-tenant variations?

The answer: you refactor using the Hub-it-out strategy.

Understanding the “Hub-it-Out” Strategy

“Hub-it-out” is a practical approach to introducing new Hubs and Links into your existing Raw Data Vault without reloading everything from the source systems.

Here’s how it works:

  1. Identify the business key in your existing Satellite. For example, you already have RecruiterID stored as an attribute, even if it wasn’t modeled as a Hub initially.
  2. Create a new Hub (Hub Recruiter) by selecting all distinct recruiter IDs from your existing Satellite data.
  3. Generate hash keys for each business key value and assign load dates and record sources from the Satellite’s first occurrence.
  4. Create a Link between the existing parent Hub (e.g., Requisition) and the new Hub (Recruiter) based on relationships already present in the Satellite.

With this approach, you can build new structures directly from your existing Data Vault — no need to reload historical data from source systems, which may not even be available anymore. If you have a data lake or Persistent Staging Area (PSA), you can also load from there as an alternative.

Why a Valid Model Is Everything

One key prerequisite for refactoring success is having a valid Raw Data Vault model. A “valid” model means that:

  • Data is captured consistently and completely.
  • No business keys or relationships have been lost during earlier transformations.
  • Hubs, Links, and Satellites follow proper Data Vault design rules.

If your model is valid, you can refactor it safely — adding Hubs, Links, or Satellites — without touching your original data sources. This makes evolution faster, cheaper, and much less disruptive.

Multi-Tenant Data Vault Considerations

When dealing with multiple clients or tenants, you should introduce a TenantID attribute in every Hub, Link, and Satellite. This ensures that data from one tenant never overwrites another’s records, since the hash keys will differ.

Typically, the first few tenants may require adjustments as you generalize your model. But after integrating the second or third tenant, the structure stabilizes. Each new tenant may add Satellites or minor extensions — but the overall architecture remains consistent.

Avoid Overloading Links

Each source system should define its own Unit of Work — a consistent relationship between business keys. If one system defines a Link between three business keys (e.g., Requisition, Candidate, Recruiter) and another defines it between two, treat them as separate Links. Avoid “overloading” Links by mixing different granularities or structures. That’s a common source of data inconsistency and confusion.

Transitioning Without Breaking Queries

Refactoring can be disruptive for power users who query the Raw Data Vault directly. Their queries may break when Hubs or Links are renamed, split, or replaced. To manage this transition smoothly:

  • Load both old and new entities temporarily. Keep the old Link active for a while, even if it’s limited to a single tenant.
  • Mark deprecated objects clearly. Add a “deprecated” flag or comment in your metadata catalog.
  • Communicate proactively. Notify users via email or release notes, giving them 90–180 days to adjust queries.

Despite communication efforts, some users will inevitably miss the deadline — but maintaining transparency and clear documentation helps minimize friction.

Using Virtual Views for Legacy Support

Instead of maintaining redundant tables, you can recreate deprecated entities as virtual views on top of your new Data Vault structures. This approach saves storage while still supporting legacy queries.

Yes, there’s a small performance trade-off, but since these views are temporary and clearly marked as deprecated, it’s an effective bridge strategy. Inform users that performance may decrease slightly and that these views will be removed after a defined period (e.g., one year).

When and How to Clean Up

Every new entity adds maintenance overhead — documentation, metadata management, refactoring complexity. Once your transition period ends, remove deprecated tables and views to keep your model clean and manageable.

Set a clear timeline with your business stakeholders, archive necessary backups, and drop the obsolete entities once the window closes. This keeps your RDV lean and reduces technical debt.

Key Takeaways

  • Deploy early — perfection isn’t required for value.
  • Refactor continuously through small, validated improvements.
  • Use the Hub-it-out strategy to extend your model safely.
  • Always maintain a valid Data Vault foundation to enable future flexibility.
  • Manage user expectations through clear communication and deprecation policies.

Final Thoughts

Refactoring your Raw Data Vault isn’t about redoing work — it’s about evolving intelligently. With the right strategies and mindset, you can adapt to new business requirements, integrate additional tenants, and maintain consistency without painful re-engineering.

Whether you’re adding a Recruiter Hub, optimizing Link structures, or modernizing your multi-tenant DWH, remember: the goal isn’t perfection. It’s progress — sprint by sprint, improvement by improvement.

Watch the Video

Data Governance in Agile Teams: Balancing Speed and Compliance

Agile Teams and Data Governance

In today’s fast-paced world of analytics and data-driven decision-making, organizations face a growing challenge: how to stay agile while maintaining strong data governance. For many teams, governance is seen as a roadblock — something that slows delivery, adds layers of bureaucracy, and drains motivation. But when implemented correctly, data governance doesn’t have to be a pitstop. It can actually be the engine that keeps your Formula One data team running at top speed — safely, reliably, and compliantly.



When Governance Feels Like a Pitstop

Many data professionals can relate to the frustration: you’re in the middle of a sprint, the team is shipping fast, and suddenly you have to stop everything for governance discussions. Documentation, approvals, compliance checks — they all take time. It can feel like racing a Formula One car and being forced to pull over every 100 meters.

This tension between agility and governance is common. Data engineers want to deliver quickly, while governance teams need to ensure trust, traceability, and compliance. When these two groups work in isolation, frustration grows on both sides. The result? Slower delivery, lower morale, and data that stakeholders don’t fully trust.

Formula One agile team working with data governance

Why Governance Still Matters — A Lot

Despite the frustration, data governance remains essential. With increasing regulations like GDPR and growing concerns over data privacy, security, and lineage, organizations can’t afford to ignore governance. Without it, data quickly loses reliability and can even expose the company to legal and reputational risks.

Governance provides the foundation for trustworthy data. It defines who owns the data, how it’s used, and how quality is maintained. The challenge is not whether governance should exist — it’s how it should be implemented in a way that supports agility rather than stifles it.

Breaking the “Slow vs. Fast” Mindset

One of the biggest misconceptions is that teams have to choose between being fast and being compliant. In reality, good governance can actually increase speed — if done the right way. Instead of launching massive governance projects that take months before showing value, organizations should start small.

Start with one use case. Define what data needs to be governed, what rules are necessary, and which processes can be automated. By building governance iteratively, teams can maintain momentum while gradually increasing compliance coverage. This approach mirrors agile methodology itself: small increments, continuous improvement, and fast feedback loops.

From Data Lake to Data Swamp

When governance is ignored, data platforms can quickly degrade. Data lakes, for example, often become “data swamps” — unstructured, inconsistent, and untrustworthy. Without clear ownership and metadata management, it becomes impossible to understand what’s inside, how it was sourced, or if it’s even accurate.

To prevent this, governance teams and data engineers must work together early in the project lifecycle. Metadata, lineage, and data quality checks should not be afterthoughts. By integrating these elements from the start, teams can ensure that the lake remains organized and that all data remains discoverable and auditable.

Collaboration Over Confrontation

Too often, governance and delivery teams operate like opposing forces — “the ones who slow us down” versus “the ones who don’t care about compliance.” This mindset kills productivity. The truth is, both sides share the same goal: reliable, high-quality data that supports business success.

To make governance work in agile environments, it must be treated as a team sport. Data engineers, analysts, and governance professionals should collaborate from day one, not after development is complete. Early involvement prevents costly rework and reduces the perception that governance is an obstacle.

Start Small, Then Scale

Big-bang governance projects often fail. Buying an enterprise tool and trying to document everything at once is a recipe for analysis paralysis. Instead, start with a single use case or dataset. Identify what metadata, access rules, and lineage details are truly necessary. Use that as a pilot to refine your process and showcase quick wins.

Once the first success is achieved, expand governance incrementally. This approach ensures that governance evolves naturally with the organization’s needs, rather than becoming an oversized initiative that never delivers value.

Practical Tips for Agile Data Governance

  • Integrate governance early: Bring governance experts into sprint planning and design discussions, not after development is complete.
  • Automate wherever possible: Modern tools offer built-in data lineage, metadata tracking, and policy enforcement — leverage them.
  • Adopt data vault architecture: Separate raw data (raw vault) from business logic (business vault) to ensure traceability and compliance.
  • Iterate and adapt: Governance rules should evolve just like software requirements. Continuously refine based on feedback.
  • Show value quickly: Demonstrate how governance improves quality, consistency, and trust — not just compliance.

Tools and Automation: Governance Without Overhead

Today’s data platforms — especially in Azure and other cloud ecosystems — offer native tools that make governance easier. Many ETL and metadata management platforms now include features such as:

  • Automated data lineage tracking
  • Built-in documentation and metadata management
  • Testing and validation frameworks
  • Policy enforcement and access control

Before investing in an expensive governance suite, review what’s already available in your existing stack. Often, these native features are more than enough to get started and can help you build the foundation for a more mature governance model later on.

Data Vault: A Governance Enabler

The Data Vault methodology is particularly effective for combining agility with governance. By separating raw and business layers, it provides full traceability of every transformation while supporting iterative development. Each change can be tracked and audited, ensuring compliance without slowing delivery.

This structure also supports GDPR and other data privacy requirements by isolating personally identifiable information and simplifying data lineage tracking. When implemented correctly, the Data Vault becomes a backbone for both agility and compliance.

Conclusion: Governance as a Team Sport

Data governance doesn’t have to be a roadblock for agile teams. When done right, it ensures trust, transparency, and collaboration across all stakeholders. The key is to stop viewing governance as something external to the data process. It’s an integral part of creating reliable, sustainable, and compliant data ecosystems.

Think of your data team as a Formula One crew. The engineers build speed. The governance team ensures safety and reliability. Only by working together can the car reach its full potential — fast, secure, and built to last.

Start small, collaborate early, and leverage automation. Over time, governance will shift from a burden to a strategic advantage — one that drives your organization forward with confidence.

Watch the Video

How to Connect to the dbt Semantic Layer Through Power BI

Connecting Power BI to the dbt Semantic Layer

As organizations increasingly rely on data-driven decision making, the ability to connect business intelligence tools directly to semantic layers becomes essential. One of the most common requests we’ve heard from our community is: “How can I connect Power BI to the dbt Semantic Layer to expose metrics?”

In this guide, we’ll walk through the entire process of setting up and connecting Power BI with the dbt Semantic Layer. By the end, you’ll be able to query dbt metrics directly in Power BI and build interactive dashboards that stay in sync with your semantic models.



Why Connect Power BI with the dbt Semantic Layer?

The dbt Semantic Layer allows teams to define business metrics and dimensions in one central place. Instead of duplicating logic across BI tools, analysts and business users can rely on consistent definitions for KPIs such as revenue, churn, or order count. When Power BI is connected to this layer, dashboards automatically reflect the same trusted metrics already defined in dbt.

This integration helps:

  • Maintain consistency in metric definitions across the organization.
  • Reduce manual work for analysts when creating Power BI reports.
  • Ensure real-time access to governed data models.

Pre-Requisites

Before starting, make sure you have the following:

  • A working dbt Cloud project with a configured Semantic Layer.
  • Permission to create or access a Service Token in dbt Cloud.
  • Installed version of Power BI Desktop.
  • Internet access to download the dbt Semantic Layer Power BI connector.

Step 1: Review the dbt Semantic Layer Setup

If you’re not familiar with how the dbt Semantic Layer is configured, check out Hernan Revale’s detailed session on setting up metrics and dimensions in dbt Cloud:
Watch the dbt Semantic Layer session here.

Step 2: Collect Required Credentials in dbt Cloud

Navigate to your dbt Cloud Dashboard and head to:

  • Settings → Semantic Layer or Settings → Edit Semantic Layer

Here, confirm or configure the following:

  1. Credentials for the deployment environment where your semantic models run.
  2. A Service Token linked to the Semantic Layer. If you don’t have permission, ask your dbt admin to create one.
  3. Your Environment ID and Host, which will be used in Power BI during connection setup.

Important: Store your Service Token securely. You’ll need it to authenticate Power BI.

Step 3: Install the dbt Semantic Layer Power BI Connector

Download the Power BI connector for dbt Semantic Layer from the official documentation:
Download the connector here.

Run the installer and follow the on-screen steps. After installation, verify it by checking the list of available drivers in the ODBC Data Sources. The dbt Semantic Layer connector should now appear in the list.

Step 4: Connect Power BI to the dbt Semantic Layer

Now that everything is set up, it’s time to establish the connection:

  1. Open Power BI Desktop and start a blank report.
  2. Search for dbt Semantic Layer in the available connectors.
  3. Accept the beta notice (as the connector is still under development).
  4. Provide the required details:
    • Host
    • Environment ID
    • Service Token
  5. Choose DirectQuery (Import is not yet supported).
  6. Click Load to access your metrics.

Step 5: Build a Simple Dashboard

Once the semantic model is loaded, you’ll see your dbt metrics in the Power BI fields pane. You can now build visualizations just like you would with any other dataset. For example:

  • Create a stacked column chart with “Orders Total” on the Y-axis and “Customer Region” on the X-axis.
  • Add slicers for Region, Market, or Segment to filter the data dynamically.
  • Include a card visualization to highlight key metrics such as total revenue.

At this point, your Power BI dashboard is fully connected to the dbt Semantic Layer. Metrics are updated live and reflect the definitions you’ve configured in dbt Cloud.

Step 6: What’s Next?

In this tutorial, we focused on connecting Power BI Desktop to the dbt Semantic Layer. In the next part of this series, we’ll publish the report to Power BI Service and explain how to retain dbt connectivity in a collaborative environment.

Stay tuned for the next video and article, and don’t forget to subscribe to our channel for updates.

Conclusion

Connecting Power BI to the dbt Semantic Layer is a powerful way to bring consistent, governed metrics directly into your BI environment. With a few configuration steps, you can ensure that every report and dashboard your team creates in Power BI leverages the same trusted metric definitions managed in dbt.

This setup not only accelerates dashboard creation but also strengthens data governance across your organization. As the connector continues to evolve, we can expect even smoother integrations and more functionality in the near future.

Additional Resources

Watch the Video

Improving Salesforce Data Quality: Practical Solutions for Business Users

Fix Your Salesforce Data

Improving Salesforce Data Quality

Data is at the heart of every modern business. Organizations invest heavily in CRM platforms like Salesforce to manage customer information, support decision-making, and automate key processes. But even the most powerful CRM is only as good as the data it holds. Poor data quality leads to errors, delays, missed opportunities, and ultimately, lost revenue.

In this article, we explore the most common Salesforce data quality challenges, why they matter, and how business users—not just technical teams—can play a key role in keeping data accurate, consistent, and reliable. We’ll also share a step-by-step approach using Salesforce reports and dashboards to empower business teams in their daily operations.



Why Salesforce Data Quality Matters

Salesforce enables organizations to capture, store, and analyze customer information at scale. However, when data is incomplete, duplicated, or inconsistent, the value of Salesforce declines dramatically. Poor data quality often results in:

  • Incomplete reporting: Missing data fields prevent business teams from generating accurate reports and dashboards. This makes data-driven decision-making difficult or impossible.
  • Process errors: Incorrect values or misused fields can trigger workflow failures or lead to flawed outputs, causing business disruptions.
  • Delays in operations: Missing information, such as a shipping address, can halt critical business processes and create costly delays.
  • Automation failures: Flows, triggers, and integrations depend on complete and validated data. Poor-quality data leads to automation breakdowns and system errors.

The bottom line: without quality data, Salesforce cannot deliver on its promise of smarter sales, marketing, and customer service.

Typical Salesforce Data Quality Challenges

Across organizations, several recurring issues appear when it comes to Salesforce data quality:

  • Duplicated records: Multiple entries for the same account or contact create confusion, reporting inconsistencies, and wasted effort.
  • Missing key fields: Fields like industry, VAT number, or shipping address may be left blank, leading to gaps in reporting or process blockages.
  • Misused fields: Fields designed for one purpose may be repurposed by different teams, resulting in inconsistent data and unreliable reports.
  • Outdated information: Customer details can change frequently. Without regular updates, Salesforce quickly fills with stale data.

These issues are not unique to your company. They affect organizations of all sizes and industries. The key is to recognize that data quality is a continuous responsibility—not a one-time cleanup exercise.

Why Business Users Should Be Involved

Traditionally, data quality has been seen as an IT or admin responsibility. But in practice, many issues arise in day-to-day operations where business users interact with Salesforce directly. For example:

  • A sales rep forgets to mark an account as active.
  • A customer service agent skips entering a shipping address.
  • A marketing user enters inconsistent industry categories.

These small mistakes compound over time. By empowering business users to identify and correct data quality problems early, organizations can dramatically reduce long-term issues and keep processes running smoothly. The secret is to provide them with the right tools—without overwhelming them with technical details.

Using Salesforce Reports to Identify Data Gaps

Salesforce reports are one of the most effective tools for supporting business users in maintaining data quality. Reports can highlight records that fail to meet business requirements, enabling users to quickly spot and correct issues. Let’s walk through two practical examples.

Example 1: Accounts Missing the “Active” Field

Imagine that your business requires all accounts to have the “Active” field correctly set. However, during migrations or bulk uploads, many accounts are left blank. This creates reporting gaps when sales managers try to analyze active accounts.

By creating a simple report filtered to show accounts where “Active” is not set, you can generate a list of problem records. A designated business user can then review this report, update the missing values, and ensure reporting accuracy going forward.

Example 2: Missing Shipping Addresses on Closed-Won Opportunities

Another critical scenario involves shipping addresses. Suppose you have accounts with closed-won opportunities but no shipping address. This creates immediate risks for order fulfillment.

By building a report with a cross-filter (accounts with won opportunities AND missing shipping address), you can provide a focused list of problematic records. Assign this report to the operations or logistics team, and they can update shipping addresses before orders are delayed.

Creating Dashboards for Ongoing Monitoring

Reports are useful, but dashboards make monitoring even easier. You can combine multiple data quality reports into a single dashboard, categorized by department or data type. Examples include:

  • Sales Data Health: Accounts missing “Active” status, opportunities missing key fields.
  • Marketing Data Health: Leads missing industry or source information.
  • Service Data Health: Cases missing priority or escalation status.

Dashboards provide a real-time overview of data quality, helping managers track progress and ensuring accountability. Each team can take ownership of their specific data health metrics.

Best Practices for Business-Led Data Quality Management

To make this approach effective, keep the following best practices in mind:

  • Keep it simple: Reports and dashboards should be easy to read. Focus on the most critical data quality issues.
  • Assign responsibility: Make sure each report has an owner who is accountable for keeping it clear of records.
  • Explain the “why”: Always include descriptions that explain why a field matters. Business users are more likely to correct data when they understand its impact.
  • Automate where possible: Use validation rules, required fields, or automation to prevent errors before they enter the system.
  • Review regularly: Schedule regular reviews of dashboards to ensure data quality remains a priority.

Conclusion

Salesforce is a powerful platform, but it relies on accurate and complete data to function effectively. Data quality challenges—whether missing fields, duplicates, or outdated information—can significantly hinder decision-making and operational efficiency. The good news is that these challenges are solvable.

By empowering business users with simple reports and dashboards, you can shift data quality management from a reactive IT task to a proactive, business-led practice. This not only improves Salesforce performance but also fosters a culture of accountability across your organization.

Start small: identify a handful of critical fields, build focused reports, and create a simple dashboard. Over time, you’ll see measurable improvements in data health, process reliability, and business outcomes.

Remember: data quality is not a one-time project. It’s an ongoing effort—and when business users are equipped to take ownership, everyone benefits.

Watch the Video

Multi-Tenant Data Vault

Multi-Tenant Environment

Designing and maintaining a Data Vault in a multi-tenant environment presents unique challenges. When a data warehouse must handle not just internal data, but also data from dozens of external clients with slightly different processes and systems, the complexity increases dramatically.

A recent question we received highlighted this exact situation:

“I’m struggling with link management and the evolution process in a multi-tenant warehouse, especially putting all data together in the Information Mart. Our Data Warehouse contains internal data as well as shared data from our clients, for which we perform job requisition processes using their internal systems. We plan to onboard 50–60 clients in the next 2–3 years. Right now, we’re still in the MVP phase, supporting just a few clients. How should I manage links with so many different systems, such a large number of source tables, and processes that are similar but not identical? The goal is to have one common Information Mart design for all clients to enable standardized reporting.”

This is a classic question in modern data architecture. Let’s explore how to approach Raw Vault, Business Vault, and Information Mart design in a multi-tenant context.



Multi-Tenancy in the Raw Data Vault

A cornerstone principle in Data Vault modeling is that each Satellite is sourced from a single source system. However, in a multi-tenant setup, this guideline needs some adaptation. Many tenants use the same source systems (e.g., Salesforce, SAP) with similar core structures. In such cases, you can load multiple tenants into the same Satellite as long as you introduce a Tenant ID as part of the key.

Why Add a Tenant ID?

  • Ensures uniqueness of business keys across tenants (e.g., Customer 42 in Tenant A ≠ Customer 42 in Tenant B).
  • Partitions data naturally, so Satellites contain subsets per tenant without overwriting each other’s records.
  • Provides a straightforward way to filter or secure records by tenant.

By combining the local business key with the Tenant ID, you create a unique enterprise-wide business key. This guarantees data integrity while simplifying downstream querying and reporting.

Where to Add the Tenant ID

In multi-tenant designs, the Tenant ID should ideally appear:

  • Hubs: As part of the business key or alternate key, ensuring uniqueness across tenants.
  • Links: As part of the Hub references, ensuring uniqueness in combined relationships.
  • Satellites: As a payload field for convenience, even if the hash key already includes the Tenant ID.

With this approach, every record in the Raw Vault can always be traced back to a specific tenant, which simplifies not only modeling but also governance and security.

Defining the Tenant ID

A natural question arises: what exactly is a “tenant”? The answer depends on your business context:

  • It could be a client organization you serve.
  • It could be a business unit, country, or factory in large enterprises.
  • It might also be defined by data ownership—who is responsible for the dataset.

In some cases, you may also need a reserved Tenant ID for global or shared data that is not owned by any specific tenant. This ensures consistency and supports role-based access control.

Staging and Tenant Assignment

The Tenant ID is typically introduced already in the staging layer. How it’s assigned depends on the source system:

  • Tenant-dedicated systems: Assign a constant Tenant ID for all data from that system.
  • Multi-tenant systems (e.g., SAP, Salesforce): Extract and map the Tenant ID from existing fields (e.g., business unit, org ID).
  • Global systems: Use a reserved Tenant ID (e.g., “GLOBAL”) when ownership is shared or unclear.

This is a hard rule (constant assignment), not a conditional transformation, which ensures repeatability and traceability.

Business Vault in Multi-Tenant Contexts

Once Tenant IDs are embedded in the Raw Vault, the Business Vault becomes much easier to design. Business rules can be applied consistently across tenants, while preserving tenant-specific contexts.

  • Same-as Links: Crucial for resolving duplicate entities across tenants (e.g., the same customer appears in different client systems).
  • Custom Satellites: Standardize where possible, but add additional Satellites for tenant-specific customizations.
  • Wide PIT Tables: Be prepared for them—multiple tenants and diverse source systems naturally lead to broader structures.

At this stage, the goal is harmonization without oversimplification. A balance must be struck between common modeling and tenant-specific flexibility.

Designing the Information Mart

The Information Mart is where tenants—or the enterprise as a whole—derive insights. The challenge is to provide both:

  • Enterprise-wide views: Merging data from all tenants for global reporting.
  • Tenant-specific views: Allowing clients or business units to see only their data.

Common Mart Design

A single common dimensional model for all tenants reduces development overhead and supports standardized reporting. By including the Tenant ID in dimensions and facts, you can apply row-level security to restrict access per tenant.

When Separate Marts Are Needed

In some cases, specific tenants may require custom Information Marts. This is typically justified when:

  • Unique KPIs or processes cannot be expressed in the common model.
  • Legal or contractual reasons require strict separation of data.

However, these should remain exceptions. A well-designed common mart, filtered by Tenant ID, is usually sufficient for most tenants.

Role of Same-as Links in Reporting

To unify data across systems and tenants, Same-as Links are critical. These resolve entity duplicates across different tenants and systems (e.g., a product appearing under different codes in SAP and Salesforce).

Same-as Links can be sourced from:

  • Raw data: Mapping tables provided by business or source systems.
  • Calculated logic: Fuzzy matching, soundex, or other deduplication algorithms.

This harmonization enables the creation of enterprise-wide dimensions that span multiple tenants.

Security and Governance in Multi-Tenant Data Vaults

By embedding Tenant IDs throughout the model, row-level security becomes straightforward. Each record can be tied to a tenant, and access can be granted or denied accordingly. This simplifies compliance with data privacy regulations and contractual obligations.

Governance practices should also establish clear rules for:

  • Defining and maintaining Tenant IDs.
  • Managing ownership of global vs. tenant-specific data.
  • Regular audits of access controls and Same-as Links.

Best Practices for Multi-Tenant Data Vaults

  1. Add Tenant IDs early: Introduce them in staging to ensure consistency across the pipeline.
  2. Unify where possible: Standardize Satellites for common structures, customize only when necessary.
  3. Reserve global IDs: Create special identifiers for shared or unclear ownership data.
  4. Secure with Tenant IDs: Use row-level security tied directly to the Tenant ID field.
  5. Leverage Same-as Links: Resolve duplicates to support enterprise-wide reporting.
  6. Design one common mart: Rely on row-level filtering instead of duplicating models per tenant.
  7. Scale incrementally: Start with MVP, refine the model as you onboard new tenants.

Conclusion

Multi-tenant Data Vault design requires careful thought about uniqueness, ownership, and harmonization. By embedding Tenant IDs consistently across Hubs, Links, and Satellites, you not only preserve data integrity but also simplify governance and security. The Business Vault and Information Mart can then be designed to support both tenant-specific and enterprise-wide perspectives.

As organizations grow and onboard more clients or business units, this approach ensures scalability without overwhelming complexity. With clear governance, Same-as Links, and standardized mart designs, you can build a robust multi-tenant data warehouse that serves diverse needs while staying maintainable and secure.

Watch the Video

How to Get Your Data Platform Ready for Agentic AI

AI Agent Anatomy

Not long ago, simple large language models were the pinnacle of AI. Today, they can feel almost rudimentary, as the domain of artificial intelligence is rapidly evolves. Lately, we are seeing a push trying to move beyond one-off prompts and towards AI agents. 

It only makes sense that businesses are eager to incorporate AI agents into their workflows, and one domain particularly primed for such transformation is the data team. AI agents can automate repetitive tasks, streamline operations, and enhance data analysis and allow data professionals to focus more on the business side.

Future-Proofing your Data Platform and Unlocking its value as an AI Asset

Many companies investing in enterprise AI find success is limited by the quality of their data platforms. A key issue is “architectural debt,” which hinders the performance and scalability of AI initiatives. This session will provide guidance on how to identify and address these architectural challenges, helping organizations transform their data platforms into reliable assets that support AI agent workflows. Register for our free webinar, October 21st, 2025!

Watch Webinar Recording

AI Agents: A Brief Introduction

AI agents are autonomous software systems that perceive their environment, reason over data, and take actions to achieve specified goals. They leverage large language models, tool‑use frameworks, and API integrations to connect with external services from CRM platforms and cloud storage to data platforms and real‑time event streams. Unlike static models, agents can maintain memory across sessions, chain multiple model calls, and adapt their workflows based on real‑time feedback from connected systems.

The Anatomy of AI Agents

AI Agent Anatomy

Figure 1: A conversation agent built in low-code automation tool n8n.

An AI agent is typically centered around a large language model that serves as its core reasoning engine, interpreting user inputs, generating plans, and orchestrating decision-making through chain-of-thought or self-prompting techniques​. Surrounding this core is a memory structure that can span across immediate working memory, episodic logs, and semantic knowledge stores that persistently captures and condenses interaction histories​. To provide durable, structured storage and enable symbolic multi-hop reasoning, agents integrate databases (e.g., SQL, graph, or vector stores) as their internal memory substrate, issuing queries to organize, link, and evolve knowledge beyond the context window of the LLM​. Finally, AI agents orchestrate a suite of external tools ranging from RESTful APIs and code execution environments to web scrapers and domain-specific plugins to act upon the world, extend their cognitive reach, and execute actions in both digital and physical domains​.

A key limitation of relying on custom APIs as connectors in an AI agent framework is scalability: as you add more agents, tools, and integrations, maintaining a separate API connection for every tool and action soon becomes unmanageable. That’s where MCPs come in.

Model Context Protocols (MCP)

Figure 2: A diagram showcasing Model Context Protocols (MCP)

Developed by Anthropic and open-sourced in November 2024, the Model Context Protocol (MCP) functions as a standardized integration layer that enables the reasoning engine to interface with external resources​. It accomplishes this by defining a uniform client–server protocol whereby MCP clients (the AI agents) discover available services via a registry, authenticate, and invoke capabilities such as database queries, function calls, or file retrieval through RESTful endpoints​. By decoupling the LLM from tool-specific protocols, MCP fosters a modular ecosystem in which new services can be plugged in dynamically, making AI agent development much more scalable.

Build a Solid Data Foundation for Agentic AI

Enterprises that aim to integrate AI agents into their data workloads must first build a solid data foundation. According to a cybersecurity report, 72% of professionals state that IT and security data are siloed within their organizations, creating corporate misalignment and increased security risks. Likewise, an industry study found that in three out of four companies, data silos hinder internal collaboration, and more than 40% report a growing number of such silos. 

When data remains in isolated, non-integrated environments, AI agents cannot establish a holistic overview of the data landscape of an enterprise, hence severely limiting its abilities in making meaningful impact.

AI Agent Enterprise Data Platform

Figure 3: An Enterprise Data Platform diagram, with an EDW

To overcome this, it is best to unify data sources into an enterprise data warehouse (EDW). The EDW must provide both current and historical data in a single data platform. By functioning as a true EDW, the data platform provides a single source of facts for all agents and analytics engines. This means that the AI agents across the enterprise are empowered to create what is needed with the increased availability of data. At Scalefree, we believe that a robust and well-designed data model is foundational to building a scalable and resilient EDW, that supports both operational efficiency and long-term analytical agility.

Ensure Data Quality and Metadata Management

Data quality is already a key issue in data warehousing. Poor data quality can lead to inaccurate insights, flawed decision-making, and ultimately compromise business success. The effectiveness of AI agents is also directly influenced by the quality of the data they consume. Issues such as duplicate records, missing values, and inconsistent schemas can result in erroneous behavior or reduced performance. These issues can be addressed through systematic data cleaning processes and the implementation of data quality tests across ingestion and transformation pipelines. Ongoing monitoring should be in place to detect anomalies and trigger remediation actions where necessary. 

Metadata management also plays a role in agent effectiveness. Shared taxonomies and ontologies provide agents with a consistent framework for understanding data definitions across domains. Without standardized metadata, agents may cause errors in reasoning or communication due to misinterpreted values. Establishing a well-maintained data catalog and promoting organization-wide metadata standards supports both data discoverability and semantic consistency, which are essential in multi-agent environments.

Prepare for Real-Time Processing and Efficient Retrieval

AI agents do not strictly require real-time data, but having access to it can significantly enhance their performance and decision-making capabilities. Real-time data allows AI agents to be informed in quickly changing conditions and provide more accurate and relevant responses. To support this, data platforms can be set up to process streaming data or near-real-time updates.

Additionally, indexing strategies must accommodate both structured and unstructured data. When needed, structured data can continue to rely on traditional indexing methods such as inverted indexes. For unstructured content, embedding-based vector search provides agents with the means to identify semantically similar data points.

Large data objects should also be broken into manageable segments through chunking. This practice enables agents to retrieve and reason over smaller, contextually meaningful portions of data, which improves both performance and interpretability. Determining appropriate chunk sizes may require tuning to balance context with precision.

Implement Orchestration and Observability for AI Workflows

The introduction of AI agents into business processes necessitates a layer of orchestration that governs how agents collaborate, pass information, and handle dependencies. A multi-agent orchestration system should trigger the right agents for a given task, coordinate their outputs, and manage error handling or fallback logic. Orchestrators also need to support asynchronous communication where agents operate independently but contribute to a shared goal.

Monitoring and testing these workflows is essential. Agents can fail, drift from intended behavior, or interact in unintended ways. Logging, alerting, and automated feedback loops can be integrated into orchestration frameworks to surface and correct such deviations. Performance metrics such as response time, accuracy, and success rates should be tracked to ensure continued alignment with business objectives.

AI Agents as Identity-Bearing Entities

AI agents should be treated as identity-bearing entities within the enterprise architecture. This means granting them access only to the data and systems necessary for their assigned roles. To that end, just as any other employee, AI agents should abide by the principle of least privilege. Role-Based Access Control ensures that each agent’s data permissions are explicitly defined and enforceable. For example, an AI agent responsible for financial forecasting should not have access to sensitive HR data.

Integrating AI agents into existing identity and access management (IAM) systems can help enforce compliance and support auditability. Just as human users have roles and access policies, agents should be provisioned, monitored, and offboarded in a controlled and traceable manner.

Embrace a Data Mesh for Scalable Multi-Agent Workflows

Organizations expecting to deploy multiple AI agents concurrently should consider transitioning from a centralized end-to-end model to a data mesh. A data mesh distributes data ownership across domain teams and treats data as a product, aligning well with the modular nature of AI agents. This architecture allows agents to scale horizontally across business functions while maintaining domain-specific ownership of data pipelines and logic. Each agent can operate on a defined domain without depending on a centralized data engineering team, reducing bottlenecks and increasing agility. In environments with high agent interaction, domain-driven decentralization ensures that systems remain responsive and maintainable as usage grows.

Design for Modularity and Scalability

To scale the use of AI agents across business processes, data pipelines should be decomposed into independently deployable and maintainable components. This approach allows new agents or features to be added without having to duplicate or fork existing systems. Event-driven architectures, in which agents react to messages or state changes, support this level of decoupling and flexibility.

Agent-to-agent communication should be standardized using standardized protocols and contracts to allow agent-to-agent interaction predictably. By designing systems with modular interfaces and reusable components, AI agent ecosystems can grow in an agile, iterative fashion.

How to Remove Duplicate Records in Salesforce with Standard Tools

No More Duplicates

Deduplication with Salesforce Standard Tools

Duplicate data is one of the most common and damaging problems in any CRM system. Whether it’s from manual entry, marketing campaigns, or automated integrations, duplicates create chaos across sales, marketing, and reporting. The good news is that Salesforce provides powerful standard tools to identify and prevent duplicates without needing third-party applications.

In this article, we’ll explore why duplicate data is such a problem, the consequences it has on your business, and how you can use Matching Rules and Duplicate Rules in Salesforce to take control of your data quality.



Why Duplicate Data Happens in Salesforce

CRM systems are only as good as the data inside them. Unfortunately, data can enter Salesforce through many channels, making duplicates almost inevitable if you don’t have safeguards in place.

  • Manual input by sales or marketing team members
  • Web forms capturing leads from campaigns
  • API integrations with other systems
  • Automations such as Flows or imports

When these channels are not synchronized or when human error occurs, duplicate records slip into the system. Once they’re in, they can have ripple effects across every part of your organization.

The Consequences of Duplicate Data

The saying “garbage in, garbage out” applies directly to CRM systems. If your Salesforce environment is filled with duplicate data, the results can be disastrous.

  • Wasted Marketing Spend: Sending the same campaign multiple times to the same contact drives up costs and reduces ROI.
  • Lost Sales Opportunities: Sales reps waste time figuring out which record is the “real” one, slowing down the pipeline.
  • Poor Customer Experience: Customers receive duplicate or confusing communications, lowering trust and satisfaction.
  • Untrustworthy Reports: Business leaders make decisions based on flawed dashboards and KPIs, leading to bad strategy.

Put simply, duplicate data undermines every aspect of CRM performance. But with Salesforce’s standard tools, you can fix it.

Salesforce’s Standard Deduplication Tools

Salesforce provides two native features that help with deduplication:

  1. Matching Rules: Define the criteria that determines when two records should be considered the same.
  2. Duplicate Rules: Decide what happens when a match is found — block the action, allow with a warning, or report it.

Let’s go step by step through how these work in practice.

Step 1: Understanding Matching Rules

A Matching Rule is the logic that Salesforce uses to evaluate whether two records are duplicates. For example, Salesforce provides a standard Lead Matching Rule that checks for:

  • Exact matches on email address
  • Similar matches on first and last names

In many cases, the standard rules are enough. However, you can create custom matching rules to account for your organization’s unique data entry patterns. For example, you may want to consider phone numbers, company names, or other fields when evaluating duplicates.

Step 2: Creating Duplicate Rules

Once you’ve defined how Salesforce recognizes duplicates, you need to decide what to do about them. That’s where Duplicate Rules come in.

When setting up a Duplicate Rule, you’ll need to decide:

  • Which object the rule applies to (e.g., Leads, Contacts, Accounts).
  • What happens when a duplicate is detected:
    • Block: Prevents the duplicate record from being saved.
    • Allow but Alert: Lets the record be saved but notifies the user that a duplicate exists.
  • The alert message that users will see when duplicates are found.
  • The matching rule to use (e.g., Standard Lead Matching Rule).

For example, if you create a Duplicate Rule for the Lead object, you can block users from creating a new Lead when the email address already exists in Salesforce. This ensures you never have two records for the same prospect.

Step 3: Activating and Testing

After creating a Duplicate Rule, don’t forget to activate it. Once it’s active, Salesforce will enforce it every time someone tries to create or update a record.

A quick test is to try creating a record that you know already exists. Salesforce should either block the action or display your custom alert, depending on your configuration.

Practical Example

Let’s say you already have a Lead record for John Miller at GlobalTech with the email [email protected]. A sales rep accidentally tries to create a new record for Jon Miller (without the “h”) at the same company, using the same email address. Without rules, Salesforce would allow both records, creating confusion and duplicate communications.

But with Matching and Duplicate Rules in place, Salesforce will flag the record as a duplicate and prevent it from being saved. The sales rep sees an alert message explaining why, and the system stays clean.

Best Practices for Salesforce Deduplication

  • Start simple: Use Salesforce’s standard rules before creating complex custom ones.
  • Block when possible: Preventing duplicates at the source is more effective than cleaning them later.
  • Alert strategically: In some cases, like large imports, allowing but warning might be more practical.
  • Review periodically: Duplicate patterns can change as your business evolves. Review and adjust rules every few months.
  • Combine with data cleanup: If your system already has duplicates, consider a one-time cleanup before enforcing rules.

Beyond Standard Tools

While Salesforce’s standard tools cover most use cases, large enterprises or organizations with very complex data structures may benefit from advanced deduplication solutions, such as third-party apps. These tools offer fuzzy matching, cross-object detection, and automated merging capabilities. However, starting with Salesforce’s built-in features is the most cost-effective and straightforward way to protect your CRM data quality.

Conclusion

Duplicate data can cripple the effectiveness of your Salesforce CRM by wasting resources, confusing teams, and eroding customer trust. Thankfully, Salesforce provides out-of-the-box Matching Rules and Duplicate Rules to help you detect, prevent, and manage duplicates effectively.

By setting up these rules, you can ensure your CRM stays clean, your reports stay accurate, and your teams can focus on what matters most — engaging customers and closing deals.

Watch the Video

Installing and Managing Packages in Coalesce.io

Coalesce.io Package Management

In this article, we will guide you through managing packages within your Coalesce.io environment. We’ll cover everything from what packages are and why they are essential to the step-by-step process of installing, upgrading, and uninstalling them. By the end, you’ll have a clear understanding of how to leverage Coalesce’s marketplace to expand the capabilities of your data platform and streamline your development workflow.

Your data platform is a powerful tool, and while it comes with a robust set of built-in features, its true power lies in its expandability. This is where the Coalesce.io marketplace comes into play, offering a vast array of packages that can introduce new features and functionalities to your environment. Think of it as a toolkit that you can customize and grow to meet your specific needs, whether you’re implementing a Data Vault, integrating testing frameworks, or leveraging specific Snowflake functions.



Exploring the Coalesce.io Marketplace

Before we jump into the installation process, let’s take a quick look at the marketplace itself. When you open the marketplace, you’ll find different categories of packages designed to serve various purposes. These include:

  • Feature Packages: These can add new functionalities, such as leveraging Snowflake’s dynamic tables or integrating powerful tests for data quality.
  • Base Node Types: These packages introduce new node types that can be used to build your data warehouse, such as the Data Vault for Coalesce.io package, which provides specific nodes for hub, link, and satellite entities.
  • Advanced Deploy Packages: These help in managing and deploying your data pipelines more efficiently.

Each package listing provides key information, including its latest version, supported platforms (e.g., Snowflake, Databricks), release date, and a unique package ID. This ID is crucial for the installation process, as it tells Coalesce.io exactly which package you want to install. The description also offers valuable insights into the package’s features and how to use it, along with links to more detailed resources.

Step-by-Step Guide to Installing a New Package

The process of installing a new package is straightforward and can be done directly from your Coalesce.io environment settings. Here’s how you do it:

  1. Copy the Package ID: First, head to the marketplace, find the package you want to install, and copy its unique package ID. This is your key to the installation.
  2. Navigate to Settings: In your Coalesce.io environment, go to your project settings, and then to ‘packages’. You’ll see an overview of all the packages currently installed in your environment.
  3. Browse and Install: Click on the ‘Browse’ button. Here, you can paste the package ID you copied earlier. Coalesce.io will then fetch all available versions of that package.
  4. Select Version and Alias: Choose the version you want to install. It’s highly recommended to give your new package an alias. An alias is a custom name that helps you easily identify the package, especially if you have multiple versions or a large number of packages installed. For example, naming it Data Vault for Coalesce.io - v2.01 provides a clear distinction from an older version.
  5. Complete Installation: Click ‘Install’. The process might take a few moments. Once complete, Coalesce.io will confirm that the package is installed and provide links to view its new macros and node types.

The use of aliases is a best practice that helps you maintain a clear overview of which package and which version you are using, preventing confusion as your project grows.

Upgrading and Managing Package Versions

Upgrading a package is just as simple as installing a new one. The process is particularly important when a package you are already using receives an update with new features or bug fixes. Here’s the recommended best practice for a smooth upgrade:

  1. Install the New Version: Follow the installation steps outlined above to install the latest version of the package.
  2. Transfer Existing Entities: Go through your existing Coalesce.io entities (nodes) that are using the old package. You will see a clear indication of which package and version is being used. Switch the node type to the new, updated version. This process ensures that your existing workflows benefit from the new features and stability of the latest release.
  3. Review and Deactivate Old Node Types: In the package settings, you can also manage the visibility of node types. If you want to prevent accidentally using an older version, you can simply turn off the node types from the old package. This cleans up your workspace and ensures you are always building with the latest tools.
  4. Uninstall the Old Package: Once all of your entities have been successfully migrated to the new version, you can safely uninstall the old package. Coalesce.io will alert you if any nodes are still using the old version, preventing you from accidentally breaking your project. This is a critical step to keep your environment clean and efficient.

This systematic approach ensures a seamless transition and keeps your project on the cutting edge of Coalesce’s capabilities without any disruption.

Discovering New Macros and Capabilities

Beyond new node types, packages often come with a set of powerful macros. These are reusable snippets of code that can significantly speed up your development process. In your Coalesce.io settings, you can navigate to the ‘macros’ section to see all available macros, including those from your installed packages. This allows you to explore what the package can do under the hood and even integrate some of its functionalities directly into your own custom nodes.

For example, if a package includes macros for data quality checks, you can use these in your own custom SQL queries to ensure data integrity at various stages of your pipeline. This level of extensibility is what makes Coalesce.io such a versatile platform for modern data engineering.

Final Thoughts on Coalesce.io Package Management

In this article, we’ve walked through the entire lifecycle of a package in Coalesce.io. We’ve shown you how to navigate the marketplace, install a new package, and follow a best-practice process for upgrading your project. We also touched upon the importance of managing node types and exploring the powerful macros that come with packages.

The ability to extend and customize your data platform is a key advantage of Coalesce.io. By actively managing your packages, you can ensure that your environment is always up-to-date, efficient, and equipped with the tools you need to tackle any data challenge. Remember, a well-managed environment is the foundation for a successful and scalable data platform.

Watch the Video

Cost Factors in Implementing and Maintaining Data Vault 2.0

Cost Factors in Data Vault 2.0

Implementing a modern data platform is never a one-size-fits-all endeavor. Every company has unique requirements, legacy systems, and business needs. When it comes to Data Vault 2.0 (or more precisely, Data Vault 2.1), understanding the main cost factors early on can help organizations budget realistically and avoid painful surprises later. In this article, we will explore the typical phases of a Data Vault project, break down the major cost drivers, and share best practices for cost optimization and governance.



How a Data Vault 2.1 Project Looks Like

While no two projects are exactly alike, a Data Vault journey often follows a recognizable structure:

  • Training & Onboarding: Equip your team with the right skills through workshops and tool hands-on sessions.
  • Requirements Analysis: Define the first use case and design an architecture that matches requirements.
  • Architecture & Setup: Prepare the platform, establish automation, and agree on standards and conventions.
  • First Tracer Bullet Sprint: Deliver an end-to-end flow for one use case, ensuring the first business value is realized.
  • Next Sprints & Cost Optimization: Add data sources incrementally, monitor resource usage, and optimize for efficiency.

The key difference compared to traditional data warehouse projects? Instead of building layer by layer and waiting months for business value, Data Vault emphasizes sprints with early, visible results. This agile approach not only accelerates delivery but also makes cost management more transparent.

The Major Cost Factors

What drives costs in a Data Vault implementation? Broadly, there are three categories:

1. People

The largest expense in most data projects is people. Costs include developers, data modelers, business analysts, and ongoing maintainers. Skilled professionals are needed not only for implementation but also for optimization and support. Investing in training early can reduce errors and long-term inefficiencies, making this a cost that pays back quickly.

2. Architecture

Whether you deploy on-premises or in the cloud, the technical backbone of your Data Vault incurs costs. Expect expenses for:

  • Compute: Running queries, data transformations, and analytical workloads.
  • Storage: Staging areas, raw vault, business vault, and marts require structured storage planning.
  • ETL / ELT: Orchestration pipelines and integration layers that keep the system running smoothly.

3. Tooling

Tools for automation, governance, and project management also add to the bill. However, Data Vault’s standards lend themselves well to automation, reducing manual effort and long-term costs. Tools like dbt Core or Coalesce provide strong value, often at lower costs compared to legacy ETL suites.

Cost Optimization Strategies

Once the platform is running, cost optimization should not be an afterthought. Instead, it should be a guiding principle from the very beginning.

Define Responsibilities

Every instance, warehouse, or resource that incurs costs needs a clear owner. Without ownership, cloud resources often remain active long past their usefulness, silently increasing bills.

Set End Dates

Many dashboards and data pipelines are built for temporary projects. Without end dates, they keep consuming compute and storage. Assign a sunset date for every resource and re-evaluate its necessity over time.

Use Tags for Transparency

Cloud platforms allow tagging by project, department, or cost center. This makes it easier to allocate expenses and understand who is using what. Clear tagging also improves accountability and enables granular reporting.

Define Purpose

Every instance, pipeline, or report should have a clear business purpose. If you cannot state who benefits from it and why, it is a strong candidate for decommissioning.

9 Best Practices for Cost Monitoring

Effective cost management requires discipline. These nine practices provide a structured approach:

  1. Involve Stakeholders: Ensure business and technical stakeholders understand cost implications.
  2. Set Up Budget Alerts: Get notified when costs exceed defined thresholds.
  3. Use Tags for Resources: Track usage by cost center, project, or department.
  4. Create Cost Dashboards: Tools like Snowsight provide real-time insights.
  5. Enable Usage Tracking: Know who uses which resources, and why.
  6. Review Allocations: Regularly audit and rebalance resource usage.
  7. Monitor Queries: Optimize inefficient SQL to cut unnecessary costs.
  8. Optimize Warehouses: Use auto-suspend/resume and right-size compute.
  9. Optimize Storage: Leverage zero copy cloning and transient tables to save space.

The Pareto Principle in Cost Saving

Not all cost optimizations are equal. According to the 80/20 rule, 20% of resources often account for 80% of costs. Identifying and addressing these high-impact areas—such as a handful of long-running queries—can unlock significant savings with minimal effort.

How Data Vault 2.0 Helps Reduce Costs

Beyond traditional cost-cutting measures, Data Vault 2.0 itself provides structural advantages that reduce expenses:

  • Automation: Standardized entities make it possible to automate much of the raw vault, lowering developer workload.
  • Agile Development: The tracer bullet approach allows incremental delivery of business value, avoiding expensive rework.
  • Auditing & Compliance: Built-in historization and auditability support GDPR compliance, preventing costly legal issues.

Conclusion

Estimating the exact cost of a Data Vault 2.0 implementation is impossible—each project has unique factors. However, by recognizing the primary cost drivers (people, architecture, tooling), adopting disciplined cost management practices, and leveraging the automation and agility inherent in Data Vault 2.0, organizations can keep their projects efficient and cost-effective.

Cost optimization is not a one-time activity. It’s an ongoing process of review, accountability, and continuous improvement. With the right governance and monitoring in place, Data Vault 2.0 is not only a robust data architecture—it’s a cost-conscious one too.

Watch the Video

Close Menu