Skip to main content
search
0
All Posts By

Building a scalable Data Platform?

Whether you're implementing Data Vault 2.1 or modernizing your analytics architecture, our experts help you turn complex data challenges into practical, future-proof solutions. From hands-on implementation to in-depth training, we support your team every step of the way.

Simplify Data Pipelines with Custom Macros in Coalesce.io

Custom Macros in Coalesce.io

When working in data transformation platforms like Coalesce.io, efficiency and maintainability are everything. As your data pipelines grow, so does the complexity of your transformation logic. This is where custom macros become a powerful ally — they let you write reusable, dynamic SQL code that can be applied consistently across your entire data model.

In this article, we’ll explore how to use macros in Coalesce.io with the help of the Jinja templating language. You’ll see how a small example — converting currency values from cents to dollars — can scale into a reusable pattern that saves hours of development and maintenance time.



What Are Macros in Coalesce.io?

Macros are reusable pieces of logic that can be dynamically inserted into your SQL transformations. They’re written in Jinja, a templating language originally popularized in web development frameworks such as Django and Flask. Coalesce.io leverages Jinja to make SQL more dynamic and parameterized, letting you write logic once and reuse it anywhere.

In simple terms, a macro is like a small function that can generate SQL code on the fly. You define a macro once — for example, to perform a mathematical conversion, a date calculation, or a string manipulation — and then call it across multiple nodes in your Coalesce.io project.

Why Use Macros?

As data models evolve, repetitive transformations become a maintenance challenge. Imagine you have multiple columns across different tables that need to be converted from cents to dollars. You could copy and paste the same transformation logic everywhere, but what happens when the business decides to change the precision or switch the data type?

You’d have to update every single instance manually — a process that’s time-consuming and error-prone. With a macro, you simply change the logic once, and it updates everywhere it’s used. This drastically reduces maintenance efforts and improves consistency.

Setting Up the Example: Converting Cents to Dollars

Let’s walk through a simple example. Suppose you have a customer table with an account_balance column that stores values in cents. The business requires all monetary values to be in dollars for reporting and analytics. This means we need to divide the column by 100 and adjust the data type accordingly.

In Coalesce.io, we can perform this transformation directly within a stage node. You might start with something like:

{{ SC }} / 100::NUMERIC(18, 2)

The {{ SC }} syntax tells Coalesce.io to dynamically insert the source column name. After running the transformation, you’ll see that account_balance is now expressed in dollars instead of cents. So far, so good — but what if multiple columns or nodes need this conversion?

Copying and pasting this logic everywhere quickly becomes inefficient. This is the perfect case for turning the transformation into a macro.

Creating a Custom Macro in Coalesce.io

To define a new macro, navigate to your project’s Build Settings and locate the Macros section. There, you can create workspace-level macros that are available across your entire Coalesce.io environment.

Here’s a simple Jinja macro to convert cents to dollars:

{% macro cents_to_dollars(column, scale=2) %}
    ({{ column }} / 100)::NUMERIC(18, {{ scale }})
{% endmacro %}

This macro does a few important things:

  • Accepts parameters — The column parameter specifies which column to transform, and scale defines the number of decimal places (defaulting to 2).
  • Performs the conversion — It divides the value by 100 and casts it to a numeric type.
  • Is reusable — You can now call this macro anywhere without rewriting the logic.

Applying the Macro in a Transformation

Once your macro is defined, you can use it directly within your stage or transform node. For example, if you want to create a new column that stores the converted dollar value, simply write:

{{ cents_to_dollars('customer.account_balance') }}

This will execute the macro, substitute the SQL expression, and generate the proper transformation logic dynamically. You can even adjust the precision if needed:

{{ cents_to_dollars('customer.account_balance', scale=4) }}

With just one line of code, you’ve achieved a flexible, reusable transformation that can be applied across multiple nodes and columns.

Benefits of Using Macros in Coalesce.io

Macros may seem like a small feature, but their impact on data engineering workflows is significant. Here are a few key advantages:

1. Reduce Maintenance Overhead

When transformation logic changes, you only need to update it in one place. This ensures consistency across your pipelines and minimizes human error.

2. Promote Reusability

Macros make your codebase more modular. Teams can share standardized transformation logic, reducing duplication and ensuring best practices are applied everywhere.

3. Improve Readability

Instead of cluttering transformations with complex SQL expressions, you can reference clean, descriptive macro calls. This improves readability and helps onboard new team members faster.

4. Enhance Collaboration

Macros can be shared across teams or workspaces, enabling collaborative development in larger analytics engineering environments.

5. Simplify Complex Logic

As transformations get more sophisticated, you can encapsulate multi-step logic inside a macro. For example, handling data type conversions, conditional mappings, or even custom business rules — all within one reusable function.

Debugging and Testing Macros

Because macros generate SQL dynamically, debugging can sometimes feel tricky. Coalesce.io helps by allowing you to preview the compiled SQL code. After running a transformation that uses a macro, open the results to view the underlying SQL — you’ll see exactly how your macro was expanded and executed.

This visibility is crucial for validating logic and ensuring your macros behave as expected.

When to Use Macros

A good rule of thumb is simple:

If you find yourself repeating the same transformation in more than one place, consider turning it into a macro.

Macros are particularly useful for:

  • Currency or unit conversions
  • Data cleansing logic (e.g., trimming whitespace, normalizing case)
  • Timestamp or date formatting
  • Standard calculations (e.g., margin, growth rate, ratios)
  • Conditional logic applied across multiple datasets

By standardizing these repetitive transformations, your data environment becomes cleaner, easier to manage, and more scalable.

Best Practices for Writing Coalesce.io Macros

  • Keep macros simple — Each macro should serve a single clear purpose. Break complex logic into smaller, composable macros when possible.
  • Use default parameters wisely — Providing defaults (like scale=2) makes macros flexible and user-friendly.
  • Document your macros — Add comments explaining what each macro does and the parameters it expects. Future maintainers will thank you.
  • Test before scaling — Validate each macro on a small dataset before applying it widely.
  • Version control your macros — Store them in a shared Git repository or Coalesce.io workspace for collaboration and traceability.

Real-World Impact

In production environments, macros can save hours of repetitive work each week. Imagine applying the same data type conversion or formatting rule across dozens of tables — a single macro call replaces all those redundant SQL snippets.

Macros also make large-scale refactoring much safer. If a business requirement changes (for example, moving from numeric to money data types), one macro edit automatically updates every transformation that depends on it.

Final Thoughts

Custom macros in Coalesce.io aren’t just a convenience — they’re a foundation for scalable, maintainable data engineering. By abstracting common logic into reusable templates, you streamline your transformations, reduce technical debt, and empower your team to focus on building insights rather than maintaining code.

As a best practice, always look for opportunities to generalize repetitive logic. When you spot patterns across transformations, that’s your cue to create a macro. Start small, experiment, and watch your data pipelines become cleaner, faster, and easier to manage.

Next Steps

Explore Coalesce’s documentation on macros and Jinja templating to deepen your understanding. You can also download our free Data World Handbook — a comprehensive guide to modern data architecture, data vault modeling, and transformation best practices.

And if you’ve built interesting macros of your own, share them in the comments below. We’d love to see how you’re simplifying your Coalesce.io workflows.

Watch the Video

Meet the Speaker

Profile picture of Tim Kirschke

Tim Kirschke
Senior Consultant

Tim has a Bachelor’s degree in Applied Mathematics and has been working as a BI consultant for Scalefree since the beginning of 2021. He’s an expert in the design and implementation of BI solutions, with focus on the Data Vault 2.0 methodology. His main areas of expertise are dbt, Coalesce, and BigQuery.

How to Refactor Your Raw Data Vault: Proven Strategies for Scalable Multi-Tenant Data Warehouses

Refactoring the Raw Data Vault

Refactoring is a natural part of any evolving data architecture. Whether you’re adding a new entity, integrating additional source systems, or moving toward a multi-tenant Data Warehouse, change is inevitable. In this article, we’ll explore what it means to refactor your Raw Data Vault (RDV), why it’s essential, and how to do it safely using strategies like Hub-it-out.

The discussion is based on a real-world scenario where a company’s recruitment data model needed to evolve. Initially, the model tracked requisitions but didn’t fully capture recruiter details. With new requirements, the team needed to refactor their RDV to introduce a new Hub Recruiter entity — without breaking historical data or existing queries.



What Refactoring Really Means

Before diving into the technical details, let’s clarify what refactoring is — and what it isn’t.

Refactoring means changing the internal structure of your data model without altering its external behavior. Think of it as improving or extending the foundation — adding new components, reorganizing relationships, or optimizing loading logic — while keeping the overall functionality intact.

Redesign, on the other hand, means making user-visible changes. This might involve restructuring dashboards, altering reports, or redefining business logic in ways that affect your end users.

Since the Raw Data Vault is an internal structure (not typically user-facing), most adjustments here fall under refactoring. These changes are iterative and low-risk if your base model is valid.

Deploy First, Refactor Later

Many data teams fall into the trap of seeking the “perfect” model before deployment. But as experts at Scalefree emphasize — perfection is the enemy of progress. Instead of spending months debating the right business key or model structure, deploy a valid model quickly, gather feedback, and improve it through incremental refactoring.

“Don’t aim for the best model on day one. Aim for a valid model that you can continuously improve.”

This agile mindset ensures your team delivers value early and often. Over time, with each sprint, you evolve closer to the “best” model for your organization — even though, in truth, a perfect model doesn’t exist.

The Recruiter Example: When Business Requirements Change

Let’s revisit our use case: your company is in the recruiting business. Each job requisition is led by a single recruiter. Your original model included a RecruiterID in the requisition table but didn’t have a dedicated Hub Recruiter. Now, the business wants to integrate additional recruiter data from new source systems.

So how should you adapt your Data Vault model?

  • Do you close the old Link and create a new one that includes the Recruiter Hub?
  • Do you split existing Links into smaller, recruiter-focused relationships?
  • How do you handle historical backfills and multi-tenant variations?

The answer: you refactor using the Hub-it-out strategy.

Understanding the “Hub-it-Out” Strategy

“Hub-it-out” is a practical approach to introducing new Hubs and Links into your existing Raw Data Vault without reloading everything from the source systems.

Here’s how it works:

  1. Identify the business key in your existing Satellite. For example, you already have RecruiterID stored as an attribute, even if it wasn’t modeled as a Hub initially.
  2. Create a new Hub (Hub Recruiter) by selecting all distinct recruiter IDs from your existing Satellite data.
  3. Generate hash keys for each business key value and assign load dates and record sources from the Satellite’s first occurrence.
  4. Create a Link between the existing parent Hub (e.g., Requisition) and the new Hub (Recruiter) based on relationships already present in the Satellite.

With this approach, you can build new structures directly from your existing Data Vault — no need to reload historical data from source systems, which may not even be available anymore. If you have a data lake or Persistent Staging Area (PSA), you can also load from there as an alternative.

Why a Valid Model Is Everything

One key prerequisite for refactoring success is having a valid Raw Data Vault model. A “valid” model means that:

  • Data is captured consistently and completely.
  • No business keys or relationships have been lost during earlier transformations.
  • Hubs, Links, and Satellites follow proper Data Vault design rules.

If your model is valid, you can refactor it safely — adding Hubs, Links, or Satellites — without touching your original data sources. This makes evolution faster, cheaper, and much less disruptive.

Multi-Tenant Data Vault Considerations

When dealing with multiple clients or tenants, you should introduce a TenantID attribute in every Hub, Link, and Satellite. This ensures that data from one tenant never overwrites another’s records, since the hash keys will differ.

Typically, the first few tenants may require adjustments as you generalize your model. But after integrating the second or third tenant, the structure stabilizes. Each new tenant may add Satellites or minor extensions — but the overall architecture remains consistent.

Avoid Overloading Links

Each source system should define its own Unit of Work — a consistent relationship between business keys. If one system defines a Link between three business keys (e.g., Requisition, Candidate, Recruiter) and another defines it between two, treat them as separate Links. Avoid “overloading” Links by mixing different granularities or structures. That’s a common source of data inconsistency and confusion.

Transitioning Without Breaking Queries

Refactoring can be disruptive for power users who query the Raw Data Vault directly. Their queries may break when Hubs or Links are renamed, split, or replaced. To manage this transition smoothly:

  • Load both old and new entities temporarily. Keep the old Link active for a while, even if it’s limited to a single tenant.
  • Mark deprecated objects clearly. Add a “deprecated” flag or comment in your metadata catalog.
  • Communicate proactively. Notify users via email or release notes, giving them 90–180 days to adjust queries.

Despite communication efforts, some users will inevitably miss the deadline — but maintaining transparency and clear documentation helps minimize friction.

Using Virtual Views for Legacy Support

Instead of maintaining redundant tables, you can recreate deprecated entities as virtual views on top of your new Data Vault structures. This approach saves storage while still supporting legacy queries.

Yes, there’s a small performance trade-off, but since these views are temporary and clearly marked as deprecated, it’s an effective bridge strategy. Inform users that performance may decrease slightly and that these views will be removed after a defined period (e.g., one year).

When and How to Clean Up

Every new entity adds maintenance overhead — documentation, metadata management, refactoring complexity. Once your transition period ends, remove deprecated tables and views to keep your model clean and manageable.

Set a clear timeline with your business stakeholders, archive necessary backups, and drop the obsolete entities once the window closes. This keeps your RDV lean and reduces technical debt.

Key Takeaways

  • Deploy early — perfection isn’t required for value.
  • Refactor continuously through small, validated improvements.
  • Use the Hub-it-out strategy to extend your model safely.
  • Always maintain a valid Data Vault foundation to enable future flexibility.
  • Manage user expectations through clear communication and deprecation policies.

Final Thoughts

Refactoring your Raw Data Vault isn’t about redoing work — it’s about evolving intelligently. With the right strategies and mindset, you can adapt to new business requirements, integrate additional tenants, and maintain consistency without painful re-engineering.

Whether you’re adding a Recruiter Hub, optimizing Link structures, or modernizing your multi-tenant DWH, remember: the goal isn’t perfection. It’s progress — sprint by sprint, improvement by improvement.

Watch the Video

Meet the Speaker

Profile picture of Michael Olschimke

Michael Olschimke

Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

Multi-Tenant Data Vault

Multi-Tenant Environment

Designing and maintaining a Data Vault in a multi-tenant environment presents unique challenges. When a data warehouse must handle not just internal data, but also data from dozens of external clients with slightly different processes and systems, the complexity increases dramatically.

A recent question we received highlighted this exact situation:

“I’m struggling with link management and the evolution process in a multi-tenant warehouse, especially putting all data together in the Information Mart. Our Data Warehouse contains internal data as well as shared data from our clients, for which we perform job requisition processes using their internal systems. We plan to onboard 50–60 clients in the next 2–3 years. Right now, we’re still in the MVP phase, supporting just a few clients. How should I manage links with so many different systems, such a large number of source tables, and processes that are similar but not identical? The goal is to have one common Information Mart design for all clients to enable standardized reporting.”

This is a classic question in modern data architecture. Let’s explore how to approach Raw Vault, Business Vault, and Information Mart design in a multi-tenant context.



Multi-Tenancy in the Raw Data Vault

A cornerstone principle in Data Vault modeling is that each Satellite is sourced from a single source system. However, in a multi-tenant setup, this guideline needs some adaptation. Many tenants use the same source systems (e.g., Salesforce, SAP) with similar core structures. In such cases, you can load multiple tenants into the same Satellite as long as you introduce a Tenant ID as part of the key.

Why Add a Tenant ID?

  • Ensures uniqueness of business keys across tenants (e.g., Customer 42 in Tenant A ≠ Customer 42 in Tenant B).
  • Partitions data naturally, so Satellites contain subsets per tenant without overwriting each other’s records.
  • Provides a straightforward way to filter or secure records by tenant.

By combining the local business key with the Tenant ID, you create a unique enterprise-wide business key. This guarantees data integrity while simplifying downstream querying and reporting.

Where to Add the Tenant ID

In multi-tenant designs, the Tenant ID should ideally appear:

  • Hubs: As part of the business key or alternate key, ensuring uniqueness across tenants.
  • Links: As part of the Hub references, ensuring uniqueness in combined relationships.
  • Satellites: As a payload field for convenience, even if the hash key already includes the Tenant ID.

With this approach, every record in the Raw Vault can always be traced back to a specific tenant, which simplifies not only modeling but also governance and security.

Defining the Tenant ID

A natural question arises: what exactly is a “tenant”? The answer depends on your business context:

  • It could be a client organization you serve.
  • It could be a business unit, country, or factory in large enterprises.
  • It might also be defined by data ownership—who is responsible for the dataset.

In some cases, you may also need a reserved Tenant ID for global or shared data that is not owned by any specific tenant. This ensures consistency and supports role-based access control.

Staging and Tenant Assignment

The Tenant ID is typically introduced already in the staging layer. How it’s assigned depends on the source system:

  • Tenant-dedicated systems: Assign a constant Tenant ID for all data from that system.
  • Multi-tenant systems (e.g., SAP, Salesforce): Extract and map the Tenant ID from existing fields (e.g., business unit, org ID).
  • Global systems: Use a reserved Tenant ID (e.g., “GLOBAL”) when ownership is shared or unclear.

This is a hard rule (constant assignment), not a conditional transformation, which ensures repeatability and traceability.

Business Vault in Multi-Tenant Contexts

Once Tenant IDs are embedded in the Raw Vault, the Business Vault becomes much easier to design. Business rules can be applied consistently across tenants, while preserving tenant-specific contexts.

  • Same-as Links: Crucial for resolving duplicate entities across tenants (e.g., the same customer appears in different client systems).
  • Custom Satellites: Standardize where possible, but add additional Satellites for tenant-specific customizations.
  • Wide PIT Tables: Be prepared for them—multiple tenants and diverse source systems naturally lead to broader structures.

At this stage, the goal is harmonization without oversimplification. A balance must be struck between common modeling and tenant-specific flexibility.

Designing the Information Mart

The Information Mart is where tenants—or the enterprise as a whole—derive insights. The challenge is to provide both:

  • Enterprise-wide views: Merging data from all tenants for global reporting.
  • Tenant-specific views: Allowing clients or business units to see only their data.

Common Mart Design

A single common dimensional model for all tenants reduces development overhead and supports standardized reporting. By including the Tenant ID in dimensions and facts, you can apply row-level security to restrict access per tenant.

When Separate Marts Are Needed

In some cases, specific tenants may require custom Information Marts. This is typically justified when:

  • Unique KPIs or processes cannot be expressed in the common model.
  • Legal or contractual reasons require strict separation of data.

However, these should remain exceptions. A well-designed common mart, filtered by Tenant ID, is usually sufficient for most tenants.

Role of Same-as Links in Reporting

To unify data across systems and tenants, Same-as Links are critical. These resolve entity duplicates across different tenants and systems (e.g., a product appearing under different codes in SAP and Salesforce).

Same-as Links can be sourced from:

  • Raw data: Mapping tables provided by business or source systems.
  • Calculated logic: Fuzzy matching, soundex, or other deduplication algorithms.

This harmonization enables the creation of enterprise-wide dimensions that span multiple tenants.

Security and Governance in Multi-Tenant Data Vaults

By embedding Tenant IDs throughout the model, row-level security becomes straightforward. Each record can be tied to a tenant, and access can be granted or denied accordingly. This simplifies compliance with data privacy regulations and contractual obligations.

Governance practices should also establish clear rules for:

  • Defining and maintaining Tenant IDs.
  • Managing ownership of global vs. tenant-specific data.
  • Regular audits of access controls and Same-as Links.

Best Practices for Multi-Tenant Data Vaults

  1. Add Tenant IDs early: Introduce them in staging to ensure consistency across the pipeline.
  2. Unify where possible: Standardize Satellites for common structures, customize only when necessary.
  3. Reserve global IDs: Create special identifiers for shared or unclear ownership data.
  4. Secure with Tenant IDs: Use row-level security tied directly to the Tenant ID field.
  5. Leverage Same-as Links: Resolve duplicates to support enterprise-wide reporting.
  6. Design one common mart: Rely on row-level filtering instead of duplicating models per tenant.
  7. Scale incrementally: Start with MVP, refine the model as you onboard new tenants.

Conclusion

Multi-tenant Data Vault design requires careful thought about uniqueness, ownership, and harmonization. By embedding Tenant IDs consistently across Hubs, Links, and Satellites, you not only preserve data integrity but also simplify governance and security. The Business Vault and Information Mart can then be designed to support both tenant-specific and enterprise-wide perspectives.

As organizations grow and onboard more clients or business units, this approach ensures scalability without overwhelming complexity. With clear governance, Same-as Links, and standardized mart designs, you can build a robust multi-tenant data warehouse that serves diverse needs while staying maintainable and secure.

Watch the Video

Meet the Speaker

Profile picture of Michael Olschimke

Michael Olschimke

Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

How to Remove Duplicate Records in Salesforce with Standard Tools

No More Duplicates

Deduplication with Salesforce Standard Tools

Duplicate data is one of the most common and damaging problems in any CRM system. Whether it’s from manual entry, marketing campaigns, or automated integrations, duplicates create chaos across sales, marketing, and reporting. The good news is that Salesforce provides powerful standard tools to identify and prevent duplicates without needing third-party applications.

In this article, we’ll explore why duplicate data is such a problem, the consequences it has on your business, and how you can use Matching Rules and Duplicate Rules in Salesforce to take control of your data quality.



Why Duplicate Data Happens in Salesforce

CRM systems are only as good as the data inside them. Unfortunately, data can enter Salesforce through many channels, making duplicates almost inevitable if you don’t have safeguards in place.

  • Manual input by sales or marketing team members
  • Web forms capturing leads from campaigns
  • API integrations with other systems
  • Automations such as Flows or imports

When these channels are not synchronized or when human error occurs, duplicate records slip into the system. Once they’re in, they can have ripple effects across every part of your organization.

The Consequences of Duplicate Data

The saying “garbage in, garbage out” applies directly to CRM systems. If your Salesforce environment is filled with duplicate data, the results can be disastrous.

  • Wasted Marketing Spend: Sending the same campaign multiple times to the same contact drives up costs and reduces ROI.
  • Lost Sales Opportunities: Sales reps waste time figuring out which record is the “real” one, slowing down the pipeline.
  • Poor Customer Experience: Customers receive duplicate or confusing communications, lowering trust and satisfaction.
  • Untrustworthy Reports: Business leaders make decisions based on flawed dashboards and KPIs, leading to bad strategy.

Put simply, duplicate data undermines every aspect of CRM performance. But with Salesforce’s standard tools, you can fix it.

Salesforce’s Standard Deduplication Tools

Salesforce provides two native features that help with deduplication:

  1. Matching Rules: Define the criteria that determines when two records should be considered the same.
  2. Duplicate Rules: Decide what happens when a match is found — block the action, allow with a warning, or report it.

Let’s go step by step through how these work in practice.

Step 1: Understanding Matching Rules

A Matching Rule is the logic that Salesforce uses to evaluate whether two records are duplicates. For example, Salesforce provides a standard Lead Matching Rule that checks for:

  • Exact matches on email address
  • Similar matches on first and last names

In many cases, the standard rules are enough. However, you can create custom matching rules to account for your organization’s unique data entry patterns. For example, you may want to consider phone numbers, company names, or other fields when evaluating duplicates.

Step 2: Creating Duplicate Rules

Once you’ve defined how Salesforce recognizes duplicates, you need to decide what to do about them. That’s where Duplicate Rules come in.

When setting up a Duplicate Rule, you’ll need to decide:

  • Which object the rule applies to (e.g., Leads, Contacts, Accounts).
  • What happens when a duplicate is detected:
    • Block: Prevents the duplicate record from being saved.
    • Allow but Alert: Lets the record be saved but notifies the user that a duplicate exists.
  • The alert message that users will see when duplicates are found.
  • The matching rule to use (e.g., Standard Lead Matching Rule).

For example, if you create a Duplicate Rule for the Lead object, you can block users from creating a new Lead when the email address already exists in Salesforce. This ensures you never have two records for the same prospect.

Step 3: Activating and Testing

After creating a Duplicate Rule, don’t forget to activate it. Once it’s active, Salesforce will enforce it every time someone tries to create or update a record.

A quick test is to try creating a record that you know already exists. Salesforce should either block the action or display your custom alert, depending on your configuration.

Practical Example

Let’s say you already have a Lead record for John Miller at GlobalTech with the email [email protected]. A sales rep accidentally tries to create a new record for Jon Miller (without the “h”) at the same company, using the same email address. Without rules, Salesforce would allow both records, creating confusion and duplicate communications.

But with Matching and Duplicate Rules in place, Salesforce will flag the record as a duplicate and prevent it from being saved. The sales rep sees an alert message explaining why, and the system stays clean.

Best Practices for Salesforce Deduplication

  • Start simple: Use Salesforce’s standard rules before creating complex custom ones.
  • Block when possible: Preventing duplicates at the source is more effective than cleaning them later.
  • Alert strategically: In some cases, like large imports, allowing but warning might be more practical.
  • Review periodically: Duplicate patterns can change as your business evolves. Review and adjust rules every few months.
  • Combine with data cleanup: If your system already has duplicates, consider a one-time cleanup before enforcing rules.

Beyond Standard Tools

While Salesforce’s standard tools cover most use cases, large enterprises or organizations with very complex data structures may benefit from advanced deduplication solutions, such as third-party apps. These tools offer fuzzy matching, cross-object detection, and automated merging capabilities. However, starting with Salesforce’s built-in features is the most cost-effective and straightforward way to protect your CRM data quality.

Conclusion

Duplicate data can cripple the effectiveness of your Salesforce CRM by wasting resources, confusing teams, and eroding customer trust. Thankfully, Salesforce provides out-of-the-box Matching Rules and Duplicate Rules to help you detect, prevent, and manage duplicates effectively.

By setting up these rules, you can ensure your CRM stays clean, your reports stay accurate, and your teams can focus on what matters most — engaging customers and closing deals.

Watch the Video

Meet the Speaker

Tim Bauer

Tim Bauer

Tim supports the Scalefree Salesforce team in the administration, configuration, and further development of Salesforce solutions with a special focus on accounting systems. He brings with him in-depth knowledge of business process automation and model-based system design. Mr. Bauer holds a bachelor’s degree in business informatics with a focus on CRM and a master’s degree in digital transformation with a focus on data science.

Cost Factors in Implementing and Maintaining Data Vault 2.0

Cost Factors in Data Vault 2.0

Implementing a modern data platform is never a one-size-fits-all endeavor. Every company has unique requirements, legacy systems, and business needs. When it comes to Data Vault 2.0 (or more precisely, Data Vault 2.1), understanding the main cost factors early on can help organizations budget realistically and avoid painful surprises later. In this article, we will explore the typical phases of a Data Vault project, break down the major cost drivers, and share best practices for cost optimization and governance.



How a Data Vault 2.1 Project Looks Like

While no two projects are exactly alike, a Data Vault journey often follows a recognizable structure:

  • Training & Onboarding: Equip your team with the right skills through workshops and tool hands-on sessions.
  • Requirements Analysis: Define the first use case and design an architecture that matches requirements.
  • Architecture & Setup: Prepare the platform, establish automation, and agree on standards and conventions.
  • First Tracer Bullet Sprint: Deliver an end-to-end flow for one use case, ensuring the first business value is realized.
  • Next Sprints & Cost Optimization: Add data sources incrementally, monitor resource usage, and optimize for efficiency.

The key difference compared to traditional data warehouse projects? Instead of building layer by layer and waiting months for business value, Data Vault emphasizes sprints with early, visible results. This agile approach not only accelerates delivery but also makes cost management more transparent.

The Major Cost Factors

What drives costs in a Data Vault implementation? Broadly, there are three categories:

1. People

The largest expense in most data projects is people. Costs include developers, data modelers, business analysts, and ongoing maintainers. Skilled professionals are needed not only for implementation but also for optimization and support. Investing in training early can reduce errors and long-term inefficiencies, making this a cost that pays back quickly.

2. Architecture

Whether you deploy on-premises or in the cloud, the technical backbone of your Data Vault incurs costs. Expect expenses for:

  • Compute: Running queries, data transformations, and analytical workloads.
  • Storage: Staging areas, raw vault, business vault, and marts require structured storage planning.
  • ETL / ELT: Orchestration pipelines and integration layers that keep the system running smoothly.

3. Tooling

Tools for automation, governance, and project management also add to the bill. However, Data Vault’s standards lend themselves well to automation, reducing manual effort and long-term costs. Tools like dbt Core or Coalesce provide strong value, often at lower costs compared to legacy ETL suites.

Cost Optimization Strategies

Once the platform is running, cost optimization should not be an afterthought. Instead, it should be a guiding principle from the very beginning.

Define Responsibilities

Every instance, warehouse, or resource that incurs costs needs a clear owner. Without ownership, cloud resources often remain active long past their usefulness, silently increasing bills.

Set End Dates

Many dashboards and data pipelines are built for temporary projects. Without end dates, they keep consuming compute and storage. Assign a sunset date for every resource and re-evaluate its necessity over time.

Use Tags for Transparency

Cloud platforms allow tagging by project, department, or cost center. This makes it easier to allocate expenses and understand who is using what. Clear tagging also improves accountability and enables granular reporting.

Define Purpose

Every instance, pipeline, or report should have a clear business purpose. If you cannot state who benefits from it and why, it is a strong candidate for decommissioning.

9 Best Practices for Cost Monitoring

Effective cost management requires discipline. These nine practices provide a structured approach:

  1. Involve Stakeholders: Ensure business and technical stakeholders understand cost implications.
  2. Set Up Budget Alerts: Get notified when costs exceed defined thresholds.
  3. Use Tags for Resources: Track usage by cost center, project, or department.
  4. Create Cost Dashboards: Tools like Snowsight provide real-time insights.
  5. Enable Usage Tracking: Know who uses which resources, and why.
  6. Review Allocations: Regularly audit and rebalance resource usage.
  7. Monitor Queries: Optimize inefficient SQL to cut unnecessary costs.
  8. Optimize Warehouses: Use auto-suspend/resume and right-size compute.
  9. Optimize Storage: Leverage zero copy cloning and transient tables to save space.

The Pareto Principle in Cost Saving

Not all cost optimizations are equal. According to the 80/20 rule, 20% of resources often account for 80% of costs. Identifying and addressing these high-impact areas—such as a handful of long-running queries—can unlock significant savings with minimal effort.

How Data Vault 2.0 Helps Reduce Costs

Beyond traditional cost-cutting measures, Data Vault 2.0 itself provides structural advantages that reduce expenses:

  • Automation: Standardized entities make it possible to automate much of the raw vault, lowering developer workload.
  • Agile Development: The tracer bullet approach allows incremental delivery of business value, avoiding expensive rework.
  • Auditing & Compliance: Built-in historization and auditability support GDPR compliance, preventing costly legal issues.

Conclusion

Estimating the exact cost of a Data Vault 2.0 implementation is impossible—each project has unique factors. However, by recognizing the primary cost drivers (people, architecture, tooling), adopting disciplined cost management practices, and leveraging the automation and agility inherent in Data Vault 2.0, organizations can keep their projects efficient and cost-effective.

Cost optimization is not a one-time activity. It’s an ongoing process of review, accountability, and continuous improvement. With the right governance and monitoring in place, Data Vault 2.0 is not only a robust data architecture—it’s a cost-conscious one too.

Watch the Video

Meet the Speaker

Picture of Lorenz Kindling

Lorenz Kindling
Senior Consultant

Lorenz is working in Business Intelligence and Enterprise Data Warehousing (EDW) with a focus on data warehouse automation and Data Vault modeling. Since 2021, he has been advising renowned companies in various industries for Scalefree International. Prior to Scalefree, he also worked as a consultant in the field of data analytics. This allowed him to gain a comprehensive overview of data warehousing projects and common issues that arise.

Complex Computed Satellites in Data Vault

Complex Computed Satellites

When people first learn about computed satellites in Data Vault, they often encounter very simple examples: concatenating first and last names into a full name, or applying a basic calculation within a satellite. While these examples are valid, they don’t capture the full breadth of what computed satellites can do. In reality, computed satellites are a powerful mechanism for integrating, transforming, and enriching data across your vault — enabling business-driven insights while maintaining the Data Vault principles of auditability and traceability.

This article will walk through the broader concept of computed satellites, discuss how they are designed, and provide practical implementation patterns for handling more complex use cases.



What is a Computed Satellite?

At its core, a satellite in Data Vault is a structure that describes a business object (a hub or link) by holding descriptive attributes over time. A computed satellite differs from a raw satellite because its data does not come directly from the source system but is derived through business logic.

Examples include:

  • Concatenating FirstName and LastName into FullName.
  • Deriving an age from a birthdate.
  • Producing calculated scores, risk categories, or classifications.
  • Integrating attributes from multiple satellites across different hubs via links.
  • Creating artificial relationships, such as product recommendations based on purchase history.

Importantly, a computed satellite isn’t just about the calculation itself — it’s about what the result describes and where it logically belongs in your model.

Step 1: Defining the Parent Entity

Before you build a computed satellite, you must answer a critical question: What does the result describe?

Every satellite attaches to either a hub (a business key) or a link (a relationship between keys). If your calculation produces attributes describing a customer, then the computed satellite belongs on the Customer Hub. If it describes a relationship between customers and products, it belongs on the respective link.

For example:

  • A Full Name attribute describes a Customer Hub.
  • A product recommendation score describes a Customer–Product Link.
  • A risk category for an account describes an Account Hub.

This step ensures that your computed satellite stays aligned with the business meaning of your Data Vault model.

Step 2: Designing the Structure

Once you know the parent, the next step is to decide the structure of your results. Computed satellites can contain:

  • Simple attributes (e.g., strings, numbers, dates).
  • Multiple descriptive fields derived from logic.
  • Semi-structured data, such as JSON or XML.

For example, you might calculate a JSON object capturing a customer’s segmentation profile, or an XML document describing a product configuration.

The important point: the satellite reflects the structure of your results, not the mechanics of how you implemented them.

Step 3: Implementing the Business Logic

After modeling comes implementation. Computed satellites can be populated in several ways:

SQL Views

The most common approach is to implement a computed satellite as a SQL view. Here, the SQL query both expresses the logic (e.g., joins, transformations, calculations) and defines the result structure. If SQL is sufficient for your business rules, this is often the simplest and most maintainable approach.

External Scripts (Python, R, etc.)

For more advanced transformations, machine learning, or statistical processing, you may use external code. A Python script, for example, could pick up data from raw satellites, apply complex algorithms, and write results back into a computed satellite.

The golden rule: the implementation must remain under your control. Even if a data scientist creates an initial model using tools like Azure ML or RapidMiner, once it becomes part of your Business Vault, the deployment and maintenance are governed centrally. This ensures auditability and consistency.

Materialized Tables

Sometimes, business logic requires intermediate storage. In this case, you may materialize computed satellites as physical tables populated via INSERT statements or stored procedures. This is useful for performance optimization or managing dependency chains in cascading business rules.

Complex Use Cases for Computed Satellites

1. Filtering or Subsetting Business Keys

Imagine a Partner Hub with a single satellite. Business users may want to see only clients, employees, or vendors. Computed satellites can create filtered subsets that bring the model closer to business expectations. While not always the cleanest design, this is a practical option in some industries, such as insurance.

2. Artificial Links

A link doesn’t always need to come directly from a source system. You can create artificial links based on computed relationships. For example, by analyzing purchase history, you might generate product recommendations — effectively creating a Customer–Product Recommendation Link.

3. Cascading Business Rules

A powerful pattern is to break complex logic into smaller, reusable steps:

  1. Create a simple computed satellite that performs data cleansing or a basic calculation.
  2. Use that result in a second computed satellite to apply additional rules.
  3. Join results with other business vault entities to build richer attributes.

This cascading approach makes rules easier to maintain, document, and reuse — and avoids giant, unmanageable SQL queries filled with dozens of CTEs.

Best Practices

  • Start with the business meaning: Always clarify what the result describes before modeling.
  • Keep business logic in the Business Vault, not in downstream marts.
  • Favor cascading rules over monolithic transformations — it improves maintainability and reusability.
  • Control the code: All scripts, views, and procedures must be owned by the data warehouse team, not end-users.
  • Support multiple technologies: SQL for straightforward logic, external scripts for advanced logic, and materialized tables where necessary.

Dependencies and Execution

When you cascade rules or materialize results, you introduce dependencies. One entity must load before another. To manage this, many teams implement dependency tables that track loading order. This enables recursive or automated job scheduling, ensuring consistency across the Business Vault.

Virtualized approaches (SQL views) are often easier, since query optimizers can resolve dependencies dynamically. Materialized approaches, however, provide better performance and control at scale.

Why Computed Satellites Matter

Computed satellites are more than “extra calculated fields.” They enable organizations to:

  • Bridge the gap between raw data and business expectations.
  • Implement business rules in a controlled, auditable environment.
  • Support advanced analytics and machine learning workflows inside the Data Vault framework.
  • Enable modular, reusable logic that scales across domains and use cases.

By treating computed satellites as first-class citizens in your Business Vault, you ensure that business logic is not scattered in marts, reports, or ad hoc scripts — but is instead centralized, governed, and reusable.

Conclusion

Computed satellites in Data Vault can be as simple as a concatenated name, or as complex as multi-step cascading business rules that derive artificial relationships. The key is to start by identifying what your result describes, attach the satellite to the correct parent, design the structure of your attributes, and then implement the logic in a controlled, maintainable way.

Whether implemented via SQL, Python scripts, or materialized processes, computed satellites should remain under the stewardship of your data warehouse team. By following best practices, you’ll unlock the full potential of the Business Vault — keeping it business-aligned, auditable, and ready for advanced analytics.

Watch the Video

Meet the Speaker

Profile picture of Michael Olschimke

Michael Olschimke

Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

The Power of Data Contracts: From Data Chaos to Cohesion

The Power of Data Contracts

Have you ever had that feeling, the one where you wake up on a Monday morning and a familiar sense of dread washes over you? You get to your desk and hope against hope that no data pipeline has failed overnight, no dashboard has broken, and no server has crashed. For anyone working with data, this scenario is all too common. The modern data landscape is a sprawling, interconnected web where a small change in one area can trigger a cascade of failures downstream. A simple column rename, a change in data type, or an unexpected null value can bring a whole system to a grinding halt.

You spend your morning firefighting—analyzing the issue, pinpointing the source of the error, and scrambling to get everything back online. By the time you look at the clock, it’s lunchtime, and you’ve spent your entire morning just fixing a bug.

This chaos is exactly what a data contract is designed to solve. It’s a way to bring order to the madness, to create a foundation of trust and reliability. A data contract not only speeds up the bug-fixing process but also makes development and changes much easier, fostering a sense of accountability within your data teams.



What Exactly is a Data Contract?

Think of a data contract as a formal, machine-readable agreement between data producers and data consumers. It’s a pact that defines the expectations and promises between different teams in your organization. Imagine a sales dashboard team (the consumer) relying on data generated by the data engineering team (the producer). The data contract defines exactly what the data engineering team will deliver, creating a clear and reliable relationship.

Data Contract flow

While a data contract can be as detailed as needed, there are three core elements that should always be included.

1. Schema

The schema is the blueprint of your data. It defines exactly what your data will look like. This includes column names, data types, and the structure of the data. A data contract should define this schema and any potential schema changes, no matter how small. A minor change, like renaming a column, can easily break a downstream pipeline if it’s not communicated and managed properly. The schema element of the contract ensures that everyone is on the same page about the data’s structure.

2. Data Quality

Data quality is a crucial, yet often underestimated, aspect of data management. Your data contract should define data quality expectations that both producers and consumers can agree on. For example, a data warehouse team might require that a customer_id column in a source system table never be empty or null. A reporting team, on the other hand, might require that the quantity of an order never be zero. These are simple examples, but defining these expectations upfront prevents many common data problems.

3. Service Level Agreement (SLA)

An SLA is a promise that one party makes to another. In the context of a data contract, it can cover a variety of things. How quickly should a problem be fixed? How fresh does the data need to be (daily, weekly, real-time)? You can also use SLAs to manage changes. For instance, an SLA could stipulate that if the engineering team wants to rename a column, they must notify consumers one week in advance. This gives the dashboarding team time to implement the change in their reports before the new version goes live, ensuring a smooth transition without breaking anything.

Implementing Data Contracts in Practice

A data contract shouldn’t be a static PDF document that nobody uses. For it to be truly effective, it must be machine-readable and integrated into your daily workflow. Here’s how you can make that happen:

Automation is Key

Your data contract should be tested automatically against your data to ensure it’s being followed. You should also have automation in place for managing changes. For example, if a data producer updates the contract with a schema change, an automated process could send a notification to the data consumers. This automation makes people accountable for their data products. It ensures that any changes, even if they have a valid reason, are communicated clearly and don’t cause unexpected issues.

CI/CD Pipelines

You can integrate data contract checks into your Continuous Integration and Continuous Delivery (CI/CD) pipelines. Before a new deployment goes live, the pipeline can check if the changes adhere to the data contract. If they don’t, the deployment can be blocked. This prevents contract-breaking changes from ever reaching production.

Fostering Communication

While automation handles much of the communication, the ultimate goal is to foster a culture of collaboration. A data contract shouldn’t be a tool for finger-pointing (“They made the problem!”). Instead, it should be a framework that encourages teamwork, where everyone is working together to build reliable, trusted data products.

The Benefits of Data Contracts

Implementing data contracts might sound like a lot of work, especially the automation part, but the benefits are substantial:

  • Increased Developer Time: Automated testing and CI/CD pipelines significantly reduce the time spent on bug-fixing and troubleshooting. Your teams can focus on development and innovation instead of firefighting.
  • Data Reliability: With clear definitions and automated checks, your data becomes much more reliable. People can trust the data they are using, and they can easily check the contract to understand its quality and refresh schedule.
  • Autonomy: Data contracts enable autonomy. Teams can make changes and improvements without fear of breaking something downstream. They know that if a change is needed, the automated process will notify the right people, and everything can be managed safely and securely.

This newfound autonomy allows for a more dynamic and responsive data ecosystem. Teams are no longer afraid to innovate because they have a clear, safe process for doing so.

Getting Started with Data Contracts

If you’re ready to start, don’t try to tackle everything at once. Begin with a single use case—a small, easy-to-manage dataset. The goal is to test the process, not to solve every problem overnight.

  1. Start with Collaboration: Explain the benefits to your teams and get them working together. Don’t frame data contracts as a top-down mandate. Instead, show them how this will make their lives easier and their work more effective.
  2. Automate Everything: This is a critical step. Bring in DevOps expertise to help you build out automated testing and CI/CD pipelines. Look at the testing you already have in place and see how you can build on it.
  3. Remember the Culture and the Tech: Data contracts are both a cultural shift and a technical one. A PDF document alone won’t solve your problems. You need the technical implementation—the automation, the testing—to make the cultural shift truly stick.

Data contracts are a powerful tool for transforming your data landscape from a state of chaos to one of cohesion and trust. They empower your teams, increase data reliability, and free up valuable time for innovation.

Watch the Video

Meet the Speaker

Picture of Lorenz Kindling

Lorenz Kindling
Senior Consultant

Lorenz is working in Business Intelligence and Enterprise Data Warehousing (EDW) with a focus on data warehouse automation and Data Vault modeling. Since 2021, he has been advising renowned companies in various industries for Scalefree International. Prior to Scalefree, he also worked as a consultant in the field of data analytics. This allowed him to gain a comprehensive overview of data warehousing projects and common issues that arise.

Data Vault Hashing on Databricks with XXHASH64

Hashing on Databricks

Hashing is a core element of Data Vault modeling. Hash keys are used to uniquely identify Hubs, Links, and Satellites, and they need to be consistent, reproducible, and efficient. A common debate is whether to use widely adopted hashing algorithms like MD5 or SHA-1, or to opt for faster and smaller hash functions such as xxhash64, which Databricks natively supports.

The question is simple: What if we stored hash keys as 64-bit integers (int64) using xxhash64 instead of 128-bit MD5 values? On the surface, this looks attractive — faster generation, better join performance, and reduced storage. But as we’ll explore in this article, the trade-offs around collisions, scalability, and platform independence make this a risky choice in Data Vault 2.0 architectures.



Why Smaller Hash Keys Look Tempting

There are clear benefits to using smaller hash values such as int64:

  • Speed of generation: Algorithms like xxhash64 or CRC-64 are significantly faster than MD5 or SHA-1. OpenSSL and Linux benchmarks consistently show xxhash64 outperforming older cryptographic functions.
  • Join performance: Joining 64-bit integers is naturally faster than joining 128-bit binary or 32-character string columns.
  • Storage efficiency: An int64 hash key requires only 8 bytes, compared to 16 bytes for binary MD5 or 32 bytes for an MD5 stored as a string.

From a performance perspective, the appeal is undeniable. But in Data Vault, performance is not the only concern. The fundamental question is: How safe are these smaller hashes when used as surrogate keys at scale?

The Risk of Collisions

A hash collision occurs when two different inputs produce the same hash value. In Data Vault, this means two different business keys could be treated as the same Hub, Link, or Satellite record — corrupting your data integrity.

With 64-bit hashes, the number of possible unique values is 2^64. While that sounds huge, probability tells a different story when you start loading millions or billions of rows. The so-called birthday paradox makes collisions far more likely than intuition suggests.

For example:

  • With just 10,000 records, a 32-bit hash already has a 1 in 100 chance of collision.
  • A 64-bit hash greatly reduces the risk, but at large scales (hundreds of millions of rows), the probability becomes uncomfortably high.
  • A 128-bit hash (MD5) pushes collision risk into the realm of trillions of rows before it becomes statistically relevant.

That’s why MD5 and SHA-1 — despite being slower — are still standard in Data Vault: they provide mathematically safe keyspace sizes for large enterprise datasets.

Performance vs. Integrity

The argument for int64 hashing often emphasizes query performance:

  • Joins on numeric columns are faster than joins on strings or binary.
  • xxhash64 is faster to compute than MD5.

These points are true. But in practice, the cost of a collision far outweighs the performance gains. A single collision can undermine the entire lineage of your data warehouse. Once data integrity is compromised, every downstream analytic and report is suspect.

As a result, most Data Vault practitioners will gladly accept the slightly higher CPU cost of MD5 or SHA-1 in exchange for peace of mind.

Platform Portability Matters

Another issue with xxhash64 is platform availability. While Databricks supports it, you also need to consider:

  • Can you compute xxhash64 in Snowflake, BigQuery, SQL Server, or Oracle?
  • Can you reproduce xxhash64 consistently in Python, Java, or ETL tools?
  • Will the algorithm be supported 5–10 years from now?

One of the strengths of MD5 and SHA-1 is their ubiquity. They are implemented in nearly every database, programming language, and ETL platform. This cross-platform reproducibility is essential when building a Data Vault that may span multiple systems. By contrast, xxhash64 locks you into Databricks (or requires custom implementations elsewhere).

Why 128-Bit is the Safe Minimum

Let’s compare hash sizes:

  • CRC-32: Very fast, but collisions appear after ~10,000 records. Unusable for Data Vault.
  • xxhash64 / CRC-64: Better, but collisions become likely as datasets grow into the hundreds of millions.
  • MD5 (128-bit): Standard choice. Safe up to trillions of rows.
  • SHA-1 (160-bit): Provides even more headroom for extremely large datasets.

For most enterprises, MD5 hits the sweet spot: fast enough, widely supported, and statistically collision-free at realistic data volumes.

Alternative Approaches

If performance is a serious concern, consider these approaches instead of shrinking hash size:

  • Binary storage: Store MD5 as a 16-byte binary instead of a 32-character string. This cuts storage in half and improves join performance.
  • Partitioning strategies: Optimize joins by partitioning your Data Vault tables, reducing the need for full-table joins.
  • Hardware acceleration: Modern CPUs have optimized instructions for MD5 and SHA-1, making them faster than you might expect.
  • Consider SHA-256 only if required: While SHA-2 offers stronger guarantees, it’s rarely necessary in Data Vault and adds performance overhead.

Summary: Should You Use xxhash64 in Data Vault?

While xxhash64 looks attractive in Databricks because of its speed and smaller footprint, it’s not a good fit for Data Vault 2.0:

  • Collision risk is too high for large-scale data warehouses.
  • Portability is limited — you risk vendor lock-in.
  • Long-term maintainability suffers if your algorithm isn’t standardized across platforms.

In Data Vault, data integrity always comes first. That’s why MD5 (128-bit) or SHA-1 (160-bit) remain the recommended standards for hash keys. They provide the balance of performance, portability, and collision safety needed for enterprise-scale solutions.

Conclusion

If you’re working in Databricks, it might be tempting to adopt xxhash64 for hash keys. But resist that temptation. The risks of collisions and platform lock-in far outweigh the benefits. Stick with MD5 or SHA-1 for your Data Vault hash keys, store them efficiently as binary values, and optimize performance through storage and join strategies.

Remember: a faster broken key is still a broken key. In Data Vault, correctness and consistency are always the highest priority.

Watch the Video

Meet the Speaker

Profile picture of Michael Olschimke

Michael Olschimke

Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

Joining SCD2 Tables Using Data Vault

Joining SCD2 Tables

When working with Data Vault 2.0, one of the most common challenges is how to handle Slowly Changing Dimension type 2 (SCD2) tables when loading Link Satellites. Imagine a scenario where you need to join three SCD2 tables, each with valid_from and valid_to dates, and bring them together into a single Satellite hanging from a Link. At first, this might sound straightforward, but the details matter — and depending on your approach, you could make your warehouse harder to maintain, less performant, or less flexible.

In this article, we’ll walk through the best practices for handling this situation, based on Data Vault principles. We’ll discuss why you shouldn’t rush into joining multiple SCD2 tables into one Satellite, how to handle business timelines, the role of PIT and T-PIT tables, and strategies for keeping your design scalable and future-proof.



Understanding SCD2 in the Context of Data Vault

Slowly Changing Dimensions type 2 are a way to track historical changes in dimensional data. Each record typically has valid_from and valid_to dates that describe its period of effect. In a Data Vault model, however, the focus isn’t on interpreting those business dates upfront. Instead, the Raw Data Vault stage is all about capturing what the source gives you, as-is.

That means when we first bring data into the Raw Vault, the valid_from and valid_to fields should be stored as descriptive attributes — not as part of the Satellite’s primary key. Trying to interpret and align them too early will only create unnecessary complexity.

Why Not Load One Satellite From Three SCD2 Tables?

At first glance, it may seem attractive to combine all three SCD2 tables directly into one Satellite. But this goes against Data Vault best practices:

  • Each Satellite should usually source from a single table or source system.
  • Denormalizing multiple sources into one Satellite complicates your Raw Vault and makes it harder to maintain.
  • You risk having to undo the denormalization later when new requirements come in.

The recommended approach is to create at least three Satellites — one per SCD2 table. You may even need more Satellites if you have to split them based on privacy, rate of change, or security rules. This simplifies your Raw Vault and sets you up for flexibility later.

Handling Validity Dates in Satellites

Once your Satellites are created, the question is how to handle valid_from and valid_to. Here are two key approaches:

1. Multi-Active Satellites

If your SCD2 records represent multiple simultaneously valid states (for example, different price lists where both current and future prices are valid), you can use a multi-active Satellite. In this design:

  • The primary key is composed of hash_key + load_date + subsequence.
  • valid_from and valid_to are stored as descriptive attributes, not as key parts.
  • A staging-generated subsequence ensures uniqueness within a load.

2. Standard (CDC) Satellites with Subsequence

If only one record is active at a time (the classic SCD2 case), then you don’t need a multi-active design. Instead:

  • Stick with the standard Satellite primary key: hash_key + load_date.
  • Handle multiple intraday changes by adding micro- or nanosecond subsequences to load_date.
  • This ensures only one active row per parent key at any given time.

Choosing between multi-active and CDC-style Satellites depends entirely on your data. Do you need multiple simultaneously valid records? Or does one replace the other in sequence? Your answer determines the right design.

Aligning Data with PIT Tables

Once the three Satellites are loaded, you’ll need to bring them together for reporting. This is where PIT (Point-in-Time) tables come in. PIT tables align deltas across Satellites to a common snapshot date, making it possible to present a unified view of related changes.

There are two main flavors:

  • Standard PIT – aligns data based on load_date (technical historization).
  • Temporal PIT (T-PIT) – aligns data based on business timelines like valid_from and valid_to.

A T-PIT lets you activate rows based on both the technical and business timelines. However, it comes with maintenance challenges: whenever valid_from or valid_to dates change, you may need to rebuild PIT partitions — which can be expensive if large date ranges are altered.

Where Should You Apply Business Timelines?

There are three strategies for applying business timelines like valid_from and valid_to:

  1. Downstream in reporting or dashboards – simplest to maintain, and business users can define how timelines should be applied.
  2. In dimension views – apply filters and conditions directly in the SQL layer that feeds reports.
  3. Upstream in PIT/T-PIT – most performant but requires heavier maintenance whenever business timelines change.

A practical approach is to start downstream and only move timeline application upstream if performance issues demand it.

Building a Business Satellite

Once your PIT table aligns the Satellites, you may still need a Business Satellite (or Computed Satellite). This Satellite:

  • Uses the PIT table as its foundation.
  • Combines attributes from the three original Satellites using COALESCE or other business rules.
  • Optionally applies T-PIT logic if strict business timeline alignment is required.

By separating Raw Satellites (which store raw source data) from Business Satellites (which apply interpretation and business logic), you preserve Data Vault’s flexibility while still meeting analytical requirements.

Best Practices Recap

  • Create one Satellite per SCD2 source table — don’t denormalize too early.
  • Treat valid_from and valid_to as descriptive attributes in the Raw Vault.
  • Use multi-active Satellites only when multiple records are simultaneously valid; otherwise, stick with standard CDC Satellites.
  • Align Satellites with PIT tables; consider T-PITs only if business timelines must be applied upstream.
  • Whenever possible, push business timeline interpretation downstream to dimensions or reports for easier maintenance.
  • Use Business Satellites when you need to merge multiple sources into a single unified view.

Conclusion

Joining multiple SCD2 tables into a single Satellite is rarely the right first step in a Data Vault 2.0 implementation. Instead, build your Raw Vault Satellites closely aligned with their sources, use PIT tables to align changes, and apply business timelines carefully — starting as far downstream as possible. This approach keeps your architecture maintainable, flexible, and scalable while still supporting complex historical analysis.

By following these principles, you’ll not only simplify your data model but also give your business users the power to interpret validity ranges in ways that make sense for them — without locking your warehouse into rigid rules that are hard to maintain.

Watch the Video

Meet the Speaker

Profile picture of Michael Olschimke

Michael Olschimke

Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

Defining Snapshot Dates in Data Vault

Defining Snapshot Dates in Data Vault

When working with Data Vault, one of the most common questions practitioners face is: how do we define snapshot dates? While load dates are tied to when data arrives from the source, snapshot dates serve a different and equally important purpose — they allow us to deliver stable, consistent, and predictable datasets to end users. In this article, we’ll break down the concept of snapshot dates, explain how they differ from load dates, and walk through practical examples of how they are implemented in a Data Vault architecture. By the end, you’ll understand how to design and manage snapshot dates effectively, and how they fit into the broader picture of multi-temporal data management.


Why Snapshot Dates Matter

Imagine you are a business user who arrives at the office at 9 AM, expecting your reports to be ready. You want those reports to be stable — not changing throughout the day unless a new cycle of data refresh is scheduled. Snapshot dates exist to decouple data delivery from data ingestion.

  • Load Date: Indicates when data from the source system arrived in your warehouse. It is system-driven and depends on the source’s delivery schedule.
  • Snapshot Date: Defines when the data is frozen for reporting and analysis. It is business-driven and follows a regular schedule (daily, hourly, every 5 minutes, etc.).

This separation allows data teams to provide users with predictable datasets, regardless of how irregularly or frequently source systems deliver new data.

The Three Timelines in Data Vault

To fully grasp snapshot dates, it’s useful to understand that Data Vault designs operate across three different timelines:

  1. Data Warehouse Timeline: Driven by load datetime stamps, representing when data batches (or real-time messages) arrive from the source system.
  2. Snapshot Timeline: Driven by the business delivery cycle. This is when data is made available to users in stable form.
  3. Business Timeline: Driven by business events (contract start dates, end dates, valid-from/valid-to fields, modified timestamps, etc.) and stored in Satellites or other model components.

For this article, we’ll focus on the second timeline: the snapshot timeline.

Defining Snapshot Dates with Examples

Daily Snapshot Example

Suppose a company wants its reports refreshed once per day, every morning at 7 AM UTC. This means that:

  • Regardless of when the source delivers data, the snapshot timestamp is always set to 7 AM.
  • Users querying the data warehouse at 9 AM will see a stable version of the data that won’t change until the next snapshot is generated.
  • In the control table, a row is inserted daily with the snapshot timestamp (e.g., 2025-09-16 07:00:00 UTC).

Hourly Snapshot Example

For real-time dashboards, an hourly or even 5-minute snapshot might be necessary. Let’s say hourly snapshots are generated:

  • A new snapshot timestamp is inserted every hour (e.g., 2025-09-16 01:00:00, 2025-09-16 02:00:00, etc.).
  • Older hourly snapshots may be discarded after a week to save storage, while daily snapshots are retained for a year.
  • End users can query either the most recent hourly snapshot or the daily snapshot depending on their needs.

Mixed Use Case

Some users might want hourly updates, while others only need a daily snapshot. In such cases:

  • The control table holds all snapshots (hourly + daily).
  • Boolean flags are used to mark whether a snapshot is “hourly,” “daily,” or “real-time.”
  • Users can filter based on these flags when running queries.

How Control Tables Help Manage Snapshot Dates

In practice, snapshot dates are managed using control tables. There are typically two types of control tables in Data Vault projects:

  1. Load Control Table: Tracks load datetime stamps for each source, indicating whether the batch has been processed into staging, raw Data Vault, business vault, or marts.
  2. Snapshot Control Table: Stores snapshot datetime stamps generated by the warehouse. These define the stable reporting layers that users can query.

When building PIT (Point-in-Time) tables and bridge tables, the snapshot control table plays a critical role. It ensures that snapshots align with user expectations, and it provides metadata for filtering (e.g., “latest snapshot,” “daily snapshot,” etc.).

Implementing Snapshot Dates in PIT Tables

PIT tables act like indexes into your Data Vault, enabling efficient query performance. The loading process of PIT tables typically involves:

  1. Checking the snapshot control table to see which snapshots should exist.
  2. Loading the required snapshots into the PIT table if they’re missing.
  3. Ensuring that the PIT table and snapshot control table remain in sync after each refresh cycle.

When a user queries a PIT table, they can join it to the snapshot control table based on the snapshot datetime stamp. From there, they can filter by flags (e.g., latest, daily, hourly) to get the version of data they need.

Best Practices for Defining Snapshot Dates

  • Work with business users: Define snapshot frequencies based on real business needs (daily, hourly, real-time).
  • Be consistent: Establish a standard snapshot time (e.g., 7 AM UTC daily) to simplify reporting.
  • Use control tables: Automate the insertion of snapshot timestamps and track them for PIT/bridge table loading.
  • Retain wisely: Keep high-frequency snapshots (hourly/5-minute) for a short time, but retain daily snapshots for longer historical analysis.
  • Decouple load and snapshot timelines: Remember that load datetime depends on the source, but snapshot datetime depends on user requirements.

Snapshot Dates vs. Business Dates

It’s worth highlighting again that snapshot dates are not the same as business dates like “contract start” or “valid-to.” Business dates come directly from source systems and are stored as part of the business timeline in Satellites or Links. Snapshot dates, on the other hand, are warehouse-generated and serve as reference points for reporting and querying.

Conclusion

Snapshot dates are a cornerstone of Data Vault’s multi-temporal design. By providing a regular, predictable timestamp for reporting and queries, they ensure stability and trust in data delivery — even as source systems deliver data at unpredictable times. With the help of control tables, PIT/bridge tables, and well-defined retention policies, snapshot dates give both IT teams and business users the clarity and consistency they need. Whether your business requires daily snapshots for stable reports or high-frequency snapshots for real-time dashboards, the principles remain the same: define, control, and communicate your snapshot strategy clearly.

Watch the Video

Meet the Speaker

Profile picture of Michael Olschimke

Michael Olschimke

Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

Close Menu