Skip to main content
search
0

Integration from WooCommerce to Salesforce (with AWS)

Salesforce WooCommerce Implementation

WooCommerce: A Robust E-commerce Solution

WooCommerce, a robust e-commerce solution, is built upon the open-source WordPress platform. This versatile platform empowers businesses to not only sell products but also offer training sessions and organize events seamlessly. Data generated from these transactions is stored in a structured database. While some user-generated information is accessible through the platform’s backend, a comprehensive view often requires extensive manual searching.



The Advantages of Salesforce Integration

The integration of Salesforce into this equation introduces a plethora of advantages. Salesforce is renowned for its capabilities in managing customer relationships, tracking buyer information over time, facilitating the distribution of marketing materials, and streamlining the invoicing process. By consolidating these functions within a unified system, businesses can significantly enhance their operational efficiency.

Our Integration Journey

Faced with the challenge of seamlessly integrating WooCommerce and Salesforce, our dedicated team embarked on a meticulous journey. This endeavor commenced with an in-depth assessment of the business requirements and a thorough evaluation of available plugins. Subsequently, our decision was to develop a custom integration solution, tailored to our specific needs.

Leveraging WooCommerce Webhooks

The core of our integration strategy hinges upon WooCommerce’s invaluable feature known as Webhooks. In this context, Webhooks enable the automatic triggering of events, such as the creation of an order, which subsequently initiates a POST request to a designated service. This service, in our case, takes the form of an AWS Lambda function script. Through this implementation, we achieved the ability to transmit order data seamlessly from WooCommerce to Salesforce.

Salesforce WooCommerce Implementation

Business Benefits of the Integration

This integration offers a myriad of advantages to our business users. Salesforce data becomes accessible within their purview, enabling them to monitor, update, and transform information into vital components such as contacts, accounts, and opportunities. Furthermore, this integration seamlessly connects to an automated billing system, enhancing the financial operations of the business.

Get in Touch With Us

If you seek further insights into the intricacies of our WooCommerce to Salesforce integration or wish to explore the possibilities of implementing a similar service for your business, we welcome you to initiate a conversation with us. Our team is readily available to address your inquiries and discuss how this integration can be tailored to meet your unique needs. Please do not hesitate to contact us through the provided channels below, and we will be delighted to assist you in optimizing your business processes.

Complex Computed Satellites in Data Vault

Complex Computed Satellites

When people first learn about computed satellites in Data Vault, they often encounter very simple examples: concatenating first and last names into a full name, or applying a basic calculation within a satellite. While these examples are valid, they don’t capture the full breadth of what computed satellites can do. In reality, computed satellites are a powerful mechanism for integrating, transforming, and enriching data across your vault — enabling business-driven insights while maintaining the Data Vault principles of auditability and traceability.

This article will walk through the broader concept of computed satellites, discuss how they are designed, and provide practical implementation patterns for handling more complex use cases.



What is a Computed Satellite?

At its core, a satellite in Data Vault is a structure that describes a business object (a hub or link) by holding descriptive attributes over time. A computed satellite differs from a raw satellite because its data does not come directly from the source system but is derived through business logic.

Examples include:

  • Concatenating FirstName and LastName into FullName.
  • Deriving an age from a birthdate.
  • Producing calculated scores, risk categories, or classifications.
  • Integrating attributes from multiple satellites across different hubs via links.
  • Creating artificial relationships, such as product recommendations based on purchase history.

Importantly, a computed satellite isn’t just about the calculation itself — it’s about what the result describes and where it logically belongs in your model.

Step 1: Defining the Parent Entity

Before you build a computed satellite, you must answer a critical question: What does the result describe?

Every satellite attaches to either a hub (a business key) or a link (a relationship between keys). If your calculation produces attributes describing a customer, then the computed satellite belongs on the Customer Hub. If it describes a relationship between customers and products, it belongs on the respective link.

For example:

  • A Full Name attribute describes a Customer Hub.
  • A product recommendation score describes a Customer–Product Link.
  • A risk category for an account describes an Account Hub.

This step ensures that your computed satellite stays aligned with the business meaning of your Data Vault model.

Step 2: Designing the Structure

Once you know the parent, the next step is to decide the structure of your results. Computed satellites can contain:

  • Simple attributes (e.g., strings, numbers, dates).
  • Multiple descriptive fields derived from logic.
  • Semi-structured data, such as JSON or XML.

For example, you might calculate a JSON object capturing a customer’s segmentation profile, or an XML document describing a product configuration.

The important point: the satellite reflects the structure of your results, not the mechanics of how you implemented them.

Step 3: Implementing the Business Logic

After modeling comes implementation. Computed satellites can be populated in several ways:

SQL Views

The most common approach is to implement a computed satellite as a SQL view. Here, the SQL query both expresses the logic (e.g., joins, transformations, calculations) and defines the result structure. If SQL is sufficient for your business rules, this is often the simplest and most maintainable approach.

External Scripts (Python, R, etc.)

For more advanced transformations, machine learning, or statistical processing, you may use external code. A Python script, for example, could pick up data from raw satellites, apply complex algorithms, and write results back into a computed satellite.

The golden rule: the implementation must remain under your control. Even if a data scientist creates an initial model using tools like Azure ML or RapidMiner, once it becomes part of your Business Vault, the deployment and maintenance are governed centrally. This ensures auditability and consistency.

Materialized Tables

Sometimes, business logic requires intermediate storage. In this case, you may materialize computed satellites as physical tables populated via INSERT statements or stored procedures. This is useful for performance optimization or managing dependency chains in cascading business rules.

Complex Use Cases for Computed Satellites

1. Filtering or Subsetting Business Keys

Imagine a Partner Hub with a single satellite. Business users may want to see only clients, employees, or vendors. Computed satellites can create filtered subsets that bring the model closer to business expectations. While not always the cleanest design, this is a practical option in some industries, such as insurance.

2. Artificial Links

A link doesn’t always need to come directly from a source system. You can create artificial links based on computed relationships. For example, by analyzing purchase history, you might generate product recommendations — effectively creating a Customer–Product Recommendation Link.

3. Cascading Business Rules

A powerful pattern is to break complex logic into smaller, reusable steps:

  1. Create a simple computed satellite that performs data cleansing or a basic calculation.
  2. Use that result in a second computed satellite to apply additional rules.
  3. Join results with other business vault entities to build richer attributes.

This cascading approach makes rules easier to maintain, document, and reuse — and avoids giant, unmanageable SQL queries filled with dozens of CTEs.

Best Practices

  • Start with the business meaning: Always clarify what the result describes before modeling.
  • Keep business logic in the Business Vault, not in downstream marts.
  • Favor cascading rules over monolithic transformations — it improves maintainability and reusability.
  • Control the code: All scripts, views, and procedures must be owned by the data warehouse team, not end-users.
  • Support multiple technologies: SQL for straightforward logic, external scripts for advanced logic, and materialized tables where necessary.

Dependencies and Execution

When you cascade rules or materialize results, you introduce dependencies. One entity must load before another. To manage this, many teams implement dependency tables that track loading order. This enables recursive or automated job scheduling, ensuring consistency across the Business Vault.

Virtualized approaches (SQL views) are often easier, since query optimizers can resolve dependencies dynamically. Materialized approaches, however, provide better performance and control at scale.

Why Computed Satellites Matter

Computed satellites are more than “extra calculated fields.” They enable organizations to:

  • Bridge the gap between raw data and business expectations.
  • Implement business rules in a controlled, auditable environment.
  • Support advanced analytics and machine learning workflows inside the Data Vault framework.
  • Enable modular, reusable logic that scales across domains and use cases.

By treating computed satellites as first-class citizens in your Business Vault, you ensure that business logic is not scattered in marts, reports, or ad hoc scripts — but is instead centralized, governed, and reusable.

Conclusion

Computed satellites in Data Vault can be as simple as a concatenated name, or as complex as multi-step cascading business rules that derive artificial relationships. The key is to start by identifying what your result describes, attach the satellite to the correct parent, design the structure of your attributes, and then implement the logic in a controlled, maintainable way.

Whether implemented via SQL, Python scripts, or materialized processes, computed satellites should remain under the stewardship of your data warehouse team. By following best practices, you’ll unlock the full potential of the Business Vault — keeping it business-aligned, auditable, and ready for advanced analytics.

Watch the Video

The Power of Data Contracts: From Data Chaos to Cohesion

The Power of Data Contracts

Have you ever had that feeling, the one where you wake up on a Monday morning and a familiar sense of dread washes over you? You get to your desk and hope against hope that no data pipeline has failed overnight, no dashboard has broken, and no server has crashed. For anyone working with data, this scenario is all too common. The modern data landscape is a sprawling, interconnected web where a small change in one area can trigger a cascade of failures downstream. A simple column rename, a change in data type, or an unexpected null value can bring a whole system to a grinding halt.

You spend your morning firefighting—analyzing the issue, pinpointing the source of the error, and scrambling to get everything back online. By the time you look at the clock, it’s lunchtime, and you’ve spent your entire morning just fixing a bug.

This chaos is exactly what a data contract is designed to solve. It’s a way to bring order to the madness, to create a foundation of trust and reliability. A data contract not only speeds up the bug-fixing process but also makes development and changes much easier, fostering a sense of accountability within your data teams.



What Exactly is a Data Contract?

Think of a data contract as a formal, machine-readable agreement between data producers and data consumers. It’s a pact that defines the expectations and promises between different teams in your organization. Imagine a sales dashboard team (the consumer) relying on data generated by the data engineering team (the producer). The data contract defines exactly what the data engineering team will deliver, creating a clear and reliable relationship.

Data Contract flow

While a data contract can be as detailed as needed, there are three core elements that should always be included.

1. Schema

The schema is the blueprint of your data. It defines exactly what your data will look like. This includes column names, data types, and the structure of the data. A data contract should define this schema and any potential schema changes, no matter how small. A minor change, like renaming a column, can easily break a downstream pipeline if it’s not communicated and managed properly. The schema element of the contract ensures that everyone is on the same page about the data’s structure.

2. Data Quality

Data quality is a crucial, yet often underestimated, aspect of data management. Your data contract should define data quality expectations that both producers and consumers can agree on. For example, a data warehouse team might require that a customer_id column in a source system table never be empty or null. A reporting team, on the other hand, might require that the quantity of an order never be zero. These are simple examples, but defining these expectations upfront prevents many common data problems.

3. Service Level Agreement (SLA)

An SLA is a promise that one party makes to another. In the context of a data contract, it can cover a variety of things. How quickly should a problem be fixed? How fresh does the data need to be (daily, weekly, real-time)? You can also use SLAs to manage changes. For instance, an SLA could stipulate that if the engineering team wants to rename a column, they must notify consumers one week in advance. This gives the dashboarding team time to implement the change in their reports before the new version goes live, ensuring a smooth transition without breaking anything.

Implementing Data Contracts in Practice

A data contract shouldn’t be a static PDF document that nobody uses. For it to be truly effective, it must be machine-readable and integrated into your daily workflow. Here’s how you can make that happen:

Automation is Key

Your data contract should be tested automatically against your data to ensure it’s being followed. You should also have automation in place for managing changes. For example, if a data producer updates the contract with a schema change, an automated process could send a notification to the data consumers. This automation makes people accountable for their data products. It ensures that any changes, even if they have a valid reason, are communicated clearly and don’t cause unexpected issues.

CI/CD Pipelines

You can integrate data contract checks into your Continuous Integration and Continuous Delivery (CI/CD) pipelines. Before a new deployment goes live, the pipeline can check if the changes adhere to the data contract. If they don’t, the deployment can be blocked. This prevents contract-breaking changes from ever reaching production.

Fostering Communication

While automation handles much of the communication, the ultimate goal is to foster a culture of collaboration. A data contract shouldn’t be a tool for finger-pointing (“They made the problem!”). Instead, it should be a framework that encourages teamwork, where everyone is working together to build reliable, trusted data products.

The Benefits of Data Contracts

Implementing data contracts might sound like a lot of work, especially the automation part, but the benefits are substantial:

  • Increased Developer Time: Automated testing and CI/CD pipelines significantly reduce the time spent on bug-fixing and troubleshooting. Your teams can focus on development and innovation instead of firefighting.
  • Data Reliability: With clear definitions and automated checks, your data becomes much more reliable. People can trust the data they are using, and they can easily check the contract to understand its quality and refresh schedule.
  • Autonomy: Data contracts enable autonomy. Teams can make changes and improvements without fear of breaking something downstream. They know that if a change is needed, the automated process will notify the right people, and everything can be managed safely and securely.

This newfound autonomy allows for a more dynamic and responsive data ecosystem. Teams are no longer afraid to innovate because they have a clear, safe process for doing so.

Getting Started with Data Contracts

If you’re ready to start, don’t try to tackle everything at once. Begin with a single use case—a small, easy-to-manage dataset. The goal is to test the process, not to solve every problem overnight.

  1. Start with Collaboration: Explain the benefits to your teams and get them working together. Don’t frame data contracts as a top-down mandate. Instead, show them how this will make their lives easier and their work more effective.
  2. Automate Everything: This is a critical step. Bring in DevOps expertise to help you build out automated testing and CI/CD pipelines. Look at the testing you already have in place and see how you can build on it.
  3. Remember the Culture and the Tech: Data contracts are both a cultural shift and a technical one. A PDF document alone won’t solve your problems. You need the technical implementation—the automation, the testing—to make the cultural shift truly stick.

Data contracts are a powerful tool for transforming your data landscape from a state of chaos to one of cohesion and trust. They empower your teams, increase data reliability, and free up valuable time for innovation.

Watch the Video

Data Vault Hashing on Databricks with XXHASH64

Hashing on Databricks

Hashing is a core element of Data Vault modeling. Hash keys are used to uniquely identify Hubs, Links, and Satellites, and they need to be consistent, reproducible, and efficient. A common debate is whether to use widely adopted hashing algorithms like MD5 or SHA-1, or to opt for faster and smaller hash functions such as xxhash64, which Databricks natively supports.

The question is simple: What if we stored hash keys as 64-bit integers (int64) using xxhash64 instead of 128-bit MD5 values? On the surface, this looks attractive — faster generation, better join performance, and reduced storage. But as we’ll explore in this article, the trade-offs around collisions, scalability, and platform independence make this a risky choice in Data Vault 2.0 architectures.



Why Smaller Hash Keys Look Tempting

There are clear benefits to using smaller hash values such as int64:

  • Speed of generation: Algorithms like xxhash64 or CRC-64 are significantly faster than MD5 or SHA-1. OpenSSL and Linux benchmarks consistently show xxhash64 outperforming older cryptographic functions.
  • Join performance: Joining 64-bit integers is naturally faster than joining 128-bit binary or 32-character string columns.
  • Storage efficiency: An int64 hash key requires only 8 bytes, compared to 16 bytes for binary MD5 or 32 bytes for an MD5 stored as a string.

From a performance perspective, the appeal is undeniable. But in Data Vault, performance is not the only concern. The fundamental question is: How safe are these smaller hashes when used as surrogate keys at scale?

The Risk of Collisions

A hash collision occurs when two different inputs produce the same hash value. In Data Vault, this means two different business keys could be treated as the same Hub, Link, or Satellite record — corrupting your data integrity.

With 64-bit hashes, the number of possible unique values is 2^64. While that sounds huge, probability tells a different story when you start loading millions or billions of rows. The so-called birthday paradox makes collisions far more likely than intuition suggests.

For example:

  • With just 10,000 records, a 32-bit hash already has a 1 in 100 chance of collision.
  • A 64-bit hash greatly reduces the risk, but at large scales (hundreds of millions of rows), the probability becomes uncomfortably high.
  • A 128-bit hash (MD5) pushes collision risk into the realm of trillions of rows before it becomes statistically relevant.

That’s why MD5 and SHA-1 — despite being slower — are still standard in Data Vault: they provide mathematically safe keyspace sizes for large enterprise datasets.

Performance vs. Integrity

The argument for int64 hashing often emphasizes query performance:

  • Joins on numeric columns are faster than joins on strings or binary.
  • xxhash64 is faster to compute than MD5.

These points are true. But in practice, the cost of a collision far outweighs the performance gains. A single collision can undermine the entire lineage of your data warehouse. Once data integrity is compromised, every downstream analytic and report is suspect.

As a result, most Data Vault practitioners will gladly accept the slightly higher CPU cost of MD5 or SHA-1 in exchange for peace of mind.

Platform Portability Matters

Another issue with xxhash64 is platform availability. While Databricks supports it, you also need to consider:

  • Can you compute xxhash64 in Snowflake, BigQuery, SQL Server, or Oracle?
  • Can you reproduce xxhash64 consistently in Python, Java, or ETL tools?
  • Will the algorithm be supported 5–10 years from now?

One of the strengths of MD5 and SHA-1 is their ubiquity. They are implemented in nearly every database, programming language, and ETL platform. This cross-platform reproducibility is essential when building a Data Vault that may span multiple systems. By contrast, xxhash64 locks you into Databricks (or requires custom implementations elsewhere).

Why 128-Bit is the Safe Minimum

Let’s compare hash sizes:

  • CRC-32: Very fast, but collisions appear after ~10,000 records. Unusable for Data Vault.
  • xxhash64 / CRC-64: Better, but collisions become likely as datasets grow into the hundreds of millions.
  • MD5 (128-bit): Standard choice. Safe up to trillions of rows.
  • SHA-1 (160-bit): Provides even more headroom for extremely large datasets.

For most enterprises, MD5 hits the sweet spot: fast enough, widely supported, and statistically collision-free at realistic data volumes.

Alternative Approaches

If performance is a serious concern, consider these approaches instead of shrinking hash size:

  • Binary storage: Store MD5 as a 16-byte binary instead of a 32-character string. This cuts storage in half and improves join performance.
  • Partitioning strategies: Optimize joins by partitioning your Data Vault tables, reducing the need for full-table joins.
  • Hardware acceleration: Modern CPUs have optimized instructions for MD5 and SHA-1, making them faster than you might expect.
  • Consider SHA-256 only if required: While SHA-2 offers stronger guarantees, it’s rarely necessary in Data Vault and adds performance overhead.

Summary: Should You Use xxhash64 in Data Vault?

While xxhash64 looks attractive in Databricks because of its speed and smaller footprint, it’s not a good fit for Data Vault 2.0:

  • Collision risk is too high for large-scale data warehouses.
  • Portability is limited — you risk vendor lock-in.
  • Long-term maintainability suffers if your algorithm isn’t standardized across platforms.

In Data Vault, data integrity always comes first. That’s why MD5 (128-bit) or SHA-1 (160-bit) remain the recommended standards for hash keys. They provide the balance of performance, portability, and collision safety needed for enterprise-scale solutions.

Conclusion

If you’re working in Databricks, it might be tempting to adopt xxhash64 for hash keys. But resist that temptation. The risks of collisions and platform lock-in far outweigh the benefits. Stick with MD5 or SHA-1 for your Data Vault hash keys, store them efficiently as binary values, and optimize performance through storage and join strategies.

Remember: a faster broken key is still a broken key. In Data Vault, correctness and consistency are always the highest priority.

Watch the Video

Joining SCD2 Tables Using Data Vault

Joining SCD2 Tables

When working with Data Vault 2.0, one of the most common challenges is how to handle Slowly Changing Dimension type 2 (SCD2) tables when loading Link Satellites. Imagine a scenario where you need to join three SCD2 tables, each with valid_from and valid_to dates, and bring them together into a single Satellite hanging from a Link. At first, this might sound straightforward, but the details matter — and depending on your approach, you could make your warehouse harder to maintain, less performant, or less flexible.

In this article, we’ll walk through the best practices for handling this situation, based on Data Vault principles. We’ll discuss why you shouldn’t rush into joining multiple SCD2 tables into one Satellite, how to handle business timelines, the role of PIT and T-PIT tables, and strategies for keeping your design scalable and future-proof.



Understanding SCD2 in the Context of Data Vault

Slowly Changing Dimensions type 2 are a way to track historical changes in dimensional data. Each record typically has valid_from and valid_to dates that describe its period of effect. In a Data Vault model, however, the focus isn’t on interpreting those business dates upfront. Instead, the Raw Data Vault stage is all about capturing what the source gives you, as-is.

That means when we first bring data into the Raw Vault, the valid_from and valid_to fields should be stored as descriptive attributes — not as part of the Satellite’s primary key. Trying to interpret and align them too early will only create unnecessary complexity.

Why Not Load One Satellite From Three SCD2 Tables?

At first glance, it may seem attractive to combine all three SCD2 tables directly into one Satellite. But this goes against Data Vault best practices:

  • Each Satellite should usually source from a single table or source system.
  • Denormalizing multiple sources into one Satellite complicates your Raw Vault and makes it harder to maintain.
  • You risk having to undo the denormalization later when new requirements come in.

The recommended approach is to create at least three Satellites — one per SCD2 table. You may even need more Satellites if you have to split them based on privacy, rate of change, or security rules. This simplifies your Raw Vault and sets you up for flexibility later.

Handling Validity Dates in Satellites

Once your Satellites are created, the question is how to handle valid_from and valid_to. Here are two key approaches:

1. Multi-Active Satellites

If your SCD2 records represent multiple simultaneously valid states (for example, different price lists where both current and future prices are valid), you can use a multi-active Satellite. In this design:

  • The primary key is composed of hash_key + load_date + subsequence.
  • valid_from and valid_to are stored as descriptive attributes, not as key parts.
  • A staging-generated subsequence ensures uniqueness within a load.

2. Standard (CDC) Satellites with Subsequence

If only one record is active at a time (the classic SCD2 case), then you don’t need a multi-active design. Instead:

  • Stick with the standard Satellite primary key: hash_key + load_date.
  • Handle multiple intraday changes by adding micro- or nanosecond subsequences to load_date.
  • This ensures only one active row per parent key at any given time.

Choosing between multi-active and CDC-style Satellites depends entirely on your data. Do you need multiple simultaneously valid records? Or does one replace the other in sequence? Your answer determines the right design.

Aligning Data with PIT Tables

Once the three Satellites are loaded, you’ll need to bring them together for reporting. This is where PIT (Point-in-Time) tables come in. PIT tables align deltas across Satellites to a common snapshot date, making it possible to present a unified view of related changes.

There are two main flavors:

  • Standard PIT – aligns data based on load_date (technical historization).
  • Temporal PIT (T-PIT) – aligns data based on business timelines like valid_from and valid_to.

A T-PIT lets you activate rows based on both the technical and business timelines. However, it comes with maintenance challenges: whenever valid_from or valid_to dates change, you may need to rebuild PIT partitions — which can be expensive if large date ranges are altered.

Where Should You Apply Business Timelines?

There are three strategies for applying business timelines like valid_from and valid_to:

  1. Downstream in reporting or dashboards – simplest to maintain, and business users can define how timelines should be applied.
  2. In dimension views – apply filters and conditions directly in the SQL layer that feeds reports.
  3. Upstream in PIT/T-PIT – most performant but requires heavier maintenance whenever business timelines change.

A practical approach is to start downstream and only move timeline application upstream if performance issues demand it.

Building a Business Satellite

Once your PIT table aligns the Satellites, you may still need a Business Satellite (or Computed Satellite). This Satellite:

  • Uses the PIT table as its foundation.
  • Combines attributes from the three original Satellites using COALESCE or other business rules.
  • Optionally applies T-PIT logic if strict business timeline alignment is required.

By separating Raw Satellites (which store raw source data) from Business Satellites (which apply interpretation and business logic), you preserve Data Vault’s flexibility while still meeting analytical requirements.

Best Practices Recap

  • Create one Satellite per SCD2 source table — don’t denormalize too early.
  • Treat valid_from and valid_to as descriptive attributes in the Raw Vault.
  • Use multi-active Satellites only when multiple records are simultaneously valid; otherwise, stick with standard CDC Satellites.
  • Align Satellites with PIT tables; consider T-PITs only if business timelines must be applied upstream.
  • Whenever possible, push business timeline interpretation downstream to dimensions or reports for easier maintenance.
  • Use Business Satellites when you need to merge multiple sources into a single unified view.

Conclusion

Joining multiple SCD2 tables into a single Satellite is rarely the right first step in a Data Vault 2.0 implementation. Instead, build your Raw Vault Satellites closely aligned with their sources, use PIT tables to align changes, and apply business timelines carefully — starting as far downstream as possible. This approach keeps your architecture maintainable, flexible, and scalable while still supporting complex historical analysis.

By following these principles, you’ll not only simplify your data model but also give your business users the power to interpret validity ranges in ways that make sense for them — without locking your warehouse into rigid rules that are hard to maintain.

Watch the Video

Monitoring of a Snowflake Powered EDW

Snowflake Built-In Tools

Introduction to Snowflake

Snowflake established itself as one of the most widely used cloud data platforms, providing a scalable and flexible architecture for building, maintaining, and operating Enterprise Data Warehouses. While it aims to eliminate data silos and simplify the data structure, with billions of queries a day (overall), its elastic computing handles the workload for good performance and a satisfied customer.

However, as organizations rely increasingly on Snowflake, monitoring their solution might be crucial for performance optimization, error and root cause analysis, and cost control. Although Snowflake handles these tasks very well on its own, experience shows some need for human interference when it comes to more complex queries, to increase performance, lowering the runtime without upscaling or scaling out.

This blog article aims to provide general information about and technical insights into Snowflake. Focusing on techniques as well as built-in tools to analyze bottlenecks, we aim to potentially increase performance and therefore decrease unnecessary costs.



Snowflake Built-In Tools

Query History

Starting with Snowflake’s built-in tools, we will dive into its roots and combinations of several metrics.

Monitoring query performance includes several depths of insight, from an overview of the general query history, down to the performance of each individual step within a query.

Snowflake Built-In Tools

Let’s assume the top level as an overview of our queries. Including metrics such as:

  • Start Time,
  • Total Time,
  • Status and
  • Query Type.

For the first level, we can gain insights when the queries start, how much time they were consuming, whether they succeeded or failed, or even are still running. Lastly, we can see the type of query, such as ‘CREATE TABLE’, ‘INSERT’, or ‘SELECT’. At this level, a brief overview could be created about the traffic within the Warehouse.

When hearing about this, you may be familiar with a built-in tool called ‘Query History’, which tracks this kind of information automatically without the need for user interference. It is easily available within the sections ‘Monitoring’ and ‘Query History’.

Query Profile

Further, it is easily possible to extract more information about the executed queries. Every single query within the ‘Query History’ has a unique identifier and attached metrics. Within the ‘Query History’ section, when choosing a query by simply clicking on it, we gain insights into the single query. These include two main sections.

‘Query Details’ delivers information about the query in general. Including the above-mentioned ‘Start Time’, ‘End Time’, ‘Warehouse Size’, and ‘Duration’, which is divided into ‘Compilation’, ‘Execution’, and further time-consuming steps.

‘Query Profile’ describes the execution plan of the query as a graphical representation. Each step within the query, such as ‘SELECT’, ‘JOIN’, and ‘WHERE CONDITION’ are represented as nodes within the graph. These nodes are operations, triggered by the query. Each fulfilling a task like filtering data, selecting data rows, but also scanning results, and reading data from cache. Snowflake interprets the written SQL and creates a performant execution plan by itself.

Despite that, each node consumes more or less resources, depending on the SQL code itself, the underlying dataset, and the Warehouse configurations. Using the ‘Query Profile’, we can identify the most time-consuming steps in our query, such as false ‘JOINS’, and work around those nodes to identify more performant solutions. Therefore, the ‘Query Profile’ provides detailed information about each node, like processing time, cache scanning, and partition pruning. For further information, check out the Snowflake Documentation.

Custom Monitoring Solution

But when Snowflake already provides a solution, why do we need separate monitoring?

Although already delivered, your requirements may vary from the existing solution. With the built-in tools mentioned above, the possibility to check on each query is given. Additionally, Snowflake may not be the only tool used for data processing. ELT tools like dbt or Coalesce help data engineers improve processes. To include these further metrics from outside Snowflake, a custom monitoring solution, based on Snowflake’s delivered metrics, is needed.

With this in mind, the following section focuses on our own scalable monitoring solution.

Account Usage and Information Schema

With the idea that the data of the graphical output exists as a table or view, containing all data necessary, it turns out that Snowflake provides us the schemas ‘ACCOUNT_USAGE’ and ‘INFORMATION_SCHEMA’, fulfilling this purpose. They provide similar opportunities for this challenge, although there are some differences.

‘ACCOUNT_USAGE’ provides insights into Snowflake metadata of several objects. This includes ‘QUERY_HISTORY’, which contains the information described in the sections above. The query history is available for all queries within the last 365 days. This enables the loading of metrics entities in the Data Warehouse, making the information readily accessible and persistent for daily, weekly, or monthly processes. The downside is that access to the ‘ACCOUNT_USAGE’ schema is often limited.

The ‘INFORMATION_SCHEMA’ contains a set of system-defined views and tables, providing metadata and information about created objects. Using predefined functions on the ‘INFORMATION_SCHEMA’, metadata can be retrieved, which, on the other hand, is limited to some factors. As this option is more accessible for developers, the focus remains on this option.

Account Usage and Information Schema

Extract Metrics Data

At the beginning, it is crucial to point out the limitations and possibilities of this approach. For the scope of this blog, two functions on the ‘INFORMATION_SCHEMA’ are needed.

‘QUERY_HISTORY_BY_WAREHOUSE()’ is equal to ‘QUERY_HISTORY’.

‘GET_QUERY_OPERATOR_STATS()’ is the equivalent of ‘QUERY_PROFILE’.

‘QUERY_HISTORY_BY_WAREHOUSE()’ takes four different arguments.

  • ‘WAREHOUSE_NAME’ – The name of the Warehouse executing the queries
  • ‘END_TIME_RANGE_START’ – Time range within the last 7 days, in which the query started running
  • ‘END_TIME_RANGE_END’ – Time range within the last 7 days, in which the query completed running
  • ‘RESULT_LIMIT’ – The number of the maximum returned rows. The default lies by ‘100’, the range by ‘1’ to ‘10.000’

The output is the ‘QUERY_HISTORY’ as a structured table with some extra information. Each query is displayed in one row. Therefore, the metrics are aggregated into one specific query.

The most critical limitation seen here is the ‘RESULT_LIMIT’ and ‘END_TIME_RANGE_*’, as it forces us to retrieve the data before the 7-day retention ends and before 10.000 other queries have been executed. Depending on the size of the Warehouse and the scope of the monitoring range, the extract and load process must be customized.

‘GET_QUERY_OPERATOR_STATS()’ takes one argument.

  • ‘QUERY_ID’ – The unique ID of the executed query.

The output is a structured table of the ‘QUERY_PROFILE’ of one specific query. Each step in the execution plan is displayed as one row, so the output size depends on the complexity of the query itself. Each node is detailed with available metadata, breaking down information from ‘GET_QUERY_OPERATOR_STATS()’ into individual steps. This allows for a deeper analysis of performance metrics, helping to identify any bottlenecks.

As the function is on query-level, it is not done by joining the table outputs of those two functions together to get a satisfactory result. Further steps are needed.

Load Metrics Data

Load Metrics Data

Before filtering, transforming, or joining the data, it is suggested to load the data as it comes into persistent tables.

Snowflake Query History

The first step should be to create the table when calling the function, already combining these two components to avoid missing capture data, columns, or false data types. To keep continuous loading, use your preferred ELT tool to append the process at the end, or set up a scheduled Snowflake Task, considering the limitations mentioned above.

Snowflake Query Profile

This process is more complex. To load the data, it is not sufficient to simply call the function once for the query history. To ensure data integrity and a clear 1:n solution (1 row in the query history, n>0 rows in the query profile), it is needed to iterate through the list of all query IDs inside the query history and run the function for each one of them. Each of the resulting datasets needs to be inserted into the corresponding table.

Information Delivery

Now, as the data is extracted and loaded, the last remaining steps are transforming the data and information delivery. As this strongly depends on the use case, we focus on linking the two metrics tables and creating a standard dashboard inside Snowflake to show the results.

Our two tables ‘snowflake_query_history’ and ‘snowflake_query_profile’ are in a soft relation, where the “QUERY_ID” serves as a unique identifier to link each query from the query history to its query profile. Therefore, joining both tables on the “QUERY_ID” is indispensable. The first decision to be made is whether to keep one row for each query and aggregate the object constructs or to split it up into its individual steps, like within the query profile. For simplicity, we focus on the second option to avoid complex SQL statements for the dashboard. As the SQL statement only includes the ‘SELECT’ and ‘JOIN’ operators, you may choose a view as materialization, instead of a table.

With the possibility of creating dashboards out of Snowflake worksheets, it is relatively easy to integrate simple SQL statements into the dashboard, represented as charts. For some examples, we will take a look at the resulting data.

Query Performance

This example shows the top 20 queries by ‘TOTAL_ELAPSED_TIME’, the overall time the query needs to compile, run, waiting time, and other technical steps. As noticed by the chosen x-axis label, Snowflake’s metrics can be combined using other tools, such as dbt. It is set to compare this time against the produced rows to gain some insights into whether the queries can handle much data or not. As can be seen, the query consuming the most time, and therefore credits, does not produce the most rows. At this point, we may be interested in gaining some insights into the query itself.

Query Profile

Showing each Operator with its corresponding relative runtime, only a handful exceed 5% of the total execution time. This may indicate that within the SQL statement, only a few steps need to be optimized to create an overall more performant query.

Query Dashboard

With this in mind, the single query can also be analyzed in the dashboard. For example, you may take into account input rows and output rows, bytes spilled, partitions scanned, and partitions total, or the join condition. Furthermore, you may not only gain insights into your query performance, but also into your table management, such as clustering and partitioning. Additionally, you are also capable of integrating more metrics coming from external sources and linking them directly to Snowflake, making it a powerful monitoring solution.

Conclusion

Snowflake offers a powerful and highly adaptable cloud data platform that meets the demands of modern enterprise data warehousing. However, as data volume and complexity grow, so does the need for proactive monitoring and optimization to ensure continued performance and cost efficiency. By leveraging Snowflake’s built-in tools and implementing strategic performance-enhancing techniques, organizations can address bottlenecks effectively and optimize query performance.

In this article, we addressed the opportunities given through Snowflake’s built-in tools and how to effectively use them. We learned that although these tools are great, to gain a fast and high-level overview, as well as detailed insights at the same time, it is relatively easy to create your own dashboard with the loaded data. Therefore, you are able to analyze and evaluate it, as well as make optimum use of resources to enable high-performance operation without increasing cost by scaling up or scaling out.

Behind the Branches – Navigating Git Workflows in Modern DevOps

A Man Branching with Visual Studio

Branching Strategies

Branching strategies are one of those topics that rarely get much attention until they suddenly become a problem. Whether it’s drowning in merge conflicts, the headache of implementing and synchronizing hotfixes across multiple branches, or a feature freeze caused by insufficient quality assurance, your repository and branching structure can have a major impact on day-to-day development.

But what branching strategies actually exist, and what are their pros and cons? Which approach allows you to deploy changes most quickly? And how can you maintain high software quality despite frequent releases?

In this article, we’ll provide a structured overview of common branching strategies and typical challenges developers face when using them.

Navigating Git Workflows in Modern DevOps

This webinar offers a clear overview of common approaches and how they impact CI/CD, code quality, and maintainability. Beyond theory, we’ll dive into practical challenges and real-world issues teams face every day. Register now for our free webinar on September 16th, 2025!

Watch Webinar Recording

Why Are Branching Strategies Relevant?

Branching models reflect the organization, release culture, and technical maturity of a project. There is no single “correct” strategy that fits every project. Choosing the right one depends heavily on the project’s context. Some of the most important questions to consider when selecting a branching strategy include:

  • Does the team work in fixed sprint or release cycles, or is code deployed continuously?
  • How many developers are working simultaneously on the same codebase?
  • What is the quality of your CI/CD pipeline? Does every change need a manual review, even if the pipeline passes, or can it be deployed automatically?

Depending on the answers to these questions, a simple or more complex branching strategy may be appropriate.

Comparison of Common Strategies

Git Flow

The Git Flow strategy was originally developed for traditional software projects with planned release cycles. Its long-lived main branches are “main” (or “master”) and “dev”.

In addition, it introduces several short-lived branches:

Feature branches

New features are developed in separate feature branches, which are merged into the develop branch once completed.

Hotfix branches

If a critical bug occurs in the production environment (i.e., on the main branch), a hotfix branch is created from main to address the issue. Once the fix is implemented and pushed to the hotfix branch, it is merged into both main and develop to ensure the bug is resolved in both branches.

Release branches

When a release is approaching, a release branch is created from develop, containing all features added since the last release. This branch is then used for final QA testing, bug fixing, and versioning. Once the release is approved, the release branch is merged into both main and develop.


The main advantage of Git Flow is its clear structure. Even in larger teams with many developers and therefore multiple concurrent feature branches, it’s easy to track which version is in what state. The strategy supports parallel development very well due to its structured branching model.

However, the downside is the organizational and technical overhead. The large number of branches and merges can lead to conflicts and divergence over time, especially with long-lived release and hotfix branches. A particular challenge arises when keeping branches in sync. Hotfixes created from main need to be merged back into main and dev, and changes made in release branches, which originate from dev, must eventually be merged into both main and dev, as shown in the diagram. These synchronization steps often introduce additional effort and increase the risk of conflicts or inconsistencies, especially when multiple streams of work are active in parallel.

Additionally, the path a feature must take, from a feature branch to develop, to a release branch, and finally to main, can slow down the deployment process.

While a solid CI/CD pipeline can help automate and streamline parts of this workflow, Git Flow does not rely on automation to function. This makes it especially suitable for teams with more manual QA processes or limited automation infrastructure.

Gitflow branching

GitHub Flow

Compared to Git Flow, the GitHub Flow strategy is significantly leaner. It uses only a single long-lived branch, usually main, and temporary feature branches that are merged via pull requests.

Once all changes on a feature branch are complete and have passed review and various tests, the branch is merged directly into main.

The key advantage of GitHub Flow is its simplicity. There are no separate release or develop branches, and even hotfixes can be handled in short-lived branches. Teams can respond to changes quickly and deploy frequently. This agility is especially effective when supported by a robust CI/CD pipeline. If properly implemented, testing, building, and deployment processes are automated, further improving GitHub Flow’s fast time to market.

Because of its low complexity and minimal coordination overhead, GitHub Flow is also particularly well-suited for smaller teams that value speed and iteration over rigid release planning.

If you’re interested in how such pipelines are structured in practice, our CI/CD pipeline Blog article offers a look at a practical GitHub-based setup using GitHub Actions and dbt. It’s a useful companion piece for understanding the automation layer that supports fast and reliable delivery.

However, this strategy also comes with limitations: it doesn’t support managing multiple parallel versions or complex release planning.

Additionally, it relies heavily on the quality of the CI/CD pipeline.

Trunk based branching

Trunk-Based Development

Trunk-Based Development is quite similar to the GitHub Flow strategy, but there are a few key differences.

While it also relies on a single long-lived branch (the trunk, typically main), commits are either made directly to main or via very short-lived feature branches. These feature branches often exist for only a few hours, and it’s common for changes to be merged into main multiple times a day. The goal is to integrate changes as early as possible to avoid conflicts before they even arise.

Because there are no fixed release cycles in Trunk-Based Development, it’s essential to ensure that incomplete features don’t go live prematurely. Feature flags play a central role here, allowing unfinished functionality to be hidden in the production environment until it’s ready.

As with GitHub Flow, a strong CI/CD pipeline is essential. It acts as the main safeguard for quality assurance and enables rapid deployment to the main branch.

Trunk-Based Development is especially effective for teams that are comfortable with rapid iteration and a high level of automation. While it can be used by smaller teams, it truly shines in larger organizations where multiple teams work in parallel and frequent integration is critical to maintaining momentum and consistency.

The benefits of Trunk-Based Development include extremely fast deployments and minimal risk of merge conflicts due to the short-lived nature of branches and continuous integration.

However, similar to GitHub Flow, this strategy heavily depends on the reliability of the CI/CD pipeline. If your team operates in a highly automated DevOps environment, this approach works smoothly. But if that’s not the case, software quality can suffer significantly. The risk is especially high here, as all changes are deployed directly to the main branch.

Conclusion

All three strategies come with their own strengths and weaknesses.

Git Flow is well-suited for larger projects with fixed release cycles, manual QA, and structured approval processes. It offers stability and clear workflows, but also brings significant technical and organizational overhead, making it a heavyweight option that can slow down development and release cycles due to its complexity and synchronization requirements.

GitHub Flow, by contrast, emphasizes speed and simplicity. It’s an excellent fit for smaller teams working on web or SaaS projects that deploy continuously, thanks to its low complexity and quick turnaround. But it relies on a good CI/CD pipeline. If tests are insufficient, faulty code might get deployed automatically.

Many of these risks can be mitigated with proper pipeline design and DevOps experience within the team, ensuring that automation is not just fast but also reliable.

Trunk-Based Development enables the highest release frequency, but only delivers consistent quality if the necessary technical maturity is in place. This makes it ideal for highly automated environments where teams ship many changes every day.

There are always ways to mitigate or minimize the downsides of any branching strategy. Techniques like blue/green or canary deployments, for example, can help reduce the impact of faulty changes and make rollbacks easier.

Stay tuned, we regularly share practical insights and solutions on topics like CI/CD, DevOps patterns, and deployment strategies.

Defining Snapshot Dates in Data Vault

Defining Snapshot Dates in Data Vault

When working with Data Vault, one of the most common questions practitioners face is: how do we define snapshot dates? While load dates are tied to when data arrives from the source, snapshot dates serve a different and equally important purpose — they allow us to deliver stable, consistent, and predictable datasets to end users. In this article, we’ll break down the concept of snapshot dates, explain how they differ from load dates, and walk through practical examples of how they are implemented in a Data Vault architecture. By the end, you’ll understand how to design and manage snapshot dates effectively, and how they fit into the broader picture of multi-temporal data management.


Why Snapshot Dates Matter

Imagine you are a business user who arrives at the office at 9 AM, expecting your reports to be ready. You want those reports to be stable — not changing throughout the day unless a new cycle of data refresh is scheduled. Snapshot dates exist to decouple data delivery from data ingestion.

  • Load Date: Indicates when data from the source system arrived in your warehouse. It is system-driven and depends on the source’s delivery schedule.
  • Snapshot Date: Defines when the data is frozen for reporting and analysis. It is business-driven and follows a regular schedule (daily, hourly, every 5 minutes, etc.).

This separation allows data teams to provide users with predictable datasets, regardless of how irregularly or frequently source systems deliver new data.

The Three Timelines in Data Vault

To fully grasp snapshot dates, it’s useful to understand that Data Vault designs operate across three different timelines:

  1. Data Warehouse Timeline: Driven by load datetime stamps, representing when data batches (or real-time messages) arrive from the source system.
  2. Snapshot Timeline: Driven by the business delivery cycle. This is when data is made available to users in stable form.
  3. Business Timeline: Driven by business events (contract start dates, end dates, valid-from/valid-to fields, modified timestamps, etc.) and stored in Satellites or other model components.

For this article, we’ll focus on the second timeline: the snapshot timeline.

Defining Snapshot Dates with Examples

Daily Snapshot Example

Suppose a company wants its reports refreshed once per day, every morning at 7 AM UTC. This means that:

  • Regardless of when the source delivers data, the snapshot timestamp is always set to 7 AM.
  • Users querying the data warehouse at 9 AM will see a stable version of the data that won’t change until the next snapshot is generated.
  • In the control table, a row is inserted daily with the snapshot timestamp (e.g., 2025-09-16 07:00:00 UTC).

Hourly Snapshot Example

For real-time dashboards, an hourly or even 5-minute snapshot might be necessary. Let’s say hourly snapshots are generated:

  • A new snapshot timestamp is inserted every hour (e.g., 2025-09-16 01:00:00, 2025-09-16 02:00:00, etc.).
  • Older hourly snapshots may be discarded after a week to save storage, while daily snapshots are retained for a year.
  • End users can query either the most recent hourly snapshot or the daily snapshot depending on their needs.

Mixed Use Case

Some users might want hourly updates, while others only need a daily snapshot. In such cases:

  • The control table holds all snapshots (hourly + daily).
  • Boolean flags are used to mark whether a snapshot is “hourly,” “daily,” or “real-time.”
  • Users can filter based on these flags when running queries.

How Control Tables Help Manage Snapshot Dates

In practice, snapshot dates are managed using control tables. There are typically two types of control tables in Data Vault projects:

  1. Load Control Table: Tracks load datetime stamps for each source, indicating whether the batch has been processed into staging, raw Data Vault, business vault, or marts.
  2. Snapshot Control Table: Stores snapshot datetime stamps generated by the warehouse. These define the stable reporting layers that users can query.

When building PIT (Point-in-Time) tables and bridge tables, the snapshot control table plays a critical role. It ensures that snapshots align with user expectations, and it provides metadata for filtering (e.g., “latest snapshot,” “daily snapshot,” etc.).

Implementing Snapshot Dates in PIT Tables

PIT tables act like indexes into your Data Vault, enabling efficient query performance. The loading process of PIT tables typically involves:

  1. Checking the snapshot control table to see which snapshots should exist.
  2. Loading the required snapshots into the PIT table if they’re missing.
  3. Ensuring that the PIT table and snapshot control table remain in sync after each refresh cycle.

When a user queries a PIT table, they can join it to the snapshot control table based on the snapshot datetime stamp. From there, they can filter by flags (e.g., latest, daily, hourly) to get the version of data they need.

Best Practices for Defining Snapshot Dates

  • Work with business users: Define snapshot frequencies based on real business needs (daily, hourly, real-time).
  • Be consistent: Establish a standard snapshot time (e.g., 7 AM UTC daily) to simplify reporting.
  • Use control tables: Automate the insertion of snapshot timestamps and track them for PIT/bridge table loading.
  • Retain wisely: Keep high-frequency snapshots (hourly/5-minute) for a short time, but retain daily snapshots for longer historical analysis.
  • Decouple load and snapshot timelines: Remember that load datetime depends on the source, but snapshot datetime depends on user requirements.

Snapshot Dates vs. Business Dates

It’s worth highlighting again that snapshot dates are not the same as business dates like “contract start” or “valid-to.” Business dates come directly from source systems and are stored as part of the business timeline in Satellites or Links. Snapshot dates, on the other hand, are warehouse-generated and serve as reference points for reporting and querying.

Conclusion

Snapshot dates are a cornerstone of Data Vault’s multi-temporal design. By providing a regular, predictable timestamp for reporting and queries, they ensure stability and trust in data delivery — even as source systems deliver data at unpredictable times. With the help of control tables, PIT/bridge tables, and well-defined retention policies, snapshot dates give both IT teams and business users the clarity and consistency they need. Whether your business requires daily snapshots for stable reports or high-frequency snapshots for real-time dashboards, the principles remain the same: define, control, and communicate your snapshot strategy clearly.

Watch the Video

Creating Data Vault Links with Coalesce.io

Data Vault Links

Data Vault modeling separates data into hubs, links, and satellites to support scalable, auditable, and historized data warehouses. Links represent relationships between business entities (hubs) — for example, which supplier delivers which part. Coalesce.io makes creating Data Vault objects fast by providing a GUI-driven workflow for building stages, generating hash keys, and creating links that can be run incrementally against a target like Snowflake.

This article walks through the exact process demonstrated in the video transcript: exploring source tables, creating a Data Vault stage, generating hash columns, building a single-source link, then extending it to a multi-source link. The goal is to give you a clear checklist and practical tips so you can reproduce the steps in your Coalesce.io environment.



Why create a Data Vault link?

In Data Vault, links are the canonical way to model relationships between business entities. A link contains:

  • a link hash key — generated from the business keys of all participants
  • the hub hash keys of every participating hub
  • load metadata such as load timestamp and record source

Creating links gives you an integrated relational layer independent of the transactional sources. Links allow you to track relationship history and stitch together hubs for downstream analytics and satellites.

Overview of the source data used

In the example from the video, you start with two resource tables in Coalesce:

  • parts — defines parts with attributes like brand, name, type, size, and contains a reference to a supplier.
  • suppliers — contains supplier metadata and a supplier key.

The parts table contains a column that references the supplier key, so semantically there is a relationship: part → supplier. This is the relationship we model as a Data Vault link.

Step 1 — Create a Data Vault stage in Coalesce

Every Data Vault object creation in Coalesce.io should start from a Data Vault stage. The stage is where you prepare the source rowset and add the derived columns that your downstream hub/link/satellite will need (hash keys, record source, load date, etc.).

  1. Right-click the source entity (in the video: parts) and select Node → Data Vault for Coalesce → Stage.
  2. Coalesce.io creates a new object (e.g., DV_stage_parts) and forwards the source columns into the stage.
  3. Add any extra columns you need for the link: a hash key for the part hub, a hash key for the supplier hub, and the combined link hash key.

Best practice: keep the naming consistent. If the column will become the hub hash key for part, name it something like HK_part_H. That makes it obvious where the column flows later.

Step 2 — Generate hub hash keys and the link hash key

Hash keys are central to Data Vault 2.0 implementations. They provide stable, compact identifiers for business keys and are typically generated using a deterministic hash function (often MD5 or SHA). In Coalesce.io you can generate these with the GUI.

  1. Right-click the business key column for the part and choose Generate hash column — rename it to HK_part_H.
  2. Repeat for the supplier business key and call it HK_supplier_H.
  3. Select both business key columns (or both generated hash columns) and choose Generate hash column again to produce the link hash. Name it something descriptive, like HK_part_supplier_L (L for link).

Important: the link hash must be calculated from the business keys (or their hub hashes) of all relationship participants and in a deterministic order. That ensures the same relationship always produces the same link hash across sources and loads.

Step 3 — Select only the columns you need for the link

Coalesce.io lets you choose which stage columns flow into the next object. For the link you typically need:

  • the link hash key (first column)
  • the hub hash keys for all participants (in order)
  • load metadata: LoadDate or LoadId, and RecordSource

Select these five (or more if you want custom metadata) and then create the Data Vault link node by right-clicking → Node → Data Vault for Coalesce → Link.

Step 4 — Configure the link object

When the new link object appears, follow these checks:

  • Confirm the link hash column is the first column and matches your naming standard (e.g., HK_part_supplier_L).
  • Verify the hub hash keys follow — HK_part_H then HK_supplier_H.
  • Make sure load metadata (load timestamp/id and record source) are present.
  • Open the Data Vault options panel in Coalesce.io and explicitly set which column is the link hash. This tells the Coalesce.io macro how to populate the link.

Click Create and then Run. Coalesce.io will generate and execute an INSERT statement (an incremental load) against your target (Snowflake in the example).

Tip: check the generated SQL before running. Coalesce.io usually issues an incremental INSERT that only adds new link rows, so the second run often returns “0 rows inserted” when there are no new relationships — that’s expected and desirable.

Step 5 — Inspect results and generated code

After running, Coalesce.io shows the result set and the generated SQL. In the video, the first execution inserted the bulk of rows (e.g., hundreds of thousands), and subsequent runs inserted zero because no new relationships existed. This shows the incremental behavior is functioning correctly.


-- example pseudo-SQL generated by Coalesce
INSERT INTO dv_link_part_supplier (...)
SELECT ...
FROM staging.dv_stage_parts
WHERE NOT EXISTS (
SELECT 1 FROM dv_link_part_supplier l
WHERE l.hk_link = staging.hk_part_supplier_l
);

Always validate the counts and confirm that the link hash values are unique per relationship. If you see duplicates or mismatches, re-check the hash generation order and the columns used.

Creating a multi-source link

A common Data Vault case is that the same relationship may appear in multiple source systems (e.g., ERP, procurement feed, third-party data). Data Vault links are designed to aggregate relationship evidence across sources. Coalesce.io supports multi-source links by allowing you to add additional source mappings to the link object.

The workflow from the video:

  1. Create or update a stage for the second (or additional) source that produces the same five columns (link hash, hub hashes, load metadata).
  2. On the link object in Coalesce, open the multi-source mappings and add a new source mapping (e.g., part_sub_source).
  3. Map the source stage columns to the link columns — Coalesce.io often auto-maps if names match.
  4. Run the link. Coalesce.io will process both sources and insert any newly observed relationships.

Note: If you run into an error where Coalesce.io cannot find the link hash for the new source, make sure the stage includes the generated join or column you intended to hash. In the video, the speaker realized they needed to explicitly build the join in the stage (generate the joined dataset) before the link could reference its hash column.

Common pitfalls & best practices

  • Hash ordering: Always use a consistent ordering for participants when generating the link hash (e.g., alphabetical by object name or a documented canonical order).
  • Naming conventions: Adopt a clear naming convention: HK_<object>_H for hub hashes and HK_<a>_<b>_L for link hashes. Consistency helps Coalesce.io auto-map and keeps downstream ETL predictable.
  • Stage first: Always prepare your stage before creating a link. The stage is where joins, derived fields, and hash generation happen.
  • Record source & load metadata: Always include record source and a load timestamp or load ID in the link so you can trace where and when a relationship was observed.
  • Incremental testing: Run the insert once to ingest the historical baseline, then run it again to verify zero rows are inserted when no changes exist.
  • Multi-source growth: Plan for adding multiple sources over time — links should be able to accept additional source mappings without rework.

When to add satellites

Links can also have satellites if you want to capture attributes specific to the relationship (for example, contract terms, effective dates, or relationship status). If you need to historize relationship attributes, create a satellite for the link and drive it with a hash diff or change detection strategy.

The video skipped satellite configuration because its focus was link creation, but be mindful that links + satellites are the full pattern for historized relationship data in Data Vault.

Conclusion

Creating Data Vault links with Coalesce.io is a straightforward, GUI-assisted process once you follow a repeatable pattern:

  1. Create a Data Vault stage for your source rows.
  2. Generate hub hash keys for all participating entities.
  3. Generate a deterministic link hash from the participants’ business keys (or hub hashes).
  4. Select the required columns and create the link object.
  5. Configure multi-source mappings as needed and run incremental loads.

Following these steps ensures your links are consistent, auditable, and ready for enterprise-grade analytics. If you haven’t implemented hubs yet, consider building hubs first (or in parallel) so your link hub keys map cleanly into the rest of the Data Vault model.

Happy modeling — and if you’re using Snowflake as your target, double-check the generated SQL from Coalesce.io and watch the run results to validate incremental behavior.

Watch the Video

Databricks and dbt: A Practical Approach to Data Vault Implementation

Bronze Silver and Gold layers in the Data Vault Structure

Databricks and dbt

Selecting the appropriate technology stack is a critical factor in the successful delivery of a Data Vault 2 architecture. Two technologies that work effectively together at a large scale data solutions are Databricks and dbt. When combined, they provide a practical way to implement Data Vault models while addressing performance, governance, and auditability requirements.

It can be argued that dbt’s role in a Databricks-based architecture is not always essential, since many of its core capabilities (such as transformation scheduling, lineage tracking, and documentation) can also be implemented using native Databricks features. Understanding the specific role each tool plays helps clarify where they complement each other and where functionality overlaps.



Databricks as the Processing and Storage Platform

Databricks’ Lakehouse architecture combines the scalability of a data lake with the reliability of a warehouse. Its Delta Lake technology offers ACID transactions, schema enforcement, and time travel, enabling precise historical querying, which are relevant aspects when it comes to Data Vault’s historization requirements.

With Unity Catalog, Databricks centralizes metadata management and enforces fine-grained access control, ensuring sensitive attributes are protected without introducing unnecessary satellite splits. This alignment between governance and performance is particularly relevant in Data Vault environments.

dbt as the Transformation and Orchestration Layer

dbt manages and automates SQL-based transformations in a modular and version-controlled manner. In a Data Vault context, dbt enables:

  • The creation of Hubs, Links, and Satellites through templated, reusable models. Here, different packages can be leveraged, like our datavault4dbt package, which is constantly updated to be fully compliant with the most recent Data Vault standards.
  • Integrated testing to validate business keys, relationships, and data quality.
  • Automated documentation that directly reflects the structure and dependencies of the Data Vault.

This structured approach makes transformations transparent and repeatable, supporting the auditability requirements inherent to Data Vault.

Integration in a Data Vault Workflow

When Databricks and dbt are deployed together:

  • Data ingestion occurs in Databricks, storing raw datasets as Delta tables, usually in the Bronze layer.
  • dbt transformations generate Raw Vault entities and Business Vault objects in the Silver layer.
  • Governance and security controls are enforced via Unity Catalog without altering the Data Vault model structure.
Bronze Silver and Gold layers in the Data Vault Structure

This approach preserves Data Vault’s methodological structure while using Databricks’ distributed compute and storage capabilities.

Business Value when combining dbt and Databricks

The combined use of Databricks and dbt offers:

  • Scalable processing of large, complex datasets: Databricks handles enterprise-scale data efficiently, while dbt structures transformations into modular, reusable components.
  • Consistent governance across all layers of the Data Vault: dbt’s lineage and documentation, plus Unity Catalog’s access control, ensure compliance and transparency end to end.
  • Lower operational risk through tested, version-controlled transformations: Git-based versioning and automated tests in dbt reduce errors before execution on Databricks.
  • Improved query performance for information marts and analytics: Delta Lake optimizations and dbt’s pre-aggregated tables with business logic minimize expensive joins.

For organizations building Data Vault on Databricks, dbt strengthens structure and quality while Databricks ensures scalability and performance.

Agile Development in Data Warehousing with Data Vault 2.0

Agile Development in Data Warehousing: Initial Situation

Agile methodologies bring flexibility and adaptability to data warehousing, making them a natural fit for modern approaches like Data Vault 2.0. A common issue in data warehousing projects is that a scope is often missing and many of the processes such as controlled access, GDPR handling, auditability, documentation and infrastructure are not optimized. Additionally, data warehouse projects that have a scope often begin without a real focus on business value. This is mostly due to the fact that the use cases are not clearly communicated and the data architects do not know where to start. The consequence of this means  no business value can be delivered.

Data Vault 2.0 Methodology

It is often assumed that Data Vault 2.0 is only a modeling technique, but this is not correct. Data Vault 2.0 includes the modeling technique, a reference architecture and the methodology. The methodology introduces project management tools such as CMMI, Six Sigma and Scrum to solve the problems described. While CMMI and Six Sigma deal with general project management issues, Scrum is mostly used specifically in the development team and provides the framework for a continuously improving development process.  The use of agile development in Data Vault 2.0 projects will be described in more detail below.

The Scope of a Sprint

The first step in setting up a data warehouse project in an agile way is defining the objective of the project with just one or two pages. Unlike waterfall projects, the goal is to produce working pieces of usable outputs, could be reports or dashboards, in continuous iterations, otherwise called sprints. This means that we don’t need to plan the entire project in detail but instead can build around a general idea or goal for the final data warehouse before then focusing on planning the first sprints. In order to address the aforementioned problems, the focus of the sprints needs to be centered around business value. For this reason, it is important to receive constant feedback from the business users for a continuous improvement process.

Define the project

Both the scope of a sprint and the architecture follow a business value driven approach built vertically and not horizontally. This means they are not built layer by layer but instead feature by feature. A common approach for this is the Tracer Bullet approach. Based on business value, which is defined by a report, a dashboard or an information mart, the source data will be identified and modeled through all layers and loaded. 

As shown in Figure 1, the entire staging area layer is not initially built but rather a small part of the respective layer is built based on data in the scope, in this case the SalesReport.

Agile Development

Before new functionality can be implemented in a sprint, it needs to be defined.
This task lies with the product owner as they are the ones to write and prioritize user stories.
As already explained, the goal of a sprint is to produce working pieces of usable outputs called features.
In addition, there are tech topics that need to be considered. There are various methods to support Sprint Planning, such as planning poker or Function Point Analysis, which are discussed in more detail in another article.

Another good indicator is to evaluate the sprint itself while the sprint is still ongoing. If the development team does not manage to implement a feature in a sprint, this can often be seen as a good indicator that the scope is too large. 

To avoid this, all work packages that are not relevant for the feature should be removed. Though, what is often the case these work packages are not completely removed out of fear from the business user. 

To address this fear it is important to educate the business user that they will be delivered but only in a later sprint and temporarily moved into the backlog.

Agile - Data Warehousing Sprint
Figure1 : Data Vault 2.0 Architecture

Due to the flexible and scalable Data Vault model, these layers can be extended with the next feature with little to no re-engineering. This is possible due to the fact Data Vault consists of a Raw Data Vault and a Business Vault model which means it contains the logical architecture as well as the data modeling perspective. The Raw Data Vault is modeled in a data-driven way by integrating the data by business keys. Only hard business rules like data type conversions or hash key calculations are applied. All other soft business rules are only applied in the Business Vault. 

Here, we turn data into information. For this reason, the Raw Data Vault requires less refactoring and can be extended limitlessly.

Agile Development Review

Another important success factor for agile projects is proper review and improvement. Even before the next sprint starts, two meetings must be held by the team:

  • The sprint review meeting: This meeting is about reviewing the delivered features. Usually the development team, the product owner, the Scrum Master and the end-users participate in this meeting.
  • Retrospective meeting: This meeting usually takes place directly after the review meeting and focuses on identifying activities that need to be improved.
  • Backlog refinement for prioritizing the user stories and to make sure that the team understands what to do
  • Sprint planning to plan which user stories fit into the next sprint based on estimating the effort.

It is important that these meetings are held so that toe source errors can be found. In this way, the outcome of a project can be improved and the development processes optimized in an iterative way.

Conclusion

Data Vault 2.0 is not only a scalable and flexible modeling technique, but a complete methodology to accomplish enterprise vision in Data Warehousing and Information Delivery by following an agile approach and focusing on business value. By using agile methods in data warehousing, the focus in projects can be on the business value and delivering useful products to the customer.

Automation Options for Data Vault

How Automation Tools Are Changing the Game

Quick takeaway: Automation for Data Vault spans multiple waves — from template-driven code generation that builds the vault structure to modern generative-AI tools that help discover the model itself and AI copilots that accelerate business-rule development. Combined, these approaches dramatically cut time-to-value, reduce errors, and let teams focus on the parts that actually create business impact.



Why automation matters for Data Vault

Data Vault was designed for change: it separates raw capture from business logic, records full history and provenance, and uses standardized patterns (hubs, links, satellites). Those same patterns make the model ideal for automation. Manual Data Vault development works, but it’s slow and error-prone — especially when you must onboard many sources, handle evolving business keys or prove lineage for audits and AI projects. Automation reduces repetitive work, enforces consistency, and lets your engineers and architects spend time on modelling decisions and business rules, not boilerplate SQL.

The evolution of automation — three waves

Automation for Data Vault didn’t appear overnight. Think about it in three waves:

  1. Manual coding era: Everything by hand — raw ingestion, keys, history tracking, and the transformations. Effective, but slow and brittle.
  2. Template-driven automation (first wave): Tools that generate physical vault objects and standard loading code from a defined model. They speed up delivery and cut repetitive errors.
  3. AI-driven automation (second wave) + AI copilots: Tools that assist or even automate the model discovery itself, and AI copilots that generate business logic or transformation code — moving humans from creators to reviewers.

Understanding these waves helps you choose the right mix — existing template tools remain valuable, while AI tools are rapidly becoming practical for model discovery and logic generation.

First-wave automation: structured, reliable, repeatable

The first-wave tools are the ones most teams have used for years. Their primary job is to take a model and generate the physical implementation and ETL/ELT pipelines. Key benefits:

  • Speed: Generating hub, link and satellite structures with standard loading patterns significantly reduces delivery time.
  • Consistency: Every table and load pattern follows the same, tested template — fewer bugs and easier maintenance.
  • Orchestration and operations: Many tools build pipelines, manage hash keys, and include scheduling and error handling.

These tools are excellent when you already have a trusted logical model and want to automate the “how” of implementation. They do not solve the “what” (the model discovery) — that still requires human analysis.

Second-wave automation: AI-assisted model discovery

The real shift happens when automation starts to help with — or take over — the model discovery process itself. Instead of hand-crafting hubs and links, generative AI platforms can scan source systems and metadata to propose an initial Raw Data Vault logical model. What does that look like in practice?

  • Source scanning: The tool ingests table/field metadata, sample values and constraints.
  • Entity discovery: It suggests candidate hubs (business entities) by grouping columns and identifying recurring patterns and unique keys.
  • Key recommendation: It proposes business-key candidates and highlights primary/unique candidates derived from the source.
  • Relationship discovery: It suggests link structures where keys appear together or where foreign-key relationships are inferred.
  • Satellite design hints: The AI may split attributes into satellites based on volatility, sensitivity (PII), or update patterns.

This capability moves the needle: modelers become reviewers and validators instead of building every piece from scratch. It accelerates onboarding of new sources and shortens the path to a working Raw Vault.

AI copilots for the Business Vault and transformations

While model discovery is one hard problem, translating business requirements into transformation logic is another. This is where AI copilots shine. Integrated into developer environments, they can:

  • Generate SQL transforms from plain-English requirements (e.g., “calculate monthly churn rate by customer segment”).
  • Create complex joins, window functions and aggregations that implement business rules.
  • Suggest test cases, edge-case handling and simple data quality checks.
  • Accelerate the creation of information-marts (star schemas) by scaffolding the necessary queries and documentation.

Important caveat: copilots are accelerators, not autopilots. Generated code still needs human review for correctness, performance and governance. But they massively reduce the repetitive cognitive load and let experienced engineers focus on validation and optimisation.

Putting the technologies together: a practical workflow

Here’s a pragmatic, step-by-step workflow that mixes first-wave tools and AI capabilities into a usable process:

  1. Connect an AI discovery tool to your sources. Let it propose hubs, links and satellites.
  2. Review and refine the AI-suggested model with domain experts — confirm business keys and entity definitions.
  3. Export logical model into a template-driven automation tool (ELT/DBT/Wherescape). Generate physical tables, load patterns and orchestration pipelines.
  4. Use AI copilots to implement Business Vault logic and information-marts — write high-level requirements and have the copilot scaffold the SQL/Python transform code.
  5. Run tests and checks: automated unit tests, data quality checks and lineage validation.
  6. Deploy and monitor: schedule pipelines, monitor failures and feed back findings into the model or automation templates.

This end-to-end process reduces the time spent in data plumbing and increases time spent on business validation and value delivery.

Governance, auditability and human-in-the-loop

Automation is powerful, but it must sit inside proper governance. Because Data Vault is often used for regulatory and audit-sensitive environments, keep these guardrails in place:

  • Human review points: AI should suggest, not decide. Model approvals and business-key selection must be explicit sign-offs by domain owners.
  • Lineage and provenance: Ensure automation tools emit metadata and lineage so every generated artifact is traceable back to sources and the AI suggestions that influenced it.
  • Testing and validation: Automatically generate tests for any AI-generated transformation and fail deployments until tests pass.
  • CI/CD and version control: Keep generated models and transformations in version control so you can audit changes over time.

When these controls exist, you get the speed of automation without sacrificing compliance or trust.

When automation is the right move — and when to hold back

Automation fits particularly well when:

  • Your landscape includes many sources and you expect change.
  • You need fast onboarding (M&A or rapid product expansion).
  • Traceability and auditability are core requirements.
  • You want to reduce repetitive developer work and scale the team’s output.

Consider holding off or using a hybrid approach when:

  • Your environment is tiny and unlikely to change — heavy automation may be overkill.
  • Source data semantics are ambiguous and require deep domain expertise that AI cannot infer reliably.
  • You lack governance and testing practices to safely validate generated models and code.

Risks, limitations and best practices

Generative AI is not perfect: it can hallucinate or misinterpret faint signals in metadata. Best practices to mitigate risk include:

  • Always pair AI output with domain validation. Treat AI suggestions as draft artefacts, not final products.
  • Enforce tests: Automated unit tests and data quality checks should gate deployment.
  • Keep humans in the loop: Use model reviewers, not model builders — domain experts must accept or correct AI outputs.
  • Capture metadata: Store which AI model/version produced which suggestion for future audits.

Final thoughts — how to get started

If you’re curious about Data Vault automation, start small: pick one source or one high-value report and run it through an AI-assisted discovery + template automation pipeline. Measure case outcomes: time saved, fewer errors, and the number of iterations required to reach stakeholder approval. Use these metrics to build a business case and expand automation incrementally.

Automation won’t replace thoughtful modelling and governance, but used correctly it turns weeks of repetitive engineering into hours and lets teams focus on the decisions that move the business forward.

Watch the Video

Close Menu