Skip to main content
search
0

How to Explain Data Vault to Business Users?

How to Explaining Data Vault

When introducing Data Vault to business users, it’s important to communicate its value in a way that resonates with them. Instead of focusing on the technical details, it’s best to highlight the benefits and business impact.



Understanding the Audience

Business users, including executives and commercial leaders, generally don’t need to understand the intricacies of Data Vault. They are more interested in the outcomes, such as data accessibility, security, and adaptability.

Explaining Data Vault with a Simple Analogy

Imagine you are building a house. As a homeowner, you don’t need to know every construction detail; you just want to ensure that it’s solid, safe, and has all the necessary features. Similarly, business leaders don’t need to understand the technical framework of Data Vault—they just want a reliable data management system that supports their decision-making.

Focusing on Business Value

Instead of using the term “Data Vault,” it’s often more effective to discuss the advantages of a managed data platform:

  • Data Security & Privacy: Ensures compliance with regulations like GDPR while securing sensitive information.
  • Data Integration: Consolidates structured, semi-structured, and unstructured data from multiple sources.
  • Auditability & Transparency: Provides full data lineage, ensuring that every data point can be traced back to its source.
  • Agile Data Delivery: Enables incremental delivery of insights, so business teams don’t have to wait months or years to see results.
  • Adaptability to Change: Easily adjusts to changes in source systems, business logic, and reporting needs.
  • Handling Multiple Business Timelines: Supports complex business requirements, including postdating and backdating of records.
  • Scalability: Handles large datasets and high-speed data processing.

Delivering Business Outcomes

Business leaders care about measurable outcomes. With a Data Vault-based platform, they can expect:

  • Improved decision-making with accurate, timely data.
  • Faster adaptation to market changes and customer demands.
  • Cost efficiency through a structured yet flexible data architecture.
  • Enhanced reporting and analytics with reliable data sources.

Making the Pitch to Business Users

When discussing Data Vault with executives, avoid technical jargon and focus on business goals. Instead of saying, “We use Data Vault 2.0 for data modeling,” say, “We have a data platform that ensures secure, auditable, and easily accessible insights to drive your business forward.”

By emphasizing real-world benefits, you can effectively communicate the value of Data Vault without overwhelming non-technical stakeholders with complexity.

Conclusion

Communicating the benefits of Data Vault to business users requires a shift from technical explanations to business value discussions. By framing the conversation around security, agility, and data-driven decision-making, you can successfully gain buy-in from stakeholders and demonstrate the impact of a well-managed data platform.

Watch the Video

Use Data Vault 2.0 to Tackle GDPR

Why use Data Vault 2.0 to Tackle GDPR?

Today, we explore how Data Vault 2.0 can be a powerful tool for addressing the challenges posed by the General Data Protection Regulation (GDPR). GDPR requires organizations to protect the personal data of European citizens and grants individuals the “right to be forgotten”. This article outlines how Data Vault 2.0 can simplify compliance with GDPR while maintaining the integrity of your data warehouse.



Understanding GDPR and its Challenges

GDPR, implemented in 2018 by the European Union, sets strict rules for handling personal data. One key aspect is the right to be forgotten, allowing individuals to request the deletion of their personal information from an organization’s systems. For data warehousing and analytics, this can be particularly challenging as organizations often need to retain some data for analytical purposes while complying with GDPR’s deletion requirements.

The Data Vault 2.0 Approach to GDPR

Data Vault 2.0 provides a structured way to tackle GDPR compliance through its unique data modeling techniques. At its core, Data Vault separates data into three main components: Hubs, Links, and Satellites. Satellites are used to store descriptive attributes of business keys, and with GDPR, we can utilize a method called Satellite Splits to manage personal and non-personal data effectively.

Satellite Splits

Satellite splits involve creating separate Satellites for personal and non-personal data. For example:

  • Personal Satellite: Contains personal information such as names, addresses, and email addresses. This data must be deleted if a customer exercises their right to be forgotten.
  • Non-Personal Satellite: Stores non-identifiable data such as regions or generated technical data, which can be retained for analytics even after personal data is removed.

When a deletion request is received, you can simply delete the records from the Personal Satellite while retaining the non-personal data for analytical use. This ensures compliance with GDPR while preserving valuable business insights.

Addressing Privacy-Relevant Business Keys

One of the challenges with GDPR is managing business keys that are tied to personal data, such as social security numbers. If such keys are used in Hubs, deleting personal data becomes complicated. Here’s how Data Vault 2.0 handles this:

Using Artificial Hubs

To avoid using personal attributes as business keys, Data Vault 2.0 introduces artificial Hubs. These Hubs assign unique, non-identifiable numbers to replace personal identifiers. For example:

  • An artificial Hub might contain a generated number for each customer’s car insurance data.
  • A Link connects the artificial Hub to the personal data stored in a Satellite.

When a customer requests deletion, you delete the connection between the personal identifier and the artificial number in the Link. The artificial Hub remains intact, allowing you to retain non-personal data for analytics without risking re-identification.

Best Practices for Implementing GDPR with Data Vault 2.0

  • Avoid Personal Identifiers as Business Keys: Always opt for non-personal or artificial identifiers wherever possible to simplify the model.
  • Use Randomized Identifiers: Generate UUIDs or random sequence numbers to prevent reverse-engineering personal data.
  • Collaborate with Legal Teams: Work closely with legal experts to define which data can be retained and which must be deleted under GDPR.

By adhering to these practices, organizations can create a robust Data Vault model that simplifies GDPR compliance while maintaining data integrity and analytics capabilities.

Conclusion

Data Vault 2.0 offers a flexible and efficient approach to tackling GDPR challenges. By leveraging Satellite splits and artificial Hubs, organizations can balance regulatory compliance with business needs. While managing GDPR compliance may seem complex at first, the structured approach of Data Vault 2.0 ensures that your data remains both secure and useful.

For further learning, join the Data Vault Innovators Community or participate in Data Vault Fridays hosted by Scalefree. These resources provide valuable insights and opportunities to explore topics like GDPR, data warehousing, and more.

Watch the Video

Creating Data Vault Stages

Data Vault Stages

The Data Vault methodology provides a robust framework for managing and organizing enterprise data. One of the foundational components of a Data Vault is the stage. In this guide, we’ll explore what Data Vault stages are, their importance, and how to create them effectively.



Understanding Node Types in Data Vault

Before diving into stages, let’s review the key node types in a Data Vault:

  • Stages: Temporary storage areas where raw data is preprocessed.
  • Hubs: Central entities containing unique business keys.
  • Links: Relationships between hubs.
  • Satellites: Contextual and descriptive data for hubs and links.
  • PITs (Point-in-Time Tables): Optimized query performance tools.
  • Snapshot Tables: Historical states of data.
  • Non-Historized Links & Satellites: Used when historical tracking isn’t required.
  • Multi-Active Satellites: Support multiple active records for the same key.
  • Record Tracking Satellites: Track changes and versions of records.

Features of Data Vault Patterns

The Data Vault methodology leverages years of practical experience to deliver several key features:

  • Patterns Based on Expertise: Proven methods for efficient loading and processing.
  • Multi-Batch Processing: Handle multiple data batches simultaneously.
  • Automatic PIT Cleanup: Uses logarithmic snapshot logic for optimal performance.
  • Virtual Load End-Date: Allows insert-only processes by using calculated end dates.
  • Automated Ghost Records: Simplifies handling of missing or incomplete data.

Why Are Stages Important in Data Vault?

Stages play a critical role in the Data Vault architecture by enabling efficient data preparation and ensuring data integrity. Key benefits include:

  • Hash Keys & Hash Diffs: Ensures unique identifiers for data integration and deduplication.
  • Load Date & Record Source: Tracks the origin and timing of data entries.
  • Prejoins: Combines data efficiently before entering the vault.
  • Hard Rules: Implements strict validation and transformation logic.

How to Create a Data Vault Stage

Creating a stage in a Data Vault involves leveraging the right tools and techniques. For this, we recommend using Datavault4Coalesce, a powerful platform designed for Data Vault implementation. This tool simplifies the process by automating key tasks and ensuring best practices are followed.

Conclusion

Stages are a foundational component of the Data Vault methodology, enabling seamless data preparation and integration. By understanding their role and leveraging the right tools, you can ensure the success of your Data Vault implementation.

Watch the Video

Using PIT and Bridge Tables in Business Vault Entities

Watch the Video

PIT and Bridge Tables

In this blog post, we will answer a commonly asked question regarding PIT and Bridge Tables:

In the Data Vault architecture, is it okay to use/reuse created PIT and Bridge tables in the code of the Business Vault business rules?

The short answer is yes, but let’s dive into the details to understand the rationale and how PIT (Point-In-Time) tables and Bridge tables work in the context of Business Vault entities.



Understanding PIT and Bridge Tables

Before explaining their usage, let’s quickly clarify what PIT and Bridge tables are in the Data Vault architecture:

  • PIT Tables: These provide a snapshot of data for a specific point in time. They help combine deltas and descriptive data to enable calculations or business logic that requires a specific snapshot.
  • Bridge Tables: These are primarily used to resolve many-to-many relationships and improve query performance when dealing with large datasets.

Applying Business Logic in Business Vault

In the Data Vault, data flows from the Raw Data Vault (RDV) to the Business Vault (BV) and finally to the Information Mart (IM). The key difference lies in the granularity of data:

  • Load Date: In the Raw Data Vault, data batches are identified by a load date, which represents when the data was ingested.
  • Snapshot Date: In the Information Mart, data is often presented as snapshots, where each snapshot represents the data at a specific point in time.

Now, the Business Vault sits between the Raw Data Vault and Information Marts. When applying business rules in the BV, there are two major types of granularities to consider:

1. Granularity Based on Incoming Deltas

In this case, business logic is applied to all incoming deltas identified by the load date. For example, cleansing phone numbers is a typical use case where every delta (update) must be processed, even if only the latest version is needed in the end.

The resulting data is stored in a computed Satellite in the Business Vault. The primary key remains the hash key of the parent entity and the load date.

2. Granularity Based on Snapshot Date

Some business logic requires calculations for specific points in time. For example, calculating the lifetime value of a customer:

  • The lifetime value increases when a customer makes a purchase.
  • The lifetime value decreases incrementally if no purchases are made over time.

In this scenario, even when no new delta is coming in, the value must still be recalculated daily. This granularity aligns with the snapshot date, which is already defined in the PIT table. By leveraging the PIT table, you can calculate and store the lifetime value in a computed Satellite with a primary key of the parent hash key and snapshot date.

Reusing PIT Tables

When switching from load date (deltas) to snapshot date (snapshots), PIT tables play a crucial role:

  • PIT tables help join descriptive data from Satellites to provide a snapshot-based view of the data.
  • They allow business rules to be applied to outgoing information granularity (snapshot date).

For example, if you want to calculate a specific measure, such as a customer’s lifetime value, the PIT table provides the granularity needed to compute the values for every day, hour, or minute, depending on your requirements.

Reusing Bridge Tables

Bridge tables can also be reused in Business Vault entities but with one key consideration:

Avoid loading one Bridge Table from another Bridge Table.

Why? Cascading Bridge Tables can lead to sequential dependencies, which hinder parallelization. Parallel processing is essential for performance, especially in high-volume environments. To work around this limitation, use Computed Aggregate Links.

What Is a Computed Aggregate Link?

A Computed Aggregate Link is essentially a Link with pre-computed aggregations. This concept is described in the Data Vault methodology and allows you to reuse aggregations efficiently without chaining Bridge Tables together.

For example, if you want to calculate a new measure based on facts stored in a Bridge Table:

  • Use the Bridge Table as the FROM source for a computed Satellite.
  • Attach the new measure to the Bridge Table as part of the Business Vault entity.

This approach avoids cascading dependencies while allowing you to extend facts or perform complex calculations.

Best Practices Recap for PIT and Bridge Tables

Here are the key takeaways for using PIT and Bridge tables in Business Vault entities:

  • Yes, you can reuse PIT tables: They are commonly used to provide snapshot granularity for computed Satellites.
  • Yes, you can reuse Bridge tables: Use them carefully to avoid cascading dependencies.
  • Use Computed Aggregate Links: When you need to extend a Bridge Table, this is the recommended approach to maintain efficiency and parallelization.
  • Granularity switch: Be mindful of the transition from load date (delta-driven) to snapshot date (snapshot-driven) when applying business logic.

Summary

In summary, PIT and Bridge tables are powerful tools in the Data Vault architecture, especially within the Business Vault. They enable complex business logic, such as snapshot-based calculations, while maintaining efficiency and performance. By adhering to best practices like avoiding cascading Bridge Table loads, you can ensure your implementation remains scalable and robust.

Building Responsible AI Systems Under the EU AI Act

EU AI Act Responsible Systems

The EU Artificial Intelligence (AI) Act represents a significant step forward in regulating AI technologies across the European Union. Its purpose is to establish a unified legal framework, ensuring human rights protection, safety, and the ethical use of AI, while fostering innovation and accountability. With its phased implementation starting in 2024, the Act brings major changes to how AI systems are designed, deployed, and monitored.



Overview of the EU AI Act

The EU AI Act aims to:

  • Establish a unified legal framework for AI across the EU.
  • Protect human rights and ensure safety.
  • Prohibit harmful and unethical uses of AI.
  • Promote transparency and accountability in AI systems.
  • Foster innovation and technological growth.

Timeline for Implementation

The Act includes specific deadlines for compliance:

  • August 2024: Prohibited AI practices must stop immediately.
  • August 2025: Transparency rules for general-purpose AI, including content labeling, take effect.
  • August 2026: High-risk AI regulations, such as those in healthcare, become enforceable with strict data quality standards.

Why This Matters

AI adoption is growing rapidly, with 42% of organizations utilizing AI in 2023—a 7% increase from 2022. The EU AI Act not only imposes penalties of up to 7% of global turnover for non-compliance but also reflects a societal responsibility to use AI ethically, addressing inequalities and safeguarding future generations.

The Risk-Based Approach

The EU AI Act categorizes AI systems into four risk levels:

  • Unacceptable Risk: Prohibited under Article 5.
  • High Risk: Strict regulation and obligations under Articles 6-51.
  • Limited Risk: Providers regulated under Articles 52a-52e.
  • Minimal Risk: Subject to transparency obligations under Article 52.

Key Principles of Responsible AI

Building responsible AI systems involves adhering to several key principles:

  • Explainability: AI models should be transparent and easy to understand.
  • Bias & Fairness: Detect and mitigate biases to ensure equitable outcomes.
  • Accountability: Define responsibilities for AI outcomes clearly.
  • Data Suitability: Use appropriate, high-quality data in compliance with regulations.
  • Monitoring: Continuously track AI performance to ensure reliability.
  • Transparency: Disclose system functionalities clearly and provide user mechanisms for feedback.
  • Auditability: Maintain detailed logs of algorithms, datasets, and configurations.

Steps to Build Responsible AI Systems

Organizations can prepare for compliance and ethical AI usage through the following steps:

  • Implement scalable AI services.
  • Develop predictive reporting mechanisms.
  • Establish robust governance frameworks.
  • Leverage tools and platforms for AI development.
  • Ensure data suitability and compliance.

AI Marts: Enabling AI Act Compliance

Traditional machine learning workflows without centralized data management can lead to feature inconsistencies, operational complexity, and compliance issues. AI Marts address these challenges by providing:

  • Centralized feature management.
  • Integration of feature engineering into workflows and pipelines.
  • Metadata and version control.
  • Scalable feature serving across targets.
  • Comprehensive logs for governance and auditing.

Benefits: AI Marts enhance data governance and security, serving as a critical step towards compliance with the EU AI Act.

Conclusion

As AI adoption grows, compliance with the EU AI Act is essential for organizations aiming to use AI responsibly. By implementing risk-based strategies, embracing transparency, and leveraging tools like AI Marts, companies can align with regulatory requirements while fostering trust and innovation.

Watch the Video

Flow.BI: Generating the Raw Data Vault Using AI and dbt

Flow.Bi architecture Data Warehouse

Raw Data Vault

This question might be as old as the Data Vault: Can it generate the Raw Data Vault using artificial intelligence (AI)? Until recently, the prevailing expectation in the industry was that an AI, if ever existing, might only be able to assist the data modeler, for example, by identifying and suggesting business keys or modeling parts of the model.

The question arises out of need: in the past, data volume and shape have risen exponentially. And there is no sign that it should flatten out in the future. But who should analyze all the data required for today’s data platforms? We already have a shortage of qualified data engineers. And this situation will only become worse in the future because university students don’t rise exponentially.

Generating the Raw Data Vault Using Flow.BI and dbt

We present the integration between datavault4dbt and Flow.BI which is used to generate the Raw Data Vault. Flow.BI is an artificial intelligence capable of defining all the hubs, links, and satellites for enterprise data. This includes determining the business keys and special entity types such as effectivity satellites, hierarchical links, multi-active satellites, non-historized links, and reference tables. The work-sharing is simple: the advanced AI of Flow.BI defines the Raw Data Vault with all required entities, and datavault4dbt generates the code, including CREATE TABLE and INSERT INTO statements for both the model and the Raw Data Vault loading procedures.

Watch Webinar Recording

Exceeding Expectations with Generative AI

With the release of Flow.BI, the expectation that only an assisting AI is possible has been exceeded. Flow.BI is a generative AI that fully “defines” the Raw Data Vault, including:

  • Hubs and their business keys
  • Links, including hierarchical links
  • Satellites, including the satellite splits for privacy, security and rate-of-change
  • Non-historized links and their satellites
  • Effectivity satellites and multi-active satellites
  • Reference hubs and their satellites

Model Structure and Load Definitions

In addition to the model structure, the advanced AI of Flow.BI also defines the load definitions, that is, the definition from where a hub’s business key or satellite’s attribute is loaded. This is later used to generate the INSERT INTO statements.

Because it is all done by the AI, it works at scale and can quickly generate Raw Data Vault models with thousands of entities.

Simplified Modeling with Flow.BI

Flow.BI drastically simplifies the modeling approach: all the user has to do is attach data sources to Flow.BI for analysis and profiling, hit the red button and Flow.BI defines at least a valid model. To achieve this, the solution identifies the concepts in the entities of the data sources first and then the business keys for those concepts. Next, it identifies the relationships and processes the descriptive data attributes into satellites. For the last finishing, the integrated natural language generator adds the entities’ documentation, attributes, and load definitions.

Users can improve the target model by adjusting the identified concepts and the rules for privacy, security, and satellite splits. 

A Data-Driven Approach

The model defined by Flow.BI follows a data-driven approach and, therefore, aligns with the teaching of Scalefree’s Data Vault training

In a data-driven approach, the Data Vault model is “modeling the raw data as the business uses it.” That means the focus is clearly on the raw data, but business keys (“as the business is using it”) integrate data across multiple data sources. 

The idea behind this is that business keys are often shared keys that exist in multiple source systems and can, therefore, be used for integration purposes. 

Raw Data Vault vs. Business Logic

Business logic, such as WHERE conditions or conditional logic, has no place in a data-driven Raw Data Vault. The Business Vault aims to extend the Raw Data Vault by business logic. 

Therefore, the defined model doesn’t contain any business logic, which must be added later in the Business Vault. However, the Raw Data Vault model is an integrated enterprise data model that spans all attached data sources.

Metadata for Advanced Data Warehousing

The metadata produced by Flow.BI can be used not only for generating the Raw Data Vault but also for the staging area, either on a relational database or a data lake. 

Flow.BI doesn’t generate the Raw Data Vault alone but defines it. That means it only indicates which hubs, links, and satellites should exist to capture the data from the source systems. But it doesn’t generate the CREATE TABLE and INSERT INTO statements for the physical model. Instead, it relies on tools such as dbt via the datavault4dbt package to generate the code. Flow.BI hands over the metadata of the defined model, and datavault4dbt generates the actual code. 

With this in mind, Flow.BI is a teammate who analyzes and profiles the data sources, knows how to model the Raw Data Vault, and ingests the metadata into dbt’s SQL models.

User Control Over AI Modeling

Flow.BI is imitating the human data modeler. 

But does this mean that the user has lost control over the AI? No. There are many options to influence Flow.BI’s AI to produce a “better” target model. But first, what “better” means should be defined: fewer entities in the target model? Faster queries? Faster loading? Depending on the goals, Flow.BI can be influenced, for example, by the concept classification and how source data is presented to Flow.BI.

Integration with Data Warehouse Automation

Once the metadata for the Raw Data Vault has been defined, it is handed over to the data warehouse automation (DWA) solution. There are many solutions available, but a popular option is dbt. Scalefree has developed the open-source dbt package datavault4dbt, which enjoys growing popularity in the industry. 

To generate SQL models for the dbt package, the integration between Flow.BI and datavault4dbt leverages TurboVault, another open-source package by Scalefree. TurboVault is a graphical user interface that sets up the metadata for a Raw Data Vault to be generated with datavault4dbt.

Once the metadata for the Raw Data Vault has been defined, it is handed over to the data warehouse automation (DWA) solution. There are many solutions available, but a popular option is dbt. Scalefree has developed the open-source dbt package datavault4dbt, which enjoys growing popularity in the industry. 

To generate SQL models for the dbt package, the integration between Flow.BI and datavault4dbt leverages TurboVault, another open-source package by Scalefree. TurboVault is a graphical user interface that sets up the metadata for a Raw Data Vault to be generated with datavault4dbt.

The best option was to ingest Flow.BI’s metadata into TurboVault to leverage its capabilities. Once Flow.BI’s metadata has been loaded into TurboVault, TurboVault generates the SQL models for datavault4dbt, which, in turn, generates the Data Vault entities and the loading procedures.

To facilitate the integration between Flow.BI and TurboVault, Scalefree has now released a dedicated Flow.BI connector that retrieves and transforms the logical data models from Flow.BI and enables the automated generation of your Data Vault.

Value Proposition of Flow.BI

Flow.BI offers many values: first, the price. Defining the model using AI is much more cost-effective than manually setting up the metadata for datavault4dbt. Another problem is that Data Vault experts are a scarce resource and not widely available, especially when quality is an essential factor, which always should be the case when dealing with enterprise data.

Another value of Flow.BI is the agility: instead of defining the Raw Data Vault in months and years, Flow.BI’s advanced AI calculates the Raw Data Vault model within minutes and hours.

This also reduces the project risk: what if the Data Vault experts produce a low-quality (or even invalid) model after years of working and millions of Euros spent? Having another attempt at modeling is often unrealistic.

But with Flow.BI, results are close to immediate; if they are unsatisfactory, the manual alternative is still available.

Therefore, the best option to get started is to contact us for a proof of concept or workshop on Flow.BI.

Maintaining the Hash Diff

Watch the Video

The Problem

Adopting a thoughtful approach to Hash Diff calculation can minimize manual maintenance, ensure data consistency, and optimize storage. Our question comes from a project where the source system occasionally delivers new columns for existing tables. When these columns are added to a satellite, the hash difference (hashdiff) calculation changes. As a result, new deltas are generated for all business keys during the next load—even if the actual data hasn’t changed. The manual recalculation of hashdiffs for historical records is time-consuming and prone to errors. Can this be avoided?



Understanding Hash Diff Changes

The hashdiff is a critical component in a Data Vault model, used to detect changes in descriptive attributes. Adding a new column changes the hashdiff logic, potentially creating unnecessary deltas, which consume additional storage and complicate data integrity checks. Let’s break this down:

  • When a new column is introduced, historical records often have NULL values for that column.
  • The updated hashdiff logic incorporates the new column, even if its value doesn’t contribute to meaningful changes.
  • This can result in false positives—new records that aren’t genuinely different.

Potential Solutions

There are several strategies to handle this scenario, each with varying levels of manual effort and maintenance:

1. Recalculating the Hash Diff Manually

One approach is to manually recalculate the hashdiff for all existing records. While effective, this method requires significant effort and is not scalable for large datasets. Additionally, updating historical records can disrupt the auditability of your Data Vault.

2. Minimizing Updates with Targeted Recalculation

A more focused strategy is to update only the current records in the satellite (those with an open-ended load date). These records are actively used for comparisons and would benefit most from updated hashdiffs. While this reduces the number of updates, it still involves manual intervention.

3. Ensuring Hash Diff Consistency Automatically

The most efficient solution is to design the hashdiff calculation to remain consistent, even when structural changes occur:

  • **Add Columns Only at the End:** Ensure new columns are appended to the end of the table structure.
  • **Ignore Trailing Nulls:** Use a function like RTRIM to remove trailing delimiters caused by NULL values. This keeps the hashdiff consistent when new columns are empty for historical records.

This approach eliminates the need for manual updates, provided that all structural changes adhere to these guidelines.

Practical Example

Consider a satellite linked to a company hub, containing records for a company’s name and address. Initially, the hashdiff calculation includes only the company name and address. When a new column, postal code, is added:

  • Historical records will have NULL values for postal code.
  • Using the RTRIM function ensures that the new column does not affect the hashdiff for these records.

This prevents unnecessary deltas, saving storage space and reducing maintenance overhead.

Handling Hash Diff Duplicates

Another question we received involved handling hard duplicates—records that are identical in every aspect, including hashdiff values. The recommended approach is to:

  • Move such duplicates into an Error Mart for auditability.
  • Fix pipeline issues if duplicates are caused by ingestion errors.
  • For soft duplicates (e.g., intraday changes), manipulate the load timestamp by adding microseconds based on sequence IDs to ensure unique records.

Conclusion

By adopting a thoughtful approach to hashdiff calculation, you can minimize manual maintenance, ensure data consistency, and optimize storage in your Data Vault model. Whether you choose to recalculate selectively or implement hashdiff logic that handles changes automatically, the goal is the same: maintain the integrity of your data warehouse while reducing unnecessary effort.

Semantic Models and Metrics

Unlocking Analytics with Semantic Models and Metrics

A semantic model is a layer of abstraction that defines business-friendly terms and metrics on top of raw or transformed data. It bridges the gap between data transformations and end-user reporting, ensuring accuracy, consistency, and clarity across analytics tools. By providing a unified way to define and calculate key metrics, semantic models empower businesses with reusability and precision in reporting.



Understanding Semantic Models

Semantic models form the foundation of the dbt Semantic Layer. Configured using YAML files, they correspond to dbt models in your DAG. Each model requires a unique YAML configuration, enabling dynamic and reliable dataset refinement. You can even create multiple semantic models from a single dbt model, provided each has a distinct name.
These models comprise three key components:

  • Entities: Define relationships between semantic models (e.g., IDs).
  • Dimensions: Columns used for slicing, grouping, and filtering data (e.g., timestamps, categories).
  • Measures: Quantitative values aggregated in analyses.

Diving into Metrics

Metrics are calculations representing essential business measures, built from entities, measures, and dimensions. They ensure centralized definitions, reusability across tools, and consistency in analysis. Metrics encapsulate both logic (e.g., aggregations, filters) and context (e.g., time granularity, dimensions).
Types of metrics include:

  • Conversion Metrics: Track events like purchases per user.
  • Cumulative Metrics: Aggregate measures over specified windows.
  • Derived Metrics: Expressions combining multiple metrics.
  • Ratio Metrics: Comparisons of numerator and denominator metrics.
  • Simple Metrics: Directly reference a single measure.

Commanding Metrics with dbt

dbt Cloud CLI provides MetricFlow commands to interact with the semantic layer. For instance, dbt sl query executes queries and validates metrics, while dbt sl list dimensions retrieves dimensions for specific metrics. These tools streamline metric management and ensure robust analytics workflows.

Semantic models and metrics are vital for bridging data transformations and actionable insights. They provide a foundation for scalable, consistent, and reusable analytics frameworks, enabling businesses to thrive in data-driven environments.

Watch the Video

Delta Lake vs Data Vault

How does Data Vault add value when we have the Delta Lake?

In the world of modern data management, businesses often find themselves navigating a maze of tools, architectures, and methodologies to meet their ever-evolving data needs. Among the popular approaches are Delta Lake and Data Vault. While both have their strengths, it’s important to understand how they complement each other and why Data Vault can be a game-changer even when you’re leveraging Delta Lake.



Understanding Delta Lake

Delta Lake is an open-source storage layer that brings reliability to data lakes. Built on top of Parquet files, it provides ACID transactions, schema enforcement, and the ability to handle incremental data changes. It’s a robust foundation for modern data warehouses and data lakes, especially when using tools like Databricks.
However, Delta Lake primarily focuses on managing data storage and changes. It doesn’t inherently bridge the gap between raw source data and the business-ready reports and dashboards that users demand.

Enter Data Vault: Bridging the Gap

Data Vault is a modeling approach designed to address the disconnect between raw data and user needs. While Delta Lake handles data storage efficiently, Data Vault focuses on the *why* and *how* of transforming that data into actionable insights. Here’s where Data Vault excels:

  • Data Modeling: Data Vault organizes data into Hubs, Links, and Satellites, ensuring a flexible and scalable structure. Hubs capture business keys, Links handle relationships, and Satellites store descriptive data.
  • Data Integration: It helps integrate disparate data sources into a unified model that reflects the business context.
  • Change Tracking: While Delta Lake tracks changes at the file or record level, Data Vault optimizes this by capturing deltas more efficiently, especially when splitting data into specialized Satellites.
  • Target-Oriented Design: Data Vault focuses on producing business-ready data models like star schemas, flat tables, or dashboards, rather than being a consumption model itself.

Performance Challenges and Solutions

A frequent criticism of Data Vault on Delta Lake revolves around query performance, particularly due to the columnar storage of Parquet files. Joins can be slow, but this is more a characteristic of the storage mechanism than the modeling technique. Here are some strategies to address this:

  • Denormalization: Flattening data into wide tables eliminates the need for joins, resulting in faster query performance.
  • Materialized Views: Creating materialized Parquet views for end-user consumption ensures high performance without impacting upstream processes.
  • Optimized Storage: Use technologies like Iceberg or Delta tables for Hubs and Links, and consider presenting Satellites as views to minimize storage overhead.
  • Incremental Load: Design systems to handle insert-only incremental loads, reducing the complexity of updates and deletes.

Why Business Users Love Data Vault (Even If They Don’t Know It)

The ultimate goal of any data architecture is to serve business users. Reports, dashboards, and analytics are the end-products they care about. Data Vault excels here by enabling the creation of robust information models that align with user requirements:

  • Flexibility: Business rules can be implemented on top of the Data Vault model to derive the desired target model.
  • Scalability: Large data flows can be broken down into manageable pieces, making the system easier to maintain.
  • Agility: Changes in business requirements can be accommodated without overhauling the entire model.

Delta Lake and Data Vault: Better Together

Rather than viewing Delta Lake and Data Vault as competing approaches, think of them as complementary. Delta Lake provides the foundation for reliable data storage and change tracking, while Data Vault transforms this raw data into meaningful, business-ready formats.
For example, Delta Lake can serve as the staging or landing zone, where raw data is ingested and stored. Data Vault then takes over to model this data into Hubs, Links, and Satellites, preparing it for business consumption. The combination ensures both robust data management and the flexibility to meet diverse analytical needs.

Final Thoughts

Data Vault is a powerful methodology for bridging the gap between raw data and actionable insights. Even in environments that leverage Delta Lake, Data Vault adds value by providing a scalable, user-focused approach to data modeling. By combining the strengths of these two technologies, organizations can achieve both reliability and agility in their data architectures.
As with any tool or methodology, the key is to tailor the implementation to your specific needs, ensuring that both performance and usability are optimized. Whether you’re dealing with Databricks, Parquet, or other tools, Data Vault provides the flexibility and structure to deliver what matters most: business value.

Watch the Video

Working With Semi-structured Data

Mastering Semi-Structured Data: Key Approaches and Best Practices

Semi-structured data, such as JSON, is increasingly common in modern data ecosystems. But how should you store and handle it? Should you store the data as-is or flatten its structure? Both approaches have unique advantages and limitations, and understanding these can help you make informed decisions based on your use cases.



Key Considerations

  • Expected Data Structure: Is the schema likely to change? Are nested objects (hierarchies) present?
  • Velocity & Size: How large and fast-moving is your data?
  • Database Capabilities: Does your system support efficient queries and manage large datasets?
  • Use Cases: What operations will you perform on the data?

Approach 1: Store Data As-Is

This method involves storing the data in its original format. It’s ideal for flexibility but has limitations:

  • Pros: Quick to ingest, accommodates changing schemas, suitable for unknown operations.
  • Cons: Struggles with large files and nested queries.

Approach 2: Flatten Nested Structures

Flattening the structure simplifies data querying and scalability. However, it also has trade-offs:

  • Pros: Easy querying, no file size constraints, better for fixed schemas.
  • Cons: Complexity in handling hierarchies, loss of schema flexibility.

Data Vault Modeling: A Flexible Solution

Data Vault modeling supports both approaches:

  • Storing As-Is: Store files as non-historized links or satellites, keeping the original file in a single column. Virtual structures can be built on top.
  • Flattening Before Loading: Create standard Data Vault entities while storing the original files in a Data Lake for reference.

Choosing the right strategy depends on your operational needs and database capabilities. By considering these factors, you can efficiently work with semi-structured data while optimizing performance and flexibility.

Watch the Video

Salesforce Account Engagement and Domain Management

Domain Management in Salesforce Pardot

Domain Management Within Salesforce Account Engagement

In this video guide, we explore the essential components of domain management within Salesforce Account Engagement, highlighting their critical roles and why they are vital for your email marketing strategy.

Additionally, the guide underscores the significance of mastering three key aspects: (1) email sending domains, (2) tracker domains, and (3) tracking code. By understanding and implementing these elements, you can significantly enhance your email deliverability, security, and overall campaign performance.

This guide is designed for Salesforce marketers, administrators, and key users who are working or planning to work with Salesforce Account Engagement to optimize their domain management practices.

Domain Management in Account Engagement

Domain management in Salesforce Account Engagement involves the configuration and maintenance of domains used for email marketing and tracking purposes.

This includes authenticating email sending domains to ensure emails are properly delivered and not flagged as spam, setting up custom tracker domains to maintain brand consistency and improve deliverability, and managing tracking codes to accurately monitor and analyze user interactions.

Effective domain management enhances email security, optimizes deliverability rates, and ensures a professional and trustworthy user experience.

Key Takeaways

Discover the compelling reasons behind adopting four essential best practices for domain management in Salesforce Account Engagement. These practices, detailed in the video, play a critical role in enhancing email deliverability, security, and brand consistency.

By understanding and implementing these practices, users can optimize their email marketing strategy for sustained success:

    1. Align Return Path with Mail-from Address
    2. Enable HTTPS for Tracker Domains
    3. Monitor Domain Reputation
    4. Use Custom Tracking Domains for Brand Consistency

Target Audience

Designed for Salesforce marketers, administrators, and users, this guide encourages the effective management of domains within Salesforce Account Engagement.

By highlighting the importance of key practices, the video aims to provide a deeper understanding of why mastering domain management is crucial for achieving optimal email deliverability, security, and campaign performance within the Salesforce environment.

Watch the Video

Custom Node Types in Coalesce

Custom Node Types in Coalesce: Unlocking Flexibility and Reusability

Nodes are the foundational building blocks in coalesce.io, serving as database objects like tables or views. Each node belongs to a specific type, equipped with a predefined user interface, a create template, and a run template. While coalesce.io provides four standard node types, custom node types allow users to adapt and extend these capabilities for unique requirements.



What Are Custom Node Types?

Custom node types enable users to define reusable database object patterns. By specifying a user interface (UI), Data Definition Language (DDL), and Data Manipulation Language (DML), users can create tailored solutions for patterns such as stages, dimensions, facts, hubs, and links. Parameters and macros make these custom types even more adaptable and reusable.

Why Create Custom Node Types?

Custom node types address two key needs:

  • Custom Needs: Standard node types may not cover specific use cases.
  • Reusability: Custom node types eliminate the redundancy of repeatedly creating similar nodes, saving time and effort.

Key Components of Custom Nodes

Node Definition and UI Configuration

The node definition specifies the UI elements, such as materialization selectors, toggles, dropdowns, and text boxes. These components define how users interact with and configure the custom node.

Create Template

The create template includes SQL logic for generating tables or views. It supports column transformations, comments, clustering keys, and all Snowflake DDL features.

Run Template

The run template defines DML operations, such as inserting data, applying incremental or merge strategies, and performing transformations. These operations are executed exclusively for table-based nodes and utilize all Snowflake DML features.

Get Started with Custom Node Types

Custom node types in coalesce.io empower teams to design reusable, scalable solutions tailored to specific needs. By leveraging their flexibility, you can streamline development, reduce repetitive tasks, and maximize efficiency in your data workflows.

Watch the Video

Close Menu