Skip to main content
search
0

Data Streaming in Snowflake

Data Streaming in Snowflake

In this newsletter, we’ll provide an overview of the possibilities for streaming data using Snowflake.

The approaches discussed here focus on how Snowflake’s features and capabilities enable ingestion and processing data streams with a certain speed. The goal is to explore how businesses can create an advantage by using real time data to make time critical decisions, improve operational efficiency, and enhance customer experiences. Additionally, we’ll examine Snowflake’s architecture to enable a reliable solutions base, its suitability for streaming workloads, and the challenges it addresses.

Data Streaming in Snowflake

This webinar on February 11th, 2025, 11 am CET, shows you how to speed up data-driven decisions using Snowflake’s real-time data capabilities. Learn to unify batch and streaming data pipelines to handle large volumes of data with Snowpipe Streaming, Hybrid Tables, and Dynamic Tables. Discover how to get low-latency insights and make faster decisions without affecting your existing processes. We’ll explore real-world examples of how others use real-time data for dashboards, customer experiences, and more, plus best practices for ensuring data quality and performance. Ideal for data engineers and architects, this session will show you how Snowflake can revolutionize your analytics and help you thrive in today’s fast-paced digital world.

Watch Webinar Recording

What to Expect

You will gain an overview of the various built-in tools and techniques Snowflake provides to handle real time data streaming, including data ingestion, processing, and querying capabilities. We’ll cover Snowflake’s architectural components, such as tables, views, and processing capabilities, and explore how they are used to managing transactional workloads.

The main focus of this blog is to provide insights on how Snowflake enables real time data processing, the opportunities it presents, and the trade-offs to consider when implementing streaming solutions.

Real Time Data – an Overview

Real time data enables businesses to process information immediately after it is generated. Unlike traditional batch processing, which works with fixed intervals, real time data flows continuously, allowing for dynamic and immediate actions. This type of data processing has become increasingly important for organizations seeking to stay competitive in a fast-moving market.

Key Characteristics of Real Time Data

Real time data has distinct characteristics that differentiate it from other types of data flows:

  • High Velocity: Data is processed at high speeds, often within milliseconds or seconds.
  • Low Latency: Systems are designed to minimize delays, ensuring timely access to insights.
  • Dynamic Streams: Continuous and often unpredictable data streams require flexible and scalable processing.

Each of these characteristics brings unique advantages and challenges to the table. For this reason, organizations need specialized tools and platforms to handle real time data effectively.

Why Real Time Data is Important

The value of real time data lies in its ability to provide immediate insights, which enable organizations to act without delay. It can significantly enhance several business areas, such as:

  • Enhanced Decision-Making: Real time insights enable proactive responses, such as dynamically adjusting pricing or inventory based on live demand.
  • Customer Personalization: By analyzing behavior in real time, companies can deliver tailored experiences that meet customer expectations in the moment.
  • Operational Efficiency: Constant monitoring helps organizations to detect and address issues early, reducing downtime and optimizing performance.
  • Risk and Fraud Detection: Rapid identification of anomalies or threats reduces exposure to fraud and operational risks.

The benefits of real time data are clear; however, its implementation comes with specific challenges, as detailed below.

Challenges with Real Time Data

While the advantages of real time data are significant, they are accompanied by a unique set of challenges:

  • Handling Velocity and Volume: Processing large streams of data at high speed requires scalable systems.
  • Ensuring Consistency: Maintaining data accuracy across distributed systems can be complex, especially when dealing with multiple sources.
  • Integration Complexity: Real time data must often be combined with batch systems or legacy analytics, which introduces technical complexities.

The continuous flow and immediacy of real time data make it indispensable for businesses aiming to stay ahead in their industries. With the right tools and architecture, organizations can unlock its potential to deliver significant value and competitive advantage.

Real Time Data in Snowflake

Snowflake offers a modern approach to handling real time data by combining its cloud-native architecture with features designed for continuous ingestion, processing, and querying. With built-in tools like Snowpipe, streams, and tasks, Snowflake enables organizations to integrate real time data. In this section, we are going to explore how Snowflake manages real time data and its possibilities to meet the demands of streaming workloads.

Snowflake’s Architecture for Real Time Data

Snowflake’s architecture combines shared-disk and shared-nothing concepts. It uses a central, cloud-based data repository (the “shared-disk” portion) while compute resources (virtual warehouses) scale independently and process data in parallel (the “shared-nothing” aspect).

  • Batch: Traditionally, data engineers batch large amounts of data into Snowflake using the COPY command or scheduled loads. The compute layer scales up for heavy loads and then scales back down for cost efficiency.
  • Real Time: Continuous ingestion focuses on small, frequent data arrivals. The architecture still benefits from the separation of storage and compute, allowing Snowflake to handle real time feeds without blocking batch workloads or analytics queries.

Micro-partitions

Snowflake automatically divides tables into micro-partitions, compact storage units that group data based on natural ordering or load patterns.

  • Batch: Micro-partitions shine when running large analytical queries, as Snowflake can prune out unnecessary partitions quickly, boosting performance for wide-ranging queries.
  • Real Time: Frequent data loads generate more micro-partitions. The fine-grained organization remains beneficial, but you need to plan for slightly more overhead in terms of partition management, particularly if your real time data streams produce extremely high volumes.

Dynamic and Hybrid Tables

Speaking of tables, the next two entities are very helpful when considering processing real time data in Snowflake.

Dynamic Tables allow you to define continuous transformations within Snowflake. Think of them as similar to materialized views, but more flexible and with the ability to handle complex transformations and dependencies. You specify a SQL query for how the table should be built, and Snowflake automatically processes incremental changes from source tables behind the scenes.

  • Incremental Data Processing: Instead of running ad-hoc or scheduled jobs, Dynamic Tables automatically update when new data arrives. You can set a target lag by specifying a refresh duration or defining a downstream dependency. Within the downstream dependency, the last dynamic table determines when the data needs to be refreshed.
  • Continuous Pipelines: They reduce the need for manual orchestration with Snowflake Tasks or external job schedulers.
  • Complex Transformations: Unlike materialized views (which are generally suited for simpler aggregates), Dynamic Tables can handle joins, window functions, and other advanced SQL operations.

 

Hybrid Tables are a newer concept that merges Snowflake’s columnar micro-partition storage with row-oriented features, making it easier to handle fast, high-volume inserts or updates. They aim to support both analytic (OLAP) and transactional (OLTP-like) workloads in a single Snowflake environment.

  • Faster Row-Level Operations: Traditional Snowflake tables can handle inserts and updates at scale, but they’re optimized primarily for analytical reads. Hybrid Tables aim to make row-level operations more efficient.
  • Mixed Workload Support: Combine real time event ingestion (often associated with row-level databases) and analytical querying (where column-based storage excels).
  • Reduced Latency: By better handling small transactions and frequent data changes, Hybrid Tables can help lower the time it takes to get new data ready for querying.

Snowpipe and Snowpipe Streaming

Snowpipe is Snowflake’s continuous data ingestion service that loads data from external or internal stages. Typically, you point Snowpipe at a cloud storage location (e.g., Amazon S3) where new files land, and it automatically imports them.

  • Classic Batch: The COPY command is often scheduled to run at fixed intervals (e.g., hourly or daily), which can introduce latency.
  • Snowpipe: Instead of waiting for a scheduled batch, Snowpipe automatically pulls in smaller file increments soon after they arrive in the stage. This reduces the time between data generation and availability for queries.

However, Snowpipe is optimized for continuous loads of small files – think micro-batches, rather than large single-file loads. If your data volume spikes significantly, you might face higher costs or need to switch to different ingestion strategies to maintain throughput.

Snowpipe Streaming is an API-based approach that writes data directly to Snowflake tables, bypassing intermediate storage altogether. This can reduce latency and loading times even further than standard Snowpipe.

Why is it “More Real Time”? Data arrives instantly (or very close to it) in Snowflake, enabling near real time reporting. You don’t need to wait for batch files to land in cloud storage, and it’s often more cost-effective at scale, especially for high-frequency, small data events.

Additionally, the Snowflake Connector for Apache Kafka supports Snowpipe Streaming, allowing a near-seamless way to flow messages from Kafka topics straight into Snowflake tables. This integration is a significant step toward bridging the gap between streaming data platforms and Snowflake’s cloud data warehouse.

Snowflake Streams

Lastly, a Snowflake “Stream” tracks changes made to a table (inserts, updates, and deletes). This is particularly useful for Change Data Capture (CDC) workflows, enabling efficient processing of real time data.
You can design pipelines that react to these CDC streams, triggering transformations or downstream processes. This is ideal for scenarios where real time data updates must flow into transactional or operational systems.

Limitations and Possibilities of Real-Time Data in Snowflake

After breaking down some key factors, we will examine several limitations and possibilities when working with Snowflake and real time data.

While Snowflake aims for near real time processing, it isn’t designed as an ultra-low-latency, sub-second event-processing engine. Rapid, continuous ingestion can drive up costs if you’re not careful with warehouse sizing and auto-suspend settings. Each warehouse or streaming component could incur charges for frequent queries and data loads.

Monitoring multiple data streams, scaling them and managing their throughput can be complex. You’ll need robust DevOps and DataOps practices to ensure data integrity and consistent performance.

On the other hand, you can enable real time dashboards and analytics. With Snowpipe Streaming and the Kafka Connector, you can feed data directly into Snowflake and power dashboards that update within seconds or minutes. CDC streams enable event-driven pipelines where changes in your transactional databases trigger immediate actions in Snowflake, such as downstream transformations or alerts. Snowflake’s separation of storage and compute makes it easier to handle spiky workloads. Real time pipelines can scale up independently of other batch or analytical tasks, ensuring smooth parallel processing.

Many organizations mix real time flows with traditional batch loads. Snowflake’s architecture can accommodate both simultaneously without conflict.

Conclusion

By bringing together Hybrid Tables for row-based writes, Snowpipe Streaming for near real time ingestion, Snowflake Streams for continuous change detection, and Dynamic Tables for automated transformations, organizations can introduce real time data flows on top of their existing batch approach. This collaboration of Snowflake features opens up opportunities to capture high-velocity data and process it almost immediately, enabling decision-makers to access live metrics, personalize customer interactions as they happen, optimize operational processes, and detect anomalies or fraud with minimal delay. When combined with standard batch ingestion, this architecture preserves historical data and large-scale analytics while adding a complementary layer for low-latency, event-driven insights.

However, there are challenges and limitations to consider. More frequent writes can drive up compute costs, especially if data arrives at unpredictable volumes and velocities. Ensuring consistent data models becomes trickier if schema changes occur rapidly, as real time pipelines need to keep up with adjustments in source systems. Hybrid Tables, while excellent for rapid inserts, may not always yield the same query performance as columnar tables when it comes to large-scale analytical workloads; likewise, the cost overhead of always-on streaming pipelines can be significant if not carefully sized and monitored. Complexity also rises with more moving parts, and a robust DataOps or DevOps framework is critical for controlling ingestion, transformation, and monitoring across multiple modes of data processing.

Still, integrating these real time capabilities into an existing batch-loading architecture can be done gradually by setting up a dedicated real time pipeline that feeds into Snowflake alongside the traditional bulk loads. The new pipeline could ingest event data using Snowpipe Streaming and store it initially in Hybrid Tables, where Streams and Dynamic Tables can process changes in near real time. Historical data loads and heavy analytical queries would continue using the existing batch approach with columnar, micro-partitioned tables. Over time, teams can merge both approaches at the consumption layer—whether in BI dashboards, ML pipelines, or operational analytics—to combine the immediacy of real time data with the depth of historical context, all within the same Snowflake ecosystem.

Microbatch Incremental Models: A New Approach to Large Time-Series Data

What is a Microbatch?

Microbatch is an innovative incremental strategy designed for large time-series datasets. Introduced in dbt Core version 1.9 (currently in beta), it complements existing incremental strategies by offering a structured and efficient way to process data in batches.



Key features of Microbatch include:

  • Utilizes a time column to define batch ranges.
  • Supports reprocessing failed batches.
  • Auto-detects parallel batch y
  • Eliminates complex conditional logic for backfilling.

However, it’s not suitable for datasets lacking a reliable time column or requiring fine-grained control over processing logic.

How Microbatches Work

Microbatching works by splitting model processing into multiple queries (batches) based on:

  • event_time: The time column defining batch ranges.
  • batch_size: The time period for each batch (hour, day [default], month, year).

Each batch functions as an independent, atomic unit, meaning:

  • Batches can be processed, retried, or replaced individually.
  • Parallel execution enables separate, idempotent batch processing.

Batch replacement strategies vary by database adapter:

  • Postgres: Uses merge.
  • BigQuery, Spark: Uses insert_overwrite.
  • Databricks: Uses replace_where.
  • Redshift, Snowflake: Uses delete + insert.

Microbatch Model Configurations

When setting up a Microbatch model, the following configurations are required:

  • event_time: Specifies the time column in UTC.
  • batch_size: Defines batch granularity (hour, day, month, year).
  • begin: Sets the start point for initial or full-refresh builds.

Optional configurations include:

  • lookback: Processes prior batches for late-arriving records.
  • concurrent_batches: Controls parallel execution (auto-detected by default).

Running Batches in Parallel

Parallel execution is automatically detected based on batch conditions and adapter support. However, users can override this behavior using the concurrent_batches setting.

Parallel execution is possible when:

  • The batch is neither the first nor last in the sequence.
  • The database adapter supports parallel execution.
  • The model logic does not depend on execution order.

How to Backload Data

Backloading allows reprocessing historical data within a specific time range using the following command:

dbt run --event-time-start "2025-02-01" --event-time-end "2025-02-03"

This ensures that only batches within the defined range are processed independently.

Microbatch vs. Other Incremental Strategies

Microbatch differs from traditional incremental strategies by:

  • Using independent queries for time-based batches.
  • Eliminating the need for is_incremental() and complex SQL logic.
  • Automatically selecting the most efficient operation (insert, update, replace) for each platform.

Conclusion

Microbatch is a powerful new approach to incremental data processing in dbt Core. By breaking down large datasets into manageable, parallelizable chunks, it simplifies data modeling while improving efficiency and scalability. However, it is essential to consider whether Microbatch suits your data pipeline’s requirements before implementing it.

Watch the Video

Effectivity Satellites on Links

Watch the Video

Understanding Effectivity Satellites in Data Vault

Effectivity satellites play a crucial role in Data Vault modeling by tracking changes and deletions in source systems. In this article, we’ll explore when to use an effectivity satellite, how it differs from a regular satellite, and the best practices for implementing it.

What is an Effectivity Satellite?

An effectivity satellite is essentially a standard satellite used to capture business time attributes such as valid-from and valid-to dates, contract start and end dates, and deletion timestamps. The key distinction is that it tracks soft deletions from source systems, ensuring that historical data integrity is maintained in the Data Vault.

When Should You Use an Effectivity Satellite?

Effectivity satellites should be used when you need to track historical changes in relationships and entities, especially deletions. Common use cases include:

  • Tracking contract start and end dates
  • Monitoring employee and corporate car assignments
  • Managing customer records and their deletion status

Difference Between Regular and Effectivity Satellites

Regular satellites store descriptive attributes like names and addresses, whereas effectivity satellites focus on time-based attributes and deletion markers. While regular satellites track changes in data, effectivity satellites specifically manage record deletions and validity periods.

Choosing Between Link Satellites and Effectivity Satellites

Link satellites capture changes in relationships between entities, whereas effectivity satellites track the validity of those relationships. You should choose an effectivity satellite when:

  • Deletions need to be recorded without physically removing data
  • You need to track when a relationship was created and ended
  • Historical relationship integrity must be preserved

Example: Employee and Corporate Car Assignment

Consider an employee assigned to a corporate car. When the assignment changes, a new link entry is created between the employee and the new car. The effectivity satellite records the deletion timestamp for the old relationship and maintains the history of assignments.

Handling Deletions in Effectivity Satellites

One of the main challenges in effectivity satellites is detecting and handling deletions. Different data loading methods impact how deletions are recorded:

  • Full Loads: Compare current and previous loads to identify missing records.
  • Change Data Capture (CDC): Uses system flags to detect deletions.
  • Delta Loads: Requires additional logic to identify removed records.

Effectivity Satellites in Business Vault

In a Business Vault, effectivity satellites can be used to implement business rules for tracking deletions. For instance, a customer may be considered deleted only if removed from all source systems, which requires a business-driven deletion logic.

Conclusion

Effectivity satellites are essential in Data Vault modeling for tracking deletions and historical changes. Understanding their role and choosing the right satellite type ensures accurate data lineage and integrity.

Multi-Active Satellites vs. Dependent Child Links in Data Vault Modeling

Multi-Active Satellites vs. Dependent Child Links

In the realm of Data Vault modeling, practitioners often encounter scenarios where multiple descriptive entries are valid simultaneously for a single business entity. Two primary modeling techniques address this complexity: Multi-Active Satellites (MAS) and Dependent Child Links. Understanding the distinctions between these approaches is crucial for designing efficient and accurate data warehouses.



Understanding Multi-Active Satellites

A Multi-Active Satellite is designed to store multiple instances of descriptive information related to a parent key, all valid at the same point in time. The parent can be either a Hub or a Link. This structure is particularly useful when an entity can have several concurrent attributes.

For example, consider an insurance policy that offers various coverage details, each with different validity periods. Here, the policy (Hub) is associated with multiple coverages, each represented as a row in the Multi-Active Satellite, capturing the distinct validity periods and coverage amounts.

Defining Dependent Child Links

A Dependent Child Link is a Link entity that includes one or more additional key attributes. Together with the combination of business keys connected by the Link, these attributes uniquely identify incoming data records. This structure is also known as a degenerated link, peg-legged link, non-historized link, or transactional link.

For instance, in an invoicing system, an invoice (Hub) may have multiple line items. Each line item can be uniquely identified by combining the invoice identifier with a line item number, forming a Dependent Child Link.

Modeling Example: Order Line Items

When modeling order line items, there are two valid approaches:

  1. Create a non-historized, Dependent Child Link with a Non-historized Satellite containing the invoice footer details.
  2. Establish a non-historized, Dependent Child Link that includes both the key combination and the invoice footer details.

The choice between these methods depends on the specific requirements of the data model and the nature of the data being captured.

Modeling Example: Insurance Policies

Consider an insurance policy with different effective time windows—a scenario discussed in a previous session. In this case, attributes such as ValidFrom, ValidTo, and Amount are descriptive data attributes related to the business relationship between the Policy and Coverage.

The recommendation is to keep these attributes together in a Multi-Active Satellite on a standard Link between Policy and Coverage. This approach ensures that all relevant information is stored cohesively, allowing for efficient querying and analysis.

Choosing Between Multi-Active Satellites and Dependent Child Links

The decision to use a Multi-Active Satellite or a Dependent Child Link hinges on the specific business scenario:

  • Multi-Active Satellites are ideal when multiple descriptive attributes of an entity are valid simultaneously, and these attributes need to be tracked over time. This structure allows for capturing the history of changes effectively.
  • Dependent Child Links are suitable when there is a need to uniquely identify records through a combination of keys, especially in transactional contexts where multiple related records exist, such as invoice line items.

It’s essential to assess the nature of the data and the business requirements to determine the most appropriate modeling technique.

Conclusion

Both Multi-Active Satellites and Dependent Child Links offer valuable structures in Data Vault modeling, each catering to different scenarios involving multiple concurrent records. By understanding their definitions, applications, and differences, data modelers can make informed decisions to design robust and efficient data warehouses.

Watch the Video

Interview with Julien Redmond, Creator of IRiS

Interview with Julien Redmond

Welcome to another edition of Data Vault Friday! I’m Michael Olschimke, CEO of Scalefree, and every Friday at 11 o’clock, we dive into discussions about Data Vault, data mining, cloud computing, and any data-driven applications. Today, we have a special guest—Julien Redmond from Ignition Data in Australia, who’s been working with us as a partner. Julien has developed the IRiS Data Vault automation tool, and he’s here to share insights about this innovative solution.



The Global IRiS Tour

Julien has been traveling the globe, promoting IRiS and ensuring that everyone knows about this groundbreaking tool. IRiS focuses on simplifying the data engineering aspects of Data Vault, rather than the modeling tasks, making it accessible and easy to use. Julien’s goal was to create a process so straightforward that anyone could learn it in less than a day. This simplicity allows teams to make Data Vault tasks repeatable and manageable, even for junior members.

What Sets IRiS Apart?

With so many Data Vault automation offerings available, IRiS stands out by addressing common pain points. The tool aims to minimize the steep learning curves often associated with other automation tools and facilitates seamless knowledge transfer between experienced and new users. It’s designed to integrate softly with existing data management platforms—whether that’s Microsoft Data Factory, AWS Glue, or other established tools—without disrupting current systems.

Seamless Integration

IRiS requires a minimal amount of metadata, which can be easily extracted from any modeling tool. This means there’s no new modeling interface to learn—just feed the metadata into IRiS, and it generates the necessary stored procedures and data definition scripts. This integration approach ensures that companies can leverage their existing platforms while adding powerful Data Vault automation capabilities.

Empowering Data-Driven Organizations

IRiS supports a range of target platforms like Databricks, Snowflake, and Microsoft tools, aligning with the growing trend of moving towards Lakehouse architectures. Organizations can incrementally move data into the Lakehouse based on specific use cases, promoting value-driven design and delivery. Julien emphasized that IRiS is lightweight, inexpensive, and comes as a single container—making it easy to deploy and use without significant overhead.

Learning and Community Support

One of the standout features of IRiS is its supportive learning environment. It includes an online training program with six hours of videos, a comprehensive playbook blending Data Vault methodology with practical user guidance, and access to a knowledge hub with tips and tricks. New users can get up to speed quickly, reinforcing their learning with a supportive community ready to help when needed.

Future of IRiS

Julien’s global tour reflects the excitement and confidence behind IRiS. As he visits partners worldwide—from Finland to the US—he’s spreading the word about how IRiS can transform Data Vault engineering, especially for organizations invested in cloud platforms. The response so far has been overwhelmingly positive, with teams appreciating how IRiS fits into their existing infrastructures while simplifying their workflows.

That wraps up this special session of Data Vault Friday! Thanks for joining us, and a big thanks to Julien for sharing his journey with IRiS. We’ll return to our usual Q&A format next time, so be sure to bring your questions. Until then, have a fantastic weekend!

Watch the Video

Microsoft Fabric as an Enterprise Data Platform

Intelligent Data Front-ends

Introduction to Microsoft Fabric and dbt Cloud

In today’s digital world, organizations need a unified, scalable, and collaborative data platform to power analytics, AI-driven insights, and business intelligence. Enter Microsoft Fabric—a comprehensive, role-based, SaaS-delivered data platform that brings together key Azure services under a single “one lake” foundation with built-in AI capabilities.

In this article, we’ll explore how Microsoft Fabric can serve as your enterprise data platform, how it integrates with data modeling tools like dbt Cloud, and a proven “medallion” reference architecture that takes you from raw data ingestion to business-ready information marts. We’ll also discuss future extensions, practical limitations, and best practices to guide your journey.

Microsoft Fabric as an Enterprise Data Platform

This webinar covers leveraging Microsoft Fabric to implement a modern, end-to-end data platform. You will learn, how the different Fabric services can be combined, to implement a medaillon architecture, supported by Data Vault 2.0 and dbt Cloud. A live demo will show lakehouses, warehouses, and Hub, Links, and Satellites in a real world scenario!

Watch webinar recording

Quick Primer: The Data Vault Methodology

Before diving into Fabric, it’s helpful to understand the Data Vault approach—an architecture pattern that brings agility, auditability, and scalability to your data warehouse. It comprises three core components:

  • Business Keys: Unique identifiers of business objects (e.g., customer number in a CRM).
  • Descriptive Data: Attributes that describe business keys (e.g., customer name, birthdate), which evolve over time.
  • Relationships: Linkages between business keys (e.g., customer–order relationships in a CRM).

By separating these elements into hubs, satellites, and links, Data Vault provides a repeatable, auditable framework for loading and tracing data lineage, perfectly suited for modern cloud platforms.

hubs, links, and satellites

Microsoft Fabric: Core Front-Ends and Services

At its heart, Microsoft Fabric brings together seven role-based “front-end” experiences, but three of them are key to enterprise data engineering and warehousing:

Data Factory

  • Data Flows: Low-code transformations (joins, aggregations, cleansing) via a Power Query-like interface.
  • Data Pipelines: Petabyte-scale ETL/ELT workflows with full control-flow constructs (if/else, loops).

Use case: Ingest raw data from relational, semi-structured, or unstructured sources into your landing zone lakehouse.

Data Engineering

  • Lakehouses: Unified storage for structured/unstructured data in Delta-Parquet format, with SQL endpoints for analytics.
  • Notebooks: Interactive Python, R, or Scala environments for data prep, analysis, and data science exploration.
  • Spark Job Definitions: Batch and streaming ETL jobs on Spark clusters.
  • Data Pipelines: Orchestrated sequences of collection, processing, and transformation steps.

Use case: Land raw data and expose it to data scientists or further transformation processes.

Data Warehouse

  • Warehouses: Relational-style databases with Delta-Parquet storage, instant elastic scale, and full transactional support.
  • Support for cross-warehouse queries and seamless read access to lakehouses.

Use case: Implement Data Vault’s Raw Vault, Business Vault, and Information Marts for BI consumption.

Intelligent Data Front-ends

Workspaces

All Fabric resources live inside workspaces, which group lakehouses, warehouses, notebooks, pipelines, and more. Workspaces enable:z

  • Role-based access control and collaboration
  • Integration with Git for versioning and CI/CD
  • Cross-workspace data access via shortcuts

Integrating dbt Cloud with Microsoft Fabric

dbt Cloud is an industry-leading transformation framework that brings software engineering best practices to your data models: modular SQL, testing, documentation, and CI/CD. In Fabric, dbt Cloud:

  • Connects to a Fabric workspace as a data warehouse endpoint
  • Generates SQL models (SELECT statements), reading from lakehouses or warehouses
  • Executes those models natively on Fabric warehouses

Key benefit: dbt manages your Data Vault layers (hubs, links, satellites, and information marts) with clear lineage, testing, and version control—while Fabric handles execution, storage, and compute elasticity.

Reference Architecture: The Medallion Approach on Fabric

The modern “medallion” architecture separates data into three refinement layers—Bronze (raw), Silver (conformed/business), and Gold (BI-ready). Here’s how it maps onto Fabric:

Bronze (Landing Zone Lakehouse)

Data Factory pipelines copy raw relational, JSON, and unstructured files into a lakehouse. This fully persisted, immutable history remains read-only for most users.

Silver (Raw & Business Vault Warehouses)

  • Raw Vault Warehouse: dbt models generate staging views/tables with hash keys, load dates, and audit metadata.
  • Business Vault Warehouse: dbt builds hubs, links, and satellites based on business keys and relationships.

Gold (Information Mart Warehouse)

Information marts—star or snowflake schemas—are created via dbt models as optimized, query-ready tables for BI tools (Power BI, Tableau, etc.).

Dbt Cloud and Microsoft Fabric - Medaillon

Live Demo Highlights

During our webinar demonstration, we walked through:

  • Setting up a Fabric workspace and viewing lakehouse tables via SQL and the Windows Explorer integration
  • Using a Data Factory pipeline to ingest sample Snowflake data into a landing zone lakehouse
  • Authoring dbt models in dbt Cloud to create staging (hashing, load dates), hub tables, link tables, and satellites
  • Executing dbt runs that generate and run SQL in Fabric warehouses, and previewing results directly in the Fabric UI
  • Accessing all data files and Delta-Parquet tables seamlessly in Windows Explorer for multi-cloud portability

Outlook: Next-Gen Enhancements

Beyond the core implementation, here are exciting ways to evolve your Fabric-dbt platform:

  • Workspace Segmentation & Data Mesh: Create dedicated workspaces for medallion layers or business domains, and stitch multiple dbt projects together with dbt Mesh for a true data mesh design.
  • Real-Time Data Integration: Leverage Fabric’s built-in streaming capabilities to blend real-time feeds into your warehouses alongside batch data.
  • Enhanced Governance & Semantic Layers: Define and enforce semantic models both in dbt and in Fabric (via semantic models) to ensure consistent metrics across all BI tools.
  • Data Science Collaboration: Grant read-only access to bronze lakehouses and empower data scientists to use Fabric notebooks (Python, R, Scala) for ad-hoc analysis and advanced ML experiments.
  • Simplified Migration: Existing dbt projects on on-prem or other cloud warehouses can be repointed to Fabric with minimal code changes—especially when using community macros for Data Vault deployments.

Considerations & Limitations

While Fabric is powerful, be mindful of:

  • Write Support: Lakehouses currently support only SQL writes—transformations must target Fabric warehouses.
  • Shortcut Management: Cross-workspace shortcuts must be manually maintained; frequent schema changes can add overhead.
  • Multiple Overlapping Tools: Data Factory, Data Engineering pipelines, notebooks, and dbt all offer ETL—establish clear standards to avoid confusion.
  • Product Maturity: As a relatively new platform, UI changes and minor bugs may appear; plan for iterative improvements.
  • Capacity Transparency: Compute and storage share capacity; monitor and size your Fabric capacity carefully to meet SLAs.

Conclusion

Microsoft Fabric, coupled with dbt Cloud, delivers an end-to-end Enterprise Data Platform that unifies data ingestion, storage, transformation, and consumption. By applying proven patterns like the medallion architecture and Data Vault methodology, you can build a scalable, collaborative, and governed environment, empowering both data engineers and business users to unlock insights faster.

Ready to take your data platform to the next level? Reach out for a tailored workshop, architecture advisory, or hands-on implementation support.

Modelling Salesforce History Tables in Data Vault

Modelling Salesforce History Tables

Salesforce tracks changes to configured attributes by storing them in history tables. This data, which includes record ID, field name, old and new values, and timestamps, presents a unique challenge for Data Vault modeling. In this article, we’ll explore an optimal way to model this data using Data Vault principles.



Understanding Salesforce History Tables

Salesforce allows tracking of specific attribute changes within objects like Contacts. These changes are stored in history tables such as ContactHistory. Each entry logs:

  • Record ID (e.g., Contact ID)
  • Field Name
  • Old Value
  • New Value
  • Timestamp

Challenges in Modeling Salesforce History Data

When designing a Data Vault model for this history data, there are key challenges to consider:

  • Handling multiple changes for the same record within a short time frame
  • Maintaining referential integrity
  • Efficiently querying and pivoting data for reporting

Approach: Multi-Active Satellite

A common initial approach is to model the history table as a multi-active satellite attached to a Contact Hub, with the field name as the dependent key. However, this approach has pitfalls:

  • Duplicates can arise if multiple changes occur for the same field in the same batch
  • Timestamp-based keys are unreliable due to possible duplicate timestamps

To counter this, a unique sequence number should be assigned in the staging area and used as a dependent key.

Optimized Approach: Non-Historized Link

Instead of a multi-active satellite, a non-historized link can be used to model Salesforce history data more efficiently. Here’s how it works:

  • Create a non-historized link connecting the Contact and User hubs.
  • Store change-related attributes (field name, old value, new value, timestamp) directly within this link.
  • Use the timestamp as an event-based attribute rather than part of the primary key.

This approach avoids the need for complex joins and simplifies querying.

Efficient Data Retrieval: Pivoting

Since history tables are structured in a key-value format, queries often require pivoting. By using database pivot functions, we can restructure the data into a more usable format for reporting without excessive joins.

Alternative Consideration: JSON Storage

Another approach is to store change data as a JSON object in a standard satellite. This method offers flexibility, particularly when dealing with a large number of attributes. However, it complicates querying and should be used only when necessary.

Conclusion

For most cases, a non-historized link is the optimal way to model Salesforce history tables in Data Vault. It simplifies data storage, reduces the need for extensive joins, and enhances query performance. Multi-active satellites are an alternative but require careful handling of duplicate timestamps and field changes.

Watch the Video

Learning from DORA: Data Governance Lessons for All Institutions

Learning from DORA: Data Governance Lessons for All Institutions

Want to improve your organization’s ability to withstand digital disruptions? This webinar unpacks the key lessons from the Digital Operational Resilience Act (DORA), providing practical takeaways you can implement immediately. Discover how DORA’s advanced approaches to data management, risk mitigation, and operational resilience can be adapted to enhance your organization’s security posture, regardless of your sector.

Register now to secure your spot and enhance your digital resilience!

Webinar Details

  • Date: January 21st, 2025
  • Time: 11:00 – 12:00 CET
Watch Webinar Recording

Speakers

Implement Data Tests to Enhance Data Quality

Data Quality Testing

In today’s data-driven world, poor data quality can lead to costly mistakes. From misguided strategic decisions to operational inefficiencies and poor customer experiences, the impact of bad data is far-reaching. Issues such as duplicates, data integrity failures, missing values, and inconsistent formats can create significant business challenges.



Why Early Detection Matters

Fixing data quality issues at the source or integration level is cost-effective and minimizes business disruption. In contrast, correcting errors at the business level is expensive and can severely impact operations. Implementing data tests early ensures smooth processes and reliable reporting.

Benefits of Data Testing

  • Trust in Data – Enables confident decision-making and reliable analytics.
  • Process Efficiency – Automates quality checks and reduces manual work.
  • Business Protection – Safeguards reputation and enhances customer satisfaction.
  • Risk Reduction – Provides early warnings and ensures compliance.

Key Data Tests in Coalesce

To maintain high data quality, businesses should test for:

  • Custom business rules
  • Referential integrity
  • Value ranges
  • Uniqueness
  • Data types
  • Missing or null values

By implementing rigorous data tests, organizations can enhance data quality, minimize risks, and drive better business outcomes.

Watch the Video

The Power of Data Vault – Business Use Cases

Business Use Cases in Data Vault

In the world of data management and integration, businesses face many challenges. Data Vault is a methodology designed to address these challenges, offering a flexible and scalable solution for integrating and managing data across an enterprise. But when is it the right time to use Data Vault? Are there specific business scenarios where Data Vault’s power truly shines? This article explores the core benefits of Data Vault, its use cases, and how it can solve complex data integration problems.



Understanding the Pain Points

Before diving into when and where Data Vault is most beneficial, it’s important to understand the underlying pain points businesses face in their data management processes. According to Michael Olschimke, CEO of Scalefree, understanding the business pain points is crucial. If there is no significant problem, there may be no need for a new solution like Data Vault. The key is identifying situations where current methods fall short in handling data integration, privacy regulations, and evolving business rules.

The most common pain point is the challenge of data integration. Modern businesses typically operate with data spread across multiple sources, from internal systems to external data feeds. Integrating these data sources into a single, unified view is one of the biggest challenges. Whether you’re trying to generate reports, create dashboards, or analyze data for business insights, you need a consistent and reliable method to integrate data from diverse systems. This is where Data Vault excels.

The Core Strength of Data Vault: Data Integration

Data Vault is designed specifically for situations where integration is a priority. If a business needs to bring together multiple disparate data sources into a single framework for reporting, Data Vault offers a robust solution. Its flexibility allows businesses to combine data from different systems, apply various business logic, and present the data in a meaningful way.

In contrast to other methods, Data Vault shines when the data integration needs are complex. Simply dumping data into a data lake may seem like an easy solution, but it leaves businesses with the challenge of how to integrate these disparate datasets into a cohesive model. Without a clear method for integration, data lakes become isolated silos of information, and producing integrated reports becomes a significant challenge.

Addressing Regulatory Compliance and Privacy

In today’s data-driven world, businesses must also address regulatory requirements such as GDPR. One of the strengths of Data Vault is its built-in support for privacy and security regulations. When managing sensitive data, businesses need to ensure compliance with privacy regulations, including the ability to delete or anonymize personal data when necessary.

While other methods can also address regulatory concerns, Data Vault provides out-of-the-box patterns and solutions that are easy to implement and scale. For example, Data Vault allows businesses to securely store data, apply business rules, and remove or anonymize personal attributes without disrupting the overall data structure. This capability is crucial in today’s regulatory environment, where compliance is not just a best practice but a legal requirement.

Handling Changing Business Rules Over Time

Another key use case for Data Vault arises when businesses face changing business rules. Over time, companies evolve, and with this evolution comes changes in how data is processed and interpreted. For example, a business might need to apply different versions of a business rule to historical and current data, depending on when the rule was in effect.

Data Vault provides a solution to this challenge by separating the data transformation processes and storing them in the “business vault.” This separation allows businesses to apply different versions of business rules to different datasets. For instance, you might apply one rule to data from the previous year and a different rule to the current year’s data. This flexibility allows companies to adapt to new business requirements without overhauling their data architecture every time the rules change.

Scalability and Flexibility

As businesses grow and their data needs become more complex, the scalability of their data management solutions becomes critical. Data Vault is highly scalable because it allows companies to add new data sources, apply new business rules, and adjust their data models as needed without requiring a complete redesign of their data infrastructure.

One of the most powerful features of Data Vault is its ability to “creatively destruct” incoming data. This means that data from different source systems can be broken down into fundamental components—such as business keys, relationships, and descriptive data. These components can then be recombined in any format or structure that suits the business’s reporting or analytical needs, whether that’s a star schema, flat tables, or any other target structure. This flexibility ensures that businesses can meet various use cases and reporting requirements using the same data platform.

Data Vault and Business Intelligence

In the realm of business intelligence (BI), Data Vault stands out as an effective method for managing large, complex datasets. It offers businesses the ability to handle multiple use cases, such as generating reports, analyzing trends, and forecasting future performance. Because it integrates data from multiple sources, it provides a single, reliable source of truth for reporting and analysis.

Unlike traditional BI systems, which often require multiple data platforms or complex ETL (extract, transform, load) processes, Data Vault allows businesses to use a single platform for all their BI needs. Whether you’re running operational reports, building data marts, or creating advanced analytics models, Data Vault’s flexibility ensures that businesses can handle various BI scenarios without the need for separate systems or tools.

Addressing Complex Data Models

While Data Vault is highly flexible, it can also become more complex as businesses face increasingly complex data models. The complexity arises when businesses deal with dirty data, unclear business key definitions, or overlapping data from different source systems. In these situations, Data Vault allows companies to address these challenges by adding new components to their data models, such as hubs, links, and satellites.

For instance, if a business has two different source systems with different business key definitions, Data Vault can create a new hub to store these keys and establish relationships between them. Similarly, when data quality is an issue, Data Vault allows businesses to add computed satellites to clean the data before it’s used for reporting or analysis. While these additional components can increase the complexity of the data model, they are essential for solving the challenges presented by messy, inconsistent, or incomplete data.

When Should You Use Data Vault?

Ultimately, the decision to use Data Vault depends on your business’s data requirements. If your data integration needs are relatively simple, or if you don’t have stringent privacy or regulatory requirements, other solutions might suffice. However, for businesses dealing with complex datasets, evolving business rules, and compliance challenges, Data Vault provides a comprehensive, scalable solution that addresses all these needs.

When evaluating whether Data Vault is the right choice for your organization, it’s essential to assess your current and future data needs. If you require robust data integration, the ability to apply different business rules over time, and compliance with privacy regulations, Data Vault is a powerful tool that can handle these challenges. Its flexibility and scalability ensure that it can grow with your business as your data needs evolve.

Conclusion

Data Vault is a powerful methodology for businesses that need to integrate complex data from multiple sources, apply evolving business rules, and comply with privacy regulations. While it may not be necessary for every business, for those facing challenges in these areas, Data Vault offers a robust, flexible, and scalable solution. By breaking down and restructuring data in a way that supports various reporting and analytical needs, Data Vault ensures businesses can keep up with the ever-changing demands of today’s data-driven world.

Watch the Video

Bridging Domain Ownership and Data Products in Data Mesh Using Data Vault 2.0

Data Mesh and Data Vault 2.0

The Data Mesh paradigm is revolutionizing how organizations manage and utilize their data. By decentralizing data ownership and treating data as a product, businesses can create a self-sufficient ecosystem that empowers teams and promotes collaboration. Here’s how Data Mesh principles align with Data Vault 2.0 to enhance data management and governance.



Key Principles of Data Mesh

  • Domain Ownership: Data is managed at the domain level, with domains defined by business needs, such as product categories or customer segments. Analytical and operational data become the responsibility of domain teams.
  • Data as a Product: Domains own analytical data, with a focus on usability and quality. Data contracts ensure consistency and reliability for consumers.
  • Federated Governance: Standards and governance frameworks enable interoperability and ensure that the entire data ecosystem remains cohesive.
  • Self-Service Data Platform: DevOps and platform teams support a self-service environment where data can be easily shared and accessed through managed self-service BI tools.

What Defines a Domain?

A domain in Data Mesh is characterized by:

  • Autonomous Operations: Independence in managing and delivering data products.
  • Cross-Functional Teams: Teams that bring together diverse skills to manage data effectively.
  • Governance Accountability: Responsibility for adhering to governance and quality standards.

Understanding Domain Ownership

Domain ownership emphasizes:

  • Quality and Usability Focus: Delivering reliable, easy-to-use data products.
  • Decentralized Control: Allowing domain teams to manage their data independently.
  • Responsibility for Data Products: Ensuring end-to-end ownership of data assets.

What is a Data Product?

Data products embody the following principles:

  • Treating data as a product with well-defined consumers.
  • Enabling self-service usability through intuitive tools.
  • Ensuring end-to-end ownership from creation to delivery.

Integrating Data Mesh with Data Vault 2.0

Data Vault 2.0 serves as a foundation for implementing Data Mesh principles. Its focus on scalable data warehousing complements Data Mesh by supporting decentralized ownership and ensuring high-quality data products. This integration allows organizations to create a robust, scalable, and governed data ecosystem.

By combining the decentralized, domain-driven approach of Data Mesh with the structured methodology of Data Vault 2.0, businesses can unlock the full potential of their data assets.

Watch the Video

How to Tackle GDPR with Data Vault

Understanding GDPR in the Context of Data Vault

GDPR compliance is a critical concern for organizations handling personal data. Data Vault, a well-structured data modeling approach, offers a robust solution for meeting GDPR requirements, particularly in two key areas: data security and data privacy.



Data Security in Data Vault

Data security involves protecting existing data from unauthorized access. Data Vault supports this through two levels of security:

  • Row-Level Security: This ensures that users can only access records relevant to them. It can be implemented via database row-level security features or view layers.
  • Column-Level Security: Attributes are separated based on security classification. Each classification is stored in a separate Satellite, with access granted accordingly.

By controlling access at both row and column levels, organizations can ensure compliance with GDPR’s data access requirements.

Data Privacy and Deletion in Data Vault

Data privacy focuses on removing personal data when required. Data Vault’s design allows for the physical deletion of personal data without affecting the integrity of the entire dataset. This is achieved through:

  • Satellite Splitting: Personal and non-personal data are stored in separate Satellites. When a deletion request is made, only the personal data Satellite needs to be altered.
  • Data Retention Policies: Different personal data attributes may have varying retention periods. Separate Satellites are created for attributes that must be deleted at different times.
  • Point-in-Time (PIT) Table Updates: When personal data is deleted, PIT tables are rebuilt to reflect the absence of that data.

This approach ensures that deleted data is no longer accessible or retrievable, aligning with GDPR’s right to be forgotten.

Access Control Lists (ACL) in Data Vault

Managing user access to data is another essential aspect of GDPR compliance. Data Vault facilitates this through an ACL system modeled using Hubs and Links:

  • A User Hub stores information about individual users.
  • A User Group Hub categorizes users into groups with shared permissions.
  • A Customer Hub and Bank Account Hub manage customer and account details.
  • A Link connects users, user groups, and customers.
  • An Effectivity Satellite records the time periods during which users have access to specific data.

By applying this structure, access control can be managed dynamically, ensuring that only authorized users can view or modify data.

Security vs. Privacy: A Crucial Distinction

When discussing GDPR, it’s essential to distinguish between security and privacy:

  • Security: The data remains in the system, but access is restricted based on security policies.
  • Privacy: The data is physically removed from the system when no longer needed.

Organizations should ensure that security officers and privacy officers handle these concerns separately to avoid misconceptions, such as assuming filtered data is deleted when it is still present in the database.

Conclusion

Data Vault provides a comprehensive approach to managing GDPR requirements through built-in security and privacy mechanisms. By structuring data appropriately and implementing proper access control and deletion strategies, organizations can achieve GDPR compliance efficiently.

Watch the Video

Close Menu