Building a scalable Data Platform? – Scalefree

Building a scalable Data Platform? In Data Warehouse

The Important Role of Data Catalogs in Modern Data Warehousing: A Practical Look at Datahub

Data Catalogs in Modern Data Warehousing

Modern data architectures are becoming increasingly complex, posing significant challenges for organizations. Data is often scattered across multiple systems, making it difficult to locate, utilize, and preserve its quality. Analysts spend a substantial amount of time searching for relevant information, while businesses struggle to manage data efficiently and comply with regulatory requirements. A structured approach to metadata management is essential to enhance transparency, maintain data quality, and enable efficient data usage.

This article is aimed at data professionals, business users, and IT teams looking to establish or improve structured metadata management. It explores the challenges of fragmented metadata, the role of data catalogs as a central solution, and how they enhance data organization and usability. Additionally, it provides practical insights into structuring a data catalog within a Data Vault 2.0 architecture to support a scalable and comprehensive data strategy.

In this article:

The Challenge of Fragmented Metadata
What is a Data Catalog?
Integrating a Data Catalog With Data Vault 2.0
Introduction to DataHub
- Key Features of DataHub
Watch the Video
Conclusion

The Challenge of Fragmented Metadata

In the era of data-driven decision-making, organizations face a fundamental challenge: fragmented and inconsistent metadata. Across different departments, heterogeneous systems, divergent naming conventions, and varying documentation standards lead to a disjointed data landscape. Analysts and data professionals often spend excessive time locating, verifying, and interpreting data—a process that not only reduces efficiency but also increases the risk of errors and misinterpretations.

At the core of this issue lies the absence of a centralized metadata repository. Without a unified source of truth, critical aspects such as data lineage, ownership, and quality remain unclear. This lack of transparency not only complicates regulatory compliance but also erodes trust in data and impedes strategic initiatives. The result is the formation of data silos that prevent organizations from fully leveraging their data assets. As highlighted, this fragmentation leads to wasted time and effort on finding and accessing data, turns data platforms into data swamps, and hinders the development of a common business vocabulary.

What is a Data Catalog?

A data catalog is a centralized, structured repository that organizes and manages metadata across an organization, facilitating efficient data discovery, governance, and collaboration. Much like a well-curated library, a data catalog enables users to locate, understand, and utilize data efficiently by capturing essential metadata and making it easily searchable and accessible. Beyond merely indexing data assets, a data catalog provides critical context, tracks relationships between datasets, and integrates user-driven insights to enhance data usability and governance.

Features of a Data Catalog

A well-implemented data catalog consists of several foundational features that support metadata management, data governance, and user collaboration:

Metadata Management: Captures and organizes essential information about datasets, such as source, format, relationships, and usage patterns, ensuring accessibility and usability.
Data Discovery: Enables users to locate and access data quickly through intuitive search functions that leverage metadata attributes and contextual tags.
Data Lineage: Tracks the lifecycle of data, mapping its journey from origin to consumption. This allows organizations to trace transformations, ensuring data integrity and error resolution.
Data Governance: Establishes policies for data availability, usability, integrity, and security. This can include access control, regulatory compliance, and stewardship responsibilities.
Business Glossary: Defines business terminology and ensures consistency across the organization, improving communication and reducing misinterpretation of data fields.
Data Dictionary: Provides technical documentation on data structures, including schema definitions, data types, and field constraints.
Data Profiling: Analyzes datasets to generate statistics on data quality, completeness, distribution, and anomalies. This helps organizations understand their data better and ensures its reliability for analytical and operational use.

Benefits of a Data Catalog

A data catalog enhances data discovery by providing structured metadata, enabling users to efficiently find and access the right datasets. By offering clear definitions, contextual information, and lineage tracking, it improves data comprehension, ensuring that users understand data origins, transformations, and relationships.

Discovering and understanding data sources and their use is a core function of a data catalog. Business users can quickly locate relevant datasets, evaluate their fit, and utilize them effectively. Additionally, users can contribute to the catalog by tagging, documenting, and annotating data sources, fostering collective knowledge and enhancing data usability.

It fosters trust by documenting data quality, provenance, and usage, ensuring that users rely on accurate, well-maintained datasets. Furthermore, by reducing redundancy and streamlining workflows, a data catalog increases operational efficiency, saving time and resources while maximizing the value of an organization’s data assets.

How a Data Catalog Works

A data catalog functions as a dynamic metadata management system, automating the extraction, organization, and indexing of metadata. It seamlessly connects to various data sources, including data platforms, data lakes, ETL pipelines, and BI tools, synchronizing metadata through two primary approaches: pull-based and push-based ingestion.

In the pull-based approach, the data catalog actively queries source systems using APIs, JDBC connectors, or scheduled scans to retrieve metadata. This method allows the catalog to regularly update its metadata repository by fetching information from the source systems. On the other hand, the push-based approach relies on source systems to send real-time metadata updates to the catalog through event-driven mechanisms, webhooks, or message queues. This ensures immediate synchronization of metadata, keeping the catalog up-to-date with any changes in the data landscape.

Once metadata is ingested, the data catalog standardizes and enriches it, ensuring consistency and adding valuable context. This process includes extracting schema details, identifying data ownership, applying classifications, and establishing data lineage. The standardized and enriched metadata is then stored in an indexed repository, enabling fast search, filtering, and lineage tracking. Furthermore, automated tagging, schema change detection, and governance policies are implemented to ensure compliance with security and regulatory requirements.

Integrating a Data Catalog With Data Vault 2.0

The diagram illustrates how a Data Catalog integrates into a Data Vault 2.0 Architecture. Using various connectors, it extracts metadata from source systems, data platforms, ETL tools, and BI tools through push- or pull-based approaches. The red arrows represent the flow of metadata from these systems into the Data Catalog, ensuring continuous capture and synchronization of metadata. This process guarantees end-to-end visibility, governance, and accessibility, making data assets more reliable and valuable for business users and analysts.

Organizations can choose between open-source and commercial solutions. Commercial platforms like Atlan, Alation, and Collibra offer fully managed enterprise solutions with automation and vendor support, while open-source tools such as DataHub, Amundsen, and OpenMetadata provide more customizability and cost efficiency. Many open-source solutions now also offer cloud-based enterprise versions for easier deployment.

Self-hosted deployments are typically hosted on Docker or Kubernetes, giving organizations greater control and security, making them ideal for compliance-heavy environments. Meanwhile, cloud-based solutions offer automatic scaling and lower maintenance overhead, simplifying operations.

A well-integrated Data Catalog should connect seamlessly with data sources, ingest metadata efficiently, and provide structured management and governance. In a Data Vault 2.0 environment, it enhances traceability and transparency, enabling better data-driven decisions and reducing inefficiencies caused by fragmented metadata.

Introduction to DataHub

In this section, we will present an example of a data catalog tool and some of its key features.
DataHub is a leading open-source data catalog developed by LinkedIn to address the challenges of data discovery, governance, and observability in complex data ecosystems. Designed to handle the increasing volume, variety, and velocity of data, DataHub has been adopted by numerous organizations seeking to improve their metadata management practices. It supports both push- and pull-based metadata ingestion, seamlessly integrating with a wide range of data tools and technologies, including dbt, Snowflake, BigQuery, and Airflow.

Key Features of DataHub

DataHub goes beyond the traditional features and benefits of a standard data catalog by offering advanced features that address modern data management challenges. Here are some practical insights on how DataHub solves some of the features of a data catalog:

Data Discovery: Leveraging advanced search capabilities and detailed metadata, DataHub enables users to quickly locate and understand data assets, facilitating efficient data-driven decision-making across the organization, with a powerful search and filtering mechanism that allows users to refine queries by platform, domain, data type, owner, and specific tags, significantly enhancing the search process.

Screenshot of Internal Data Catalog

Data Lineage: It provides a comprehensive, visual trace of data flow—from origin through various transformations to final consumption—ensuring transparency, simplifying troubleshooting, and supporting compliance efforts. This lineage can extend down to the column level, offering a granular view of how data is transformed and utilized throughout the system.

Screenshot of Internal Data Catalog

Data Profiling: DataHub continuously monitors and analyzes data quality, automatically generating profiling metrics that reveal data distributions, identify anomalies, and help maintain high data quality standards. It provides key statistics such as row and column counts, query frequency, top users, and last update timestamps, along with detailed attribute profiling, including value ranges, central tendencies, null and distinct values. The table below shows some examples of these profiling metrics.

Screenshot of Internal Data Catalog

Watch the Video

Conclusion

In an era where data volume and complexity continue to surge, a robust data catalog is no longer optional, it is essential. By centralizing and enriching metadata, catalogs like DataHub turn fragmented datasets into a coherent, trustworthy foundation for analytics, governance, and collaboration. Implemented alongside architectures such as Data Vault, they deliver the transparency, lineage, and quality controls that modern organizations need to unlock real value from their data, quickly, confidently, and at scale.

– Tim Luca Derksen (Scalefree)

Building a scalable Data Platform? In Data Vault Friday

When to Use Reference Tables in Data Vault?

Reference Tables in Data Vault

In modern Data Vault 2.0 implementations, teams often face a recurring question: “We have dozens of small, static lookup tables—should we model them as full Hubs and Satellites, or can we use simpler reference tables?” If you’re dealing with Excel sheets containing tens or hundreds of rows of relatively stable data (like Profit Centers, Status Codes, or Region mappings), this article will help you decide when a lightweight reference table suffices—and when you need the auditability of a Hub/Satellite pattern.

In this article:

Understanding Business Data vs. Reference Data
Simple Reference Tables: Pros and Cons
When Simple Reference Tables Are Appropriate
Introducing Reference Hubs and Reference Satellites
How It Works in Practice
Aligning Reference Data with Business Vault Snapshots
Auditability and Regulatory Compliance
Performance Considerations
Governance and Knowledge Transfer
Decision Checklist
Conclusion
Watch the Video
Meet the Speaker

Understanding Business Data vs. Reference Data

First, it helps to distinguish two broad categories of data:

Business Objects: Entities that your processes create and update constantly—Customers, Orders, Products, etc. You generate new keys and change descriptive attributes frequently.
Reference Data: Code lists and lookup tables that describe or classify business objects—Country codes, Profit Center codes, Contract types. These change infrequently and usually in small batches.

Although some tables can straddle the line (e.g., Profit Centers may be “business objects” for accounting teams), it’s often safe to treat truly stable code lists as reference data for modeling purposes.

Simple Reference Tables: Pros and Cons

A simple reference table in your Data Vault is nothing more than a flat table with:

Primary Key: Your reference code (e.g., profit_center_code).
Attributes: The 2–5 descriptive columns you need (e.g., profit_center_name, region).
No History: Only the current state is stored; updates overwrite existing rows.

Advantages: Easy to implement, minimal objects, straightforward joins at query time.
Disadvantages: No built-in historical tracking—updates will retroactively change past reports, and you cannot reconstruct previous descriptions.

When Simple Reference Tables Are Appropriate

Consider a flat reference table when:

Your business requirement only needs the latest values.
Updates are extremely rare (quarterly or less) and don’t require audit trails.
Performance of lookups is not mission-critical (small table sizes).
You have no regulatory or internal need to reproduce past descriptions.

If any of these criteria fail—especially auditability—then a simple reference table can become a liability.

Introducing Reference Hubs and Reference Satellites

To combine simplicity with history, use the Reference Hub & Reference Satellite pattern. This mirrors the standard Hub/Satellite design, but optimized for code lists:

Reference Hub:
- business_key: the code (e.g., PROFCTR_001)
- record_source: data origin
- load_date: date the code list was loaded
Reference Satellite:
- business_key (FK to Hub)
- load_date (also part of PK)
- Descriptive attributes (e.g., name, region, valid_from)

This approach captures every change to your reference data without overwriting, and still keeps your model lightweight.

How It Works in Practice

Load the Hub: Insert every code once (or refresh if new codes appear).
Load the Satellite: For each code, insert a new row whenever any descriptive attribute changes, tagging it with the load_date.
Querying: In your dimension or Information Delivery layer, join from your business object Satellite (or Link) directly into the Reference Satellite on code, filtering to the row with the latest load_date ≤ transaction date.

Because reference tables are small, these joins remain performant even when you compare on dates.

Aligning Reference Data with Business Vault Snapshots

For organizations using a Business Vault layer with snapshot dates, you may need to “time-align” reference data. Two patterns are common:

On-the-fly alignment: In your reporting view, use the transaction’s snapshot_date and join to the Reference Satellite where load_date ≤ snapshot_date, picking the latest record.
PIT/Bridge tables: Precompute “Point-In-Time” (PIT) tables that store the reference code’s surrogate key aligned to each business object snapshot for faster querying.

Choose the pattern that balances your performance SLAs with data freshness requirements.

Auditability and Regulatory Compliance

If you operate in regulated industries (banking, telecom, government), audit trails are mandatory. The Reference Hub/Satellite pattern ensures:

Complete lineage and history of every code change.
Reproducibility of past reports with original reference descriptions.
Ability to support retrospective analyses without reloading or reconstructing data.

Even if your initial business users only ask for current values, future sprints or stakeholders may require historical context—so building auditability upfront can save costly refactoring.

Performance Considerations

Reference tables typically contain at most hundreds of rows. However, you should still consider:

Indexing: Ensure load_date and business_key are indexed for fast lookups.
Partitioning: Generally unnecessary for small tables but useful if your Satellite grows into thousands of deltas over years.
Join Strategy: In most SQL engines, joining a large transaction Satellite to a small Reference Satellite on code + latest date filter is efficient. If not, consider a PIT table.

Governance and Knowledge Transfer

Whatever pattern you choose, document and govern your reference data:

Maintain an authoritative data dictionary describing each code list, source, update frequency, and steward.
Set up automated tests (e.g., CI/CD validations) to detect unexpected code changes.
Implement alerts for large volumes of reference updates that may indicate data quality issues.

Decision Checklist

Use this quick checklist when evaluating a table for reference modeling:

Is the table truly static or slow-changing? (Quarterly or less)
Are there audit or historical requirements? (Regulatory or future use cases)
Is the table small enough (< 1,000 rows) to avoid performance concerns?
Do you need to reconstruct past reports with original descriptions?
Would a simple change in the future (e.g., retro-active update) break historical reports if you used a flat table?

If you answered “yes” to questions 2 or 5, the Reference Hub/Satellite pattern is the safer choice. Otherwise, a simple reference table may be sufficient.

Conclusion

Static lookup tables in a Data Vault 2.0 implementation can be modeled either as simple reference tables or with a Reference Hub & Satellite pattern. While flat tables are easier to build, they lack historical tracking and auditability. By adopting the Reference Hub/Satellite approach, you gain full change history, reproducible reporting, and alignment with regulatory demands—while retaining a lightweight design.

Use the decision checklist above to guide your modeling choices, and ensure your Data Vault remains both agile and compliant as your organization’s needs evolve.

Watch the Video

Meet the Speaker

Michael Olschimke

Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

Building a scalable Data Platform? In Data Vault Friday

Outsourcing the Data Warehouse: Finding the Right Balance

Outsourcing the Data Warehouse

In today’s data-driven world, building and maintaining a robust data warehouse is a critical strategic initiative. As organizations grow and their data complexity increases, leaders often ask: Is there a way to third-party some of the data warehouse work and infrastructure? In other words, can you “outsource the data warehouse” without jeopardizing control, security, or long-term knowledge retention?

In this article:

Why Treat Your Data Warehouse as Strategic
The Role of External Expertise
Outsourcing Infrastructure: The Cloud Advantage
Balancing Internal and External DevOps
AI-Driven Offloading: The Next Frontier
Setting Clear Boundaries
Key Benefits and Risks
Best Practices for Effective Outsourcing
Conclusion
Watch the Video
Meet the Speaker

Why Treat Your Data Warehouse as Strategic

Most organizations—especially those generating significant revenue or handling complex data flows—view their data platform as a core strategic asset. It underpins reporting, analytics, and decision-making, and drives competitive advantage. Handing over this critical function entirely to an external provider can feel like giving away the keys to your business.

Control & Governance: Keeping the data warehouse in-house ensures direct oversight of data quality, security, and compliance.
Knowledge Retention: Your internal team develops deep institutional understanding of data models, business logic, and reporting requirements.
Strategic Flexibility: Internal ownership allows you to pivot rapidly as business needs evolve, without waiting on external roadmaps or SLAs.

The Role of External Expertise

That said, nearly every successful data warehouse project benefits from external consulting, especially in the early stages. Consultants bring:

Proven Frameworks: Templates, best practices, and reference architectures refined across multiple clients.
Jump-Start Momentum: Hands-on help with infrastructure setup, project governance, and team organization.
Skill Gap Coverage: Experienced data architects, engineers, and DevOps specialists to fill temporary talent shortages.

Once your internal team is up to speed, you can scale back external support—keeping consultants focused on specific areas where they provide the most value.

Outsourcing Infrastructure: The Cloud Advantage

For many organizations, the first form of “outsourcing” is the data center itself. On-premises servers have given way to cloud platforms like AWS, Google Cloud, and Azure. By migrating your data warehouse infrastructure to the cloud, you:

Reduce Capital Expense: No more large upfront hardware purchases or data center maintenance costs.
Gain Elastic Scalability: Spin up additional compute and storage on demand to handle peaks in data processing.
Enhance Security & Reliability: Leverage the cloud provider’s certifications, redundancy, and disaster recovery capabilities.

This shift effectively delegates infrastructure work—provisioning, patching, and physical security—to a trusted third party, while you retain control of your data and processes.

Balancing Internal and External DevOps

DevOps for your data platform—CI/CD pipelines, automated testing, and deployment orchestration—can also be partially outsourced. However, it’s important to strike the right balance:

Internal DevOps Leads: Core pipeline design, approval processes, and environment governance should remain with your in-house team.
External Specialists: Consultants can set up complex workflows, integrate tools, and train your staff on best practices.
Limit External Proportion: Aim to keep no more than 50% of DevOps roles external to mitigate knowledge leakage and dependency risks.

AI-Driven Offloading: The Next Frontier

Emerging AI tools are beginning to automate aspects of data warehousing:

Schema Generation: AI can suggest optimized table structures based on source data profiles.
ETL/ELT Code Snippets: Auto-generation of transformation scripts for common data patterns.
Monitoring & Alerting: Machine learning models to detect anomalies and performance bottlenecks.

While these solutions can accelerate development, they work best under the guidance of your experienced internal team, who define requirements, review outputs, and ensure alignment with business objectives.

Setting Clear Boundaries

To manage the consultant-internal balance effectively, define:

Scope of Work: Specify deliverables, timelines, and handover expectations for every engagement.
Knowledge Transfer Plan: Require documentation, training sessions, and code reviews to embed expertise internally.
Governance Model: Establish who makes architectural decisions and how change requests are processed.
Resource Thresholds: Limit external headcount to a reasonable percentage (e.g., under 50%) to avoid over-reliance.

Key Benefits and Risks

Benefits:

Speed: Hit the ground running with proven accelerators and expert guidance.
Flexibility: Scale resources up or down based on budget cycles and project phases.
Cost Efficiency: Avoid long-term commitments; external consultants can be released when budgets tighten.

Risks:

Knowledge Drain: Too many external experts can lead to critical know-how walking out the door.
Strategic Misalignment: Consultants may optimize for short-term wins rather than your long-term roadmap.
Dependency: Over-outsourcing creates vendor lock-in and potential service disruptions if relationships end.

Best Practices for Effective Outsourcing

1. Start with a Pilot: Engage consultants on a small, well-defined project to validate fit and process.

2. Embed Consultants: Position them alongside your team for on-the-job training, rather than in isolation.

3. Prioritize Documentation: Ensure every architecture decision, data model, and pipeline is clearly recorded.

4. Rotate Responsibilities: Alternate tasks between internal staff and consultants to spread knowledge.

5. Review Regularly: Conduct quarterly check-ins to reassess external involvement, goals, and budget alignment.

Conclusion

Outsourcing your entire data warehouse is rarely advisable—this remains a core strategic asset that demands internal stewardship. However, judicious use of external consultants, cloud infrastructure, and AI-driven tools can accelerate your journey, fill critical skill gaps, and optimize costs. By setting clear boundaries, emphasizing knowledge transfer, and maintaining a healthy mix of internal and external talent, you can reap the benefits of outsourcing without surrendering control.

Watch the Video

Meet the Speaker

Michael Olschimke

Building a scalable Data Platform? In WhereScape

Modeling Reference Data in Data Vault 2.0 with WhereScape

Understanding the Role of Reference Data

In the world of modern data warehousing, reference data plays a crucial role in maintaining consistency and adding business context. But how should reference or master data be modeled effectively, especially within the framework of Data Vault 2.0 using WhereScape?

This article breaks down best practices and modeling techniques to help data engineers and architects manage reference data in a scalable, maintainable way.

In this article:

What is Reference Data?
- Examples of Reference Data:
Data Vault 2.0 and Reference Data Modeling
The Reference Table (Reference Hub)
The Reference Satellite
- Characteristics of a Reference Satellite:
Why Model Reference Data This Way?
Best Practices for Implementation
Common Pitfalls to Avoid
Conclusion
Watch the Video
Meet the Speaker

What is Reference Data?

Reference data consists of values that are used to categorize or describe other data within business systems. It typically includes code-to-description mappings that offer meaning to otherwise abstract identifiers.

Unlike business keys, which identify business entities or objects (like customers or products), reference data keys do not directly point to business objects. They simply support them with contextual information.

Examples of Reference Data:

ISO codes for countries
Official country names
Continent or region classifications
Currency types or codes

One critical aspect of reference data is that it can change over time. Country names may change, new currencies may be introduced, and existing classifications may be updated. This means that how we model and store this data must account for such changes.

Data Vault 2.0 and Reference Data Modeling

Data Vault 2.0 introduces a methodology designed for agility, auditability, and scalability in enterprise data warehousing. When dealing with reference data in this architecture, the recommended standard involves two main components:

Reference Table (Reference Hub)
Reference Satellite

Each serves a distinct purpose and helps manage both static and changing attributes efficiently.

The Reference Table (Reference Hub)

The reference table acts similarly to a hub in traditional Data Vault modeling, but with important distinctions:

Contains reference codes or keys (e.g., ISO country code)
Does not use a hash key – unlike typical hubs
May include additional static attributes that do not change over time

This component provides a centralized location for managing consistent lookup values across the enterprise. While it’s technically referred to as a “hub,” it’s specialized for reference data and behaves slightly differently in structure and intent.

The Reference Satellite

Reference satellites extend the reference table to store attributes that may evolve over time. This aligns well with the Data Vault 2.0 philosophy of tracking change history and ensuring auditability.

Characteristics of a Reference Satellite:

Includes reference codes or keys to link back to the reference table
Stores descriptive attributes that may change over time (e.g., country name updates, new regional classifications)

This design allows data teams to accommodate both historical tracking and the dynamic nature of reference data.

Why Model Reference Data This Way?

There are several strategic and operational advantages to modeling reference data using this structure in Data Vault 2.0:

Separation of concerns: Static and changing data are stored in different layers (table vs. satellite), improving data integrity.
Scalability: Future changes in reference attributes or descriptions are easier to manage and don’t affect historical records.
Auditability: Data Vault’s natural historization supports full lineage and change tracking, which is ideal for regulated industries.
Adaptability: Requirements for historization may evolve over time. Modeling reference data into satellites regardless of current needs ensures readiness for future changes.

Best Practices for Implementation

When implementing this in WhereScape, which automates the Data Vault modeling process, follow these best practices:

1. Always Use a Reference Satellite

Even if you don’t need to historize now, model your reference data in a satellite. Future-proofing your model saves costly rework later.

2. Use Reference Hubs When Multiple Sources Exist

If your organization consumes reference data from multiple systems (e.g., two systems providing different descriptions for the same country code), a reference hub helps consolidate and align these variations around the same key.

3. Avoid Hash Keys in Reference Hubs

Because reference tables don’t represent business objects, there’s no need for a surrogate hash key. Stick with the natural reference code (e.g., “US” for United States) as your unique identifier.

4. Design Satellites for Change

Structure your reference satellites to easily accommodate attributes that may change. Make use of effective date fields to track the history of these changes.

Common Pitfalls to Avoid

Modeling reference data as business hubs – this confuses context with core business entities
Skipping the satellite – even when attributes are static today, change is inevitable
Using hash keys unnecessarily – keep your design clean and minimal in reference structures
Ignoring multiple source issues – consolidate differing descriptions with a reference hub

Conclusion

Modeling reference data correctly is a small but critical part of building a reliable, scalable, and auditable Data Vault. By following the recommended structure—a reference table paired with a reference satellite—you create a flexible and future-proof design.

WhereScape users benefit from automation, but understanding these modeling principles ensures you’re applying the tool in a way that aligns with industry best practices and prepares your warehouse for long-term success.

Whether you’re handling ISO country codes or global currency classifications, treat your reference data with the same care you would your core business entities—because it gives those entities their context.

Watch the Video

Meet the Speaker

Trung Ta
Senior Consultant

Trung has been Senior BI Consultant since 2019. As a Certified Data Vault 2.0 Practitioner at Scalefree, his area of expertise includes Data Warehousing in a cloud environment, as well as Data Vault 2.0 modeling and implementation – especially, but not limited to, with Wherescape 3D/RED. He’s been working with industry leaders in the insurance and finance sector, advising them on building their own Data Vault 2.0 solution.

Building a scalable Data Platform? In Data Architecture, Intermediate

Chatbot Implementation Using Retrieval-Augmented Generation

In today’s AI-driven world, businesses are looking for smarter, cost-effective chatbot solutions that enhance customer interactions and streamline internal operations. While traditional chatbot models often struggle with outdated or generic responses, modern advancements have opened the door to more dynamic and intelligent systems.

This article is for business leaders, developers, and AI enthusiasts looking to implement smarter chatbot solutions. It explores Retrieval-Augmented Generation (RAG), a game-changing approach that enhances chatbot performance by retrieving relevant information in real time. By the end, you’ll not only understand why RAG is so powerful but also how to implement it effectively to build scalable, cost-efficient, and context-aware chatbots using Google Cloud Platform Services.

Implementing Production-Ready RAG Chatbots: Enhancing Information Retrieval with AI

Join our webinar for a practical guide to building and deploying powerful, RAG-driven chatbot solutions. We’ll show you how to leverage essential Google Cloud services for effective implementation, enabling your chatbot to deliver more accurate and relevant responses. Register for our free webinar, July 15th, 2025!

Watch Webinar Recording

In this article:

The Challenge of Deploying Chatbots
RAG Architecture
- Preparation Phase
- Retrieval Phase
Implementing a RAG-Based Chatbot: A Practical Example
- Data Preparation
- Retrieval Architecture

The Challenge of Deploying Chatbots

Chatbots are widely used in businesses to automate customer support and streamline internal operations. While large language models (LLMs) have improved chatbot capabilities by generating human-like responses, they come with significant challenges. LLMs require costly fine-tuning, extensive resource usage, and ongoing maintenance, making them impractical for many companies. Additionally, fine-tuned models quickly become outdated, requiring frequent retraining to stay relevant.

An alternative approach is Retrieval-Augmented Generation, which dynamically retrieves relevant information in real time, rather than relying solely on pre-trained data. This allows chatbots to stay up-to-date, reduce costs, and improve accuracy, making RAG a powerful solution for businesses looking for intelligent and scalable AI-driven chatbots without the high expenses of fine-tuning or pre-training.

RAG Architecture

RAG is revolutionizing the way chatbots interact and provide information. But how does this powerful architecture actually work? The magic lies in its two distinct phases: The Preparation Phase, where the knowledge base is built, and the Retrieval Phase, where user queries are processed and relevant information is retrieved to generate a response. Each phase plays a crucial role in creating an intelligent and responsive system. Let’s explore how RAG works in detail, starting with the preparation phase.

Preparation Phase

Data Selection: The preparation begins by selecting the appropriate data for the chatbot’s knowledge base. This typically involves choosing organized sources like databases, which contain the structured information necessary for the chatbot to function effectively.

Data Preprocessing: To prepare the raw data for optimal use in a RAG system, preprocessing is essential. This stage serves three primary purposes: enabling efficient retrieval, ensuring compatibility with the generative AI model, and optimizing for effective embedding generation. Efficient retrieval hinges on assigning unique identifiers, such as IDs, to each data element. This allows the system to quickly locate and access specific pieces of information. To enhance both the generative AI model’s understanding and the quality of embeddings, techniques such as flattening tables, tokenization, text cleaning, stemming, lemmatization, stop word removal, and data type conversion can be employed. These techniques refine the data’s structure and content, making it easier for the generative AI model to process and understand, while also optimizing the data for capturing semantic meaning and relationships in the embeddings.

Embedding Generation: With the data now preprocessed and refined, it’s ready to be transformed into a format suitable for efficient retrieval. This is where embedding generation comes in. This step involves converting the preprocessed data into numerical vectors known as embeddings using a specifically trained embedding model. The embeddings capture the semantic meaning of the data, encoding relationships between words, concepts, and ideas, which enables efficient retrieval of relevant information. The specific embedding model used will depend on the nature of the data and the requirements of the RAG system.

What is an Embedding?

In natural language processing, an embedding is a numerical representation of a word, phrase, or document that captures its meaning. These numerical vectors, which can range from a few hundred to many thousand dimensions, are designed so that words or documents with similar meanings have embeddings that are close together in the vector space.

For example, in an embedding space, the word “cat” would be located much closer to the word “lion” (another feline) than to the word “car” (an unrelated object). Meaning that even if two words were nearly identical in spelling but had vastly different meanings, their embeddings would be distant. This spatial arrangement reflects the semantic relationship between the words.

This allows AI systems to understand relationships between words and concepts, enabling them to perform tasks such as identifying similar documents. Embeddings are a fundamental building block for many AI applications, including RAG.

Saving the Embeddings in a Vector Database: The final step in the preparation phase is to store the generated embeddings in a specialized database designed for handling vector data. This is crucial because traditional databases aren’t optimized for storing or searching high-dimensional vectors. Saving the embeddings in a vector database ensures that the RAG system can quickly pinpoint the most relevant information in the knowledge base when responding to user queries.

Now looking at Data Vault’s architecture, we see that to some extent it is quite similar to what Databricks proposes: a multi-layer solution composed of a Staging layer, a Raw Data Vault and a Business Vault, followed by the domain-specific information marts. In the image below, we can see an example of a Data Vault architecture.

Retrieval Phase

With the knowledge base now prepared and ready for efficient retrieval, let’s explore how the RAG system interacts with users and generates responses in real-time.

User Interaction: The process begins with a user interacting with the chatbot through a user interface. This interaction could be a chat window, or any other platform that allows users to input queries.
Embedding Generation: The user query is then converted into an embedding using the same embedding model that was used to process the knowledge base during the preparation phase. Consistency is vital because different embedding models might produce different numerical representations, hindering the similarity search process.
Similarity Search and Retrieval: The query’s embedding is used to perform a similarity search within the same vector database that contains the knowledge base embeddings. This search identifies the most semantically similar embeddings, returning their corresponding IDs.
Retrieve relevant Information: The IDs retrieved from the similarity search are then used to query the preprocessed data from the knowledge base. This lookup retrieves the corresponding documents in their preprocessed text form, providing the generative AI model with the necessary context for generating a response.
Generative AI: The retrieved information, along with the original user query, is passed to a generative AI model. This model uses its knowledge to generate a relevant response to the user’s query. The generative process allows RAG systems to provide answers that are more than just basic information retrieval.
Delivering the Response: Finally, the generated response is delivered back to the user through the user interface. The response could be in the form of text displayed in a chat window or a spoken response from a voice assistant.

Implementing a RAG-Based Chatbot: A Practical Example

Now that we’ve explored the underlying architecture of RAG, let’s see how you can implement a RAG-based chatbot using services from the Google Cloud Platform. This example focuses on building an internal knowledge chatbot that can answer employee questions based on internal documents from a knowledge platform.

Data Preparation

The foundation of any effective RAG system is well-prepared data. This initial phase focuses on transforming your raw knowledge into a searchable format, ensuring documents are ready for efficient retrieval.

Data Selection and Preprocessing: First, you’ll need to gather the relevant internal documents, preprocess them and save the data in a data platform like Google Cloud Storage.
Embedding Generation: Once the preprocessed data is securely stored, Google Cloud Vertex AI is utilized to transform this data. Vertex AI extracts the textual content and converts it into high-dimensional vector embeddings, creating numerical representations that capture the semantic meaning of your documents. You can choose from various pre-trained embedding models or fine-tune your own model based on your specific needs.
Vector Database: With the embeddings successfully generated, Google Cloud Vertex AI Vector Search then serves as the vector database. It efficiently stores these high-dimensional embeddings and is designed to enable rapid and accurate similarity searches during the retrieval phase.

Retrieval Architecture

This section outlines the Retrieval-Augmented Generation architecture of our Google Chat chatbot. It describes the flow of a user’s query through various Google Cloud Platform services, from initial submission to the delivery of a generated response, creating a robust and intelligent conversational experience.

User Interaction: The user begins by asking a question within Google Chat. Google Chat serves as the user interface, directing the query to the Conversational Agents platform.
Request Orchestration: Conversational Agents receives the user’s input. It then triggers a webhook call, sending the user’s query to our RAG orchestrator service, which is hosted on Google Cloud Run.
User Query Embedding: From this point, the Google Cloud Run service takes over as the central orchestrator of the RAG pipeline. It starts by processing the user’s query, generating a high-dimensional vector embedding of it using the same embedding model that was utilized during the data preparation.
Information Discovery: The generated user query embedding is then transmitted from Cloud Run to Vertex AI Vector Search. Vertex AI performs an efficient similarity search, identifying and returning the ID(s) of semantically similar documents from the knowledge base based on the provided embedding.
Relevant Information Retrieval: With the identified document ID(s) obtained from Vertex AI Vector Search (these IDs precisely match our preprocessed documents that are most relevant to the user’s query), the Cloud Run service accesses Google Cloud Storage. From Cloud Storage, the full text content of these matched knowledge base pages is fetched. These documents provide the essential context required for accurate response generation.
Generative AI Response: The Cloud Run service then combines the original user query with the retrieved document content into a carefully constructed prompt. This comprehensive prompt is sent to the Gemini API, leveraging Google’s large language model to generate a natural language answer.
Response Delivery: The generated answer from the Gemini API is received by the Cloud Run service. The Cloud Run service formats this response and sends it back to Conversational Agents.
Final User Delivery: Conversational Agents receives the generated answer and seamlessly relays it back to the user within Google Chat, completing the interaction.

This example demonstrates how you can leverage GCP services to implement a RAG-based chatbot for internal knowledge sharing. By combining data preparation, embedding generation, similarity search, and generative AI, you can create a powerful tool that empowers employees with quick and easy access to relevant information.

– Tim Voßmerbäumer (Scalefree)

Building a scalable Data Platform? In Data Vault Friday

Data Vault Link Temporality

Link Temporality: Handling Source Data Errors with Effectivity Satellites

In modern data warehousing, ensuring accurate historical records is paramount. The Data Vault methodology excels at capturing raw, unfiltered data changes over time. But what happens when your source system makes errors—linking an entity to the wrong counterpart—and then corrects them? Without the right approach, your Link tables can become confusing, making it hard to identify the true “current” relationship. This article explores an elegant solution: using an Effectivity Satellite to manage link temporality and error correction in your Data Vault.

In this article:

The Problem: One-to-One Relationship with Source Data Fluctuations
Why Not Tweak the Link Table Directly?
Introducing the Effectivity Satellite
Step-by-Step: Tracking Link Changes
Loading Patterns: Full Loads vs. CDC vs. Incrementals
Handling Unreliable Source Deliveries with “Last Seen”
Querying Current Relationships
Benefits of the Satellite Approach
Conclusion
Watch the Video
Meet the Speaker

The Problem: One-to-One Relationship with Source Data Fluctuations

Imagine two hubs in your Data Vault: Hub A and Hub B. A business rule dictates that each A-entity can be linked to exactly one B-entity at a time. Your Link table models these connections. The typical workflow is:

Day 1: Source links A1 → B1 → you load this into your Link.
Day 2: Source mistakenly links A1 → B2 → you load the new link.
Day 3: Source corrects back to A1 → B1 → how do you capture this as the current, up-to-date link?

Since the Link table only records distinct relationships and ignores duplicates, reloading A1 → B1 on Day 3 won’t insert a new row or update any timestamp. You lose clarity on which relationship is active today.

Why Not Tweak the Link Table Directly?

You might be tempted to add LOAD_END_DATE or an “active” flag directly to the Link table to mark when a relationship becomes obsolete. However, this violates Data Vault best practices. The Link should remain a pure, append-only record of every relationship ever observed, without status flags or end dates. Instead, you delegate temporality to a dedicated satellite.

Introducing the Effectivity Satellite

An Effectivity Satellite sits alongside your Link and records the lifespan of each relationship. Its core columns include:

Link Hash Key: foreign key back to your Link record
LOAD_DATE_TIMESTAMP: when you first detected or ended this link
DELETE_DATE_TIMESTAMP: when the link was deactivated (or a far-future “end of time” for active rows)

This design cleanly separates the static relationship definition (Link) from its dynamic, time-dependent status (Effectivity Satellite).

Step-by-Step: Tracking Link Changes

Day 1: Initial Relationship

Link: A1–B1  
Effectivity Satellite:  
LOAD_DATE = D1  
DELETE_DATE = 8888-12-31

We load A1→B1 and mark it active by setting its DELETE_DATE to the end of all time.

Day 2: Erroneous Change

Link: add A1–B2  
Effectivity Satellite updates:  
– For A1–B1: DELETE_DATE = D2 (deactivated)  
– For A1–B2: LOAD_DATE = D2, DELETE_DATE = 8888-12-31

The old relationship is soft-deleted, and the new one is inserted and marked active.

Day 3: Correction Back to Original

Effectivity Satellite updates:  
– For A1–B2: DELETE_DATE = D3  
– For A1–B1: new row (reactivation): LOAD_DATE = D3, DELETE_DATE = 8888-12-31

Instead of touching the Link, we simply record two new deltas: ending B2 and re-activating B1. Querying the satellite for the active row (where DELETE_DATE = 8888-12-31) reveals the current link.

Loading Patterns: Full Loads vs. CDC vs. Incrementals

Your data delivery method influences how you detect deletions:

Full Loads: Compare all active links in the satellite against staging; any missing link implies a deletion. Insert a delta to end-date it.
Change Data Capture (CDC): Leverage the source’s delete events and timestamps as your DELETE_DATE_TIMESTAMP.
Incremental without Deletes: Combine staging deltas (inserts/updates) with a lightweight full load of just business keys. Missing keys signal deletions.

In all cases, the satellite becomes the single source of truth for link effectivity.

Handling Unreliable Source Deliveries with “Last Seen”

Sometimes, your source export may inadvertently drop rows (e.g., locked mainframe records). To avoid false deletions, maintain a Last Seen Date on your effectivity satellite. If a link hasn’t been seen for a configurable “grace period” (e.g., three weeks), a business rule in your Business Vault marks it deleted. This approach balances accuracy against source system quirks.

Querying Current Relationships

To retrieve only active links at any point:

SELECT L.*, S.LOAD_DATE_TIMESTAMP  
FROM Link L  
JOIN Effectivity_Sat S  
ON L.Link_HashKey = S.Link_HashKey  
WHERE S.DELETE_DATE_TIMESTAMP = '8888-12-31';

This simple filter returns the true, live relationships, abstracting away all historical noise and source-system corrections.

Benefits of the Satellite Approach

Auditability: Full history of when links were activated and deactivated.
Purity: Link tables remain simple, append-only, and free of flags/end dates.
Flexibility: Supports full loads, CDC, and incremental patterns seamlessly.
Business Rules: “Last seen” logic can live in Business Vault without polluting raw Data Vault layers.

Conclusion

Handling source data errors—especially when relationships ping-pong between states—requires a robust temporal strategy. By delegating link effectivity to a dedicated satellite, you maintain a clean Link table, capture every change, and easily identify the current relationship. Whether you’re dealing with full loads, CDC, or flaky source exports, this pattern scales, remains auditable, and adheres to Data Vault best practices. Implement effectivity satellites in your Data Vault to master link temporality and build a more resilient, transparent data platform.

Watch the Video

Meet the Speaker

Michael Olschimke

Building a scalable Data Platform? In Data Vault Friday

Data Vault Links With Just One Hub Reference

Single-Hub Links

In Data Vault modeling, links play a central role in representing relationships between business keys stored in hubs. By design, most links connect two or more hubs, capturing many-to-many relationships or associations. But what happens when an event or transaction involves only a single business key? Can you still use a link structure—and if so, which type? In this article, we’ll explore the concept of non-historized links with a single hub reference, compare alternatives, and outline best practices for real-time event modeling.

In this article:

Overview of Data Vault Components
Traditional Links and Their Purpose
Defining Non-Historized Links
When Only One Business Key Is Involved
Advantages of Single-Hub Links
Alternative: Multi-Active Satellites
Designing Your Single-Hub Non-Historized Link
Use Case Scenarios
Best Practices and Considerations
Conclusion
Watch the Video
Meet the Speaker

Overview of Data Vault Components

Before diving into one-hub links, let’s briefly review the core building blocks of a Data Vault model:

Hubs: Store unique, immutable business keys (e.g., customer IDs, order numbers).
Links: Represent relationships or associations between two or more hubs.
Satellites: Hold descriptive attributes and contextual history for hubs and links.

This three-tiered architecture ensures agility, auditability, and scalability. Hubs guarantee uniqueness; links model relationships; satellites track changes over time.

Traditional Links and Their Purpose

Most Data Vault implementations utilize links to tie together business keys from multiple hubs. Common scenarios include:

Customer–Order relationships (customer purchases multiple orders).
Order–Product line items (each order can contain multiple products).
Employee–Department assignments.

These historized links capture the evolution of relationships over time, recording load dates and allowing queries that include past associations. In contrast, non-historized links focus on events at a single point in time.

Defining Non-Historized Links

A non-historized link (sometimes called an “event link” or “transaction link”) stores relationships for a single event or message without maintaining full historical context. Instead of recording every change, it captures a snapshot of an event at its arrival:

Load timestamp identifies when the event occurred or was ingested.
Hub references list one or more business keys involved in the event.
Non-historized Satellites may attach descriptive details, but typically without tracking attribute history.

This design is ideal for real-time message processing, streaming data, or systems where full history is not required for each event.

When Only One Business Key Is Involved

While many events involve multiple business keys—such as an order linking to both customer and product—some transactions or messages involve just one key. Examples include:

A single-customer ping or heartbeat event in an IoT system.
A retail message capturing stock-level change for one product.
An alert triggered by a lone account reaching a threshold.

In these cases, you might wonder if a link structure still makes sense when there’s only one hub reference. The answer is yes: you can implement a non-historized link that references a single hub key to represent that event.

Advantages of Single-Hub Links

Opting for a non-historized link with one hub reference brings several benefits:

Consistency: Sticks to the Data Vault pattern of links for events, avoiding mixed designs.
Scalability: Scales out to handle high volumes of incoming messages without heavy historical tracking.
Clarity: Clearly separates transactional/event data from descriptive satellites and core business keys.
Query Simplicity: Enables straightforward point-in-time queries of events linked to the relevant hub.

Alternative: Multi-Active Satellites

Another design might involve a multi-active satellite on the hub itself, capturing different event types or message variants keyed by a load timestamp or event type. However:

Multi-active satellites are designed to capture multiple concurrent “active” roles or statuses rather than transient events.
The lack of a dedicated link table can blur semantic distinctions between relationships and descriptive attributes.
Query performance and partitioning strategies may suffer when trying to manage high-frequency event data in a satellite.

Therefore, for discrete, passing-through events, a non-historized link generally outperforms a multi-active satellite approach.

Designing Your Single-Hub Non-Historized Link

When modeling a non-historized link that references only one hub, follow these guidelines:

Link Table Structure: Include a surrogate primary key, load timestamp, and the single hub’s surrogate key.
Foreign Key Constraint: Enforce referential integrity back to the hub, ensuring the business key exists.
Descriptive Satellites: If extra attributes are needed (e.g., event payload details), create a non-historized satellite keyed to the link.
Partitioning Strategy: Partition by load date for efficient querying and archiving of stale event data.
Retention Policy: Define sliding windows or archival processes for old events if storage growth is a concern.

Here’s an example DDL snippet for reference:


CREATE TABLE l_event_single_hub (
l_event_id        BIGINT      IDENTITY PRIMARY KEY,
hub_key_id        BIGINT      NOT NULL,
load_date         DATETIME     NOT NULL,
-- optional metadata columns
source_system     VARCHAR(50),
record_hash       CHAR(32),
CONSTRAINT fk_l_event_hub
FOREIGN KEY (hub_key_id)
REFERENCES h_hub_entity(hub_key_id)
);

Use Case Scenarios

Organizations across industries leverage single-hub links for:

Banking: Recording individual account balance snapshot events.
Retail: Capturing stock level messages for each product unit.
IoT: Ingesting single-device telemetry pings.
Telecommunications: Logging individual phone number status changes (e.g., activated/deactivated).

In each scenario, the event is tied to one core business key, and history is either ephemeral or summarized elsewhere.

Best Practices and Considerations

When implementing single-hub non-historized links, consider the following:

Event Granularity: Define clear semantics—what constitutes one event, and how often will it be ingested?
Surrogate Keys: Always use surrogate keys for hubs and links to maintain consistency.
Hashing Strategy: Compute a record hash if you need idempotency or change detection on message payloads.
Load Performance: Optimize bulk or streaming loads with batching and minimal indexes on the link table.
Retention and Archival: Archive stale events into cheaper storage or summarize them into aggregate tables.

By following these practices, you’ll ensure a robust, maintainable design that adheres to Data Vault principles.

Conclusion

While it might seem counter-intuitive to create a link with only one hub reference, non-historized links with a single business key are both feasible and, in many real-time event scenarios, preferable to alternative designs. They preserve the semantic clarity of link tables, ensure data integrity, and scale efficiently for high-volume event streams. When events involve only one business key, reach for a one-hub non-historized link rather than shoehorning events into satellites or hybrid structures.

Watch the Video

Meet the Speaker

Michael Olschimke

Building a scalable Data Platform? In Data Vault Friday

Dealing with Corrupted Loads in Data Vault

Corrupted Loads in Data Vault

One of the foundational assumptions in Data Vault modeling is that business keys must be unique. This rule underpins how we model Hubs, Links, and Satellites. But what happens when your data doesn’t play by the rules? Specifically, what should you do when your data delivery contains multiple rows with the same business key—a situation that violates the core principles of your Raw Data Vault model?

In this article, we’ll explore practical strategies for managing corrupted data in Data Vault pipelines, focusing on maintaining auditability, consistency, and data integrity—even when upstream data delivery is flawed. We’ll also look at what to do when your business key assumptions no longer hold true.

In this article:

Understanding the Problem: Duplicate Business Keys
Why You Can't Ignore Corrupted Loads
Step 1: Capture Everything in a Data Lake
Step 2: Define Automated Data Quality Checks
Step 3: Track Rejections and Version Control Your Checks
What If There’s No Data Lake?
When the Business Key Assumption Breaks
Why You Must Refactor, Not Hack
What About Descriptive Data Errors?
Final Thoughts: Build Resilience Into Your Pipeline
Watch the Video
Meet the Speaker

Understanding the Problem: Duplicate Business Keys

Let’s start with the assumption that your Raw Data Vault is modeled around unique business keys. You’ve built Hubs, split Satellites, and established Links based on the expectation that a business key like customer_id uniquely identifies a customer.

Now, you receive a new delivery from your source system. Unexpectedly, it contains multiple rows with the same business key. This isn’t just a data quality issue—it fundamentally breaks your model. The typical loading process can no longer proceed cleanly, and worse, you risk contaminating your data warehouse with incorrect records.

Why You Can’t Ignore Corrupted Loads

It’s tempting to just skip the bad file or fix it manually. But in a proper Data Vault setup—particularly one that adheres to full auditability and compliance standards—this isn’t acceptable. You must be able to fully reconstruct each data delivery, even if it’s flawed. Every decision—whether to reject or load—must be trackable and justifiable.

Step 1: Capture Everything in a Data Lake

Today, many modern architectures use a data lake or Persistent Staging Area (PSA) as the first layer of data capture. This becomes your insurance policy. All incoming data—valid or corrupted—is ingested and stored here as-is, giving you a perfect record of what was delivered and when.

This approach also ensures your Raw Data Vault can skip flawed deliveries without data loss. By storing the original files in the data lake, you preserve the full delivery for later inspection, validation, or correction without halting the loading process entirely.

Step 2: Define Automated Data Quality Checks

Before data is loaded into the Raw Data Vault, it must pass validation. You can implement quality checks like:

Is the business key unique across the delivery?
Are column data types and lengths as expected?
Are required fields populated?

If any of these checks fail, the entire file should be rejected—not just individual records. Why? Because partial loads introduce ambiguity and audit challenges. Instead, flag the file as failed and notify the data provider to investigate and resend a corrected version.

Step 3: Track Rejections and Version Control Your Checks

You must keep detailed logs of every load attempt. This includes:

Which file was loaded or rejected
Which checks were applied
Which check failed and why
The version of the validation rule used

This ensures complete traceability. You can prove not just what was accepted, but also what was rejected and for what reason. This is crucial for regulatory compliance, audits, and operational transparency.

What If There’s No Data Lake?

In some cases, you may not have a data lake. You might be working with a transient relational staging area before the Raw Data Vault. Even then, you should still store failed deliveries. A separate location or table can be used to store the raw files that failed validation. Again, auditability is key—just because data isn’t valid doesn’t mean it can disappear.

When the Business Key Assumption Breaks

Sometimes, you dig deeper and realize that your assumption about the business key was flawed. Maybe you thought customer_id was unique, but the source system allows multiple entries per ID for different contexts. Now what?

This is where things get more complex. You need to refactor your model. Specifically, you must modify the Hub and possibly extend the business key by combining it with another column (e.g., customer_id + region) to enforce uniqueness.

Why You Must Refactor, Not Hack

Some might be tempted to patch the issue using a record source tracking Satellite or other technical workaround. But this introduces long-term maintenance and performance issues. Worse, it hides the real business reality behind a technical trick.

Instead, treat the business key as the central anchor of your model. If it changes, it impacts:

The Hub structure
All related Satellites
Any Links pointing to the Hub

Yes, it’s a big change. But it’s limited to a specific portion of your model and keeps your architecture clean and reliable.

What About Descriptive Data Errors?

If the corrupted data only affects descriptive attributes and not the business key, the fix is simpler. You can ingest a correction load directly into the Satellites with a backdated load date—just after the original bad load. Then, rebuild your PIT (Point-In-Time) tables. This resolves the issue for downstream consumption without any need to refactor Hubs or Links.

Final Thoughts: Build Resilience Into Your Pipeline

Corrupted data is not an exception—it’s an eventuality. Whether it’s duplicate business keys, incorrect formats, or structural changes in the source system, your data warehouse must be prepared. The best defenses are:

A reliable data lake or staging layer to capture raw deliveries
Automated validation and full-file rejection logic
Detailed auditing and version control on checks
Clear communication with source system owners
Willingness to refactor models when business reality shifts

Following these principles ensures your Data Vault model remains robust, scalable, and trustworthy—even in the face of corrupted loads.

Watch the Video

Meet the Speaker

Michael Olschimke

Building a scalable Data Platform? In Coalesce Transformation Talks

Leveraging the Coalesce API: A Practical Guide for Data Engineers

About the Coalesce API

In today’s fast-paced data-driven world, automation, integration, and scalability are crucial for modern data engineering. The Coalesce API empowers developers, analysts, and engineers to streamline their workflows, integrate with external tools, and build robust data pipelines. Whether you’re migrating data, monitoring runs, or embedding data tasks into your existing scheduling systems, the Coalesce API offers the flexibility and power you need.

This guide will walk you through the Coalesce API’s key features, show you how to get started, and explore real-world use cases that can elevate your data operations.

In this article:

API Features
- Coalesce API Endpoints
- Run API Endpoints
Using the API
Real-World Use Cases
How to Get Started
Conclusion
Watch the Video
Meet the Speaker

API Features

The Coalesce API is structured into two primary segments: the Coalesce API itself and the Run API. Each provides a specific set of endpoints designed to help you interact programmatically with the coalesce.io platform.

Coalesce API Endpoints

Get / List Environments: Fetch available environments where your coalesce.io projects live.
List / Get / Create / Set Nodes: Manage your data transformation nodes—essential building blocks of any pipeline.
List / Get Runs: Retrieve historical or current run information for traceability and auditing.
List Run Results: Analyze outputs and diagnostics of your executed runs.

Run API Endpoints

Start / Stop / Retry Run: Full control over triggering, halting, or retrying your pipeline executions.
Check Live Run Status: Monitor real-time status of ongoing processes.

These features provide a comprehensive toolkit for orchestrating and managing your Coalesce-powered data architecture.

Using the API

One of the strengths of the Coalesce API is its accessibility across a wide range of tools and platforms. Here’s how you can explore and interact with the API in your development environment:

API Explorer: Use the built-in API Explorer for hands-on experimentation and learning.
Postman Collection: Easily import the Coalesce API into Postman to structure and test API calls efficiently.
Insomnia: Another popular REST client for interacting with coalesce.io endpoints with ease.
Command Line: cURL and other CLI tools allow direct HTTP requests for automation and scripting.
Azure Data Factory: Seamlessly integrate coalesce.io into your Azure-based ETL pipelines.
Any API-compatible platform: Virtually any system that can make HTTP requests can work with coalesce.io.

Whether you’re a seasoned developer or just getting started with APIs, Coalesce’s compatibility makes it a flexible choice for various setups.

Real-World Use Cases

Now that you know what the API offers, let’s look at some practical scenarios where it can deliver significant value:

Migration Projects: Automate and validate data migration workflows by triggering and monitoring coalesce.io runs through the API.
Monitoring of Runs: Build dashboards or alerting systems using live run status and result endpoints.
External Scheduler Integration: Integrate with orchestration tools like Apache Airflow, Prefect, or Dagster to manage your coalesce.io executions.
Tool Synchronization: Keep multiple tools in sync by triggering workflows or pushing outputs via API commands.
And More: The flexible design means you can build custom solutions tailored to your organization’s specific needs.

The API is your gateway to turning coalesce.io into a true component of your larger data ecosystem.

How to Get Started

Getting up and running with the Coalesce API is straightforward. Here are the initial steps you need to take:

Base URL

The API’s base URL depends on your coalesce.io instance region. A common URL looks like this:
https://app.coalescesoftware.io

Bearer Token

For authentication, you’ll need a Bearer Token. You can create this securely within the coalesce.io platform under the Deploy section.

Environment ID (Optional)

You can use the API to list all environments if you’re unsure which ID to use. This is optional, depending on your endpoint needs.

Workspace ID

This is critical for API calls involving workspace-specific data. You can find your Workspace ID in the coalesce.io interface under Build Settings.

Once you have these items, you’re ready to begin sending requests and building out your automation workflows.

Conclusion

The Coalesce API opens a world of possibilities for enhancing your data workflows. With comprehensive functionality, real-time interaction, and seamless integration options, it’s an essential tool for any team looking to operationalize their data stack efficiently.

Start small—experiment with API Explorer or Postman—and gradually integrate Coalesce API calls into your ETL processes, monitoring tools, and data orchestration pipelines. The flexibility and control you gain will be well worth the investment.

Watch the Video

Meet the Speaker

Tim Kirschke
Senior Consultant

Tim has a Bachelor’s degree in Applied Mathematics and has been working as a BI consultant for Scalefree since the beginning of 2021. He’s an expert in the design and implementation of BI solutions, with focus on the Data Vault 2.0 methodology. His main areas of expertise are dbt, Coalesce, and BigQuery.

Building a scalable Data Platform? In Data Vault Friday

Defining the Error Mart in Data Vault

Defining the Error Mart

When working with data platforms that follow the Data Vault methodology, one often hears about components like the Raw Vault, Business Vault, and Information Marts. But among these well-known layers is a lesser-discussed yet critical structure: the Error Mart.

In this blog post, we take a comprehensive look at what an Error Mart is, what its main objectives are, and the best practices for designing one. This insight is based on an informative session led by Michael Olschimke, CEO of Scalefree, during a recent Data Vault Friday.

In this article:

What is an Error Mart?
The Main Goal of an Error Mart
How Is It Structured?
Best Practices for Designing an Error Mart
Why the Error Mart Matters in Data Vault Architecture
Conclusion
Watch the Video
Meet the Speaker

What is an Error Mart?

In traditional data warehousing approaches like Kimball, an Error Mart is used to store metrics about errors—for example, how many ETL jobs failed or which tables didn’t load successfully. These are primarily KPIs used for monitoring and are typically stored in what’s known as a Metrics Mart.

However, in the context of Data Vault 2.0, the Error Mart has a different, more tactical role: it acts as a catch-all for rejected records that fail to load during any of the staging or integration processes.

This could be due to a mismatch in expected data types, missing columns, or unexpected structural changes in the source data. These issues most frequently arise during:

Initial data ingestion from files, APIs, or real-time feeds
Loading data into the staging area or raw Data Vault
Applying hard rules based on schema assumptions

The Main Goal of an Error Mart

The primary goal of the Error Mart is to ensure that all incoming data—the good, the bad, and the ugly—is captured and traceable, even if it can’t immediately be loaded into the intended layer (such as the Raw Vault).

It’s a technical safety net that provides:

A secure location for rejected records
The ability to analyze and correct issues manually
A reprocessing workflow that ensures full data capture

The Error Mart is not meant for business logic errors (e.g., someone underage purchasing a product); rather, it handles technical discrepancies that prevent data from moving through the pipeline.

How Is It Structured?

Traditionally, one might think of creating multiple error tables to match each data model. However, Michael Olschimke recommends a single flexible structure—a table that stores rejected records as JSON strings. This allows you to capture various unexpected formats without predefined schemas.

Each record should be accompanied by key metadata:

Load date – Timestamp of ingestion
Record source – Source system or interface
Process identifier – The job or transformation that failed

This setup ensures that every error is auditable, traceable, and eventually resolvable.

Best Practices for Designing an Error Mart

Here are some key considerations when building your Error Mart:

1. Flexibility in Structure

Since rejected data often doesn’t conform to expected schemas, use a structure that can handle variability. A single table using JSON or Parquet formats offers great flexibility, especially when stored in a data lake.

2. Avoid Over-Engineering

There’s no need to create one table per error type. One well-documented and meta-tagged table is usually sufficient.

3. Logging and Auditing

Implement a companion log table or file to track which records have been reprocessed. Instead of deleting processed error records, use a status flag or separate tracking log to preserve data lineage and maintain transparency.

4. Trigger Monitoring and Alerts

Your system should monitor the Error Mart for unprocessed records. Set up alerts via email, log monitoring tools like CloudWatch or Greylog, or build dashboards that notify the data team when action is required.

5. Make It the Data Team’s Responsibility

A critical mindset shift: processing records in the Error Mart is not a business responsibility—it’s yours as the data engineering team. Do not offload this to end users.

6. Reprocessing Workflow

Once the technical root cause is identified (e.g., an overly strict field length), update the hard rules, reload the rejected data from the Error Mart into the target layer, and mark it as processed in your log.

7. Error Mart in Every Layer

While most errors occur in the initial stages (staging, Raw Vault), you should prepare to capture errors at every layer—Business Vault and Information Mart included.

8. Binary Data Considerations

If your incoming data includes blob fields, you can mime-encode them and store them alongside the error JSON or separately in the data lake.

Why the Error Mart Matters in Data Vault Architecture

Data Vault is built on the premise of complete and auditable data capture. To meet this principle, you must have a strategy for handling unexpected or failed data loads. The Error Mart acts as that strategy.

It’s not just a dumping ground for bad records—it’s a crucial feedback mechanism that helps you refine your ingestion and transformation rules, ensuring every piece of data, no matter how ugly, makes it into the platform.

Without an Error Mart, you risk data loss, broken lineage, and ultimately, lower trust in your data platform.

Conclusion

In summary, the Error Mart is an essential part of a resilient Data Vault architecture. It gives your data team the tools to identify, correct, and reprocess problematic data while maintaining auditability and trustworthiness.

If you’re implementing a Data Vault, don’t treat the Error Mart as an afterthought. Design it with flexibility, transparency, and process integration in mind. And remember: it’s your job to make sure no record gets left behind.

Watch the Video

Meet the Speaker

Michael Olschimke

Building a scalable Data Platform? In Data Tools, Intermediate

Data Vault on Databricks: Does It Make Sense?

In this article, we will try to explore the practical considerations of implementing Data Vault on Databricks, by analyzing Databricks’ ecosystem and its alignment with Data Vault’s core principles. We will go over the fundamentals of Databricks’ architecture, its compatibility with Data Vault’s layered approach, and how some of Databricks’ features can be leveraged to simplify, optimize, or even replace certain traditional aspects of a Data Vault implementation.

This article aims to provide a strategic perspective on how Databricks can support Data Vault principles such as historization, scalability, auditability, and modular design. We’ll discuss opportunities, such as using Delta Lake for time travel and schema evolution, and challenges, like the performance trade-offs introduced by Data Vault’s high number of joins.

Bridging EDW and Lakehouse: Implementing Data Vault on Databricks

Join us in this webinar as we explore the process of implementing Data Vault on Databricks. We will go over different integration strategies and potential challenges, as well as technical aspects like data modeling, performance considerations, and data governance. Register for our free webinar, June 17th, 2025!

Watch Webinar Recording

In this article:

Understanding Data Vault 2.0
The Databricks Ecosystem
Databricks and Data Vault: Do they work together?
Data Vault on Databricks: The Best of both Worlds

Understanding Data Vault 2.0

Data Vault is traditionally defined as a methodology encompassing implementation practices, an architectural framework, and a data modeling approach for building a business intelligence system. However, this article focuses on the architectural and modeling aspects of Data Vault, as these are most relevant topics for the implementation of Data Vault on Databricks.

The main advantage of adopting Data Vault’s architecture and modeling are:

Preservation of Historical Integrity and Auditability.
- Insert-only historization
- Reconstruction of data source deliveries
- Simplified Governance and Compliance
Flexible and Scalable Architecture Data Model
- Modular Data Model (Hub & Spoke)
- Scalable
- Decoupling of Hard and Soft Business rules
- Tool Agnosticism

The Databricks Ecosystem

Databricks is a leading platform for data analytics, offering a unified environment for data processing, machine learning, and collaborative data science. Its lakehouse architecture, built on Apache Spark and Delta Lake, combines the flexibility of data lakes with the structure and performance of data warehouses. This approach allows organizations to store all types of data while enabling efficient SQL-based analytics and AI/ML workloads.

For Data Vault implementation, Databricks can be a practical choice. Delta Lake’s ACID compliance and transaction logs ensure data integrity and enable Time Travel for historical analysis. As we will see next, features like Delta Live Tables and Unity Catalog optimize data ingestion, transformation, and governance, making Databricks a compelling platform for implementing Data Vault.

Databricks and Data Vault: Do they work together?

To assess the combination of Databricks and Data Vault, we need to analyze their common ground: architecture and data modeling. Both are designed to handle large scales of volume and data processing, and a successful integration of both relies on understanding how they can complement each other.

Architectural Compatibility

Databricks, built on Apache Spark and Delta Lake, follows the Medallion Architecture, a layered approach designed to structure and refine data. Their Medallion Architecture provides a best practice for managing data within a lakehouse environment, utilizing a three-layered approach (Bronze, Silver, Gold) to progressively structure and refine data. This approach aligns well with Data Vault’s multi-layered architecture (Staging, Raw Data Vault, Business Vault, Information Marts).

Image 1: Databricks’ Medallion architecture

Image 2: Data Vault Architecture

Integrating Data Vault with the Medallion Architecture allows for a synergistic approach, as we can see in image 3.

Image 3: Data Vault and Medallion Architecture

The Bronze layer serves the same purpose as Data Vault’s Staging Area, where raw data is ingested from the different sources and stored in a single place. From then on, the Silver layer will store the Raw Data Vault, source tables will be split into hubs, links, and satellites. Here we can already consider some Databricks’ features, such as schema enforcement for integrity; and also Delta Live Tables and Spark SQL to maintain steady loading processes and automate quality checks. The Business Vault, which derives additional business-relevant data structures, sits between Silver and Gold layers, assisting with the information delivery process.

In the Business Vault, Databricks features such as Z-Ordering and data skipping can optimize performance by organizing data more efficiently. Additionally, Spark SQL can be used for aggregations and transformations supported in PIT and Bridge tables. Finally, in the Gold layer, we can start creating our Information Marts with Flat & Wide structures that improve the performance when querying the information out of the Vault.

Privacy and Security

Databricks’ data governance features included in Unity Catalog can optimize Data Vault implementations by simplifying security and privacy controls. Unity Catalog’s fine-grained access control and data masking capabilities can eliminate the need for satellite splits traditionally used to manage sensitive data. Additionally, the lakehouse architecture enables direct data querying, which facilitates compliance with GDPR and data privacy regulations, particularly for responding to data subject access requests (DSAR) and right-to-be-forgotten requests. These data governance features help to simplify the Data Vault model and reduce the final amount of tables in the Vault.

Historization

While both Data Vault and Databricks offer mechanisms for data historization, relying solely on Delta Lake’s Time Travel for historization in a Data Vault implementation on Databricks might not be the best choice. In Databricks, the VACUUM command can permanently delete older data files, potentially removing historical data needed for auditing, lineage analysis and regulatory compliance. Hence, alternative historization methods should be considered, such as maintaining traditional historization with Data Vault’s modelling insert-only approach, or leveraging Databricks’ Change Data Feed to capture a stream of changes made to Delta Lake tables. This ensures a complete and auditable history, even if older data versions are removed by the VACUUM command.

Performance Considerations

When implementing Data Vault on Databricks, performance optimization requires architectural considerations that comprehend the characteristics of both systems. The modular design of Data Vault can create numerous tables with complex join patterns, which can be challenging in Databricks’ Spark environment, since Delta Lake’s column-based Parquet files can struggle with extensive joins. To address this challenge, practitioners should minimize satellite splits (leveraging Databricks’ native security and privacy features instead), implement virtualization in the Business Vault through views, and utilize Point-in-Time and Bridge tables to precompute historical snapshots that reduce join complexity and aid in achieving the target granularity.

For optimal performance, information marts should adopt Flat & Wide structures that prioritize query speed over storage efficiency (an acceptable trade-off given today’s relatively low storage costs). Additional performance gains can be achieved by strategically applying Delta Lake features like Z-Ordering and data skipping to enhance the information delivery process. The decision between views and fully materialized information marts is also an aspect to consider; while views reduce redundancy and simplify management, materialized marts with denormalized tables provide substantial performance benefits for complex reporting scenarios that would otherwise require resource-intensive joins across multiple Data Vault structures. A balanced approach combining views and materialized views should be based on query complexity, data volume, and update frequency, ensuring that reporting, and analytics workloads remain performant. This way we ensure that a Data Vault implementation on Databricks can maintain both the modeling flexibility of Data Vault and the performance capabilities of the Databricks platform.

Data Vault on Databricks: The Best of both Worlds

Implementing Data Vault on Databricks represents a practical and effective combination that merges Data Vault’s tool-agnostic architecture with Databricks’ technical capabilities. To optimize this integration, organizations should make thoughtful adjustments that create synergies between the modeling methodology and platform, including leveraging Unity Catalog for security and privacy satellite management, combining architectural designs while maintaining historization and data lineage, and virtualizing queries in the downstream layers with Flat & Wide structures with PIT and Bridge tables as underlying elements to enhance performance. This balanced approach allows organizations to improve governance and simplify data management, while preserving the core strengths of both systems.

– Ricardo Rodríguez (Scalefree)

Building a scalable Data Platform? In Data Vault Friday

Conceptual vs Logical vs Physical Data Models

Why Are Data Models Important?

Before diving into the specifics, let’s understand the purpose of data modeling. Imagine building a house. You wouldn’t start hammering wood together randomly—you’d begin with a sketch, then a blueprint, and finally the construction. Data models serve a similar purpose for databases and data systems.

They help ensure everyone involved (business users, developers, engineers) shares the same understanding of how data is organized, connected, and accessed.

In this article:

1. What Is a Conceptual Data Model?
- When Do You Use a Conceptual Model?
2. What Is a Logical Data Model?
- When Do You Use a Logical Model?
3. What Is a Physical Data Model?
- When Do You Use a Physical Model?
Summary: The Three Layers Compared
The Modeling Process: From Concept to Code
Final Thoughts
Watch the Video
Meet the Speaker

1. What Is a Conceptual Data Model?

The conceptual model is your high-level business map. It’s like the sketch of your house drawn on a napkin. It’s not concerned with technology or database structures. Instead, it focuses on the business concepts and how they relate.

In simple terms, it answers questions like:

What are the key things we care about? (e.g., Customers, Products, Orders)
How are they related? (e.g., Customers purchase Products)

Here’s a basic example:

Entities: Customer, Product, Purchase
Relationships: A Customer makes a Purchase; a Purchase involves a Product

You don’t list detailed fields or attributes yet—just the big picture. It’s usually created during the early discussions between business stakeholders and data professionals. It helps everyone align on the language and goals before jumping into technical design.

When Do You Use a Conceptual Model?

You use this at the start of a project, especially when:

You’re gathering requirements from the business
You’re building a shared understanding with non-technical stakeholders
You want to clarify business rules and entities

2. What Is a Logical Data Model?

Once you understand the business concepts, it’s time to turn those into a more detailed, technology-independent design. That’s the job of the logical model.

The logical model focuses on the structure of data—what fields each entity has, how they’re connected, and what kind of data they store. But it still doesn’t worry about the actual database platform or syntax.

Continuing the earlier example, a logical model might say:

Customer has attributes like First Name, Last Name, Email
Product has attributes like Product ID, Name, Price
Purchase links Customer and Product using unique IDs

In the context of Data Vault (a methodology for modeling enterprise data warehouses), this is where you define your:

Hubs: Core business entities (e.g., Customer, Product)
Links: Relationships between entities (e.g., Purchase)
Satellites: Descriptive data about Hubs and Links (e.g., customer name, address)

You also classify data here. For instance, you might mark some attributes as sensitive for privacy reasons or note how frequently they change.

When Do You Use a Logical Model?

This comes after your conceptual model, typically during solution design or system architecture planning. You use it when:

You need to define how data should be structured and connected
You’re planning your Data Vault architecture
You want to define metadata for automation tools like dbt, Wherescape, or Coalesce

3. What Is a Physical Data Model?

Now we get technical. The physical model is the actual implementation in a database system. It’s like building the house based on the blueprints.

This includes:

Create table statements
Insert/load scripts
Indexes and constraints (e.g., foreign keys)
Platform-specific configurations (e.g., Snowflake vs. Oracle)

In the physical model, you decide:

How data is stored (table structures, partitions)
How data is accessed (views, security layers)
How data is secured (row-level or column-level permissions)

For example, if your logical model says an attribute is sensitive, your physical model might enforce this by putting that attribute in a separate satellite or restricting access using database roles.

When Do You Use a Physical Model?

This comes last in the chain. You use it when:

You’re ready to implement in a database
You’re generating SQL from your metadata (often using automation tools)
You’re deploying the actual tables, views, and loading processes

Summary: The Three Layers Compared

Aspect	Conceptual Model	Logical Model	Physical Model
Purpose	High-level business understanding	Detailed data structure without technology	Actual implementation in a specific database
Focus	Entities & relationships	Attributes, keys, classifications	SQL scripts, schemas, storage
Audience	Business users & analysts	Data architects & engineers	DBAs & developers
Examples	Customer purchases Product	Customer has Name, Email; Link to Product	CREATE TABLE Customer (…); GRANT SELECT ON ViewX

The Modeling Process: From Concept to Code

One of the key takeaways from Michael Olschimke’s explanation in the Data Vault Friday session is that these models are not alternatives—they’re steps in a process.

Start with the conceptual model to understand the business.
Create a logical model based on the incoming data and requirements.
Generate the physical model using automation tools or code templates tailored to your database platform.

Each step builds on the one before it, guiding you from abstract ideas to concrete implementation.

Final Thoughts

If you’re new to data engineering, start small. Talk to the business. Sketch out their world. Then, gradually evolve those ideas into structured data models and finally into code. Understanding the differences between conceptual, logical, and physical models will make you a more effective engineer—and a better bridge between business and tech.

Watch the Video

Meet the Speaker

Michael Olschimke