Skip to main content
search
0

Green Bond Reporting in Record Time at Grenke AG

Green Bond Reporting

Sustainability and transparency have long been more than just buzzwords – nowadays, they are part of how modern companies see themselves. Green bonds are becoming increasingly important as they enable targeted investments in sustainable projects. Professional and audit-proof reporting is crucial to create trust among investors, auditors, and other stakeholders.

Our client Grenke, had already relied on our expertise and implemented a data warehouse based on Data Vault 2.0. The processes were largely automated so that data sources could be integrated and processed efficiently. When a new requirement for green bond reporting arose, we were able to implement it in just one month, thanks to the already existing scalable setup.

Initial Situation: An Existing, Automated Data Warehouse

  • Data Vault 2.0 as a foundation:
    Grenke was already using a robust Data Vault 2.0 architecture that enables flexible and expandable data storage thanks to its clear structures (hubs, links and satellites).
  • Automated model generation:
    By using templates and metadata-driven approaches, data vault models can be generated automatically. This reduces manual effort, increases standardization and improves data quality.
  • Quality checks and audit compliance:
    Plausibility checks, historization and metadata-supported processes already ensured high data quality and traceability – essential for audits and reporting.

These prerequisites formed the perfect springboard for quickly and reliably integrating the new Green Bond Reporting into the existing system.

New Requirement: Green Bond Reporting

With this new requirement, Grenke was faced with the challenge of collecting and preparing and presenting specific ESG key figures and green bond-specific data in a comprehensible report.

The aim was to design the reporting in such a way that:

External reviewers and auditors can gain insight quickly and easily.

Investors and other stakeholders receive transparent information about the sustainable projects.

Regulatory requirements and internal standards are met at all times and documented in a comprehensible manner.

Thanks to the existing Data Vault 2.0 infrastructure and the high level of automation, it was possible to implement these new requirements in a short space of time.

Our Approach: Expansion Instead Of New Construction

  1. Requirements analysis
    Together with Grenke, we defined the relevant green bond key figures and reporting requirements. These included classifications according to ESG criteria, assignment of project types, as well as regional and financial attributes.
  2. Integration into the existing data warehouse
    Instead of building a new system, we added the required fields to the existing hubs, links and satellites. Thanks to the agile Data Vault 2.0 methodology, this was possible without much additional effort.
  3. Automated processes and quality checks
    Thanks to the existing ETL/ELT routes, we were able to quickly and securely load the data into the system. New validation rules for green bond reporting were added to ensure that all relevant data was recorded completely and correctly.
  4. Reporting & dashboards
    Based on the processed data, we have developed interactive dashboards and reports that clearly present the project status, the scope of financing, and other ESG key figures. External auditors can also be given access via export functions if required.
  5. Rapid approval through external audits
    As the Data Vault 2.0 structure ensures complete historization and traceability of the data, the external audits ran smoothly. The auditors were able to fully trace all steps and data changes – a decisive advantage for sustainability reports.

Result: Green Bond Reporting In Just One Month

The combination of a scalable Data Vault 2.0 approach, a high level of automation, and an already established data infrastructure enabled us to successfully deliver the Green Bond Reporting in just one month.

This means:

  • Fast time-to-market: Grenke was able to publish the report quickly and go straight into communication and marketing.
  • Trustworthy database: Thanks to integrated quality checks and traceability, the reporting is audit-proof – a crucial prerequisite for external audits.
  • Future-proof solution: New key figures, extended ESG criteria, or regulatory requirements can be flexibly integrated without having to fundamentally rebuild the system.

What Grenke Says

“Partnering with Scalefree has been instrumental in our Data Vault 2.0 journey. Their deep expertise in Data Vault principles and practical dbt know-how have significantly supported our implementation, ensuring a smooth and structured process. Thanks to their guidance, we’ve already improved our ability to integrate and analyze business data while building a scalable and future-proof data warehouse.”

Oliwia Borecka
Chief Data & Analytics Officer at grenke digital GmbH

Conclusion: Agile And Sustainable Into The Future

The project shows how Scalefree supports customers in quickly and efficiently integrating new requirements into existing data ecosystems. The Data Vault 2.0 approach provides the ideal basis for this: scalability, flexibility, and revision security ensure that companies can meet their reporting requirements not only today, but also tomorrow.

Would you like to find out more about how you can future-proof your data warehouse or ESG reporting?
Contact us at Scalefree – together we will develop a customized solution that meets your requirements and puts you in the best possible position in terms of sustainability and transparency. We look forward to making your project a success!

The Important Role of Data Catalogs in Modern Data Warehousing: A Practical Look at Datahub

Data Catalog Architecture

Data Catalogs in Modern Data Warehousing

Modern data architectures are becoming increasingly complex, posing significant challenges for organizations. Data is often scattered across multiple systems, making it difficult to locate, utilize, and preserve its quality. Analysts spend a substantial amount of time searching for relevant information, while businesses struggle to manage data efficiently and comply with regulatory requirements. A structured approach to metadata management is essential to enhance transparency, maintain data quality, and enable efficient data usage.

This article is aimed at data professionals, business users, and IT teams looking to establish or improve structured metadata management. It explores the challenges of fragmented metadata, the role of data catalogs as a central solution, and how they enhance data organization and usability. Additionally, it provides practical insights into structuring a data catalog within a Data Vault 2.0 architecture to support a scalable and comprehensive data strategy.

The Challenge of Fragmented Metadata

In the era of data-driven decision-making, organizations face a fundamental challenge: fragmented and inconsistent metadata. Across different departments, heterogeneous systems, divergent naming conventions, and varying documentation standards lead to a disjointed data landscape. Analysts and data professionals often spend excessive time locating, verifying, and interpreting data—a process that not only reduces efficiency but also increases the risk of errors and misinterpretations.

At the core of this issue lies the absence of a centralized metadata repository. Without a unified source of truth, critical aspects such as data lineage, ownership, and quality remain unclear. This lack of transparency not only complicates regulatory compliance but also erodes trust in data and impedes strategic initiatives. The result is the formation of data silos that prevent organizations from fully leveraging their data assets. As highlighted, this fragmentation leads to wasted time and effort on finding and accessing data, turns data platforms into data swamps, and hinders the development of a common business vocabulary.

What is a Data Catalog?

A data catalog is a centralized, structured repository that organizes and manages metadata across an organization, facilitating efficient data discovery, governance, and collaboration. Much like a well-curated library, a data catalog enables users to locate, understand, and utilize data efficiently by capturing essential metadata and making it easily searchable and accessible. Beyond merely indexing data assets, a data catalog provides critical context, tracks relationships between datasets, and integrates user-driven insights to enhance data usability and governance.

Features of a Data Catalog

A well-implemented data catalog consists of several foundational features that support metadata management, data governance, and user collaboration:

  • Metadata Management: Captures and organizes essential information about datasets, such as source, format, relationships, and usage patterns, ensuring accessibility and usability.
  • Data Discovery: Enables users to locate and access data quickly through intuitive search functions that leverage metadata attributes and contextual tags.
  • Data Lineage: Tracks the lifecycle of data, mapping its journey from origin to consumption. This allows organizations to trace transformations, ensuring data integrity and error resolution.
  • Data Governance: Establishes policies for data availability, usability, integrity, and security. This can include access control, regulatory compliance, and stewardship responsibilities.
  • Business Glossary: Defines business terminology and ensures consistency across the organization, improving communication and reducing misinterpretation of data fields.
  • Data Dictionary: Provides technical documentation on data structures, including schema definitions, data types, and field constraints.
  • Data Profiling: Analyzes datasets to generate statistics on data quality, completeness, distribution, and anomalies. This helps organizations understand their data better and ensures its reliability for analytical and operational use.

Benefits of a Data Catalog

A data catalog enhances data discovery by providing structured metadata, enabling users to efficiently find and access the right datasets. By offering clear definitions, contextual information, and lineage tracking, it improves data comprehension, ensuring that users understand data origins, transformations, and relationships.

Discovering and understanding data sources and their use is a core function of a data catalog. Business users can quickly locate relevant datasets, evaluate their fit, and utilize them effectively. Additionally, users can contribute to the catalog by tagging, documenting, and annotating data sources, fostering collective knowledge and enhancing data usability.

It fosters trust by documenting data quality, provenance, and usage, ensuring that users rely on accurate, well-maintained datasets. Furthermore, by reducing redundancy and streamlining workflows, a data catalog increases operational efficiency, saving time and resources while maximizing the value of an organization’s data assets.

How a Data Catalog Works

A data catalog functions as a dynamic metadata management system, automating the extraction, organization, and indexing of metadata. It seamlessly connects to various data sources, including data platforms, data lakes, ETL pipelines, and BI tools, synchronizing metadata through two primary approaches: pull-based and push-based ingestion.

In the pull-based approach, the data catalog actively queries source systems using APIs, JDBC connectors, or scheduled scans to retrieve metadata. This method allows the catalog to regularly update its metadata repository by fetching information from the source systems. On the other hand, the push-based approach relies on source systems to send real-time metadata updates to the catalog through event-driven mechanisms, webhooks, or message queues. This ensures immediate synchronization of metadata, keeping the catalog up-to-date with any changes in the data landscape.

Once metadata is ingested, the data catalog standardizes and enriches it, ensuring consistency and adding valuable context. This process includes extracting schema details, identifying data ownership, applying classifications, and establishing data lineage. The standardized and enriched metadata is then stored in an indexed repository, enabling fast search, filtering, and lineage tracking. Furthermore, automated tagging, schema change detection, and governance policies are implemented to ensure compliance with security and regulatory requirements.

Integrating a Data Catalog With Data Vault 2.0

The diagram illustrates how a Data Catalog integrates into a Data Vault 2.0 Architecture. Using various connectors, it extracts metadata from source systems, data platforms, ETL tools, and BI tools through push- or pull-based approaches. The red arrows represent the flow of metadata from these systems into the Data Catalog, ensuring continuous capture and synchronization of metadata. This process guarantees end-to-end visibility, governance, and accessibility, making data assets more reliable and valuable for business users and analysts.

Data Catalog Architecture

Organizations can choose between open-source and commercial solutions. Commercial platforms like Atlan, Alation, and Collibra offer fully managed enterprise solutions with automation and vendor support, while open-source tools such as DataHub, Amundsen, and OpenMetadata provide more customizability and cost efficiency. Many open-source solutions now also offer cloud-based enterprise versions for easier deployment.

Self-hosted deployments are typically hosted on Docker or Kubernetes, giving organizations greater control and security, making them ideal for compliance-heavy environments. Meanwhile, cloud-based solutions offer automatic scaling and lower maintenance overhead, simplifying operations.

A well-integrated Data Catalog should connect seamlessly with data sources, ingest metadata efficiently, and provide structured management and governance. In a Data Vault 2.0 environment, it enhances traceability and transparency, enabling better data-driven decisions and reducing inefficiencies caused by fragmented metadata.

Introduction to DataHub

In this section, we will present an example of a data catalog tool and some of its key features.
DataHub is a leading open-source data catalog developed by LinkedIn to address the challenges of data discovery, governance, and observability in complex data ecosystems. Designed to handle the increasing volume, variety, and velocity of data, DataHub has been adopted by numerous organizations seeking to improve their metadata management practices. It supports both push- and pull-based metadata ingestion, seamlessly integrating with a wide range of data tools and technologies, including dbt, Snowflake, BigQuery, and Airflow.

Key Features of DataHub

DataHub goes beyond the traditional features and benefits of a standard data catalog by offering advanced features that address modern data management challenges. Here are some practical insights on how DataHub solves some of the features of a data catalog:

  • Data Discovery: Leveraging advanced search capabilities and detailed metadata, DataHub enables users to quickly locate and understand data assets, facilitating efficient data-driven decision-making across the organization, with a powerful search and filtering mechanism that allows users to refine queries by platform, domain, data type, owner, and specific tags, significantly enhancing the search process.
Internal Data Catalog Screenshot 1

Screenshot of Internal Data Catalog

  • Data Lineage: It provides a comprehensive, visual trace of data flow—from origin through various transformations to final consumption—ensuring transparency, simplifying troubleshooting, and supporting compliance efforts. This lineage can extend down to the column level, offering a granular view of how data is transformed and utilized throughout the system.
Internal Data Catalog Screenshot 2
Internal Data Catalog Screenshot 3

Screenshot of Internal Data Catalog

  • Data Profiling: DataHub continuously monitors and analyzes data quality, automatically generating profiling metrics that reveal data distributions, identify anomalies, and help maintain high data quality standards. It provides key statistics such as row and column counts, query frequency, top users, and last update timestamps, along with detailed attribute profiling, including value ranges, central tendencies, null and distinct values. The table below shows some examples of these profiling metrics.
Internal Data Catalog Screenshot 4

Screenshot of Internal Data Catalog

Watch the Video

Conclusion

In an era where data volume and complexity continue to surge, a robust data catalog is no longer optional, it is essential. By centralizing and enriching metadata, catalogs like DataHub turn fragmented datasets into a coherent, trustworthy foundation for analytics, governance, and collaboration. Implemented alongside architectures such as Data Vault, they deliver the transparency, lineage, and quality controls that modern organizations need to unlock real value from their data, quickly, confidently, and at scale.

When to Use Reference Tables in Data Vault?

Reference Tables in Data Vault

In modern Data Vault 2.0 implementations, teams often face a recurring question: “We have dozens of small, static lookup tables—should we model them as full Hubs and Satellites, or can we use simpler reference tables?” If you’re dealing with Excel sheets containing tens or hundreds of rows of relatively stable data (like Profit Centers, Status Codes, or Region mappings), this article will help you decide when a lightweight reference table suffices—and when you need the auditability of a Hub/Satellite pattern.



Understanding Business Data vs. Reference Data

First, it helps to distinguish two broad categories of data:

  • Business Objects: Entities that your processes create and update constantly—Customers, Orders, Products, etc. You generate new keys and change descriptive attributes frequently.
  • Reference Data: Code lists and lookup tables that describe or classify business objects—Country codes, Profit Center codes, Contract types. These change infrequently and usually in small batches.

Although some tables can straddle the line (e.g., Profit Centers may be “business objects” for accounting teams), it’s often safe to treat truly stable code lists as reference data for modeling purposes.

Simple Reference Tables: Pros and Cons

A simple reference table in your Data Vault is nothing more than a flat table with:

  • Primary Key: Your reference code (e.g., profit_center_code).
  • Attributes: The 2–5 descriptive columns you need (e.g., profit_center_name, region).
  • No History: Only the current state is stored; updates overwrite existing rows.

Advantages: Easy to implement, minimal objects, straightforward joins at query time.
Disadvantages: No built-in historical tracking—updates will retroactively change past reports, and you cannot reconstruct previous descriptions.

When Simple Reference Tables Are Appropriate

Consider a flat reference table when:

  • Your business requirement only needs the latest values.
  • Updates are extremely rare (quarterly or less) and don’t require audit trails.
  • Performance of lookups is not mission-critical (small table sizes).
  • You have no regulatory or internal need to reproduce past descriptions.

If any of these criteria fail—especially auditability—then a simple reference table can become a liability.

Introducing Reference Hubs and Reference Satellites

To combine simplicity with history, use the Reference Hub & Reference Satellite pattern. This mirrors the standard Hub/Satellite design, but optimized for code lists:

  • Reference Hub:
    • business_key: the code (e.g., PROFCTR_001)
    • record_source: data origin
    • load_date: date the code list was loaded
  • Reference Satellite:
    • business_key (FK to Hub)
    • load_date (also part of PK)
    • Descriptive attributes (e.g., name, region, valid_from)

This approach captures every change to your reference data without overwriting, and still keeps your model lightweight.

How It Works in Practice

  1. Load the Hub: Insert every code once (or refresh if new codes appear).
  2. Load the Satellite: For each code, insert a new row whenever any descriptive attribute changes, tagging it with the load_date.
  3. Querying: In your dimension or Information Delivery layer, join from your business object Satellite (or Link) directly into the Reference Satellite on code, filtering to the row with the latest load_date ≤ transaction date.

Because reference tables are small, these joins remain performant even when you compare on dates.

Aligning Reference Data with Business Vault Snapshots

For organizations using a Business Vault layer with snapshot dates, you may need to “time-align” reference data. Two patterns are common:

  • On-the-fly alignment: In your reporting view, use the transaction’s snapshot_date and join to the Reference Satellite where load_date ≤ snapshot_date, picking the latest record.
  • PIT/Bridge tables: Precompute “Point-In-Time” (PIT) tables that store the reference code’s surrogate key aligned to each business object snapshot for faster querying.

Choose the pattern that balances your performance SLAs with data freshness requirements.

Auditability and Regulatory Compliance

If you operate in regulated industries (banking, telecom, government), audit trails are mandatory. The Reference Hub/Satellite pattern ensures:

  • Complete lineage and history of every code change.
  • Reproducibility of past reports with original reference descriptions.
  • Ability to support retrospective analyses without reloading or reconstructing data.

Even if your initial business users only ask for current values, future sprints or stakeholders may require historical context—so building auditability upfront can save costly refactoring.

Performance Considerations

Reference tables typically contain at most hundreds of rows. However, you should still consider:

  • Indexing: Ensure load_date and business_key are indexed for fast lookups.
  • Partitioning: Generally unnecessary for small tables but useful if your Satellite grows into thousands of deltas over years.
  • Join Strategy: In most SQL engines, joining a large transaction Satellite to a small Reference Satellite on code + latest date filter is efficient. If not, consider a PIT table.

Governance and Knowledge Transfer

Whatever pattern you choose, document and govern your reference data:

  • Maintain an authoritative data dictionary describing each code list, source, update frequency, and steward.
  • Set up automated tests (e.g., CI/CD validations) to detect unexpected code changes.
  • Implement alerts for large volumes of reference updates that may indicate data quality issues.

Decision Checklist

Use this quick checklist when evaluating a table for reference modeling:

  1. Is the table truly static or slow-changing? (Quarterly or less)
  2. Are there audit or historical requirements? (Regulatory or future use cases)
  3. Is the table small enough (< 1,000 rows) to avoid performance concerns?
  4. Do you need to reconstruct past reports with original descriptions?
  5. Would a simple change in the future (e.g., retro-active update) break historical reports if you used a flat table?

If you answered “yes” to questions 2 or 5, the Reference Hub/Satellite pattern is the safer choice. Otherwise, a simple reference table may be sufficient.

Conclusion

Static lookup tables in a Data Vault 2.0 implementation can be modeled either as simple reference tables or with a Reference Hub & Satellite pattern. While flat tables are easier to build, they lack historical tracking and auditability. By adopting the Reference Hub/Satellite approach, you gain full change history, reproducible reporting, and alignment with regulatory demands—while retaining a lightweight design.

Use the decision checklist above to guide your modeling choices, and ensure your Data Vault remains both agile and compliant as your organization’s needs evolve.

Watch the Video

Outsourcing the Data Warehouse: Finding the Right Balance

Outsourcing the Data Warehouse

In today’s data-driven world, building and maintaining a robust data warehouse is a critical strategic initiative. As organizations grow and their data complexity increases, leaders often ask: Is there a way to third-party some of the data warehouse work and infrastructure? In other words, can you “outsource the data warehouse” without jeopardizing control, security, or long-term knowledge retention?



Why Treat Your Data Warehouse as Strategic

Most organizations—especially those generating significant revenue or handling complex data flows—view their data platform as a core strategic asset. It underpins reporting, analytics, and decision-making, and drives competitive advantage. Handing over this critical function entirely to an external provider can feel like giving away the keys to your business.

  • Control & Governance: Keeping the data warehouse in-house ensures direct oversight of data quality, security, and compliance.
  • Knowledge Retention: Your internal team develops deep institutional understanding of data models, business logic, and reporting requirements.
  • Strategic Flexibility: Internal ownership allows you to pivot rapidly as business needs evolve, without waiting on external roadmaps or SLAs.

The Role of External Expertise

That said, nearly every successful data warehouse project benefits from external consulting, especially in the early stages. Consultants bring:

  • Proven Frameworks: Templates, best practices, and reference architectures refined across multiple clients.
  • Jump-Start Momentum: Hands-on help with infrastructure setup, project governance, and team organization.
  • Skill Gap Coverage: Experienced data architects, engineers, and DevOps specialists to fill temporary talent shortages.

Once your internal team is up to speed, you can scale back external support—keeping consultants focused on specific areas where they provide the most value.

Outsourcing Infrastructure: The Cloud Advantage

For many organizations, the first form of “outsourcing” is the data center itself. On-premises servers have given way to cloud platforms like AWS, Google Cloud, and Azure. By migrating your data warehouse infrastructure to the cloud, you:

  • Reduce Capital Expense: No more large upfront hardware purchases or data center maintenance costs.
  • Gain Elastic Scalability: Spin up additional compute and storage on demand to handle peaks in data processing.
  • Enhance Security & Reliability: Leverage the cloud provider’s certifications, redundancy, and disaster recovery capabilities.

This shift effectively delegates infrastructure work—provisioning, patching, and physical security—to a trusted third party, while you retain control of your data and processes.

Balancing Internal and External DevOps

DevOps for your data platform—CI/CD pipelines, automated testing, and deployment orchestration—can also be partially outsourced. However, it’s important to strike the right balance:

  • Internal DevOps Leads: Core pipeline design, approval processes, and environment governance should remain with your in-house team.
  • External Specialists: Consultants can set up complex workflows, integrate tools, and train your staff on best practices.
  • Limit External Proportion: Aim to keep no more than 50% of DevOps roles external to mitigate knowledge leakage and dependency risks.

AI-Driven Offloading: The Next Frontier

Emerging AI tools are beginning to automate aspects of data warehousing:

  • Schema Generation: AI can suggest optimized table structures based on source data profiles.
  • ETL/ELT Code Snippets: Auto-generation of transformation scripts for common data patterns.
  • Monitoring & Alerting: Machine learning models to detect anomalies and performance bottlenecks.

While these solutions can accelerate development, they work best under the guidance of your experienced internal team, who define requirements, review outputs, and ensure alignment with business objectives.

Setting Clear Boundaries

To manage the consultant-internal balance effectively, define:

  1. Scope of Work: Specify deliverables, timelines, and handover expectations for every engagement.
  2. Knowledge Transfer Plan: Require documentation, training sessions, and code reviews to embed expertise internally.
  3. Governance Model: Establish who makes architectural decisions and how change requests are processed.
  4. Resource Thresholds: Limit external headcount to a reasonable percentage (e.g., under 50%) to avoid over-reliance.

Key Benefits and Risks

Benefits:

  • Speed: Hit the ground running with proven accelerators and expert guidance.
  • Flexibility: Scale resources up or down based on budget cycles and project phases.
  • Cost Efficiency: Avoid long-term commitments; external consultants can be released when budgets tighten.

Risks:

  • Knowledge Drain: Too many external experts can lead to critical know-how walking out the door.
  • Strategic Misalignment: Consultants may optimize for short-term wins rather than your long-term roadmap.
  • Dependency: Over-outsourcing creates vendor lock-in and potential service disruptions if relationships end.

Best Practices for Effective Outsourcing

1. Start with a Pilot: Engage consultants on a small, well-defined project to validate fit and process.

2. Embed Consultants: Position them alongside your team for on-the-job training, rather than in isolation.

3. Prioritize Documentation: Ensure every architecture decision, data model, and pipeline is clearly recorded.

4. Rotate Responsibilities: Alternate tasks between internal staff and consultants to spread knowledge.

5. Review Regularly: Conduct quarterly check-ins to reassess external involvement, goals, and budget alignment.

Conclusion

Outsourcing your entire data warehouse is rarely advisable—this remains a core strategic asset that demands internal stewardship. However, judicious use of external consultants, cloud infrastructure, and AI-driven tools can accelerate your journey, fill critical skill gaps, and optimize costs. By setting clear boundaries, emphasizing knowledge transfer, and maintaining a healthy mix of internal and external talent, you can reap the benefits of outsourcing without surrendering control.

Watch the Video

Modeling Reference Data in Data Vault 2.0 with WhereScape

Understanding the Role of Reference Data

In the world of modern data warehousing, reference data plays a crucial role in maintaining consistency and adding business context. But how should reference or master data be modeled effectively, especially within the framework of Data Vault 2.0 using WhereScape?

This article breaks down best practices and modeling techniques to help data engineers and architects manage reference data in a scalable, maintainable way.



What is Reference Data?

Reference data consists of values that are used to categorize or describe other data within business systems. It typically includes code-to-description mappings that offer meaning to otherwise abstract identifiers.

Unlike business keys, which identify business entities or objects (like customers or products), reference data keys do not directly point to business objects. They simply support them with contextual information.

Examples of Reference Data:

  • ISO codes for countries
  • Official country names
  • Continent or region classifications
  • Currency types or codes

One critical aspect of reference data is that it can change over time. Country names may change, new currencies may be introduced, and existing classifications may be updated. This means that how we model and store this data must account for such changes.

Data Vault 2.0 and Reference Data Modeling

Data Vault 2.0 introduces a methodology designed for agility, auditability, and scalability in enterprise data warehousing. When dealing with reference data in this architecture, the recommended standard involves two main components:

  • Reference Table (Reference Hub)
  • Reference Satellite

Each serves a distinct purpose and helps manage both static and changing attributes efficiently.

The Reference Table (Reference Hub)

The reference table acts similarly to a hub in traditional Data Vault modeling, but with important distinctions:

  • Contains reference codes or keys (e.g., ISO country code)
  • Does not use a hash key – unlike typical hubs
  • May include additional static attributes that do not change over time

This component provides a centralized location for managing consistent lookup values across the enterprise. While it’s technically referred to as a “hub,” it’s specialized for reference data and behaves slightly differently in structure and intent.

The Reference Satellite

Reference satellites extend the reference table to store attributes that may evolve over time. This aligns well with the Data Vault 2.0 philosophy of tracking change history and ensuring auditability.

Characteristics of a Reference Satellite:

  • Includes reference codes or keys to link back to the reference table
  • Stores descriptive attributes that may change over time (e.g., country name updates, new regional classifications)

This design allows data teams to accommodate both historical tracking and the dynamic nature of reference data.

Why Model Reference Data This Way?

There are several strategic and operational advantages to modeling reference data using this structure in Data Vault 2.0:

  1. Separation of concerns: Static and changing data are stored in different layers (table vs. satellite), improving data integrity.
  2. Scalability: Future changes in reference attributes or descriptions are easier to manage and don’t affect historical records.
  3. Auditability: Data Vault’s natural historization supports full lineage and change tracking, which is ideal for regulated industries.
  4. Adaptability: Requirements for historization may evolve over time. Modeling reference data into satellites regardless of current needs ensures readiness for future changes.

Best Practices for Implementation

When implementing this in WhereScape, which automates the Data Vault modeling process, follow these best practices:

1. Always Use a Reference Satellite

Even if you don’t need to historize now, model your reference data in a satellite. Future-proofing your model saves costly rework later.

2. Use Reference Hubs When Multiple Sources Exist

If your organization consumes reference data from multiple systems (e.g., two systems providing different descriptions for the same country code), a reference hub helps consolidate and align these variations around the same key.

3. Avoid Hash Keys in Reference Hubs

Because reference tables don’t represent business objects, there’s no need for a surrogate hash key. Stick with the natural reference code (e.g., “US” for United States) as your unique identifier.

4. Design Satellites for Change

Structure your reference satellites to easily accommodate attributes that may change. Make use of effective date fields to track the history of these changes.

Common Pitfalls to Avoid

  • Modeling reference data as business hubs – this confuses context with core business entities
  • Skipping the satellite – even when attributes are static today, change is inevitable
  • Using hash keys unnecessarily – keep your design clean and minimal in reference structures
  • Ignoring multiple source issues – consolidate differing descriptions with a reference hub

Conclusion

Modeling reference data correctly is a small but critical part of building a reliable, scalable, and auditable Data Vault. By following the recommended structure—a reference table paired with a reference satellite—you create a flexible and future-proof design.

WhereScape users benefit from automation, but understanding these modeling principles ensures you’re applying the tool in a way that aligns with industry best practices and prepares your warehouse for long-term success.

Whether you’re handling ISO country codes or global currency classifications, treat your reference data with the same care you would your core business entities—because it gives those entities their context.

Watch the Video

Chatbot Implementation Using Retrieval-Augmented Generation

Chatbot RAG Retrieval Phase

In today’s AI-driven world, businesses are looking for smarter, cost-effective chatbot solutions that enhance customer interactions and streamline internal operations. While traditional chatbot models often struggle with outdated or generic responses, modern advancements have opened the door to more dynamic and intelligent systems.

This article is for business leaders, developers, and AI enthusiasts looking to implement smarter chatbot solutions. It explores Retrieval-Augmented Generation (RAG), a game-changing approach that enhances chatbot performance by retrieving relevant information in real time. By the end, you’ll not only understand why RAG is so powerful but also how to implement it effectively to build scalable, cost-efficient, and context-aware chatbots using Google Cloud Platform Services.

Implementing Production-Ready RAG Chatbots: Enhancing Information Retrieval with AI

Join our webinar for a practical guide to building and deploying powerful, RAG-driven chatbot solutions. We’ll show you how to leverage essential Google Cloud services for effective implementation, enabling your chatbot to deliver more accurate and relevant responses. Register for our free webinar, July 15th, 2025!

Watch Webinar Recording

The Challenge of Deploying Chatbots

Chatbots are widely used in businesses to automate customer support and streamline internal operations. While large language models (LLMs) have improved chatbot capabilities by generating human-like responses, they come with significant challenges. LLMs require costly fine-tuning, extensive resource usage, and ongoing maintenance, making them impractical for many companies. Additionally, fine-tuned models quickly become outdated, requiring frequent retraining to stay relevant.

An alternative approach is Retrieval-Augmented Generation, which dynamically retrieves relevant information in real time, rather than relying solely on pre-trained data. This allows chatbots to stay up-to-date, reduce costs, and improve accuracy, making RAG a powerful solution for businesses looking for intelligent and scalable AI-driven chatbots without the high expenses of fine-tuning or pre-training.

RAG Architecture

RAG is revolutionizing the way chatbots interact and provide information. But how does this powerful architecture actually work? The magic lies in its two distinct phases: The Preparation Phase, where the knowledge base is built, and the Retrieval Phase, where user queries are processed and relevant information is retrieved to generate a response. Each phase plays a crucial role in creating an intelligent and responsive system. Let’s explore how RAG works in detail, starting with the preparation phase.

Preparation Phase

RAG Chatbot Architecture General Preparation Phase

Data Selection: The preparation begins by selecting the appropriate data for the chatbot’s knowledge base. This typically involves choosing organized sources like databases, which contain the structured information necessary for the chatbot to function effectively.

Data Preprocessing: To prepare the raw data for optimal use in a RAG system, preprocessing is essential. This stage serves three primary purposes: enabling efficient retrieval, ensuring compatibility with the generative AI model, and optimizing for effective embedding generation. Efficient retrieval hinges on assigning unique identifiers, such as IDs, to each data element. This allows the system to quickly locate and access specific pieces of information. To enhance both the generative AI model’s understanding and the quality of embeddings, techniques such as flattening tables, tokenization, text cleaning, stemming, lemmatization, stop word removal, and data type conversion can be employed. These techniques refine the data’s structure and content, making it easier for the generative AI model to process and understand, while also optimizing the data for capturing semantic meaning and relationships in the embeddings.

Embedding Generation: With the data now preprocessed and refined, it’s ready to be transformed into a format suitable for efficient retrieval. This is where embedding generation comes in. This step involves converting the preprocessed data into numerical vectors known as embeddings using a specifically trained embedding model. The embeddings capture the semantic meaning of the data, encoding relationships between words, concepts, and ideas, which enables efficient retrieval of relevant information. The specific embedding model used will depend on the nature of the data and the requirements of the RAG system.

What is an Embedding?

In natural language processing, an embedding is a numerical representation of a word, phrase, or document that captures its meaning. These numerical vectors, which can range from a few hundred to many thousand dimensions, are designed so that words or documents with similar meanings have embeddings that are close together in the vector space.

For example, in an embedding space, the word “cat” would be located much closer to the word “lion” (another feline) than to the word “car” (an unrelated object). Meaning that even if two words were nearly identical in spelling but had vastly different meanings, their embeddings would be distant. This spatial arrangement reflects the semantic relationship between the words.

This allows AI systems to understand relationships between words and concepts, enabling them to perform tasks such as identifying similar documents. Embeddings are a fundamental building block for many AI applications, including RAG.

Saving the Embeddings in a Vector Database: The final step in the preparation phase is to store the generated embeddings in a specialized database designed for handling vector data. This is crucial because traditional databases aren’t optimized for storing or searching high-dimensional vectors. Saving the embeddings in a vector database ensures that the RAG system can quickly pinpoint the most relevant information in the knowledge base when responding to user queries.

Now looking at Data Vault’s architecture, we see that to some extent it is quite similar to what Databricks proposes: a multi-layer solution composed of a Staging layer, a Raw Data Vault and a Business Vault, followed by the domain-specific information marts. In the image below, we can see an example of a Data Vault architecture.

Retrieval Phase

With the knowledge base now prepared and ready for efficient retrieval, let’s explore how the RAG system interacts with users and generates responses in real-time.

Chatbot RAG Retrieval Phase
  1. User Interaction: The process begins with a user interacting with the chatbot through a user interface. This interaction could be a chat window, or any other platform that allows users to input queries.
  2. Embedding Generation: The user query is then converted into an embedding using the same embedding model that was used to process the knowledge base during the preparation phase. Consistency is vital because different embedding models might produce different numerical representations, hindering the similarity search process.
  3. Similarity Search and Retrieval: The query’s embedding is used to perform a similarity search within the same vector database that contains the knowledge base embeddings. This search identifies the most semantically similar embeddings, returning their corresponding IDs.
  4. Retrieve relevant Information: The IDs retrieved from the similarity search are then used to query the preprocessed data from the knowledge base. This lookup retrieves the corresponding documents in their preprocessed text form, providing the generative AI model with the necessary context for generating a response.
  5. Generative AI: The retrieved information, along with the original user query, is passed to a generative AI model. This model uses its knowledge to generate a relevant response to the user’s query. The generative process allows RAG systems to provide answers that are more than just basic information retrieval.
  6. Delivering the Response: Finally, the generated response is delivered back to the user through the user interface. The response could be in the form of text displayed in a chat window or a spoken response from a voice assistant.

Implementing a RAG-Based Chatbot: A Practical Example

Now that we’ve explored the underlying architecture of RAG, let’s see how you can implement a RAG-based chatbot using services from the Google Cloud Platform. This example focuses on building an internal knowledge chatbot that can answer employee questions based on internal documents from a knowledge platform.

Data Preparation

The foundation of any effective RAG system is well-prepared data. This initial phase focuses on transforming your raw knowledge into a searchable format, ensuring documents are ready for efficient retrieval.

Chatbot RAG Data Preparation in Google
  • Data Selection and Preprocessing: First, you’ll need to gather the relevant internal documents, preprocess them and save the data in a data platform like Google Cloud Storage.
  • Embedding Generation: Once the preprocessed data is securely stored, Google Cloud Vertex AI is utilized to transform this data. Vertex AI extracts the textual content and converts it into high-dimensional vector embeddings, creating numerical representations that capture the semantic meaning of your documents. You can choose from various pre-trained embedding models or fine-tune your own model based on your specific needs.
  • Vector Database: With the embeddings successfully generated, Google Cloud Vertex AI Vector Search then serves as the vector database. It efficiently stores these high-dimensional embeddings and is designed to enable rapid and accurate similarity searches during the retrieval phase.

Retrieval Architecture

This section outlines the Retrieval-Augmented Generation architecture of our Google Chat chatbot. It describes the flow of a user’s query through various Google Cloud Platform services, from initial submission to the delivery of a generated response, creating a robust and intelligent conversational experience.

RAG Pipeline
  1. User Interaction: The user begins by asking a question within Google Chat. Google Chat serves as the user interface, directing the query to the Conversational Agents platform.
  2. Request Orchestration: Conversational Agents receives the user’s input. It then triggers a webhook call, sending the user’s query to our RAG orchestrator service, which is hosted on Google Cloud Run.
  3. User Query Embedding: From this point, the Google Cloud Run service takes over as the central orchestrator of the RAG pipeline. It starts by processing the user’s query, generating a high-dimensional vector embedding of it using the same embedding model that was utilized during the data preparation.
  4. Information Discovery: The generated user query embedding is then transmitted from Cloud Run to Vertex AI Vector Search. Vertex AI performs an efficient similarity search, identifying and returning the ID(s) of semantically similar documents from the knowledge base based on the provided embedding.
  5. Relevant Information Retrieval: With the identified document ID(s) obtained from Vertex AI Vector Search (these IDs precisely match our preprocessed documents that are most relevant to the user’s query), the Cloud Run service accesses Google Cloud Storage. From Cloud Storage, the full text content of these matched knowledge base pages is fetched. These documents provide the essential context required for accurate response generation.
  6. Generative AI Response: The Cloud Run service then combines the original user query with the retrieved document content into a carefully constructed prompt. This comprehensive prompt is sent to the Gemini API, leveraging Google’s large language model to generate a natural language answer.
  7. Response Delivery: The generated answer from the Gemini API is received by the Cloud Run service. The Cloud Run service formats this response and sends it back to Conversational Agents.
  8. Final User Delivery: Conversational Agents receives the generated answer and seamlessly relays it back to the user within Google Chat, completing the interaction.

This example demonstrates how you can leverage GCP services to implement a RAG-based chatbot for internal knowledge sharing. By combining data preparation, embedding generation, similarity search, and generative AI, you can create a powerful tool that empowers employees with quick and easy access to relevant information.

Data Vault Link Temporality

Link Temporality: Handling Source Data Errors with Effectivity Satellites

In modern data warehousing, ensuring accurate historical records is paramount. The Data Vault methodology excels at capturing raw, unfiltered data changes over time. But what happens when your source system makes errors—linking an entity to the wrong counterpart—and then corrects them? Without the right approach, your Link tables can become confusing, making it hard to identify the true “current” relationship. This article explores an elegant solution: using an Effectivity Satellite to manage link temporality and error correction in your Data Vault.



The Problem: One-to-One Relationship with Source Data Fluctuations

Imagine two hubs in your Data Vault: Hub A and Hub B. A business rule dictates that each A-entity can be linked to exactly one B-entity at a time. Your Link table models these connections. The typical workflow is:

  1. Day 1: Source links A1 → B1 → you load this into your Link.
  2. Day 2: Source mistakenly links A1 → B2 → you load the new link.
  3. Day 3: Source corrects back to A1 → B1 → how do you capture this as the current, up-to-date link?

Since the Link table only records distinct relationships and ignores duplicates, reloading A1 → B1 on Day 3 won’t insert a new row or update any timestamp. You lose clarity on which relationship is active today.

Why Not Tweak the Link Table Directly?

You might be tempted to add LOAD_END_DATE or an “active” flag directly to the Link table to mark when a relationship becomes obsolete. However, this violates Data Vault best practices. The Link should remain a pure, append-only record of every relationship ever observed, without status flags or end dates. Instead, you delegate temporality to a dedicated satellite.

Introducing the Effectivity Satellite

An Effectivity Satellite sits alongside your Link and records the lifespan of each relationship. Its core columns include:

  • Link Hash Key: foreign key back to your Link record
  • LOAD_DATE_TIMESTAMP: when you first detected or ended this link
  • DELETE_DATE_TIMESTAMP: when the link was deactivated (or a far-future “end of time” for active rows)

This design cleanly separates the static relationship definition (Link) from its dynamic, time-dependent status (Effectivity Satellite).

Step-by-Step: Tracking Link Changes

Day 1: Initial Relationship

Link: A1–B1  
Effectivity Satellite:  
LOAD_DATE = D1  
DELETE_DATE = 8888-12-31

We load A1→B1 and mark it active by setting its DELETE_DATE to the end of all time.

Day 2: Erroneous Change

Link: add A1–B2  
Effectivity Satellite updates:  
– For A1–B1: DELETE_DATE = D2 (deactivated)  
– For A1–B2: LOAD_DATE = D2, DELETE_DATE = 8888-12-31

The old relationship is soft-deleted, and the new one is inserted and marked active.

Day 3: Correction Back to Original

Effectivity Satellite updates:  
– For A1–B2: DELETE_DATE = D3  
– For A1–B1: new row (reactivation): LOAD_DATE = D3, DELETE_DATE = 8888-12-31

Instead of touching the Link, we simply record two new deltas: ending B2 and re-activating B1. Querying the satellite for the active row (where DELETE_DATE = 8888-12-31) reveals the current link.

Loading Patterns: Full Loads vs. CDC vs. Incrementals

Your data delivery method influences how you detect deletions:

  • Full Loads: Compare all active links in the satellite against staging; any missing link implies a deletion. Insert a delta to end-date it.
  • Change Data Capture (CDC): Leverage the source’s delete events and timestamps as your DELETE_DATE_TIMESTAMP.
  • Incremental without Deletes: Combine staging deltas (inserts/updates) with a lightweight full load of just business keys. Missing keys signal deletions.

In all cases, the satellite becomes the single source of truth for link effectivity.

Handling Unreliable Source Deliveries with “Last Seen”

Sometimes, your source export may inadvertently drop rows (e.g., locked mainframe records). To avoid false deletions, maintain a Last Seen Date on your effectivity satellite. If a link hasn’t been seen for a configurable “grace period” (e.g., three weeks), a business rule in your Business Vault marks it deleted. This approach balances accuracy against source system quirks.

Querying Current Relationships

To retrieve only active links at any point:

SELECT L.*, S.LOAD_DATE_TIMESTAMP  
FROM Link L  
JOIN Effectivity_Sat S  
ON L.Link_HashKey = S.Link_HashKey  
WHERE S.DELETE_DATE_TIMESTAMP = '8888-12-31';

This simple filter returns the true, live relationships, abstracting away all historical noise and source-system corrections.

Benefits of the Satellite Approach

  • Auditability: Full history of when links were activated and deactivated.
  • Purity: Link tables remain simple, append-only, and free of flags/end dates.
  • Flexibility: Supports full loads, CDC, and incremental patterns seamlessly.
  • Business Rules: “Last seen” logic can live in Business Vault without polluting raw Data Vault layers.

Conclusion

Handling source data errors—especially when relationships ping-pong between states—requires a robust temporal strategy. By delegating link effectivity to a dedicated satellite, you maintain a clean Link table, capture every change, and easily identify the current relationship. Whether you’re dealing with full loads, CDC, or flaky source exports, this pattern scales, remains auditable, and adheres to Data Vault best practices. Implement effectivity satellites in your Data Vault to master link temporality and build a more resilient, transparent data platform.

Watch the Video

Version Control and Deployments: A Comprehensive Guide with coalesce.io

Version Control and Deployments

In today’s fast-paced software development landscape, robust version control and seamless deployment pipelines are not just nice-to-haves—they are essential components of any successful project. From ensuring traceability of every change to enabling cross-functional teams to collaborate without stepping on each other’s toes, version control systems and automated deployments form the backbone of modern DevOps practices. In this article, we’ll explore the core concepts behind version control and deployments, and dive into how coalesce.io—a powerful, Git-native platform—elevates these processes through integrated features and automation.



Why Version Control and Deployments Matter

Whether you’re a solo developer or part of a large organization, the challenges of managing code changes, coordinating releases, and maintaining accountability can quickly become overwhelming. Implementing robust version control and deployment strategies delivers four key benefits:

  • Traceability: Every change is tracked, with a clear audit trail of who did what, when, and why.
  • Collaboration: Multiple contributors can work in parallel on different features or bug fixes without conflict.
  • Accountability: With detailed commit histories and pull request reviews, it’s easy to see ownership and rationale for changes.
  • Automation: Automated testing and deployments reduce manual errors and accelerate release cycles.

Coalesce brings all of these advantages together by embedding Git-based version control and deployment automation directly into its platform, letting teams focus on building reliable data logic rather than wrestling with infrastructure.

General Version Control Concepts with Git

Git has firmly established itself as the industry standard for source code versioning. Its distributed nature allows every developer to have a full copy of the repository history, enabling powerful branching and merging workflows.

  • Branching Model: Use feature branches for development work, separate testing branches for QA, and protected branches (e.g., main or production) for stable releases.
  • Pull Requests: Facilitate code reviews by proposing changes via pull requests (PRs), where teammates can comment, request modifications, and approve merges.
  • Commit Discipline: Write clear, atomic commits that describe what changed and why, improving the readability of the project’s history.
  • Merge Strategies: Choose between fast-forward merges, merge commits, or rebases based on team preferences and release requirements.

These practices enable controlled, transparent workflows that scale with your team’s size and complexity.

Version Control with Git in coalesce.io

Coalesce takes Git integration a step further by making it a first-class citizen of the platform’s UI. Here’s how it works:

  • Native Git-Based Structure: Projects in coalesce.io are structured as Git repositories under the hood, with every node, template, and configuration file stored as code.
  • UI-Driven Branch Management: Create, switch, and merge branches directly within the coalesce.io interface—no command line needed.
  • Automatic Commits: Any structural change you make to data nodes, business logic, or metadata generates a Git commit automatically, ensuring you never lose track of adjustments.
  • External Platform Integration: Connect to GitHub, GitLab, Azure DevOps, or Bitbucket repositories. Coalesce recognizes remote branches, pull requests, and webhooks, enabling full CI/CD pipelines with your preferred tools.

By embedding these capabilities, coalesce.io minimizes context switching and simplifies the learning curve for teams already familiar with Git workflows.

General Deployment Concepts

Deployment is the process of moving code or data logic from development environments into production, ensuring that your latest changes are available to end users or downstream systems. Key deployment concepts include:

  • Environments: Maintain separate environments—such as development, staging, and production—to safely test changes before release.
  • CI/CD Pipelines: Continuous Integration (CI) automates building and testing code upon every commit, while Continuous Deployment (CD) automates the release to target environments.
  • Rollback Strategies: Implement mechanisms to revert to previous stable versions in case of regressions or critical failures.
  • Configuration Management: Ensure environment-specific settings (e.g., database connections, API keys) are managed securely and consistently.

Automating these steps reduces human error, accelerates time-to-market, and provides greater confidence in each release.

Deployment Automation with coalesce.io

Coalesce simplifies deployments by exposing its functionality through a command-line interface (CLI) and a RESTful API. Here are the highlights:

  • Coalesce CLI: Run commands such as coalesce deploy to push the latest node definitions, templates, and configurations to a target environment in one step.
  • API-Driven Pipelines: Integrate coalesce.io into existing CI/CD tools (e.g., Jenkins, GitHub Actions, Azure Pipelines) by calling the Coalesce API to trigger builds and deployments programmatically.
  • Automated Compilation: Before deployment, coalesce.io compiles your logic—validating node dependencies and configurations—to catch errors early in the pipeline.
  • Execution Hooks: Optionally run pre- and post-deployment scripts (e.g., smoke tests, data quality checks) to enforce standards and provide feedback to development teams.

This tight integration between version control and deployments ensures that your Git history is always in sync with what’s running in production.

Best Practices for Version Control and Deployments

To maximize the benefits of these systems, consider the following recommendations:

  • Enforce Branch Protection: Require pull request reviews and passing automated tests before merging into critical branches.
  • Implement Semantic Versioning: Tag releases with meaningful version numbers (e.g., v1.2.0) to track feature sets and bugfixes.
  • Use Feature Toggles: Deploy incomplete features in a disabled state, then enable them via configuration when they’re ready.
  • Monitor and Alert: Integrate observability tools to track deployment health and automatically roll back on critical failures.
  • Document Your Workflow: Maintain clear documentation of branching strategies, deployment steps, and rollback procedures for on‑boarding and audits.

Conclusion

Version control and deployments are foundational to reliable, scalable, and secure software delivery. By leveraging Git’s powerful branching and merge capabilities alongside automated CI/CD pipelines, teams can move faster while maintaining high quality standards. Coalesce advances these practices by integrating version control directly into its platform and providing CLI/API-driven deployment tools that mesh seamlessly with existing workflows. Whether you’re just starting to adopt DevOps principles or looking to streamline your current processes, coalesce.io offers a unified solution for traceability, collaboration, accountability, and automation.

Watch the Video

dbt Fusion Explained: The Next Step in dbt’s Evolution

dbt Fusion Engine

As data teams continue to scale and the demand for faster, more reliable analytics grows, the tools we depend on must evolve. Enter dbt Fusion, the latest high‑performance execution engine from dbt Labs that promises to take your dbt workflows to unprecedented speeds. In this post, we’ll dive deep into what dbt Fusion is, explore its key features, discuss supported platforms and migration paths, and help you decide if—and when—you should upgrade. Let’s get started!



Why a New Engine?

dbt (data build tool) has revolutionized how analytics engineers transform and test data directly within the data warehouse. Until now, both dbt Core and dbt Cloud have relied on a Python-based execution engine. While powerful, Python parsing and compilation can become a bottleneck as projects grow to thousands of models. Recognizing this, dbt Labs has developed dbt Fusion from the ground up in Rust, a language known for its speed and memory safety.

Key Benefit: Lightning‑Fast Parsing

One of dbt Fusion’s marquee improvements is its parsing speed. Traditional dbt projects—especially those with tens of thousands of models—could take minutes to parse. With Fusion’s Rust implementation, parsing times drop dramatically, often by up to 30× faster, bringing multi‑minute delays down to mere seconds (or even milliseconds). Faster parsing means quicker iterations, faster CI checks, and more responsive development workflows.

Ahead‑of‑Time Cycle Compilation

Typically, dbt compilation happens right before execution, which means syntax errors or schema mismatches only surface during run time. dbt Fusion introduces ahead‑of‑time cycle compilation, enabling the engine to analyze your SQL and model dependencies intelligently before executing any queries against your warehouse. This pre‑flight check catches errors early, saving compute costs and developer time by preventing failed runs on the warehouse.

Column‑Level Lineage & Data Type Validation

Data governance is becoming ever more critical. With dbt Fusion, you gain column‑level lineage and built‑in data type validation. This fine‑grained visibility ensures that every downstream model inherits accurate metadata. For instance, if you tag a column as “PII” or “Personal Information” at the source model, Fusion will automatically propagate that tag to any downstream models referencing the same column—streamlining compliance and auditability.

Smarter Orchestration & Cost Savings

dbt Cloud users already benefit from intelligent job scheduling, but Fusion takes orchestration to the next level. It can detect unchanged models and skip them, dramatically reducing unnecessary computation. In practice, this means your daily or hourly runs only re‑execute models that truly need it, leading to significant savings on warehousing costs.

Enhanced Developer Experience in VS Code

To complement the core engine improvements, dbt Labs has released an updated VS Code extension tailored for Fusion. Highlights include:

  • Autocomplete for model names, macros, and config blocks
  • Inline SQL preview so you see your compiled SQL before executing
  • Live feedback on syntax or type errors as you code

These enhancements further shrink the feedback loop, allowing analytics engineers to develop with confidence and speed.

Supported Platforms & Future Connectors

At launch (beta stage), dbt Fusion supports:

  • Snowflake
  • Databricks

dbt Labs has confirmed that additional connectors—such as BigQuery and Redshift—are on the roadmap. To stay up to date, subscribe to the official dbt community forums or follow the dbt Twitter account for announcement alerts.

Beta to GA: What to Expect

dbt Fusion is currently in beta, but the pace of innovation is rapid. dbt Labs aims to reach general availability soon. During the beta, you can:

  1. Experiment with your most complex projects to quantify performance gains.
  2. Report issues and help refine features via GitHub or the dbt community channels.
  3. Understand limitations—such as unsupported adapters or edge‑case macros—before rolling out to production.

Migration Paths for dbt Cloud & Core Users

If you’re on dbt Cloud, you don’t need to lift a finger: Fusion will become the default execution engine automatically once GA is reached. Your existing jobs and orchestrations will seamlessly target Fusion under the hood.

For dbt Core users, upgrading is straightforward:

  1. Install the latest dbt-fusion package alongside dbt-core.
  2. Follow the step‑by‑step migration guide on the dbt Labs documentation site.
  3. Run your test suite locally to confirm compatibility.

License & Pricing Considerations

dbt Fusion introduces a new tiered licensing model:

  • Local Development (dbt Core users): Source‑available, free, and fully functional for local builds (with some advanced features behind a paywall).
  • dbt Cloud customers: Fusion is included in paid tiers, unlocking all premium capabilities—such as enterprise connectors, deeper metadata lineage, and priority support.

Review the official pricing page to see which features align with your team’s needs.

Is dbt Fusion Right for You?

If your team regularly works on large-scale dbt projects or you’re chasing every millisecond of performance, dbt Fusion is a game‑changer. Early adopters report 10×–30× faster parsing, near‑instant validation feedback, and lower cloud compute bills thanks to smarter orchestration.

That said, if your project is small or you’re comfortable with existing runtimes, you may choose to wait until GA and additional adapters ship. Either way, Fusion is the future of dbt, and understanding its capabilities now will help you plan your analytics roadmap.

Next Steps

  • Read the dbt Fusion docs to explore detailed benchmarks and feature matrices.
  • Join the beta: enable Fusion in your dev environment and share feedback.
  • Monitor connector announcements to align Fusion with your warehouse of choice.

Watch the Video

From Warehouses to Platforms: Why Should We Change Our Wording?

From Data Warehouses to Data Platforms

The world of data architecture is evolving — fast. What started as traditional data warehouses has now become a dynamic ecosystem of technologies, roles, and use cases. At Scalefree, we no longer talk exclusively about data warehouses — we intentionally use the term data platforms. Why? Because it’s not just the technology that has changed, but also the people working with data and how they use it to generate value.



From Data Warehouses to Data Ecosystems

Traditional data warehouses were built for structured data with predefined schemas — relational, static, and stable. They were and still are the backbone for reporting and classic business intelligence in most cases.

The advent of data lakes offered a revolutionary capacity to house and manipulate unstructured data. However, the absence of clear structure and robust governance often resulted in environments colloquially known as “data swamps.”

Hybrid architectures and, later, data lakehouses emerged as a logical evolution, blending the strengths of warehouses and lakes. Their key benefit: enabling different data consumers to work on a unified foundation.

The New Reality: Platforms Instead of Silos
Today, multiple roles interact with data — and each has unique needs:

Data Engineers work across all architectural layers: from raw data ingestion to business rules and curated marts.

Business Analysts need structured, refined data for reports and dashboards.

Data Scientists explore raw, granular data for predictive models — often working directly with data lakes or raw vaults.

The traditional concept of a data warehouse no longer covers this variety of use cases. It’s simply not enough.

Why We at Scalefree Speak of Data Platforms

To us, Data Platform is not just a buzzword — it’s a strategic shift that reflects today’s real-world demands. A data platform needs to fulfill multiple criteria.
For example:

Neutrality
It’s not tied to specific technologies. Whether Snowflake, Databricks, or Coalesce — the concept stays relevant.

Flexibility
It supports any data architecture: from classic warehouses to lakes and lakehouses — and whatever comes next.

Role Inclusivity
All roles — engineers, analysts, scientists — can work on the same platform, using the same data, without structural or technical barriers.

Future-Readiness
New technologies can be adopted without redefining the concept of the platform itself.

AI Enablement
A modern data platform provides the foundation for AI and machine learning by making all relevant data — structured and unstructured — accessible, governable, and ready for advanced modeling.

Conclusion: Thinking in Platforms that serves EVERYONE

The world of data is no longer binary. It’s not just “reporting” vs. “analytics,” “structured” vs. “unstructured,” or “IT” vs. “business.”

By using the term Data Platform, we acknowledge this reality and offer a unifying concept that bridges technology, people, and innovation.

At Scalefree, we actively help shape this new world — using modern architectures, Data Vault 2.0, automation tools like dbt, Coalesce, and cloud-native platforms.

Watch the Video

Data Vault Links With Just One Hub Reference

Single-Hub Links

In Data Vault modeling, links play a central role in representing relationships between business keys stored in hubs. By design, most links connect two or more hubs, capturing many-to-many relationships or associations. But what happens when an event or transaction involves only a single business key? Can you still use a link structure—and if so, which type? In this article, we’ll explore the concept of non-historized links with a single hub reference, compare alternatives, and outline best practices for real-time event modeling.



Overview of Data Vault Components

Before diving into one-hub links, let’s briefly review the core building blocks of a Data Vault model:

  • Hubs: Store unique, immutable business keys (e.g., customer IDs, order numbers).
  • Links: Represent relationships or associations between two or more hubs.
  • Satellites: Hold descriptive attributes and contextual history for hubs and links.

This three-tiered architecture ensures agility, auditability, and scalability. Hubs guarantee uniqueness; links model relationships; satellites track changes over time.

Traditional Links and Their Purpose

Most Data Vault implementations utilize links to tie together business keys from multiple hubs. Common scenarios include:

  • Customer–Order relationships (customer purchases multiple orders).
  • Order–Product line items (each order can contain multiple products).
  • Employee–Department assignments.

These historized links capture the evolution of relationships over time, recording load dates and allowing queries that include past associations. In contrast, non-historized links focus on events at a single point in time.

Defining Non-Historized Links

A non-historized link (sometimes called an “event link” or “transaction link”) stores relationships for a single event or message without maintaining full historical context. Instead of recording every change, it captures a snapshot of an event at its arrival:

  • Load timestamp identifies when the event occurred or was ingested.
  • Hub references list one or more business keys involved in the event.
  • Non-historized Satellites may attach descriptive details, but typically without tracking attribute history.

This design is ideal for real-time message processing, streaming data, or systems where full history is not required for each event.

When Only One Business Key Is Involved

While many events involve multiple business keys—such as an order linking to both customer and product—some transactions or messages involve just one key. Examples include:

  • A single-customer ping or heartbeat event in an IoT system.
  • A retail message capturing stock-level change for one product.
  • An alert triggered by a lone account reaching a threshold.

In these cases, you might wonder if a link structure still makes sense when there’s only one hub reference. The answer is yes: you can implement a non-historized link that references a single hub key to represent that event.

Advantages of Single-Hub Links

Opting for a non-historized link with one hub reference brings several benefits:

  • Consistency: Sticks to the Data Vault pattern of links for events, avoiding mixed designs.
  • Scalability: Scales out to handle high volumes of incoming messages without heavy historical tracking.
  • Clarity: Clearly separates transactional/event data from descriptive satellites and core business keys.
  • Query Simplicity: Enables straightforward point-in-time queries of events linked to the relevant hub.

Alternative: Multi-Active Satellites

Another design might involve a multi-active satellite on the hub itself, capturing different event types or message variants keyed by a load timestamp or event type. However:

  • Multi-active satellites are designed to capture multiple concurrent “active” roles or statuses rather than transient events.
  • The lack of a dedicated link table can blur semantic distinctions between relationships and descriptive attributes.
  • Query performance and partitioning strategies may suffer when trying to manage high-frequency event data in a satellite.

Therefore, for discrete, passing-through events, a non-historized link generally outperforms a multi-active satellite approach.

Designing Your Single-Hub Non-Historized Link

When modeling a non-historized link that references only one hub, follow these guidelines:

  1. Link Table Structure: Include a surrogate primary key, load timestamp, and the single hub’s surrogate key.
  2. Foreign Key Constraint: Enforce referential integrity back to the hub, ensuring the business key exists.
  3. Descriptive Satellites: If extra attributes are needed (e.g., event payload details), create a non-historized satellite keyed to the link.
  4. Partitioning Strategy: Partition by load date for efficient querying and archiving of stale event data.
  5. Retention Policy: Define sliding windows or archival processes for old events if storage growth is a concern.

Here’s an example DDL snippet for reference:


CREATE TABLE l_event_single_hub (
l_event_id        BIGINT      IDENTITY PRIMARY KEY,
hub_key_id        BIGINT      NOT NULL,
load_date         DATETIME     NOT NULL,
-- optional metadata columns
source_system     VARCHAR(50),
record_hash       CHAR(32),
CONSTRAINT fk_l_event_hub
FOREIGN KEY (hub_key_id)
REFERENCES h_hub_entity(hub_key_id)
);

Use Case Scenarios

Organizations across industries leverage single-hub links for:

  • Banking: Recording individual account balance snapshot events.
  • Retail: Capturing stock level messages for each product unit.
  • IoT: Ingesting single-device telemetry pings.
  • Telecommunications: Logging individual phone number status changes (e.g., activated/deactivated).

In each scenario, the event is tied to one core business key, and history is either ephemeral or summarized elsewhere.

Best Practices and Considerations

When implementing single-hub non-historized links, consider the following:

  • Event Granularity: Define clear semantics—what constitutes one event, and how often will it be ingested?
  • Surrogate Keys: Always use surrogate keys for hubs and links to maintain consistency.
  • Hashing Strategy: Compute a record hash if you need idempotency or change detection on message payloads.
  • Load Performance: Optimize bulk or streaming loads with batching and minimal indexes on the link table.
  • Retention and Archival: Archive stale events into cheaper storage or summarize them into aggregate tables.

By following these practices, you’ll ensure a robust, maintainable design that adheres to Data Vault principles.

Conclusion

While it might seem counter-intuitive to create a link with only one hub reference, non-historized links with a single business key are both feasible and, in many real-time event scenarios, preferable to alternative designs. They preserve the semantic clarity of link tables, ensure data integrity, and scale efficiently for high-volume event streams. When events involve only one business key, reach for a one-hub non-historized link rather than shoehorning events into satellites or hybrid structures.

Watch the Video

Dealing with Corrupted Loads in Data Vault

Corrupted Loads in Data Vault

One of the foundational assumptions in Data Vault modeling is that business keys must be unique. This rule underpins how we model Hubs, Links, and Satellites. But what happens when your data doesn’t play by the rules? Specifically, what should you do when your data delivery contains multiple rows with the same business key—a situation that violates the core principles of your Raw Data Vault model?

In this article, we’ll explore practical strategies for managing corrupted data in Data Vault pipelines, focusing on maintaining auditability, consistency, and data integrity—even when upstream data delivery is flawed. We’ll also look at what to do when your business key assumptions no longer hold true.



Understanding the Problem: Duplicate Business Keys

Let’s start with the assumption that your Raw Data Vault is modeled around unique business keys. You’ve built Hubs, split Satellites, and established Links based on the expectation that a business key like customer_id uniquely identifies a customer.

Now, you receive a new delivery from your source system. Unexpectedly, it contains multiple rows with the same business key. This isn’t just a data quality issue—it fundamentally breaks your model. The typical loading process can no longer proceed cleanly, and worse, you risk contaminating your data warehouse with incorrect records.

Why You Can’t Ignore Corrupted Loads

It’s tempting to just skip the bad file or fix it manually. But in a proper Data Vault setup—particularly one that adheres to full auditability and compliance standards—this isn’t acceptable. You must be able to fully reconstruct each data delivery, even if it’s flawed. Every decision—whether to reject or load—must be trackable and justifiable.

Step 1: Capture Everything in a Data Lake

Today, many modern architectures use a data lake or Persistent Staging Area (PSA) as the first layer of data capture. This becomes your insurance policy. All incoming data—valid or corrupted—is ingested and stored here as-is, giving you a perfect record of what was delivered and when.

This approach also ensures your Raw Data Vault can skip flawed deliveries without data loss. By storing the original files in the data lake, you preserve the full delivery for later inspection, validation, or correction without halting the loading process entirely.

Step 2: Define Automated Data Quality Checks

Before data is loaded into the Raw Data Vault, it must pass validation. You can implement quality checks like:

  • Is the business key unique across the delivery?
  • Are column data types and lengths as expected?
  • Are required fields populated?

If any of these checks fail, the entire file should be rejected—not just individual records. Why? Because partial loads introduce ambiguity and audit challenges. Instead, flag the file as failed and notify the data provider to investigate and resend a corrected version.

Step 3: Track Rejections and Version Control Your Checks

You must keep detailed logs of every load attempt. This includes:

  • Which file was loaded or rejected
  • Which checks were applied
  • Which check failed and why
  • The version of the validation rule used

This ensures complete traceability. You can prove not just what was accepted, but also what was rejected and for what reason. This is crucial for regulatory compliance, audits, and operational transparency.

What If There’s No Data Lake?

In some cases, you may not have a data lake. You might be working with a transient relational staging area before the Raw Data Vault. Even then, you should still store failed deliveries. A separate location or table can be used to store the raw files that failed validation. Again, auditability is key—just because data isn’t valid doesn’t mean it can disappear.

When the Business Key Assumption Breaks

Sometimes, you dig deeper and realize that your assumption about the business key was flawed. Maybe you thought customer_id was unique, but the source system allows multiple entries per ID for different contexts. Now what?

This is where things get more complex. You need to refactor your model. Specifically, you must modify the Hub and possibly extend the business key by combining it with another column (e.g., customer_id + region) to enforce uniqueness.

Why You Must Refactor, Not Hack

Some might be tempted to patch the issue using a record source tracking Satellite or other technical workaround. But this introduces long-term maintenance and performance issues. Worse, it hides the real business reality behind a technical trick.

Instead, treat the business key as the central anchor of your model. If it changes, it impacts:

  • The Hub structure
  • All related Satellites
  • Any Links pointing to the Hub

Yes, it’s a big change. But it’s limited to a specific portion of your model and keeps your architecture clean and reliable.

What About Descriptive Data Errors?

If the corrupted data only affects descriptive attributes and not the business key, the fix is simpler. You can ingest a correction load directly into the Satellites with a backdated load date—just after the original bad load. Then, rebuild your PIT (Point-In-Time) tables. This resolves the issue for downstream consumption without any need to refactor Hubs or Links.

Final Thoughts: Build Resilience Into Your Pipeline

Corrupted data is not an exception—it’s an eventuality. Whether it’s duplicate business keys, incorrect formats, or structural changes in the source system, your data warehouse must be prepared. The best defenses are:

  • A reliable data lake or staging layer to capture raw deliveries
  • Automated validation and full-file rejection logic
  • Detailed auditing and version control on checks
  • Clear communication with source system owners
  • Willingness to refactor models when business reality shifts

Following these principles ensures your Data Vault model remains robust, scalable, and trustworthy—even in the face of corrupted loads.

Watch the Video

Close Menu