Sample Page – Scalefree

Building a scalable Data Platform? In Inside Modern Data Teams

Expanding Agile Practices and Embracing Data Governance

Expanding Agile Practices and Embracing Data Governance for Modern Organizations

Why change a running system? In a rapidly evolving digital landscape, embracing new methodologies and exploring broader perspectives becomes essential. Agile practices are more than just Scrum; they encompass a wide array of approaches aimed at optimizing organizational workflows. This journey led us to become certified trainers in Disciplined Agile and integrate this fresh knowledge into our projects. However, staying agile also means continuously seeking new methodologies and frameworks, like Data Mesh, that align with modern data needs.

In this article:

Adapting Agile Principles for Data Governance and Data Mesh
Key Components for Effective Data Mesh Implementation
Why Data Governance Matters
Core Elements of Data Governance
Watch the Video
Meet the Speakers

Adapting Agile Principles for Data Governance and Data Mesh

Our journey has taken us deeper into the realms of data architecture, governance, and the intersection with GDPR and organizational needs. Integrating data governance within an agile framework ensures a structured yet adaptable approach that allows innovation while maintaining control and data security. This shift promotes better domain ownership, federated governance, and viewing data as a product.

Key Components for Effective Data Mesh Implementation

Standardized DevOps: Unified processes and seamless integration of tools to facilitate automation and consistency.
Data Catalogue: A centralized source for metadata, data lineage, and ownership information to enhance transparency and usability.
Federated Governance: Collaborative frameworks where domain leaders establish platform rules and sharing protocols.
Governed Platform: A managed platform that supports efficient data sharing and collaboration across teams.
Automation: Streamlined data provisioning, especially in the Data Lake and Data Vault, to avoid delivery bottlenecks.
Release Management: Organized release notes to communicate new data products and functionalities effectively.
Standard Guides: Comprehensive guidelines to ensure consistent data handling throughout the organization.

Why Data Governance Matters

Data governance should be a central focus for modern organizations. Research shows that only 11% of companies have a robust data governance structure, yet those that do experience significant benefits:

Improved Efficiency: Effective governance can reduce data search time by up to 50% (IBM).
Enhanced Decision-Making: Strong governance leads to 40% faster decision-making due to better data access (Databricks).
Long-term Value: By 2027, 60% of companies may not realize their AI project potential due to inadequate governance (Gartner).

Core Elements of Data Governance

To establish an effective governance framework, focus on:

Ownership: Clear roles for data stewardship and accountability for data lifecycle management.
Accessibility: Authorized, user-friendly data access for stakeholders.
Security: Robust data protection policies including encryption and access control.
Quality: Continuous monitoring and improvement of data accuracy, completeness, and consistency.
Transparency: Comprehensive documentation and metadata management to foster data literacy.

By integrating these agile and data governance principles, organizations can unlock true potential, fostering both innovation and compliance.

Watch the Video

Meet the Speakers

Lennart Busche
Consultant

Lennart is working in Business Intelligence and Enterprise Data Warehousing (EDW), supporting Scalefree International since the beginning of 2023 as a BI Consultant. Prior to Scalefree, he had over eight years of experience in the financial IT sector with focus on project management, IT-Service management and client management. This helped him get a broad knowledge of business requirements, the needs of customers dealing with IT and communication with different customer groups.

Lorenz Kindling
Senior Consultant

Lorenz is working in Business Intelligence and Enterprise Data Warehousing (EDW) with a focus on data warehouse automation and Data Vault modeling. Since 2021, he has been advising renowned companies in various industries for Scalefree International. Prior to Scalefree, he also worked as a consultant in the field of data analytics. This allowed him to gain a comprehensive overview of data warehousing projects and common issues that arise.

Building a scalable Data Platform? In Data Vault Friday

Federate Multiple Sets of Non-Historized Data

Welcome to another edition of Data Vault Friday! In this session, we’ll address an interesting question: how to integrate a second system into an existing setup with a non-historized link. This specific case involves transactions that can be updated for the last two months. We’ll explore the necessity of satellites, the role of non-historized links, and the concept of federating data in the business vault or information mart (BV/IM).

Let’s break it down, discuss the challenges, and provide actionable solutions.

In this article:

The Challenge
Understanding Non-Historized Links
Federation Across Systems
Handling Updates in Transactions
Federation in the Business Vault and Information Mart
Best Practices and Recommendations
Watch the Video
Meet the Speaker

The Challenge

The question revolves around a scenario where two systems provide transactional data. The goal is to integrate these datasets into a single fact entity in the information mart, enabling aggregation across both systems. A specific challenge arises when transactions from one system need to be updated within the last two months, necessitating a satellite to track those changes. So, how do we effectively design and implement this integration?

Understanding Non-Historized Links

Non-historized links play a central role in integrating datasets across source systems. These links typically store only insert operations, ensuring that facts—such as transactions—remain immutable. However, the reality is often more complex, especially when updates or changes occur.

For transactions that change, the non-historized link’s primary purpose is to maintain a consistent structure. Attributes frequently used for aggregations, such as amounts, are stored directly in the link structure. Meanwhile, descriptive attributes that casual users don’t aggregate (e.g., line descriptions) are placed in satellites to keep the link slim and optimized for storage and performance.

Federation Across Systems

One of the first decisions when integrating two systems is determining whether their data can share the same link structure. The key considerations are:

Semantic Consistency: Do both systems provide the same type of transactional data?
Granularity: Are the transactions at the same level of detail?
Key Uniqueness: Is there any overlap in identifiers that might require adding source-specific elements like a tenant ID?

If the answers indicate alignment, both systems can load into the same non-historized link. However, each source system would have its own satellite, ensuring clear separation of descriptive attributes.

Handling Updates in Transactions

Updates to transactional data pose a significant challenge, particularly when attributes such as amounts or hub references change. The typical strategy involves **technical counter transactions**, which ensure insert-only behavior while maintaining an accurate representation of changes. Here’s how it works:

For a new transaction, insert the original record into the link.
If an update occurs (e.g., an amount changes from €5 to €7), insert two additional records:
- A “counter” record negating the original amount (e.g., -€5).
- A new record reflecting the updated amount (e.g., +€7).
Aggregations will reflect the latest transaction value while retaining a complete history of changes.

This approach keeps the data model consistent and avoids direct updates to the non-historized link.

Federation in the Business Vault and Information Mart

While the raw Data Vault integrates data from multiple systems, the Business Vault (BV) and Information Mart (IM) are where federated datasets shine. By standardizing and transforming data, you can provide end-users with a unified view of all transactions.

The information mart combines the transactional data from both systems into a single fact entity. This entity supports aggregations across all transactions, ensuring that analytical queries deliver accurate and actionable insights.

Best Practices and Recommendations

To ensure a robust implementation, consider these best practices:

Design for Scalability: Keep the link structure slim to optimize performance, especially with large datasets.
Clear Satellite Separation: Create one satellite per source system to maintain clarity and prevent mixing attributes from different sources.
Use Technical Counter Transactions: Handle updates efficiently while preserving insert-only behavior.
Test Aggregations: Validate the unified fact entity in the information mart to ensure accurate reporting across systems.
Monitor Data Quality: Regularly check for inconsistencies, especially when integrating systems with different update patterns.

Watch the Video

Meet the Speaker

Michael Olschimke

Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

Dmytro Polishchuk In dbt Talk

Snapshots in dbt

Watch the Video

Snapshots in dbt: A Quick Overview

dbt snapshots allow you to “look back” at historical data by capturing changes in your database tables. This is achieved by implementing type-2 Slowly Changing Dimensions (SCDs), which track how a row has changed over time. For example, you can keep track of an order’s status as it moves from ‘pending’ to ‘shipped’ to ‘delivered’.

In this article:

How Snapshots Work
Snapshots Best Practices

How Snapshots Work

When you run dbt snapshot, it creates a new table with the same columns as your source table, plus additional metadata columns like dbt_valid_from and dbt_valid_to. On subsequent runs, dbt updates the dbt_valid_to column for changed records and adds new records with dbt_valid_to set to null.

dbt offers different strategies for detecting changes, including comparing timestamps, specific column values, or a unique key. You can also configure dbt to track deletions.

Snapshots Best Practices

Store snapshots in a separate schema to keep your historical data organized.
Snapshot raw data using the source function and select all columns.
Avoid joins in your snapshot queries; instead, create separate snapshots and join them downstream.
If you need to perform transformations, do so in ephemeral models before snapshotting.
Schedule snapshots to run frequently to capture changes regularly.

By following these best practices, you can ensure that your dbt snapshots are accurate, efficient, and easy to maintain.

Building a scalable Data Platform? In Data Vault Friday

Non-Historized Links and Their Satellites

Understanding Non-Historized Links and Satellites

In the world of data vault modeling, we often encounter situations where we need to deal with non-historized links. A non-historized link is a structure used to capture relationships between entities (hubs) in a data vault. Unlike historized links, these structures are not designed to track changes over time, making them particularly useful in certain high-performance scenarios.

However, when integrating with systems that can update transactions for recent periods, additional structures like non-historized satellites become essential. These satellites allow us to track updates to attributes associated with the link while maintaining the integrity and performance of the data model.

In this article:

Design Considerations for Non-Historized Links
Role of Non-Historized Satellites
Query Performance Optimization
Practical Example
Conclusion
Watch the Video
Meet the Speaker

Design Considerations for Non-Historized Links

The design of non-historized links requires careful planning, particularly when dealing with wide datasets or platforms with specific storage constraints. A wide link, for instance, might contain attributes like transaction ID, account segments, accounting document, accounting period, and timestamps. These attributes uniquely identify a transaction and establish relationships with other entities in the data vault.

Here are key considerations when designing non-historized links:

Include Essential Identifiers: Attributes required to uniquely identify a transaction and establish relationships with hubs should always be part of the link. This typically includes the transaction ID and any necessary alternate keys.
Balance Row Width: Including too many attributes in the link can lead to performance issues on row-based storage systems, as wide rows reduce the number of records stored per page.
Use Satellites for Additional Attributes: Attributes that are not essential for casual queries or dashboard aggregation should be moved to a satellite structure to keep the link lean and efficient.

Role of Non-Historized Satellites

Non-historized satellites play a critical role in complementing non-historized links. They store additional descriptive attributes that are not frequently queried or aggregated by typical users. This approach minimizes the join operations required during querying, improving performance.

For instance, while attributes like debit and credit amounts might be included in the link due to their importance in aggregations, other less frequently used attributes, such as transaction type or invoice descriptions, can reside in the satellite. This division ensures a balance between storage efficiency and query performance.

Query Performance Optimization

A significant design challenge lies in ensuring optimal query performance, particularly on column-based storage systems. Joining a link with its satellite can significantly impact performance. To mitigate this, consider the following:

Prejoin Frequently Queried Attributes: Move attributes that are commonly used in aggregations or group-by operations into the link structure to eliminate the need for joins.
Tailor Satellites for Specific Use Cases: Design satellites to serve specialized needs, such as data science or detailed analysis, where users can afford the higher cost of join operations.
Adopt Hybrid Storage Strategies: Leverage hybrid storage solutions, such as Snowflake’s mini partitions, to combine row-based and column-based advantages and support wider rows for frequently queried data.

Practical Example

Let’s consider a scenario with a financial transactions table containing over a billion rows. The table includes attributes such as transaction ID, account segments, accounting document, accounting period, debit amount, credit amount, and timestamps.

Here’s how we can structure the non-historized link and satellite:

Non-Historized Link: Include transaction ID, account segments, accounting document, accounting period, debit amount, and credit amount. These attributes are essential for aggregations and dimension references.
Non-Historized Satellite: Store additional attributes like transaction type and invoice descriptions, which are not frequently queried by casual users.

This structure ensures that typical queries for dashboards and reports are efficient, while still supporting detailed analysis for specialized users.

Conclusion

Designing non-historized links and their satellites requires a deep understanding of data usage patterns and storage platform constraints. By carefully selecting which attributes to include in the link and which to offload to a satellite, we can strike a balance between query performance and storage efficiency. For most scenarios, prioritizing attributes that support frequent aggregations and groupings in the link structure is key, while satellites serve as a repository for less critical details.

As data modeling continues to evolve, adapting these principles to modern storage technologies and hybrid platforms will be crucial for building robust and performant data vault systems.

Watch the Video

Meet the Speaker

Michael Olschimke

Michael has more than 20 years of experience in Information Technology. During the last eight years, he has specialised in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

Building a scalable Data Platform? In Artificial Intelligence, Data Modeling, Intermediate

AI Act Insight: Ensuring Responsible AI for Your Business

AI Act

The Artificial Intelligence (AI) Act has been in force since August 1, 2024 and will gradually come into effect over time. As a new legal milestone, the AI Act brings with it requirements for the use of artificial intelligence in companies in order to promote the responsible development and use of artificial intelligence in the EU. But instead of seeing only hurdles in the risk-based approach of the AI-Act, it also opens up a wide range of opportunities for a future-oriented AI strategy for companies.

In our newsletter, we take a first glance at the new legal framework and its significance for your company. We also show how companies can use legal compliance strategically to gain a competitive edge and promote innovative business models.

AI Act Insight: Ensuring Responsible AI for Your Business

The EU’s AI Act is here! Learn how this groundbreaking regulation impacts your business. We’ll break down the risk-based approach to AI systems, focusing on high-risk applications and compliance requirements. Discover practical steps to ensure transparency and leverage tools like AI-Marts for effective AI governance.

Watch Webinar Recording

In this article:

What is the AI Act and Why Should You Care?
- Comply With the AI Act Today!
From a Data Warehousing perspective: How can an AI-Mart help?
- Key Features of an AI-Mart
Upcoming Resources and Events
Final Remarks from the Authors

What is the AI Act and Why Should You Care?

The AI Act aims to make the use of artificial intelligence within the EU safer and more trustworthy by creating clear rules for the development and use of AI systems. The focus here is on the protection of fundamental rights, health, and safety. The legal framework is based on a risk-based approach: AI applications are divided into four different categories, from minimal to unacceptable risks, depending on the potential threat to society. Specifically, the AI Act provides for the following categories:

Unacceptable Risk: AI systems that pose a threat to human rights or safety, such as those used for social scoring or manipulative practices, are prohibited.
High Risk: These systems are heavily regulated and include applications in critical areas such as biometric identification, healthcare, transportation, education, and employment. Businesses using high-risk AI must meet strict compliance standards.
Limited Risk: These systems face fewer restrictions but must still adhere to transparency requirements. For example, chatbots need to inform users that they are interacting with AI.
Minimal or No Risk: The least regulated category includes AI applications such as spam filters or AI-driven video games.

The AI Act is of great importance for companies, as the new requirements not only entail compliance obligations, but also open up opportunities to gain competitive advantages.

By adapting to the legal requirements at an early stage, companies can strengthen trust among customers and partners, minimize risks, and promote innovation responsibly. A sound understanding of regulation enables them to make strategic use of the legal framework and position themselves better in international comparison.

Hence, for businesses operating in or with the EU, compliance with the AI Act will be a decisive factor. Failure to comply could result in significant penalties—up to 7% of the global annual turnover or €35 million, whichever is higher.

As the legislation moves gradually forward, it is recommended that companies, as a first step, review their AI tools and analyze how these systems are classified and regulated under the new framework to implement the necessary obligations.

Comply With the AI Act Today!

Preparing for the AI Act requires a proactive and comprehensive approach. Ensure compliance, mitigate risks, and foster trust in your AI applications.

Get My Free Checklist

From a Data Warehousing perspective: How can an AI-Mart help?

As businesses prepare to comply with the European Union’s AI Act, ensuring that their data and AI systems meet the new regulations is critical. Central to this is the concept of data governance and traceability, especially for AI models classified as high-risk. A modern data warehouse (DWH), particularly one powered by Data Vault 2.0, when combined with a specialized AI-Mart, can provide the technical foundation needed for compliance by managing the data lifecycle, ensuring transparent operations, and logging AI model activities. Data Vault 2.0 offers several advantages for this purpose, including its ability to support agile development, enabling rapid changes in business requirements; ensuring scalability, allowing businesses to handle increasing data volumes seamlessly; and providing strong historical tracking through its architecture, which facilitates easier auditing and compliance verification.

In the context of AI, the AI-Mart is a specialized data mart within a DWH, focused solely on managing AI training data. Its purpose is to provide a structured and compliant environment for storing and curating datasets that will be used to train, validate, and test AI models. Unlike a traditional data mart, the AI-Mart is designed with features tailored for AI, such as enhanced metadata, tracking, and model training documentation.

Key Features of an AI-Mart

Data Curation for AI Training: The AI-Mart stores data specifically curated for training AI models, ensuring that all datasets are clean, unbiased, and high-quality. Built-in data governance rules ensure that only validated data enters the mart. This ensures compliance with the AI Act’s requirements for high-risk AI systems, where data must be trustworthy, accurate, and free of bias.
Metadata and Documentation: The AI-Mart stores metadata about each dataset, including its source, transformations applied, and its use in specific AI models. This metadata is essential for traceability, ensuring that every data point used in an AI model can be traced back to its origin and all changes can be documented.
Data Versioning and Lineage: In AI applications, ensuring that models use up-to-date and reliable data is critical. The AI-Mart supports data versioning, allowing teams to maintain multiple versions of datasets and trace changes over time. Data lineage tracking ensures that the lifecycle of the data—from ingestion to usage in AI models—can be fully traced, providing a comprehensive audit trail required for compliance with the AI Act.

This is why a robust data governance framework is crucial for ensuring compliance with the AI Act. By integrating a data warehouse (DWH) with an AI-Mart, businesses can implement stringent governance measures that ensure the quality and reliability of AI training data. For example, automated validation pipelines within the DWH verify that only data meeting predefined quality standards is used, minimizing errors, biases, and missing information. This is particularly important for high-risk AI applications, such as those in biometric identification or healthcare, where poor data quality could lead to harmful or inaccurate outcomes.

To comply with the AI Act, businesses must ensure traceability in their AI systems by tracking and documenting key stages of the AI process, from data preparation to model usage. Integrating AI model logs into a data warehouse (DWH) plays a crucial role in this, providing a centralized system to monitor and store critical information about how AI models operate and interact with data.

Logging AI Decisions and Outputs: Each time an AI model processes data, logs should be automatically generated and stored in the DWH. These logs capture essential details, including input data, feature transformations, model parameters, decision thresholds, and output probabilities. By loading these logs into the DWH, businesses create a detailed audit trail of AI activity, ensuring that key aspects of the model’s operations are documented.

Log Aggregation and Storage: Logs from AI models, whether during training or production, can be continuously fed into the DWH as part of the AI-Mart infrastructure. These logs may include:

Model training logs: Documenting how the model was trained, the datasets used, and the parameters adjusted during training.
Model inference logs: Recording the input data, features generated, and each prediction made by the model.
Performance metrics: Storing evaluations like accuracy, precision, and recall, which help track the model’s performance over time and detect any model drift.

By storing these logs in the DWH, businesses can establish detailed records of AI model operations for regulatory purposes.

Querying and Auditing Logs: The DWH’s querying tools allow compliance teams to generate reports that show how models operate, what data was used, and how the AI model has evolved. This simplifies the process of responding to regulatory audits and demonstrates adherence to the AI Act’s requirements.

By combining a DWH with an AI-Mart for AI training data and loading AI model logs into the same infrastructure, businesses can build a comprehensive framework for compliance with the AI Act. This approach supports data governance, ensuring high-quality data for AI models, and ensures traceability, allowing businesses to track and audit every aspect of their AI systems. This not only meets regulatory requirements but also fosters trust and accountability in the use of AI technology.

Upcoming Resources and Events

English Webinar: Join our upcoming session on “AI Act Insights: Ensuring Responsible AI for your Business” November 26th, 2024, 2pm CET
German Webinar: Join our upcoming session on “AI Act im Fokus: Regulierung und Chancen für Unternehmen” November 27th, 2024, 2pm CET
English Whitepaper: Preparing for the AI Act: A Practical Checklist
German Whitepaper: Vorbereitung auf den AI Act: Eine Checkliste

For more information, contact our team at [email protected].

Final Remarks from the Authors

The AI Act should not only be seen as a regulatory challenge but also as an opportunity for businesses to differentiate themselves by adopting trustworthy and ethical AI practices. As AI continues to evolve, businesses that prioritize compliance, transparency, and human oversight will be better positioned to thrive in the coming years. By taking proactive steps now to ensure compliance, businesses can turn AI regulation into a strategic advantage, building trust with customers, partners, and regulators alike.

– Ulf Mattern (Scalefree) & Dr. Céline Helmschrot (GÖHMANN Rechtsanwälte)

Building a scalable Data Platform? In Data Vault Friday

Identifying Non-Historized Links

Watch the Video

Identifying Non-Historized Links in Data Vault

In data management, particularly in the Data Vault 2.0 methodology, understanding how to handle non-historized links can be essential for maintaining accurate, traceable records. In today’s Data Vault Friday, we’ll cover the concept of non-historized links, explore the unit of work in data storage, and discuss which elements should be included in link hashes.

In this article:

Non-Historized Links Background: Understanding the GL Table
Defining the Unit of Work in Data Vault
Testing the Structure of the Unit of Work
Which Values Should Be Hashed in the Link Hash?
In Practice: Setting Up Non-Historized Links
How to Submit Your Questions
Conclusion
Meet the Speaker

Non-Historized Links Background: Understanding the GL Table

Imagine we’re working with a General Ledger (GL) table for storing financial transactions, which includes over a billion rows. Each record in this table represents a transaction and includes attributes such as:

TransactionID: A unique identifier for each transaction.
AccountSegment: A business key identifying the account involved.
AccountingDocument: Another business key associated with the document related to the transaction.
AccountingPeriod: A reference key indicating the period of the transaction.
DebitAmount and CreditAmount: The monetary amounts for each transaction.
TransactionTimestamp: The exact date and time of the transaction.

While the TransactionID alone uniquely identifies each transaction, a combination of AccountSegment, AccountingDocument, AccountingPeriod, and TransactionTimestamp also provides a unique identifier for each entry. This setup raises questions about how to best organize, hash, and manage these records within a Data Vault model, specifically in terms of defining a Unit of Work and deciding which values should be included in a link hash.

Defining the Unit of Work in Data Vault

The Unit of Work is a concept used in Data Vault to determine which business keys need to be grouped together in a single record. In this context, we’re focusing on how these business keys—AccountSegment, AccountingDocument, AccountingPeriod, and TransactionTimestamp—should be structured and managed within a non-historized link.

According to Michael Olschimke, the CEO of Scalefree, the Unit of Work is primarily relevant when dealing with standard relationship-based links. In our example, there’s no need to split the link between these business keys because the link can capture the original granularity of each transaction from the source data. This means that by keeping all four attributes within the same link, we avoid disrupting the Unit of Work and ensure a coherent data structure.

Testing the Structure of the Unit of Work

One way to verify if we’re correctly applying the Unit of Work is to check if the data model enables us to reconstruct the original data source without loss of records or attributes. If we find that we can reconstruct the data accurately using the non-historized link and potentially a Satellite table later, then we’re likely adhering to the Unit of Work correctly.

Since the attributes in our case uniquely identify each transaction, we can assume the Unit of Work is preserved, allowing for a stable and consistent structure in the data model.

Which Values Should Be Hashed in the Link Hash?

The next question concerns which values we should include in the link hash. Generally, the hash key in a link table is derived from the business keys of referenced hubs. In this case, the primary question is whether to hash just the AccountSegment, AccountingDocument, and AccountingPeriod, or if we should also include the TransactionID.

Olschimke suggests a practical approach here: while we could technically get by with hashing either the TransactionID alone or the combination of the other three elements, it’s often better to include all relevant attributes in the hash. This includes the TransactionID along with AccountSegment, AccountingDocument, and AccountingPeriod.

Here’s why:

Consistency with Automation Tools: Many Data Vault automation tools automatically include all hub references and their business keys in the hash computation. Following this approach aligns with standard automation practices, making it easier to work with automation tools later.
Minimizing Potential Issues: Including all attributes in the hash computation reduces the risk of data loss or accidental data duplication. It ensures that our hashes accurately represent each unique record.
Negligible Performance Impact: While adding an extra attribute to the hash computation may slightly increase the string length, the performance impact is minimal. In large-scale data processing, this small change typically does not result in significant slowdowns.

In Practice: Setting Up Non-Historized Links

In Data Vault, non-historized links are a powerful tool for managing large datasets without versioning each record. By including all four attributes in the hash computation, we establish a stable link structure that reflects the source data’s original granularity. This approach allows us to confidently manage and retrieve accurate records without fear of unintentional data loss.

For practitioners, setting up these non-historized links can be straightforward when using automation tools that handle most of the heavy lifting. It’s generally recommended to follow the automation tools’ conventions, especially for larger datasets, as this minimizes discrepancies and facilitates smoother data processing in the long run.

How to Submit Your Questions

If you’re working with Data Vault and have a question similar to this one, Data Vault Friday sessions provide an opportunity to get expert insights. To submit a question, you can use the form, which allows you to attach diagrams, whiteboard images, or other supporting materials for clarity. Alternatively, for faster responses, you can post questions on the Data Innovators Exchange, where consultants actively respond and discuss Data Vault topics.

For those interested in more in-depth training, Scalefree also offers webinars, including sessions on WhereScape and dbt, which run monthly. These provide valuable resources for anyone looking to deepen their Data Vault expertise.

Conclusion

Identifying non-historized links and deciding on a consistent hashing approach can significantly impact data integrity and model reliability. By adhering to Data Vault principles and utilizing automation tools effectively, you can manage vast amounts of data without sacrificing accuracy or consistency.

With the right approach, you’ll be able to create a resilient, scalable data model that aligns with your organization’s needs and remains flexible for future changes. If you’re diving deeper into Data Vault, stay tuned for more Data Vault Friday insights and don’t hesitate to participate in the community for ongoing support and expertise.

Meet the Speaker

Michael Olschimke

Lorenz Kindling In Data Vault Friday

Best Practices for Managing Costs in Data Warehousing

Watch the Video

Best Practices for Managing Costs in Data Warehousing

In the world of data warehousing, managing and optimizing costs is essential, particularly as more organizations move their operations to the cloud. The shift to cloud data platforms like Snowflake, Databricks, and Azure has opened new opportunities for scaling data operations. However, it has also introduced new challenges in terms of cost management. Let’s explore key strategies and best practices for keeping data warehousing costs under control while maintaining high-quality, reliable data operations.

In this article:

Why Cost Monitoring Matters in Data Warehousing
Establish Clear Ownership and Responsibility
Set Expiration Dates for Projects and Reports
Use Tags to Track Resources and Cost Allocation
Define Purpose and Value of Reports and Data Products
Implement Cost Monitoring and Alerts
Regularly Review Cost Allocation and Query Performance
Best Practices for Cost Efficiency in Data Warehousing
Conclusion

Why Cost Monitoring Matters in Data Warehousing

Cost monitoring often becomes a focus only after a project has started and costs have begun to accumulate. However, implementing cost control early on can yield substantial savings. Statistics from AWS and Gartner underscore the importance of cloud cost management: organizations can reduce monthly cloud costs by 10-20% with monitoring tools, and companies with cloud optimization strategies may see savings of up to 30%.

These savings underscore the significance of early planning, as businesses that establish cost management measures from the outset are far better positioned to maintain budget predictability. Let’s dive into actionable strategies to keep data warehousing costs in check.

Establish Clear Ownership and Responsibility

One of the first steps in cost management is defining who is responsible for each aspect of the data warehouse. Often, data projects start with a business use case but lack a clear person to oversee costs. Every data product, data source, and even individual data warehouse instance should have an assigned data owner. By giving someone ownership, you create accountability, and there’s always a go-to person to consult when costs spike.

Set Expiration Dates for Projects and Reports

Data reports and dashboards created for specific projects can linger long after they’re needed, consuming unnecessary resources. To avoid this, establish “end dates” for each project. For instance, a report created for a finance analysis in 2024 may not be relevant after that year ends. By checking with departments to verify report usage periodically, you can ensure that outdated reports are retired, freeing up resources and reducing costs.

Use Tags to Track Resources and Cost Allocation

Modern data platforms like Snowflake and Databricks allow you to tag resources for easier tracking. Tagging can be done at various levels—by department, project, or cost center. This makes it easier to allocate costs to specific business functions and track where expenses are going. However, be strategic with tags. A thoughtful, organized approach to tagging can streamline reporting and give you a clearer picture of how resources are used across the organization.

Define Purpose and Value of Reports and Data Products

When creating new reports or data products, always define their purpose. Determine how long they’ll be useful and assess their business value. Having a clear understanding of why each data product exists ensures resources are only allocated to valuable outputs, preventing unnecessary data processing and storage costs.

Implement Cost Monitoring and Alerts

Cost monitoring should be integrated directly into your data warehouse operations. Start by defining the key performance indicators (KPIs) for cost monitoring, such as monthly costs per tag or project. Build a dashboard that visualizes these metrics, making it easier to track costs at a glance. Additionally, set up budget alerts to notify you of any significant changes or cost surges.

For instance, Snowflake and other cloud platforms allow you to set automated alerts for query runtimes, storage limits, and overall costs. This is particularly useful for identifying high-cost queries or storage use that might need optimization.

Regularly Review Cost Allocation and Query Performance

Set up regular monthly reviews with your DevOps team to evaluate your data warehouse’s costs and budgets. During these reviews, discuss areas where you can optimize queries or resource use. Identifying expensive or long-running queries is critical. Optimizing them can have a noticeable impact on your overall budget.

Best Practices for Cost Efficiency in Data Warehousing

1. Involve Stakeholders in Cost Management

Stakeholders should be aware of the cost implications of the reports they request. Make them part of the conversation, helping them understand which reports are more costly and the associated budget impacts. This can make it easier to justify the costs and encourage stakeholders to make more cost-effective choices.

2. Set Up Budget Alerts

Budget alerts are essential for staying within allocated funds. Use them to monitor query and storage costs, and receive notifications if any thresholds are breached. This can prevent unexpected spikes in expenses.

3. Create a Cost Dashboard

Establish a dashboard that visualizes real-time costs and usage statistics. This is especially straightforward in platforms like Snowflake, where dashboards can display resource costs in an easily digestible format. Regularly viewing this dashboard can help your team make timely adjustments to reduce expenses.

4. Monitor and Optimize Queries

Query monitoring is essential. Keep an eye on long-running or high-cost queries, as they often account for a significant portion of the total expenses. Optimizing these queries can substantially reduce costs.

5. Apply Data Vault Techniques for Efficiency

Data Vault methodology brings several benefits for cost efficiency. Its standardization and automation in development reduce manual effort, lowering overall project costs. The agile approach of the Data Vault, with its “Tracer Bullet” development, ensures that you deliver business value early, which helps justify costs to stakeholders.

Additionally, Data Vault supports GDPR compliance and auditability, reducing the risk of costly legal issues. Its approach to parallel loading, materialization, and the use of PIT and Bridge tables enables efficient data processing, minimizing runtime and storage needs.

6. Follow the Pareto Principle in Cost Optimization

In cost monitoring, the Pareto Principle often applies. Focus on the top 20% of queries or tables that account for 80% of costs. By targeting optimizations to these high-cost items, you can achieve significant cost savings.

Conclusion

Effective cost management in data warehousing requires early planning, stakeholder involvement, and regular monitoring. By establishing clear ownership, tagging resources, setting budget alerts, and leveraging Data Vault principles, you can maintain cost-effective data operations that continue to deliver business value. Implement these practices to ensure your data warehousing operations remain scalable, efficient, and aligned with your organization’s budgetary goals.

If you’d like to learn more about optimizing data warehousing costs, check out our other posts or join us for next week’s Data Vault Friday session!

Building a scalable Data Platform? In Data Vault Friday

Snapshot Based Bridge Table on Link and Effectivity Satellite

Watch the Video

Snapshot Based Bridge Table on Link and Effectivity Satellite

Welcome to another edition of Data Vault Friday! Today, we’re diving into the concept of a snapshot-based bridge table, particularly focusing on its application in scenarios with links and effectivity satellites. This approach helps us handle complex relationships between business objects over time, managing changes in relationships, and retaining a complete historical view. In this article, we will explore solutions to three main questions:

How to handle a missing relationship on day 2 compared to day 1
How to manage changed relationships from day 1 to day 2 for the same key
How to include a business object (A3) on day 2 when it lacks a relationship to any object in B

In this article:

Understanding the Data Vault Modeling Example
- Examining the Relationship Changes
Creating the Link and Effectivity Satellite
Building the Snapshot-Based Bridge Table
Handling Missing Data and Reinstating Relationships
Incorporating A3 in the Bridge Table on Day 2
Preventing Cartesian Products and Other Best Practices
Conclusion
Meet the Speaker

Understanding the Data Vault Modeling Example

In our example, we have two sources, each representing a different business object (A and B). Business object A is represented as a static dataset, while business object B shows dynamic relationships to A over three days. Each day brings changes in the relationships, with some entries disappearing or shifting. Our task is to capture these changes effectively in a snapshot bridge table.

Examining the Relationship Changes

Let’s walk through the changes observed over three days:

Day 1: B1 is related to A1, B2 to A2, and B3 to A3.
Day 2: The relationships change. B1 is now related to A2, B2 to A1, and B3 disappears.
Day 3: No further changes occur compared to Day 2.

Our objective is to document these relationships using an Effectivity Satellite and a Bridge Table, enabling us to query the state of relationships as they existed on each day.

Creating the Link and Effectivity Satellite

The first step in capturing these relationships is to create a Link table that holds a distinct list of relationships between business objects. Links should not have additional metadata, such as validity dates, as this can complicate data retrieval and reduce performance. In our example, the Link table captures each unique combination of A and B keys but does not record their start or end dates.

Next, we create an Effectivity Satellite. This table extends the Link by recording each relationship’s start and end timestamps, as well as an “is_active” flag to indicate the current status of each relationship. Using this table, we can track when a relationship starts, changes, or ends. Let’s examine how this works:

Day 1 entries: All relationships (B1-A1, B2-A2, B3-A3) are marked as active.
Day 2 entries: New relationships (B1-A2 and B2-A1) are added and marked active, while previous relationships (B1-A1 and B2-A2) are marked inactive. B3-A3 is removed entirely.

Building the Snapshot-Based Bridge Table

With the Effectivity Satellite in place, we can now create a Bridge Table that snapshots the active relationships for each day. This table provides a point-in-time view of the relationships as they existed on a particular day. Let’s look at how the Bridge Table is created:

Day 1 Snapshot

The Day 1 snapshot reflects the initial relationships, pulling all active records from the Effectivity Satellite. At this stage, B1-A1, B2-A2, and B3-A3 are all active.

Day 2 Snapshot

In the Day 2 snapshot, only the relationships B1-A2 and B2-A1 remain active, while B3-A3 is removed. By applying a filter to include only active entries, the snapshot accurately represents the relationships on Day 2.

Day 3 Snapshot

Day 3’s snapshot is identical to Day 2, as no additional changes were made. The active relationships B1-A2 and B2-A1 remain unchanged.

This process ensures we have a clear audit trail of relationship changes and deletions. Each day’s snapshot represents the state of relationships at that point in time, without introducing redundant data.

Handling Missing Data and Reinstating Relationships

One critical feature of the Effectivity Satellite is the ability to manage reinstated relationships. For instance, if B1-A1’s relationship is reestablished on Day 4, we add new rows in the Effectivity Satellite, marking the previous active combination (B1-A2) as inactive and reactivating B1-A1. This dynamic structure makes it easy to adjust for relationships that appear, disappear, and reappear over time.

Incorporating A3 in the Bridge Table on Day 2

A common challenge is how to incorporate business objects like A3 on Day 2, despite having no relationship in B on that day. In Data Vault, this is often addressed in downstream queries or report joins rather than in the Bridge Table itself.

By starting with object A and performing a left join to the Bridge Table, you will include all records from A (even if they don’t appear in B). This ensures that unlinked objects are included in the results, with their B relationships shown as null, or as a placeholder if desired.

Preventing Cartesian Products and Other Best Practices

When using a Bridge Table, it’s essential to avoid unexpected Cartesian products, which can inflate data during aggregation. Always check the cardinality of relationships between objects (e.g., one-to-one or many-to-many) to ensure joins occur only on necessary keys. The driving key, which anchors the relationships in the Link Table, should be the primary focus, especially in cases of many-to-many relationships.

This method ensures accuracy and performance when aggregating data and avoids inflated results in reporting.

Conclusion

Snapshot-based bridge tables are powerful tools in Data Vault modeling for handling changing relationships and tracking historical snapshots. By carefully structuring Links, Effectivity Satellites, and Bridge Tables, we create a robust, auditable trail of data changes over time. As we’ve discussed, this approach allows us to accommodate missing data, reinstate relationships, and prevent data inflation, ensuring reliable, performant data models.

If you’d like further assistance with templates for bridge tables or Effectivity Satellites, feel free to reach out to me. I hope this discussion has provided clarity on using bridge tables in Data Vault. Join us again for more Data Vault insights, and feel free to submit your questions!

Meet the Speaker

Marc Winkelmann

Marc is working in Business Intelligence and Enterprise Data Warehousing (EDW) with a focus on Data Vault 2.0 implementation and coaching. Since 2016 he is active in consulting and implementation of Data Vault 2.0 solutions with industry leaders in manufacturing, energy supply and facility management sector. In 2020 he became a Data Vault 2.0 Instructor for Scalefree.

Building a scalable Data Platform? In Data Vault Friday

Learning the Data Vault Patterns

Watch the Video

Learning the Data Vault Patterns with Azure Synapse

When it comes to managing complex, evolving data landscapes, implementing a Data Vault architecture is a popular choice. The Data Vault approach is especially helpful in handling large volumes of data while maintaining flexibility, scalability, and historical tracking. Recently, a question was raised regarding the use of Azure Synapse Analytics with Data Vault for the initial Raw Data Vault development without a finalized automation tool. Specifically, the question was: Is it feasible to start development with Synapse Notebooks using PySpark and Delta Lake?

This article will explore the answer to that question and dive into key considerations, best practices, and strategies to make the most of a manual setup while preparing for eventual automation in your Data Vault implementation.

In this article:

Is It Feasible to Start Data Vault with PySpark and Delta Lake?
Benefits of Starting Manually with Synapse, PySpark, and Delta Lake
The Drawbacks of a Manual Approach
Steps to Optimize Your Manual Development Process
Preparing for Automation: What to Keep in Mind
Moving Forward: Transitioning to Automation
Final Thoughts
Meet the Speaker

Is It Feasible to Start Data Vault with PySpark and Delta Lake?

The short answer is yes—starting with Synapse Notebooks, PySpark, and Delta Lake is indeed a viable option. In fact, this manual approach can be an excellent way to familiarize your team with Data Vault concepts, patterns, and methods before committing to an automation tool. By manually building the Raw Data Vault, your team can gain hands-on experience with essential processes, which will make the transition to automation smoother and more effective later on.

Historically, many Data Vault practitioners began with manual scripting due to limited automation tools. Over time, this “manual-first” method became a useful way to learn the intricate patterns of Data Vault. Today, automation tools for Data Vault are abundant, and using them is generally more efficient, but there’s still a place for manual methods, especially in the early learning stages of a project. Let’s look closer at why this approach works and what to consider as you work manually.

Benefits of Starting Manually with Synapse, PySpark, and Delta Lake

Using PySpark and Delta Lake in Azure Synapse Notebooks gives your team flexibility to:

Learn Core Data Vault Patterns: Building the Raw Data Vault manually helps the team understand Data Vault concepts, such as Hubs, Links, and Satellites. This is crucial knowledge that will benefit the project long-term.
Experiment with Modeling: Working without automation allows you to refine your approach and test different design patterns. This is especially helpful in creating a foundation that’s tailored to your organization’s specific needs and datasets.
Understand Data Transformation and Ingestion: By manually scripting transformations and ingesting data, your team will better understand the processes that an automation tool would handle. This will help in configuring automation later and troubleshooting any issues that arise.
Validate Requirements and Patterns: Since no tool has been chosen yet, working manually allows you to get a head start on modeling and confirm your business and technical requirements early in the project.

The Drawbacks of a Manual Approach

While starting manually has its advantages, it’s important to be aware of the limitations. The primary drawbacks of a manual approach are:

Time and Effort: Developing the Raw Data Vault by hand is time-consuming. Each process, from creating Hubs to tracking Satellites, requires careful attention to ensure the design aligns with Data Vault standards.
Limited Scalability: A manual setup is challenging to scale, especially as the data volume grows. While PySpark and Delta Lake are powerful tools, they’re not a substitute for the scalability offered by automation tools.
Risk of Technical Debt: Scripts developed manually might not be as maintainable or reusable as the templates generated by automation tools. Technical debt can accumulate if the team spends too much time maintaining manual scripts, and transitioning to automation could later require extensive rework.

Steps to Optimize Your Manual Development Process

If you decide to move forward with this manual approach, here are some strategies to make the process more efficient and to set a foundation for a future transition to automation:

Document Your Patterns Thoroughly: Take detailed notes on the specific design patterns, scripts, and models you develop manually. These can serve as templates when you move to automation, making the transition much easier.
Define Clear Modeling Standards: Establish consistent modeling practices for Hubs, Links, and Satellites. This will reduce ambiguity and provide a structured foundation for automation tools to build upon later.
Refine and Iterate: Since you’re building manually, use this time to refine your models. Adjust and improve them based on the unique data flows and needs of your organization.
Focus on Core Entities: Prioritize building out core Hubs and Links in your Raw Data Vault, focusing on the entities that are most crucial to your organization. This will create a solid foundation that can be extended as you move to automation.

Preparing for Automation: What to Keep in Mind

Even as you start manually, keep automation in mind as a goal. Today, there are numerous Data Vault automation tools available, each with its own strengths and weaknesses. As you prepare for this transition, here are some key considerations:

1. Research Automation Tools

Explore the different automation tools available on the market. Each tool has its own approach, interface, and capabilities, so it’s essential to choose one that aligns well with your organization’s technical needs, budget, and data infrastructure. Some tools focus on business-user accessibility, while others offer more technical configurations. Common tools for Data Vault automation in Azure include solutions specifically designed for Synapse or those that support PySpark and Delta Lake.

2. Select Tools with Scalability

When selecting a tool, consider how it handles scalability, as this is vital for supporting the growing volume of data in a Data Vault. Some automation tools handle scalability better than others, depending on how they manage Hubs, Links, and Satellites. In Azure Synapse, it’s also important to assess compatibility with Delta Lake and PySpark, as well as overall integration with Azure’s data ecosystem.

3. Account for Tool Limitations

Even the best automation tools have limitations. Be prepared to adjust your approach based on the chosen tool’s capabilities. For example, some tools may limit specific patterns, such as complex Satellites or multi-active relationships. By understanding these limitations ahead of time, you can avoid rework and ensure that your initial manual development aligns well with the selected tool.

4. Focus on Configurability and Customization

Ensure that your chosen tool allows for some level of customization. This is important because the patterns you develop manually may need to be fine-tuned or adjusted within the tool. Look for tools that offer configurable templates, adaptable interfaces, and support for customizations to fit your organization’s specific needs.

Moving Forward: Transitioning to Automation

As your team becomes familiar with Data Vault patterns through manual development, the next step is to select and implement an automation tool. While the manual work provides a deep understanding of Data Vault patterns, an automation tool will streamline repetitive processes, ensure consistency, and save significant time and effort as your data volume grows.

One recommended approach is to use the lessons learned during manual development to create tailored templates and workflows within your chosen tool. By doing so, you can optimize the automation tool’s capabilities based on the patterns you’ve already tested and refined. This makes for a smoother, more effective transition from manual development to automated workflows.

Final Thoughts

Starting Data Vault development manually with Synapse Notebooks, PySpark, and Delta Lake is a feasible and often beneficial approach, especially when automation tools haven’t been finalized. While this method is time-consuming and requires effort, it offers valuable insights and allows your team to learn and optimize Data Vault patterns before committing to an automation tool.

Remember, the goal is to use this manual phase to build a strong foundation, explore modeling choices, and establish best practices. When the time comes to select an automation tool, your team will be well-prepared to leverage it to its fullest potential, ensuring a scalable and efficient Data Vault implementation within Azure Synapse Analytics.

Meet the Speaker

Michael Olschimke

Building a scalable Data Platform? In Data Vault Friday

Handling JSON Data in Data Vault Satellites on Snowflake

Watch the Video

Handling JSON Data in Data Vault Satellites on Snowflake

In this blog post, we’ll discuss the challenges and best practices for handling JSON data in Data Vault architectures, specifically when using Snowflake. This question from our audience raises a common scenario in modern data environments where semi-structured data is prevalent. Let’s dive into how storing JSON in Data Vault Satellites works, when to extract fields, and the performance considerations for Snowflake.

In this article:

Why Store JSON in the Satellite?
Choosing When to Extract Fields from JSON
JSON and Snowflake: Optimized for Performance
Handling Business Logic in the Business Vault
Structuring Semi-Structured Data Over Time
Privacy and Security Considerations with JSON in Satellites
Working with JSON Arrays and Nested Objects
Virtualizing JSON Access in Snowflake
Conclusion
Meet the Speaker

Why Store JSON in the Satellite?

With the increasing use of APIs and complex data integrations, many businesses are receiving data as JSON. JSON is a flexible, semi-structured format that allows for varying structures and nested elements, making it ideal for some applications. However, this flexibility also introduces complexities when using a structured data model like the Data Vault.

The challenge here is that when you receive a JSON payload, it may contain hundreds of fields, many of which may not be immediately needed. Extracting every JSON field to store as individual columns in the Raw Data Vault can be inefficient and unnecessary. Therefore, one solution is to store the JSON “as-is” in the Satellite and extract only what’s required later in the Business Vault.

Choosing When to Extract Fields from JSON

As a general rule, it’s beneficial to extract attributes that are needed for reporting, security, or compliance early in the process. Relational attributes stored directly as columns in the Raw Data Vault are easier to query, and they’re stored more efficiently in structured databases.

For other fields in the JSON payload that are less critical or not immediately required, you can keep the data in its JSON format. This approach minimizes upfront work and allows for flexible data exploration later without overloading the Raw Data Vault with unnecessary fields.

For example, consider an e-commerce environment where transaction data is received in JSON format. The Raw Data Vault could extract only essential fields like transaction ID, customer ID, and total amount while leaving other detailed information about the items purchased in JSON format.

JSON and Snowflake: Optimized for Performance

One of Snowflake’s strengths is its ability to store and query JSON data natively. Snowflake’s support for semi-structured data formats like JSON and its efficient handling of this data type make it possible to perform queries directly on JSON attributes without needing to first extract them into columns.

This capability allows you to store JSON data in the Raw Data Vault and query it directly, even in production, without significant performance penalties. This approach works well for data that may only be needed occasionally or where performance is not a major concern. Snowflake’s built-in functions for working with JSON make it easy to extract specific attributes on demand and use them in downstream reporting.

Handling Business Logic in the Business Vault

In Data Vault, business rules are often applied within the Business Vault layer, which is designed for derived and calculated data. When working with JSON in this context, it’s common to keep the original JSON structure in the Satellite and then perform transformations as needed in the Business Vault.

In Snowflake, you can take advantage of virtual views to simplify your data pipeline. Rather than physically extracting and storing each JSON attribute in the Business Vault, you can create virtual views that access JSON fields as needed. This approach offers the benefit of flexibility and reduces storage requirements since you avoid duplicating data.

Structuring Semi-Structured Data Over Time

Storing JSON in the Satellite provides flexibility, but as your business needs evolve, you may find that some attributes become more relevant for analysis. At this point, you may consider structuring your JSON data incrementally, extracting only the fields that are frequently queried or required for compliance.

This gradual structuring process aligns with the Data Vault’s design principles. Over time, your semi-structured data will naturally become more organized as business requirements clarify which fields are necessary for analytics or reporting. You can continue to leave less critical fields in JSON format, maintaining flexibility while optimizing performance where it matters.

Privacy and Security Considerations with JSON in Satellites

One important consideration when storing JSON data in the Satellite is ensuring compliance with data privacy regulations such as GDPR. It’s essential to classify personal or sensitive data in the JSON payload and handle it appropriately.

In cases where sensitive data is involved, you may choose to split your Satellites by data classification, isolating high-sensitivity fields into their own Satellite tables with stricter access controls. Alternatively, you can extract and separate specific fields that require special handling and leave the rest in JSON format.

Working with JSON Arrays and Nested Objects

JSON data often includes nested structures, such as arrays, that add complexity to handling data in a relational model. For instance, customer records might contain an array of phone numbers or addresses. In these cases, Snowflake offers tools for managing nested data without requiring extensive restructuring.

One option is to store the entire JSON array as-is within a JSON attribute in the Satellite. For example, a JSON array containing multiple phone numbers could be stored directly in the Satellite, eliminating the need for a separate table to manage multi-active data. This approach provides a simpler alternative to the traditional multi-active Satellite design.

Snowflake allows you to work with nested JSON data using its FLATTEN function, which simplifies querying nested structures without requiring complex joins or data transformations. This method can be particularly useful when dealing with highly nested data, especially in real-time applications.

Virtualizing JSON Access in Snowflake

Snowflake’s JSON handling capabilities allow you to query JSON fields dynamically, making it possible to virtualize the access to JSON attributes in your Satellite tables. Virtual views can be used to transform JSON data for reporting without needing to physically extract every field into columns.

This approach offers significant flexibility. By keeping the JSON data in its original form, you avoid unnecessary data transformations and can leverage Snowflake’s powerful JSON functions to create reports on demand. Virtualization is an effective way to simplify your data pipeline, particularly when dealing with complex JSON structures.

Conclusion

Storing JSON in the Data Vault Satellite layer on Snowflake is a sound approach, especially when dealing with semi-structured or complex data. By extracting only the fields that are critical and leaving the rest in JSON format, you balance flexibility with efficiency. Snowflake’s support for JSON makes it possible to work with this data in real-time, leveraging virtual views to keep your pipeline agile.

Consider implementing a mix of JSON storage and relational attributes, structuring your data incrementally as your business needs evolve. With the right approach, Snowflake’s JSON capabilities allow you to handle diverse data types efficiently, supporting both compliance requirements and analytics needs.

Meet the Speaker

Michael Olschimke

Isabel Mateus In Data Tools, dbt blog article, Intermediate

Data Migration – Ensuring Data Accuracy and Compliance During a Migration

Data Migration

Data migration is a complex process that requires careful planning and execution. Understanding the data landscape, ensuring minimal downtime, managing stakeholder expectations, and most importantly, maintaining the integrity and security of your data throughout the transition are critical. Failing to address these factors can lead to data loss, corruption, or non-compliance with regulatory standards, which can have significant business implications. In a worst-case scenario, stakeholders may notice data issues before the data team does, decreasing trust in the data and the team. Another potential problem, having to work overtime because of data issues which were not noticed before.

Features embedded within dbt (Data Build Tool) and tools like Great Expectations offer powerful solutions to help organizations manage these risks, ensuring that the data remains reliable and compliant as it moves through the migration process.

Ensuring Data Accuracy and Compliance During a Migration: Leveraging dbt and Great Expectations

This webinar will cover the essential aspects of maintaining data accuracy and compliance throughout the data migration process. We will explore how dbt (Data Build Tool) enables robust data transformation with built-in and custom tests, ensuring data integrity at each stage. Additionally, we will demonstrate how Great Expectations enhances data validation, allowing you to enforce specific rules and expectations, ensuring a smooth and secure migration with minimal risk of errors or inconsistencies.

Watch Webinar Recording

In this article:

Leveraging dbt for Data Accuracy
dbt Contract enforcement
dbt-expectations vs Great Expectations
- dbt-expectations
Why Consider Integrating Great Expectations?
Key Features of Great Expectations:
Integrating Great Expectations with dbt

Leveraging dbt for Data Accuracy

A powerful tool for data transformation: dbt enables teams to build, test, and document data pipelines. By utilizing its features, such as tests and contracts, dbt ensures data consistency and accuracy. We’ll explore these capabilities in detail below.

dbt offers two ways to define tests, singular and generic data tests.

Singular data tests: Custom SQL query that is written to test a specific condition or logic in the data. It is highly tailored to a particular use case or business logic. In essence, it’s a standalone test where the developer writes custom SQL to check for specific data anomalies or inconsistencies.
Generic data tests: Pre-defined and reusable tests that can be applied to multiple models or columns across different datasets.

Examples of Generic Tests:

Unique Tests: Ensure that a field in your dataset contains unique values, which is critical for primary key fields
Not Null Tests: Validate that a field does not contain any null values
Referential Integrity Tests: Checks that foreign key relationships are maintained, ensuring consistency across related tables
Accepted values: Useful tests for columns which receive predictable data

Tests can be configured to: either fail (severity: error) or issue a warning (severity: warning). Conditional expressions such as error_if and warn_if can refine this behavior, e.g., triggering a warning only after a certain number of failures.

dbt Contract enforcement

Enforces that dbt model schemas adhere to predefined data structures
Defines specific columns, data types, and constraints (e.g., not null, unique)
Raises errors before materializing the model as a table, allowing identification of schema issues

Pro-tip for data migrations: use incremental models to update only new or modified records, which improves efficiency and avoids full table rebuilds. When enforcing a contract, the “append_new_columns” option is useful as it retains old columns, minimizing issues. The “sync_all_columns” setting is particularly handy for automatically adjusting the schema by adding new columns and removing missing ones, making it ideal for migrations with frequent renaming.

dbt-expectations vs Great Expectations

dbt-expectations integrates data quality tests into dbt, while Great Expectations provides a broader framework for managing data validation across various sources. Together, they enhance data accuracy and reliability.

dbt-expectations

The dbt-expectations package extends dbt’s testing capabilities by providing a collection of pre-built, customizable data quality tests inspired by Great Expectations. This package helps automate and standardize data quality checks across multiple models, ensuring that datasets meet specific expectations before they are used in downstream processes.

Here are some examples of data quality tests you can run using the dbt-expectations repository (we are going to cover more in the webinar):

Expect_column_values_to_match_regex: Verifies that all values in a column match a given regular expression pattern
Expect_column_median_to_be_between, expect_column_min_to_be_between, expect_column_max_to_be_between: Ensures numeric column values fall within specified ranges
Expect_column_pair_values_a_to_be_greater_than_b: Checks that values in one column are greater than values in another

Why Consider Integrating Great Expectations?

With dbt-expectations providing robust testing within a single dbt project, you might wonder why you’d want to integrate Great Expectations. Here’s why:

Cross-Database Comparisons: dbt-expectations works well within a single SQL-based data warehouse. However, if you need to compare data across different databases (like Snowflake and SQL Server), Great Expectations offers a broader solution.
Broader Data Validation: Great Expectations supports multiple data sources, including CSV, Parquet, JSON, APIs, and various SQL databases. It provides a flexible and user-friendly platform to define, manage, and execute data quality tests across diverse sources.

Key Features of Great Expectations:

Data Profiling: Before starting your migration, use GE to profile your data and set expectations based on its current state.
Detailed Validation Reports & Dashboards: GE offers comprehensive reports and visualizations, outputting results in formats like HTML, Slack, JSON, and Data Docs. This enhances transparency and provides deeper insights for both technical and non-technical stakeholders.
Customizability and Extensibility: GE allows you to define custom expectations tailored to your data pipeline and integrate with other testing libraries.
Version Control & Historical Validation: Track changes in data quality over time with version control, helping to identify trends and recurring issues.
Production Monitoring & Integration: Integrate GE with data orchestration tools like Airflow, Prefect, or Dagster to incorporate data quality checks into your broader workflows, including those not managed by dbt.

Integrating Great Expectations with dbt

Great Expectations complements dbt by offering a flexible platform for data validation beyond single-project scenarios. By integrating GE with dbt, you can achieve a more comprehensive approach to data quality, ensuring your migration process is as smooth and reliable as possible.

In the upcoming webinar, we will explore practical examples of dbt tests, dbt-expectations, and Great Expectations validations, so stay tuned!

Building a scalable Data Platform? In Data Vault Friday

(Single) Point of Facts

Watch the Video

Single Point of Facts in Data Lakehouse Architecture

Welcome to another edition of DataVault Friday! Today, we’re diving into a frequently debated topic in data management: determining the “source of truth” in a data lakehouse architecture. Specifically, we’ll answer whether that source of truth resides in data lake files or in the tables of the Raw Vault and Business Vault. Additionally, we’ll address what to do if a bug in the ingestion framework requires a re-ingestion of data sources. Let’s explore these questions to better understand data lineage, data governance, and strategies for a reliable and flexible data ecosystem.

In this article:

Understanding the "Single Point of Facts"
Data Lakehouse Architecture and Points of Truth
What If There’s a Bug in the Ingestion Framework?
Data Lake vs. Raw Vault: Which Is Easier for Reconstruction?
Best Practices for Managing Points of Fact
Conclusion
Meet the Speaker

Understanding the “Single Point of Facts”

In traditional data warehousing, the idea of a “single version of the truth” is well-known. This concept implies that there is one version of the data that acts as the definitive source across an enterprise. For instance, a single “customer” or “product” definition applies universally within the organization.

However, in Data Vault architecture, we move from a “single version of the truth” to a “single point of facts.” The focus shifts from universal definitions to an unaltered, auditable record of events. Data Vault is designed to capture historical data accurately and reliably. It provides multiple perspectives on the data (versions of the truth) by isolating raw data from any business logic or transformations. This flexibility allows organizations to apply different business rules depending on context, while maintaining a consistent underlying dataset.

In this context, the Raw Vault is considered the foundational layer, capturing facts as they are, directly from the source systems. The Business Vault, on the other hand, introduces additional business rules, metrics, and aggregated data for reporting purposes. But in essence, the “single point of facts” remains within the Raw Vault because it represents an unaltered and auditable record.

Data Lakehouse Architecture and Points of Truth

In data lakehouse architecture, data is stored both in a data lake and within Data Vault tables. This raises the question: Which source is the ultimate truth? The data lake, with its raw files, or the Raw Vault tables?

The answer depends on the architectural requirements and the level of traceability and auditability needed. Ideally, both systems should mirror each other and serve as points of fact:

Data Lake: The data lake serves as a repository for raw data files, often storing snapshots or full loads of data from source systems. This makes it easier to preserve the original data as-is without alteration.
Raw Vault: In the Raw Vault, data is loaded into a structured schema, capturing the same original details but in a way that can be more systematically queried and analyzed. Like the data lake, the Raw Vault stores unmodified facts, but it also preserves lineage information, making it possible to reproduce deliveries and trace data transformations.

Since both layers should hold the same underlying data, they collectively represent the point of fact. Either the data lake or Raw Vault can serve as the truth source, depending on the scenario. This dual system ensures a resilient architecture, as data can be cross-validated across layers.

What If There’s a Bug in the Ingestion Framework?

One key question that arises is: What happens if there’s a bug in the ingestion framework? Bugs such as incorrect population of business keys or other erroneous transformations might require a complete re-ingestion of data sources.

When dealing with bugs in data ingestion, having both a data lake and a Raw Vault allows flexibility and safeguards. Here’s how to address these issues:

Identify and Isolate the Problem: Pinpoint where the issue occurred in the ingestion process and document the scope of the bug, especially if it affects business keys or other critical aspects of data integrity.
Rely on the Data Lake for Original Files: Since the data lake contains the original, unaltered data files, you can reload the affected data from here into the Raw Vault. This ensures that any corrupted or inaccurately transformed data can be replaced without loss.
Reprocess the Raw Vault: With the correct data now available from the data lake, reload the Raw Vault. Ensure that new ingestion processes are thoroughly tested to avoid repeating the error.
Automate Audits and Reconciliation: Implement automated reconciliation checks between the data lake and the Raw Vault. Automated scripts can flag discrepancies, giving early warning of issues before they reach production or reporting layers.

By leveraging both the data lake and Raw Vault as points of fact, the architecture remains robust and auditable. This redundancy allows for re-ingestion without significant downtime and ensures that data lineage remains traceable throughout the lifecycle.

Data Lake vs. Raw Vault: Which Is Easier for Reconstruction?

When it comes to reconstructing deliveries, the data lake often offers simplicity. Since the data lake can hold raw files with minimal transformation, data reconstruction is a straightforward matter of accessing the original files. In contrast, reconstructing from the Raw Vault requires additional effort, as data must be accurately joined across hubs, links, and satellites, while preserving the original state.

That said, both layers should be auditable, with logging mechanisms that allow for a traceable history of changes. Having a clear data lineage in place allows organizations to meet compliance and audit requirements while supporting accurate reporting.

Best Practices for Managing Points of Fact

While it’s tempting to designate a single point of fact, the dual-layered approach with a data lake and Raw Vault provides a more resilient framework. Here are some best practices for managing points of fact in a data lakehouse architecture:

Maintain Consistency Between Layers: Ensure that data lake files and Raw Vault tables match exactly. Automate reconciliation checks between these layers to verify data integrity.
Implement Auditable Ingestion Processes: Document all transformations from the data lake to the Raw Vault, with logging and error-checking mechanisms. This allows for easier tracing of issues if they arise.
Retain Original Data in the Data Lake: Always keep a copy of original files in the data lake. These files provide a reliable source of truth that can be referenced or reloaded into the Raw Vault if issues occur.
Leverage Metadata for Automation: Metadata can streamline both ingestion and reconciliation. Use metadata to define business keys, relationships, and descriptive data in the Raw Vault, while automating verification processes.

With these practices, data lakehouse architecture can be made robust, auditable, and resilient to changes or errors. By treating both the data lake and Raw Vault as points of fact, you ensure that your data ecosystem remains flexible, trustworthy, and ready to meet evolving business requirements.

Conclusion

The question of “single point of facts” in data lakehouse architecture doesn’t have a straightforward answer. Both the data lake and the Raw Vault act as points of fact, each offering unique benefits in terms of auditability and reconstruction. By utilizing both, you create a highly resilient system capable of withstanding data issues while providing a comprehensive, consistent view of your data.

In summary, while the Raw Vault may traditionally serve as the “single point of facts,” using both the data lake and the Raw Vault as truth sources creates a flexible architecture that can accommodate re-ingestion, mitigate risks, and support accurate reporting. With this dual approach, your data lakehouse architecture becomes a reliable foundation for modern data needs.

Meet the Speaker

Michael Olschimke