Skip to main content
search
0

Still Struggling with GDPR?

Watch the Video

Navigating GDPR Compliance in Data Warehousing

In today’s digital age, GDPR compliance is a crucial aspect for any organization dealing with personal data. With the rise of data warehousing and advanced modeling solutions like Data Vault 2.0 (DV 2.0), questions often arise about how to handle Personally Identifiable Information (PII) within these frameworks. This article addresses some common concerns and provides practical recommendations for ensuring GDPR compliance in data warehouses.



Understanding the Challenge

GDPR mandates that personal data must be handled with the utmost care, ensuring individuals’ privacy and security. In the context of data warehousing, this often translates to managing business keys that might contain PII. Let’s dive into some specific questions raised around this topic:

  1. How should activity history be managed when the main hub contains a PII business key?
  2. Is it best practice to use hashed business keys in link tables to improve load performance?
  3. Should artificial keys originate from each business domain, and how should they be managed if not?

Question #1: Managing Activity History with PII Business Keys

The Problem

In a typical data warehouse model, customer records might include PII, such as social security numbers or tax IDs. According to GDPR, it’s crucial that activity history is not traceable back to the individual once they exercise their right to be forgotten.

The Solution

One effective approach is to split descriptive attributes into different satellites—one for personal data and another for non-personal data. This way, when a deletion request is made, only the personal satellite needs to be purged. The non-personal satellite can retain anonymized data, maintaining the integrity of the dataset while ensuring compliance.


Question #2: Using Hashed Business Keys in Link Tables

The Problem

Hashing business keys is often recommended in DV 2.0 to improve load performance. However, directly using business keys in link tables can pose a challenge, especially when those keys contain PII.

The Solution

In DV 2.0, it’s a standard practice to use hashed values of business key components rather than the business keys themselves. This approach ensures better performance and security. Here’s how it works:

  1. Hash the Business Key: Use a cryptographic hash function (e.g., SHA-256) to convert the business key into a hashed key.
  2. Use Hashed Keys in Link Tables: The hashed key then serves as the foreign key in link tables, ensuring that PII is not directly exposed.

Question #3: Originating and Managing Artificial Keys

The Problem

There’s a debate on whether artificial keys should be generated within each business domain or within the data warehouse itself. This raises concerns about consistency and management, especially if the artificial key must be derived from PII.

The Solution

Artificial keys should ideally be generated within the data warehouse to maintain consistency and control. Here’s the process:

  1. Generate a UUID: Use a universally unique identifier (UUID) for the artificial key. This ensures randomness and reduces the risk of duplication.
  2. Link Artificial Keys to Business Keys: Establish a relationship between the artificial key and the business key within the data warehouse, ensuring that the artificial key is never exposed in operational systems.

Handling Scenarios Without Artificial Keys

If generating artificial keys within the data warehouse is not feasible, the data warehouse should still generate these keys upon ingestion. This method ensures that all keys are managed consistently and securely.


Ensuring Compliance and Security

Satellite Splitting

By splitting satellites into personal and non-personal data, organizations can easily manage deletion requests without compromising data integrity.

Cryptographic Hashing

Utilizing cryptographic hashing for business keys in link tables enhances both security and performance, crucial for maintaining GDPR compliance.

Artificial Keys Management

Generating artificial keys within the data warehouse ensures consistency and security, reducing the risk of PII exposure.

Regular Audits and Legal Consultation

Regular audits and consultations with legal experts ensure ongoing compliance with GDPR and other regulations. Implementing these practices helps organizations stay ahead of potential compliance issues.


Conclusion

Handling PII in data warehouses requires careful planning and robust solutions. By implementing satellite splitting, cryptographic hashing, and consistent artificial key management, organizations can ensure GDPR compliance while maintaining data integrity and performance. Regular audits and legal advice further bolster these practices, ensuring that data handling processes remain secure and compliant with evolving regulations.

Master Data Governance: Die EU Datenverordnung und was sie für Ihr Unternehmen bedeutet

Webinar-Übersicht

Am 12. September 2025 tritt die neue EU Datenverordnung in Kraft. Sie soll eine faire und innovative Datenwirtschaft gewährleiten. Aber was heißt dies für Ihr Unternehmen in Bezug auf die Data Governance?

In diesem Webinar enträtseln wir Feinheiten der EU-Datenverordnung (Data Act). Was umfasst diese Verordnung und was bedeutet dies für Ihr Unternehmen und Data Governance. Erfahren Sie mehr über die rechtlichen und technischen Verpflichtungen, die sich aus dieser Gesetzgebung ergeben, und erhalten Sie wertvolle Einblicke in die Implementierung konformer Datenplattformen.

Im ersten Teil bietet Dr. Benno Barnitzke, Rechtsanwalt und Spezialist im Bereich IT, Datenschutz und Digitalisierung eine Einführung und Erklärungen in die rechtlichen Rahmenbedingungen dieser Verordnung.

Im zweiten Teil gibt Trung Ta, Profi im Bereich Datenplattformen Tipps und Tricks zu Umsetzung dieser Verordnung.

AUFZEICHNUNG ANSEHEN

Das erwartet Sie

  • Erläuterungen über das kürzlich in Kraft getretene EU-Datengesetz von Rechtsanwalt Dr. Benno Barnitzke
  • Technische Data Governance Empfehlungen und Best Practices, um Data Act umzusetzen
  • Praktische Einblicke in die rechtlichen und technischen Verpflichtungen bei der Implementation von Datenplattformen.

Teilnahme-Details

Datum: June 5th 2024
Uhrzeit: 14:00 – 15:00 CEST
Zu Ihrem Kalender hinzufügen

Agenda

  1. Überblick über das EU Data Act
  2. Pflichten des Dateninhabers (Data Owner)
  3. Durchsetzung
  4. Empfehlungen zur Einhaltung der Rechtsvorschriften
  5. Technische Anforderungen und Herausforderungen
  6. Technische Empfehlungen für Datenplattformen

Zielgruppen

Datenmanagement- & -Compliance-Fachkräfte, Data Product Owners und Technische Rechtsexperten.

Free PDF - Die EU Datenverordnung - Checkliste zur rechtskonformen Umsetzung

MEISTERN SIE HEUTE DIE EU-Datenverordnung!

Bereiten Sie Ihr Unternehmen mit unserer umfassenden Checkliste der 10 wichtigsten Schritte auf die Datenverordnung vor. Erfahren Sie, wie Sie Daten schützen, Verträge überprüfen und Interoperabilität gewährleisten können.

KOSTENLOSE CHECKLISTE ANFORDERN

Loading Technical Counter-Transactions

Watch the Video

Managing Data Vault Performance with Incremental Changes and Deletions

In the world of data warehousing, the Data Vault methodology has emerged as a robust and scalable solution for managing vast amounts of data. However, one common concern among practitioners is how to efficiently handle incremental changes and deletions, particularly when dealing with structures containing billions of rows. This article aims to elucidate the process, focusing on the questions around loading structures, performance considerations, and practical strategies for maintaining efficiency.



Understanding the Basics: Tracking Changes and Deletions

The core principle of Data Vault involves capturing all changes and deletions incrementally. This ensures that the data warehouse remains an accurate historical record of the enterprise’s data. Here’s a simplified illustration of how this can be achieved:

  1. Initial Load: When a new transaction is recorded, it is inserted into the Data Vault as a new record. For instance, if customer A purchases product C at store B on day one, this transaction is recorded with a value of €7.
  2. Handling Updates: If the value of the transaction changes from €7 to €5 on day two, instead of updating the existing record, two new records are created: one to nullify the original transaction (-€7) and another to represent the new transaction (€5).
  3. Dealing with Deletions: If a transaction is deleted, it is handled similarly by inserting a record that nullifies the original transaction.

This method ensures that the Data Vault remains immutable, as records are never directly altered once inserted. Instead, changes are tracked by adding new records, which simplifies loading processes and maintains data integrity.


Loading Structures: The Practical Approach

Loading structures in Data Vault can be challenging, especially when dealing with large datasets. Here are some practical strategies:

Using CDC (Change Data Capture)

If the source system supports CDC, this is the most straightforward method:

  • Insert New Records: Directly insert new records into the target system.
  • Handle Updates and Deletes: For updates and deletes, insert the corresponding counter transactions.

CDC provides a clear and efficient way to track changes and deletions, significantly simplifying the loading process.

Full Load vs. Incremental Load

In scenarios where full loads are used (though rare for very large datasets), the process involves:

  • Identifying New Records: Select records from the staging area that do not exist in the target and insert them with a counter of one.
  • Identifying Deletions: Select records from the target that do not exist in the staging area and insert counter transactions to nullify them.

While full loads can be intensive, they can be managed effectively by optimizing the identification of new and deleted records.


Performance Considerations

Handling billions of rows requires careful planning to avoid performance bottlenecks. Here are some strategies to mitigate performance issues:

Parallel Processing

By running multiple processes in parallel, you can significantly speed up the loading process. For example, separate processes can handle inserts and counter transactions concurrently.

Hash Keys and Indexes

Using hash keys and indexes efficiently can reduce the time needed to check for existing records. Ensure that your hash keys include all relevant business keys and transaction IDs to maintain uniqueness.

High-Water Marks and System Indicators

Some systems, like Oracle, offer features like SCN (System Change Number) or row versions that can help identify modified records. Using these indicators can reduce the amount of data processed by focusing only on recently changed records.


Practical Example: Incremental Loading without CDC

In cases where CDC is not available, you can still achieve efficient incremental loading:

  1. Incremental Updates from Source: If the source system provides daily increments (inserted and updated records), use this data to update the target.
  2. Handling Deletions: For deleted records, you might need an additional table or mechanism to track deletions. If such a table is available, use it to insert counter transactions.
  3. Full Load Approach: If only full loads are available, implement a two-step process to identify and handle new, updated, and deleted records.

Conclusion

Managing incremental changes and deletions in Data Vault structures, especially for large datasets, requires a combination of strategies tailored to the specific capabilities of your source systems. Whether using CDC, full loads, or incremental updates, the goal remains the same: to maintain an accurate and efficient data warehouse. By understanding the principles and applying practical solutions, you can handle the complexities of Data Vault performance effectively.

Remember, the key to success lies in thorough planning, efficient use of system capabilities, and continuous optimization of your data loading processes. By following these guidelines, you can ensure that your Data Vault implementation scales efficiently, even as your data volumes grow.

Master Data Governance: Understanding the EU’s Data Act

Data Governance

The EU’s Data Act is on the horizon, promising to reshape how businesses access and utilize data according to Data Governance, particularly industrial data. This legislation aims to foster a competitive data market, drive innovation, and ensure fairness in data sharing.

Key provisions include granting users control over data from connected products, enhancing fairness in data allocation, and safeguarding against unfair contractual terms. Complementing the Data Governance Act, these regulations lay the groundwork for an EU single market for data, positioning Europe as a global leader in the data economy.

Master Data Governance: Understanding the EU’s Data Act

Join us for an insightful webinar where we unravel the intricacies of the EU’s Data Act. Delve into the legal and technical obligations imposed by this landmark legislation and gain valuable insights into implementing compliant data platforms.

Watch Webinar Recording

Understanding the EU Data Act

The Data Act, formulated by the European Union, represents a significant step forward in governing data sharing, access, and utilization within the digital economy. It aims to foster collaboration while upholding principles of fairness, transparency, and compliance with data protection regulations such as the GDPR. This regulatory framework provides guidelines for businesses, public sector entities, and research organizations, ensuring that data-sharing practices align with legal requirements and ethical standards.

Scheduled for enforcement on September 12, 2025, the Data Act mandates adherence to its provisions concerning data-sharing practices, contractual agreements, and operational procedures. Stakeholders must prepare to align their processes with the requirements outlined in the Data Act to maintain compliance and mitigate regulatory risks effectively.

The scope of the Data Act encompasses various stakeholders engaged in data-sharing activities within the European Union. This includes businesses of all sizes, public sector bodies, research organizations, and data processing service providers. Data holders bear the responsibility of ensuring compliance with the Data Act’s directives, promoting fair and non-discriminatory data-sharing practices while safeguarding privacy and intellectual property rights.

Data Governance Free PDF - European Data Act Compliance Checklist - 10 Key Steps

Master The Eu’s Data Act Today!

Ensure your business is ready for the Data Act with our comprehensive 10 Key Steps Compliance Checklist. Learn how to protect data, review contracts, and maintain interoperability.

Get My Free Checklist

Technical Challenges

Addressing technical challenges posed by the Data Act requires robust solutions that guarantee data integrity, security, and accessibility. Leveraging methodologies like Data Vault 2.0 offers a multi-faceted approach to data management, facilitating scalability, real-time capabilities, and efficient data architecture. By decoupling storage from delivery and implementing real-time data capture and processing, organizations can streamline compliance efforts and enhance data governance.

Implementing a Data Vault 2.0-based data platform enables organizations to meet the complex requirements of the Data Act effectively. By establishing a robust architecture for data ingestion, processing, and presentation, businesses can ensure compliance while driving innovation and agility in their data operations.

Data Vault architecture meets the Data Act requirements

Figure: reference architecture for real-time Data Vault solution

Data Vault 2.0 architecture follows a multi-layer approach, consisting of the Staging Layer, the Enterprise Data Warehouse Layer, and the Information Marts Layer. This integrated approach ensures a harmonious synergy between technical and business objectives. In particular, to respond to the Data act requirements, in this case we are including a “message queue” that loads the data from source systems to our enterprise data warehouse “without undue delay,” i.e., real-time, near real-time, or with the delay stipulated by your specific technology and related services.

In leveraging a real-time or near real-time Data Vault 2.0 architecture for your data platform, you’re not just meeting the requirements of the EU Data Act but also laying the foundation for streamlined data management and enhanced data analytics capabilities. By centralizing the processing and preparation of raw data from IoT devices, this architecture ensures that compliance with the Data Act is seamlessly integrated into your data infrastructure.

Additionally, in the Information Mart layer, we can create a specific Interface Mart to cater to security and accessibility requirements. Here, data can be flagged by various criteria such as device or user ID and access level. This enables users to either download the data upon request or visualize it directly through a dedicated app. The data will be as fresh as possible, as the Interface Mart can be generated as views built on top of our Raw Vault, which is updated “without undue delay.”

Moreover, by decoupling the data storage from delivery and employing real-time data capture and processing mechanisms, you’re not only facilitating adherence to regulatory standards but also enabling agile and responsive data analytics. These business rules could be implemented in your Business Vault and downstream layers. Hence, this approach ensures compliance and also sets the stage for harnessing the full potential of your data assets for driving innovation and decision-making across your organization.

Final Remarks

As the enforcement date of the Data Act approaches, it is imperative for organizations to prioritize compliance and adopt proactive strategies for data management. By embracing technologies like Data Vault 2.0 and adhering to agile development methodologies, businesses can navigate the regulatory landscape with confidence, harnessing the synergies between regulatory requirements and technological advancements to drive sustainable growth and innovation.

Check out our webinar recordings as we explore the intersection of the Data Act and Data Vault 2.0, offering insights and practical guidance for navigating the evolving data governance landscape.

Watch here for free: In English or in German

Data Vault Hashing or Not?

Watch the Video

Exploring Data Vault 2.0: Managing Hashing Costs in Smaller Environments

In the evolving landscape of data management, Data Vault 2.0 stands out as a robust methodology designed for scalability, flexibility, and consistency across diverse technological environments. A crucial component of Data Vault 2.0 is the use of hashing for business keys (BKs) and hash diffs. Hashing ensures data integrity and efficiency, especially in distributed systems. However, the performance costs associated with hashing can sometimes become a significant concern. This blog post delves into the nuances of hashing in Data Vault 2.0, the trade-offs involved, and when it might be feasible to deviate from the standard approach.



The Role of Hashing in Data Vault 2.0

Data Vault 2.0 leverages hashing to create unique, consistent identifiers for business keys and to detect changes in data efficiently. This method is technologically agnostic, meaning it can be implemented across various databases and data platforms, whether on-premises or in the cloud. The primary advantages of hashing include:

  1. Consistency Across Systems: Hashing ensures that business keys are consistent and unique across different systems and regions.
  2. Improved Query Performance: Pre-calculating hash diffs can make query execution faster and more efficient, transferring the computational load from query time to data loading time.
  3. Simplified Data Integration: Hash keys provide a straightforward way to manage and integrate data from multiple sources, reducing the complexity of data joins.

Challenges of Hashing

Despite its benefits, hashing can introduce performance challenges, particularly in the following scenarios:

  1. Wide Tables: Calculating hash diffs for tables with a large number of columns can be computationally intensive.
  2. Complex Hash Functions: Ensuring that hash functions generate unique strings can be complex and resource-heavy.
  3. Hardware Limitations: On-premises environments with limited hardware capabilities might struggle with the additional computational load required for hashing.

Evaluating Hashing Alternatives

When faced with performance concerns, particularly in smaller, local solutions, it’s essential to consider whether deviating from the standard hashing approach would be beneficial. There are three primary options to consider:

  1. Hash Keys: The default and recommended option for most environments, especially those involving distributed systems or diverse technologies.
  2. Sequences: A legacy approach from Data Vault 1.0 that uses sequential numbers as identifiers.
  3. Business Keys: Using the original business keys directly as identifiers.

The Case Against Sequences

Sequences, although a viable option, are generally not recommended in modern Data Vault implementations due to several drawbacks:

  • Lookup Overhead: Sequences require lookups during data loading, which can slow down the process significantly.
  • Orchestration Complexity: Managing sequences adds complexity to the loading process, particularly in real-time scenarios.
  • Distributed System Challenges: Sequences do not perform well in distributed environments where parts of the solution might reside in different locations (e.g., cloud and on-premises).

Hash Keys vs. Business Keys

When deciding between hash keys and business keys, the choice largely depends on the specific technology stack and the environment. Here are some considerations:

Hash Keys

  • Pros: Provide a consistent, fixed-length identifier that simplifies joins and queries across various systems. They are particularly beneficial in mixed environments.
  • Cons: Slightly higher computational cost during data loading compared to sequences. However, the consistent performance across queries often outweighs this drawback.

Business Keys

  • Pros: Directly using business keys can simplify the architecture in environments where the data platform supports efficient handling of these keys.
  • Cons: Can lead to complex and less efficient joins, especially in mixed or distributed environments.

Performance Optimization Strategies for Hashing

For environments where hashing performance is a concern, several optimization strategies can be employed:

  1. Leverage Hardware Acceleration: On-premises environments can benefit from hardware acceleration, such as PCIe express cards with crypto chips, to offload hash computation from the CPU.
  2. Utilize Optimized Libraries: Many platforms use highly optimized libraries (e.g., OpenSSL) for hash computations, which can significantly improve performance.
  3. Incremental Loads: Ensure that performance evaluations consider multiple load cycles to capture the benefits of hash diffs during delta checks, not just initial loads.

Future Trends and Recommendations

Looking forward, the evolution of data platforms and technologies might shift the balance towards using business keys more frequently. As Massively Parallel Processing (MPP) databases become more prevalent, their native support for efficient key management could make business keys a more attractive option. However, until such technologies are ubiquitous, the default recommendation remains to use hash keys for their broad compatibility and consistent performance.


Conclusion

Data Vault 2.0’s approach to hashing business keys and hash diffs provides significant advantages in terms of consistency, scalability, and performance. While the performance costs of hashing can be a concern, particularly in smaller environments with limited hardware, careful consideration of the available options and optimization strategies can mitigate these issues. Ultimately, the decision should be guided by the specific technological context and future-proofing considerations.

For most scenarios, hash keys remain the recommended approach due to their versatility and robustness in mixed and distributed environments. However, as technology evolves, the use of business keys might become more feasible, highlighting the importance of staying informed about the latest trends and advancements in data management.

Multi Active Satellites on Links

Watch the Video

In our ongoing series, our CEO Michael Olschimke addresses a complex question from the audience regarding the use of Multi Active Satellites (MAS) on Links within a Data Vault 2.0 model. This topic touches on advanced aspects of data modeling, particularly in the context of handling multiple active records.

The question posed was, “Can the Multi Active Satellites be used on LINKs too (considering that on Link we have the option of using the child dependent key)? Please ignore the fact that the link doesn’t have a Hash column on all HUB keys.” Michael’s response delves into the practical application of MAS on Links, an area that can greatly enhance the flexibility and scalability of data models. He explains that while traditionally Multi Active Satellites are used with Hubs to track multiple active records, their application on Links is feasible and beneficial. By leveraging the child dependent key, it is possible to maintain multiple active relationships between entities, which is particularly useful in scenarios where relationships are dynamic and subject to frequent changes.

Drawing on his 15 years of experience in Information Technology, with a focus on Business Intelligence over the past eight years, Michael offers a nuanced perspective on this topic. He highlights that while the absence of a hash column on all HUB keys might pose a challenge, it can be mitigated through careful design and implementation strategies. By ensuring that each Link is adequately documented and structured, organizations can effectively use MAS to capture the complexity of real-world relationships without sacrificing data integrity or performance.

In conclusion, Michael emphasizes the importance of flexibility and adaptability in data modeling. Implementing Multi-Active Satellites on Links can provide significant advantages in managing complex data relationships, allowing for more granular and accurate data analysis. This approach aligns with best practices in Data Vault 2.0 and supports the goal of creating robust, scalable, and responsive data architectures. Michael encourages practitioners to challenge conventional boundaries and explore innovative solutions to meet their unique data management needs.

Unlock Success: Dive into the Salesforce Summer Release ‘24!

Watch the Webinar

Gear up for success as we dive into the highly anticipated Salesforce Summer Release ’24 in our exclusive webinar, “Unlock Success: Dive into the Salesforce Summer Release ’24!” Gain a competitive edge by getting ahead of the curve with a sneak peek into the upcoming updates that are set to revolutionize your business. Join us for an insightful exploration of the latest features and enhancements before they’re even released, and ensure you stay steps ahead of the competition.

In this dynamic webinar, we’ll provide you with an insider’s look into what the Salesforce Summer Release ’24 has in store. From game-changing functionalities to transformative enhancements, you’ll discover how these updates can propel your business forward and drive greater efficiency, productivity, and success. Whether you’re a seasoned Salesforce user or new to the platform, this webinar offers invaluable insights to help you maximize the potential of Salesforce and stay at the forefront of innovation.

Don’t miss this exclusive opportunity to gain early access to the groundbreaking features of the Salesforce Summer Release ’24. Register now to secure your spot and embark on a journey towards unlocking success with Salesforce. Stay ahead of the competition and position your business for growth and prosperity in the ever-evolving digital landscape.

Watch Webinar Recording

How to Track Soft Deletes in an Insert Only Data Vault 2.0 Architecture

Watch the Video

In our ongoing series, our BI Consultant Lorenz Kindling addresses a question from the audience about managing soft deletes in an insert-only data environment. This topic is particularly relevant for those in the field of data warehousing, where maintaining historical data integrity and accuracy is paramount.

The question posed was, “How to track soft deletes with insert only?” Lorenz’s response explores the complexities and best practices for implementing soft deletes within an insert-only framework. He explains that soft deletes involve marking records as inactive rather than physically removing them from the database. This approach is crucial for maintaining a comprehensive historical record and ensuring that data integrity is not compromised. Lorenz suggests using a specific status indicator or a flag within the data model to denote records that are logically deleted. This allows for efficient querying and reporting without the risk of losing historical data.

Lorenz, who has been advising renowned companies since 2021 at Scalefree International, draws on his extensive experience in Business Intelligence and Enterprise Data Warehousing to provide practical insights. He emphasizes that by carefully planning and implementing a robust soft delete mechanism, organizations can achieve a balance between data retention and performance. Lorenz’s approach ensures that data warehouses remain both scalable and efficient, even as they grow and evolve over time.

In conclusion, Lorenz highlights the importance of adopting best practices in data warehouse automation and Data Vault modeling to manage soft deletes effectively. By using insert-only methods with proper indicators for soft deletes, organizations can maintain the integrity and usability of their data warehouses, thereby supporting long-term business intelligence and analytics goals. This strategy not only addresses common data warehousing challenges but also aligns with modern data management principles.

Scale Up your Data Vault Project – with dbt Mesh

dbt Mesh - data mesh solution

dbt Mesh

Learn how dbt Mesh enhances Data Vault projects within dbt Cloud by facilitating a more efficient data mesh architecture. The larger a data warehouse project grows, the more people begin to rely and work with the data provided. This work could be consuming the data, applying business rules, modeling facts and dimensions, or other typical tasks in a data environment. In a large organization, all these users might be scattered across different divisions, and the data they are working with might belong to different business domains. At some point, the entire organization faces the challenge of data sharing and governance guidelines, which might prohibit users of the sales department from accessing data from the finance department. A data mesh offers a solution that helps organizations to deal with these challenges. If you want to learn more about the data mesh, check our recent blog article about Data Vault and data mesh here!

We also have a webinar on exactly this specific subject. Don’t miss it and watch the recording for free!

Data Mesh Support bei dbt Cloud

Many organizations struggle with introducing a Data Mesh approach into the Data Vault landscape. In this webinar, we will dive into dbt Mesh, and how to leverage it in a Data Vault project.

Watch Webinar Recording

What is dbt Mesh?

Dbt Mesh is a recently added feature that makes dbt Cloud work more efficiently with a data mesh approach. The already familiar {{ ref() }} function is no longer limited to models within one dbt project, instead it can refer to models of other dbt projects.

Why would I want to refer to other dbt projects?

Imagine a big organization that uses dbt Cloud for their Data Vault implementation. The project might have 400 sources defined, 2000 models implemented, and is used actively by 30 developers. Out of these 30 developers, there might be 5 people specifically working on the Business Data Vault and Information Mart layer for finance-related objects. Another 5 developers are working on the same layers but for sales-related objects.

At some point, you might want to avoid finance people messing around with the sales-related dbt models, so a data mesh architecture is to be implemented. This would allow the organization to define policies regarding data sharing, data ownership, and other governance measures.

With dbt Mesh, both the Sales and the Finance team would get their own dbt project. Since both should be based on the same Raw Data Vault, an additional foundational dbt project is created exclusively for staging and Raw Data Vault objects. Both domain-specific dbt projects, sales and finance, can now refer to Raw Vault objects inside the foundational dbt project, avoiding actually physically replicating the data.

dbt Mesh - data mesh solution

How can I leverage dbt Mesh in a Data Vault powered Data Mesh?

Define Data Contracts

Dbt models, or groups of models, can now be configured to have data contracts. Inside the already familiar .yml files, models can now be set to be publicly available (within an organization), data owners can be enforced, and table schemas can be locked.

Create a Foundational dbt project

In a Data Mesh architecture, the most common way to implement Data Vault 2.0, is to have a commonly shared Raw Vault as a foundation, and both Business Vault and Information Marts are divided by business domains. In dbt Mesh, this would reflect in a foundational dbt project, that includes all staging and Raw Data Vault objects. Only the Raw Data Vault objects would be configured to be accessible by other dbt projects, since the staging models should not be used outside of Raw Data Vault models.

Add domain-level dbt projects

Based on the foundational Raw Vault dbt project, each domain team can now work in their own dbt project. They access the Raw Data Vault via the (extended) {{ ref() }} function and don’t have to worry about maintaining these Raw Vault objects. Additionally, they can define which of their artifacts might be useful for other domains, these can be shared via their own data contracts.

Distribute Responsibilities

Typically, a power user does not create Hubs, Links, and Satellites. And it’s not their responsibility to ensure a reliable Raw Data Vault to build transformations on. Therefore, it is important to define responsibilities within each dbt project. Especially objects that are shared outside of one project should always have data contracts and defined owners. This ensures that users of these shared objects can rely on it.

Conclusion

All in all, dbt Mesh offers a fantastic way to properly implement a true data mesh approach. It is especially relevant, when different business domains of one organization are working together in dbt to create trustable deliverables. In most scenarios, it makes sense to already start using dbt Mesh, although your project might not be too big yet. Having clear responsibilities and data contracts always helps maintain trust and transparency for your data!

Modelling Exchange Rates

Watch the Video

In our ongoing series, our CEO Michael Olschimke addresses a question from the audience about modelling daily exchange rates within the Data Vault framework for a non-banking industry. The query highlights a common challenge faced by many organizations: integrating and managing exchange rate data effectively.

The question posed was, “How would you model daily exchange rates in Data Vault 2.0 for a non-banking industry? We are already using a reference table for the list of currencies (I guess we would have currency as a hub in the banking industry, but that is not our case). Now we also need daily exchange rates for currency conversions in the datamart layer. I would start with a Link for exchange rates, but do we need to create a hub for currencies? How about existing references to currency in the existing model (currently in SAT, because we have currency as a reference table)?”

Michael’s response delves into the intricacies of data modelling in such scenarios. He suggests that even though your industry is non-banking, establishing a structured and scalable way to manage exchange rate data is crucial. Using a Link for exchange rates is a good starting point, but creating a hub for currencies could provide additional benefits. This hub would act as a central repository for all currency-related information, ensuring consistency and ease of access across different layers of the data architecture. Additionally, integrating existing references to currencies within the model can streamline operations and enhance the accuracy of financial data analytics.

In conclusion, Michael emphasizes the importance of a well-thought-out data architecture. By creating a dedicated hub for currencies and effectively linking exchange rate data, organizations can ensure more accurate and efficient currency conversions in their datamart layer. This approach not only aligns with best practices in data vault modeling but also supports the broader goal of maintaining data integrity and usability across the enterprise.

How to Implement Data Quality Techniques

Watch the Video

In our latest video, BI Consultant Julian Brunner tackles a pressing query: “Where to implement data quality techniques? Is it possible to clean dirty data at the entry point into the raw data vault?” Data quality is foundational for informed decision-making, and Julian’s expertise shines as he navigates this critical terrain.

Julian highlights the importance of a holistic approach to data quality management, emphasizing the need for robust frameworks spanning the entire data lifecycle. Whether it’s validation rules, data profiling, or cleansing algorithms, proactive measures at every stage can fortify data integrity.

Automating a Scalable Data Warehouse with Data Vault Builder

Watch the Webinar

Unlock the power of automation in your data warehouse with Data Vault Builder in our upcoming webinar. Dive into the intricacies of Data Vault 2.0 and discover why it’s tailor-made for automation, promising efficiency and scalability like never before. Whether you’re a seasoned data professional or just embarking on your data warehousing journey, this webinar offers invaluable insights into streamlining your processes and accelerating implementation.

During this joint webinar, you’ll delve into the core principles of Data Vault 2.0 and witness firsthand how Data Vault Builder revolutionizes the implementation process. Through a live demonstration, gain practical knowledge and actionable tips to optimize your data warehouse architecture. From overcoming common challenges to kickstarting your project with confidence, this session equips you with the tools and techniques needed to succeed in the world of data warehousing.

Don’t miss this opportunity to elevate your data warehousing game and leverage the full potential of automation with Data Vault Builder. Join us and discover how to transform your data infrastructure into a dynamic, scalable powerhouse. Whether you’re a data architect, analyst, or IT professional, this webinar promises to be a game-changer for your organization’s data strategy. Register now to secure your spot!

Watch Webinar Recording
Close Menu