Building a scalable Data Platform?

Building a scalable Data Platform? In Data Vault Friday

Loading Technical Counter-Transactions

Watch the Video

Managing Data Vault Performance with Incremental Changes and Deletions

In the world of data warehousing, the Data Vault methodology has emerged as a robust and scalable solution for managing vast amounts of data. However, one common concern among practitioners is how to efficiently handle incremental changes and deletions, particularly when dealing with structures containing billions of rows. This article aims to elucidate the process, focusing on the questions around loading structures, performance considerations, and practical strategies for maintaining efficiency.

In this article:

Understanding the Basics: Tracking Changes and Deletions
Loading Structures: The Practical Approach
- Using CDC (Change Data Capture)
- Full Load vs. Incremental Load
Performance Considerations
Practical Example: Incremental Loading without CDC
Conclusion
Meet the Speaker

Understanding the Basics: Tracking Changes and Deletions

The core principle of Data Vault involves capturing all changes and deletions incrementally. This ensures that the data warehouse remains an accurate historical record of the enterprise’s data. Here’s a simplified illustration of how this can be achieved:

Initial Load: When a new transaction is recorded, it is inserted into the Data Vault as a new record. For instance, if customer A purchases product C at store B on day one, this transaction is recorded with a value of €7.
Handling Updates: If the value of the transaction changes from €7 to €5 on day two, instead of updating the existing record, two new records are created: one to nullify the original transaction (-€7) and another to represent the new transaction (€5).
Dealing with Deletions: If a transaction is deleted, it is handled similarly by inserting a record that nullifies the original transaction.

This method ensures that the Data Vault remains immutable, as records are never directly altered once inserted. Instead, changes are tracked by adding new records, which simplifies loading processes and maintains data integrity.

Loading Structures: The Practical Approach

Loading structures in Data Vault can be challenging, especially when dealing with large datasets. Here are some practical strategies:

Using CDC (Change Data Capture)

If the source system supports CDC, this is the most straightforward method:

Insert New Records: Directly insert new records into the target system.
Handle Updates and Deletes: For updates and deletes, insert the corresponding counter transactions.

CDC provides a clear and efficient way to track changes and deletions, significantly simplifying the loading process.

Full Load vs. Incremental Load

In scenarios where full loads are used (though rare for very large datasets), the process involves:

Identifying New Records: Select records from the staging area that do not exist in the target and insert them with a counter of one.
Identifying Deletions: Select records from the target that do not exist in the staging area and insert counter transactions to nullify them.

While full loads can be intensive, they can be managed effectively by optimizing the identification of new and deleted records.

Performance Considerations

Handling billions of rows requires careful planning to avoid performance bottlenecks. Here are some strategies to mitigate performance issues:

Parallel Processing

By running multiple processes in parallel, you can significantly speed up the loading process. For example, separate processes can handle inserts and counter transactions concurrently.

Hash Keys and Indexes

Using hash keys and indexes efficiently can reduce the time needed to check for existing records. Ensure that your hash keys include all relevant business keys and transaction IDs to maintain uniqueness.

High-Water Marks and System Indicators

Some systems, like Oracle, offer features like SCN (System Change Number) or row versions that can help identify modified records. Using these indicators can reduce the amount of data processed by focusing only on recently changed records.

Practical Example: Incremental Loading without CDC

In cases where CDC is not available, you can still achieve efficient incremental loading:

Incremental Updates from Source: If the source system provides daily increments (inserted and updated records), use this data to update the target.
Handling Deletions: For deleted records, you might need an additional table or mechanism to track deletions. If such a table is available, use it to insert counter transactions.
Full Load Approach: If only full loads are available, implement a two-step process to identify and handle new, updated, and deleted records.

Conclusion

Managing incremental changes and deletions in Data Vault structures, especially for large datasets, requires a combination of strategies tailored to the specific capabilities of your source systems. Whether using CDC, full loads, or incremental updates, the goal remains the same: to maintain an accurate and efficient data warehouse. By understanding the principles and applying practical solutions, you can handle the complexities of Data Vault performance effectively.

Remember, the key to success lies in thorough planning, efficient use of system capabilities, and continuous optimization of your data loading processes. By following these guidelines, you can ensure that your Data Vault implementation scales efficiently, even as your data volumes grow.

Meet the Speaker

Michael Olschimke

Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

Building a scalable Data Platform? In Data Vault Friday

Data Vault Hashing or Not?

Watch the Video

Exploring Data Vault 2.0: Managing Hashing Costs in Smaller Environments

In the evolving landscape of data management, Data Vault 2.0 stands out as a robust methodology designed for scalability, flexibility, and consistency across diverse technological environments. A crucial component of Data Vault 2.0 is the use of hashing for business keys (BKs) and hash diffs. Hashing ensures data integrity and efficiency, especially in distributed systems. However, the performance costs associated with hashing can sometimes become a significant concern. This blog post delves into the nuances of hashing in Data Vault 2.0, the trade-offs involved, and when it might be feasible to deviate from the standard approach.

In this article:

The Role of Hashing in Data Vault 2.0
Challenges of Hashing
Evaluating Hashing Alternatives
The Case Against Sequences
Hash Keys vs. Business Keys
- Hash Keys
- Business Keys
Performance Optimization Strategies for Hashing
Future Trends and Recommendations
Conclusion
Meet the Speaker

The Role of Hashing in Data Vault 2.0

Data Vault 2.0 leverages hashing to create unique, consistent identifiers for business keys and to detect changes in data efficiently. This method is technologically agnostic, meaning it can be implemented across various databases and data platforms, whether on-premises or in the cloud. The primary advantages of hashing include:

Consistency Across Systems: Hashing ensures that business keys are consistent and unique across different systems and regions.
Improved Query Performance: Pre-calculating hash diffs can make query execution faster and more efficient, transferring the computational load from query time to data loading time.
Simplified Data Integration: Hash keys provide a straightforward way to manage and integrate data from multiple sources, reducing the complexity of data joins.

Challenges of Hashing

Despite its benefits, hashing can introduce performance challenges, particularly in the following scenarios:

Wide Tables: Calculating hash diffs for tables with a large number of columns can be computationally intensive.
Complex Hash Functions: Ensuring that hash functions generate unique strings can be complex and resource-heavy.
Hardware Limitations: On-premises environments with limited hardware capabilities might struggle with the additional computational load required for hashing.

Evaluating Hashing Alternatives

When faced with performance concerns, particularly in smaller, local solutions, it’s essential to consider whether deviating from the standard hashing approach would be beneficial. There are three primary options to consider:

Hash Keys: The default and recommended option for most environments, especially those involving distributed systems or diverse technologies.
Sequences: A legacy approach from Data Vault 1.0 that uses sequential numbers as identifiers.
Business Keys: Using the original business keys directly as identifiers.

The Case Against Sequences

Sequences, although a viable option, are generally not recommended in modern Data Vault implementations due to several drawbacks:

Lookup Overhead: Sequences require lookups during data loading, which can slow down the process significantly.
Orchestration Complexity: Managing sequences adds complexity to the loading process, particularly in real-time scenarios.
Distributed System Challenges: Sequences do not perform well in distributed environments where parts of the solution might reside in different locations (e.g., cloud and on-premises).

Hash Keys vs. Business Keys

When deciding between hash keys and business keys, the choice largely depends on the specific technology stack and the environment. Here are some considerations:

Hash Keys

Pros: Provide a consistent, fixed-length identifier that simplifies joins and queries across various systems. They are particularly beneficial in mixed environments.
Cons: Slightly higher computational cost during data loading compared to sequences. However, the consistent performance across queries often outweighs this drawback.

Business Keys

Pros: Directly using business keys can simplify the architecture in environments where the data platform supports efficient handling of these keys.
Cons: Can lead to complex and less efficient joins, especially in mixed or distributed environments.

Performance Optimization Strategies for Hashing

For environments where hashing performance is a concern, several optimization strategies can be employed:

Leverage Hardware Acceleration: On-premises environments can benefit from hardware acceleration, such as PCIe express cards with crypto chips, to offload hash computation from the CPU.
Utilize Optimized Libraries: Many platforms use highly optimized libraries (e.g., OpenSSL) for hash computations, which can significantly improve performance.
Incremental Loads: Ensure that performance evaluations consider multiple load cycles to capture the benefits of hash diffs during delta checks, not just initial loads.

Future Trends and Recommendations

Looking forward, the evolution of data platforms and technologies might shift the balance towards using business keys more frequently. As Massively Parallel Processing (MPP) databases become more prevalent, their native support for efficient key management could make business keys a more attractive option. However, until such technologies are ubiquitous, the default recommendation remains to use hash keys for their broad compatibility and consistent performance.

Conclusion

Data Vault 2.0’s approach to hashing business keys and hash diffs provides significant advantages in terms of consistency, scalability, and performance. While the performance costs of hashing can be a concern, particularly in smaller environments with limited hardware, careful consideration of the available options and optimization strategies can mitigate these issues. Ultimately, the decision should be guided by the specific technological context and future-proofing considerations.

For most scenarios, hash keys remain the recommended approach due to their versatility and robustness in mixed and distributed environments. However, as technology evolves, the use of business keys might become more feasible, highlighting the importance of staying informed about the latest trends and advancements in data management.

Meet the Speaker

Michael Olschimke

Building a scalable Data Platform? In Data Vault Friday

Multi Active Satellites on Links

Watch the Video

In our ongoing series, our CEO Michael Olschimke addresses a complex question from the audience regarding the use of Multi Active Satellites (MAS) on Links within a Data Vault 2.0 model. This topic touches on advanced aspects of data modeling, particularly in the context of handling multiple active records.

The question posed was, “Can the Multi Active Satellites be used on LINKs too (considering that on Link we have the option of using the child dependent key)? Please ignore the fact that the link doesn’t have a Hash column on all HUB keys.” Michael’s response delves into the practical application of MAS on Links, an area that can greatly enhance the flexibility and scalability of data models. He explains that while traditionally Multi Active Satellites are used with Hubs to track multiple active records, their application on Links is feasible and beneficial. By leveraging the child dependent key, it is possible to maintain multiple active relationships between entities, which is particularly useful in scenarios where relationships are dynamic and subject to frequent changes.

Drawing on his 15 years of experience in Information Technology, with a focus on Business Intelligence over the past eight years, Michael offers a nuanced perspective on this topic. He highlights that while the absence of a hash column on all HUB keys might pose a challenge, it can be mitigated through careful design and implementation strategies. By ensuring that each Link is adequately documented and structured, organizations can effectively use MAS to capture the complexity of real-world relationships without sacrificing data integrity or performance.

In conclusion, Michael emphasizes the importance of flexibility and adaptability in data modeling. Implementing Multi-Active Satellites on Links can provide significant advantages in managing complex data relationships, allowing for more granular and accurate data analysis. This approach aligns with best practices in Data Vault 2.0 and supports the goal of creating robust, scalable, and responsive data architectures. Michael encourages practitioners to challenge conventional boundaries and explore innovative solutions to meet their unique data management needs.

Meet the Speaker

Michael Olschimke

Building a scalable Data Platform? In Data Tools, dbt webinar, Intermediate

Scale Up your Data Vault Project – with dbt Mesh

dbt Mesh

Learn how dbt Mesh enhances Data Vault projects within dbt Cloud by facilitating a more efficient data mesh architecture. The larger a data warehouse project grows, the more people begin to rely and work with the data provided. This work could be consuming the data, applying business rules, modeling facts and dimensions, or other typical tasks in a data environment. In a large organization, all these users might be scattered across different divisions, and the data they are working with might belong to different business domains. At some point, the entire organization faces the challenge of data sharing and governance guidelines, which might prohibit users of the sales department from accessing data from the finance department. A data mesh offers a solution that helps organizations to deal with these challenges. If you want to learn more about the data mesh, check our recent blog article about Data Vault and data mesh here!

We also have a webinar on exactly this specific subject. Don’t miss it and watch the recording for free!

Data Mesh Support bei dbt Cloud

Many organizations struggle with introducing a Data Mesh approach into the Data Vault landscape. In this webinar, we will dive into dbt Mesh, and how to leverage it in a Data Vault project.

Watch Webinar Recording

In this article:

What is dbt Mesh?
Why would I want to refer to other dbt projects?
How can I leverage dbt Mesh in a Data Vault powered Data Mesh?
Conclusion

What is dbt Mesh?

Dbt Mesh is a recently added feature that makes dbt Cloud work more efficiently with a data mesh approach. The already familiar {{ ref() }} function is no longer limited to models within one dbt project, instead it can refer to models of other dbt projects.

Why would I want to refer to other dbt projects?

Imagine a big organization that uses dbt Cloud for their Data Vault implementation. The project might have 400 sources defined, 2000 models implemented, and is used actively by 30 developers. Out of these 30 developers, there might be 5 people specifically working on the Business Data Vault and Information Mart layer for finance-related objects. Another 5 developers are working on the same layers but for sales-related objects.

At some point, you might want to avoid finance people messing around with the sales-related dbt models, so a data mesh architecture is to be implemented. This would allow the organization to define policies regarding data sharing, data ownership, and other governance measures.

With dbt Mesh, both the Sales and the Finance team would get their own dbt project. Since both should be based on the same Raw Data Vault, an additional foundational dbt project is created exclusively for staging and Raw Data Vault objects. Both domain-specific dbt projects, sales and finance, can now refer to Raw Vault objects inside the foundational dbt project, avoiding actually physically replicating the data.

How can I leverage dbt Mesh in a Data Vault powered Data Mesh?

Define Data Contracts

Dbt models, or groups of models, can now be configured to have data contracts. Inside the already familiar .yml files, models can now be set to be publicly available (within an organization), data owners can be enforced, and table schemas can be locked.

Create a Foundational dbt project

In a Data Mesh architecture, the most common way to implement Data Vault 2.0, is to have a commonly shared Raw Vault as a foundation, and both Business Vault and Information Marts are divided by business domains. In dbt Mesh, this would reflect in a foundational dbt project, that includes all staging and Raw Data Vault objects. Only the Raw Data Vault objects would be configured to be accessible by other dbt projects, since the staging models should not be used outside of Raw Data Vault models.

Add domain-level dbt projects

Based on the foundational Raw Vault dbt project, each domain team can now work in their own dbt project. They access the Raw Data Vault via the (extended) {{ ref() }} function and don’t have to worry about maintaining these Raw Vault objects. Additionally, they can define which of their artifacts might be useful for other domains, these can be shared via their own data contracts.

Distribute Responsibilities

Typically, a power user does not create Hubs, Links, and Satellites. And it’s not their responsibility to ensure a reliable Raw Data Vault to build transformations on. Therefore, it is important to define responsibilities within each dbt project. Especially objects that are shared outside of one project should always have data contracts and defined owners. This ensures that users of these shared objects can rely on it.

Conclusion

All in all, dbt Mesh offers a fantastic way to properly implement a true data mesh approach. It is especially relevant, when different business domains of one organization are working together in dbt to create trustable deliverables. In most scenarios, it makes sense to already start using dbt Mesh, although your project might not be too big yet. Having clear responsibilities and data contracts always helps maintain trust and transparency for your data!

– Tim Kirschke (Scalefree)

Building a scalable Data Platform? In Data Vault Friday

Modelling Exchange Rates

Watch the Video

In our ongoing series, our CEO Michael Olschimke addresses a question from the audience about modelling daily exchange rates within the Data Vault framework for a non-banking industry. The query highlights a common challenge faced by many organizations: integrating and managing exchange rate data effectively.

The question posed was, “How would you model daily exchange rates in Data Vault 2.0 for a non-banking industry? We are already using a reference table for the list of currencies (I guess we would have currency as a hub in the banking industry, but that is not our case). Now we also need daily exchange rates for currency conversions in the datamart layer. I would start with a Link for exchange rates, but do we need to create a hub for currencies? How about existing references to currency in the existing model (currently in SAT, because we have currency as a reference table)?”

Michael’s response delves into the intricacies of data modelling in such scenarios. He suggests that even though your industry is non-banking, establishing a structured and scalable way to manage exchange rate data is crucial. Using a Link for exchange rates is a good starting point, but creating a hub for currencies could provide additional benefits. This hub would act as a central repository for all currency-related information, ensuring consistency and ease of access across different layers of the data architecture. Additionally, integrating existing references to currencies within the model can streamline operations and enhance the accuracy of financial data analytics.

In conclusion, Michael emphasizes the importance of a well-thought-out data architecture. By creating a dedicated hub for currencies and effectively linking exchange rate data, organizations can ensure more accurate and efficient currency conversions in their datamart layer. This approach not only aligns with best practices in data vault modeling but also supports the broader goal of maintaining data integrity and usability across the enterprise.

Meet the Speaker

Michael Olschimke

Building a scalable Data Platform? In Data Vault Friday

Data Vault 2.0 Pre-Analysis aka Automation for the Poor

Watch the Video

In our ongoing series, our CEO Michael Olschimke discusses a question from the audience:

“In recent training from ScaleFree I saw a glimpse of an excel sheet that basically annotated data sources with data vault specific metadata. It had like plenty of Salesforce attributes in it together with annotations like: Business key, Link business key, Satellite descriptive attribute, etc.

Can you talk a bit about this metamodel? Can it be used to drive automated creation of the data vault structures?”

Michael stresses the irreplaceable role of pre-analysis in establishing a successful Data Vault 2.0 framework. Michael underscores the crucial nature of this stage, aligning our work with each client’s aspirations while staying true to the guiding wisdom of Dan Linstedt.

Meet the Speaker

Michael Olschimke

Building a scalable Data Platform? In Data Vault Friday

Data Vault PITs in PowerBI – Joining Type 2 Dimensions

Watch the Video

In our ongoing series, our CEO Michael Olschimke discusses a question from the audience:

“We would like either build a semantic model or let end users build it themselves as star schemas.

In Infomarts we expose facts and dimensions.

Dimensions are based on pits and expose all contextual attributes valid to a given snapshot data.

Now the problem is that facts and dimenions need to be joined not only by main keys of dimension, but also a snapshot data. However, PBI allows only joins with 1 attribute.

What is a recommended way to tackle this?

I thought of introducing sequence numbers in PITs and exposing them in virtualized fact views, additionally exposing separate snapshot dimension that synchronizes snapshots of all the dims (otherwise we end up in cartesian join). However this defeats partitioning in the PITs (join over sequence number and not hashKey + SnapshotDate blocks partition pruning).”

Michael delves into an in-depth discussion on leveraging Data Vault’s Point-in-Time (PIT) tables within PowerBI, exploring how this integration enhances analytical capabilities and supports dynamic reporting in the realm of Big Data.

Meet the Speaker

Michael Olschimke

Building a scalable Data Platform? In Data Vault Friday

Multiple PIT Tables for Different Business Scenarios

Watch the Video

In our ongoing series, CEO Michael Olschimke addresses an audience question:

“We have a need for PIT tables to be used in different business scenarios and would like to use different SQL statements to load a number of PIT tables, one for each scenario.

What is your take on it?”

PIT tables help track the historical state of records over time for analysis, and customized SQL statements cater to the specific data needs of each business context.

Michael will cover best practices, performance optimization, and data integrity when managing numerous PIT tables, offering insights that can enhance organizational data strategy.

Meet the Speaker

Michael Olschimke

Building a scalable Data Platform? In Data Vault Friday

Defining Multiple Snapshots per Day via Control Table

Watch the Video

In our ongoing series, our CEO Michael Olschimke discusses a question from the audience:

“Would a micro or mini batch refresh frequency in PIT tables of the data warehouse with subsequent aligned reporting yield multiple timestamps for a certain date in the snapshot control table?”

Michael goes on to explore the concept of multiple snapshots and how they can provide valuable insights into the evolution of data over time. By capturing snapshots at different points in time, organizations can gain a deeper understanding of trends, patterns, and anomalies within their data. This nuanced approach to data management can lead to more informed decision-making and improved overall performance.

Join us as we unravel the complexities of multiple snapshots and their impact on the data warehouse landscape.

Meet the Speaker

Michael Olschimke

Building a scalable Data Platform? In Data Vault Friday

Filtering Snapshot Date Frequencies in PIT Tables

Watch the Video

In our ongoing series, our CEO Michael Olschimke discusses a question from the audience:

“Regarding the information delivery perspective where the business user likes to have stable reports, is it correct to say that each unique frequency of a required business snapshot date (where the snapshot date is a timestamp), will have its own filter column in the snapshot control table? E.g. weekly, monthly, but also more specific ones to suit a particular business process, like in education the beginning of the nth quartile week to ensure all grades achieved are registered?”

He discusses the importance of having unique filter columns in the snapshot control table for each distinct frequency of a required business snapshot date.

By customizing snapshot date frequencies to suit different business processes, organizations can enhance the accuracy and relevance of their reporting mechanisms.

Michael’s insights highlight the strategic importance of managing snapshot dates effectively to optimize information delivery and drive better decision-making.

Meet the Speaker

Michael Olschimke

Building a scalable Data Platform? In DevOps Knowledge

Optimizing CI/CD – A Guide to Automated IaC Pipelines with Terraform

Introduction

In this article, we will talk about how you can improve your CI/CD (Continuous Integration / Continuous Deployment) development by implementing IaC (Infrastructure as Code) and a well structured automated pipeline for error-prone deployments. For IaC, we will specifically use Terraform as the chosen software, since it is the most commonly used cloud-agnostic tool.

When working with cloud, it is most important to always have a good overview of the development of your infrastructure. It can grow rather huge in bigger companies and should be organized. This is where IaC comes in handy. Using software like Terraform lets you isolate parts of your infrastructure into their own representative projects, allowing for much better maintenance.

Not only is IaC helpful for improving development in your company, but it is also essential to secure your deployments with a pipeline that monitors upcoming changes in the project. Nobody wants to deploy changes and accidentally break everything. Error checks are thus mandatory to ensure a safe working environment. In the upcoming parts of this article, we will show you some options on how you can implement a good and safe IaC pipeline for your needs step by step.

How to get started – Terraform

When it comes to developing Infrastructure on cloud providers like AWS, Google Cloud, Azure, and so on, it is always good to have better governance over everything that is currently deployed on these services.

To have such governance, it is important to divide the existing or new infrastructures into different projects. Infrastructure as Code (IaC) involves the management and provisioning of infrastructure using code, as opposed to relying on manual procedures. Introducing Terraform:

Terraform is one of the biggest IaC software out there and allows for exactly this. Writing your infrastructure inside of a file allows you to fully control every part of your project and also isolates it from everything else on the provider. Deploying new resources, adjusting, or even deleting them can all be done by just changing the code in the file and applying the changes via Terraform itself allowing for great continuous Integration (CI).

Even resources that have been created before the use of IaC can easily be imported to a Terraform file to be managed by it in retrospect. Therefore, Terraform is the ideal starting point for an agile development in your company.

Using repositories for automation – Git

Now that all of your infrastructure is managed in Terraform and separated into distinct projects, we can move to the next step. Since usually a DevOps team consists of more than one person, it is mandatory to get your newly created files accessible by the entire team. Therefore, you should move them to a version control system like Git, which usually comes with a lot of features that are also really helpful to set up a proper pipeline automation.

When using an online code repository like Github, you can create so-called “Actions” to instantly deploy changes when pushing a new commit. This ensures a single source of truth for the infrastructures. Even mistakes can be easily fixed by checking previous changes in the files or just by going back to a previous working commit.

Moreover, every deployment will now happen automatically which is exactly what we want to implement. Using this form of automation gives great value to the continuous deployment (CD) part of our pipeline.

Optimizing your automation – Security

We are now at the point where your infrastructures are managed using Terraform and it also has an automated deployment setup in our repository system like the deployment workflow in Github Actions. But we still lack one big point to bring our pipeline to its full potential. This point being security.

Currently, whenever a change gets versioned on your repository, it instantly deploys those changes (as long as there are no errors in the terraform files). Somebody could accidentally remove a resource and commit these changes even though it was not intended, possibly causing crucial problems. This is why we want to secure the process of deploying changes.

Luckily, there are a lot of options for securing your pipeline. In the following part, we will quickly run over some solutions to ensure security in your deployment pipelines environment.

Manual Approval via Repository

We already have your automated deployment workflow on your online code repositories (e.g. Github) which deploys all new committed changes. But many of these systems also support the feature of manual approval. This feature will put a break in between push and deployment and first ask other employees to check over the upcoming changes.

Only if enough people have approved the changes, the deployment continues. An example can be the use of the Github Deployment Protection rules. You can define a specific number of requested reviewers, and the deployment is only executed if this number of reviewers approves it.

Alternatively, you could create your own workflow and set a number of needed approvals in your team inside of the workflows config file.

Software for Best Practices

Working infrastructure can still have security flaws that might have been overlooked by the approvers. For example, you have a server that can easily be approached by the internet even though it is supposed to be private and only accessible by your other infrastructure.

These kinds of problems also are a big threat to the security inside of your pipeline. Luckily there already have been people who are aware of this issue and developed dedicated software to solve this problem.

For developments outside of Terraform, there is the program “pre-commit” which scans for best practices over so-called hooks. These hooks set automated checks and tasks before committing changes. If any of those fail the commit will be aborted until the causing problem is fixed.

When developing with terraform specifically you could use programs like “tfsec” which also checks your written code for best practices. It does not let you deploy changes when there are critical health issues inside of your infrastructure and marks them as errors.

Conclusion

With everything set up you should now have an optimized and secure CI/CD pipeline to work with inside of your DevOps team.

After reading this article you should be able to see the great perks of implementing IaC, automating, versioning and security measures into your CI/CD pipeline.

Even though setting up all these parts can be a little time consuming at start it will get you a much more agile approach to your development in your business.

By using this method you minimize any risks of deployment, secure your deployment workflow, keep a quick delivery in results and thus prevail good team agility.

About the Author

Moritz Gunkel

Moritz is an aspiring Consultant in the DevOps department for Scalefree, specializing in cloud engineering, automation, and Infrastructure as Code, with a particular knack for Terraform. While juggling his responsibilities, including pursuing a Bachelor’s degree in Computer Science as a working student, Moritz’s methodical approach has significantly impacted internal operations, earning him recognition and setting the stage for an exciting transition into a full-time consulting role. With a passion for innovation and a commitment to excellence, Moritz is set to continue making a lasting impression in the dynamic world of DevOps.

Building a scalable Data Platform? In Data Vault Friday

Soft-Deleting Records in Data Vault 2.0

Watch the Video

In our ongoing series, our CEO Michael Olschimke delves into a question raised by a member of the audience:

“Hi, my question is about the effectiveness of satellite tables. I notice there are no updates in the data vault. I am struggling to comprehend how we can close the End_Date field in the satellite table without actually updating it.”

In response to this query, Michael examines the concept of soft-deleting records and its implications on data management and integrity. Through insightful discussions and practical examples, he sheds light on the importance of implementing strategies such as soft deletion in maintaining data consistency and accuracy within satellite tables.

Meet the Speaker

Michael Olschimke

Watch the Video

Managing Data Vault Performance with Incremental Changes and Deletions

Understanding the Basics: Tracking Changes and Deletions

Loading Structures: The Practical Approach

Using CDC (Change Data Capture)

Full Load vs. Incremental Load

Performance Considerations

Parallel Processing

Hash Keys and Indexes

High-Water Marks and System Indicators

Practical Example: Incremental Loading without CDC

Conclusion

Meet the Speaker

Watch the Video

Exploring Data Vault 2.0: Managing Hashing Costs in Smaller Environments

The Role of Hashing in Data Vault 2.0

Challenges of Hashing

Evaluating Hashing Alternatives

The Case Against Sequences

Hash Keys vs. Business Keys

Hash Keys

Business Keys

Performance Optimization Strategies for Hashing

Future Trends and Recommendations

Conclusion

Meet the Speaker

Watch the Video

Meet the Speaker

dbt Mesh

Data Mesh Support bei dbt Cloud

What is dbt Mesh?

Why would I want to refer to other dbt projects?

How can I leverage dbt Mesh in a Data Vault powered Data Mesh?

Define Data Contracts

Create a Foundational dbt project

Add domain-level dbt projects

Distribute Responsibilities

Conclusion

Watch the Video

Meet the Speaker

Watch the Video

Meet the Speaker

Watch the Video

Meet the Speaker

Watch the Video

Meet the Speaker

Watch the Video

Meet the Speaker

Watch the Video

Meet the Speaker

Introduction

How to get started – Terraform

Using repositories for automation – Git

Optimizing your automation – Security

Manual Approval via Repository

Software for Best Practices

Conclusion

About the Author

Watch the Video

Meet the Speaker

Build Better Data Platforms

SOLUTIONS

TRAINING

EVENTS

KNOWLEDGE HUB

CAREERS

COMPANY