Skip to main content
search
0

Accelerate Your Data Vault with Snowflake

Watch the Webinar

Data Vault and Snowflake in combination are constituting flexible and scalable Enterprise Data Warehouse solutions.

Attendees will get insights about building and loading a GDPR-compliant Data Lake (AWS) and Data Vault model. The loading and querying processes have a great scalability within Snowflake.

The webinar includes a live demo from Snowflake showing Data Ingestion, Variant Data Types, and Data Sharing opportunities.

Watch Webinar Recording

Webinar Agenda

1. Intro
2. Accelerate your Data Vault with Snowflake (Scalefree)
3. Snowflake Demo (Snowflake)

Write Backs in the Enterprise Data Warehouse Architecture

Managed Self Service BI and write backs

The Data Vault 2.0 Layers

This issue covers write backs into the enterprise data warehouse and how the Data Vault 2.0 architecture can facilitate it. Many people already know the three layer architecture of data warehouses which is used in Data Vault 2.0. The first layer represents the staging area which holds the raw data from the source systems. The enterprise data warehouse layer, which in this case contains a Data Vault 2.0 model and the third layer with the Information Marts, which deliver the information in various structures (Star Schemas, Snowflake Schemas etc.).

DV2.0 Architecture and write backs

Figure 1. Data Vault 2.0 Architecture

This architecture provides possibilities and benefits for data write backs. Two possibilities are writing back data into the enterprise data warehouse and into the source systems. This issue covers the write back into the enterprise data warehouse, while an upcoming article will cover the write back into the source systems.

Continue Reading

Batch Loading Strategies for Data Vault 2.0

Solutions

Loading Strategies

In the realm of general data warehousing, various loading strategies come into play. One prevalent challenge often encountered is the absence of deleted records within a delta. In typical data warehousing scenarios, it becomes crucial to recognize and track deletions from the source system, often referred to as soft deletes.

The distinction lies in the need to not only capture new or modified data (delta) but also to account for records that have been deleted at the source. Soft deletes involve marking records as deleted rather than physically removing them, allowing for a more nuanced and traceable approach to data management.

ACCESS THE SOLUTION

Data Vault Use Cases Beyond Classical Reporting – Part 3

Data Vault use cases for reporting

New Possibilities with Data Vault 2.0

Data Vault 2.0 empowers organizations to go beyond traditional reporting by unlocking new avenues for scalability, automation, and data-driven decision-making. From cleansing data and automating business processes to enabling advanced data science techniques like machine learning and predictive analytics, Data Vault 2.0 provides a flexible framework for modern data challenges. In this article, we explore how Data Vault 2.0 integrates data science to optimize operational processes, predict outcomes, and enhance enterprise data warehouses, ensuring a competitive edge in today’s data-driven landscape.

Going beyond standard reporting

Reporting and dashboarding have become the standards in business when it comes to identifying KPIs and other measurements. As such, Enterprise Data Warehouses have emerged to support the reporting process. Though, due to the large quantity and variety of data, a demand has developed for a method of utilizing this existing data in a manner in which it can add additional business value towards a company’s needs. Data Vault 2.0 offers a wide range of methods to provide decision support beyond standard reporting as well as critical information regarding the future. To see for yourself, join us as we present different approaches and solutions as to fully leverage the potential of your data.

Watch webinar recording

Going beyond standard reporting

As we have shown in previous issues, Data Vault 2.0 enables individuals to implement reporting beyond the traditional methods.
In the first part, we demonstrated how to perform data cleansing in Data Vault 2.0.
And the second use case showed how to implement business process automation using Interface Marts.

The scalability and flexibility of Data Vault 2.0 offers a whole variety of use cases that can be realized, e.g. to optimize as well as automate operational processes, predict the future, push data back to operational systems as a new input or trigger events outside the data warehouse, to name a few. Continue Reading

Satellite Modeling for Any Structural Changes in the Source System

Modeling a Satellite in the instance of any structural changes within the source system

Over time, a source system can change. The question is how to absorb these changes into a Data Vault 2.0 data warehouse, especially when considering the satellites?

It is necessary to find a balance between the reengineering effort and performance when the source table structure changes. To better help those who find structural changes in the source system, this article will present our recommendations, based on our knowledge base,  for various types of changes in a source.

This article describes features embodied in the Data Vault 2.0 model: the foundation of a hub, link, and satellite entities can adjust to changes in the source data easily, thus reducing the cost of reengineering the enterprise data warehouse

New columns in the source system: when any new columns or attributes are added to the source

There are two options for absorbing new attributes from the source into the data warehouse. First, the existing satellite could be modified.
This is a pragmatic approach but requires the modification of existing code.
On the other hand, it is also possible to create a new satellite for the new attribute, or attributes, without modifying the existing satellites. This has the advantage of a zero code impact but requires more joins in an Information Delivery part of the Data Vault.

The first option does not require this join as the new attribute is added to the existing satellite. The best approach is to compare the advantages and disadvantages of both options in the specific situation as it applies to your situation. Automation tools for example usually can handle the alter table statement automatically without manual coding effort but require changes be made in the database.

Removing columns in the source system from source column deletion 

One option is to close the “old” Satellite, i.e. not load it further, as the ETL code is turned off, and create a new satellite which should be loaded. The same approach is used when the underlying data structures from the source are modified in a larger perspective.
Old satellites are turned off, new satellites with the new structure are then loaded.
Another option would be more meaningful if there are only minor changes needed such as the removal of one column. Then “simulating” this column with a NULL value or a value which adds meaning and makes more sense would be more helpful for auditing purposes.

If a new Satellite is created, the end result will be two new columns in the related PIT table (Hash Key + LDTS). 

Closing a satellite and creating a new one is also applicable if there are major changes in the source system, for example a new release version of the source system where columns are deleted, renamed and created. In the instance of small changes, especially when columns disappear, we recommend altering the satellite.

Creating a Virtual Dimension table from a PIT table having multiple satellites

When a new satellite for the new attribute, or attributes, is created by not modifying the existing satellite, a new virtual dimension is required to fetch information from PIT tables using both the satellites accordingly based on the required timestamp.
There are two approaches on how the information can be drawn using both the satellites:

  • The first approach uses a computed Satellite, in which you combine all satellites with the most recent record per Hash Key and the same structure. Though, this might be a complicated query as it depends upon the amount of data and the number of Satellites to join.
  • The second approach is to use a PIT table for all satellites and when querying the data out, for a dimension table for example, you take the record from the leading one, for example using an IIF statement or COALESCE function.

Conclusion

While every situation does require an approach that takes into account the individual nature of the task, the above solutions have proven themselves to be vital when we implement them within our own projects.

We offer these as a way of allowing others to benefit from what our testing, application, and implementation have taught us.

Splitting a Satellite Entity Based on the Source Data

Satellite split by source system

Splitting a Satellite Entity

Satellite splitting criteria plays a vital role in a satellite’s structure. Being such, it is not recommended that the entirety of descriptive data related to a business key should be stored in a single satellite structure. Instead, raw data should preferably be split by certain criteria.

 

Criteria for splitting a Satellite

In general, we have defined the following types of satellite splits:

  1. Splitting by source system
  2. Splitting by rate of change

Additionally, we have defined two more types of splits as mentioned below:

  1. Splitting by level of security and by the level of privacy
  2. Business-driven split

A satellite split by source system is strongly recommended to prevent two issues when loading the data into the enterprise data warehouse: first, if two different source systems with different relational structures should be loaded into the same satellite entity, a transformation of the structure might be required. However, structural transformation requires business logic sooner or later and that should be deferred to the information delivery stage to support fully-auditable environments as well as the application of multiple business perspectives. Continue Reading

Delete and Change Handling Approaches in Data Vault 2.0 Without a Trail

Data Vault 2.0 - Insert logic

Delete and Change Handling Approaches in Data Vault 2.0

In this article, we will show you how to use counter records for change or delete practices in Data Vault 2.0. In January of this year, we published a piece detailing an approach to handle deletes and business key changes of relationships in Data Vault without having an audit trail in place.
This approach is an alternative to the Driving Key structure, which is part of the Data Vault standards and a valid solution.
However, at times it may be difficult to find the business keys in a relationship which will never change and therefore be used as the anchor keys, Link Driving Key, when querying. The presented method inserts counter records for changed or deleted records, specifically for transactional data, and is a straightforward as well as pragmatic approach. However, the article caused a lot of questions, confusion and disagreements.
That being said, it is the intention of this blogpost to dive deeper into the technical implementation in which we could approve by employing it.



Technical Implementation in Data Vault 2.0

The following table shows a slightly modified target structure of the link from the previous blog post when using counter records in Data Vault 2.0.

In this case, we are focusing on transactions that have been changed by the source system without delivering any audit data about the changes as well as no counter bookings by the source itself.

It is important to note that the link stores the sales positions by referencing the customer and the product. Thus, the Link Hash Key, as well as the Load Date, are the primary keys as we are not able to gather a consistent singular record ID in this case. Being so, the Link Hash Key is calculated by the Customer Business Key, the Product Business Key, the sales price, and the transaction timestamp. 

Data Vault 2.0 link with counter record

Figure 1: Link with counter records

To load the link, the following steps are required:

Firstly, insert and check as to whether a counter booking is necessary at all as the former step loads new data from the staging area into the link. Please note that the loading logic in this step is similar to that in the standard link loading process, with some differences:

Data Vault 2.0 - Insert logic

Figure 2: Insert Logic

In Data Vault 2.0, the counter record should identify records, the most recent records by Link Hash Key, that exist in the link but don’t exist in the staging area due to deletion or changes made to the record. Thus, query results will be “countered” with a value for “counter” set to -1, which indicates that these records are not able to be found at this stage. Note that in this query we selected the existing record from the link table in the raw vault, however, further note that the record’s time of arrival should be the LDTS of the actual staging batch. Therefore, within the shown statement, the LDTS is a variable with the load date of the staging batch:

Data Vault 2.0 - Counter logic

Figure 3: Counter Logic

In instances in which it changes back to the original record, the same procedure applies: The current missing value will be countered by the new one inserted again with a new LDTS. 

Conclusion

Thus, we can conclude that this Data Vault approach works well for tables which are a hot-spot for measured values only as well as when changes are possible, although the data represents “transactions” and is to be used when CDC is not available.

Instead of a “get the most recent record per Hash Key (Driving Key)” it is possible to run calculations as well as aggregations directly on one table which results in a better performance in the end stage.

If there are still questions left, please feel free to leave a comment. We are looking forward to an exchange and your thoughts on the topic.

Overcoming Data Warehousing Challenges with Data Vault 2.0

Watch the Webinar

There are many Data Warehousing challenges that Organizations and Data Warehousing Teams are struggling with. In this short one-hour webinar, we will talk about some major challenges and how you can overcome them with Data Vault 2.0.

Watch Webinar Recording

Webinar Agenda

1. Achieving and maintaining agility
2. Maintaining or migrating legacy DWH systems
3. Security and privacy (GDPR)
4. Big data
5. Adapting to changing requirements
6. Auditability and data lineage
7. Self-service BI

Data Vault Use Cases Beyond Classical Reporting – Part 2

The Role of the Enterprise Data Warehouse

The Enterprise Data Warehouse (EDW) is no longer limited to reporting and dashboarding; it is a powerful tool for driving business process automation and operational efficiency. Building on earlier discussions in our Data Vault Use Cases series, this article explores how Data Vault 2.0 facilitates seamless integration with automation frameworks and external interfaces to enhance knowledge management, streamline operations, and reduce manual workload. From leveraging interface marts to trigger automated processes to enriching documentation systems with dynamic updates, we uncover the transformative potential of the EDW in modern business environments.

Going beyond standard reporting

Reporting and dashboarding have become the standards in business when it comes to identifying KPIs and other measurements. As such, Enterprise Data Warehouses have emerged to support the reporting process. Though, due to the large quantity and variety of data, a demand has developed for a method of utilizing this existing data in a manner in which it can add additional business value towards a company’s needs. Data Vault 2.0 offers a wide range of methods to provide decision support beyond standard reporting as well as critical information regarding the future. To see for yourself, join us as we present different approaches and solutions as to fully leverage the potential of your data.

Watch webinar recording

More Than Reporting and Dashboarding

As we first introduced within the first part of the Data Vault Use Cases article series, the Enterprise Data Warehouse (EDW) can do more than just simple reporting and dashboarding. 

We previously explored how the EDW can help to improve data quality by implementing data cleansing rules. 

This can be applied by write-back operations that affect the source systems directly. Though this was only one example of how to add more value to the EDW.
The scalability and flexibility of Data Vault 2.0 offers a whole variety of use cases that can be realized, e.g. to optimize and automate operational processes, predict the future, push data back to operational systems as a new input or trigger events outside the data warehouse, to name a few.

Continue Reading

Capturing Semi-Structured Descriptive Data

The previous articles within this series have presented hub and link entities to capture business keys as well as the relationships between business keys. To illustrate, the hub document collection in MongoDB is a distinct list of business keys used to identify customers. 

As to capture the descriptive data, which in this case is the describing factor of the business keys, satellite entities are used in Data Vault. As both business keys and relationships between business keys can be described by user data, satellites may be attached to hub as well as link entities as such:

Continue Reading

Identifying Additional Relationships Between Documents

The last article within our series recently covered the Data Vault hub entity which is used to capture distinct list of business keys in an enterprise data warehouse as most integration will actually occur on these hub entities themselves. However, there are scenarios in which the integration of data solely on these hub entities is not sufficient enough for the necessary end goal in mind. 

Consider this situation in which a sample data set, involving an insurance company, concerning customers signing car and home insurance policies as well as filing claims, each respectively. Though before moving forward with the example, it is important to note that there are relationships between the involved business keys, that of the customer number, the policy identifiers, and the claims.

These relationships are captured by Data Vault link entities and just like hubs, they contain a distinct list of records, as such, they contain no duplicates in terms of stored data. Thus, both will form the skeleton of Data Vault and later be described by descriptive user data stored in satellites.

Continue Reading

Close Menu