Skip to main content
search
0

Satellite Modeling for Any Structural Changes in the Source System

Modeling a Satellite in the instance of any structural changes within the source system

Over time, a source system can change. The question is how to absorb these changes into a Data Vault 2.0 data warehouse, especially when considering the satellites?

It is necessary to find a balance between the reengineering effort and performance when the source table structure changes. To better help those who find structural changes in the source system, this article will present our recommendations, based on our knowledge base,  for various types of changes in a source.

This article describes features embodied in the Data Vault 2.0 model: the foundation of a hub, link, and satellite entities can adjust to changes in the source data easily, thus reducing the cost of reengineering the enterprise data warehouse

New columns in the source system: when any new columns or attributes are added to the source

There are two options for absorbing new attributes from the source into the data warehouse. First, the existing satellite could be modified.
This is a pragmatic approach but requires the modification of existing code.
On the other hand, it is also possible to create a new satellite for the new attribute, or attributes, without modifying the existing satellites. This has the advantage of a zero code impact but requires more joins in an Information Delivery part of the Data Vault.

The first option does not require this join as the new attribute is added to the existing satellite. The best approach is to compare the advantages and disadvantages of both options in the specific situation as it applies to your situation. Automation tools for example usually can handle the alter table statement automatically without manual coding effort but require changes be made in the database.

Removing columns in the source system from source column deletion 

One option is to close the “old” Satellite, i.e. not load it further, as the ETL code is turned off, and create a new satellite which should be loaded. The same approach is used when the underlying data structures from the source are modified in a larger perspective.
Old satellites are turned off, new satellites with the new structure are then loaded.
Another option would be more meaningful if there are only minor changes needed such as the removal of one column. Then “simulating” this column with a NULL value or a value which adds meaning and makes more sense would be more helpful for auditing purposes.

If a new Satellite is created, the end result will be two new columns in the related PIT table (Hash Key + LDTS). 

Closing a satellite and creating a new one is also applicable if there are major changes in the source system, for example a new release version of the source system where columns are deleted, renamed and created. In the instance of small changes, especially when columns disappear, we recommend altering the satellite.

Creating a Virtual Dimension table from a PIT table having multiple satellites

When a new satellite for the new attribute, or attributes, is created by not modifying the existing satellite, a new virtual dimension is required to fetch information from PIT tables using both the satellites accordingly based on the required timestamp.
There are two approaches on how the information can be drawn using both the satellites:

  • The first approach uses a computed Satellite, in which you combine all satellites with the most recent record per Hash Key and the same structure. Though, this might be a complicated query as it depends upon the amount of data and the number of Satellites to join.
  • The second approach is to use a PIT table for all satellites and when querying the data out, for a dimension table for example, you take the record from the leading one, for example using an IIF statement or COALESCE function.

Conclusion

While every situation does require an approach that takes into account the individual nature of the task, the above solutions have proven themselves to be vital when we implement them within our own projects.

We offer these as a way of allowing others to benefit from what our testing, application, and implementation have taught us.

Splitting a Satellite Entity Based on the Source Data

Satellite split by source system

Splitting a Satellite Entity

Satellite splitting criteria plays a vital role in a satellite’s structure. Being such, it is not recommended that the entirety of descriptive data related to a business key should be stored in a single satellite structure. Instead, raw data should preferably be split by certain criteria.

 

Criteria for splitting a Satellite

In general, we have defined the following types of satellite splits:

  1. Splitting by source system
  2. Splitting by rate of change

Additionally, we have defined two more types of splits as mentioned below:

  1. Splitting by level of security and by the level of privacy
  2. Business-driven split

A satellite split by source system is strongly recommended to prevent two issues when loading the data into the enterprise data warehouse: first, if two different source systems with different relational structures should be loaded into the same satellite entity, a transformation of the structure might be required. However, structural transformation requires business logic sooner or later and that should be deferred to the information delivery stage to support fully-auditable environments as well as the application of multiple business perspectives. Continue Reading

Delete and Change Handling Approaches in Data Vault 2.0 Without a Trail

Data Vault 2.0 - Insert logic

Delete and Change Handling Approaches in Data Vault 2.0

In this article, we will show you how to use counter records for change or delete practices in Data Vault 2.0. In January of this year, we published a piece detailing an approach to handle deletes and business key changes of relationships in Data Vault without having an audit trail in place.
This approach is an alternative to the Driving Key structure, which is part of the Data Vault standards and a valid solution.
However, at times it may be difficult to find the business keys in a relationship which will never change and therefore be used as the anchor keys, Link Driving Key, when querying. The presented method inserts counter records for changed or deleted records, specifically for transactional data, and is a straightforward as well as pragmatic approach. However, the article caused a lot of questions, confusion and disagreements.
That being said, it is the intention of this blogpost to dive deeper into the technical implementation in which we could approve by employing it.



Technical Implementation in Data Vault 2.0

The following table shows a slightly modified target structure of the link from the previous blog post when using counter records in Data Vault 2.0.

In this case, we are focusing on transactions that have been changed by the source system without delivering any audit data about the changes as well as no counter bookings by the source itself.

It is important to note that the link stores the sales positions by referencing the customer and the product. Thus, the Link Hash Key, as well as the Load Date, are the primary keys as we are not able to gather a consistent singular record ID in this case. Being so, the Link Hash Key is calculated by the Customer Business Key, the Product Business Key, the sales price, and the transaction timestamp. 

Data Vault 2.0 link with counter record

Figure 1: Link with counter records

To load the link, the following steps are required:

Firstly, insert and check as to whether a counter booking is necessary at all as the former step loads new data from the staging area into the link. Please note that the loading logic in this step is similar to that in the standard link loading process, with some differences:

Data Vault 2.0 - Insert logic

Figure 2: Insert Logic

In Data Vault 2.0, the counter record should identify records, the most recent records by Link Hash Key, that exist in the link but don’t exist in the staging area due to deletion or changes made to the record. Thus, query results will be “countered” with a value for “counter” set to -1, which indicates that these records are not able to be found at this stage. Note that in this query we selected the existing record from the link table in the raw vault, however, further note that the record’s time of arrival should be the LDTS of the actual staging batch. Therefore, within the shown statement, the LDTS is a variable with the load date of the staging batch:

Data Vault 2.0 - Counter logic

Figure 3: Counter Logic

In instances in which it changes back to the original record, the same procedure applies: The current missing value will be countered by the new one inserted again with a new LDTS. 

Conclusion

Thus, we can conclude that this Data Vault approach works well for tables which are a hot-spot for measured values only as well as when changes are possible, although the data represents “transactions” and is to be used when CDC is not available.

Instead of a “get the most recent record per Hash Key (Driving Key)” it is possible to run calculations as well as aggregations directly on one table which results in a better performance in the end stage.

If there are still questions left, please feel free to leave a comment. We are looking forward to an exchange and your thoughts on the topic.

Overcoming Data Warehousing Challenges with Data Vault 2.0

Watch the Webinar

There are many Data Warehousing challenges that Organizations and Data Warehousing Teams are struggling with. In this short one-hour webinar, we will talk about some major challenges and how you can overcome them with Data Vault 2.0.

Watch Webinar Recording

Webinar Agenda

1. Achieving and maintaining agility
2. Maintaining or migrating legacy DWH systems
3. Security and privacy (GDPR)
4. Big data
5. Adapting to changing requirements
6. Auditability and data lineage
7. Self-service BI

Meet the Speaker

blank

Suhita Dutta

Suhita Dutta is a Senior Consultant in DWH & Business Intelligence at Scalefree. With 13 years of experience, she has had the opportunity to support organizations across various industries in their data journey, especially in the financial services industry. She is a certified DV2.0 practitioner with a primary focus on Agile Data Warehousing, DW Automation & EDW Architecture.

Capturing Semi-Structured Descriptive Data

The previous articles within this series have presented hub and link entities to capture business keys as well as the relationships between business keys. To illustrate, the hub document collection in MongoDB is a distinct list of business keys used to identify customers. 

As to capture the descriptive data, which in this case is the describing factor of the business keys, satellite entities are used in Data Vault. As both business keys and relationships between business keys can be described by user data, satellites may be attached to hub as well as link entities as such:

Continue Reading

Identifying Additional Relationships Between Documents

The last article within our series recently covered the Data Vault hub entity which is used to capture distinct list of business keys in an enterprise data warehouse as most integration will actually occur on these hub entities themselves. However, there are scenarios in which the integration of data solely on these hub entities is not sufficient enough for the necessary end goal in mind. 

Consider this situation in which a sample data set, involving an insurance company, concerning customers signing car and home insurance policies as well as filing claims, each respectively. Though before moving forward with the example, it is important to note that there are relationships between the involved business keys, that of the customer number, the policy identifiers, and the claims.

These relationships are captured by Data Vault link entities and just like hubs, they contain a distinct list of records, as such, they contain no duplicates in terms of stored data. Thus, both will form the skeleton of Data Vault and later be described by descriptive user data stored in satellites.

Continue Reading

Integrating Documents from Heterogeneous Sources

Within this part of our ongoing blog series, we would like to introduce a sample data set based upon insurance data. This data set will be used to explain the concepts and patterns expanded upon further in the post. That said, please consider the following situation: an insurance company utilizes two different operational systems, let’s say, a home insurance policy system and a car insurance policy system.

Both systems should be technically integrated, which means if a new customer signs up for a home insurance policy, the customer’s data should be synchronized into the car insurance policy system as well and kept in sync at all times. Thus, when the customer relocates, the new address is updated within both systems.

Though in reality, it often doesn’t go quite as one would expect, as, first of all, both systems are usually not well integrated or simply not integrated at all. Adding to the complexity, in some worst-case scenarios, data is manually copied from one system to the next and updates are not applied to all datasets in a consistent fashion but only to some, leading to inconsistent, contradicting source datasets. The same situation applies often to data sets after mergers and acquisitions are made within an organization.

Continue Reading

Document Processing in MongoDB

In continuing our ongoing series, this piece within the blog series will describe the basics of querying and modifying data in MongoDB with a focus on the basics needed for the Data Vault load as well as query patterns. 

In contrast to the tables used by relational databases, MongoDB uses a JSON-based document data model. Thus, documents are a more natural way to represent data as a single structure with related data embedded as sub-documents and arrays collapses what is otherwise separated into parent-child tables linked by foreign keys in a relational database. You can model data in any way that your application demands – from rich, hierarchical documents through to flat, table-like structures, simple key-value pairs, text, geospatial data, and the nodes as well as edges used in graph processing.

Continue Reading

An Enterprise Document Warehouse Architecture

A common requirement for enterprise data warehousing is to provide an analytical model for information delivery, for example in a dashboard or reporting solution. One challenge in this scenario is that the required target model, often a dimensional star or snowflake schema or just a denormalized flat-and-wide entity, doesn’t match the source data structure. Instead the end-user of the analytical data will directly or indirectly define the target structure according to the information requirements.

Another challenge is the data itself, regardless of its structure.
In many, if not most, cases, the source data doesn’t meet the information requirements of the user regarding its content. In many cases, the data needs cleansing and transformation before it can be presented to the user.

Instead of just loading the data into a MongoDB collection and wrangling it until it fits the needs of the end user, the Data Vault 2.0 architecture proposes an approach that allows data as well as business rules, which are used for data cleansing in addition to transformation, to be re-used by many users. To achieve this, it is made up of a multi-layered architecture that contains the following layers:

Continue Reading

Processing Enterprise Data with Documents in MongoDB

Today’s enterprise organizations receive and process data from a variety of sources, including silos generated by web as well as mobile applications, social media, artificial intelligence solutions in addition to IoT sensors. That said, the efficient processing of this data at high volume in an enterprise setting is still a challenge for many organizations. 

Typical challenges include issues such as the integration of mainframe data with real-time IoT messages and hierarchical documents.
One of such issues being that enterprise data is not clean and might have contradicting characteristics as well as interpretations. This poses a challenge for many processes such as when integrating customers from multiple source systems.

Though, data cleansing could be considered as a solution to this problem. However, what if different data cleansing rules should be applied to the incoming data set? For example, because the basic assumption for “a single version of the truth” doesn’t exist in most enterprises. While one department might have a clear understanding of how the incoming data should be cleansed, another department, or an external party, might have another understanding. 

Continue Reading

Close Menu