With multi-active satellites, you’re able to store multiple active records for one business key. Depending on how the data arrives from your source, there are different ways to implement multi-activity in Data Vault 2.0. In this post, we’ll explain your options for modeling.
What is a Multi-Active Satellite?
A multi-active satellite is similar to a standard satellite and its structure. As said before, it stores multiple active records per key at a point in time. This exact structure depends on the use case though.
See the exemple Data Vault model in figure 1.
Figure 1: Data Vault model
When to use a Multi-Active Satellite?
For this reason, multi-active satellites are suitable for multiple active records. So, we use it when the source system delivers them. Furthermore, it is also appropriate for use when the delivery mechanism doesn’t provide enough metadata, for example by using XML.
Let’s say you get your customer’s data delivered in semi-structured XML-files.Within it are several phone numbers in an array. But the metadata doesn’t provide more information. So the number, type, and order of the records are unknown. When flattening the XML-file, you will receive as many records as the number of tuples in the array (figure 2). If you create a standard satellite out of these, only one row per customer business key can be active at the same time. Otherwise, you would run into a primary key violation having the same hash key and ldts.
Figure 2 – customer.xml
Different Options of Implementation
To avoid these problems, and also be able to query the active phone numbers at any point in time, there are several options. These depend on whether a “multi-active key” is available though, in addition to performance issues and the usability of it when you query the data. In the following section, we’ll briefly describe them. That said, in the next post, we’re going to go more into detail.
Multi-Active Satellite Rows
In case you don’t have a multi-active key, you need to create your own one by adding a sub-sequence number per business key in the staging area. In addition to the hash key and ldts, you need to add this to the primary key (figure 3). Here, though, you have to pay close attention if the data changes in the next load. For example, if the order of the phone numbers change, you cannot assign a specific phone number to a sub-sequence.
Figure 3 – Multi-active satellite with sub-sequence
If it is possible to define type codes and add them to the source feed, the best approach is to add these to the primary key of the satellite. Here, you rarely need to change types as well. In our example, we would define types like ‘home’, ‘business’, ‘cell’ and ‘fax’ as shown in figure 4. But remember, this must delivered by the source system. If not, you have to use the sub-sequences.
Figure 4 – Type code solution
Another option to avoid changing types requires pivoting in which you create different columns per type. Here, you receive the same granularity as the hub. But, while also considering the need of reengineering in case of changing types of the attributes.
The last way to solve issues regarding multi-active records is to create an extra, weaker hub for the multi-active attribute, in this case the phone number, and link them to the main hub. This requires that this attribute qualifies as a business key. Here, the number of hubs may increase heavily depending on how many multi-active data sets per business key you have. Furthermore, the new hub might not present as a real business object. So, while this is not the first way to go, it is a feasible solution in some cases.
Add the Multi-Active Attribute to the Composite Key
It’s not recommended to simply add the multi-active key to the business key. This will change the original grain and the semantic meaning. Which makes the results in that the original intention of the hub disappear.
Keep these rules for a Raw Data Vault hub in mind:
- Don’t super- or subtype in a hub
- Don’t mix different semantic meanings in a hub
- Don’t mix granularities in a hub
In many data vault projects, the BI team encounters such scenarios, especially when working with semi-structured data. That said, the best practice in case of multi-active records is to work with type codes which are added to the source feed. Though when they aren’t available, you need to use the sub-sequence. Though, as a last solution you can create an extra hub.
In the next post, we’re going to talk more about the advantages and disadvantages of the options and how exactly to implement them. Topics will be delta-checks, ldts, PITs and we’ll talk about a specific hash function to calculate multi-row hashdiffs.
– by Marvin Geerken (Scalefree)
Get Updates and Support
Please send inquiries and feature requests to [email protected].
To support the creation of Visual Data Vault drawings in Microsoft Visio, a stencil is implemented that can be used to draw Data Vault models. The stencil is available at www.visualdatavault.com.