Skip to main content
search
0

Granularities of Business Vault Entities

The Business Vault is the layer in the Data Vault 2.0 architecture where business logic is implemented to transform, cleanse and modify the data.

The book “Building a Scalable Data Warehouse with Data Vault 2.0” by Scalefree’s founders Dan Linstedt and Michael Olschimke and the Data Vault 2.0 Boot Camp shows how to implement such business logic using various Business Vault entities, such as computed satellites.

However, it is worth to note that this is only half the story, half the knowledge. The book shows computed satellites (and other entities) with a load date in the primary key of the computed satellite. Such satellites are great for capturing the results from business logic that is applied on the incoming deltas. However, there are two different types of granularities for business logic in the Business Vault: Continue Reading

Data Warehousing and Why We Need It

Data Warehousing

A data warehouse is a subject oriented, nonvolatile, integrated, time variant collection of data to support management’s decisions
Inmon, W. H. (2005). Building the Data Warehouse. Indianapolis, Ind.: Wiley.

Data Warehousing provides the infrastructure needed to run Business Intelligence effectively. Its purpose is to integrate data from different data sources and to provide a historicised database. Through a DWH, consistent and reliable reporting can be ensured. A standardized view of the data can prevent interpretation errors, improved data quality and leads to better decision-making. Furthermore, the historization of data offers additional analysis possibilities and leads to (complete) auditability. 

Data Warehousing – Why we need it

This webinar delves into how a data warehouse integrates data from various sources to support business intelligence by providing a centralized, historical database. This integration ensures consistent and reliable reporting, improves data quality, and facilitates better decision-making. Additionally, maintaining historical data enables comprehensive analysis and complete auditability.

Watch webinar recording

Why do we need Data Warehousing?

“Why do we need data warehousing for reporting, we have excel sheets?!”

Yes, excel is a great tool… to use and lose control over  your data as well as your reports.

You can report directly from a data source but you are massively limited in functionality and governance. Furthermore, you can only generate reports from one source system and don’t have a delta-driven history of your data. By creating reports directly from the source system and storing them on a local pc you lose track of which user pulled the data, as well as at what time, to build the report. Thus, the reports are no longer reliable. To prevent this, data warehousing comes into play.

Let’s imagine our goal is to build a sales revenue dashboard based on a timeline, a customer group, your products and regions. Without a DWH you have to collect all data manually from all necessary source systems. This data is most likely a mix of structured, unstructured and semi structured data. The challenge then becomes how to prepare and visualise the data in addition to creating an easily repeatable method of doing so. This is very time-consuming and can be very costly.
By the time all data is collected and prepared, the data may already be out of date causing the need to start again. 

With a DWH, all data is collected at one single point. The Data is aligned to the business (integrated & subject-oriented) with standardised definitions e.g. of KPI’s so that every report interprets the data equally. The access to the DWH is  read only (non-volatile), once loaded you can’t change the data (auditable). This leads to a complete historization of the data (time variant). With all data available, the needs of the users can be satisfied (structured data, integrated by business terms). For business users, there is also the option of using Self-Service BI.

What about a Data Lake?

As the “data lake” was introduced a couple of years ago, there was the assumption that it would replace a data warehouse.
A data lake is a great environment when used as a first landing zone for your data in your IT infrastructure but it does not “integrate” the data as a data warehouse does.
A data lake can be used to process the data further downstream into your data warehouse and an information delivery area. Structure becomes very important at this point so that your data lake doesn’t turn into a data dump and you are always able to query the data you need in an easy way.

To this end, you must create an architectural design dependent upon how you process the data from your data lake into your data warehouse. This could also happen in a completely virtualized way, depending on the amount of data as well as respectively the performance necessary to process the data towards the point of the end-users.

A data lake is also a good place for data scientists to gain access to the data as soon as possible, even if it is the native format. For end users who are working with structural data for reporting, dashboarding and analysis purposes, a structured, integrated, well-performing and easy-to-access data warehouse is necessary to fulfil their requirements. They expect the data in a prepared information mart, like a star schema or a flat and wide table.

Conclusion: If you want to use a data lake, think about how you need and process the data on the way out so that you can create a suitable structure for it. If you don’t need your data integrated, subject-oriented and time-variant, then you may be fine with a data lake only. But if you need all these great properties, you definitely need a data warehouse.

How does Data Warehousing work?

It starts with the ETL process (extract, transform, and load) in which the data is extracted from the source system into your  technical environment / (DWH infrastructure)  called the “Staging Area”. After extracting all data from the source system, you integrate your data into a subject oriented structure. The result is an Enterprise Data Warehouse (EDW) which provides data and information about how the end user needs it.

There are several modeling techniques available to build a data warehouse. 3NF (third normal form) was invented by Bill Inmon and is also known as the top-down approach. Alternatively, Dimensional Modeling by Kimball is more aligned to the business processes (bottom-up approach). Data Vault 2.0  is a hybrid between 3NF and dimensional modeling invented by Dan Linstedt. At Scalefree, we specialize in Data Vault 2.0 modeling.

Data Warehousing Reference Architecture

Conclusion

When only utilizing this single aspect of an EDW, users are missing opportunities to take advantage of their data by limiting the EDW to such basic use cases. A variety of use cases can be realized by using the data warehouse, e.g. to optimize and automate operational processes, predict the future, push data back to operational systems as a new input or to trigger events outside the data warehouse, to simply explore but a few new opportunities available.

Data Warehousing Use cases

Data Vault 2.0 Use Cases

Data warehousing is ideal for centrally storing all internal and external data sources. The standardization of structured, unstructured, and semi-structured data enables faster and more reliable reporting. Historization allows additional reports and past reports can be reconstructed at any time. With the flexibility of Data Vault 2.0, organizations can apply new capabilities, which go beyond just standard reporting and dashboarding.

If you want to learn more about Data Vault 2.0 Use Cases and the latest technologies from the market, we offer a broad range of free knowledge on our blog/newsletter and webinars. Feel free to sign up for regular updates.

– by Tobias Triphan (Scalefree)

Handling Validation of Relationships in Data Vault 2.0

Validation of relationships in Data Vault 2.0

Validation of Relationships in Data Vault 2.0

There are different ways of handling validation of relationships from source systems depending on how the data is delivered, (full-extract or CDC), and the way a delete is delivered by the source system, such as a soft delete or hard delete. In Data Vault 2.0, we differentiate data by keys, relationships, and descriptions.

That said, an often underestimated point is the handling and the validation of relationships in Data Vault 2.0.
In the following blog article, we explain what to consider and how to deal with it. 

 

Deletes in Data Vault 2.0 

First, let us explain the different kinds of deletes in source systems:

  1. Hard delete – A record is hard deleted in the source system and no longer appears in the system.
  2. Soft delete – The deleted record still exists in the source systems database and is flagged as deleted.

Secondly, let’s explore  how we find the data in the staging area:

  1. Full extract – This can be the current status of the source system or a delta/incremental extract.
  2. CDC (Change Data Capture) – Only new, updated, or deleted records to load data in an incremental/delta way.

To keep the following explanation as simple as possible, our assumption is that we want to mark relationships as deleted as soon as we get the deleted information, even if there is no audit trail from the source system (data aging is another topic).

Delete Detection and Validation of Relationships in Data Vault 2.0

Delete detection for business keys, or Hubs, is straightforward.  Soft deletes are handled as descriptive attributes in the Satellite directly and do not take into account whether the data arrives from a full extract or CDC. For hard deletes in the source system, we have to distinguish between full-extract and CDC.
Here we introduce the Effectivity Satellite. In the case of:

  1. Full-extract – Perform a lookup back into the staging area to check whether the business key still exists. If not, add a record with the deleted information (i.e. a flag and a date) into the Effectivity Satellite. 
  2. CDC – We receive “Delete” information which is a new entry in the Effectivity Satellite.

Delete detection of relationships needs a bit more attention and is often forgotten. With a full extract, we can follow the same approach as followed for business keys: Just check whether or not the Link Hash Key exists in the current staging load and insert a new entry accordingly into the Effectivity Satellite.

But nowadays, CDC is becoming more common. Though, as CDC delivers deltas only, the challenge now is to identify relationships that no longer exist. The example below shows a relationship between the business objects customer and company. This is a 1:n relationship:

Validation of relationships in Data Vault 2.0

Image 1: Tables Customer and Company

The Link table in Data Vault looks like this:

Validation of relationships in Data Vault 2.0

Table 1: Customer Link

For better readability and simplification, we present the business keys instead of hash keys and don’t show system fields like the load date timestamp and record source.

So far so good, but what happens when the customer is starting to work for another company? This will result in a new record in the Link. The CDC mechanism will provide us the data as an update of the customer table.

Validation of relationships in Data Vault 2.0

Image 2: Source tables and Link after company change

From where do we get the information that Customer 4711 no longer works for Company 1234 and where is that information stored? We need to soft-delete the old link entry in the data warehouse to make the data consistent again. At the moment, it looks like the customer works for both companies as both links are currently active. 

There are two possible ways:

  1. You get the “from” and the “to” in your audit trail and you identify a difference for the company_id.
    If that is the case, create 2 new entries in the Effectivity Satellite, one marks the old one (from) as deleted and the other one marks the new one (to) as not deleted. It is necessary to insert new relationships as “not deleted” that you can activate and deactivate Hash Keys forth and back.
    Think about what happens when customer 4711 works for company 1234 again.
  2. In case you don’t have the “from” and “to”, you either have to load the CDC data into a persistent staging area, where you keep the full history of data delivered by CDC, or a source replica, where you create a mirror of the of the source system by feeding it with the CDC data whereby you perform hard updates when an “Updated” comes from the CDC and hard deletes when a “Delete” comes from the CDC.
    When using the source replica, you can follow the same approach as stated before when getting full loads: join into the replica and figure out whether the Hash Key still exists or not.
    The biggest disadvantage here is that you have to scan more data, which means more IO. When using a persistent staging area, you can figure out a change in a relationship by using the window function lead() where you partition by the technical ID, Customer_ID in this case, and order by the load date timestamp.
    As soon as the Link Hash Key is different, the relationship is changed and the old one no longer exists.

The result is the following Effectivity Satellite (logical):

Validation of relationships in Data Vault 2.0

Table 2: Effectivity Satellite on the Link

Conclusion

We covered two major points in this article. The first one is that in Data Vault 2.0, we extract relationship information from the source tables and thus we have to pay more attention to the validation of those.
The second point is that the way you get the data (delta by CDC or full-extract) brings you different opportunities regarding the way to load the data. When you are dealing with a huge amount of data, CDC is definitely the way to go. In addition to that, with the CDC mechanism you will get all updates from the source, and you can easier load data in (near) real time.

– by Marc Finger (Scalefree)

Salesforce Meets Data Vault

Salesforce and Data Vault - decoupling

It’s a Match!

Data integration with Salesforce can be tricky and needs a system of business intelligence to handle the complexity. Data Vault is capable of decoupling all the necessary business-driven changes, extensions and customizations to the platform while maintaining the ability to become the cornerstone of an integrated architecture. The decoupling is a part of our Data Vault Boot Camp and is summarized in Figure 1. Scalefree can provide knowledge and implementation assistance in both Data Vault as well as Salesforce therefore creating the optimal partner for your Salesforce integration project.

Salesforce and Data Vault - decoupling

Figure 1. Data Vault Decoupling

Agile Integration of Salesforce

Salesforce is optimized for transactions and not for analytics, in fact this is one reason we want to integrate it. More likely than not your Salesforce system is not “just a CRM” anymore. Over the past decade, Salesforce has evolved into a general purpose business application platform and maintains many levels of functionalities if one chooses to utilize them.

Salesforce can deliver a Sandbox for your own developers and third parties to develop any application they want. In fact, our customers often go on to create a variety of application add-ons and customizations that are made within Salesforce. This means that your integration will become more complex over time as more elements are added to the “one” source system. For this reason, it fits very well with Data Vault.

We can extend the vault,  as technical integration and business needs are decoupled by the very idea of Data Vault. This is where all the standards that you used to build your Vault come into play and save your day. Now you can  leverage the benefits of having an extensible and agile data warehouse.

In our recent webinar Salesforce and Data Vault we discussed some of the change drivers around Salesforce, which are summarized in Figure 2. We also talked about how those change drivers can be defused by using Data Vault.

Salesforce and Data Vault

Figure 2. Salesforce Change Drivers

In addition to the Data Vault methodology, project Roles like the Domain experts can help with the communication between the source system operations folks and the data warehousing team. For more information on the roles, study Disciplined Agile Delivery (DAD) by Scott Ambler, which is now also a part of the Project management institute (PMI).

Of course, we have just touched the surface here as there are many topics we have not talked about yet. For example, some Salesforce related challenges can be either solved with an expensive workaround in the Data Warehouse or with some simple adjustments in Salesforce. 

Also to be touched upon at a later time, how to deal with Salesforce limits like API calls or operational reporting limitations.

What topics are you interested in? 

What challenges are you facing right now?

Conclusion

We have seen that the integration of Salesforce can be handled with Data Vault as both systems fit together quite well. Data Vault adds the agility your data warehouse requires within those changing and complex source systems that are needed to provide the highest possible business value to your organization while saving you re- engineering cost in the long run. In this way, you can create your own sustainable Salesforce data pipeline.

– by Markus Lewandowski (Scalefree)

Data Quality in the Data Vault Architecture

Data Quality in Data Vault

Data Quality as the Basis for solid decision-making

When making business decisions, data quality is a critical facet to factor into the decision-making processes. Thus, the immediate access to the data and certainty on its quality can enhance business performance immensely. But the sad truth is that we see bad data in operational systems due to human-caused errors such as typos, ignoring standards and duplicates, in addition to lack of input-validators in operating systems such as must-fields not being declared as well as references to other entities (primary-foreign-key constraints) not being defined.

Continue Reading

Salesforce and Data Vault 2.0

Watch the Webinar

Salesforce is a sophisticated source system. The successful integration into a Data Warehouse can be dependent on the general understanding of the system and its typical capabilities and caveats.

In an agile environment like a Data Vault 2.0 project, this task is typically performed by a source system domain expert. But what happens when this role is not sufficiently available in your organization? Get a brief overview of the Salesforce platform from the perspective of an Enterprise Data Warehouse specialist.

Watch Webinar Recording

Webinar Agenda

1. Intro: Salesforce
2. Agile Team Roles
3. Data Vault Solutions

Meet the Speaker

Picture of Markus Lewandowski

Markus Lewandowski

Markus Lewandowski is a Consultant at Scalefree, who has over 6 Years experience in Salesforce development and administration. While also being a Certified Data Vault 2.0 Practitioner (CDVP2™) he is taking a hybrid role as a Salesforce Specialist and DV 2.0 Practitioner. His main competences are Salesforce Process Automation, Application Integration and Data Management.

Difference Between Data Vault, Inmon and Kimball Approach

Solutions

Data Vault, Inmon and Kimball

Data Vault 2.0 stands on a robust foundation of four pillars, each shaping its distinct architecture. The Methodology pillar guides the project lifecycle, ensuring standardization. Architecture defines the blueprint, prioritizing scalability. Modeling introduces agile techniques, enhancing adaptability. Implementation brings the design to life, addressing practical considerations.

The Inmon approach to building a data warehouse begins with the corporate data model. This model identifies the key subject areas, and most importantly, the key entities the business operates with. From this model, a detailed logical model is created for each major entity.

The Kimball approach to building the data warehouse starts with identifying the key business processes and the key business questions that the data warehouse needs to answer. The key sources (operational systems) of data for the data warehouse are analyzed and documented.

ACCESS THE SOLUTION

Advantages for Virtualization in the Data Vault

Solutions

Virtualization in the Data Vault

In legacy or traditional data warehousing, a common strategy involves materializing data marts, also known as information marts, to enhance performance. However, this approach comes with a notable disadvantage – an increase in storage requirements within traditional data warehousing systems.

Materializing data marts can offer performance benefits, but the trade-off is a higher demand for storage space. This approach has been traditionally employed to optimize query response times and facilitate efficient data access

ACCESS THE SOLUTION

Accelerate Your Data Vault with Snowflake

Watch the Webinar

Data Vault and Snowflake in combination are constituting flexible and scalable Enterprise Data Warehouse solutions.

Attendees will get insights about building and loading a GDPR-compliant Data Lake (AWS) and Data Vault model. The loading and querying processes have a great scalability within Snowflake.

The webinar includes a live demo from Snowflake showing Data Ingestion, Variant Data Types, and Data Sharing opportunities.

Watch Webinar Recording

Webinar Agenda

1. Intro
2. Accelerate your Data Vault with Snowflake (Scalefree)
3. Snowflake Demo (Snowflake)

Meet the Speaker

Picture of Sven Börjesson

Sven Börjesson

Sven is a Partner at Scalefree with a Master of Business Administration degree from the University of Kiel in Germany. He has extensive competencies in agile coaching and is a Scrum Master, as well as an expert in Data Warehousing and Data Vault 2.0, Business Intelligence Solutions, and Enterprise Data Architectures. His main customers With his diverse skill set, Sven is a valuable asset to improve agile processes and data management solutions.

Write Backs in the Enterprise Data Warehouse Architecture

Managed Self Service BI and write backs

The Data Vault 2.0 Layers

This issue covers write backs into the enterprise data warehouse and how the Data Vault 2.0 architecture can facilitate it. Many people already know the three layer architecture of data warehouses which is used in Data Vault 2.0. The first layer represents the staging area which holds the raw data from the source systems. The enterprise data warehouse layer, which in this case contains a Data Vault 2.0 model and the third layer with the Information Marts, which deliver the information in various structures (Star Schemas, Snowflake Schemas etc.).

DV2.0 Architecture and write backs

Figure 1. Data Vault 2.0 Architecture

This architecture provides possibilities and benefits for data write backs. Two possibilities are writing back data into the enterprise data warehouse and into the source systems. This issue covers the write back into the enterprise data warehouse, while an upcoming article will cover the write back into the source systems.

Continue Reading

Batch Loading Strategies for Data Vault 2.0

Solutions

Loading Strategies

In the realm of general data warehousing, various loading strategies come into play. One prevalent challenge often encountered is the absence of deleted records within a delta. In typical data warehousing scenarios, it becomes crucial to recognize and track deletions from the source system, often referred to as soft deletes.

The distinction lies in the need to not only capture new or modified data (delta) but also to account for records that have been deleted at the source. Soft deletes involve marking records as deleted rather than physically removing them, allowing for a more nuanced and traceable approach to data management.

ACCESS THE SOLUTION

Data Vault Use Cases Beyond Classical Reporting – Part 3

Data Vault use cases for reporting

New Possibilities with Data Vault 2.0

Data Vault 2.0 empowers organizations to go beyond traditional reporting by unlocking new avenues for scalability, automation, and data-driven decision-making. From cleansing data and automating business processes to enabling advanced data science techniques like machine learning and predictive analytics, Data Vault 2.0 provides a flexible framework for modern data challenges. In this article, we explore how Data Vault 2.0 integrates data science to optimize operational processes, predict outcomes, and enhance enterprise data warehouses, ensuring a competitive edge in today’s data-driven landscape.

Going beyond standard reporting

Reporting and dashboarding have become the standards in business when it comes to identifying KPIs and other measurements. As such, Enterprise Data Warehouses have emerged to support the reporting process. Though, due to the large quantity and variety of data, a demand has developed for a method of utilizing this existing data in a manner in which it can add additional business value towards a company’s needs. Data Vault 2.0 offers a wide range of methods to provide decision support beyond standard reporting as well as critical information regarding the future. To see for yourself, join us as we present different approaches and solutions as to fully leverage the potential of your data.

Watch webinar recording

Going beyond standard reporting

As we have shown in previous issues, Data Vault 2.0 enables individuals to implement reporting beyond the traditional methods.
In the first part, we demonstrated how to perform data cleansing in Data Vault 2.0.
And the second use case showed how to implement business process automation using Interface Marts.

The scalability and flexibility of Data Vault 2.0 offers a whole variety of use cases that can be realized, e.g. to optimize as well as automate operational processes, predict the future, push data back to operational systems as a new input or trigger events outside the data warehouse, to name a few. Continue Reading

Close Menu