We previously explored how the EDW can help to improve data quality by implementing data cleansing rules.
This can be applied by write-back operations that affect the source systems directly. Though this was only one example of how to add more value to the EDW.
The scalability and flexibility of Data Vault 2.0 offers a whole variety of use cases that can be realized, e.g. to optimize and automate operational processes, predict the future, push data back to operational systems as a new input or trigger events outside the data warehouse, to name a few.
As to capture the descriptive data, which in this case is the describing factor of the business keys, satellite entities are used in Data Vault. As both business keys and relationships between business keys can be described by user data, satellites may be attached to hub as well as link entities as such:
Consider this situation in which a sample data set, involving an insurance company, concerning customers signing car and home insurance policies as well as filing claims, each respectively. Though before moving forward with the example, it is important to note that there are relationships between the involved business keys, that of the customer number, the policy identifiers, and the claims.
These relationships are captured by Data Vault link entities and just like hubs, they contain a distinct list of records, as such, they contain no duplicates in terms of stored data. Thus, both will form the skeleton of Data Vault and later be described by descriptive user data stored in satellites.
Both systems should be technically integrated, which means if a new customer signs up for a home insurance policy, the customer’s data should be synchronized into the car insurance policy system as well and kept in sync at all times. Thus, when the customer relocates, the new address is updated within both systems.
Though in reality, it often doesn’t go quite as one would expect, as, first of all, both systems are usually not well integrated or simply not integrated at all. Adding to the complexity, in some worst-case scenarios, data is manually copied from one system to the next and updates are not applied to all datasets in a consistent fashion but only to some, leading to inconsistent, contradicting source datasets. The same situation applies often to data sets after mergers and acquisitions are made within an organization.
In contrast to the tables used by relational databases, MongoDB uses a JSON-based document data model. Thus, documents are a more natural way to represent data as a single structure with related data embedded as sub-documents and arrays collapses what is otherwise separated into parent-child tables linked by foreign keys in a relational database. You can model data in any way that your application demands – from rich, hierarchical documents through to flat, table-like structures, simple key-value pairs, text, geospatial data, and the nodes as well as edges used in graph processing.
Another challenge is the data itself, regardless of its structure.
In many, if not most, cases, the source data doesn’t meet the information requirements of the user regarding its content. In many cases, the data needs cleansing and transformation before it can be presented to the user.
Instead of just loading the data into a MongoDB collection and wrangling it until it fits the needs of the end user, the Data Vault 2.0 architecture proposes an approach that allows data as well as business rules, which are used for data cleansing in addition to transformation, to be re-used by many users. To achieve this, it is made up of a multi-layered architecture that contains the following layers:
Typical challenges include issues such as the integration of mainframe data with real-time IoT messages and hierarchical documents.
One of such issues being that enterprise data is not clean and might have contradicting characteristics as well as interpretations. This poses a challenge for many processes such as when integrating customers from multiple source systems.
Though, data cleansing could be considered as a solution to this problem. However, what if different data cleansing rules should be applied to the incoming data set? For example, because the basic assumption for “a single version of the truth” doesn’t exist in most enterprises. While one department might have a clear understanding of how the incoming data should be cleansed, another department, or an external party, might have another understanding.
To all those that have been a part of the Scalefree journey up until this point,
We’d first and foremost like to thank you for all the contributions you have made in helping us build Scalefree into the firm it is today. All of your contributions and business have allowed us to create a success story beyond what was first imagined and for that we offer our gratitude.
That said, a recent development here at Scalefree has presented the company with the opportunity to offer unprecedented, on-site access to the man that helped make all of this possible, the inventor of Data Vault 2.0, Dan Linstedt.
To put it simply, an Enterprise Data Warehouse (EDW) collects data from your company’s internal as well as external data sources, to be used for simple reporting and dashboarding purposes. Often, some analytical transformations are applied to that data as to create the reports and dashboards in a way that is both more useful and valuable. That said, there exist additional valuable use cases which are often missed by organizations when building a data warehouse. The truth being, EDWs can access untapped potential beyond simply reporting statistics of the past. To enable these opportunities, Data Vault brings a high grade of flexibility and scalability to make this possible in an agile manner.
Data Vault Use Cases
A whole variety of use cases can be realized by using the data warehouse, e.g. to optimize and automate operational processes, predict the future, push data back to operational systems as a new input or to trigger events outside the data warehouse, to simply explore but a few new opportunities available.
An initial decision of critical importance within Data Vault development relates to the definition of naming conventions for database objects. As part of the development standardization, these conventions are mandatory as to maintain a well-structured and consistent Data Vault model. It is important to note that proper naming conventions boost usability of the data warehouse, not only for solution developers but also for power users within data exploration.
Throughout this article, we will present the most vital considerations within our standard book, the process of defining naming conventions.
Naming convention documentation
Within Data Vault there are special entities which leverage the query performance on the way out of the Data Vault. These entities are placed between the Data Vault and the Information Delivery Layer and are necessary for instances in which many joins and aggregations on the Raw Data Vault are executed what cause performance issues. This often happens when designing the virtualized fact tables in the information and data marts. Thus, to produce the required granularity in the fact tables without increasing the query time, Bridge tables come into play. Bridge tables belong to the Business Vault and have the purpose of improving performance, similar in manner to the PIT table which was discussed in a prior newsletter.
As a means to achieve its goals, the bridge table materializes the grain shift that is often required within the information delivery process. Though, before we dig deeper into the specifics of using a bridge table for performance tuning, it is important to first define granularities within a data warehouse.
Grain Definitions in Data Warehousing
Back in 2017 we introduced the link structure with an example of a Data Vault model in the banking industry. We showed how the model looks like when a link represents either a relationship or a transaction between two business objects. A link can also connect more than two hubs. Furthermore, there is a special case when a part of the hub references stored in a link can change without describing a different relation. This has a great impact on the link satellites. What is the alternative to the Driving Key implementation in Data Vault 2.0?
The Driving Key
Skilled modeling is important to harness the full potential of Data Vault 2.0. To get the most out of the system due to scalability and performance, it also has to be built on an architecture which is completely insert only. On the way into the Data Vault, all update operations can be eliminated and loading processes simplified.
THE COMMON IMPLEMENTATION
A problem that occurs when querying the data out of the Raw Data Vault happens when there are multiple satellites on a hub or a link:
As the events that unfolded during the various financial crises at the start of the century left governments the world over seeking to impose stricter regulations for the financial sector. Banks within the sector were faced with a new task as they sought to continue operating with expansion in mind while still falling well within defined standards. Read More
Focus on trends: Data Lake and no-sql, dwh architecture, self-service bi, modeling and gdpr
All these topics were already covered in Data Vault 2.0 and most of them moved into a higher focus within the last months. Coming with the trends in the private sector, NoSQL databases are now playing an important role for storing data fast from different source systems. This brings new opportunities to analyze the data, but also new challenges, i.e. how to query fast from those “semi”- and “unstructured” data, e.g. including Massive Parallel Processing (MPP). Furthermore, there is an abundance of tools to store, transport, transform and analyze the data, what often results in time and cost-intensive researching. The knowledge about “Schema on Write” and “Schema on Read” (and their differences) became very important to build a Data “Warehouse”. A Schema has been and is still mandatory for Business Analysts when they have to tie the data to business objects for analytical reasons. Storing your data in NoSQL platforms only (let’s call it a “Data Lake”) is a good approach to capture all your company’s data, but it became much more difficult for Business User to get the data out from those platforms. A good and recommended approach is to have both, a Data Lake AND a Data Warehouse combined in a Hybrid Architecture.
Looking beyond Scrum and learn how to increase the value in Data Vault 2.0 projects
Agile transformation is hard because cultural change is hard. It’s not one problem that needs to be solved, but a series of hundreds of decisions affecting lots of people over a long period of time that affects relationships, processes, and even the state of mind of those working within the change.
There are two fundamental visions about what it means to scale agile: Tailoring agile strategies to address the scaling challenges – such as geographic distribution, regulatory compliance, and large team size – faced by development teams and adopting agility across your organization. Both visions are important, but if you can’t successfully perform the former then there is little hope that you’ll be successful at the latter.