In this newsletter, we’ll discuss an overview of the possible ways to monitor your solution within an EDW built using Data Vault 2.0.
The monitoring approaches below focus on giving an overview of different possibilities to grant insights into the performance of your data warehouse solution. The goal is to gain useful information about the technical data of your EDW and use them to track errors, find bottlenecks inside loading and transformation processes, and overall boost the performance of your data warehouse.
In a webinar session dived deeper into this topic of technical testing and monitoring methods for a Data Vault powered EDW. It covered all EDW layers and provided performance insights with a focus on modeling metrics vaults and marts, emphasizing source data.
Watch the recording here for free!
What to expect
You will receive an overview of different metrics and types of metadata to monitor your solution. One step further, the modeling approach for capturing and later analyzing the data is covered for different layers of your EDW, starting with the classical metrics vault and ending with the metrics marts. Additionally, different areas of monitoring are covered, focusing on potential challenges inside your solution. The main focus of this newsletter is how to provide metadata and therefore gain useful insights.
Four instances are covered, which hold useful technical data. These are the metrics vault and the metrics mart built on top of it, the meta mart, and lastly the error mart. Each of these have their own use, although the modeling and data source may resemble one another. The similarities and differences between the modeling approaches of these mostly technical tables and their data source are explained in the below sections.
The metrics vault is used to catch defined metrics, which originate from ETL/ELT data flows. It’s not a mandatory component of Data Vault, however when implemented, your solution will gain several advantages from technical capabilities, such as:
- Error inspection
- Root cause analysis
- Performance metrics
In terms of data flows, it focuses on identifying errors, as well as granting insights into the origins of these failures, in order to take actions to prevent these errors from happening again in the future.
The metrics themselves might originate from different sources. These sources may be the data platform, the ETL/ELT, or ESW job orchestration tool you are using. For this reason, we recommend modeling these data sources using the same patterns and standards you use to produce the Raw Data Vault, this will provide a scalable solution to capture and process your metrics.
Inside the metrics vault, the data is modeled and optimized for storage, scalability, and flexibility. To analyze the gathered data, the model is needed to fit the end-user’s requirements. For this matter, the metrics mart is part of the information delivery layer. As the metrics vault is built exclusively to provide data for the metrics mart, the standards of the Data Vault 2.0 architecture have been followed with this approach. The metrics mart is a special-purpose Information Mart which is sourced mainly from the metrics vault. It can be assumed that both materialization of marts as tables and virtualization as views can be considered and performed. The virtualization approach may be preferred, as deployment speed benefits from this decision.
As mentioned above, the main data source for the metrics mart is the metrics vault, although it’s not limited to it. In some cases, the business may decide to couple business objects with metrics to track the performance of specific business objects. This also helps to further boost performance.
The data model of the metrics mart is not fixed and varies due to the needs of the end-user. Therefore, a dimensional model could be the best solution, but it doesn’t have to be. In some cases, the data originates from the tools themselves, in such cases, the original structure of these tools could fulfill the purpose of analyzing the data more efficiently.
Metadata is stored directly in the meta mart. These marts don’t source from the raw data vault, thus there is no such thing as a meta vault as a source for the marts to virtualize them. Therefore, the meta mart is materialized as tables. The model is similar to other information marts. For this matter, the business requirements imply the exact model of the meta mart, as its main function is to provide data in such a way that businesses can make the most use of it. In some cases, it might be a Dimensional Model, in other cases, it could be modeled in third-normal form.
The error mart is another information mart. Therefore, it also relies on the business as to how to exactly implement the model, although in practice most of them are modeled using a dimensional model. The goal of the error mart is to catch errors, so-called “ugly data”. This data is rejected by your transformation tools, for example rejected data rows from your soft rules. As soft rules change over time, so do your error marts. These errors don’t source from the raw data vault, resulting in the absence of an Error Vault. The data for the error marts can come from a variety of sources, although most of it originates from the ETL / ELT engine.
As the section above has described how to model specified information marts and what the main purpose of those are, this section focuses on the details of “what to load”. For this matter, a closer look at general data itself, its origin, and use case is taken at this point.
Business Metadata is defined as metadata with meaning for the business. Although, it is not completely defined who is responsible for this data, as it depends on the business itself, the data warehouse team is responsible for the management of the data. The meta mart covers this type of metadata.
Some metadata you should consider tracking include the following:
- Business column names
As the business may decide to use abbreviations in prefixes or suffixes, which need to be tracked in order to be more coherent for the business.
- Business definitions
As these definitions are the foundation for analyzing data in later processes. It should be a business description of the different attributes and tables. It is also an indicator for the necessity of the attribute or table. If the business can’t provide information about the column/attribute or table, it should be considered as to whether it is necessary within an information mart or if it needs to be redefined within the business.
- Ontologies and taxonomies
As it provides information about the business object behind the source data, such as the classification of the business object, the relationship to other objects or hierarchies.
- Physical table and column names
As they provide information about the relation between the business object names and the names of the physical tables and attributes. This information is used to associate the data and the business objects.
- Record source
As it describes the different record sources in business terms. Therefore, its goal is not to provide a reference to technical instances such as databases. Furthermore, it aims to describe them in a manner that the business meaning behind those sources is provided.
- Table specification
As they describe the purpose of the source table and data as well as the column names and keys.
- Exception-handling rules
As they provide a list of potential technical issues, such as potential errors or data quality issues and how the ETL / ELT process handles these errors.
- Source system business definitions
As they describe the business meaning of source attributes.
- Business rules
As they generate new data based on business terms. Developers need some understanding of these rules and transformations in order to comprehend the requirements. As business metadata, the business rules are described in a way that the business can understand these. It should be considered to divide between hard rules and soft rules.
As business metadata serves the business the most, the data warehouse team is most likely to benefit from technical metadata, which is also stored in the meta mart. For this reason, most of technical metadata originates from technical components, resulting in the following examples:
- Source systems
As it provides a technical description of the source systems, such as database names or flat file locations.
- Data models
These are not typical metadata, although they are a helpful asset. They provide information about relationships between the used tables and could be presented in a graphical form.
- Data definitions
As they describe the data in a technical way. This includes information about the table name, column name, data type, constraints, or default values.
- Business rules
Business rules are also a part of technical metadata. Here, specifically, the technical definitions of these rules are considered, as they need to be implemented in the ETL / ELT tools later on. Hard rules and soft rules should be separated, just as in the business metadata.
As they describe the evolution of data loads, table sizes, and growth patterns in order to estimate the workload of the data warehouse.
- Ontologies and taxonomies
Technical metadata also should provide information about ontologies and taxonomies, like abbreviations of terms and attributes, relationships, business key designations, parents, and hierarchies as well as re-defines.
- Data quality
This metadata provides information about data quality metrics, which may include standardization of source data.
Process Execution Metadata
Process execution metadata is not provided by the business or source applications like business metadata and technical metadata, but furthermore is generated by the data warehouse team to provide insights into the ETL / ELT processes. The main goal is to provide information for a better understanding of the data warehouse performance and thus potentially increasing this performance. Most of the data comes from the ETL / ELT tool you are working with, delivering the following metadata:
- Control flow metadata
As a control flow executes at least one data flow, it provides data about processes and data lineage of all elements of the data warehouse.
- Data flow metadata
As they provide data about the data flows themselves. This includes the data volumes and rejected rows of a data flow / data transformation.
- Package metadata
The package metadata contains a summary of the information about running a package, which usually executes a control flow.
- Process metadata
Most packages are executed by a scheduling application. The process metadata provides data about the process that has started the package.
The process execution metadata should be stored in the metrics vault and kept separated from the business metadata and technical metadata.
In this newsletter, we provided an overview of monitoring options for your Data Vault-powered EDW.
We covered different types of information marts, including their modeling approach, their data sources, and usage for businesses and their data warehouse team.
We learned about different types of metadata, their source and which purpose they serve. For this matter, we divide between business metadata, that describes your data in terms, that business can make the most use out of it, technical metadata, that holds technical information about the processed data, and process execution metadata, which provides information about your data processes and ETL / ELT pipelines.
Finally, this newsletter provides information about how and what metadata you can process to monitor your EDW according to the standards of Data Vault 2.0.
We will go even deeper in our webinar. Make sure to watch the recording for free! We are looking forward to seeing you there.
– Jan Paul Langner (Scalefree)
Get Updates and Support
Please send inquiries and feature requests to [email protected].
To support the creation of Visual Data Vault drawings in Microsoft Visio, a stencil is implemented that can be used to draw Data Vault models. The stencil is available at www.visualdatavault.com.