A data warehouse is a subject oriented, nonvolatile, integrated, time variant collection of data to support management’s decisions
- Inmon, W. H. (2005). Building the Data Warehouse. Indianapolis, Ind.: Wiley.
It provides the technical infrastructure needed to run Business Intelligence effectively. Its purpose is to integrate data from different data sources and to provide a historicised database. Through a DWH, consistent and reliable reporting can be ensured. A standardised view of the data can prevent interpretation errors, improved data quality and leads to better decision-making. Furthermore, the historization of data offers additional analysis possibilities and leads to (complete) auditability.
“Why do we need a data warehouse for reporting, we have excel sheets?!”
Yes, excel is a great tool… to use and lose control over your data as well as your reports.
You can report directly from a data source but you are massively limited in functionality and governance. Furthermore, you can only generate reports from one source system and don’t have a delta-driven history of your data. By creating reports directly from the source system and storing them on a local pc you lose track of which user pulled the data, as well as at what time, to build the report. Thus, the reports are no longer reliable. To prevent this, a data warehouse comes into play.
Why do we need it?
Let’s imagine our goal is to build a sales revenue dashboard based on a timeline, a customer group, your products and regions. Without a DWH you have to collect all data manually from all necessary source systems. This data is most likely a mix of structured, unstructured and semi structured data. The challenge then becomes how to prepare and visualise the data in addition to creating an easily repeatable method of doing so. This is very time-consuming and can be very costly.
By the time all data is collected and prepared, the data may already be out of date causing the need to start again.
With a DWH, all data is collected at one single point. The Data is aligned to the business (integrated & subject-oriented) with standardised definitions e.g. of KPI’s so that every report interprets the data equally. The access to the DWH is read only (non-volatile), once loaded you can’t change the data (auditable). This leads to a complete historization of the data (time variant). With all data available, the needs of the users can be satisfied (structured data, integrated by business terms). For business users, there is also the option of using Self-Service BI.
What about a Data Lake?
As the “data lake” was introduced a couple of years ago, there was the assumption that it would replace a data warehouse.
A data lake is a great environment when used as a first landing zone for your data in your IT infrastructure but it does not “integrate” the data as a data warehouse does.
A data lake can be used to process the data further downstream into your data warehouse and an information delivery area. Structure becomes very important at this point so that your data lake doesn’t turn into a data dump and you are always able to query the data you need in an easy way.
To this end, you must create an architectural design dependent upon how you process the data from your data lake into your data warehouse. This could also happen in a completely virtualized way, depending on the amount of data as well as respectively the performance necessary to process the data towards the point of the end-users.
A data lake is also a good place for data scientists to gain access to the data as soon as possible, even if it is the native format. For end users who are working with structural data for reporting, dashboarding and analysis purposes, a structured, integrated, well-performing and easy-to-access data warehouse is necessary to fulfil their requirements. They expect the data in a prepared information mart, like a star schema or a flat and wide table.
Conclusion: If you want to use a data lake, think about how you need and process the data on the way out so that you can create a suitable structure for it. If you don’t need your data integrated, subject-oriented and time-variant, then you may be fine with a data lake only. But if you need all these great properties, you definitely need a data warehouse.
How does a Data Warehouse work?
It starts with the ETL process (extract, transform, and load) in which the data is extracted from the source system into your technical environment / (DWH infrastructure) called the “Staging Area”. After extracting all data from the source system, you integrate your data into a subject oriented structure. The result is an Enterprise Data Warehouse (EDW) which provides data and information about how the end user needs it.
There are several modeling techniques available to build a data warehouse. 3NF (third normal form) was invented by Bill Inmon and is also known as the top-down approach. Alternatively, Dimensional Modeling by Kimball is more aligned to the business processes (bottom-up approach). Data Vault is a hybrid between 3NF and dimensional modeling invented by Dan Linstedt. At Scalefree, we specialize in Data Vault.
Data Warehouse Reference Architecture
There is more!
When only utilizing this single aspect of an EDW, users are missing opportunities to take advantage of their data by limiting the EDW to such basic use cases. A variety of use cases can be realized by using the data warehouse, e.g. to optimize and automate operational processes, predict the future, push data back to operational systems as a new input or to trigger events outside the data warehouse, to simply explore but a few new opportunities available. Read more
Data Vault Use Cases
A data warehouse is ideal for centrally storing all internal and external data sources. The standardization of structured, unstructured and semi-structured data enables faster and more reliable reporting. Historization allows additional reports and past reports can be reconstructed at any time. With the flexibility of Data Vault 2.0, organizations can apply new capabilities, which go beyond just standard reporting and dashboarding.
If you want to learn more about Data Vault Use Cases and the latest technologies from the market, we offer a broad range of free knowledge on our blog/newsletter and webinars. Feel free to sign up for regular updates
– by Tobias Triphan (Scalefree)
Get Updates and Support
Please send inquiries and feature requests to [email protected].
To support the creation of Visual Data Vault drawings in Microsoft Visio, a stencil is implemented that can be used to draw Data Vault models. The stencil is available at www.visualdatavault.com.
New Data Vault insights every month