Skip to main content

“Big Data”, “Data Lake”, “Data Swamp”, “Hybrid Architecture”, “NoSQL”, “Hadoop” … terms you are confronted with very often these days when you are dealing with data. Furthermore, the question comes up if you really need a data warehouse nowadays when you deal with a high variety and volume of data. We want to talk about what a data lake is, if we need a data warehouse when using NoSQL platforms like Hadoop, and how it is combined with Data Vault.

WHAT IS A DATA LAKE?

There is a proper definition from Tamara Dull (SAS): “A data lake is a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured, and unstructured data. The data structure and requirements are not defined until the data is needed.” 1

The last sentence, that the data structure and requirements are not defined until the data is needed, says that the structure is mapped when the data is queried from the data repository (data lake), known as “schema on read”. The difference to traditional RDBMS (schema on write) is, that you don’t pre-define the structure of the data (files) itself when loading the data into the NoSQL database. This does not say, that structure is not necessary – on the contrary, the structure is very important and required to utilize a data lake. This is not defined directly in the data lake (NoSQL database), but when the location of the files is attached to a schema, executed when the data is read. If no structure is defined and you are using a NoSQL database like Hadoop – which is just a file storage – as a “landing zone”, it will become a data dump. Then you will come into the situation, that your data scientists or power users have to do a high effort when going to the dump and try to find things of value in it.

For example, with tools like Hive you can declare your query schema, and this by using MPP (Massive Parallel Processing) on a HDFS.

DO WE STILL NEED A DATA WAREHOUSE?

When you have a data lake with all your data, you might ask yourself the question if you need a data warehouse … or do you already have a data warehouse when having a data lake?

First, we have to compare the terms data lake and data warehouse. A data warehouse stores its data subject-oriented, time variant and integrated by business key. A data lake is not subject oriented when ingested, not integrated at all and cannot handle CDC (Change Data Capture) or deltas, because you can not update the content of files (you just add or replace an existing file). From the technology and capability perspective, there are also differences/changes  by using a traditional RDBMS (for data warehouses) and New SQL / NoSQL platforms (for data lakes).

[table id=1 /]

Figure 1: Traditional RDBMS vs. New SQL / NoSQL 2

This short comparison makes clear, that a data lake is not a data warehouse. But, NoSQL and RDBMS can live happily together and should be designed to co-exist. When using a data lake only, some lacks are coming up:

[table id=2 /]

Figure 2: Lacks when using a data lake only

Back to the question, if we need a data warehouse. The short answer is, it depends: If you don’t have an existing data warehouse and all you do is data science work or investigating your data, you probably don’t want or need a data warehouse. But, if you need structure, analysis, multi-integrated systems (integrated by consistent business keys), tie your data back to the business processes by business key, then you definitely need a data warehouse. In summary: when you want to extract value out of the data lake (the data in the data lake), a data warehouse is a mature concept.

When a data warehouse already exists, a best practice is to augment your existing data warehouse (RDBMS) by integrate it in the data lake or probably use a hybrid architecture by replacing the relational staging area by a HDFS based staging area which captures all unstructured and structured data.

1 https://www.kdnuggets.com/2015/09/data-lake-vs-data-warehouse-key-differences.html
adopted from https://www.smartdatacollective.com/big-data-cheat-sheet-what-executives-want-know/

 

How to Get Updates and Support

Please send inquiries and feature requests to [email protected]

Managed Self-Service BI is part of the Data Vault 2.0 Boot Camps. For Data Vault training and on-site training inquiries, please contact [email protected] or register at www.scalefree.com.

To support the creation of Visual Data Vault drawings in Microsoft Visio, a stencil is implemented that can be used to draw Data Vault models. The stencil is available at www.visualdatavault.com.

Leave a Reply

X
X