Skip to main content
Category

Scalefree Newsletter

Data Vault 2.0 with dbt – Part 2

By Scalefree Newsletter No Comments

In the first part of this blog series we introduced you to dbt. Now let’s take a look at how you can implement Data Vault 2.0 with dbt and what advantages this offers. If you don’t know the first part yet, you can read it here

dbt provides the ability to create models and generate and execute SQL dynamically out of these models. This allows you to write your data transformations in models using SQL and reusable macros powered by Jinja2 to run your data pipelines in a clean and efficient way. However, the most important part for the Data Vault use case is the ability to define and use those macros.

But first we should clarify how models work in dbt.

Dbt handles the compilation and execution of models written using SQL and the Jinja macro language. Each model consists of exactly one SQL SELECT statement. Jinja code is translated into SQL during compilation.

This figure shows a simple model in dbt. A big advantage of Jinja is the possibility to generate sql programmatically, for example with loops and conditions. Also, through the ref() functions, dbt understands the relationships between the models and builds a dependency graph. This ensures that the models are built in the correct order and that the data lineage is documented.

A lineage graph may look like this:

The materialization of models can be defined at different configuration levels. This allows fast prototyping with views and switching to materialized tables if necessary for performance reasons.

Data Vault 2.0 and macros

But how can Data Vault 2.0 be implemented in dbt? The most important part for using Data Vault 2.0 is the ability to define and use macros. Macros can be called in models and then generate into that macro additional SQL or even the entire SQL code.

For example, you could create a macro to generate a hub that gets the source/staging model as an input parameter, as well as the specification of the columns for the business key(s), load date and record source. The sql code for the hub is then generated dynamically from this. The advantage of this is that a change to the macro would directly affect each individual hub, which greatly improves maintainability.

At this point, you also gain huge benefits from the active open-source community around dbt. There are many open-source packages with which dbt can be extended.

There are also already some packages that are perfect for using data vault 2.0. 

We at Scalefree are currently developing an open-source dbt package that provides macros to translate a Data Vault model “on paper”, to actual tables and views like Hubs, Links, Satellites and more. Subscribe to our Newsletter to get notified as soon as it’s available.

The only thing you need in your model, for example for a hub, is just one macro call:

{%-

hub(src_pk, src_nk, src_ldts, src_source, source_model) 

-%}

With the parameters of the macro call you define the source table where the columns can be found (source_model) and the column names for the hash-key (src_pk), business key(s) (src_nk), load date (src_ldts) and the record source (src_source) column. When the model and the macro(s) defined in the model are executed, the SQL gets compiled and processed on the database system.

The metadata needed can for example be defined in variables with jinja directly in the model:

What you also can see is that dbt provides different options for the materialization. The incremental materialization for example will load an entity as a table on an incremental basis.

When the model is executed, dbt generates the whole sql out of the macro and decides how the records are loaded. If the hub table does not exist, yet it is created and all records are loaded, if it already exists the table is loaded incrementally. 

For people who tried or managed to implement a Data Vault with “vanilla” SQL, you will realise that this is a real game-changer. The team can now focus entirely on the Data Vault design itself. Once the metadata is identified, dbt along with your macros can take care of the entire logic.

Openly available packages can add basic Data Vault 2.0 principles to dbt and therefore allow users to quick-dive into Data Vault implementation. Dbt’s general openness allows you to adjust all macros for example to your company or project specific Data Vault flavor to meet your technical and business requirements. 

However it is important to mention, that existing dbt packages for Data Vault 2.0 does not completely fulfill Data Vault 2.0 standards and deviates from them in details. Currently, we at Scalefree are working on an open-source dbt package that provides all important Data Vault 2.0 entities, as well as the latest standards and best practices and will be released soon. But these details are worth another blog post!

 

-by Ole Bause (Scalefree)

 

Get Updates and Support

Please send inquiries and feature requests to [email protected]

For Data Vault training and on-site training inquiries, please contact [email protected] or register at www.scalefree.com.

To support the creation of Visual Data Vault drawings in Microsoft Visio, a stencil is implemented that can be used to draw Data Vault models. The stencil is available at www.visualdatavault.com.

Scalefree

What’s new in Data Vault?

By Scalefree Newsletter No Comments

Whats new in Data Vault?

Attendees of our Data Vault 2.0 trainings often ask us what the difference between the book “Building a Scalable Data Warehouse with Data Vault 2.0” and our current consulting practice is. To give a definitive answer, at least for now, we have written this article.

Data Vault 3.0?

First of all, there is no Data Vault 3.0. We like to refer to the version of Data Vault that we teach and apply in our consulting practice, as “Data Vault 2.0.1”. Therefore, there are only slight differences to the version we used in the book. Some of these changes look big, but they don’t really modify the underlying concept and that is what matters regarding version changes. So, based on the concept, we only see minor changes or enhancements. However, they might have some larger impact on the actual implementation, but that is only due to technology changes.

 

How to Fit Data Vault 2.0 Between Two Covers

On top of the minor changes to the concept, there is also another factor at play here: when writing the book, we had to scope the book to make it fit between the covers as the publisher had a page limit (which we actually exceeded a bit). Therefore, we did not apply all Data Vault 2.0 concepts in the book: for example, real-time concepts are not covered by the book, and we did not use a data lake or cloud computing in our examples. Instead we only briefly covered these concepts for completeness but focused on on-premise technology which was more used in actual projects back then. In 2012, cloud computing was already available and widely used, but it was easier to tailor the book to more readers by the use of on-premise technologies.

 

With that in mind, what has changed since the book came out? 

 

Removal of Load End Date

 

The most obvious change is the removal of the Load End Date. Well, in our actual projects, we don’t completely remove it from the Data Vault 2.0 model, we just virtualize it by removing the Load End Date from the underlying satellite table and virtually calculating it in a view on top of the satellite table using a window function (typically LEAD, but LAG is also possible). This way we can get rid of the update procedure to maintain the Load End Date (the so-called end-dating process) while preserving the query patterns downstream. Keep in mind that it might also be more efficient to use the window function when loading PIT tables or satellites, and therefore the query layer is actually only for power users when they query the Data Vault model directly. 

 

That’s actually not an update: even in 2012 (and before) we used these approaches but they didn’t work on SQL Server 2014, which was used in the book as the analytical window function required for this approach is ways too slow. However, it improved in 2016. To get rid of the Load End Date in such scenarios where the window function is too slow or just don’t exist, a new solution has emerged: the use of a snapshot partition in the PIT table. The end of all times is used in the PIT table to refer to the latest delta in each satellite for the delta check. Once the PIT table is available, it can also be used to produce SCD Type 1 dimensions (without history) and therefore the need for the Load End Date (or a fast replacement as described above) only exists in loading.

 

Hybrid Architecture

 

The next obvious change is the use of a hybrid architecture where a data lake is used for staging purposes. In the book, we focused on the on-premise Microsoft SQL Server stack that didn’t include a distributed file system. We already recommended to clients to use a data lake for staging purposes, in an architecture we called the “hybrid architecture.” Back then, only a few followed the advice, but today most clients use the hybrid architecture for their data warehouse. We actually consider a relational staging area an anti-pattern and do not recommend it anymore to clients (with some exceptions). 

 

Multi Temporal Data Vault 2.0

 

The book describes a temporal PIT (TPIT) for building multi-temporal solutions. While the pattern is still valid and relatively flexible compared to other solutions, today we typically move the business timelines into the dimensional model for highest flexibility. This is discussed in more detail in the Multi-Temporal Data Vault 2.0 class. TPITs are less flexible but have higher performance.

 

Snapshot Date

Satellites in the Business Vault might use a Snapshot Date Timestamp instead of a Load Date Timestamp. The idea is that, as a general practice, a Business Vault entity should always reuse an existing granularity. It just becomes much easier to develop Business Vault solutions. With the Snapshot date in the primary key of the satellite, this becomes much easier for business rules on the outgoing information granularitiy. This is discussed in more detail in the Data Vault 2.0 Information Delivery class

 

A Confession

 

And finally, we also made some mistakes. We are willing (not happy) to admit it in our errata page for the book – this link is directly from our internal Wiki and we promise to keep it updated (however, we didn’t receive additional reports lately).

Get Updates and Support

Please send inquiries and feature requests to [email protected]

For Data Vault training and on-site training inquiries, please contact [email protected] or register at www.scalefree.com.

To support the creation of Visual Data Vault drawings in Microsoft Visio, a stencil is implemented that can be used to draw Data Vault models. The stencil is available at www.visualdatavault.com.

Scalefree

Data Vault 2.0 with dbt – Part 1

By Scalefree Newsletter No Comments

Data is an important asset in the decision making process. As we have previously discussed in another post, Data Vault 2.0 is the right choice when the goal of an enterprise data warehouse is to have fully historized and integrated data. Additionally, it is also better suited to instances in which data from many source systems needs to be combined. You can find the previous blog post here.

While Data Vault 2.0 focuses on the “what”, there are many options for the “how” of technically translating a Data Vault model into physical tables and views in the enterprise data warehouse, as well as for orchestrating and loading/processing the procedures. And this is where dbt comes in.

ABOUT DBT

The data build tool (dbt) transforms your data directly in your data warehouse. For reference, dbt is the “T” in ELT. Therefore, dbt assumes that data is already loaded into a database from which the current database can query from. In contrast, ETL extracts data then transforms it and before loading it into the target. With ELT, the data is not transformed as it moves into the data warehouse. (with transform we mean transformation by soft business rules which would change the meaning of the data. Of course we have to make sure that the data fits into the target table (data type adjustments etc.). Here we are talking about “hard rules”.)
Dbt is particularly compatible and useful in cloud DWH solutions such as Snowflake, Azure Synapse Analytics, BigQuery and Redshift and performs transformations and modeling directly on the database in order to take advantage of the performance of these highly scalable platforms.

HOW DBT WORKS

Models and SQL statements can be easily created, tested and managed in dbt itself. A powerful combination of the scripting language, Jinja2, and the all-time classic SQL allows users to build models. The simple interface enables data analysts without engineering know-how to initiate appropriate transformations. Data team workflows become more efficient and cost-effective because of this. Behind dbt sits an open source community that is constantly and passionately developing the tool. As such, dbt is available both as a free, reduced core version and as a comprehensive and flexible cloud version.

What is the place of dbt in Data Vault 2.0?

Part of the Data Vault 2.0 methodology is the model which focuses on how to design the core data warehouse with a scalable solution. The core Data Vault entities are hubs, links and satellites. That said, dbt provides the ability to generate Data Vault models and also allows you to write your data transformations using SQL and code-reusable macros powered by Jinja2 to run your data pipelines in a clean and efficient way.

Application Areas of dbt in an Enterprise BI Solution

dbt does not reinvent the world, but when it comes to building a new EDW – especially if it is in the clouds – dbt provides a very helpful basic framework with many important functions for Continuous Integration and Deployment already defined. dbt brings the standards of software development into the world of data transformation. This allows developers to concentrate on the core tasks of data modeling and business logic. Especially, but not only for smaller projects, dbt offers a lightweight and extremely affordable alternative to other data warehouse automation solutions.

– by Ole Bause (Scalefree)

Get Updates and Support

Please send inquiries and feature requests to [email protected]

For Data Vault training and on-site training inquiries, please contact [email protected] or register at www.scalefree.com.

To support the creation of Visual Data Vault drawings in Microsoft Visio, a stencil is implemented that can be used to draw Data Vault models. The stencil is available at www.visualdatavault.com.

Scalefree

Agile development in data warehousing with Data Vault 2.0

By Scalefree Newsletter No Comments

Initial Situation:

A common issue in data warehousing projects is a scope is often missing and many of the processes such as controlled access, GDPR handling, auditability, documentation and infrastructure are not optimized. Additionally, data warehouse projects that have a scope often begin without a real focus on business value. This is mostly due to the fact that the use cases are not clearly communicated and the data architects do not know where to start. The consequence of this means  no business value can be delivered.

Data Vault 2.0 Methodology

It is often assumed that Data Vault 2.0 is only a modeling technique, but this is not correct. Data Vault 2.0 includes the modeling technique, a reference architecture and the methodology. The methodology introduces project management tools such as CMMI, Six Sigma and Scrum to solve the problems described. While CMMI and Six Sigma deal with general project management issues, Scrum is mostly used specifically in the development team and provides the framework for a continuously improving development process.  The use of agile development in Data Vault 2.0 projects will be described in more detail below.

The Scope of a Sprint

The first step in setting up a data warehouse project in an agile way is defining the objective of the project with just one or two pages. Unlike waterfall projects, the goal is to produce working pieces of usable outputs, could be reports or dashboards, in continuous iterations, otherwise called sprints. This means that we don’t need to plan the entire project in detail but instead can build around a general idea or goal for the final data warehouse before then focusing on planning the first sprints. In order to address the aforementioned problems, the focus of the sprints needs to be centered around business value. For this reason, it is important to receive constant feedback from the business users for a continuous improvement process.

Define the project

Both the scope of a sprint and the architecture follow a business value driven approach built vertically and not horizontally. This means they are not built layer by layer but instead feature by feature. A common approach for this is the Tracer Bullet approach. Based on business value, which is defined by a report, a dashboard or an information mart, the source data will be identified and modeled through all layers and loaded. 

As shown in Figure 1, the entire staging area layer is not initially built but rather a small part of the respective layer is built based on data in the scope, in this case the SalesReport.

Agile Development

Before new functionality can be implemented in a sprint, it needs to be defined.
This task lies with the product owner as they are the ones to write and prioritize user stories.
As already explained, the goal of a sprint is to produce working pieces of usable outputs called features.
In addition, there are tech topics that need to be considered. There are various methods to support Sprint Planning, such as planning poker or Function Point Analysis, which are discussed in more detail in another article.

Another good indicator is to evaluate the sprint itself while the sprint is still ongoing. If the development team does not manage to implement a feature in a sprint, this can often be seen as a good indicator that the scope is too large. 

To avoid this, all work packages that are not relevant for the feature should be removed. Though, what is often the case these work packages are not completely removed out of fear from the business user. 

To address this fear it is important to educate the business user that they will be delivered but only in a later sprint and temporarily moved into the backlog.

Due to the flexible and scalable Data Vault model, these layers can be extended with the next feature with little to no re-engineering. This is possible due to the fact Data Vault consists of a Raw Data Vault and a Business Vault model which means it contains the logical architecture as well as the data modeling perspective. The Raw Data Vault is modeled in a data-driven way by integrating the data by business keys. Only hard business rules like data type conversions or hash key calculations are applied. All other soft business rules are only applied in the Business Vault. 

Here, we turn data into information. For this reason, the Raw Data Vault requires less refactoring and can be extended limitlessly.

Review

Another important success factor for agile projects is proper review and improvement. Even before the next sprint starts, two meetings must be held by the team:

  • The sprint review meeting: This meeting is about reviewing the delivered features. Usually the development team, the product owner, the Scrum Master and the end-users participate in this meeting.
  • Retrospective meeting: This meeting usually takes place directly after the review meeting and focuses on identifying activities that need to be improved.
  • Backlog refinement for prioritizing the user stories and to make sure that the team understands what to do
  • Sprint planning to plan which user stories fit into the next sprint based on estimating the effort.

It is important that these meetings are held so that toe source errors can be found. In this way, the outcome of a project can be improved and the development processes optimized in an iterative way.

Conclusion

Data Vault 2.0 is not only a scalable and flexible modeling technique, but a complete methodology to accomplish enterprise vision in Data Warehousing and Information Delivery by following an agile approach and focusing on business value. By using agile methods in data warehousing, the focus in projects can be on the business value and delivering useful products to the customer.

Get Updates and Support

Please send inquiries and feature requests to [email protected]

For Data Vault training and on-site training inquiries, please contact [email protected] or register at www.scalefree.com.

To support the creation of Visual Data Vault drawings in Microsoft Visio, a stencil is implemented that can be used to draw Data Vault models. The stencil is available at www.visualdatavault.com.

Scalefree

Multi-Temporality in Data Vault 2.0 Part 2 – A Practical Example

By Scalefree Newsletter No Comments

Multi-Temporal Example from the Travel Industry

In one of our previous blog posts, we gave an insight into the basics of multi-temporality in Data Vault 2.0. This article will give you a short practical example of the topic.

When you book your holidays, you might change something after the initial booking: the duration, the flight departure location, the hotel, the start and/or end date of your journey or might just book some extras. From an operational perspective, these changes are not a problem to deal with, and the expectations are not high.

The data set for your journey will be updated in the source systems’ database, and only the most recent data is shown in the graphical user interface. The employee might see some logs in the history part, if they exist.

That’s usually enough to keep the operational part working. In the business intelligence world, the way to deal with data is a bit different. There, we want to see everything. Each time a record in the source system is updated, we catch all the changes and store them in a separate database.

This means that we introduce a new timeline which provides us new opportunities but also comes with new challenges. Let’s have a look into some example data sets of a source system over time:

Day 1: Initial booking of a holiday trip

Day 2: Correction of the start and end date and a small price adjustment

Day 2: Correction of the start and end date and a small price adjustment

When you have a look into the source system today, you will see that booking B4711 is cancelled, or maybe it is completely gone, and B4712 is the active one (ID = 2, created on day 3). From an analytical perspective you might have some more questions you want to have answered.
For example:

  • What was the estimated revenue for July, looking from February?
  • What was the estimated revenue for July, looking from March?
  • Which dates do I need to use to answer the two questions above?
  • In which month do we have the most bookings for the next summer season?
  • Can we reproduce a report we created some months ago?
  • What if some business dates change? What affects my reports in the past?
  • Where can we prepare the data to answer all these questions? In the data warehouse? In the front-end BI tool? Or somewhere else?

In our webinar “Multi-Temporality in Data Vault 2.0” we will provide you answers to all these questions by using Data Vault entities.

If you are interested to see some more use cases related to multi-temporality: we offer a 2-day class especially for this topic: “Multi-Temporal Data Vault 2.0 Class”

by Marc Finger (Scalefree)

Get Updates and Support

Please send inquiries and feature requests to [email protected]

For Data Vault training and on-site training inquiries, please contact [email protected] or register at www.scalefree.com.

To support the creation of Visual Data Vault drawings in Microsoft Visio, a stencil is implemented that can be used to draw Data Vault models. The stencil is available at www.visualdatavault.com.

Scalefree

Why should I use Data Vault (2.0)?

By Scalefree Newsletter No Comments

As a database developer, administrator or user, you probably have already heard of Data Vault. But with an existing database solution in mind, you might come to the question of why you should use Data Vault in your enterprise data warehouse. This blog article discusses some features where Data Vault can shine and why you should use it.

Reason 1: Supporting multiple source systems and frequently changing relationships.

Data Vault 2.0 provides the most benefits when your data comes from many source systems or has constantly changing relationships. 

Data Vault 2.0 works well for multi-source systems or those that have constantly changing relationships.

The reason why it works well for these systems is its ability to make adding attributes simple. If there is a change to one source system, that change does not need to show up within all source systems. Similarly, you can limit the number of places changes are made, as attributes are stored separately from structural data in satellites.

Additionally, it is easier to account for new and changing relationships by closing off one link and creating another. You don’t have to change the historical data to account for a new relationship or update an existing schema. You only need to account for the changes going forward. This brings enormous flexibility and scalability into your enterprise data warehouse.

Reason 2: Easily track and audit your data.

In a modern data environment, the data runs through various different stops. To still provide continuous data quality, it must always be clear where data has come from. 

Figure 1: Data Lineage

Given the need to be able to easily track and audit data, Data Vault has made it easier to do so. It does so by inherently enabling auditing, as load times and record sources are required for every row. It also tracks all historical changes as satellites, including the load time as part of the primary key.
When an attribute is updated, a new record is created.
All of this auditing enables you to easily provide auditability for both regulatory and data governance purposes. And because you store all of your history, you can access data from any point in time.

Reason 3: Load data from multiple source systems more quickly.

Data Vault enables quicker data loading simply because a number of tables can be loaded at the same time in parallel. The model decreases dependencies between tables during the load process and simplifies the ingestion process by leveraging inserts only, which load quicker than upserts or merges. This also leads to less complexity.

Conclusion:

Data Vault 2.0 is designed to integrate data from several source systems and may over perform in some situations.

In summary, if you have a small to medium-sized analytics requirement, with a small team of architects, designers, and engineers delivering a solution with data sourced from a few systems, then Data Vault may be inappropriate for your need.

If, however, you have a large project with many source systems leading to an enormous data integration challenge, then Data Vault can potentially add massive value to the project.

Get Updates and Support

Please send inquiries and feature requests to [email protected]

For Data Vault training and on-site training inquiries, please contact [email protected] or register at www.scalefree.com.

To support the creation of Visual Data Vault drawings in Microsoft Visio, a stencil is implemented that can be used to draw Data Vault models. The stencil is available at www.visualdatavault.com.

Scalefree

Multi-Temporality in Data Vault 2.0 – Part 1

By Scalefree Newsletter No Comments

Part 1 – Theoretical understanding of multiple timelines in a data warehouse.

What is “Multi-Temporality” in a Data Warehouse?

Before we start talking about multi-temporality, let’s define the term bi-temporality first, given, a common misconception is that Data Vault 2.0 is only bi-temporal (which is wrong):

“Bitemporal Modeling is a specific case of Temporal database information modeling technique designed to handle historical data along two different timelines. This makes it possible to rewind the information to “as it actually was” in combination with “as it was recorded” at some point in time.” (According to: https://en.wikipedia.org/wiki/Bitemporal_Modeling)

Bi-temporality just addresses two timelines, which are commonly referred to as “System Time” (the technical timeline) and “Valid Time” (the business timeline). Data Vault Satellites, Point-in-Time tables (PIT) and Bridge tables are able to address multiple active timelines in the same record. Let’s categorise just some of them: Read More

Open Source Production Grade Data Integration – Part 2

By Scalefree Newsletter No Comments

Meltano in action

In our last overview, we talked about Meltano and its architecture. Now, we would like to illustrate the ease in which you can use Meltano to create a data integration pipeline.
Before we start, please ensure that you have already installed Meltano on your machine. If you haven’t yet, you can follow Meltano’s official installation guide.

First we will initialize a Meltano project.
Initialize a new project in a directory of your choice by using  “meltano init”.

Read More

Open Source Production Grade Data Integration – Part 1

By Scalefree Newsletter No Comments
In our past blog post, we introduced an open source framework for ELT processes called Singer. This framework can be wrapped up using another open source tool which adds more interesting features to Singer including installation, setup of environments, monitoring, scheduling and orchestration. At Scalefree, we moved all of our ELT pipelines into this framework on AWS and are pleased with the results.

 

Please note, there are a large number of platforms for managing data integration but there is a lack of robust and easy-to-use, free open source solutions. The Meltano project aims to provide a solution to that situation. Meltano is a full-package data integration platform that challenges the most established players in the data space. Meltano is built on top of the best open source tools for data integration and infuses them with DataOps best practices.

Meltano is the easiest way to build, run and orchestrate ELT pipelines made-up of Singer taps, targets and dbt models. It is open source, self-hosted and version controlled as well as containerized.

Meltano’s open source model lets you easily adapt it to your own needs and reduces cost. Read More

Running modern ETL-Processes with Framework-Based Tools – Part 2

By Scalefree Newsletter 2 Comments

In the last blog post, we introduced Singer, the open-source framework, as a powerful tool for ETL processes. This time, we’d like to discuss how you can implement the framework in your own projects.

How to start working with Singer

Starting a test run is rather simple. First, you need to create a python environment,  for which step-by-step instructions to do so are available online. 

As soon as you’ve done that, it’s time to create your first virtual environment inside python.
Please note before beginning that it’s a best practice to create and use an individual virtual environment for every tap and target. This avoids any conflicts between module requirements for the different modules. 

The next step is to install the tap and target you’ve chosen into their corresponding virtual environment. This installation can be performed very easily using a pip install command. This example command installs the tap-salesforce to the load data from your Salesforce account:
Read More

Running modern ETL-Processes with Framework-Based Tools – Part 1

By Scalefree Newsletter No Comments

A big part of every Enterprise Datawarehouse are ETL- or ELT-processes.
In both abbreviations, the letters stand for the same words, only the order in which each process is done changes.
To brush-up on those processes, “E” stands for extraction, “T” for transformation and “L” is for loading.

That said, rather than dive into the benefits of each,  we would like to present a powerful open-source framework to execute the processes instead.

Why use a framework?

Rather than developing individual solutions per source system, using standardized frameworks provides a wide variety of benefits. The main of which we have already mentioned, standardization.
Another benefit, using the same concept for extracting data from different source systems allows your system to become more auditable and reliable.
And when taking into consideration the varied benefits between frameworks, other potential upsides become available as well. Read More

Implementing Data Vault 2.0 Zero Keys

By Scalefree Newsletter 2 Comments
In a previous blog post, we discussed how to implement ghost records within a Data Vault 2.0 solution. This time around, we’d like to talk about “the other” concept, namely zero keys, which oftentimes are referenced interchangeably with ghost records.

 

As discussed in the previous part of this series, a ghost record is a dummy record in satellite entities containing default values. Simply put, zero keys are the entry in each hub and link entity that is a counterpart to the satellite’s ghost record containing its hash key. In this manner, the term “zero key” is oftentimes used to describe the ghost record’s hash key, which might show up in other Data Vault entities such as in Point-in-Time (PIT) tables or links. Accompanying the zero hash key is, similar to a ghost record, a default value for the business key . Or, in the case of a composite business key, multiple default values for each of its components.

Read More

Using Multi-Active Satellites the Correct Way (2/2)

By Scalefree Newsletter No Comments
In our first post about multi-active satellites, we briefly explained different implementations that can be used to solve multi-activity. Now, we’re going to go into more detail regarding the advantages and disadvantages of these approaches having delta checks on or off.

 

Short summary of Multi-Active Satellites

Multi-active satellites allow you to implement multi-active records per business key in Data Vault 2.0. To illustrate the need for the solution, let’s look at the common occurrence of a source system that doesn’t provide the needed metadata such as when working with XML-files.
One solution to the above is to create a multi-active satellite by adding a subsequence number per business key. This accounts for any instance in which there is no multi-active attribute delivered by the source itself. Regarding phone numbers, this information could be a tag for a business, home or mobile phone number. Another possibility is to create an extra hub for the multi-active attribute. Though, since it doesn’t present a real business object, the first solution can be more effective.

Delta Check OFF

There are two ways to insert new records into a multi-active satellite – having delta checks active or inactive. With delta checks turned off, all records of a business key are inserted into the satellite from your source delivery.
The advantage to that is that loads are faster and have a consistent load date timestamp to the parent hash key, independent of the multi-active attribute.
Later on, it simplifies the query based on the multi-active data (see figure 1). As a critical drawback, the ingested amount of data can increase strongly if full date loads are received.
In this case, you should partition your data by the load date timestamp. 

Read More

Using Multi-Active Satellites the Correct Way (1/2)

By Scalefree Newsletter No Comments
With multi-active satellites, you’re able to store multiple active records for one business key. Depending on how the data arrives from your source, there are different ways to implement multi-activity in Data Vault 2.0. In this post, we’ll explain your options for modeling. 

 

What is a Multi-Active Satellite?

A multi-active satellite is similar to a standard satellite and its structure. As said before, it stores multiple active records per key at a point in time. This exact structure depends on the use case though.
See the exemple Data Vault model in figure 1.

Read More

Effort estimation in Data Vault 2.0 projects

By Scalefree Newsletter No Comments

There are many options available when choosing a method to estimate the necessary effort within agile IT projects.
In Data Vault 2.0 projects, we recommend estimating the effort by applying a Function Point Analysis (FPA). In this article, you will learn why FPA is a good choice and why you should consider using this method in your own Data Vault 2.0 projects.

GOOD OLD PLANNING POKER

Probably the best known method for estimating work in agile projects is Planning Poker. Within the process, so-called story points, based upon the Fibonacci sequence (0, 0.5, 1, 2, 3, 5, 8, 13, 20, 40 and 100), are used to estimate the effort of a given task. 

To begin the process, the entire development team sits together as each member simultaneously assigns story points to each user story that they feel are appropriate. If the story points match, the final estimate is made. Alternatively, if a consensus cannot be reached the effort is discussed until a decision is made.  Read More

Implementing Data Vault 2.0 ghost records

By Scalefree Newsletter 4 Comments

Implementing Data Vault 2.0 ghost records

During the development of Data Vault, from the first iteration to its latest Data Vault 2.0, we’ve mentioned the two terms “ghost records” and “zero keys” in our literature as well as in our Data Vault 2.0 Boot Camps. And since then, we’ve noticed these concepts oftentimes being referenced to interchangeably. 

In this blog entry, we’ll discuss the implementation of ghost records in Data Vault 2.0. Please note, that this article is part one of a multi-part blog series clarifying Ghost records vs. Zero Keys. Read More

About Non-Functional Requirements

By Scalefree Newsletter 2 Comments
In our trainings and consulting practice, we often pitch the idea of “focusing on the business value” to the audience. Business value in enterprise data warehousing is defined as “something of value to the business” (believe it or not, we believe it should be said sometimes).

Typically, the reason by business setups a budget for the enterprise data warehouse is that they want some reports or dashboards with actionable information.

On the other hand, the most important counter-argument is that the enterprise data warehouse is more than just reports and dashboards. There is actually a lot of more technical components (non-functional requirements) to be built, including but not limited to: Read More

Organization of Information Requirements

By Scalefree Newsletter 5 Comments

Just a Recommendation…how we Organize our Information Requirements

 

Information is required by business users throughout the industry. However, as part of our consulting engagements, we also encounter a lack of proper description as to what the business user actually needs. 

So, we want to use this article to present the way we structure our information requirements internally at Scalefree as well as the way we do so for many of our customers.

What about User Stories?

We all know user stories from Scrum and many business intelligence projects.
Their structure is typically something that looks like:

As a <type of user>, I want <some goal> so that <some reason>.

The following example represents a typical user story we would receive in a project:

As a <marketing user>, I want <to have an overview report with the number of leads from a marketing channel> so that <I can adjust the marketing budget accordingly>.

Now, what should we do with this user story?
Many details are missing, and yes, we all know about product backlog refinement. The problem is that the user story is just not sufficient enough within business intelligence efforts and some structure might be of help.

Information Requirements

Developers in enterprise data warehousing and business intelligence need much more detail than just the user story. On the other hand, the user story is a good starting point for the information requirement. So, it can be treated as a typical introduction. The overall structure looks like this:
Read More

X
X