Skip to main content
search
0

Meltano Open Source Production Grade Data Integration – Part 2

Meltano result

Meltano Open Source Production Grade Data Integration

In this article, we delve into the practical application of Meltano, an open-source platform designed for building, running, and orchestrating ELT pipelines. Building upon our previous discussion of Meltano’s architecture, this installment guides you through the process of creating a data integration pipeline, from initializing a project to configuring extractors and loaders, and utilizing the Meltano UI for streamlined management.

Meltano in action

In our last overview, we talked about Meltano and its architecture. Now, we would like to illustrate the ease in which you can use Meltano to create a data integration pipeline.
Before we start, please ensure that you have already installed Meltano on your machine. If you haven’t yet, you can follow Meltano’s official installation guide.

First we will initialize a Meltano project.
Initialize a new project in a directory of your choice by using  “meltano init”. This will create a new directory with, among other things, your  “meltano.yml” project file.

Meltano

The Meltano project folder

The project folder is the single source of truth for all your data needs. It is a simple directory and version controllable. 

The main part of your project is the meltano.yml project file. This file defines your plugins, pipelines and configuration.

The project and meltano.yml files are manageable using the Meltano CLI and are instantly containerized for Docker/Kubernetes deployment. Though, please note that there are not any defined plugins or pipeline schedules created yet. We will do this in the next step.

Adding an extractor and loader

Now, let’s initialize the pipeline’s components. The first plugin you’ll want to add is an extractor which will be responsible for pulling data out of your data source.

To find out if an extractor for your data source is supported out of the box, you can check the Extractors list on MeltanoHub or run  “meltano discover”.

We will use the Tap for Gitlab in this example as we don’t need to create API credentials.

Meltano

Meltano manages setup, configuration and handles invocation.

Meltano

We will configure the extractor to pull the data from the repository meltano/meltano. Additionally, we will define it to only extract data as of 1 January 2021 and to include only data under the “Tags” stream.

Meltano

Data selection is way easier than using just Singer! The extractor is now set up. Now, we will add a loader to store the data into a CSV file and define our destination path:

Meltano

Remember that the directory needs to be previously created, as it will not be created automatically. This is what our meltano.yml looks like:

Instead of using the CLI, we can make changes directly in the YAML.

This way, it’s also possible to configure the extractors and loaders in the Meltano UI:

Start the Meltano UI web server using  meltano ui. Unless configured otherwise, the UI will now be available at http://localhost:5000.

Meltano Extractors
Meltano Configuration

Run a pipeline

Now it’s time to run a pipeline. To run a one-time pipeline, we can just use the meltano elt command:

Meltano
Pipelines

Result

And we are done! It took us just ten commands to create a data integration pipeline.

Meltano result

Conclusion

It’s clear why Meltano is a great choice for building your data platform: it’s powerful but simple to maintain and its open-source model makes it flexible, budget-friendly and reliable.

Speed Up Your Vault with VaultSpeed – Success Through Automation – Part 2

Watch the Webinar

In this Webinar, we take a closer look at Data Warehouse Automation and how easily it can be implemented. First, we will break down the basics of Data Warehouse Automation.

Then we will show how Data Vault 2.0 can contribute to successful Automation on the basis of a sample COVID-19 data set showing vaccine and infection numbers.

Lastly, Vaultspeed will give a demonstration of their tool that implements our suggested model using the data set thus showing the viewers how “simple and easy” VaultSpeed as a Data Warehouse Automation tool is.

This webinar is for everyone who wants to learn about Data Warehouse Automation and a sneak peek into VaultSpeed.

Watch Webinar Recording

Webinar Agenda

1. Understanding Data Warehouse Automation
2. Automation in Data Vault 2.0
3. Usecase
4. Vaultspeed Demo

Speed Up Your Vault with VaultSpeed – Success Through Automation – Part 1

Watch the Webinar

In this Webinar, we take a closer look at Data Warehouse Automation and how easily it can be implemented. First, we will break down the basics of Data Warehouse Automation.

Then we will show how Data Vault 2.0 can contribute to successful Automation on the basis of a sample COVID-19 data set showing vaccine and infection numbers.

Lastly, Vaultspeed will give a demonstration of their tool that implements our suggested model using the data set thus showing the viewers how “simple and easy” VaultSpeed as a Data Warehouse Automation tool is.

This webinar is for everyone who wants to learn about Data Warehouse Automation and a sneak peek into VaultSpeed.

Watch Webinar Recording

Webinar Agenda

1. Understanding Data Warehouse Automation
2. Automation in Data Vault 2.0
3. Usecase
4. Vaultspeed Demo

Singer Open Source Production Grade Data Integration – Part 1

In our past blog post, we introduced an open source framework for ELT processes called Singer. This framework can be wrapped up using another open source tool which adds more interesting features to Singer including installation, setup of environments, monitoring, scheduling and orchestration. At Scalefree, we moved all of our ELT pipelines into this framework on AWS and are pleased with the results.

Please note, there are a large number of platforms for managing data integration but there is a lack of robust and easy-to-use, free open source solutions. The Meltano project aims to provide a solution to that situation. Meltano is a full-package data integration platform that challenges the most established players in the data space. Meltano is built on top of the best open source tools for data integration and infuses them with DataOps best practices.

Meltano is the easiest way to build, run and orchestrate ELT pipelines made-up of Singer taps, targets and dbt models. It is open source, self-hosted and version controlled as well as containerized.

Meltano’s open source model lets you easily adapt it to your own needs and reduces cost. Continue Reading

Running Modern ETL-Processes with Framework-Based Tools – Part 2

Managed Self Service BI image

In the last blog post, we introduced Singer, the open-source framework, as a powerful tool for ETL processes. This time, we’d like to discuss how you can implement the framework in your own projects.

How to start working with Singer

Starting a test run is rather simple. First, you need to create a python environment,  for which step-by-step instructions to do so are available online. 

As soon as you’ve done that, it’s time to create your first virtual environment inside python.
Please note before beginning that it’s a best practice to create and use an individual virtual environment for every tap and target. This avoids any conflicts between module requirements for the different modules. 

The next step is to install the tap and target you’ve chosen into their corresponding virtual environment. This installation can be performed very easily using a pip install command. This example command installs the tap-salesforce to the load data from your Salesforce account:
Continue Reading

Running Modern ETL-Processes with Framework-Based Tools – Part 1

Data Vault 2.0 Information Delivery Class

A big part of every Enterprise Datawarehouse are ETL- or ELT-processes.
In both abbreviations, the letters stand for the same words, only the order in which each process is done changes.
To brush-up on those processes, “E” stands for extraction, “T” for transformation and “L” is for loading.

That said, rather than dive into the benefits of each,  we would like to present a powerful open-source framework to execute the processes instead.

Why use a framework?

Rather than developing individual solutions per source system, using standardized frameworks provides a wide variety of benefits. The main of which we have already mentioned, standardization.
Another benefit, using the same concept for extracting data from different source systems allows your system to become more auditable and reliable.
And when taking into consideration the varied benefits between frameworks, other potential upsides become available as well. Continue Reading

Was Machen BI-Berater:innen? Skills & Tasks

Watch the Webinar

Was macht ein Berater eigentlich? Es gibt viele Vorurteile und Mythen, die sich um den klassischen BI Berater ranken, aber stimmen sie auch?

In diesem Webinar lernt ihr wie der Job eines BI Beraters in der Praxis aussieht. Wir zeigen einen kompletten Werdegang vom Werkstudenten bis zum Senior Berater.

Watch Webinar Recording

Webinar Agenda

1. Kurzvorstellung Scalefree
2. Was machen Berater:innen?
3. Interview Werkstudent
4. Mein Weg als Berater
5. Interview Senior Beraterin

Implementing Data Vault 2.0 Zero Keys

Implementing Data Vault 2.0 Zero Keys

Learn about Zero Keys, “the other” concept that is oftentimes referenced interchangeably with ghost records, which we discussed in a previous blog post.

Why implement Zero Keys?

As discussed in the previous part of this series, a ghost record is a dummy record in satellite entities containing default values. Simply put, zero keys are the entry in each hub and link entity that is a counterpart to the satellite’s ghost record containing its hash key. In this manner, the term “zero key” is oftentimes used to describe the ghost record’s hash key, which might show up in other Data Vault entities such as in Point-in-Time (PIT) tables or links. Accompanying the zero hash key is, similar to a ghost record, a default value for the business key . Or, in the case of a composite business key, multiple default values for each of its components.

Zero Key with a composite business key

With the hub and link entry for the zero key in place, each and every entry in its related satellite will then have a parent hash key, avoiding so-called hash key orphans.

What does a Zero Key look like?

In Data Vault 2.0, it is only required to insert a single ghost record to each satellite entity. However, it is possible to have multiple zero keys in place. At Scalefree internally and in many  of our projects, we distinguish two types of missing objects through different hub zero keys.
Please note the hash algorithm in use is MD5:

  • 00000000000000000000000000000000 (32 times the digit ‘0’) for general “unknown” cases where a business key is missing.
  • ffffffffffffffffffffffffffffffff (32 times the letter ‘f’): a dedicated zero key for “erroneous” cases of missing business keys that show.
Multiple zero keys in a Hub entity

A good example that calls for the “error” zero key is in an erroneous or broken mandatory object relationship in the source. In that case, the zero key ffffffffffffffffffffffffffffffff will be found in the link entity, indicating an unexpectedly absent hub reference. Bear in mind, should you choose to implement the error zero key, it is not required to insert a ghost record with the error zero key as a parent hash key in satellite entities.

As for the zero key in link entities, it is only necessary to have one entry containing the zero hash key as both link hash key and hub reference.

Zero key in a Link entity

It is also important to point out that all examples we provide in this blog series involve the hash algorithm MD5, which outputs 32-hexadecimal-digit sequences. For Data Vault 2.0 projects that adopt other hash algorithms, such as SHA256, simply adjust the length of the zero keys we proposed (“0000…” / “ffff…”) to the desired hash output length.

Conclusion

We hope that this blog post helped to clarify the implementation of zero keys in a Data Vault 2.0 solution and the differences between the concepts of ghost records and zero keys. Feel free to share your experience with implementing these concepts in the comments below!

Using Multi-Active Satellites the Correct Way – Part 2

Data Vault 2.0 Training FAQ - Customized Class

Multi-Active Satellites in Data Vault 2.0

In our first post about multi-active satellites, we briefly explained different implementations that can be used to solve multi-activity. Now, we’re going to go into more detail regarding the advantages and disadvantages of these approaches having delta checks on or off.



Short Summary of Multi-Active Satellites

Multi-active satellites allow you to implement multi-active records per business key in Data Vault 2.0. To illustrate the need for the solution, let’s look at the common occurrence of a source system that doesn’t provide the needed metadata such as when working with XML files.

One solution to the above is to create a multi-active satellite by adding a subsequence number per business key. This accounts for any instance in which there is no multi-active attribute delivered by the source itself. Regarding phone numbers, this information could be a tag for a business, home, or mobile phone number. Another possibility is to create an extra hub for the multi-active attribute. However, since it doesn’t present a real business object, the first solution can be more effective.

Delta Check OFF

There are two ways to insert new records into a multi-active satellite – having delta checks active or inactive. With delta checks turned off, all records of a business key are inserted into the satellite from your source delivery.

The advantage of that is that loads are faster and have a consistent load date timestamp to the parent hash key, independent of the multi-active attribute.

Later on, it simplifies the query based on the multi-active data (see Figure 1). As a critical drawback, the ingested amount of data can increase strongly if full data loads are received.

In this case, you should partition your data by the load date timestamp. 

Multi-active satellites

Figure 1: Joining records using one PIT-table when having one LDTS for all active records per key.

Delta Check ON

To reduce the amount of data simply use delta checks. Doing so, you match the incoming data with the latest LDTS per hash key.
Note that a useful function to leverage the delta check is LISTAGG(). This function transforms values from a group of rows into a list of values. As a result, you are able to create a hash difference of multi-active attributes per business key (see figure 2).

Multi-active satellites

Figure 2 – Multi-row hash difference using the LISTAGG()-function

If new data arrives, it can be compared by the multi-row hash difference. Using it, a high proportion of data can be ignored. When a delta is detected, all records are inserted per hash key even if the content of the data for only one subsequence is changed. The result is that you get a consistent load date timestamp per hash key. Additionally, a delta is going to be noticed when the order of records changes as well. This approach is a trade-off to reduce the amount of records while still having a consistent load date timestamp per hash key so that only one PIT table is necessary.

Using Type Codes

Another method to address multi-activity is to use types of the attributes, granted they exist, as explained in the previous post.
If you have type codes in place, you’re able to compare row by row and load only this specific data. But consequently, you get different LDTS per multi-active attribute. Using PIT-tables in the business vault, you’ll therefore need an extra multi-active PIT-table. This scenario is shown in figure 3.

Multi-active satellites

Figure 3 – The need for an MA-PIT having different LDTS of active records per key

If you can pivot the multi-active column into multiple columns, one column per characteristic, you won’t need to consider any of the above advantages or disadvantages. In this case, you turn the satellite into a standard satellite. Related to our example with the phone numbers, you would create a couple of new columns per phone type.
This solution is applicable if you can be certain that the characteristics of the phone types will not change in the future. Otherwise, you will have to reengineer your process to catch all the data you have delivered!

Conclusion

In our second post regarding  multi-active satellites, we explored the backgrounds behind the different methods. You can use this information to implement your own multi-active solution depending on your present data.
Though, note that it’s often recommended limiting the loading data. For this purpose, the LISTAGG()-function becomes useful. Using a solution with potentially different LDST, remember to consider them when querying the data.

Using Multi-Active Satellites the Correct Way – Part 1

Multi-Active Satellites in Data Vault 2.0

With multi-active satellites, you’re able to store multiple active records for one business key. Depending on how the data arrives from your source, there are different ways to implement multi-activity in Data Vault 2.0. In this post, we’ll explain your options for modeling. 

What are Multi-Active Satellites?

Multi-active satellites are similar to standard satellites and their structure. As said before, they store multiple active records per key at a point in time. This exact structure depends on the use case though.
See the example Data Vault model in Figure 1.

Continue Reading

Effort Estimation in Data Vault 2.0 Projects

Effort Estimation in Data Vault 2.0 Projects

In Data Vault 2.0 projects, we recommend estimating the effort by applying a Function Point Analysis (FPA), but there are many options available when choosing a method to estimate the necessary effort within agile IT projects. In this article, you will learn why FPA is a good choice and why you should consider using this method in your own Data Vault 2.0 projects.

Good Old Planning Poker for Effort Estimation

Probably the best known method for estimating work in agile projects is Planning Poker. Within the process, so-called story points, based upon the Fibonacci sequence (0, 0.5, 1, 2, 3, 5, 8, 13, 20, 40 and 100), are used to estimate the effort of a given task. To begin the process, the entire development team sits together as each member simultaneously assigns story points to each user story that they feel are appropriate. If the story points match, the final estimate is made. Alternatively, if a consensus cannot be reached the effort is discussed until a decision is made. 

However, it’s important to note that this technique involves a lot of effort as a result of either too many tasks or too many team members.  So the question is: Does Planning Poker work for Data Vault projects? The short answer is “it makes little sense”. Since in Data Vault 2.0 the functional requirements are broken down into small items, there are numerous tasks that all have to be discussed and evaluated. In addition, the subtasks in Data Vault 2.0 are standardized artifacts, such as Hub, Link, and Satellite. In principle, these artifacts always represent the same effort.

Why does Function Point Analysis suit it so well?

This is where FPA comes into play and the reason why it is widely used in agile software projects. The idea is that software consists of the following characteristics, which represent the Function Point Types:

  • External inputs (EI) → data entering the system
  • External outputs (EO), external inquiries EQ → data leaving the system one way or another
  • Internal logical files (ILF) → data manufactured and stored within the system
  • External interface files (EIF) → data maintained outside the system but necessary to perform the task

With FPA, you break the functionality into smaller items for better analysis. As mentioned, this is already done in Data Vault 2.0 projects due to the standardized artifacts. This is the reason why FPA is very suitable for Data Vault projects.

How to apply FPA in Data Vault 2.0

Hence, to make good use of FPA in the Data Vault 2.0 methodology, the functional characteristics of software, external inputs, external outputs, external inquiries, internal logical files, and external interface files, must be adapted to reflect Data Vault projects. The following functional characteristics of data warehouses built with Data Vault are defined as:

  • Stage load (EI)
  • Hub load (ILF)
  • Link load (ILF)
  • Satellite load (ILF)
  • Dimension load (ILF)
  • Fact load (ILF)
  • Report build (EO)

Please note that there are also other functional components that can be defined, like Business Vault entities, Point-in-time tables, and so on. Once you have defined these components, you should create a table that maps them to function points. Function points are used to quantify the amount of business functionality an element provides to a user. In general, it is recommended to add a complexity factor first:

Complexity Factor Person Hours per Function Point
Easy 0.1
Moderate 0.2
Difficult 0.7

Then use the complexity factors and the assigned function points per component to calculate the estimated hours needed to add the respective functionality. Here is a short example of how the mapping table could look like for you:

Component Complexity Factor Estimated Function Points Estimated Total Hours
Hub Load Easy 2 0.2
Dimension Load Difficult 3 2.1
Report Build Difficult 5 3.5

The goal of estimation is to standardize the development of operational information systems by making the effort more predictable. This is due to the fact that when you use a systematic approach to estimate the needed effort to add components, it is possible to compare the estimated values with the actual values once the functionality is delivered. When both values are compared, your team can learn from these previous estimates and improve their future estimates by adjusting the function points per component. Also, keep in mind that your developers will gain experience over time or might lose experience due to replacement.

Conclusion

I hope this first glimpse into FPA helps you understand the basic value it can provide your team in Data Vault 2.0 projects. You can have a more in-depth look at how to apply FPA in Data Vault 2.0 projects by reading the book “Building a Scalable Data Warehouse with Data Vault 2.0” by Michael Olschimke and Dan Lindstedt.

Implementing Data Vault 2.0 Ghost Records

Ghost records

Implementing Data Vault 2.0 ghost records

During the development of Data Vault, from the first iteration to its latest Data Vault 2.0, we’ve mentioned the two terms “ghost records” and “zero keys” in our literature as well as in our Data Vault 2.0 Boot Camps. And since then, we’ve noticed these concepts oftentimes being referenced interchangeably. 

In this blog entry, we’ll discuss the implementation of ghost records in Data Vault 2.0. Please note, that this article is part one of a multi-part blog series clarifying Ghost Records vs. Zero Keys.

 

Why implement ghost records?

The concept of ghost records is usually brought up together with the implementation of point-in-time (PIT) tables. PIT tables are used as query assistant objects as part of the Business Vault, in which snapshots of data are created for certain time intervals specified by the data consumers. It’s important to note that these intervals can be daily, weekly, even real-time, etc. Each entry in a PIT table materializes joins from a Data Vault spine object (either a Hub or a Link) to its surrounding Satellite structures to reduce joins while querying against the Data Vault and thus boosting query performance.

In some instances, however, upon joining e.g. a Hub to one of its Satellites, there can be no corresponding Satellite delta for certain snapshots. The reason behind this could be that the business key was not available or unknown by the data source at that given time. 

Ghost records

Reference to a ghost record in a PIT table

To combat this issue, ghost records are added to Satellite entities to virtually fill up gaps in the beginning of the timeline, so that equal joins are made possible in ad-hoc queries against the Raw Vault. Equal joins (a.k.a. equi-joins) are joins that only use equality comparators and are arguably the most efficient/fastest SQL-join type.

What does a ghost record look like?

A ghost record can be understood as a dummy record that contains default values. In the previous iteration of Data Vault (DV1), the solution was to create a ghost record per key per satellite structure. This would still do the job of filling up gaps at the beginning of the timeline. However, this solution didn’t scale well on higher volumes of data. Imagine a hub that contains 10 million business keys and there are three satellites attached to it. Every satellite then contains 10 million ghost records, resulting in 30 million records across all three satellites. In addition, every time a business key is added to the hub, a corresponding ghost record needs to be added to each satellite. The sheer amount of ghost records in this case would defeat the whole purpose of trying to achieve equi-joins, to enable faster queries. 

Thus, since the introduction of DV2.0, it is only required to insert one single ghost record per Satellite structure.

Ghost records

Example: Ghost record with attributes of different data types

The ghost record typically contains a constant hash key 00000000000000000000000000000000 (32 times the character “0”). This hash key is also known as a Zero key – more on Zero keys coming up in the next part of this blog series. Its load timestamp is usually set to the earliest possible timestamp within the DBMS, indicating the “beginning point of time”. The record source “SYSTEM” simply means the record is artificially generated. 

Then, follows a list of default NULL values for every descriptive attribute within the Satellite structure. For each data type, we define a default value for the ghost record. For example, attributes with numeric data types can be filled with (numeric) zero, string attributes can be filled with either “(unknown)” or “?” depending on the length definition of the attribute.

It is recommended that the ghost record is filled with default values, as opposed to filling it with NULL/empty values, since these default values can be used and displayed further downstream. A good example of this can be seen in dimensions, an “(unknown)” string is arguably way more descriptive than a mere NULL value.

How to insert ghost records

There are a couple of ways to insert ghost records into Satellite structures.

The first variation is to insert the ghost records upon object creation as a one-time operation and then forget about it. Simple as that!

Another way is to insert the ghost record during the process of loading Satellites. The loading procedure should start with inserting a ghost record into the target object, if it does not yet exist. Then, the procedure can proceed with loading the Satellite with incoming data as normal. This variation might be viewed as rather excessive. However, it ensures that the ghost record is always available and that it gets inserted back into each Satellite – in case for whatever reason, objects are truncated or the ghost record itself is accidentally deleted, for example during development.

Both variations can be fully automated within your project’s Data Vault automation tool of choice.

Conclusion

We hope that this blog post helps to clarify the implementation of ghost records in a Data Vault 2.0 solution. Coming up next, we’d like to discuss with you “the-other-technical-term” Zero keys and the difference between Ghost records and them – which has been rather confusing to many fellow Data Vault practitioners.

Close Menu