Skip to main content

Scalefree Newsletter

Agile Kanban Made Easy for Salesforce: Introducing the MOWI App

By Salesforce, Scalefree Newsletter No Comments

This article discusses how agile project management can be implemented through the use of a Kanban board, which is a visual representation of tasks and their progress. MOWI is a Salesforce-based lightweight application that fully integrates the Kanban board with all other components within the Business Application Platform, using the Mission, Objective, Work Item objects. These objects have relationships between them and can help teams increase efficiency and productivity by providing a visual representation of their work, making it easier to track progress and manage tasks.

Agile project management is a well known but often poorly executed approach to software development that emphasizes adaptability and collaboration. One popular method for implementing Agile is through the use of a Kanban board.

A Kanban board is a visual representation of User Stories/Tasks, often displayed on a physical whiteboard or a digital tool. Each item on the board represents a task, and its progress is tracked by moving it from one column to another. Typically, columns represent stages of work such as “Active”, “Blocked”, “Testing”, and “Done”. This improves visibility for the team and the stakeholders alike and can help with better inter-team communication.

MOWI is a lightweight application that uses the capabilities of Salesforce List Views in Kanban View to fully integrate the Kanban Board with all other components within the Business Application Platform. The name was established from the acronym of the beginning letters of its containing objects: Mission, Objective, Work Item. These are intended to enable users to insert and manage projects and tasks in a Kanban system.

The objectives are the heart of the MOWI application. As already mentioned, each objective can only have one mission. However, the objectives can have further so-called Follow-Up Objectives, therefore a one-to-many relationship within the objectives is also possible.

The objects have relationships between them. A mission can be understood as a container of objectives, i.e., a mission can have several objectives, but an objective can have only one parent mission in parallel.

The same applies to the Work Items, they can be understood as sub-goals of the objectives. However, just like the mission, they are not absolutely necessary to create.

Image 1: Schema-Builder (Mission/Objective/Work Item)

Using a Kanban board view in Salesforce can help teams increase efficiency and productivity by providing a visual representation of their work and making it easier to track progress and manage tasks. The three objects within the custom application provide a good template for this.


Missions are a container for multiple objectives. If multiple objectives need to be grouped to a “bigger picture” you can add a mission but they are optional. The screen shows how a mission is structured and which fields are contained within it. A mission has one or more objectives which can be understood as “sub-objectives”.

Image 2: Screenshot of a Mission Record


Objectives are records with multiple fields. You can also call the objectives Tasks, User Stories, or Cards. The fields can be filled out with information to describe the objective. In the following screenshot you can see an example of an objective.

Image 3: Screenshot of an Objective Record

Work Item

Also, the objectives within a mission may need to be divided into individual sub-goals. Work-Items are these sub-tasks of the related objective. There, you can track the progress towards reaching the goal of your objective. The screenshot below shows an example Work_item with the status set to “Open”. In the work item, there are the three following phases:

  • Open
  • In Progress
  • Closed

Image 4: Screenshot of a Work Item Record

The Kanban View

MOWI, so far so good. But what does the view with the objectives actually look like? The following screenshot shows an example of a Kanban board. In the board, there are different objectives in different phases. Users can drag and drop the objectives into the different phases depending on their progress. 

Typically, columns represent stages of work such as “Active”, “Blocked”, “Testing”, and “Done”

Image 5: Screenshot of the Kanban view

Final Remarks

Interested in the MOWI solution? Then visit our app on the AppExchange and download it today for free!
Agile Kanban Board in Salesforce – MOWI

We hope that this article has provided valuable information to you. If you need further assistance or have any questions, don’t hesitate to reach out to us at [email protected].

If you want to stay up-to-date on the latest developments and news, be sure to subscribe to our newsletter. We send out regular updates with important information and insights for Salesforce.

Get Updates and Support

Please send inquiries and feature requests to [email protected]

For Data Vault training and on-site training inquiries, please contact [email protected] or register at

To support the creation of Visual Data Vault drawings in Microsoft Visio, a stencil is implemented that can be used to draw Data Vault models. The stencil is available at


Exploring DataVault4dbt: A Practical Series on the dbt Package for Data Vault 2.0 – Vol. 1: The Staging Layer

By Scalefree Newsletter No Comments

Join the Webinar:

Last year Scalefree released DataVault4dbt, a Data Vault 2.0 open-source package for dbt, which includes loading templates for creating and modeling Data Vault 2.0 entities following up-to-date standards and best practices. If you want to read more about the general content of the package and its motivation, you can do so here.

We’re excited to launch a series of insightful posts and webinars, showcasing practical implementations of DataVault4dbt. This will empower you to leverage its full potential in your data warehousing endeavors. Today, we’ll spotlight its application in the staging layer.

Before we start…

We will assume some previous knowledge related to Data Vault 2.0 and dbt. Besides, for the following examples, we will be using dbt Cloud IDE connected to Snowflake. For an updated list of supported platforms, check the package’s Github repository.

Also, bear in mind that for optimal use of the macros, you must meet a couple of prerequisites:

  • Flat & Wide source data, accessible in your target database
  • A Load Date column signifying the time of arrival in the source data storage
  • A Record Source column detailing the origin of the source data, such as the file location within a Data Lake

In our case, we used and adapted the data from the jaffle_shop example project available on dbt.

Installing datavault4dbt package on dbt

Installing DataVault4dbt is like installing any other package on your project. You will need to follow two simple steps:

1.Add it to your packages.yml file

2. Run dbt deps

Installing, screen

Using the macro for staging our source data

According to the documentation for the staging layer, this layer primarily focuses on hashing. It also offers functionalities like creating derived columns, conducting prejoins, and adding NULL values for missing columns. Rather than diving deep into the technical aspects of each macro component, which are comprehensively covered in the documentation, let’s dive straight into its application!

A. Basic source information

Identifying the Source Model (source_model):

  • When referencing a source, adopt the dictionary format: ‘source_name’: ‘source_table’.
  • For models within our dbt project, just use the model name: ‘source_table’.

Setting Load Date Timestamp (ldts) & Record Source (rsrc):

  • Both can reference a column from the source table or a more detailed SQL expression.
  • Additionally, for the Record Source, you can use a static string beginning with ‘!’, like ‘!my_source’.


A table with two table blocks.
  • source_model: Calls an already created table on dbt named ‘orders_example’.
  • ldts: Calls a timestamp column from our source model.
  • rsrc: Calls a column which contains a string referring to our record source name.

B. Hashing

The hashed_columns parameter outlines how to generate hashkeys and hashdiffs. For each hash column:

  • The key represents the hash column’s name.
  • For Hashkeys, the value is a list of business keys.
  • For Hashdiffs, the value will usually be a list of descriptive attributes.


Screen shot, table, different types of data.
  • hk_order_h: hashkey generated using two columns inputs (O_ORDERKEY and O_CUSTKEY)
  • hd_order_s: hashdiff generated using multiple descriptive attributes

C. Derived columns

Derived Columns in stage models allow users to directly apply specific transformations to data. They act as on-the-fly customizations, enabling immediate adjustments to data within the column itself. Essentially, if data isn’t in the desired format, you can derive a new version right within the column using a specified rule.

When setting the derived_columns parameter, each derived column includes:

  • value: The transformation expression.
  • datatype: The datatype of the column.
  • src_cols_required: Source columns needed for the transformation.

Depending on how you name the derived column and the source columns, you can achieve two outcomes:

  1. If the derived column’s name matches its source column’s name, the original column’s data will be replaced by the transformed data. This effectively means you’re overwriting the original data.
  2. On the other hand, if the derived column’s name is different from its source column’s name, the transformation will result in a brand new column, preserving the original column’s data.


table, prices, items.
  • price_euro: creation of a new column with the same values as the O_TOTALPRICE column.
  • country_isocode: creation of a new column with a static string ‘GER’.

D. Prejoining

Why Prejoin?

In certain scenarios, your source data might not have the ‘Business Key’ which is often a human-readable identifier, such as an email address or username. Instead, it might have a ‘Technical Key’, which could be an internally generated identifier or code. If you need to use the human-readable Business Key in your processing but only have the Technical Key, you would use prejoining to combine your data with another table that maps Technical Keys to Business Keys.

How to Define Prejoins?

The package provides a structured way to define these prejoins (prejoined_columns) using dictionaries.

For every column you’re adding through prejoining, you need to specify a few things:

  • src_name: This is the source of the prejoined data, as defined in a .yml file.
  • src_table: This specifies which table you’re prejoining with, as named in the .yml file.
  • bk: This is the name of the Business Key column in the prejoined table or the column values you are bringing to your table.
  • this_column_name: In your original data, this is the column that matches up with the prejoined table. This is often a Technical Key.
  • ref_column_name: In the prejoined table, this is the column that this_column_name points to. It should match up with the values in this_column_name.

Note that both ‘this_column_name’ and ‘ref_column_name’ can represent either a single column or a list of columns, serving as the basis for constructing the JOIN conditions.


Table, data.
  • c_name: we brought the column “C_NAME” from the customer source table, joining on orders.o_custkey = customer.c_custkey.

E. Multi active config

The multi_active_config parameter is used when dealing with source data that contains multiple active records for the same Business Key. Essentially, you need to specify which columns are the multi-active keys and the primary hashkey column.

If your source data doesn’t have a natural multi-active key column, you should create one using functions like row_number in a preceding layer. Then, add the name of this newly created column to the multi-active-key parameter. It’s crucial that the combination of multi-active keys, main hashkey, and the ldts column be unique in the final satellite output. If you don’t use this setting, the stage is considered to have only single active records.


Table, types, block.

By setting this parameter, we’ll observe consistent hashdiffs for identical Business Keys, proving beneficial in subsequent layers. If you want to know why, you can check this post.

F. Missing columns

The missing_columns parameter helps handle scenarios where the source schema changes and some columns no longer exist. Using this parameter, you can create placeholder columns filled with NULL values to replace the missing ones. This ensures that hashdiff calculations and satellite payloads continue to work. Essentially, you provide a dictionary where the column names are the keys and their respective SQL datatypes are the values.


table, price
  • discount_code: creation of a new discount_code column with NULL values.

Final Remarks

Scalefree’s DataVault4dbt package introduces an easy-to-use yet powerful solution for database modeling. In our case, we went through the staging layer macro, which combines best practices with the flexibility to address diverse source data needs. From hashing to on-the-fly column modifications, this Data Vault 2.0 open-source package for dbt streamlines complex processes.

As we continue to explore its potential, we invite you to join our upcoming webinar for a deeper dive. Reserve your spot here and stay tuned to the package’s Github repository for the latest updates and support.

– By Hernan Revale (Scalefree)

Get Updates and Support

Please send inquiries and feature requests to [email protected]

For Data Vault training and on-site training inquiries, please contact [email protected] or register at

To support the creation of Visual Data Vault drawings in Microsoft Visio, a stencil is implemented that can be used to draw Data Vault models. The stencil is available at


Choosing the Right Tech Stack for an Open-Source Powered EDW

By Scalefree Newsletter No Comments

Join the Webinar:


Choosing the right technology stack is a critical decision when building an open source powered Enterprise Data Warehouse (EDW). The technology stack consists of various components, including databases, automation tools, DevOps, Infrastructure, and visualizations, which work together to enable efficient data management, processing, and analysis.

In this blog article, we will dive deeper into the topic of selecting the right tech stack for an open source powered EDW. We will explore different aspects to consider, such as evaluating vendors, leveraging open source products, and understanding the key components of a robust tech stack. By the end of this article, you will have a better understanding of the factors to consider when selecting the right tech stack for your EDW.

Register here for free.

Evaluating Vendors and Leveraging Open-Source Products:

When embarking on the journey of building an open-source powered EDW, it is crucial to evaluate vendors and leverage open source products effectively. By choosing reputable vendors and open source solutions, you can ensure reliability, community support, and continuous development. Evaluating vendors involves assessing their expertise, reputation, and compatibility with your project requirements. Additionally, leveraging open source products provides flexibility, cost-effectiveness, and access to a vast community of contributors and developers.

Understanding the Key Components of a Robust Tech Stack:

A robust tech stack for an open source powered EDW comprises various components that work together to enable efficient data management and analysis. Here are some key components to consider:


Choosing the appropriate database technology is vital for efficient data storage and retrieval. Options like MongoDB, PostgreSQL, MySQL, or other databases that align with your project requirements should be considered

Automation Tools:

Automation tools play a crucial role in the development process of an EDW. These tools greatly accelerate the development process, particularly in a Data Vault project. One example of an open source automation tool is dbt (data build tool), which can be combined with Scalefree’s self-developed package DataVault4dbt. These tools help streamline the development process and make the development team more efficient.

DevOps and Infrastructure:

Having a stable scheduler or a similar tool to load the data regularly from the sources into the Data Warehouse is important. Options such as Airflow can be considered for this purpose. Additionally, having a DevOps tool for project management is essential. These tools help structure the work and make the development team more efficient, especially when using agile methodologies like Scrum.


Effective data visualization is crucial for analyzing and understanding the data in an EDW. There are various open source visualization tools available, such as Grafana, Superset, or Metabase, which provide powerful capabilities for creating insightful visualizations and dashboards.

Why Data Vault 2.0 is a Powerful Choice in Combination with an Open Source Tech Stack:

Combining Data Vault 2.0 with an open source tech stack offers a powerful solution for building an efficient, scalable EDW. The agile concepts used in Data Vault make it easier to gradually build an open source tech stack over time, starting with basic needs and expanding as necessary.
It should be noted that checking the readiness of an open source automation tool for Data Vault and having Data Vault templates in place is crucial. These components enhance efficiency, streamline development, and ensure smooth integration in an open source powered EDW environment.

Benefits of an Open Source Powered EDW:

Building an open source powered EDW offers several advantages. Firstly, open source solutions often provide a vast community of developers, ensuring continuous support, updates, and improvements. Secondly, open source products can be customized and tailored to meet specific project requirements. This flexibility allows you to adapt the tech stack to your organization’s needs and scale as your data processing requirements grow. Lastly, open source solutions typically offer cost-effectiveness by eliminating or reducing licensing fees, making them an attractive option for organizations of all sizes.

Considerations for Scalability and Performance:

Scalability and performance are crucial factors to consider when selecting the right tech stack for an open source powered EDW. As your data processing needs grow, it’s important to choose a tech stack that can scale horizontally or vertically to handle increasing workloads. Technologies like Kubernetes can be considered for container orchestration and load balancing to ensure efficient utilization of resources and smooth scalability. Additionally, performance optimization techniques, such as caching mechanisms, data indexing, and query optimization, should be considered to ensure fast and efficient data retrieval and processing.

Security and Data Privacy:

When dealing with enterprise data, security and data privacy are of utmost importance. Ensure that the chosen tech stack incorporates robust security measures and follows best practices for data encryption, access control, and secure communication protocols. Regular security audits and updates are essential to address any vulnerabilities and ensure compliance with data privacy regulations.


Picking the right tech stack for an open source powered EDW is a crucial step in building an efficient and scalable BI-System. By evaluating vendors, leveraging open source products, and understanding the key components of a robust tech stack, you can ensure a solid foundation for your EDW. Databases, automation tools, DevOps and Infrastructure, and visualization choices play vital roles in creating an effective and customizable solution. Embracing open source solutions provides flexibility, community support, and cost-effectiveness, making it an ideal choice for organizations seeking efficient data processing and analysis capabilities. Considerations for scalability, performance, security, and data privacy are important to ensure the success of your EDW implementation.

In conclusion, the selection of a tech stack for an open source powered EDW requires careful consideration of various factors. It is essential to evaluate vendors, leverage open source products, and understand the key components that contribute to a robust tech stack. By making informed choices and aligning the tech stack with your project objectives, you can build a scalable and efficient EDW that empowers your organization to process and analyze data effectively.

If you are interested to learn more about the topic, don’t miss the opportunity to participate in the upcoming webinar about this topic.
Register here for free.

Get Updates and Support

Please send inquiries and feature requests to [email protected]

For Data Vault training and on-site training inquiries, please contact [email protected] or register at

To support the creation of Visual Data Vault drawings in Microsoft Visio, a stencil is implemented that can be used to draw Data Vault models. The stencil is available at


Mastering Metadata: Data Catalogs in Data Warehousing with DataHub

By Scalefree Newsletter No Comments

Join the Webinar:


In today’s data-driven world, it is essential to be able to manage and organize large amounts of data in an efficient way. Businesses across all industries are forced to contend with more data than ever before. The introduction and development of an enterprise data warehouse in a company naturally plays a central role but does not solve a major challenge: How to effectively organize and manage the data, especially metadata, in an Enterprise Data Warehouse? This is where the concept of data catalogs comes into play.

A data catalog serves as a comprehensive inventory of data assets in an organization, providing context, annotations, and metadata to facilitate understanding and discovery of data. It’s like a map to your data, helping users navigate the complex data landscape to find the exact data they need.

A data catalog can help users to understand where to find specific data in the data warehouse that fits their need and to investigate where it came from as well as how it might be connected to other data. This can greatly simplify tasks like data analysis and reporting, making the data warehouse more accessible and usable for everyone in the organization.

There’s also going to be a webinar on July 25 at 2 pm with a more detailed look on Data Catalogs in Data Warehousing with DataHub.

Register here for free.

Understanding Data Catalogs

What is a Data Catalog?

In general, a data catalog is like a metadata inventory, which consists of organized and structured metadata regarding all data assets in an organization. It is a central place where all this metadata can be stored, combined, and categorized, which makes it a lot easier to discover and understand the corresponding data, for example in a data warehouse. A data catalog also has search functionalities to find specific data from the available indexed datasets. It serves like a single source of truth of your metadata, enabling users to trust the data they’re using for their analyses or business decisions.

Role of a Data Catalog in Data Warehousing

In the context of data warehousing, a data catalog brings a lot of benefits. It provides a way to explore and search all data stored in the data warehouse. Technical users, as well as Business Users, can discover relevant data, understand its context, and ensure it is up-to-date, reliable, and accurate. The following figure shows where data catalogs in data warehousing with Data Vault 2.0 take place. A Data Catalog should cover the entire Enterprise BI Solution. This also applies, for example, to a data lake, if available, and to the information delivery layer.

Now that we’ve understood what a data catalog is, let’s delve into how each component plays a part in a data catalog and explore how a tool like DataHub can assist organizations in these tasks.

Introduction to DataHub

In the world of data catalogs, DataHub stands out as an increasingly popular choice for many businesses. DataHub is a growing open-source software developed by LinkedIn to address its growing need for a more dynamic and scalable data management tool. It was created in part due to the fact that the existing tools were not sufficient with LinkedIns expanding needs.
As LinkedIn grew, so did its data volume, variety, and velocity. Recognizing the need for a more efficient way to manage its data, LinkedIn built and introduced DataHub in 2020. Open-sourcing DataHub allowed other organizations to benefit from this advanced tool, and it has since been adopted by many businesses looking for a modern, scalable data catalog solution.

DataHub supports both push-based and pull-based metadata ingestion including a wide range of integrations for example Airflow, BigQuery, Databricks, dbt, Hive, Kafka, Looker, MSSQL, MongoDB, Oracle, S3, PowerBI, Snowflake, Spark, and much more. You can find a full list here. This gives datahub the ability to combine and show metadata of the same data from multiple sources, for example a dbt model definition and if the tests were running successfully right next to the database schema and stats for all columns.

Key Features and Capabilities of DataHub

DataHub, as a metadata platform, goes beyond traditional data catalogs. DataHub offers all important features and capabilities:

1. Scalability: DataHub is designed to handle metadata from thousands of datasets, which makes it a great choice for large organizations.

2. Flexible and Extensible Data Model: The technical data model behind this tool is designed to be customizable and expandable to allow organizations to customize it to their specific business requirements

3. Powerful Search and Discovery: Leveraging Elasticsearch, DataHub offers robust search functionality that enables users to discover datasets quickly based on various attributes, such as the data’s origin, schema, and usage.

4. Rich Metadata: Unlike traditional data catalogs, DataHub captures and presents a wide variety of metadata, including data lineage, operational metadata, and business metadata. This gives users a comprehensive understanding of their data.

5. Data Lineage and Relationships: DataHub automatically captures and visualizes data lineage, showing how data flows through various systems. It also displays relationships between datasets, allowing users to understand how different data assets interact with each other.


Benefits of Using a Data Catalog

Using a data catalog comes with several benefits:

  1. Enhanced Data Discovery: With the search and categorization capabilities of a data catalog, users can quickly find the exact data they need without having to comb through large datasets.
  2. Improved Data Understanding: The metadata in a data catalog provides users with necessary context about the data, making it easier to interpret and use correctly.
  3. Better Compliance and Governance: A data catalog supports data governance initiatives by ensuring data is consistent, accurate, and compliant with relevant regulations.
  4. Increased Trust in Data: By providing transparency into data lineage, a data catalog helps build trust in the data by allowing users to see its history and verify its accuracy and reliability.
  5. Time and Resource Efficiency: By making it easier to locate and understand data, a data catalog can save the company resources, thus speeding up data-driven activities and reducing the burden on data management teams.

In conclusion, DataHub provides a flexible, feature-rich, and all-encompassing option for data catalogs in a data warehousing environment. By providing powerful features for data discovery, metadata management, data lineage, and data governance, it enables businesses to extract maximum value from their data.

If you’re interested in learning more about data catalogs, don’t miss the opportunity to participate in the upcoming webinar about this topic.
Register here for free.

Get Updates and Support

Please send inquiries and feature requests to [email protected]

For Data Vault training and on-site training inquiries, please contact [email protected] or register at

To support the creation of Visual Data Vault drawings in Microsoft Visio, a stencil is implemented that can be used to draw Data Vault models. The stencil is available at


Data Vault 2.0 with Hadoop and Hive/Spark

By Scalefree Newsletter No Comments


In this newsletter, you’ll receive an overview of what Hadoop and Hive is and why they can be used as an alternative to traditional databases.

There’s also going to be a webinar on June 13 at 2 pm with a more detailed look on how the components interact with each other and what a sample Data Vault 2.0 architecture looks like in that ecosystem. The webinar will also cover why Spark is used with Hadoop and Hive.

Take a look into the recording.


Hadoop is used to process and analyze large volumes of data efficiently by distributing the workload across a cluster of commodity hardware, enabling parallel processing and providing fault tolerance through its distributed file system and resource management framework.

HDFS – Hadoop Distributed File System

HDFS is a distributed file system that provides reliable and scalable storage for big data. It breaks large files into blocks and distributes them across a cluster of commodity hardware. HDFS ensures data reliability and availability through data replication.

Yet Another Resource Negotiator – YARN

YARN provides a flexible, scalable resource management framework for Hadoop, enabling a variety of applications and workloads to coexist and efficiently utilize the cluster’s resources. It abstracts the underlying infrastructure and allows for the dynamic allocation of resources based on application requirements.

MapReduce – MR

MapReduce is a programming model and processing framework for distributed data processing in Hadoop. It allows for parallel processing of large datasets by dividing the workload into maps, reducing tasks. Map tasks process data in parallel and the output is combined and reduced to produce the final result.

Hadoop Common

Hadoop Common provides libraries, utilities, and infrastructure support for the other components of Hadoop. It includes common utilities, authentication mechanisms, and interfaces that are used by various Hadoop modules.

What is the benefit?

Hadoop enables the storage and processing of massive amounts of data by scaling horizontally across a cluster of commodity hardware. It can handle petabytes of data without sacrificing performance.

Distributed Computing
Hadoop distributes data and processing tasks across multiple nodes in a cluster, allowing for parallel processing and faster data analysis. This distributed computing model enables efficient utilization of resources and enables high-performance data processing.

Fault Tolerance
Hadoop provides fault tolerance by replicating data across multiple nodes in the cluster. If a node fails, data can still be accessed from other replicas, ensuring data reliability and availability.

Hadoop is designed to run on inexpensive commodity hardware, making it a cost-effective solution for storing and processing large volumes of data. It eliminates the need for expensive specialized hardware.

Flexibility and Extensibility
Hadoop’s modular architecture allows for integration with various tools and frameworks within the Hadoop ecosystem, providing flexibility and extensibility. It supports a wide range of data processing tasks, including batch processing, real-time processing, machine learning, and more.

Data Locality
Hadoop’s distributed file system, HDFS, aims to bring the computation closer to the data. By processing data where it is stored, Hadoop minimizes data movement across the network, reducing latency and improving overall performance.

Ecosystem and Community
Hadoop has a rich ecosystem with a wide range of tools, libraries, and frameworks that extend its functionality for different use cases. It also has a large and active community of users, developers, and contributors, providing support, resources, and continuous improvement.

These benefits make Hadoop a powerful, popular solution for handling big data, enabling organizations to efficiently store, process, and gain insights from vast amounts of structured and unstructured data. The whole ecosystem can also run on-premise, which can make it a good alternative if ‘cloud’ is not an option.


Hive is a data warehouse infrastructure built on top of Hadoop that provides a high-level SQL-like query language called HiveQL for querying and analyzing large datasets.

What are the components?

Data Storage
Hive leverages Hadoop Distributed File System (HDFS) as its underlying storage system. It stores data in HDFS in a distributed and fault-tolerant manner, allowing for scalable, reliable data storage.

Schema Definition
Hive allows users to define a schema for their data using a language called Hive Data Definition Language, like DDL. This allows users to define tables, partitions, columns, data types, and other metadata associated with the data.

Query Optimization
Hive optimizes queries by performing query planning and optimization techniques. It aims to generate efficient query execution plans to minimize data movement, optimize resource utilization, and improve query performance.

Hive Metastore
Hive maintains a metadata repository called the Hive Metastore. It stores information about the tables, partitions, schemas, and other metadata associated with the data stored in HDFS. The metastore allows for efficient metadata management and retrieval during query processing.

Hive offers extensibility through User-Defined Functions (UDFs), User-Defined Aggregations (UDAs), and User-Defined Table Functions (UDTFs). These allow users to define custom logic and operations in programming languages like Java, Python, or other supported languages.

Integration with other tools
Hive integrates with various other tools and frameworks in the Hadoop ecosystem. For example, it can work alongside Apache Spark, Apache Pig, Apache HBase, and other components to provide a complete data processing and analytics solution.

Partitioning and Bucketing
Hive supports data partitioning and bucketing, allowing users to organize and store data in a structured manner. Partitioning involves dividing data into logical partitions based on specific criteria, while bucketing involves dividing data into equally sized buckets based on hash values.

Hive uses a serialization/deserialization framework called SerDe (Serializer/Deserializer) to read and write data in different formats, such as CSV, JSON, Avro, and more. Users can specify the appropriate SerDe for their data format to ensure proper data processing.

Overall, Hive simplifies data querying and analysis on Hadoop by providing a familiar SQL-like interface. It abstracts the complexity of writing low-level MapReduce or Tez jobs and provides a declarative and user-friendly approach to interact with large-scale data stored in Hadoop.


Hadoop is a robust and feature-rich environment that can be challenging to manage. However, its numerous advantages make it a compelling choice, depending on the user’s needs and the available in-house expertise. If you’re interested in learning more about it, don’t miss the opportunity to participate in the upcoming webinar.

Register here for free

Get Updates and Support

Please send inquiries and feature requests to [email protected]

For Data Vault training and on-site training inquiries, please contact [email protected] or register at

To support the creation of Visual Data Vault drawings in Microsoft Visio, a stencil is implemented that can be used to draw Data Vault models. The stencil is available at


Get started with real-time processing in Data Vault 2.0 – on Microsoft Azure

By Scalefree Newsletter No Comments


In this newsletter, you’re going to get an overview of what real-time processing is and what possibilities it can provide your Data Vault 2.0 implementation.

There’s also going to be a webinar on May 9, 2023 at 2pm CEST with a more in-depth look into the possibilities of real-time data. Take a look into the recording here.

What to expect

You will learn that real-time processing gives you the ability to create value out of data quicker, have the most up-to-date data in your reporting tools and allow more accurate decisions regarding data.
With that, your company will be able to adapt to changes in the market quicker by seeing developments right away with the most recent data.

Additionally, you can save costs by moving away from batch loading because the peak of computing power normally required for that gets reduced and is more evenly distributed throughout the day. That is especially the case when using cloud environments, because then it’s possible to replace promised environments and contribute the needed computing power perfectly.

The traditional way – batch-loading

Batch loading is a traditional method used to load data into a data warehouse system in large batches, mostly overnight. The data from data sources is delivered up to a certain time in the night to be transformed and loaded into the core data warehouse layer.

This method leads to a peak of data processing overnight, and organizations have to adjust their infrastructure needs to be able to deal with the expected maximum peak of required computing power.

The new way – real-time data

Real-time data is processed and made available immediately as it is generated, instead of being loaded in batches overnight. When using real-time approaches, the loading window is extended to 24 hours. So the overnight peak and its disadvantages are gone.
When using real-time data, it’s always modeled as a non-historized link or as a satellite.

Possible use cases for real-time data are vital monitoring in the healthcare industry, inventory tracking, user behavior on social media or production line monitoring.

Different types of real-time data

There are different types of real-time data based on how frequently the data is loaded and the degree of urgency or immediacy of the data.

Near real-time data refers to data that is loaded in mini-batches at least every fifteen minutes, with the data stored in a cache until it is loaded into the data analytics platform.
Actual real-time data, also called message streaming, involves loading every single message directly into the data analytics platform without any cache.
This type of real-time data is useful when it is important to have data available as soon as it is generated for dashboards or further analytics.

The acceptable processing delay for real-time data is typically defined by the consequences of missing a deadline. Additionally, there are three types of real-time systems: hard real-time, soft real-time, and firm real-time.

Implementing real-time data

So, how do you implement real-time data processing into your data warehouse solution? There are many architectures for that, but we will focus on the Lambda and Data Vault 2.0 architecture.

The lambda architecture separates data processing into a speed layer and a batch layer. The speed layer processes real-time messages with a focus on speed and throughput, while the batch layer provides accuracy and completeness by processing high volumes of data in regular batches. The serving layer integrates data from both layers for presentation purposes.

At first, the Data Vault 2.0 architecture seems to be similar to the lambda architecture, but it treats some aspects differently. The lambda architecture has issues from a Data Vault 2.0 perspective, such as implementing a single layer in each data flow and lacking a defined layer for capturing raw, unmodified data for auditing purposes.

The Data Vault 2.0 architecture adds a real-time part called “message streaming” to the existing batch-driven architecture, with multiple layers implemented for capturing and processing real-time data, integrating it with the batch-driven flow at multiple points. Messages are pushed downstream from the publisher to the subscriber, loaded into the Raw Data Vault and forked off into the data lake. But the main process is the push inside the message streaming area. The architecture is able to integrate data from batch feeds or to stream the real-time data directly into the dashboard.

Using Microsoft Azure for real-time data

Microsoft Azure is a cloud computing platform and set of services offered by Microsoft. It provides a variety of services, including virtual machines, databases, analytics, storage, and networking. These services can be used to create web and mobile applications, run large-scale data processing tasks, store and manage data, host websites and much more.

The illustration describes a typical real-time architecture used by Scalefree consultants, which follows the conceptual Data Vault 2.0 architecture.

Data sources deliver data either in batches or real-time. This is loaded into the Azure Data Lake or accepted by the Event Hub beforehand. The Raw Data Vault Loader separates business keys, relationships and descriptive data using Stream Analytics and forwards the message to the Business Vault processor. The Business Vault processor applies transformation and other business rules to produce the target message structure for consumption by the (dashboarding) application. The results can be loaded into physical tables in the Business Vault on Synapse or be delivered in real-time without further materialization in the database. The target message is generated and sent to the real-time information mart layer implemented by a streaming dataset, which is consumed by PowerBI. The cache of the dashboard service will expire quickly, but the Synapse database has all data available for other uses, including strategic, long-term reporting.

Conclusion and Outlook

In conclusion, real-time data processing offers numerous benefits over traditional batch loading methods, including the ability to create value out of data quicker, have the most up-to-date information in reporting tools, and make more accurate decisions. By adapting to changes in the market quicker, companies can stay ahead of the competition. Moving away from batch loading can also save costs by reducing the peak of computing power required.

As mentioned before, the last illustration shows an architecture that the Scalefree Consultants implemented to make use of real-time data.

Read more on our recently released Microsoft Blog Article.

How is your current experience with real-time data processing?
Are you thinking about kick-starting your Data Vault by also using real-time data?
Or are you already using it and thinking about improving it further?

Let us know your thoughts in the comment section!

Get Updates and Support

Please send inquiries and feature requests to [email protected]

For Data Vault training and on-site training inquiries, please contact [email protected] or register at

To support the creation of Visual Data Vault drawings in Microsoft Visio, a stencil is implemented that can be used to draw Data Vault models. The stencil is available at


Quick guide of a Data Vault 2.0 Implementation

By Scalefree Newsletter No Comments

It can be a bit overwhelming for beginners to start using Data Vault 2.0, as well as learning how and where to implement it. It’s important to note that Data Vault 2.0 is often assumed to be only a modeling technique but it encompasses a lot more than that. Not only that, but it is a whole BI solution composed of agile methodology, architecture, implementation, and modeling.

So why start using Data Vault?

  • Data Vault 2.0 allows you to build automated loading processes/patterns and generate models very easily
  • Platform independence
  • Auditability 
  • Scalability
  • Supports ELT instead of ETL processes

Now that we answered the why, you may be wondering what steps are needed to implement Data Vault 2.0 in your project ? 

It depends on a lot of factors like your business case, the architecture you want to have in place, how your sources are loaded, the sprint timeline of your project, etc.

Feature by Feature Architecture

One thing is for sure: the architecture should be built vertically, not horizontally. This means not layer by layer but feature by feature. 

A common approach here is the Tracer Bullet approach. Based on business value, which is defined by a report, a dashboard or an information mart, the source data needs to be identified, modeled, and loaded through all layers of the architecture. 

For example, let’s say the business request was to build a dashboard to analyze the company’s sales:

1. Extract

First thing, we need to extract the data from the source systems and load the data as it is somewhere. In this example, we put it in a Transient Staging Area but you could choose a persistent one in a Data Lake as well.

2. Transform

Next, you should apply some hard rules if necessary, be careful with this as you  do not want to make business calculations here, using a transformation tool. There are a lot of different warehouse automation tools that you can choose from: dbt, Coalesce, WhereScape, etc.

3. Load

Load your Raw Stage into the Raw Vault.

4. Model Business requirements

Model the Data Vault entities needed for the business requirement to be fulfilled. If we have some Sales transactions and customers data, for example, we will model a Non-historized Link, also known as Transactional Link, and a Customer Hub, along with any additional Satellites for holding the Customer descriptive data that we want to see in the Sales Dashboard in the end.

5. Apply Business Logic

Next we need some calculations and aggregations to be performed, so we will build some business logic on top of the raw entities, loading it into the business vault.

6. Build an Information Mart

Now, we could directly use the data stored in the Raw and Business Vault into charts/dashboard, but we want to structure the data, so it can be easily read and fetched by business users, so we will build an information mart with a star schema model with a fact table and dimensions.

7. Visualize Data

To build the Sales Dashboard in a BI visualization tool like PowerBI or Tableau, we now fetch directly from the star schema in the information mart, which has all the information we need, using a connection to my data warehouse in our database.


Data Vault 2.0 offers an agile, scalable, and flexible approach to Data Warehousing Automation. As demonstrated in the example, we only modeled the Data Vault tables that were necessary for accomplishing the handed task of building a Sales dashboard. This way you can scale up your business by demand, so you don’t have to figure out and map out the whole enterprise in one go. 

The answer to how to implement Data Vault 2.0 can be translated into a simple phrase: Focus on business value!

If you would like to see an explanation of this step by step implementation with some demonstration of actual data using dbt as the chosen transformation tool, check out the webinar recording.

Get Updates and Support

Please send inquiries and feature requests to [email protected]

For Data Vault training and on-site training inquiries, please contact [email protected] or register at

To support the creation of Visual Data Vault drawings in Microsoft Visio, a stencil is implemented that can be used to draw Data Vault models. The stencil is available at


Bring your Data Vault automation to the next level – with Datavault4Coalesce

By Scalefree Newsletter No Comments

Join the Webinar:

In last September’s webinar, we explored a new competitor in the highly contested market of data warehouse automation tools: Coalesce!


And everyone who watched that webinar might remember that at the end, we announced an even closer relationship between Coalesce and Scalefree and a commitment to bring Scalefree’s best practices into Coalesce!

For those who didn’t watch the webinar, you can find it here.

Recap: What is Coalesce?

Coalesce is a Data Transformation solution made for Snowflake. When working with Coalescse, you build directed acyclic graphs (DAG) which contain nodes. A node represents any physical database object, like tables or views, or even stages and external tables.

Coalesce itself is built around metadata that stores table and column level information which describes the structure inside your data warehouse. This metadata-focused design enables a couple of features which strongly drive towards scalability and manageability. 

All the metadata allows a team to track past, current and desired states of the data warehouse by deeply integrating it and all the workflows that it brings. Additionally users can define standardized patterns and templates on column- and table-level.

How can Data Vault jump in here?

These mentioned templates open up the gate to implement Data Vault 2.0 related patterns and best-practices. Especially on table level, it might quickly come to mind that you could try to build a template for a Hub or a Link.

On column level, this could be a repeated transformation which is then managed in only one so-called macro, which makes it very easy to implement changes with low to zero impact. You could think of hashkey calculation or virtual load-end-dating here.

And that is exactly what we at Scalefree did since the webinar last year. Lead-developers from Coalesce sat together with Data Vault experts and developers from Scalefree with one goal: Create something amazing that helps users to automate their Data Vault 2.0 implementation!

How fast can I build a Data Vault? Yes!

This cooperation created Datavault4Coalesce, an open source extension package for Coalesce, which will be available on March 16th! Let’s have a sneak peek at a selection of what users can do with Datavault4Coalesce.

The first release of Datavault4Coalesce will feature a basic set of Data Vault 2.0 entities:

  • Stages
  • Hubs
  • Links
  • Satellites
  • PITs
  • Snapshot Control Tables

While providing DDL and DML templates for the entity types mentioned above, Datavault4Coalesce makes use of Coalesce’s ability to define the UI for each node type. For stages, this means that users can decide if they want Datavault4Coalesce to automatically generate ghost records or not, as shown in the screenshot below:

This Data Vault related interface can be found across all node types and allow users to conveniently adjust Datavault4Coalesce to fit their requirements!

How can I start?

First of all a bummer, Datavault4Coalesce will only be available starting from the 16th of March. But there is no reason to wait that long to dive into Coalesce itself! Since Coalesce is now part of Snowflake Partner Connect, it’s never been easier to get your hands into a fresh Coalesce environment!

Just sign up for a free Snowflake trial here and initialize your Coalesce experience within seconds by accessing the Partner Connect portal! Then, you just have to load any desired data into it and you can start building your data pipelines with Coalesce. And when the 16th of March finally arrives, you just have to add Datavault4Coalesce to your Coalesce environment – and now you can start to build Data Vault faster than ever!

Also, don’t miss out our next webinar which will be held at the day of launch, the 16th of March, where we will show you Datavault4Coalesce in action. Sign up here!

Get Updates and Support

Please send inquiries and feature requests to [email protected]

For Data Vault training and on-site training inquiries, please contact [email protected] or register at

To support the creation of Visual Data Vault drawings in Microsoft Visio, a stencil is implemented that can be used to draw Data Vault models. The stencil is available at

TurboVault4dbt Logo

Speed up Your Data Vault 2.0 Implementation – with TurboVault4dbt

By Scalefree Newsletter No Comments

Join the Webinar:

Scalefree has released TurboVault4dbt, an open source package that provides automated model generation using DataVault4dbt-compatible templates based on your sources’ metadata.

TurboVault4dbt currently supports metadata input from Excel, GoogleSheets, BigQuery and Snowflake and helps your business with:

  • Speeding up the development process, reducing development costs and producing faster results
  • Encouraging users to analyze and understand their source data

‘Isn’t every model kind of the same?’

Datavault4dbt is the result of years of experience in creating and loading Data Vault 2.0 solutions forged into a fully auditable solution for your Data Vault 2.0 powered Data Warehouse using dbt.

But every developer who has worked with the package or has created dbt-models for the Raw Vault must have come across one nuisance:
Creating a new dbt-model for a table basically means taking the already existing template and providing it with specific metadata for that table. Doing this over and over again can be quite a chore. Which is why we created TurboVault4dbt to automate and speed up this process.


How many times has everyone pressed CTRL+C then CTRL+V and corrected a few lines of code when creating new dbt-models for the raw vault?

Instead of trying to figure out what the names of your tables and business keys are or what hashing order you want your Hashkey to be generated in, TurboVault4dbt will do all of that for you. All TurboVault4dbt needs is a metadata input where you capture the structure of your data warehouse.

TurboVault4dbt currently requires a structure of five metadata tables:

  • Hub Entities: This table stores metadata information about your Hubs,
    e.g. (Hub Name, Business Keys, Column Sort Order for Hashing, etc.)
  • Link Entities: This table stores metadata information about your Links,
    e.g. (Link Name, Referenced Hubs, Pre-Join Columns, etc.)
  • Hub Satellites: This table stores metadata information about your Hub Satellites,
    e.g. (Satellite Name, Referenced Hub, Column Definition, etc.)
  • Link Satellites: This table stores metadata information about your Hub Satellites,
    e.g. (Satellite Name, Referenced Link, Column Definition, etc.)
  • Source Data: This table stores metadata information about your Sources,
    e.g. (Source System, Source Object, Source Schema, etc.)

By capturing the metadata in those five tables above, TurboVault4dbt can both extract necessary information and generate every model that is based on a selected source but also, as a user, encourage you to analyze and understand your data.

Lean back, relax and let the machine take over!

Create and fill your metadata tables, connect them to TurboVault4dbt and enjoy your free time for another cup of coffee. Give it a try, or give us your feedback by visiting TurboVault4dbt on GitHub!

Stay updated on TurboVault4dbt through our marketing channels as great features lie ahead!

Get Updates and Support

Please send inquiries and feature requests to [email protected]

For Data Vault training and on-site training inquiries, please contact [email protected] or register at

To support the creation of Visual Data Vault drawings in Microsoft Visio, a stencil is implemented that can be used to draw Data Vault models. The stencil is available at



By Scalefree Newsletter No Comments

Join the Webinar:

In the Data Vault 2.0 architecture, information marts are used to deliver information to the end-users.
Conceptually, an information mart follows the same definition as a data mart in legacy data warehousing. However, in legacy data warehousing, a data mart is used to deliver useful information, not raw data. This is why the data mart has been renamed in Data Vault 2.0 to better reflect the use case.

“Classical” information marts

But the definition of information marts has more facets. In the book “Building a Scalable Data Warehouse with Data Vault 2.0” we present three types of marts:

  • Information marts: used to deliver information to business users, typically via dashboards and reports.
  • Metrics Mart: used in conjunction to a Metrics Vault, which captures EDW log data in a Data Vault model. The Metrics Mart is derived from the Metrics Vault to present the metrics in order to analyze performance bottlenecks or in resource consumption of power users and data scientists in managed self-service BI solutions.
  • Error Mart: stores those records that typically fail a hard rule when loading the data into the enterprise data warehouse.

Additional information marts

In addition to these “classical” information marts, we use additional ones in our consulting practice:

  • Interface Mart: this is more or less an information mart, however, the information is not delivered to a human being, e.g. via a dashboard or report. Instead, it is delivered to a subsequent application, or as a write-back, to the source system (for example when using the enterprise data warehouse for data cleansing).
  • Quality Mart: the quality mart is again an information mart, but instead of cleansing bad data, it is used to report bad data. Essentially, it turns the business logic used to cleanse bad data upside down: only bad data, in addition to ugly data at times, is delivered to the end-user, the data steward. This is often done in conjunction with data cleansing frontends where the data steward can either correct source data or comment and tag the exceptions.
  • Source Mart: again an information mart, but this time not using one of the popular schemas, such as star schemas, snowflake schemas or fully denormalized schemas. Instead, the information mart uses the data model of the source application, similar to an operational data store (ODS) schema. However, the Source Mart is not a copy of the data, it is a virtualized model on top of the Data Vault model, reflecting the original structures. It’s great for ad-hoc reporting and provides great value for many data scientists as well as power users.

This concludes our list of information marts. We have used them successfully in projects for our clients to better communicate the actual application of the information marts in their organization.

Let us know in the comments if you think this is helpful for you, too!

Get Updates and Support

Please send inquiries and feature requests to [email protected]

For Data Vault training and on-site training inquiries, please contact [email protected] or register at

To support the creation of Visual Data Vault drawings in Microsoft Visio, a stencil is implemented that can be used to draw Data Vault models. The stencil is available at

DataVault4dbt Powered by Scalefree

Kick-start your Data Vault 2.0 Implementation – with datavault4dbt

By Scalefree Newsletter No Comments

Join the Webinar:

Scalefree has released datavault4dbt. An open source package, that provides best-practice loading templates for Data Vault 2.0 entities, embedded into the open source data warehouse automation tool dbt.

Datavault4dbt currently supports Snowflake, BigQuery and Exasol and comes with a lot of great features:

  • A Data Vault 2.0 implementation congruent to the original Data Vault 2.0 definition by Dan Linstedt
  • Ready for both Persistent Staging Areas and Transient Staging Areas, due to the allowance of multiple deltas in all macros, without loosing any intermediate changes
  • Creating a centralized, snapshot-based Business interface by using a centralized snapshot table supporting logarithmic logic
  • Optimizing incremental loads by implementing a high-water-mark that also works for entities that are loaded from multiple sources

Building a Data Vault 2.0 Solution – made easy

The overall goal of releasing Data Vault 2.0 templates for dbt is to combine our years of experience in creating and loading Data Vault 2.0 solutions into publicly available loading patterns and best practices for everyone to use. Out of this ambition, datavault4dbt, an open source package for dbt was created and will be maintained by the Scalefree expert team. 

The most valuable characteristic of datavault4dbt is that it carnates the original Data Vault 2.0 definition by Dan Linstedt. It represents a fully auditable solution for your Data Vault 2.0 powered Data Warehouse. With a straight-forward, standardized approach, it enables the team to conduct agile development cycles.

By allowing multiple increments per batch while loading each Data Vault entity type, datavault4dbt supports both Persistent and Transient Staging Areas without losing any intermediate changes. These incremental loads are even optimized by implementing a dynamic high-water-mark that even works when loading an entity from multiple sources.

Additionally, datavault4dbt encourages strict naming conventions and standards by implementing a variety of global variables that span across all Data Vault layers and supported Databases. The process of end-dating data is completely virtualized to ensure a modern insert-only approach that avoids updating data.

With all these features, datavault4dbt is the perfect solution for your modern Big Data Enterprise Data Warehouse.

From the Stage over the Spine into the PITs

To achieve all this, we worked hard on creating a solid and universal staging area. All hashkeys and hashdiffs are calculated here and users are given the option to add derived columns, generate prejoins with other stages and add ghost records to their data. All of this highly automated based on parameterized user input. 

Based on staging areas, the Data Vault 2.0 spine can be created. Hubs, Links and Non-Historized Links can be loaded from multiple sources including mapping options to ensure business harmonization. 

This spine is then enriched by Standard Satellites, Non-Historized Satellites, Multi-Active Satellites and/or Record-Tracking Satellites. All of those that require it come with a version 0 for tables and a version 1 for end-dated views. 

Based on the Raw Data Vault, PITs can be created automatically, and their loading is backed by an automated, highly-configurable but optional logarithmic snapshot logic. This logic is included in the Control Snapshot Table, which also comes in two consecutive versions. To wrap the logarithmic snapshot logic up, a post-hook for cleaning up all PITs is included and comes in handy.

DataVault4dbt Powered by Scalefree

Start now and boost your Data Vault experience!

The lines above made you think “Nah, that’s all too good to be true!”? Convince yourself, or give us your highly appreciated feedback by visiting datavault4dbt on Github!

Of course, our future ambitions for datavault4dbt are high and next on our list are a lot of important topics, like:

  • Provide a detailed working example of datavault4dbt
  • Extend and migrate the existing documentation of the package
  • Support more and more databases
  • Add more advanced and specific Data Vault 2.0 entities
  • Develop automated Data Vault related tests
  • Review and implement user feedback and suggestions

Stay tuned for more datavault4dbt content on all our marketing channels!

Get Updates and Support

Please send inquiries and feature requests to [email protected]

For Data Vault training and on-site training inquiries, please contact [email protected] or register at

To support the creation of Visual Data Vault drawings in Microsoft Visio, a stencil is implemented that can be used to draw Data Vault models. The stencil is available at

Image of the structure from Data Vault

Data Vault 2.0: Best of Breed from Data Warehousing and Data Lakes

By Scalefree Newsletter No Comments

There are two competing approaches to data analytics available in the industry, and most professionals at least tend to one or the other as the preferred tool of choice: data warehousing vs data lake. This article sheds light on the differences between both approaches and how Data Vault 2.0 provides a best of breed solution that integrates the advantages of both approaches into a unified concept.

About Data Warehousing

Data Warehousing is the traditional enterprise solution for providing reliable information to decision makers at every level of the organization. Data warehouse solutions (but also data lakes) are based on a data model, which is traditionally defined on the basis of either the information requirement in a bottom-up approach or in a top-down approach based on an integrated enterprise information model.

In any case, the traditional data warehouse is based on a concept called “schema-on-write” where the data model is established when loading the data into the data warehouse. This often leads to non-agile data processing, as this data model often requires modifications in order to cope with changes in the business.

About Data Lakes

Data lakes, on the other hand, are based on the “schema-on-read” concept. Instead of modeling the enterprise or fitting the incoming dataset into a target information model, the data is first and foremost stored on the data lake as delivered without any modelling applied. 

While traditional data warehousing often leads to overmodeling and non-agile data analytics, the data lake approach often leads to the direct opposite: to unmanaged data and inconsistent information results.

The Best of Breed

Both approaches are on the extreme ends of the data analytics space and used throughout the years with mixed results. With the emergence of the Data Vault 2.0 concept, a third option is available to industry professionals to build data analytics platforms. 

Data Vault 2.0 is a best of breed between traditional data warehousing and data lakes: for example, there is a data model to manage the data and business logic as in traditional data warehousing, but it follows a schema-on-write approach as in data lakes. 

The Data Vault 2.0 architecture comprises multiple layers:

Image of the structure from Data Vault

The first layer is the staging area: it is used to extract the data from the source systems. The next layer is the Raw Data Vault. This layer is still functionally oriented as the staging layer, but integrates and versionizes the data. To achieve this, the incoming source data model is broken down into smaller components: business keys (stored in hubs), relationships between business keys (stored in links) and descriptive data (captured by satellites). 


The Business Vault is the next layer, but only sparsely modelled: only where business logic is required to deliver useful information, a Business Vault entity is put in place. The Business Vault bridges the gap between the target information model (as in the next layer) and the actual raw data. Often, the raw data doesn’t meet the expectations of the business regarding data quality, completeness, or content and thus needs to be adjusted. Business logic is used to fill the gap.

The final layer is the information mart layer where the information model is produced to deliver the final information in the desired format, e.g., a dimensional star schema. This model is used by the business user either directly in ad-hoc queries or using business intelligence tools such as dashboarding or reporting software.

The first layers until the Raw Data Vault are still functionally oriented because the model is still derived either directly from the source system (as in the staging area) or by breaking down the incoming data model into smaller, normalized components, as in the Raw Data Vault. The target schema is only applied at the latest layer, the information mart layer. This is when the desired information model is applied. Because the information mart is often virtualized using SQL views, the target schema is actually applied during query time. Queries against the view layers are merged with the SQL statements inside the view layer and run against the materialized tables in the Raw Data Vault, the actual data. Therefore, the schema-on-read concept is used in Data Vault 2.0.

Data Vault 2.0 also preserves the agility: the concept has demonstrated in many projects that it is easy to extend over time when either the source system structures change, the business rules change or the information models need to be adjusted. In addition, it is easy to add new data sources, additional business logic and additional information artifacts to the data warehouse. 

On top of that, the Data Vault 2.0 model is typically integrated with a data lake: the above diagram shows the use of a data lake for staging purposes, which is the recommended “hybrid architecture” for new projects at Scalefree. But the data lake can also be used to capture semi-structured or unstructured data for the enterprise data warehouse or to deliver unstructured information marts.

With all that in mind, the Data Vault 2.0 concept has been established itself as a best of breed approach between the traditional data warehouse and the data lake. Organizations of all sizes use is to build data analytics platforms to deliver useful information to their decision makers.


-by Michael Olschimke (Scalefree)

Get Updates and Support

Please send inquiries and feature requests to [email protected]

For Data Vault training and on-site training inquiries, please contact [email protected] or register at

To support the creation of Visual Data Vault drawings in Microsoft Visio, a stencil is implemented that can be used to draw Data Vault models. The stencil is available at


Data Vault 2.0 with dbt – Part 2

By Scalefree Newsletter No Comments

In the first part of this blog series we introduced you to dbt. Now let’s take a look at how you can implement Data Vault 2.0 with dbt and what advantages this offers. If you don’t know the first part yet, you can read it here

dbt provides the ability to create models and generate and execute SQL dynamically out of these models. This allows you to write your data transformations in models using SQL and reusable macros powered by Jinja2 to run your data pipelines in a clean and efficient way. However, the most important part for the Data Vault use case is the ability to define and use those macros.

But first we should clarify how models work in dbt.

Dbt handles the compilation and execution of models written using SQL and the Jinja macro language. Each model consists of exactly one SQL SELECT statement. Jinja code is translated into SQL during compilation.

This figure shows a simple model in dbt. A big advantage of Jinja is the possibility to generate sql programmatically, for example with loops and conditions. Also, through the ref() functions, dbt understands the relationships between the models and builds a dependency graph. This ensures that the models are built in the correct order and that the data lineage is documented.

A lineage graph may look like this:

The materialization of models can be defined at different configuration levels. This allows fast prototyping with views and switching to materialized tables if necessary for performance reasons.

Data Vault 2.0 and macros

But how can Data Vault 2.0 be implemented in dbt? The most important part for using Data Vault 2.0 is the ability to define and use macros. Macros can be called in models and then generate into that macro additional SQL or even the entire SQL code.

For example, you could create a macro to generate a hub that gets the source/staging model as an input parameter, as well as the specification of the columns for the business key(s), load date and record source. The sql code for the hub is then generated dynamically from this. The advantage of this is that a change to the macro would directly affect each individual hub, which greatly improves maintainability.

At this point, you also gain huge benefits from the active open-source community around dbt. There are many open-source packages with which dbt can be extended.

There are also already some packages that are perfect for using data vault 2.0. 

We at Scalefree are currently developing an open-source dbt package that provides macros to translate a Data Vault model “on paper”, to actual tables and views like Hubs, Links, Satellites and more. Subscribe to our Newsletter to get notified as soon as it’s available.

The only thing you need in your model, for example for a hub, is just one macro call:


hub(src_pk, src_nk, src_ldts, src_source, source_model) 


With the parameters of the macro call you define the source table where the columns can be found (source_model) and the column names for the hash-key (src_pk), business key(s) (src_nk), load date (src_ldts) and the record source (src_source) column. When the model and the macro(s) defined in the model are executed, the SQL gets compiled and processed on the database system.

The metadata needed can for example be defined in variables with jinja directly in the model:

What you also can see is that dbt provides different options for the materialization. The incremental materialization for example will load an entity as a table on an incremental basis.

When the model is executed, dbt generates the whole sql out of the macro and decides how the records are loaded. If the hub table does not exist, yet it is created and all records are loaded, if it already exists the table is loaded incrementally. 

For people who tried or managed to implement a Data Vault with “vanilla” SQL, you will realise that this is a real game-changer. The team can now focus entirely on the Data Vault design itself. Once the metadata is identified, dbt along with your macros can take care of the entire logic.

Openly available packages can add basic Data Vault 2.0 principles to dbt and therefore allow users to quick-dive into Data Vault implementation. Dbt’s general openness allows you to adjust all macros for example to your company or project specific Data Vault flavor to meet your technical and business requirements. 

However it is important to mention, that existing dbt packages for Data Vault 2.0 does not completely fulfill Data Vault 2.0 standards and deviates from them in details. Currently, we at Scalefree are working on an open-source dbt package that provides all important Data Vault 2.0 entities, as well as the latest standards and best practices and will be released soon. But these details are worth another blog post!


-by Ole Bause (Scalefree)


Get Updates and Support

Please send inquiries and feature requests to [email protected]

For Data Vault training and on-site training inquiries, please contact [email protected] or register at

To support the creation of Visual Data Vault drawings in Microsoft Visio, a stencil is implemented that can be used to draw Data Vault models. The stencil is available at


What’s new in Data Vault?

By Scalefree Newsletter 2 Comments

Whats new in Data Vault?

Attendees of our Data Vault 2.0 trainings often ask us what the difference between the book “Building a Scalable Data Warehouse with Data Vault 2.0” and our current consulting practice is. To give a definitive answer, at least for now, we have written this article.

Data Vault 3.0?

First of all, there is no Data Vault 3.0. We like to refer to the version of Data Vault that we teach and apply in our consulting practice, as “Data Vault 2.0.1”. Therefore, there are only slight differences to the version we used in the book. Some of these changes look big, but they don’t really modify the underlying concept and that is what matters regarding version changes. So, based on the concept, we only see minor changes or enhancements. However, they might have some larger impact on the actual implementation, but that is only due to technology changes.


How to Fit Data Vault 2.0 Between Two Covers

On top of the minor changes to the concept, there is also another factor at play here: when writing the book, we had to scope the book to make it fit between the covers as the publisher had a page limit (which we actually exceeded a bit). Therefore, we did not apply all Data Vault 2.0 concepts in the book: for example, real-time concepts are not covered by the book, and we did not use a data lake or cloud computing in our examples. Instead we only briefly covered these concepts for completeness but focused on on-premise technology which was more used in actual projects back then. In 2012, cloud computing was already available and widely used, but it was easier to tailor the book to more readers by the use of on-premise technologies.


With that in mind, what has changed since the book came out? 


Removal of Load End Date


The most obvious change is the removal of the Load End Date. Well, in our actual projects, we don’t completely remove it from the Data Vault 2.0 model, we just virtualize it by removing the Load End Date from the underlying satellite table and virtually calculating it in a view on top of the satellite table using a window function (typically LEAD, but LAG is also possible). This way we can get rid of the update procedure to maintain the Load End Date (the so-called end-dating process) while preserving the query patterns downstream. Keep in mind that it might also be more efficient to use the window function when loading PIT tables or satellites, and therefore the query layer is actually only for power users when they query the Data Vault model directly. 


That’s actually not an update: even in 2012 (and before) we used these approaches but they didn’t work on SQL Server 2014, which was used in the book as the analytical window function required for this approach is ways too slow. However, it improved in 2016. To get rid of the Load End Date in such scenarios where the window function is too slow or just don’t exist, a new solution has emerged: the use of a snapshot partition in the PIT table. The end of all times is used in the PIT table to refer to the latest delta in each satellite for the delta check. Once the PIT table is available, it can also be used to produce SCD Type 1 dimensions (without history) and therefore the need for the Load End Date (or a fast replacement as described above) only exists in loading.


Hybrid Architecture


The next obvious change is the use of a hybrid architecture where a data lake is used for staging purposes. In the book, we focused on the on-premise Microsoft SQL Server stack that didn’t include a distributed file system. We already recommended to clients to use a data lake for staging purposes, in an architecture we called the “hybrid architecture.” Back then, only a few followed the advice, but today most clients use the hybrid architecture for their data warehouse. We actually consider a relational staging area an anti-pattern and do not recommend it anymore to clients (with some exceptions). 


Multi Temporal Data Vault 2.0


The book describes a temporal PIT (TPIT) for building multi-temporal solutions. While the pattern is still valid and relatively flexible compared to other solutions, today we typically move the business timelines into the dimensional model for highest flexibility. This is discussed in more detail in the Multi-Temporal Data Vault 2.0 class. TPITs are less flexible but have higher performance.


Snapshot Date

Satellites in the Business Vault might use a Snapshot Date Timestamp instead of a Load Date Timestamp. The idea is that, as a general practice, a Business Vault entity should always reuse an existing granularity. It just becomes much easier to develop Business Vault solutions. With the Snapshot date in the primary key of the satellite, this becomes much easier for business rules on the outgoing information granularitiy. This is discussed in more detail in the Data Vault 2.0 Information Delivery class


A Confession


And finally, we also made some mistakes. We are willing (not happy) to admit it in our errata page for the book – this link is directly from our internal Wiki and we promise to keep it updated (however, we didn’t receive additional reports lately).

Get Updates and Support

Please send inquiries and feature requests to [email protected]

For Data Vault training and on-site training inquiries, please contact [email protected] or register at

To support the creation of Visual Data Vault drawings in Microsoft Visio, a stencil is implemented that can be used to draw Data Vault models. The stencil is available at


Data Vault 2.0 with dbt – Part 1

By Scalefree Newsletter 2 Comments

Data is an important asset in the decision making process. As we have previously discussed in another post, Data Vault 2.0 is the right choice when the goal of an enterprise data warehouse is to have fully historized and integrated data. Additionally, it is also better suited to instances in which data from many source systems needs to be combined. You can find the previous blog post here.

While Data Vault 2.0 focuses on the “what”, there are many options for the “how” of technically translating a Data Vault model into physical tables and views in the enterprise data warehouse, as well as for orchestrating and loading/processing the procedures. And this is where dbt comes in.



The data build tool (dbt) transforms your data directly in your data warehouse. For reference, dbt is the “T” in ELT. Therefore, dbt assumes that data is already loaded into a database from which the current database can query from. In contrast, ETL extracts data then transforms it and before loading it into the target. With ELT, the data is not transformed as it moves into the data warehouse. (with transform we mean transformation by soft business rules which would change the meaning of the data. Of course we have to make sure that the data fits into the target table (data type adjustments etc.). Here we are talking about “hard rules”.)
Dbt is particularly compatible and useful in cloud DWH solutions such as Snowflake, Azure Synapse Analytics, BigQuery and Redshift and performs transformations and modeling directly on the database in order to take advantage of the performance of these highly scalable platforms.


Models and SQL statements can be easily created, tested and managed in dbt itself. A powerful combination of the scripting language, Jinja2, and the all-time classic SQL allows users to build models. The simple interface enables data analysts without engineering know-how to initiate appropriate transformations. Data team workflows become more efficient and cost-effective because of this. Behind dbt sits an open source community that is constantly and passionately developing the tool. As such, dbt is available both as a free, reduced core version and as a comprehensive and flexible cloud version.

What is the place of dbt in Data Vault 2.0?

Part of the Data Vault 2.0 methodology is the model which focuses on how to design the core data warehouse with a scalable solution. The core Data Vault entities are hubs, links and satellites. That said, dbt provides the ability to generate Data Vault models and also allows you to write your data transformations using SQL and code-reusable macros powered by Jinja2 to run your data pipelines in a clean and efficient way.

Application Areas of dbt in an Enterprise BI Solution

dbt does not reinvent the world, but when it comes to building a new EDW – especially if it is in the clouds – dbt provides a very helpful basic framework with many important functions for Continuous Integration and Deployment already defined. dbt brings the standards of software development into the world of data transformation. This allows developers to concentrate on the core tasks of data modeling and business logic. Especially, but not only for smaller projects, dbt offers a lightweight and extremely affordable alternative to other data warehouse automation solutions.

– by Ole Bause (Scalefree)

Get Updates and Support

Please send inquiries and feature requests to [email protected]

For Data Vault training and on-site training inquiries, please contact [email protected] or register at

To support the creation of Visual Data Vault drawings in Microsoft Visio, a stencil is implemented that can be used to draw Data Vault models. The stencil is available at


Agile development in data warehousing with Data Vault 2.0

By Scalefree Newsletter No Comments

Initial Situation:

A common issue in data warehousing projects is a scope is often missing and many of the processes such as controlled access, GDPR handling, auditability, documentation and infrastructure are not optimized. Additionally, data warehouse projects that have a scope often begin without a real focus on business value. This is mostly due to the fact that the use cases are not clearly communicated and the data architects do not know where to start. The consequence of this means  no business value can be delivered.

Data Vault 2.0 Methodology

It is often assumed that Data Vault 2.0 is only a modeling technique, but this is not correct. Data Vault 2.0 includes the modeling technique, a reference architecture and the methodology. The methodology introduces project management tools such as CMMI, Six Sigma and Scrum to solve the problems described. While CMMI and Six Sigma deal with general project management issues, Scrum is mostly used specifically in the development team and provides the framework for a continuously improving development process.  The use of agile development in Data Vault 2.0 projects will be described in more detail below.

The Scope of a Sprint

The first step in setting up a data warehouse project in an agile way is defining the objective of the project with just one or two pages. Unlike waterfall projects, the goal is to produce working pieces of usable outputs, could be reports or dashboards, in continuous iterations, otherwise called sprints. This means that we don’t need to plan the entire project in detail but instead can build around a general idea or goal for the final data warehouse before then focusing on planning the first sprints. In order to address the aforementioned problems, the focus of the sprints needs to be centered around business value. For this reason, it is important to receive constant feedback from the business users for a continuous improvement process.

Define the project

Both the scope of a sprint and the architecture follow a business value driven approach built vertically and not horizontally. This means they are not built layer by layer but instead feature by feature. A common approach for this is the Tracer Bullet approach. Based on business value, which is defined by a report, a dashboard or an information mart, the source data will be identified and modeled through all layers and loaded. 

As shown in Figure 1, the entire staging area layer is not initially built but rather a small part of the respective layer is built based on data in the scope, in this case the SalesReport.

Agile Development

Before new functionality can be implemented in a sprint, it needs to be defined.
This task lies with the product owner as they are the ones to write and prioritize user stories.
As already explained, the goal of a sprint is to produce working pieces of usable outputs called features.
In addition, there are tech topics that need to be considered. There are various methods to support Sprint Planning, such as planning poker or Function Point Analysis, which are discussed in more detail in another article.

Another good indicator is to evaluate the sprint itself while the sprint is still ongoing. If the development team does not manage to implement a feature in a sprint, this can often be seen as a good indicator that the scope is too large. 

To avoid this, all work packages that are not relevant for the feature should be removed. Though, what is often the case these work packages are not completely removed out of fear from the business user. 

To address this fear it is important to educate the business user that they will be delivered but only in a later sprint and temporarily moved into the backlog.

Due to the flexible and scalable Data Vault model, these layers can be extended with the next feature with little to no re-engineering. This is possible due to the fact Data Vault consists of a Raw Data Vault and a Business Vault model which means it contains the logical architecture as well as the data modeling perspective. The Raw Data Vault is modeled in a data-driven way by integrating the data by business keys. Only hard business rules like data type conversions or hash key calculations are applied. All other soft business rules are only applied in the Business Vault. 

Here, we turn data into information. For this reason, the Raw Data Vault requires less refactoring and can be extended limitlessly.


Another important success factor for agile projects is proper review and improvement. Even before the next sprint starts, two meetings must be held by the team:

  • The sprint review meeting: This meeting is about reviewing the delivered features. Usually the development team, the product owner, the Scrum Master and the end-users participate in this meeting.
  • Retrospective meeting: This meeting usually takes place directly after the review meeting and focuses on identifying activities that need to be improved.
  • Backlog refinement for prioritizing the user stories and to make sure that the team understands what to do
  • Sprint planning to plan which user stories fit into the next sprint based on estimating the effort.

It is important that these meetings are held so that toe source errors can be found. In this way, the outcome of a project can be improved and the development processes optimized in an iterative way.


Data Vault 2.0 is not only a scalable and flexible modeling technique, but a complete methodology to accomplish enterprise vision in Data Warehousing and Information Delivery by following an agile approach and focusing on business value. By using agile methods in data warehousing, the focus in projects can be on the business value and delivering useful products to the customer.

Get Updates and Support

Please send inquiries and feature requests to [email protected]

For Data Vault training and on-site training inquiries, please contact [email protected] or register at

To support the creation of Visual Data Vault drawings in Microsoft Visio, a stencil is implemented that can be used to draw Data Vault models. The stencil is available at


Multi-Temporality in Data Vault 2.0 Part 2 – A Practical Example

By Scalefree Newsletter No Comments

Multi-Temporal Example from the Travel Industry

In one of our previous blog posts, we gave an insight into the basics of multi-temporality in Data Vault 2.0. This article will give you a short practical example of the topic.

When you book your holidays, you might change something after the initial booking: the duration, the flight departure location, the hotel, the start and/or end date of your journey or might just book some extras. From an operational perspective, these changes are not a problem to deal with, and the expectations are not high.

The data set for your journey will be updated in the source systems’ database, and only the most recent data is shown in the graphical user interface. The employee might see some logs in the history part, if they exist.

That’s usually enough to keep the operational part working. In the business intelligence world, the way to deal with data is a bit different. There, we want to see everything. Each time a record in the source system is updated, we catch all the changes and store them in a separate database.

This means that we introduce a new timeline which provides us new opportunities but also comes with new challenges. Let’s have a look into some example data sets of a source system over time:

Day 1: Initial booking of a holiday trip

Day 2: Correction of the start and end date and a small price adjustment

Day 2: Correction of the start and end date and a small price adjustment

When you have a look into the source system today, you will see that booking B4711 is cancelled, or maybe it is completely gone, and B4712 is the active one (ID = 2, created on day 3). From an analytical perspective you might have some more questions you want to have answered.
For example:

  • What was the estimated revenue for July, looking from February?
  • What was the estimated revenue for July, looking from March?
  • Which dates do I need to use to answer the two questions above?
  • In which month do we have the most bookings for the next summer season?
  • Can we reproduce a report we created some months ago?
  • What if some business dates change? What affects my reports in the past?
  • Where can we prepare the data to answer all these questions? In the data warehouse? In the front-end BI tool? Or somewhere else?

In our webinar “Multi-Temporality in Data Vault 2.0” we will provide you answers to all these questions by using Data Vault entities.

If you are interested to see some more use cases related to multi-temporality: we offer a 2-day class especially for this topic: “Multi-Temporal Data Vault 2.0 Class”

by Marc Finger (Scalefree)

Get Updates and Support

Please send inquiries and feature requests to [email protected]

For Data Vault training and on-site training inquiries, please contact [email protected] or register at

To support the creation of Visual Data Vault drawings in Microsoft Visio, a stencil is implemented that can be used to draw Data Vault models. The stencil is available at


Why should I use Data Vault (2.0)?

By Scalefree Newsletter No Comments

As a database developer, administrator or user, you probably have already heard of Data Vault. But with an existing database solution in mind, you might come to the question of why you should use Data Vault in your enterprise data warehouse. This blog article discusses some features where Data Vault can shine and why you should use it.

Reason 1: Supporting multiple source systems and frequently changing relationships.

Data Vault 2.0 provides the most benefits when your data comes from many source systems or has constantly changing relationships. 

Data Vault 2.0 works well for multi-source systems or those that have constantly changing relationships.

The reason why it works well for these systems is its ability to make adding attributes simple. If there is a change to one source system, that change does not need to show up within all source systems. Similarly, you can limit the number of places changes are made, as attributes are stored separately from structural data in satellites.

Additionally, it is easier to account for new and changing relationships by closing off one link and creating another. You don’t have to change the historical data to account for a new relationship or update an existing schema. You only need to account for the changes going forward. This brings enormous flexibility and scalability into your enterprise data warehouse.

Reason 2: Easily track and audit your data.

In a modern data environment, the data runs through various different stops. To still provide continuous data quality, it must always be clear where data has come from. 

Figure 1: Data Lineage

Given the need to be able to easily track and audit data, Data Vault has made it easier to do so. It does so by inherently enabling auditing, as load times and record sources are required for every row. It also tracks all historical changes as satellites, including the load time as part of the primary key.
When an attribute is updated, a new record is created.
All of this auditing enables you to easily provide auditability for both regulatory and data governance purposes. And because you store all of your history, you can access data from any point in time.

Reason 3: Load data from multiple source systems more quickly.

Data Vault enables quicker data loading simply because a number of tables can be loaded at the same time in parallel. The model decreases dependencies between tables during the load process and simplifies the ingestion process by leveraging inserts only, which load quicker than upserts or merges. This also leads to less complexity.


Data Vault 2.0 is designed to integrate data from several source systems and may over perform in some situations.

In summary, if you have a small to medium-sized analytics requirement, with a small team of architects, designers, and engineers delivering a solution with data sourced from a few systems, then Data Vault may be inappropriate for your need.

If, however, you have a large project with many source systems leading to an enormous data integration challenge, then Data Vault can potentially add massive value to the project.

Get Updates and Support

Please send inquiries and feature requests to [email protected]

For Data Vault training and on-site training inquiries, please contact [email protected] or register at

To support the creation of Visual Data Vault drawings in Microsoft Visio, a stencil is implemented that can be used to draw Data Vault models. The stencil is available at