Skip to main content
search
0

Soft Deletes in Data Vault 2.0

Watch the Video

In the latest edition of our Data Vault Friday series, our knowledgeable BI Consultant, Lorenz Kindling, delves into a question posed by an audience member.

“Can you use soft deletes for GDPR or Security in Data Vault?”

Lorenz provides valuable insights into the application of soft deletes within the Data Vault framework, specifically addressing their potential role in achieving GDPR compliance and enhancing data security measures.

Agile Kanban Made Easy for Salesforce: Introducing the MOWI App

MOWI – The Agile Kanban Board in Salesforce

Agile methods like Kanban help teams manage projects with transparency, flexibility, and efficiency. However, many organizations struggle with integrating Kanban into their existing systems. This is where MOWI comes in.

MOWI is a lightweight Salesforce application that brings Kanban boards directly into the Business Application Platform. Built around the three objects Mission, Objective, and Work Item, MOWI makes project and task management in Salesforce simple, structured, and intuitive.

Why MOWI?

  • Fully integrated into Salesforce – no additional tools required.
  • Improved transparency – all team members can see progress and responsibilities at a glance.
  • Flexible collaboration – goals, tasks, and subtasks can be managed in a Kanban board with simple drag and drop.

How MOWI Works

Mission – the bigger picture

A Mission groups together multiple Objectives and serves as the overarching project goal. This makes it easy to organize even complex initiatives.

Example: A “Product Launch” mission that includes several objectives such as marketing, development, and sales preparation.

Objective – the core element

Objectives are the main tasks or user stories, comparable to cards on a Kanban board. Each Objective contains detailed information and can also be linked to Follow-up Objectives for more complex scenarios.

Example: “Create marketing campaign” as an objective, linked to multiple work items such as landing page design or content creation.

Work Item – the detailed steps

Work Items are the subtasks of an objective. They provide a more granular view of progress and move through phases such as Open, In Progress, and Closed.

Example: “Design landing page” as a work item within the “Create marketing campaign” objective.

The Kanban View in Salesforce

In the Kanban View, all objectives are displayed in columns such as Active, Blocked, Testing, or Done. Users can drag and drop objectives across columns as work progresses.

The result: clear visibility, better communication within the team, and improved productivity.

Conclusion

With MOWI, you bring agile project management directly into Salesforce – lightweight, integrated, and effective. Download now for free on AppExchange.

Data Vault Modeling Styles

Watch the Video

As part of our engaging Data Vault Friday series, our CEO, Michael Olschimke, addresses a pertinent question raised by an audience member.

“What’s your view on other Data Vault philosophies? Some of my colleagues received training in such modeling styles, but the philosophy contains some substantial differences related to the CDVP2 certification.”

In this insightful video, Michael shares his perspectives on various Data Vault philosophies beyond the CDVP2 (Certified Data Vault Practitioner Level 2) certification. Drawing from his extensive experience and expertise, he navigates through the nuances of different modeling styles within the Data Vault framework.

Meet the Speaker

Profile picture of Michael Olschimke

Michael Olschimke

Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

Point in Time vs. Record Source Tracking in Data Vault

Watch the Video

In the latest installment of our informative Data Vault Friday series, our CEO, Michael Olschimke, takes on a thought-provoking question posed by a member of our engaged audience.

“How are record source tracking satellites used in a Data Vault, and if it is not used in a project, then how can PIT tables come into play in tracking the customers or business keys? And if we are taking the PITs for tracking the keys, does this mean we should take the daily snapshots of the data?”

Michael intricately unpacks the nuances of utilizing record source tracking satellites within the context of Data Vault methodology. He provides valuable insights into the role of PIT (Point-in-Time) tables in effectively tracking customers or business keys, shedding light on their significance in scenarios where record source tracking satellites might not be employed.

Delving deeper into the intricacies, Michael elucidates whether the use of PIT tables necessitates the capture of daily snapshots of the data, offering a comprehensive perspective for data professionals seeking clarity on these vital aspects of data modeling.

Meet the Speaker

Profile picture of Michael Olschimke

Michael Olschimke

Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

Exploring Datavault4DBT: A Practical Series on the DBT Package for Data Vault 2.0 – Vol. 1: The Staging Layer

Exploring DataVault4dbt

Last year Scalefree released DataVault4dbt, a Data Vault 2.0 open-source package for dbt, which includes loading templates for creating and Data Vault 2.0 modeling entities following up-to-date standards and best practices. If you want to read more about the general content of the package and its motivation, you can do so here.

We’re excited to launch a series of insightful posts and webinars, showcasing practical implementations of DataVault4dbt. This will empower you to leverage its full potential in your data warehousing endeavors. Today, we’ll spotlight its application in the staging layer.

Before we start with DataVault4dbt

We will assume some previous knowledge related to Data Vault 2.0 and dbt. Besides, for the following examples, we will be using dbt Cloud IDE connected to Snowflake. For an updated list of supported platforms, check the package’s GitHub repository.

Also, bear in mind that for optimal use of the macros, you must meet a couple of prerequisites:

  • Flat & Wide source data, accessible in your target database
  • A Load Date column signifying the time of arrival in the source data storage
  • A Record Source column detailing the origin of the source data, such as the file location within a Data Lake

In our case, we used and adapted the data from the jaffle_shop example project available on dbt.

Installing DataVault4dbt package on dbt

Installing DataVault4dbt is like installing any other package on your project. You will need to follow two simple steps:

1.Add it to your packages.yml file

DataVault4dbt installation

2. Run dbt deps

DataVault4dbt installation

Using the macro for staging our source data

According to the documentation for the staging layer of DataVault4dbt, this layer primarily focuses on hashing. It also offers functionalities like creating derived columns, conducting prejoins, and adding NULL values for missing columns. Rather than diving deep into the technical aspects of each macro component, which are comprehensively covered in the documentation, let’s dive straight into its application!

A. Basic source information

Identifying the Source Model (source_model):

  • When referencing a source, adopt the dictionary format: ‘source_name’: ‘source_table’.
  • For models within our dbt project, just use the model name: ‘source_table’.

Setting Load Date Timestamp (ldts) & Record Source (rsrc):

  • Both can reference a column from the source table or a more detailed SQL expression.
  • Additionally, for the Record Source, you can use a static string beginning with ‘!’, like ‘!my_source’.

Example

DataVault4dbt: A table with two table blocks.
  • source_model: Calls an already created table on dbt named ‘orders_example’.
  • ldts: Calls a timestamp column from our source model.
  • rsrc: Calls a column which contains a string referring to our record source name.

B. Hashing

In DataVault4dbt, the hashed_columns parameter outlines how to generate hashkeys and hashdiffs. For each hash column:

  • The key represents the hash column’s name.
  • For Hashkeys, the value is a list of business keys.
  • For Hashdiffs, the value will usually be a list of descriptive attributes.

Example

DataVault4dbt: Screen shot, table, different types of data.
  • hk_order_h: hashkey generated using two columns inputs (O_ORDERKEY and O_CUSTKEY)
  • hd_order_s: hashdiff generated using multiple descriptive attributes

C. Derived columns

Derived Columns in DataVault4dbt stage models allow users to directly apply specific transformations to data. They act as on-the-fly customizations, enabling immediate adjustments to data within the column itself. Essentially, if data isn’t in the desired format, with DataVault4dbt you can derive a new version right within the column using a specified rule.

When setting the derived_columns parameter, each derived column includes:

  • value: The transformation expression.
  • datatype: The datatype of the column.
  • src_cols_required: Source columns needed for the transformation.

Depending on how you name the derived column and the source columns, you can achieve two outcomes:

  1. If the derived column’s name matches its source column’s name, the original column’s data will be replaced by the transformed data. This effectively means you’re overwriting the original data.
  2. On the other hand, if the derived column’s name is different from its source column’s name, the transformation will result in a brand new column, preserving the original column’s data.

Example

DataVault4dbt: table, prices, items.
  • price_euro: creation of a new column with the same values as the O_TOTALPRICE column.
  • country_isocode: creation of a new column with a static string ‘GER’.

D. Prejoining

Why Prejoin?

In certain scenarios, your source data might not have the ‘Business Key’ which is often a human-readable identifier, such as an email address or username. Instead, it might have a ‘Technical Key’, which could be an internally generated identifier or code. If you need to use the human-readable Business Key in your processing but only have the Technical Key, you would use prejoining to combine your data with another table that maps Technical Keys to Business Keys.

How to Define Prejoins in DataVault4dbt?

The DataVault4dbt package provides a structured way to define these prejoins (prejoined_columns) using dictionaries.

For every column you’re adding through prejoining, you need to specify a few things:

  • src_name: This is the source of the prejoined data, as defined in a .yml file.
  • src_table: This specifies which table you’re prejoining with, as named in the .yml file.
  • bk: This is the name of the Business Key column in the prejoined table or the column values you are bringing to your table.
  • this_column_name: In your original data, this is the column that matches up with the prejoined table. This is often a Technical Key.
  • ref_column_name: In the prejoined table, this is the column that this_column_name points to. It should match up with the values in this_column_name.

Note that both ‘this_column_name’ and ‘ref_column_name’ can represent either a single column or a list of columns, serving as the basis for constructing the JOIN conditions.

Example

DataVault4dbt example: Table, data.
  • c_name: we brought the column “C_NAME” from the customer source table, joining on orders.o_custkey = customer.c_custkey.

E. Multi active config

The multi_active_config parameter is used when dealing with source data that contains multiple active records for the same Business Key. Essentially, you need to specify which columns are the multi-active keys and the primary hashkey column.

If your source data doesn’t have a natural multi-active key column, you should create one using functions like row_number in a preceding layer. Then, add the name of this newly created column to the multi-active-key parameter. It’s crucial that the combination of multi-active keys, the main hashkey, and the ldts column be unique in the final satellite output. If you don’t use this setting, the stage is considered to have only single active records.

Example

DataVault4dbt example: Table, types, block.

By setting this parameter, we’ll observe consistent hashdiffs for identical Business Keys, proving beneficial in subsequent layers. If you want to know why, you can check this post.

F. Missing columns

With DataVault4dbt, the missing_columns parameter helps handle scenarios where the source schema changes and some columns no longer exist. Using this parameter, you can create placeholder columns filled with NULL values to replace the missing ones. This ensures that hashdiff calculations and satellite payloads continue to work. Essentially, you provide a dictionary where the column names are the keys and their respective SQL datatypes are the values.

Example

DataVault4dbt example: table, price
  • discount_code: creation of a new discount_code column with NULL values.

Conclusion

Scalefree’s DataVault4dbt package introduces an easy-to-use yet powerful solution for database modeling. In our case, we went through the staging layer macro, which combines best practices with the flexibility to address diverse source data needs. From hashing to on-the-fly column modifications, this Data Vault 2.0 open-source package for dbt streamlines complex processes.

As we continue to explore its potential, we invite you to join our monthly expert session for a deeper dive. Reserve your spot here and stay tuned to the package’s GitHub repository for the latest updates and support.

Timezone to Be Used for Timestamps in Data Vault

Watch the Video

As part of our engaging Data Vault Friday series, our distinguished CEO, Michael Olschimke, delves into a pertinent question posed by an inquisitive member of our audience.

“In DV2.0 it is advised to use the UTC zone. How to store income timestamps from incoming sources that are in other time zones (e.g. GMT)? E.g. in Azure SQL server.”

In addressing this query, Michael provides valuable insights into the best practices for handling timestamps, especially when dealing with diverse time zones within the Data Vault 2.0 methodology. Emphasizing the recommendation to utilize the UTC zone, he navigates through the considerations and strategies for storing incoming timestamps that originate from sources operating in different time zones, such as GMT.

This illuminating discussion serves as a testament to our commitment to fostering knowledge and expertise in the realm of data architecture, making our Data Vault Friday series a valuable resource for data professionals.

Meet the Speaker

Profile picture of Michael Olschimke

Michael Olschimke

Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

Referencing Reference Tables in Data Vault

Watch the Video

In the ongoing journey of our Data Vault Friday series, our esteemed CEO, Michael Olschimke, delves into a thought-provoking question raised by a keen member of our audience.

“Is it possible to have an m:n link between two reference tables (country to currency)?”

In addressing this query, Michael navigates through the intricacies of data modeling, shedding light on the feasibility and implications of establishing an m:n (many-to-many) link between two reference tables, specifically in the context of countries and currencies.

By exploring the nuances of this scenario, Michael provides valuable insights into the challenges and considerations associated with creating such relationships in the Data Vault framework. This engagement exemplifies the essence of our Data Vault Friday series, where practical queries are met with informative discussions to enhance the understanding of data professionals.

Meet the Speaker

Profile picture of Michael Olschimke

Michael Olschimke

Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

Hierarchical Link by Using an Example in Data Vault

Watch the Video

With this week’s episode of Data Vault Friday, our CEO, Michael Olschimke, turns his attention to an insightful question about the use of a Hierarchical Link:

“Can you please explain in detail the hierarchical link using an example (different from the Bill of Material one, please)?”

In response to this discerning inquiry, Michael embarks on a comprehensive exploration of the concept of hierarchical links within the Data Vault framework. Drawing upon his extensive expertise, he elucidates the intricacies of modeling hierarchical links by presenting a distinctive example, distinct from the conventional Bill of Material scenario.

Through this elucidation, Michael aims to demystify the complexities surrounding hierarchical links, providing the audience with a practical and nuanced understanding of their application in diverse contexts. His commitment to delivering insightful explanations reflects the ethos of our Data Vault Friday series, which strives to empower data professionals with valuable knowledge.

Meet the Speaker

Profile picture of Michael Olschimke

Michael Olschimke

Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

Record Source for Links in Data Vault

Watch the Video

As part of our continuous Data Vault Friday series, our adept BI Consultant, Lorenz Kindling, delves into a thought-provoking question posed by a keen member of our audience.

“A Link refers to multiple Hubs, but we only have one Record_Source in the Link. Or the Link is loaded from more than one Source System. What do we use as a Record Source?”

In response to this intriguing query, Lorenz delves into the critical aspect of determining record sources for links within the Data Vault methodology. He shares insights into the best practices and considerations when dealing with scenarios where a Link is associated with multiple Hubs or loaded from various Source Systems.

Lorenz’s comprehensive analysis provides clarity on the nuanced decisions involved in selecting an appropriate Record Source, ensuring that the Data Vault model maintains accuracy and coherence. This discussion underscores Lorenz’s commitment to offering practical guidance to data professionals navigating the intricacies of link modeling.

Agile Development with Data Vault 2.0

Watch the Video

In our continuous Data Vault Friday series, our seasoned BI Consultant, Lorenz Kindling, takes the spotlight to address a pertinent query posed by an engaged member of our audience.

“I have a problem with the business value not delivering. Is there a perfect solution?”

Lorenz, drawing from his wealth of experience and expertise, delves into the nuances of overcoming challenges related to the delivery of business value in the context of agile development. He shares insights and practical solutions to ensure that the delivery process aligns seamlessly with the intended business outcomes.

Lorenz’s thoughtful analysis provides valuable guidance for individuals navigating the complexities of agile development within the framework of Data Vault methodologies. This engaging discussion underscores his commitment to empowering data professionals with actionable insights and best practices.

Choosing the Right Tech Stack for an Open-Source Powered EDW

Open-Source Powered EDW

Choosing the right technology stack is a critical decision when building an open source powered Enterprise Data Warehouse (EDW). The technology stack consists of various components, including databases, automation tools, DevOps, Infrastructure, and visualizations, which work together to enable efficient data management, processing, and analysis.

In this blog article, we will dive deeper into the topic of selecting the right tech stack for an open source powered EDW. We will explore different aspects to consider, such as evaluating vendors, leveraging open source products, and understanding the key components of a robust tech stack. By the end of this article, you will have a better understanding of the factors to consider when selecting the right tech stack for your EDW.

Choosing the right Tech Stack for an Open Source powered EDW

Join our webinar as our expert dives into the process of selecting the tech stack for your open-source Enterprise Data Warehouse (EDW) project. Learn more about essential considerations such as evaluating vendors, leveraging open-source products, and understanding key components like databases, automation tools, DevOps, infrastructure, and visualization. Furthermore, discover the power of combining Data Vault 2.0 with an open-source tech stack and learn how it can empower your EDW project.

Watch Webinar Part 1Watch Webinar Part 2

Evaluating Vendors and Leveraging Open-Source Products:

When embarking on the journey of building an open-source powered EDW, it is crucial to evaluate vendors and leverage open source products effectively. By choosing reputable vendors and open source solutions, you can ensure reliability, community support, and continuous development. Evaluating vendors involves assessing their expertise, reputation, and compatibility with your project requirements. Additionally, leveraging open source products provides flexibility, cost-effectiveness, and access to a vast community of contributors and developers.

Understanding the Key Components of a Robust Tech Stack:

A robust tech stack for an open source powered EDW comprises various components that work together to enable efficient data management and analysis. Here are some key components to consider:

blank

Databases:

Choosing the appropriate database technology is vital for efficient data storage and retrieval. Options like MongoDB, PostgreSQL, MySQL, or other databases that align with your project requirements should be considered

Automation Tools:

Automation tools play a crucial role in the development process of an EDW. These tools greatly accelerate the development process, particularly in a Data Vault project. One example of an open source automation tool is dbt (data build tool), which can be combined with Scalefree’s self-developed package DataVault4dbt. These tools help streamline the development process and make the development team more efficient.

DevOps and Infrastructure:

Having a stable scheduler or a similar tool to load the data regularly from the sources into the Data Warehouse is important. Options such as Airflow can be considered for this purpose. Additionally, having a DevOps tool for project management is essential. These tools help structure the work and make the development team more efficient, especially when using agile methodologies like Scrum.

Visualization:

Effective data visualization is crucial for analyzing and understanding the data in an EDW. There are various open source visualization tools available, such as Grafana, Superset, or Metabase, which provide powerful capabilities for creating insightful visualizations and dashboards.

Why Data Vault 2.0 is a Powerful Choice in Combination with an Open Source Tech Stack:

Combining Data Vault 2.0 with an open source tech stack offers a powerful solution for building an efficient, scalable EDW. The agile concepts used in Data Vault make it easier to gradually build an open source tech stack over time, starting with basic needs and expanding as necessary.
It should be noted that checking the readiness of an open source automation tool for Data Vault and having Data Vault templates in place is crucial. These components enhance efficiency, streamline development, and ensure smooth integration in an open source powered EDW environment.

Benefits of an Open Source Powered EDW:

Building an open source powered EDW offers several advantages. Firstly, open source solutions often provide a vast community of developers, ensuring continuous support, updates, and improvements. Secondly, open source products can be customized and tailored to meet specific project requirements. This flexibility allows you to adapt the tech stack to your organization’s needs and scale as your data processing requirements grow. Lastly, open source solutions typically offer cost-effectiveness by eliminating or reducing licensing fees, making them an attractive option for organizations of all sizes.

Considerations for Scalability and Performance:

Scalability and performance are crucial factors to consider when selecting the right tech stack for an open source powered EDW. As your data processing needs grow, it’s important to choose a tech stack that can scale horizontally or vertically to handle increasing workloads. Technologies like Kubernetes can be considered for container orchestration and load balancing to ensure efficient utilization of resources and smooth scalability. Additionally, performance optimization techniques, such as caching mechanisms, data indexing, and query optimization, should be considered to ensure fast and efficient data retrieval and processing.

Security and Data Privacy:

When dealing with enterprise data, security and data privacy are of utmost importance. Ensure that the chosen tech stack incorporates robust security measures and follows best practices for data encryption, access control, and secure communication protocols. Regular security audits and updates are essential to address any vulnerabilities and ensure compliance with data privacy regulations.

Summary

Picking the right tech stack for an open source powered EDW is a crucial step in building an efficient and scalable BI-System. By evaluating vendors, leveraging open source products, and understanding the key components of a robust tech stack, you can ensure a solid foundation for your EDW. Databases, automation tools, DevOps and Infrastructure, and visualization choices play vital roles in creating an effective and customizable solution. Embracing open source solutions provides flexibility, community support, and cost-effectiveness, making it an ideal choice for organizations seeking efficient data processing and analysis capabilities. Considerations for scalability, performance, security, and data privacy are important to ensure the success of your EDW implementation.

In conclusion, the selection of a tech stack for an open source powered EDW requires careful consideration of various factors. It is essential to evaluate vendors, leverage open source products, and understand the key components that contribute to a robust tech stack. By making informed choices and aligning the tech stack with your project objectives, you can build a scalable and efficient EDW that empowers your organization to process and analyze data effectively.

If you are interested to learn more about the topic, watch the recording here for free.

 

– Lorenz Kindling (Scalefree)

Information Delivery with Multi-Activity in Data Vault

Watch the Video

In our recurring Data Vault Friday series, our adept BI Consultant, Julian Brunner, takes center stage to address a thought-provoking inquiry from a keen member of our audience.

“I want to create a dimension with three standard satellites and one multi-active satellite, that are all attached to the same hub. The multi-active satellite has a multi-active attribute included. Is there a best practice for DV modeling?”

Julian, leveraging his extensive experience and expertise, delves into the intricate realm of Data Vault modeling, with a specific focus on creating dimensions enriched with both standard and multi-active satellites. He elucidates best practices, providing valuable insights on how to architect an efficient and effective Data Vault model. The emphasis is on optimizing information delivery while maintaining the integrity and flexibility of the underlying data structure.

This engaging discussion serves as a valuable resource for data professionals seeking guidance on advanced Data Vault modeling techniques, showcasing Julian’s commitment to sharing practical insights and industry best practices.

Close Menu