Building a scalable Data Platform?

Building a scalable Data Platform? In Data Vault Friday

Agile Development with Data Vault 2.0

Watch the Video

In our continuous Data Vault Friday series, our seasoned BI Consultant, Lorenz Kindling, takes the spotlight to address a pertinent query posed by an engaged member of our audience.

“I have a problem with the business value not delivering. Is there a perfect solution?”

Lorenz, drawing from his wealth of experience and expertise, delves into the nuances of overcoming challenges related to the delivery of business value in the context of agile development. He shares insights and practical solutions to ensure that the delivery process aligns seamlessly with the intended business outcomes.

Lorenz’s thoughtful analysis provides valuable guidance for individuals navigating the complexities of agile development within the framework of Data Vault methodologies. This engaging discussion underscores his commitment to empowering data professionals with actionable insights and best practices.

Meet the Speaker

Lorenz Kindling

Lorenz is working in Business Intelligence and Enterprise Data Warehousing (EDW) with a focus on data warehouse automation and Data Vault modeling. Since 2021, he has been advising renowned companies in various industries for Scalefree International. Prior to Scalefree, he also worked as a consultant in the field of data analytics. This allowed him to gain a comprehensive overview of data warehousing projects and common issues that arise.

Building a scalable Data Platform? In Data Warehouse, Intermediate

Choosing the Right Tech Stack for an Open-Source Powered EDW

Open-Source Powered EDW

Choosing the right technology stack is a critical decision when building an open source powered Enterprise Data Warehouse (EDW). The technology stack consists of various components, including databases, automation tools, DevOps, Infrastructure, and visualizations, which work together to enable efficient data management, processing, and analysis.

In this blog article, we will dive deeper into the topic of selecting the right tech stack for an open source powered EDW. We will explore different aspects to consider, such as evaluating vendors, leveraging open source products, and understanding the key components of a robust tech stack. By the end of this article, you will have a better understanding of the factors to consider when selecting the right tech stack for your EDW.

Choosing the right Tech Stack for an Open Source powered EDW

Join our webinar as our expert dives into the process of selecting the tech stack for your open-source Enterprise Data Warehouse (EDW) project. Learn more about essential considerations such as evaluating vendors, leveraging open-source products, and understanding key components like databases, automation tools, DevOps, infrastructure, and visualization. Furthermore, discover the power of combining Data Vault 2.0 with an open-source tech stack and learn how it can empower your EDW project.

Watch Webinar Part 1 Watch Webinar Part 2

In this article:

Evaluating Vendors and Leveraging Open-Source Products:
Understanding the Key Components of a Robust Tech Stack:
Why Data Vault 2.0 is a Powerful Choice in Combination with an Open Source Tech Stack:
Benefits of an Open Source Powered EDW:
Considerations for Scalability and Performance:
Security and Data Privacy:
Summary

Evaluating Vendors and Leveraging Open-Source Products:

When embarking on the journey of building an open-source powered EDW, it is crucial to evaluate vendors and leverage open source products effectively. By choosing reputable vendors and open source solutions, you can ensure reliability, community support, and continuous development. Evaluating vendors involves assessing their expertise, reputation, and compatibility with your project requirements. Additionally, leveraging open source products provides flexibility, cost-effectiveness, and access to a vast community of contributors and developers.

Understanding the Key Components of a Robust Tech Stack:

A robust tech stack for an open source powered EDW comprises various components that work together to enable efficient data management and analysis. Here are some key components to consider:

Databases:

Choosing the appropriate database technology is vital for efficient data storage and retrieval. Options like MongoDB, PostgreSQL, MySQL, or other databases that align with your project requirements should be considered

Automation Tools:

Automation tools play a crucial role in the development process of an EDW. These tools greatly accelerate the development process, particularly in a Data Vault project. One example of an open source automation tool is dbt (data build tool), which can be combined with Scalefree’s self-developed package DataVault4dbt. These tools help streamline the development process and make the development team more efficient.

DevOps and Infrastructure:

Having a stable scheduler or a similar tool to load the data regularly from the sources into the Data Warehouse is important. Options such as Airflow can be considered for this purpose. Additionally, having a DevOps tool for project management is essential. These tools help structure the work and make the development team more efficient, especially when using agile methodologies like Scrum.

Visualization:

Effective data visualization is crucial for analyzing and understanding the data in an EDW. There are various open source visualization tools available, such as Grafana, Superset, or Metabase, which provide powerful capabilities for creating insightful visualizations and dashboards.

Why Data Vault 2.0 is a Powerful Choice in Combination with an Open Source Tech Stack:

Combining Data Vault 2.0 with an open source tech stack offers a powerful solution for building an efficient, scalable EDW. The agile concepts used in Data Vault make it easier to gradually build an open source tech stack over time, starting with basic needs and expanding as necessary.
It should be noted that checking the readiness of an open source automation tool for Data Vault and having Data Vault templates in place is crucial. These components enhance efficiency, streamline development, and ensure smooth integration in an open source powered EDW environment.

Benefits of an Open Source Powered EDW:

Building an open source powered EDW offers several advantages. Firstly, open source solutions often provide a vast community of developers, ensuring continuous support, updates, and improvements. Secondly, open source products can be customized and tailored to meet specific project requirements. This flexibility allows you to adapt the tech stack to your organization’s needs and scale as your data processing requirements grow. Lastly, open source solutions typically offer cost-effectiveness by eliminating or reducing licensing fees, making them an attractive option for organizations of all sizes.

Considerations for Scalability and Performance:

Scalability and performance are crucial factors to consider when selecting the right tech stack for an open source powered EDW. As your data processing needs grow, it’s important to choose a tech stack that can scale horizontally or vertically to handle increasing workloads. Technologies like Kubernetes can be considered for container orchestration and load balancing to ensure efficient utilization of resources and smooth scalability. Additionally, performance optimization techniques, such as caching mechanisms, data indexing, and query optimization, should be considered to ensure fast and efficient data retrieval and processing.

Security and Data Privacy:

When dealing with enterprise data, security and data privacy are of utmost importance. Ensure that the chosen tech stack incorporates robust security measures and follows best practices for data encryption, access control, and secure communication protocols. Regular security audits and updates are essential to address any vulnerabilities and ensure compliance with data privacy regulations.

Summary

Picking the right tech stack for an open source powered EDW is a crucial step in building an efficient and scalable BI-System. By evaluating vendors, leveraging open source products, and understanding the key components of a robust tech stack, you can ensure a solid foundation for your EDW. Databases, automation tools, DevOps and Infrastructure, and visualization choices play vital roles in creating an effective and customizable solution. Embracing open source solutions provides flexibility, community support, and cost-effectiveness, making it an ideal choice for organizations seeking efficient data processing and analysis capabilities. Considerations for scalability, performance, security, and data privacy are important to ensure the success of your EDW implementation.

In conclusion, the selection of a tech stack for an open source powered EDW requires careful consideration of various factors. It is essential to evaluate vendors, leverage open source products, and understand the key components that contribute to a robust tech stack. By making informed choices and aligning the tech stack with your project objectives, you can build a scalable and efficient EDW that empowers your organization to process and analyze data effectively.

If you are interested to learn more about the topic, watch the recording here for free.

– Lorenz Kindling (Scalefree)

Building a scalable Data Platform? In Data Vault Friday

Capturing Temporal Data on Changing Relationships in Data Vault

Watch the Video

In the latest installment of our enlightening Data Vault Friday series, our CEO, Michael Olschimke, delves into a thought-provoking query posed by a member of our engaged audience.

“We have received the relationship between investor and company with a PostingMonth for the last couple of months. Also, the ownership percentage for the relationship could change over time (see attached Excel for mock data :)). So our question is: should we take the Period as a part of the Investor_Company_Link? If yes, how can we track the relationship changes with Effectivity Satellite? Or do you think Multi-active link satellite is a better choice here?”

Michael meticulously explores the intricacies of modeling investor-company relationships, particularly when faced with dynamic factors such as changing ownership percentages over distinct time periods. He offers valuable insights into the considerations between incorporating Period as part of the Investor_Company_Link and the nuanced application of Effectivity Satellite or Multi-active link satellite to accurately capture and manage the evolving nature of these relationships.

This insightful discussion proves instrumental for data professionals navigating the complexities of representing dynamic relationships within the Data Vault framework.

Meet the Speaker

Michael Olschimke

Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!

Building a scalable Data Platform? In Data Vault Friday

Extending Existing Data Vault Model by GDPR-Identified Data

Watch the Video

In our ongoing Data Vault Friday series, our esteemed CEO, Michael Olschimke, tackles a compelling question raised by an engaged member of our audience.

“Let’s assume that DWH is fed from many source systems and one of them (some minor one, called ‘XYZ’) exports customer data identified by PERSONAL_ID (no other identifier available). We already have HUB_CUSTOMER based on some other customer identifier, and the PERSONAL_ID attribute is stored in SAT_CUSTOMER_PD. But there is one important thing regarding customer data, there are cases where multiple rows in HUB_CUSTOMER have the same PERSONAL_ID in mentioned satellite (which means, that some of the customers have been registered multiple times in our core systems).”

In this illuminating episode, Michael delves into the intricate scenario of integrating customer data from diverse sources, emphasizing the challenges posed by the absence of a unique identifier and the existence of duplicate entries. He articulates a strategic approach to address this nuanced issue within the Data Vault framework, providing practical insights and recommendations for achieving a coherent and accurate representation of customer information.

This discussion proves invaluable for data professionals navigating the complexities of consolidating diverse customer data sets with varying identifier structures.

Meet the Speaker

Michael Olschimke

Building a scalable Data Platform? In Data Vault Friday

Extending Satellites in Data Vault

Watch the Video

In our continuous Data Vault Friday series, our skilled trainer, Marc Finger, delves into a question posed by an audience member.

“Changes in the source system (new column/s): New row in an existing Satellite or new Satellite?”

Marc provides valuable insights into handling changes in the source system, specifically when encountering the addition of new columns. The question revolves around whether it’s more appropriate to introduce a new row in an existing Satellite or create an entirely new Satellite to accommodate these changes.

Through a clear and concise discussion, Marc elucidates the considerations and factors that influence the decision-making process. He explores the implications of both options, emphasizing the importance of aligning the chosen approach with the specific requirements and goals of the Data Vault model.

The trainer guides the audience through the thought process involved in making this decision, providing practical tips and best practices. By the end of the episode, viewers gain a deeper understanding of how to navigate the challenges associated with changes in the source system within the context of Data Vault methodology.

Meet the Speaker

Marc Finger

Marc is working in Business Intelligence and Enterprise Data Warehousing (EDW) with a focus on Data Vault 2.0 implementation and coaching. Since 2016 he is active in consulting and implementation of Data Vault 2.0 solutions with industry leaders in manufacturing, energy supply and facility management sector. In 2020 he became a Data Vault 2.0 Instructor for Scalefree.

Building a scalable Data Platform? In Data Warehouse, Expert

Mastering Metadata: Data Catalogs in Data Warehousing with Datahub

Mastering Metadata in Data Warehousing

In today’s data-driven world, it is essential to be able to manage and organize large amounts of data in an efficient way. Businesses across all industries are forced to contend with more data than ever before. The introduction and development of an enterprise data warehouse in a company naturally plays a central role, but does not solve a major challenge: How to effectively organize and manage the data, especially metadata, in an Enterprise Data Warehouse? This is where the concept of data catalogs comes into play and where tools like Datahub become essential.

A data catalog serves as a comprehensive inventory of data assets in an organization, providing context, annotations, and metadata to facilitate the understanding and discovery of data. It’s like a map to your data, helping users navigate the complex data landscape to find the exact data they need.

A data catalog can help users to understand where to find specific data in the data warehouse that fits their needs and to investigate where it came from, as well as how it might be connected to other data. This can greatly simplify tasks like data analysis and reporting, making the data warehouse more accessible and usable for everyone in the organization.

Mastering Metadata: Data Catalogs in Data Warehousing with DataHub

Don’t miss our upcoming webinar about data catalogs! This session will explore in detail the critical role of data catalogs in data warehousing, with an exclusive focus on the powerful tool DataHub. You’ll gain practical insights on enhancing data discovery, metadata management, data lineage, and data governance. Sign up today and transform your data management strategies into a competitive advantage.

Watch webinar recording

In this article:

Understanding Data Catalogs
- What is a Data Catalog?
- Role of a Data Catalog in Data Warehousing
Introduction to DataHub
- Key Features and Capabilities of DataHub
Conclusion

Understanding Data Catalogs

What is a Data Catalog?

In general, a data catalog is like a metadata inventory, which consists of organized and structured metadata regarding all data assets in an organization. It is a central place where all this metadata can be stored, combined, and categorized, which makes it a lot easier to discover and understand the corresponding data, for example in a data warehouse. A data catalog also has search functionalities to find specific data from the available indexed datasets. It serves like a single source of truth of your metadata, enabling users to trust the data they’re using for their analyses or business decisions.

Role of a Data Catalog in Data Warehousing

In the context of data warehousing, a data catalog brings a lot of benefits. It provides a way to explore and search all data stored in the data warehouse. Technical users, as well as Business Users, can discover relevant data, understand its context, and ensure it is up-to-date, reliable, and accurate. The following figure shows where data catalogs in data warehousing with Data Vault 2.0 take place. A Data Catalog should cover the entire Enterprise BI Solution. This also applies, for example, to a data lake, if available, and to the information delivery layer.

Now that we’ve understood what a data catalog is, let’s delve into how each component plays a part in a data catalog and explore how a tool like DataHub can assist organizations in these tasks.

Introduction to DataHub

In the world of data catalogs, DataHub stands out as an increasingly popular choice for many businesses. DataHub is a growing open-source software developed by LinkedIn to address its growing need for a more dynamic and scalable data management tool. It was created in part due to the fact that the existing tools were not sufficient with LinkedIns expanding needs.

As LinkedIn grew, so did its data volume, variety, and velocity. Recognizing the need for a more efficient way to manage its data, LinkedIn built and introduced DataHub in 2020. Open-sourcing DataHub allowed other organizations to benefit from this advanced tool, and it has since been adopted by many businesses looking for a modern, scalable data catalog solution.

DataHub supports both push-based and pull-based metadata ingestion, including a wide range of integrations for example Airflow, BigQuery, Databricks, dbt, Hive, Kafka, Looker, MSSQL, MongoDB, Oracle, S3, PowerBI, Snowflake, Spark, and much more. You can find a full list here. This gives datahub the ability to combine and show metadata of the same data from multiple sources, for example, a dbt model definition, and if the tests were running successfully, right next to the database schema and stats for all columns.

Key Features and Capabilities of DataHub

DataHub, as a metadata platform, goes beyond traditional data catalogs. DataHub offers all important features and capabilities:

1. Scalability: DataHub is designed to handle metadata from thousands of datasets, which makes it a great choice for large organizations.

2. Flexible and Extensible Data Model: The technical data model behind this tool is designed to be customizable and expandable to allow organizations to customize it to their specific business requirements

3. Powerful Search and Discovery: Leveraging Elasticsearch, DataHub offers robust search functionality that enables users to discover datasets quickly based on various attributes, such as the data’s origin, schema, and usage.

4. Rich Metadata: Unlike traditional data catalogs, DataHub captures and presents a wide variety of metadata, including data lineage, operational metadata, and business metadata. This gives users a comprehensive understanding of their data.

5. Data Lineage and Relationships: DataHub automatically captures and visualizes data lineage, showing how data flows through various systems. It also displays relationships between datasets, allowing users to understand how different data assets interact with each other.

Conclusion

Using a data catalog comes with several benefits:

Enhanced Data Discovery: With the search and categorization capabilities of a data catalog, users can quickly find the exact data they need without having to comb through large datasets.
Improved Data Understanding: The metadata in a data catalog provides users with necessary context about the data, making it easier to interpret and use correctly.
Better Compliance and Governance: A data catalog supports data governance initiatives by ensuring data is consistent, accurate, and compliant with relevant regulations.
Increased Trust in Data: By providing transparency into data lineage, a data catalog helps build trust in the data by allowing users to see its history and verify its accuracy and reliability.
Time and Resource Efficiency: By making it easier to locate and understand data, a data catalog can save the company resources, thus speeding up data-driven activities and reducing the burden on data management teams.

In conclusion, DataHub provides a flexible, feature-rich, and all-encompassing option for data catalogs in a data warehousing environment. By providing powerful features for data discovery, metadata management, data lineage, and data governance, it enables businesses to extract maximum value from their data.

If you’re interested in learning more about data catalogs, watch the recording here for free.

– Ole Bause (Scalefree)

Building a scalable Data Platform? In Data Vault Friday

Metadata Translation in Data Vault

Watch the Video

In our ongoing Data Vault Friday series, our CEO Michael Olschimke discusses a question from the audience.

“Our EDW should use English entity names for hubs, links, and satellites. However, our sources are in a variety of languages (English, and German mostly). Where is the best option to translate everything into English?”

Michael provides insightful guidance on tackling the challenge of maintaining consistency in entity names across a multilingual landscape. He explores different strategies for translating entity names, weighing the pros and cons of various approaches. Whether to perform the translation at the source level, during the ETL (Extract, Transform, Load) process, or within the EDW itself, Michael offers considerations to help make an informed decision based on the specific needs and characteristics of the project.

The CEO emphasizes the importance of aligning with business objectives and ensuring that the chosen translation strategy aligns with the overall goals of the data warehousing initiative. This episode provides valuable insights and best practices for handling multilingual challenges in Data Vault projects, contributing to the success of your data integration and management endeavors.

Meet the Speaker

Michael Olschimke

Building a scalable Data Platform? In Data Vault Friday

Hiding Dimension Members in Data Vault

Watch the Video

In our ongoing Data Vault Friday series, our CEO Michael Olschimke addresses a query from the audience, exploring the dynamics of managing data visibility in the DIMENSION information mart.

“How can a record be hidden in the DIMENSION information mart if it is no longer in use? Our Data Warehouse (DWH) features a hierarchy of region, division, and zone, which may undergo splitting or merging multiple times. The challenge is that the deleted event is not signaled from the source side, and only a full refresh captures new hierarchy information. Users desire a consistently current status reflected in both FACT and DIM tables.

1. To handle this, the current relation can be flagged and counted. This approach involves managing the relationship with a counter, allowing for effective tracking and visibility.

2. Additionally, the last relation needs to remain visible in the FACT table, ensuring that historical relationships are retained for reference.”

In this engaging video, Michael elaborates on these strategies, providing insights into maintaining data integrity and visibility within complex hierarchies, while accommodating changes and updates efficiently.

Meet the Speaker

Michael Olschimke

Building a scalable Data Platform? In Data Vault Friday

Sampling (DB Subsetting) Production Data in Data Vault

Watch the Video

In our ongoing Data Vault Friday series, our CEO Michael Olschimke engages with a pertinent question from the audience, shedding light on best practices for structuring EDW environments.

“In one of the previous webinars (‘EDW Environments’), you mentioned about best practices for creating your EDW environments. Let’s consider a configuration where we have 4 environments, DEV + TST and PRE_PROD + PROD. Moreover, assume that the PROD environment is very heavy in the meaning of data volumes and we simply cannot handle such amounts of data on PRE PROD and TST (data on TST env. will be anonymized). Do you have any advice on how to create lightweight environments from PROD?”

In this insightful video, Michael delves into the complexities of managing EDW environments with varying data volumes. He offers practical advice on creating lightweight versions of the production environment for development, testing, and pre-production stages. The discussion encompasses strategies for data anonymization on the testing environment and optimizing resources to ensure efficiency across different stages of the EDW lifecycle.

Meet the Speaker

Michael Olschimke

Building a scalable Data Platform? In Data Vault Friday

Reference Tables With Effectivity Satellites in Data Vault

Watch the Video

In our continuous exploration of Data Vault concepts in the Data Vault Friday series, our CEO Michael Olschimke delves into an intriguing question posed by the audience.

“Do you use Effectivity Satellites also for Reference Data in Reference Satellites?”

This concise yet crucial inquiry prompts Michael to unravel the considerations and best practices associated with leveraging Effectivity Satellites in the context of Reference Data within Reference Satellites.

In this insightful video, Michael shares his expertise, discussing the potential applications and benefits of employing Effectivity Satellites for managing reference data. He sheds light on how this approach can enhance the flexibility and temporal aspects of Reference Satellites, contributing to a more robust and adaptable Data Vault architecture.

Meet the Speaker

Michael Olschimke

Building a scalable Data Platform? In Intermediate

Utilizing Potentials of Data Vault 2.0 – Overcoming Bad Practices – Part 2

Watch the Webinar

What are common mistakes when applying Data Vault 2.0 in enterprise data warehouse projects? Do you have questions regarding modeling in Data Vault and the realization of GDPR causes you great difficulties or is your project stuck because you are delivering no business value?

This webinar describes common Anti-patterns of Data Vault, their consequences, and the solution to eliminate them from your current or in your future projects.

Tune in and learn more to avoid bad practices and apply simple solutions.

Watch Webinar Recording

Webinar Agenda

1. How to use Data Vault for modeling business information
2. How to avoid the pitfalls of being unable to deliver business value
3. How to mask Business Keys from Hubs for privacy

Meet the Speaker

Lorenz Kindling

Watch the Video

Meet the Speaker

Open-Source Powered EDW

Choosing the right Tech Stack for an Open Source powered EDW

Evaluating Vendors and Leveraging Open-Source Products:

Understanding the Key Components of a Robust Tech Stack:

Databases:

Automation Tools:

DevOps and Infrastructure:

Visualization:

Why Data Vault 2.0 is a Powerful Choice in Combination with an Open Source Tech Stack:

Benefits of an Open Source Powered EDW:

Considerations for Scalability and Performance:

Security and Data Privacy:

Summary

Watch the Video

Meet the Speaker

Watch the Video

Meet the Speaker

Watch the Video

Meet the Speaker

Watch the Video

Meet the Speaker

Mastering Metadata in Data Warehousing

Mastering Metadata: Data Catalogs in Data Warehousing with DataHub

Understanding Data Catalogs

What is a Data Catalog?

Role of a Data Catalog in Data Warehousing

Introduction to DataHub

Key Features and Capabilities of DataHub

Conclusion

Watch the Video

Meet the Speaker

Watch the Video

Meet the Speaker

Watch the Video

Meet the Speaker

Watch the Video

Meet the Speaker

Watch the Webinar

Webinar Agenda

Meet the Speaker

Build Better Data Platforms

SOLUTIONS

TRAINING

EVENTS

KNOWLEDGE HUB

CAREERS

COMPANY