Skip to main content
search
0

Sampling (DB Subsetting) Production Data in Data Vault

Watch the Video

In our ongoing Data Vault Friday series, our CEO Michael Olschimke engages with a pertinent question from the audience, shedding light on best practices for structuring EDW environments.

“In one of the previous webinars (‘EDW Environments’), you mentioned about best practices for creating your EDW environments. Let’s consider a configuration where we have 4 environments, DEV + TST and PRE_PROD + PROD. Moreover, assume that the PROD environment is very heavy in the meaning of data volumes and we simply cannot handle such amounts of data on PRE PROD and TST (data on TST env. will be anonymized). Do you have any advice on how to create lightweight environments from PROD?”

In this insightful video, Michael delves into the complexities of managing EDW environments with varying data volumes. He offers practical advice on creating lightweight versions of the production environment for development, testing, and pre-production stages. The discussion encompasses strategies for data anonymization on the testing environment and optimizing resources to ensure efficiency across different stages of the EDW lifecycle.

Reference Tables With Effectivity Satellites in Data Vault

Watch the Video

In our continuous exploration of Data Vault concepts in the Data Vault Friday series, our CEO Michael Olschimke delves into an intriguing question posed by the audience.

“Do you use Effectivity Satellites also for Reference Data in Reference Satellites?”

This concise yet crucial inquiry prompts Michael to unravel the considerations and best practices associated with leveraging Effectivity Satellites in the context of Reference Data within Reference Satellites.

In this insightful video, Michael shares his expertise, discussing the potential applications and benefits of employing Effectivity Satellites for managing reference data. He sheds light on how this approach can enhance the flexibility and temporal aspects of Reference Satellites, contributing to a more robust and adaptable Data Vault architecture.

Stärken der Datenanalyse Innerhalb von Salesforce

Watch the Webinar

In diesem Webinar erfahren Sie, wie Sie Salesforce-Reports erstellen und nutzen können, um aussagekräftige Einblicke in Ihre KPI zu gewinnen.

Wir zeigen Ihnen bewährte Best Practices für die Erstellung von benutzerdefinierten Berichten und Dashboards sowie praktische Tipps für die effektive Nutzung von Salesforce-Reporting-Tools.

Salesforce unterscheidet sich unter allen anderen CRM-Systemen, da das Reporting auf allen Daten innerhalb des Systems möglich ist. Sie lernen, wie Sie Reports filtern, gruppieren und Diagramme erstellen können.

Dieses Webinar richtet sich sowohl an Anfänger als auch an fortgeschrittene Salesforce-Nutzer, die ihre Reporting-Fähigkeiten verbessern möchten.

Watch Webinar Recording

Webinar Agenda

1. Reports In Salesforce
2. Vorteile
3. Warum Nutzen
4. Demos
5. Fazit

Data Vault 2.0 with Hadoop and Hive/Spark

Hadoop and Hive/Spark in Data Vault 2.0

In this article, you’ll receive an overview of what Hadoop and Hive is and why they can be used as an alternative to traditional databases.

Data Vault 2.0 with Hadoop and Hive/Spark

This webinar delves into the ins and outs of Hadoop and Hive, including what they are and how they communicate. The second part of the presentation focuses on a Data Vault 2.0 example architecture using batch loading, providing participants with insights into how a sample can look like to provide value in real-world scenarios. Whether you are a seasoned data professional or just starting out, this webinar is an invaluable resource for anyone seeking to learn more about Hadoop. So if you are looking to expand your knowledge of these technologies and explore their potential in the world of data analytics, this webinar is not to be missed.

Watch webinar recording

Hadoop

Hadoop is used to process and analyze large volumes of data efficiently by distributing the workload across a cluster of commodity hardware, enabling parallel processing and providing fault tolerance through its distributed file system and resource management framework.

HDFS – Hadoop Distributed File System

HDFS is a distributed file system that provides reliable and scalable storage for big data. It breaks large files into blocks and distributes them across a cluster of commodity hardware. HDFS ensures data reliability and availability through data replication.

Yet Another Resource Negotiator – YARN

YARN provides a flexible, scalable resource management framework for Hadoop, enabling a variety of applications and workloads to coexist and efficiently utilize the cluster’s resources. It abstracts the underlying infrastructure and allows for the dynamic allocation of resources based on application requirements.

MapReduce – MR

MapReduce is a programming model and processing framework for distributed data processing in Hadoop. It allows for parallel processing of large datasets by dividing the workload into maps, reducing tasks. Map tasks process data in parallel and the output is combined and reduced to produce the final result.

Hadoop Common

Hadoop Common provides libraries, utilities, and infrastructure support for the other components of Hadoop. It includes common utilities, authentication mechanisms, and interfaces that are used by various Hadoop modules.

What is the benefit?

Scalability
Hadoop enables the storage and processing of massive amounts of data by scaling horizontally across a cluster of commodity hardware. It can handle petabytes of data without sacrificing performance.

Distributed Computing
Hadoop distributes data and processing tasks across multiple nodes in a cluster, allowing for parallel processing and faster data analysis. This distributed computing model enables efficient utilization of resources and enables high-performance data processing.

Fault Tolerance
Hadoop provides fault tolerance by replicating data across multiple nodes in the cluster. If a node fails, data can still be accessed from other replicas, ensuring data reliability and availability.

Cost-Effectiveness
Hadoop is designed to run on inexpensive commodity hardware, making it a cost-effective solution for storing and processing large volumes of data. It eliminates the need for expensive specialized hardware.

Flexibility and Extensibility
Hadoop’s modular architecture allows for integration with various tools and frameworks within the Hadoop ecosystem, providing flexibility and extensibility. It supports a wide range of data processing tasks, including batch processing, real-time processing, machine learning, and more.

Data Locality
Hadoop’s distributed file system, HDFS, aims to bring the computation closer to the data. By processing data where it is stored, Hadoop minimizes data movement across the network, reducing latency and improving overall performance.

Ecosystem and Community
Hadoop has a rich ecosystem with a wide range of tools, libraries, and frameworks that extend its functionality for different use cases. It also has a large and active community of users, developers, and contributors, providing support, resources, and continuous improvement.

These benefits make Hadoop a powerful, popular solution for handling big data, enabling organizations to efficiently store, process, and gain insights from vast amounts of structured and unstructured data. The whole ecosystem can also run on-premise, which can make it a good alternative if ‘cloud’ is not an option.

HIVE

Hive is a data warehouse infrastructure built on top of Hadoop that provides a high-level SQL-like query language called HiveQL for querying and analyzing large datasets.

What are the components?

Data Storage
Hive leverages Hadoop Distributed File System (HDFS) as its underlying storage system. It stores data in HDFS in a distributed and fault-tolerant manner, allowing for scalable, reliable data storage.

Schema Definition
Hive allows users to define a schema for their data using a language called Hive Data Definition Language, like DDL. This allows users to define tables, partitions, columns, data types, and other metadata associated with the data.

Query Optimization
Hive optimizes queries by performing query planning and optimization techniques. It aims to generate efficient query execution plans to minimize data movement, optimize resource utilization, and improve query performance.

Hive Metastore
Hive maintains a metadata repository called the Hive Metastore. It stores information about the tables, partitions, schemas, and other metadata associated with the data stored in HDFS. The metastore allows for efficient metadata management and retrieval during query processing.

Extensibility
Hive offers extensibility through User-Defined Functions (UDFs), User-Defined Aggregations (UDAs), and User-Defined Table Functions (UDTFs). These allow users to define custom logic and operations in programming languages like Java, Python, or other supported languages.

Integration with other tools
Hive integrates with various other tools and frameworks in the Hadoop ecosystem. For example, it can work alongside Apache Spark, Apache Pig, Apache HBase, and other components to provide a complete data processing and analytics solution.

Partitioning and Bucketing
Hive supports data partitioning and bucketing, allowing users to organize and store data in a structured manner. Partitioning involves dividing data into logical partitions based on specific criteria, while bucketing involves dividing data into equally sized buckets based on hash values.

SerDe
Hive uses a serialization/deserialization framework called SerDe (Serializer/Deserializer) to read and write data in different formats, such as CSV, JSON, Avro, and more. Users can specify the appropriate SerDe for their data format to ensure proper data processing.

Overall, Hive simplifies data querying and analysis on Hadoop by providing a familiar SQL-like interface. It abstracts the complexity of writing low-level MapReduce or Tez jobs and provides a declarative and user-friendly approach to interact with large-scale data stored in Hadoop.

Conclusion

Hadoop is a robust and feature-rich environment that can be challenging to manage. However, its numerous advantages make it a compelling choice, depending on the user’s needs and the available in-house expertise. If you’re interested in learning more about it, watch the following recording.

Utilizing Potentials of Data Vault 2.0 – Overcoming Bad Practices – Part 2

Watch the Webinar

What are common mistakes when applying Data Vault 2.0 in enterprise data warehouse projects? Do you have questions regarding modeling in Data Vault and the realization of GDPR causes you great difficulties or is your project stuck because you are delivering no business value?

This webinar describes common Anti-patterns of Data Vault, their consequences, and the solution to eliminate them from your current or in your future projects.

Tune in and learn more to avoid bad practices and apply simple solutions.

Watch Webinar Recording

Webinar Agenda

1. How to use Data Vault for modeling business information
2. How to avoid the pitfalls of being unable to deliver business value
3. How to mask Business Keys from Hubs for privacy

Get Started with Real-Time Processing in Data Vault 2.0 on Microsoft Azure

Data Vault 2.0 on Microsoft Azure

In this newsletter, you’re going to get an overview of what real-time processing is and what possibilities it can provide your Data Vault 2.0 implementation.

Real-Time Processing with Data Vault 2.0 on Azure

In this webinar, we’ll discuss the new data warehouse requirements for data and explore
Real-Time processing. We’ll cover various Real-Time processing architectures for an initial overview. The second part focuses on Real-Time data architecture with Data Vault 2.0 and includes a brief overview of Microsoft Azure. You’ll also see a Real-Time processing implementation of Data Vault 2.0 in Azure. This webinar is for anyone new to Real-Time Data with Data Vault 2.0 and interested in an overview and implementation in Azure.

Watch Webinar Part 1Watch Webinar Part 2

What to expect

You will learn that real-time processing gives you the ability to create value out of data quicker, have the most up-to-date data in your reporting tools and allow more accurate decisions regarding data.
With that, your company will be able to adapt to changes in the market quicker by seeing developments right away with the most recent data.

Additionally, you can save costs by moving away from batch loading because the peak of computing power normally required for that gets reduced and is more evenly distributed throughout the day. That is especially the case when using cloud environments, because then it’s possible to replace promised environments and contribute the needed computing power perfectly.

The traditional way – batch-loading

Batch loading is a traditional method used to load data into a data warehouse system in large batches, mostly overnight. The data from data sources is delivered up to a certain time in the night to be transformed and loaded into the core data warehouse layer.

This method leads to a peak of data processing overnight, and organizations have to adjust their infrastructure needs to be able to deal with the expected maximum peak of required computing power.

The new way – real-time data

Real-time data is processed and made available immediately as it is generated, instead of being loaded in batches overnight. When using real-time approaches, the loading window is extended to 24 hours. So the overnight peak and its disadvantages are gone.
When using real-time data, it’s always modeled as a non-historized link or as a satellite.

Possible use cases for real-time data are vital monitoring in the healthcare industry, inventory tracking, user behavior on social media or production line monitoring.

Different types of real-time data

There are different types of real-time data based on how frequently the data is loaded and the degree of urgency or immediacy of the data.

Near real-time data refers to data that is loaded in mini-batches at least every fifteen minutes, with the data stored in a cache until it is loaded into the data analytics platform.
Actual real-time data, also called message streaming, involves loading every single message directly into the data analytics platform without any cache.
This type of real-time data is useful when it is important to have data available as soon as it is generated for dashboards or further analytics.

The acceptable processing delay for real-time data is typically defined by the consequences of missing a deadline. Additionally, there are three types of real-time systems: hard real-time, soft real-time, and firm real-time.

Real-time processing types

Implementing real-time processing

So, how do you implement real-time data processing into your data warehouse solution? There are many architectures for that, but we will focus on the Lambda and Data Vault 2.0 architecture.

Generic real-time processing architecture

The lambda architecture separates data processing into a speed layer and a batch layer. The speed layer processes real-time messages with a focus on speed and throughput, while the batch layer provides accuracy and completeness by processing high volumes of data in regular batches. The serving layer integrates data from both layers for presentation purposes.

At first, the Data Vault 2.0 architecture seems to be similar to the lambda architecture, but it treats some aspects differently. The lambda architecture has issues from a Data Vault 2.0 perspective, such as implementing a single layer in each data flow and lacking a defined layer for capturing raw, unmodified data for auditing purposes.

The Data Vault 2.0 architecture adds a real-time part called “message streaming” to the existing batch-driven architecture, with multiple layers implemented for capturing and processing real-time data, integrating it with the batch-driven flow at multiple points. Messages are pushed downstream from the publisher to the subscriber, loaded into the Raw Data Vault and forked off into the data lake. But the main process is the push inside the message streaming area. The architecture is able to integrate data from batch feeds or to stream the real-time data directly into the dashboard.

Using Microsoft Azure for real-time processing

Microsoft Azure is a cloud computing platform and set of services offered by Microsoft. It provides a variety of services, including virtual machines, databases, analytics, storage, and networking. These services can be used to create web and mobile applications, run large-scale data processing tasks, store and manage data, host websites and much more.

Microsoft Azure for real-time processing

The illustration describes a typical real-time architecture used by Scalefree consultants, which follows the conceptual Data Vault 2.0 architecture.

Data sources deliver data either in batches or real-time. This is loaded into the Azure Data Lake or accepted by the Event Hub beforehand. The Raw Data Vault Loader separates business keys, relationships and descriptive data using Stream Analytics and forwards the message to the Business Vault processor. The Business Vault processor applies transformation and other business rules to produce the target message structure for consumption by the (dashboarding) application. The results can be loaded into physical tables in the Business Vault on Synapse or be delivered in real-time without further materialization in the database. The target message is generated and sent to the real-time information mart layer implemented by a streaming dataset, which is consumed by PowerBI. The cache of the dashboard service will expire quickly, but the Synapse database has all data available for other uses, including strategic, long-term reporting.

Conclusion

In conclusion, real-time data processing offers numerous benefits over traditional batch loading methods, including the ability to create value out of data quicker, have the most up-to-date information in reporting tools, and make more accurate decisions. By adapting to changes in the market quicker, companies can stay ahead of the competition. Moving away from batch loading can also save costs by reducing the peak of computing power required.

As mentioned before, the last illustration shows an architecture that the Scalefree Consultants implemented to make use of real-time data.

Read more on our recently released Microsoft Blog Article.

How is your current experience with real-time data processing?
Are you thinking about kick-starting your Data Vault by also using real-time data?
Or are you already using it and thinking about improving it further?

Let us know your thoughts in the comment section!

Data Vault on Databricks

Watch the Video

In our ongoing Data Vault Friday series, our CEO Michael Olschimke addresses a pertinent question raised by the audience, unraveling the discourse around the compatibility of Data Vault 2.0 (DV2.0) with Databricks.

“There has been hype going on on LinkedIn about whether or not DV2.0 is suited to exist on Databricks. Many people disagree that it is. The most significant comments are ‘lots of joins,’ ‘performance getting data out,’ and ‘not suited for modern automation.’ The latter ties to tools creating generated code per object VS. parameterized pipelines.”

In this illuminating video, Michael delves into the discussions surrounding the suitability of Data Vault 2.0 in the Databricks environment. He provides insights into the concerns raised, such as the perceived challenges related to joint operations, data retrieval performance, and the alignment with modern automation practices.

Michael offers a balanced perspective, exploring the nuances of utilizing DV2.0 on Databricks and addressing the key considerations raised in the LinkedIn discussions.

Warum Eigentlich Salesforce?

Watch the Webinar

In diesem Webinar geht es um die Frage “Warum eigentlich Salesforce?”. Wir werden uns die 3 Hauptvorteile des CRM-Systems von Salesforce ansehen: Integrierbarkeit, Erweiterbarkeit und Anpassbarkeit. Erfahren Sie, wie Salesforce Ihnen dabei helfen kann, Ihr Unternehmen zu transformieren und Ihre digitalen Prozesse zu optimieren.

Sie werden verstehen, wie Salesforce Ihre Arbeitsabläufe nahtlos integrieren und automatisieren kann, um Ihnen Zeit und Ressourcen zu sparen. Wir zeigen Ihnen, wie einfach und schnell es ist, Salesforce zu erweitern und anzupassen, um den spezifischen Bedürfnissen Ihres Unternehmens gerecht zu werden.

Melden Sie sich jetzt an und erfahren Sie, warum Salesforce die beste Wahl für die digitale Transformation Ihres Unternehmens ist.

Watch Webinar Recording

Webinar Agenda

1. CRM Systeme in der digitalen Transformation → Shared Customer Insight (nach Jeanne Ross)
2. Warum Salesforce so gut passt. (Salesforce order MS Dynamics in Spitzengruppe)
3. Grund 1 Integrierbarkeit
4. Grund 2 Erweiterbarkeit
5. Grund 3 Anpassbarkeit

Multi-temporal Source Data (Sap Hrms) in Data Vault

Watch the Video

In our ongoing Data Vault Friday series, our CEO Michael Olschimke explores a valuable question from the audience, shedding light on the intricacies of modeling an SAP HRMS source with SCD type 2 data and dealing with time-dependent information in Data Vault 2.0.

“Could you please guide us on how to model an SAP HRMS Source that holds the data in SCD type 2 in the source itself with an effectivity start date and end date for each change? What will be the best way to deal with time-dependent data in Data Vault 2.0?”

In this enlightening video, Michael provides practical guidance on modeling strategies for incorporating SAP HRMS source data with Slowly Changing Dimension (SCD) type 2 attributes directly in the source. He addresses the complexities of handling time-dependent data within the Data Vault 2.0 framework, offering insights into the best practices for managing effectivity start and end dates for each change.

Michael shares valuable considerations and recommendations, providing a clear roadmap for efficiently handling time-dependent data scenarios in Data Vault 2.0 projects.

Quick Guide of a Data Vault 2.0 Implementation

Data Vault 2.0 Architecture

Data Vault 2.0 Implementation

Data Vault 2.0 is often assumed to be only a modeling technique, but it encompasses much more than that. Not only that, but it is a whole BI solution composed of agile methodology, architecture, implementation, and modeling.

So why start using Data Vault?

  • Data Vault 2.0 allows you to build automated loading processes/patterns and generate models very easily
  • Platform independence
  • Auditability 
  • Scalability
  • Supports ELT instead of ETL processes

Now that we answered the why, you may be wondering what steps are needed to implement Data Vault 2.0 in your project.

It depends on a lot of factors like your business case, the architecture you want to have in place, how your sources are loaded, the sprint timeline of your project, etc.

Walk-through of a Data Vault 2.0 Implementation

It can be a bit overwhelming for beginners to start using Data Vault 2.0 and how and where to implement it. In this webinar, a very basic guide will be provided showing the steps needed for making a Data Vault 2.0 implementation based on a business requirement from scratch. This will be done with a demonstrated example, and it starts from the gathering of some sample requirements to the finished delivered product.

Watch Webinar Part 1Watch Webinar Part 2

Data Vault 2.0 feature by feature architecture

One thing is for sure: the architecture should be built vertically, not horizontally. This means not layer by layer but feature by feature. 

A common approach here is the Tracer Bullet approach. Based on business value, which is defined by a report, a dashboard, or an information mart, the source data needs to be identified, modeled, and loaded through all layers of the architecture. 

For example, let’s say the business request was to build a dashboard to analyze the company’s sales:

1. Extract

First thing, we need to extract the data from the source systems and load the data as it is somewhere. In this example, we put it in a Transient Staging Area but you could choose a persistent one in a Data Lake as well.

2. Transform

Next, you should apply some hard rules if necessary, be careful with this as you do not want to make business calculations here, using a transformation tool. There are a lot of different data warehouse automation tools that you can choose from: dbt, Coalesce, WhereScape, etc.

Data Vault 2.0 Architecture

3. Load

Load your Raw Stage into the Raw Vault.

4. Model Business requirements

Model the Data Vault entities needed for the business requirement to be fulfilled. If we have some Sales transactions and customers data, for example, we will model a Non-historized Link, also known as Transactional Link, and a Customer Hub, along with any additional Satellites for holding the Customer descriptive data that we want to see in the Sales Dashboard in the end.

5. Apply Business Logic

Next, we need some calculations and aggregations to be performed, so we will build some business logic on top of the raw entities, loading it into the business vault.

6. Build an Information Mart

Now, we could directly use the data stored in the Raw and Business Vault into charts/dashboards, but we want to structure the data, so it can be easily read and fetched by business users, so we will build an information mart with a star schema model with a fact table and dimensions.

7. Visualize Data

To build the Sales Dashboard in a BI visualization tool like PowerBI or Tableau, we now fetch directly from the star schema in the information mart, which has all the information we need, using a connection to my data warehouse in our database.

Data Vault 2.0 offers an agile, scalable, and flexible approach to Data Warehousing Automation. As demonstrated in the example, we only modeled the Data Vault tables that were necessary for accomplishing the handed task of building a Sales dashboard. This way you can scale up your business by demand, so you don’t have to figure out and map out the whole enterprise in one go. 

The answer to how to implement Data Vault 2.0 can be translated into a simple phrase: Focus on business value!

If you would like to see an explanation of this step-by-step implementation with some demonstration of actual data using dbt as the chosen transformation tool, check out the webinar recording.

Conclusion

Implementing Data Vault 2.0 involves a structured approach that begins with extracting data from source systems into a staging area, followed by minimal necessary transformations, and loading into the Raw Vault. Subsequently, business requirements guide the modeling of Data Vault entities, application of business logic, construction of information marts, and data visualization. This feature-by-feature methodology ensures scalability and flexibility, allowing organizations to focus on delivering business value incrementally. By aligning development efforts with specific business needs, enterprises can efficiently build and expand their data warehousing solutions.

EDW Environments in Data Vault

Watch the Video

In our ongoing Data Vault Friday series, our CEO Michael Olschimke addresses a crucial question from the audience that highlights a common challenge in data projects.

“I’m currently working on a project where the ‘environments’ (Dev, Prod, Test) are not well administrated. This topic is not mentioned at all in the DV2.0 methodology. Could you please elaborate on the roles of these environments and how to correctly use and manage them? As context, the problem faced at the moment by the company is that they’re not being able to test correctly and then implement. Also, the environments don’t necessarily count with the same information.”

In this insightful video, Michael provides a comprehensive discussion on the roles and importance of environments (Development, Production, Test) in the context of Data Vault 2.0 methodology. He addresses the challenges faced by the company, emphasizing the critical role that well-administered environments play in testing, implementing, and ensuring data consistency across different stages.

Michael shares practical insights into the correct utilization and management of environments, offering guidance on establishing a robust environment strategy within the Data Vault framework.

Automating Business Logic

Watch the Webinar

In this webinar, you’ll learn that Data Vault automation is not restricted to loading data, but can also be applied to the presentation layer.

There’s always some repeatable business logic – think of calculations such as currency conversion, Lifetime Value (LTV), or Net Present Value (NPV) – to feed different reports, even if all of them contain different information.

We’ll explain how you can create custom business templates and add additional layers in the information marts, to apply calculations repeatedly and even interdependently, thereby extending the scope of Data Vault automation from integration to presentation.

This webinar focuses on practical solutions.

Watch Webinar Recording

Webinar Agenda

1. How to get data out of a Data Vault.
2. What’s a PIT, what’s a bridge?
3. What’s meant by virtualization?
4. How to identify low-hanging fruits, i.e. the repeatable business logic in your solution.
5. How to automate those business rules using VaultSpeed.

Close Menu