Marc Winkelmann

Marc Winkelmann In Beginner, Data Warehouse

Real-Time Data Warehousing and Business Intelligence with Data Vault 2.0 and AWS Kinesis

Data is the fuel of the digital economy. However, its true value is realized only when it is processed quickly, reliably, and structured for analysis and reporting. Real-time data streaming enables companies to make data-driven decisions instantly. Data Vault 2.0 combined with AWS Kinesis provides a future-proof solution for efficiently processing and storing large volumes of data in modern data warehousing and BI environments.

Realtime on AWS with Data Vault 2.0

Join our webinar on March 18th, 2025, 11 am CET, and learn how to build a scalable, real-time data architecture on AWS. We’ll cover AWS infrastructure for real-time data, applying Data Vault 2.0 in real-time scenarios, and showcase a live demo with a real-world use case.

Watch Webinar Recording

In this article:

Why Real-Time Data Streaming for Data Warehousing and BI?
Data Vault 2.0 as the Foundation for Real-Time Data Warehousing
AWS Kinesis: Real-Time Data for Your Data Warehouse
Conclusion: Future-Proof BI and Data Warehousing with Real-Time Streaming

Why Real-Time Data Streaming for Data Warehousing and BI?

In today’s fast-paced business environment, timely access to accurate data is essential for making informed decisions. Traditional batch processing methods can no longer keep up with the need for real-time insights, often resulting in outdated reports and slow reaction times. Real-time data streaming solves this problem by enabling continuous data integration, allowing companies to analyze and act on fresh data as it arrives. This shift not only improves operational efficiency but also enhances overall business intelligence strategies by ensuring that the most up-to-date information is always available.

Data Vault 2.0 as the Foundation for Real-Time Data Warehousing

As organizations deal with increasing volumes of data from multiple sources, they need a flexible and scalable approach to data modeling. Data Vault 2.0 provides the ideal foundation for real-time data warehousing by offering a structured yet adaptable methodology. Unlike traditional data models, which can be rigid and difficult to modify, Data Vault 2.0 adapts to new requirements quite fast. By leveraging Data Vault 2.0, companies can build a resilient and future-proof data warehouse capable of handling real-time data streams with ease.

AWS Kinesis: Real-Time Data for Your Data Warehouse

Processing real-time data at scale requires a robust infrastructure, and AWS Kinesis is built precisely for this purpose. It enables businesses to collect, process, and analyze real-time data streams, ensuring that data warehouses remain continuously updated. By eliminating data latency, companies can generate insights in real time, leading to faster decision-making and improved operational performance. Furthermore, AWS Kinesis seamlessly integrates with widely used BI systems such as AWS Redshift and Snowflake, making it an essential component for modern data architectures. Its dynamic scaling capabilities provide cost efficiency by adjusting resource consumption based on actual demand. Additionally, Kinesis includes advanced security features, ensuring that sensitive data remains protected while adhering to industry regulations.

Conclusion: Future-Proof BI and Data Warehousing with Real-Time Streaming

Companies that embrace real-time data processing benefit from faster BI analysis, lower costs, and greater scalability. Data Vault 2.0 combined with AWS Kinesis offers a powerful, future-proof solution for modern data warehousing architectures. By enabling seamless integration of real-time data, businesses can react instantly to market changes, optimize their operations, and stay ahead of the competition.

Investing in real-time data streaming is not just about speed, it’s about building a resilient and adaptive data infrastructure that grows with your business. Organizations that leverage these technologies today will gain a significant competitive edge, ensuring long-term success in an increasingly data-driven world. Leverage real-time streaming for BI and maximize the value of your data!

Marc Winkelmann In Salesforce Services

Salesforce Account Engagement and Domain Management

Domain Management Within Salesforce Account Engagement

In this video guide, we explore the essential components of domain management within Salesforce Account Engagement, highlighting their critical roles and why they are vital for your email marketing strategy.

Additionally, the guide underscores the significance of mastering three key aspects: (1) email sending domains, (2) tracker domains, and (3) tracking code. By understanding and implementing these elements, you can significantly enhance your email deliverability, security, and overall campaign performance.

This guide is designed for Salesforce marketers, administrators, and key users who are working or planning to work with Salesforce Account Engagement to optimize their domain management practices.

Domain Management in Account Engagement

Domain management in Salesforce Account Engagement involves the configuration and maintenance of domains used for email marketing and tracking purposes.

This includes authenticating email sending domains to ensure emails are properly delivered and not flagged as spam, setting up custom tracker domains to maintain brand consistency and improve deliverability, and managing tracking codes to accurately monitor and analyze user interactions.

Effective domain management enhances email security, optimizes deliverability rates, and ensures a professional and trustworthy user experience.

Key Takeaways

Discover the compelling reasons behind adopting four essential best practices for domain management in Salesforce Account Engagement. These practices, detailed in the video, play a critical role in enhancing email deliverability, security, and brand consistency.

By understanding and implementing these practices, users can optimize their email marketing strategy for sustained success:

1. Align Return Path with Mail-from Address
2. Enable HTTPS for Tracker Domains
3. Monitor Domain Reputation
4. Use Custom Tracking Domains for Brand Consistency

Target Audience

Designed for Salesforce marketers, administrators, and users, this guide encourages the effective management of domains within Salesforce Account Engagement.

By highlighting the importance of key practices, the video aims to provide a deeper understanding of why mastering domain management is crucial for achieving optimal email deliverability, security, and campaign performance within the Salesforce environment.

Watch the Video

Marc Winkelmann In Data Tools, Intermediate

Data Vault 2.0 with Hadoop and Hive/Spark

Hadoop and Hive/Spark in Data Vault 2.0

In this article, you’ll receive an overview of what Hadoop and Hive is and why they can be used as an alternative to traditional databases.

Data Vault 2.0 with Hadoop and Hive/Spark

This webinar delves into the ins and outs of Hadoop and Hive, including what they are and how they communicate. The second part of the presentation focuses on a Data Vault 2.0 example architecture using batch loading, providing participants with insights into how a sample can look like to provide value in real-world scenarios. Whether you are a seasoned data professional or just starting out, this webinar is an invaluable resource for anyone seeking to learn more about Hadoop. So if you are looking to expand your knowledge of these technologies and explore their potential in the world of data analytics, this webinar is not to be missed.

Watch webinar recording

In this article:

Hadoop
HIVE
- What are the components?
Conclusion

Hadoop

Hadoop is used to process and analyze large volumes of data efficiently by distributing the workload across a cluster of commodity hardware, enabling parallel processing and providing fault tolerance through its distributed file system and resource management framework.

HDFS – Hadoop Distributed File System

HDFS is a distributed file system that provides reliable and scalable storage for big data. It breaks large files into blocks and distributes them across a cluster of commodity hardware. HDFS ensures data reliability and availability through data replication.

Yet Another Resource Negotiator – YARN

YARN provides a flexible, scalable resource management framework for Hadoop, enabling a variety of applications and workloads to coexist and efficiently utilize the cluster’s resources. It abstracts the underlying infrastructure and allows for the dynamic allocation of resources based on application requirements.

MapReduce – MR

MapReduce is a programming model and processing framework for distributed data processing in Hadoop. It allows for parallel processing of large datasets by dividing the workload into maps, reducing tasks. Map tasks process data in parallel and the output is combined and reduced to produce the final result.

Hadoop Common

Hadoop Common provides libraries, utilities, and infrastructure support for the other components of Hadoop. It includes common utilities, authentication mechanisms, and interfaces that are used by various Hadoop modules.

What is the benefit?

Scalability
Hadoop enables the storage and processing of massive amounts of data by scaling horizontally across a cluster of commodity hardware. It can handle petabytes of data without sacrificing performance.

Distributed Computing
Hadoop distributes data and processing tasks across multiple nodes in a cluster, allowing for parallel processing and faster data analysis. This distributed computing model enables efficient utilization of resources and enables high-performance data processing.

Fault Tolerance
Hadoop provides fault tolerance by replicating data across multiple nodes in the cluster. If a node fails, data can still be accessed from other replicas, ensuring data reliability and availability.

Cost-Effectiveness
Hadoop is designed to run on inexpensive commodity hardware, making it a cost-effective solution for storing and processing large volumes of data. It eliminates the need for expensive specialized hardware.

Flexibility and Extensibility
Hadoop’s modular architecture allows for integration with various tools and frameworks within the Hadoop ecosystem, providing flexibility and extensibility. It supports a wide range of data processing tasks, including batch processing, real-time processing, machine learning, and more.

Data Locality
Hadoop’s distributed file system, HDFS, aims to bring the computation closer to the data. By processing data where it is stored, Hadoop minimizes data movement across the network, reducing latency and improving overall performance.

Ecosystem and Community
Hadoop has a rich ecosystem with a wide range of tools, libraries, and frameworks that extend its functionality for different use cases. It also has a large and active community of users, developers, and contributors, providing support, resources, and continuous improvement.

These benefits make Hadoop a powerful, popular solution for handling big data, enabling organizations to efficiently store, process, and gain insights from vast amounts of structured and unstructured data. The whole ecosystem can also run on-premise, which can make it a good alternative if ‘cloud’ is not an option.

HIVE

Hive is a data warehouse infrastructure built on top of Hadoop that provides a high-level SQL-like query language called HiveQL for querying and analyzing large datasets.

What are the components?

Data Storage
Hive leverages Hadoop Distributed File System (HDFS) as its underlying storage system. It stores data in HDFS in a distributed and fault-tolerant manner, allowing for scalable, reliable data storage.

Schema Definition
Hive allows users to define a schema for their data using a language called Hive Data Definition Language, like DDL. This allows users to define tables, partitions, columns, data types, and other metadata associated with the data.

Query Optimization
Hive optimizes queries by performing query planning and optimization techniques. It aims to generate efficient query execution plans to minimize data movement, optimize resource utilization, and improve query performance.

Hive Metastore
Hive maintains a metadata repository called the Hive Metastore. It stores information about the tables, partitions, schemas, and other metadata associated with the data stored in HDFS. The metastore allows for efficient metadata management and retrieval during query processing.

Extensibility
Hive offers extensibility through User-Defined Functions (UDFs), User-Defined Aggregations (UDAs), and User-Defined Table Functions (UDTFs). These allow users to define custom logic and operations in programming languages like Java, Python, or other supported languages.

Integration with other tools
Hive integrates with various other tools and frameworks in the Hadoop ecosystem. For example, it can work alongside Apache Spark, Apache Pig, Apache HBase, and other components to provide a complete data processing and analytics solution.

Partitioning and Bucketing
Hive supports data partitioning and bucketing, allowing users to organize and store data in a structured manner. Partitioning involves dividing data into logical partitions based on specific criteria, while bucketing involves dividing data into equally sized buckets based on hash values.

SerDe
Hive uses a serialization/deserialization framework called SerDe (Serializer/Deserializer) to read and write data in different formats, such as CSV, JSON, Avro, and more. Users can specify the appropriate SerDe for their data format to ensure proper data processing.

Overall, Hive simplifies data querying and analysis on Hadoop by providing a familiar SQL-like interface. It abstracts the complexity of writing low-level MapReduce or Tez jobs and provides a declarative and user-friendly approach to interact with large-scale data stored in Hadoop.

Conclusion

Hadoop is a robust and feature-rich environment that can be challenging to manage. However, its numerous advantages make it a compelling choice, depending on the user’s needs and the available in-house expertise. If you’re interested in learning more about it, watch the following recording.

Marc Winkelmann In Beginner, Data Vault

Quick Guide of a Data Vault 2.0 Implementation

Data Vault 2.0 Implementation

Data Vault 2.0 is often assumed to be only a modeling technique, but it encompasses much more than that. Not only that, but it is a whole BI solution composed of agile methodology, architecture, implementation, and modeling.

So why start using Data Vault?

Data Vault 2.0 allows you to build automated loading processes/patterns and generate models very easily
Platform independence
Auditability
Scalability
Supports ELT instead of ETL processes

Now that we answered the why, you may be wondering what steps are needed to implement Data Vault 2.0 in your project.

It depends on a lot of factors like your business case, the architecture you want to have in place, how your sources are loaded, the sprint timeline of your project, etc.

Walk-through of a Data Vault 2.0 Implementation

It can be a bit overwhelming for beginners to start using Data Vault 2.0 and how and where to implement it. In this webinar, a very basic guide will be provided showing the steps needed for making a Data Vault 2.0 implementation based on a business requirement from scratch. This will be done with a demonstrated example, and it starts from the gathering of some sample requirements to the finished delivered product.

Watch Webinar Part 1 Watch Webinar Part 2

In this article:

Data Vault 2.0 feature by feature architecture
Conclusion

Data Vault 2.0 feature by feature architecture

One thing is for sure: the architecture should be built vertically, not horizontally. This means not layer by layer but feature by feature.

A common approach here is the Tracer Bullet approach. Based on business value, which is defined by a report, a dashboard, or an information mart, the source data needs to be identified, modeled, and loaded through all layers of the architecture.

For example, let’s say the business request was to build a dashboard to analyze the company’s sales:

1. Extract

First thing, we need to extract the data from the source systems and load the data as it is somewhere. In this example, we put it in a Transient Staging Area but you could choose a persistent one in a Data Lake as well.

2. Transform

Next, you should apply some hard rules if necessary, be careful with this as you do not want to make business calculations here, using a transformation tool. There are a lot of different data warehouse automation tools that you can choose from: dbt, Coalesce, WhereScape, etc.

3. Load

Load your Raw Stage into the Raw Vault.

4. Model Business requirements

Model the Data Vault entities needed for the business requirement to be fulfilled. If we have some Sales transactions and customers data, for example, we will model a Non-historized Link, also known as Transactional Link, and a Customer Hub, along with any additional Satellites for holding the Customer descriptive data that we want to see in the Sales Dashboard in the end.

5. Apply Business Logic

Next, we need some calculations and aggregations to be performed, so we will build some business logic on top of the raw entities, loading it into the business vault.

6. Build an Information Mart

Now, we could directly use the data stored in the Raw and Business Vault into charts/dashboards, but we want to structure the data, so it can be easily read and fetched by business users, so we will build an information mart with a star schema model with a fact table and dimensions.

7. Visualize Data

To build the Sales Dashboard in a BI visualization tool like PowerBI or Tableau, we now fetch directly from the star schema in the information mart, which has all the information we need, using a connection to my data warehouse in our database.

Data Vault 2.0 offers an agile, scalable, and flexible approach to Data Warehousing Automation. As demonstrated in the example, we only modeled the Data Vault tables that were necessary for accomplishing the handed task of building a Sales dashboard. This way you can scale up your business by demand, so you don’t have to figure out and map out the whole enterprise in one go.

The answer to how to implement Data Vault 2.0 can be translated into a simple phrase: Focus on business value!

If you would like to see an explanation of this step-by-step implementation with some demonstration of actual data using dbt as the chosen transformation tool, check out the webinar recording.

Conclusion

Implementing Data Vault 2.0 involves a structured approach that begins with extracting data from source systems into a staging area, followed by minimal necessary transformations, and loading into the Raw Vault. Subsequently, business requirements guide the modeling of Data Vault entities, application of business logic, construction of information marts, and data visualization. This feature-by-feature methodology ensures scalability and flexibility, allowing organizations to focus on delivering business value incrementally. By aligning development efforts with specific business needs, enterprises can efficiently build and expand their data warehousing solutions.

Marc Winkelmann In Data Tools, Intermediate

Speed Up Your Data Vault 2.0 Implementation with Turbovault4DBT

TurboVault4dbt

Scalefree released TurboVault4dbt, an open-source package to automate model generation using DataVault4dbt-compatible templates based on your sources’ metadata.

TurboVault4dbt currently supports metadata input from Excel, GoogleSheets, BigQuery, and Snowflake and helps your business with:

Speeding up the development process, reducing development costs, and producing faster results
Encouraging users to analyze and understand their source data

Speed up Your Data Vault 2.0 Implementation – with TurboVault4dbt

This webinar delves into TurboVault4dbt, an open-source tool by Scalefree that speeds up Data Vault 2.0 implementation. It automates dbt model creation using your source metadata, saving time and costs while encouraging better data analysis.

TurboVault4dbt works with metadata inputs like Excel, Google Sheets, BigQuery, and Snowflake, generating models for hubs, links, and satellites automatically. Just set up your metadata tables, connect the tool, and watch it do the heavy lifting!

Watch webinar recording

In this article:

‘Isn’t every model kind of the same?’
From CTRL+C AND CTRL+V to a simple mouse-click
Conclusion: Lean back, relax and let TurboVault4bdt take over!

‘Isn’t every model kind of the same?’

Datavault4dbt is the result of years of experience in creating and loading Data Vault 2.0 solutions forged into a fully auditable solution for your Data Vault 2.0 powered Data Warehouse using dbt.

But every developer who has worked with the package or has created dbt models for the Raw Vault must have come across one nuisance:

Creating a new dbt model for a table means taking the already existing template and providing it with specific metadata for that table. Doing this over and over again can be quite a chore. This is why we created TurboVault4dbt to automate and speed up this process.

From CTRL+C AND CTRL+V to a simple mouse-click

How many times has everyone pressed CTRL+C then CTRL+V and corrected a few lines of code when creating new dbt-models for the raw vault?

Instead of trying to figure out what the names of your tables and business keys are or what hashing order you want your Hashkey to be generated in, TurboVault4dbt will do all of that for you. All TurboVault4dbt needs is a metadata input where you capture the structure of your data warehouse.

TurboVault4dbt currently requires a structure of five metadata tables:

Hub Entities: This table stores metadata information about your Hubs,
e.g. (Hub Name, Business Keys, Column Sort Order for Hashing, etc.)
Link Entities: This table stores metadata information about your Links,
e.g. (Link Name, Referenced Hubs, Pre-Join Columns, etc.)
Hub Satellites: This table stores metadata information about your Hub Satellites,
e.g. (Satellite Name, Referenced Hub, Column Definition, etc.)
Link Satellites: This table stores metadata information about your Hub Satellites,
e.g. (Satellite Name, Referenced Link, Column Definition, etc.)
Source Data: This table stores metadata information about your Sources,
e.g. (Source System, Source Object, Source Schema, etc.)

By capturing the metadata in those five tables above, TurboVault4dbt can extract necessary information and generate every model that is based on a selected source but also, as a user, encourage you to analyze and understand your data.

Conclusion: Lean back, relax and let TurboVault4bdt take over!

Create and fill your metadata tables, connect them to TurboVault4dbt, and enjoy your free time for another cup of coffee. Give it a try, or give us your feedback by visiting TurboVault4dbt on GitHub!

Stay updated on TurboVault4dbt through our marketing channels as great features lie ahead!

Real-Time Data Warehousing and Business Intelligence with Data Vault 2.0 and AWS Kinesis

Realtime on AWS with Data Vault 2.0

Why Real-Time Data Streaming for Data Warehousing and BI?

Data Vault 2.0 as the Foundation for Real-Time Data Warehousing

AWS Kinesis: Real-Time Data for Your Data Warehouse

Conclusion: Future-Proof BI and Data Warehousing with Real-Time Streaming

Salesforce Account Engagement and Domain Management

Domain Management Within Salesforce Account Engagement

Domain Management in Account Engagement

Key Takeaways

Target Audience

Watch the Video

Data Vault 2.0 with Hadoop and Hive/Spark

Hadoop and Hive/Spark in Data Vault 2.0

Data Vault 2.0 with Hadoop and Hive/Spark

Hadoop

HDFS – Hadoop Distributed File System

Yet Another Resource Negotiator – YARN

MapReduce – MR

Hadoop Common

What is the benefit?

HIVE

What are the components?

Conclusion

Quick Guide of a Data Vault 2.0 Implementation

Data Vault 2.0 Implementation

Walk-through of a Data Vault 2.0 Implementation

Data Vault 2.0 feature by feature architecture

1. Extract

2. Transform

3. Load

4. Model Business requirements

5. Apply Business Logic

6. Build an Information Mart

7. Visualize Data

Conclusion

Speed Up Your Data Vault 2.0 Implementation with Turbovault4DBT

TurboVault4dbt

Speed up Your Data Vault 2.0 Implementation – with TurboVault4dbt

‘Isn’t every model kind of the same?’

From CTRL+C AND CTRL+V to a simple mouse-click

Conclusion: Lean back, relax and let TurboVault4bdt take over!

Build Better Data Platforms

SOLUTIONS

TRAINING

EVENTS

KNOWLEDGE HUB

CAREERS

COMPANY