Skip to main content
All Posts By

Lina Sibbel

Data Vault 2.0 with Hadoop and Hive/Spark

By Scalefree Newsletter No Comments

Join the Webinar:

Introduction

In this newsletter, you’ll receive an overview of what Hadoop and Hive is and why they can be used as an alternative to traditional databases.

There’s also going to be a webinar on June 13 at 2 pm with a more detailed look on how the components interact with each other and what a sample Data Vault 2.0 architecture looks like in that ecosystem. The webinar will also cover why Spark is used with Hadoop and Hive.

Register here for free.

Hadoop

Hadoop is used to process and analyze large volumes of data efficiently by distributing the workload across a cluster of commodity hardware, enabling parallel processing and providing fault tolerance through its distributed file system and resource management framework.

HDFS – Hadoop Distributed File System

HDFS is a distributed file system that provides reliable and scalable storage for big data. It breaks large files into blocks and distributes them across a cluster of commodity hardware. HDFS ensures data reliability and availability through data replication.

Yet Another Resource Negotiator – YARN

YARN provides a flexible, scalable resource management framework for Hadoop, enabling a variety of applications and workloads to coexist and efficiently utilize the cluster’s resources. It abstracts the underlying infrastructure and allows for the dynamic allocation of resources based on application requirements.

MapReduce – MR

MapReduce is a programming model and processing framework for distributed data processing in Hadoop. It allows for parallel processing of large datasets by dividing the workload into maps, reducing tasks. Map tasks process data in parallel and the output is combined and reduced to produce the final result.

Hadoop Common

Hadoop Common provides libraries, utilities, and infrastructure support for the other components of Hadoop. It includes common utilities, authentication mechanisms, and interfaces that are used by various Hadoop modules.

What is the benefit?

Scalability
Hadoop enables the storage and processing of massive amounts of data by scaling horizontally across a cluster of commodity hardware. It can handle petabytes of data without sacrificing performance.

Distributed Computing
Hadoop distributes data and processing tasks across multiple nodes in a cluster, allowing for parallel processing and faster data analysis. This distributed computing model enables efficient utilization of resources and enables high-performance data processing.

Fault Tolerance
Hadoop provides fault tolerance by replicating data across multiple nodes in the cluster. If a node fails, data can still be accessed from other replicas, ensuring data reliability and availability.

Cost-Effectiveness
Hadoop is designed to run on inexpensive commodity hardware, making it a cost-effective solution for storing and processing large volumes of data. It eliminates the need for expensive specialized hardware.

Flexibility and Extensibility
Hadoop’s modular architecture allows for integration with various tools and frameworks within the Hadoop ecosystem, providing flexibility and extensibility. It supports a wide range of data processing tasks, including batch processing, real-time processing, machine learning, and more.

Data Locality
Hadoop’s distributed file system, HDFS, aims to bring the computation closer to the data. By processing data where it is stored, Hadoop minimizes data movement across the network, reducing latency and improving overall performance.

Ecosystem and Community
Hadoop has a rich ecosystem with a wide range of tools, libraries, and frameworks that extend its functionality for different use cases. It also has a large and active community of users, developers, and contributors, providing support, resources, and continuous improvement.

These benefits make Hadoop a powerful, popular solution for handling big data, enabling organizations to efficiently store, process, and gain insights from vast amounts of structured and unstructured data. The whole ecosystem can also run on-premise, which can make it a good alternative if ‘cloud’ is not an option.

HIVE

Hive is a data warehouse infrastructure built on top of Hadoop that provides a high-level SQL-like query language called HiveQL for querying and analyzing large datasets.

What are the components?

Data Storage
Hive leverages Hadoop Distributed File System (HDFS) as its underlying storage system. It stores data in HDFS in a distributed and fault-tolerant manner, allowing for scalable, reliable data storage.

Schema Definition
Hive allows users to define a schema for their data using a language called Hive Data Definition Language, like DDL. This allows users to define tables, partitions, columns, data types, and other metadata associated with the data.

Query Optimization
Hive optimizes queries by performing query planning and optimization techniques. It aims to generate efficient query execution plans to minimize data movement, optimize resource utilization, and improve query performance.

Hive Metastore
Hive maintains a metadata repository called the Hive Metastore. It stores information about the tables, partitions, schemas, and other metadata associated with the data stored in HDFS. The metastore allows for efficient metadata management and retrieval during query processing.

Extensibility
Hive offers extensibility through User-Defined Functions (UDFs), User-Defined Aggregations (UDAs), and User-Defined Table Functions (UDTFs). These allow users to define custom logic and operations in programming languages like Java, Python, or other supported languages.

Integration with other tools
Hive integrates with various other tools and frameworks in the Hadoop ecosystem. For example, it can work alongside Apache Spark, Apache Pig, Apache HBase, and other components to provide a complete data processing and analytics solution.

Partitioning and Bucketing
Hive supports data partitioning and bucketing, allowing users to organize and store data in a structured manner. Partitioning involves dividing data into logical partitions based on specific criteria, while bucketing involves dividing data into equally sized buckets based on hash values.

SerDe
Hive uses a serialization/deserialization framework called SerDe (Serializer/Deserializer) to read and write data in different formats, such as CSV, JSON, Avro, and more. Users can specify the appropriate SerDe for their data format to ensure proper data processing.

Overall, Hive simplifies data querying and analysis on Hadoop by providing a familiar SQL-like interface. It abstracts the complexity of writing low-level MapReduce or Tez jobs and provides a declarative and user-friendly approach to interact with large-scale data stored in Hadoop.

Conclusion

Hadoop is a robust and feature-rich environment that can be challenging to manage. However, its numerous advantages make it a compelling choice, depending on the user’s needs and the available in-house expertise. If you’re interested in learning more about it, don’t miss the opportunity to participate in the upcoming webinar.

Register here for free

Get Updates and Support

Please send inquiries and feature requests to [email protected]

For Data Vault training and on-site training inquiries, please contact [email protected] or register at www.scalefree.com.

To support the creation of Visual Data Vault drawings in Microsoft Visio, a stencil is implemented that can be used to draw Data Vault models. The stencil is available at www.visualdatavault.com.

Scalefree

Bring your Data Vault automation to the next level – with Datavault4Coalesce

By Scalefree Newsletter No Comments

Join the Webinar:

In last September’s webinar, we explored a new competitor in the highly contested market of data warehouse automation tools: Coalesce!

And everyone who watched that webinar might remember that at the end, we announced an even closer relationship between Coalesce and Scalefree and a commitment to bring Scalefree’s best practices into Coalesce!

For those who didn’t watch the webinar, you can find it here.

Recap: What is Coalesce?

Coalesce is a Data Transformation solution made for Snowflake. When working with Coalescse, you build directed acyclic graphs (DAG) which contain nodes. A node represents any physical database object, like tables or views, or even stages and external tables.

Coalesce itself is built around metadata that stores table and column level information which describes the structure inside your data warehouse. This metadata-focused design enables a couple of features which strongly drive towards scalability and manageability. 

All the metadata allows a team to track past, current and desired states of the data warehouse by deeply integrating it and all the workflows that it brings. Additionally users can define standardized patterns and templates on column- and table-level.

How can Data Vault jump in here?

These mentioned templates open up the gate to implement Data Vault 2.0 related patterns and best-practices. Especially on table level, it might quickly come to mind that you could try to build a template for a Hub or a Link.

On column level, this could be a repeated transformation which is then managed in only one so-called macro, which makes it very easy to implement changes with low to zero impact. You could think of hashkey calculation or virtual load-end-dating here.

And that is exactly what we at Scalefree did since the webinar last year. Lead-developers from Coalesce sat together with Data Vault experts and developers from Scalefree with one goal: Create something amazing that helps users to automate their Data Vault 2.0 implementation!

How fast can I build a Data Vault? Yes!

This cooperation created Datavault4Coalesce, an open source extension package for Coalesce, which will be available on March 16th! Let’s have a sneak peek at a selection of what users can do with Datavault4Coalesce.

The first release of Datavault4Coalesce will feature a basic set of Data Vault 2.0 entities:

  • Stages
  • Hubs
  • Links
  • Satellites
  • PITs
  • Snapshot Control Tables

While providing DDL and DML templates for the entity types mentioned above, Datavault4Coalesce makes use of Coalesce’s ability to define the UI for each node type. For stages, this means that users can decide if they want Datavault4Coalesce to automatically generate ghost records or not, as shown in the screenshot below:

This Data Vault related interface can be found across all node types and allow users to conveniently adjust Datavault4Coalesce to fit their requirements!

How can I start?

First of all a bummer, Datavault4Coalesce will only be available starting from the 16th of March. But there is no reason to wait that long to dive into Coalesce itself! Since Coalesce is now part of Snowflake Partner Connect, it’s never been easier to get your hands into a fresh Coalesce environment!

Just sign up for a free Snowflake trial here and initialize your Coalesce experience within seconds by accessing the Partner Connect portal! Then, you just have to load any desired data into it and you can start building your data pipelines with Coalesce. And when the 16th of March finally arrives, you just have to add Datavault4Coalesce to your Coalesce environment – and now you can start to build Data Vault faster than ever!

Also, don’t miss out our next webinar which will be held at the day of launch, the 16th of March, where we will show you Datavault4Coalesce in action. Sign up here!

Get Updates and Support

Please send inquiries and feature requests to [email protected]

For Data Vault training and on-site training inquiries, please contact [email protected] or register at www.scalefree.com.

To support the creation of Visual Data Vault drawings in Microsoft Visio, a stencil is implemented that can be used to draw Data Vault models. The stencil is available at www.visualdatavault.com.

Scalefree
TurboVault4dbt Logo

Speed up Your Data Vault 2.0 Implementation – with TurboVault4dbt

By Scalefree Newsletter No Comments

Join the Webinar:

Scalefree has released TurboVault4dbt, an open source package that provides automated model generation using DataVault4dbt-compatible templates based on your sources’ metadata.

TurboVault4dbt currently supports metadata input from Excel, GoogleSheets, BigQuery and Snowflake and helps your business with:

  • Speeding up the development process, reducing development costs and producing faster results
  • Encouraging users to analyze and understand their source data
dbt Logo

‘Isn’t every model kind of the same?’

Datavault4dbt is the result of years of experience in creating and loading Data Vault 2.0 solutions forged into a fully auditable solution for your Data Vault 2.0 powered Data Warehouse using dbt.

But every developer who has worked with the package or has created dbt-models for the Raw Vault must have come across one nuisance:
Creating a new dbt-model for a table basically means taking the already existing template and providing it with specific metadata for that table. Doing this over and over again can be quite a chore. Which is why we created TurboVault4dbt to automate and speed up this process.

FROM CTRL+C AND CTRL+V TO A SIMPLE MOUSE-CLICK

How many times has everyone pressed CTRL+C then CTRL+V and corrected a few lines of code when creating new dbt-models for the raw vault?

Instead of trying to figure out what the names of your tables and business keys are or what hashing order you want your Hashkey to be generated in, TurboVault4dbt will do all of that for you. All TurboVault4dbt needs is a metadata input where you capture the structure of your data warehouse.

TurboVault4dbt currently requires a structure of five metadata tables:

  • Hub Entities: This table stores metadata information about your Hubs,
    e.g. (Hub Name, Business Keys, Column Sort Order for Hashing, etc.)
  • Link Entities: This table stores metadata information about your Links,
    e.g. (Link Name, Referenced Hubs, Pre-Join Columns, etc.)
  • Hub Satellites: This table stores metadata information about your Hub Satellites,
    e.g. (Satellite Name, Referenced Hub, Column Definition, etc.)
  • Link Satellites: This table stores metadata information about your Hub Satellites,
    e.g. (Satellite Name, Referenced Link, Column Definition, etc.)
  • Source Data: This table stores metadata information about your Sources,
    e.g. (Source System, Source Object, Source Schema, etc.)

By capturing the metadata in those five tables above, TurboVault4dbt can both extract necessary information and generate every model that is based on a selected source but also, as a user, encourage you to analyze and understand your data.

Lean back, relax and let the machine take over!

Create and fill your metadata tables, connect them to TurboVault4dbt and enjoy your free time for another cup of coffee. Give it a try, or give us your feedback by visiting TurboVault4dbt on GitHub!

Stay updated on TurboVault4dbt through our marketing channels as great features lie ahead!

Get Updates and Support

Please send inquiries and feature requests to [email protected]

For Data Vault training and on-site training inquiries, please contact [email protected] or register at www.scalefree.com.

To support the creation of Visual Data Vault drawings in Microsoft Visio, a stencil is implemented that can be used to draw Data Vault models. The stencil is available at www.visualdatavault.com.

Scalefree

ABOUT INFORMATION MARTS IN DATA VAULT 2.0

By Scalefree Newsletter No Comments

Join the Webinar:

In the Data Vault 2.0 architecture, information marts are used to deliver information to the end-users.
Conceptually, an information mart follows the same definition as a data mart in legacy data warehousing. However, in legacy data warehousing, a data mart is used to deliver useful information, not raw data. This is why the data mart has been renamed in Data Vault 2.0 to better reflect the use case.

“Classical” information marts

But the definition of information marts has more facets. In the book “Building a Scalable Data Warehouse with Data Vault 2.0” we present three types of marts:

  • Information marts: used to deliver information to business users, typically via dashboards and reports.
  • Metrics Mart: used in conjunction to a Metrics Vault, which captures EDW log data in a Data Vault model. The Metrics Mart is derived from the Metrics Vault to present the metrics in order to analyze performance bottlenecks or in resource consumption of power users and data scientists in managed self-service BI solutions.
  • Error Mart: stores those records that typically fail a hard rule when loading the data into the enterprise data warehouse.

Additional information marts

In addition to these “classical” information marts, we use additional ones in our consulting practice:

  • Interface Mart: this is more or less an information mart, however, the information is not delivered to a human being, e.g. via a dashboard or report. Instead, it is delivered to a subsequent application, or as a write-back, to the source system (for example when using the enterprise data warehouse for data cleansing).
  • Quality Mart: the quality mart is again an information mart, but instead of cleansing bad data, it is used to report bad data. Essentially, it turns the business logic used to cleanse bad data upside down: only bad data, in addition to ugly data at times, is delivered to the end-user, the data steward. This is often done in conjunction with data cleansing frontends where the data steward can either correct source data or comment and tag the exceptions.
  • Source Mart: again an information mart, but this time not using one of the popular schemas, such as star schemas, snowflake schemas or fully denormalized schemas. Instead, the information mart uses the data model of the source application, similar to an operational data store (ODS) schema. However, the Source Mart is not a copy of the data, it is a virtualized model on top of the Data Vault model, reflecting the original structures. It’s great for ad-hoc reporting and provides great value for many data scientists as well as power users.

This concludes our list of information marts. We have used them successfully in projects for our clients to better communicate the actual application of the information marts in their organization.

Let us know in the comments if you think this is helpful for you, too!

Get Updates and Support

Please send inquiries and feature requests to [email protected]

For Data Vault training and on-site training inquiries, please contact [email protected] or register at www.scalefree.com.

To support the creation of Visual Data Vault drawings in Microsoft Visio, a stencil is implemented that can be used to draw Data Vault models. The stencil is available at www.visualdatavault.com.

Scalefree
Banner datavault4dbt Logo

Kick-start your Data Vault 2.0 Implementation – with datavault4dbt

By Scalefree Newsletter No Comments

Join the Webinar:

Scalefree has released datavault4dbt. An open source package, that provides best-practice loading templates for Data Vault 2.0 entities, embedded into the open source data warehouse automation tool dbt.

Datavault4dbt currently supports Snowflake, BigQuery and Exasol and comes with a lot of great features:

  • A Data Vault 2.0 implementation congruent to the original Data Vault 2.0 definition by Dan Linstedt
  • Ready for both Persistent Staging Areas and Transient Staging Areas, due to the allowance of multiple deltas in all macros, without loosing any intermediate changes
  • Creating a centralized, snapshot-based Business interface by using a centralized snapshot table supporting logarithmic logic
  • Optimizing incremental loads by implementing a high-water-mark that also works for entities that are loaded from multiple sources
dbt Logo

Building a Data Vault 2.0 Solution – made easy

The overall goal of releasing Data Vault 2.0 templates for dbt is to combine our years of experience in creating and loading Data Vault 2.0 solutions into publicly available loading patterns and best practices for everyone to use. Out of this ambition, datavault4dbt, an open source package for dbt was created and will be maintained by the Scalefree expert team. 

The most valuable characteristic of datavault4dbt is that it carnates the original Data Vault 2.0 definition by Dan Linstedt. It represents a fully auditable solution for your Data Vault 2.0 powered Data Warehouse. With a straight-forward, standardized approach, it enables the team to conduct agile development cycles.

By allowing multiple increments per batch while loading each Data Vault entity type, datavault4dbt supports both Persistent and Transient Staging Areas without losing any intermediate changes. These incremental loads are even optimized by implementing a dynamic high-water-mark that even works when loading an entity from multiple sources.

Additionally, datavault4dbt encourages strict naming conventions and standards by implementing a variety of global variables that span across all Data Vault layers and supported Databases. The process of end-dating data is completely virtualized to ensure a modern insert-only approach that avoids updating data.

With all these features, datavault4dbt is the perfect solution for your modern Big Data Enterprise Data Warehouse.

From the Stage over the Spine into the PITs

To achieve all this, we worked hard on creating a solid and universal staging area. All hashkeys and hashdiffs are calculated here and users are given the option to add derived columns, generate prejoins with other stages and add ghost records to their data. All of this highly automated based on parameterized user input. 

Based on staging areas, the Data Vault 2.0 spine can be created. Hubs, Links and Non-Historized Links can be loaded from multiple sources including mapping options to ensure business harmonization. 

This spine is then enriched by Standard Satellites, Non-Historized Satellites, Multi-Active Satellites and/or Record-Tracking Satellites. All of those that require it come with a version 0 for tables and a version 1 for end-dated views. 

Based on the Raw Data Vault, PITs can be created automatically, and their loading is backed by an automated, highly-configurable but optional logarithmic snapshot logic. This logic is included in the Control Snapshot Table, which also comes in two consecutive versions. To wrap the logarithmic snapshot logic up, a post-hook for cleaning up all PITs is included and comes in handy.

DataVault4dbt Logo

Start now and boost your Data Vault experience!

The lines above made you think “Nah, that’s all too good to be true!”? Convince yourself, or give us your highly appreciated feedback by visiting datavault4dbt on Github!

Of course, our future ambitions for datavault4dbt are high and next on our list are a lot of important topics, like:

  • Provide a detailed working example of datavault4dbt
  • Extend and migrate the existing documentation of the package
  • Support more and more databases
  • Add more advanced and specific Data Vault 2.0 entities
  • Develop automated Data Vault related tests
  • Review and implement user feedback and suggestions

Stay tuned for more datavault4dbt content on all our marketing channels!

Get Updates and Support

Please send inquiries and feature requests to [email protected]

For Data Vault training and on-site training inquiries, please contact [email protected] or register at www.scalefree.com.

To support the creation of Visual Data Vault drawings in Microsoft Visio, a stencil is implemented that can be used to draw Data Vault models. The stencil is available at www.visualdatavault.com.

Scalefree