Skip to main content
All Posts By

Lina Sibbel

Agile Kanban Made Easy for Salesforce: Introducing the MOWI App

By Salesforce, Scalefree Newsletter No Comments

This article discusses how agile project management can be implemented through the use of a Kanban board, which is a visual representation of tasks and their progress. MOWI is a Salesforce-based lightweight application that fully integrates the Kanban board with all other components within the Business Application Platform, using the Mission, Objective, Work Item objects. These objects have relationships between them and can help teams increase efficiency and productivity by providing a visual representation of their work, making it easier to track progress and manage tasks.

Agile project management is a well known but often poorly executed approach to software development that emphasizes adaptability and collaboration. One popular method for implementing Agile is through the use of a Kanban board.

A Kanban board is a visual representation of User Stories/Tasks, often displayed on a physical whiteboard or a digital tool. Each item on the board represents a task, and its progress is tracked by moving it from one column to another. Typically, columns represent stages of work such as “Active”, “Blocked”, “Testing”, and “Done”. This improves visibility for the team and the stakeholders alike and can help with better inter-team communication.

MOWI is a lightweight application that uses the capabilities of Salesforce List Views in Kanban View to fully integrate the Kanban Board with all other components within the Business Application Platform. The name was established from the acronym of the beginning letters of its containing objects: Mission, Objective, Work Item. These are intended to enable users to insert and manage projects and tasks in a Kanban system.

The objectives are the heart of the MOWI application. As already mentioned, each objective can only have one mission. However, the objectives can have further so-called Follow-Up Objectives, therefore a one-to-many relationship within the objectives is also possible.

The objects have relationships between them. A mission can be understood as a container of objectives, i.e., a mission can have several objectives, but an objective can have only one parent mission in parallel.

The same applies to the Work Items, they can be understood as sub-goals of the objectives. However, just like the mission, they are not absolutely necessary to create.

Image 1: Schema-Builder (Mission/Objective/Work Item)

Using a Kanban board view in Salesforce can help teams increase efficiency and productivity by providing a visual representation of their work and making it easier to track progress and manage tasks. The three objects within the custom application provide a good template for this.

Mission

Missions are a container for multiple objectives. If multiple objectives need to be grouped to a “bigger picture” you can add a mission but they are optional. The screen shows how a mission is structured and which fields are contained within it. A mission has one or more objectives which can be understood as “sub-objectives”.

Image 2: Screenshot of a Mission Record

Objective

Objectives are records with multiple fields. You can also call the objectives Tasks, User Stories, or Cards. The fields can be filled out with information to describe the objective. In the following screenshot you can see an example of an objective.

Image 3: Screenshot of an Objective Record

Work Item

Also, the objectives within a mission may need to be divided into individual sub-goals. Work-Items are these sub-tasks of the related objective. There, you can track the progress towards reaching the goal of your objective. The screenshot below shows an example Work_item with the status set to “Open”. In the work item, there are the three following phases:

  • Open
  • In Progress
  • Closed

Image 4: Screenshot of a Work Item Record

The Kanban View

MOWI, so far so good. But what does the view with the objectives actually look like? The following screenshot shows an example of a Kanban board. In the board, there are different objectives in different phases. Users can drag and drop the objectives into the different phases depending on their progress. 

Typically, columns represent stages of work such as “Active”, “Blocked”, “Testing”, and “Done”

Image 5: Screenshot of the Kanban view

Final Remarks

Interested in the MOWI solution? Then visit our app on the AppExchange and download it today for free!
Agile Kanban Board in Salesforce – MOWI

We hope that this article has provided valuable information to you. If you need further assistance or have any questions, don’t hesitate to reach out to us at [email protected].

If you want to stay up-to-date on the latest developments and news, be sure to subscribe to our newsletter. We send out regular updates with important information and insights for Salesforce.

Get Updates and Support

Please send inquiries and feature requests to [email protected]

For Data Vault training and on-site training inquiries, please contact [email protected] or register at www.scalefree.com.

To support the creation of Visual Data Vault drawings in Microsoft Visio, a stencil is implemented that can be used to draw Data Vault models. The stencil is available at www.visualdatavault.com.

Scalefree

Exploring DataVault4dbt: A Practical Series on the dbt Package for Data Vault 2.0 – Vol. 1: The Staging Layer

By Scalefree Newsletter No Comments

Join the Webinar:

Last year Scalefree released DataVault4dbt, a Data Vault 2.0 open-source package for dbt, which includes loading templates for creating and modeling Data Vault 2.0 entities following up-to-date standards and best practices. If you want to read more about the general content of the package and its motivation, you can do so here.

We’re excited to launch a series of insightful posts and webinars, showcasing practical implementations of DataVault4dbt. This will empower you to leverage its full potential in your data warehousing endeavors. Today, we’ll spotlight its application in the staging layer.

Before we start…

We will assume some previous knowledge related to Data Vault 2.0 and dbt. Besides, for the following examples, we will be using dbt Cloud IDE connected to Snowflake. For an updated list of supported platforms, check the package’s Github repository.

Also, bear in mind that for optimal use of the macros, you must meet a couple of prerequisites:

  • Flat & Wide source data, accessible in your target database
  • A Load Date column signifying the time of arrival in the source data storage
  • A Record Source column detailing the origin of the source data, such as the file location within a Data Lake

In our case, we used and adapted the data from the jaffle_shop example project available on dbt.

Installing datavault4dbt package on dbt

Installing DataVault4dbt is like installing any other package on your project. You will need to follow two simple steps:

1.Add it to your packages.yml file

2. Run dbt deps

Installing, screen

Using the macro for staging our source data

According to the documentation for the staging layer, this layer primarily focuses on hashing. It also offers functionalities like creating derived columns, conducting prejoins, and adding NULL values for missing columns. Rather than diving deep into the technical aspects of each macro component, which are comprehensively covered in the documentation, let’s dive straight into its application!

A. Basic source information

Identifying the Source Model (source_model):

  • When referencing a source, adopt the dictionary format: ‘source_name’: ‘source_table’.
  • For models within our dbt project, just use the model name: ‘source_table’.

Setting Load Date Timestamp (ldts) & Record Source (rsrc):

  • Both can reference a column from the source table or a more detailed SQL expression.
  • Additionally, for the Record Source, you can use a static string beginning with ‘!’, like ‘!my_source’.

Example

A table with two table blocks.
  • source_model: Calls an already created table on dbt named ‘orders_example’.
  • ldts: Calls a timestamp column from our source model.
  • rsrc: Calls a column which contains a string referring to our record source name.

B. Hashing

The hashed_columns parameter outlines how to generate hashkeys and hashdiffs. For each hash column:

  • The key represents the hash column’s name.
  • For Hashkeys, the value is a list of business keys.
  • For Hashdiffs, the value will usually be a list of descriptive attributes.

Example

Screen shot, table, different types of data.
  • hk_order_h: hashkey generated using two columns inputs (O_ORDERKEY and O_CUSTKEY)
  • hd_order_s: hashdiff generated using multiple descriptive attributes

C. Derived columns

Derived Columns in stage models allow users to directly apply specific transformations to data. They act as on-the-fly customizations, enabling immediate adjustments to data within the column itself. Essentially, if data isn’t in the desired format, you can derive a new version right within the column using a specified rule.

When setting the derived_columns parameter, each derived column includes:

  • value: The transformation expression.
  • datatype: The datatype of the column.
  • src_cols_required: Source columns needed for the transformation.

Depending on how you name the derived column and the source columns, you can achieve two outcomes:

  1. If the derived column’s name matches its source column’s name, the original column’s data will be replaced by the transformed data. This effectively means you’re overwriting the original data.
  2. On the other hand, if the derived column’s name is different from its source column’s name, the transformation will result in a brand new column, preserving the original column’s data.

Example

table, prices, items.
  • price_euro: creation of a new column with the same values as the O_TOTALPRICE column.
  • country_isocode: creation of a new column with a static string ‘GER’.

D. Prejoining

Why Prejoin?

In certain scenarios, your source data might not have the ‘Business Key’ which is often a human-readable identifier, such as an email address or username. Instead, it might have a ‘Technical Key’, which could be an internally generated identifier or code. If you need to use the human-readable Business Key in your processing but only have the Technical Key, you would use prejoining to combine your data with another table that maps Technical Keys to Business Keys.

How to Define Prejoins?

The package provides a structured way to define these prejoins (prejoined_columns) using dictionaries.

For every column you’re adding through prejoining, you need to specify a few things:

  • src_name: This is the source of the prejoined data, as defined in a .yml file.
  • src_table: This specifies which table you’re prejoining with, as named in the .yml file.
  • bk: This is the name of the Business Key column in the prejoined table or the column values you are bringing to your table.
  • this_column_name: In your original data, this is the column that matches up with the prejoined table. This is often a Technical Key.
  • ref_column_name: In the prejoined table, this is the column that this_column_name points to. It should match up with the values in this_column_name.

Note that both ‘this_column_name’ and ‘ref_column_name’ can represent either a single column or a list of columns, serving as the basis for constructing the JOIN conditions.

Example

Table, data.
  • c_name: we brought the column “C_NAME” from the customer source table, joining on orders.o_custkey = customer.c_custkey.

E. Multi active config

The multi_active_config parameter is used when dealing with source data that contains multiple active records for the same Business Key. Essentially, you need to specify which columns are the multi-active keys and the primary hashkey column.

If your source data doesn’t have a natural multi-active key column, you should create one using functions like row_number in a preceding layer. Then, add the name of this newly created column to the multi-active-key parameter. It’s crucial that the combination of multi-active keys, main hashkey, and the ldts column be unique in the final satellite output. If you don’t use this setting, the stage is considered to have only single active records.

Example

Table, types, block.

By setting this parameter, we’ll observe consistent hashdiffs for identical Business Keys, proving beneficial in subsequent layers. If you want to know why, you can check this post.

F. Missing columns

The missing_columns parameter helps handle scenarios where the source schema changes and some columns no longer exist. Using this parameter, you can create placeholder columns filled with NULL values to replace the missing ones. This ensures that hashdiff calculations and satellite payloads continue to work. Essentially, you provide a dictionary where the column names are the keys and their respective SQL datatypes are the values.

Example

table, price
  • discount_code: creation of a new discount_code column with NULL values.

Final Remarks

Scalefree’s DataVault4dbt package introduces an easy-to-use yet powerful solution for database modeling. In our case, we went through the staging layer macro, which combines best practices with the flexibility to address diverse source data needs. From hashing to on-the-fly column modifications, this Data Vault 2.0 open-source package for dbt streamlines complex processes.

As we continue to explore its potential, we invite you to join our upcoming webinar for a deeper dive. Reserve your spot here and stay tuned to the package’s Github repository for the latest updates and support.

– By Hernan Revale (Scalefree)

Get Updates and Support

Please send inquiries and feature requests to [email protected]

For Data Vault training and on-site training inquiries, please contact [email protected] or register at www.scalefree.com.

To support the creation of Visual Data Vault drawings in Microsoft Visio, a stencil is implemented that can be used to draw Data Vault models. The stencil is available at www.visualdatavault.com.

Scalefree

Choosing the Right Tech Stack for an Open-Source Powered EDW

By Scalefree Newsletter No Comments

Join the Webinar:

Introduction

Choosing the right technology stack is a critical decision when building an open source powered Enterprise Data Warehouse (EDW). The technology stack consists of various components, including databases, automation tools, DevOps, Infrastructure, and visualizations, which work together to enable efficient data management, processing, and analysis.

In this blog article, we will dive deeper into the topic of selecting the right tech stack for an open source powered EDW. We will explore different aspects to consider, such as evaluating vendors, leveraging open source products, and understanding the key components of a robust tech stack. By the end of this article, you will have a better understanding of the factors to consider when selecting the right tech stack for your EDW.

Register here for free.

Evaluating Vendors and Leveraging Open-Source Products:

When embarking on the journey of building an open-source powered EDW, it is crucial to evaluate vendors and leverage open source products effectively. By choosing reputable vendors and open source solutions, you can ensure reliability, community support, and continuous development. Evaluating vendors involves assessing their expertise, reputation, and compatibility with your project requirements. Additionally, leveraging open source products provides flexibility, cost-effectiveness, and access to a vast community of contributors and developers.

Understanding the Key Components of a Robust Tech Stack:

A robust tech stack for an open source powered EDW comprises various components that work together to enable efficient data management and analysis. Here are some key components to consider:

Databases:

Choosing the appropriate database technology is vital for efficient data storage and retrieval. Options like MongoDB, PostgreSQL, MySQL, or other databases that align with your project requirements should be considered

Automation Tools:

Automation tools play a crucial role in the development process of an EDW. These tools greatly accelerate the development process, particularly in a Data Vault project. One example of an open source automation tool is dbt (data build tool), which can be combined with Scalefree’s self-developed package DataVault4dbt. These tools help streamline the development process and make the development team more efficient.

DevOps and Infrastructure:

Having a stable scheduler or a similar tool to load the data regularly from the sources into the Data Warehouse is important. Options such as Airflow can be considered for this purpose. Additionally, having a DevOps tool for project management is essential. These tools help structure the work and make the development team more efficient, especially when using agile methodologies like Scrum.

Visualization:

Effective data visualization is crucial for analyzing and understanding the data in an EDW. There are various open source visualization tools available, such as Grafana, Superset, or Metabase, which provide powerful capabilities for creating insightful visualizations and dashboards.

Why Data Vault 2.0 is a Powerful Choice in Combination with an Open Source Tech Stack:

Combining Data Vault 2.0 with an open source tech stack offers a powerful solution for building an efficient, scalable EDW. The agile concepts used in Data Vault make it easier to gradually build an open source tech stack over time, starting with basic needs and expanding as necessary.
It should be noted that checking the readiness of an open source automation tool for Data Vault and having Data Vault templates in place is crucial. These components enhance efficiency, streamline development, and ensure smooth integration in an open source powered EDW environment.

Benefits of an Open Source Powered EDW:

Building an open source powered EDW offers several advantages. Firstly, open source solutions often provide a vast community of developers, ensuring continuous support, updates, and improvements. Secondly, open source products can be customized and tailored to meet specific project requirements. This flexibility allows you to adapt the tech stack to your organization’s needs and scale as your data processing requirements grow. Lastly, open source solutions typically offer cost-effectiveness by eliminating or reducing licensing fees, making them an attractive option for organizations of all sizes.

Considerations for Scalability and Performance:

Scalability and performance are crucial factors to consider when selecting the right tech stack for an open source powered EDW. As your data processing needs grow, it’s important to choose a tech stack that can scale horizontally or vertically to handle increasing workloads. Technologies like Kubernetes can be considered for container orchestration and load balancing to ensure efficient utilization of resources and smooth scalability. Additionally, performance optimization techniques, such as caching mechanisms, data indexing, and query optimization, should be considered to ensure fast and efficient data retrieval and processing.

Security and Data Privacy:

When dealing with enterprise data, security and data privacy are of utmost importance. Ensure that the chosen tech stack incorporates robust security measures and follows best practices for data encryption, access control, and secure communication protocols. Regular security audits and updates are essential to address any vulnerabilities and ensure compliance with data privacy regulations.

Summary

Picking the right tech stack for an open source powered EDW is a crucial step in building an efficient and scalable BI-System. By evaluating vendors, leveraging open source products, and understanding the key components of a robust tech stack, you can ensure a solid foundation for your EDW. Databases, automation tools, DevOps and Infrastructure, and visualization choices play vital roles in creating an effective and customizable solution. Embracing open source solutions provides flexibility, community support, and cost-effectiveness, making it an ideal choice for organizations seeking efficient data processing and analysis capabilities. Considerations for scalability, performance, security, and data privacy are important to ensure the success of your EDW implementation.

In conclusion, the selection of a tech stack for an open source powered EDW requires careful consideration of various factors. It is essential to evaluate vendors, leverage open source products, and understand the key components that contribute to a robust tech stack. By making informed choices and aligning the tech stack with your project objectives, you can build a scalable and efficient EDW that empowers your organization to process and analyze data effectively.

If you are interested to learn more about the topic, don’t miss the opportunity to participate in the upcoming webinar about this topic.
Register here for free.

Get Updates and Support

Please send inquiries and feature requests to [email protected]

For Data Vault training and on-site training inquiries, please contact [email protected] or register at www.scalefree.com.

To support the creation of Visual Data Vault drawings in Microsoft Visio, a stencil is implemented that can be used to draw Data Vault models. The stencil is available at www.visualdatavault.com.

Scalefree

Mastering Metadata: Data Catalogs in Data Warehousing with DataHub

By Scalefree Newsletter No Comments

Join the Webinar:

Introduction

In today’s data-driven world, it is essential to be able to manage and organize large amounts of data in an efficient way. Businesses across all industries are forced to contend with more data than ever before. The introduction and development of an enterprise data warehouse in a company naturally plays a central role but does not solve a major challenge: How to effectively organize and manage the data, especially metadata, in an Enterprise Data Warehouse? This is where the concept of data catalogs comes into play.

A data catalog serves as a comprehensive inventory of data assets in an organization, providing context, annotations, and metadata to facilitate understanding and discovery of data. It’s like a map to your data, helping users navigate the complex data landscape to find the exact data they need.

A data catalog can help users to understand where to find specific data in the data warehouse that fits their need and to investigate where it came from as well as how it might be connected to other data. This can greatly simplify tasks like data analysis and reporting, making the data warehouse more accessible and usable for everyone in the organization.

There’s also going to be a webinar on July 25 at 2 pm with a more detailed look on Data Catalogs in Data Warehousing with DataHub.

Register here for free.

Understanding Data Catalogs

What is a Data Catalog?

In general, a data catalog is like a metadata inventory, which consists of organized and structured metadata regarding all data assets in an organization. It is a central place where all this metadata can be stored, combined, and categorized, which makes it a lot easier to discover and understand the corresponding data, for example in a data warehouse. A data catalog also has search functionalities to find specific data from the available indexed datasets. It serves like a single source of truth of your metadata, enabling users to trust the data they’re using for their analyses or business decisions.

Role of a Data Catalog in Data Warehousing

In the context of data warehousing, a data catalog brings a lot of benefits. It provides a way to explore and search all data stored in the data warehouse. Technical users, as well as Business Users, can discover relevant data, understand its context, and ensure it is up-to-date, reliable, and accurate. The following figure shows where data catalogs in data warehousing with Data Vault 2.0 take place. A Data Catalog should cover the entire Enterprise BI Solution. This also applies, for example, to a data lake, if available, and to the information delivery layer.

Now that we’ve understood what a data catalog is, let’s delve into how each component plays a part in a data catalog and explore how a tool like DataHub can assist organizations in these tasks.

Introduction to DataHub

In the world of data catalogs, DataHub stands out as an increasingly popular choice for many businesses. DataHub is a growing open-source software developed by LinkedIn to address its growing need for a more dynamic and scalable data management tool. It was created in part due to the fact that the existing tools were not sufficient with LinkedIns expanding needs.
As LinkedIn grew, so did its data volume, variety, and velocity. Recognizing the need for a more efficient way to manage its data, LinkedIn built and introduced DataHub in 2020. Open-sourcing DataHub allowed other organizations to benefit from this advanced tool, and it has since been adopted by many businesses looking for a modern, scalable data catalog solution.

DataHub supports both push-based and pull-based metadata ingestion including a wide range of integrations for example Airflow, BigQuery, Databricks, dbt, Hive, Kafka, Looker, MSSQL, MongoDB, Oracle, S3, PowerBI, Snowflake, Spark, and much more. You can find a full list here. This gives datahub the ability to combine and show metadata of the same data from multiple sources, for example a dbt model definition and if the tests were running successfully right next to the database schema and stats for all columns.

Key Features and Capabilities of DataHub

DataHub, as a metadata platform, goes beyond traditional data catalogs. DataHub offers all important features and capabilities:

1. Scalability: DataHub is designed to handle metadata from thousands of datasets, which makes it a great choice for large organizations.

2. Flexible and Extensible Data Model: The technical data model behind this tool is designed to be customizable and expandable to allow organizations to customize it to their specific business requirements

3. Powerful Search and Discovery: Leveraging Elasticsearch, DataHub offers robust search functionality that enables users to discover datasets quickly based on various attributes, such as the data’s origin, schema, and usage.

4. Rich Metadata: Unlike traditional data catalogs, DataHub captures and presents a wide variety of metadata, including data lineage, operational metadata, and business metadata. This gives users a comprehensive understanding of their data.

5. Data Lineage and Relationships: DataHub automatically captures and visualizes data lineage, showing how data flows through various systems. It also displays relationships between datasets, allowing users to understand how different data assets interact with each other.

Summary

Benefits of Using a Data Catalog

Using a data catalog comes with several benefits:

  1. Enhanced Data Discovery: With the search and categorization capabilities of a data catalog, users can quickly find the exact data they need without having to comb through large datasets.
  2. Improved Data Understanding: The metadata in a data catalog provides users with necessary context about the data, making it easier to interpret and use correctly.
  3. Better Compliance and Governance: A data catalog supports data governance initiatives by ensuring data is consistent, accurate, and compliant with relevant regulations.
  4. Increased Trust in Data: By providing transparency into data lineage, a data catalog helps build trust in the data by allowing users to see its history and verify its accuracy and reliability.
  5. Time and Resource Efficiency: By making it easier to locate and understand data, a data catalog can save the company resources, thus speeding up data-driven activities and reducing the burden on data management teams.

In conclusion, DataHub provides a flexible, feature-rich, and all-encompassing option for data catalogs in a data warehousing environment. By providing powerful features for data discovery, metadata management, data lineage, and data governance, it enables businesses to extract maximum value from their data.

If you’re interested in learning more about data catalogs, don’t miss the opportunity to participate in the upcoming webinar about this topic.
Register here for free.

Get Updates and Support

Please send inquiries and feature requests to [email protected]

For Data Vault training and on-site training inquiries, please contact [email protected] or register at www.scalefree.com.

To support the creation of Visual Data Vault drawings in Microsoft Visio, a stencil is implemented that can be used to draw Data Vault models. The stencil is available at www.visualdatavault.com.

Scalefree

Data Vault 2.0 with Hadoop and Hive/Spark

By Scalefree Newsletter No Comments

Introduction

In this newsletter, you’ll receive an overview of what Hadoop and Hive is and why they can be used as an alternative to traditional databases.

There’s also going to be a webinar on June 13 at 2 pm with a more detailed look on how the components interact with each other and what a sample Data Vault 2.0 architecture looks like in that ecosystem. The webinar will also cover why Spark is used with Hadoop and Hive.

Take a look into the recording.

Hadoop

Hadoop is used to process and analyze large volumes of data efficiently by distributing the workload across a cluster of commodity hardware, enabling parallel processing and providing fault tolerance through its distributed file system and resource management framework.

HDFS – Hadoop Distributed File System

HDFS is a distributed file system that provides reliable and scalable storage for big data. It breaks large files into blocks and distributes them across a cluster of commodity hardware. HDFS ensures data reliability and availability through data replication.

Yet Another Resource Negotiator – YARN

YARN provides a flexible, scalable resource management framework for Hadoop, enabling a variety of applications and workloads to coexist and efficiently utilize the cluster’s resources. It abstracts the underlying infrastructure and allows for the dynamic allocation of resources based on application requirements.

MapReduce – MR

MapReduce is a programming model and processing framework for distributed data processing in Hadoop. It allows for parallel processing of large datasets by dividing the workload into maps, reducing tasks. Map tasks process data in parallel and the output is combined and reduced to produce the final result.

Hadoop Common

Hadoop Common provides libraries, utilities, and infrastructure support for the other components of Hadoop. It includes common utilities, authentication mechanisms, and interfaces that are used by various Hadoop modules.

What is the benefit?

Scalability
Hadoop enables the storage and processing of massive amounts of data by scaling horizontally across a cluster of commodity hardware. It can handle petabytes of data without sacrificing performance.

Distributed Computing
Hadoop distributes data and processing tasks across multiple nodes in a cluster, allowing for parallel processing and faster data analysis. This distributed computing model enables efficient utilization of resources and enables high-performance data processing.

Fault Tolerance
Hadoop provides fault tolerance by replicating data across multiple nodes in the cluster. If a node fails, data can still be accessed from other replicas, ensuring data reliability and availability.

Cost-Effectiveness
Hadoop is designed to run on inexpensive commodity hardware, making it a cost-effective solution for storing and processing large volumes of data. It eliminates the need for expensive specialized hardware.

Flexibility and Extensibility
Hadoop’s modular architecture allows for integration with various tools and frameworks within the Hadoop ecosystem, providing flexibility and extensibility. It supports a wide range of data processing tasks, including batch processing, real-time processing, machine learning, and more.

Data Locality
Hadoop’s distributed file system, HDFS, aims to bring the computation closer to the data. By processing data where it is stored, Hadoop minimizes data movement across the network, reducing latency and improving overall performance.

Ecosystem and Community
Hadoop has a rich ecosystem with a wide range of tools, libraries, and frameworks that extend its functionality for different use cases. It also has a large and active community of users, developers, and contributors, providing support, resources, and continuous improvement.

These benefits make Hadoop a powerful, popular solution for handling big data, enabling organizations to efficiently store, process, and gain insights from vast amounts of structured and unstructured data. The whole ecosystem can also run on-premise, which can make it a good alternative if ‘cloud’ is not an option.

HIVE

Hive is a data warehouse infrastructure built on top of Hadoop that provides a high-level SQL-like query language called HiveQL for querying and analyzing large datasets.

What are the components?

Data Storage
Hive leverages Hadoop Distributed File System (HDFS) as its underlying storage system. It stores data in HDFS in a distributed and fault-tolerant manner, allowing for scalable, reliable data storage.

Schema Definition
Hive allows users to define a schema for their data using a language called Hive Data Definition Language, like DDL. This allows users to define tables, partitions, columns, data types, and other metadata associated with the data.

Query Optimization
Hive optimizes queries by performing query planning and optimization techniques. It aims to generate efficient query execution plans to minimize data movement, optimize resource utilization, and improve query performance.

Hive Metastore
Hive maintains a metadata repository called the Hive Metastore. It stores information about the tables, partitions, schemas, and other metadata associated with the data stored in HDFS. The metastore allows for efficient metadata management and retrieval during query processing.

Extensibility
Hive offers extensibility through User-Defined Functions (UDFs), User-Defined Aggregations (UDAs), and User-Defined Table Functions (UDTFs). These allow users to define custom logic and operations in programming languages like Java, Python, or other supported languages.

Integration with other tools
Hive integrates with various other tools and frameworks in the Hadoop ecosystem. For example, it can work alongside Apache Spark, Apache Pig, Apache HBase, and other components to provide a complete data processing and analytics solution.

Partitioning and Bucketing
Hive supports data partitioning and bucketing, allowing users to organize and store data in a structured manner. Partitioning involves dividing data into logical partitions based on specific criteria, while bucketing involves dividing data into equally sized buckets based on hash values.

SerDe
Hive uses a serialization/deserialization framework called SerDe (Serializer/Deserializer) to read and write data in different formats, such as CSV, JSON, Avro, and more. Users can specify the appropriate SerDe for their data format to ensure proper data processing.

Overall, Hive simplifies data querying and analysis on Hadoop by providing a familiar SQL-like interface. It abstracts the complexity of writing low-level MapReduce or Tez jobs and provides a declarative and user-friendly approach to interact with large-scale data stored in Hadoop.

Conclusion

Hadoop is a robust and feature-rich environment that can be challenging to manage. However, its numerous advantages make it a compelling choice, depending on the user’s needs and the available in-house expertise. If you’re interested in learning more about it, don’t miss the opportunity to participate in the upcoming webinar.

Register here for free

Get Updates and Support

Please send inquiries and feature requests to [email protected]

For Data Vault training and on-site training inquiries, please contact [email protected] or register at www.scalefree.com.

To support the creation of Visual Data Vault drawings in Microsoft Visio, a stencil is implemented that can be used to draw Data Vault models. The stencil is available at www.visualdatavault.com.

Scalefree

Bring your Data Vault automation to the next level – with Datavault4Coalesce

By Scalefree Newsletter No Comments

Join the Webinar:

In last September’s webinar, we explored a new competitor in the highly contested market of data warehouse automation tools: Coalesce!

Coalesce

And everyone who watched that webinar might remember that at the end, we announced an even closer relationship between Coalesce and Scalefree and a commitment to bring Scalefree’s best practices into Coalesce!

For those who didn’t watch the webinar, you can find it here.

Recap: What is Coalesce?

Coalesce is a Data Transformation solution made for Snowflake. When working with Coalescse, you build directed acyclic graphs (DAG) which contain nodes. A node represents any physical database object, like tables or views, or even stages and external tables.

Coalesce itself is built around metadata that stores table and column level information which describes the structure inside your data warehouse. This metadata-focused design enables a couple of features which strongly drive towards scalability and manageability. 

All the metadata allows a team to track past, current and desired states of the data warehouse by deeply integrating it and all the workflows that it brings. Additionally users can define standardized patterns and templates on column- and table-level.

How can Data Vault jump in here?

These mentioned templates open up the gate to implement Data Vault 2.0 related patterns and best-practices. Especially on table level, it might quickly come to mind that you could try to build a template for a Hub or a Link.

On column level, this could be a repeated transformation which is then managed in only one so-called macro, which makes it very easy to implement changes with low to zero impact. You could think of hashkey calculation or virtual load-end-dating here.

And that is exactly what we at Scalefree did since the webinar last year. Lead-developers from Coalesce sat together with Data Vault experts and developers from Scalefree with one goal: Create something amazing that helps users to automate their Data Vault 2.0 implementation!

How fast can I build a Data Vault? Yes!

This cooperation created Datavault4Coalesce, an open source extension package for Coalesce, which will be available on March 16th! Let’s have a sneak peek at a selection of what users can do with Datavault4Coalesce.

The first release of Datavault4Coalesce will feature a basic set of Data Vault 2.0 entities:

  • Stages
  • Hubs
  • Links
  • Satellites
  • PITs
  • Snapshot Control Tables

While providing DDL and DML templates for the entity types mentioned above, Datavault4Coalesce makes use of Coalesce’s ability to define the UI for each node type. For stages, this means that users can decide if they want Datavault4Coalesce to automatically generate ghost records or not, as shown in the screenshot below:

This Data Vault related interface can be found across all node types and allow users to conveniently adjust Datavault4Coalesce to fit their requirements!

How can I start?

First of all a bummer, Datavault4Coalesce will only be available starting from the 16th of March. But there is no reason to wait that long to dive into Coalesce itself! Since Coalesce is now part of Snowflake Partner Connect, it’s never been easier to get your hands into a fresh Coalesce environment!

Just sign up for a free Snowflake trial here and initialize your Coalesce experience within seconds by accessing the Partner Connect portal! Then, you just have to load any desired data into it and you can start building your data pipelines with Coalesce. And when the 16th of March finally arrives, you just have to add Datavault4Coalesce to your Coalesce environment – and now you can start to build Data Vault faster than ever!

Also, don’t miss out our next webinar which will be held at the day of launch, the 16th of March, where we will show you Datavault4Coalesce in action. Sign up here!

Get Updates and Support

Please send inquiries and feature requests to [email protected]

For Data Vault training and on-site training inquiries, please contact [email protected] or register at www.scalefree.com.

To support the creation of Visual Data Vault drawings in Microsoft Visio, a stencil is implemented that can be used to draw Data Vault models. The stencil is available at www.visualdatavault.com.

Scalefree
TurboVault4dbt Logo

Speed up Your Data Vault 2.0 Implementation – with TurboVault4dbt

By Scalefree Newsletter No Comments

Join the Webinar:

Scalefree has released TurboVault4dbt, an open source package that provides automated model generation using DataVault4dbt-compatible templates based on your sources’ metadata.

TurboVault4dbt currently supports metadata input from Excel, GoogleSheets, BigQuery and Snowflake and helps your business with:

  • Speeding up the development process, reducing development costs and producing faster results
  • Encouraging users to analyze and understand their source data
dbt

‘Isn’t every model kind of the same?’

Datavault4dbt is the result of years of experience in creating and loading Data Vault 2.0 solutions forged into a fully auditable solution for your Data Vault 2.0 powered Data Warehouse using dbt.

But every developer who has worked with the package or has created dbt-models for the Raw Vault must have come across one nuisance:
Creating a new dbt-model for a table basically means taking the already existing template and providing it with specific metadata for that table. Doing this over and over again can be quite a chore. Which is why we created TurboVault4dbt to automate and speed up this process.

FROM CTRL+C AND CTRL+V TO A SIMPLE MOUSE-CLICK

How many times has everyone pressed CTRL+C then CTRL+V and corrected a few lines of code when creating new dbt-models for the raw vault?

Instead of trying to figure out what the names of your tables and business keys are or what hashing order you want your Hashkey to be generated in, TurboVault4dbt will do all of that for you. All TurboVault4dbt needs is a metadata input where you capture the structure of your data warehouse.

TurboVault4dbt currently requires a structure of five metadata tables:

  • Hub Entities: This table stores metadata information about your Hubs,
    e.g. (Hub Name, Business Keys, Column Sort Order for Hashing, etc.)
  • Link Entities: This table stores metadata information about your Links,
    e.g. (Link Name, Referenced Hubs, Pre-Join Columns, etc.)
  • Hub Satellites: This table stores metadata information about your Hub Satellites,
    e.g. (Satellite Name, Referenced Hub, Column Definition, etc.)
  • Link Satellites: This table stores metadata information about your Hub Satellites,
    e.g. (Satellite Name, Referenced Link, Column Definition, etc.)
  • Source Data: This table stores metadata information about your Sources,
    e.g. (Source System, Source Object, Source Schema, etc.)

By capturing the metadata in those five tables above, TurboVault4dbt can both extract necessary information and generate every model that is based on a selected source but also, as a user, encourage you to analyze and understand your data.

Lean back, relax and let the machine take over!

Create and fill your metadata tables, connect them to TurboVault4dbt and enjoy your free time for another cup of coffee. Give it a try, or give us your feedback by visiting TurboVault4dbt on GitHub!

Stay updated on TurboVault4dbt through our marketing channels as great features lie ahead!

Get Updates and Support

Please send inquiries and feature requests to [email protected]

For Data Vault training and on-site training inquiries, please contact [email protected] or register at www.scalefree.com.

To support the creation of Visual Data Vault drawings in Microsoft Visio, a stencil is implemented that can be used to draw Data Vault models. The stencil is available at www.visualdatavault.com.

Scalefree

ABOUT INFORMATION MARTS IN DATA VAULT 2.0

By Scalefree Newsletter No Comments

Join the Webinar:

In the Data Vault 2.0 architecture, information marts are used to deliver information to the end-users.
Conceptually, an information mart follows the same definition as a data mart in legacy data warehousing. However, in legacy data warehousing, a data mart is used to deliver useful information, not raw data. This is why the data mart has been renamed in Data Vault 2.0 to better reflect the use case.

“Classical” information marts

But the definition of information marts has more facets. In the book “Building a Scalable Data Warehouse with Data Vault 2.0” we present three types of marts:

  • Information marts: used to deliver information to business users, typically via dashboards and reports.
  • Metrics Mart: used in conjunction to a Metrics Vault, which captures EDW log data in a Data Vault model. The Metrics Mart is derived from the Metrics Vault to present the metrics in order to analyze performance bottlenecks or in resource consumption of power users and data scientists in managed self-service BI solutions.
  • Error Mart: stores those records that typically fail a hard rule when loading the data into the enterprise data warehouse.

Additional information marts

In addition to these “classical” information marts, we use additional ones in our consulting practice:

  • Interface Mart: this is more or less an information mart, however, the information is not delivered to a human being, e.g. via a dashboard or report. Instead, it is delivered to a subsequent application, or as a write-back, to the source system (for example when using the enterprise data warehouse for data cleansing).
  • Quality Mart: the quality mart is again an information mart, but instead of cleansing bad data, it is used to report bad data. Essentially, it turns the business logic used to cleanse bad data upside down: only bad data, in addition to ugly data at times, is delivered to the end-user, the data steward. This is often done in conjunction with data cleansing frontends where the data steward can either correct source data or comment and tag the exceptions.
  • Source Mart: again an information mart, but this time not using one of the popular schemas, such as star schemas, snowflake schemas or fully denormalized schemas. Instead, the information mart uses the data model of the source application, similar to an operational data store (ODS) schema. However, the Source Mart is not a copy of the data, it is a virtualized model on top of the Data Vault model, reflecting the original structures. It’s great for ad-hoc reporting and provides great value for many data scientists as well as power users.

This concludes our list of information marts. We have used them successfully in projects for our clients to better communicate the actual application of the information marts in their organization.

Let us know in the comments if you think this is helpful for you, too!

Get Updates and Support

Please send inquiries and feature requests to [email protected]

For Data Vault training and on-site training inquiries, please contact [email protected] or register at www.scalefree.com.

To support the creation of Visual Data Vault drawings in Microsoft Visio, a stencil is implemented that can be used to draw Data Vault models. The stencil is available at www.visualdatavault.com.

Scalefree
DataVault4dbt Powered by Scalefree

Kick-start your Data Vault 2.0 Implementation – with datavault4dbt

By Scalefree Newsletter No Comments

Join the Webinar:

Scalefree has released datavault4dbt. An open source package, that provides best-practice loading templates for Data Vault 2.0 entities, embedded into the open source data warehouse automation tool dbt.

Datavault4dbt currently supports Snowflake, BigQuery and Exasol and comes with a lot of great features:

  • A Data Vault 2.0 implementation congruent to the original Data Vault 2.0 definition by Dan Linstedt
  • Ready for both Persistent Staging Areas and Transient Staging Areas, due to the allowance of multiple deltas in all macros, without loosing any intermediate changes
  • Creating a centralized, snapshot-based Business interface by using a centralized snapshot table supporting logarithmic logic
  • Optimizing incremental loads by implementing a high-water-mark that also works for entities that are loaded from multiple sources
dbt

Building a Data Vault 2.0 Solution – made easy

The overall goal of releasing Data Vault 2.0 templates for dbt is to combine our years of experience in creating and loading Data Vault 2.0 solutions into publicly available loading patterns and best practices for everyone to use. Out of this ambition, datavault4dbt, an open source package for dbt was created and will be maintained by the Scalefree expert team. 

The most valuable characteristic of datavault4dbt is that it carnates the original Data Vault 2.0 definition by Dan Linstedt. It represents a fully auditable solution for your Data Vault 2.0 powered Data Warehouse. With a straight-forward, standardized approach, it enables the team to conduct agile development cycles.

By allowing multiple increments per batch while loading each Data Vault entity type, datavault4dbt supports both Persistent and Transient Staging Areas without losing any intermediate changes. These incremental loads are even optimized by implementing a dynamic high-water-mark that even works when loading an entity from multiple sources.

Additionally, datavault4dbt encourages strict naming conventions and standards by implementing a variety of global variables that span across all Data Vault layers and supported Databases. The process of end-dating data is completely virtualized to ensure a modern insert-only approach that avoids updating data.

With all these features, datavault4dbt is the perfect solution for your modern Big Data Enterprise Data Warehouse.

From the Stage over the Spine into the PITs

To achieve all this, we worked hard on creating a solid and universal staging area. All hashkeys and hashdiffs are calculated here and users are given the option to add derived columns, generate prejoins with other stages and add ghost records to their data. All of this highly automated based on parameterized user input. 

Based on staging areas, the Data Vault 2.0 spine can be created. Hubs, Links and Non-Historized Links can be loaded from multiple sources including mapping options to ensure business harmonization. 

This spine is then enriched by Standard Satellites, Non-Historized Satellites, Multi-Active Satellites and/or Record-Tracking Satellites. All of those that require it come with a version 0 for tables and a version 1 for end-dated views. 

Based on the Raw Data Vault, PITs can be created automatically, and their loading is backed by an automated, highly-configurable but optional logarithmic snapshot logic. This logic is included in the Control Snapshot Table, which also comes in two consecutive versions. To wrap the logarithmic snapshot logic up, a post-hook for cleaning up all PITs is included and comes in handy.

DataVault4dbt Powered by Scalefree

Start now and boost your Data Vault experience!

The lines above made you think “Nah, that’s all too good to be true!”? Convince yourself, or give us your highly appreciated feedback by visiting datavault4dbt on Github!

Of course, our future ambitions for datavault4dbt are high and next on our list are a lot of important topics, like:

  • Provide a detailed working example of datavault4dbt
  • Extend and migrate the existing documentation of the package
  • Support more and more databases
  • Add more advanced and specific Data Vault 2.0 entities
  • Develop automated Data Vault related tests
  • Review and implement user feedback and suggestions

Stay tuned for more datavault4dbt content on all our marketing channels!

Get Updates and Support

Please send inquiries and feature requests to [email protected]

For Data Vault training and on-site training inquiries, please contact [email protected] or register at www.scalefree.com.

To support the creation of Visual Data Vault drawings in Microsoft Visio, a stencil is implemented that can be used to draw Data Vault models. The stencil is available at www.visualdatavault.com.

Scalefree