Sample Page – Scalefree

Building a scalable Data Platform? In Data Vault

Managed Self-Service BI: Success in Spite of Stringent Regulation

The latest story to find itself added to the growing number of successful implementations of Scalefree’s services, and Data Vault 2.0 as a whole, centers around a sector known for its strict regulatory bodies in addition to high volume of data that demands the utmost in terms of privacy and security.

As the events that unfolded during the various financial crises at the start of the century left governments the world over seeking to impose stricter regulations for the financial sector. Banks within the sector were faced with a new task as they sought to continue operating with expansion in mind while still falling well within defined standards. Continue Reading

Building a scalable Data Platform? In Data Vault

The Latest Innovations of Data Vault 2.0

Data Vault 2.0 Training FAQ - Customized Class

Focus on trends: Data Lake and no-sql, dwh architecture, self-service bi, modeling and gdpr

In the past, we wrote about topics we were confronted with when we consult our clients or just recognized widely occurring discussions in the web.

All these topics were already covered in Data Vault 2.0 and most of them moved into a higher focus within the last months. Coming with the trends in the private sector, NoSQL databases are now playing an important role for storing data fast from different source systems. This brings new opportunities to analyze the data, but also new challenges, i.e. how to query fast from those “semi”- and “unstructured” data, e.g. including Massive Parallel Processing (MPP). Furthermore, there is an abundance of tools to store, transport, transform and analyze the data, what often results in time and cost-intensive researching. The knowledge about “Schema on Write” and “Schema on Read” (and their differences) became very important to build a Data “Warehouse”. A Schema has been and is still mandatory for Business Analysts when they have to tie the data to business objects for analytical reasons. Storing your data in NoSQL platforms only (let’s call it a “Data Lake”) is a good approach to capture all your company’s data, but it became much more difficult for Business User to get the data out from those platforms. A good and recommended approach is to have both, a Data Lake AND a Data Warehouse combined in a Hybrid Architecture.

Building a scalable Data Platform? In Agile Data

How to Scale in a Disciplined Agile Manner?

Looking beyond Scrum and learn how to increase the value in Data Vault 2.0 projects

Earlier this year we talked about Managed Self-Service BI to explain how business users can take a benefit from this approach in Data Vault 2.0. Now we want to show you how to get there from a project management perspective, even in large companies where the standard Scrum approach often not works with the accorded deployment/release regulations and other approaches like the Disciplined Agile framework are the better fit.

Agile transformation is hard because cultural change is hard. It’s not one problem that needs to be solved, but a series of hundreds of decisions affecting lots of people over a long period of time that affects relationships, processes, and even the state of mind of those working within the change.

There are two fundamental visions about what it means to scale agile: Tailoring agile strategies to address the scaling challenges – such as geographic distribution, regulatory compliance, and large team size – faced by development teams and adopting agility across your organization. Both visions are important, but if you can’t successfully perform the former then there is little hope that you’ll be successful at the latter.

Building a scalable Data Platform? In Data Vault

Still Struggling with GDPR?

GDPR

The new General Data Protection Regulation (GDPR) is a law by the European Union (EU) and became effective on May 25, 2018. This new regulation is designed to put a high level of protection to personal data of European citizens, what means that companies around the world have to establish transparency and ownership to the individuals’ data and need to get a clear declaration of consent from them to save and process their personal data. Though laws from countries outside the EU (especially the USA) tend to favor business over consumer, GDPR affects all companies over the world who have personal data from EU-citizens in their database.

In this article:

What is new in GDPR?
Data Vault 2.0 as a solution
Conclusion

What is new in GDPR?

To be careful with personal data is nothing new, especially not in the EU. The key change of collecting and processing personal data is that the data is now completely under control of the owner, who can force the companies to delete or anonymize their data or to request copies of all owners personal data stored in the system. Personal data or Privately Identifiable Information (PII) means data, an individual can be identified with, e.g. name, phone number or email address. Continue Reading

Building a scalable Data Platform? In Data Warehouse

Data Warehouse and Data Lake: Do We Still Need a Data Warehouse?

“Big Data”, “Data Lake”, “Data Swamp”, “Hybrid Architecture”, “NoSQL”, “Hadoop” … terms you are confronted with very often these days when you are dealing with data. Furthermore, the question comes up if you really need a data warehouse nowadays when you deal with a high variety and volume of data. We want to talk about what a data lake is, if we need a data warehouse when using NoSQL platforms like Hadoop, and how it is combined with Data Vault.

WHAT IS A DATA LAKE?

There is a proper definition from Tamara Dull (SAS): “A data lake is a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured, and unstructured data. The data structure and requirements are not defined until the data is needed.” ¹ Continue Reading

Building a scalable Data Platform? In Data Vault

How to Combine Managed Self-Service BI with Data Vault 2.0?

Managed Self-Service BI and Data Vault 2.0

Combining Managed Self-Service BI with Data Vault 2.0

This article explores how combining Managed Self-Service BI with Data Vault 2.0 enables organizations to balance data governance and agility, ensuring both control and flexibility in their analytics processes.Last month we talked about a hybrid architecture in Data Vault 2.0, where we explain how to combine structured and unstructured data with a hybrid architecture. To follow up on this topic, we now want to explain how your business users (especially power users) can take a benefit from it with the managed Self-Service Business Intelligence (mSSBI) approach in Data Vault 2.0.

In this article:

About Self-Service BI
About the Managed Self-Service BI Approach
Implement MSSI in the Data Vault 2.0 Architecture
Conclusion

About Self-Service BI

Self-service BI allows end-users to completely circumvent IT due to this unresponsiveness of IT. In this approach, business users are left on their own with the whole process of sourcing the data from operational systems, integration and consolidation of the raw data. There are many problems with this self-service approach without the involvement of IT:

Direct access to source systems
Unintegrated raw data
Low data quality
Unconsolidated raw data
Non-standardized business rules

In many cases, end-users – even if they are power users with the knowledge to SQL, MDX, and other techniques, don’t have the right tools available to solve the tasks. Instead, much work is done manually and error-prone. But from our experience, it is not possible to completely prevent such power users from obtaining data from source systems, preparing it, and eventually reporting the data to upper management. What organizations need is a compromise between IT agility and data management that allows power users to obtain the data they need quickly, in a usable quality. To overcome these problems, the Data Vault 2.0 standard allows experienced or advanced business users to perform their own data analysis tasks on the raw data of the data warehouse.

About the Managed Self-Service BI Approach

In fact, a Data Vault 2.0 powered IT welcomes business users to take the data that is available in the enterprise data warehouse (either in the Raw Data Vault or in the Business Vault) to create local information marts using specialized tools. These tools retrieve the data from the enterprise data warehouse, apply a set of user-defined business rules and present the output to the end-user. IT might also create structures where organizational-wide business rules are applied to provide a consolidated view on parts of the model or pre-calculate KPIs to ensure consistency among such calculations. Because both types of data (raw data and business rule applied data) is already integrated, the business user can also join consolidated data with raw data from specific source systems. This approach is called Managed Self-Service BI, where IT evolves to a service organization that provides those power users with the data they want, in the timeframe they need. The data is integrated by its business key and can be consolidated as well as quality checked.

Implement MSSI in the Data Vault 2.0 Architecture

The Data Vault 2.0 architecture provides self-service capabilities for power users in the organization:

Figure1 : mSSBI Architecture

In this case, power users who build their own, custom solutions can write back data and information into the Enterprise Data Warehouse by leveraging a dedicated user space for this purpose. The write-back can then later be re-used in the solution for information delivery. Furthermore, the business users can manage their own master data by using an MDM application. It enables authorized business users to change the parameter values and therefore influence the results of the business rules.

The difference between “managed” Self-Service BI to the standard Self-Service BI approach from the general industry is that Data Vault 2.0 provides a managed environment where data and information are provided in a controlled and secure manner. Power users can only query the data they are allowed to see from a data security perspective.
Another advantage is that this approach enables organizations in security and banking industries to provide a fully auditable and traceable environment that meets the highest security requirements.

Managed Self-Service BI does require a write-back possibility in the enterprise data warehouse architecture, otherwise, it’s just plain old BI solution. Without write-back, there are no differentiators. Beyond that, write-back is necessary in order for enriching or enhancing the quality of the data being put forward. Data scientists, for example, do this all the time: when using Hadoop they create a new target file as a result of their processing output. This is a direct write-back in the Hadoop space. We have required write-back in Self Service BI for years, otherwise, master data, and hierarchy management don’t work properly.

In this modification of the architecture from the previous section, the relational staging area is replaced by a HDFS based staging area which captures all unstructured and structured data. While capturing structured data on the HDFS appears as overhead at first glance, this strategy actually reduces the burden of the source system by making sure that the source data is always being extracted, regardless of any structural changes. The data is then extracted using Apache Drill, Hive External or similar technologies. It is also possible to store the Raw Data Vault and the Business Vault (the structured data in the Data Vault model) on Hive Internal.

Conclusion

Combining Managed Self-Service BI with Data Vault 2.0 empowers organizations to strike a balance between governance and agility in their data ecosystems. By leveraging Data Vault’s structured, auditable architecture alongside self-service BI’s flexibility, businesses can ensure data accuracy, security, and scalability while enabling users to gain faster insights. This approach enables collaboration between IT and business users, driving more informed decision-making and accelerating data-driven innovation.

Building a scalable Data Platform? In Data Architecture

Hybrid Architecture in Data Vault 2.0

Hybrid Architecture in Data Vault 2.0

Business users expect their data warehouse systems to load and prepare more and more data, regarding the variety, volume, and velocity of data. Also, the workload that is put on typical data warehouse environments is increasing more and more, especially if the initial version of the warehouse has become a success with its first users. Therefore, scalability has multiple dimensions. Last month we talked about Satellites, which play an important role in scalability. Now we explain how to combine structured and unstructured data with a hybrid architecture.

In this article:

Logical Data Vault 2.0 Architecture
Hybrid Architecture
Conclusion

Logical Data Vault 2.0 Architecture

The Data Vault 2.0 architecture is based on three layers: the staging area which collects the raw data from the source systems, the enterprise data warehouse layer, modeled as a Data Vault 2.0 model, and the information delivery layer with information marts as star schemas and other structures. The architecture supports both batch loading of source systems and real-time loading from the enterprise service bus (ESB) or any other service-oriented architecture (SOA).

The following diagram shows the most basic logical Data Vault 2.0 architecture:

Figure 1: Logical Data Vault 2.0 Architecture

In this case, structured data from source systems is first loaded into the staging area to reduce the operational / performance burden from the operational source systems. It is then loaded unmodified into the Raw Data Vault which represents the Enterprise Data Warehouse layer. After the data has been loaded into this Data Vault model (with hubs, links, and satellites), business rules are applied in the Business Vault on top of the data in the Raw Data Vault. Once the business logic is applied, both, the Raw Data Vault and the Business Vault are joined and restructured into the business model for information delivery in the information marts. The business user is using dashboard applications (or reporting applications) to access the information in the information marts.

The architecture allows implementation of the business rules in the Business Vault using a mix of various technologies, such as SQL-based virtualization (typically using SQL views), and external tools, such as business rule management systems (BRMS).

However, it is also possible to integrate unstructured NoSQL database systems using a hybrid architecture. Due to the platform independence of Data Vault 2.0, NoSQL can be used for every data warehouse layer, including the stage area, the enterprise data warehouse layer, and information delivery. Therefore, the NoSQL database could be used as a staging area and load data into the relational Data Vault layer. However, it could also be integrated both ways with the Data Vault layer via a hashed business key. In this case, it would become a hybrid architecture solution and information marts would consume data from both environments.

Hybrid Architecture

The standard Data Vault 2.0 architecture in Figure 1 focuses on structured data. Because more and more enterprise data is semi-structured or unstructured, the recommended best practice for a new enterprise data warehouse is to use a hybrid architecture based on a Hadoop cluster, as shown in the next figure:

Figure 2: Hybrid Data Vault 2.0 Architecture

In this hybrid architecture modification, the relational staging area is replaced by a HDFS based staging area that captures all unstructured and structured data. While capturing structured data on the HDFS appears as overhead at first glance, this strategy actually reduces the burden of the source system by making sure that the source data is always being extracted, regardless of any structural changes. The data is then extracted using Apache Drill, Hive External, or similar technologies.

It is also possible to store the Raw Data Vault and the Business Vault (the structured data in the Data Vault model) on Hive Internal.

Conclusion

Integrating a hybrid architecture within Data Vault 2.0 enables organizations to effectively manage both structured and unstructured data by leveraging platforms like Hadoop. This approach enhances scalability and flexibility, allowing for efficient data processing and storage. By replacing traditional relational staging areas with HDFS-based systems, businesses can reduce the burden on source systems and ensure seamless data extraction

Building a scalable Data Platform? In Data Modeling

Visual Data Vault by Example: Satellites Modeling in the Health Care Industry

Data Vault 2.0 is a concept for data warehousing, invented by Dan Linstedt. It brings many new features that help anyone who is concerned with Business Intelligence entering a new age of data warehousing. Data Vault 2.0 is a Big Data concept that integrates relational data warehousing with unstructured data warehousing in real-time. It is an extensible data model where new data sources are easy to add. When our founders wrote the book, they required a visual approach to model the concepts of Data Vault in the book. For this purpose, they developed the graphical modeling language, which focuses on the logical aspects of Data Vault. The Microsoft Visio stencils and a detailed white paper are available on www.visualdatavault.com as a free download.

This year we already wrote about the modeling of hubs and links in Data Vault 2.0. Now, we want to introduce you the third standard entity, the Satellite.

SATELLITES IN VISUAL DATA VAULT

Satellites add descriptive data to hubs and links. Descriptive data is stored in attributes that are added to the satellite. The individual attributes are added to the satellite one at a time. A satellite might be attached to any hub or link. However, it is only possible to attach the satellite to one parent. Continue Reading

Building a scalable Data Platform? In Solutions

Detect Deletes – Standard Process without Last Seen Date, Using Descriptive Attribute

Descriptive Attribute

Hubs and Links provide a distinct list of all business keys ever recognized by the data warehouse. They are not end-dated and don’t provide any information about the current status of the record (e.g. if the business key is still valid / assigned to a business object).

It is required to define how to store the deleted flag (for example as a descriptive field next to other descriptive fields in a standard satellite) and how to identify the deletes by identifying the sub-set of business keys from the target hub (or relationships from the link) that don’t exist anymore in the staging area source table.

Without this descriptive flag, which indicates the current status of the record, the data in the data warehouse should be considered as inconsistent because hubs and links don’t implement effectivity.

Detecting deletes is an important process to ensure the consistency of the data warehouse. This solution describes how to detect deletes without requiring a Last Seen Date. The state of the record is expressed using a descriptive field in the staging area.

ACCESS THE SOLUTION

Building a scalable Data Platform? In Data Modeling

Visual Data Vault by Example: Links Modeling in the Banking Industry

Visual Data Vault

The following article describes how to represent links using a Visual Data Vault example. With the advent of Data Vault 2.0, which adds architecture and process definitions to the Data Vault 1.0 standard, Dan Linstedt standardized the Data Vault symbols used in modeling. Based on these standardized symbols, the Visual Data Vault (VDV) modeling language was developed, which can be used by EDW architects to build Data Vault models.

When our founders wrote the book, they required a visual approach to model the concepts of Data Vault in the book. For this purpose, they developed the graphical modeling language, which focuses on the logical aspects of Data Vault. The Microsoft Visio stencils and a detailed white paper are available on www.visualdatavault.com as a free download.

In this article:

Links in Visual Data Vault
Conclusion

Links in Visual Data Vault

We previously published another newsletter how hubs are modeled in the accounting industry. In this Newsletter we explain the function of standard links and how the modeling in the banking industry works.

Links connect individual hubs in a Data Vault model and represent either transactions or relationships between business objects. Business objects are connected in business. No business object is entirely separate from other business objects. Instead, they are connected to each other through the operational business processes that use business objects in the execution of their tasks. The image below shows a link that connects two hubs (a standard link has to have at least two connections) as the following diagram shows:

Figure 1: A standard link connects two hubs

The link in in the image above references two hubs: Account and Customer. The connector (the arrow) should be read as “(the hub) Customer is used by (the link) Account to Customer.” The second reference is a little different because the name of the connection between the Account hub and the link is overwritten by the meaning of a credit or a debt Account. This is necessary in cases where the model requires more meaning or when multiple connections are required to the same hub. The hash keys of each hub, which identify each business object unique by one calculated attribute, are replicated into the link entity by using the same attribute name.

A link represents many-to-many relationships and therefore they provide flexibility because changes to the business rules don’t require re-engineering and the granularity is expressed by the number of referenced hubs and is thus well documented.

Figure 2: Link table with its attributes

The link contains all hash keys of the related hubs (logical foreign keys), a load date when the relationship arrives the data warehouse for the first time, the record source where the data comes from, and the link hash key (logical primary key) which is calculated from the business keys of the hubs (not from the hash keys – never hash a hash!) and follows an insert-only loading pattern.

Links greatly improve the flexibility of the Data Vault model, because it is easy to add links or modify the relationship type of existing links. It takes less time to respond to changes in the business. To add new functionality, you only need to add new hubs and connect them via links to existing hubs. Usually a standard satellite is attached to the link, which contains the descriptive data of the relationship between the hubs.

Another common kind of link is the Non-Historized Link (also known as Transactional Link) which contains transactions only and does not need a Satellite, what means that the loading pattern is a complete insert-only approach. Read more about the value of NH-Links in an earlier Newsletter this year.

Conclusion

In conclusion, the Visual Data Vault modeling language offers a standardized and effective approach for representing complex relationships and transactions within the banking industry. By connecting business objects through links, it ensures a clear and logical structure that mirrors real-world processes. This methodology not only enhances the clarity of data models but also facilitates efficient data management and retrieval, making it an invaluable tool for enterprise data warehouse architects.

Building a scalable Data Platform? In Data Vault

Achieve Data Lineage in Data Vault 2.0

One common requirement in data warehouse projects is to provide data lineage from end-to-end. However, custom solutions (for example custom Meta Marts for self-developed Data Vault generators) or tools from different vendors often break such end-to-end data lineage.

Unlike business or technical metadata, which is provided by the business or source applications, process execution metadata is generated by the data warehouse team and provides insights into the ETL processing for maintenance. The data is used by the data warehouse team or by end-users to better understand the data warehouse performance and results presented in the information marts. One type of process execution metadata is the control flow metadata which executes one or more data flows among other tasks. Logging the process execution provides a valuable tool for maintaining or debugging the ETL processes of the data warehouse because it provided information about the data lineage of all elements of the data warehouse. Continue Reading

Building a scalable Data Platform? In Data Modeling

Data Vault Modeling: Visual Example in the Accounting Industry

Visual Example of a Data Vault Modeling

With the addition of architecture and process definitions in Data Vault 2.0, Dan Linstedt has standardized Data Vault modeling.

Based on these standardized symbols, the Visual Data Vault modeling language was developed, which can be used by Enterprise Data Warehouse architects to build Data Vault models.

The authors of the book “Building a Scalable Data Warehouse”, who are the founders of Scalefree, required a visual approach to model the concepts of Data Vault in the book.

For this purpose, they developed the graphical Data Vault modeling language, which focuses on the logical aspects of Data Vault.

The Microsoft Visio stencils and a detailed white paper are available on www.visualdatavault.com as a free download.

Hubs in Visual Data Vault

Business keys play an important role in every business, because they are referenced by business transactions and relationships between business objects.

Whenever a business identifies and tracks business objects, business keys are used throughout business processes.

This is one of the reasons why Data Vault is based on the business keys. In Data Vault modeling, business keys are stored in hub entities.

The challenge is to identify the business keys that represent a business object uniquely. That can be just one business key, but also a composite key or a smart key.

The first image shows a hub with only one business key attribute:

Here, the attribute Invoice Number is sufficient to identify the invoice. No other attribute is required (such as the invoice year).

In other cases, it is not as easy, as the following diagram shows:

In this Data Vault modeling case, the accountant is identified by a Country Code attribute (such as the ISO2 code) and an Employee Number attribute.

One attribute alone would not be sufficient to identify the accountant: the employee number by itself might be overlapping across all countries and have only a local meaning (employee number 10006 might be used in multiple countries and identify a different accountant in each country).

Therefore, the local key is extended by the country code to uniquely identify the accountant. Be aware, the country code has to be in the source data to make this a valid model in Data Vault (in the end, we do model source data, in the Raw Data Vault, not the desired model of the business).

Another example extends this concept into a so-called smart-key:

Here, the IBAN number, which is used to identify banking accounts internationally, consists of 4 physical elements in the number:

Country code
Checking number
Account number
Bank Identifier code

In order to model a smart key (a key that comprises multiple parts or keys), add a smart key to the hub and then add business keys to identify the sections of the smart key.

As you can see from above figure, the logical symbol of a smart key is similar to that of a business key. However, the icons are slightly different and the shape indicates a stack.

In this Data Vault modeling example, each business key is modeled as an individual attribute in the hub entity. The combination identifies a business object in the business. The checking number is actually not modelled, because it contains no business value (except the ability to serve as a technical checksum).

However, you’re not wrong with adding it to the model, too.

Managed Self-Service BI: Success in Spite of Stringent Regulation

Focus on trends: Data Lake and no-sql, dwh architecture, self-service bi, modeling and gdpr

How to Scale in a Disciplined Agile Manner?

Looking beyond Scrum and learn how to increase the value in Data Vault 2.0 projects

GDPR

What is new in GDPR?

Data Warehouse and Data Lake: Do We Still Need a Data Warehouse?

WHAT IS A DATA LAKE?

How to Combine Managed Self-Service BI with Data Vault 2.0?

Combining Managed Self-Service BI with Data Vault 2.0

About Self-Service BI

About the Managed Self-Service BI Approach

Implement MSSI in the Data Vault 2.0 Architecture

Conclusion

Hybrid Architecture in Data Vault 2.0

Hybrid Architecture in Data Vault 2.0

Logical Data Vault 2.0 Architecture

Hybrid Architecture

Conclusion

Visual Data Vault by Example: Satellites Modeling in the Health Care Industry

SATELLITES IN VISUAL DATA VAULT

Descriptive Attribute

Visual Data Vault by Example: Links Modeling in the Banking Industry

Visual Data Vault

Links in Visual Data Vault

Conclusion

Achieve Data Lineage in Data Vault 2.0

Data Vault Modeling: Visual Example in the Accounting Industry

Visual Example of a Data Vault Modeling

Hubs in Visual Data Vault

Build Better Data Platforms

SOLUTIONS

TRAINING

EVENTS

KNOWLEDGE HUB

CAREERS

COMPANY

Make Better Salesforce Decisions

Build Better Data Platforms