Skip to main content
search
0

Integrating Documents from Heterogeneous Sources

Within this part of our ongoing blog series, we would like to introduce a sample data set based upon insurance data. This data set will be used to explain the concepts and patterns expanded upon further in the post. That said, please consider the following situation: an insurance company utilizes two different operational systems, let’s say, a home insurance policy system and a car insurance policy system.

Both systems should be technically integrated, which means if a new customer signs up for a home insurance policy, the customer’s data should be synchronized into the car insurance policy system as well and kept in sync at all times. Thus, when the customer relocates, the new address is updated within both systems.

Though in reality, it often doesn’t go quite as one would expect, as, first of all, both systems are usually not well integrated or simply not integrated at all. Adding to the complexity, in some worst-case scenarios, data is manually copied from one system to the next and updates are not applied to all datasets in a consistent fashion but only to some, leading to inconsistent, contradicting source datasets. The same situation applies often to data sets after mergers and acquisitions are made within an organization.

Continue Reading

Document Processing in MongoDB

In continuing our ongoing series, this piece within the blog series will describe the basics of querying and modifying data in MongoDB with a focus on the basics needed for the Data Vault load as well as query patterns. 

In contrast to the tables used by relational databases, MongoDB uses a JSON-based document data model. Thus, documents are a more natural way to represent data as a single structure with related data embedded as sub-documents and arrays collapses what is otherwise separated into parent-child tables linked by foreign keys in a relational database. You can model data in any way that your application demands – from rich, hierarchical documents through to flat, table-like structures, simple key-value pairs, text, geospatial data, and the nodes as well as edges used in graph processing.

Continue Reading

An Enterprise Document Warehouse Architecture

A common requirement for enterprise data warehousing is to provide an analytical model for information delivery, for example in a dashboard or reporting solution. One challenge in this scenario is that the required target model, often a dimensional star or snowflake schema or just a denormalized flat-and-wide entity, doesn’t match the source data structure. Instead the end-user of the analytical data will directly or indirectly define the target structure according to the information requirements.

Another challenge is the data itself, regardless of its structure.
In many, if not most, cases, the source data doesn’t meet the information requirements of the user regarding its content. In many cases, the data needs cleansing and transformation before it can be presented to the user.

Instead of just loading the data into a MongoDB collection and wrangling it until it fits the needs of the end user, the Data Vault 2.0 architecture proposes an approach that allows data as well as business rules, which are used for data cleansing in addition to transformation, to be re-used by many users. To achieve this, it is made up of a multi-layered architecture that contains the following layers:

Continue Reading

Processing Enterprise Data with Documents in MongoDB

Today’s enterprise organizations receive and process data from a variety of sources, including silos generated by web as well as mobile applications, social media, artificial intelligence solutions in addition to IoT sensors. That said, the efficient processing of this data at high volume in an enterprise setting is still a challenge for many organizations. 

Typical challenges include issues such as the integration of mainframe data with real-time IoT messages and hierarchical documents.
One of such issues being that enterprise data is not clean and might have contradicting characteristics as well as interpretations. This poses a challenge for many processes such as when integrating customers from multiple source systems.

Though, data cleansing could be considered as a solution to this problem. However, what if different data cleansing rules should be applied to the incoming data set? For example, because the basic assumption for “a single version of the truth” doesn’t exist in most enterprises. While one department might have a clear understanding of how the incoming data should be cleansed, another department, or an external party, might have another understanding. 

Continue Reading

Data Lake Efficiency: Structural Solutions

Data Lake architecture

Data Lake Structure – Solution

The organization of data within a data lake can significantly impact downstream accessibility. While offloading data into the data lake is a straightforward process, the real challenge arises in efficiently retrieving this data. The efficiency of data retrieval becomes crucial for tasks such as the incremental or initial Enterprise Data Warehouse (EDW) load and for data science practitioners conducting independent queries. In practice, the ease of accessing data downstream depends on how well the data is organized within the data lake. A well-organized structure facilitates smoother retrieval processes, empowering both EDW loads and the independent querying needs of data scientists.

ACCESS THE SOLUTION
Within a hybrid data warehouse architecture, as promoted in the Data Vault 2.0 Boot Camp training, a data lake is used as a replacement for a relational staging area. Thus, to take full advantage of this architecture, the data lake is best organized in a way that allows efficient access within a persistent staging area pattern and better data virtualization.

Continue Reading

Data Vault 2.0’s Inventor Offers Unprecedented On-Site Access

To all those that have been a part of the Scalefree journey up until this point,

We’d first and foremost like to thank you for all the contributions you have made in helping us build Scalefree into the firm it is today. All of your contributions and business have allowed us to create a success story beyond what was first imagined and for that we offer our gratitude.

That said, a recent development here at Scalefree has presented the company with the opportunity to offer unprecedented, on-site access to the man that helped make all of this possible, the inventor of Data Vault 2.0, Dan Linstedt.

Though before diving into the unique opportunity that presents you, a little about how we got here.

Continue Reading

Data Vault Use Cases Beyond Classical Reporting – Part 1

Data Cleansing with Data Vault 2.0

Enterprise Data Warehouses with Data Vault 2.0

While Enterprise Data Warehouses (EDWs) are traditionally used for reporting and dashboarding, with Data Vault, their true potential lies far beyond these basic applications. Data Vault 2.0 introduces unparalleled flexibility and scalability, enabling organizations to unlock new use cases such as data cleansing, operational process automation, and predictive analytics. This article explores how Data Vault 2.0 empowers businesses to apply centralized data cleansing rules, enhance data quality at its source, and embrace Total Quality Management (TQM). By leveraging the full capabilities of their EDWs, organizations can move beyond simple analytics to create sustainable, agile, and impactful data strategies.

Going beyond standard reporting with Data Vault 2.0

Reporting and dashboarding have become the standards in business when it comes to identifying KPIs and other measurements. As such, Enterprise Data Warehouses have emerged to support the reporting process. Though, due to the large quantity and variety of data, a demand has developed for a method of utilizing this existing data in a manner in which it can add additional business value towards a company’s needs. Data Vault 2.0 offers a wide range of methods to provide decision support beyond standard reporting as well as critical information regarding the future. To see for yourself, join us as we present different approaches and solutions as to fully leverage the potential of your data.

Watch webinar recording

Flexibility and Scalability

To put it simply, an Enterprise Data Warehouse (EDW) collects data from your company’s internal as well as external data sources, to be used for simple reporting and dashboarding purposes. Often, some analytical transformations are applied to that data as to create the reports and dashboards in a way that is both more useful and valuable. That said, there exist additional valuable use cases which are often missed by organizations when building a data warehouse. The truth being, EDWs can access untapped potential beyond simply reporting statistics of the past. To enable these opportunities, Data Vault brings a high grade of flexibility and scalability to make this possible in an agile manner.

Data Vault Use Cases

To begin, the data warehouse is often used to collect data as well as preprocess the information for reporting and dashboarding purposes only. When only utilizing this single aspect of an EDW, users are missing opportunities to take advantage of their data by limiting the EDW to such basic use cases.

A whole variety of use cases can be realized by using the data warehouse, e.g. to optimize and automate operational processes, predict the future, push data back to operational systems as a new input or to trigger events outside the data warehouse, to simply explore but a few new opportunities available.

Continue Reading

Requirements and Templates for Hashing

Solutions

REQUIREMENTS FOR HASHING

Traditional data warehouses often use sequence numbers to identify records in other tables.

By using sequences, this method comes with some drawbacks. One of the biggest drawbacks is performance. Since the sequence numbers are generated by a generator, this step presents a bottleneck. In addition sequence numbers are generated in the data warehouse instead of loading them before.

This solution provides a template and the requirements round about hashing.

ACCESS THE SOLUTION

What to Consider for Naming Conventions in Data Warehousing – Part 1

Naming conventions - Letter cases

Naming Conventions in Data Warehousing

An initial decision of critical importance within Data Vault development relates to the definition of naming conventions for database objects. As part of the development standardization, these conventions are mandatory to maintain a well-structured and consistent Data Vault model. It is important to note that proper naming conventions boost the usability of the data warehouse, not only for solution developers but also for power users within data exploration.

Throughout this article, we will present the most vital considerations within our standard book, the process of defining naming conventions.

Naming Convention Documentation

It is one aspect to simply define naming conventions utilized within the development of your data warehouse, but it is completely another to establish consistency to create defined naming conventions that are to become standards. That said, it is a good practice to document a guideline for naming Data Warehouse objects. To that end, the next sections will discuss several considerations to take into account when defining the naming conventions for a data warehouse solution.

Naming Conventions: Letter case

There are several options when it comes to considering letter cases for names: all uppercase, all lowercase, Camel Case, and Pascal Case. Though the variances may be slight, each option does have its own advantages and disadvantages regarding legibility as well as type-ability, to briefly touch upon the differences.

Ultimately, the decision regarding letter case falls to the decision of which database management system is used as some, such as PostgreSQL, support case-sensitive object name that requires the use of quoted names. Therefore, users often prefer the lower-case by default in PostgreSQL as having lower-cased object names can reduce the amount of code that is to be generated while simultaneously improving the usability for ad-hoc queries by power users. Nevertheless, it is imperative to maintain one consistent letter case for both entity and column names.

Naming conventions - Letter cases
Figure 1: Naming Convention - Letter Case

Naming Conventions: Usage of underscores “_”, hyphens “-”

To improve readability, word separators like underscores “_” or, depending on use cases, hyphens/dashes “-” are desirable. However, it is important to remember that in many systems, hyphens are interpreted as minus signs. That said, hyphens are commonly used in XML or JSON data format though they can be easily replaced by underscores, should the latter be used as separators.

Naming Conventions: Abbreviation, acronyms

Some systems enforce character limits on object names, e.g. Oracle 12.1 and below only allows for a maximum object name length of 30 bytes. Therefore, abbreviations and acronyms may be taken into consideration during the object naming process, even though they can often lead to misinterpretation. To combat this, it is suggested to compile a document containing a list of the abbreviations being used with a detailed description of their meanings. However, to limit any possible confusion avoid excessive use of abbreviations and acronyms.

In logical models, it is advisable, that object names are as self-explanatory as possible, i.e. most words should be fully spelled out, except common abbreviations for longer words such as “dept” for “department” or “org” for “organization”. However, abbreviations and acronyms are typically used in physical models, to keep object names short.

Naming Conventions: Singular vs. plural object names

It is a common practice to utilize nouns or noun phrases, in their singular form, as object names. This is done to avoid the necessity of dealing with irregular pluralization in English, e.g. man/men, person/people, which would unnecessarily add a whole new level of complexity within the data model.

Naming Conventions: Prefix vs. Suffix

Whether objects are named using prefixes or suffixes, is not of much importance within the development. That said, internally at Scalefree, we prefer table names with suffixes such as “customer_h”, and “transaction_l” instead of utilizing prefixes. The advantage of this method is that, given most database tools sort tables alphabetically, all tables, that are related to a business object, will be grouped. For example, all contact hubs, satellites, and links that have names beginning with “contact_…” will, therefore, be found together in the browser. This supports data exploration by power users and developers.

Naming conventions - prefix
Figure 2: Naming Convention - Prefix

Nevertheless, prefixes can be meaningful within some use cases, e.g. using prefix in layer schema names helps keeping them neatly listed together in the database browser.

Naming conventions - Schema
Figure 3: Naming Convention - Schemas

Conclusion

Naming conventions are partially a matter of personal preference and organizational guidelines. Regardless, the more systematically consistent the naming conventions become when defined, the more benefits will ultimately be seen within the development and implementation of your Data Vault solution. To illustrate this point, we encourage Data Vault development teams to write and implement a simple SQL function that inspects the entire database while checking for naming convention inconsistencies. This further ensures that standards are being followed.

Interested in how we standardize naming conventions here at Scalefree? In an upcoming article, we will share several concrete suggestions for naming conventions, most of which both our customers and our team regularly utilize internally.

Now let’s open this up for discussion in the comment box below: How do you implement naming conventions within your Data Vault development? What conventions do you follow?

Data Security Concepts in Data Vault 2.0

Solutions

Data Security Concepts

The focal point of our discussion revolves around critical aspects such as security controls, access controls, and the definition of identities. The primary objective of this solution is to safeguard data assets effectively. The approach taken is typically centered on securing data at both the row/document level and the attribute level.

In terms of security controls, the emphasis is on implementing measures that ensure the confidentiality, integrity, and availability of data. Access controls play a pivotal role in governing who can interact with specific data assets, limiting access to authorized individuals or roles. Defining identities involves establishing clear parameters for users and entities accessing the data, contributing to a robust security framework.

In summary, this Data Vault solution prioritizes a comprehensive approach to data security, addressing concerns at different levels to fortify the protection of valuable data assets.

ACCESS THE SOLUTION

Exemplary Naming Conventions in Data Vault 2.0

Solutions

Naming Conventions

Data Vault modeling is a powerful approach that introduces a multitude of entities to the database. To enhance usability and facilitate effective development, it is highly advisable to implement clear and consistent naming conventions. These conventions play a vital role in grouping entities by concept and conveying crucial information to developers, including the data source, rate of change, privacy levels, security considerations, and more.

Introducing a well-thought-out naming convention not only simplifies the development process but also contributes to a more organized and comprehensible database structure. It acts as a guide for developers, offering insights into the nature and characteristics of each entity.

Conversely, the absence of a naming convention poses challenges in identifying related tables within Data Vault models. This lack of structure can lead to confusion, making it harder for developers to discern the relationships and purpose of different entities.

In conclusion, the implementation of naming conventions is fundamental for the success of Data Vault modeling solutions. It promotes clarity, efficiency, and a systematic approach to database development.

ACCESS THE SOLUTION

Bridge Tables 101: Why They Are Useful

Bridge Tables Screenshot of Connections

What Are Bridge Tables?

Bridge tables are special entities within the Data Vault that leverage query performance on the way out of the Data Vault. These entities are placed between the Data Vault and the Information Delivery Layer and are necessary for instances in which many joins and aggregations on the Raw Data Vault are executed, causing performance issues.

This often happens when designing the virtualized fact tables in the information and data marts. Thus, to produce the required granularity in the fact tables without increasing the query time, bridge tables come into play. Bridge tables belong to the Business Vault and have the purpose of improving performance, similar in manner to the PIT table which was discussed in a prior newsletter.

As a means to achieve its goals, bridge tables materialize the grain shift that is often required within the information delivery process. Though, before we dig deeper into the specifics of using bridge tables for performance tuning, it is important to first define granularities within a data warehouse.

Grain Definitions in Data Warehousing

The grain within a dimensional model is the level of detail available of each table. Thus, the grain of a fact table is defined by the number of related dimensions. Basically, there are three different types of granularities for fact entities within a dimensional model. Continue Reading

Close Menu