Skip to main content
search
0
Scalefree Knowledge Webinars Data Vault Friday Data Vault in a Microservices Architecture

Microservices Architecture and Data Vault: Managing Satellites at Scale

Microservices architectures create a specific modeling challenge for Data Vault practitioners. When services are ephemeral — spinning up and down as Docker or Kubernetes containers — each with its own message structure, the standard advice to split Satellites by source system quickly leads to hundreds or thousands of Satellites. At that scale, the real question isn’t about metadata management overhead. It’s about how to consume all that data without joining 500 tables every time you need an answer. This post walks through a practical approach to handling high-volume, highly varied source structures in a Data Vault model.



Microservices Architecture: Why Satellite Splits Become a Problem

The conventional Satellite splitting rules — by rate of change, source system, security, and privacy — exist for good reasons. But in a microservices context, applying them strictly leads to an explosion of Satellites. A new Docker image with a new message structure technically deserves its own Satellite. Automate that process and you accumulate hundreds or thousands of Satellites quickly, most of which may never be queried by anyone.

The issue isn’t that databases can’t handle 500 tables — they can. The issue is the consumption side: joining 500 Satellites to produce a target model is expensive, complex to maintain, and in many cases unnecessary. The real challenge is finding a modeling approach that captures the variety of incoming structures without creating an unmanageable query layer downstream.

Rate of Change Splits: Still Relevant, but Less So for Now

The rate of change split was designed to reduce storage consumption by separating high-frequency attributes from stable ones. Every delta insert copies all columns in the Satellite, so a single change on one attribute in a wide Satellite wastes a lot of storage on unchanged data.

For most modern analytical database systems, compression makes this largely unnecessary. Insert-only tables with lots of redundant data compress extremely well, and virtually all modern analytical platforms support this. The storage cost of skipping the rate of change split is manageable with compression turned on.

That said, this is worth watching. In pay-per-query environments like Athena querying row-based Avro files, or systems that charge based on uncompressed data scanned, the rate of change split becomes economically relevant again. BigQuery’s columnar storage sidesteps this because you only pay for the columns you query — but other managed infrastructure doesn’t work that way. The rate of change split isn’t obsolete; it’s just less pressing for now, and likely to become more relevant as managed, consumption-based pricing models become more common.

The Data Vault Handbook:
Core Concepts and Modern Applications

Build Your Path to a Scalable and Resilient Data Platform

The Data Vault Handbook is an accessible introduction to Data Vault. Designed for data practitioners, this guide provides a clear and cohesive overview of Data Vault principles.

Read it for Free

The Flip-Flop Effect: Why Source System Splits Still Matter

The source system split is a different matter. Loading data from two different source systems into the same Satellite creates a well-known problem: the flip-flop effect.

Consider a customer whose address is known to both an ERP system (California) and a CRM system (Hannover, Germany). The two systems have different knowledge and potentially different structures for representing the same data. If both load into the same Satellite, the Satellite ends up recording two deltas per day — not because the customer moved, but because two systems loaded sequentially with different values. The data flips between California and Hannover with every load cycle, consuming storage and making it impossible to determine the actual address without applying business logic. Worse, the order of loading determines what the Satellite shows at any given moment — a purely technical artifact with no business meaning.

The fix is straightforward: one Satellite per source. This keeps each system’s view of the data independent and equally available, so business logic in the Business Vault can reconcile them deliberately rather than having the Raw Data Vault collapse them accidentally.

The Gray Area: Millions of Sources, One Practical Solution

The flip-flop rule works cleanly when you have a manageable number of distinct source systems. It breaks down at the extreme end — IoT deployments with millions of sensors, or microservices architectures with hundreds of ephemeral containers — where creating one Satellite per source is operationally impractical.

The solution here depends on two conditions being met. First, you need a key in the parent entity that partitions the data by source — a sensor ID, a Docker image ID, a tenant ID, something that creates independent delta streams within the same Satellite. With this in place, deltas from source A can’t replace or invalidate deltas from source B, which eliminates the flip-flop effect without requiring separate Satellites. Second, the structure of the incoming data must be consistent enough to fit in a shared target — which in practice usually means JSON.

When messages from different microservices or sensors all arrive as JSON — even with different internal structures — you can load them all into a single Satellite or Non-Historized Link with a JSON or JSONB payload column. The structure differences are captured inside the JSON document. You add the partitioning key to the parent, and you’re done. Instead of 500 Satellites with 500 different schemas, you have one entity with a JSON payload and a key that tells you which source produced each record.

For real-time message streams from microservices, a Non-Historized Link with a JSON payload is often the right structure. Real-time messages are events — they don’t update, they accumulate. The flip-flop concern largely disappears because you’re capturing messages as they arrive, not loading full snapshots that might overwrite each other. A Non-Historized Link captures the event, the relevant Hub references, and the message payload in a structure that’s fast to load and straightforward to query.

This same pattern was applied at Scalefree for an investment banking client with 500 different source systems delivering asset data in different CSV formats. Rather than creating 500 entities, a single Non-Historized Link and Satellite captured everything — different CSV structures serialized as JSON strings, distinguished by a load source identifier. Two entities replaced 500, and the consumption layer handled the structural variety through filtering and extraction rather than joins.

The Data Vault Handbook:
Core Concepts and Modern Applications

Build Your Path to a Scalable and Resilient Data Platform

The Data Vault Handbook is an accessible introduction to Data Vault. Designed for data practitioners, this guide provides a clear and cohesive overview of Data Vault principles.

Read it for Free

Consuming Semi-Structured Data Without Joining 500 Tables

Loading everything into a JSON payload doesn’t eliminate the structural variety — it defers it to query time. When you need data from a specific message type, you need to identify records with the right structure among all the records in the same target entity.

The approach here is filtering rather than joining. Instead of joining 500 Satellites, you query one entity and filter for records that contain specific JSON keys or values that uniquely identify the message type you care about. Email messages, for example, always have a subject, body, sender, and recipient — keys that distinguish them from other message types. A specific transaction type might always carry an ID starting with a known prefix. These structural signatures let you extract subsets of the JSON stream efficiently.

Once filtered, you extract the attributes you need from each subset and UNION the results if you need to combine multiple message types. A UNION of 500 filtered queries on one table is significantly faster than a JOIN of 500 separate tables, and it scales much better as the number of source types grows.

Choosing the Right Approach for Your Context

The right answer depends on where you sit on the spectrum between a small number of structurally distinct source systems and a very large number of structurally similar ones. For a handful of systems with genuinely different schemas and different business semantics — CRM, ERP, financial systems — separate Satellites per source is the right call. The flip-flop effect and structural differences make consolidation risky and introduce business logic where it doesn’t belong.

For microservices, IoT devices, or any scenario where you have many sources with similar structures and a partitioning key available, consolidating into a small number of JSON-payload entities is usually the better trade-off. It simplifies loading, reduces metadata overhead, and keeps the consumption layer manageable — at the cost of pushing structural interpretation into filtering and extraction logic downstream.

To go deeper on Satellite design, source system splits, and Data Vault modeling patterns for modern architectures, explore our Data Vault certification program. The free Data Vault handbook is also available as a physical copy or ebook for a solid grounding in the core methodology.

Watch the Video

Leave a Reply

Close Menu