Refactoring the Raw Data Vault
Refactoring is a natural part of any evolving data architecture. Whether you’re adding a new entity, integrating additional source systems, or moving toward a multi-tenant Data Warehouse, change is inevitable. In this article, we’ll explore what it means to refactor your Raw Data Vault (RDV), why it’s essential, and how to do it safely using strategies like Hub-it-out.
The discussion is based on a real-world scenario where a company’s recruitment data model needed to evolve. Initially, the model tracked requisitions but didn’t fully capture recruiter details. With new requirements, the team needed to refactor their RDV to introduce a new Hub Recruiter entity — without breaking historical data or existing queries.
In this article:
- What Refactoring Really Means
- Deploy First, Refactor Later
- The Recruiter Example: When Business Requirements Change
- Understanding the “Hub-it-Out” Strategy
- Why a Valid Model Is Everything
- Multi-Tenant Data Vault Considerations
- Avoid Overloading Links
- Transitioning Without Breaking Queries
- Using Virtual Views for Legacy Support
- When and How to Clean Up
- Key Takeaways
- Final Thoughts
- Watch the Video
- Meet the Speaker
What Refactoring Really Means
Before diving into the technical details, let’s clarify what refactoring is — and what it isn’t.
Refactoring means changing the internal structure of your data model without altering its external behavior. Think of it as improving or extending the foundation — adding new components, reorganizing relationships, or optimizing loading logic — while keeping the overall functionality intact.
Redesign, on the other hand, means making user-visible changes. This might involve restructuring dashboards, altering reports, or redefining business logic in ways that affect your end users.
Since the Raw Data Vault is an internal structure (not typically user-facing), most adjustments here fall under refactoring. These changes are iterative and low-risk if your base model is valid.
Deploy First, Refactor Later
Many data teams fall into the trap of seeking the “perfect” model before deployment. But as experts at Scalefree emphasize — perfection is the enemy of progress. Instead of spending months debating the right business key or model structure, deploy a valid model quickly, gather feedback, and improve it through incremental refactoring.
“Don’t aim for the best model on day one. Aim for a valid model that you can continuously improve.”
This agile mindset ensures your team delivers value early and often. Over time, with each sprint, you evolve closer to the “best” model for your organization — even though, in truth, a perfect model doesn’t exist.
The Recruiter Example: When Business Requirements Change
Let’s revisit our use case: your company is in the recruiting business. Each job requisition is led by a single recruiter. Your original model included a RecruiterID
in the requisition table but didn’t have a dedicated Hub Recruiter. Now, the business wants to integrate additional recruiter data from new source systems.
So how should you adapt your Data Vault model?
- Do you close the old Link and create a new one that includes the Recruiter Hub?
- Do you split existing Links into smaller, recruiter-focused relationships?
- How do you handle historical backfills and multi-tenant variations?
The answer: you refactor using the Hub-it-out strategy.
Understanding the “Hub-it-Out” Strategy
“Hub-it-out” is a practical approach to introducing new Hubs and Links into your existing Raw Data Vault without reloading everything from the source systems.
Here’s how it works:
- Identify the business key in your existing Satellite. For example, you already have
RecruiterID
stored as an attribute, even if it wasn’t modeled as a Hub initially. - Create a new Hub (Hub Recruiter) by selecting all distinct recruiter IDs from your existing Satellite data.
- Generate hash keys for each business key value and assign load dates and record sources from the Satellite’s first occurrence.
- Create a Link between the existing parent Hub (e.g., Requisition) and the new Hub (Recruiter) based on relationships already present in the Satellite.
With this approach, you can build new structures directly from your existing Data Vault — no need to reload historical data from source systems, which may not even be available anymore. If you have a data lake or Persistent Staging Area (PSA), you can also load from there as an alternative.
Why a Valid Model Is Everything
One key prerequisite for refactoring success is having a valid Raw Data Vault model. A “valid” model means that:
- Data is captured consistently and completely.
- No business keys or relationships have been lost during earlier transformations.
- Hubs, Links, and Satellites follow proper Data Vault design rules.
If your model is valid, you can refactor it safely — adding Hubs, Links, or Satellites — without touching your original data sources. This makes evolution faster, cheaper, and much less disruptive.
Multi-Tenant Data Vault Considerations
When dealing with multiple clients or tenants, you should introduce a TenantID attribute in every Hub, Link, and Satellite. This ensures that data from one tenant never overwrites another’s records, since the hash keys will differ.
Typically, the first few tenants may require adjustments as you generalize your model. But after integrating the second or third tenant, the structure stabilizes. Each new tenant may add Satellites or minor extensions — but the overall architecture remains consistent.
Avoid Overloading Links
Each source system should define its own Unit of Work — a consistent relationship between business keys. If one system defines a Link between three business keys (e.g., Requisition, Candidate, Recruiter) and another defines it between two, treat them as separate Links. Avoid “overloading” Links by mixing different granularities or structures. That’s a common source of data inconsistency and confusion.
Transitioning Without Breaking Queries
Refactoring can be disruptive for power users who query the Raw Data Vault directly. Their queries may break when Hubs or Links are renamed, split, or replaced. To manage this transition smoothly:
- Load both old and new entities temporarily. Keep the old Link active for a while, even if it’s limited to a single tenant.
- Mark deprecated objects clearly. Add a “deprecated” flag or comment in your metadata catalog.
- Communicate proactively. Notify users via email or release notes, giving them 90–180 days to adjust queries.
Despite communication efforts, some users will inevitably miss the deadline — but maintaining transparency and clear documentation helps minimize friction.
Using Virtual Views for Legacy Support
Instead of maintaining redundant tables, you can recreate deprecated entities as virtual views on top of your new Data Vault structures. This approach saves storage while still supporting legacy queries.
Yes, there’s a small performance trade-off, but since these views are temporary and clearly marked as deprecated, it’s an effective bridge strategy. Inform users that performance may decrease slightly and that these views will be removed after a defined period (e.g., one year).
When and How to Clean Up
Every new entity adds maintenance overhead — documentation, metadata management, refactoring complexity. Once your transition period ends, remove deprecated tables and views to keep your model clean and manageable.
Set a clear timeline with your business stakeholders, archive necessary backups, and drop the obsolete entities once the window closes. This keeps your RDV lean and reduces technical debt.
Key Takeaways
- Deploy early — perfection isn’t required for value.
- Refactor continuously through small, validated improvements.
- Use the Hub-it-out strategy to extend your model safely.
- Always maintain a valid Data Vault foundation to enable future flexibility.
- Manage user expectations through clear communication and deprecation policies.
Final Thoughts
Refactoring your Raw Data Vault isn’t about redoing work — it’s about evolving intelligently. With the right strategies and mindset, you can adapt to new business requirements, integrate additional tenants, and maintain consistency without painful re-engineering.
Whether you’re adding a Recruiter Hub, optimizing Link structures, or modernizing your multi-tenant DWH, remember: the goal isn’t perfection. It’s progress — sprint by sprint, improvement by improvement.
Watch the Video
Meet the Speaker

Michael Olschimke
Michael has more than 15 years of experience in Information Technology. During the last eight years he has specialized in Business Intelligence topics such as OLAP, Dimensional Modelling, and Data Mining. Challenge him with your questions!