Skip to main content
search
0
All Posts By

Ole Bause

Ole Bause is a BI Consultant at Scalefree with a Master’s in Data Science and extensive experience in cloud native DWH (AWS, GCP, Azure). A Certified Data Vault 2.0 Practitioner (CDVP2) and AWS Solutions Architect, he specializes in building scalable pipelines with dbt and Snowflake.

Datensouveränität: Die souveräne Datenplattform als Weg zu unabhängigen Daten und sicherer KI

Datensouveränität wird oft als rein politisches Buzzword oder als bloße Compliance-Aufgabe, z. B. im Rahmen der DSGVO und des AI Acts, abgetan. Doch in der Realität ist sie eine harte wirtschaftliche Notwendigkeit. In einer Ära, in der Daten nicht mehr nur in Dashboards visualisiert werden, sondern als Grundlage für automatisierte Geschäftsprozesse und Künstliche Intelligenz dienen, wird die eigene Infrastruktur zum strategischen Flaschenhals.

Wer in dieser Phase die Kontrolle vollständig an externe Technologieunternehmen abtritt, verliert nicht nur Unabhängigkeit, sondern auch Innovationskraft. Sind Daten in geschlossenen Systemen gefangen, bestimmt letztlich der Anbieter, was angebunden werden darf oder welche KI-Modelle genutzt werden können. Der Weg zu echter Datensouveränität beginnt mit der Erkenntnis, dass die bequemen „All-in-One“-Versprechen vieler Cloud-Anbieter einen hohen, oft versteckten Preis haben.

Wie der Kontrollverlust in der Praxis aussieht

Um zu verstehen, wie sich die Datenhoheit zurückgewinnen lässt, muss man zunächst betrachten, wie Unternehmen sie überhaupt verlieren. Dieser Kontrollverlust geschieht selten über Nacht. Vielmehr ist es ein schleichender Prozess, der tief in der Architektur traditioneller und moderner Cloud-Datenplattformen verwurzelt ist.

Fällt die Entscheidung auf eine proprietäre Datenplattform, werden die Rohdaten vollständig an das System des Anbieters übergeben.

Proprietäre Formate

Um die versprochene Performance zu liefern, wandeln geschlossene Plattformen die eingespeisten Daten in herstellereigene, proprietäre Speicherformate um. Ab diesem Moment können diese Daten nur noch von der Compute-Engine (der Rechenleistung) genau dieses einen Anbieters gelesen und verarbeitet werden.

Fehlende Interoperabilität

Soll nun eine neue, innovative Lösung, wie bspw. eine spezialisierte Analyse-Engine oder Reporting Software eines Drittanbieters angebunden oder eine bestimmte (open-source) KI genutzt werden, stehen Unternehmen oft vor einer Wand. Externe Tools können die proprietären Formate nicht nativ lesen oder es wird gar nicht erst eine benötigte Schnittstelle bereitgestellt.

Kostenfalle (“Egress Fees”)

Um die Daten für andere Anwendungen nutzbar zu machen, oder im schlimmsten Fall den Anbieter komplett zu wechseln, müssen sie aufwändig exportiert werden. Hier schlagen die sogenannten „Egress Fees“ (Kosten für den Datenabfluss) massiv zu Buche. Große Cloud-Provider machen den Ingest (das Einspeisen der Daten) oft sehr günstig, bestrafen den Export aber mit hohen Gebühren.

Verlust der Preissetzungsmacht

Sind historische Unternehmensdaten erst einmal in einem geschlossenen System verankert und die Wechselkosten künstlich in die Höhe getrieben, sind Unternehmen künftigen Preissteigerungen und Lizenzänderungen des Anbieters ausgeliefert.

Kurzum: Das Unternehmen trägt zwar weiterhin die volle rechtliche und geschäftliche Verantwortung für seine Daten, hat aber den direkten, physischen Zugriff darauf verloren. Es mietet lediglich den Zugang zum eigenen Wissen.

Stellen Sie sich an diesem Punkt einmal ganz ehrlich die Frage:

Wissen Sie genau, in welchem Format und auf welcher Infrastruktur Ihre Kern-Daten in diesem Moment liegen?
Und noch viel wichtiger: Wie kommen Sie an Ihre Daten, wenn der Zugang über das Portal Ihres Anbieters morgen früh plötzlich nicht mehr funktioniert oder die Preise über Nacht unerwartet diktiert werden?

Das Data Lakehouse und offene Standards als Ausweg

Der technologische Ausweg aus dieser Abhängigkeit führt über eine grundlegende architektonische Neuausrichtung. Die Antwort auf proprietäre Datensilos lautet heute: Data Lakehouse. Dieser Architekturansatz vereint die Flexibilität eines Data Lakes mit der Struktur und Zuverlässigkeit eines klassischen Data Warehouses, jedoch unter einer entscheidenden Prämisse: der konsequenten Trennung von Speicher (Storage) und Rechenleistung (Compute).

Diese Trennung ermöglicht es Unternehmen, ihre Architektur nach dem “Best-of-Breed-Prinzip” aufzubauen:

Eigene Infrastruktur

Anstatt Daten in die Systeme externer Dienstleister zu laden und dort zu “verriegeln”, verbleiben sie im unternehmenseigenen Cloud-Speicher (beispielsweise Amazon S3, Azure Data Lake oder Google Cloud Storage). Das Unternehmen besitzt faktisch und rechtlich den einzigen Schlüssel zu den eigenen Daten.

Offene Datenformate als Fundament

Ein wichtiger Hebel der Datensouveränität ist das Speicherformat. In einem modernen Data Lakehouse werden Daten ausschließlich in quelloffenen Standards wie Apache Iceberg, Hudi oder Delta Lake abgelegt. Diese Formate gehören keinem einzelnen Software-Hersteller und unterliegen keiner proprietären Lizenzierung.

Interoperabilität (“Bring Your Own Engine”)

Da die Unternehmensdaten nun strukturiert und in einem offenen Format im eigenen Speicher liegen, lassen sie sich von unterschiedlichsten Verarbeitungs-Engines (wie Databricks, Trino, Spark etc.) lesen. Der entscheidende Vorteil: Die Daten müssen dafür weder kopiert noch verschoben werden.

Das Resultat dieser Architektur ist echte digitale Souveränität. Wenn ein Software-Anbieter die Preise drastisch erhöht oder technologisch zurückfällt, lässt sich die Compute-Engine austauschen oder parallel durch andere Tools ergänzen. Die wertvolle Datenbasis bleibt davon völlig unberührt.

Keine sichere KI ohne souveräne Datenplattform

Diese architektonische Unabhängigkeit ist nicht nur eine Frage der Kostenkontrolle, sondern eine wichtige Grundvoraussetzung für den produktiven und sicheren Einsatz von Künstlicher Intelligenz. Aktuell herrscht in nahezu jedem Industriesektor der Druck, KI-gestützte Automatisierungen einzuführen. Gleichzeitig wächst die berechtigte Sorge, sensible Geschäftsgeheimnisse an US-amerikanische „Black-Box“-Sprachmodelle abfließen zu lassen oder durch fehlerhafte KI-Antworten (Halluzinationen) geschäftskritische Fehlentscheidungen zu treffen.

Eine unaufgeräumte Datenbasis und geschlossene SaaS-Systeme bremsen KI-Initiativen hier systematisch aus. Ein souveräner KI-Ansatz erfordert andere Vorgehensweisen.

Abfrage statt Einbettung

Viele frühe KI-Versuche scheitern daran, dass Unternehmensdaten direkt in Sprachmodelle eingebettet werden. Dies birgt nicht nur massive Datenschutzrisiken, sondern führt unweigerlich zu gefährlichen Halluzinationen. Ein Large Language Model (LLM) ist primär ein Sprachwerkzeug, keine relationale Datenbank.

Agentic AI auf Open-Source-Basis

Die Lösung liegt im Einsatz sogenannter „Agentic AI“ in Kombination mit quelloffenen Sprachmodellen (Open-Source-LLMs), die lokal und sicher in der eigenen (Cloud-)Umgebung betrieben werden. Die Daten verlassen die unternehmenseigene Infrastruktur zu keinem Zeitpunkt. Noch wichtiger: Die KI wird so konfiguriert, dass sie die Daten nicht auswendig lernt, sondern als intelligenter Agent agiert. Sie nutzt ihr semantisches Kontextverständnis, um bei Bedarf gezielt direkte Abfragen (beispielsweise über SQL) an die offenen Datenformate des Lakehouses zu stellen.

„Talk-to-your-data“ in der Praxis

Durch die direkte Anbindung an die zentrale Datenplattform liefert das System harte, verifizierbare Fakten statt stochastisch berechneter Wahrscheinlichkeiten. Dieser Ansatz ermöglicht völlig neue Geschäftsprozesse: Fachbereiche ohne tiefe Programmier- oder SQL-Kenntnisse können künftig im direkten Dialog mit ihren Daten interagieren. Komplexe Analysen und Reportings lassen sich per natürlicher Sprache automatisieren und verlässlich abfragen.

Damit dieser reibungslose Dialog zwischen Business-User, KI-Agent und Datenplattform jedoch nicht im Chaos endet, muss die KI exakt verstehen, wie die Daten strukturiert sind und welche semantische Bedeutung sie haben. Technologie allein reicht hierfür nicht aus, womit wir beim oft unterschätzten Kernstück der Datensouveränität angelangt sind.

Data Governance: Vom Regelwerk zum strategischen Enabler

Auch bei Datenplattformen bewahrheitet sich immer wieder eine Erkenntnis, die auch in vielen anderen Bereichen eine gewisse Allgemeingültigkeit erreicht hat: Technologie allein ist kein Garant für Erfolg. Ein modernes Data Lakehouse und fortschrittliche Agentic AI laufen ins Leere, wenn die zugrunde liegende Datenqualität mangelhaft ist oder die semantische Bedeutung der Daten unklar bleibt. An diesem Punkt wandelt sich Data Governance von einem oft ungeliebten Kontrollinstrument zu einem echten strategischen Enabler.

Wenn ein KI-Agent eine Benutzereingabe in eine präzise Datenbankabfrage übersetzen soll, benötigt er mehr als nur Zugriff auf Tabellen. Er benötigt Kontext. Ohne ein gepflegtes Business Glossary, klare Metadaten und definierte Verantwortlichkeiten (Data Ownership) ist das Risiko hoch, dass die KI zwar syntaktisch korrekte, aber fachlich falsche Ergebnisse liefert. „Garbage in, garbage out“ gilt im Zeitalter der Künstlichen Intelligenz mehr denn je.

Eine saubere Governance-Struktur löst dieses Problem an der Wurzel:

Zentrale Wahrheit, dezentrale Nutzung

Durch klare Qualitätsregeln und definierte Datenprodukte entsteht ein Fundament des Vertrauens. Fachbereiche können sich darauf verlassen, dass die bereitgestellten Informationen korrekt, aktuell und rechtssicher sind.

Echte Demokratisierung

Erst dieses Vertrauen ermöglicht Self-Service-Analytics. Wenn die Leitplanken der Governance feststehen, können Daten im gesamten Unternehmen demokratisiert und sicher zur Verfügung gestellt werden, ohne dass die IT-Abteilung jeden einzelnen Report manuell freigeben muss. Auch KI-Ergebnisse können so ohne Kopfschmerzen bezüglich Halluzinationen oder rechtliche Bedenken angenommen und weiterverwendet werden.

Compliance als Standard

Mit Blick auf strenge europäische Regulierungen wie die DSGVO oder den AI Act stellt eine integrierte Governance sicher, dass Zugriffsrechte, Anonymisierung und Nachvollziehbarkeit (Data Lineage) von Beginn an in der Architektur verankert sind.

Wer die Verantwortung für seine Daten auf diese Weise intern übernimmt, schafft die zwingende Voraussetzung für Skalierbarkeit.

Wie gelingt die Migration?

Die Vorteile offener Standards und einer souveränen Architektur sind einleuchtend. Dennoch scheuen viele IT-Verantwortliche den Schritt aus dem Vendor-Lock-in, weil sie ein riskantes, jahrelanges IT-Großprojekt befürchten. Doch die Befreiung aus geschlossenen Systemen erfordert keinen riskanten „Big Bang“.

Erfolgreiche Migrationsprojekte in der Praxis beweisen, dass der Wechsel zu einer offenen souveräneren-Architektur agil und inkrementell erfolgen kann:

Use-Case-getriebene Migration

Anstatt das gesamte historische Data Warehouse auf einmal abzulösen, wird die neue, offene Plattform parallel aufgebaut. Die Migration erfolgt anhand priorisierter, geschäftskritischer Anwendungsfälle.

Schneller Return on Investment (ROI)

Indem zunächst diejenigen Datenbereiche migriert werden, die den höchsten sofortigen Mehrwert bieten, zum Beispiel zur Umsetzung neuer Use-Cases, welche zuvor unmöglich schienen, refinanziert sich der Umbau oft schon während der Projektlaufzeit.

Risikominimierung

Dieser schrittweise Ansatz stellt sicher, dass das Tagesgeschäft (Reporting und laufende Analysen) völlig ungestört weiterläuft, während im Hintergrund das zukunftssichere Fundament iterativ wächst.

Der Übergang zu offener Software und herstellerunabhängigen Datenformaten ist somit kein IT-Selbstzweck, sondern eine planbare, risikoarme Investition in die unternehmerische Handlungsfähigkeit.

Souveränität aktiv gestalten

Wahrlich souverän ist nur das Unternehmen, das die Architektur, die Qualität und den Verbleib seiner Daten vollständig kontrolliert und sich dieser Verantwortung bewusst ist. Wenn Sie sich aus der Abhängigkeit lösen, teure Lizenzmodelle hinter sich lassen und eine rechtssichere Basis für Künstliche Intelligenz schaffen wollen, führt der Weg unweigerlich über offene Standards.

Übernehmen Sie wieder die volle Verantwortung für Ihre Daten. Verwandeln Sie Ihre IT-Infrastruktur von einem reinen Kostenfaktor in den entscheidenden Wettbewerbsvorteil Ihrer Branche.

Als Experten für Big Data und die Entwicklung moderner Datenplattformen unterstützt Scalefree europäische Unternehmen dabei, diesen Weg erfolgreich zu gehen. Wir planen und realisieren End-to-End Daten- und KI-Lösungen jeder Skalierung, von der strategischen Architekturberatung bis zur Implementierung, sowie Agentic AI.

Sind Ihre Daten bereit für die Zukunft?

Lassen Sie uns in einem unverbindlichen Gespräch Ihre aktuelle Architektur beleuchten. Erfahren Sie, wie ein maßgeschneidertes Data Lakehouse auf Basis offener Standards Ihre Datensouveränität dauerhaft sichern kann.

Kostenloses Erstgespräch vereinbaren

Predictive Analytics on the Modern Data Platform

Predictive Analytics on the Modern Data Warehouse

From BI to AI: Operationalizing Predictive Analytics where your Data already lives

Traditional Business Intelligence and reporting are incredibly good at telling what happened yesterday. How much revenue was generated last quarter? How many users logged in this week? But while understanding the past is important, today’s businesses need to know what will happen tomorrow.

This is where Predictive Analytics comes in. At its core, predictive analytics simply uses historical data to forecast future outcomes. Instead of asking how many customers canceled last month, predictive analytics asks:

“Which specific customers are most likely to cancel next week?”

Many organizations understand this value and eagerly hire data scientists to build these models. Yet, time and time again, these predictive initiatives fail to make it out of the PoC phase and into daily business operations because of how teams and data architectures are fundamentally structured.

Predictive Analytics on the Modern Data Platform

Learn how to bridge the gap between your data platform and actionable AI by building predictive models directly where your business data lives. This webinar covers practical strategies for transforming warehouse data into features, deploying models, and automating the flow of insights back into your daily operational workflows. Learn more in our upcoming webinar on March 17th, 2026!

Watch Webinar Recording

The Problem: The “Two Silos” of Data

In many companies, Data Engineering and Data Science exist in two entirely different worlds.

Data Engineers and Data Warehouse Developers spend their days building the Modern Data Platform. They carefully extract, clean, conform, and govern data from dozens of sources to create a “Single Source of Truth.” When a business analyst looks at a revenue dashboard, they know they can trust the numbers because the data platform enforces strict business logic.

Data Scientists, on the other hand, often work in isolated environments like standalone Jupyter notebooks. Because they need massive amounts of data to train their machine learning models, they often bypass the data platform entirely, pulling raw, unstructured data directly from a data lake.

This disconnect creates some challenges:

  • Duplicated Effort: Data Scientists waste up to 80% of their time cleaning and prepping raw data, work the Data Engineering team has already done in the platform
  • Inconsistent Metrics: Because models are built on raw data, a model’s definition of “Active Customer” might completely contradict the official definition used by the in the data platform
  • The “Wall of Production”: A model might look perfectly accurate on a data scientist’s laptop, but because it relies on disconnected, ungoverned data pipelines, integrating its predictions back into the daily workflows of sales or support teams becomes an IT nightmare
  • Exclusivity: Data analysts are often limited to classic descriptive analytics which slows time-to-insight. The optimal solution is to democratize data science, empowering analysts to implement predictive use cases directly.

The Solution: Bring the Machine Learning to the Data

The fix to this problem requires a fundamental shift in how we think about machine learning architecture. Instead of moving data out of governed systems to feed external ML models, we need to bring the ML workflows more closer to the data.

By positioning the Modern Data Platform as the foundation for predictive analytics, you ensure that every prediction is built on the same trusted, cleansed, and governed business data used for your daily reporting. The Data Platform becomes the Feature Store, a centralized hub where data is prepared once and used everywhere, whether for a BI dashboard or training a predictive model.

Predictive Analytics on the Modern Data Warehouse

When the data platform serves as the single source of truth for both analysts and algorithms, magic happens. Data science teams stop wrestling with raw data pipelines, data engineering teams maintain governance, and the business gets predictions they can actually trust and operationalize.

Two Architecture Approaches

So, how do we actually bring the machine learning to the data? There isn’t a one-size-fits-all answer. Depending on the team’s skillset and the complexity of the models, there are typically one of two foundational patterns adopted:

Pattern 1: In-Warehouse Machine Learning (Democratizing ML)

Modern cloud data platforms have evolved beyond just storing and querying data, many now have machine learning engines built directly into them.

  • How it works: Using standard SQL, Data Analysts and Analytics Engineers can train, evaluate, and deploy models entirely inside the data platform (for example BigQuery ML, Snowflake Cortex or Databricks)
  • The Benefit: This radically democratizes predictive analytics. You don’t need to know Python or manage complex infrastructure to build a model. If you know SQL, you can generate predictions using the exact same tables you use for your BI dashboards
  • The Trade-off: While perfect for standard tasks like regression or classification, you are limited to the specific algorithms supported by the data platform

Pattern 2: The Data Platform as a Feature Store (The Hybrid Approach)

For organizations with dedicated Data Science teams building highly complex or custom models, the data platform takes on a different role: the Feature Store.

  • How it works: Data Scientists continue to work in their preferred external ML platforms (like Vertex AI, Databricks, or other). However, instead of pulling messy data from a data lake, they connect directly to the data platform to pull curated, business-approved data (“features”) for training
  • The Benefit: Data Scientists retain maximum flexibility to use advanced Python libraries and deep learning frameworks, while ensuring the models are trained on governed, accurate data
  • The Trade-off: It requires a bit more orchestration to manage the pipeline between the data platform and the ML platform, and ensuring predictions are accurately written back to the data platform
Architecture Feature Store

Example: Predicting Customer Churn

To understand the benefits of these approaches, let’s look at a classic business challenge: Customer Churn Prevention.

Imagine a SaaS company trying to figure out which customers are likely to cancel their subscriptions. In a siloed environment, predicting this is can be a messy, manual science project. But on a modern data platform, it becomes an automated operational workflow:

  1. The Foundation (Data): Because of the Data Engineering team’s work, the data platform already contains all necessary historical information of the customer. CRM data (company size), financial records (billing history), product logs (login frequency), and Zendesk tickets (recent complaints) are all cleaned, joined, and sitting in governed tables including a full history of changes
  2. The Prediction (Modeling): An analyst uses In-Warehouse ML (Pattern 1) to run a classification model against this historical data. The model identifies the hidden patterns of a churning customer and generates a “Churn Risk Score” between 0 and 100 for every active user
  3. The Operationalization (Action): This is the crucial step. The predictions aren’t left in a notebook. The risk scores are written directly back into a new table in the data platform. Through reverse-ETL, these scores can be automatically synced to the CRM as well as dashboards and reports can easily be built on top of the results.

Conclusion

Predictive analytics shouldn’t be an isolated science experiment. It should be a living, breathing part of your operational reality. By treating your modern data platform as the foundation for your machine learning workflows, you eliminate data silos, empower your analysts, and ensure your predictions are built on the trusted business data that matters most.

It is time to operationalize predictive insights where your business data already lives.

Want to see how this works in practice?

Join our upcoming webinar: Predictive Analytics on the Modern Data Platform. We will explore how to build and run predictive analytics directly on top of your data platform using trusted, governed business data as the foundation. You’ll learn practical patterns for turning warehouse models into features, training and deploying predictions, and integrating results back into reporting and operational workflows. Join me on March 17th.

Register for free

Unlock the Intelligence Layer: LLMs in Data Warehousing and the Future of Your Data

Natural Language AI Model

“Stop writing complex SQL, start talking to your data?”

This provocative question highlights a growing shift in how we interact with data. For years, getting answers from a Data Warehouse meant writing SQL queries or relying on pre-built dashboards.

For many organizations, their data platforms remain underutilized because accessing insights still requires writing code or navigating complex dashboards. It’s time to go beyond static reports and unlock a true intelligence layer on top of your data warehouse. Recent advances in Large Language Models (LLMs) and Natural Language Processing (NLP) are making data warehouses smarter, faster, and easier to use for everyone. In this article, we’ll explore how LLMs can transform the way you interact with your data – from using plain English queries instead of SQL, to AI-driven discovery of hidden insights, to enriching your data pipelines – and why this shift represents the future of data analytics.

Unlock the Intelligence Layer: LLMs in Data Warehousing and the Future of your Data

Unlock your data warehouse’s full potential! This webinar reveals how Large Language Models and Natural Language Processing are transforming data interaction, empowering everyone to effortlessly translate plain language into SQL, enable AI-driven data discovery, and deliver actionable insights to every stakeholder. Register for our free webinar, August 12th, 2025!

Watch Webinar Recording

From Complex SQL to Conversational Queries

Business users often depend on data engineers or analysts to fetch answers, creating bottlenecks in decision-making. Even data professionals themselves spend considerable time writing and optimizing SQL, rather than interpreting results. What if anyone could simply ask the data warehouse a question in plain language and get the answer? This is the promise of LLMs as an “intelligence layer”, a layer that bridges complex datasets and human comprehension. Advanced LLMs can understand a user’s question or request and generate the appropriate SQL queries on the fly.

This technology (often called Text-to-SQL or Natural-Language-to-SQL or NL2SQL) has rapidly evolved and major technology players have already taken note. For example, Databricks introduced a Natural Language Query feature (LakehouseIQ) to let users ask questions of their Lakehouse, and Snowflake is also exploring LLM-driven query capabilities.

Imagine asking your data warehouse in plain English: “What were our top-selling products last quarter by region?”. This text input is passed into a LLM, often enriched by company-specific data via RAG and then the system translates that into a correct, optimized SQL query that retrieves the answer.

Natural Language AI Model

Of course, translating natural language to SQL at an enterprise scale isn’t trivial. Complex schemas, ambiguous user input, and security considerations mean the LLM has to be both smart and careful. Uber has built such an AI system that works on an enterprise scale level.

Uber’s QueryGPT is an NL2SQL system that uses a multi-step, RAG-based pipeline combining LLMs with retrieval and agent modules. It fetches context via similarity search over a vector database of example queries and schema information for SQL generation. To manage Uber’s vast data ecosystem, QueryGPT employs specialized agents:

  • an Intent Agent classifies requests by business domain
  • a Table Agent suggests tables for the query
  • a Column Prune Agent trims irrelevant columns to reduce prompt length. The LLM then produces the SQL query and an explanation.

This layered design allows QueryGPT to handle large schemas and reliably generate complex multi-table queries. It’s a hybrid architecture where multiple transformer calls specialize in sub-tasks, enabling scalable, accurate NL2SQL as a production service, saving thousands of Uber employees significant time by mid-2024.

AI-Augmented Data Discovery and Insights

Beyond simply fetching results for user queries, LLMs can augment data discovery by revealing insights that users might not have explicitly asked for. Traditional dashboards show you what is happening, but a smart LLM-based system can tell you why it’s happening and highlight patterns you might not notice. This is often called augmented analytics – using AI to automatically find important correlations, trends, outliers, and drivers in your data.

LLMs excel at interpreting data outputs and providing additional context. For example, rather than just displaying a chart or a table, an LLM can generate a written summary pointing out key trends or anomalies. They can explain which metrics are up or down and suggest potential reasons (for instance, detecting that “conversion rates dipped in July, possibly due to seasonality or inventory issues”), enabling quicker and more informed decision-making.

Another area where LLMs can significantly reduce manual effort is in the creation and maintenance of data catalogs. Documenting data models, table structures, and especially individual column descriptions is often time-consuming and easily skipped due to missing resources, despite being crucial for an effective use and accessibility of the data. LLMs can automate large parts of this process by generating descriptions based on data profiling, SQL logic, naming conventions, and metadata.

dbt Cloud has recently released their dbt Copilot AI Agent that supports the developer in various ways, for example by letting the AI analyzing the SQL code and schema metadata to automatically generate model and column descriptions.

LLMs in Your Data Pipeline: Enrichment and Efficiency

LLMs don’t just enhance how users interact with the Data Warehouse; they can also improve the data itself and the efficiency of data engineering processes. In modern ELT (Extract-Load-Transform) pipelines, a lot of time is spent cleaning, enriching, and preparing data for analysis. Here, LLMs offer new tools to automate and augment these steps.

One promising use case is the semantic enrichment of data. Large Language Models have absorbed a vast amount of world knowledge and language patterns, and they can use that to fill gaps or add context to your raw data. For example, imagine you have a dataset of customer feedback where each entry is a text comment. An LLM could automatically classify the sentiment of each comment (positive/negative), extract key themes, or even generate a summary of common issues. In this way, unstructured data becomes structured insights without manual effort. The image below illustrates how an LLM is integrated into a data pipeline: text inputs from a CustomerFeedback table are passed to an OpenAI API endpoint, where the model returns structured sentiment labels that are then stored back in the database.

Large Language Models Sentiment Analysis

In a practical case study, LLMs were used to enrich an academic dataset by inferring missing attributes (like guessing a person’s gender from their name with high accuracy), which outperformed dedicated API services. This showcases how LLMs can bring external knowledge and reasoning to enhance your data.

Another area is metadata enrichment and semantic enrichment of unstructured data. Enterprise data is often filled with cryptic column names and jargon that prevents usability. LLMs can intelligently expand abbreviations and annotate fields with business-friendly descriptions. For instance, an LLM-driven catalog might take a column labeled “CUST_ID” and annotate it as “Customer Identifier, unique ID for each customer record”.

LLMs can also assist in the coding and transformation process itself. Data engineers can leverage LLMs to generate boilerplate code or SQL for transformations, document pipeline logic in plain English, or even detect anomalies and data quality issues through pattern analysis. By automating tedious parts of data preparation and providing AI-generated suggestions, LLMs free up engineers to focus on higher-level architecture and problem-solving.

Conclusion

While the promise of an LLM-powered intelligence layer is exciting, it’s important to approach it with a clear strategy. Successful implementation requires considering a few key challenges and best practices. Data quality and governance are more crucial than ever. If your underlying data is inaccurate or poorly structured, the AI’s answers will be unreliable. As the saying goes, “garbage in, garbage out.”

Ensuring clean, well-organized data (and maintaining a robust data governance program) will help the LLM produce meaningful and correct insights. Additionally, organizations may need to fine-tune or configure their LLMs to understand industry-specific terminology or business context. This reduces the chance of the AI misinterpreting what a user asks or generating an incorrect query.

Privacy and security are another important consideration. If your data includes sensitive information, you must ensure that any AI tool accessing it complies with your security requirements. This might involve using self-hosted models or secure APIs, and setting up proper access controls.

The dream of a self-service analytics experience: “just talk to the data and get answers” is quickly becoming a reality. This evolution may redefine roles (enabling analysts and engineers alike to focus on higher-value tasks) and open up analytics to a wider audience than ever before. It’s an exciting time to be a data professional, but also one that demands staying informed and ready to adapt.

AI in Data Warehousing: Fundamental Principles and Applications

AI in Data Warehousing

AI in Data Warehousing

The ability to efficiently store, manage, and analyze data has become a critical aspect for any organization. The field of Data Warehousing is an integral part of the process by providing structured, centralized storage for large volumes of data and is an important element of business intelligence.
However, the increasing volume and complexity of data is becoming increasingly challenging to manage. That said, AI is already shaping many processes in a wide variety of areas, including data warehousing.

This article explores the basic principles and applications of artificial intelligence and how AI can be integrated into an Enterprise Data Warehouse, by using it to enhance the design and operation of data warehouses, as well as enabling the development of more challenging data science applications like machine learning or predictive analytics.

AI in Data Warehousing: Fundamental Principles and Applications

This webinar delves into the essential principles of Artificial Intelligence (AI) and its transformative role in data warehousing. We explore how AI and Machine Learning (ML) not only enhance the construction and operation of data warehouses but also leverage the vast amounts of data stored within them for developing powerful applications, e.g. for predictive analytics.

You will gain an understanding of the foundational AI technologies driving these advancements and how they intersect with data science and machine learning to optimize data storage, improve decision-making, and unlock new business insights. This webinar provides valuable perspectives on integrating these technologies into your data warehousing practices.

Watch webinar recording

Foundational Principles of AI

Let’s start by categorizing the term artificial intelligence, as well as other terms such as machine learning, deep learning, or generative AI, in order to clarify the context and what exactly is hidden behind these terms. 

Artificial Intelligence is now more of a generic term and represents a broad field. In general, the term AI encompasses techniques that allow computers to emulate human behavior, enabling them to learn, understand language, make decisions, recognize patterns, and solve complex problems in a way that is similar to human intelligence.

AI Graphic

Machine Learning (ML) as a Subset of AI

Machine Learning is a subset of the field of artificial intelligence and, even if less present as a term, one of the most important. Machine Learning uses advanced techniques and algorithms to recognize patterns in large amounts of data, allowing machines to learn and adapt autonomously to make inferences or predictions based on the data. Unlike traditional programming where tasks are executed based on clear and defined instructions, machine learning relies on statistical analysis based on input data to output data values that fall within an expected range. Essentially, machine learning enables computers to learn and interpret data without being explicitly programmed for each individual case.

Deep Learning (DL) – A Deeper Dive into ML

Deep Learning is a more specialized sub-area of machine learning & AI and uses deep neural networks to perform in-depth data processing tasks to recognize complex patterns.

Deep learning uses many layers to extract high-level features from raw data as input, in a sense simulating the way a human brain works.

In this respect, deep learning is particularly strong in processing very large amounts of data, using it to learn complex patterns in order to solve a wide variety of tasks. Well-known examples of this are tasks such as image or voice recognition, which every smartphone is capable of.

Generative AI – The Creative Aspect of Deep Learning

Generative AI is a term that has recently become more widely known, especially due to ChatGPT. Generative AI is a sub-area of deep learning and includes deep learning models that generate new content, such as text, images, code, or even videos, based on data on which they have been trained on. They are able to generate novel outputs without explicit instructions that do not directly replicate the training data. This opens up completely new possibilities such as generating large texts, complex images, or even music.

The Role of AI in Building Data Warehouses

AI is changing everything, including data warehousing. As such, it can help to enhance the efficiency and effectiveness of your data warehouse from the ground up. From the design and structure to the ongoing data management processes, there are many opportunities where AI can help.

AI has the power to address a data warehouse’s biggest challenges: performance, governance, and usability. This is data intelligence and will revolutionize the way you query, manage, govern, and visualize your data.

Enhancing Design and Structure

The architecture of a data warehouse is critical as it needs to support efficient data querying and scalability while maintaining performance. By analyzing usage and query patterns, AI algorithms can suggest the most effective data models and indexing strategies. This not only speeds up the retrieval of information but also ensures more agile data handling when scaling or integrating new data sources.

Automating Data Integration, Cleaning, and Transformation

By letting AI take over low-level tasks, data engineers can focus on higher-level tasks such as designing data models, training machine learning algorithms, and creating data visualizations. For example, The Coca-Cola Company uses AI-powered ETL tools to automate data integration tasks across its global supply chain to optimize sourcing and procurement processes. 

But AI can also support the actual developers in their work, enabling them to work faster and more efficiently. Developers can use AI to debug problems in their code faster by using AI-based code generation and analysis, such as Github Copilot, which completes, refactors, and debugs code in real time directly in the IDE. 

The automation of performance tuning of data warehouse workloads can also be improved with predictive optimizations, which can save a considerable amount of costs.

AI can also scale and automate governance using automated tagging, documentation, and natural language search across all data and assets in an organization.

Right up to the possibility that business users could use natural language to interact with the data, ask questions, and build dashboards, the possibilities are endless and we are only at the very beginning of these developments.

Using a Data Warehouse for AI Applications

The nature of a data warehouse, a structured and centralized repository that aggregates data from multiple sources within an organization, makes it a perfect foundation for building and training AI applications. AI models require large amounts of data with as much variability as possible that are well structured and have a high data quality, which a data warehouse can perfectly provide for an organization to train its own AI model. 

For example, a data warehouse in a retail context might store years of customer purchase history, demographics, and product information. Data that can be used to train AI models to predict future buying trends or recommend products effectively.

Enhancing AI Capabilities with Data Vault

Data Vault 2.0 in particular sets the exact right conditions for training your own AI models by leveraging the structured, reliable data framework that Data Vault 2.0 provides. 

Here’s are some examples how this integration can benefit AI-driven analytics:

  • Improved Data Quality: The organized and auditable structure of Data Vault 2.0 ensures higher data quality and consistency, which is important for training accurate AI models. Clean, well-structured data models reduce the time spent on data preparation and increase the reliability of AI predictions.
  • Enhanced Historical Analysis: The comprehensive historical data captured by Data Vault 2.0 allows AI models to perform more accurate trend analysis and forecasting. This capability is especially valuable in sectors like finance and retail, where understanding long-term trends can significantly impact strategic decisions.
  • Data Reliability and Lineage: Data Vault 2.0’s robust framework ensures data reliability through its unique architecture that captures data from various sources while maintaining its lineage. This means every piece of data in the system can be traced back to its origin, providing transparency and trust in the data used for AI models.

Conclusion

Integrating artificial intelligence (AI) into data warehousing significantly enhances the efficiency and effectiveness of data storage, management, and analysis. By applying AI techniques, organizations can automate data integration, cleaning, and transformation processes, leading to more accurate and timely insights. Furthermore, AI-driven optimization of data warehouse design and structure ensures improved performance and scalability. Embracing AI in data warehousing not only streamlines operations but also empowers businesses to leverage advanced analytics, such as machine learning and predictive analytics, thereby unlocking deeper insights and fostering informed decision-making.

Mastering Metadata: Data Catalogs in Data Warehousing with Datahub

Mastering Metadata in Data Warehousing

In today’s data-driven world, it is essential to be able to manage and organize large amounts of data in an efficient way. Businesses across all industries are forced to contend with more data than ever before. The introduction and development of an enterprise data warehouse in a company naturally plays a central role, but does not solve a major challenge: How to effectively organize and manage the data, especially metadata, in an Enterprise Data Warehouse? This is where the concept of data catalogs comes into play and where tools like Datahub become essential. 

A data catalog serves as a comprehensive inventory of data assets in an organization, providing context, annotations, and metadata to facilitate the understanding and discovery of data. It’s like a map to your data, helping users navigate the complex data landscape to find the exact data they need.

A data catalog can help users to understand where to find specific data in the data warehouse that fits their needs and to investigate where it came from, as well as how it might be connected to other data. This can greatly simplify tasks like data analysis and reporting, making the data warehouse more accessible and usable for everyone in the organization.

Mastering Metadata: Data Catalogs in Data Warehousing with DataHub

Don’t miss our upcoming webinar about data catalogs! This session will explore in detail the critical role of data catalogs in data warehousing, with an exclusive focus on the powerful tool DataHub. You’ll gain practical insights on enhancing data discovery, metadata management, data lineage, and data governance. Sign up today and transform your data management strategies into a competitive advantage.

Watch webinar recording

Understanding Data Catalogs

What is a Data Catalog?

In general, a data catalog is like a metadata inventory, which consists of organized and structured metadata regarding all data assets in an organization. It is a central place where all this metadata can be stored, combined, and categorized, which makes it a lot easier to discover and understand the corresponding data, for example in a data warehouse. A data catalog also has search functionalities to find specific data from the available indexed datasets. It serves like a single source of truth of your metadata, enabling users to trust the data they’re using for their analyses or business decisions.

Role of a Data Catalog in Data Warehousing

In the context of data warehousing, a data catalog brings a lot of benefits. It provides a way to explore and search all data stored in the data warehouse. Technical users, as well as Business Users, can discover relevant data, understand its context, and ensure it is up-to-date, reliable, and accurate. The following figure shows where data catalogs in data warehousing with Data Vault 2.0 take place. A Data Catalog should cover the entire Enterprise BI Solution. This also applies, for example, to a data lake, if available, and to the information delivery layer.

Data Vault 2.0 General Architecture

Now that we’ve understood what a data catalog is, let’s delve into how each component plays a part in a data catalog and explore how a tool like DataHub can assist organizations in these tasks.

Introduction to DataHub

In the world of data catalogs, DataHub stands out as an increasingly popular choice for many businesses. DataHub is a growing open-source software developed by LinkedIn to address its growing need for a more dynamic and scalable data management tool. It was created in part due to the fact that the existing tools were not sufficient with LinkedIns expanding needs.

As LinkedIn grew, so did its data volume, variety, and velocity. Recognizing the need for a more efficient way to manage its data, LinkedIn built and introduced DataHub in 2020. Open-sourcing DataHub allowed other organizations to benefit from this advanced tool, and it has since been adopted by many businesses looking for a modern, scalable data catalog solution.

DataHub supports both push-based and pull-based metadata ingestion, including a wide range of integrations for example Airflow, BigQuery, Databricks, dbt, Hive, Kafka, Looker, MSSQL, MongoDB, Oracle, S3, PowerBI, Snowflake, Spark, and much more. You can find a full list here. This gives datahub the ability to combine and show metadata of the same data from multiple sources, for example, a dbt model definition, and if the tests were running successfully, right next to the database schema and stats for all columns.

Datahub statistics

Key Features and Capabilities of DataHub

DataHub, as a metadata platform, goes beyond traditional data catalogs. DataHub offers all important features and capabilities:

1. Scalability: DataHub is designed to handle metadata from thousands of datasets, which makes it a great choice for large organizations.

2. Flexible and Extensible Data Model: The technical data model behind this tool is designed to be customizable and expandable to allow organizations to customize it to their specific business requirements

3. Powerful Search and Discovery: Leveraging Elasticsearch, DataHub offers robust search functionality that enables users to discover datasets quickly based on various attributes, such as the data’s origin, schema, and usage.

Datahub capabilities

4. Rich Metadata: Unlike traditional data catalogs, DataHub captures and presents a wide variety of metadata, including data lineage, operational metadata, and business metadata. This gives users a comprehensive understanding of their data.

5. Data Lineage and Relationships: DataHub automatically captures and visualizes data lineage, showing how data flows through various systems. It also displays relationships between datasets, allowing users to understand how different data assets interact with each other.

Lineage and Relationships in datahub

Conclusion

Using a data catalog comes with several benefits:

  1. Enhanced Data Discovery: With the search and categorization capabilities of a data catalog, users can quickly find the exact data they need without having to comb through large datasets.
  2. Improved Data Understanding: The metadata in a data catalog provides users with necessary context about the data, making it easier to interpret and use correctly.
  3. Better Compliance and Governance: A data catalog supports data governance initiatives by ensuring data is consistent, accurate, and compliant with relevant regulations.
  4. Increased Trust in Data: By providing transparency into data lineage, a data catalog helps build trust in the data by allowing users to see its history and verify its accuracy and reliability.
  5. Time and Resource Efficiency: By making it easier to locate and understand data, a data catalog can save the company resources, thus speeding up data-driven activities and reducing the burden on data management teams.

In conclusion, DataHub provides a flexible, feature-rich, and all-encompassing option for data catalogs in a data warehousing environment. By providing powerful features for data discovery, metadata management, data lineage, and data governance, it enables businesses to extract maximum value from their data.

If you’re interested in learning more about data catalogs, watch the recording here for free.

Data Vault 2.0 with DBT – Part 2

Data Vault 2.0 with dbt

Data Vault 2.0 with dbt

In the first part of this blog series we introduced you to dbt. Now let’s take a look at how you can implement Data Vault 2.0 with dbt and what advantages this offers. If you don’t know the first part yet, you can read it here.

 

Dbt Models

dbt provides the ability to create models and generate and execute SQL dynamically out of these models. This allows you to write your data transformations in models using SQL and reusable macros powered by Jinja2 to run your data pipelines in a clean and efficient way. However, the most important part for the Data Vault use case is the ability to define and use those macros.

But first we should clarify how models work in dbt.

This tool handles the compilation and execution of models written using SQL and the Jinja macro language. Each model consists of exactly one SQL SELECT statement. Jinja code is translated into SQL during compilation.

Data Vault 2.0 with dbt

This figure shows a simple model in dbt. A big advantage of Jinja is the possibility to generate sql programmatically, for example with loops and conditions. Also, through the ref() functions, dbt understands the relationships between the models and builds a dependency graph. This ensures that the models are built in the correct order and that the data lineage is documented. A lineage graph may look like this:

Data Vault 2.0 with dbt

The materialization of models can be defined at different configuration levels. This allows fast prototyping with views and switching to materialized tables if necessary for performance reasons.

Data Vault 2.0 and Macros

But how can we implement Data Vault 2.0 with dbt? The most important part for using Data Vault 2.0 is the ability to define and use macros. Macros can be called in models and then generate into that macro additional SQL or even the entire SQL code.

For example, you could create a macro to generate a hub that gets the source/staging model as an input parameter, as well as the specification of the columns for the business key(s), load date and record source. The sql code for the hub is then generated dynamically from this. The advantage of this is that a change to the macro would directly affect each individual hub, which greatly improves maintainability.

At this point, you also gain huge benefits from the active open-source community around dbt. There are many open-source packages with which it can be extended.

There are also already some packages that are perfect for using Data Vault 2.0 with dbt. 

For example, our own open-source package DataVault4dbt, developed and actively maintained here at Scalefree, provides a comprehensive set of dbt macros to translate a Data Vault model “on paper” into actual tables and views—such as Hubs, Links, Satellites, and more. The package is actively used in real-world projects and helps enforce best practices for a modern, audit-friendly Data Vault 2.0 implementation.

To explore all features and macro parameters in detail, check out the documentation.

The only thing you need in your model, for example for a hub, is just one macro call:

{%-

hub(src_pk, src_nk, src_ldts, src_source, source_model) 

-%}

With the parameters of the macro call you define the source table where the columns can be found (source_model) and the column names for the hash-key (src_pk), business key(s) (src_nk), load date (src_ldts) and the record source (src_source) column. When the model and the macro(s) defined in the model are executed, the SQL gets compiled and processed on the database system.

The metadata needed can for example be defined in variables with jinja directly in the model:

What you can also see is that this tool provides different options for the materialization. The incremental materialization for example, will load an entity as a table on an incremental basis.

When the model is executed, dbt generates the whole sql out of the macro and decides how the records are loaded. If the hub table does not exist, yet it is created and all records are loaded, if it already exists, the table is loaded incrementally. 

For people who tried or managed to implement a Data Vault with “vanilla” SQL, you will realise that this is a real game-changer. The team can now focus entirely on the Data Vault design itself. Once the metadata is identified, dbt, along with your macros, can take care of the entire logic.

Openly available packages can add basic Data Vault 2.0 principles to dbt and therefore allow users to quick-dive into Data Vault implementation. Dbt’s general openness allows you to adjust all macros, for example, to your company or project specific Data Vault flavor to meet your technical and business requirements. 

It is important to note that many of the currently available dbt packages for implementing Data Vault 2.0 deviate from the official standards in some details. Our own open-source package DataVault4dbt, developed and actively maintained here at Scalefree, addresses this gap by supporting all core Data Vault 2.0 entities and the latest best practices. The package is already used in real-world projects to help ensure a modern and audit-friendly implementation.

Conclusion

Integrating Data Vault 2.0 with dbt streamlines the data warehousing process by leveraging dbt’s capabilities to define models and macros, enabling efficient and dynamic SQL generation. This approach allows teams to focus on Data Vault design, with dbt handling the underlying logic.

Our open-source package DataVault4dbt brings these advantages to real-world projects, offering a reliable and standards-aligned way to build Hubs, Links, Satellites, and more.

Data Vault 2.0 with DBT – Part 1

Data Vault 2.0 with dbt

This article focuses on the benefits of using Data Vault 2.0 with dbt, and the importance of choosing the correct implementation tools. Data is an important asset in the decision-making process. As we have previously discussed in another post, Data Vault 2.0 is the right choice when the goal of an enterprise data warehouse is to have fully historized and integrated data. Additionally, it is also better suited to instances in which data from many source systems needs to be combined. You can find the previous blog post here.

While Data Vault 2.0 focuses on the “what”, there are many options for the “how” of technically translating a Data Vault model into physical tables and views in the enterprise data warehouse, as well as for orchestrating and loading/processing the procedures. And this is where Data Vault 2.0 with dbt creates an effective solution.

About dbt

This data build tool transforms your data directly into your data warehouse. For reference, dbt is the “T” in ELT. Therefore, dbt assumes that data is already loaded into a database from which the current database can query. In contrast, ETL extracts data then transforms it and before loading it into the target. With ELT, the data is not transformed as it moves into the data warehouse. (with transform we mean transformation by soft business rules which would change the meaning of the data. Of course, we have to make sure that the data fits into the target table (data type adjustments, etc.). Here we are talking about “hard rules”.)
Dbt is particularly compatible and useful in cloud DWH solutions such as Snowflake, Azure Synapse Analytics, BigQuery, and Redshift and performs transformations and modeling directly on the database in order to take advantage of the performance of these highly scalable platforms.

How dbt works

Models and SQL statements can be easily created, tested, and managed in dbt itself. A powerful combination of the scripting language, Jinja2, and the all-time classic SQL allows users to build models. The simple interface enables data analysts without engineering know-how to initiate appropriate transformations. Data team workflows become more efficient and cost-effective because of this. Behind this tool sits an open-source community that is constantly and passionately developing the tool. As such, dbt is available both as a free, reduced core version and as a comprehensive and flexible cloud version.

How does Data Vault 2.0 with dbt work?

Part of the Data Vault 2.0 methodology is the model that focuses on how to design the core data warehouse with a scalable solution. The core Data Vault entities are hubs, links ,and satellites. That said, using Data Vault 2.0 with dbt provides the ability to generate Data Vault models and also allows you to write your data transformations using SQL and code-reusable macros powered by Jinja2 to run your data pipelines in a clean and efficient way.

Data Vault 2.0 with dbt

Conclusion

Dbt does not reinvent the wheel, but when it comes to building a new EDW – especially if it is in the clouds – it provides a very helpful basic framework with many important functions for Continuous Integration and Deployment already defined. dbt brings the standards of software development into the world of data transformation. This allows developers to concentrate on the core tasks of data modeling and business logic. Especially, not only for smaller projects, this tool offers a lightweight and extremely affordable alternative to other data warehouse automation solutions.

Meltano Open Source Production Grade Data Integration – Part 2

Meltano result

Meltano Open Source Production Grade Data Integration

In this article, we delve into the practical application of Meltano, an open-source platform designed for building, running, and orchestrating ELT pipelines. Building upon our previous discussion of Meltano’s architecture, this installment guides you through the process of creating a data integration pipeline, from initializing a project to configuring extractors and loaders, and utilizing the Meltano UI for streamlined management.

Meltano in action

In our last overview, we talked about Meltano and its architecture. Now, we would like to illustrate the ease in which you can use Meltano to create a data integration pipeline.
Before we start, please ensure that you have already installed Meltano on your machine. If you haven’t yet, you can follow Meltano’s official installation guide.

First we will initialize a Meltano project.
Initialize a new project in a directory of your choice by using  “meltano init”. This will create a new directory with, among other things, your  “meltano.yml” project file.

Meltano

The Meltano project folder

The project folder is the single source of truth for all your data needs. It is a simple directory and version controllable. 

The main part of your project is the meltano.yml project file. This file defines your plugins, pipelines and configuration.

The project and meltano.yml files are manageable using the Meltano CLI and are instantly containerized for Docker/Kubernetes deployment. Though, please note that there are not any defined plugins or pipeline schedules created yet. We will do this in the next step.

Adding an extractor and loader

Now, let’s initialize the pipeline’s components. The first plugin you’ll want to add is an extractor which will be responsible for pulling data out of your data source.

To find out if an extractor for your data source is supported out of the box, you can check the Extractors list on MeltanoHub or run  “meltano discover”.

We will use the Tap for Gitlab in this example as we don’t need to create API credentials.

Meltano

Meltano manages setup, configuration and handles invocation.

Meltano

We will configure the extractor to pull the data from the repository meltano/meltano. Additionally, we will define it to only extract data as of 1 January 2021 and to include only data under the “Tags” stream.

Meltano

Data selection is way easier than using just Singer! The extractor is now set up. Now, we will add a loader to store the data into a CSV file and define our destination path:

Meltano

Remember that the directory needs to be previously created, as it will not be created automatically. This is what our meltano.yml looks like:

Instead of using the CLI, we can make changes directly in the YAML.

This way, it’s also possible to configure the extractors and loaders in the Meltano UI:

Start the Meltano UI web server using  meltano ui. Unless configured otherwise, the UI will now be available at http://localhost:5000.

Meltano Extractors
Meltano Configuration

Run a pipeline

Now it’s time to run a pipeline. To run a one-time pipeline, we can just use the meltano elt command:

Meltano
Pipelines

Result

And we are done! It took us just ten commands to create a data integration pipeline.

Meltano result

Conclusion

It’s clear why Meltano is a great choice for building your data platform: it’s powerful but simple to maintain and its open-source model makes it flexible, budget-friendly and reliable.

Singer Open Source Production Grade Data Integration – Part 1

In our past blog post, we introduced an open source framework for ELT processes called Singer. This framework can be wrapped up using another open source tool which adds more interesting features to Singer including installation, setup of environments, monitoring, scheduling and orchestration. At Scalefree, we moved all of our ELT pipelines into this framework on AWS and are pleased with the results.

Please note, there are a large number of platforms for managing data integration but there is a lack of robust and easy-to-use, free open source solutions. The Meltano project aims to provide a solution to that situation. Meltano is a full-package data integration platform that challenges the most established players in the data space. Meltano is built on top of the best open source tools for data integration and infuses them with DataOps best practices.

Meltano is the easiest way to build, run and orchestrate ELT pipelines made-up of Singer taps, targets and dbt models. It is open source, self-hosted and version controlled as well as containerized.

Meltano’s open source model lets you easily adapt it to your own needs and reduces cost. Continue Reading

Close Menu