Most enterprises store their massive volumes of transactional and analytics data at rest in data warehouses or data lakes. Sales, marketing, and customer success teams require access to these data sets. Reverse ETL is a buzzword that defines the concept of collecting data from existing data stores to provide it easy and quick for business teams.
This blog post explores why software vendors (try to) introduce new solutions for Reverse ETL, when it is needed, and how it fits into the enterprise architecture. The involvement of event streaming with tools like Apache Kafka to process data in motion is a crucial piece of Reverse ETL for real-time use cases.
(Originally posted on Kai Waehner’s blog: “When to use Reverse ETL — and how this is related to Event Streaming and Apache Kafka”… Stay informed about new blog posts by subscribing to my newsletter)
What are ETL and Reverse ETL?
Let’s begin with the terms. What do ETL and Reverse ETL mean?
Extract-Transform-Load (ETL) is a common term for data integration. Vendors like Informatica or Talend provide visual coding to implement robust ETL pipelines. The cloud brought new SaaS players and the term Integration Platform as a Service (iPaaS) into the ETL market with vendors such as Boomi, SnapLogic, or Mulesoft Anypoint.
Most ETL tools operate in batch processes for big data workloads or use SOAP/REST web services and APIs for non-scalable real-time communication. ETL pipelines consume data from various data sources, transform or aggregate it, and store the processed data at rest in data sinks such as databases, data warehouses, or data lakes:
Extract-Load-Transform (ELT) is a very similar approach. However, the transformations and aggregations happen after the ingestion into the datastore: