Tuesday, November 05, 2019

ELT is passé; STL(Stream-Transform-Load) is the new name for ETL

ETL: Extract Transform Load
ELT: Extract Load Transform
STL: Stream Transform Load

Everybody is interested in talking about how technology can make a difference in real-time and contextual based application experience but is anyone doing anything about it.  If someone is doing anything about it, then they are the market leaders in that domain.

In the past, most of the banking and retail industries used to run batch jobs to move the transaction data that happened during that day to a data warehouse in midnight (also called of end of day processing). One of the standard conditions for batch processing was that there should not be any transaction happening in the source data system when the batch processing was running. 

These days most of the banks and retail companies run 24x7 and they cannot have a minute of downtime on their system, ie. a customer could log on to the mobile application or website and do a transaction at midnight. They are even more concerned about setting up a disaster recovery (DR) site far away from the main production site so that some catastrophe hits their data center, they are not stopping their business. Post-Sept 2011 attack, one of the first things that most enterprises did was to create a DR site with real-time data replication technology to be implemented. Now that they cannot even have a minute of downtime in their data systems and allow customers to transact 24x7, they want to repurpose their ETL systems as STL systems so that they can have real-time ETL functionalities with the new big data technologies that can scale and process for even real-time systems. 

Here is a high-level comparison between ETL to CDC and STL

Traditional ETL
  • Batch mode of extraction, high latency, low throughput on large window
  • High load on sources, usage only during certain times of day
  • On-disk transformations
CDC (Change Data Capture)
  • Real-time Replication, high throughput in a large window
  • Low load on sources, fully utilized systems
  • High load on target to load high volumes
Stream-Transform-Load (a.k.a. Streaming ETL)
  • Best parts of ETL and CDC (low latency, low load on sources, overall higher throughput)
  • In-memory transformations
  • Reduced load to target systems
  • Reduce Garbage-in


Some of the most common use cases of data stream processing include

  • Industrial Automation
  • Log Analytics
  • Building Real-time data lakes
  • IoT (Wearables & devices) Analytics
  • Smart homes and City

Industry-specific use-cases for Stream Processing:

  • Financial Services
    • Fraud Detection
    • Real-time analysis of commodity prices
    • Real-time analysis of currency exchange data
    • Risk Management
  • Retail
    • Markdown optimization 
    • Dynamic pricing and forecasting and Trends
    • Real-time Personalized Offers
    • Shopping cart defections
    • Better store and shelf management
  • Transportation
    • Tracking Containers, Delivery Vehicles, and Other Assets
    • Vehicle Management 
    • Passenger Alerts
    • Logistics and Route Optimization
  • Telecom
    • Wifi Off-Loading
    • Video Analytics
    • Network Management
    • Security Operations
    • Geolocation Marketing
    • Mobile Data Processing
  • Health Care
    • Medical Device Monitoring
    • In-home Patient Monitoring
    • Medical Fraud Detection
o   Safer Cities
  • Utilities, Oil and Gas
    • Outage Intelligence
    • Workforce Management
    • Real-time Drilling Analysis
    • Telemetry on critical assets
  • Manufacturing
    • Smart Inventory
    • Quality Control
    • Building Management
    • Logistics and Route Optimization