Build and Run ETL
Pipelines in Databricks
Introduction to ETL and Databricks
Ali Feizollah
Computer Science, Ph.D.
@ali_feizollah
What Is ETL?
Extract Transform Load
Challenges with Traditional ETL Pipelines
Limited scalability for growing data volumes
Complexity in managing multiple tools and
workflows
High maintenance and operational costs
Inadequate support for real-time processing
Unified analytics platform
How Databricks Built on Apache Spark
Addresses These
Delta Lake integration
Challenges
Collaborative notebooks
The Power of Delta Lake
Reliability Scalability Unified storage Real-time &
later batch
processing
Real-world Example:
Shell’s Data Pipeline with Databricks
Adopted Databricks’ unified analytics platform to replace siloed
legacy ETL systems
Leveraged Apache Spark’s fast processing and Delta Lake’s ACID
transactions
Enabled both batch and streaming processing to handle historical
data and real-time sensor feeds
Lakehouse Architecture Explained
Data Lakes vs. Data Warehouses
Data Lakes vs. Data Warehouses
Store raw, unstructured, or Store structured, curated data
semi-structured data
Optimized for fast, complex queries
Highly scalable and cost-effective and analytics
Ideal for storing vast volumes of Often come with higher costs and
diverse data less flexibility for raw data
Introducing the Lakehouse Architecture
Merges the benefits of data lakes and data warehouses into one
unified platform
Supports both structured and unstructured data without the need
for complex ETL processes
Uses Delta Lake to add reliability, ACID transactions, and schema
enforcement to the data lake
Delta Lake: The Engine of the Lakehouse
ACID transactions
Schema enforcement
Time travel
Unified batch & streaming
Batch vs. Streaming ETL
Use Cases for Batch ETL
Historical data Data warehouse Latency of several
analysis and reporting updates and minutes to hours is
scheduled acceptable
aggregations
Use Cases for Streaming ETL
Real-time monitoring IoT data ingestion and Fraud detection, live
and alerting processing dashboards, and
continuous customer
analytics
Code Snippets in Databricks
Batch ETL code vs. streaming ETL code
Batch_ETL.py Streaming_ETL.py
# Streaming ETL Example
# Batch ETL Example
df_stream =
df_batch = spark.read.format("csv") \
spark.readStream.format("cloudFiles") \
.option("header", "true") \
.option("cloudFiles.format", "json") \
.load("/mnt/data/historical_data/")
.load("/mnt/data/streaming_data/")
df_transformed =
df_transformed = df_stream.filter("status
df_batch.filter("status = 'active'") \
= 'active'") \
.groupBy("category") \
.groupBy("category") \
.agg(count("*").alias("total"))
.agg(count("*").alias("total"))