0% found this document useful (0 votes)

29 views14 pages

Understanding Databricks For Etl Slides

The document discusses the concept of ETL (Extract, Transform, Load) and the challenges associated with traditional ETL pipelines, such as scalability and high maintenance costs. It highlights Databricks as a unified analytics platform built on Apache Spark that addresses these challenges through Delta Lake integration and collaborative features. Additionally, it explains the Lakehouse architecture, which combines the benefits of data lakes and warehouses, and provides examples of batch and streaming ETL use cases.

Uploaded by

turrican

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views14 pages

Understanding Databricks For Etl Slides

Uploaded by

turrican

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Build and Run ETL

Pipelines in Databricks
Introduction to ETL and Databricks

Ali Feizollah
Computer Science, Ph.D.

@ali_feizollah
What Is ETL?

Extract Transform Load

Challenges with Traditional ETL Pipelines

Limited scalability for growing data volumes

Complexity in managing multiple tools and

workflows

High maintenance and operational costs

Inadequate support for real-time processing

Unified analytics platform
How Databricks Built on Apache Spark
Addresses These
Delta Lake integration
Challenges
Collaborative notebooks
The Power of Delta Lake

Reliability Scalability Unified storage Real-time &

later batch
processing
Real-world Example:
Shell’s Data Pipeline with Databricks

Adopted Databricks’ unified analytics platform to replace siloed

legacy ETL systems

Leveraged Apache Spark’s fast processing and Delta Lake’s ACID

transactions

Enabled both batch and streaming processing to handle historical

data and real-time sensor feeds
Lakehouse Architecture Explained
Data Lakes vs. Data Warehouses

Data Lakes vs. Data Warehouses

Store raw, unstructured, or Store structured, curated data

semi-structured data
Optimized for fast, complex queries
Highly scalable and cost-effective and analytics
Ideal for storing vast volumes of Often come with higher costs and
diverse data less flexibility for raw data
Introducing the Lakehouse Architecture

Merges the benefits of data lakes and data warehouses into one
unified platform

Supports both structured and unstructured data without the need

for complex ETL processes

Uses Delta Lake to add reliability, ACID transactions, and schema

enforcement to the data lake
Delta Lake: The Engine of the Lakehouse

ACID transactions
Schema enforcement
Time travel

Unified batch & streaming

Batch vs. Streaming ETL
Use Cases for Batch ETL

Historical data Data warehouse Latency of several

analysis and reporting updates and minutes to hours is
scheduled acceptable
aggregations
Use Cases for Streaming ETL

Real-time monitoring IoT data ingestion and Fraud detection, live

and alerting processing dashboards, and
continuous customer
analytics
Code Snippets in Databricks
Batch ETL code vs. streaming ETL code

Batch_ETL.py Streaming_ETL.py

# Streaming ETL Example

# Batch ETL Example
df_stream =
df_batch = spark.read.format("csv") \
spark.readStream.format("cloudFiles") \
.option("header", "true") \
.option("cloudFiles.format", "json") \
.load("/mnt/data/historical_data/")
.load("/mnt/data/streaming_data/")
df_transformed =
df_transformed = df_stream.filter("status
df_batch.filter("status = 'active'") \
= 'active'") \
.groupBy("category") \
.groupBy("category") \
.agg(count("*").alias("total"))
.agg(count("*").alias("total"))

Tutorial Build An ETL Pipeline With Apache Spark On The Databricks Platform
No ratings yet
Tutorial Build An ETL Pipeline With Apache Spark On The Databricks Platform
6 pages
Data Engineering - Session 03
No ratings yet
Data Engineering - Session 03
26 pages
Data Engineering and Data Engineer - Students
No ratings yet
Data Engineering and Data Engineer - Students
56 pages
De Mod 4 Build Data Pipelines With Delta Live Tables
No ratings yet
De Mod 4 Build Data Pipelines With Delta Live Tables
52 pages
Architecting Data Pipelines on GCP
No ratings yet
Architecting Data Pipelines on GCP
24 pages
Batch Arch
No ratings yet
Batch Arch
1 page
Simplify Your Streaming
No ratings yet
Simplify Your Streaming
27 pages
??????? ???????? ????? ???? ????? ????
No ratings yet
??????? ???????? ????? ???? ????? ????
57 pages
Ebook Solving Business Needs With Delta Lakev2
No ratings yet
Ebook Solving Business Needs With Delta Lakev2
43 pages
Databricks Lakehouse & AI Overview
No ratings yet
Databricks Lakehouse & AI Overview
60 pages
Cloud 2
No ratings yet
Cloud 2
3 pages
Data Pipelines W DLT (Template)
No ratings yet
Data Pipelines W DLT (Template)
89 pages
Databricks 101 Crystal
No ratings yet
Databricks 101 Crystal
65 pages
Data Engineering With Databricks (Verma, Sumit) (Z-Library)
No ratings yet
Data Engineering With Databricks (Verma, Sumit) (Z-Library)
193 pages
Delta Lake On Azure Databricks
No ratings yet
Delta Lake On Azure Databricks
18 pages
ETL Vs ELT and Data Lakehouse Presentation
No ratings yet
ETL Vs ELT and Data Lakehouse Presentation
16 pages
Building Batch Data Pipelines On Google Cloud
No ratings yet
Building Batch Data Pipelines On Google Cloud
18 pages
Day 1
No ratings yet
Day 1
10 pages
Lakehouse With Delta Lake Deep Dive
100% (2)
Lakehouse With Delta Lake Deep Dive
64 pages
Databricks DeltaLake by Ceteris
No ratings yet
Databricks DeltaLake by Ceteris
32 pages
Understanding Etl Er1
No ratings yet
Understanding Etl Er1
34 pages
Data Lake ETL: 6 Tool Evaluation Guidelines
No ratings yet
Data Lake ETL: 6 Tool Evaluation Guidelines
9 pages
Databricks Overview Deck
No ratings yet
Databricks Overview Deck
42 pages
What Is A Data Lakehouse
No ratings yet
What Is A Data Lakehouse
4 pages
Data Warehousing
No ratings yet
Data Warehousing
6 pages
Data Engineering Concepts For Mid-to-Senior Professionals
No ratings yet
Data Engineering Concepts For Mid-to-Senior Professionals
27 pages
Databricks Class 1 PPT
No ratings yet
Databricks Class 1 PPT
8 pages
The Different Ways You Can Build An ETL Process
No ratings yet
The Different Ways You Can Build An ETL Process
6 pages
Apache Spark Week-5 PDF
No ratings yet
Apache Spark Week-5 PDF
9 pages
Data Lake
No ratings yet
Data Lake
26 pages
Databricks Lakehouse for Enterprises
No ratings yet
Databricks Lakehouse for Enterprises
30 pages
Databricks
No ratings yet
Databricks
81 pages
Data Engineering Databricks
No ratings yet
Data Engineering Databricks
139 pages
Tutorial Build An ETL Pipeline With Lakeflow Declarative Pipelines
No ratings yet
Tutorial Build An ETL Pipeline With Lakeflow Declarative Pipelines
8 pages
Delta Lake Data Engineering Overview
No ratings yet
Delta Lake Data Engineering Overview
59 pages
Shell Delta Lake Tables Case Study - FINAL
No ratings yet
Shell Delta Lake Tables Case Study - FINAL
5 pages
Data Intelligence With Azure Databricks - Virtual 22 - 02 - 2024
No ratings yet
Data Intelligence With Azure Databricks - Virtual 22 - 02 - 2024
32 pages
Big Data Architectures and The Data Lake: James Serra
No ratings yet
Big Data Architectures and The Data Lake: James Serra
53 pages
Complete ETL Pipeline Guide - Top 20 Interview Questions ?
No ratings yet
Complete ETL Pipeline Guide - Top 20 Interview Questions ?
8 pages
Chapter 2 Data Warehousing
No ratings yet
Chapter 2 Data Warehousing
57 pages
Open Table Format - Delta Lake
No ratings yet
Open Table Format - Delta Lake
10 pages
The Delta Lake Series Lakehouse 012921
100% (1)
The Delta Lake Series Lakehouse 012921
19 pages
Etl VS Elt
No ratings yet
Etl VS Elt
8 pages
Data Engineering With Databricks
100% (2)
Data Engineering With Databricks
63 pages
Databricks Guide
No ratings yet
Databricks Guide
31 pages
Introduction To Data Lakes
No ratings yet
Introduction To Data Lakes
6 pages
Data Lakehouse, Data Mesh, and Data Fabric - SqlBits
No ratings yet
Data Lakehouse, Data Mesh, and Data Fabric - SqlBits
35 pages
Os ETLAHigh EfficiencyOpen ScalaSolutionforIntegratingHeterogeneousDatainLarge ScaleDataWarehousing
No ratings yet
Os ETLAHigh EfficiencyOpen ScalaSolutionforIntegratingHeterogeneousDatainLarge ScaleDataWarehousing
9 pages
A Quick Technical Guide To Delta Lake
No ratings yet
A Quick Technical Guide To Delta Lake
10 pages
UNIT 1 To 5
No ratings yet
UNIT 1 To 5
37 pages
Data Modeling Concept Latest
No ratings yet
Data Modeling Concept Latest
25 pages
Data Bricks
No ratings yet
Data Bricks
8 pages
ETL Guide & Questions
No ratings yet
ETL Guide & Questions
4 pages
What Is Delta Lake
No ratings yet
What Is Delta Lake
3 pages
Exploring Delta Live Tables
No ratings yet
Exploring Delta Live Tables
36 pages
Data Lakes Powering The Future of Big Data
No ratings yet
Data Lakes Powering The Future of Big Data
8 pages
Databricks Training
100% (1)
Databricks Training
4 pages
Interview Topics 1749449767
No ratings yet
Interview Topics 1749449767
5 pages
Academy, Skill Valley - PMI PMP PMBOK 7 Practice Exam Book_ Over 3 Full Practice Tests, Offering 540+ Realistic PMP Questions Aligned With PMBOK Guide, 7th Edition and 2021 ECO With Detailed Explanati
100% (21)
Academy, Skill Valley - PMI PMP PMBOK 7 Practice Exam Book_ Over 3 Full Practice Tests, Offering 540+ Realistic PMP Questions Aligned With PMBOK Guide, 7th Edition and 2021 ECO With Detailed Explanati
460 pages
AWS Certified Solutions Architect Associate (Jon Bonso and Adrian Formaran)
83% (6)
AWS Certified Solutions Architect Associate (Jon Bonso and Adrian Formaran)
236 pages
Kubernetes Basic To Advance End To End
100% (6)
Kubernetes Basic To Advance End To End
295 pages
AWS Course - All Slides
80% (10)
AWS Course - All Slides
879 pages
PMP Memory Sheets
98% (41)
PMP Memory Sheets
6 pages
PMP Exam Prep Summary
100% (22)
PMP Exam Prep Summary
5 pages
2021 PMP Mock Practice Tests (Gumroad)
95% (21)
2021 PMP Mock Practice Tests (Gumroad)
284 pages
AWS Solutions Architect Study Guide
73% (11)
AWS Solutions Architect Study Guide
288 pages
Agile Project Management For Knowledge
90% (10)
Agile Project Management For Knowledge
131 pages
AWS Certified Solutions Architect Professional Slides v6
100% (9)
AWS Certified Solutions Architect Professional Slides v6
823 pages
GCP Fundamentals
100% (2)
GCP Fundamentals
178 pages
Aws Solution Architect Associate Guide
100% (2)
Aws Solution Architect Associate Guide
43 pages
AWS Certified Solution Architect Associate Study Guide V1.0 Abdul Jaseem VP Release 30 Aug 2020
100% (7)
AWS Certified Solution Architect Associate Study Guide V1.0 Abdul Jaseem VP Release 30 Aug 2020
235 pages
Kubernetes Practicals Ebook
75% (4)
Kubernetes Practicals Ebook
187 pages
PMP Exam Prep - 2023 11th Edition (Rita Mulcahy, PMP With Margo Kirwin)
93% (81)
PMP Exam Prep - 2023 11th Edition (Rita Mulcahy, PMP With Margo Kirwin)
456 pages
AWS Lab Workbook v1.0
93% (14)
AWS Lab Workbook v1.0
127 pages
AWS Architect Exam Study Guide
100% (3)
AWS Architect Exam Study Guide
369 pages
Read and Pass Notes For PMP Exams (Based On PMBOK Guide 6th Edition) by Maneesh Vijaya (Marek) PDF
89% (28)
Read and Pass Notes For PMP Exams (Based On PMBOK Guide 6th Edition) by Maneesh Vijaya (Marek) PDF
774 pages
Ansible For Kubernetes PDF
100% (6)
Ansible For Kubernetes PDF
172 pages
PMP Exam Prep 2023-2024 Covers The Current PMP Exam Content Agile and Predictive Content 2023
100% (13)
PMP Exam Prep 2023-2024 Covers The Current PMP Exam Content Agile and Predictive Content 2023
391 pages
Data Engineering Cookbook
89% (9)
Data Engineering Cookbook
88 pages
Terraform Interview Questions Guide
100% (3)
Terraform Interview Questions Guide
11 pages
Kubernetes Tutorial
100% (11)
Kubernetes Tutorial
83 pages
AWS Cloud Practitioner Full Course
86% (14)
AWS Cloud Practitioner Full Course
246 pages
PMP Ebook
100% (3)
PMP Ebook
570 pages
Tutorials Dojo Study Guide and Cheat Sheets AWS Certified Cloud Practitioner 2021 10 01 xrhf1w
100% (13)
Tutorials Dojo Study Guide and Cheat Sheets AWS Certified Cloud Practitioner 2021 10 01 xrhf1w
196 pages
PMP Study Materials
100% (12)
PMP Study Materials
90 pages
Essential PMP Preparation A Practical Exam Prep With Simplified Explanations Definitions and Examp 2022
92% (12)
Essential PMP Preparation A Practical Exam Prep With Simplified Explanations Definitions and Examp 2022
336 pages
Best PMP Exam Prep Guide 2023 - 2024 Get PMP Certified in 2 Weeks - Study 2 Hours A Day Before-After Work 2023
100% (8)
Best PMP Exam Prep Guide 2023 - 2024 Get PMP Certified in 2 Weeks - Study 2 Hours A Day Before-After Work 2023
274 pages
Architecting A Data Lake
100% (9)
Architecting A Data Lake
60 pages
Automating Etl Pipelines Slides
No ratings yet
Automating Etl Pipelines Slides
2 pages
Monitoring Features in Fabric Slides
No ratings yet
Monitoring Features in Fabric Slides
8 pages
Controlling System Processes Slides
No ratings yet
Controlling System Processes Slides
2 pages
Managing File and Object Access and Security Slides
No ratings yet
Managing File and Object Access and Security Slides
12 pages
Installing and Configuring Windows Server 2025 Slides
No ratings yet
Installing and Configuring Windows Server 2025 Slides
24 pages
Demos
No ratings yet
Demos
1 page
Optimizing Etl Pipelines Slides
No ratings yet
Optimizing Etl Pipelines Slides
1 page
Interacting With The Server Slides
No ratings yet
Interacting With The Server Slides
4 pages
Beano 15 March 2023
100% (1)
Beano 15 March 2023
38 pages
Booklist Reader April 2023
No ratings yet
Booklist Reader April 2023
38 pages
The Wall Street Journal April 03 2023
No ratings yet
The Wall Street Journal April 03 2023
30 pages
ABC 03 Mayo 2023
No ratings yet
ABC 03 Mayo 2023
64 pages
BA 311: Capacity Management Module
No ratings yet
BA 311: Capacity Management Module
10 pages
Financial Data Analysis For FP&A - With Excel and Python-Reactive Publishing (2024)
No ratings yet
Financial Data Analysis For FP&A - With Excel and Python-Reactive Publishing (2024)
526 pages
Linear Asset Management - v1
100% (1)
Linear Asset Management - v1
33 pages
Green Buildings: For A Smarter World
0% (1)
Green Buildings: For A Smarter World
17 pages
Baldi's Basics EARLY PROTOTYPE!!! (Baldi V1.0 Mod) - YouTube
No ratings yet
Baldi's Basics EARLY PROTOTYPE!!! (Baldi V1.0 Mod) - YouTube
1 page
Master Sales Deck
No ratings yet
Master Sales Deck
35 pages
Pibas Core Business Solution (A Software Proposal To HBL)
No ratings yet
Pibas Core Business Solution (A Software Proposal To HBL)
26 pages
Global Stocks Resources
No ratings yet
Global Stocks Resources
201 pages
Key PMP Exam Terms Explained
No ratings yet
Key PMP Exam Terms Explained
6 pages
PDM1 - Sales - Training Overview - v4 - 0 - 0-1
No ratings yet
PDM1 - Sales - Training Overview - v4 - 0 - 0-1
3 pages
PSML PP Configuration
No ratings yet
PSML PP Configuration
66 pages
EDE Prac12
No ratings yet
EDE Prac12
4 pages
Bangalore Chennai
No ratings yet
Bangalore Chennai
55 pages
Chapter 1 Final Project Proposal
No ratings yet
Chapter 1 Final Project Proposal
6 pages
Assigment 2024 MARKING GUIDE
No ratings yet
Assigment 2024 MARKING GUIDE
2 pages
Interview FMT
No ratings yet
Interview FMT
9 pages
Greg Wendt Resume 2020
No ratings yet
Greg Wendt Resume 2020
1 page
Risk Analysisin Mobile Application Development
No ratings yet
Risk Analysisin Mobile Application Development
7 pages
Ebook No. 1 - Introduction To ChatGPT
100% (4)
Ebook No. 1 - Introduction To ChatGPT
52 pages
Module 4 - Introduction To Business Models
No ratings yet
Module 4 - Introduction To Business Models
53 pages
US7272580B2
No ratings yet
US7272580B2
21 pages
AZ-900 Microsoft Azure Exam Guide
No ratings yet
AZ-900 Microsoft Azure Exam Guide
9 pages
Unit - 5 Subject Name: Supply Chain & Logistics Management Subject Code - KMBNOP01 Supply Chain and Crm-Linkage
No ratings yet
Unit - 5 Subject Name: Supply Chain & Logistics Management Subject Code - KMBNOP01 Supply Chain and Crm-Linkage
7 pages
Ibm Infosphere Optim Data Growth Solution Validated Integration With Oracle E-Business Suite 12.2
No ratings yet
Ibm Infosphere Optim Data Growth Solution Validated Integration With Oracle E-Business Suite 12.2
2 pages
XEPOS Agreement (UK) As
No ratings yet
XEPOS Agreement (UK) As
14 pages
Inf 403
No ratings yet
Inf 403
20 pages
Mba Mba Batchno 33
No ratings yet
Mba Mba Batchno 33
100 pages
The Implication of Using Modular Construction Projects On The Building Sustainability: A Critical Literature Review
No ratings yet
The Implication of Using Modular Construction Projects On The Building Sustainability: A Critical Literature Review
71 pages
ZoodMall Merchant API 2.0.16-En
No ratings yet
ZoodMall Merchant API 2.0.16-En
22 pages
UnionSys Technologies Services & Competency Overview - 2023
No ratings yet
UnionSys Technologies Services & Competency Overview - 2023
12 pages

Understanding Databricks For Etl Slides

Uploaded by

Understanding Databricks For Etl Slides

Uploaded by

Build and Run ETL

Extract Transform Load

Limited scalability for growing data volumes

Complexity in managing multiple tools and

High maintenance and operational costs

Inadequate support for real-time processing

Reliability Scalability Unified storage Real-time &

Adopted Databricks’ unified analytics platform to replace siloed

Leveraged Apache Spark’s fast processing and Delta Lake’s ACID

Enabled both batch and streaming processing to handle historical

Data Lakes vs. Data Warehouses

Store raw, unstructured, or Store structured, curated data

Supports both structured and unstructured data without the need

Uses Delta Lake to add reliability, ACID transactions, and schema

Unified batch & streaming

Historical data Data warehouse Latency of several

Real-time monitoring IoT data ingestion and Fraud detection, live

# Streaming ETL Example

You might also like