0% found this document useful (0 votes)

47 views25 pages

Chapter 3

Uploaded by

fforeroandres

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views25 pages

Chapter 3

Uploaded by

fforeroandres

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Caching

C L E A N I N G D ATA W I T H P Y S PA R K

Mike Metzger
Data Engineering Consultant
What is caching?
Caching in Spark:

Stores DataFrames in memory or on disk

Improves speed on later transformations / actions

Reduces resource usage

CLEANING DATA WITH PYSPARK

Disadvantages of caching
Very large data sets may not t in memory

Local disk based caching may not be a performance improvement

Cached objects may not be available

CLEANING DATA WITH PYSPARK

Caching tips
When developing Spark tasks:

Cache only if you need it

Try caching DataFrames at various points and determine if your performance improves

Cache in memory and fast SSD / NVMe storage

Cache to slow local disk if needed

Use intermediate les!

Stop caching objects when nished

CLEANING DATA WITH PYSPARK

Implementing caching
Call .cache() on the DataFrame before Action

voter_df = spark.read.csv('voter_data.txt.gz')
voter_df.cache().count()

voter_df = voter_df.withColumn('ID', monotonically_increasing_id())

voter_df = voter_df.cache()
voter_df.show()

CLEANING DATA WITH PYSPARK

More cache operations
Check .is_cached to determine cache status

print(voter_df.is_cached)

True

Call .unpersist() when nished with DataFrame

voter_df.unpersist()

CLEANING DATA WITH PYSPARK

Let's Practice!
C L E A N I N G D ATA W I T H P Y S PA R K
Improve import
performance
C L E A N I N G D ATA W I T H P Y S PA R K

Mike Metzger
Data Engineering Consultant
Spark clusters
Spark Clusters are made of two types of processes

Driver process

Worker processes

CLEANING DATA WITH PYSPARK

Import performance
Important parameters:

Number of objects (Files, Network locations, etc)

More objects be er than larger ones

Can import via wildcard

airport_df = spark.read.csv('airports-*.txt.gz')

General size of objects

Spark performs be er if objects are of similar size

CLEANING DATA WITH PYSPARK

Schemas
A well-de ned schema will drastically improve import performance

Avoids reading the data multiple times

Provides validation on import

CLEANING DATA WITH PYSPARK

How to split objects
Use OS utilities / scripts (split, cut, awk)

split -l 10000 -d largefile chunk-

Use custom scripts

Write out to Parquet

df_csv = spark.read.csv('singlelargefile.csv')
df_csv.write.parquet('data.parquet')
df = spark.read.parquet('data.parquet')

CLEANING DATA WITH PYSPARK

Let's practice!
C L E A N I N G D ATA W I T H P Y S PA R K
Cluster sizing tips
C L E A N I N G D ATA W I T H P Y S PA R K

Mike Metzger
Data Engineering Consultant
Configuration options
Spark contains many con guration se ings

These can be modi ed to match needs

Reading con guration se ings:

spark.conf.get(<configuration name>)

Writing con guration se ings

spark.conf.set(<configuration name>)

CLEANING DATA WITH PYSPARK

Cluster Types
Spark deployment options:

Single node

Standalone

Managed
YARN

Mesos

Kubernetes

CLEANING DATA WITH PYSPARK

Driver
Task assignment

Result consolidation

Shared data access

Tips:

Driver node should have double the memory of the worker

Fast local storage helpful

CLEANING DATA WITH PYSPARK

Worker
Runs actual tasks

Ideally has all code, data, and resources for a given task

Recommendations:

More worker nodes is o en be er than larger workers

Test to nd the balance

Fast local storage extremely useful

CLEANING DATA WITH PYSPARK

Let's practice!
C L E A N I N G D ATA W I T H P Y S PA R K
Performance
improvements
C L E A N I N G D ATA W I T H P Y S PA R K

Mike Metzger
Data Engineering Consultant
Explaining the Spark execution plan
voter_df = df.select(df['VOTER NAME']).distinct()
voter_df.explain()

== Physical Plan ==
*(2) HashAggregate(keys=[VOTER NAME#15], functions=[])
+- Exchange hashpartitioning(VOTER NAME#15, 200)
+- *(1) HashAggregate(keys=[VOTER NAME#15], functions=[])
+- *(1) FileScan csv [VOTER NAME#15] Batched: false, Format: CSV, Location:
InMemoryFileIndex[file:/DallasCouncilVotes.csv.gz],
PartitionFilters: [], PushedFilters: [],
ReadSchema: struct<VOTER NAME:string>

CLEANING DATA WITH PYSPARK

What is shuffling?
Shu ing refers to moving data around to various workers to complete a task

Hides complexity from the user

Can be slow to complete

Lowers overall throughput

Is o en necessary, but try to minimize

CLEANING DATA WITH PYSPARK

How to limit shuffling?
Limit use of .repartition(num_partitions)
Use .coalesce(num_partitions) instead

Use care when calling .join()

Use .broadcast()

May not need to limit it

CLEANING DATA WITH PYSPARK

Broadcasting
Broadcasting:

Provides a copy of an object to each worker

Prevents undue / excess communication between nodes

Can drastically speed up .join() operations

Use the .broadcast(<DataFrame>) method

from pyspark.sql.functions import broadcast

combined_df = df_1.join(broadcast(df_2))

CLEANING DATA WITH PYSPARK

Let's practice!
C L E A N I N G D ATA W I T H P Y S PA R K

Pyspark STAR Questions
No ratings yet
Pyspark STAR Questions
21 pages
PySpark Interview Questions Guide
100% (3)
PySpark Interview Questions Guide
126 pages
Master Pyspark Zero To Hero 1738689679
No ratings yet
Master Pyspark Zero To Hero 1738689679
102 pages
Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
PySpark Cheatsheet
No ratings yet
PySpark Cheatsheet
12 pages
E-Book Data Cleaning Techniques in Python
100% (2)
E-Book Data Cleaning Techniques in Python
50 pages
PySpark Data Cleaning Guide
0% (1)
PySpark Data Cleaning Guide
20 pages
PySpark Notes
No ratings yet
PySpark Notes
64 pages
Data Cleaning
No ratings yet
Data Cleaning
52 pages
Deloitte & EY Data Engineer Interview Questions
No ratings yet
Deloitte & EY Data Engineer Interview Questions
26 pages
How To Work With Apache Spark and Delta Lake?
No ratings yet
How To Work With Apache Spark and Delta Lake?
40 pages
PySpark Caching and Performance Tips
No ratings yet
PySpark Caching and Performance Tips
25 pages
Data Engineering Part - 2
No ratings yet
Data Engineering Part - 2
21 pages
PySpark Performance Optimization PDF
No ratings yet
PySpark Performance Optimization PDF
7 pages
Data Cleaning Guide for Python Users
No ratings yet
Data Cleaning Guide for Python Users
14 pages
Caching in Spark
No ratings yet
Caching in Spark
51 pages
PySpark Basics Overview 2
No ratings yet
PySpark Basics Overview 2
15 pages
Py Spark
No ratings yet
Py Spark
9 pages
Secure PA Systems for Large Projects
No ratings yet
Secure PA Systems for Large Projects
8 pages
PySpark DataFrame Operations Guide
100% (1)
PySpark DataFrame Operations Guide
25 pages
PySpark Big Data Analytics Guide
No ratings yet
PySpark Big Data Analytics Guide
7 pages
Prep Chatgpt
No ratings yet
Prep Chatgpt
6 pages
Top 10 Production-Grade Reusable PySpark Scripts For Data Engineers - by Mayurkumar Surani - May, 2025 - Medium
No ratings yet
Top 10 Production-Grade Reusable PySpark Scripts For Data Engineers - by Mayurkumar Surani - May, 2025 - Medium
14 pages
Lecture Week2
No ratings yet
Lecture Week2
72 pages
DSP
No ratings yet
DSP
3 pages
Pyspark Cheat Sheet PDF
No ratings yet
Pyspark Cheat Sheet PDF
1 page
From Query Plan To Query Performance:: Supercharging Your Spark Queries Using The Spark UI SQL Tab
No ratings yet
From Query Plan To Query Performance:: Supercharging Your Spark Queries Using The Spark UI SQL Tab
52 pages
Deloite Data Engineer Interview Questions
No ratings yet
Deloite Data Engineer Interview Questions
24 pages
Biometrics in Secure E-Transaction
No ratings yet
Biometrics in Secure E-Transaction
22 pages
Execr
No ratings yet
Execr
4 pages
PySpark Interview Questions Big Data
No ratings yet
PySpark Interview Questions Big Data
8 pages
Ftalk Pp011 en P
No ratings yet
Ftalk Pp011 en P
4 pages
Pyspark Cheatsheet
No ratings yet
Pyspark Cheatsheet
10 pages
What Is Data Cleaning
No ratings yet
What Is Data Cleaning
8 pages
Chapter 2
No ratings yet
Chapter 2
25 pages
Data Manipulation in Python Using Pandas
No ratings yet
Data Manipulation in Python Using Pandas
12 pages
Spark Optimization 1741826797
No ratings yet
Spark Optimization 1741826797
7 pages
Week 2 - Data Exploration
No ratings yet
Week 2 - Data Exploration
8 pages
Advanced Data Cleaning Techniques With PySpark
No ratings yet
Advanced Data Cleaning Techniques With PySpark
25 pages
Pandas 1
No ratings yet
Pandas 1
13 pages
Computer Science Before College
No ratings yet
Computer Science Before College
11 pages
Core of ML - Part 1 Handling Data
No ratings yet
Core of ML - Part 1 Handling Data
3 pages
Pyspark Distinct and Filter
No ratings yet
Pyspark Distinct and Filter
3 pages
Cyber Security Trends in Modern Automobile Industry/sector
100% (1)
Cyber Security Trends in Modern Automobile Industry/sector
51 pages
Cleaning Data With PySpark Chapter4
No ratings yet
Cleaning Data With PySpark Chapter4
23 pages
IBM PySpark CheatSheet
No ratings yet
IBM PySpark CheatSheet
2 pages
UQ21CA632B - Unit2 - Class12 - Cleaning Tools-Pandas&Inspect and Organize Data
No ratings yet
UQ21CA632B - Unit2 - Class12 - Cleaning Tools-Pandas&Inspect and Organize Data
12 pages
Py Spark 1
No ratings yet
Py Spark 1
11 pages
Spark Optimisation
No ratings yet
Spark Optimisation
7 pages
Data Visualization - Data Mining PRESENTATION
No ratings yet
Data Visualization - Data Mining PRESENTATION
9 pages
CSV Data Handling Guide
No ratings yet
CSV Data Handling Guide
14 pages
Pyspark Funcamentals
No ratings yet
Pyspark Funcamentals
10 pages
Deep Learning Ram
No ratings yet
Deep Learning Ram
21 pages
S08 Slides
No ratings yet
S08 Slides
14 pages
50 PySpark Interview Questions 1732556477
No ratings yet
50 PySpark Interview Questions 1732556477
7 pages
Python Pickling and Spark Vectors
No ratings yet
Python Pickling and Spark Vectors
8 pages
Reading 5 - Data Preparation
No ratings yet
Reading 5 - Data Preparation
23 pages
PySpark DataFrame Operations Guide
No ratings yet
PySpark DataFrame Operations Guide
10 pages
Lecture 5
No ratings yet
Lecture 5
13 pages
Mindmate PDF
No ratings yet
Mindmate PDF
12 pages
MIT - The Dark Secret at The Heart of AI
No ratings yet
MIT - The Dark Secret at The Heart of AI
13 pages
Unit-2 Bda
No ratings yet
Unit-2 Bda
11 pages
Practical Load Balancing Ride The Performance Tiger 1st Edition Peter Membrey Download
No ratings yet
Practical Load Balancing Ride The Performance Tiger 1st Edition Peter Membrey Download
51 pages
Spark 3.0 Key Features Overview
No ratings yet
Spark 3.0 Key Features Overview
8 pages
PPC-R0 .2: Project Planning Manual
No ratings yet
PPC-R0 .2: Project Planning Manual
52 pages
Generating API Using Django Rest Framework With Insomnia
No ratings yet
Generating API Using Django Rest Framework With Insomnia
7 pages
1 - Introduction ToPySpark
No ratings yet
1 - Introduction ToPySpark
26 pages
Unit 5 1 Basicblocks
No ratings yet
Unit 5 1 Basicblocks
39 pages
AI Boost for YouTube Long-Form Videos
No ratings yet
AI Boost for YouTube Long-Form Videos
26 pages
Complexity Criteria for Reports
No ratings yet
Complexity Criteria for Reports
2 pages
MIS Review Case Study
No ratings yet
MIS Review Case Study
11 pages
Tutorial Cut - Ts - Sample (Avidemux)
No ratings yet
Tutorial Cut - Ts - Sample (Avidemux)
2 pages
AashishAmbasta Resume
No ratings yet
AashishAmbasta Resume
1 page
Security Consultant Guide With Code Examples
No ratings yet
Security Consultant Guide With Code Examples
12 pages
EZVIZ Device Preview Guide
No ratings yet
EZVIZ Device Preview Guide
16 pages
Design Database Main Module Simplified Tvet
No ratings yet
Design Database Main Module Simplified Tvet
238 pages
R Programming for B.Tech Students
No ratings yet
R Programming for B.Tech Students
6 pages
SF EC CWM Impl
100% (1)
SF EC CWM Impl
88 pages
Bathymetry Toolbox Installer Doc
No ratings yet
Bathymetry Toolbox Installer Doc
26 pages
Propuesta Doble X
No ratings yet
Propuesta Doble X
39 pages
C++ Conditional Structures Exercise
No ratings yet
C++ Conditional Structures Exercise
7 pages
Introduction Systems Development
No ratings yet
Introduction Systems Development
11 pages
Galaxy Watch Active SM-R500 - en
No ratings yet
Galaxy Watch Active SM-R500 - en
139 pages
Java Control Statements
No ratings yet
Java Control Statements
20 pages
Thesis Supervisor Recommendation With Representative Content and Information Retrieval
No ratings yet
Thesis Supervisor Recommendation With Representative Content and Information Retrieval
8 pages
Parallel Programming Insights
No ratings yet
Parallel Programming Insights
32 pages

Chapter 3

Uploaded by

Chapter 3

Uploaded by

Caching

Stores DataFrames in memory or on disk

Improves speed on later transformations / actions

Reduces resource usage

CLEANING DATA WITH PYSPARK

Local disk based caching may not be a performance improvement

Cached objects may not be available

CLEANING DATA WITH PYSPARK

Cache only if you need it

Cache in memory and fast SSD / NVMe storage

Cache to slow local disk if needed

Use intermediate les!

Stop caching objects when nished

CLEANING DATA WITH PYSPARK

voter_df = voter_df.withColumn('ID', monotonically_increasing_id())

CLEANING DATA WITH PYSPARK

Call .unpersist() when nished with DataFrame

CLEANING DATA WITH PYSPARK

CLEANING DATA WITH PYSPARK

Number of objects (Files, Network locations, etc)

Can import via wildcard

General size of objects

CLEANING DATA WITH PYSPARK

Avoids reading the data multiple times

Provides validation on import

CLEANING DATA WITH PYSPARK

split -l 10000 -d largefile chunk-

Use custom scripts

Write out to Parquet

CLEANING DATA WITH PYSPARK

These can be modi ed to match needs

Reading con guration se ings:

Writing con guration se ings

CLEANING DATA WITH PYSPARK

CLEANING DATA WITH PYSPARK

Shared data access

Driver node should have double the memory of the worker

Fast local storage helpful

CLEANING DATA WITH PYSPARK

More worker nodes is o en be er than larger workers

Test to nd the balance

Fast local storage extremely useful

CLEANING DATA WITH PYSPARK

CLEANING DATA WITH PYSPARK

Hides complexity from the user

Can be slow to complete

Lowers overall throughput

Is o en necessary, but try to minimize

CLEANING DATA WITH PYSPARK

Use care when calling .join()

May not need to limit it

CLEANING DATA WITH PYSPARK

Provides a copy of an object to each worker

Prevents undue / excess communication between nodes

Can drastically speed up .join() operations

Use the .broadcast(<DataFrame>) method

from pyspark.sql.functions import broadcast

CLEANING DATA WITH PYSPARK

You might also like