100% found this document useful (1 vote)

539 views14 pages

Spark Optimization PDF

The document discusses common performance problems in Spark, including skew, spill, shuffle, and storage. Skew occurs when there is imbalance in the size of data partitions. Spill happens when data must be written to disk due to lack of memory. Shuffle moves data between executors due to wide transformations. The root causes of problems can be difficult to identify as one problem may cause another. Methods for detecting and mitigating each problem are provided.

Uploaded by

Naveen Naik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

539 views14 pages

Spark Optimization PDF

Uploaded by

Naveen Naik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Common Performance Problems

Performance issues in Spark are mostly related to the following

topics:

• Skew: what occurs in case of imbalance in the size of data

partitions.

• Spill: the writing of temp files to disk due to lack of memory.

• Shuffle: moving data between executors due to a wide

transformation.

• Storage: the way data is stored on disk actually matters.

• Serialization: the distribution of code across the cluster (UDFs

are evil).

Although finding the root of a problem can be quite hard since one
problem can actually cause another one: Skew can induce spill,
storage issues can induce excess of shuffle, a wrong way of
addressing shuffle can increase skew… And sometimes, many of this
causes can be present at the same time!

Skew

In Spark, data is typically read in partitions that are evenly

distributed across the cluster executors. As long as we apply
transformations to the data, it’s possible that some partitions will
end up with much more records than others. This imbalance in the
size of data partitions is what we call Skew.
Small amount of skew is not a problem and can be ignored.
Although, large skews in the data can result in partitions that big
that can’t fit in the RAM of the workers. This would result in spill,
and sometimes, even OOM errors hard to diagnose.

Detecting skew

We have to deep dive into the Spark UI and look at the join job and
pay attention to the following:

• Check Event Timeline: we consider it unhealthy when we see

very unbalance tasks, what means that some tasks are computing
way more or less data than others, in other words, partitions are
unbalaced. They should all have more or less the same duration.

• Check Summary Metrics: pay attention to the Shuffle Read

values (min/25th p./median/75th p./max), if the 75th p
(meaning most of the tasks) have a high number and the others
very low means that most of the tasks are shuffling a lot of data,
because the data required is in other partitions. If these Shuffle
Read values are are similar to each other, it would indicate that
all the tasks are shuffling about the same amount of data, and
therefor the partitions are decently balanced.

• Check Aggregated Metrics by Executor: If Spill value

(memory/disk) is very high, this is caused because of shuffle
caused by the skew. Too big partitions can require to store data
in temp files in disk because of lack of memory.

• Inspecting the data: Once you know you might be facing skew
issues, you can get out of doubts by inspecting the data. By
performing a count of records per category of “group by” or “join
key” and checking if there are way many more records in some
categories than in others. If thre is a big unbalance, the biggest
category (partition) will take much longer, causing lots of
shuffling and spill.

What can we do to mitigate skew?

In case of OOM issues, you can feel tempted of increasing the RAM
of your cluster’s workers. This might fix the issue and make your job
run, but that won’t fix the root of the problem and it might push the
problem to later time.. If skew is detected, the first thing to solve is
the uneven distribution of records across all partitions. For this
purpose we can:

• If in Spark 3, Enable AQE (Adaptative Query Execution).

• If in Databricks, specific skew hint (Skew Join Optimization).

• Otherwise, apply Key Salting. Salt the skewed column with a

random number creating better distribution across each
partition at the cost of extra processing.

Keep in mind that when applying these solutions the job duration
can even increase, but remember that even more important than
duration, mitigating skew removes potential OOM errors.

How to implement Key Salting for skewed joins?

The general idea of the Key Salting consists on reducing the number
of partitions, by increasing the number of join keys. For that we can
create a new column wich values are based on the join key, plus a
random number within a range. This needs to be applied to all the
involved dataframes that are joined by the affected key. After this,
we can perfom the join operation using the new key. As a result, we
will end up with more and smaller partitions, and tasks.

The random number range must be chosen after experimentation.

In case we choose a large number, we can end up with too many
small partitions, and if it’s too less, we can keep having the skew
problem.

Implementing Key Salting can require lot of effort, consider invest

evergy to salt only skewed keys.

Spill

In Spark, this is defined as the act of moving a data from memory to

disk and vice-versa during a job. This is a defensive action of Spark
in order to free up worker’s memory and avoid OOM errors when a
partition is too large to fit into memory. In this way, the Spark job
can come to an end by paying the high penalty of the compute time
caused by the read-write overhead of spilling data.

There are several ways we can get into this problem:

• Having set a spark.sql.filesMaxPartitionBytes too high (the

default is 128MB) thus ingesting large partitions that might not
fit in memory.

• The explode() operation of even a small array. Each partition will

end up with as much rows as items in the array, therefor the
resulting partition size might not fit in memory.

• The join() or crossJoin() of two tables.

• Aggregating results by a skewed key. As we saw before, having
unbalanced datasets can lead to bigger partitions than others,
and in some cases this bigger partitions might not fit in memory.

Detecting Spill

In the Spark UI this is represented by:

• Spill (Memory): size in memory of the spilled partitions

• Spill (Disk): size in disk of the spilled partitions (allways

smaller than memory because of compression).

This values are only represented in the details page for a single stage
(summary metrics, aggregated metrics by executor, task table) or in
SQL query details. This makes it hard to recognize because you have
to manually search for it. Be aware that in case there is no spill, you
won’t find these values.

An alternative to manual search for spill, is implementing a

SpillListener to track automagically when a stage spills.
Unfortunately, this is only available in Scala.

What can we do to mitigate spill?

Once we find our stages are spilling we have a few options to

mitigate it:

• Check if spill is caused by data skew, in that case, mitigate that

problem first.
• If possible, increase the memory of the cluster’s workers. In this
way, larger partitions will fit in memory and Spark won’t need to
write that much into disk.

• Decrease the size of each partition by increasing the number

partition. We can do this by tuning
the spark.sql.shuffle.partitions and spark.sql.maxPartitionByte
s, or explicitly repartitioning.

Mitigaing spill is not always worth it, so check first if it makes sense
or not.

Shuffle

Shuffle is a natural operation of Spark. It’s just a side effect of wide

transformations like joining, grouping, or sorting. In these cases, the
data needs to be shuffled in order to group records with the same
keys together under the same partitions for later on being able to
execute the aggregations by those keys. When a wide transformation
happens, partitions are written to disk, so the executors of the next
stage can read the data and continue the job. Because the data needs
to be moved across workers, this behaviour can result in lots of
network IO.

As I have mentioned before, shuffling it’s inevitable. Just put the

focus only on the most expensive operations. At same time, be aware
that targeting other problems like skew, spill, or tiny files problems
is often more effective.

What can we do to mitigate the impact of shuffles?

The biggest pain point of shuffles is the amount of data that needs to
be moved across the cluster. In order to reduce this problem, we can
apply the following strategies:

• Use fewer and larger workers to reduce network traffic.

Having same number of executors (CPUs) in less workers, would
reduce the amount of data that needs to be transferred throught
the network to other workers.

• Reduce the amount of data being shuffled. Sometimes

these wide transformations are performed over data that is not
needed for the final result, increasing the cost unnecesarily.
Therefore, we should filter out those columns and rows that are
not required before the execution of a wide transformation.

• Denormalize datasets. In case a query that causes expensive

shuffling is executed very often by data users, the output of this
query can be persisted in a data lake and can be queried directly.

• Broadcast the smaller table. When one of the tables involved

in a join is way smaller than the others, we can broadcast it to all
the executors, so it will be fully present there and Spark will be
able to join it to the other table partitions without shuffling it.
This is called BroadcastHashHoin and it can be applied by
using .broadcast(df). However, the default table size threshold is
10MB. We can increase this number by tuning
the spark.sql.autoBroadcastJoinThreshold, but not too much
since it puts the driver under pressure and it can result in OOM
errors. Moreover, this approach increases the IO between driver
and executors, it doesn’t work well when many empty partitions,
and requires enough memory in driver and executors.
• Bucketed datasets. For joins, data can be pre-shuffled and
store it by buckets, and optionally sorted per bucket. This is
worth it when working with terabyte size, tables are joined quite
often, and no filters are applied. This requires all tables involved
to be bucketed by key with same number of buckets (normally
one per core). However, the cost to produce and maintain this
approach is very high, and must be justifiable.

Storage

The way initial data is ingested in the data lake can lead to problems
which are normally related to: tiny files, directory scanning, or
schemas.

Tiny Files

We consider tiny files those that are considerably smaller than the
default block size of the underlaying file system (128MB).

Having a dataset partitioned in too many tiny files implies longer

total time for openening and closing files, and it leads to very bad
perfromance. It often relates to a high overhead with ingesting data,
or as a result of a Spark job.

We can use the Spark UI to see how many files are read under the
SQL tab and checking the read operation.

This problem can be mitigated by:

• Compacting the existing small files to larger

files equivalent to block size or the efficient partition size used.
• Make ingesting tools to write bigger files.

When produced as a result of a Spark job, Spark is partitioning the

data way more than required for its size, and it’s reflected when
writing. We can mitigate this by:

• Changing default partition number.

Tuning spark.sql.shuffle.partitions (200 by default).

• Explicitly repartitioning the data before writing.

Applying repartition() or coalesce() functions to decrease the
number of partitions, or in case of Spark 3.0+ with AQE enabled
set the spark.sql.adaptive.coalescePartitions.enabled to true.

Directory Scanning

Having too many directories for some dataset (because of data

partioitning) lead to performance issues at scanning time. Too
partitioned datasets with no much data also ends up into tiny files
problem

We can detect this by paying attention to the scanning time under

the SQL tab read operation.

We can mitigate this problem by:

• Partitioning stored data in a smarter way.

• Registering datasets as a tables. When doing so, metadata

like where to find the files belonging to that dataset are stored in
the Hive Metastore, thus it’s not needed to scan the directory
anymore. However, first time we register the table, it will need
some time to retrieve the metadata by scanning direcotries first.

Schemas

Inferring schemas require a full read of a file to determine the data

type of each column. This involves time for opening up and scan the
files. Reading parquet files requires a one-time read of the schema,
because schema is included on the files itself. On the other hand,
supporting schema evolution is potentially expensive if you have
hundred or thousands of part-files, each schema has to be read in
and then merged collectively, and that might be really expesive.
Schema merging can be enabled via spark.sql.parquet.mergeSchema.

We can mitigate the schema problem by:

• Providing schema every time.

• Registering datasets as tables. In this way the schema will

also be stored in the Hive Metastore.

• Using Delta format. Merges schemas automatically for

supporting schema evolution.

Serialization

It happens when we need to apply a non-native API transformation,

known as UDFs. This implies to serialize the data into a JVM object
to be modified outside Spark. The impact of this is way worse in
Python, since JVM objects are not native for Python, so they need to
be transformed first, execute the code on top of it, and seriallize the
result back to JVM object, causing a big overhead. On the other
hand, this part is not needed in Scala since it’s JVM native language.

In any case, UDFs are a barrier for the Catalyst Optimizer, since it
can’t connect the code before and after applying the UDF, since it’s
impossible for it to know what the UDFs are doing, and how to
optimize the overal job execution.

We can mitigate serialization issues by:

• Avoid using UDFs, Pandas UDFs or Typed

Transformations whenever possible and use native Spark high
order functions instead.

• If there is no other option:

– Python: use Pandas UDF over “standard” Python code.
Pandas UDF uses PyArrow to serialise batches of records
(treated as a Pandas Series or Data Frame) to later apply the
UDF to every record in Python. On the other hand, regular
Python UDFs serialises every single record individually, and
executes the UDF on it.
– Scala: use Typed Transformations over “standard” Scala
code.

If your data transformations require the application of many UDFs,

consider Scala as a programming language.

In Apache Spark if the data does not fits into the memory then Spark simply persists that
data to disk.

The persist method in Apache Spark provides 6 persist storage levels to persist the data.
That are as follows:
1. MEMORY_ONLY

2. MEMORY_AND_DISK

3. MEMORY_ONLY_SER
4. MEMORY_AND_DISK_SER

5. DISK_ONLY
6. OFF_HEAP

Which Storage Level to Choose?

• RDD fits comfortably in memory, use MEMORY_ONLY

• If not, try MEMORY_ONLY_SER
• Don’t persist to disk unless, the computation is really expensive.

Pyspark Hands On
No ratings yet
Pyspark Hands On
189 pages
Spark QA
No ratings yet
Spark QA
34 pages
PySpark Optimization Scenarios - Wipro
No ratings yet
PySpark Optimization Scenarios - Wipro
8 pages
Azure Comapny Wise Question
No ratings yet
Azure Comapny Wise Question
68 pages
ApacheSpark Top 10 QnA
No ratings yet
ApacheSpark Top 10 QnA
33 pages
Performance Tuning Spark UI
No ratings yet
Performance Tuning Spark UI
37 pages
Pyspark Notes
No ratings yet
Pyspark Notes
93 pages
Accomplishment Report of Project ICARE
100% (1)
Accomplishment Report of Project ICARE
10 pages
Azure DataEngineering End To End Videos
100% (1)
Azure DataEngineering End To End Videos
21 pages
LEA 6 - CFLM 2 N-P-FaHCotP
No ratings yet
LEA 6 - CFLM 2 N-P-FaHCotP
128 pages
Hive Interview
75% (4)
Hive Interview
17 pages
Walmart Stock Data Analysis with Spark
0% (1)
Walmart Stock Data Analysis with Spark
17 pages
Spark Driver Role & Data Skew Solutions
No ratings yet
Spark Driver Role & Data Skew Solutions
33 pages
Pyspark PDF
100% (1)
Pyspark PDF
406 pages
DBT Flow
No ratings yet
DBT Flow
15 pages
PySpark DataFrame Merging Guide
No ratings yet
PySpark DataFrame Merging Guide
42 pages
Databricks Performance Tuning
No ratings yet
Databricks Performance Tuning
54 pages
Databricks Pyspark 1712042928
100% (1)
Databricks Pyspark 1712042928
21 pages
Delta Table and Pyspark Interview Questions
100% (1)
Delta Table and Pyspark Interview Questions
14 pages
My Pyspark Practice Notes
100% (1)
My Pyspark Practice Notes
63 pages
FUN WITH GRAMMAR (NOUNS) Chap06.pdf by Betty Azar
100% (1)
FUN WITH GRAMMAR (NOUNS) Chap06.pdf by Betty Azar
19 pages
Azure Data Engineer Mock Interview - Project Special
No ratings yet
Azure Data Engineer Mock Interview - Project Special
11 pages
Azure Databricks Interview Question
No ratings yet
Azure Databricks Interview Question
12 pages
Final Print Py Spark
No ratings yet
Final Print Py Spark
133 pages
Sparksql PDF
100% (2)
Sparksql PDF
119 pages
Azure DE Interview Que
100% (1)
Azure DE Interview Que
25 pages
PySpark Cheatsheet
No ratings yet
PySpark Cheatsheet
12 pages
Azure Data Factory Interview Questions
100% (1)
Azure Data Factory Interview Questions
33 pages
Spark SQL Optimization
No ratings yet
Spark SQL Optimization
29 pages
NX NF TipsUndTricks
100% (1)
NX NF TipsUndTricks
12 pages
Spark Notes
No ratings yet
Spark Notes
37 pages
Microsoft Certified: Azure Data Engineer Associate - Skills Measured
No ratings yet
Microsoft Certified: Azure Data Engineer Associate - Skills Measured
4 pages
Telling Time Worksheets
100% (1)
Telling Time Worksheets
30 pages
Azure Data Engineering for Pharma
100% (1)
Azure Data Engineering for Pharma
5 pages
Pyspark Questions & Scenario Based
No ratings yet
Pyspark Questions & Scenario Based
25 pages
Spark SQL & DataFrames Guide 2.2.0
No ratings yet
Spark SQL & DataFrames Guide 2.2.0
35 pages
Databricks Python & Linux Commands Guide
No ratings yet
Databricks Python & Linux Commands Guide
109 pages
Shi'Ur Qomah
No ratings yet
Shi'Ur Qomah
2 pages
Data Engineer Interview Prep
100% (1)
Data Engineer Interview Prep
16 pages
Another Side of Life
No ratings yet
Another Side of Life
960 pages
Databricks Sparkconfig 1669383836
No ratings yet
Databricks Sparkconfig 1669383836
1 page
Azure Data Engineer Interview Questions and Answers
No ratings yet
Azure Data Engineer Interview Questions and Answers
7 pages
Databricks
No ratings yet
Databricks
56 pages
ADF Copy Data
100% (1)
ADF Copy Data
81 pages
Spark RDD Actions & Transformations
No ratings yet
Spark RDD Actions & Transformations
25 pages
Data Engineering Roadmap 2023
No ratings yet
Data Engineering Roadmap 2023
1 page
ST Francis Xavier
No ratings yet
ST Francis Xavier
45 pages
Building Data Pipelines - 1
No ratings yet
Building Data Pipelines - 1
25 pages
Window Function in Pyspark
100% (1)
Window Function in Pyspark
8 pages
Azure Data Engineer Interview Guide
No ratings yet
Azure Data Engineer Interview Guide
15 pages
Pyspark Material
No ratings yet
Pyspark Material
16 pages
Azure Databricks Best Practices 1664384402
No ratings yet
Azure Databricks Best Practices 1664384402
30 pages
Databricks Course Curriculum
No ratings yet
Databricks Course Curriculum
2 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
4.1 The Spark UI - Databricks
No ratings yet
4.1 The Spark UI - Databricks
7 pages
Spark Application Deployment Guide
No ratings yet
Spark Application Deployment Guide
18 pages
Pyspark - SQL Module
No ratings yet
Pyspark - SQL Module
132 pages
CAD, Mechatronics
No ratings yet
CAD, Mechatronics
168 pages
SQL and PySpark Interview Questions
No ratings yet
SQL and PySpark Interview Questions
15 pages
JOHN KEATS AND THE CULTURE OF DISSENT 2nd Edition Nicholas Roe - The Full Ebook Set Is Available With All Chapters For Download
100% (1)
JOHN KEATS AND THE CULTURE OF DISSENT 2nd Edition Nicholas Roe - The Full Ebook Set Is Available With All Chapters For Download
86 pages
Databricks Delta for Developers
No ratings yet
Databricks Delta for Developers
11 pages
Naresh DE
No ratings yet
Naresh DE
5 pages
Airflow Introduction
No ratings yet
Airflow Introduction
9 pages
Grammar Practice Activities
No ratings yet
Grammar Practice Activities
6 pages
Azure Data Engineer Content
No ratings yet
Azure Data Engineer Content
6 pages
Infographic PDF About Teaching Strategies To English Skills
No ratings yet
Infographic PDF About Teaching Strategies To English Skills
2 pages
Spark SQL
No ratings yet
Spark SQL
24 pages
BRC410 Compatibility With BRC400 HPG800 Composer and HGS
No ratings yet
BRC410 Compatibility With BRC400 HPG800 Composer and HGS
4 pages
Bhaskar ADE - Altimetrik
No ratings yet
Bhaskar ADE - Altimetrik
3 pages
GA4 User-Provided Data
No ratings yet
GA4 User-Provided Data
41 pages
REGULATION 2022R (Curriculam and Syllabus) UPDATED - 20.04.2024
No ratings yet
REGULATION 2022R (Curriculam and Syllabus) UPDATED - 20.04.2024
87 pages
Allahu Allah Hu Allah Isdin Pagawai Mesjid
No ratings yet
Allahu Allah Hu Allah Isdin Pagawai Mesjid
13 pages
Spark Interview Q&A: Key Insights
No ratings yet
Spark Interview Q&A: Key Insights
10 pages
Spark Interview Prep Guide
No ratings yet
Spark Interview Prep Guide
4 pages
Eden's Bridge Songs
No ratings yet
Eden's Bridge Songs
6 pages
Azure Databricks Team Data Science Lab
No ratings yet
Azure Databricks Team Data Science Lab
18 pages
Spark Tuning
No ratings yet
Spark Tuning
26 pages
01 Slurm14.3TrainingHands On
No ratings yet
01 Slurm14.3TrainingHands On
1 page
Endsem Deep Learning Important
No ratings yet
Endsem Deep Learning Important
2 pages
7.2 Algorithms
No ratings yet
7.2 Algorithms
4 pages
Unix IPC for Developers
No ratings yet
Unix IPC for Developers
15 pages
Exercise Grade 12
No ratings yet
Exercise Grade 12
7 pages
Chapter 8 Conditional and Iterative Statements PDF
No ratings yet
Chapter 8 Conditional and Iterative Statements PDF
23 pages
InGuard (Toll Fraud Guard) Application Installation Manual - 2 - 0
No ratings yet
InGuard (Toll Fraud Guard) Application Installation Manual - 2 - 0
23 pages
Free Modules 55 PDF
No ratings yet
Free Modules 55 PDF
13 pages
March Topic Letter Final Nursery
No ratings yet
March Topic Letter Final Nursery
4 pages
Microprocessors vs. Microcomputers
No ratings yet
Microprocessors vs. Microcomputers
2 pages
Smart GPSLog
No ratings yet
Smart GPSLog
5 pages
Prem Updated Resume
No ratings yet
Prem Updated Resume
6 pages
Examen English Level 10 PDF
No ratings yet
Examen English Level 10 PDF
3 pages

Spark Optimization PDF

Uploaded by

Spark Optimization PDF

Uploaded by

Common Performance Problems

Performance issues in Spark are mostly related to the following

• Skew: what occurs in case of imbalance in the size of data

• Spill: the writing of temp files to disk due to lack of memory.

• Shuffle: moving data between executors due to a wide

• Storage: the way data is stored on disk actually matters.

• Serialization: the distribution of code across the cluster (UDFs

In Spark, data is typically read in partitions that are evenly

• Check Event Timeline: we consider it unhealthy when we see

• Check Summary Metrics: pay attention to the Shuffle Read

• Check Aggregated Metrics by Executor: If Spill value

What can we do to mitigate skew?

• If in Spark 3, Enable AQE (Adaptative Query Execution).

• If in Databricks, specific skew hint (Skew Join Optimization).

• Otherwise, apply Key Salting. Salt the skewed column with a

How to implement Key Salting for skewed joins?

The random number range must be chosen after experimentation.

Implementing Key Salting can require lot of effort, consider invest

In Spark, this is defined as the act of moving a data from memory to

There are several ways we can get into this problem:

• Having set a spark.sql.filesMaxPartitionBytes too high (the

• The explode() operation of even a small array. Each partition will

• The join() or crossJoin() of two tables.

In the Spark UI this is represented by:

• Spill (Memory): size in memory of the spilled partitions

• Spill (Disk): size in disk of the spilled partitions (allways

An alternative to manual search for spill, is implementing a

What can we do to mitigate spill?

Once we find our stages are spilling we have a few options to

• Check if spill is caused by data skew, in that case, mitigate that

• Decrease the size of each partition by increasing the number

Shuffle is a natural operation of Spark. It’s just a side effect of wide

As I have mentioned before, shuffling it’s inevitable. Just put the

What can we do to mitigate the impact of shuffles?

• Use fewer and larger workers to reduce network traffic.

• Reduce the amount of data being shuffled. Sometimes

• Denormalize datasets. In case a query that causes expensive

• Broadcast the smaller table. When one of the tables involved

Having a dataset partitioned in too many tiny files implies longer

This problem can be mitigated by:

• Compacting the existing small files to larger

When produced as a result of a Spark job, Spark is partitioning the

• Changing default partition number.

• Explicitly repartitioning the data before writing.

Having too many directories for some dataset (because of data

We can detect this by paying attention to the scanning time under

We can mitigate this problem by:

• Partitioning stored data in a smarter way.

• Registering datasets as a tables. When doing so, metadata

Inferring schemas require a full read of a file to determine the data

We can mitigate the schema problem by:

• Providing schema every time.

• Registering datasets as tables. In this way the schema will

• Using Delta format. Merges schemas automatically for

It happens when we need to apply a non-native API transformation,

We can mitigate serialization issues by:

• Avoid using UDFs, Pandas UDFs or Typed

• If there is no other option:

If your data transformations require the application of many UDFs,

Which Storage Level to Choose?

• RDD fits comfortably in memory, use MEMORY_ONLY

You might also like