Caching
C L E A N I N G D ATA W I T H P Y S PA R K
Mike Metzger
Data Engineering Consultant
What is caching?
Caching in Spark:
Stores DataFrames in memory or on disk
Improves speed on later transformations / actions
Reduces resource usage
CLEANING DATA WITH PYSPARK
Disadvantages of caching
Very large data sets may not t in memory
Local disk based caching may not be a performance improvement
Cached objects may not be available
CLEANING DATA WITH PYSPARK
Caching tips
When developing Spark tasks:
Cache only if you need it
Try caching DataFrames at various points and determine if your performance improves
Cache in memory and fast SSD / NVMe storage
Cache to slow local disk if needed
Use intermediate les!
Stop caching objects when nished
CLEANING DATA WITH PYSPARK
Implementing caching
Call .cache() on the DataFrame before Action
voter_df = spark.read.csv('voter_data.txt.gz')
voter_df.cache().count()
voter_df = voter_df.withColumn('ID', monotonically_increasing_id())
voter_df = voter_df.cache()
voter_df.show()
CLEANING DATA WITH PYSPARK
More cache operations
Check .is_cached to determine cache status
print(voter_df.is_cached)
True
Call .unpersist() when nished with DataFrame
voter_df.unpersist()
CLEANING DATA WITH PYSPARK
Let's Practice!
C L E A N I N G D ATA W I T H P Y S PA R K
Improve import
performance
C L E A N I N G D ATA W I T H P Y S PA R K
Mike Metzger
Data Engineering Consultant
Spark clusters
Spark Clusters are made of two types of processes
Driver process
Worker processes
CLEANING DATA WITH PYSPARK
Import performance
Important parameters:
Number of objects (Files, Network locations, etc)
More objects be er than larger ones
Can import via wildcard
airport_df = spark.read.csv('airports-*.txt.gz')
General size of objects
Spark performs be er if objects are of similar size
CLEANING DATA WITH PYSPARK
Schemas
A well-de ned schema will drastically improve import performance
Avoids reading the data multiple times
Provides validation on import
CLEANING DATA WITH PYSPARK
How to split objects
Use OS utilities / scripts (split, cut, awk)
split -l 10000 -d largefile chunk-
Use custom scripts
Write out to Parquet
df_csv = spark.read.csv('singlelargefile.csv')
df_csv.write.parquet('data.parquet')
df = spark.read.parquet('data.parquet')
CLEANING DATA WITH PYSPARK
Let's practice!
C L E A N I N G D ATA W I T H P Y S PA R K
Cluster sizing tips
C L E A N I N G D ATA W I T H P Y S PA R K
Mike Metzger
Data Engineering Consultant
Configuration options
Spark contains many con guration se ings
These can be modi ed to match needs
Reading con guration se ings:
spark.conf.get(<configuration name>)
Writing con guration se ings
spark.conf.set(<configuration name>)
CLEANING DATA WITH PYSPARK
Cluster Types
Spark deployment options:
Single node
Standalone
Managed
YARN
Mesos
Kubernetes
CLEANING DATA WITH PYSPARK
Driver
Task assignment
Result consolidation
Shared data access
Tips:
Driver node should have double the memory of the worker
Fast local storage helpful
CLEANING DATA WITH PYSPARK
Worker
Runs actual tasks
Ideally has all code, data, and resources for a given task
Recommendations:
More worker nodes is o en be er than larger workers
Test to nd the balance
Fast local storage extremely useful
CLEANING DATA WITH PYSPARK
Let's practice!
C L E A N I N G D ATA W I T H P Y S PA R K
Performance
improvements
C L E A N I N G D ATA W I T H P Y S PA R K
Mike Metzger
Data Engineering Consultant
Explaining the Spark execution plan
voter_df = df.select(df['VOTER NAME']).distinct()
voter_df.explain()
== Physical Plan ==
*(2) HashAggregate(keys=[VOTER NAME#15], functions=[])
+- Exchange hashpartitioning(VOTER NAME#15, 200)
+- *(1) HashAggregate(keys=[VOTER NAME#15], functions=[])
+- *(1) FileScan csv [VOTER NAME#15] Batched: false, Format: CSV, Location:
InMemoryFileIndex[file:/DallasCouncilVotes.csv.gz],
PartitionFilters: [], PushedFilters: [],
ReadSchema: struct<VOTER NAME:string>
CLEANING DATA WITH PYSPARK
What is shuffling?
Shu ing refers to moving data around to various workers to complete a task
Hides complexity from the user
Can be slow to complete
Lowers overall throughput
Is o en necessary, but try to minimize
CLEANING DATA WITH PYSPARK
How to limit shuffling?
Limit use of .repartition(num_partitions)
Use .coalesce(num_partitions) instead
Use care when calling .join()
Use .broadcast()
May not need to limit it
CLEANING DATA WITH PYSPARK
Broadcasting
Broadcasting:
Provides a copy of an object to each worker
Prevents undue / excess communication between nodes
Can drastically speed up .join() operations
Use the .broadcast(<DataFrame>) method
from pyspark.sql.functions import broadcast
combined_df = df_1.join(broadcast(df_2))
CLEANING DATA WITH PYSPARK
Let's practice!
C L E A N I N G D ATA W I T H P Y S PA R K