Intro to data
cleaning with
Apache Spark
C L E A N I N G D ATA W I T H P Y S PA R K
Mike Metzger
Data Engineering Consultant
What is Data Cleaning?
Data Cleaning: Preparing raw data for use in data processing pipelines.
Possible tasks in data cleaning:
Reformatting or replacing text
Performing calculations
Removing garbage or incomplete data
CLEANING DATA WITH PYSPARK
Why perform data cleaning with Spark?
Problems with typical data systems:
Performance
Organizing data ow
Advantages of Spark:
Scalable
Powerful framework for data handling
CLEANING DATA WITH PYSPARK
Data cleaning example
Raw data: Cleaned data:
name age (years) city last name rst name age (months) state
Smith, John 37 Dallas Smith John 444 TX
Wilson, A. 59 Chicago Wilson A. 708 IL
null 215
CLEANING DATA WITH PYSPARK
Spark Schemas
De ne the format of a DataFrame
May contain various data types:
Strings, dates, integers, arrays
Can lter garbage data during import
Improves read performance
CLEANING DATA WITH PYSPARK
Example Spark Schema
Import schema
import pyspark.sql.types
peopleSchema = StructType([
# Define the name field
StructField('name', StringType(), True),
# Add the age field
StructField('age', IntegerType(), True),
# Add the city field
StructField('city', StringType(), True)
])
Read CSV le containing data
people_df = spark.read.format('csv').load(name='rawdata.csv', schema=peopleSchema)
CLEANING DATA WITH PYSPARK
Let's practice!
C L E A N I N G D ATA W I T H P Y S PA R K
Immutability and
Lazy Processing
C L E A N I N G D ATA W I T H P Y S PA R K
Mike Metzger
Data Engineering Consultant
Variable review
Python variables:
Mutable
Flexibility
Potential for issues with concurrency
Likely adds complexity
CLEANING DATA WITH PYSPARK
Immutability
Immutable variables are:
A component of functional programming
De ned once
Unable to be directly modi ed
Re-created if reassigned
Able to be shared ef ciently
CLEANING DATA WITH PYSPARK
Immutability Example
De ne a new data frame:
voter_df = spark.read.csv('voterdata.csv')
Making changes:
voter_df = voter_df.withColumn('fullyear',
voter_df.year + 2000)
voter_df = voter_df.drop(voter_df.year)
CLEANING DATA WITH PYSPARK
Lazy Processing
Isn't this slow?
Transformations
Actions
Allows ef cient planning
voter_df = voter_df.withColumn('fullyear',
voter_df.year + 2000)
voter_df = voter_df.drop(voter_df.year)
voter_df.count()
CLEANING DATA WITH PYSPARK
Let's practice!
C L E A N I N G D ATA W I T H P Y S PA R K
Understanding
Parquet
C L E A N I N G D ATA W I T H P Y S PA R K
Mike Metzger
Data Engineering Consultant
Dif culties with CSV les
No de ned schema
Nested data requires special handling
Encoding format limited
CLEANING DATA WITH PYSPARK
Spark and CSV les
Slow to parse
Files cannot be ltered (no "predicate pushdown")
Any intermediate use requires rede ning schema
CLEANING DATA WITH PYSPARK
The Parquet Format
A columnar data format
Supported in Spark and other data processing frameworks
Supports predicate pushdown
Automatically stores schema information
CLEANING DATA WITH PYSPARK
Working with Parquet
Reading Parquet les
df = spark.read.format('parquet').load('filename.parquet')
df = spark.read.parquet('filename.parquet')
Writing Parquet les
df.write.format('parquet').save('filename.parquet')
df.write.parquet('filename.parquet')
CLEANING DATA WITH PYSPARK
Parquet and SQL
Parquet as backing stores for SparkSQL operations
flight_df = spark.read.parquet('flights.parquet')
flight_df.createOrReplaceTempView('flights')
short_flights_df = spark.sql('SELECT * FROM flights WHERE flightduration < 100')
CLEANING DATA WITH PYSPARK
Let's Practice!
C L E A N I N G D ATA W I T H P Y S PA R K