0% found this document useful (1 vote)

445 views20 pages

PySpark Data Cleaning Guide

This document discusses data cleaning with Apache Spark. It introduces data cleaning as preparing raw data for use in data processing pipelines. Common data cleaning tasks include reformatting text, performing calculations, and removing garbage or incomplete data. Spark is advantageous for data cleaning due to its scalability and powerful framework for data handling. An example demonstrates cleaning raw data by reformatting names and ages and converting city names to states. Spark schemas can define the format of a DataFrame to filter garbage data and improve read performance. Spark uses immutability and lazy evaluation to efficiently handle transformations on data frames. The Parquet format is also discussed as it supports schemas, columnar data storage, and predicate pushdown for efficient data processing in Spark.

Uploaded by

Fgpeqw

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (1 vote)

445 views20 pages

PySpark Data Cleaning Guide

Uploaded by

Fgpeqw

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Intro to data

cleaning with
Apache Spark
C L E A N I N G D ATA W I T H P Y S PA R K

Mike Metzger
Data Engineering Consultant
What is Data Cleaning?
Data Cleaning: Preparing raw data for use in data processing pipelines.

Possible tasks in data cleaning:

Reformatting or replacing text

Performing calculations

Removing garbage or incomplete data

CLEANING DATA WITH PYSPARK

Why perform data cleaning with Spark?
Problems with typical data systems:

Performance

Organizing data ow

Advantages of Spark:

Scalable

Powerful framework for data handling

CLEANING DATA WITH PYSPARK

Data cleaning example
Raw data: Cleaned data:

name age (years) city last name rst name age (months) state

Smith, John 37 Dallas Smith John 444 TX

Wilson, A. 59 Chicago Wilson A. 708 IL

null 215

CLEANING DATA WITH PYSPARK

Spark Schemas
De ne the format of a DataFrame

May contain various data types:

Strings, dates, integers, arrays

Can lter garbage data during import

Improves read performance

CLEANING DATA WITH PYSPARK

Example Spark Schema
Import schema

import pyspark.sql.types
peopleSchema = StructType([
# Define the name field
StructField('name', StringType(), True),
# Add the age field
StructField('age', IntegerType(), True),
# Add the city field
StructField('city', StringType(), True)
])

Read CSV le containing data

people_df = spark.read.format('csv').load(name='rawdata.csv', schema=peopleSchema)

CLEANING DATA WITH PYSPARK

Let's practice!
C L E A N I N G D ATA W I T H P Y S PA R K
Immutability and
Lazy Processing
C L E A N I N G D ATA W I T H P Y S PA R K

Mike Metzger
Data Engineering Consultant
Variable review
Python variables:

Mutable

Flexibility

Potential for issues with concurrency

Likely adds complexity

CLEANING DATA WITH PYSPARK

Immutability
Immutable variables are:

A component of functional programming

De ned once

Unable to be directly modi ed

Re-created if reassigned

Able to be shared ef ciently

CLEANING DATA WITH PYSPARK

Immutability Example
De ne a new data frame:

voter_df = spark.read.csv('voterdata.csv')

Making changes:

voter_df = voter_df.withColumn('fullyear',
voter_df.year + 2000)

voter_df = voter_df.drop(voter_df.year)

CLEANING DATA WITH PYSPARK

Lazy Processing
Isn't this slow?

Transformations

Actions

Allows ef cient planning

voter_df = voter_df.withColumn('fullyear',
voter_df.year + 2000)
voter_df = voter_df.drop(voter_df.year)

voter_df.count()

CLEANING DATA WITH PYSPARK

Let's practice!
C L E A N I N G D ATA W I T H P Y S PA R K
Understanding
Parquet
C L E A N I N G D ATA W I T H P Y S PA R K

Mike Metzger
Data Engineering Consultant
Dif culties with CSV les
No de ned schema

Nested data requires special handling

Encoding format limited

CLEANING DATA WITH PYSPARK

Spark and CSV les
Slow to parse

Files cannot be ltered (no "predicate pushdown")

Any intermediate use requires rede ning schema

CLEANING DATA WITH PYSPARK

The Parquet Format
A columnar data format

Supported in Spark and other data processing frameworks

Supports predicate pushdown

Automatically stores schema information

CLEANING DATA WITH PYSPARK

Working with Parquet
Reading Parquet les

df = spark.read.format('parquet').load('filename.parquet')

df = spark.read.parquet('filename.parquet')

Writing Parquet les

df.write.format('parquet').save('filename.parquet')

df.write.parquet('filename.parquet')

CLEANING DATA WITH PYSPARK

Parquet and SQL
Parquet as backing stores for SparkSQL operations

flight_df = spark.read.parquet('flights.parquet')

flight_df.createOrReplaceTempView('flights')

short_flights_df = spark.sql('SELECT * FROM flights WHERE flightduration < 100')

CLEANING DATA WITH PYSPARK

Let's Practice!
C L E A N I N G D ATA W I T H P Y S PA R K

PySpark Interview Questions Guide
100% (3)
PySpark Interview Questions Guide
126 pages
Credit Risk Modeling in Python Chapter3
No ratings yet
Credit Risk Modeling in Python Chapter3
35 pages
Azure Databricks Interview Question
No ratings yet
Azure Databricks Interview Question
12 pages
Data Engineering With Databricks
No ratings yet
Data Engineering With Databricks
5 pages
Pyspark Questions & Scenario Based
No ratings yet
Pyspark Questions & Scenario Based
25 pages
My Pyspark Practice Notes
100% (1)
My Pyspark Practice Notes
63 pages
Customer Service Management
100% (2)
Customer Service Management
502 pages
Customer Segmentation in Python Chapter2
No ratings yet
Customer Segmentation in Python Chapter2
33 pages
PySpark Data Frame Questions PDF
100% (2)
PySpark Data Frame Questions PDF
57 pages
Seaborn Categorical Plot Guide
100% (1)
Seaborn Categorical Plot Guide
32 pages
Spark SQL Bucketing Guide
No ratings yet
Spark SQL Bucketing Guide
27 pages
Databricks Certified Developer For Apache Spark 3.0 Practice Tests 540 Questions
0% (1)
Databricks Certified Developer For Apache Spark 3.0 Practice Tests 540 Questions
290 pages
ML Workflows for Cybersecurity
No ratings yet
ML Workflows for Cybersecurity
39 pages
GRPC Services in Kong
No ratings yet
GRPC Services in Kong
29 pages
Software Quality Operational Readiness Review (ORR) Checklist
100% (1)
Software Quality Operational Readiness Review (ORR) Checklist
5 pages
SQL Vs PySpark 1678871778
No ratings yet
SQL Vs PySpark 1678871778
8 pages
New Format Louvre Partner Pilot
No ratings yet
New Format Louvre Partner Pilot
11 pages
Lighthouse Report Viewer
No ratings yet
Lighthouse Report Viewer
8 pages
VB.NET Paint System Project
No ratings yet
VB.NET Paint System Project
49 pages
Arduino Library & Serial Guide
No ratings yet
Arduino Library & Serial Guide
8 pages
Green Black Simple Software Engineer CV
No ratings yet
Green Black Simple Software Engineer CV
3 pages
Excel for Data Visualization
100% (1)
Excel for Data Visualization
72 pages
Software Engineering Lab (KCS-661) : Name of The Lab Name of The Experiment
No ratings yet
Software Engineering Lab (KCS-661) : Name of The Lab Name of The Experiment
3 pages
Fluo Lm2: "Translation of Original Instructions"
100% (1)
Fluo Lm2: "Translation of Original Instructions"
38 pages
Pyspark Interview Questions: Click Here
0% (1)
Pyspark Interview Questions: Click Here
35 pages
Python Chatbot Development Guide
No ratings yet
Python Chatbot Development Guide
41 pages
E LabSoft Classroom Manager 6 Software Suite Single License
No ratings yet
E LabSoft Classroom Manager 6 Software Suite Single License
4 pages
UQ21CA632B - Unit2 - Class12 - Cleaning Tools-Pandas&Inspect and Organize Data
No ratings yet
UQ21CA632B - Unit2 - Class12 - Cleaning Tools-Pandas&Inspect and Organize Data
12 pages
Programing For Engineers Mini Project 1
No ratings yet
Programing For Engineers Mini Project 1
10 pages
Objectives: Classic Models
No ratings yet
Objectives: Classic Models
3 pages
Customer Segmentation in Python Chapter4
No ratings yet
Customer Segmentation in Python Chapter4
37 pages
Chapter3 PDF
No ratings yet
Chapter3 PDF
36 pages
Preparing Your Gures To Share With Others: Ariel Rokem
No ratings yet
Preparing Your Gures To Share With Others: Ariel Rokem
35 pages
Top Pyspark InterviewQuestions
No ratings yet
Top Pyspark InterviewQuestions
21 pages
What Is Data Cleaning
No ratings yet
What Is Data Cleaning
8 pages
DBT Flow
No ratings yet
DBT Flow
15 pages
Chapter 3
No ratings yet
Chapter 3
25 pages
Analysis and Improvemetn of Jewellery Industry
No ratings yet
Analysis and Improvemetn of Jewellery Industry
7 pages
Introduction To Data Visualization With Matplotlib: Ariel Rokem
No ratings yet
Introduction To Data Visualization With Matplotlib: Ariel Rokem
30 pages
Cisco Finesse Administration Guide 116 - Chapter - 01101
No ratings yet
Cisco Finesse Administration Guide 116 - Chapter - 01101
8 pages
Help W - Insite 8.5 Install
No ratings yet
Help W - Insite 8.5 Install
2 pages
Customer Segmentation in Python Chapter3
No ratings yet
Customer Segmentation in Python Chapter3
25 pages
Spoken Language Processing in Python Chapter4
No ratings yet
Spoken Language Processing in Python Chapter4
46 pages
Online Blog: Presented To:-Presented By
No ratings yet
Online Blog: Presented To:-Presented By
20 pages
Pyspark PDF
100% (1)
Pyspark PDF
406 pages
Changing Plot Style and Color: Erin Case
No ratings yet
Changing Plot Style and Color: Erin Case
54 pages
My CV
No ratings yet
My CV
1 page
IoT Data Analysis with Python
No ratings yet
IoT Data Analysis with Python
34 pages
Pyspark Interview 1738079940
No ratings yet
Pyspark Interview 1738079940
6 pages
Analyzing IoT Data in Python Chapter3
No ratings yet
Analyzing IoT Data in Python Chapter3
30 pages
03-HMI Login, Default Accounts and Passwords - B
No ratings yet
03-HMI Login, Default Accounts and Passwords - B
4 pages
Designing Machine Learning Workflows in Python Chapter3
No ratings yet
Designing Machine Learning Workflows in Python Chapter3
42 pages
Python SpeechRecognition Guide
No ratings yet
Python SpeechRecognition Guide
23 pages
Designing Machine Learning Workflows in Python Chapter4
No ratings yet
Designing Machine Learning Workflows in Python Chapter4
38 pages
Analyzing IoT Data in Python Chapter2
No ratings yet
Analyzing IoT Data in Python Chapter2
35 pages
Introduction To Data Visualization With Seaborn Chapter2
No ratings yet
Introduction To Data Visualization With Seaborn Chapter2
38 pages
Designing Machine Learning Workflows in Python Chapter1
No ratings yet
Designing Machine Learning Workflows in Python Chapter1
32 pages
Introduction To Data Visualization With Matplotlib Chapter2
No ratings yet
Introduction To Data Visualization With Matplotlib Chapter2
27 pages
Introduction To Data Visualization With Seaborn Chapter1
No ratings yet
Introduction To Data Visualization With Seaborn Chapter1
26 pages
Spoken Language Processing in Python Chapter3
No ratings yet
Spoken Language Processing in Python Chapter3
26 pages
Data Cleaning Guide for Python Users
No ratings yet
Data Cleaning Guide for Python Users
14 pages
Building Chatbots in Python Chapter4
No ratings yet
Building Chatbots in Python Chapter4
20 pages
PySpark DataFrame Operations Guide
100% (1)
PySpark DataFrame Operations Guide
25 pages
Spoken Language Processing in Python Chapter1
No ratings yet
Spoken Language Processing in Python Chapter1
17 pages
Pyspark Practice - Databricks
No ratings yet
Pyspark Practice - Databricks
66 pages
PySpark Cheatsheet
No ratings yet
PySpark Cheatsheet
12 pages
Android Course Syllabus
No ratings yet
Android Course Syllabus
3 pages
Master Pyspark Zero To Hero 1738689679
No ratings yet
Master Pyspark Zero To Hero 1738689679
102 pages
PySpark Caching and Performance Tips
No ratings yet
PySpark Caching and Performance Tips
25 pages
SDLC Model
No ratings yet
SDLC Model
11 pages
Data Engineer Interview Prep
100% (1)
Data Engineer Interview Prep
16 pages
Chapter 2
No ratings yet
Chapter 2
25 pages
Cleaning Data With PySpark Chapter4
No ratings yet
Cleaning Data With PySpark Chapter4
23 pages
SQL To Pyspark Conversion
No ratings yet
SQL To Pyspark Conversion
9 pages
Profile of The Problem
No ratings yet
Profile of The Problem
2 pages
Java Developer Profile
No ratings yet
Java Developer Profile
3 pages
Spark Architecture for Developers
No ratings yet
Spark Architecture for Developers
7 pages
Comandi Moshell
No ratings yet
Comandi Moshell
12 pages
PySpark Big Data Analytics Guide
No ratings yet
PySpark Big Data Analytics Guide
7 pages
Core Python
No ratings yet
Core Python
102 pages
Window Function in Pyspark
100% (1)
Window Function in Pyspark
8 pages
ATI x1400 Full Resolution Guide
No ratings yet
ATI x1400 Full Resolution Guide
7 pages
Introduction To PL/SQL: Kristian Torp
No ratings yet
Introduction To PL/SQL: Kristian Torp
46 pages
Credit Risk Modeling for Data Scientists
100% (1)
Credit Risk Modeling for Data Scientists
35 pages
Sekolah Tun Fatimah Jalan Tun Abdul Razak, 80000 Johor Bahru
No ratings yet
Sekolah Tun Fatimah Jalan Tun Abdul Razak, 80000 Johor Bahru
12 pages
Ravi Pyspark RDD Tutorial 1665758938
No ratings yet
Ravi Pyspark RDD Tutorial 1665758938
20 pages
PySpark 30 Days Practice Guide?
100% (1)
PySpark 30 Days Practice Guide?
35 pages
Senior Technical Project Manager in Denver CO Resume Fern Van Milligan
No ratings yet
Senior Technical Project Manager in Denver CO Resume Fern Van Milligan
2 pages
Deep Learning Ram
No ratings yet
Deep Learning Ram
21 pages
20 PySpark Problems
No ratings yet
20 PySpark Problems
22 pages
Chapter1 PDF
No ratings yet
Chapter1 PDF
37 pages
Apache Spark - DataFrames and Spark SQL
100% (2)
Apache Spark - DataFrames and Spark SQL
146 pages
Spark Interview Q&A
No ratings yet
Spark Interview Q&A
31 pages
PySpark SQL Cheat Sheet Python PDF
No ratings yet
PySpark SQL Cheat Sheet Python PDF
1 page
Spark RDD Actions & Transformations
No ratings yet
Spark RDD Actions & Transformations
25 pages
PySpark Transformations Tutorial
100% (1)
PySpark Transformations Tutorial
58 pages
PySpark Reference Guide
No ratings yet
PySpark Reference Guide
2 pages
SQL Insurance and Order Database Guide
67% (3)
SQL Insurance and Order Database Guide
45 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
Slide 10 PySpark - SQL
No ratings yet
Slide 10 PySpark - SQL
131 pages
PySpark Questions
No ratings yet
PySpark Questions
5 pages
Pyspark Interview Code
100% (3)
Pyspark Interview Code
197 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Walmart Stock Data Analysis with Spark
0% (1)
Walmart Stock Data Analysis with Spark
17 pages
ADB Course Catalog
No ratings yet
ADB Course Catalog
84 pages
Spark SQL
No ratings yet
Spark SQL
24 pages
Pyspark - SQL Module
No ratings yet
Pyspark - SQL Module
132 pages
L02 - Spark SQL For Data Processing: CBG1C04 Big Data Programming
No ratings yet
L02 - Spark SQL For Data Processing: CBG1C04 Big Data Programming
23 pages
Spark SQL & DataFrames Guide 2.2.0
No ratings yet
Spark SQL & DataFrames Guide 2.2.0
35 pages
Spark Interview Prep Guide
No ratings yet
Spark Interview Prep Guide
4 pages
Spark in Production
No ratings yet
Spark in Production
34 pages
Distributed Database Systems: - Spark I
No ratings yet
Distributed Database Systems: - Spark I
59 pages
Databricks Course Curriculum
No ratings yet
Databricks Course Curriculum
2 pages
Windowing Functions
No ratings yet
Windowing Functions
54 pages
Pyspark Material
No ratings yet
Pyspark Material
16 pages
4.1 The Spark UI - Databricks
No ratings yet
4.1 The Spark UI - Databricks
7 pages
Sparksql PDF
100% (2)
Sparksql PDF
119 pages

PySpark Data Cleaning Guide

Uploaded by

PySpark Data Cleaning Guide

Uploaded by

Intro to data

Possible tasks in data cleaning:

Reformatting or replacing text

Removing garbage or incomplete data

CLEANING DATA WITH PYSPARK

Powerful framework for data handling

CLEANING DATA WITH PYSPARK

Smith, John 37 Dallas Smith John 444 TX

Wilson, A. 59 Chicago Wilson A. 708 IL

CLEANING DATA WITH PYSPARK

May contain various data types:

Can lter garbage data during import

Improves read performance

CLEANING DATA WITH PYSPARK

Read CSV le containing data

people_df = spark.read.format('csv').load(name='rawdata.csv', schema=peopleSchema)

CLEANING DATA WITH PYSPARK

Potential for issues with concurrency

Likely adds complexity

CLEANING DATA WITH PYSPARK

A component of functional programming

Unable to be directly modi ed

Able to be shared ef ciently

CLEANING DATA WITH PYSPARK

CLEANING DATA WITH PYSPARK

Allows ef cient planning

CLEANING DATA WITH PYSPARK

Nested data requires special handling

Encoding format limited

CLEANING DATA WITH PYSPARK

Files cannot be ltered (no "predicate pushdown")

Any intermediate use requires rede ning schema

CLEANING DATA WITH PYSPARK

Supported in Spark and other data processing frameworks

Supports predicate pushdown

Automatically stores schema information

CLEANING DATA WITH PYSPARK

Writing Parquet les

CLEANING DATA WITH PYSPARK

short_flights_df = spark.sql('SELECT * FROM flights WHERE flightduration < 100')

CLEANING DATA WITH PYSPARK

You might also like