0% found this document useful (0 votes)

7 views4 pages

Pyspark Cheat Sheet

This document is a PySpark cheat sheet outlining the most common and important functions for data loading, filtering, column operations, aggregations, joins, null handling, date and string functions, window functions, data writing, and other operations. Each section provides example code snippets and descriptions of the functions' uses. It serves as a quick reference for users working with PySpark.

Uploaded by

lavsinghdon

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views4 pages

Pyspark Cheat Sheet

Uploaded by

lavsinghdon

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

🔥 PySpark Cheat Sheet: Most Common & Important Functions

📊 1. Data Loading & Inspection

spark.read.csv("file.csv", header=True, inferSchema=True)

Function Use

df.show() Display rows

df.printSchema() Show schema

df.describe() Basic stats

df.columns List column names

🧹 2. Filtering & Conditional Logic

df.filter(df.age > 30)

df.where((df.age > 25) & (df.gender == 'M'))
df.filter(df.name.isNotNull())

Function Use

filter() / where() Row filtering

isin() df.city.isin("NY", "LA")

when().otherwise() If-else logic

isNull() / isNotNull() Null checks

📌 3. Column Operations

df.withColumn("age_plus1", df.age + 1)
df.drop("old_col")
df.selectExpr("name as customer_name")

Function Use

select() Pick columns

1
Function Use

withColumn() Add/modify column

drop() Drop column

alias() Rename column

📈 4. Aggregations

df.groupBy("city").agg(avg("income"), sum("sales"))
df.groupBy("product").count()

Function Use

groupBy().agg() Aggregation

count() , sum() Common aggregates

orderBy("age", ascending=False) Sort rows

🔗 5. Joins

df1.join(df2, "id", "inner")

df1.join(df2, ["id", "dept"], "left")
df1.join(df2, "id", "left_anti")

Type Purpose

inner Matching rows

left , right , outer All rows + nulls

left_anti Rows only in left

left_semi Matching keys only

🧽 6. Null Handling

df.dropna()
df.fillna({"age": 0})
df.na.replace("NA", None)

2
Function Use

dropna() Drop nulls

fillna() Replace nulls

replace() Replace values

📅 7. Date & String Functions

from pyspark.sql.functions import *

df.select(current_date(), year("dob"))
df.select(trim(col("name")), upper("city"))

Function Use

current_date() Today's date

year() , month() Extract date parts

lower() , upper() Case change

trim() , substr() String cleaning

📦 8. Window Functions

from pyspark.sql.window import Window

from pyspark.sql.functions import row_number

w = Window.partitionBy("dept").orderBy("salary")
df.withColumn("rnk", row_number().over(w))

Function Use

row_number() , rank() Ranking rows

lag() , lead() Previous / next value

Window.partitionBy().orderBy() Window frame

3
💾 9. Write / Save Data

df.write.mode("overwrite").parquet("/tmp/data")
df.write.csv("output.csv", header=True)

Format Function

parquet , csv , json File formats

mode("overwrite") Overwrite existing

🔁 10. Others

df.distinct()
df.dropDuplicates(["id"])
df.limit(10)
df.cache()
df.collect()

Function Use

distinct() Unique rows

dropDuplicates() Based on subset

limit() Row limit

cache() Store in memory

collect() Bring to driver (⚠️ small data only)

PySpark Cheat 23
No ratings yet
PySpark Cheat 23
9 pages
PySpark SQL Cheat Sheet Python
100% (2)
PySpark SQL Cheat Sheet Python
1 page
Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
Comparison of SQL
No ratings yet
Comparison of SQL
11 pages
Software Engineering
80% (5)
Software Engineering
783 pages
Cheat Sheet: From Spark Data Sources SQL Queries
No ratings yet
Cheat Sheet: From Spark Data Sources SQL Queries
1 page
PySpark ELT Cheat Sheet Guide
No ratings yet
PySpark ELT Cheat Sheet Guide
8 pages
EDA Cheat Sheet
No ratings yet
EDA Cheat Sheet
7 pages
PySpark DataFrame Merging Guide
No ratings yet
PySpark DataFrame Merging Guide
42 pages
Spark Test Que
No ratings yet
Spark Test Que
3 pages
Imaster NCE V100R020C10 REST NBI User Guide 10
100% (1)
Imaster NCE V100R020C10 REST NBI User Guide 10
110 pages
Spring Interview Questions
No ratings yet
Spring Interview Questions
84 pages
PySpark SQL Cheat Sheet Python PDF
No ratings yet
PySpark SQL Cheat Sheet Python PDF
1 page
Selenium Java Interview Questions
100% (5)
Selenium Java Interview Questions
22 pages
PySpark Big Data Analytics Guide
No ratings yet
PySpark Big Data Analytics Guide
7 pages
PySpark Cheat Sheet
No ratings yet
PySpark Cheat Sheet
6 pages
PySpark SQL Cheat Sheet Guide
No ratings yet
PySpark SQL Cheat Sheet Guide
1 page
PySpark SQL Cheat Sheet Python PDF
No ratings yet
PySpark SQL Cheat Sheet Python PDF
1 page
Exploratory Data Analysis (Eda) With Pandas: (Cheatsheet)
No ratings yet
Exploratory Data Analysis (Eda) With Pandas: (Cheatsheet)
7 pages
PySpark Transformations
No ratings yet
PySpark Transformations
18 pages
Creating Watchpoints While Debugging The ABAP Code1 PDF
No ratings yet
Creating Watchpoints While Debugging The ABAP Code1 PDF
2 pages
Continuous Deployment Quiz Guide
No ratings yet
Continuous Deployment Quiz Guide
5 pages
PySpark Cheatsheet
No ratings yet
PySpark Cheatsheet
12 pages
EDA With Pandas
No ratings yet
EDA With Pandas
8 pages
SQL Cheat Sheet Python
100% (1)
SQL Cheat Sheet Python
1 page
PySpark Notes
No ratings yet
PySpark Notes
64 pages
Question Bank-BDA (Module 1&2) 2
No ratings yet
Question Bank-BDA (Module 1&2) 2
5 pages
Pyspark SQL Transformation Cheat Sheet
No ratings yet
Pyspark SQL Transformation Cheat Sheet
3 pages
MATLAB Analog Clock Design Guide
No ratings yet
MATLAB Analog Clock Design Guide
4 pages
Spark Cheat Sheet
No ratings yet
Spark Cheat Sheet
3 pages
Must Know Pyspark Coding Before Databricks Interview
No ratings yet
Must Know Pyspark Coding Before Databricks Interview
7 pages
IBM PySpark CheatSheet
No ratings yet
IBM PySpark CheatSheet
2 pages
PySpark All Query
No ratings yet
PySpark All Query
22 pages
Python Data Exploratory Commands
No ratings yet
Python Data Exploratory Commands
9 pages
Trust Manager: Problems Importing Certificate Responses
No ratings yet
Trust Manager: Problems Importing Certificate Responses
4 pages
PySpark SQL Pandas CheatSheet
No ratings yet
PySpark SQL Pandas CheatSheet
2 pages
EDA Python For Data Analsis
No ratings yet
EDA Python For Data Analsis
10 pages
Day11 Notes
No ratings yet
Day11 Notes
2 pages
Pyspark Intro
No ratings yet
Pyspark Intro
3 pages
Pyspark Distinct and Filter
No ratings yet
Pyspark Distinct and Filter
3 pages
Pyspark Syntax Using Simple Examples
No ratings yet
Pyspark Syntax Using Simple Examples
28 pages
PySpark Interview Cheatsheet 1741068112
No ratings yet
PySpark Interview Cheatsheet 1741068112
19 pages
Understanding OO vs. Procedural Programming
No ratings yet
Understanding OO vs. Procedural Programming
2 pages
An Analysis and Enhancements To MOOD Metrics
No ratings yet
An Analysis and Enhancements To MOOD Metrics
29 pages
Project Synopsis
No ratings yet
Project Synopsis
10 pages
PySpark Cheatsheet - Elaborate
No ratings yet
PySpark Cheatsheet - Elaborate
14 pages
Bda Exp - 7
No ratings yet
Bda Exp - 7
8 pages
How To Work With Apache Spark and Delta Lake?
No ratings yet
How To Work With Apache Spark and Delta Lake?
40 pages
Arena Practical 02
No ratings yet
Arena Practical 02
31 pages
Pyspark Interview Questions
No ratings yet
Pyspark Interview Questions
4 pages
Cheat Sheet - Pandas
No ratings yet
Cheat Sheet - Pandas
6 pages
Pyspark 12 Questions
No ratings yet
Pyspark 12 Questions
8 pages
Pyspark Funcamentals
No ratings yet
Pyspark Funcamentals
10 pages
Scenarios Where Bad Records Occur
No ratings yet
Scenarios Where Bad Records Occur
38 pages
Spark SQL Optimization - Real Case Studies
No ratings yet
Spark SQL Optimization - Real Case Studies
18 pages
PySpark DataFrame Operations Guide
No ratings yet
PySpark DataFrame Operations Guide
10 pages
SQL Vs PySpark
No ratings yet
SQL Vs PySpark
7 pages
Spark Optimisation
No ratings yet
Spark Optimisation
7 pages
Final Chapter - 6 Spos-2018
No ratings yet
Final Chapter - 6 Spos-2018
122 pages
Pyspark SQL and DataFrames
No ratings yet
Pyspark SQL and DataFrames
6 pages
PySpark SQL Cheat Sheet Python
No ratings yet
PySpark SQL Cheat Sheet Python
1 page
Basic DataFrame Operation
No ratings yet
Basic DataFrame Operation
11 pages
Pyspark Coding Questions From StrataScratch Platform
No ratings yet
Pyspark Coding Questions From StrataScratch Platform
23 pages
PySpark Cheatsheet: Key Operations
No ratings yet
PySpark Cheatsheet: Key Operations
8 pages
Pyspark Cheatsheet
No ratings yet
Pyspark Cheatsheet
21 pages
Pyspark Coding Interview Questions
No ratings yet
Pyspark Coding Interview Questions
19 pages
Distributed Systems Communication
No ratings yet
Distributed Systems Communication
36 pages
Pointers & Dynamic Memory Guide
No ratings yet
Pointers & Dynamic Memory Guide
20 pages
Lab 1 Install Exchange 2019 in Windows Server 2019
No ratings yet
Lab 1 Install Exchange 2019 in Windows Server 2019
11 pages
Luis Fernando Trueba (CV)
No ratings yet
Luis Fernando Trueba (CV)
1 page
Big Data Analytics in Apache Spark
No ratings yet
Big Data Analytics in Apache Spark
79 pages
Word Processing Application
No ratings yet
Word Processing Application
44 pages
Start Programming Using HTML CSS and JavaScript 1st Edition Iztok Fajfar (Author) Instant Download
No ratings yet
Start Programming Using HTML CSS and JavaScript 1st Edition Iztok Fajfar (Author) Instant Download
71 pages
IoT Relay Programming Guide
No ratings yet
IoT Relay Programming Guide
57 pages
Teradata SQL Reference Guide
No ratings yet
Teradata SQL Reference Guide
9 pages
Tugas Algoritma Tree Traversal Preorder
No ratings yet
Tugas Algoritma Tree Traversal Preorder
4 pages
Agile and Scrum Best Practices Quiz
100% (1)
Agile and Scrum Best Practices Quiz
3 pages
Quiz App Using Android Studio: Vaibhavi Balaji Kunale, Sharvarisandiip Shinde, Shelke R.B
No ratings yet
Quiz App Using Android Studio: Vaibhavi Balaji Kunale, Sharvarisandiip Shinde, Shelke R.B
3 pages
What Is Clips
No ratings yet
What Is Clips
9 pages
Outlook Style GroupBox Control
No ratings yet
Outlook Style GroupBox Control
3 pages
DCSA SCC For TT 2.2 - Final
No ratings yet
DCSA SCC For TT 2.2 - Final
10 pages
Batch Data Communication (BDC) Procedure in Overview, PDF Book in SAP ABAP
No ratings yet
Batch Data Communication (BDC) Procedure in Overview, PDF Book in SAP ABAP
6 pages
Ms. Yu Yu Lwin Date of Birth: 17 Feb 1993 Contact No.: 94855718 Nationality: Myanmar
No ratings yet
Ms. Yu Yu Lwin Date of Birth: 17 Feb 1993 Contact No.: 94855718 Nationality: Myanmar
3 pages
CSC3002 23fall Assignment 3
No ratings yet
CSC3002 23fall Assignment 3
3 pages

Pyspark Cheat Sheet

Uploaded by

Pyspark Cheat Sheet

Uploaded by

🔥 PySpark Cheat Sheet: Most Common & Important Functions

📊 1. Data Loading & Inspection

spark.read.csv("file.csv", header=True, inferSchema=True)

df.show() Display rows

df.printSchema() Show schema

df.describe() Basic stats

df.columns List column names

🧹 2. Filtering & Conditional Logic

df.filter(df.age > 30)

filter() / where() Row filtering

isin() df.city.isin("NY", "LA")

when().otherwise() If-else logic

isNull() / isNotNull() Null checks

select() Pick columns

withColumn() Add/modify column

drop() Drop column

alias() Rename column

count() , sum() Common aggregates

orderBy("age", ascending=False) Sort rows

df1.join(df2, "id", "inner")

inner Matching rows

left , right , outer All rows + nulls

left_anti Rows only in left

left_semi Matching keys only

dropna() Drop nulls

fillna() Replace nulls

replace() Replace values

📅 7. Date & String Functions

from pyspark.sql.functions import *

current_date() Today's date

year() , month() Extract date parts

lower() , upper() Case change

trim() , substr() String cleaning

from pyspark.sql.window import Window

row_number() , rank() Ranking rows

lag() , lead() Previous / next value

Window.partitionBy().orderBy() Window frame

parquet , csv , json File formats

mode("overwrite") Overwrite existing

distinct() Unique rows

dropDuplicates() Based on subset

limit() Row limit

cache() Store in memory

collect() Bring to driver (⚠️ small data only)

You might also like