Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
7 views4 pages

Pyspark Cheat Sheet

This document is a PySpark cheat sheet outlining the most common and important functions for data loading, filtering, column operations, aggregations, joins, null handling, date and string functions, window functions, data writing, and other operations. Each section provides example code snippets and descriptions of the functions' uses. It serves as a quick reference for users working with PySpark.

Uploaded by

lavsinghdon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views4 pages

Pyspark Cheat Sheet

This document is a PySpark cheat sheet outlining the most common and important functions for data loading, filtering, column operations, aggregations, joins, null handling, date and string functions, window functions, data writing, and other operations. Each section provides example code snippets and descriptions of the functions' uses. It serves as a quick reference for users working with PySpark.

Uploaded by

lavsinghdon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

🔥 PySpark Cheat Sheet: Most Common & Important Functions

📊 1. Data Loading & Inspection

spark.read.csv("file.csv", header=True, inferSchema=True)

Function Use

df.show() Display rows

df.printSchema() Show schema

df.describe() Basic stats

df.columns List column names

🧹 2. Filtering & Conditional Logic

df.filter(df.age > 30)


df.where((df.age > 25) & (df.gender == 'M'))
df.filter(df.name.isNotNull())

Function Use

filter() / where() Row filtering

isin() df.city.isin("NY", "LA")

when().otherwise() If-else logic

isNull() / isNotNull() Null checks

📌 3. Column Operations

df.withColumn("age_plus1", df.age + 1)
df.drop("old_col")
df.selectExpr("name as customer_name")

Function Use

select() Pick columns

1
Function Use

withColumn() Add/modify column

drop() Drop column

alias() Rename column

📈 4. Aggregations

df.groupBy("city").agg(avg("income"), sum("sales"))
df.groupBy("product").count()

Function Use

groupBy().agg() Aggregation

count() , sum() Common aggregates

orderBy("age", ascending=False) Sort rows

🔗 5. Joins

df1.join(df2, "id", "inner")


df1.join(df2, ["id", "dept"], "left")
df1.join(df2, "id", "left_anti")

Type Purpose

inner Matching rows

left , right , outer All rows + nulls

left_anti Rows only in left

left_semi Matching keys only

🧽 6. Null Handling

df.dropna()
df.fillna({"age": 0})
df.na.replace("NA", None)

2
Function Use

dropna() Drop nulls

fillna() Replace nulls

replace() Replace values

📅 7. Date & String Functions

from pyspark.sql.functions import *


df.select(current_date(), year("dob"))
df.select(trim(col("name")), upper("city"))

Function Use

current_date() Today's date

year() , month() Extract date parts

lower() , upper() Case change

trim() , substr() String cleaning

📦 8. Window Functions

from pyspark.sql.window import Window


from pyspark.sql.functions import row_number

w = Window.partitionBy("dept").orderBy("salary")
df.withColumn("rnk", row_number().over(w))

Function Use

row_number() , rank() Ranking rows

lag() , lead() Previous / next value

Window.partitionBy().orderBy() Window frame

3
💾 9. Write / Save Data

df.write.mode("overwrite").parquet("/tmp/data")
df.write.csv("output.csv", header=True)

Format Function

parquet , csv , json File formats

mode("overwrite") Overwrite existing

🔁 10. Others

df.distinct()
df.dropDuplicates(["id"])
df.limit(10)
df.cache()
df.collect()

Function Use

distinct() Unique rows

dropDuplicates() Based on subset

limit() Row limit

cache() Store in memory

collect() Bring to driver (⚠️ small data only)

You might also like