🔥 PySpark Cheat Sheet: Most Common & Important Functions
📊 1. Data Loading & Inspection
spark.read.csv("file.csv", header=True, inferSchema=True)
Function Use
df.show() Display rows
df.printSchema() Show schema
df.describe() Basic stats
df.columns List column names
🧹 2. Filtering & Conditional Logic
df.filter(df.age > 30)
df.where((df.age > 25) & (df.gender == 'M'))
df.filter(df.name.isNotNull())
Function Use
filter() / where() Row filtering
isin() df.city.isin("NY", "LA")
when().otherwise() If-else logic
isNull() / isNotNull() Null checks
📌 3. Column Operations
df.withColumn("age_plus1", df.age + 1)
df.drop("old_col")
df.selectExpr("name as customer_name")
Function Use
select() Pick columns
1
Function Use
withColumn() Add/modify column
drop() Drop column
alias() Rename column
📈 4. Aggregations
df.groupBy("city").agg(avg("income"), sum("sales"))
df.groupBy("product").count()
Function Use
groupBy().agg() Aggregation
count() , sum() Common aggregates
orderBy("age", ascending=False) Sort rows
🔗 5. Joins
df1.join(df2, "id", "inner")
df1.join(df2, ["id", "dept"], "left")
df1.join(df2, "id", "left_anti")
Type Purpose
inner Matching rows
left , right , outer All rows + nulls
left_anti Rows only in left
left_semi Matching keys only
🧽 6. Null Handling
df.dropna()
df.fillna({"age": 0})
df.na.replace("NA", None)
2
Function Use
dropna() Drop nulls
fillna() Replace nulls
replace() Replace values
📅 7. Date & String Functions
from pyspark.sql.functions import *
df.select(current_date(), year("dob"))
df.select(trim(col("name")), upper("city"))
Function Use
current_date() Today's date
year() , month() Extract date parts
lower() , upper() Case change
trim() , substr() String cleaning
📦 8. Window Functions
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
w = Window.partitionBy("dept").orderBy("salary")
df.withColumn("rnk", row_number().over(w))
Function Use
row_number() , rank() Ranking rows
lag() , lead() Previous / next value
Window.partitionBy().orderBy() Window frame
3
💾 9. Write / Save Data
df.write.mode("overwrite").parquet("/tmp/data")
df.write.csv("output.csv", header=True)
Format Function
parquet , csv , json File formats
mode("overwrite") Overwrite existing
🔁 10. Others
df.distinct()
df.dropDuplicates(["id"])
df.limit(10)
df.cache()
df.collect()
Function Use
distinct() Unique rows
dropDuplicates() Based on subset
limit() Row limit
cache() Store in memory
collect() Bring to driver (⚠️ small data only)