CHEAT
SHEET FOR DATA
ENGINEERS
Master these essential PySpark commands to
build efficient and scalable data pipelines!
Abhishek Agrawal
Data Engineer
SparkSession (Starting Point)
from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder.appName("Example").getOrCreate()
DataFrame Operations
Command Description Example
df.show() Display the DataFrame df.show(5)
Print the schema of the
df.printSchema() df.printSchema()
DataFrame
df.select("name",
df.select() Select specific columns
"age").show()
Abhishek Agrawal | Data Engineer
Command Description Example
Filter rows based on df.filter(df.age >
df.filter()
conditions 18).show()
df.withColumn("discount
df.withColumn() Add or modify a column
", df.price * 0.1)
df.drop() Drop a column df.drop("column_name")
df.distinct() Get distinct rows df.distinct().show()
Sort DataFrame by df.sort(df["price"].des
df.sort()
columns c()).show()
Group rows and apply df.groupBy("region").su
df.groupBy()
aggregations m("sales").show()
Abhishek Agrawal | Data Engineer
DataFrame Joins
Join Type Description Example
inner Inner join df1.join(df2, "id", "inner")
left Left join df1.join(df2, "id", "left")
right Right join df1.join(df2, "id", "right")
full Full outer join df1.join(df2, "id", "full")
Data Transformation Commands
Command Description Example
df.union() Combine two DataFrames df1.union(df2).show()
df.repartition() Repartition the DataFrame df.repartition(4)
Abhishek Agrawal | Data Engineer
Command Description Example
Cache the DataFrame in
df.cache() df.cache()
memory
Persist the DataFrame to
df.persist() df.persist()
memory/disk
df.dropDuplicates(["id
df.dropDuplicates() Drop duplicate rows
"]).show()
SQL Queries in PySpark
Command Description Example
createOrReplaceTempVie Create a temporary SQL df.createOrReplaceTemp
w() view View("table_name")
spark.sql("SELECT *
Run SQL queries on the
spark.sql() FROM
temporary view
table_name").show()
Abhishek Agrawal | Data Engineer
PySpark Window Functions
Function Description Example
Assigns a unique df.withColumn("row_num",
row_number()
number to each row row_number().over(windowSpec))
Ranks rows based on df.withColumn("rank",
rank()
specified criteria rank().over(windowSpec))
Access next/previous df.withColumn("next_val",
lead() / lag()
row in a partition lead("sales").over(windowSpec))
Abhishek Agrawal | Data Engineer
RDD Operations (for Advanced
Use Cases)
Command Description Example
Apply a function to each rdd.map(lambda x: x *
rdd.map()
element 2).collect()
Filter elements based rdd.filter(lambda x: x >
rdd.filter()
on a condition 10).collect()
Apply a function to
rdd.reduce() rdd.reduce(lambda x, y: x + y)
reduce elements
Return all elements as a
rdd.collect() rdd.collect()
list
Abhishek Agrawal | Data Engineer
Quick Tips for Optimizing
PySpark Jobs
✅ Use repartition() for large datasets.
✅ Use broadcast() for small lookup tables.
✅ Cache DataFrames you reuse frequently.
✅ Avoid wide transformations like groupBy() unless
necessary.
✅ Use Delta Lake for ACID transactions and time travel.
Abhishek Agrawal | Data Engineer
Follow for more
content like this
Abhishek Agrawal
Azure Data Engineer