Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
202 views9 pages

PySpark Cheat 23

This cheat sheet provides essential PySpark commands for data engineers to build efficient data pipelines, covering SparkSession creation, DataFrame operations, joins, transformations, SQL queries, window functions, and RDD operations. It also includes quick tips for optimizing PySpark jobs. The document is authored by Abhishek Agrawal, a Data Engineer.

Uploaded by

Gp Gp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
202 views9 pages

PySpark Cheat 23

This cheat sheet provides essential PySpark commands for data engineers to build efficient data pipelines, covering SparkSession creation, DataFrame operations, joins, transformations, SQL queries, window functions, and RDD operations. It also includes quick tips for optimizing PySpark jobs. The document is authored by Abhishek Agrawal, a Data Engineer.

Uploaded by

Gp Gp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

CHEAT

SHEET FOR DATA


ENGINEERS
Master these essential PySpark commands to
build efficient and scalable data pipelines!

Abhishek Agrawal
Data Engineer
SparkSession (Starting Point)

from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder.appName("Example").getOrCreate()

DataFrame Operations
Command Description Example

df.show() Display the DataFrame df.show(5)

Print the schema of the


df.printSchema() df.printSchema()
DataFrame

df.select("name",
df.select() Select specific columns
"age").show()

Abhishek Agrawal | Data Engineer


Command Description Example

Filter rows based on df.filter(df.age >


df.filter()
conditions 18).show()

df.withColumn("discount
df.withColumn() Add or modify a column
", df.price * 0.1)

df.drop() Drop a column df.drop("column_name")

df.distinct() Get distinct rows df.distinct().show()

Sort DataFrame by df.sort(df["price"].des


df.sort()
columns c()).show()

Group rows and apply df.groupBy("region").su


df.groupBy()
aggregations m("sales").show()

Abhishek Agrawal | Data Engineer


DataFrame Joins
Join Type Description Example

inner Inner join df1.join(df2, "id", "inner")

left Left join df1.join(df2, "id", "left")

right Right join df1.join(df2, "id", "right")

full Full outer join df1.join(df2, "id", "full")

Data Transformation Commands


Command Description Example

df.union() Combine two DataFrames df1.union(df2).show()

df.repartition() Repartition the DataFrame df.repartition(4)

Abhishek Agrawal | Data Engineer


Command Description Example

Cache the DataFrame in


df.cache() df.cache()
memory

Persist the DataFrame to


df.persist() df.persist()
memory/disk

df.dropDuplicates(["id
df.dropDuplicates() Drop duplicate rows
"]).show()

SQL Queries in PySpark


Command Description Example

createOrReplaceTempVie Create a temporary SQL df.createOrReplaceTemp


w() view View("table_name")

spark.sql("SELECT *
Run SQL queries on the
spark.sql() FROM
temporary view
table_name").show()

Abhishek Agrawal | Data Engineer


PySpark Window Functions
Function Description Example

Assigns a unique df.withColumn("row_num",


row_number()
number to each row row_number().over(windowSpec))

Ranks rows based on df.withColumn("rank",


rank()
specified criteria rank().over(windowSpec))

Access next/previous df.withColumn("next_val",


lead() / lag()
row in a partition lead("sales").over(windowSpec))

Abhishek Agrawal | Data Engineer


RDD Operations (for Advanced
Use Cases)
Command Description Example

Apply a function to each rdd.map(lambda x: x *


rdd.map()
element 2).collect()

Filter elements based rdd.filter(lambda x: x >


rdd.filter()
on a condition 10).collect()

Apply a function to
rdd.reduce() rdd.reduce(lambda x, y: x + y)
reduce elements

Return all elements as a


rdd.collect() rdd.collect()
list

Abhishek Agrawal | Data Engineer


Quick Tips for Optimizing
PySpark Jobs

✅ Use repartition() for large datasets.


✅ Use broadcast() for small lookup tables.
✅ Cache DataFrames you reuse frequently.
✅ Avoid wide transformations like groupBy() unless
necessary.

✅ Use Delta Lake for ACID transactions and time travel.

Abhishek Agrawal | Data Engineer


Follow for more
content like this

Abhishek Agrawal
Azure Data Engineer

You might also like