SQL vs PySpark
The Ultimate Colourful Cheat Sheet
Basic Data Operations
Select Columns
SQL:
SELECT col1, col2 FROM table;
PySpark:
df.select("col1", "col2")
Filter Rows
SQL:
SELECT * FROM table WHERE col > 100;
PySpark:
df.filter(df.col > 100)
Limit Rows
SQL:
SELECT * FROM table LIMIT 10;
PySpark:
df.limit(10)
Distinct Values
SQL:
SELECT DISTINCT col FROM table;
PySpark:
df.select("col").distinct()
Aggregations
Count Rows
SQL:
SELECT COUNT(*) FROM table;
PySpark:
df.count()
Group By & Aggregate
SQL:
SELECT col, COUNT(*) FROM table GROUP BY col;
PySpark:
df.groupBy("col").count()
Multiple Aggregations
SQL:
SELECT col, AVG(val), MAX(val) FROM table GROUP BY col;
PySpark:
from pyspark.sql import functions as F
df.groupBy("col").agg(F.avg("val"), F.max("val"))
Joins
Inner Join
SQL:
SELECT * FROM t1 INNER JOIN t2 ON t1.id = t2.id;
PySpark:
df1.join(df2, df1.id == df2.id, "inner")
Left Join
SQL:
SELECT * FROM t1 LEFT JOIN t2 ON t1.id = t2.id;
PySpark:
df1.join(df2, df1.id == df2.id, "left")
Cross Join
SQL:
SELECT * FROM t1 CROSS JOIN t2;
PySpark:
df1.crossJoin(df2)
Window Functions
Row Number
SQL:
SELECT *, ROW_NUMBER() OVER(PARTITION BY col ORDER BY date) as
rn FROM table;
PySpark:
from pyspark.sql.window import Window
windowSpec = Window.partitionBy("col").orderBy("date")
df.withColumn("rn", F.row_number().over(windowSpec))
Rank
SQL:
RANK() OVER(PARTITION BY col ORDER BY val DESC);
PySpark:
df.withColumn("rank", F.rank().over(windowSpec))
Data Manipulation
Add Column
SQL:
ALTER TABLE table ADD col2 INT; OR SELECT *, col1+1 AS col2
FROM table;
PySpark:
df.withColumn("col2", df.col1 + 1)
Rename Column
SQL:
SELECT col1 AS new_name FROM table;
PySpark:
df.withColumnRenamed("col1", "new_name")
Drop Column
SQL:
(varies by dialect)
PySpark:
df.drop("col1")
Data Types & Casting
Cast Column
SQL:
SELECT CAST(col AS INT) FROM table;
PySpark:
df.withColumn("col", df.col.cast("int"))
Check Schema
SQL:
DESCRIBE table;
PySpark:
df.printSchema()