Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
7 views7 pages

SQL Vs PySpark

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views7 pages

SQL Vs PySpark

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

SQL vs PySpark

The Ultimate Colourful Cheat Sheet


Basic Data Operations
Select Columns
SQL:
SELECT col1, col2 FROM table;
PySpark:
df.select("col1", "col2")

Filter Rows
SQL:
SELECT * FROM table WHERE col > 100;
PySpark:
df.filter(df.col > 100)

Limit Rows
SQL:
SELECT * FROM table LIMIT 10;
PySpark:
df.limit(10)

Distinct Values
SQL:
SELECT DISTINCT col FROM table;
PySpark:
df.select("col").distinct()
Aggregations
Count Rows
SQL:
SELECT COUNT(*) FROM table;
PySpark:
df.count()

Group By & Aggregate


SQL:
SELECT col, COUNT(*) FROM table GROUP BY col;
PySpark:
df.groupBy("col").count()

Multiple Aggregations
SQL:
SELECT col, AVG(val), MAX(val) FROM table GROUP BY col;
PySpark:
from pyspark.sql import functions as F
df.groupBy("col").agg(F.avg("val"), F.max("val"))
Joins
Inner Join
SQL:
SELECT * FROM t1 INNER JOIN t2 ON t1.id = t2.id;
PySpark:
df1.join(df2, df1.id == df2.id, "inner")

Left Join
SQL:
SELECT * FROM t1 LEFT JOIN t2 ON t1.id = t2.id;
PySpark:
df1.join(df2, df1.id == df2.id, "left")

Cross Join
SQL:
SELECT * FROM t1 CROSS JOIN t2;
PySpark:
df1.crossJoin(df2)
Window Functions
Row Number
SQL:
SELECT *, ROW_NUMBER() OVER(PARTITION BY col ORDER BY date) as
rn FROM table;
PySpark:
from pyspark.sql.window import Window
windowSpec = Window.partitionBy("col").orderBy("date")
df.withColumn("rn", F.row_number().over(windowSpec))

Rank
SQL:
RANK() OVER(PARTITION BY col ORDER BY val DESC);
PySpark:
df.withColumn("rank", F.rank().over(windowSpec))
Data Manipulation
Add Column
SQL:
ALTER TABLE table ADD col2 INT; OR SELECT *, col1+1 AS col2
FROM table;
PySpark:
df.withColumn("col2", df.col1 + 1)

Rename Column
SQL:
SELECT col1 AS new_name FROM table;
PySpark:
df.withColumnRenamed("col1", "new_name")

Drop Column
SQL:
(varies by dialect)
PySpark:
df.drop("col1")
Data Types & Casting
Cast Column
SQL:
SELECT CAST(col AS INT) FROM table;
PySpark:
df.withColumn("col", df.col.cast("int"))

Check Schema
SQL:
DESCRIBE table;
PySpark:
df.printSchema()

You might also like