0% found this document useful (0 votes)

7 views36 pages

Pyspark - Cheatsheet With Comparison To SQL5 - Seequality

The document provides a comprehensive cheatsheet for learning PySpark, highlighting its significance as a big data processing framework and drawing comparisons with SQL. It includes various examples of SQL queries and their equivalent PySpark code for operations such as selecting, filtering, and aggregating data. The document serves as a practical guide for those familiar with SQL to transition into using PySpark effectively.

Uploaded by

singh vijay

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views36 pages

Pyspark - Cheatsheet With Comparison To SQL5 - Seequality

Uploaded by

singh vijay

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

Pyspark – cheatsheet with

comparison to SQL5

I think I should not convince you to learn PySpark.

There is multiple reasons why we should do that, first
of all PySpark is one of the most widely used big data
processing frameworks, providing support for large-
scale data processing using the Apache Spark framework that currently dominated Big
Data world. How to learn it? The simplest way will be to do it by analogy to something
that we already know. Today I would like to share with you different kind of article – In
the past I have developed a comprehensive cheatsheet to assist myself with the
mastery of PySpark. In my journey to become proficient in Spark, I initially leveraged
my familiarity with SQL to facilitate my understanding of data frames in PySpark. This
approach proved to be an effective method for gaining a solid understanding of the
PySpark library.

To start let’s say what PySpark is. PySpark is the Python API for Apache Spark, an
open-source, distributed computing system used for big data processing and analysis.
It allows developers to process large amounts of data in a parallel, fast, and efficient
manner using Python. Please assume that in some cases it cannot be translated
directly, Spark has a concept of a table but in reality this is just metastore pointers to
the storage.

To properly check below pyspark structures you can execute on your cluster following
queries:

1. CREATE TABLE employees

2. (
3. employeeId INT,
4. employeeName STRING,
5. employeeSurname STRING,
6. employeeTitle STRING,
7. age INT,
8. city STRING,
9. birthdate DATE,
10. salary DECIMAL
11. )
12.
13. CREATE OR REPLACE TABLE sales
14. (
15. EmployeeId INT,
16. Quantity DECIMAL,
17. UnitPrice DECIMAL,
18. City STRING,
19. Date DATE
20. )
21.
22. INSERT INTO employees (employeeId,employeeName, employeeSurname, employeeTitle,
age, city, birthdate, salary)
23. VALUES
24. (1,"John", "Doe", "Manager", 35, "New York", "1986-05-10", 75000),
25. (2,"Jane", "Doe", "Developer", 32, "London", "1989-02-15", 65000),
26. (3,"Jim", "Smith", "Architect", 40, "Paris", "1980-08-20", 85000),
27. (4,"Sarah", "Johnson", "Designer", 29, "Berlin", "1992-12-01", 55000),
28. (5,"Michael", "Brown", "Product Manager", 38, "Tokyo", "1984-06-06", 75000),
29. (6,"Emily", "Davis", "Data Analyst", 31, "Sydney", "1990-09-12", 65000),
30. (7,"David", "Wilson", "Salesperson", 33, "Toronto", "1988-07-01", 55000),
31. (8,"William", "Johnson", "Support Engineer", 36, "Beijing", "1985-04-01", 65000),
32. (9,"Brian", "Anderson", "Marketing Manager", 37, "Shanghai", "1983-05-15",
75000),
33. (10,"James", "Lee", "Operations Manager", 39, "Seoul", "1981-03-01", 85000),
34. (11,"Emily", "Parker", "HR Manager", 30, "Dubai", "1991-12-25", 75000),
35. (12,"Jacob", "Williams", "Accountant", 34, "New Delhi", "1987-06-01", 65000);
36.
37. INSERT INTO sales (EmployeeId, Quantity, UnitPrice, City, Date) VALUES (1, 10,
20, 'New York',"2023-03-01");
38. INSERT INTO sales (EmployeeId, Quantity, UnitPrice, City, Date) VALUES (1, 5, 10,
'London',"2023-03-01");
39. INSERT INTO sales (EmployeeId, Quantity, UnitPrice, City, Date) VALUES (3, 15,
15, 'Paris',"2023-03-02");
40. INSERT INTO sales (EmployeeId, Quantity, UnitPrice, City, Date) VALUES (4, 20,
30, 'Berlin',"2023-03-02");
41. INSERT INTO sales (EmployeeId, Quantity, UnitPrice, City, Date) VALUES (4, 25,
10, 'Tokyo',"2023-03-03");
42. INSERT INTO sales (EmployeeId, Quantity, UnitPrice, City, Date) VALUES (4, 30,
20, 'Sydney',"2023-03-03");
43. INSERT INTO sales (EmployeeId, Quantity, UnitPrice, City, Date) VALUES (5, 35,
25, 'Beijing',"2023-03-03");
44. INSERT INTO sales (EmployeeId, Quantity, UnitPrice, City, Date) VALUES (6, 40,
30, 'Shanghai',"2023-03-04");
45. INSERT INTO sales (EmployeeId, Quantity, UnitPrice, City, Date) VALUES (7, 45,
35, 'New Delhi',"2023-03-05");
46. INSERT INTO sales (EmployeeId, Quantity, UnitPrice, City, Date) VALUES (7, 50,
40, 'New Delhi',"2023-03-05");
47. INSERT INTO sales (EmployeeId, Quantity, UnitPrice, City, Date) VALUES (9, 55,
45, 'Seoul',"2023-03-05");
48. INSERT INTO sales (EmployeeId, Quantity, UnitPrice, City, Date) VALUES (10, 60,
50, 'Rio de Janeiro',"2023-03-05");
49. INSERT INTO sales (EmployeeId, Quantity, UnitPrice, City, Date) VALUES (10, 65,
55, 'Paris',"2023-03-06");
50. INSERT INTO sales (EmployeeId, Quantity, UnitPrice, City, Date) VALUES (NULL, 70,
60, 'Mexico City',"2023-03-06");
51. INSERT INTO sales (EmployeeId, Quantity, UnitPrice, City, Date) VALUES (NULL, 75,
65, 'Mumbai',"2023-03-08");

Feel free to adjust it to your needs.

SELECT all columns from table
The statement is used to retrieve all columns and rows of data from the table.

SQL:

1. SELECT * FROM employees

Python:

1. df = spark.table("employees")
2. df.show()

Select specific columns from table

This query retrieves the values of the columns “employeeName,” “employeeSurname,”
and “employeeTitle” from the “employees” table.

SQL:

1. SELECT employeeName,employeeSurname,employeeTitle FROM employees

Pyspark:

1. df = spark.table("employees")
2. df = df.select("employeeName", "employeeSurname", "employeeTitle")
3. df.show()
Order result
SELECT employeeName,employeeSurname,employeeTitle FROM employees ORDER
BY employeeSurname

SQL:

1. SELECT employeeName,employeeSurname,employeeTitle FROM employees ORDER BY

employeeSurname

Pyspark:

1. df = spark.table("employees")
2. df = df.select("employeeName", "employeeSurname", "employeeTitle")
3. df = df.orderBy("employeeSurname")
4. df.show()
Order result descending
This is a SQL query that retrieves the values of the columns “employeeName”,
“employeeSurname”, and “employeeTitle” from a table called “employees”. The results
will be ordered in ascending order based on the values in the “employeeSurname”
column.

SQL:

1. SELECT employeeName,employeeSurname,employeeTitle FROM employees ORDER BY

employeeSurname DESC

Pyspark:

1. df = spark.table("employees")
2. df = df.select("employeeName", "employeeSurname", "employeeTitle")
3. df = df.orderBy(df["employeeSurname"].desc())
4. df.show()
Select TOP N rows
The query retrieves the “employeeName”, “employeeSurname”, and “employeeTitle”
columns from the “employees” table. It limits the number of returned rows to 10, with
the first 10 rows being determined by the sorting order specified in the “ORDER BY”
clause, which sorts the results by the “employeeSurname” column. This means that the
query will return the first 10 employees from the table sorted by their surname.

SQL:

1. SELECT TOP 10 employeeName,employeeSurname,employeeTitle FROM employees ORDER BY

employeeSurname

Pyspark:

1. df = spark.table("employees")
2. df = df.select("employeeName", "employeeSurname", "employeeTitle")
3. df = df.orderBy("employeeSurname")
4. df = df.limit(10)
5. df.show()
Adding aliases
This query retrieves data from the “employees” table and selects the columns
“employeeName”, “employeeSurname”, and “employeeTitle”. The “AS” clause is used to
assign new names, or aliases, to each of the selected columns.

SQL:

1. SELECT employeeName AS Name ,employeeSurname AS Surname,employeeTitle AS Title

FROM employees

Pyspark:

1. df = spark.table("employees")
2. df = df.withColumnRenamed("employeeName", "Name")
3. df = df.withColumnRenamed("employeeSurname", "Surname")
4. df = df.withColumnRenamed("employeeTitle", "Title")
5. df.show()
Filtering – greater than
his query retrieves data from the “employees” table and selects all columns by using
the wildcard selector “*”. The “WHERE” clause is used to filter the results based on a
condition. In this case, the condition is “Age > 35”, which means that only rows where
the value in the “Age” column is greater than 35 will be returned.

SQL:

1. SELECT * FROM employees

2. WHERE Age>35

Pyspark:

1. df = spark.table("employees")
2. df = df.filter(df["Age"] > 35)
3. df.show()
Filtering – logical AND
The “WHERE” clause is used to filter the results based on multiple conditions. In this
case, the conditions are “Age > 18” and “City = ‘Warsaw'”. These conditions are
combined using the logical operator “AND”, meaning that only rows where both
conditions are true will be returned.

SQL:

1. SELECT * FROM employees

2. WHERE Age>35 AND City = 'Seoul'

Pyspark:

1. df = spark.table("employees")
2. df = df.filter((df["Age"] > 35) & (df["City"] == "Seoul"))
3. df.show()

Filtering – logical OR
The “WHERE” clause is used to filter the results based on multiple conditions. In this
case, the conditions are “Age > 18” and “City = ‘Warsaw'”. These conditions are
combined using the logical operator “OR”, meaning that rows where either one or both
conditions are true will be returned.

SQL:
1. SELECT * FROM employees
2. WHERE Age>35 OR City = 'Seoul'

Pyspark:

1. df = spark.table("employees")
2. df = df.filter((df["Age"] > 35) | (df["City"] == "Seoul"))
3. df.show()

Filtering – wildcard
The “WHERE” clause is used to filter the results based on a condition using a pattern
match. In this case, the condition is “City LIKE ‘W%'”, which means that only rows
where the value in the “City” column starts with the letter “W” will be returned. The “%”
symbol is used as a wildcard character, representing zero or more characters.

SQL:

1. SELECT * FROM employees

2. WHERE City LIKE 'S%'

Pyspark:

1. df = spark.table("employees")
2. df = df.filter(df["City"].like("S%"))
3. df.show()
Filtering – BETWEEN
The “WHERE” clause is used to filter the results based on a range of values. In this
case, the range is defined using the “BETWEEN” operator, with the values ‘19900101’
and ‘20000101’ as the lower and upper bounds, respectively. Only rows where the
value in the “BirthDate” column is within this range will be returned.

SQL:

1. SELECT * FROM employees

2. WHERE BirthDate BETWEEN '19900101' AND '20000101'

Pyspark:

1. df = spark.table("employees")
2. df = df.filter(df["BirthDate"].between("1990-01-01", "2000-01-01"))
3. df.show()

Filtering – not equal

The “WHERE” clause is used to filter the results based on a condition. In this case, the
condition is “City <> ‘Warsaw'”, which means that only rows where the value in the
“City” column is not equal to ‘Warsaw’ will be returned

SQL:

1. SELECT * FROM employees

2. WHERE City <> 'Warsaw'

Pyspark:

1. df = spark.read.table("employees")
2. df.filter(df["City"] != 'Toronto').select("*").show()
Filtering – IN
The “WHERE” clause is used to filter the results based on a condition. In this case, the
condition is “City IN (‘Warsaw’,’Krakow’,’Lublin’)”, which means that only rows where
the value in the “City” column is equal to one of the cities listed in the parentheses will
be returned. The “IN” operator is used to specify a set of values to check for equality.

SQL:

1. SELECT * FROM employees

2. WHERE City IN ('Warsaw','Krakow','Lublin')

Pyspark:

1. df = spark.read.table("employees")
2. df = df.filter(df["City"].isin(['Tokyo', 'London', 'New York']))
3. df.show()

Filtering – NOT IN
The “WHERE” clause is used to filter the results based on a condition. In this case, the
condition is “City NOT IN (‘Warsaw’,’Krakow’,’Lublin’)”, which means that only rows
where the value in the “City” column is not equal to any of the cities listed in the
parentheses will be returned. The “NOT IN” operator is used to specify a set of values
to check for inequality.

SQL:

1. SELECT * FROM employees

2. WHERE City NOT IN ('Warsaw','Krakow','Lublin')

Pyspark:

1. df = spark.read.table("employees")
2. df = df.filter(~df["City"].isin(['Tokyo', 'London', 'New York']))
3. df.show()

Aggregate – COUNT
This query retrieves data from the “employees” table and selects the “City” column
and a count of the number of rows for each city.

SQL:

1. SELECT City,COUNT(*) FROM employees

2. GROUP BY City

Pyspark:

1. from pyspark.sql.functions import count

2. df = spark.read.table("employees")
3. df = df.groupBy("City").agg(count("*").alias("Count")).select("City", "Count")
4. df.show()

Aggregate – MIN,MAX,AVG,SUM
This query retrieves data from the “employees” table and selects the “City” column
and various aggregate functions applied to the “Salary” column for each city.

SQL:

1. SELECT City,AVG(Salary),MIN(Salary),MAX(Salary),SUM(Salary) FROM employees

2. GROUP BY City

Pyspark:

1. from pyspark.sql.functions import avg, min, max, sum

2. df = spark.read.table("employees")
3. df = df.groupBy("City").agg(avg("Salary").alias("avg_salary"),
min("Salary").alias("min_salary"), max("Salary").alias("max_salary"),
sum("Salary").alias("total_salary"))
4. df.show()
Aggregate – HAVING
This query retrieves data from the “employees” table and selects the “City” column
and the average salary of employees in each city. The “GROUP BY” clause is used to
group the rows in the “employees” table by the “City” column. The “AVG(Salary)”
function is used to calculate the average salary of employees in each city. The
“HAVING” clause is used to further filter the groups based on the aggregate result. In
this case, the condition is “Salary>70000”, which means that only groups where the
average salary of employees is greater than 70000 will be returned.

SQL:

1. SELECT City,AVG(Salary)AS Salary FROM employees

2. GROUP BY City
3. HAVING Salary>1000

Pyspark:

1. from pyspark.sql.functions import avg

2.
3. df = spark.read.table("employees")
4. df = df.groupBy("City")
.agg(avg(col("Salary").cast("double")).alias("Salary")).filter(col("Salary") >
70000) .select("City", "Salary")
5. df.show()
SELECT DISTINCT
The “SELECT DISTINCT” statement is used to return only unique values from the “City”
column. This means that if there are multiple employees with the same city, the query
will only return one instance of that city.

SQL:

1. SELECT DISTINCT City FROM employees

Pyspark:

1. df = spark.read.table("employees")
2. df = df.select("City").distinct()
3. df.show()
Calculated column
This query performs a simple calculation to add a new column to a dataframe. The
calculation multiplies the values of two existing columns in the dataframe (named
“OrderQty” and “UnitPrice”) to create a new column named “SalesAmount”.

SQL

1. SELECT Quantity * UnitPrice AS SalesAmount From Sales

Pyspark:

1. df = spark.read.table("sales")
2. df = df.withColumn("SalesAmount", df["Quantity"] * df["UnitPrice"])
3. df.show()e"))

to replace existing column use third parameter overwrite=True

SUBSTRING, LEFT, RIGHT
This query retrieves data from the “employees” table and selects three columns, each
using a different string function: “LEFT”, “RIGHT”, and “SUBSTRING”.”LEFT” function
returns the first 2 characters of the city name. “RIGHT” function returns the last 2
characters of the city name. “SUBSTRING” function returns the 2 characters of the city
name from third character.

SQL:

1. SELECT LEFT(City,2),RIGHT(City,2),SUBSTRING(City,1,2) FROM Employees

Pyspark:

1. from pyspark.sql.functions import substring

2.
3. df = spark.read.table("sales")
4. df = df.select( df["City"],substring(df["City"], 1, 2).alias("left"),
substring(df["City"], -2, 2).alias("right"), substring(df["City"], 3,
2).alias("substring") )
5. df.show()

Concatenation
The “CONCAT” function concatenates (joins) two or more strings into a single string. In
this case, the first argument to the “CONCAT” function is the string “Employee: “, and
the second argument is the concatenation of the “FirstName” and “LastName”
columns, separated by a space.

SQL:

1. SELECT CONCAT('Employee: ',employeeName+ ' '+employeeSurname) FROM Employees

Pyspark

1. from pyspark.sql.functions import lit,concat

2. df = spark.read.table("employees")
3. df = df.select(concat(lit("Employee: "), df["employeeName"], lit(" "),
df["employeeSurname"]).alias("full_name"))
4. df.show()

Filtering – NOT IN and subquery

This query retrieves data from the “employees” table and selects all rows where the
city is not present in the “City” column of the “Sales” table.

SQL
1. SELECT * FROM Employees
2. WHERE City NOT IN
3. (SELECT City FROM Sales)

Pyspark

1. df1 = spark.read.table("employees")
2. df2 = spark.read.table("sales")
3. df = df1.join(df2, "City", "left_anti")
4. df.show()

Subquery with filtering

This query retrieves data from the “employees” table, creates a subquery that groups
the employees by city and counts the number of employees in each city, and then
filters the result to only show cities with fewer than 2 employees.

SQL

1. SELECT City
2. (
3. SELECT City,COUNT(*) AS Cnt FROM employees
4. GROUP BY City
5. ) AS t
6. WHERE Cnt<2

Pyspark

1. from pyspark.sql.functions import count

2. df = spark.read.table("employees")
3. df = df.select("City").join(df.groupBy("City").agg(count("*").alias("cnt")),
"City").where("cnt < 2")
4. df.show()
JOIN – Inner join
This query performs a join between the “employees” and “Sales” tables, which
combines data from both tables based on the “employeeId” column.

SQL

1. SELECT e.City, s.(Quantity) AS Qty FROM employees AS e

2. JOIN Sales AS s
3. ON s.employeeId = e.employeeId

Python

1. df1 = spark.read.table("employees")
2. df2 = spark.read.table("sales")
3. df= df1.join(df2,df1["employeeId"] == df2["employeeId"]).select(df1["City"],
df2["Quantity"].alias("Qty"))
4. df.show()
JOIN – Left join
This query performs a left join between the “employees” and “Sales” tables, which
combines data from both tables based on the “employeeId” column. The left join
returns all rows from the “employees” table and only matching rows from the “Sales”
table. If there is no matching data in the “Sales” table for a particular employee, NULL
values will be displayed for the “Quantity” column.

SQL

1. SELECT e.City, s.(Quantity) AS Qty FROM employees AS e

2. LEFT JOIN Sales AS s
3. ON s.employeeId = e.employeeId

Pyspark

1. df1 = spark.read.table("employees")
2. df2 = spark.read.table("sales")
3. df= df1.join(df2,df1["employeeId"] ==
df2["employeeId"],"left").select(df1["City"], df2["Quantity"].alias("Qty"))
4. df.show()
JOIN – right join
This query performs a right join between the “employees” and “Sales” tables, which
combines data from both tables based on the “employeeId” column. The right join
returns all rows from the “Sales” table and only matching rows from the “employees”
table. If there is no matching data in the “employees” table for a particular sale, NULL
values will be displayed for the “City” column.

SQL

1. SELECT e.City, s.Quantity AS Qty FROM employees AS e

2. RIGHT JOIN Sales AS s
3. ON s.employeeId = e.employeeId

Pyspark

1. df1 = spark.read.table("employees")
2. df2 = spark.read.table("sales")
3. df= df1.join(df2,df1["employeeId"] ==
df2["employeeId"],"right").select(df1["City"], df2["Quantity"].alias("Qty"))
4. df.show()
JOIN – full join
The query performs a full join between the “employees” table and the “Sales” table,
using the “employeeId” column as the join condition. The result will return all rows from
both tables, including matching and non-matching rows. For non-matching rows, the
values from the other table will be filled with NULL.

SQL

1. SELECT e.City, s.Quantity AS Qty FROM employees AS e

2. FULL JOIN Sales AS s
3. ON s.employeeId = e.employeeId

Pyspark

1. df1 = spark.read.table("employees")
2. df2 = spark.read.table("sales")
3. df= df1.join(df2,df1["employeeId"] ==
df2["employeeId"],"full").select(df1["City"], df2["Quantity"].alias("Qty"))
4. df.show()
JOIN – Cross join
This query that performs a cross join between two tables, and returns all columns from
both tables. The cross join results in all possible combinations of rows from both
tables.

SQL

1. SELECT * FROM employees AS e

2. CROSS JOIN Sales AS s

Pyspark

1. df1 = spark.table("employees")
2. df2 = spark.table("sales")
3. df1.crossJoin(df2)
Working on sets – UNION
This is a query that retrieves the employeeName and employeeSurname columns from
the employees table, where the value of the City column is either ‘Toronto’ or ‘New
Delhi’. The UNION operation is used to combine two datasets into a single dataset. The
result will contain only unique rows, with duplicates removed.

SQL

1. SELECT employeeName, EmployeeSurname FROM employees

2. WHERE City='Toronto'
3. UNION
4. SELECT employeeName, employeeSurname FROM employees
5. WHERE City='New Delhi'
6. UNION
7. SELECT employeeName, employeeSurname FROM employees
8. WHERE City='New Delhi'

Python

1. from pyspark.sql.functions import col

2.
3. df = spark.table("employees")
4. df1 = df.select(col("employeeName"), col("employeeSurname")).filter(col("City")
== "Toronto")
5. df2 = df.select(col("employeeName"), col("employeeSurname")).filter(col("City")
== "New Delhi")
6. df3 = df.select(col("employeeName"), col("employeeSurname")).filter(col("City")
== "New Delhi")
7. df = df1.union(df2).union(df3).distinct()
8. df.show()
Working on sets – UNION ALL
This is a query retrieves the employeeName and employeeSurname columns from the
employees table, where the value of the City column is either ‘Toronto’ or ‘New Delhi’.
The UNION ALL keyword is used to combine the results of the three separate datasets
into a single dataset. Unlike UNION, UNION ALL does not remove duplicates, so the
result will contain all rows from all three datasets.

SQL

1. SELECT employeeName, employeeSurname FROM employees

2. WHERE City='Toronto'
3. UNION ALL
4. SELECT employeeName, employeeSurname FROM employees
5. WHERE City='New Delhi'
6. UNION ALL
7. SELECT employeeName, employeeSurname
8. FROM employees
9. WHERE City='New Delhi'

Pyspark

1. from pyspark.sql.functions import col

2.
3. df = spark.table("employees")
4. df1 = df.select(col("employeeName"), col("employeeSurname")).filter(col("City")
== "Toronto")
5. df2 = df.select(col("employeeName"), col("employeeSurname")).filter(col("City")
== "New Delhi")
6. df3 = df.select(col("employeeName"), col("employeeSurname")).filter(col("City")
== "New Delhi")
7. df = df1.union(df2).union(df3)
8. df.show()
Working on sets – INTERSECT
This is a query retrieves the employeeName and employeeSurname columns from the
employees where the value of the City column is either ‘Toronto’ or ‘New Delhi’. The
INTERSECT keyword is used to return only the rows that are common to both datasets.
In other words, the result will contain only the rows where the City column has the
value ‘Toronto’ in both datasets.

SQL

1. SELECT employeeName, employeeSurname FROM employees

2. WHERE City IN ('New Delhi, Toronto')
3. INTERSECT
4. SELECT employeeName, employeeSurname FROM employees
5. WHERE City='Toronto'

Pyspark

1. df = spark.table("employees")
2. df1 = df.select(col("employeeName"),
col("employeeSurname")).filter(df["City"].isin(["Toronto","New Delhi"]))
3. df2 = df.select(col("employeeName"),
col("employeeSurname")).filter(df["City"].isin(["Toronto"]))
4. df = df1.intersect(df2)
5. df.show()
Working on sets – EXCEPT
This is a query that retrieves the employeeName and employeeSurname columns from
the employees table, where the value of the City column is either ‘Toronto’ or ‘New
Delhi’.The EXCEPT operation is used to return the rows that are in the first dataset but
not in the second . In this case, the first dataset retrieves the rows where the City
column is either ‘Toronto’ or ‘New Delhi’, while the second SELECT statement retrieves
the rows where the City column has the value ‘Toronto’.

SQL

1. SELECT employeeName, employeeSurname FROM employees

2. WHERE City IN ('Toronto','New Delhi')
3. EXCEPT
4. SELECT employeeName, employeeSurname FROM employees
5. WHERE City='Toronto'

Pyspark

SQL

1. SELECT ROW_NUMBER() OVER(ORDER BY Salary DESC) AS RN,* FROM Employees

Pyspark

1. from pyspark.sql.functions import col, row_number

2. from pyspark.sql import Window
3.
4. df = spark.table("employees")
5.
6. windowSpec = Window.orderBy(col("Salary").desc())
7. df = df.withColumn("RN", row_number().over(windowSpec))
8. df.show()

Windows functions – PARTITION BY

This is a SQL query that assigns a unique row number to each row in the Employees
table, based on the order of the Salary column in descending order, but only within
each partition defined by the Surname column.

SQL

1. SELECT ROW_NUMBER() OVER(PARTITION BY Surname ORDER BY Salary DESC) AS RN,* FROM

Employees

Pyspark

1. from pyspark.sql.functions import col, row_number

2. from pyspark.sql import Window
3.
4. df = spark.table("employees")
5.
6. windowSpec =
Window.partitionBy(col("employeeSurname")).orderBy(col("Salary").desc())
7. df = df.withColumn("RN", row_number().over(windowSpec))
8. df.show()

Window functions – Aggregate +

OVER
This is a query calculates the total salary for each unique surname in the Employees
table, and returns the surname, salary, and total salary for each employee. The query
uses the SUM function with the OVER clause to calculate the total salary for each
unique surname. The PARTITION BY clause in the OVER clause specifies that the total
salary should be calculated separately for each unique surname, so the SUM function
only adds up the salaries for employees with the same surname.
SQL

1. SELECT employeeSurname,Salary,SUM(Salary) OVER(PARTITION BY employeeSurname) AS

TotalSalaryBySurname FROM Employees

Pyspark

1. from pyspark.sql.functions import col, row_number, sum

2. from pyspark.sql import Window
3.
4. df = spark.table("employees")
5.
6. windowSpec = Window.partitionBy(col("employeeSurname"))
7. df = df.withColumn("TotalSalaryBySurname", sum(df["Salary"]).over(windowSpec))
8. df = df.select(["employeeSurname","Salary","TotalSalaryBySurname"])
9. df.show()

Window functions – LAG & LEAD

The query uses the LAG and LEAD functions with the OVER clause to calculate the
previous and next day’s Quantity values for each day, based on the order of the Date
column.

SQL

1. SELECT Quantity, LAG(Quantity) OVER(ORDER BY Date) AS PreviousDayQuantity,

Lead(Quantity) OVER(ORDER BY Date) AS NextDayQuantity
2. FROM
3. (
4. SELECT Date,SUM(Quantity) AS Quantity
5. FROM
6. Sales
7. ) AS t

Pyspark

1. from pyspark.sql.functions import lag, lead

2. df = spark.table("sales")
3. df = df.groupBy("Date").agg(sum(df["Quantity"]).alias("Quantity"))
4. df = df.select( df["Date"],df["Quantity"],
lag("Quantity").over(Window.orderBy("Date")).alias("PreviousDayQuantity"),
lead("Quantity").over(Window.orderBy("Date")).alias("NextDayQuantity") )
5. df.show()

Subqueries – Common Table

Expression
This is a query that creates a Common Table Expression (CTE) named cte which is
used to retrieve the top 5 employees with the highest salaries from the Employees
table, sorted in descending order by salary.

SQL

1. with cte
2. AS
3. (select TOP 5 * FROM Employees ORDER BY Salary DESC)
4. SELECT * FROM cte

Pyspark

1. from pyspark.sql.functions import desc

2. df = spark.table("employees")
3. cte = df.select("*").sort(desc("Salary")).limit(5)
4. cte.show()
Converting – CAST
This is a query retrieves the employeeId, birthdate, and a new column AgeText from
the Employees table.

The AgeText column is created using the CAST function which is used to convert the
birthdate column from its original data type to a string data type with a maximum
length of 50 characters. The resulting AgeText column will contain the string
representation of the birthdate values.

SQL

1. SELECT employeeId,birthdate,CAST(birthdateAS VARCHAR(50)) AS AgeText

2. FROM Employees

Pyspark

1. from pyspark.sql.functions import cast

2.
3. df = spark.table("employees")
4. df = df.withColumn("birthdateText",df["birthdate"].cast("string") )
5. df = df.select(["employeeId","birthdate","birthdateText"])
6. df.show()
In conclusion, it is my hope that the information shared in this article regarding
PySpark and its comparison to SQL will be beneficial to your learning journey. I
anticipate providing further insights on related topics, including PySpark, Pandas, and
Spark, in the future. I invite you to stay tuned for updates.

Author Recent Posts

Adrian Chodkowski
SQL geek, Data enthusiast, Consultant & Developer

Follow me

Introduction to
Microsoft Fabric:
 What You Need to
Know
Delta Lake 101 Part 2: 
Transaction Log

2 COMMENTS

Michał Pawlikowski
February 2, 2023 at 4:54 am

Wow. Nice1. Tnx

Adrian Chodkowski
February 12, 2023 at 3:33 pm

Thanks Michał!
REPLY

CRTP Exam Update
No ratings yet
CRTP Exam Update
10 pages
Py Spark
No ratings yet
Py Spark
10 pages
Real Data Analyst Interview Questions Answers
No ratings yet
Real Data Analyst Interview Questions Answers
15 pages
Cloud-AI-Native 6G Powered by eBPF
No ratings yet
Cloud-AI-Native 6G Powered by eBPF
20 pages
Pyspark SQL Basics Cheat Sheet: Python For Data Science
No ratings yet
Pyspark SQL Basics Cheat Sheet: Python For Data Science
1 page
PySpark SQL Cheat Sheet Python PDF
No ratings yet
PySpark SQL Cheat Sheet Python PDF
1 page
PySpark SQL Cheat Sheet Python
100% (2)
PySpark SQL Cheat Sheet Python
1 page
110 SQL Query Interview Questions and Practice Exercises For Experienced and Fre
No ratings yet
110 SQL Query Interview Questions and Practice Exercises For Experienced and Fre
40 pages
PySpark Cheatsheet - Elaborate
No ratings yet
PySpark Cheatsheet - Elaborate
14 pages
PySpark SQL Cheat Sheet Python PDF
No ratings yet
PySpark SQL Cheat Sheet Python PDF
1 page
PySpark SQL Cheat Sheet Python
No ratings yet
PySpark SQL Cheat Sheet Python
1 page
SQL To Pyspark Conversion
No ratings yet
SQL To Pyspark Conversion
9 pages
SQL Cheat Sheet Python
100% (1)
SQL Cheat Sheet Python
1 page
Pyspark Syntax Using Simple Examples
No ratings yet
Pyspark Syntax Using Simple Examples
28 pages
Solutions 1742312993
No ratings yet
Solutions 1742312993
14 pages
Merukhand Example
No ratings yet
Merukhand Example
1 page
Pyspark and SQL
No ratings yet
Pyspark and SQL
57 pages
Advanced SQL Techniques Guide
No ratings yet
Advanced SQL Techniques Guide
48 pages
Basic DataFrame Operation
No ratings yet
Basic DataFrame Operation
11 pages
Unit 4 Spark SQL
No ratings yet
Unit 4 Spark SQL
49 pages
XII IP Practical List 2025-26 - KV1UDR
No ratings yet
XII IP Practical List 2025-26 - KV1UDR
4 pages
Comparison of SQL
No ratings yet
Comparison of SQL
11 pages
DBMS External Questions
No ratings yet
DBMS External Questions
32 pages
SQL PySpark Cheat Sheet 1731729790
No ratings yet
SQL PySpark Cheat Sheet 1731729790
9 pages
Configuring Allen Bradley RSLinx and RSLogix For Use With The PLC Trainer
No ratings yet
Configuring Allen Bradley RSLinx and RSLogix For Use With The PLC Trainer
7 pages
SQL Cheat Sheet
No ratings yet
SQL Cheat Sheet
22 pages
PySpark DataFrame Operations
No ratings yet
PySpark DataFrame Operations
103 pages
Ip File Class 12
No ratings yet
Ip File Class 12
26 pages
Wipro Data Analyst Interview Questions
No ratings yet
Wipro Data Analyst Interview Questions
29 pages
DBMS 3a (Employee, Department, Location)
No ratings yet
DBMS 3a (Employee, Department, Location)
6 pages
C Programming and Data Structures - Unit I Notes
No ratings yet
C Programming and Data Structures - Unit I Notes
40 pages
Addon - SQL Cheat Sheet
No ratings yet
Addon - SQL Cheat Sheet
18 pages
Pyspark SQL Transformation Cheat Sheet
No ratings yet
Pyspark SQL Transformation Cheat Sheet
3 pages
SQL Interview Cheat Sheet
No ratings yet
SQL Interview Cheat Sheet
7 pages
SQL & Pyspark
No ratings yet
SQL & Pyspark
9 pages
Gajanan
No ratings yet
Gajanan
23 pages
Holiday Worksheet
No ratings yet
Holiday Worksheet
9 pages
Dokumen - Pub Embedded Systems Design Programming and Applications 1nbsped 9781783320462 9781842657829
No ratings yet
Dokumen - Pub Embedded Systems Design Programming and Applications 1nbsped 9781783320462 9781842657829
396 pages
SQL Vs Pyspark-1
No ratings yet
SQL Vs Pyspark-1
9 pages
3 Notes of 3 Unit
No ratings yet
3 Notes of 3 Unit
36 pages
Day 77
No ratings yet
Day 77
10 pages
Questions For Preparation
No ratings yet
Questions For Preparation
9 pages
DBMS 3b (Employee Department Location)
No ratings yet
DBMS 3b (Employee Department Location)
9 pages
SQL & Python Interview Q&A
No ratings yet
SQL & Python Interview Q&A
7 pages
SQL Cheat Sheet
No ratings yet
SQL Cheat Sheet
13 pages
SQL User Stories
No ratings yet
SQL User Stories
6 pages
Online Medical Store - Synopsis
75% (4)
Online Medical Store - Synopsis
35 pages
HTML Code
No ratings yet
HTML Code
4 pages
Company Management System Assignment
No ratings yet
Company Management System Assignment
3 pages
6
No ratings yet
6
5 pages
SQL Cheat Sheet
No ratings yet
SQL Cheat Sheet
2 pages
Rdbms Programs
No ratings yet
Rdbms Programs
4 pages
Ip MS
No ratings yet
Ip MS
6 pages
HTML Code
No ratings yet
HTML Code
3 pages
Pyspark Distinct and Filter
No ratings yet
Pyspark Distinct and Filter
3 pages
Quewtion SQL - Pyspark
No ratings yet
Quewtion SQL - Pyspark
4 pages
Sample Questions
No ratings yet
Sample Questions
3 pages
Practical 2024
No ratings yet
Practical 2024
10 pages
Assignment 3 - Shouvik (1159)
No ratings yet
Assignment 3 - Shouvik (1159)
15 pages
ANSWER KEY Yearly Exame Paper Maths Class 9 Session (2024-25)
No ratings yet
ANSWER KEY Yearly Exame Paper Maths Class 9 Session (2024-25)
12 pages
SQL & pySPARK
No ratings yet
SQL & pySPARK
9 pages
SQL Vs PySpark 1678871778
No ratings yet
SQL Vs PySpark 1678871778
8 pages
PM800 User Guide
No ratings yet
PM800 User Guide
122 pages
Yfyr Dyk 100: Syllabus For Madhyamic Paper I STET 2023 Unit I Subject Marks
No ratings yet
Yfyr Dyk 100: Syllabus For Madhyamic Paper I STET 2023 Unit I Subject Marks
3 pages
HSSLiVE-XI-Maths - CH-3 - TRIGONOMETRY
No ratings yet
HSSLiVE-XI-Maths - CH-3 - TRIGONOMETRY
11 pages
Corpus BasedSociolinguistics Partington
No ratings yet
Corpus BasedSociolinguistics Partington
7 pages
Jeopardy Game Template Guide
No ratings yet
Jeopardy Game Template Guide
56 pages
Grammar Practice Activities
No ratings yet
Grammar Practice Activities
6 pages
Alice in Wonderland - A Critique Paper
No ratings yet
Alice in Wonderland - A Critique Paper
2 pages
Flexbox CSS
No ratings yet
Flexbox CSS
23 pages
Python CheatSheet
No ratings yet
Python CheatSheet
2 pages
Theory of L-Functions: An Introduction To The
No ratings yet
Theory of L-Functions: An Introduction To The
205 pages
Face-to-Face Communication & Tech
No ratings yet
Face-to-Face Communication & Tech
4 pages
English 8 Quarter 1 Concept Notes 1
No ratings yet
English 8 Quarter 1 Concept Notes 1
18 pages
Process of Writing
No ratings yet
Process of Writing
5 pages
Apophasis in Plotinus - A Critical Approach, Sells
100% (1)
Apophasis in Plotinus - A Critical Approach, Sells
20 pages
Software Manual MAS-100 NT & NT Ex en V14.0
No ratings yet
Software Manual MAS-100 NT & NT Ex en V14.0
56 pages
AI's Impact on Tech and Society
No ratings yet
AI's Impact on Tech and Society
8 pages
Week 2 Eapp Lesson
No ratings yet
Week 2 Eapp Lesson
43 pages
Preserving Mangyan Poetry
No ratings yet
Preserving Mangyan Poetry
1 page
Women in Heart of Darkness
No ratings yet
Women in Heart of Darkness
4 pages
WELMEC Guide 7.3 v2020
No ratings yet
WELMEC Guide 7.3 v2020
28 pages
Mubtilaat e Namaz
No ratings yet
Mubtilaat e Namaz
5 pages
The Fable of The Preacher Who Flew His Kite, But Not Because He Wished To Do So by George Ade
No ratings yet
The Fable of The Preacher Who Flew His Kite, But Not Because He Wished To Do So by George Ade
3 pages
01ReadingGuide Plato Lindberg
No ratings yet
01ReadingGuide Plato Lindberg
4 pages
Manavalli
No ratings yet
Manavalli
4 pages
Tutorial - The Sum-Product Algorithm
No ratings yet
Tutorial - The Sum-Product Algorithm
5 pages
What'S New in This Version: Bugfix
No ratings yet
What'S New in This Version: Bugfix
10 pages
Simion Bărnuțiu's Philosophy Insights
No ratings yet
Simion Bărnuțiu's Philosophy Insights
11 pages
Child Care Thesis - by Slidesgo
No ratings yet
Child Care Thesis - by Slidesgo
50 pages

Pyspark - Cheatsheet With Comparison To SQL5 - Seequality

Uploaded by

Pyspark - Cheatsheet With Comparison To SQL5 - Seequality

Uploaded by

Pyspark – cheatsheet with

I think I should not convince you to learn PySpark.

1. CREATE TABLE employees

Feel free to adjust it to your needs.

1. SELECT * FROM employees

Select specific columns from table

1. SELECT employeeName,employeeSurname,employeeTitle FROM employees

1. SELECT employeeName,employeeSurname,employeeTitle FROM employees ORDER BY

1. SELECT employeeName,employeeSurname,employeeTitle FROM employees ORDER BY

1. SELECT TOP 10 employeeName,employeeSurname,employeeTitle FROM employees ORDER BY

1. SELECT employeeName AS Name ,employeeSurname AS Surname,employeeTitle AS Title

1. SELECT * FROM employees

1. SELECT * FROM employees

1. SELECT * FROM employees

1. SELECT * FROM employees

Filtering – not equal

1. SELECT * FROM employees

1. SELECT * FROM employees

1. SELECT * FROM employees

1. SELECT City,COUNT(*) FROM employees

1. from pyspark.sql.functions import count

1. SELECT City,AVG(Salary),MIN(Salary),MAX(Salary),SUM(Salary) FROM employees

1. from pyspark.sql.functions import avg, min, max, sum

1. SELECT City,AVG(Salary)AS Salary FROM employees

1. from pyspark.sql.functions import avg

1. SELECT DISTINCT City FROM employees

1. SELECT Quantity * UnitPrice AS SalesAmount From Sales

to replace existing column use third parameter overwrite=True

1. SELECT LEFT(City,2),RIGHT(City,2),SUBSTRING(City,1,2) FROM Employees

1. from pyspark.sql.functions import substring

1. SELECT CONCAT('Employee: ',employeeName+ ' '+employeeSurname) FROM Employees

1. from pyspark.sql.functions import lit,concat

Filtering – NOT IN and subquery

Subquery with filtering

1. from pyspark.sql.functions import count

1. SELECT e.City, s.(Quantity) AS Qty FROM employees AS e

1. SELECT e.City, s.(Quantity) AS Qty FROM employees AS e

1. SELECT e.City, s.Quantity AS Qty FROM employees AS e

1. SELECT e.City, s.Quantity AS Qty FROM employees AS e

1. SELECT * FROM employees AS e

1. SELECT employeeName, EmployeeSurname FROM employees

1. from pyspark.sql.functions import col

1. SELECT employeeName, employeeSurname FROM employees

1. from pyspark.sql.functions import col

1. SELECT employeeName, employeeSurname FROM employees

1. SELECT employeeName, employeeSurname FROM employees

1. SELECT ROW_NUMBER() OVER(ORDER BY Salary DESC) AS RN,* FROM Employees

1. from pyspark.sql.functions import col, row_number

Windows functions – PARTITION BY

1. SELECT ROW_NUMBER() OVER(PARTITION BY Surname ORDER BY Salary DESC) AS RN,* FROM

1. from pyspark.sql.functions import col, row_number

Window functions – Aggregate +

1. SELECT employeeSurname,Salary,SUM(Salary) OVER(PARTITION BY employeeSurname) AS

1. from pyspark.sql.functions import col, row_number, sum

Window functions – LAG & LEAD

1. SELECT Quantity, LAG(Quantity) OVER(ORDER BY Date) AS PreviousDayQuantity,

1. from pyspark.sql.functions import lag, lead

Subqueries – Common Table

1. from pyspark.sql.functions import desc

1. SELECT employeeId,birthdate,CAST(birthdateAS VARCHAR(50)) AS AgeText

1. from pyspark.sql.functions import cast

Author Recent Posts

Wow. Nice1. Tnx

You might also like