Most Asked Spark Interview
Questions
12LPA - 20LPA
Question 1:
1. Student Grade Classification
Problem:
You have a Data Frame of students with the following columns: student_id, name, score, and
subject.
Create a new column grade based on the score:
o 'A' if score >= 90
o 'B' if 80 <= score < 90
o 'C' if 70 <= score < 80
o 'D' if 60 <= score < 70
o 'F' if score < 60
Data Set
student_id name score subject
1 Alice 92 Math
2 Bob 85 Math
3 Carol 77 Science
4 Dave 65 Science
5 Eve 50 Math
6 Frank 82 Science
Scala Spark
Spark - SQL
PySpark
Output -
Question 2:
You have a DataFrame employees with columns: employee_id, name, age, and salary.
Create a new column age_group based on age:
o 'Young' if age < 30
o 'Mid' if 30 <= age <= 50
o 'Senior' if age > 50
Create a new column salary_range based on salary:
o 'High' if salary > 100000
o 'Medium' if 50000 <= salary <= 100000
o 'Low' if salary < 50000
Filter employees whose name starts with 'J'.
Filter employees whose name ends with 'e'.
Data Set -
data = [
(1, "John", 28, 60000),
(2, "Jane", 32, 75000),
(3, "Mike", 45, 120000),
(4, "Alice", 55, 90000),
(5, "Steve", 62, 110000),
(6, "Claire", 40, 40000)
]
Scala Spark -
Spark - SQL
PySpark -
Output -
Question 3:
You have a DataFrame purchase_history with columns: purchase_id, customer_id,
purchase_amount, and purchase_date.
Create a new column purchase_category based on purchase_amount:
o 'Large' if purchase_amount > 2000
o 'Medium' if 1000 <= purchase_amount <= 2000
o 'Small' if purchase_amount < 1000
Filter purchases that occurred in 'January 2024'
Data Set -
[(1,1,2500,"2024-01-05"),
(2,2,1500,"2024-01-15"),
(3,3,500,"2024-02-20"),
(4,4,2200,"2024-03-01"),
(5,5,900,"2024-01-25"),
(6,6,3000,"2024-03-12")]
Scala Spark
Spark - SQL
PySpark
Output -
Question 4:
Data set -
val employees = List(
(1, "John", "2020-01-01", "active"),
(2, "Jane", "2020-06-01", "inactive"),
(3, "Mike", "2020-03-01", "active"),
(4, "Alice", "2020-09-01", "inactive"),
(5, "Steve", "2020-02-01", "active")
)
Scala Spark -
PySpark -
Output -
Question 5 -
Data Set -
data = [
(1,"Order-001","2022-01-01",100.0),
(2,"Order-002","2022-06-01",200.0),
(3,"Order-003","2022-03-01",50.0),
(4,"Order-004","2022-09-01",160.0),
(5,"Order-005","2022-02-01",250.0)
]
Scala Spark -
PySpark -
Output -
Created By :
Harshavardhana I
Data Engineer