Create a DataFrame for an Indian Employee Database
data = [
(1, "Amit", "IT", 60000),
(2, "Priya", "HR", 55000),
(3, "Rahul", "Finance", 75000),
(4, "Sneha", "IT", 80000),
(5, "Karan", "HR", 65000)]
columns = ["EmpID", "Name", "Department", "Salary"]
df = spark.createDataFrame(data, columns)
Task: Display the schema and first 3 rows.
2. Filter Employees Earning More than 70,000
Task: Write a query to filter employees earning more than ₹70,000.
3. Calculate Average Salary per Department
Task: Use `groupBy` to get the average salary for each department.
4. Find Employees whose Name Starts with 'A'
Task: Filter employees whose names start with the letter 'A'.
5. Count the Number of Employees per Department
Task: Use `groupBy` and `count()` to find the number of employees in each
department.
6. Add a New Column for Tax Deduction (10% of Salary)
Task: Add a new column `Tax` that deducts 10% from `Salary`.
7. Sort Employees by Salary in Descending Order
Task: Display employees sorted in descending order of salary.
8. Get the Second Highest Salary
Task: Find the second highest salary without using `LIMIT` and `OFFSET`.
9. Get Employees Who are in the HR or IT Department
Task: Filter records where the department is either "HR" or "IT".
10. Find the Total Salary Paid by the Company
Task: Calculate the sum of all salaries.
11. Read a CSV File of Cricket Players
Sample CSV (`players.csv`):
Player,Country,Runs,Wickets
Virat Kohli,India,12000,4
Rohit Sharma,India,11000,8
Jasprit Bumrah,India,1200,200
Steve Smith,Australia,9500,20
Task: Read this CSV file into a DataFrame and display its contents.
12. Find the Player with Maximum Runs
Task: Find the player who has scored the maximum runs.
13. Find the Average Runs Scored by Indian Players
Task: Filter players from "India" and calculate the average runs scored.
14. Get Players Who Have Taken More than 50 Wickets
Task: Filter players who have taken more than 50 wickets.
15. Read a JSON File Containing Indian Cities Population
Sample JSON (`cities.json`):
json
[
{"City": "Mumbai", "State": "Maharashtra", "Population": 20000000},
{"City": "Delhi", "State": "Delhi", "Population": 18000000},
{"City": "Bangalore", "State": "Karnataka", "Population": 12000000},
{"City": "Hyderabad", "State": "Telangana", "Population": 10000000}
]
Task: Read this JSON file into a DataFrame and display its contents.
16. Find Cities with a Population Greater than 15 Million
Task: Filter cities with a population greater than 15 million.
17. Calculate Total Population per State
Task : Group by `State` and sum the `Population`.
18. Find the State with the Highest Total Population
Task: Identify which state has the highest total population.
19. Convert a DataFrame to Pandas
Task: Convert the `df` DataFrame into a Pandas DataFrame.