0% found this document useful (0 votes)

809 views189 pages

Pyspark Hands On

Uploaded by

Richard Smith

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

809 views189 pages

Pyspark Hands On

Uploaded by

Richard Smith

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 189

PYSPARK LEARNING HUB

Step - 1 : Problem Statement

Actors and Directors Who Cooperated At Least

Three Times
Write a pyspark Program for a report that provides the pairs
(actor_id, director_id) where the actor has cooperated with
the director at least 3 times.

Difficult Level : EASY

DataFrame:
schema = StructType([
StructField("ActorId",IntegerType(),True),
StructField("DirectorId",IntegerType(),True),
StructField("timestamp",IntegerType(),True)
])

data = [
(1, 1, 0),
(1, 1, 1),
(1, 1, 2),
(1, 2, 3),
(1, 2, 4),
(2, 1, 5),
(2, 1, 6)
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB

Step - 2 : Identifying The Input Data And Expected

Output
INPUT
INPUT
ACTOR_ID DIRECTOR_ID TIMESTAMP
1 1 0
1 1 1
1 1 2
1 2 3
1 2 4
2 1 5
2 1 6

OUTPUT
OUTPUT
ACTOR_ID DIRECTOR_ID
1 1

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB

Step - 3 : Writing the pyspark code to solve

the problem
# Creating Spark Session

from pyspark.sql import SparkSession

from pyspark.sql.types import StructType,StructField,IntegerType

#creating spark session

schema = StructType([
StructField("ActorId",IntegerType(),True),
StructField("DirectorId",IntegerType(),True),
StructField("timestamp",IntegerType(),True)
])

data = [
(1, 1, 0),
(1, 1, 1),
(1, 1, 2),
(1, 2, 3),
(1, 2, 4),
(2, 1, 5),
(2, 1, 6)
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB

df=spark.createDataFrame(data,schema)
df.show()

df_group=df.groupBy('ActorId','DirectorId').count()
df_group.show()

+-------+----------+-----+
|ActorId|DirectorId|count|
+-------+----------+-----+
| 1| 2| 2|
| 1| 1| 3|
| 2| 1| 2|
+-------+----------+-----+

df_group.filter(df_group['count'] >= 3).show()

+-------+----------+-----+
|ActorId|DirectorId|count|
+-------+----------+-----+
| 1| 1| 3|
+-------+----------+-----+

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 2

Step - 1 : Problem Statement

Ads Performance
Write an pyspark code to find the ctr of each Ad.Round ctr to 2
decimal points. Order the result table by ctr in descending order
and by ad_id in ascending order in case of a tie.
Ctr=Clicked/(Clicked+Viewed)

Difficult Level : EASY

DataFrame:
# Define the schema for the Ads table
schema=StructType([
StructField('AD_ID',IntegerType(),True)
,StructField('USER_ID',IntegerType(),True)
,StructField('ACTION',StringType(),True)
])

# Define the data for the Ads table

data = [
(1, 1, 'Clicked'),
(2, 2, 'Clicked'),
(3, 3, 'Viewed'),
(5, 5, 'Ignored'),
(1, 7, 'Ignored'),
(2, 7, 'Viewed'),
(3, 5, 'Clicked'),
(1, 4, 'Viewed'),
(2, 11, 'Viewed'),
(1, 2, 'Clicked')
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 2

Step - 2 : Identifying The Input Data And Expected

Output
INPUT

INPUT
AD_ID USER_ID ACTION
1 1 Clicked
2 2 Clicked
3 3 Viewed
5 5 Ignored
1 7 Ignored
2 7 Viewed
3 5 Clicked
1 4 Viewed
2 11 Viewed
1 2 Clicked
OUTPUT
OUTPUT
AD_ID CTR
1 0.67
3 0.5
2 0.33
5 0

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 2

Step - 3 : Writing the pyspark code to solve

the problem
# Creating Spark Session

from pyspark.sql import SparkSession

from pyspark.sql.types import
StructType,StructField,IntegerType,StringType
from pyspark.sql.functions import when
from pyspark.sql import functions as F
from pyspark.sql.window import Window

#creating spark session

# Define the schema for the Ads table

schema=StructType([
StructField('AD_ID',IntegerType(),True)
,StructField('USER_ID',IntegerType(),True)
,StructField('ACTION',StringType(),True)
])

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 2
# Define the data for the Ads table
data = [
(1, 1, 'Clicked'),
(2, 2, 'Clicked'),
(3, 3, 'Viewed'),
(5, 5, 'Ignored'),
(1, 7, 'Ignored'),
(2, 7, 'Viewed'),
(3, 5, 'Clicked'),
(1, 4, 'Viewed'),
(2, 11, 'Viewed'),
(1, 2, 'Clicked')
]

# Create a PySpark DataFrame

df=spark.createDataFrame(data,schema)
df.show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 2

ctr_df = (
ads_df.groupBy("ad_id")
.agg(
F.sum(F.when(ads_df["action"] == "Clicked",
1).otherwise(0)).alias("click_count"),
F.sum(F.when(ads_df["action"] == "Viewed",
1).otherwise(0)).alias("view_count")
)
.withColumn("ctr", F.round(F.col("click_count") /
(F.col("click_count") + F.col("view_count")), 2))
)

# Order the result table by CTR in descending order and by ad_id in

ascending order
window_spec = Window.orderBy(F.col("ctr").desc(),
F.col("ad_id").asc())
result_df = ctr_df.withColumn("rank", F.rank().over(window_spec))

# Show the result DataFrame

result_df.select('ad_id','ctr').show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 3

Step - 1 : Problem Statement

Combine Two DF
Write a Pyspark program to report the first name, last name, city, and state of each person in the
Person dataframe. If the address of a personId is not present in the Address dataframe,
report null instead.

Difficult Level : EASY

DataFrame:
# Define schema for the 'persons' table
persons_schema = StructType([
StructField("personId", IntegerType(), True),
StructField("lastName", StringType(), True),
StructField("firstName", StringType(), True)
])

# Define schema for the 'addresses' table

addresses_schema = StructType([
StructField("addressId", IntegerType(), True),
StructField("personId", IntegerType(), True),
StructField("city", StringType(), True),
StructField("state", StringType(), True)
])

# Define data for the 'persons' table

persons_data = [
(1, 'Wang', 'Allen'),
(2, 'Alice', 'Bob')
]

# Define data for the 'addresses' table

addresses_data = [
(1, 2, 'New York City', 'New York'),
(2, 3, 'Leetcode', 'California')
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 3

Step - 2 : Identifying The Input Data And Expected

Output
INPUT

INPUT-1 persons
PERSONID LASTNAME FIRSTNAME
1 Wang Allen
2 Alice Bob

INPUT-2 addresses
ADDRESSID PERSONID CITY STATE

1 2 New York City New York

2 3 Leetcode California

OUTPUT

OUTPUT
FIRSTNAME LASTNAME CITY STATE
Bob Alice New York City New York
Allen Wang

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 3

Step - 3 : Writing the pyspark code to solve

the problem
# Creating Spark Session

from pyspark.sql import SparkSession

from pyspark.sql.types import
StructType,StructField,IntegerType,StringType
from pyspark.sql.functions import when
from pyspark.sql import functions as F
from pyspark.sql.window import Window

#creating spark session

spark = SparkSession. \
builder. \
config('spark.shuffle.useOldFetchProtocol', 'true'). \
config('spark.ui.port','0'). \
config("spark.sql.warehouse.dir", "/user/itv008042/warehouse"). \
enableHiveSupport(). \
master('yarn'). \
getOrCreate()
# Define schema for the 'persons' table
persons_schema = StructType([
StructField("personId", IntegerType(), True),
StructField("lastName", StringType(), True),
StructField("firstName", StringType(), True)
])

# Define schema for the 'addresses' table

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 3

# Define data for the 'persons' table

persons_data = [
(1, 'Wang', 'Allen'),
(2, 'Alice', 'Bob')
]

# Define data for the 'addresses' table

addresses_data = [
(1, 2, 'New York City', 'New York'),
(2, 3, 'Leetcode', 'California')
]

# Create a PySpark DataFrame

person_df=spark.createDataFrame(persons_data,persons_schema)
address_df=spark.createDataFrame(addresses_data,addresses_schema)

person_df.show()
address_df.show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 3

person_df.join(address_df,person_df.personId==address_df.perso
nId,'left')\
.select('firstName','lastName','city','state')\
.show()

# Show the result DataFrame

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 4

Step - 1 : Problem Statement

04_Employees Earning More Than Their Managers

Write a Pyspark program to find Employees Earning More Than Their
Managers

Difficult Level : EASY

DataFrame:
# Define the schema for the "employees"
employees_schema = StructType([
StructField("id", IntegerType(), True),
StructField("name", StringType(), True),
StructField("salary", IntegerType(), True),
StructField("managerId", IntegerType(), True)
])

# Define data for the "employees"

employees_data = [
(1, 'Joe', 70000, 3),
(2, 'Henry', 80000, 4),
(3, 'Sam', 60000, None),
(4, 'Max', 90000, None)
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 4

Step - 2 : Identifying The Input Data And Expected

Output
INPUT
INPUT
ID NAME SALARY MANAGERID
1 Joe 70,000 3
2 Henry 80,000 4
3 Sam 60,000
4 Max 90,000

OUTPUT

OUTPUT
NAME
Joe

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 4

Step - 3 : Writing the pyspark code to solve

the problem
# Creating Spark Session

from pyspark.sql import SparkSession

from pyspark.sql.types import
StructType,StructField,IntegerType,StringType
from pyspark.sql.functions import when
from pyspark.sql import functions as F
from pyspark.sql.window import Window

#creating spark session

spark = SparkSession. \
builder. \
config('spark.shuffle.useOldFetchProtocol', 'true'). \
config('spark.ui.port','0'). \
config("spark.sql.warehouse.dir", "/user/itv008042/warehouse"). \
enableHiveSupport(). \
master('yarn'). \
getOrCreate()
# Define the schema for the "employees"
employees_schema = StructType([
StructField("id", IntegerType(), True),
StructField("name", StringType(), True),
StructField("salary", IntegerType(), True),
StructField("managerId", IntegerType(), True)
])

# Define data for the "employees"

employees_data = [
(1, 'Joe', 70000, 3),
(2, 'Henry', 80000, 4),
(3, 'Sam', 60000, None),
(4, 'Max', 90000, None)
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 4

# Create a PySpark DataFrame

emp_df=spark.createDataFrame(employees_data,employees_sche
ma)
emp_df.show()

emp_df1 = emp_df.alias("e1")
emp_df2 = emp_df.alias("e2")

self_joined_df = emp_df1.join(emp_df2, col("e1.id") ==

col("e2.managerId"),
"inner") .select(col("e2.name"),col("e2.salary"),col("e1.sa
lary").alias("msal"))

self_joined_df.filter(self_joined_df.salary>self_joined_df.msal).sele
ct("name").show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB

Step - 1 : Problem Statement

Duplicate Emails
Write a Pyspark program to report all the duplicate emails.
Note that it's guaranteed that the email field is not NULL.
Difficult Level : EASY

# Define data for the "employees"

employees_data = [
(1, 'Joe', 70000, 3),
(2, 'Henry', 80000, 4),
(3, 'Sam', 60000, None),
(4, 'Max', 90000, None)
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB

Step - 2 : Identifying The Input Data And Expected

Output
INPUT

INPUT
ID EMAIL
1 [email protected]
2 [email protected]
3 [email protected]

OUTPUT

OUTPUT
EMAIL
[email protected]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB

Step - 3 : Writing the pyspark code to solve

the problem
# Creating Spark Session

from pyspark.sql import SparkSession

from pyspark.sql.types import
StructType,StructField,IntegerType,StringType
from pyspark.sql.functions import when
from pyspark.sql import functions as F
from pyspark.sql.window import Window

#creating spark session

# Define the schema for the "emails" table

emails_schema = StructType([
StructField("id", IntegerType(), True),
StructField("email", StringType(), True)
])

# Define data for the "emails" table

emails_data = [
(1, '[email protected]'),
(2, '[email protected]'),
(3, '[email protected]')
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB

# Create a PySpark DataFrame

df=spark.createDataFrame(emails_data,emails_schema)
df.show()

df_group=df.groupby("email").count()
df_group.filter(df_group["count"] > 1).show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 6

Step - 1 : Problem Statement

06_Customers Who Never Order

Write a Pyspark program to find all customers who never
order anything.
Difficult Level : EASY
DataFrame:
# Define the schema for the "Customers"
customers_schema = StructType([
StructField("id", IntegerType(), True),
StructField("name", StringType(), True)
])

# Define data for the "Customers"

customers_data = [
(1, 'Joe'),
(2, 'Henry'),
(3, 'Sam'),
(4, 'Max')
]

# Define the schema for the "Orders"

orders_schema = StructType([
StructField("id", IntegerType(), True),
StructField("customerId", IntegerType(), True)
])

# Define data for the "Orders"

orders_data = [
(1, 3),
(2, 1)
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 6

Step - 2 : Identifying The Input Data And Expected

Output
INPUT
INPUT -1 customers
ID NAME
1 Joe
2 Henry
3 Sam
4 Max

INPUT - 2 orders
ID CUSTOMERID
1 3
2 1

OUTPUT

OUTPUT
NAME
Max
Henry

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 6

Step - 3 : Writing the pyspark code to solve

the problem
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session

customers_schema = StructType([
StructField("id", IntegerType(), True),
StructField("name", StringType(), True)
])

# Define data for the "Customers"

customers_data = [
(1, 'Joe'),
(2, 'Henry'),
(3, 'Sam'),
(4, 'Max')
]

# Define the schema for the "Orders"

orders_schema = StructType([
StructField("id", IntegerType(), True),
StructField("customerId", IntegerType(), True)
])

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 6
# Define data for the "Orders"
orders_data = [
(1, 3),
(2, 1)
]

# Create a PySpark DataFrame

cus_df=spark.createDataFrame(customers_data,customers_schem
a)
ord_df=spark.createDataFrame(orders_data,orders_schema)

cus_df.show()
ord_df.show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 6

cus_df.join(ord_df,cus_df.id == ord_df.customerId,"left_anti")\
.select("name").show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 7

Step - 1 : Problem Statement

07_Rising Temperature
Write a solution to find all dates' Id with higher
temperatures compared to its previous dates (yesterday).

Return the result table in any order.

Difficult Level : EASY

DataFrame:
# Define the schema for the "Weather" table
weather_schema = StructType([
StructField("id", IntegerType(), True),
StructField("recordDate", StringType(), True),
StructField("temperature", IntegerType(), True)
])

# Define data for the "Weather" table

weather_data = [
(1, '2015-01-01', 10),
(2, '2015-01-02', 25),
(3, '2015-01-03', 20),
(4, '2015-01-04', 30)
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 7

Step - 2 : Identifying The Input Data And Expected

Output
INPUT

INPUT
ID RECORDDATE TEMPERATURE
1 2015-01-01 10
2 2015-01-02 25
3 2015-01-03 20
4 2015-01-04 30

OUTPUT

OUTPUT
ID
2
4

Step - 3 : Writing the pyspark code to solve

the problem
# Creating Spark Session
from pyspark.sql import SparkSession

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 7
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session

# Define the schema for the "Weather" table

weather_schema = StructType([
StructField("id", IntegerType(), True),
StructField("recordDate", StringType(), True),
StructField("temperature", IntegerType(), True)
])

# Define data for the "Weather" table

weather_data = [
(1, '2015-01-01', 10),
(2, '2015-01-02', 25),
(3, '2015-01-03', 20),
(4, '2015-01-04', 30)
]

# Create a PySpark DataFrame

temp_df=spark.createDataFrame(weather_data,weather_schema)
temp_df.show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 7

lag_df=temp_df.withColumn("prev_day",lag(temp_df.temperature).
over(Window.orderBy(temp_df.recordDate)))
lag_df.show()

lag_df.filter(lag_df["temperature"] >
lag_df["prev_day"] ).select("id").show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 7

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 8

Step - 1 : Problem Statement

08_Game Play Analysis I

Write a solution to find the first login date for each player.
Return the result table in any order.
Difficult Level : EASY
DataFrame:
# Define the schema for the "Activity"
activity_schema = StructType([
StructField("player_id", IntegerType(), True),
StructField("device_id", IntegerType(), True),
StructField("event_date", StringType(), True),
StructField("games_played", IntegerType(), True)
])

# Define data for the "Activity"

activity_data = [
(1, 2, '2016-03-01', 5),
(1, 2, '2016-05-02', 6),
(2, 3, '2017-06-25', 1),
(3, 1, '2016-03-02', 0),
(3, 4, '2018-07-03', 5)
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 8

Step - 2 : Identifying The Input Data And Expected

Output
INPUT

INPUT
PLAYER_ID DEVICE_ID EVENT_DATE GAMES_PLAYED
1 2 2016-03-01 5
1 2 2016-05-02 6
2 3 2017-06-25 1
3 1 2016-03-02 0
3 4 2018-07-03 5

OUTPUT

OUTPUT
PLAYER_ID FISRT_LOGIN
1 2016-03-01
2 2017-06-25
3 2016-03-02

Step - 3 : Writing the pyspark code to solve

the problem

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 8
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session

# Define the schema for the "Activity"

activity_schema = StructType([
StructField("player_id", IntegerType(), True),
StructField("device_id", IntegerType(), True),
StructField("event_date", StringType(), True),
StructField("games_played", IntegerType(), True)
])

# Define data for the "Activity"

activity_data = [
(1, 2, '2016-03-01', 5),
(1, 2, '2016-05-02', 6),
(2, 3, '2017-06-25', 1),
(3, 1, '2016-03-02', 0),
(3, 4, '2018-07-03', 5)
]

# Create a PySpark DataFrame

activity_df=spark.createDataFrame(activity_data,activity_schema)
activity_df.show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 8

rank_df=activity_df.withColumn("RK",rank().over(Window.partition
By(activity_df['player_id']).orderBy(activity_df['event_date'])))
rank_df.show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 8

rank_df.filter(rank_df["RK"] ==
1).select("player_id",rank_df["event_date"].alias("First_Login")).sh
ow()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 9

Step - 1 : Problem Statement

09_Game Play Analysis II

Write a pyspark code that reports the device that is first
logged in for each player.
Return the result table in any order.
Difficult Level : EASY
DataFrame:
# Define the schema for the "Activity"
activity_schema = StructType([
StructField("player_id", IntegerType(), True),
StructField("device_id", IntegerType(), True),
StructField("event_date", StringType(), True),
StructField("games_played", IntegerType(), True)
])

# Define data for the "Activity"

activity_data = [
(1, 2, '2016-03-01', 5),
(1, 2, '2016-05-02', 6),
(2, 3, '2017-06-25', 1),
(3, 1, '2016-03-02', 0),
(3, 4, '2018-07-03', 5)
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 9

Step - 2 : Identifying The Input Data And Expected

Output
INPUT

INPUT
PLAYER_ID DEVICE_ID EVENT_DATE GAMES_PLAYED
1 2 2016-03-01 5
1 2 2016-05-02 6
2 3 2017-06-25 1
3 1 2016-03-02 0
3 4 2018-07-03 5

OUTPUT

OUTPUT
PLAYER_ID DEVICE_ID
1 2
2 3
3 1

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 9

Step - 3 : Writing the pyspark code to solve

the problem
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session

# Define the schema for the "Activity"

# Define data for the "Activity"

activity_data = [
(1, 2, '2016-03-01', 5),
(1, 2, '2016-05-02', 6),
(2, 3, '2017-06-25', 1),
(3, 1, '2016-03-02', 0),
(3, 4, '2018-07-03', 5)
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 9

# Create a PySpark DataFrame

df=spark.createDataFrame(activity_data,activity_schema)
df.show()

rank_df=df.withColumn("rk",rank().over(Window.partitionBy(df["pla
yer_id"]).orderBy(df["event_date"])))
rank_df.show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 9

rank_df.filter(rank_df["rk"] ==
1).select("player_id","device_id").show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 10

Step - 1 : Problem Statement

10_Employee Bonus
Write a solution to report the name and bonus amount of
each employee with a bonus less than 1000.
Return the result table in any order

Difficult Level : EASY

DataFrame:
# Define the schema for the "Employee"
employee_schema = StructType([
StructField("empId", IntegerType(), True),
StructField("name", StringType(), True),
StructField("supervisor", IntegerType(), True),
StructField("salary", IntegerType(), True)
])

# Define data for the "Employee"

employee_data = [
(3, 'Brad', None, 4000),
(1, 'John', 3, 1000),
(2, 'Dan', 3, 2000),
(4, 'Thomas', 3, 4000)
]

# Define the schema for the "Bonus"

bonus_schema = StructType([
StructField("empId", IntegerType(), True),
StructField("bonus", IntegerType(), True)
])

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 10
# Define data for the "Bonus"
bonus_data = [
(2, 500),
(4, 2000)
]

Step - 2 : Identifying The Input Data And Expected

Output
INPUT
INPUT-1 EMPLOYEE
EMPID NAME SUPERVISOR SALARY
3 Brad 4,000
1 John 3 1,000
2 Dan 3 2,000
4 Thomas 3 4,000

INPUT-2 BONUS
EMPID BONUS
2 500
4 2,000

OUTPUT
OUTPUT
NAME BONUS
Brad
John
Dan 500

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 10

Step - 3 : Writing the pyspark code to solve

the problem
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session

# Define the schema for the "Employee"

employee_schema = StructType([
StructField("empId", IntegerType(), True),
StructField("name", StringType(), True),
StructField("supervisor", IntegerType(), True),
StructField("salary", IntegerType(), True)
])

# Define data for the "Employee"

employee_data = [
(3, 'Brad', None, 4000),
(1, 'John', 3, 1000),
(2, 'Dan', 3, 2000),
(4, 'Thomas', 3, 4000)
]

# Define the schema for the "Bonus"

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 10
bonus_schema = StructType([
StructField("empId", IntegerType(), True),
StructField("bonus", IntegerType(), True)
])

# Define data for the "Bonus"

bonus_data = [
(2, 500),
(4, 2000)
]

# Create a PySpark DataFrame

emp_df =
spark.createDataFrame(employee_data,employee_schema)
bonus_df = spark.createDataFrame(bonus_data,bonus_schema)
emp_df.show()
bonus_df.show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 10
join_df=emp_df.join(bonus_df,emp_df.empId==bonus_df.empId,"lef
t")
join_df.show()

join_df.filter( (join_df.bonus < 1000) | col("bonus").isNull() ).show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 11

Step - 1 : Problem Statement

11_Find Customer Referee

Find the names of the customer that are not referred by the
customer with id = 2.
Return the result table in any order

Difficult Level : EASY

DataFrame:

# Define the schema for the Customer table

schema = StructType([
StructField("id", IntegerType(), True),
StructField("name", StringType(), True),
StructField("referee_id", IntegerType(), True)
])

# Create an RDD with the data

data = [
(1, 'Will', None),
(2, 'Jane', None),
(3, 'Alex', 2),
(4, 'Bill', None),
(5, 'Zack', 1),
(6, 'Mark', 2)
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 11

Step - 2 : Identifying The Input Data And Expected

Output

INPUT

INPUT
ID NAME REFEREE_ID
1 Will
2 Jane
3 Alex 2
4 Bill
5 Zack 1
6 Mark 2

OUTPUT

OUTPUT
NAME
Will
Jane
Bill
Zack

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 11

Step - 3 : Writing the pyspark code to solve

the problem
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session

# Define the schema for the Customer table

schema = StructType([
StructField("id", IntegerType(), True),
StructField("name", StringType(), True),
StructField("referee_id", IntegerType(), True)
])

# Create an RDD with the data

data = [
(1, 'Will', None),
(2, 'Jane', None),
(3, 'Alex', 2),
(4, 'Bill', None),
(5, 'Zack', 1),
(6, 'Mark', 2)
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 11

# Create a PySpark DataFrame

customer_df = spark.createDataFrame(data ,schema )

# Filter customers not referred by customer with id = 2

result_df = customer_df.filter((col("referee_id").isNull()) |
(col("referee_id") != 2))

# Select only the 'name' column

result_df = result_df.select("name")

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 12

Step - 1 : Problem Statement

12_Cities With Completed Trades

Write a pypsark code to retrieve the top three cities that
have the highest number of completed trade orders listed in
descending order. Output the city name and the
corresponding number of completed trade orders.

Difficult Level : EASY

DataFrame:
# Define the schema for the trades
trades_schema = StructType([
StructField("order_id", IntegerType(), True),
StructField("user_id", IntegerType(), True),
StructField("price", FloatType(), True),
StructField("quantity", IntegerType(), True),
StructField("status", StringType(), True),
StructField("timestamp", StringType(), True)
])

# Define the schema for the users

users_schema = StructType([
StructField("user_id", IntegerType(), True),
StructField("city", StringType(), True),
StructField("email", StringType(), True),
StructField("signup_date", StringType(), True)
])

# Create an RDD with the data for trades

trades_data = [
(100101, 111, 9.80, 10, 'Cancelled', '2022-08-17 12:00:00'),
(100102, 111, 10.00, 10, 'Completed', '2022-08-17 12:00:00'),
(100259, 148, 5.10, 35, 'Completed', '2022-08-25 12:00:00'),
(100264, 148, 4.80, 40, 'Completed', '2022-08-26 12:00:00'),
(100305, 300, 10.00, 15, 'Completed', '2022-09-05 12:00:00'),
(100400, 178, 9.90, 15, 'Completed', '2022-09-09 12:00:00'),
(100565, 265, 25.60, 5, 'Completed', '2022-12-19 12:00:00')
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 12

# Create an RDD with the data for users

users_data = [
(111, 'San Francisco', '[email protected]', '2021-08-03 12:00:00'),
(148, 'Boston', '[email protected]', '2021-08-20 12:00:00'),
(178, 'San Francisco', '[email protected]', '2022-01-05
12:00:00'),
(265, 'Denver', '[email protected]', '2022-02-26 12:00:00'),
(300, 'San Francisco', '[email protected]', '2022-06-30
12:00:00')
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 12

Step - 2 : Identifying The Input Data And Expected

Output
INPUT

INPUT-1 trade

ORDER_ID USER_ID PRICE QUANTITY STATUS TIMESTAMP

100101 111 9.8 10 Cancelled 2022-08-17 12:00:00

100102 111 10 10 Completed 2022-08-17 12:00:00

100259 148 5.1 35 Completed 2022-08-25 12:00:00

100264 148 4.8 40 Completed 2022-08-26 12:00:00

100305 300 10 15 Completed 2022-09-05 12:00:00

100400 178 9.9 15 Completed 2022-09-09 12:00:00

100565 265 25.6 5 Completed 2022-12-19 12:00:00

INPUT - 2 user

USER_ID CITY EMAIL SIGNUP_DATE

111 San Francisco [email protected] 2021-08-03 12:00:00

148 Boston [email protected] 2021-08-20 12:00:00

178 San Francisco [email protected] 2022-01-05 12:00:00

265 Denver [email protected] 2022-02-26 12:00:00

300 San Francisco [email protected] 2022-06-30 12:00:00

OUTPUT
OUTPUT

CITY COUNT()

San Francisco 3

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 12
Boston 2

Denver 1

Step - 3 : Writing the pyspark code to solve

the problem
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session

# Define the schema for the trades

trades_schema = StructType([
StructField("order_id", IntegerType(), True),
StructField("user_id", IntegerType(), True),
StructField("price", FloatType(), True),
StructField("quantity", IntegerType(), True),
StructField("status", StringType(), True),
StructField("timestamp", StringType(), True)
])
# Define the schema for the users
users_schema = StructType([
StructField("user_id", IntegerType(), True),

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 12
StructField("city", StringType(), True),
StructField("email", StringType(), True),
StructField("signup_date", StringType(), True)
])

# Create an RDD with the data for trades

# Create an RDD with the data for users

users_data = [
(111, 'San Francisco', '[email protected]', '2021-08-03
12:00:00'),
(148, 'Boston', '[email protected]', '2021-08-20 12:00:00'),
(178, 'San Francisco', '[email protected]', '2022-
01-05 12:00:00'),
(265, 'Denver', '[email protected]', '2022-02-26
12:00:00'),
(300, 'San Francisco', '[email protected]',
'2022-06-30 12:00:00')
]

Trade_df=spark.createDataFrame(trades_data,trades_schema)
User_df=spark.createDataFrame(users_data,users_schema)
Trade_df.show()
User_df.show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 12

join_df=Trade_df.join(User_df,Trade_df['user_id']==User_df['user_i
d'],"inner")
join_df.show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 12

join_df.filter(join_df['status'] ==
'Completed').groupby(join_df['city']).count()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 13

Step - 1 : Problem Statement

13_Page With No Likes

Write a pyspark code to return the IDs of the Facebook pages
that have zero likes. The output should be sorted in
ascending order based on the page IDs.

Difficult Level : EASY

DataFrame:
# Define the schema for the pages
pages_schema = StructType([
StructField("page_id", IntegerType(), True),
StructField("page_name", StringType(), True)
])
# Define the schema for the page_likes table
page_likes_schema = StructType([
StructField("user_id", IntegerType(), True),
StructField("page_id", IntegerType(), True),
StructField("liked_date", StringType(), True)
])
# Create an RDD with the data for pages
pages_data = [
(20001, 'SQL Solutions'),
(20045, 'Brain Exercises'),
(20701, 'Tips for Data Analysts')
]
# Create an RDD with the data for page_likes table
page_likes_data = [
(111, 20001, '2022-04-08 00:00:00'),
(121, 20045, '2022-03-12 00:00:00'),
(156, 20001, '2022-07-25 00:00:00')
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 13

Step - 2 : Identifying The Input Data And Expected

Output
INPUT

INPUT - 1 PAGES

PAGE_ID PAGE_NAME

20001 SQL Solutions

20045 Brain Exercises

Tips for Data
20701 Analysts

INPUT - 2 PAGES_LIEKS

USER_ID PAGE_ID LIKED_DATE

111 20001 2022-04-08 0:00:00

121 20045 2022-03-12 0:00:00

156 20001 2022-07-25 0:00:00

OUTPUT

PAGE_ID

20701

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 13

Step - 3 : Writing the pyspark code to solve

the problem
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session

spark = SparkSession. \
builder. \
config('spark.shuffle.useOldFetchProtocol', 'true'). \
config('spark.ui.port','0'). \
config("spark.sql.warehouse.dir", "/user/itv008042/warehouse"). \
enableHiveSupport(). \
master('yarn'). \
getOrCreate()
# Define the schema for the pages
pages_schema = StructType([
StructField("page_id", IntegerType(), True),
StructField("page_name", StringType(), True)
])

# Define the schema for the page_likes table

page_likes_schema = StructType([
StructField("user_id", IntegerType(), True),
StructField("page_id", IntegerType(), True),
StructField("liked_date", StringType(), True)
])

# Create an RDD with the data for pages

pages_data = [
(20001, 'SQL Solutions'),
(20045, 'Brain Exercises'),
(20701, 'Tips for Data Analysts')
]
# Create an RDD with the data for page_likes table
page_likes_data = [
(111, 20001, '2022-04-08 00:00:00'),
(121, 20045, '2022-03-12 00:00:00'),

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 13
(156, 20001, '2022-07-25 00:00:00')
]

page_df=spark.createDataFrame(pages_data,pages_schema)
page_like_df=spark.createDataFrame(page_likes_data,page_likes_
schema)
page_df.show()
page_like_df.show()

# Perform a left anti join to get pages with zero likes

zero_likes_pages = page_df.join(page_like_df, 'page_id',
'left_anti')
# Select and sort the result
result = zero_likes_pages.select("page_id").orderBy("page_id")
# Show the result
result.show()
+-------+
|page_id|
+-------+
| 20701|

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 14

Step - 1 : Problem Statement

14_Purchasing Activity by Product

Type
We have been given purchasing activity DF and we need
to find out cumulative purchases of each product over
time.
Difficult Level : EASY
DataFrame:
# Define schema for the DataFrame
schema = StructType([
StructField("order_id", IntegerType(), True),
StructField("product_type", StringType(), True),
StructField("quantity", IntegerType(), True),
StructField("order_date", StringType(), True),
])

# Define data
# Define data
data = [
(213824, 'printer', 20, "2022-06-27 "),
(212312, 'hair dryer', 5, "2022-06-28 "),
(132842, 'printer', 18, "2022-06-28 "),
(284730, 'standing lamp', 8, "2022-07-05 ")
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 14

Step - 2 : Identifying The Input Data And Expected

Output
INPUT

INPUT
ORDER_ID PRODUCT_TYPE QUANTITY ORDER_DATE

213824 printer 20 2022-06-27 12:00:00

212312 hair dryer 5 2022-06-28 12:00:00

132842 printer 18 2022-06-28 12:00:00

284730 standing lamp 8 2022-07-05 12:00:00

OUTPUT

ORDER_DATE PRODUCT_TYPE CUM_PURCHASED

2022-06-27 12:00:00 printer 20

2022-06-28 12:00:00 hair dryer 5

2022-06-28 12:00:00 printer 38

2022-07-05 12:00:00 standing lamp 8

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 14

Step - 3 : Writing the pyspark code to solve

the problem
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session

# Define schema for the DataFrame

schema = StructType([
StructField("order_id", IntegerType(), True),
StructField("product_type", StringType(), True),
StructField("quantity", IntegerType(), True),
StructField("order_date", StringType(), True),
])

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 14

order_df=spark.createDataFrame(data,schema)
order_df.show()

# Define a Window specification based on the 'order_date' column

window_spec =
Window.partitionBy("product_type").orderBy("order_date").rowsBe
tween(Window.unboundedPreceding, 0)

# Add a new column 'cumulative_purchases' representing the

cumulative sum
result_df = order_df.withColumn("cumulative_purchases",
F.sum("quantity").over(window_spec))
result_df.show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 15

Step - 1 : Problem Statement

15_Teams Power Users
Write a pyspark code to identify the top 2 Power Users
who sent the highest number of messages on Microsoft
Teams in August 2022. Display the IDs of these 2 users
along with the total number of messages they sent.
Output the results in descending order based on the
count of the messages.
Difficult Level : EASY
DataFrame:
schema = StructType([
StructField("message_id", IntegerType(), True),
StructField("sender_id", IntegerType(), True),
StructField("receiver_id", IntegerType(), True),
StructField("content", StringType(), True),
StructField("sent_date", StringType(), True),
])

# Define the data

data = [
(901, 3601, 4500, 'You up?', '2022-08-03 00:00:00'),
(902, 4500, 3601, 'Only if you\'re buying', '2022-08-03 00:00:00'),
(743, 3601, 8752, 'Let\'s take this offline', '2022-06-14 00:00:00'),
(922, 3601, 4500, 'Get on the call', '2022-08-10 00:00:00'),
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 15

Step - 2 : Identifying The Input Data And Expected

Output
INPUT

INPUT

MESSAGE_ID SENDER_ID RECEIVER_ID CONTENT SENT_DATE

2022-08-03
901 3601 4500 You up? 0:00:00
Only if you're 2022-08-03
902 4500 3601 buying 0:00:00
Let's take this 2022-06-14
743 3601 8752 offline 0:00:00
2022-08-10
922 3601 4500 Get on the call 0:00:00

OUTPUT

SENDER_ID COUNT(*)

3601 2

4500 1

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 15

Step - 3 : Writing the pyspark code to solve

the problem
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session

schema = StructType([
StructField("message_id", IntegerType(), True),
StructField("sender_id", IntegerType(), True),
StructField("receiver_id", IntegerType(), True),
StructField("content", StringType(), True),
StructField("sent_date", StringType(), True),
])

# Define the data

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 15

teams_df = spark.createDataFrame(data,schema)
teams_df.show()

filter_df=teams_df.filter(teams_df['sent_date'].like("2022-08%"))
filter_df.show()

result_df=filter_df.groupby(filter_df['sender_id']).count()
result_df=result_df.orderBy(desc(result_df['count'])).limit(2)
result_df.show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 16

Step - 1 : Problem Statement

16_Select in pyspark
Write a pyspark code perform below function
● Write a pyspark code to get all employee detail.
● Write a query to get only "FirstName" column from emp_df
● Write a Pyspark code to get FirstName in upper case as "First
Name".
● Write a pyspark code to get FirstName in lower case

Difficult Level : EASY

DataFrame:
data = [
[1, "Vikas", "Ahlawat", 600000.0, "2013-02-15 11:16:28.290", "IT",
"Male"],
[2, "nikita", "Jain", 530000.0, "2014-01-09 17:31:07.793", "HR", "Female"],
[3, "Ashish", "Kumar", 1000000.0, "2014-01-09 10:05:07.793", "IT",
"Male"],
[4, "Nikhil", "Sharma", 480000.0, "2014-01-09 09:00:07.793", "HR",
"Male"],
[5, "anish", "kadian", 500000.0, "2014-01-09 09:31:07.793", "Payroll",
"Male"],
]
# Create a schema for the DataFrame
schema = StructType([
StructField("EmployeeID", IntegerType(), True),
StructField("First_Name", StringType(), True),
StructField("Last_Name", StringType(), True),
StructField("Salary", DoubleType(), True),
StructField("Joining_Date", StringType(), True),
StructField("Department", StringType(), True),
StructField("Gender", StringType(), True)
])

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 16

Step - 2 : Writing the pyspark code to solve

the problem
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session

# Create a list of rows from the image

data = [
[1, "Vikas", "Ahlawat", 600000.0, "2013-02-15 11:16:28.290",
"IT", "Male"],
[2, "nikita", "Jain", 530000.0, "2014-01-09 17:31:07.793", "HR",
"Female"],
[3, "Ashish", "Kumar", 1000000.0, "2014-01-09 10:05:07.793",
"IT", "Male"],
[4, "Nikhil", "Sharma", 480000.0, "2014-01-09 09:00:07.793",
"HR", "Male"],
[5, "anish", "kadian", 500000.0, "2014-01-09 09:31:07.793",
"Payroll", "Male"],
]

# Create a schema for the DataFrame

schema = StructType([
StructField("EmployeeID", IntegerType(), True),
StructField("First_Name", StringType(), True),
StructField("Last_Name", StringType(), True),

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 16
StructField("Salary", DoubleType(), True),
StructField("Joining_Date", StringType(), True),
StructField("Department", StringType(), True),
StructField("Gender", StringType(), True)
])

emp_df=spark.createDataFrame(data,schema)

#1. Write a pyspark code to get all employee detail

emp_df.show()

# 2. Write a query to get only "FirstName" column from emp_df

# Method 1
emp_df.select("First_Name").show()

# Method 2
emp_df.select(col("First_Name")).show()

# Method 3
emp_df.createOrReplaceTempView("emp_table")
spark.sql("select First_Name from emp_table").show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 16

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 16
# 3. Write a Pyspark code to get FirstName in upper case as "First
Name".

emp_df.select(upper("First_Name")).show()

#4. Write a pyspark code to get FirstName in lower case

from pyspark.sql.functions import lower

emp_df.select(lower("First_Name")).show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 17

Step - 1 : Problem Statement

17_Select in pyspark
Write a pyspark code perform below function
● Write a pyspark code for combine FirstName and LastName
and display it as "Name" (also include white space between
first name & last name)
● Select employee detail whose name is "Vikas"
● Get all employee detail from EmployeeDetail table whose
"FirstName" start with letter 'a'.

Difficult Level : EASY

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 17
StructField("Department", StringType(), True),
StructField("Gender", StringType(), True)
])

Step - 2 : Writing the pyspark code to solve

the problem
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session

# Create a list of rows from the image

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 17
# Create a schema for the DataFrame
schema = StructType([
StructField("EmployeeID", IntegerType(), True),
StructField("First_Name", StringType(), True),
StructField("Last_Name", StringType(), True),
StructField("Salary", DoubleType(), True),
StructField("Joining_Date", StringType(), True),
StructField("Department", StringType(), True),
StructField("Gender", StringType(), True)
])

emp_df=spark.createDataFrame(data,schema)

#1. Write a pyspark code for combine FirstName and

LastName and display it as "Name" (also include white space
between first name & last name)

from pyspark.sql.functions import concat_ws

emp_df.select(concat_ws(" ","First_Name","Last_Name")\
.alias("Name")).show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 17

# 2. Select employee detail whose name is "Vikas"

# Methos 1
from pyspark.sql.functions import col
emp_df.filter(col("First_Name") == 'Vikas' ).show(truncate=False)

# Methos 2
emp_df.filter(emp_df.First_Name == 'Vikas' ).show(truncate=False)

# Methos 3
emp_df.filter(emp_df['First_Name'] == 'Vikas' ).show(truncate=False)

# Methos 4
emp_df.where(emp_df['First_Name'] == 'Vikas' ).show(truncate=False)

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 17
● Get all employee detail from EmployeeDetail table whose
"FirstName" start with letter 'a'.

# Method 1
from pyspark.sql.functions import lower
emp_df.filter(lower(emp_df['First_Name']).like("a%")).show()

# Method 2

emp_df.filter((emp_df['First_Name'].like("a%")) |
(emp_df['First_Name'].like("A%")) ).show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 18

Step - 1 : Problem Statement

18_Select in pyspark
Write a pyspark code perform below function
● Get all employee details from EmployeeDetail table whose
"FirstName" contains 'k'
● Get all employee details from EmployeeDetail table whose
"FirstName" end with 'h'
● Get all employee detail from EmployeeDetail table whose
"FirstName" start with
● Get all employee detail from EmployeeDetail table whose
"FirstName" start with any single character between 'a-p'

Difficult Level : EASY

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 18
StructField("Salary", DoubleType(), True),
StructField("Joining_Date", StringType(), True),
StructField("Department", StringType(), True),
StructField("Gender", StringType(), True)
])

Step - 2 : Writing the pyspark code to solve

the problem
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session

# Create a list of rows from the image

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 18
]

# Create a schema for the DataFrame

schema = StructType([
StructField("EmployeeID", IntegerType(), True),
StructField("First_Name", StringType(), True),
StructField("Last_Name", StringType(), True),
StructField("Salary", DoubleType(), True),
StructField("Joining_Date", StringType(), True),
StructField("Department", StringType(), True),
StructField("Gender", StringType(), True)
])

emp_df=spark.createDataFrame(data,schema)

#1. Get all employee details from EmployeeDetail table

whose "FirstName" contains 'k'

from pyspark.sql.functions import col

emp_df.filter(emp_df["First_Name"].like("%k%")).show(
)

# Get all employee details from EmployeeDetail table whose

"FirstName" end with 'h'
emp_df.filter(emp_df["First_Name"].like("%h")).show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 18

# Get all employee detail from EmployeeDetail table whose

"FirstName" start with any single character between 'a-p'
emp_df.filter(emp_df["First_Name"].rlike("[^a-pA-P%]")).show()

# Get all employee detail from EmployeeDetail table whose

"FirstName" start with
# any single character between 'a-p'
emp_df.filter(~(emp_df["First_Name"].rlike("[^a-pA-
P%]"))).show()

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 19

Step - 1 : Problem Statement

19_Select in pyspark
Write a pyspark code perform below function
● Get all employee detail from emp_df whose "Gender" end
with 'le' and contain 4 letters. The Underscore(_) Wildcard
Character represents any single character.
● Get all employee detail from EmployeeDetail table whose
"FirstName" start with # 'A' and contain 5 letters.
● Get all unique "Department" from EmployeeDetail table.
● Get the highest "Salary" from EmployeeDetail table.

Difficult Level : EASY

DataFrame:
data = [
[1, "Vikas", "Ahlawat", 600000.0, "2013-02-15 11:16:28.290", "IT", "Male"],
[2, "nikita", "Jain", 530000.0, "2014-01-09 17:31:07.793", "HR", "Female"],
[3, "Ashish", "Kumar", 1000000.0, "2014-01-09 10:05:07.793", "IT", "Male"],
[4, "Nikhil", "Sharma", 480000.0, "2014-01-09 09:00:07.793", "HR", "Male"],
[5, "anish", "kadian", 500000.0, "2014-01-09 09:31:07.793", "Payroll", "Male"],
]
# Create a schema for the DataFrame
schema = StructType([
StructField("EmployeeID", IntegerType(), True),
StructField("First_Name", StringType(), True),
StructField("Last_Name", StringType(), True),
StructField("Salary", DoubleType(), True),
StructField("Joining_Date", StringType(), True),
StructField("Department", StringType(), True),
StructField("Gender", StringType(), True)
])

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 19

Step - 2 : Writing the pyspark code to solve

the problem
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session

# Create a list of rows from the image

data = [
[1, "Vikas", "Ahlawat", 600000.0, "2013-02-15 11:16:28.290", "IT", "Male"],
[2, "nikita", "Jain", 530000.0, "2014-01-09 17:31:07.793", "HR", "Female"],
[3, "Ashish", "Kumar", 1000000.0, "2014-01-09 10:05:07.793", "IT", "Male"],
[4, "Nikhil", "Sharma", 480000.0, "2014-01-09 09:00:07.793", "HR", "Male"],
[5, "anish", "kadian", 500000.0, "2014-01-09 09:31:07.793", "Payroll", "Male"],
]

# Create a schema for the DataFrame

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 19

emp_df=spark.createDataFrame(data,schema)

Get all employee detail from emp_df whose "Gender" end with
'le'and contain 4 letters. The Underscore(_) Wildcard Character
represents any single character.

# Get all employee detail from EmployeeDetail table whose

"FirstName" start with

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 19
# 'A' and contain 5 letters.

# Get all unique "Department" from EmployeeDetail table.

# Get the highest "Salary" from EmployeeDetail table.

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 19

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 20

Step - 1 : Problem Statement

20_Date in pyspark
Write a pyspark code perform below function
● Get the lowest "Salary" from EmployeeDetail table.
● Show "JoiningDate" in "dd mmm yyyy" format, ex- "15 Feb
2013"
● Show "JoiningDate" in "yyyy/mm/dd" format, ex- "2013/02/15"

Difficult Level : EASY

DataFrame:
data = [
[1, "Vikas", "Ahlawat", 600000.0, "2013-02-15 11:16:28.290", "IT", "Male"],
[2, "nikita", "Jain", 530000.0, "2014-01-09 17:31:07.793", "HR", "Female"],
[3, "Ashish", "Kumar", 1000000.0, "2014-01-09 10:05:07.793", "IT", "Male"],
[4, "Nikhil", "Sharma", 480000.0, "2014-01-09 09:00:07.793", "HR", "Male"],
[5, "anish", "kadian", 500000.0, "2014-01-09 09:31:07.793", "Payroll", "Male"],
]
# Create a schema for the DataFrame
schema = StructType([
StructField("EmployeeID", IntegerType(), True),
StructField("First_Name", StringType(), True),
StructField("Last_Name", StringType(), True),
StructField("Salary", DoubleType(), True),
StructField("Joining_Date", StringType(), True),
StructField("Department", StringType(), True),
StructField("Gender", StringType(), True)
])

Step - 2 : Writing the pyspark code to solve

the problem

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 20
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session

# Create a list of rows from the image

data = [
[1, "Vikas", "Ahlawat", 600000.0, "2013-02-15 11:16:28.290", "IT", "Male"],
[2, "nikita", "Jain", 530000.0, "2014-01-09 17:31:07.793", "HR", "Female"],
[3, "Ashish", "Kumar", 1000000.0, "2014-01-09 10:05:07.793", "IT", "Male"],
[4, "Nikhil", "Sharma", 480000.0, "2014-01-09 09:00:07.793", "HR", "Male"],
[5, "anish", "kadian", 500000.0, "2014-01-09 09:31:07.793", "Payroll", "Male"],
]

# Create a schema for the DataFrame

emp_df=spark.createDataFrame(data,schema)

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 20

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 21

Step - 1 : Problem Statement

21_Date in pyspark
Write a pyspark code perform below function
● Get only Year part of "JoiningDate"
● Get only Month part of "JoiningDate".
● Get only date part of "JoiningDate".
● Get the current system date using DataFrame API
● Get the current UTC date and time using DataFrame API

Difficult Level : EASY

DataFrame:
data = [
[1, "Vikas", "Ahlawat", 600000.0, "2013-02-15 11:16:28.290", "IT", "Male"],
[2, "nikita", "Jain", 530000.0, "2014-01-09 17:31:07.793", "HR", "Female"],
[3, "Ashish", "Kumar", 1000000.0, "2014-01-09 10:05:07.793", "IT", "Male"],
[4, "Nikhil", "Sharma", 480000.0, "2014-01-09 09:00:07.793", "HR", "Male"],
[5, "anish", "kadian", 500000.0, "2014-01-09 09:31:07.793", "Payroll", "Male"],
]
# Create a schema for the DataFrame
schema = StructType([
StructField("EmployeeID", IntegerType(), True),
StructField("First_Name", StringType(), True),
StructField("Last_Name", StringType(), True),
StructField("Salary", DoubleType(), True),
StructField("Joining_Date", StringType(), True),
StructField("Department", StringType(), True),
StructField("Gender", StringType(), True)
])

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 21

Step - 2 : Writing the pyspark code to solve

the problem
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session

# Create a list of rows from the image

data = [
[1, "Vikas", "Ahlawat", 600000.0, "2013-02-15 11:16:28.290", "IT", "Male"],
[2, "nikita", "Jain", 530000.0, "2014-01-09 17:31:07.793", "HR", "Female"],
[3, "Ashish", "Kumar", 1000000.0, "2014-01-09 10:05:07.793", "IT", "Male"],
[4, "Nikhil", "Sharma", 480000.0, "2014-01-09 09:00:07.793", "HR", "Male"],
[5, "anish", "kadian", 500000.0, "2014-01-09 09:31:07.793", "Payroll", "Male"],
]

# Create a schema for the DataFrame

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 21
emp_df=spark.createDataFrame(data,schema)

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 21

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 22

Step - 1 : Problem Statement

22_Date in pyspark
Write a pyspark code perform below function
● Get the first name, current date, joiningdate and diff between
current date and joining date in months.
● Get the first name, current date, joiningdate and diff between
current date and joining date in days.
● Get all employee details from EmployeeDetail table whose
joining year is 2013

Difficult Level : EASY

DataFrame:
data = [
[1, "Vikas", "Ahlawat", 600000.0, "2013-02-15 11:16:28.290", "IT", "Male"],
[2, "nikita", "Jain", 530000.0, "2014-01-09 17:31:07.793", "HR", "Female"],
[3, "Ashish", "Kumar", 1000000.0, "2014-01-09 10:05:07.793", "IT", "Male"],
[4, "Nikhil", "Sharma", 480000.0, "2014-01-09 09:00:07.793", "HR", "Male"],
[5, "anish", "kadian", 500000.0, "2014-01-09 09:31:07.793", "Payroll", "Male"],
]
# Create a schema for the DataFrame
schema = StructType([
StructField("EmployeeID", IntegerType(), True),
StructField("First_Name", StringType(), True),
StructField("Last_Name", StringType(), True),
StructField("Salary", DoubleType(), True),
StructField("Joining_Date", StringType(), True),
StructField("Department", StringType(), True),
StructField("Gender", StringType(), True)
])

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 22

Step - 2 : Writing the pyspark code to solve the

problem
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session

# Create a list of rows from the image

data = [
[1, "Vikas", "Ahlawat", 600000.0, "2013-02-15 11:16:28.290", "IT", "Male"],
[2, "nikita", "Jain", 530000.0, "2014-01-09 17:31:07.793", "HR", "Female"],
[3, "Ashish", "Kumar", 1000000.0, "2014-01-09 10:05:07.793", "IT", "Male"],
[4, "Nikhil", "Sharma", 480000.0, "2014-01-09 09:00:07.793", "HR", "Male"],
[5, "anish", "kadian", 500000.0, "2014-01-09 09:31:07.793", "Payroll", "Male"],
]

# Create a schema for the DataFrame

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 22
emp_df=spark.createDataFrame(data,schema)

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 22

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 23

Step - 1 : Problem Statement

23_Date in pyspark
Write a pyspark code perform below function
● Get all employee details from EmployeeDetail table whose
joining month is Jan(1).
● Get all employee details from EmployeeDetail table whose
joining date between 2013-01-01" and "2013-12-01".
● Get how many employee exist in "EmployeeDetail" table.
● Select all employee detail with First name "Vikas","Ashish", and
"Nikhil".

Difficult Level : EASY

DataFrame:
data = [
[1, "Vikas", "Ahlawat", 600000.0, "2013-02-15 11:16:28.290", "IT", "Male"],
[2, "nikita", "Jain", 530000.0, "2014-01-09 17:31:07.793", "HR", "Female"],
[3, "Ashish", "Kumar", 1000000.0, "2014-01-09 10:05:07.793", "IT", "Male"],
[4, "Nikhil", "Sharma", 480000.0, "2014-01-09 09:00:07.793", "HR", "Male"],
[5, "anish", "kadian", 500000.0, "2014-01-09 09:31:07.793", "Payroll", "Male"],
]
# Create a schema for the DataFrame
schema = StructType([
StructField("EmployeeID", IntegerType(), True),
StructField("First_Name", StringType(), True),
StructField("Last_Name", StringType(), True),
StructField("Salary", DoubleType(), True),
StructField("Joining_Date", StringType(), True),
StructField("Department", StringType(), True),
StructField("Gender", StringType(), True)
])

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 23

Step - 2 : Writing the pyspark code to solve the

problem
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session

# Create a list of rows from the image

data = [
[1, "Vikas", "Ahlawat", 600000.0, "2013-02-15 11:16:28.290", "IT", "Male"],
[2, "nikita", "Jain", 530000.0, "2014-01-09 17:31:07.793", "HR", "Female"],
[3, "Ashish", "Kumar", 1000000.0, "2014-01-09 10:05:07.793", "IT", "Male"],
[4, "Nikhil", "Sharma", 480000.0, "2014-01-09 09:00:07.793", "HR", "Male"],
[5, "anish", "kadian", 500000.0, "2014-01-09 09:31:07.793", "Payroll", "Male"],
]

# Create a schema for the DataFrame

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 23

emp_df=spark.createDataFrame(data,schema)

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 23

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 24

Step - 1 : Problem Statement

24_Trim and case in pyspark

Write a pyspark code perform below function
● Select all employee detail with First name not in
"Vikas","Ashish", and "Nikhil".
● Select first name from "EmployeeDetail" df after removing
white spaces from right side
● Select first name from "EmployeeDetail" table after removing
white spaces from left side
● Display first name and Gender as M/F.(if male then M, if
Female then F)

Difficult Level : EASY

DataFrame:
data = [
[1, "Vikas", "Ahlawat", 600000.0, "2013-02-15 11:16:28.290", "IT", "Male"],
[2, "nikita", "Jain", 530000.0, "2014-01-09 17:31:07.793", "HR", "Female"],
[3, "Ashish", "Kumar", 1000000.0, "2014-01-09 10:05:07.793", "IT", "Male"],
[4, "Nikhil", "Sharma", 480000.0, "2014-01-09 09:00:07.793", "HR", "Male"],
[5, "anish", "kadian", 500000.0, "2014-01-09 09:31:07.793", "Payroll", "Male"],
]
# Create a schema for the DataFrame
schema = StructType([
StructField("EmployeeID", IntegerType(), True),
StructField("First_Name", StringType(), True),
StructField("Last_Name", StringType(), True),
StructField("Salary", DoubleType(), True),
StructField("Joining_Date", StringType(), True),
StructField("Department", StringType(), True),
StructField("Gender", StringType(), True)

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 24
])

Step - 2 : Writing the pyspark code to solve the

problem
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session

# Create a list of rows from the image

data = [
[1, "Vikas", "Ahlawat", 600000.0, "2013-02-15 11:16:28.290", "IT", "Male"],
[2, "nikita", "Jain", 530000.0, "2014-01-09 17:31:07.793", "HR", "Female"],
[3, "Ashish", "Kumar", 1000000.0, "2014-01-09 10:05:07.793", "IT", "Male"],
[4, "Nikhil", "Sharma", 480000.0, "2014-01-09 09:00:07.793", "HR", "Male"],
[5, "anish", "kadian", 500000.0, "2014-01-09 09:31:07.793", "Payroll", "Male"],
]

# Create a schema for the DataFrame

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 24
StructField("Gender", StringType(), True)
])

emp_df=spark.createDataFrame(data,schema)

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 24

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 25

Step - 1 : Problem Statement

25_operator in pyspark
Write a pyspark code perform below function
● Select first name from "EmployeeDetail" table prifixed with
"Hello "
● Get employee details from "EmployeeDetail" table whose
Salary greater than 600000
● Get employee details from "EmployeeDetail" table whose
Salary less than 700000
● Get employee details from "EmployeeDetail" table whose
Salary between 500000 than 600000
● Select second highest salary from "EmployeeDetail" table

Difficult Level : EASY

# Create a schema for the DataFrame

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 25
schema = StructType([
StructField("EmployeeID", IntegerType(), True),
StructField("First_Name", StringType(), True),
StructField("Last_Name", StringType(), True),
StructField("Salary", DoubleType(), True),
StructField("Joining_Date", StringType(), True),
StructField("Department", StringType(), True),
StructField("Gender", StringType(), True)
])

Step - 2 : Writing the pyspark code to solve the

problem
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session

# Create a list of rows from the image

data = [
[1, "Vikas", "Ahlawat", 600000.0, "2013-02-15 11:16:28.290", "IT", "Male"],
[2, "nikita", "Jain", 530000.0, "2014-01-09 17:31:07.793", "HR", "Female"],
[3, "Ashish", "Kumar", 1000000.0, "2014-01-09 10:05:07.793", "IT", "Male"],
[4, "Nikhil", "Sharma", 480000.0, "2014-01-09 09:00:07.793", "HR", "Male"],
[5, "anish", "kadian", 500000.0, "2014-01-09 09:31:07.793", "Payroll", "Male"],
]

# Create a schema for the DataFrame

emp_df=spark.createDataFrame(data,schema)

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 25

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 26

Step - 1 : Problem Statement

26_groupby in pyspark
Write a pyspark code perform below function
● Write the query to get the department and department wise
total(sum) salary from "EmployeeDetail" table.
● Write the query to get the department and department wise
total(sum) salary, display it in ascending order according to
salary.
● Write the query to get the department and department wise
total(sum) salary, display it in descending order according to
salary.
● Write the query to get the department, total no. of
departments, total(sum) salary with respect to department
from "EmployeeDetail" table.

Difficult Level : EASY

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 26

# Create a schema for the DataFrame

Step - 2 : Writing the pyspark code to solve the

problem
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session

# Create a list of rows from the image

data = [
[1, "Vikas", "Ahlawat", 600000.0, "2013-02-15 11:16:28.290", "IT", "Male"],
[2, "nikita", "Jain", 530000.0, "2014-01-09 17:31:07.793", "HR", "Female"],
[3, "Ashish", "Kumar", 1000000.0, "2014-01-09 10:05:07.793", "IT", "Male"],
[4, "Nikhil", "Sharma", 480000.0, "2014-01-09 09:00:07.793", "HR", "Male"],
[5, "anish", "kadian", 500000.0, "2014-01-09 09:31:07.793", "Payroll", "Male"],

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 26
]

# Create a schema for the DataFrame

emp_df=spark.createDataFrame(data,schema)

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 26

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 27

Step - 1 : Problem Statement

27_groupby in pyspark
Write a pyspark code perform below function
● 46. Get department wise average salary from
"EmployeeDetail" table order by salary ascending
● 47. Get department wise maximum salary from
"EmployeeDetail" table order by salary ascending
● 48. Get department wise minimum salary from
"EmployeeDetail" table order by salary ascending

Difficult Level : EASY

# Create a schema for the DataFrame

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 27

Step - 2 : Writing the pyspark code to solve the

problem
# Creating Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.types import
StructType,StructField,IntegerType,StringType

#creating spark session

# Create a list of rows from the image

data = [
[1, "Vikas", "Ahlawat", 600000.0, "2013-02-15 11:16:28.290", "IT", "Male"],
[2, "nikita", "Jain", 530000.0, "2014-01-09 17:31:07.793", "HR", "Female"],
[3, "Ashish", "Kumar", 1000000.0, "2014-01-09 10:05:07.793", "IT", "Male"],
[4, "Nikhil", "Sharma", 480000.0, "2014-01-09 09:00:07.793", "HR", "Male"],
[5, "anish", "kadian", 500000.0, "2014-01-09 09:31:07.793", "Payroll", "Male"],
]

# Create a schema for the DataFrame

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 27
StructField("Gender", StringType(), True)
])

emp_df=spark.createDataFrame(data,schema)

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 27

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 28

Step - 1 : Problem Statement

28_Join_in_pyspark
Write a pyspark code perform below function
● Write down the query to fetch Project name assign to more
than one Employee
● Get employee name, project name order by firstname from
"EmployeeDetail" and"ProjectDetail" for those employee which
have assigned project already.

Difficult Level : EASY

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 28

# Create a schema for the DataFrame

pro_schema = StructType([
StructField("Project_DetailID", IntegerType(), True),
StructField("Employee_DetailID", IntegerType(), True),
StructField("Project_Name", StringType(), True)
])

# Create the data as a list of tuples

pro_data = [
(1, 1, "Task Track"),
(2, 1, "CLP"),
(3, 1, "Survey Management"),
(4, 2, "HR Management"),
(5, 3, "Task Track"),
(6, 3, "GRS"),
(7, 3, "DDS"),
(8, 4, "HR Management"),
(9, 6, "GL Management")
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 28

Step - 2 : Writing the pyspark code to solve the

problem

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 28

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 29

Step - 1 : Problem Statement

29_Join_in_pyspark
Write a pyspark code perform below function
● 52. Get employee name, project name order by firstname from
"EmployeeDetail" and "ProjectDetail" for all employee even
they have not assigned project.
● 53 Get employee name, project name order by firstname from
"EmployeeDetail" and "ProjectDetail" for all employee if
project is not assigned then display "-No Project Assigned".

Difficult Level : EASY

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 29

# Create a schema for the DataFrame

pro_schema = StructType([
StructField("Project_DetailID", IntegerType(), True),
StructField("Employee_DetailID", IntegerType(), True),
StructField("Project_Name", StringType(), True)
])

# Create the data as a list of tuples

Step - 2 : Writing the pyspark code to solve the

problem

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 29

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 30

Step - 1 : Problem Statement

30_Join_in_pyspark
Write a pyspark code perform below function
● 56. Write a pyspark code to find out the employeename who
has not assigned any project, and display "-No Project
Assigned"( tables :- [EmployeeDetail],[ProjectDetail]).
● 57. Write a pyspark code to find out the project name which is
not assigned to any employee( tables :-
[EmployeeDetail],[ProjectDetail]).

Difficult Level : EASY

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 30

# Create a schema for the DataFrame

pro_schema = StructType([
StructField("Project_DetailID", IntegerType(), True),
StructField("Employee_DetailID", IntegerType(), True),
StructField("Project_Name", StringType(), True)
])

# Create the data as a list of tuples

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 30

Step - 2 : Writing the pyspark code to solve the

problem

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 30

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 31

Step - 1 : Problem Statement

31_Histogram of Tweets
write a query to obtain a histogram of tweets posted per user in 2022. Output the
tweet count per user as the bucket and the number of Twitter users who fall into that
bucket.

In other words, group the users by the number of tweets they posted in 2022 and
count the number of users in each group.

Difficult Level : EASY

DataFrame:
schema = StructType([
StructField("tweet_id", IntegerType(), True),
StructField("user_id", IntegerType(), True),
StructField("msg", StringType(), True),
StructField("tweet_date", StringType(), True)
])

# Define the data

data = [
(214252, 111, 'Am considering taking Tesla private at $420. Funding
secured.', '2021-12-30 00:00:00'),
(739252, 111, 'Despite the constant negative press covfefe', '2022-01-01
00:00:00'),
(846402, 111, 'Following @NickSinghTech on Twitter changed my life!',
'2022-02-14 00:00:00'),
(241425, 254, 'If the salary is so competitive why won’t you tell me what
it is?', '2022-03-01 00:00:00'),
(231574, 148, 'I no longer have a manager. I can\'t be managed', '2022-03-
23 00:00:00')
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 31

Step - 2 : Identifying The Input Data And Expected

Output
INPUT
INPUT
TWEET_DAT
TWEET_ID USER_ID MSG E

Am considering taking Tesla private at $420. 2021-12-30

214252 111 Funding secured. 0:00:00
2022-01-01
739252 111 Despite the constant negative press covfefe 0:00:00
Following @NickSinghTech on Twitter changed 2022-02-14
846402 111 my life! 0:00:00
If the salary is so competitive why won’t you tell 2022-03-01
241425 254 me what it is? 0:00:00

2022-03-23
231574 148 I no longer have a manager. I can't be managed 0:00:00

OUTPUT

BUCKET USER_NUM

1 2

2 1

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 31

Step - 3 : Writing the pyspark code to solve

the problem

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 31

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 32

Step - 1 : Problem Statement

32_pyspark_transformation
Write a pyspark code to transform the DataFrame to
display each student's marks in Math and English as
separate columns.
Difficult Level : EASY
DataFrame:
data=[
('Rudra','math',79),
('Rudra','eng',60),
('Shivu','math', 68),
('Shivu','eng', 59),
('Anu','math', 65),
('Anu','eng',80)
]

schema = StructType([
StructField("Name", StringType(), True),
StructField("Sub", StringType(), True),
StructField("Marks", IntegerType(), True)
])

Step - 2 : Identifying The Input Data And Expected

Output

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 32
INPUT

OUTPUT

Step - 3 : Writing the pyspark code to solve

the problem

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 32

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 33

Step - 1 : Problem Statement

33_Hobbies Data Transformation

Problem Statement:
Transform a dataset with individuals' names and
associated hobbies into a new format using PySpark.
Convert the comma-separated hobbies into separate
rows, creating a DataFrame with individual rows for
each person and their respective hobbies.
Difficult Level : EASY
DataFrame:
# Sample input data
data = [("Alice", "badminton,tennis"),
("Bob", "tennis,cricket"),
("Julie", "cricket,carroms")]

# Create a DataFrame
df = spark.createDataFrame(data, ["name", "hobbies"])

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 33

Step - 2 : Identifying The Input Data And Expected

Output
INPUT

OUTPUT

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 33

Step - 3 : Writing the pyspark code to solve

the problem

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 33

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 34

Step - 1 : Problem Statement

34_ Histogram of Tweets

write a pyspark code to obtain a histogram of tweets
posted per user in 2022. Output the tweet count per
user as the bucket and the number of Twitter users who
fall into that bucket.In other words, group the users by
the number of tweets they posted in 2022 and count the
number of users in each group.
Difficult Level : EASY
DataFrame:
# Define the schema for the tweets DataFrame
schema = StructType([
StructField("tweet_id", IntegerType(), True),
StructField("user_id", IntegerType(), True),
StructField("msg", StringType(), True),
StructField("tweet_date", StringType(), True)
])

# Create the tweets DataFrame

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 34

Step - 2 : Identifying The Input Data And Expected

Output
INPUT

INPUT
TWEET_DAT
TWEET_ID USER_ID MSG E

Am considering taking Tesla private at $420. 2021-12-30

2022-03-23
231574 148 I no longer have a manager. I can't be managed 0:00:00

OUTPUT

BUCKET USER_NUM
1 2

2 1

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 34

Step - 3 : Writing the pyspark code to solve

the problem

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 34

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 35

Step - 1 : Problem Statement

35_Classes More Than 5 Students

Write a pyspark code to find all the classes that have at
least five students.Return the result table in any order.
Difficult Level : EASY
DataFrame:
# Define the schema for the DataFrame
schema = StructType([
StructField("StudentID", StringType(), True),
StructField("ClassName", StringType(), True)
])
# Data to be inserted into the DataFrame
data = [
('A', 'Math'),
('B', 'English'),
('C', 'Math'),
('D', 'Biology'),
('E', 'Math'),
('F', 'Computer'),
('G', 'Math'),
('H', 'Math'),
('I', 'Math')
]

Step - 2 : Identifying The Input Data And Expected

Output

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 35
INPUT

OUTPUT

Step - 3 : Writing the pyspark code to solve

the problem

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 35

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 36

Step - 1 : Problem Statement

36_Rank Scores Problem

Write a pyspark code to rank scores. If there is a tie between
two scores, both should have the same ranking. Note that
after a tie, the next ranking number should be the next
consecutive integer value.In other words, there should be no
“holes” between ranks.

Difficult Level : MED

DataFrame:
# Define the schema for the DataFrame
schema = StructType([
StructField("Id", IntegerType(), True),
StructField("Score", FloatType(), True)
])

# Data to be inserted into the DataFrame

data = [
(1, 3.50),
(2, 3.65),
(3, 4.00),
(4, 3.85),
(5, 4.00),
(6, 3.65)
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 36

Step - 2 : Identifying The Input Data And Expected

Output
INPUT

OUTPUT

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 36

Step - 3 : Writing the pyspark code to solve

the problem

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 36

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 37

Step - 1 : Problem Statement

37_Triangle Judgement Problem

A pupil Tim gets homework to identify whether three line segments could
possibly form a triangle.
However, this assignment is very heavy because there are hundreds of records
to calculate.
Could you help Tim by writing a pyspark code to judge whether these three
sides can form a triangle,
assuming df triangle holds the length of the three sides x, y and z.

Difficult Level : EASY

DataFrame:
# Define the schema for the DataFrame
schema = StructType([
StructField("x", IntegerType(), True),
StructField("y", IntegerType(), True),
StructField("z", IntegerType(), True)
])

# Data to be inserted into the DataFrame

data = [
(13, 15, 30),
(10, 20, 15)
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 37

Step - 2 : Identifying The Input Data And Expected

Output
INPUT

OUTPUT

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 37

Step - 3 : Writing the pyspark code to solve

the problem

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 37

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 38

Step - 1 : Problem Statement

38_Biggest Single Number Problem

Df contains many numbers in column num including duplicated ones.

Can you write a pyspark code to find the biggest number, which only
appears once.

Difficult Level : EASY

DataFrame:
# Define the schema for the DataFrame
schema = StructType([StructField("num", IntegerType(), True)])

# Your data
data = [(8,), (8,), (3,), (3,), (1,), (4,), (5,), (6,)]

# Create a PySpark DataFrame

df = spark.createDataFrame(data, schema=schema)

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 38

Step - 2 : Identifying The Input Data And Expected

Output
INPUT

OUTPUT

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 38

Step - 3 : Writing the pyspark code to solve

the problem

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 38

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 39

Step - 1 : Problem Statement

39_Not Boring Movies Problem

X city opened a new cinema, many people would like to go to this
cinema. The cinema also gives out a poster indicating the movies’ ratings
and descriptions. Please write a Pyspark Code to output movies with an
odd numbered ID and a description that is not ‘boring’. Order the result
by rating.

Difficult Level : EASY

DataFrame:
# Define the schema for the DataFrame
schema = StructType([
StructField("id", IntegerType(), True),
StructField("movie", StringType(), True),
StructField("description", StringType(), True),
StructField("rating", FloatType(), True)
])

# Your data
data = [
(1, "War", "great 3D", 8.9),
(2, "Science", "fiction", 8.5),
(3, "Irish", "boring", 6.2),
(4, "Ice song", "Fantasy", 8.6),
(5, "House card", "Interesting", 9.1)
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 39

Step - 2 : Identifying The Input Data And Expected

Output
INPUT

OUTPUT

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 39

Step - 3 : Writing the pyspark code to solve

the problem

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 39

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 40

Step - 1 : Problem Statement

40_Swap Gender Problem

Given a df salary, such as the one below, that has m=male
and f=female values. Swap all f and m values (i.e., change all f
values to m and vice versa)

Difficult Level : EASY

DataFrame:
# Define the schema for the DataFrame
schema = StructType([
StructField("id", IntegerType(), True),
StructField("name", StringType(), True),
StructField("sex", StringType(), True),
StructField("salary", IntegerType(), True),
])

# Define the data

data = [
(1, "A", "m", 2500),
(2, "B", "f", 1500),
(3, "C", "m", 5500),
(4, "D", "f", 500),
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 40

Step - 2 : Identifying The Input Data And Expected

Output
INPUT

OUTPUT

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 40

Step - 3 : Writing the pyspark code to solve

the problem

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB : DAY - 40

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR

Azure Databricks Interview Question
No ratings yet
Azure Databricks Interview Question
12 pages
Azure Data Engineering Interview Q & A - Topicwise
100% (1)
Azure Data Engineering Interview Q & A - Topicwise
57 pages
PySpark Comprehensive Notes
No ratings yet
PySpark Comprehensive Notes
59 pages
Azure Data Engineer Interview Questions and Answers
No ratings yet
Azure Data Engineer Interview Questions and Answers
7 pages
Guide To Building AI Agents From Scratch
100% (6)
Guide To Building AI Agents From Scratch
17 pages
Pyspark Questions & Scenario Based
No ratings yet
Pyspark Questions & Scenario Based
25 pages
PySpark DataFrame Merging Guide
No ratings yet
PySpark DataFrame Merging Guide
42 pages
Azure DE Roadmap2024
No ratings yet
Azure DE Roadmap2024
10 pages
Azure Data Engineer Content
No ratings yet
Azure Data Engineer Content
6 pages
My Pyspark Practice Notes
100% (1)
My Pyspark Practice Notes
63 pages
PySpark+Slides v1
No ratings yet
PySpark+Slides v1
458 pages
Spark SQL Bucketing Guide
No ratings yet
Spark SQL Bucketing Guide
27 pages
Cloud Migration for Banking Data
No ratings yet
Cloud Migration for Banking Data
1 page
Databricks Delta for Developers
No ratings yet
Databricks Delta for Developers
11 pages
Amazon Vs Walmart Fighting It Out Online On Price
No ratings yet
Amazon Vs Walmart Fighting It Out Online On Price
5 pages
Must Know Pyspark Coding Before Databricks Interview
No ratings yet
Must Know Pyspark Coding Before Databricks Interview
7 pages
PySpark Cheatsheet
No ratings yet
PySpark Cheatsheet
12 pages
Spark Interview Q&A
No ratings yet
Spark Interview Q&A
31 pages
Azure DE Interview Que
100% (1)
Azure DE Interview Que
25 pages
Spark Optimization PDF
100% (1)
Spark Optimization PDF
14 pages
Azure Data Engineering for Pharma
100% (1)
Azure Data Engineering for Pharma
5 pages
Final Print Py Spark
No ratings yet
Final Print Py Spark
133 pages
4.1 The Spark UI - Databricks
No ratings yet
4.1 The Spark UI - Databricks
7 pages
Spark SQL Optimization
No ratings yet
Spark SQL Optimization
29 pages
SQL To Pyspark Conversion
No ratings yet
SQL To Pyspark Conversion
9 pages
PySpark Optimization Scenarios - Wipro
No ratings yet
PySpark Optimization Scenarios - Wipro
8 pages
Pyspark Notes
No ratings yet
Pyspark Notes
93 pages
Microsoft Certified: Azure Data Engineer Associate - Skills Measured
No ratings yet
Microsoft Certified: Azure Data Engineer Associate - Skills Measured
4 pages
Azure Databricks Team Data Science Lab
No ratings yet
Azure Databricks Team Data Science Lab
18 pages
ADF Copy Data
100% (1)
ADF Copy Data
81 pages
SQL and PySpark Interview Questions
No ratings yet
SQL and PySpark Interview Questions
15 pages
Window Function in Pyspark
100% (1)
Window Function in Pyspark
8 pages
Naresh DE
No ratings yet
Naresh DE
5 pages
Databricks Course Curriculum
No ratings yet
Databricks Course Curriculum
2 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Interview DE by Company Azurelib Dot Com
No ratings yet
Interview DE by Company Azurelib Dot Com
14 pages
DBT Flow
No ratings yet
DBT Flow
15 pages
Spark Application Deployment Guide
No ratings yet
Spark Application Deployment Guide
18 pages
A Data Pipeline Should Address These Issues:: Topics To Study
No ratings yet
A Data Pipeline Should Address These Issues:: Topics To Study
10 pages
Spark Interview Prep Guide
No ratings yet
Spark Interview Prep Guide
4 pages
Delta Table and Pyspark Interview Questions
100% (1)
Delta Table and Pyspark Interview Questions
14 pages
Master Pyspark Zero To Hero 1738689679
No ratings yet
Master Pyspark Zero To Hero 1738689679
102 pages
Databricks Performance Tuning
No ratings yet
Databricks Performance Tuning
9 pages
Azure Data Engineer Mock Interview - Project Special
No ratings yet
Azure Data Engineer Mock Interview - Project Special
11 pages
Snowflake External Tables Guide
No ratings yet
Snowflake External Tables Guide
105 pages
Deloitte Pyspark Interview Questions For Data Engineer 2024 - by Ronit Malhotra - Jun, 2024 - Medium
No ratings yet
Deloitte Pyspark Interview Questions For Data Engineer 2024 - by Ronit Malhotra - Jun, 2024 - Medium
9 pages
Spark SQL & DataFrames Guide 2.2.0
No ratings yet
Spark SQL & DataFrames Guide 2.2.0
35 pages
Slide 10 PySpark - SQL
No ratings yet
Slide 10 PySpark - SQL
131 pages
Data Engineer Interview Prep
100% (1)
Data Engineer Interview Prep
16 pages
Spark Notes
No ratings yet
Spark Notes
37 pages
Pyspark
No ratings yet
Pyspark
31 pages
50 PySpark Interview Questions PDF
No ratings yet
50 PySpark Interview Questions PDF
7 pages
50 PySpark Interview Questions 1732556477
No ratings yet
50 PySpark Interview Questions 1732556477
7 pages
Snowflake
No ratings yet
Snowflake
11 pages
ADB Course Catalog
No ratings yet
ADB Course Catalog
84 pages
Bhaskar ADE - Altimetrik
No ratings yet
Bhaskar ADE - Altimetrik
3 pages
Databricks Pyspark 1712042928
100% (1)
Databricks Pyspark 1712042928
21 pages
Spark Interview Q&A: Key Insights
No ratings yet
Spark Interview Q&A: Key Insights
10 pages
Spark Interview Questions 1713805760
No ratings yet
Spark Interview Questions 1713805760
40 pages
Performance Tuning Spark UI
No ratings yet
Performance Tuning Spark UI
37 pages
Azure DataEngineer Course Outline
No ratings yet
Azure DataEngineer Course Outline
4 pages
PySpark 30 Days Practice Guide?
100% (1)
PySpark 30 Days Practice Guide?
35 pages
Day 1 1720441733
No ratings yet
Day 1 1720441733
6 pages
Day 1 1720441733
No ratings yet
Day 1 1720441733
6 pages
SQL Learning Hub
No ratings yet
SQL Learning Hub
5 pages
Python Portfolio Project For Data Analyst
No ratings yet
Python Portfolio Project For Data Analyst
13 pages
Data Modelling Essentials
No ratings yet
Data Modelling Essentials
40 pages
Day 89
No ratings yet
Day 89
9 pages
Databricks Vs SQL Cheat Sheet
100% (1)
Databricks Vs SQL Cheat Sheet
11 pages
Spark - groupByKey Vs reduceByKey
No ratings yet
Spark - groupByKey Vs reduceByKey
3 pages
Pyspark Cashing & Persisting - Complete Guide
No ratings yet
Pyspark Cashing & Persisting - Complete Guide
3 pages
Master Airflow With This Amazing Document!
No ratings yet
Master Airflow With This Amazing Document!
63 pages
Efficient SCD Management in Spark
No ratings yet
Efficient SCD Management in Spark
5 pages
Top 10 ChatGPT Prompting Techniques
100% (2)
Top 10 ChatGPT Prompting Techniques
14 pages
20+ Key Difference in Spark
No ratings yet
20+ Key Difference in Spark
9 pages
Most Asked Python Interview Questions at MAANG Companies
No ratings yet
Most Asked Python Interview Questions at MAANG Companies
26 pages
Spark Driver Role & Data Skew Solutions
No ratings yet
Spark Driver Role & Data Skew Solutions
33 pages
Full Load
No ratings yet
Full Load
16 pages
File Types in Data Engineering!
No ratings yet
File Types in Data Engineering!
18 pages
Step-By-Step Method To Find Drop Off Points in A User Flow
No ratings yet
Step-By-Step Method To Find Drop Off Points in A User Flow
17 pages
Discover India's Path To Net-Zero - Sustainable Growth & Green Energy!
No ratings yet
Discover India's Path To Net-Zero - Sustainable Growth & Green Energy!
1 page
APPROVED Vendor Pending List
No ratings yet
APPROVED Vendor Pending List
177 pages
Developing Management Skills 10th Edition Download Instantly
No ratings yet
Developing Management Skills 10th Edition Download Instantly
315 pages
Oilfield Chemical Solutions
No ratings yet
Oilfield Chemical Solutions
13 pages
Consolidated Marksheet
No ratings yet
Consolidated Marksheet
3 pages
VCDS Diagnostic Report
No ratings yet
VCDS Diagnostic Report
7 pages
Oracle Exadata Training Extended
No ratings yet
Oracle Exadata Training Extended
3 pages
Industrial Reseller Pricelist July-2023
No ratings yet
Industrial Reseller Pricelist July-2023
1 page
En Brochure Digital Use
No ratings yet
En Brochure Digital Use
4 pages
Review Of: Generated On 2022-12-20
No ratings yet
Review Of: Generated On 2022-12-20
21 pages
Women's Day - Famous Space Women
No ratings yet
Women's Day - Famous Space Women
2 pages
Peluang Kewirausahaan AUC 0324 Samarinda
No ratings yet
Peluang Kewirausahaan AUC 0324 Samarinda
19 pages
Cotton Association of India
No ratings yet
Cotton Association of India
5 pages
WILP Brochure
No ratings yet
WILP Brochure
20 pages
18 September 2024 Updation REWA SINCE 2007
No ratings yet
18 September 2024 Updation REWA SINCE 2007
61 pages
Working at Heights Verification of Competency RIIWHS204E OHS - Com.au
No ratings yet
Working at Heights Verification of Competency RIIWHS204E OHS - Com.au
4 pages
2 Template 11& 14, Annex 3A
No ratings yet
2 Template 11& 14, Annex 3A
7 pages
Fog Computing: Survey of Trends, Architectures, Requirements, and Research Directions
No ratings yet
Fog Computing: Survey of Trends, Architectures, Requirements, and Research Directions
31 pages
Configuring The Network Settings
No ratings yet
Configuring The Network Settings
23 pages
Thesis Paper On Net Zero Carbon
No ratings yet
Thesis Paper On Net Zero Carbon
68 pages
WB - 5 Judiciary
No ratings yet
WB - 5 Judiciary
39 pages
End 1 End 2: Intralox, Inc. P.O. Box 50699 New Orleans, LA 70150 USA Fax: (504) 734-0063
No ratings yet
End 1 End 2: Intralox, Inc. P.O. Box 50699 New Orleans, LA 70150 USA Fax: (504) 734-0063
2 pages
LAB 1 - Matlab Basic
100% (1)
LAB 1 - Matlab Basic
26 pages
LHS Cab Fuse & Relay Panel Guide
No ratings yet
LHS Cab Fuse & Relay Panel Guide
1 page
Maintenance Task Record E Rating English
No ratings yet
Maintenance Task Record E Rating English
11 pages
Switchyard Equipment Footing
No ratings yet
Switchyard Equipment Footing
11 pages
Mahendra Engineering College
No ratings yet
Mahendra Engineering College
2 pages
Call Center Skills Workshop
No ratings yet
Call Center Skills Workshop
14 pages
SRQ 6 and 7 Strucutre and Model Sample
No ratings yet
SRQ 6 and 7 Strucutre and Model Sample
20 pages
COP WFP CHK 01 2013 v1 All Checklists
100% (1)
COP WFP CHK 01 2013 v1 All Checklists
47 pages