Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
10 views6 pages

Day 1 1720441733

The document outlines a PySpark program to identify pairs of actors and directors who have collaborated at least three times. It includes the problem statement, input data structure, expected output, and the PySpark code to achieve the solution. The final output shows that actor 1 has worked with director 1 three times.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views6 pages

Day 1 1720441733

The document outlines a PySpark program to identify pairs of actors and directors who have collaborated at least three times. It includes the problem statement, input data structure, expected output, and the PySpark code to achieve the solution. The final output shows that actor 1 has worked with director 1 three times.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

PYSPARK LEARNING HUB

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB

Step - 1 : Problem Statement

Actors and Directors Who Cooperated At Least


Three Times
Write a pyspark Program for a report that provides the pairs
(actor_id, director_id) where the actor has cooperated with
the director at least 3 times.

Difficult Level : EASY


DataFrame:
schema = StructType([
StructField("ActorId",IntegerType(),True),
StructField("DirectorId",IntegerType(),True),
StructField("timestamp",IntegerType(),True)
])

data = [
(1, 1, 0),
(1, 1, 1),
(1, 1, 2),
(1, 2, 3),
(1, 2, 4),
(2, 1, 5),
(2, 1, 6)
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB

Step - 2 : Identifying The Input Data And Expected


Output
INPUT
INPUT
ACTOR_ID DIRECTOR_ID TIMESTAMP
1 1 0
1 1 1
1 1 2
1 2 3
1 2 4
2 1 5
2 1 6

OUTPUT
OUTPUT
ACTOR_ID DIRECTOR_ID
1 1

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB

Step - 3 : Writing the pyspark code to solve


the problem
# Creating Spark Session

from pyspark.sql import SparkSession


from pyspark.sql.types import StructType,StructField,IntegerType

#creating spark session


spark = SparkSession. \
builder. \
config('spark.shuffle.useOldFetchProtocol', 'true'). \
config('spark.ui.port','0'). \
config("spark.sql.warehouse.dir", "/user/itv008042/warehouse"). \
enableHiveSupport(). \
master('yarn'). \
getOrCreate()

schema = StructType([
StructField("ActorId",IntegerType(),True),
StructField("DirectorId",IntegerType(),True),
StructField("timestamp",IntegerType(),True)
])

data = [
(1, 1, 0),
(1, 1, 1),
(1, 1, 2),
(1, 2, 3),
(1, 2, 4),
(2, 1, 5),
(2, 1, 6)
]

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB

df=spark.createDataFrame(data,schema)
df.show()

df_group=df.groupBy('ActorId','DirectorId').count()
df_group.show()

+-------+----------+-----+
|ActorId|DirectorId|count|
+-------+----------+-----+
| 1| 2| 2|
| 1| 1| 3|
| 2| 1| 2|
+-------+----------+-----+

df_group.filter(df_group['count'] >= 3).show()

+-------+----------+-----+
|ActorId|DirectorId|count|
+-------+----------+-----+
| 1| 1| 3|
+-------+----------+-----+

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR
PYSPARK LEARNING HUB

WWW.LINKEDIN.COM/IN/AKASHMAHINDRAKAR

You might also like