Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
9 views29 pages

SP 5

Uploaded by

juadsr96
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views29 pages

SP 5

Uploaded by

juadsr96
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

1.

You have the following code:

text

CollapseWrap

Copy

df.select("name", expr("salary") * 0.20).show()

It produces the output:

text

CollapseWrap

Copy

+-----+--------------+

| name|(salary * 0.2)|

+-----+--------------+

| Ravi| 640.0|

|Abdul| 960.0|

| John| 1300.0|

| Rosy| 1640.0|

+-----+--------------+

Choose the correct expression for giving an alias to the last column:

• A. df.select("name", col("salary as increment") * 0.20)

• B. df.select("name", expr("salary * 0.20 as increment"))

• C. df.select("name", col("salary") * 0.20 as increment)

• D. None of the above

2. Which of the following code blocks will add two new columns salary_increment and new_salary
to an existing DataFrame?

• A. df.withColumn("salary_increment", expr("salary * 0.15")).withColumn("new_salary",


expr("salary + salary_increment"))

• B. df.selectExpr("*", "salary * 0.15 as salary_increment", "salary + salary_increment as


new_salary")

• C. df.withColumn("salary_increment", col("salary * 0.15")).withColumn("new_salary",


col("salary + salary_increment"))

• D. All of the above

3. You have the following DataFrame:


text

CollapseWrap

Copy

+-----+---+------+---------+----------+

| name|age|salary|increment|new_salary|

+-----+---+------+---------+----------+

| Ravi| 28| 3200| 480.0| 3680.0|

|Abdul| 23| 4800| 720.0| 5520.0|

| John| 32| 6500| 975.0| 7475.0|

| Rosy| 48| 8200| 1230.0| 9430.0|

+-----+---+------+---------+----------+

You want to remove the salary column. Choose the response that correctly fills in the numbered
blanks:

text

CollapseWrap

Copy

df.__1__(__2__)

• A. 1. drop 2. "salary"

• B. 1. del 2. salary

• C. 1. remove 2. "salary"

• D. 1. delete 2. "salary"

4. Select all correct statements about the withColumnRenamed() method:

• A. The correct method name is withColumnRename()

• B. Returns a new DataFrame with a column renamed

• C. Throws an error if the schema doesn't contain the existing name

• D. Does not throw an error if the schema doesn't contain the existing name

5. You have a DataFrame with two date-type columns: start and end. Which expression correctly
selects the difference between these two dates?

• A. myDF.select(date_diff("end", "start"))

• B. myDF.select(datediff("start", "end"))

• C. myDF.select(datediff("end", "start"))

• D. myDF.select("end" - "start")
6. You have a start_time field in your DataFrame (string type) with values like:

text

CollapseWrap

Copy

17-05-2021 00:02:17.592

This represents a timestamp in DD-MM-YYYY HH:MI:SS.SSS format. How can you convert this field to
a timestamp type?

• A. myDF.withColumn("start_time", to_timestamp("start_time", "DD-MM-YYYY


HH:MM:ss:SSS"))

• B. myDF.withColumn("start_time", to_timestamp("start_time", "dd-MM-yyyy


HH:mm:ss.SSS"))

• C. myDF.withColumn("start_time", to_timestamp("start_time"))

• D. myDF.withColumn("start_time", to_timestamp("start_time", "dd-MM-yyyy


HH:mm:ss:SSS"))

7. You have a DataFrame with the schema:

text

CollapseWrap

Copy

root

|-- name: string (nullable = true)

|-- age: string (nullable = true)

|-- salary: string (nullable = true)

You want to select all columns, converting age to integer and salary to double. Choose the correct
option:

• A. df.select("name", "cast(age, integer)", "cast(salary, double)")

• B. df.select("name", expr("cast(age, integer)"), expr("cast(salary, double)"))

• C. df.select("name", expr("INT(age)"), expr("DOUBLE(salary)"))

• D. df.select("name", "cast(age as integer)", "cast(salary as double)")

8. Select all expressions equivalent to:

text

CollapseWrap

Copy

df.where("salary > 5000 and age > 30")


• A. df.filter((salary > 5000) & (age > 30))

• B. df.filter((df.salary > 5000) & (df.age > 30))

• C. df.filter("salary > 5000").filter("age > 30")

• D. df.filter(col("salary") > 5000 & col("age") > 30)

9. You have the following DataFrame:

text

CollapseWrap

Copy

+-----+---+------+

| name|age|salary|

+-----+---+------+

|Abdul| 36| 4800 |

|Abdul| 36| 4800 |

|Abdul| 42| 4800 |

+-----+---+------+

You want to create a new DataFrame with unique records:

text

CollapseWrap

Copy

+-----+---+------+

| name|age|salary|

+-----+---+------+

|Abdul| 36| 4800 |

|Abdul| 42| 4800 |

+-----+---+------+

Choose the incorrect option:

• A. df.distinct()

• B. df.select("*").distinct()

• C. df.registerTempTable("dfTable"); spark.sql("select distinct * from dfTable")

• D. df.selectExpr("distinct(name, age, salary)")

10. You are given the following DataFrame:


text

CollapseWrap

Copy

data_list = [("David", "Account", "United States", "6500"), ("Ravi", "Account", "India", "5500"),
("John", "Software", "India", "6500"), ("Rosy", "Software", "India", "8200"), ("Abdul", "Support",
"Brazil", "4800")]

df = spark.createDataFrame(data_list).toDF("name", "department", "country", "salary")

Choose the correct code block to produce:

text

CollapseWrap

Copy

+----------+-------------+-----------+-----------+

|department| country|NumEmployee|TotalSalary|

+----------+-------------+-----------+-----------+

| Account | India | 1| 5500.0|

| Support | Brazil | 1| 4800.0|

| Account |United States| 1| 6500.0|

| Software | India | 2 | 14700.0|

+----------+-------------+-----------+-----------+

• A. df.groupBy("department", "country").agg(expr("count(*)"), expr("sum(salary)")).show()

• B. df.groupBy("department", "country").agg(expr("count(*) as NumEmployee"),


expr("sum(salary) as TotalSalary")).show()

• C. df.groupBy("department", "country").agg("count(*)", "sum(salary)").show()

• D. df.groupBy("department", "country").select(expr("count(*)"), expr("sum(salary)")).show()

11. You are given the following DataFrame:

text

CollapseWrap

Copy

+-------+----+----------+--------+

|BatchID|Year|CourseName|Students|

+-------+----+----------+--------+

| X1 |2021| Scala | 270|


| Y5 |2021| Scala | 230|

| N3 |2020| Scala | 150|

| C5 |2020| Scala | 100|

| D7 |2020| Python | 300|

| D3 |2021| Python | 400|

| H2 |2021| Python | 500|

+-------+----+----------+--------+

Choose the code block to create a Pivot DataFrame:

text

CollapseWrap

Copy

+----+------+-----+

|Year|Python|Scala|

+----+------+-----+

|2020| 300.0|250.0|

|2021| 900.0|500.0|

+----+------+-----+

• A. df.groupBy("Year").agg(expr("pivot(CourseName)"), expr("sum(Students)"))

• B. df.groupBy("Year").pivot("CourseName").agg(expr("sum(Students)"))

• C. df.groupBy("CourseName").pivot("Year").agg(expr("sum(Students)"))

• D. df.groupBy("Year").pivot("Students").agg(expr("sum(CourseName)"))

12. You are given two DataFrames:

df1:

text

CollapseWrap

Copy

+-------+----+----------+

|BatchID|Year|CourseName|

+-------+----+----------+

| X1 |2021| Scala |

| Y5 |2021| Scala |
+-------+----+----------+

df2:

text

CollapseWrap

Copy

+-------+--------+

|BatchID|Students|

+-------+--------+

| X1 | 270|

| N3 | 150|

+-------+--------+

You join them with:

text

CollapseWrap

Copy

df1.join(df2, df1.BatchID == df2.BatchID, "right_outer").show()

Choose the correct output:

• A. +-------+----+----------+ |BatchID|Year|CourseName| +-------+----+----------+ | X1 |2021| Scala


| +-------+----+----------+

• B. +-------+----+----------+ |BatchID|Year|CourseName| +-------+----+----------+ | Y5 |2021| Scala


| +-------+----+----------+

• C. +-------+----+----------+-------+--------+ |BatchID|Year|CourseName|BatchID|Students| +-------


+----+----------+-------+--------+ | null |null| null | N3 | 150| | X1 |2021| Scala | X1 | 270| +-----
--+----+----------+-------+--------+

• D. +-------+----+----------+-------+--------+ |BatchID|Year|CourseName|BatchID|Students| +------


-+----+----------+-------+--------+ | X1 |2021| Scala | X1 | 270| | Y5 |2021| Scala | null | null| +-
------+----+----------+-------+--------+

13. You are joining two DataFrames:

text

CollapseWrap

Copy

joinType = "inner"

joinExpr = df1.BatchID == df2.BatchID


df1.join(df2, joinType, joinExpr).show()

What is wrong with the above code?

• A. There is no problem with the code block

• B. You cannot define joinExpr outside the join() method

• C. The joinType and joinExpr are at the wrong place; swap their positions

• D. There is no join type as inner; it must be inner_join

14. You have a DataFrame (df1) with 146 unique countries in the Country column. You want to
repartition it on Country into 10 partitions only. Choose the correct code block:

• A. df2 = df1.repartition(10)

• B. df2 = df1.repartition(10, "Country")

• C. This feature is not supported in Spark DataFrame API

• D. The requirement is incorrect; you can partition to 146 partitions because you have 146
countries

15. What is the output of the following code block?

text

CollapseWrap

Copy

df = spark.read.parquet("data/summary.parquet")

df2 = df.repartition(20)

print(df2.rdd.getNumPartitions())

df3 = df2.coalesce(100)

print(df3.rdd.getNumPartitions())

• A. 20 100

• B. 100 100

• C. 100 20

• D. 20 20

16. You are given the following DataFrame:

text

CollapseWrap

Copy

+-------+----+----------+

|BatchID|Year|CourseName|
+-------+----+----------+

| X1 |2020| Scala |

| X2 |2020| Python |

| X3 |null| Java |

| X4 |2021| Scala |

| X5 |null| Python |

| X6 |2021| Spark |

+-------+----+----------+

Choose the incorrect option to replace all nulls in the Year column with 2021:

• A. df.withColumn("Year", expr("coalesce(Year, '2021')"))

• B. df.withColumn("Year", coalesce(col("Year"), lit("2021")))

• C. df.withColumn("Year", expr("ifnull(Year, '2021')"))

• D. df.withColumn("Year", ifnull(col("Year"), "2021"))

17. You are given the following DataFrame:

text

CollapseWrap

Copy

+-------+----+----------+

|BatchID|Year|CourseName|

+-------+----+----------+

| X1 |2020| Scala |

| X2 |2020| Python |

| X3 |null| Java |

| X4 |2021| Scala |

| X5 |null| null |

| X6 |2021| Spark |

+-------+----+----------+

Choose the correct code block to:

1. Replace all nulls in Year with 2021

2. Replace all nulls in CourseName with Python

• A. df.na.fill("2021", "Python")
• B. df.na.fill({"Year": "2021", "CourseName": "Python"})

• C. df.na.fill({"CourseName": "Python", "Year": "2021"})

• D. df.na.fill("2021")

18. Which code block merges two DataFrames df1 and df2?

• A. df1.append(df2)

• B. df1.merge(df2)

• C. df1.union(df2)

• D. df1.add(df2)

19. Which statements are used to bring data to the Spark driver?

• A. df.first()

• B. df.take(10)

• C. df.collect()

• D. df.limit(10)

20. Spark allows schema-on-read using the infer schema option. Choose the incorrect statement
about schema-on-read:

• A. Schema-on-read usually works fine for ad hoc analysis

• B. Infer schema can be slow with plain-text file formats like CSV or JSON

• C. Infer schema can lead to precision issues like a long type incorrectly set as an integer

• D. It is a good idea to use schema-on-read for production ETL

21. The following code defines a schema:

text

CollapseWrap

Copy

mySchema = StructType([StructField("ID", IntegerType()), StructField("Name", StringType()),


StructField("Salary", DoubleType())])

Choose the correct equivalent code block:

• A. mySchema = spark.createSchema(StructField("ID", IntegerType()), StructField("Name",


StringType()), StructField("Salary", DoubleType()))

• B. mySchema = "ID INT, Name STRING, Salary DOUBLE"

22. What is the default compression format for saving a DataFrame as a Parquet file?

• A. uncompressed

• B. none
• C. lz4

• D. snappy

23. Choose the code block to write a DataFrame in compressed JSON file format:

• A. df.write.mode("overwrite").format("json").option("compression",
"gzip").save("data/myTable")

• B. df.write.mode("overwrite").option("compression", "gzip").save("data/myTable")

• C. df.write.mode("overwrite").option("codec", "gzip").save("data/myTable")

• D. df.write.mode("overwrite").codec("gzip").save("data/myTable")

24. Choose the correct expression to create a Spark database named my_spark_db:

• A. spark.sql("CREATE DATABASE my_spark_db")

• B. spark.createDatabase("my_spark_db")

• C. spark.catalog.createDatabase("my_spark_db")

• D. None of the above

25. You are given two code blocks:

1. spark.sql("CREATE TABLE flights_tbl (date STRING, delay INT, distance INT, origin STRING,
destination STRING)")

2. spark.sql("CREATE TABLE flights_tbl(date STRING, delay INT, distance INT, origin STRING,
destination STRING) USING csv OPTIONS (PATH '/tmp/flights/flights_tbl.csv')") Choose all
correct statements:

• A. The first code block creates a Spark managed table

• B. The second code block creates a Spark unmanaged table

• C. Both statements are the same

• D. Both statements are the same except the second specifies the data file location

26. You created a temporary view:

text

CollapseWrap

Copy

df1.createOrReplaceTempView("my_view")

Choose the correct expression to drop this view:

• A. spark.sql("DROP VIEW IF EXISTS global_temp.my_view")

• B. spark.sql("DROP VIEW IF EXISTS my_view")

• C. spark.catalog.dropGlobalTempView("my_view")
• D. spark.catalog.dropTempView("my_view")

27. What is the default storage level for a Spark DataFrame when cached?

• A. MEMORY_AND_DISK in Spark 3.0

• B. MEMORY_AND_DISK_DESER in Spark 3.1.1

• C. MEMORY_ONLY

• D. DISK_ONLY

28. Which statement correctly defines the MEMORY_AND_DISK storage level?

• A. Data is stored directly as objects in memory and a copy is serialized and stored on disk

• B. Data is stored directly as objects in memory, but if there’s insufficient memory, the rest is
serialized and stored on disk

• C. Data is stored on the disk and brought into memory when required

• D. None of the above

29. The following DataFrame expression has an error:

text

CollapseWrap

Copy

df1.withColumn("Flight_Delays", expr("""CASE WHEN delay > 360 THEN 'Very Long Delays' WHEN
delay >= 120 AND delay <= 360 THEN 'Long Delays' WHEN delay >= 60 AND delay < 120 THEN 'Short
Delays' WHEN delay > 0 and delay < 60 THEN 'Tolerable Delays' WHEN delay = 0 THEN 'No Delays'
ELSE 'Early'"""))

Choose the statement that identifies the error:

• A. The CASE statement requires an END, which is missing

• B. You cannot use CASE WHEN construct in DataFrame expression

• C. It is invalid to use """ for a string

• D. There is no error in the code block

30. You want to implement the following CASE expression using the when() DataFrame function:

text

CollapseWrap

Copy

df1.withColumn("Flight_Delays", expr("""CASE WHEN delay > 360 THEN 'Very Long Delays' WHEN
delay >= 120 AND delay <= 360 THEN 'Long Delays' WHEN delay >= 60 AND delay < 120 THEN 'Short
Delays' WHEN delay > 0 and delay < 60 THEN 'Tolerable Delays' WHEN delay = 0 THEN 'No Delays'
ELSE 'Early' END"""))
Choose the correct expression:

• A. df1.withColumn("Flight_Delays", when(col("delay") > 360, lit("Very Long


Delays")).when((col("delay") >= 120) & (col("delay") <= 360), lit("Long
Delays")).when((col("delay") >= 60) & (col("delay") <= 120), lit("Short
Delays")).when((col("delay") > 0) & (col("delay") < 60), lit("Tolerable
Delays")).when(col("delay") == 0, lit("No Delays")).else(lit("Early")))

• B. df1.withColumn("Flight_Delays", when(col("delay") > 360, lit("Very Long


Delays")).when((col("delay") >= 120) & (col("delay") <= 360), lit("Long
Delays")).when((col("delay") >= 60) & (col("delay") <= 120), lit("Short
Delays")).when((col("delay") > 0) & (col("delay") < 60), lit("Tolerable
Delays")).when(col("delay") == 0, lit("No Delays")).otherwise(lit("Early")))

• C. df1.withColumn("Flight_Delays", when(col("delay") > 360, "Very Long


Delays").when((col("delay") >= 120) & (col("delay") <= 360), "Long
Delays").when((col("delay") >= 60) & (col("delay") <= 120), "Short
Delays").when((col("delay") > 0) & (col("delay") < 60), "Tolerable Delays").when(col("delay")
== 0, "No Delays").otherwise("Early"))

• D. There is no error in the code block

31. There is a global temp view named my_global_view. Which command should you choose to
query it?

• A. spark.read.table("my_global_view")

• B. spark.read.view("my_global_view")

• C. spark.read.table("global_temp.my_global_view")

• D. spark.read.view("global_temp.my_global_view")

32. Which code block reads from a tab-separated TSV file?

• A. df = spark.read.format("tsv").option("inferSchema", "true").option("header",
"true").load("data/my_data_file.tsv")

• B. df = spark.read.format("csv").option("inferSchema", "true").option("header",
"true").option("sep", "tab").load("data/my_data_file.tsv")

• C. df = spark.read.format("csv").option("inferSchema", "true").option("header",
"true").option("sep", "\t").load("data/my_data_file.tsv")

• D. df = spark.read.format("csv").option("inferSchema", "true").option("header",
"true").option("delimeter", "\t").load("data/my_data_file.tsv")

33. You are given a CSV file with content:

text

CollapseWrap

Copy

id,fname,lname,dob
101,prashant,pandey,25-05-1975

102,abdul,hamid,28-12-1986

103,M David,turner,23-08-1979

You want to load it with the schema:

text

CollapseWrap

Copy

root

|-- id: integer (nullable = true)

|-- fname: string (nullable = true)

|-- lname: string (nullable = true)

|-- dob: date (nullable = true)

Choose the correct code block:

• A. schema = StructType([StructField("id", IntegerType()), StructField("fname", StringType()),


StructField("lname", StringType()), StructField("dob", DateType())]); df =
spark.read.format("csv").option("header", "true").schema(schema).option("dateFormat",
"dd-MM-yyyy").load("data/my_data_file.csv")

• B. schema = StructType([StructField("id", IntegerType()), StructField("fname", StringType()),


StructField("lname", StringType()), StructField("dob", DateType())]); df =
spark.read.format("csv").option("header",
"true").schema(schema).load("data/my_data_file.csv")

• C. schema = StructType([StructField("id", IntegerType()), StructField("fname", StringType()),


StructField("lname", StringType()), StructField("dob", DateType())]); df =
spark.read.format("csv").option("header", "true").schema(schema).option("dateFormat",
"yyyy-MM-dd").load("data/my_data_file.csv")

• D. schema = StructType([StructField("id", IntegerType()), StructField("fname", StringType()),


StructField("lname", StringType()), StructField("dob", DateType())]); df =
spark.read.format("csv").option("header", "true").option("inferSchema",
"true").option("dateFormat", "dd-MM-yyyy").load("data/my_data_file.csv")

34. Can you write Spark UDFs in Scala or Java and run them from a PySpark application?

• A. TRUE

• B. FALSE

35. You have the following DataFrame and UDF:

text

CollapseWrap
Copy

df = spark.range(5).toDF("num")

def power3(value):

return value ** 3

power3_udf = udf(power3)

df.selectExpr("power3_udf(num)").show()

There is an error. Choose the corrected code:

• A. The UDF is not registered as an SQL function: spark.udf.register("power3_udf", power3)

• B. The function is incorrectly defined: def power3(value): return value * 3

• C. The second last line is incorrect: power3_udf = udf(power3(_: Double):Double)

• D. There is no error in the given code

36. You have a DataFrame with the schema:

text

CollapseWrap

Copy

root

|-- ID: long (nullable = true)

|-- PersonalDetails: struct (nullable = false)

| |-- FName: string (nullable = true)

| |-- LName: string (nullable = true)

| |-- DOB: string (nullable = true)

|-- Department: string (nullable = true)

Choose the correct code block to select:

text

CollapseWrap

Copy

+---+-----+------+----------+

| ID|FName|LName|Department|

+---+-----+------+----------+

|101| John| Doe | Software |

|102|David|Turner| Support |
|103|Abdul| Hamid| Account |

+---+-----+------+----------+

• A. df1.select("ID", "FName", "LName", "Department").show()

• B. df1.select("ID", col("PersonalDetails").getField("FName").alias("FName"),
col("PersonalDetails").getField("LName").alias("LName"), "Department").show()

• C. df1.select("ID", col("FName"), col("LName"), "Department").show()

37. You are given a CSV file (sample.txt) with content:

text

CollapseWrap

Copy

ID,TEXT

101,WHITE HANGING HEART T-LIGHT HOLDER

102,WHITE LANTERN

103,RED WOOLLY HOTTIE WHITE HEART

Choose the correct output of:

text

CollapseWrap

Copy

df = spark.read.option("header", "true").option("inferSchema", "true").csv("data/sample.txt")

df1 = df.select("ID", split(col("TEXT"), " ").alias("VALUES"))

df1.selectExpr("ID", "VALUES[0] as V1", "VALUES[1] as V2", "VALUES[2] as V3").show()

• A. +---+-----+--------+------+ | ID| V1| V2| V3| +---+-----+--------+------+


|101|WHITE|HANGING|HEART| |102|WHITE|LANTERN| null| |103|
RED|WOOLLY|HOTTIE| +---+-----+--------+------+

• B. +---+-----+--------+----------+ | ID| V1| V2| V3| +---+-----+--------+----------+


|101|WHITE|HANGING| HEART| |102|WHITE|LANTERN| null| |103| RED|WOOLLY|HOTTIE
WHITE| +---+-----+--------+----------+

38. What is the output of the following code block?

text

CollapseWrap

Copy

mylist = [1002, 3001, 4002, 2003, 2002, 3004, 1003, 4006]

df = spark.createDataFrame(mylist, IntegerType()).toDF("value")
df.withColumn("key", col("value") % 1000).groupBy("key").agg(expr("count(key) as count"),
expr("sum(key) as sum")).orderBy(col("key").desc()).limit(1).select("count", "sum").show()

• A. +-----+---+ |count|sum| +-----+---+ | 1| 6| +-----+---+

• B. +-----+---+ |count|sum| +-----+---+ | 2| 6| +-----+---+

• C. +-----+---+ |count|sum| +-----+---+ | 3| 6| +-----+---+

• D. +-----+---+ |count|sum| +-----+---+ | 1| 1| +-----+---+

39. Which configuration controls the maximum partition size when reading files?

• A. spark.files.maxPartitionSize

• B. spark.files.maxPartitionBytes

• C. spark.maxPartitionSize

• D. spark.maxPartitionBytes

40. Which configuration controls the number of available cores for executors?

• A. spark.driver.cores

• B. spark.executor.cores

• C. spark.cores.max

• D. spark.task.cpus

41. The Spark Core engine:

• A. Is fault-tolerant

• B. Executes a DAG of Spark application

• C. Stores and manages resources

• D. Manages data and its storage

42. Select the correct statements about the Cluster Manager:

• A. A Cluster Manager is responsible for running your Spark Applications

• B. The Cluster Manager is responsible for maintaining a cluster of machines that will run your
Spark Application

• C. A Cluster Manager may have its own master and worker nodes

• D. Cluster Manager provides the Storage Services to Apache Spark

43. Select the correct statement about Spark Drivers and Executors:

• A. The executors communicate with the cluster manager and are responsible for executing
tasks on the workers

• B. Spark Driver communicates directly with the executors

• C. We cannot have more than one executor per worker node


• D. A Spark executor runs on the worker node in the cluster

44. Select the correct statement about Spark deployment modes:

• A. Kubernetes does not support cluster mode

• B. Local mode runs the Spark driver and executor in the same JVM on a single computer

• C. Client mode runs the driver and executor on the client machine

• D. Cluster mode runs the driver with the YARN Application Master

45. Select the correct statements about the Spark Context:

• A. Is not available in Spark 2.x

• B. Within the SparkSession represents the connection to the Spark cluster

• C. Is your driver application

• D. You communicate with some of Spark’s lower-level APIs, such as RDDs

46. Select the incorrect statement about the Spark Application:

• A. Spark Application runs as a series of Spark Jobs

• B. Each Spark Job is internally represented as a DAG of stages

• C. Spark Application runs all the Spark Jobs in parallel

• D. You can submit a Spark application using the spark-submit tool

47. Choose all incorrect statements:

• A. If there are 1,000 little partitions, we will have 1,000 tasks that can be executed in parallel

• B. Partitioning your data into a greater number of partitions means that more can be
executed in parallel

• C. An executor with 12 cores can have 12 or more tasks working on 12 or more partitions in
parallel

• D. One executor must run only one task at a time

48. Which of the following are correct for slots?

• A. Slots are the same thing as tasks

• B. Each executor can have multiple slots depending upon the executor cores

• C. Each slot in the executor can be assigned a task

• D. Each worker can have multiple slots where executors are allocated

49. Which configuration sets the scheduling mode between jobs submitted to the same
SparkContext?

• A. spark.job.scheduler.mode

• B. spark.scheduler.mode
• C. spark.optimizer.scheduler.mode

• D. spark.scheduler.job.mode

50. Which configurations are related to enabling dynamic adjustment of resources based on the
workload?

• A. spark.dynamicAllocation.shuffleTracking.enabled

• B. spark.sql.dynamicAllocation.enabled

• C. spark.dynamicAllocation.enabled

• D. spark.shuffle.service.enabled

51. Columns or table names are resolved by consulting an internal Catalog at which stage of Spark
query optimization?

• A. Analysis

• B. Logical Optimization

• C. Physical Planning

• D. Code Generation

52. Which method can be used to rename a column in a Spark DataFrame?

• A. withColumnRenamed(existingName: String, newName: String)

• B. withColumnRename(existingName: String, newName: String)

• C. withColumn(newName: String, existingName: String)

• D. There is no method for renaming a column

53. You have a DataFrame with a string-type column today in DD-MM-YYYY format. You want to
add a column week_later with a value one week later than today. Select the correct code block:

• A. myDF.withColumn("week_later", date_add("today", 7))

• B. myDF.withColumn("week_ago", date_add(to_date("today", "dd-MM-yyyy"), 7))

• C. myDF.withColumn("week_ago", date_add(to_date("today", "DD-MM-YYYY"), 7))

• D. All of the above

54. You have a DataFrame with a string field day in MM-DD-YYYY format. What is the problem
with:

text

CollapseWrap

Copy

myDF.filter("day > '2021-05-07'")

• A. There is no problem with the code


• B. The day field is in MM-DD-YYYY format but the filter condition expects it in YYYY-MM-DD
format

• C. This problem can be solved by changing the filter to: myDF.filter("day > '05-07-2021'")

• D. The day field is a string field but the filter condition expects it to be a date field; convert
day to date type first

55. Select the most appropriate expression for casting a salary column to a double value:

• A. expr("cast(salary as double)")

• B. col("salary").cast("double")

• C. expr("DOUBLE(salary)")

• D. All of the above

56. Select all expressions equivalent to:

text

CollapseWrap

Copy

df.where("salary > 5000 or age > 30")

• A. df.filter("salary > 5000 or age > 30")

• B. df.filter("salary > 5000").filter("age > 30")

• C. df.filter((col("salary") > 5000) | (col("age") > 30))

• D. df.filter(("salary" > 5000) | ("age" > 30))

57. You have a DataFrame:

text

CollapseWrap

Copy

+-----+---+------+

| name|age|salary|

+-----+---+------+

| Ravi| 28| 6500 |

| John| 32| 6500 |

| Rosy| 48| 8200 |

|Abdul| 36| 4800 |

+-----+---+------+

You want to sort it in descending order of salary, and if salary is equal, by age in ascending order:
text

CollapseWrap

Copy

+-----+---+------+

| name|age|salary|

+-----+---+------+

| Rosy| 48| 8200 |

| Ravi| 28| 6500 |

| John| 32| 6500 |

|Abdul| 36| 4800 |

+-----+---+------+

Choose the correct code to fill in:

text

CollapseWrap

Copy

df._1_(_2_, _3_)

• A. 1. sort 2. expr("salary desc") 3. "age"

• B. 1. sort 2. col("salary").desc() 3. "age"

• C. 1. sort 2. expr("desc(salary)") 3. "age"

• D. 1. sort 2. expr("salary").desc() 3. "age"

58. You are given the following DataFrame:

text

CollapseWrap

Copy

data_list = [("David", "Account", "United States", "6500"), ("Ravi", "Account", "India", "5500"),
("John", "Software", "India", "6500"), ("Rosy", "Software", "India", "8200"), ("Abdul", "Support",
"Brazil", "4800")]

df = spark.createDataFrame(data_list).toDF("name", "department", "country", "salary")

Choose the best option to produce:

text

CollapseWrap

Copy
+-------------+----------+-----+

| country|department|count|

+-------------+----------+-----+

| India | Account | 1|

|United States| Account | 1|

| India | Software | 2|

| Brazil | Support | 1|

+-------------+----------+-----+

• A. df.groupBy("country", "department").agg(expr("count(*)")).show()

• B. df.groupBy(expr("country, department")).count().show()

• C. df.groupBy("country", "department").count().show()

• D. df.groupBy("department", "country").count().show()

59. You are given the following DataFrame:

text

CollapseWrap

Copy

+-------+----+----------+--------+

|BatchID|Year|CourseName|Students|

+-------+----+----------+--------+

| X1 |2021| Scala | 270|

| Y5 |2021| Scala | 230|

| N3 |2020| Scala | 150|

| C5 |2020| Scala | 100|

| D7 |2020| Python | 300|

| D3 |2021| Python | 400|

| H2 |2021| Python | 500|

+-------+----+----------+--------+

Choose the code block to create a summary DataFrame for TotalStudents over Year and
CourseName:

text

CollapseWrap
Copy

+----+----------+-------------+

|Year|CourseName|TotalStudents|

+----+----------+-------------+

|null| null| 1950.0|

|2020| null| 550.0|

|2020| Python | 300.0|

|2020| Scala | 250.0|

|2021| null| 1400.0|

|2021| Python | 900.0|

|2021| Scala | 500.0|

+----+----------+-------------+

• A. df.groupBy("Year",
"CourseName").agg(expr("sum(Students)").alias("TotalStudents")).orderBy("Year",
"CourseName")

• B. df.rollup("Year",
"CourseName").agg(expr("sum(Students)").alias("TotalStudents")).orderBy("Year",
"CourseName")

• C. df.rollup("Year", "CourseName").agg(expr("sum(Students)").alias("TotalStudents"))

• D. df.pivot("Year",
"CourseName").agg(expr("sum(Students)").alias("TotalStudents")).orderBy("Year",
"CourseName")

60. You have the following code block for joining two DataFrames:

text

CollapseWrap

Copy

joinType = "inner"

joinExpr = df1.BatchID == df2.BatchID

df1.join(df2, joinExpr, joinType).select("BatchID", "Year").show()

This throws an error: Reference 'BatchID' is ambiguous. Choose the corrected code block:

• A. joinType = "inner"; joinExpr = df1.BatchID == df2.BatchID; df1.join(df2, joinExpr,


joinType).select("df1.BatchID", "df1.Year").show()

• B. joinType = "inner"; joinExpr = df1.BatchID == df2.BatchID; df1.join(df2, joinExpr,


joinType).select(df1.BatchID, df1.Year).show()
• C. joinType = "inner"; joinExpr = "BatchID"; df1.join(df2, joinExpr, joinType).select("BatchID",
"Year").show()

• D. joinType = "inner"; joinExpr = df1.BatchID == df2.BatchID; df1.join(df2, joinExpr,


joinType).drop(df2.BatchID).select("BatchID", "Year").show()

61. You are given two DataFrames:

df1:

text

CollapseWrap

Copy

+-------+----+----------+

|BatchID|Year|CourseName|

+-------+----+----------+

| X1 |2021| Scala |

| Y5 |2021| Scala |

+-------+----+----------+

df2:

text

CollapseWrap

Copy

+-------+--------+

|BatchID|Students|

+-------+--------+

| X1 | 270|

| N3 | 150|

+-------+--------+

You want to select rows from df1 that do not exist in df2:

text

CollapseWrap

Copy

+-------+----+----------+

|BatchID|Year|CourseName|

+-------+----+----------+
| Y5 |2021| Scala |

+-------+----+----------+

Choose the correct join type:

text

CollapseWrap

Copy

df1.join(df2, df1.BatchID == df2.BatchID, joinType).show()

• A. joinType = "left_semi"

• B. joinType = "right_semi"

• C. joinType = "left_outer"

• D. joinType = "left_anti"

62. What API can you use to get the number of partitions of a DataFrame df?

• A. df.getNumPartitions()

• B. df.rdd.getNumPartitions()

• C. df.getPartitionCount()

• D. df.rdd.getPartitionCount()

63. What is the use of the coalesce(expr*) function in Spark SQL?

• A. Sharing the number of DataFrame partitions

• B. Returns the first non-null argument if it exists; otherwise, null

• C. Merge the column values into one column

• D. None of the above

64. You are given the following DataFrame:

text

CollapseWrap

Copy

+-------+----+----------+

|BatchID|Year|CourseName|

+-------+----+----------+

| X1 |2020| Scala |

| X2 |2020| Python |

| X3 |null| Java |
| X4 |2021| Scala |

| X5 |null| Python |

| X6 |2021| Spark |

+-------+----+----------+

Choose the correct statements:

• A. We can use df.na.drop(subset=("Year","CourseName")) to delete rows if Year or


CourseName is null

• B. We can use df.na.drop() to delete all rows having any null column

• C. We can use df.na.drop("all") to delete rows if all columns are null

• D. All of the above are correct

65. You are given the following DataFrame:

text

CollapseWrap

Copy

+-------+----+----------+

|BatchID|Year|CourseName|

+-------+----+----------+

| X1 |2020| Scala |

| X2 |2020| Python |

| X3 |null| Java |

| X4 |2021| Scala |

| X5 |null| null |

| X6 |2021| null |

+-------+----+----------+

You use:

text

CollapseWrap

Copy

df.na.drop(thresh=1)

Choose the correct statement:

• A. This expression will delete X3, X5, and X6 rows because the threshold=1 says "Delete the
row if at least one column is null"
• B. This expression will not delete any row because the threshold=1 says "Keep the row if at
least one column is not null"

• C. This expression will delete X3 and X6 rows because the threshold=1 says "Delete the row if
only one column is null"

• D. This expression will delete X3 row because the threshold=1 says "Delete the row if more
than one column is null"

Keysheet (Correct Answers)

1. B - df.select("name", expr("salary * 0.20 as increment")) (Correct way to alias using expr() for
column expressions.)

2. A, B - Both withColumn() chaining and selectExpr() are valid; option C is incorrect due to
invalid col() syntax.

3. A - 1. drop 2. "salary" (Correct method to remove a column.)

4. B, D - Returns a new DataFrame with a column renamed; does not throw an error if the
schema doesn’t contain the existing name.

5. C - myDF.select(datediff("end", "start")) (Correct function and order for date difference in


days.)

6. B - myDF.withColumn("start_time", to_timestamp("start_time", "dd-MM-yyyy


HH:mm:ss.SSS")) (Correct format for timestamp conversion.)

7. C - df.select("name", expr("INT(age)"), expr("DOUBLE(salary)")) (Correct Spark SQL functions


for casting.)

8. B, C - df.filter((df.salary > 5000) & (df.age > 30)) and df.filter("salary > 5000").filter("age > 30")
(Implement AND condition correctly.)

9. D - df.selectExpr("distinct(name, age, salary)") (Incorrect; no such SQL function in Spark.)

10. B - df.groupBy("department", "country").agg(expr("count(*) as NumEmployee"),


expr("sum(salary) as TotalSalary")).show() (Correct for aliasing aggregated columns.)

11. B - df.groupBy("Year").pivot("CourseName").agg(expr("sum(Students)")) (Correct for pivoting


on CourseName.)

12. C - Right outer join includes all rows from df2, with nulls for non-matching df1 rows.

13. C - The joinType and joinExpr are in the wrong order; correct syntax is join(rightDF, joinExpr,
joinType).

14. B - df2 = df1.repartition(10, "Country") (Correct for repartitioning by column with specified
partitions.)

15. D - 20 20 (coalesce(100) cannot increase partitions, so it remains 20.)

16. D - df.withColumn("Year", ifnull(col("Year"), "2021")) (Incorrect; ifnull is a SQL function, not a


DataFrame function.)
17. B, C - df.na.fill({"Year": "2021", "CourseName": "Python"}) (Correct dictionary syntax for
multiple columns; order doesn’t matter.)

18. C - df1.union(df2) (Correct method to merge DataFrames.)

19. A, B, C - first(), take(10), collect() (All are actions that bring data to the driver; limit() is a
transformation.)

20. D - It is a good idea to use schema-on-read for production ETL (Incorrect; manual schema
definition is preferred for production.)

21. B - mySchema = "ID INT, Name STRING, Salary DOUBLE" (Correct equivalent schema string.)

22. D - snappy (Default compression for Parquet files.)

23. A - df.write.mode("overwrite").format("json").option("compression",
"gzip").save("data/myTable") (Correct for JSON compression.)

24. A - spark.sql("CREATE DATABASE my_spark_db") (Correct SQL command for creating a


database.)

25. A, B, D - First block creates a managed table; second creates an unmanaged table with a
specified path.

26. B, D - spark.sql("DROP VIEW IF EXISTS my_view") and


spark.catalog.dropTempView("my_view") (Correct for dropping a temp view.)

27. B - MEMORY_AND_DISK_DESER in Spark 3.1.1 (Correct default storage level for newer Spark
versions.)

28. B - Data is stored directly as objects in memory, but if insufficient memory, the rest is
serialized and stored on disk.

29. A - The CASE statement requires an END, which is missing.

30. B, C - Both when() with otherwise() and without lit() are correct; else is not a valid method.

31. C - spark.read.table("global_temp.my_global_view") (Correct for accessing global temp


views.)

32. C - df = spark.read.format("csv").option("inferSchema", "true").option("header",


"true").option("sep", "\t").load("data/my_data_file.tsv") (Correct for tab-separated files.)

33. A - schema with dateFormat "dd-MM-yyyy" (Correct for parsing dates correctly.)

34. A - TRUE (Scala/Java UDFs can be used in PySpark for performance.)

35. A - spark.udf.register("power3_udf", power3) (Correct for registering UDF as SQL function.)

36. B - df1.select("ID", col("PersonalDetails").getField("FName").alias("FName"),


col("PersonalDetails").getField("LName").alias("LName"), "Department").show() (Correct for
accessing struct fields.)

37. A - split() creates an array with individual words; V3 for row 103 is "HOTTIE".

38. C - key=6 (from 4006) has count=3 (1002, 2002, 4006) and sum=6.

39. B - spark.files.maxPartitionBytes (Correct configuration for partition size.)


40. B - spark.executor.cores (Controls cores per executor.)

41. A, B - Is fault-tolerant; executes a DAG of Spark application.

42. A, B, C - Cluster Manager runs applications, maintains the cluster, and has master/worker
nodes.

43. B, D - Driver communicates with executors; executors run on worker nodes.

44. B - Local mode runs driver and executor in the same JVM.

45. B, D - SparkContext is within SparkSession and used for RDDs.

46. C - Spark Application runs jobs sequentially, not in parallel.

47. D - One executor can run multiple tasks concurrently.

48. B, C - Executors have multiple slots; each slot can be assigned a task.

49. B - spark.scheduler.mode (Correct for scheduling mode.)

50. C, D - spark.dynamicAllocation.enabled and spark.shuffle.service.enabled (Related to


dynamic allocation.)

51. A - Analysis (Catalog resolution occurs here.)

52. A - withColumnRenamed(existingName: String, newName: String) (Correct method.)

53. B - myDF.withColumn("week_ago", date_add(to_date("today", "dd-MM-yyyy"), 7)) (Correct


format and function.)

54. D - The day field is a string; convert to date using to_date() before filtering.

55. D - All are valid methods for casting to double.

56. A, C - df.filter("salary > 5000 or age > 30") and df.filter((col("salary") > 5000) | (col("age") >
30)) (Correct for OR condition.)

57. B, D - col("salary").desc() and expr("salary").desc() are equivalent and correct.

58. C - df.groupBy("country", "department").count().show() (Simplest and correct for counting.)

59. B - df.rollup("Year",
"CourseName").agg(expr("sum(Students)").alias("TotalStudents")).orderBy("Year",
"CourseName") (Correct for rollup with totals.)

60. B, D - Use df1.BatchID or drop df2.BatchID to resolve ambiguity.

61. D - joinType = "left_anti" (Correct for NOT EXISTS condition.)

62. B - df.rdd.getNumPartitions() (Correct API for partition count.)

63. B - Returns the first non-null argument if it exists; otherwise, null (Spark SQL coalesce
function.)

64. B, C - df.na.drop() and df.na.drop("all") are correct; subset syntax is incorrect.

65. B - thresh=1 means keep rows with at least one non-null column, so no rows are deleted.

You might also like