0% found this document useful (0 votes)

9 views29 pages

SP 5

Uploaded by

juadsr96

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views29 pages

SP 5

Uploaded by

juadsr96

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

1.

You have the following code:

text

CollapseWrap

Copy

df.select("name", expr("salary") * 0.20).show()

It produces the output:

text

CollapseWrap

Copy

+-----+--------------+

| name|(salary * 0.2)|

+-----+--------------+

| Ravi| 640.0|

|Abdul| 960.0|

| John| 1300.0|

| Rosy| 1640.0|

+-----+--------------+

Choose the correct expression for giving an alias to the last column:

• A. df.select("name", col("salary as increment") * 0.20)

• B. df.select("name", expr("salary * 0.20 as increment"))

• C. df.select("name", col("salary") * 0.20 as increment)

• D. None of the above

2. Which of the following code blocks will add two new columns salary_increment and new_salary
to an existing DataFrame?

• A. df.withColumn("salary_increment", expr("salary * 0.15")).withColumn("new_salary",

expr("salary + salary_increment"))

• B. df.selectExpr("", "salary 0.15 as salary_increment", "salary + salary_increment as

new_salary")

• C. df.withColumn("salary_increment", col("salary * 0.15")).withColumn("new_salary",

col("salary + salary_increment"))

• D. All of the above

3. You have the following DataFrame:

text

CollapseWrap

Copy

+-----+---+------+---------+----------+

+-----+---+------+---------+----------+

| Ravi| 28| 3200| 480.0| 3680.0|

|Abdul| 23| 4800| 720.0| 5520.0|

| John| 32| 6500| 975.0| 7475.0|

| Rosy| 48| 8200| 1230.0| 9430.0|

+-----+---+------+---------+----------+

You want to remove the salary column. Choose the response that correctly fills in the numbered
blanks:

text

CollapseWrap

Copy

df.__1__(__2__)

• A. 1. drop 2. "salary"

• B. 1. del 2. salary

• C. 1. remove 2. "salary"

• D. 1. delete 2. "salary"

4. Select all correct statements about the withColumnRenamed() method:

• A. The correct method name is withColumnRename()

• B. Returns a new DataFrame with a column renamed

• C. Throws an error if the schema doesn't contain the existing name

• D. Does not throw an error if the schema doesn't contain the existing name

5. You have a DataFrame with two date-type columns: start and end. Which expression correctly
selects the difference between these two dates?

• A. myDF.select(date_diff("end", "start"))

• B. myDF.select(datediff("start", "end"))

• C. myDF.select(datediff("end", "start"))

• D. myDF.select("end" - "start")
6. You have a start_time field in your DataFrame (string type) with values like:

text

CollapseWrap

Copy

17-05-2021 00:02:17.592

This represents a timestamp in DD-MM-YYYY HH:MI:SS.SSS format. How can you convert this field to
a timestamp type?

• A. myDF.withColumn("start_time", to_timestamp("start_time", "DD-MM-YYYY

HH:MM:ss:SSS"))

• B. myDF.withColumn("start_time", to_timestamp("start_time", "dd-MM-yyyy

HH:mm:ss.SSS"))

• C. myDF.withColumn("start_time", to_timestamp("start_time"))

• D. myDF.withColumn("start_time", to_timestamp("start_time", "dd-MM-yyyy

HH:mm:ss:SSS"))

7. You have a DataFrame with the schema:

text

CollapseWrap

Copy

root

|-- name: string (nullable = true)

|-- age: string (nullable = true)

|-- salary: string (nullable = true)

You want to select all columns, converting age to integer and salary to double. Choose the correct
option:

• A. df.select("name", "cast(age, integer)", "cast(salary, double)")

• B. df.select("name", expr("cast(age, integer)"), expr("cast(salary, double)"))

• C. df.select("name", expr("INT(age)"), expr("DOUBLE(salary)"))

• D. df.select("name", "cast(age as integer)", "cast(salary as double)")

8. Select all expressions equivalent to:

text

CollapseWrap

Copy

df.where("salary > 5000 and age > 30")

• A. df.filter((salary > 5000) & (age > 30))

• B. df.filter((df.salary > 5000) & (df.age > 30))

• C. df.filter("salary > 5000").filter("age > 30")

• D. df.filter(col("salary") > 5000 & col("age") > 30)

9. You have the following DataFrame:

text

CollapseWrap

Copy

+-----+---+------+

| name|age|salary|

+-----+---+------+

|Abdul| 36| 4800 |

|Abdul| 42| 4800 |

+-----+---+------+

You want to create a new DataFrame with unique records:

text

CollapseWrap

Copy

+-----+---+------+

| name|age|salary|

+-----+---+------+

|Abdul| 36| 4800 |

|Abdul| 42| 4800 |

+-----+---+------+

Choose the incorrect option:

• A. df.distinct()

• B. df.select("*").distinct()

• C. df.registerTempTable("dfTable"); spark.sql("select distinct * from dfTable")

• D. df.selectExpr("distinct(name, age, salary)")

10. You are given the following DataFrame:

text

CollapseWrap

Copy

data_list = [("David", "Account", "United States", "6500"), ("Ravi", "Account", "India", "5500"),
("John", "Software", "India", "6500"), ("Rosy", "Software", "India", "8200"), ("Abdul", "Support",
"Brazil", "4800")]

df = spark.createDataFrame(data_list).toDF("name", "department", "country", "salary")

Choose the correct code block to produce:

text

CollapseWrap

Copy

+----------+-------------+-----------+-----------+

+----------+-------------+-----------+-----------+

| Account | India | 1| 5500.0|

| Support | Brazil | 1| 4800.0|

| Account |United States| 1| 6500.0|

| Software | India | 2 | 14700.0|

+----------+-------------+-----------+-----------+

• A. df.groupBy("department", "country").agg(expr("count(*)"), expr("sum(salary)")).show()

• B. df.groupBy("department", "country").agg(expr("count(*) as NumEmployee"),

expr("sum(salary) as TotalSalary")).show()

• C. df.groupBy("department", "country").agg("count(*)", "sum(salary)").show()

• D. df.groupBy("department", "country").select(expr("count(*)"), expr("sum(salary)")).show()

11. You are given the following DataFrame:

text

CollapseWrap

Copy

+-------+----+----------+--------+

+-------+----+----------+--------+

| X1 |2021| Scala | 270|

| Y5 |2021| Scala | 230|

| N3 |2020| Scala | 150|

| C5 |2020| Scala | 100|

| D7 |2020| Python | 300|

| D3 |2021| Python | 400|

| H2 |2021| Python | 500|

+-------+----+----------+--------+

Choose the code block to create a Pivot DataFrame:

text

CollapseWrap

Copy

+----+------+-----+

|Year|Python|Scala|

+----+------+-----+

|2020| 300.0|250.0|

|2021| 900.0|500.0|

+----+------+-----+

• A. df.groupBy("Year").agg(expr("pivot(CourseName)"), expr("sum(Students)"))

• B. df.groupBy("Year").pivot("CourseName").agg(expr("sum(Students)"))

• C. df.groupBy("CourseName").pivot("Year").agg(expr("sum(Students)"))

• D. df.groupBy("Year").pivot("Students").agg(expr("sum(CourseName)"))

12. You are given two DataFrames:

df1:

text

CollapseWrap

Copy

+-------+----+----------+

|BatchID|Year|CourseName|

+-------+----+----------+

| X1 |2021| Scala |

| Y5 |2021| Scala |
+-------+----+----------+

df2:

text

CollapseWrap

Copy

+-------+--------+

|BatchID|Students|

+-------+--------+

| X1 | 270|

| N3 | 150|

+-------+--------+

You join them with:

text

CollapseWrap

Copy

df1.join(df2, df1.BatchID == df2.BatchID, "right_outer").show()

Choose the correct output:

• A. +-------+----+----------+ |BatchID|Year|CourseName| +-------+----+----------+ | X1 |2021| Scala

| +-------+----+----------+

• B. +-------+----+----------+ |BatchID|Year|CourseName| +-------+----+----------+ | Y5 |2021| Scala

| +-------+----+----------+

• C. +-------+----+----------+-------+--------+ |BatchID|Year|CourseName|BatchID|Students| +-------

+----+----------+-------+--------+ | null |null| null | N3 | 150| | X1 |2021| Scala | X1 | 270| +-----
--+----+----------+-------+--------+

• D. +-------+----+----------+-------+--------+ |BatchID|Year|CourseName|BatchID|Students| +------

-+----+----------+-------+--------+ | X1 |2021| Scala | X1 | 270| | Y5 |2021| Scala | null | null| +-
------+----+----------+-------+--------+

13. You are joining two DataFrames:

text

CollapseWrap

Copy

joinType = "inner"

joinExpr = df1.BatchID == df2.BatchID

df1.join(df2, joinType, joinExpr).show()

What is wrong with the above code?

• A. There is no problem with the code block

• B. You cannot define joinExpr outside the join() method

• C. The joinType and joinExpr are at the wrong place; swap their positions

• D. There is no join type as inner; it must be inner_join

14. You have a DataFrame (df1) with 146 unique countries in the Country column. You want to
repartition it on Country into 10 partitions only. Choose the correct code block:

• A. df2 = df1.repartition(10)

• B. df2 = df1.repartition(10, "Country")

• C. This feature is not supported in Spark DataFrame API

• D. The requirement is incorrect; you can partition to 146 partitions because you have 146
countries

15. What is the output of the following code block?

text

CollapseWrap

Copy

df = spark.read.parquet("data/summary.parquet")

df2 = df.repartition(20)

print(df2.rdd.getNumPartitions())

df3 = df2.coalesce(100)

print(df3.rdd.getNumPartitions())

• A. 20 100

• B. 100 100

• C. 100 20

• D. 20 20

16. You are given the following DataFrame:

text

CollapseWrap

Copy

+-------+----+----------+

|BatchID|Year|CourseName|
+-------+----+----------+

| X1 |2020| Scala |

| X2 |2020| Python |

| X3 |null| Java |

| X4 |2021| Scala |

| X5 |null| Python |

| X6 |2021| Spark |

+-------+----+----------+

Choose the incorrect option to replace all nulls in the Year column with 2021:

• A. df.withColumn("Year", expr("coalesce(Year, '2021')"))

• B. df.withColumn("Year", coalesce(col("Year"), lit("2021")))

• C. df.withColumn("Year", expr("ifnull(Year, '2021')"))

• D. df.withColumn("Year", ifnull(col("Year"), "2021"))

17. You are given the following DataFrame:

text

CollapseWrap

Copy

+-------+----+----------+

|BatchID|Year|CourseName|

+-------+----+----------+

| X1 |2020| Scala |

| X2 |2020| Python |

| X3 |null| Java |

| X4 |2021| Scala |

| X5 |null| null |

| X6 |2021| Spark |

+-------+----+----------+

Choose the correct code block to:

1. Replace all nulls in Year with 2021

2. Replace all nulls in CourseName with Python

• A. df.na.fill("2021", "Python")
• B. df.na.fill({"Year": "2021", "CourseName": "Python"})

• C. df.na.fill({"CourseName": "Python", "Year": "2021"})

• D. df.na.fill("2021")

18. Which code block merges two DataFrames df1 and df2?

• A. df1.append(df2)

• B. df1.merge(df2)

• C. df1.union(df2)

• D. df1.add(df2)

19. Which statements are used to bring data to the Spark driver?

• A. df.first()

• B. df.take(10)

• C. df.collect()

• D. df.limit(10)

20. Spark allows schema-on-read using the infer schema option. Choose the incorrect statement
about schema-on-read:

• A. Schema-on-read usually works fine for ad hoc analysis

• B. Infer schema can be slow with plain-text file formats like CSV or JSON

• C. Infer schema can lead to precision issues like a long type incorrectly set as an integer

• D. It is a good idea to use schema-on-read for production ETL

21. The following code defines a schema:

text

CollapseWrap

Copy

mySchema = StructType([StructField("ID", IntegerType()), StructField("Name", StringType()),

StructField("Salary", DoubleType())])

Choose the correct equivalent code block:

• A. mySchema = spark.createSchema(StructField("ID", IntegerType()), StructField("Name",

StringType()), StructField("Salary", DoubleType()))

• B. mySchema = "ID INT, Name STRING, Salary DOUBLE"

22. What is the default compression format for saving a DataFrame as a Parquet file?

• A. uncompressed

• B. none
• C. lz4

• D. snappy

23. Choose the code block to write a DataFrame in compressed JSON file format:

• A. df.write.mode("overwrite").format("json").option("compression",
"gzip").save("data/myTable")

• B. df.write.mode("overwrite").option("compression", "gzip").save("data/myTable")

• C. df.write.mode("overwrite").option("codec", "gzip").save("data/myTable")

• D. df.write.mode("overwrite").codec("gzip").save("data/myTable")

24. Choose the correct expression to create a Spark database named my_spark_db:

• A. spark.sql("CREATE DATABASE my_spark_db")

• B. spark.createDatabase("my_spark_db")

• C. spark.catalog.createDatabase("my_spark_db")

• D. None of the above

25. You are given two code blocks:

1. spark.sql("CREATE TABLE flights_tbl (date STRING, delay INT, distance INT, origin STRING,
destination STRING)")

2. spark.sql("CREATE TABLE flights_tbl(date STRING, delay INT, distance INT, origin STRING,
destination STRING) USING csv OPTIONS (PATH '/tmp/flights/flights_tbl.csv')") Choose all
correct statements:

• A. The first code block creates a Spark managed table

• B. The second code block creates a Spark unmanaged table

• C. Both statements are the same

• D. Both statements are the same except the second specifies the data file location

26. You created a temporary view:

text

CollapseWrap

Copy

df1.createOrReplaceTempView("my_view")

Choose the correct expression to drop this view:

• A. spark.sql("DROP VIEW IF EXISTS global_temp.my_view")

• B. spark.sql("DROP VIEW IF EXISTS my_view")

• C. spark.catalog.dropGlobalTempView("my_view")
• D. spark.catalog.dropTempView("my_view")

27. What is the default storage level for a Spark DataFrame when cached?

• A. MEMORY_AND_DISK in Spark 3.0

• B. MEMORY_AND_DISK_DESER in Spark 3.1.1

• C. MEMORY_ONLY

• D. DISK_ONLY

28. Which statement correctly defines the MEMORY_AND_DISK storage level?

• A. Data is stored directly as objects in memory and a copy is serialized and stored on disk

• B. Data is stored directly as objects in memory, but if there’s insufficient memory, the rest is
serialized and stored on disk

• C. Data is stored on the disk and brought into memory when required

• D. None of the above

29. The following DataFrame expression has an error:

text

CollapseWrap

Copy

df1.withColumn("Flight_Delays", expr("""CASE WHEN delay > 360 THEN 'Very Long Delays' WHEN
delay >= 120 AND delay <= 360 THEN 'Long Delays' WHEN delay >= 60 AND delay < 120 THEN 'Short
Delays' WHEN delay > 0 and delay < 60 THEN 'Tolerable Delays' WHEN delay = 0 THEN 'No Delays'
ELSE 'Early'"""))

Choose the statement that identifies the error:

• A. The CASE statement requires an END, which is missing

• B. You cannot use CASE WHEN construct in DataFrame expression

• C. It is invalid to use """ for a string

• D. There is no error in the code block

30. You want to implement the following CASE expression using the when() DataFrame function:

text

CollapseWrap

Copy

• A. df1.withColumn("Flight_Delays", when(col("delay") > 360, lit("Very Long

Delays")).when((col("delay") >= 120) & (col("delay") <= 360), lit("Long
Delays")).when((col("delay") >= 60) & (col("delay") <= 120), lit("Short
Delays")).when((col("delay") > 0) & (col("delay") < 60), lit("Tolerable
Delays")).when(col("delay") == 0, lit("No Delays")).else(lit("Early")))

• B. df1.withColumn("Flight_Delays", when(col("delay") > 360, lit("Very Long

• C. df1.withColumn("Flight_Delays", when(col("delay") > 360, "Very Long

Delays").when((col("delay") >= 120) & (col("delay") <= 360), "Long
Delays").when((col("delay") >= 60) & (col("delay") <= 120), "Short
Delays").when((col("delay") > 0) & (col("delay") < 60), "Tolerable Delays").when(col("delay")
== 0, "No Delays").otherwise("Early"))

• D. There is no error in the code block

31. There is a global temp view named my_global_view. Which command should you choose to
query it?

• A. spark.read.table("my_global_view")

• B. spark.read.view("my_global_view")

• C. spark.read.table("global_temp.my_global_view")

• D. spark.read.view("global_temp.my_global_view")

32. Which code block reads from a tab-separated TSV file?

• A. df = spark.read.format("tsv").option("inferSchema", "true").option("header",
"true").load("data/my_data_file.tsv")

• B. df = spark.read.format("csv").option("inferSchema", "true").option("header",
"true").option("sep", "tab").load("data/my_data_file.tsv")

• C. df = spark.read.format("csv").option("inferSchema", "true").option("header",
"true").option("sep", "\t").load("data/my_data_file.tsv")

• D. df = spark.read.format("csv").option("inferSchema", "true").option("header",
"true").option("delimeter", "\t").load("data/my_data_file.tsv")

33. You are given a CSV file with content:

text

CollapseWrap

Copy

id,fname,lname,dob
101,prashant,pandey,25-05-1975

102,abdul,hamid,28-12-1986

103,M David,turner,23-08-1979

You want to load it with the schema:

text

CollapseWrap

Copy

root

|-- id: integer (nullable = true)

|-- fname: string (nullable = true)

|-- lname: string (nullable = true)

|-- dob: date (nullable = true)

Choose the correct code block:

• A. schema = StructType([StructField("id", IntegerType()), StructField("fname", StringType()),

StructField("lname", StringType()), StructField("dob", DateType())]); df =
spark.read.format("csv").option("header", "true").schema(schema).option("dateFormat",
"dd-MM-yyyy").load("data/my_data_file.csv")

• B. schema = StructType([StructField("id", IntegerType()), StructField("fname", StringType()),

StructField("lname", StringType()), StructField("dob", DateType())]); df =
spark.read.format("csv").option("header",
"true").schema(schema).load("data/my_data_file.csv")

• C. schema = StructType([StructField("id", IntegerType()), StructField("fname", StringType()),

• D. schema = StructType([StructField("id", IntegerType()), StructField("fname", StringType()),

StructField("lname", StringType()), StructField("dob", DateType())]); df =
spark.read.format("csv").option("header", "true").option("inferSchema",
"true").option("dateFormat", "dd-MM-yyyy").load("data/my_data_file.csv")

34. Can you write Spark UDFs in Scala or Java and run them from a PySpark application?

• A. TRUE

• B. FALSE

35. You have the following DataFrame and UDF:

text

CollapseWrap
Copy

df = spark.range(5).toDF("num")

def power3(value):

return value ** 3

power3_udf = udf(power3)

df.selectExpr("power3_udf(num)").show()

There is an error. Choose the corrected code:

• A. The UDF is not registered as an SQL function: spark.udf.register("power3_udf", power3)

• B. The function is incorrectly defined: def power3(value): return value * 3

• C. The second last line is incorrect: power3_udf = udf(power3(_: Double):Double)

• D. There is no error in the given code

36. You have a DataFrame with the schema:

text

CollapseWrap

Copy

root

|-- ID: long (nullable = true)

|-- PersonalDetails: struct (nullable = false)

| |-- FName: string (nullable = true)

| |-- LName: string (nullable = true)

| |-- DOB: string (nullable = true)

|-- Department: string (nullable = true)

Choose the correct code block to select:

text

CollapseWrap

Copy

+---+-----+------+----------+

+---+-----+------+----------+

|101| John| Doe | Software |

+---+-----+------+----------+

• A. df1.select("ID", "FName", "LName", "Department").show()

• B. df1.select("ID", col("PersonalDetails").getField("FName").alias("FName"),
col("PersonalDetails").getField("LName").alias("LName"), "Department").show()

• C. df1.select("ID", col("FName"), col("LName"), "Department").show()

37. You are given a CSV file (sample.txt) with content:

text

CollapseWrap

Copy

ID,TEXT

101,WHITE HANGING HEART T-LIGHT HOLDER

102,WHITE LANTERN

103,RED WOOLLY HOTTIE WHITE HEART

Choose the correct output of:

text

CollapseWrap

Copy

df = spark.read.option("header", "true").option("inferSchema", "true").csv("data/sample.txt")

df1 = df.select("ID", split(col("TEXT"), " ").alias("VALUES"))

df1.selectExpr("ID", "VALUES[0] as V1", "VALUES[1] as V2", "VALUES[2] as V3").show()

• A. +---+-----+--------+------+ | ID| V1| V2| V3| +---+-----+--------+------+

• B. +---+-----+--------+----------+ | ID| V1| V2| V3| +---+-----+--------+----------+

38. What is the output of the following code block?

text

CollapseWrap

Copy

mylist = [1002, 3001, 4002, 2003, 2002, 3004, 1003, 4006]

df = spark.createDataFrame(mylist, IntegerType()).toDF("value")
df.withColumn("key", col("value") % 1000).groupBy("key").agg(expr("count(key) as count"),
expr("sum(key) as sum")).orderBy(col("key").desc()).limit(1).select("count", "sum").show()

• A. +-----+---+ |count|sum| +-----+---+ | 1| 6| +-----+---+

• B. +-----+---+ |count|sum| +-----+---+ | 2| 6| +-----+---+

• C. +-----+---+ |count|sum| +-----+---+ | 3| 6| +-----+---+

• D. +-----+---+ |count|sum| +-----+---+ | 1| 1| +-----+---+

39. Which configuration controls the maximum partition size when reading files?

• A. spark.files.maxPartitionSize

• B. spark.files.maxPartitionBytes

• C. spark.maxPartitionSize

• D. spark.maxPartitionBytes

40. Which configuration controls the number of available cores for executors?

• A. spark.driver.cores

• B. spark.executor.cores

• C. spark.cores.max

• D. spark.task.cpus

41. The Spark Core engine:

• A. Is fault-tolerant

• B. Executes a DAG of Spark application

• C. Stores and manages resources

• D. Manages data and its storage

42. Select the correct statements about the Cluster Manager:

• A. A Cluster Manager is responsible for running your Spark Applications

• B. The Cluster Manager is responsible for maintaining a cluster of machines that will run your
Spark Application

• C. A Cluster Manager may have its own master and worker nodes

• D. Cluster Manager provides the Storage Services to Apache Spark

43. Select the correct statement about Spark Drivers and Executors:

• A. The executors communicate with the cluster manager and are responsible for executing
tasks on the workers

• B. Spark Driver communicates directly with the executors

• C. We cannot have more than one executor per worker node

• D. A Spark executor runs on the worker node in the cluster

44. Select the correct statement about Spark deployment modes:

• A. Kubernetes does not support cluster mode

• B. Local mode runs the Spark driver and executor in the same JVM on a single computer

• C. Client mode runs the driver and executor on the client machine

• D. Cluster mode runs the driver with the YARN Application Master

45. Select the correct statements about the Spark Context:

• A. Is not available in Spark 2.x

• B. Within the SparkSession represents the connection to the Spark cluster

• C. Is your driver application

• D. You communicate with some of Spark’s lower-level APIs, such as RDDs

46. Select the incorrect statement about the Spark Application:

• A. Spark Application runs as a series of Spark Jobs

• B. Each Spark Job is internally represented as a DAG of stages

• C. Spark Application runs all the Spark Jobs in parallel

• D. You can submit a Spark application using the spark-submit tool

47. Choose all incorrect statements:

• A. If there are 1,000 little partitions, we will have 1,000 tasks that can be executed in parallel

• B. Partitioning your data into a greater number of partitions means that more can be
executed in parallel

• C. An executor with 12 cores can have 12 or more tasks working on 12 or more partitions in
parallel

• D. One executor must run only one task at a time

48. Which of the following are correct for slots?

• A. Slots are the same thing as tasks

• B. Each executor can have multiple slots depending upon the executor cores

• C. Each slot in the executor can be assigned a task

• D. Each worker can have multiple slots where executors are allocated

49. Which configuration sets the scheduling mode between jobs submitted to the same
SparkContext?

• A. spark.job.scheduler.mode

• B. spark.scheduler.mode
• C. spark.optimizer.scheduler.mode

• D. spark.scheduler.job.mode

50. Which configurations are related to enabling dynamic adjustment of resources based on the
workload?

• A. spark.dynamicAllocation.shuffleTracking.enabled

• B. spark.sql.dynamicAllocation.enabled

• C. spark.dynamicAllocation.enabled

• D. spark.shuffle.service.enabled

51. Columns or table names are resolved by consulting an internal Catalog at which stage of Spark
query optimization?

• A. Analysis

• B. Logical Optimization

• C. Physical Planning

• D. Code Generation

52. Which method can be used to rename a column in a Spark DataFrame?

• A. withColumnRenamed(existingName: String, newName: String)

• B. withColumnRename(existingName: String, newName: String)

• C. withColumn(newName: String, existingName: String)

• D. There is no method for renaming a column

53. You have a DataFrame with a string-type column today in DD-MM-YYYY format. You want to
add a column week_later with a value one week later than today. Select the correct code block:

• A. myDF.withColumn("week_later", date_add("today", 7))

• B. myDF.withColumn("week_ago", date_add(to_date("today", "dd-MM-yyyy"), 7))

• C. myDF.withColumn("week_ago", date_add(to_date("today", "DD-MM-YYYY"), 7))

• D. All of the above

54. You have a DataFrame with a string field day in MM-DD-YYYY format. What is the problem
with:

text

CollapseWrap

Copy

myDF.filter("day > '2021-05-07'")

• A. There is no problem with the code

• B. The day field is in MM-DD-YYYY format but the filter condition expects it in YYYY-MM-DD
format

• C. This problem can be solved by changing the filter to: myDF.filter("day > '05-07-2021'")

• D. The day field is a string field but the filter condition expects it to be a date field; convert
day to date type first

55. Select the most appropriate expression for casting a salary column to a double value:

• A. expr("cast(salary as double)")

• B. col("salary").cast("double")

• C. expr("DOUBLE(salary)")

• D. All of the above

56. Select all expressions equivalent to:

text

CollapseWrap

Copy

df.where("salary > 5000 or age > 30")

• A. df.filter("salary > 5000 or age > 30")

• B. df.filter("salary > 5000").filter("age > 30")

• C. df.filter((col("salary") > 5000) | (col("age") > 30))

• D. df.filter(("salary" > 5000) | ("age" > 30))

57. You have a DataFrame:

text

CollapseWrap

Copy

+-----+---+------+

| name|age|salary|

+-----+---+------+

| Ravi| 28| 6500 |

| John| 32| 6500 |

| Rosy| 48| 8200 |

|Abdul| 36| 4800 |

+-----+---+------+

You want to sort it in descending order of salary, and if salary is equal, by age in ascending order:
text

CollapseWrap

Copy

+-----+---+------+

| name|age|salary|

+-----+---+------+

| Rosy| 48| 8200 |

| Ravi| 28| 6500 |

| John| 32| 6500 |

|Abdul| 36| 4800 |

+-----+---+------+

Choose the correct code to fill in:

text

CollapseWrap

Copy

df._1_(_2_, _3_)

• A. 1. sort 2. expr("salary desc") 3. "age"

• B. 1. sort 2. col("salary").desc() 3. "age"

• C. 1. sort 2. expr("desc(salary)") 3. "age"

• D. 1. sort 2. expr("salary").desc() 3. "age"

58. You are given the following DataFrame:

text

CollapseWrap

Copy

df = spark.createDataFrame(data_list).toDF("name", "department", "country", "salary")

Choose the best option to produce:

text

CollapseWrap

Copy
+-------------+----------+-----+

| country|department|count|

+-------------+----------+-----+

| India | Account | 1|

|United States| Account | 1|

| India | Software | 2|

| Brazil | Support | 1|

+-------------+----------+-----+

• A. df.groupBy("country", "department").agg(expr("count(*)")).show()

• B. df.groupBy(expr("country, department")).count().show()

• C. df.groupBy("country", "department").count().show()

• D. df.groupBy("department", "country").count().show()

59. You are given the following DataFrame:

text

CollapseWrap

Copy

+-------+----+----------+--------+

+-------+----+----------+--------+

| X1 |2021| Scala | 270|

| Y5 |2021| Scala | 230|

| N3 |2020| Scala | 150|

| C5 |2020| Scala | 100|

| D7 |2020| Python | 300|

| D3 |2021| Python | 400|

| H2 |2021| Python | 500|

+-------+----+----------+--------+

Choose the code block to create a summary DataFrame for TotalStudents over Year and
CourseName:

text

CollapseWrap
Copy

+----+----------+-------------+

|Year|CourseName|TotalStudents|

+----+----------+-------------+

|null| null| 1950.0|

|2020| null| 550.0|

|2020| Python | 300.0|

|2020| Scala | 250.0|

|2021| null| 1400.0|

|2021| Python | 900.0|

|2021| Scala | 500.0|

+----+----------+-------------+

• A. df.groupBy("Year",
"CourseName").agg(expr("sum(Students)").alias("TotalStudents")).orderBy("Year",
"CourseName")

• B. df.rollup("Year",
"CourseName").agg(expr("sum(Students)").alias("TotalStudents")).orderBy("Year",
"CourseName")

• C. df.rollup("Year", "CourseName").agg(expr("sum(Students)").alias("TotalStudents"))

• D. df.pivot("Year",
"CourseName").agg(expr("sum(Students)").alias("TotalStudents")).orderBy("Year",
"CourseName")

60. You have the following code block for joining two DataFrames:

text

CollapseWrap

Copy

joinType = "inner"

joinExpr = df1.BatchID == df2.BatchID

df1.join(df2, joinExpr, joinType).select("BatchID", "Year").show()

This throws an error: Reference 'BatchID' is ambiguous. Choose the corrected code block:

• A. joinType = "inner"; joinExpr = df1.BatchID == df2.BatchID; df1.join(df2, joinExpr,

joinType).select("df1.BatchID", "df1.Year").show()

• B. joinType = "inner"; joinExpr = df1.BatchID == df2.BatchID; df1.join(df2, joinExpr,

joinType).select(df1.BatchID, df1.Year).show()
• C. joinType = "inner"; joinExpr = "BatchID"; df1.join(df2, joinExpr, joinType).select("BatchID",
"Year").show()

• D. joinType = "inner"; joinExpr = df1.BatchID == df2.BatchID; df1.join(df2, joinExpr,

joinType).drop(df2.BatchID).select("BatchID", "Year").show()

61. You are given two DataFrames:

df1:

text

CollapseWrap

Copy

+-------+----+----------+

|BatchID|Year|CourseName|

+-------+----+----------+

| X1 |2021| Scala |

| Y5 |2021| Scala |

+-------+----+----------+

df2:

text

CollapseWrap

Copy

+-------+--------+

|BatchID|Students|

+-------+--------+

| X1 | 270|

| N3 | 150|

+-------+--------+

You want to select rows from df1 that do not exist in df2:

text

CollapseWrap

Copy

+-------+----+----------+

|BatchID|Year|CourseName|

+-------+----+----------+
| Y5 |2021| Scala |

+-------+----+----------+

Choose the correct join type:

text

CollapseWrap

Copy

df1.join(df2, df1.BatchID == df2.BatchID, joinType).show()

• A. joinType = "left_semi"

• B. joinType = "right_semi"

• C. joinType = "left_outer"

• D. joinType = "left_anti"

62. What API can you use to get the number of partitions of a DataFrame df?

• A. df.getNumPartitions()

• B. df.rdd.getNumPartitions()

• C. df.getPartitionCount()

• D. df.rdd.getPartitionCount()

63. What is the use of the coalesce(expr*) function in Spark SQL?

• A. Sharing the number of DataFrame partitions

• B. Returns the first non-null argument if it exists; otherwise, null

• C. Merge the column values into one column

• D. None of the above

64. You are given the following DataFrame:

text

CollapseWrap

Copy

+-------+----+----------+

|BatchID|Year|CourseName|

+-------+----+----------+

| X1 |2020| Scala |

| X2 |2020| Python |

| X3 |null| Java |
| X4 |2021| Scala |

| X5 |null| Python |

| X6 |2021| Spark |

+-------+----+----------+

Choose the correct statements:

• A. We can use df.na.drop(subset=("Year","CourseName")) to delete rows if Year or

CourseName is null

• B. We can use df.na.drop() to delete all rows having any null column

• C. We can use df.na.drop("all") to delete rows if all columns are null

• D. All of the above are correct

65. You are given the following DataFrame:

text

CollapseWrap

Copy

+-------+----+----------+

|BatchID|Year|CourseName|

+-------+----+----------+

| X1 |2020| Scala |

| X2 |2020| Python |

| X3 |null| Java |

| X4 |2021| Scala |

| X5 |null| null |

| X6 |2021| null |

+-------+----+----------+

You use:

text

CollapseWrap

Copy

df.na.drop(thresh=1)

Choose the correct statement:

• A. This expression will delete X3, X5, and X6 rows because the threshold=1 says "Delete the
row if at least one column is null"
• B. This expression will not delete any row because the threshold=1 says "Keep the row if at
least one column is not null"

• C. This expression will delete X3 and X6 rows because the threshold=1 says "Delete the row if
only one column is null"

• D. This expression will delete X3 row because the threshold=1 says "Delete the row if more
than one column is null"

Keysheet (Correct Answers)

1. B - df.select("name", expr("salary * 0.20 as increment")) (Correct way to alias using expr() for
column expressions.)

2. A, B - Both withColumn() chaining and selectExpr() are valid; option C is incorrect due to
invalid col() syntax.

3. A - 1. drop 2. "salary" (Correct method to remove a column.)

4. B, D - Returns a new DataFrame with a column renamed; does not throw an error if the
schema doesn’t contain the existing name.

5. C - myDF.select(datediff("end", "start")) (Correct function and order for date difference in

days.)

6. B - myDF.withColumn("start_time", to_timestamp("start_time", "dd-MM-yyyy

HH:mm:ss.SSS")) (Correct format for timestamp conversion.)

7. C - df.select("name", expr("INT(age)"), expr("DOUBLE(salary)")) (Correct Spark SQL functions

for casting.)

8. B, C - df.filter((df.salary > 5000) & (df.age > 30)) and df.filter("salary > 5000").filter("age > 30")
(Implement AND condition correctly.)

9. D - df.selectExpr("distinct(name, age, salary)") (Incorrect; no such SQL function in Spark.)

10. B - df.groupBy("department", "country").agg(expr("count(*) as NumEmployee"),

expr("sum(salary) as TotalSalary")).show() (Correct for aliasing aggregated columns.)

11. B - df.groupBy("Year").pivot("CourseName").agg(expr("sum(Students)")) (Correct for pivoting

on CourseName.)

12. C - Right outer join includes all rows from df2, with nulls for non-matching df1 rows.

13. C - The joinType and joinExpr are in the wrong order; correct syntax is join(rightDF, joinExpr,
joinType).

14. B - df2 = df1.repartition(10, "Country") (Correct for repartitioning by column with specified
partitions.)

15. D - 20 20 (coalesce(100) cannot increase partitions, so it remains 20.)

16. D - df.withColumn("Year", ifnull(col("Year"), "2021")) (Incorrect; ifnull is a SQL function, not a

DataFrame function.)
17. B, C - df.na.fill({"Year": "2021", "CourseName": "Python"}) (Correct dictionary syntax for
multiple columns; order doesn’t matter.)

18. C - df1.union(df2) (Correct method to merge DataFrames.)

19. A, B, C - first(), take(10), collect() (All are actions that bring data to the driver; limit() is a
transformation.)

20. D - It is a good idea to use schema-on-read for production ETL (Incorrect; manual schema
definition is preferred for production.)

21. B - mySchema = "ID INT, Name STRING, Salary DOUBLE" (Correct equivalent schema string.)

22. D - snappy (Default compression for Parquet files.)

23. A - df.write.mode("overwrite").format("json").option("compression",
"gzip").save("data/myTable") (Correct for JSON compression.)

24. A - spark.sql("CREATE DATABASE my_spark_db") (Correct SQL command for creating a

database.)

25. A, B, D - First block creates a managed table; second creates an unmanaged table with a
specified path.

26. B, D - spark.sql("DROP VIEW IF EXISTS my_view") and

spark.catalog.dropTempView("my_view") (Correct for dropping a temp view.)

27. B - MEMORY_AND_DISK_DESER in Spark 3.1.1 (Correct default storage level for newer Spark
versions.)

28. B - Data is stored directly as objects in memory, but if insufficient memory, the rest is
serialized and stored on disk.

29. A - The CASE statement requires an END, which is missing.

30. B, C - Both when() with otherwise() and without lit() are correct; else is not a valid method.

31. C - spark.read.table("global_temp.my_global_view") (Correct for accessing global temp

views.)

32. C - df = spark.read.format("csv").option("inferSchema", "true").option("header",

"true").option("sep", "\t").load("data/my_data_file.tsv") (Correct for tab-separated files.)

33. A - schema with dateFormat "dd-MM-yyyy" (Correct for parsing dates correctly.)

34. A - TRUE (Scala/Java UDFs can be used in PySpark for performance.)

35. A - spark.udf.register("power3_udf", power3) (Correct for registering UDF as SQL function.)

36. B - df1.select("ID", col("PersonalDetails").getField("FName").alias("FName"),

col("PersonalDetails").getField("LName").alias("LName"), "Department").show() (Correct for
accessing struct fields.)

37. A - split() creates an array with individual words; V3 for row 103 is "HOTTIE".

38. C - key=6 (from 4006) has count=3 (1002, 2002, 4006) and sum=6.

39. B - spark.files.maxPartitionBytes (Correct configuration for partition size.)

40. B - spark.executor.cores (Controls cores per executor.)

41. A, B - Is fault-tolerant; executes a DAG of Spark application.

42. A, B, C - Cluster Manager runs applications, maintains the cluster, and has master/worker
nodes.

43. B, D - Driver communicates with executors; executors run on worker nodes.

44. B - Local mode runs driver and executor in the same JVM.

45. B, D - SparkContext is within SparkSession and used for RDDs.

46. C - Spark Application runs jobs sequentially, not in parallel.

47. D - One executor can run multiple tasks concurrently.

48. B, C - Executors have multiple slots; each slot can be assigned a task.

49. B - spark.scheduler.mode (Correct for scheduling mode.)

50. C, D - spark.dynamicAllocation.enabled and spark.shuffle.service.enabled (Related to

dynamic allocation.)

51. A - Analysis (Catalog resolution occurs here.)

52. A - withColumnRenamed(existingName: String, newName: String) (Correct method.)

53. B - myDF.withColumn("week_ago", date_add(to_date("today", "dd-MM-yyyy"), 7)) (Correct

format and function.)

54. D - The day field is a string; convert to date using to_date() before filtering.

55. D - All are valid methods for casting to double.

56. A, C - df.filter("salary > 5000 or age > 30") and df.filter((col("salary") > 5000) | (col("age") >
30)) (Correct for OR condition.)

57. B, D - col("salary").desc() and expr("salary").desc() are equivalent and correct.

58. C - df.groupBy("country", "department").count().show() (Simplest and correct for counting.)

59. B - df.rollup("Year",
"CourseName").agg(expr("sum(Students)").alias("TotalStudents")).orderBy("Year",
"CourseName") (Correct for rollup with totals.)

60. B, D - Use df1.BatchID or drop df2.BatchID to resolve ambiguity.

61. D - joinType = "left_anti" (Correct for NOT EXISTS condition.)

62. B - df.rdd.getNumPartitions() (Correct API for partition count.)

63. B - Returns the first non-null argument if it exists; otherwise, null (Spark SQL coalesce
function.)

64. B, C - df.na.drop() and df.na.drop("all") are correct; subset syntax is incorrect.

65. B - thresh=1 means keep rows with at least one non-null column, so no rows are deleted.

SP 6
No ratings yet
SP 6
14 pages
SP 4
No ratings yet
SP 4
22 pages
07 Structured Data Processing
No ratings yet
07 Structured Data Processing
91 pages
Question - Paper Set (IP)
No ratings yet
Question - Paper Set (IP)
117 pages
SP 3
No ratings yet
SP 3
18 pages
DAV Previous Year
No ratings yet
DAV Previous Year
7 pages
Minimum Level Pandas Skill Based Questions
No ratings yet
Minimum Level Pandas Skill Based Questions
8 pages
Class XII PT1 Informatics Python Pandas
No ratings yet
Class XII PT1 Informatics Python Pandas
4 pages
Pandas Test
No ratings yet
Pandas Test
6 pages
Xii-Ip Quarterly Exam Ms 25-26
No ratings yet
Xii-Ip Quarterly Exam Ms 25-26
8 pages
SQL Cheat Sheet Python
100% (1)
SQL Cheat Sheet Python
1 page
PG Accomodation Building Construction: An Internship Report
No ratings yet
PG Accomodation Building Construction: An Internship Report
35 pages
Revision - Data Frames
No ratings yet
Revision - Data Frames
6 pages
12 IP Dataframe and Pyplot Notes
No ratings yet
12 IP Dataframe and Pyplot Notes
14 pages
Pandas & Vis 2
No ratings yet
Pandas & Vis 2
11 pages
Python MCQs
No ratings yet
Python MCQs
21 pages
Dav 2024 Pyq
No ratings yet
Dav 2024 Pyq
7 pages
S7 Practice Questions
No ratings yet
S7 Practice Questions
7 pages
Pyspark Coding Interview Questions
No ratings yet
Pyspark Coding Interview Questions
19 pages
Practice Questions (Unsolved)
No ratings yet
Practice Questions (Unsolved)
8 pages
Pandas Series & DataFrame Guide
No ratings yet
Pandas Series & DataFrame Guide
60 pages
Informatics Practices Practical List22-2323
No ratings yet
Informatics Practices Practical List22-2323
6 pages
G 12 Worksheet 2
No ratings yet
G 12 Worksheet 2
4 pages
Cybercrime Laboratory Manual
No ratings yet
Cybercrime Laboratory Manual
28 pages
Worksheet Dataframe
No ratings yet
Worksheet Dataframe
2 pages
More Practice Questions For DataFrame
No ratings yet
More Practice Questions For DataFrame
9 pages
DataFrame Assignment2024
No ratings yet
DataFrame Assignment2024
10 pages
Python - Final 1
No ratings yet
Python - Final 1
17 pages
28 03 2024 Sample Paper Grade 12 Informatics Practices 2023 24
No ratings yet
28 03 2024 Sample Paper Grade 12 Informatics Practices 2023 24
8 pages
Databricks Spark Exam Notes
No ratings yet
Databricks Spark Exam Notes
27 pages
Data Frame Worksheet
No ratings yet
Data Frame Worksheet
3 pages
Ipqppt1 24-25kvamc
No ratings yet
Ipqppt1 24-25kvamc
3 pages
FDS-Practical-Exam-Qs
No ratings yet
FDS-Practical-Exam-Qs
4 pages
Pyspark Coding Questions From StrataScratch Platform
No ratings yet
Pyspark Coding Questions From StrataScratch Platform
23 pages
Top 100 Pyspark Functions For Data Engineers 1738131847
No ratings yet
Top 100 Pyspark Functions For Data Engineers 1738131847
30 pages
Half Yearly Examination 2022-23 PT2: Class XII
No ratings yet
Half Yearly Examination 2022-23 PT2: Class XII
7 pages
Midterm Exam Multiple Choice
No ratings yet
Midterm Exam Multiple Choice
8 pages
DataFrame Revision
No ratings yet
DataFrame Revision
5 pages
Midterm Exam Multiple Choice
No ratings yet
Midterm Exam Multiple Choice
8 pages
Wa0001
No ratings yet
Wa0001
3 pages
Pandas DataFrame Checklist Student Worksheet
No ratings yet
Pandas DataFrame Checklist Student Worksheet
2 pages
Heskay Report
No ratings yet
Heskay Report
43 pages
XII C IP Summer Break Holiday Home Work by Rahul Lakra
No ratings yet
XII C IP Summer Break Holiday Home Work by Rahul Lakra
9 pages
11,12, 13, 14operations On DF
No ratings yet
11,12, 13, 14operations On DF
5 pages
DATAFRAME
No ratings yet
DATAFRAME
11 pages
DS Practical
No ratings yet
DS Practical
30 pages
Python Pandas - 2 2020-21
No ratings yet
Python Pandas - 2 2020-21
21 pages
Wa0012.
No ratings yet
Wa0012.
30 pages
Python & SQL Exam Paper
No ratings yet
Python & SQL Exam Paper
9 pages
Data Analysis Exam for CS Majors
No ratings yet
Data Analysis Exam for CS Majors
12 pages
101 Onwards On Python Pandas and Pyplot
No ratings yet
101 Onwards On Python Pandas and Pyplot
33 pages
Dav Pyq 2023
No ratings yet
Dav Pyq 2023
15 pages
The Origin of Paper
No ratings yet
The Origin of Paper
3 pages
Pandas Data Handling Worksheet XII
No ratings yet
Pandas Data Handling Worksheet XII
4 pages
Document (4) - 1
No ratings yet
Document (4) - 1
15 pages
Latebloomerworksheet
No ratings yet
Latebloomerworksheet
8 pages
Deploy DFS on Windows Server 2012 R2
No ratings yet
Deploy DFS on Windows Server 2012 R2
53 pages
Portable Percent Oxygen Analyzer With USB Data Logging
No ratings yet
Portable Percent Oxygen Analyzer With USB Data Logging
1 page
Dataframe in Pandas
No ratings yet
Dataframe in Pandas
23 pages
Stacey Evans Letter To Cobb 11.17.22
No ratings yet
Stacey Evans Letter To Cobb 11.17.22
14 pages
Class 12 Pandas Practical Guide
No ratings yet
Class 12 Pandas Practical Guide
15 pages
The Poisson Distribution
No ratings yet
The Poisson Distribution
13 pages
Python Pandas Assignments
No ratings yet
Python Pandas Assignments
3 pages
5-Day - Workout - Plan (1) Fsssghjytfcvhjufdd
No ratings yet
5-Day - Workout - Plan (1) Fsssghjytfcvhjufdd
2 pages
404-Article Text-596-1-10hjsjjsjjsjkskzns
No ratings yet
404-Article Text-596-1-10hjsjjsjjsjkskzns
2 pages
XEV 9e Brochure
No ratings yet
XEV 9e Brochure
27 pages
Case Base Practice Question
No ratings yet
Case Base Practice Question
7 pages
Cisco® Catalyst® 9400 Series
No ratings yet
Cisco® Catalyst® 9400 Series
25 pages
مكاتب استشارية الكويت PDF
No ratings yet
مكاتب استشارية الكويت PDF
2 pages
T20 and T24 SP and AP
No ratings yet
T20 and T24 SP and AP
2 pages
Matrox PowerStream Plus User Guide
No ratings yet
Matrox PowerStream Plus User Guide
129 pages
W22 Three-Phase Motor Features
No ratings yet
W22 Three-Phase Motor Features
28 pages
Protection Chapter-8
No ratings yet
Protection Chapter-8
28 pages
NM Plus Hydrogen Generator: Carrier Grade
No ratings yet
NM Plus Hydrogen Generator: Carrier Grade
4 pages
Weekly - Muscle - Sets-Exercise Guidance by
No ratings yet
Weekly - Muscle - Sets-Exercise Guidance by
1 page
V1 N2 1980 Rabenhorst
No ratings yet
V1 N2 1980 Rabenhorst
6 pages
A CSV Document For Knowing Customer Sales
No ratings yet
A CSV Document For Knowing Customer Sales
2 pages
FMB920 Tracker Setup Guide
No ratings yet
FMB920 Tracker Setup Guide
16 pages
It's Known As The List'-And It's A Secret File of AI Geniuses - WSJ
No ratings yet
It's Known As The List'-And It's A Secret File of AI Geniuses - WSJ
9 pages
ECE312 Final Exam 2021
No ratings yet
ECE312 Final Exam 2021
2 pages
Agip GR SLL 00
No ratings yet
Agip GR SLL 00
1 page
WNMC Assignment 2
No ratings yet
WNMC Assignment 2
9 pages
1
No ratings yet
1
1 page
Understanding Buffer Overflow
No ratings yet
Understanding Buffer Overflow
5 pages
Rohde and Schwarz TSMA6B - Bro - en - 3609-5622-12 - v0600
No ratings yet
Rohde and Schwarz TSMA6B - Bro - en - 3609-5622-12 - v0600
26 pages
Encapsulation & Library Classes Quiz
No ratings yet
Encapsulation & Library Classes Quiz
2 pages
Session 9 Verilog Programming
No ratings yet
Session 9 Verilog Programming
13 pages
Find List of Oyo in Hyderabad Near Me - Justdial
No ratings yet
Find List of Oyo in Hyderabad Near Me - Justdial
46 pages
Web Practical
No ratings yet
Web Practical
37 pages
Ericsson Supply Chain
No ratings yet
Ericsson Supply Chain
178 pages
LC-3 System Calls & TRAP Guide
No ratings yet
LC-3 System Calls & TRAP Guide
32 pages
Development and Control of Virtual Plants in A Co Simulation Environment 1
No ratings yet
Development and Control of Virtual Plants in A Co Simulation Environment 1
35 pages

SP 5

Uploaded by

SP 5

Uploaded by

1.

You have the following code:

df.select("name", expr("salary") * 0.20).show()

It produces the output:

• A. df.select("name", col("salary as increment") * 0.20)

• B. df.select("name", expr("salary * 0.20 as increment"))

• C. df.select("name", col("salary") * 0.20 as increment)

• D. None of the above

• A. df.withColumn("salary_increment", expr("salary * 0.15")).withColumn("new_salary",

• B. df.selectExpr("*", "salary * 0.15 as salary_increment", "salary + salary_increment as

• C. df.withColumn("salary_increment", col("salary * 0.15")).withColumn("new_salary",

• D. All of the above

3. You have the following DataFrame:

| Ravi| 28| 3200| 480.0| 3680.0|

|Abdul| 23| 4800| 720.0| 5520.0|

| John| 32| 6500| 975.0| 7475.0|

| Rosy| 48| 8200| 1230.0| 9430.0|

4. Select all correct statements about the withColumnRenamed() method:

• A. The correct method name is withColumnRename()

• B. Returns a new DataFrame with a column renamed

• C. Throws an error if the schema doesn't contain the existing name

• A. myDF.withColumn("start_time", to_timestamp("start_time", "DD-MM-YYYY

• B. myDF.withColumn("start_time", to_timestamp("start_time", "dd-MM-yyyy

• D. myDF.withColumn("start_time", to_timestamp("start_time", "dd-MM-yyyy

7. You have a DataFrame with the schema:

|-- name: string (nullable = true)

|-- age: string (nullable = true)

|-- salary: string (nullable = true)

• A. df.select("name", "cast(age, integer)", "cast(salary, double)")

• B. df.select("name", expr("cast(age, integer)"), expr("cast(salary, double)"))

• C. df.select("name", expr("INT(age)"), expr("DOUBLE(salary)"))

• D. df.select("name", "cast(age as integer)", "cast(salary as double)")

8. Select all expressions equivalent to:

df.where("salary > 5000 and age > 30")

• B. df.filter((df.salary > 5000) & (df.age > 30))

• C. df.filter("salary > 5000").filter("age > 30")

• D. df.filter(col("salary") > 5000 & col("age") > 30)

9. You have the following DataFrame:

|Abdul| 36| 4800 |

|Abdul| 36| 4800 |

|Abdul| 42| 4800 |

You want to create a new DataFrame with unique records:

|Abdul| 36| 4800 |

|Abdul| 42| 4800 |

Choose the incorrect option:

• C. df.registerTempTable("dfTable"); spark.sql("select distinct * from dfTable")

• D. df.selectExpr("distinct(name, age, salary)")

10. You are given the following DataFrame:

df = spark.createDataFrame(data_list).toDF("name", "department", "country", "salary")

Choose the correct code block to produce:

| Account | India | 1| 5500.0|

| Support | Brazil | 1| 4800.0|

| Account |United States| 1| 6500.0|

| Software | India | 2 | 14700.0|

• A. df.groupBy("department", "country").agg(expr("count(*)"), expr("sum(salary)")).show()

• B. df.groupBy("department", "country").agg(expr("count(*) as NumEmployee"),

• C. df.groupBy("department", "country").agg("count(*)", "sum(salary)").show()

• D. df.groupBy("department", "country").select(expr("count(*)"), expr("sum(salary)")).show()

11. You are given the following DataFrame:

| X1 |2021| Scala | 270|

| N3 |2020| Scala | 150|

| C5 |2020| Scala | 100|

| D7 |2020| Python | 300|

| D3 |2021| Python | 400|

| H2 |2021| Python | 500|

Choose the code block to create a Pivot DataFrame:

12. You are given two DataFrames:

You join them with:

df1.join(df2, df1.BatchID == df2.BatchID, "right_outer").show()

Choose the correct output:

• A. +-------+----+----------+ |BatchID|Year|CourseName| +-------+----+----------+ | X1 |2021| Scala

• B. +-------+----+----------+ |BatchID|Year|CourseName| +-------+----+----------+ | Y5 |2021| Scala

• C. +-------+----+----------+-------+--------+ |BatchID|Year|CourseName|BatchID|Students| +-------

• D. +-------+----+----------+-------+--------+ |BatchID|Year|CourseName|BatchID|Students| +------

13. You are joining two DataFrames:

joinExpr = df1.BatchID == df2.BatchID

• B. df.selectExpr("", "salary 0.15 as salary_increment", "salary + salary_increment as