SP 5
SP 5
text
CollapseWrap
Copy
text
CollapseWrap
Copy
+-----+--------------+
| name|(salary * 0.2)|
+-----+--------------+
| Ravi| 640.0|
|Abdul| 960.0|
| John| 1300.0|
| Rosy| 1640.0|
+-----+--------------+
Choose the correct expression for giving an alias to the last column:
2. Which of the following code blocks will add two new columns salary_increment and new_salary
to an existing DataFrame?
CollapseWrap
Copy
+-----+---+------+---------+----------+
| name|age|salary|increment|new_salary|
+-----+---+------+---------+----------+
+-----+---+------+---------+----------+
You want to remove the salary column. Choose the response that correctly fills in the numbered
blanks:
text
CollapseWrap
Copy
df.__1__(__2__)
• A. 1. drop 2. "salary"
• B. 1. del 2. salary
• C. 1. remove 2. "salary"
• D. 1. delete 2. "salary"
• D. Does not throw an error if the schema doesn't contain the existing name
5. You have a DataFrame with two date-type columns: start and end. Which expression correctly
selects the difference between these two dates?
• A. myDF.select(date_diff("end", "start"))
• B. myDF.select(datediff("start", "end"))
• C. myDF.select(datediff("end", "start"))
• D. myDF.select("end" - "start")
6. You have a start_time field in your DataFrame (string type) with values like:
text
CollapseWrap
Copy
17-05-2021 00:02:17.592
This represents a timestamp in DD-MM-YYYY HH:MI:SS.SSS format. How can you convert this field to
a timestamp type?
• C. myDF.withColumn("start_time", to_timestamp("start_time"))
text
CollapseWrap
Copy
root
You want to select all columns, converting age to integer and salary to double. Choose the correct
option:
text
CollapseWrap
Copy
text
CollapseWrap
Copy
+-----+---+------+
| name|age|salary|
+-----+---+------+
+-----+---+------+
text
CollapseWrap
Copy
+-----+---+------+
| name|age|salary|
+-----+---+------+
+-----+---+------+
• A. df.distinct()
• B. df.select("*").distinct()
CollapseWrap
Copy
data_list = [("David", "Account", "United States", "6500"), ("Ravi", "Account", "India", "5500"),
("John", "Software", "India", "6500"), ("Rosy", "Software", "India", "8200"), ("Abdul", "Support",
"Brazil", "4800")]
text
CollapseWrap
Copy
+----------+-------------+-----------+-----------+
|department| country|NumEmployee|TotalSalary|
+----------+-------------+-----------+-----------+
+----------+-------------+-----------+-----------+
text
CollapseWrap
Copy
+-------+----+----------+--------+
|BatchID|Year|CourseName|Students|
+-------+----+----------+--------+
+-------+----+----------+--------+
text
CollapseWrap
Copy
+----+------+-----+
|Year|Python|Scala|
+----+------+-----+
|2020| 300.0|250.0|
|2021| 900.0|500.0|
+----+------+-----+
• A. df.groupBy("Year").agg(expr("pivot(CourseName)"), expr("sum(Students)"))
• B. df.groupBy("Year").pivot("CourseName").agg(expr("sum(Students)"))
• C. df.groupBy("CourseName").pivot("Year").agg(expr("sum(Students)"))
• D. df.groupBy("Year").pivot("Students").agg(expr("sum(CourseName)"))
df1:
text
CollapseWrap
Copy
+-------+----+----------+
|BatchID|Year|CourseName|
+-------+----+----------+
| X1 |2021| Scala |
| Y5 |2021| Scala |
+-------+----+----------+
df2:
text
CollapseWrap
Copy
+-------+--------+
|BatchID|Students|
+-------+--------+
| X1 | 270|
| N3 | 150|
+-------+--------+
text
CollapseWrap
Copy
text
CollapseWrap
Copy
joinType = "inner"
• C. The joinType and joinExpr are at the wrong place; swap their positions
14. You have a DataFrame (df1) with 146 unique countries in the Country column. You want to
repartition it on Country into 10 partitions only. Choose the correct code block:
• A. df2 = df1.repartition(10)
• D. The requirement is incorrect; you can partition to 146 partitions because you have 146
countries
text
CollapseWrap
Copy
df = spark.read.parquet("data/summary.parquet")
df2 = df.repartition(20)
print(df2.rdd.getNumPartitions())
df3 = df2.coalesce(100)
print(df3.rdd.getNumPartitions())
• A. 20 100
• B. 100 100
• C. 100 20
• D. 20 20
text
CollapseWrap
Copy
+-------+----+----------+
|BatchID|Year|CourseName|
+-------+----+----------+
| X1 |2020| Scala |
| X2 |2020| Python |
| X3 |null| Java |
| X4 |2021| Scala |
| X5 |null| Python |
| X6 |2021| Spark |
+-------+----+----------+
Choose the incorrect option to replace all nulls in the Year column with 2021:
text
CollapseWrap
Copy
+-------+----+----------+
|BatchID|Year|CourseName|
+-------+----+----------+
| X1 |2020| Scala |
| X2 |2020| Python |
| X3 |null| Java |
| X4 |2021| Scala |
| X5 |null| null |
| X6 |2021| Spark |
+-------+----+----------+
• A. df.na.fill("2021", "Python")
• B. df.na.fill({"Year": "2021", "CourseName": "Python"})
• D. df.na.fill("2021")
18. Which code block merges two DataFrames df1 and df2?
• A. df1.append(df2)
• B. df1.merge(df2)
• C. df1.union(df2)
• D. df1.add(df2)
19. Which statements are used to bring data to the Spark driver?
• A. df.first()
• B. df.take(10)
• C. df.collect()
• D. df.limit(10)
20. Spark allows schema-on-read using the infer schema option. Choose the incorrect statement
about schema-on-read:
• B. Infer schema can be slow with plain-text file formats like CSV or JSON
• C. Infer schema can lead to precision issues like a long type incorrectly set as an integer
text
CollapseWrap
Copy
22. What is the default compression format for saving a DataFrame as a Parquet file?
• A. uncompressed
• B. none
• C. lz4
• D. snappy
23. Choose the code block to write a DataFrame in compressed JSON file format:
• A. df.write.mode("overwrite").format("json").option("compression",
"gzip").save("data/myTable")
• B. df.write.mode("overwrite").option("compression", "gzip").save("data/myTable")
• C. df.write.mode("overwrite").option("codec", "gzip").save("data/myTable")
• D. df.write.mode("overwrite").codec("gzip").save("data/myTable")
24. Choose the correct expression to create a Spark database named my_spark_db:
• B. spark.createDatabase("my_spark_db")
• C. spark.catalog.createDatabase("my_spark_db")
1. spark.sql("CREATE TABLE flights_tbl (date STRING, delay INT, distance INT, origin STRING,
destination STRING)")
2. spark.sql("CREATE TABLE flights_tbl(date STRING, delay INT, distance INT, origin STRING,
destination STRING) USING csv OPTIONS (PATH '/tmp/flights/flights_tbl.csv')") Choose all
correct statements:
• D. Both statements are the same except the second specifies the data file location
text
CollapseWrap
Copy
df1.createOrReplaceTempView("my_view")
• C. spark.catalog.dropGlobalTempView("my_view")
• D. spark.catalog.dropTempView("my_view")
27. What is the default storage level for a Spark DataFrame when cached?
• C. MEMORY_ONLY
• D. DISK_ONLY
• A. Data is stored directly as objects in memory and a copy is serialized and stored on disk
• B. Data is stored directly as objects in memory, but if there’s insufficient memory, the rest is
serialized and stored on disk
• C. Data is stored on the disk and brought into memory when required
text
CollapseWrap
Copy
df1.withColumn("Flight_Delays", expr("""CASE WHEN delay > 360 THEN 'Very Long Delays' WHEN
delay >= 120 AND delay <= 360 THEN 'Long Delays' WHEN delay >= 60 AND delay < 120 THEN 'Short
Delays' WHEN delay > 0 and delay < 60 THEN 'Tolerable Delays' WHEN delay = 0 THEN 'No Delays'
ELSE 'Early'"""))
30. You want to implement the following CASE expression using the when() DataFrame function:
text
CollapseWrap
Copy
df1.withColumn("Flight_Delays", expr("""CASE WHEN delay > 360 THEN 'Very Long Delays' WHEN
delay >= 120 AND delay <= 360 THEN 'Long Delays' WHEN delay >= 60 AND delay < 120 THEN 'Short
Delays' WHEN delay > 0 and delay < 60 THEN 'Tolerable Delays' WHEN delay = 0 THEN 'No Delays'
ELSE 'Early' END"""))
Choose the correct expression:
31. There is a global temp view named my_global_view. Which command should you choose to
query it?
• A. spark.read.table("my_global_view")
• B. spark.read.view("my_global_view")
• C. spark.read.table("global_temp.my_global_view")
• D. spark.read.view("global_temp.my_global_view")
• A. df = spark.read.format("tsv").option("inferSchema", "true").option("header",
"true").load("data/my_data_file.tsv")
• B. df = spark.read.format("csv").option("inferSchema", "true").option("header",
"true").option("sep", "tab").load("data/my_data_file.tsv")
• C. df = spark.read.format("csv").option("inferSchema", "true").option("header",
"true").option("sep", "\t").load("data/my_data_file.tsv")
• D. df = spark.read.format("csv").option("inferSchema", "true").option("header",
"true").option("delimeter", "\t").load("data/my_data_file.tsv")
text
CollapseWrap
Copy
id,fname,lname,dob
101,prashant,pandey,25-05-1975
102,abdul,hamid,28-12-1986
103,M David,turner,23-08-1979
text
CollapseWrap
Copy
root
34. Can you write Spark UDFs in Scala or Java and run them from a PySpark application?
• A. TRUE
• B. FALSE
text
CollapseWrap
Copy
df = spark.range(5).toDF("num")
def power3(value):
return value ** 3
power3_udf = udf(power3)
df.selectExpr("power3_udf(num)").show()
text
CollapseWrap
Copy
root
text
CollapseWrap
Copy
+---+-----+------+----------+
| ID|FName|LName|Department|
+---+-----+------+----------+
|102|David|Turner| Support |
|103|Abdul| Hamid| Account |
+---+-----+------+----------+
• B. df1.select("ID", col("PersonalDetails").getField("FName").alias("FName"),
col("PersonalDetails").getField("LName").alias("LName"), "Department").show()
text
CollapseWrap
Copy
ID,TEXT
102,WHITE LANTERN
text
CollapseWrap
Copy
text
CollapseWrap
Copy
df = spark.createDataFrame(mylist, IntegerType()).toDF("value")
df.withColumn("key", col("value") % 1000).groupBy("key").agg(expr("count(key) as count"),
expr("sum(key) as sum")).orderBy(col("key").desc()).limit(1).select("count", "sum").show()
39. Which configuration controls the maximum partition size when reading files?
• A. spark.files.maxPartitionSize
• B. spark.files.maxPartitionBytes
• C. spark.maxPartitionSize
• D. spark.maxPartitionBytes
40. Which configuration controls the number of available cores for executors?
• A. spark.driver.cores
• B. spark.executor.cores
• C. spark.cores.max
• D. spark.task.cpus
• A. Is fault-tolerant
• B. The Cluster Manager is responsible for maintaining a cluster of machines that will run your
Spark Application
• C. A Cluster Manager may have its own master and worker nodes
43. Select the correct statement about Spark Drivers and Executors:
• A. The executors communicate with the cluster manager and are responsible for executing
tasks on the workers
• B. Local mode runs the Spark driver and executor in the same JVM on a single computer
• C. Client mode runs the driver and executor on the client machine
• D. Cluster mode runs the driver with the YARN Application Master
• A. If there are 1,000 little partitions, we will have 1,000 tasks that can be executed in parallel
• B. Partitioning your data into a greater number of partitions means that more can be
executed in parallel
• C. An executor with 12 cores can have 12 or more tasks working on 12 or more partitions in
parallel
• B. Each executor can have multiple slots depending upon the executor cores
• D. Each worker can have multiple slots where executors are allocated
49. Which configuration sets the scheduling mode between jobs submitted to the same
SparkContext?
• A. spark.job.scheduler.mode
• B. spark.scheduler.mode
• C. spark.optimizer.scheduler.mode
• D. spark.scheduler.job.mode
50. Which configurations are related to enabling dynamic adjustment of resources based on the
workload?
• A. spark.dynamicAllocation.shuffleTracking.enabled
• B. spark.sql.dynamicAllocation.enabled
• C. spark.dynamicAllocation.enabled
• D. spark.shuffle.service.enabled
51. Columns or table names are resolved by consulting an internal Catalog at which stage of Spark
query optimization?
• A. Analysis
• B. Logical Optimization
• C. Physical Planning
• D. Code Generation
53. You have a DataFrame with a string-type column today in DD-MM-YYYY format. You want to
add a column week_later with a value one week later than today. Select the correct code block:
54. You have a DataFrame with a string field day in MM-DD-YYYY format. What is the problem
with:
text
CollapseWrap
Copy
• C. This problem can be solved by changing the filter to: myDF.filter("day > '05-07-2021'")
• D. The day field is a string field but the filter condition expects it to be a date field; convert
day to date type first
55. Select the most appropriate expression for casting a salary column to a double value:
• A. expr("cast(salary as double)")
• B. col("salary").cast("double")
• C. expr("DOUBLE(salary)")
text
CollapseWrap
Copy
text
CollapseWrap
Copy
+-----+---+------+
| name|age|salary|
+-----+---+------+
+-----+---+------+
You want to sort it in descending order of salary, and if salary is equal, by age in ascending order:
text
CollapseWrap
Copy
+-----+---+------+
| name|age|salary|
+-----+---+------+
+-----+---+------+
text
CollapseWrap
Copy
df._1_(_2_, _3_)
text
CollapseWrap
Copy
data_list = [("David", "Account", "United States", "6500"), ("Ravi", "Account", "India", "5500"),
("John", "Software", "India", "6500"), ("Rosy", "Software", "India", "8200"), ("Abdul", "Support",
"Brazil", "4800")]
text
CollapseWrap
Copy
+-------------+----------+-----+
| country|department|count|
+-------------+----------+-----+
| India | Account | 1|
| India | Software | 2|
| Brazil | Support | 1|
+-------------+----------+-----+
• A. df.groupBy("country", "department").agg(expr("count(*)")).show()
• B. df.groupBy(expr("country, department")).count().show()
• C. df.groupBy("country", "department").count().show()
• D. df.groupBy("department", "country").count().show()
text
CollapseWrap
Copy
+-------+----+----------+--------+
|BatchID|Year|CourseName|Students|
+-------+----+----------+--------+
+-------+----+----------+--------+
Choose the code block to create a summary DataFrame for TotalStudents over Year and
CourseName:
text
CollapseWrap
Copy
+----+----------+-------------+
|Year|CourseName|TotalStudents|
+----+----------+-------------+
+----+----------+-------------+
• A. df.groupBy("Year",
"CourseName").agg(expr("sum(Students)").alias("TotalStudents")).orderBy("Year",
"CourseName")
• B. df.rollup("Year",
"CourseName").agg(expr("sum(Students)").alias("TotalStudents")).orderBy("Year",
"CourseName")
• C. df.rollup("Year", "CourseName").agg(expr("sum(Students)").alias("TotalStudents"))
• D. df.pivot("Year",
"CourseName").agg(expr("sum(Students)").alias("TotalStudents")).orderBy("Year",
"CourseName")
60. You have the following code block for joining two DataFrames:
text
CollapseWrap
Copy
joinType = "inner"
This throws an error: Reference 'BatchID' is ambiguous. Choose the corrected code block:
df1:
text
CollapseWrap
Copy
+-------+----+----------+
|BatchID|Year|CourseName|
+-------+----+----------+
| X1 |2021| Scala |
| Y5 |2021| Scala |
+-------+----+----------+
df2:
text
CollapseWrap
Copy
+-------+--------+
|BatchID|Students|
+-------+--------+
| X1 | 270|
| N3 | 150|
+-------+--------+
You want to select rows from df1 that do not exist in df2:
text
CollapseWrap
Copy
+-------+----+----------+
|BatchID|Year|CourseName|
+-------+----+----------+
| Y5 |2021| Scala |
+-------+----+----------+
text
CollapseWrap
Copy
• A. joinType = "left_semi"
• B. joinType = "right_semi"
• C. joinType = "left_outer"
• D. joinType = "left_anti"
62. What API can you use to get the number of partitions of a DataFrame df?
• A. df.getNumPartitions()
• B. df.rdd.getNumPartitions()
• C. df.getPartitionCount()
• D. df.rdd.getPartitionCount()
text
CollapseWrap
Copy
+-------+----+----------+
|BatchID|Year|CourseName|
+-------+----+----------+
| X1 |2020| Scala |
| X2 |2020| Python |
| X3 |null| Java |
| X4 |2021| Scala |
| X5 |null| Python |
| X6 |2021| Spark |
+-------+----+----------+
• B. We can use df.na.drop() to delete all rows having any null column
text
CollapseWrap
Copy
+-------+----+----------+
|BatchID|Year|CourseName|
+-------+----+----------+
| X1 |2020| Scala |
| X2 |2020| Python |
| X3 |null| Java |
| X4 |2021| Scala |
| X5 |null| null |
| X6 |2021| null |
+-------+----+----------+
You use:
text
CollapseWrap
Copy
df.na.drop(thresh=1)
• A. This expression will delete X3, X5, and X6 rows because the threshold=1 says "Delete the
row if at least one column is null"
• B. This expression will not delete any row because the threshold=1 says "Keep the row if at
least one column is not null"
• C. This expression will delete X3 and X6 rows because the threshold=1 says "Delete the row if
only one column is null"
• D. This expression will delete X3 row because the threshold=1 says "Delete the row if more
than one column is null"
1. B - df.select("name", expr("salary * 0.20 as increment")) (Correct way to alias using expr() for
column expressions.)
2. A, B - Both withColumn() chaining and selectExpr() are valid; option C is incorrect due to
invalid col() syntax.
4. B, D - Returns a new DataFrame with a column renamed; does not throw an error if the
schema doesn’t contain the existing name.
8. B, C - df.filter((df.salary > 5000) & (df.age > 30)) and df.filter("salary > 5000").filter("age > 30")
(Implement AND condition correctly.)
12. C - Right outer join includes all rows from df2, with nulls for non-matching df1 rows.
13. C - The joinType and joinExpr are in the wrong order; correct syntax is join(rightDF, joinExpr,
joinType).
14. B - df2 = df1.repartition(10, "Country") (Correct for repartitioning by column with specified
partitions.)
19. A, B, C - first(), take(10), collect() (All are actions that bring data to the driver; limit() is a
transformation.)
20. D - It is a good idea to use schema-on-read for production ETL (Incorrect; manual schema
definition is preferred for production.)
21. B - mySchema = "ID INT, Name STRING, Salary DOUBLE" (Correct equivalent schema string.)
23. A - df.write.mode("overwrite").format("json").option("compression",
"gzip").save("data/myTable") (Correct for JSON compression.)
25. A, B, D - First block creates a managed table; second creates an unmanaged table with a
specified path.
27. B - MEMORY_AND_DISK_DESER in Spark 3.1.1 (Correct default storage level for newer Spark
versions.)
28. B - Data is stored directly as objects in memory, but if insufficient memory, the rest is
serialized and stored on disk.
30. B, C - Both when() with otherwise() and without lit() are correct; else is not a valid method.
33. A - schema with dateFormat "dd-MM-yyyy" (Correct for parsing dates correctly.)
37. A - split() creates an array with individual words; V3 for row 103 is "HOTTIE".
38. C - key=6 (from 4006) has count=3 (1002, 2002, 4006) and sum=6.
42. A, B, C - Cluster Manager runs applications, maintains the cluster, and has master/worker
nodes.
44. B - Local mode runs driver and executor in the same JVM.
48. B, C - Executors have multiple slots; each slot can be assigned a task.
54. D - The day field is a string; convert to date using to_date() before filtering.
56. A, C - df.filter("salary > 5000 or age > 30") and df.filter((col("salary") > 5000) | (col("age") >
30)) (Correct for OR condition.)
59. B - df.rollup("Year",
"CourseName").agg(expr("sum(Students)").alias("TotalStudents")).orderBy("Year",
"CourseName") (Correct for rollup with totals.)
63. B - Returns the first non-null argument if it exists; otherwise, null (Spark SQL coalesce
function.)
65. B - thresh=1 means keep rows with at least one non-null column, so no rows are deleted.