1. Accumulators are incremented and can be read from Spark Workers. State True or False.
• True
• False
2. Given the pair RDD country that contains tuples of the form ((Country, count)), which of the
following is used to get the country with lowest number of refugees in Scala?
• Val low = country.sortByKey().first
• Val low = country.sortByKey(false).first
• Val low = country.map(x=>(x._2.x._1)).sortByKey().first
• Val low = country.map(x=>(x._2.x._1)).sortByKey(false).first
3. which DataFrame method is used to remove a column from the resultant DataFrame
• drop()
• filter()
• remove()
• all
4. what are some of the things you can monitor in the Spark Web UI?
• which stages are running slow
• your application has the resources a expected
• if the datasets are fitting into memory
• all of the above
5. How to enable the Dynamic Allocation Property?
• spark.dynamicAllocation.enabled=true
• spark.dynamicAllocation.enabled=false
• spark.dynamicAllocation.enabled=yes
• spark.dynamicAllocation.enabled=no
6. Spark broadcast variables and "setting variables in your driver program" in pyspark are
same. State True or False.
• TRUE
• FALSE
7. The number of stages in a job is usually equal to the number of RDDs in the DAG.
However, the scheduler can truncate the lineage when:
• There is no movement of data from the parent RDD
• There is a shuffle
• The RDD is catched or persisted
• The RDD was materialized due to an earlier shuffle
8. MEMORY_AND_DISK_SER storage level specifies what storage options for RDD
• In memory(off-heap), on disk, serialized
• In memory, on disk, serialzed
• In memory , on disk, serialized and replicated
• In memory, on disk, non-serialized
9. Dataset was introduced in which spark release?
• Spark 1.6
• Spark 1.4.0
• Spark 2.1.0
• Spark 1.1
10. Which of the (in Scala) will give the top 10 resolutions to the console assuming that
sfpdDF is the DataFrame registered as a table-sfpd?
• sqlContext.sql("SELECT resolution, count(incidentnum)AS inccount FROM
sfpd GROUP BY resolution ORDER BY inccount DESC LIMIT 10")
• sfpdDF.select("resolution").count.sort($"count".desc).show(10)
• sfpdDF.groupBy("resolution").count.sort($"count".desc).show(10)
• none of the above
11. which RDD function returns min, max, count, mean and standard derivation for all
elements in RDD?
• Min
• Variance
• Stats
• Stdev
12. The “foreach” and “map” operations operate on each element of an RDD. What are
the differences between these two operations?
• foreach is an action, “map” is transformation
• for each operates an RDD and returns data to the driver, “map” operates on
RDD and returns RDD
• foreach is an transformation, “map” is action
• for each operates an RDD and returns RDD. “map” operates on RDD and
returns data to the driver
13. Given the sfpd RDD, to create a pair RDD consisting of tuples consisting of the form.
(Category, 1), in Scala use:
• val pairs = sfpd.parallelize()
• val pairs = sfpd.map(x=>(x(Category, 1))
• val pairs = sfpd.map(x=>x.parallelize))
• None of the above
14. what is the difference between the take(1) and first() actions
• take(1) returns a list with one element from an RDD, first() returns one
element not in list
• first() returns a list one element from an RDD, take(1) returns one element not in
list
• first() returns an array with one element from an RDD, take(1) returns one element
not in array
• take(1) returns an array with one element from an RDD, first() returns one element
not in array
15. Given the sfpd RDD, to create a pair RDD consisting of tuples consisting of the form.
(Category, 1), in Scala use:
• val pairs = sfpd.parallelize()
• val pairs = sfpd.map(x=>(x(Category, 1))
• val pairs = sfpd.map(x=>x.parallelize))
• None of the above
16. Which partitioner class is used to partition keys according to the sort order respective
to given type?
• RangePartitioner
• HeadPartitioner
• CompositePartitioner
• ListPartitioner
17. The primary Machine Learning API for Spark is now the _____ based API.
• Data Frame
• Dataset
• RDD
• All of the above options
18. which of the following is not the feature of spark
• Supports in memory computation
• fault-tolerance
• it is cost efficient
• Compatible with other file storage system
19. which of the following is true of running a spark application on Hadoop yarn
• in Hadoop YARN mode, the RDD’s and variables are always in the same memory
space
• running in Hadoop YARN has the advantage of having multiple users running the
spark interactive shell
• There are two deploy modes that can be used to launch spark applications
on YARN _Client mode and cluster mode
• irrespective of the mode , the driver is launched in the client process that
submitted the job
20. The keys Transformation returns as Rdd with ordered keys from the a key value pair
Rdd. True or false
• True
• False
21. repartition(5) is the same as coalesce(5= shuffle= true). State true or False
• True
• False
22. Spark sql translates commands into codes. These codes are processed by
• Driver Nodes
• Executor Nodes
• Cluster Manager
• d. None of the above
23. Which partition hinder spark performance
• Only small
• Only large
• Both
• None
24. What is dynamic allocation?
• Dynamic allocation is a properly where executors can be released back to
cluster resource pool of they are idle for a specified period of time
• Dynamic allocation is a properly where drivers can be released back to cluster
resource pool if they are idle for specified period of time
• Both
• None
25. What parameters are requested for a windowed operation such as
reducedByKeyAndWindow?
• Window Length
• Sliding interval
• Window length and sliding interval
• None
26. groupByKey is less efficient than reduceByKey on large datasets because
• groupByKey will group all values with the same key on one machine
• with groupByKey, if a single key has more key-value pairs than can fit in
memory, an out of memory exception occurs
• reduceByKey combines locally first and then reduces after the shuffle
• All of the above
27. What are the some of the things you can monitor in the spark web UI
• Which stages are running slow
• Your application has the resources as expected
• If the datasets are fitting into memory
• All of the above
28. Caching can use disk if memory is not available. State True or False
• True
• False
29. ________________leverages spark core fast scheduling capability to perform streaming
analytics.
• Mlib
• SparkStreaming
• GraphX
• RDDs
30. Combining a set of filtered edges and/or filtered vertives from a graph creates what
structure?
• Graph
• Subgraph
• Triplet
• Struct
31. which of the below command is used to remove a broadcast variable, "bvar", from
memory?
• bvar.remove()
• bvar.unpersist()
• bvar=sc.broadcast(None)
• bvar.drop()
32. we can create DataFrame using:
• tables in hive
• structured data files
• external database
• all of the above
33. A DataFrame can be created from an existing RDD. You would create the DataFrame from
the existing RDD by inferring the schema using case classes in which case?
• If your dataset has more than 22 fields
• if all your users are going to need the dataset parsed in the same way
• if you have two sets of users who will need the text database parsed differently
• none of the above
34. Function used to call program written in shell-script/perl into pyspark:
• call()
• pipe()
• import()
• sub()
35. Dstream internally is:
• Continuous Stream of RDD
• Continuous stream of dataframe
• Continuous stream of dataset
• None of the above
36. An existing RDD, unhcrRDD contains refugee data from the UNHCR. It contains the
following fields Country of residence, country of origin, year, number of
refugees………………………………To get count
refugees………..(Array(array(Afghanistan,Pakistan,2013,34),Array(Albania,algeria,2013,0)
array(Albania,china,2013,12)….)
• Val country=unhcrRDD.map(x=>(x(0),x(3))).reduceByKey((a,b)=>a+b)
• Val country=unhcrRDD.map(x=>(x(0),1)).reduceByKey((a,b)=>a+b)
• Val country=unhcrRDD.map(x=>x.parallelize())
• None of the above
37. Which Dstream output operation is used to write output to the console
• Print()
• Dump()
• Pprint()
• writeToConsole
38. What is default Partitioner class used by spark?
• RangePartitioner
• HashPartitioner
• CompositePartitioner
• ListPartitioner
39. pyspark is a bunch figuring structure which keeps running on a group of item equipment
and performs information unification. State True or False
• True
• False
40. Some ways of improving performance of your spark app include:-
• Us3 kyro serialization
• tune the degree of parallelism
• avoid shuffling large amounts of data
• All of the above
41. Apache spark has API's in?
• Java
• Scala
• Python
• all of the above