0% found this document useful (0 votes)

5K views10 pages

Pyspark Dumps

1. Accumulators can be read from Spark workers and are incremented. This statement is true. 2. The country RDD with lowest refugee count can be found using country.sortByKey().first. 3. The DataFrame method drop() is used to remove a column. 4. Some things that can be monitored in the Spark web UI include which stages are running slow, if the datasets are fitting into memory, and if the application has the resources expected.

Uploaded by

Eren Levi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5K views10 pages

Pyspark Dumps

Uploaded by

Eren Levi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

1. Accumulators are incremented and can be read from Spark Workers. State True or False.

• True
• False

2. Given the pair RDD country that contains tuples of the form ((Country, count)), which of the
following is used to get the country with lowest number of refugees in Scala?
• Val low = country.sortByKey().first
• Val low = country.sortByKey(false).first
• Val low = country.map(x=>(x._2.x._1)).sortByKey().first
• Val low = country.map(x=>(x._2.x._1)).sortByKey(false).first

3. which DataFrame method is used to remove a column from the resultant DataFrame
• drop()
• filter()
• remove()
• all

4. what are some of the things you can monitor in the Spark Web UI?
• which stages are running slow
• your application has the resources a expected
• if the datasets are fitting into memory
• all of the above

5. How to enable the Dynamic Allocation Property?

• spark.dynamicAllocation.enabled=true
• spark.dynamicAllocation.enabled=false
• spark.dynamicAllocation.enabled=yes
• spark.dynamicAllocation.enabled=no

6. Spark broadcast variables and "setting variables in your driver program" in pyspark are
same. State True or False.
• TRUE
• FALSE
7. The number of stages in a job is usually equal to the number of RDDs in the DAG.
However, the scheduler can truncate the lineage when:

• There is no movement of data from the parent RDD

• There is a shuffle
• The RDD is catched or persisted
• The RDD was materialized due to an earlier shuffle
8. MEMORY_AND_DISK_SER storage level specifies what storage options for RDD
• In memory(off-heap), on disk, serialized
• In memory, on disk, serialzed
• In memory , on disk, serialized and replicated
• In memory, on disk, non-serialized

9. Dataset was introduced in which spark release?

• Spark 1.6
• Spark 1.4.0
• Spark 2.1.0
• Spark 1.1

10. Which of the (in Scala) will give the top 10 resolutions to the console assuming that
sfpdDF is the DataFrame registered as a table-sfpd?
• sqlContext.sql("SELECT resolution, count(incidentnum)AS inccount FROM
sfpd GROUP BY resolution ORDER BY inccount DESC LIMIT 10")
• sfpdDF.select("resolution").count.sort($"count".desc).show(10)
• sfpdDF.groupBy("resolution").count.sort($"count".desc).show(10)
• none of the above

11. which RDD function returns min, max, count, mean and standard derivation for all
elements in RDD?
• Min
• Variance
• Stats
• Stdev

12. The “foreach” and “map” operations operate on each element of an RDD. What are
the differences between these two operations?
• foreach is an action, “map” is transformation
• for each operates an RDD and returns data to the driver, “map” operates on
RDD and returns RDD
• foreach is an transformation, “map” is action
• for each operates an RDD and returns RDD. “map” operates on RDD and
returns data to the driver

13. Given the sfpd RDD, to create a pair RDD consisting of tuples consisting of the form.
(Category, 1), in Scala use:
• val pairs = sfpd.parallelize()
• val pairs = sfpd.map(x=>(x(Category, 1))
• val pairs = sfpd.map(x=>x.parallelize))
• None of the above
14. what is the difference between the take(1) and first() actions
• take(1) returns a list with one element from an RDD, first() returns one
element not in list
• first() returns a list one element from an RDD, take(1) returns one element not in
list
• first() returns an array with one element from an RDD, take(1) returns one element
not in array
• take(1) returns an array with one element from an RDD, first() returns one element
not in array

15. Given the sfpd RDD, to create a pair RDD consisting of tuples consisting of the form.
(Category, 1), in Scala use:
• val pairs = sfpd.parallelize()
• val pairs = sfpd.map(x=>(x(Category, 1))
• val pairs = sfpd.map(x=>x.parallelize))
• None of the above

16. Which partitioner class is used to partition keys according to the sort order respective
to given type?
• RangePartitioner
• HeadPartitioner
• CompositePartitioner
• ListPartitioner

17. The primary Machine Learning API for Spark is now the _____ based API.
• Data Frame
• Dataset
• RDD
• All of the above options

18. which of the following is not the feature of spark

• Supports in memory computation
• fault-tolerance
• it is cost efficient
• Compatible with other file storage system

19. which of the following is true of running a spark application on Hadoop yarn
• in Hadoop YARN mode, the RDD’s and variables are always in the same memory
space
• running in Hadoop YARN has the advantage of having multiple users running the
spark interactive shell
• There are two deploy modes that can be used to launch spark applications
on YARN _Client mode and cluster mode
• irrespective of the mode , the driver is launched in the client process that
submitted the job
20. The keys Transformation returns as Rdd with ordered keys from the a key value pair
Rdd. True or false
• True
• False

21. repartition(5) is the same as coalesce(5= shuffle= true). State true or False

• True
• False

22. Spark sql translates commands into codes. These codes are processed by

• Driver Nodes
• Executor Nodes
• Cluster Manager
• d. None of the above

23. Which partition hinder spark performance

• Only small
• Only large
• Both
• None

24. What is dynamic allocation?

• Dynamic allocation is a properly where executors can be released back to

cluster resource pool of they are idle for a specified period of time
• Dynamic allocation is a properly where drivers can be released back to cluster
resource pool if they are idle for specified period of time
• Both
• None

25. What parameters are requested for a windowed operation such as

reducedByKeyAndWindow?

• Window Length
• Sliding interval
• Window length and sliding interval
• None
26. groupByKey is less efficient than reduceByKey on large datasets because

• groupByKey will group all values with the same key on one machine
• with groupByKey, if a single key has more key-value pairs than can fit in
memory, an out of memory exception occurs
• reduceByKey combines locally first and then reduces after the shuffle
• All of the above

27. What are the some of the things you can monitor in the spark web UI

• Which stages are running slow

• Your application has the resources as expected
• If the datasets are fitting into memory
• All of the above

28. Caching can use disk if memory is not available. State True or False

• True
• False

29. ________________leverages spark core fast scheduling capability to perform streaming

analytics.

• Mlib
• SparkStreaming
• GraphX
• RDDs

30. Combining a set of filtered edges and/or filtered vertives from a graph creates what
structure?
• Graph
• Subgraph
• Triplet
• Struct

31. which of the below command is used to remove a broadcast variable, "bvar", from
memory?
• bvar.remove()
• bvar.unpersist()
• bvar=sc.broadcast(None)
• bvar.drop()
32. we can create DataFrame using:
• tables in hive
• structured data files
• external database
• all of the above

33. A DataFrame can be created from an existing RDD. You would create the DataFrame from
the existing RDD by inferring the schema using case classes in which case?
• If your dataset has more than 22 fields
• if all your users are going to need the dataset parsed in the same way
• if you have two sets of users who will need the text database parsed differently
• none of the above

34. Function used to call program written in shell-script/perl into pyspark:

• call()
• pipe()
• import()
• sub()

35. Dstream internally is:

• Continuous Stream of RDD

• Continuous stream of dataframe
• Continuous stream of dataset
• None of the above

36. An existing RDD, unhcrRDD contains refugee data from the UNHCR. It contains the
following fields Country of residence, country of origin, year, number of
refugees………………………………To get count
refugees………..(Array(array(Afghanistan,Pakistan,2013,34),Array(Albania,algeria,2013,0)
array(Albania,china,2013,12)….)

• Val country=unhcrRDD.map(x=>(x(0),x(3))).reduceByKey((a,b)=>a+b)
• Val country=unhcrRDD.map(x=>(x(0),1)).reduceByKey((a,b)=>a+b)
• Val country=unhcrRDD.map(x=>x.parallelize())
• None of the above

37. Which Dstream output operation is used to write output to the console

• Print()
• Dump()
• Pprint()
• writeToConsole
38. What is default Partitioner class used by spark?

• RangePartitioner
• HashPartitioner
• CompositePartitioner
• ListPartitioner

39. pyspark is a bunch figuring structure which keeps running on a group of item equipment
and performs information unification. State True or False
• True
• False

40. Some ways of improving performance of your spark app include:-

• Us3 kyro serialization
• tune the degree of parallelism
• avoid shuffling large amounts of data
• All of the above

41. Apache spark has API's in?

• Java
• Scala
• Python
• all of the above

Snowflake Clustering and Scaling Policies
No ratings yet
Snowflake Clustering and Scaling Policies
1,148 pages
Pyspark Questions & Scenario Based
No ratings yet
Pyspark Questions & Scenario Based
25 pages
Pyspark MCQ
No ratings yet
Pyspark MCQ
3 pages
Aims Case Study Abu Dhabi Airport 1252006246
No ratings yet
Aims Case Study Abu Dhabi Airport 1252006246
2 pages
Snowflake Prctice1
No ratings yet
Snowflake Prctice1
51 pages
Databricks Data Engineer Professional Practice
No ratings yet
Databricks Data Engineer Professional Practice
10 pages
CIS4130 Mock Exam
No ratings yet
CIS4130 Mock Exam
9 pages
Databricks Certified Professional Data Engineer Jun 2024
No ratings yet
Databricks Certified Professional Data Engineer Jun 2024
21 pages
Pysparkdump
No ratings yet
Pysparkdump
4 pages
Woahhhhh
No ratings yet
Woahhhhh
6 pages
Azure Data Engineer Mock Interview - Project Special
No ratings yet
Azure Data Engineer Mock Interview - Project Special
11 pages
G60 Training
67% (3)
G60 Training
2 pages
Azure DE Interview Que
100% (1)
Azure DE Interview Que
25 pages
PySpark Optimization Scenarios - Wipro
No ratings yet
PySpark Optimization Scenarios - Wipro
8 pages
Walmart Stock Data Analysis with Spark
0% (1)
Walmart Stock Data Analysis with Spark
17 pages
Pyspark Interview 1738079940
No ratings yet
Pyspark Interview 1738079940
6 pages
PracticeExam DataEngineerAssociate
100% (1)
PracticeExam DataEngineerAssociate
23 pages
Spark Interview Q&A
No ratings yet
Spark Interview Q&A
31 pages
DBT Interview Questions
No ratings yet
DBT Interview Questions
18 pages
Databricks Question
No ratings yet
Databricks Question
89 pages
CPD20/25/30/35ESL: Efficiency Rules The Future
No ratings yet
CPD20/25/30/35ESL: Efficiency Rules The Future
2 pages
Master Snowflake Interview Q A 1729835390
No ratings yet
Master Snowflake Interview Q A 1729835390
7 pages
Web Design Basics for Beginners
100% (2)
Web Design Basics for Beginners
29 pages
Apache Spark Quick Reference
No ratings yet
Apache Spark Quick Reference
71 pages
Databricks Performance Tuning
No ratings yet
Databricks Performance Tuning
9 pages
ABD Exame PDF
No ratings yet
ABD Exame PDF
17 pages
Top Pyspark InterviewQuestions
No ratings yet
Top Pyspark InterviewQuestions
21 pages
Control of Inspection Equipment
No ratings yet
Control of Inspection Equipment
7 pages
Databricks Certified Data Engineer Associate Practice Questions
No ratings yet
Databricks Certified Data Engineer Associate Practice Questions
6 pages
Spark Architecture for Developers
No ratings yet
Spark Architecture for Developers
7 pages
Databricks Delta for Developers
No ratings yet
Databricks Delta for Developers
11 pages
Interview Questions On ADF
No ratings yet
Interview Questions On ADF
2 pages
Matillion - Interview - Questions
100% (1)
Matillion - Interview - Questions
2 pages
Top 20 Electronic Components Manufacturers in India - ElectronicsB2B - ElectronicsB2B
No ratings yet
Top 20 Electronic Components Manufacturers in India - ElectronicsB2B - ElectronicsB2B
14 pages
Administering Snowflake
No ratings yet
Administering Snowflake
4 pages
Apache Spark Interview Questions
No ratings yet
Apache Spark Interview Questions
12 pages
Snowproans
No ratings yet
Snowproans
85 pages
Azure Data Engineering for Pharma
100% (1)
Azure Data Engineering for Pharma
5 pages
Snowflake External Tables Guide
No ratings yet
Snowflake External Tables Guide
105 pages
Spark SQL
No ratings yet
Spark SQL
24 pages
Oracle PLSQL Notes
100% (4)
Oracle PLSQL Notes
59 pages
Data Bricks
No ratings yet
Data Bricks
20 pages
SCD Type-1,2 Implementation in Pyspark
No ratings yet
SCD Type-1,2 Implementation in Pyspark
6 pages
ERC RESOLUTION NO 115 01 Amended 20191217-1-Yh36qh
No ratings yet
ERC RESOLUTION NO 115 01 Amended 20191217-1-Yh36qh
118 pages
SQL and PySpark Interview Questions
No ratings yet
SQL and PySpark Interview Questions
15 pages
Databricks Certified Professional Data Engineer 1 1
No ratings yet
Databricks Certified Professional Data Engineer 1 1
16 pages
PWM Library
No ratings yet
PWM Library
2 pages
Snowflake Questions 2
No ratings yet
Snowflake Questions 2
6 pages
Catálogo 2021 - Redfil 1
No ratings yet
Catálogo 2021 - Redfil 1
48 pages
Rogue Wireless AP Detection Using Delay Fluctuation in Backbone Network
No ratings yet
Rogue Wireless AP Detection Using Delay Fluctuation in Backbone Network
2 pages
Student Feedback Form
No ratings yet
Student Feedback Form
54 pages
Snowflake Document
No ratings yet
Snowflake Document
26 pages
Databricks Questions
No ratings yet
Databricks Questions
31 pages
Snowflake Document
No ratings yet
Snowflake Document
21 pages
Srikanth
No ratings yet
Srikanth
7 pages
L02 - Spark SQL For Data Processing: CBG1C04 Big Data Programming
No ratings yet
L02 - Spark SQL For Data Processing: CBG1C04 Big Data Programming
23 pages
Snowflake - Syllubus and DBT
No ratings yet
Snowflake - Syllubus and DBT
11 pages
Construction Project Schedule Example-Residential Building: Civilverse Admin 2 May 2021
No ratings yet
Construction Project Schedule Example-Residential Building: Civilverse Admin 2 May 2021
7 pages
Azure Databricks Interview Question
No ratings yet
Azure Databricks Interview Question
12 pages
Spark-Tutorial - IV - Python
No ratings yet
Spark-Tutorial - IV - Python
212 pages
Snowflake - Interview Questions
No ratings yet
Snowflake - Interview Questions
15 pages
50 PySpark Interview Questions 1732556477
No ratings yet
50 PySpark Interview Questions 1732556477
7 pages
24 StoredProcs
No ratings yet
24 StoredProcs
6 pages
PLSQL Introduction Final
No ratings yet
PLSQL Introduction Final
81 pages
Pyspark Modules&packages RDD
No ratings yet
Pyspark Modules&packages RDD
9 pages
CFD of Aerofoil Pitot Tube
No ratings yet
CFD of Aerofoil Pitot Tube
5 pages
Pushpender Snowflake 24thjune Questions
No ratings yet
Pushpender Snowflake 24thjune Questions
3 pages
Air Conditioning System Using Vehicle Suspension
No ratings yet
Air Conditioning System Using Vehicle Suspension
14 pages
Databricksmcqsquestionsandanswers
No ratings yet
Databricksmcqsquestionsandanswers
5 pages
Akash PPPP New
No ratings yet
Akash PPPP New
7 pages
21.streams in Snowflake
No ratings yet
21.streams in Snowflake
8 pages
Secure Data Sharing in Snowflake
100% (1)
Secure Data Sharing in Snowflake
35 pages
Foundations of Programming
No ratings yet
Foundations of Programming
220 pages
DataStage Interview Questions Guide
No ratings yet
DataStage Interview Questions Guide
18 pages
Sd8817a 2
No ratings yet
Sd8817a 2
3 pages
Sapan AGRAWAL-1
No ratings yet
Sapan AGRAWAL-1
2 pages
Two Tier Bike Rack Manufacturer From Chennai
No ratings yet
Two Tier Bike Rack Manufacturer From Chennai
2 pages
34OTM C 21D Rev B Jan2018 CompactATS
No ratings yet
34OTM C 21D Rev B Jan2018 CompactATS
24 pages
Revised Energy Account Notice
No ratings yet
Revised Energy Account Notice
17 pages
Week 1 - Introduction and Basic Concepts of AI
No ratings yet
Week 1 - Introduction and Basic Concepts of AI
2 pages
Lab6 Copy Options
No ratings yet
Lab6 Copy Options
4 pages
Create Temporary, Permanent & Transient Table
No ratings yet
Create Temporary, Permanent & Transient Table
2 pages
Verizon Network Test Report
No ratings yet
Verizon Network Test Report
36 pages
K08 QMH and Other Line Units
No ratings yet
K08 QMH and Other Line Units
18 pages
Aquatec Price List 2020 - Export
No ratings yet
Aquatec Price List 2020 - Export
120 pages
Dan Foss
No ratings yet
Dan Foss
12 pages
Azure Data Engineer + Databricks Content
No ratings yet
Azure Data Engineer + Databricks Content
7 pages
Lync 2013 Debugging Guide
No ratings yet
Lync 2013 Debugging Guide
13 pages
Lec 04 Peripheral Devices
No ratings yet
Lec 04 Peripheral Devices
17 pages
Purolator Rate Guide English
No ratings yet
Purolator Rate Guide English
141 pages

Pyspark Dumps

Uploaded by

Pyspark Dumps

Uploaded by

1. Accumulators are incremented and can be read from Spark Workers. State True or False.

5. How to enable the Dynamic Allocation Property?

• There is no movement of data from the parent RDD

9. Dataset was introduced in which spark release?

18. which of the following is not the feature of spark

23. Which partition hinder spark performance

24. What is dynamic allocation?

• Dynamic allocation is a properly where executors can be released back to

25. What parameters are requested for a windowed operation such as

• Which stages are running slow

29. ________________leverages spark core fast scheduling capability to perform streaming

34. Function used to call program written in shell-script/perl into pyspark:

35. Dstream internally is:

• Continuous Stream of RDD

40. Some ways of improving performance of your spark app include:-

41. Apache spark has API's in?

You might also like