0% found this document useful (0 votes)

10 views5 pages

Spark Interview Questions

Uploaded by

praveen4ynp

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views5 pages

Spark Interview Questions

Uploaded by

praveen4ynp

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 5

1. What is Apache Spark?

Spark is a fast, easy-to-use, and flexible data processing framework.

Open-source analytics engine

Advanced execution engine supporting acyclic data flow and in-memory computing.

In-memory caching and optimized execution of queries for faster query analytics of
data of any size.

2. Explain the key features of Spark.

Apache Spark allows integrating with Hadoop.

It has an interactive language shell, Scala (the language in which Spark is written).

Spark consists of RDDs (Resilient Distributed Datasets), which can be cached

across the computing nodes in a cluster.

Apache Spark supports multiple analytic tools that are used for interactive query
analysis, real-time analysis, and graph processing.

Apache Spark supports stream processing in real-time.

Spark helps in achieving a very high processing speed of data, which it achieves by
reducing the read or write operations to disk.

Spark is considered a better cost-efficient solution when compared to Hadoop.

3. What is MapReduce?

It is a software framework and programming model which is used for processing huge
datasets.

MapReduce is basically split into two parts, Map and Reduce.

Map handles data splitting and data mapping.

Reduce handles shuffle and reduction in data.

4. Broadcast Variables

Broadcast variables allow the programmer to keep a read-only variable cached on

each machine rather than shipping a copy of it with tasks.

Creating broadcast variables is only useful when tasks across multiple stages need the
same data.

SparkContext.broadcast(broadcastVar)
val broadcastVar = sc.broadcast(Array(1, 2, 3))
broadcastVar.value

5. Accumulators

Accumulators are variables that are only “added” to through an associative and
commutative operation and can therefore be efficiently supported in parallel.

can be used to implement counters (as in MapReduce) or sums.

Only the driver program can read the accumulator’s value, using its value method.

SparkContext.longAccumulator()
Or
SparkContext.doubleAccumulator()

val accum = sc.longAccumulator("My Accumulator")

sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum.add(x))

6. Spark Optimization

Apache Spark optimization helps with in-memory data computations.

bottleneck for these spark optimization computations can be CPU, memory or

any resource in the cluster.

 Serialization
 API Selection
 Advance Variable
 Cache and Persist
 ByKey Operation
 File Format selection
 Garbage Collection Tuning
 Level of Parallelism

Serialization
 Serialization plays an important role in the performance for any distributed
application. By default, Spark uses Java serializer.

 Spark can also use another serializer called ‘Kryo’ serializer for better performance.

 Kryo serializer is in compact binary format and offers processing 10x faster than Java
serializer.

 To set the serializer properties:

conf.set(“spark.serializer”, “org.apache.spark.serializer.KryoSerializer”)

API Selection

 Spark introduced three types of API to work upon – RDD, DataFrame, DataSet

 RDD is used for low level operation with less optimization.

 DataFrame is best choice in most cases due to its catalyst optimizer and low
garbage collection (GC) overhead.

 Dataset is highly type safe and use encoders. It uses Tungsten for serialization in
binary format.

Advanced Variable
 Broadcasting plays an important role while tuning Spark jobs.

 Broadcast variable will make small datasets available on nodes locally.

 When you have one dataset which is smaller than other dataset, Broadcast join is
highly recommended.

 To use the Broadcast join: (df1. join(broadcast(df2)))

Cache and Persist

RDD.Cache() it will always store the data in memory.

RDD.Persist() then some part of data can be stored into the memory some can be
stored on the disk.

ByKey Operation
 Shuffles are heavy operation which consume a lot of memory.
 While coding in Spark, the user should always try to avoid shuffle operation.
 High shuffling may give rise to an OutOfMemory Error; To avoid such an error, the
user can increase the level of parallelism.
 A user should go for the reduceByKey because groupByKey creates a lot of shuffling
which hampers the performance.
 Partition the data correctly.

File Format selection

 Spark supports many formats, such as CSV, JSON, XML, PARQUET, ORC, AVRO,
etc.
 Spark jobs can be optimized by choosing the parquet file with snappy compression
which gives the high performance and best analysis.
 Parquet file is native to Spark which carries the metadata along with its footer.

Garbage Collection Tuning

 JVM garbage collection can be a problem when you have large collection of unused
objects.
 The first step in GC tuning is to collect statistics by choosing – verbose while
submitting spark jobs.
 In an ideal situation we try to keep GC overheads < 10% of heap memory.

Level of Parallelism
 Parallelism plays a very important role while tuning spark jobs.
 Every partition ~ task requires a single core for processing.
 There are two ways to maintain the parallelism:
 Repartition: Gives equal number of partitions with high shuffling
 Coalesce: Generally reduces the number of partitions with less shuffling

PySpark Comprehensive Notes
No ratings yet
PySpark Comprehensive Notes
59 pages
Apache Spark Technical Round Dashboard
No ratings yet
Apache Spark Technical Round Dashboard
14 pages
Understanding Apache Spark Architecture
No ratings yet
Understanding Apache Spark Architecture
30 pages
Apache Spark
No ratings yet
Apache Spark
62 pages
Tuning - Spark 3.5.1 Documentation
No ratings yet
Tuning - Spark 3.5.1 Documentation
10 pages
Spark RDD WithCode
No ratings yet
Spark RDD WithCode
34 pages
Spark A To Z
No ratings yet
Spark A To Z
63 pages
Spark
No ratings yet
Spark
96 pages
Spark Interview Questions 1713805760
No ratings yet
Spark Interview Questions 1713805760
40 pages
Key Differences in Aache Spark Components and Concepts
No ratings yet
Key Differences in Aache Spark Components and Concepts
7 pages
Pyspark Questions & Scenario Based
No ratings yet
Pyspark Questions & Scenario Based
25 pages
Spark Class 1
No ratings yet
Spark Class 1
33 pages
Top 45 Apache Spark Interview Questions and Answers - by Sanjay Kumar PHD - Medium
No ratings yet
Top 45 Apache Spark Interview Questions and Answers - by Sanjay Kumar PHD - Medium
26 pages
Execr
No ratings yet
Execr
4 pages
Spark Interview Prep for Telugu Speakers
100% (3)
Spark Interview Prep for Telugu Speakers
31 pages
Advance Spark
No ratings yet
Advance Spark
8 pages
Extended Spark Interview QA
No ratings yet
Extended Spark Interview QA
3 pages
Intro to Database Management
No ratings yet
Intro to Database Management
45 pages
Spark Architecture for Developers
No ratings yet
Spark Architecture for Developers
7 pages
Interview Question Spark Day1
No ratings yet
Interview Question Spark Day1
3 pages
PDF Apache Spark
No ratings yet
PDF Apache Spark
15 pages
Spark Class 1 PPT
No ratings yet
Spark Class 1 PPT
33 pages
Notes
No ratings yet
Notes
4 pages
Top Answers To Spark Interview Questions
No ratings yet
Top Answers To Spark Interview Questions
32 pages
SPARK
No ratings yet
SPARK
35 pages
Spark Interview Questions
100% (1)
Spark Interview Questions
8 pages
Spark Introduction
No ratings yet
Spark Introduction
26 pages
Bda Unit5
No ratings yet
Bda Unit5
11 pages
Top Spark Interview Q&A
No ratings yet
Top Spark Interview Q&A
21 pages
TFWolj ND9 K
No ratings yet
TFWolj ND9 K
25 pages
Spark Optimisation Techniques
No ratings yet
Spark Optimisation Techniques
3 pages
TrainingWorksheet-MacrosandPivotTables AGUDA
No ratings yet
TrainingWorksheet-MacrosandPivotTables AGUDA
3 pages
Databricks Question
No ratings yet
Databricks Question
7 pages
Apache Spark: Key Concepts & Features
No ratings yet
Apache Spark: Key Concepts & Features
8 pages
Dicc
No ratings yet
Dicc
164 pages
Unit - 4
No ratings yet
Unit - 4
18 pages
Spark
No ratings yet
Spark
7 pages
MariaDB for Database Professionals
No ratings yet
MariaDB for Database Professionals
33 pages
Spark Interview Questions
No ratings yet
Spark Interview Questions
61 pages
SPARK Question Answers
No ratings yet
SPARK Question Answers
19 pages
Pyspark Cheat Sheet PDF
No ratings yet
Pyspark Cheat Sheet PDF
1 page
Super 25 Unit 5 Notes
No ratings yet
Super 25 Unit 5 Notes
11 pages
Pyspark Interview Code
100% (3)
Pyspark Interview Code
197 pages
Apache Spark Components Guide
No ratings yet
Apache Spark Components Guide
6 pages
Top Answers To Spark Interview Questions
No ratings yet
Top Answers To Spark Interview Questions
32 pages
Msbte Super 25 Unit 5 Notes
No ratings yet
Msbte Super 25 Unit 5 Notes
17 pages
Spark Interview Questions
No ratings yet
Spark Interview Questions
19 pages
Architecture and Components of Spark
No ratings yet
Architecture and Components of Spark
6 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
8888888888888888888
100% (1)
8888888888888888888
131 pages
Spark Notes
No ratings yet
Spark Notes
19 pages
Unit V
No ratings yet
Unit V
35 pages
Introduction To Spark
No ratings yet
Introduction To Spark
30 pages
Spark Interview Q&A: Key Insights
No ratings yet
Spark Interview Q&A: Key Insights
10 pages
Spark Vs Hadoop Features Spark
No ratings yet
Spark Vs Hadoop Features Spark
9 pages
Tech Seminar Report
No ratings yet
Tech Seminar Report
5 pages
Chapter 3 Spark
No ratings yet
Chapter 3 Spark
6 pages
Distributed System Practical File
No ratings yet
Distributed System Practical File
21 pages
Spark Interview Prep Guide
No ratings yet
Spark Interview Prep Guide
4 pages
How To Hack Wi-Fi Via
100% (2)
How To Hack Wi-Fi Via
5 pages
Memory Module - Wikipedia
No ratings yet
Memory Module - Wikipedia
6 pages
Computer Networks: BITS Pilani
No ratings yet
Computer Networks: BITS Pilani
104 pages
Bloom FIlter and Hash Function Numericals
No ratings yet
Bloom FIlter and Hash Function Numericals
6 pages
SQL Queries Railway Information
No ratings yet
SQL Queries Railway Information
6 pages
1 PDFsam Apache Spark Tutorial
No ratings yet
1 PDFsam Apache Spark Tutorial
7 pages
PySpark Core Concepts & Interview Prep
No ratings yet
PySpark Core Concepts & Interview Prep
8 pages
Mid Exam Part2, Var 3
No ratings yet
Mid Exam Part2, Var 3
21 pages
RegEx for Data Quality in Informatica
No ratings yet
RegEx for Data Quality in Informatica
2 pages
CS-5100 Eng
No ratings yet
CS-5100 Eng
9 pages
VLAN Configuration & Troubleshooting Lab
No ratings yet
VLAN Configuration & Troubleshooting Lab
93 pages
Apache Spark IQ
No ratings yet
Apache Spark IQ
15 pages
Bluetooth Low Energy Networking Guide
No ratings yet
Bluetooth Low Energy Networking Guide
38 pages
Database Systems Overview
No ratings yet
Database Systems Overview
5 pages
DR Data Center Services VX XX
No ratings yet
DR Data Center Services VX XX
17 pages
Complet Csa Notes
No ratings yet
Complet Csa Notes
63 pages
Distributed System Design Patterns
No ratings yet
Distributed System Design Patterns
25 pages
Deploying Remote Ethernet Devices on Sophos Firewall
No ratings yet
Deploying Remote Ethernet Devices on Sophos Firewall
12 pages
Logstash: Elastic Stack's ETL Engine Guide
No ratings yet
Logstash: Elastic Stack's ETL Engine Guide
33 pages
Apache Spark Interview Questions
No ratings yet
Apache Spark Interview Questions
12 pages
B2MML V0600 Common PDF
No ratings yet
B2MML V0600 Common PDF
60 pages
PHP: Hypertext Preprocessor: Correct Answer!
No ratings yet
PHP: Hypertext Preprocessor: Correct Answer!
6 pages
Football Tournament DB Design
No ratings yet
Football Tournament DB Design
12 pages
2012 AL ICT Model Paper Answers English at Apepanthiya - LK
No ratings yet
2012 AL ICT Model Paper Answers English at Apepanthiya - LK
13 pages
UMTv2/UMTPro MTK2 Partition Layout
No ratings yet
UMTv2/UMTPro MTK2 Partition Layout
12 pages
2023-QUESTION BANK-Computer-Sc-I
No ratings yet
2023-QUESTION BANK-Computer-Sc-I
2 pages
Lab 01
No ratings yet
Lab 01
10 pages
Adv Database Outline
No ratings yet
Adv Database Outline
2 pages
Operators: General Properties of Operators
No ratings yet
Operators: General Properties of Operators
23 pages
Bda Unit 5 - Mam
No ratings yet
Bda Unit 5 - Mam
44 pages

Spark Interview Questions

Uploaded by

Spark Interview Questions

Uploaded by

1. What is Apache Spark?

Spark is a fast, easy-to-use, and flexible data processing framework.

Open-source analytics engine

2. Explain the key features of Spark.

Apache Spark allows integrating with Hadoop.

Spark consists of RDDs (Resilient Distributed Datasets), which can be cached

Apache Spark supports stream processing in real-time.

Spark is considered a better cost-efficient solution when compared to Hadoop.

MapReduce is basically split into two parts, Map and Reduce.

Map handles data splitting and data mapping.

Broadcast variables allow the programmer to keep a read-only variable cached on

can be used to implement counters (as in MapReduce) or sums.

val accum = sc.longAccumulator("My Accumulator")

sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum.add(x))

Apache Spark optimization helps with in-memory data computations.

bottleneck for these spark optimization computations can be CPU, memory or

 To set the serializer properties:

 RDD is used for low level operation with less optimization.

 Broadcast variable will make small datasets available on nodes locally.

 To use the Broadcast join: (df1. join(broadcast(df2)))

Cache and Persist

File Format selection

Garbage Collection Tuning

You might also like