Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
10 views5 pages

Spark Interview Questions

Uploaded by

praveen4ynp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views5 pages

Spark Interview Questions

Uploaded by

praveen4ynp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

1. What is Apache Spark?

Spark is a fast, easy-to-use, and flexible data processing framework.

Open-source analytics engine


Advanced execution engine supporting acyclic data flow and in-memory computing.

In-memory caching and optimized execution of queries for faster query analytics of
data of any size.

2. Explain the key features of Spark.

Apache Spark allows integrating with Hadoop.

It has an interactive language shell, Scala (the language in which Spark is written).

Spark consists of RDDs (Resilient Distributed Datasets), which can be cached


across the computing nodes in a cluster.

Apache Spark supports multiple analytic tools that are used for interactive query
analysis, real-time analysis, and graph processing.

Apache Spark supports stream processing in real-time.

Spark helps in achieving a very high processing speed of data, which it achieves by
reducing the read or write operations to disk.

Spark is considered a better cost-efficient solution when compared to Hadoop.

3. What is MapReduce?

It is a software framework and programming model which is used for processing huge
datasets.

MapReduce is basically split into two parts, Map and Reduce.

Map handles data splitting and data mapping.


Reduce handles shuffle and reduction in data.

4. Broadcast Variables

Broadcast variables allow the programmer to keep a read-only variable cached on


each machine rather than shipping a copy of it with tasks.

Creating broadcast variables is only useful when tasks across multiple stages need the
same data.

SparkContext.broadcast(broadcastVar)
val broadcastVar = sc.broadcast(Array(1, 2, 3))
broadcastVar.value

5. Accumulators

Accumulators are variables that are only “added” to through an associative and
commutative operation and can therefore be efficiently supported in parallel.

can be used to implement counters (as in MapReduce) or sums.

Only the driver program can read the accumulator’s value, using its value method.

SparkContext.longAccumulator()
Or
SparkContext.doubleAccumulator()

val accum = sc.longAccumulator("My Accumulator")

sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum.add(x))

6. Spark Optimization

Apache Spark optimization helps with in-memory data computations.

bottleneck for these spark optimization computations can be CPU, memory or


any resource in the cluster.

 Serialization
 API Selection
 Advance Variable
 Cache and Persist
 ByKey Operation
 File Format selection
 Garbage Collection Tuning
 Level of Parallelism

Serialization
 Serialization plays an important role in the performance for any distributed
application. By default, Spark uses Java serializer.

 Spark can also use another serializer called ‘Kryo’ serializer for better performance.

 Kryo serializer is in compact binary format and offers processing 10x faster than Java
serializer.

 To set the serializer properties:


conf.set(“spark.serializer”, “org.apache.spark.serializer.KryoSerializer”)

conf.set(“spark.serializer”, “org.apache.spark.serializer.KryoSerializer”)

API Selection

 Spark introduced three types of API to work upon – RDD, DataFrame, DataSet

 RDD is used for low level operation with less optimization.

 DataFrame is best choice in most cases due to its catalyst optimizer and low
garbage collection (GC) overhead.

 Dataset is highly type safe and use encoders. It uses Tungsten for serialization in
binary format.

Advanced Variable
 Broadcasting plays an important role while tuning Spark jobs.

 Broadcast variable will make small datasets available on nodes locally.

 When you have one dataset which is smaller than other dataset, Broadcast join is
highly recommended.

 To use the Broadcast join: (df1. join(broadcast(df2)))

Cache and Persist


RDD.Cache() it will always store the data in memory.

RDD.Persist() then some part of data can be stored into the memory some can be
stored on the disk.

ByKey Operation
 Shuffles are heavy operation which consume a lot of memory.
 While coding in Spark, the user should always try to avoid shuffle operation.
 High shuffling may give rise to an OutOfMemory Error; To avoid such an error, the
user can increase the level of parallelism.
 A user should go for the reduceByKey because groupByKey creates a lot of shuffling
which hampers the performance.
 Partition the data correctly.

File Format selection


 Spark supports many formats, such as CSV, JSON, XML, PARQUET, ORC, AVRO,
etc.
 Spark jobs can be optimized by choosing the parquet file with snappy compression
which gives the high performance and best analysis.
 Parquet file is native to Spark which carries the metadata along with its footer.

Garbage Collection Tuning


 JVM garbage collection can be a problem when you have large collection of unused
objects.
 The first step in GC tuning is to collect statistics by choosing – verbose while
submitting spark jobs.
 In an ideal situation we try to keep GC overheads < 10% of heap memory.

Level of Parallelism
 Parallelism plays a very important role while tuning spark jobs.
 Every partition ~ task requires a single core for processing.
 There are two ways to maintain the parallelism:
 Repartition: Gives equal number of partitions with high shuffling
 Coalesce: Generally reduces the number of partitions with less shuffling

You might also like