Advanced Hadoop Tuning and
Optimizations
Presented By:
Sanjay Sharma
Download the Whitepaper: Deriving Intelligence from Large Data
Using Hadoop and Applying Analytics at http://bit.ly/cNCCGj
Hadoop- The Good/Bad/Ugly
Hadoop is GOOD- that is why we all are here
Hadoop is not BAD- else we would not be here
Hadoop is sometimes Ugly- why?
Out of the box configuration not friendly
Difficult to debug
Performance – tuning/optimizations is a
black art
Download the Whitepaper: Deriving Intelligence from Large Data
Using Hadoop and Applying Analytics at http://bit.ly/cNCCGj 2
Configuration parameters
Compression
mapred.compress.map.output: Map Output Compression
Default: False
Pros: Faster disk writes, lower disk space usage, lesser time
spent on data transfer (from mappers to reducers).
Cons: Overhead in compression at Mappers and decompression
at Reducers.
Suggestions: For large cluster and large jobs this property
should be set true. The compression codec can also be set
through the property mapred.map.output.compression.codec
(Default is org.apache.hadoop.io.compress.DefaultCodec).
Download the Whitepaper: Deriving Intelligence from Large Data
Using Hadoop and Applying Analytics at http://bit.ly/cNCCGj 4
Speculative Execution
mapred.map/reduce.tasks.speculative.execution:
Enable/Disable task (map/reduce) speculative Execution
Default: True
Pros: Reduces the job time if the task progress is slow due to memory
unavailability or hardware degradation.
Cons: Increases the job time if the task progress is slow due to complex and
large calculations. On a busy cluster speculative execution can reduce
overall throughput, since redundant tasks are being executed in an attempt
to bring down the execution time for a single job.
Suggestions: In large jobs where average task completion time is significant
(> 1 hr) due to complex and large calculations and high throughput is
required the speculative execution should be set to false.
Download the Whitepaper: Deriving Intelligence from Large Data
Using Hadoop and Applying Analytics at http://bit.ly/cNCCGj 5
Number of Maps/Reducers
mapred.tasktracker.map/reduce.tasks.maximum:
Maximum tasks (map/reduce) for a tasktracker
Default: 2
Suggestions:
Recommended range - (cores_per_node)/2 to 2x(cores_per_node),
especially for large clusters.
This value should be set according to the hardware specification of
cluster nodes and resource requirements of tasks (map/reduce).
Download the Whitepaper: Deriving Intelligence from Large Data
Using Hadoop and Applying Analytics at http://bit.ly/cNCCGj 6
File block size
dfs.block.size: File system block size
Default: 67108864 (bytes)
Suggestions:
Small cluster and large data set: default block size will create a large
number of map tasks.
e.g. Input data size = 160 GB and dfs.block.size = 64 MB then the minimum no. of maps=
(160*1024)/64 = 2560 maps.
If dfs.block.size = 128 MB minimum no. of maps= (160*1024)/128 = 1280 maps.
If dfs.block.size = 256 MB minimum no. of maps= (160*1024)/256 = 640 maps.
In a small cluster (6-10 nodes) the map task creation overhead is
considerable. So dfs.block.size should be large in this case but small
enough to utilize all the cluster resources.
The block size should be set according to size of the cluster, map task
complexity, map task capacity of cluster and average size of input files.
Download the Whitepaper: Deriving Intelligence from Large Data
Using Hadoop and Applying Analytics at http://bit.ly/cNCCGj 7
Sort size
io.sort.mb: Buffer size (MBs) for sorting
Default: 100
Suggestions:
For Large jobs (the jobs in which map output is very large), this value
should be increased keeping in mind that it will increase the memory
required by each map task. So the increment in this value should be
according to the available memory at the node.
Greater the value of io.sort.mb, lesser will be the spills to the disk,
saving write to the disk.
Download the Whitepaper: Deriving Intelligence from Large Data
Using Hadoop and Applying Analytics at http://bit.ly/cNCCGj 8
Sort factor
io.sort.factor: Stream merge factor
Default: 10
Suggestions:
For Large jobs (the jobs in which map output is very large and number
of maps are also large) which have large number of spills to disk, value
of this property should be increased.
The number of input streams (files) to be merged at once in the
map/reduce tasks, as specified by io.sort.factor, should be set to a
sufficiently large value (for example, 100) to minimize disk accesses.
Increment in io.sort.factor, benefits in merging at reducers since the last
batch of streams (equal to io.sort.factor) are sent to the reduce function
without merging, thus saving time in merging.
Download the Whitepaper: Deriving Intelligence from Large Data
Using Hadoop and Applying Analytics at http://bit.ly/cNCCGj 9
JVM reuse
mapred.job.reuse.jvm.num.tasks: Reuse single JVM
Default: 1
Suggestions: The minimum overhead of JVM creation for each task is
around 1 second. So for the tasks which live for seconds or a few minutes
and have lengthy initialization, this value can be increased to gain
performance.
Download the Whitepaper: Deriving Intelligence from Large Data
Using Hadoop and Applying Analytics at http://bit.ly/cNCCGj 10
Reduce parallel copies
mapred.reduce.parallel.copies: Threads for parallel copy at reducer
Default: 5
Description: The number of threads used to copy map outputs to the
reducer.
Suggestions: For Large jobs (the jobs in which map output is very large),
value of this property can be increased keeping in mind that it will increase
the total CPU usage.
Download the Whitepaper: Deriving Intelligence from Large Data
Using Hadoop and Applying Analytics at http://bit.ly/cNCCGj 11
The Other Threads
dfs.namenode{/mapred.job.tracker}.handler.count :server
threads that handle remote procedure calls (RPCs)
Default: 10
Suggestions: This can be increased for larger server (50-64).
dfs.datanode.handler.count :server threads that handle remote
procedure calls (RPCs)
Default: 3
Suggestions: This can be increased for larger number of HDFS clients (6-8).
tasktracker.http.threads : number of worker threads on the HTTP
server on each TaskTracker
Default: 40
Suggestions: The can be increased for larger clusters (50).
Download the Whitepaper: Deriving Intelligence from Large Data
Using Hadoop and Applying Analytics at http://bit.ly/cNCCGj 12
Other hotspots
Revelation-Temporary space
Temporary space allocation:
Jobs which generate large intermediate data (map output) should have
enough temporary space controlled by property mapred.local.dir. This
property specifies list directories where the MapReduce stores intermediate
data for jobs. The data is cleaned-up after the job completes.
By default, replication factor for file storage on HDFS is 3, which means that
every file has three replicas. As a rule of thumb, at least 25% of the total
hard disk should be allocated for intermediate temporary output. So
effectively, only ¼ hard disk space is available for business use.
The default value for mapred.local.dir is ${hadoop.tmp.dir}/mapred/local.
So if mapred.local.dir is not set, hadoop.tmp.dir must have enough space
to hold job’s intermediate data. If the node doesn’t have enough temporary
space the task attempt will fail and starts a new attempt, thus impacting the
performance.
Download the Whitepaper: Deriving Intelligence from Large Data
Using Hadoop and Applying Analytics at http://bit.ly/cNCCGj 14
Java- JVM
JVM tuning:
Besides normal java code optimizations, JVM settings for each child task
also affects the processing time. On slave node end, the task tracker and
data node use 1 GB RAM each. Effective use of the remaining RAM as well
as choosing the right GC mechanism for each Map or Reduce task is very
important for maximum utilization of hardware resources. The default max
RAM for child tasks is 200MB which might be insufficient for many
production grade jobs. The JVM settings for child tasks are governed by
mapred.child.java.opts property.
Use JDK 1.6 64 BIT–
+ +XX:CompressedOops helpful in dealing with OOM errors
Do remember changing Linux open file descriptor
Set java.net.preferIPv4Stack set to true, to avoid timeouts in cases where
the OS/JVM picks up an IPv6 address and must resolve the hostname.
15
Logging
Is a friend to developers, Foe in production
Default - INFO level
dfs.namenode.logging.level
hadoop.job.history
hadoop.logfile.size/count
Download the Whitepaper: Deriving Intelligence from Large Data
Using Hadoop and Applying Analytics at http://bit.ly/cNCCGj 16
Static Data strategies
Available Approaches
JobConf.set(“key”,”value”)
Distributed cache
HDFS shared file
Suggested approaches if above ones not efficient
Memcached
Tokyocabinet/TokyoTyrant
Berkley DB
HBase
Download the Whitepaper: Deriving Intelligence from Large Data
Using Hadoop and Applying Analytics at http://bit.ly/cNCCGj 17
Debugging and profiling- Arun C Murthy
Hadoop Map-Reduce – Tuning and Debugging- from Arun C
Murthy presentation
Debugging
Log files/UI view
Local runner
Single machine mode
Set keep.failed.task.files to true and use the IsolationRunner
Profiling
Set mapred.task.profile to true
Use mapred.task.profile.{maps|reduces}
hprof support is built-in
Use mapred.task.profile.params to set options for the debugger
Possibly DistributedCache for the profiler’s agent
18
Tuning - Arun C Murthy
Hadoop Map-Reduce – Tuning and Debugging- from Arun C Murthy
presentation
Tuning
Tell HDFS and Map-Reduce about your network! – Rack locality script: topology.script.file.name
Number of maps – Data locality
Number of reduces – You don’t need a single output file!Log files/UI view
Amount of data processed per Map - Consider fatter maps, Custom input format
Combiner - multi-level combiners at both Map and Reduce
Check to ensure the combiner is useful!
Map-side sort -io.sort.mb, io.sort.factor, io.sort.record.percent, io.sort.spill.percent
Shuffle
Compression for map-outputs – mapred.compress.map.output ,
mapred.map.output.compression.codec , lzo via libhadoop.so, tasktracker.http.threads
mapred.reduce.parallel.copies, mapred.reduce.copy.backoff,
mapred.job.shuffle.input.buffer.percent, mapred.job.shuffle.merge.percent,
mapred.inmem.merge.threshold, mapred.job.reduce.input.buffer.percent
Compress the job output
Miscellaneous -Speculative execution, Heap size for the child, Re-use jvm for maps/reduces, Raw
Comparators
19
Next steps
Hadoop Vaidya (since 0.20.0)
Job configuration analyzer (WIP-to be contributed
back to Hadoop)
Part of Analyze Job web ui
Analyze and suggest config parameters from job.xml
Smart suggestion engine/auto-correction
Download the Whitepaper: Deriving Intelligence from Large Data
Using Hadoop and Applying Analytics at http://bit.ly/cNCCGj 20
Conclusion
Performance of Hadoop MapReduce jobs can be
improved without increasing the hardware costs,
by tuning several key configuration parameters
for cluster specifications, input data size and
processing complexity.
Download the Whitepaper: Deriving Intelligence from Large Data
Using Hadoop and Applying Analytics at http://bit.ly/cNCCGj 21
References
Hadoop.apache.org
Hadoop-performance tuning--white paper v1
1.pdf – Arun C Murthy
Intel_White_Paper_Optimizing_Hadoop_Deploym
ents.pdf
Download the Whitepaper: Deriving Intelligence from Large Data
Using Hadoop and Applying Analytics at http://bit.ly/cNCCGj 22