0% found this document useful (0 votes)

3K views22 pages

Hadoop Tuning for Data Engineers

Impetus provides expert consulting services around Hadoop implementations, including R&D, assessment, deployment (on private and public clouds), optimizations for enhanced static shared data implementations. This presentation speaks about Advanced Hadoop Tuning and Optimisation.

Uploaded by

Impetus

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3K views22 pages

Hadoop Tuning for Data Engineers

Uploaded by

Impetus

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 22

Advanced Hadoop Tuning and

Optimizations

Presented By:
Sanjay Sharma

Download the Whitepaper: Deriving Intelligence from Large Data

Using Hadoop and Applying Analytics at http://bit.ly/cNCCGj
Hadoop- The Good/Bad/Ugly

 Hadoop is GOOD- that is why we all are here

 Hadoop is not BAD- else we would not be here

 Hadoop is sometimes Ugly- why?

 Out of the box configuration not friendly
 Difficult to debug
 Performance – tuning/optimizations is a
black art

Download the Whitepaper: Deriving Intelligence from Large Data

Using Hadoop and Applying Analytics at http://bit.ly/cNCCGj 2
Configuration parameters
Compression
mapred.compress.map.output: Map Output Compression
 Default: False
 Pros: Faster disk writes, lower disk space usage, lesser time
spent on data transfer (from mappers to reducers).
 Cons: Overhead in compression at Mappers and decompression
at Reducers.
 Suggestions: For large cluster and large jobs this property
should be set true. The compression codec can also be set
through the property mapred.map.output.compression.codec
(Default is org.apache.hadoop.io.compress.DefaultCodec).

Download the Whitepaper: Deriving Intelligence from Large Data

Using Hadoop and Applying Analytics at http://bit.ly/cNCCGj 4
Speculative Execution
mapred.map/reduce.tasks.speculative.execution:
Enable/Disable task (map/reduce) speculative Execution
 Default: True
 Pros: Reduces the job time if the task progress is slow due to memory
unavailability or hardware degradation.
 Cons: Increases the job time if the task progress is slow due to complex and
large calculations. On a busy cluster speculative execution can reduce
overall throughput, since redundant tasks are being executed in an attempt
to bring down the execution time for a single job.
 Suggestions: In large jobs where average task completion time is significant
(> 1 hr) due to complex and large calculations and high throughput is
required the speculative execution should be set to false.

Download the Whitepaper: Deriving Intelligence from Large Data

Using Hadoop and Applying Analytics at http://bit.ly/cNCCGj 5
Number of Maps/Reducers
 mapred.tasktracker.map/reduce.tasks.maximum:
Maximum tasks (map/reduce) for a tasktracker
 Default: 2
 Suggestions:
 Recommended range - (cores_per_node)/2 to 2x(cores_per_node),
especially for large clusters.
 This value should be set according to the hardware specification of
cluster nodes and resource requirements of tasks (map/reduce).

Download the Whitepaper: Deriving Intelligence from Large Data

Using Hadoop and Applying Analytics at http://bit.ly/cNCCGj 6
File block size
dfs.block.size: File system block size
 Default: 67108864 (bytes)
 Suggestions:
 Small cluster and large data set: default block size will create a large
number of map tasks.
 e.g. Input data size = 160 GB and dfs.block.size = 64 MB then the minimum no. of maps=
(160*1024)/64 = 2560 maps.
 If dfs.block.size = 128 MB minimum no. of maps= (160*1024)/128 = 1280 maps.
 If dfs.block.size = 256 MB minimum no. of maps= (160*1024)/256 = 640 maps.

 In a small cluster (6-10 nodes) the map task creation overhead is
considerable. So dfs.block.size should be large in this case but small
enough to utilize all the cluster resources.
 The block size should be set according to size of the cluster, map task
complexity, map task capacity of cluster and average size of input files.
Download the Whitepaper: Deriving Intelligence from Large Data
Using Hadoop and Applying Analytics at http://bit.ly/cNCCGj 7
Sort size
io.sort.mb: Buffer size (MBs) for sorting
 Default: 100
 Suggestions:
 For Large jobs (the jobs in which map output is very large), this value
should be increased keeping in mind that it will increase the memory
required by each map task. So the increment in this value should be
according to the available memory at the node.
 Greater the value of io.sort.mb, lesser will be the spills to the disk,
saving write to the disk.

Download the Whitepaper: Deriving Intelligence from Large Data

Using Hadoop and Applying Analytics at http://bit.ly/cNCCGj 8
Sort factor
io.sort.factor: Stream merge factor
 Default: 10
 Suggestions:
 For Large jobs (the jobs in which map output is very large and number
of maps are also large) which have large number of spills to disk, value
of this property should be increased.
 The number of input streams (files) to be merged at once in the
map/reduce tasks, as specified by io.sort.factor, should be set to a
sufficiently large value (for example, 100) to minimize disk accesses.
 Increment in io.sort.factor, benefits in merging at reducers since the last
batch of streams (equal to io.sort.factor) are sent to the reduce function
without merging, thus saving time in merging.

Download the Whitepaper: Deriving Intelligence from Large Data

Using Hadoop and Applying Analytics at http://bit.ly/cNCCGj 9
JVM reuse
mapred.job.reuse.jvm.num.tasks: Reuse single JVM
 Default: 1
 Suggestions: The minimum overhead of JVM creation for each task is
around 1 second. So for the tasks which live for seconds or a few minutes
and have lengthy initialization, this value can be increased to gain
performance.

Download the Whitepaper: Deriving Intelligence from Large Data

Using Hadoop and Applying Analytics at http://bit.ly/cNCCGj 10
Reduce parallel copies
mapred.reduce.parallel.copies: Threads for parallel copy at reducer
 Default: 5
 Description: The number of threads used to copy map outputs to the
reducer.
 Suggestions: For Large jobs (the jobs in which map output is very large),
value of this property can be increased keeping in mind that it will increase
the total CPU usage.

Download the Whitepaper: Deriving Intelligence from Large Data

Using Hadoop and Applying Analytics at http://bit.ly/cNCCGj 11
The Other Threads
dfs.namenode{/mapred.job.tracker}.handler.count :server
threads that handle remote procedure calls (RPCs)
 Default: 10
 Suggestions: This can be increased for larger server (50-64).

dfs.datanode.handler.count :server threads that handle remote

procedure calls (RPCs)
 Default: 3
 Suggestions: This can be increased for larger number of HDFS clients (6-8).

tasktracker.http.threads : number of worker threads on the HTTP

server on each TaskTracker
 Default: 40
 Suggestions: The can be increased for larger clusters (50).
Download the Whitepaper: Deriving Intelligence from Large Data
Using Hadoop and Applying Analytics at http://bit.ly/cNCCGj 12
Other hotspots
Revelation-Temporary space
Temporary space allocation:
 Jobs which generate large intermediate data (map output) should have
enough temporary space controlled by property mapred.local.dir. This
property specifies list directories where the MapReduce stores intermediate
data for jobs. The data is cleaned-up after the job completes.
 By default, replication factor for file storage on HDFS is 3, which means that
every file has three replicas. As a rule of thumb, at least 25% of the total
hard disk should be allocated for intermediate temporary output. So
effectively, only ¼ hard disk space is available for business use.
 The default value for mapred.local.dir is ${hadoop.tmp.dir}/mapred/local.
So if mapred.local.dir is not set, hadoop.tmp.dir must have enough space
to hold job’s intermediate data. If the node doesn’t have enough temporary
space the task attempt will fail and starts a new attempt, thus impacting the
performance.
Download the Whitepaper: Deriving Intelligence from Large Data
Using Hadoop and Applying Analytics at http://bit.ly/cNCCGj 14
Java- JVM
JVM tuning:
 Besides normal java code optimizations, JVM settings for each child task
also affects the processing time. On slave node end, the task tracker and
data node use 1 GB RAM each. Effective use of the remaining RAM as well
as choosing the right GC mechanism for each Map or Reduce task is very
important for maximum utilization of hardware resources. The default max
RAM for child tasks is 200MB which might be insufficient for many
production grade jobs. The JVM settings for child tasks are governed by
mapred.child.java.opts property.
 Use JDK 1.6 64 BIT–
 + +XX:CompressedOops helpful in dealing with OOM errors
 Do remember changing Linux open file descriptor
 Set java.net.preferIPv4Stack set to true, to avoid timeouts in cases where
the OS/JVM picks up an IPv6 address and must resolve the hostname.

15
Logging

 Is a friend to developers, Foe in production

 Default - INFO level
 dfs.namenode.logging.level
 hadoop.job.history
 hadoop.logfile.size/count

Download the Whitepaper: Deriving Intelligence from Large Data

Using Hadoop and Applying Analytics at http://bit.ly/cNCCGj 16
Static Data strategies
 Available Approaches
 JobConf.set(“key”,”value”)
 Distributed cache
 HDFS shared file

 Suggested approaches if above ones not efficient

 Memcached
 Tokyocabinet/TokyoTyrant
 Berkley DB
 HBase

Download the Whitepaper: Deriving Intelligence from Large Data

Using Hadoop and Applying Analytics at http://bit.ly/cNCCGj 17
Debugging and profiling- Arun C Murthy
Hadoop Map-Reduce – Tuning and Debugging- from Arun C
Murthy presentation
 Debugging
 Log files/UI view
 Local runner
 Single machine mode
 Set keep.failed.task.files to true and use the IsolationRunner
 Profiling
 Set mapred.task.profile to true
 Use mapred.task.profile.{maps|reduces}
 hprof support is built-in
 Use mapred.task.profile.params to set options for the debugger
 Possibly DistributedCache for the profiler’s agent
18
Tuning - Arun C Murthy
Hadoop Map-Reduce – Tuning and Debugging- from Arun C Murthy
presentation
 Tuning
 Tell HDFS and Map-Reduce about your network! – Rack locality script: topology.script.file.name
 Number of maps – Data locality
 Number of reduces – You don’t need a single output file!Log files/UI view
 Amount of data processed per Map - Consider fatter maps, Custom input format
 Combiner - multi-level combiners at both Map and Reduce
 Check to ensure the combiner is useful!
 Map-side sort -io.sort.mb, io.sort.factor, io.sort.record.percent, io.sort.spill.percent
 Shuffle
 Compression for map-outputs – mapred.compress.map.output ,
mapred.map.output.compression.codec , lzo via libhadoop.so, tasktracker.http.threads
 mapred.reduce.parallel.copies, mapred.reduce.copy.backoff,
mapred.job.shuffle.input.buffer.percent, mapred.job.shuffle.merge.percent,
mapred.inmem.merge.threshold, mapred.job.reduce.input.buffer.percent
 Compress the job output
 Miscellaneous -Speculative execution, Heap size for the child, Re-use jvm for maps/reduces, Raw
Comparators
19
Next steps

 Hadoop Vaidya (since 0.20.0)

 Job configuration analyzer (WIP-to be contributed
back to Hadoop)
 Part of Analyze Job web ui
 Analyze and suggest config parameters from job.xml
 Smart suggestion engine/auto-correction

Download the Whitepaper: Deriving Intelligence from Large Data

Using Hadoop and Applying Analytics at http://bit.ly/cNCCGj 20
Conclusion
 Performance of Hadoop MapReduce jobs can be
improved without increasing the hardware costs,
by tuning several key configuration parameters
for cluster specifications, input data size and
processing complexity.

Download the Whitepaper: Deriving Intelligence from Large Data

Using Hadoop and Applying Analytics at http://bit.ly/cNCCGj 21
References

 Hadoop.apache.org
 Hadoop-performance tuning--white paper v1
1.pdf – Arun C Murthy
 Intel_White_Paper_Optimizing_Hadoop_Deploym
ents.pdf

Download the Whitepaper: Deriving Intelligence from Large Data

Using Hadoop and Applying Analytics at http://bit.ly/cNCCGj 22

Hadoop 2 Quick Start Guide PDF
100% (1)
Hadoop 2 Quick Start Guide PDF
736 pages
Bigdata Interview Preparation Guide
No ratings yet
Bigdata Interview Preparation Guide
292 pages
Apache Hadoop Training
No ratings yet
Apache Hadoop Training
377 pages
Apache Hadoop Developer Training
100% (1)
Apache Hadoop Developer Training
394 pages
Apache Hadoop Developer Training PDF
100% (1)
Apache Hadoop Developer Training PDF
397 pages
IBM I (AS/400) System Operations and Administration
No ratings yet
IBM I (AS/400) System Operations and Administration
3 pages
100+ Hadoop Interview Questions From Interviews
No ratings yet
100+ Hadoop Interview Questions From Interviews
32 pages
Unit 1,2,3,4
No ratings yet
Unit 1,2,3,4
116 pages
Ha Do Op Performance Tuning Guide
No ratings yet
Ha Do Op Performance Tuning Guide
1 page
Hadoop Learning MapReduce
No ratings yet
Hadoop Learning MapReduce
3 pages
BDS Session 8 MapReduce YARN
No ratings yet
BDS Session 8 MapReduce YARN
68 pages
BDA praON Iat1
No ratings yet
BDA praON Iat1
12 pages
Scenario Based Hadoop Interview Questions
No ratings yet
Scenario Based Hadoop Interview Questions
5 pages
Apache Hadoop Developer Training PDF
No ratings yet
Apache Hadoop Developer Training PDF
394 pages
Bda CHP 2
No ratings yet
Bda CHP 2
5 pages
7 Tips to Boost MapReduce Performance
No ratings yet
7 Tips to Boost MapReduce Performance
4 pages
Hadoop BigData Testing Overview
No ratings yet
Hadoop BigData Testing Overview
37 pages
Unit-5 - Hadoop
No ratings yet
Unit-5 - Hadoop
29 pages
Hadoop Architecture Review
No ratings yet
Hadoop Architecture Review
4 pages
Cse3002 Big Data m1
No ratings yet
Cse3002 Big Data m1
62 pages
Hadoop Lab Practical Guide
No ratings yet
Hadoop Lab Practical Guide
69 pages
DSF-ReleaseNotes-202302 0 0
100% (1)
DSF-ReleaseNotes-202302 0 0
13 pages
Bigdata
No ratings yet
Bigdata
6 pages
Big Data
No ratings yet
Big Data
47 pages
Hadoop 1
No ratings yet
Hadoop 1
26 pages
Modbus and Modbus TCP
No ratings yet
Modbus and Modbus TCP
45 pages
L02-Hadoop Framework
No ratings yet
L02-Hadoop Framework
40 pages
BDA Lab Manual
No ratings yet
BDA Lab Manual
62 pages
MapReduce App Development Guide
No ratings yet
MapReduce App Development Guide
42 pages
3 Unit
No ratings yet
3 Unit
17 pages
02 Haddop Biginsights
No ratings yet
02 Haddop Biginsights
36 pages
Module 4: NSX-T Data Center Design Considerations: © 2020 Vmware, Inc
No ratings yet
Module 4: NSX-T Data Center Design Considerations: © 2020 Vmware, Inc
60 pages
Big Data Analytics
No ratings yet
Big Data Analytics
44 pages
bdcc-2 2
No ratings yet
bdcc-2 2
12 pages
Hadoop Performance Tuning
100% (1)
Hadoop Performance Tuning
13 pages
DM Hadoop Architecture
No ratings yet
DM Hadoop Architecture
6 pages
Hadoop Week 3
No ratings yet
Hadoop Week 3
60 pages
Big Data Mapreduce and Streaming
No ratings yet
Big Data Mapreduce and Streaming
10 pages
DC Hadoop
No ratings yet
DC Hadoop
48 pages
Hadoop Map Reduce Concept
No ratings yet
Hadoop Map Reduce Concept
23 pages
Hadoop MapReduce Overview & Usage
No ratings yet
Hadoop MapReduce Overview & Usage
57 pages
Hadoop Chapter 1
No ratings yet
Hadoop Chapter 1
6 pages
Toc 9780134049984
No ratings yet
Toc 9780134049984
10 pages
11 Lecture
No ratings yet
11 Lecture
22 pages
Avr Usb
100% (2)
Avr Usb
10 pages
Nitro Pro 8: PDF Creation & Editing
No ratings yet
Nitro Pro 8: PDF Creation & Editing
2 pages
Hadoop Interview Questions Guide
No ratings yet
Hadoop Interview Questions Guide
25 pages
Layer 3 Leaf-Spine - Arista ATD 1 Documentation - Arista
No ratings yet
Layer 3 Leaf-Spine - Arista ATD 1 Documentation - Arista
5 pages
Tutorial MapReduce
No ratings yet
Tutorial MapReduce
13 pages
Edge AI Inference Computer Powered by NVIDIA GPU Cards - P24
No ratings yet
Edge AI Inference Computer Powered by NVIDIA GPU Cards - P24
1 page
Hadoop Big Data: Follow This Link To Know About Features of Hadoop
No ratings yet
Hadoop Big Data: Follow This Link To Know About Features of Hadoop
85 pages
Hadoop Project: Hardware Specific
No ratings yet
Hadoop Project: Hardware Specific
4 pages
Introduction To
No ratings yet
Introduction To
7 pages
EMS Configuration Guide: GPON OLT Products User Manual
No ratings yet
EMS Configuration Guide: GPON OLT Products User Manual
126 pages
Hadoop by Dr. Kamal Gulati
No ratings yet
Hadoop by Dr. Kamal Gulati
33 pages
Basic Concepts of Hadoop: Karthick Selvam
No ratings yet
Basic Concepts of Hadoop: Karthick Selvam
42 pages
DSBDA ORAL Question Bank
100% (1)
DSBDA ORAL Question Bank
6 pages
Hadoop Course Content (Hadoop-1.x, - 2.x & - 3.x) (Development and Administration)
No ratings yet
Hadoop Course Content (Hadoop-1.x, - 2.x & - 3.x) (Development and Administration)
14 pages
May Jun 2024
No ratings yet
May Jun 2024
2 pages
HWACHEON Touch Off Instructions
No ratings yet
HWACHEON Touch Off Instructions
2 pages
Mifare Classic Hack
No ratings yet
Mifare Classic Hack
2 pages
Toshiba HDD
No ratings yet
Toshiba HDD
4 pages
Best Hadoop Online Training
No ratings yet
Best Hadoop Online Training
41 pages
Big Data Hadoop Insight
No ratings yet
Big Data Hadoop Insight
46 pages
2inceptez Hadoop Processing
No ratings yet
2inceptez Hadoop Processing
16 pages
Hadoop Interview Question
No ratings yet
Hadoop Interview Question
25 pages
Hadoop Training for Data Professionals
No ratings yet
Hadoop Training for Data Professionals
41 pages
ISS UCR Business 450 Basic - PRO: Enterprise-Level Communications Server For Growing Businesses
No ratings yet
ISS UCR Business 450 Basic - PRO: Enterprise-Level Communications Server For Growing Businesses
1 page
Falcon Sensor For Linux Deployment - Documentation - Support and Resources - Falcon
No ratings yet
Falcon Sensor For Linux Deployment - Documentation - Support and Resources - Falcon
271 pages
Hadoop Cluster Capacity Planning
No ratings yet
Hadoop Cluster Capacity Planning
9 pages
HW Monitor
No ratings yet
HW Monitor
48 pages
VPN Setup Guide for 24/7 Leaders
No ratings yet
VPN Setup Guide for 24/7 Leaders
15 pages
About The CD: Appx - Indd 831 5/21/08 5:47:35 PM
No ratings yet
About The CD: Appx - Indd 831 5/21/08 5:47:35 PM
4 pages
Required Ports For HP Smart Device Services v1.3
No ratings yet
Required Ports For HP Smart Device Services v1.3
3 pages
HPE ProLiant DL360 Gen10 Server-Datasheet
No ratings yet
HPE ProLiant DL360 Gen10 Server-Datasheet
5 pages
Appinfo Log
No ratings yet
Appinfo Log
8 pages
BDA Experiment 3
No ratings yet
BDA Experiment 3
7 pages
PSoC Designer Guide for Engineers
No ratings yet
PSoC Designer Guide for Engineers
12 pages
Log
No ratings yet
Log
47 pages
Activity One: Code: Cls Echo %computername% Dir Pause
No ratings yet
Activity One: Code: Cls Echo %computername% Dir Pause
16 pages
Assignment
No ratings yet
Assignment
15 pages
LKL
No ratings yet
LKL
35 pages
QTP - Not Just For GUI Anymore
No ratings yet
QTP - Not Just For GUI Anymore
48 pages
Centero WirelessHART Full API Integration Manual
No ratings yet
Centero WirelessHART Full API Integration Manual
57 pages
Changes
No ratings yet
Changes
74 pages

Hadoop Tuning for Data Engineers

Uploaded by

Hadoop Tuning for Data Engineers

Uploaded by

Advanced Hadoop Tuning and

Download the Whitepaper: Deriving Intelligence from Large Data

 Hadoop is GOOD- that is why we all are here

 Hadoop is not BAD- else we would not be here

 Hadoop is sometimes Ugly- why?

Download the Whitepaper: Deriving Intelligence from Large Data

Download the Whitepaper: Deriving Intelligence from Large Data

Download the Whitepaper: Deriving Intelligence from Large Data

Download the Whitepaper: Deriving Intelligence from Large Data

Download the Whitepaper: Deriving Intelligence from Large Data

Download the Whitepaper: Deriving Intelligence from Large Data

Download the Whitepaper: Deriving Intelligence from Large Data

Download the Whitepaper: Deriving Intelligence from Large Data

dfs.datanode.handler.count :server threads that handle remote

tasktracker.http.threads : number of worker threads on the HTTP

 Is a friend to developers, Foe in production

Download the Whitepaper: Deriving Intelligence from Large Data

 Suggested approaches if above ones not efficient

Download the Whitepaper: Deriving Intelligence from Large Data

 Hadoop Vaidya (since 0.20.0)

Download the Whitepaper: Deriving Intelligence from Large Data

Download the Whitepaper: Deriving Intelligence from Large Data

Download the Whitepaper: Deriving Intelligence from Large Data

You might also like