0% found this document useful (0 votes)

6 views37 pages

Module 2

This document provides an introduction to Hadoop, highlighting its capabilities in handling massive data, its advantages over traditional RDBMS, and the challenges of distributed computing. It covers key components of Hadoop, including HDFS and MapReduce, explaining their functionalities and architecture. Additionally, it discusses the limitations of earlier Hadoop versions and introduces YARN for improved resource management.

Uploaded by

Vikas Vicky

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views37 pages

Module 2

Uploaded by

Vikas Vicky

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

Module 2

Chapter 5 – Introduction to Hadoop

Prepared By:
Mr. Rajesh Nayak
Department of Artificial Intelligence and Data Science
5.1 Introducing Hadoop

2
• 5.1.1. Data: The Treasure Trove:
1. Provides business advantages such as generating product
recommendations, inventing new products, analysing the
market, and many more,.
2. Provides few early key indicators that can turn the fortune
of business.
3. Provide room for precise analysis. If we have more data for
analysis, then we have grater precision of analysis.

• 5.2. Why Hadoop?

• The key consideration for popularity of Hadoop is “Its
capability to handle massive amounts of data, different
categories of data-fairly quickly.”
• Other considerations: Low cost, computing power,
scalability, storage flexibility, and inherent data protection.
3
5.1 Introducing Hadoop

Figure 5.2 Key considerations of Hadoop

4
• 5.3. Why Not RDBMS?
• 5.4 RDBMS versus Hadoop

5
• 5.5. Distributed computing challenges
• 5.5.1. Hardware Failure
• In distributed system, several servers are networked
together. This implies that more often than not, there may
be a possibility of hardware failure.
• Hadoop uses Replication Factor to decide number of
replication of the data.

• 5.5.2. How to process this Gigantic store of data?

• In distributed system, data is spread across the network on
several machines.
• Key challenge is to integrate the data.
• Hadoop uses MapReduce programming to solve this
challenge.

6
• 5.6. History of Hadoop

7
• 5.7. Hadoop Overview
• 5.7.1. Key aspects of Hadoop

8
• 5.7. Hadoop Overview
• 5.7.2. Hadoop Components

9
• 5.7.3. Hadoop Conceptual Layer
• It involves Data Storage Layer and Data Processing Layer.
• 5.7.4 High-level Architecture of Hadoop

• Master HDFS partitions the data storage across the slave

nodes. It also keep track of locations of data on DataNodes.
• Master MapReduce decides and schedules computation
task on slave nodes. 10
• 5.8. Use Case of Hadoop
• ClickStream Data: It’s a mouse click data that helps you to
understand the purchasing behaviour of customers. It
helps online marketers to optimize their product web
pages, promotional content, etc. to improve their business.

11
• 5.10. HDFS (Hadoop Distributed File System)
• Key points of HDFS:
1. Storage component of Hadoop.
2. Distributed File System.
3. Modelled after Google File System.
4. Optimized for high throughput.
5. Can replicate a file for a configured number of times,
which is tolerant in terms of both software and hardware.
6. Re-replicates data blocks automatically on nodes that have
failed.
7. Helps in read an write of large files.
8. Sits on top of native file system such as ext3 and ext4.

12
13
• 5.10.1 HDFS Daemons
• NameNode: HDFS divides large file into smaller pieces
called blocks.
• NameNode uses rack ID to identify DataNodes in the rack.
• A rack is a collection of DataNodes within the cluster.
• NameNode keeps tracks of blocks of a file as it is placed on
various DataNodes.
• NameNode manages file-related operations such as read,
write, create, and delete.
• Its main job is managing the file system namespace, which
is a collection of files in the cluster. NameNode stores HDFS
namespace.
• There is single NameNode per cluster.

14
• 5.10.1 HDFS Daemons
• DataNode: There are multiple DataNodes per cluster.
During Pipeline read and write DataNodes communicate
with each other. A DataNode also continuously sends
“heartbeat” message to NameNode to ensure the
connectivity between the NameNode and DataNode.
• In case there is no heartbeat from DataNode, the
NameNode replicates that DataNode within the cluster and
keeps on running as if nothing had happened.

15
• 5.10.1 HDFS Daemons
• Secondary NameNode: It takes snapshot of HDFS
metadata at intervals specified in the Hadoop
configuration. Since memory requirements of Secondary
NameNode are same as NameNode, its better to run
NameNode and Secondary NameNode on different
machines.
• Secondary NameNode does not record any real-time
changes that happen to the HDFS metadata.

16
• 5.10.2 Anatomy of File Read

17
• 5.10.2 Anatomy of File Read

18
• 5.10.2 Anatomy of File Write

19
• 5.10.2 Anatomy of File Write

20
• 5.10.4 Replica Placement Strategy
• As per the Hadoop Replica Placement Strategy, first replica
is placed on the same node as the client. Then it places
second replica on a node that is present on different rack. It
places the third replica on the same rack as second, but on
a different node in the rack. Once replica locations have
been set, a pipeline is built. This strategy provides good
reliability.

21
• 5.10.6 Special Features of HDFS
• Data Replication: There is absolutely no need for a client
application to track all blocks. It directs the client to the
nearest replica to ensure high performance.

• Data Pipeline: A client application writes a block to the first

DataNode in the pipeline. Then this DataNode takes over
and forwards the data to the next node in the pipeline. This
process continues for all the data blocks, and subsequently
all the replicas are written to the disk.

22
• 5.11 Processing Data with Hadoop
• In MapReduce programming, the input dataset is split into
independent chunks. Map tasks process these independent
chunks completely in a parallel manner. The output
produced by the map tasks serves as intermediate data and
is stored on the local disk of the server.
• The output of the mappers are automatically shuffled and
sorted by the framework.
• MapReduce framework sorts the output based on keys.
This sorted output becomes the input to the reduce task.
• Reduce task provides reduced output by combining the
output of various mappers.
• Job inputs and outputs are stored in file system.
• MapReduce also handles tasks like scheduling, monitoring,
re-executing failed tasks, etc.
23
• 5.11 Processing Data with Hadoop

• 5.11.1 MapReduce Daemons

1. JobTracker: It provides connectivity between Hadoop and user
application. When user submit code to cluster, JobTracker
creates the execution plan by deciding which task to assign to
which node. It also monitors all running tasks.
2. TaskTracker: This daemon is responsible for executing individual
tasks that is assigned by the JobTrackerr. There is a single
TaskTracker per slave and spawns multiple Java Virtual
Machines to handle multiple map and reduce tasks in parallel.
24
• 5.11.2 How Does MapReduce Work?

25
• 5.12 Managing Resources and Applications with Hadoop
YARN
• YARN is Hadoop 2.x based architecture. Which handles the
resource management task.
• 5.12.1. Limitations of Hadoop 1.0 Architecture:
1. Single NameNode is responsible for managing entire
namespace for Hadoop cluster.
2. It has a restricted processing model which is suitable for
batch-oriented MapReduce jobs.
3. Hadoop MapReduce is not suitable for interactive analysis.
4. Hadoop 1.0 is not suitable for ML algorithms, graphs, and
other memory intensive algorithms.
5. MapReduce is responsible for cluster resource
management and data processing.
26
• 5.12.2. HDFS Limitations:
• NameNode saves all its file metadata in main memory.
Although the main memory is not as small and as expensive
as before, still there is limit on the number of objects that
one can have in the memory on a single NameNode.
• This problem is resolved with help of HDFS Federation in
Hadoop 2.x.
• 5.12.3 Hadoop 2: HDFS
• It consists of 2 major components: 1) namespace, 2) blocks
storage service.
• Namespace service takes care of file-related operations,
such as creating and modifying files and directories. Block
storage service handles data node cluster management,
replication.
27
• HDFS 2 Features:
– Horizontal scalability – HDFS federation using multiple independent
name nodes to support scalability. These name nodes need not to
have coordination with each other
– High availability – This is obtained with the help of Passive Standby
NameNode. These NameNodes handle failover automatically.
Passive NameNodes reads, edits from shared storage and keeps
the metadata updated. In case of active NameNode failure, passive
NameNode becomes active automatically.

28
• 5.12.4 Hadoop 2 YARN: Taking Hadoop beyond Batch:

• The fundamental idea behind this architecture is splitting the JobTracker responsibility
of resource management and Job Scheduling/Monitoring into separate daemons.
Daemons that are part of YARN Architecture are described below.
• A Global ResourceManager: Its main responsibility is to distribute resources among
various applications in the system. It has two main components: Scheduler and
Application Manager (Accept job, negotiate resource, restart).
• NodeManager: This is a per-machine slave daemon. NodeManager responsibility is
launching the application containers for application execution. It monitors the resource
usage such as memory, CPU, disk, network, etc. It then reports the usage of resources to
the global Resource storage.
• Per-application ApplicationMaster: This is an application-specific entity. Its
responsibility is to negotiate required resources for execution from the
ResourceManager. It works along with the NodeManager for executing and monitoring
component tasks. 29
• YARN Architecture:

30
• YARN Architecture:

31
Module 2
Chapter 8 – Introduction to MapReduce
Programming
• 8.1 Introduction
• ln MapReduce Programming, Jobs (Applications) are split
into a set of map tasks and reduce tasks. Then these tasks
arc executed in a distributed fashion on Hadoop cluster.
• Each task processes small subset of data that has been
assigned to it. This way, Hadoop distributes the load across
the cluster. MapReduce job takes a set of files that is stored
in HDFS (Hadoop Distributed File System) as input.
• Map task takes care of loading. parsing, transforming, and
filtering. The responsibility of reduce task is grouping and
aggregating data that is produced by map tasks to generate
final output.
• Each map task is broken into the following phases: 1)
RecordReader. 2) Mapper. 3) Combiner. 4) Partitioner.
• The reduce tasks are broken into the following phases: 1)
Shuffle. 2) Sort. 3) Reducer. 4) Output Format. 33
• 8.2 Mapper
• A mapper maps the input key-value pairs into a set of
intermediate key-value pairs. Maps are individual tasks that have
the responsibility of transforming input records into
intermediate key-value pairs.
1. RecordReader: RecordReader converts a byte-oriented view of the input
(as generated by the lnput Split) into a record-oriented view and presents
it to the Mapper tasks. It presents the tasks with Keys and values.
Generally the key is the positional information and value is a chunk of data
that constitutes the record.
2. Map: Map function works on the key-value pair produced by Record
Reader and generates zero or more intermediate key-value pairs. The
MapReduce decides the key-value pair based on the context.
3. Combiner: It is an optional function but provides high performance in
terms of network bandwidth and disk space. It takes intermediate key-
value pair provided by mapper and applies user-specific aggregate
function to only that mapper. It is also known as local reducer.
4. Partitioner: The partitioner takes the intermediate key-value pairs
produced by the mapper, splits them into shard, and sends the shard to
the particular reducer as per the user-specific code. Usually, the key with
same values goes to the same reducer. The partitioned data of each map
task is written to the local disk of that machine and pulled by the
respective reducer.
34
• 8.3 Reducer
• The primary chore of the Reducer is to reduce a set of intermediate values
(the ones that share a common key) to a smaller set of values. The Reducer
has three primary phases: Shuffle and Sort, Reduce, and Output Format.
1. Shuffle and Sort: This phase takes the output of all the partitioners and
downloads them into the Iocal machine where the reducer is running. Then
these individual data pipes are sorted by keys which produce larger data
list. The main purpose of this sort is grouping similar words so that their
values can be easily iterated over by the reduce task.
2. Reduce: The reducer takes the grouped data produced by the shuffle and
sort phase, applies reduce function, and processes one group at a time. The
reduce function iterates all the values associated with that key. Reducer
function provides various operations such as aggregation, filtering, and
combining data. Once it is done, the output (zero or more key-value pairs)
of reducer is sent to the output format.
3. Output Format: The output format separates key-value pair with tab
(default) and writes it out to a file using record writer.

35
• 8.4 Combiner
• It is an optimization technique for MapReduce Job. Generally, the
reducer class is set to be the combiner class. The difference
between combiner class and reducer class is as follows:
1. Output generated by combiner is intermediate data and it is passed to the
reducer.
2. Output of the reducer is passed to the output file on disk.
• 8.5 Partitioner
• The partitioning phase happens after map phase and before
reduce phase. Usually the number of partitions are equal to the
number of reducers. The default partitioner is hash partitioner.
• 8.6 Compression
• In MapReduce programming, you can compress the MapReduce
output file. Compression provides two benefits as follows:
1. Reduces the space to store files.
2. Speeds up data transfer across the network.

Hadoop Architecture
No ratings yet
Hadoop Architecture
8 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
11 pages
Relational Database Essentials
100% (2)
Relational Database Essentials
59 pages
Lecture Notes Hadoop
100% (1)
Lecture Notes Hadoop
11 pages
MCQs - Big Data Analytics - Predictive Analytics
No ratings yet
MCQs - Big Data Analytics - Predictive Analytics
10 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
Hadoop Basics and HDFS Overview
No ratings yet
Hadoop Basics and HDFS Overview
126 pages
RPA Design & Development: Data Manipulation
No ratings yet
RPA Design & Development: Data Manipulation
62 pages
UNIT V-Cloud Computing
No ratings yet
UNIT V-Cloud Computing
33 pages
Fbda Unit-3
No ratings yet
Fbda Unit-3
27 pages
Hadoop Basics for Engineering Students
No ratings yet
Hadoop Basics for Engineering Students
18 pages
BDA Module 2 - Notes PDF
No ratings yet
BDA Module 2 - Notes PDF
101 pages
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
No ratings yet
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
25 pages
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
No ratings yet
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
47 pages
Chapter 10
No ratings yet
Chapter 10
45 pages
Unit 5 Print
No ratings yet
Unit 5 Print
32 pages
Unit 5 - Big Data Ecosystem - 06.05.18
No ratings yet
Unit 5 - Big Data Ecosystem - 06.05.18
21 pages
Slide 2 GFS and Hadoop
No ratings yet
Slide 2 GFS and Hadoop
95 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
60 pages
HDFS 79
No ratings yet
HDFS 79
74 pages
Data Science
No ratings yet
Data Science
14 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
Hadoop
No ratings yet
Hadoop
7 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
Hadoop
No ratings yet
Hadoop
9 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
31 pages
Chapter3 HDFS MapReduce YARN
No ratings yet
Chapter3 HDFS MapReduce YARN
35 pages
Unit 5
No ratings yet
Unit 5
101 pages
Adobe Scan 05-Nov-2023
No ratings yet
Adobe Scan 05-Nov-2023
9 pages
Big Data Unit-2 PPT Part1
No ratings yet
Big Data Unit-2 PPT Part1
76 pages
Hadoop Presentaton
No ratings yet
Hadoop Presentaton
47 pages
BDS Session 6
No ratings yet
BDS Session 6
78 pages
Hadoop 1
No ratings yet
Hadoop 1
75 pages
Hadoop Basics and Benefits
No ratings yet
Hadoop Basics and Benefits
52 pages
Hadoop Intro and Hdfs
No ratings yet
Hadoop Intro and Hdfs
37 pages
Chapter 2 Introduction To Hadoop
No ratings yet
Chapter 2 Introduction To Hadoop
31 pages
Lecture 2
No ratings yet
Lecture 2
28 pages
De - Qbank
No ratings yet
De - Qbank
125 pages
Unit 3
No ratings yet
Unit 3
18 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Big Data Notes
No ratings yet
Big Data Notes
8 pages
Hadoop for Big Data Professionals
No ratings yet
Hadoop for Big Data Professionals
24 pages
Chapter - 6 - Hadoop
No ratings yet
Chapter - 6 - Hadoop
51 pages
Introduction to Hadoop & DFS
No ratings yet
Introduction to Hadoop & DFS
34 pages
Unit - 2
No ratings yet
Unit - 2
42 pages
BDA Unit 1
No ratings yet
BDA Unit 1
35 pages
Bda Unit-Iv
No ratings yet
Bda Unit-Iv
37 pages
Big Data-UNIT-2
No ratings yet
Big Data-UNIT-2
46 pages
Act2 - March7 - 6E - BDA - SEC
No ratings yet
Act2 - March7 - 6E - BDA - SEC
8 pages
Unit 5-PLH
No ratings yet
Unit 5-PLH
34 pages
HADOOP
No ratings yet
HADOOP
19 pages
Hadoop Platform & Services
No ratings yet
Hadoop Platform & Services
41 pages
Unit-3 BDA
No ratings yet
Unit-3 BDA
30 pages
Unit 2
No ratings yet
Unit 2
22 pages
M2 Q&a
No ratings yet
M2 Q&a
31 pages
Hadoop Basics for Data Engineers
No ratings yet
Hadoop Basics for Data Engineers
44 pages
Module 2 HDFS
No ratings yet
Module 2 HDFS
33 pages
Unit 4
No ratings yet
Unit 4
36 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
56 pages
Introduction To Hadoop and MapReduce Programming
No ratings yet
Introduction To Hadoop and MapReduce Programming
29 pages
SQL Zero To Advance
No ratings yet
SQL Zero To Advance
46 pages
Bda QB Sample Unit
No ratings yet
Bda QB Sample Unit
12 pages
SQL Queries for Worker Database
No ratings yet
SQL Queries for Worker Database
10 pages
DBMS Lab Manual
No ratings yet
DBMS Lab Manual
57 pages
Data Dictionary
No ratings yet
Data Dictionary
4 pages
Operating Database Application
100% (2)
Operating Database Application
3 pages
Top 13 Advanced Excel Skills
100% (1)
Top 13 Advanced Excel Skills
3 pages
Current Log
No ratings yet
Current Log
27 pages
Hadoop: A Software Framework For Data Intensive Computing Applications
No ratings yet
Hadoop: A Software Framework For Data Intensive Computing Applications
47 pages
MIS Project: ER Diagram & Queries
No ratings yet
MIS Project: ER Diagram & Queries
26 pages
dbms2 File
No ratings yet
dbms2 File
17 pages
Cranfield Test
No ratings yet
Cranfield Test
11 pages
Data STR Lab Cse 215 PDF
No ratings yet
Data STR Lab Cse 215 PDF
3 pages
Exporting File Systems To UNIX
No ratings yet
Exporting File Systems To UNIX
31 pages
Context Tuning for Enhanced RAG
No ratings yet
Context Tuning for Enhanced RAG
9 pages
Dimensional Modeling Primer: Kimball & Ross
No ratings yet
Dimensional Modeling Primer: Kimball & Ross
14 pages
MySQL Q&A for Beginners
No ratings yet
MySQL Q&A for Beginners
7 pages
MVC Student Management Guide
No ratings yet
MVC Student Management Guide
11 pages
Personalized Mobile Search Optimization
No ratings yet
Personalized Mobile Search Optimization
5 pages
Haya Almegren-202201327-DB-lab05-214
No ratings yet
Haya Almegren-202201327-DB-lab05-214
4 pages
Revista Japonesa - Aplicação em Feltro
No ratings yet
Revista Japonesa - Aplicação em Feltro
67 pages
Column vs. Row DBMS Optimization
No ratings yet
Column vs. Row DBMS Optimization
3 pages
3NF-Third Normal Form - NOTES
No ratings yet
3NF-Third Normal Form - NOTES
2 pages
9713 m17 Ms 4
No ratings yet
9713 m17 Ms 4
5 pages
B.Tech. 3rd Yr CSE (AI) 2022 23 Revised - 10
No ratings yet
B.Tech. 3rd Yr CSE (AI) 2022 23 Revised - 10
1 page
1NF, 2NF, 3NF and BCNF in Database Normalization
No ratings yet
1NF, 2NF, 3NF and BCNF in Database Normalization
3 pages
DB 2 Move
0% (1)
DB 2 Move
5 pages
SAP HANA SP5 Learning Package
No ratings yet
SAP HANA SP5 Learning Package
3 pages
Exam Ref 70-768 Developing SQL Data Models: List of Urls
No ratings yet
Exam Ref 70-768 Developing SQL Data Models: List of Urls
11 pages

Module 2

Uploaded by

Module 2

Uploaded by

Module 2

Chapter 5 – Introduction to Hadoop

• 5.2. Why Hadoop?

Figure 5.2 Key considerations of Hadoop

• 5.5.2. How to process this Gigantic store of data?

• Master HDFS partitions the data storage across the slave

• Data Pipeline: A client application writes a block to the first

• 5.11.1 MapReduce Daemons

You might also like