0% found this document useful (0 votes)

27 views30 pages

Lecture 5 - Hadoop and Mapreduce

Uploaded by

reham2sultan5alalimi3

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views30 pages

Lecture 5 - Hadoop and Mapreduce

Uploaded by

reham2sultan5alalimi3

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

What is Hadoop?

An open source framework Commodity Hardware

that allows distributed
processing of large data-sets ❖ Economic / affordable
across the cluster of machines
Commodity Hardware ❖ Typically low
performance hardware
• Open source framework written in Java

• Inspired by Google's Map-Reduce programming model as well

as its file system (GFS)
Hadoop History
Doug Cutting added Hadoop defeated
DFS & MapReduce Super computer
in
converted 4TB of
Doug Cutting started Doug Cutting
image archives over
working on joined Cloudera
100 EC2 instances

2002 2003 2004 2005 2006 2007 2008 2009

published GFS & Hadoop became

MapReduce papers Development of top-level project
started as Lucene sub-project

launched Hive,
SQL Support for Hadoop
What is Hadoop?
• Open source software framework designed for
storage and processing of large scale dataset on large
clusters of commodity hardware
• Large datasets → Terabytes or petabytes of data
• Large clusters → hundreds or thousands of nodes

• . Uses for Hadoop

• Data-intensive text processing
• Graph mining
• Machine learning and data mining
• Large scale social network analysis
What is Hadoop (Cont’d)
• Hadoop framework consists on two main layers
• Hadoop Distributed file system (HDFS)
• Execution engine (MapReduce)

6
Hadoop Master/Slave Architecture

• Hadoop is designed as a master-slave architecture

Master node (single node)

Many slave nodes

7
Design Principles of Hadoop

• Need to process big data

• Need to parallelize computation across thousands of nodes

• Commodity hardware
• Large number of low-end cheap machines working in parallel
to solve a computing problem

8
Properties of HDFS
• Large: A HDFS instance may consist of thousands of
server machines, each storing part of the file system’s
data

• Replication: Each data block is replicated many times

(default is 3)

• Failure: Failure is the norm rather than exception

• Fault Tolerance: Detection of faults and quick,

automatic recovery from them is a core architectural goal
of HDFS

9
Hadoop: How it Works

10
Hadoop Architecture
• Distributed file system (HDFS)
• Execution engine (MapReduce)

Master node (single node)

Many slave nodes

11
Hadoop Distributed File System
(HDFS)

Centralized namenode
- Maintains metadata info about files

File F
Blocks (64 MB)

Many datanode (1000s)

- Store the actual data
- Files are divided into blocks
- Each block is replicated N times
(Default = 3)

12
Hadoop Distributed File
System
Namenode
File1
1
2
• NameNode: 3
• Stores metadata (file names, 4
block locations, etc)

• DataNode:
• Stores the actual HDFS data
1 2 1 3
blocks 2 1 4 2
4 3 3 4

Datanodes
Data Retrieval

• When a client wants to retrieve data it

communicates with the NameNode to determine
which blocks make up a file and on which data
nodes those blocks are stored

• Then communicated directly with the data nodes to

read the data
MapReduce
Distributing computation across nodes
MapReduce Overview

• A method for distributing computation across

multiple nodes

• Each node processes the data that is stored at that

node

• Consists of two main phases

• Map
• Reduce
The Mapper

• Reads data as key/value pairs

• The key is often discarded

• Outputs zero or more key/value pairs

Shuffle and Sort

• Output from the mapper is sorted by key

• All values with the same key are guaranteed to go to

the same machine
The Reducer

• Called once for each unique key

• Gets a list of all values associated with a key as

input

• The reducer outputs zero or more final key/value

pairs
• Usually just one output per input key
JobTracker and TaskTracker

• JobTracker
• Determines the
execution plan for the
job
• Assigns individual tasks

• TaskTracker
• Keeps track of the
performance of an
individual mapper or
reducer
Properties of MapReduce Engine
• Job Tracker is the master node (runs with the namenode)
• Receives the user’s job
• Decides on how many tasks will run (number of mappers)
• Decides on where to run each mapper (concept of locality)
Node 1 Node 2 Node 3

• This file has 5 Blocks → run 5 map tasks

• Where to run the task reading block “1”

• Try to run it on Node 1 or Node 3

21
Properties of MapReduce Engine
(Cont’d)
• Task Tracker is the slave node (runs on each datanode)
• Receives the task from Job Tracker
• Runs the task until completion (either map or reduce task)
• Always in communication with the Job Tracker reporting progress

Map Parse-hash
Reduce

Map Parse-hash
Reduce
In this example, 1 map-reduce job
consists of 4 map tasks and 3
Map Parse-hash
reduce tasks
Reduce

Map Parse-hash

22
MapReduce Phases

Deciding on what will be the key and what will be the value ➔ developer’s
responsibility

23
Map-Reduce Execution Engine
(Example: Color Count)
Input blocks Produces (k, v) Shuffle & Sorting Consumes(k, [v])
on HDFS ( , 1) based on k ( , [1,1,1,1,1,1..])

Produces(k’, v’)
Map Parse-hash ( , 100)
Reduce

Map Parse-hash
Reduce

Map Parse-hash

Users only provide the “Map” and “Reduce” functions

24
Key-Value Pairs
• Mappers and Reducers are users’ code (provided functions)

• Just need to obey the Key-Value pairs interface

• Mappers:
• Consume <key, value> pairs
• Produce <key, value> pairs

• Reducers:
• Consume <key, <list of values>>
• Produce <key, value>

• Shuffling and Sorting:

• Hidden phase between mappers and reducers
• Groups all similar keys from all mappers, sorts and passes them to a certain
reducer in the form of <key, <list of values>>

25
Example 1: Word Count
• Job: Count the occurrences of each word in a data set

Map Reduce
Tasks Tasks

26
Example 2: Color Count
Job: Count the number of each color in a data set

Input blocks Produces (k, v) Shuffle & Sorting Consumes(k, [v])

on HDFS ( , 1) based on k ( , [1,1,1,1,1,1..])

Produces(k’, v’)
Map Parse-hash ( , 100)
Reduce Part0001

Map Parse-hash
Reduce Part0002

Map Parse-hash
Reduce Part0003

Map Parse-hash
That’s the output file, it
has 3 parts on probably 3
27 different machines
Example 3: Color Filter
Job: Select only the blue and the green colors
• Each map task will select only
Input blocks Produces (k, v) the blue or green colors
on HDFS ( , 1)

• No need for reduce phase

Write to HDFS
Map Part0001

Write to HDFS
Map Part0002
That’s the output file, it
has 4 parts on probably 4
Write to HDFS
Map Part0003 different machines

Write to HDFS
Map Part0004

28
Other Tools

• Hive
• Hadoop processing with SQL

• Pig
• Hadoop processing with scripting

• HBase
• Database model built on top of Hadoop

29
Who Uses Hadoop?

250+ TOP MCQs On Geotechnical Engineering and Answers
100% (4)
250+ TOP MCQs On Geotechnical Engineering and Answers
4 pages
Lecture 5 - Hadoop and Mapreduce
No ratings yet
Lecture 5 - Hadoop and Mapreduce
30 pages
Large-Scale Data Management: Cs525: Special Topics in Dbs
No ratings yet
Large-Scale Data Management: Cs525: Special Topics in Dbs
22 pages
Hadoop Map Reduce Concept
No ratings yet
Hadoop Map Reduce Concept
23 pages
Exploring Bigdata With Hadoop: Dr.A.Bazila Banu Associate Professor Department of Cse
No ratings yet
Exploring Bigdata With Hadoop: Dr.A.Bazila Banu Associate Professor Department of Cse
23 pages
Unit V Cloud Technologies and Advancements
No ratings yet
Unit V Cloud Technologies and Advancements
33 pages
Unit - III Advanced Analytics Technology and Tools
No ratings yet
Unit - III Advanced Analytics Technology and Tools
44 pages
Hadoop: Er. Gursewak Singh Dsce
No ratings yet
Hadoop: Er. Gursewak Singh Dsce
15 pages
Unit 4 1
No ratings yet
Unit 4 1
12 pages
Hadoop Trainting in Hyderabad@KellyTechnologies
No ratings yet
Hadoop Trainting in Hyderabad@KellyTechnologies
23 pages
U-3 Big Data
No ratings yet
U-3 Big Data
23 pages
Big Data
No ratings yet
Big Data
67 pages
Hadoop 1
No ratings yet
Hadoop 1
26 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
L02-Hadoop Framework
No ratings yet
L02-Hadoop Framework
40 pages
2 Hadoop Ecosystem
No ratings yet
2 Hadoop Ecosystem
41 pages
Chap 2 Hadoop
No ratings yet
Chap 2 Hadoop
24 pages
BDA - Unit 3
No ratings yet
BDA - Unit 3
41 pages
HadoopMapreduce Summerization
No ratings yet
HadoopMapreduce Summerization
24 pages
Hadoop
No ratings yet
Hadoop
34 pages
Hadoop Overview
100% (1)
Hadoop Overview
16 pages
A New Way To Store and Analyze Data: Presented By:: Harsha Jain
No ratings yet
A New Way To Store and Analyze Data: Presented By:: Harsha Jain
20 pages
Unit 2
No ratings yet
Unit 2
22 pages
Hadoopintro
No ratings yet
Hadoopintro
31 pages
Unit Iv-1
No ratings yet
Unit Iv-1
84 pages
Shortnotes For Cloud
No ratings yet
Shortnotes For Cloud
22 pages
Hadoop Ecosystem Overview
No ratings yet
Hadoop Ecosystem Overview
7 pages
Hadoop
No ratings yet
Hadoop
5 pages
CC Unit5
No ratings yet
CC Unit5
27 pages
1 MapReduce Introduction With Example
No ratings yet
1 MapReduce Introduction With Example
52 pages
Hadoop Ankit
No ratings yet
Hadoop Ankit
20 pages
Hadoop Notesforstudents
No ratings yet
Hadoop Notesforstudents
13 pages
Hadoop for Big Data Enthusiasts
No ratings yet
Hadoop for Big Data Enthusiasts
21 pages
Unit 5
No ratings yet
Unit 5
35 pages
T05 MapReduce
No ratings yet
T05 MapReduce
20 pages
Hadoop (Mapreduce)
No ratings yet
Hadoop (Mapreduce)
43 pages
Unit 5
No ratings yet
Unit 5
32 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
No ratings yet
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
25 pages
MapReduce Arch
No ratings yet
MapReduce Arch
29 pages
Hadoop Introduction PDF
No ratings yet
Hadoop Introduction PDF
3 pages
BDA Module 3
No ratings yet
BDA Module 3
69 pages
BDA-Unit 4
No ratings yet
BDA-Unit 4
20 pages
Unit 5
No ratings yet
Unit 5
101 pages
Map Reduce
No ratings yet
Map Reduce
40 pages
Big Data Analytics AAM Unit 5
No ratings yet
Big Data Analytics AAM Unit 5
28 pages
Hadoop Basics for Engineering Students
No ratings yet
Hadoop Basics for Engineering Students
18 pages
Unit - 2 (A)
No ratings yet
Unit - 2 (A)
8 pages
05 Movies Data Analysis Using Mapreduce
No ratings yet
05 Movies Data Analysis Using Mapreduce
20 pages
3.1.how Map Reduce Works & 3.2 Anatomy
No ratings yet
3.1.how Map Reduce Works & 3.2 Anatomy
11 pages
Introduction To
No ratings yet
Introduction To
7 pages
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
100% (1)
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
89 pages
Parallel Project
No ratings yet
Parallel Project
32 pages
Hadoop: A Report Writing On
No ratings yet
Hadoop: A Report Writing On
13 pages
BDA Notes
No ratings yet
BDA Notes
25 pages
BDA Manual
No ratings yet
BDA Manual
57 pages
Date Palm Pest Management Guide
No ratings yet
Date Palm Pest Management Guide
234 pages
Final Project Report On Formulation of A Pesticide (LUBWAMA KENNETH)
No ratings yet
Final Project Report On Formulation of A Pesticide (LUBWAMA KENNETH)
39 pages
Math - Exercise of Pat
No ratings yet
Math - Exercise of Pat
5 pages
Altas Copco FD 230 PDF
No ratings yet
Altas Copco FD 230 PDF
16 pages
Act. 2 - Micropipetting Techni
No ratings yet
Act. 2 - Micropipetting Techni
29 pages
Blockholders' Power & Firm Value
No ratings yet
Blockholders' Power & Firm Value
13 pages
To Check Yourself
No ratings yet
To Check Yourself
12 pages
1LE2321-1CA11-4GA3 Datasheet en
No ratings yet
1LE2321-1CA11-4GA3 Datasheet en
1 page
Invoice Details for Car Buyers
No ratings yet
Invoice Details for Car Buyers
2 pages
Physical and Political Divisions of The World
No ratings yet
Physical and Political Divisions of The World
72 pages
Astm A278 A278m
No ratings yet
Astm A278 A278m
4 pages
CMM 26-11-15 PN CG7G0 Smoke Detector
No ratings yet
CMM 26-11-15 PN CG7G0 Smoke Detector
56 pages
Control Structures in PLSQL
No ratings yet
Control Structures in PLSQL
8 pages
Manufacturing Process I Diploma in Mechanical Engineering 3 RD Semester
No ratings yet
Manufacturing Process I Diploma in Mechanical Engineering 3 RD Semester
18 pages
Book List For Iit Jee
100% (2)
Book List For Iit Jee
13 pages
Making Salts
No ratings yet
Making Salts
29 pages
Statement Project
No ratings yet
Statement Project
1 page
BA Graville-Chapter 3
No ratings yet
BA Graville-Chapter 3
36 pages
Proplem Chapter 2.pdf - 2023.02.03 - 12.38.41pm
No ratings yet
Proplem Chapter 2.pdf - 2023.02.03 - 12.38.41pm
7 pages
Crystals, Defects and Microstructures - Modeling Across Scales - R. Phillips (Cambridge, 2004) WW PDF
100% (3)
Crystals, Defects and Microstructures - Modeling Across Scales - R. Phillips (Cambridge, 2004) WW PDF
808 pages
Revenue Grade Metering Standards
No ratings yet
Revenue Grade Metering Standards
2 pages
200749205339
No ratings yet
200749205339
10 pages
SP1100 50HZ Perkins Generator
No ratings yet
SP1100 50HZ Perkins Generator
4 pages
Grade 7 Science: Heat & Energy
No ratings yet
Grade 7 Science: Heat & Energy
9 pages
Nigerian Audit Impact on Reporting
No ratings yet
Nigerian Audit Impact on Reporting
121 pages
Illus Strate Edp Arts List: S Spicer Tandem M Axles S
No ratings yet
Illus Strate Edp Arts List: S Spicer Tandem M Axles S
22 pages
Fortnightly Test Series 2023 24 - RM (P1) Test 01A
No ratings yet
Fortnightly Test Series 2023 24 - RM (P1) Test 01A
20 pages
Vewlix VLX 1 Base
No ratings yet
Vewlix VLX 1 Base
6 pages
Reg Pop Density
No ratings yet
Reg Pop Density
1 page

Lecture 5 - Hadoop and Mapreduce

Uploaded by

Lecture 5 - Hadoop and Mapreduce

Uploaded by

What is Hadoop?

An open source framework Commodity Hardware

• Inspired by Google's Map-Reduce programming model as well

2002 2003 2004 2005 2006 2007 2008 2009

published GFS & Hadoop became

• . Uses for Hadoop

• Hadoop is designed as a master-slave architecture

Master node (single node)

Many slave nodes

• Need to process big data

• Need to parallelize computation across thousands of nodes

• Replication: Each data block is replicated many times

• Failure: Failure is the norm rather than exception

• Fault Tolerance: Detection of faults and quick,

Master node (single node)

Many slave nodes

Many datanode (1000s)

• When a client wants to retrieve data it

• Then communicated directly with the data nodes to

• A method for distributing computation across

• Each node processes the data that is stored at that

• Consists of two main phases

• Reads data as key/value pairs

• Outputs zero or more key/value pairs

• Output from the mapper is sorted by key

• All values with the same key are guaranteed to go to

• Called once for each unique key

• Gets a list of all values associated with a key as

• The reducer outputs zero or more final key/value

• This file has 5 Blocks → run 5 map tasks

• Where to run the task reading block “1”

Users only provide the “Map” and “Reduce” functions

• Just need to obey the Key-Value pairs interface

• Shuffling and Sorting:

Input blocks Produces (k, v) Shuffle & Sorting Consumes(k, [v])

• No need for reduce phase

You might also like