What is Hadoop?
An open source framework Commodity Hardware
that allows distributed
processing of large data-sets ❖ Economic / affordable
across the cluster of machines
Commodity Hardware ❖ Typically low
performance hardware
• Open source framework written in Java
• Inspired by Google's Map-Reduce programming model as well
as its file system (GFS)
Hadoop History
Doug Cutting added Hadoop defeated
DFS & MapReduce Super computer
in
converted 4TB of
Doug Cutting started Doug Cutting
image archives over
working on joined Cloudera
100 EC2 instances
2002 2003 2004 2005 2006 2007 2008 2009
published GFS & Hadoop became
MapReduce papers Development of top-level project
started as Lucene sub-project
launched Hive,
SQL Support for Hadoop
What is Hadoop?
• Open source software framework designed for
storage and processing of large scale dataset on large
clusters of commodity hardware
• Large datasets → Terabytes or petabytes of data
• Large clusters → hundreds or thousands of nodes
• . Uses for Hadoop
• Data-intensive text processing
• Graph mining
• Machine learning and data mining
• Large scale social network analysis
What is Hadoop (Cont’d)
• Hadoop framework consists on two main layers
• Hadoop Distributed file system (HDFS)
• Execution engine (MapReduce)
6
Hadoop Master/Slave Architecture
• Hadoop is designed as a master-slave architecture
Master node (single node)
Many slave nodes
7
Design Principles of Hadoop
• Need to process big data
• Need to parallelize computation across thousands of nodes
• Commodity hardware
• Large number of low-end cheap machines working in parallel
to solve a computing problem
8
Properties of HDFS
• Large: A HDFS instance may consist of thousands of
server machines, each storing part of the file system’s
data
• Replication: Each data block is replicated many times
(default is 3)
• Failure: Failure is the norm rather than exception
• Fault Tolerance: Detection of faults and quick,
automatic recovery from them is a core architectural goal
of HDFS
9
Hadoop: How it Works
10
Hadoop Architecture
• Distributed file system (HDFS)
• Execution engine (MapReduce)
Master node (single node)
Many slave nodes
11
Hadoop Distributed File System
(HDFS)
Centralized namenode
- Maintains metadata info about files
File F
Blocks (64 MB)
Many datanode (1000s)
- Store the actual data
- Files are divided into blocks
- Each block is replicated N times
(Default = 3)
12
Hadoop Distributed File
System
Namenode
File1
1
2
• NameNode: 3
• Stores metadata (file names, 4
block locations, etc)
• DataNode:
• Stores the actual HDFS data
1 2 1 3
blocks 2 1 4 2
4 3 3 4
Datanodes
Data Retrieval
• When a client wants to retrieve data it
communicates with the NameNode to determine
which blocks make up a file and on which data
nodes those blocks are stored
• Then communicated directly with the data nodes to
read the data
MapReduce
Distributing computation across nodes
MapReduce Overview
• A method for distributing computation across
multiple nodes
• Each node processes the data that is stored at that
node
• Consists of two main phases
• Map
• Reduce
The Mapper
• Reads data as key/value pairs
• The key is often discarded
• Outputs zero or more key/value pairs
Shuffle and Sort
• Output from the mapper is sorted by key
• All values with the same key are guaranteed to go to
the same machine
The Reducer
• Called once for each unique key
• Gets a list of all values associated with a key as
input
• The reducer outputs zero or more final key/value
pairs
• Usually just one output per input key
JobTracker and TaskTracker
• JobTracker
• Determines the
execution plan for the
job
• Assigns individual tasks
• TaskTracker
• Keeps track of the
performance of an
individual mapper or
reducer
Properties of MapReduce Engine
• Job Tracker is the master node (runs with the namenode)
• Receives the user’s job
• Decides on how many tasks will run (number of mappers)
• Decides on where to run each mapper (concept of locality)
Node 1 Node 2 Node 3
• This file has 5 Blocks → run 5 map tasks
• Where to run the task reading block “1”
• Try to run it on Node 1 or Node 3
21
Properties of MapReduce Engine
(Cont’d)
• Task Tracker is the slave node (runs on each datanode)
• Receives the task from Job Tracker
• Runs the task until completion (either map or reduce task)
• Always in communication with the Job Tracker reporting progress
Map Parse-hash
Reduce
Map Parse-hash
Reduce
In this example, 1 map-reduce job
consists of 4 map tasks and 3
Map Parse-hash
reduce tasks
Reduce
Map Parse-hash
22
MapReduce Phases
Deciding on what will be the key and what will be the value ➔ developer’s
responsibility
23
Map-Reduce Execution Engine
(Example: Color Count)
Input blocks Produces (k, v) Shuffle & Sorting Consumes(k, [v])
on HDFS ( , 1) based on k ( , [1,1,1,1,1,1..])
Produces(k’, v’)
Map Parse-hash ( , 100)
Reduce
Map Parse-hash
Reduce
Map Parse-hash
Reduce
Map Parse-hash
Users only provide the “Map” and “Reduce” functions
24
Key-Value Pairs
• Mappers and Reducers are users’ code (provided functions)
• Just need to obey the Key-Value pairs interface
• Mappers:
• Consume <key, value> pairs
• Produce <key, value> pairs
• Reducers:
• Consume <key, <list of values>>
• Produce <key, value>
• Shuffling and Sorting:
• Hidden phase between mappers and reducers
• Groups all similar keys from all mappers, sorts and passes them to a certain
reducer in the form of <key, <list of values>>
25
Example 1: Word Count
• Job: Count the occurrences of each word in a data set
Map Reduce
Tasks Tasks
26
Example 2: Color Count
Job: Count the number of each color in a data set
Input blocks Produces (k, v) Shuffle & Sorting Consumes(k, [v])
on HDFS ( , 1) based on k ( , [1,1,1,1,1,1..])
Produces(k’, v’)
Map Parse-hash ( , 100)
Reduce Part0001
Map Parse-hash
Reduce Part0002
Map Parse-hash
Reduce Part0003
Map Parse-hash
That’s the output file, it
has 3 parts on probably 3
27 different machines
Example 3: Color Filter
Job: Select only the blue and the green colors
• Each map task will select only
Input blocks Produces (k, v) the blue or green colors
on HDFS ( , 1)
• No need for reduce phase
Write to HDFS
Map Part0001
Write to HDFS
Map Part0002
That’s the output file, it
has 4 parts on probably 4
Write to HDFS
Map Part0003 different machines
Write to HDFS
Map Part0004
28
Other Tools
• Hive
• Hadoop processing with SQL
• Pig
• Hadoop processing with scripting
• HBase
• Database model built on top of Hadoop
29
Who Uses Hadoop?