0% found this document useful (0 votes)

10 views17 pages

3 Unit

MapReduce is a framework for processing large data sets in a distributed manner, consisting of two main tasks: Map, which extracts key-value pairs from input data, and Reduce, which aggregates results. The process involves splitting input data into chunks, processing them in parallel, and managing tasks through Jobtracker and Tasktracker. Additionally, tools like Hive, Pig, and Spark enhance data processing capabilities within the Hadoop ecosystem.

Uploaded by

Mani Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views17 pages

3 Unit

Uploaded by

Mani Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 17

MapReduce is a powerful framework used in Big Data (especially in Hadoop) to process and

generate large data sets with a parallel, distributed algorithm.

Instead of processing the whole huge file at once, it divides the work into small chunks and
processes them in parallel (multiple computers or nodes working together).

It mainly has two important tasks:

 Map

 Reduce

Before jumping deep, remember:

 Map → Break down and extract important information

 Reduce → Summarize and combine the results

🔵 Detailed Steps of MapReduce

1. Input Data

 A huge file containing massive data.

 Example: Billions of weather records, web logs, social media posts, etc.

 Data is stored in HDFS (Hadoop Distributed File System).

2. Splitting

 Hadoop splits the input file into fixed-size blocks (default: 128 MB or 64 MB).

 Each block is processed independently by a different mapper (small task running on a

node).

3. Map Phase

This is the first main function.

 Map function reads the input line by line.

 It processes the input data and outputs a set of (key, value) pairs.

👉 What happens inside Map function?

 It extracts useful data from the input.

 Example: From a weather record line, it extracts (year, temperature).

🔵 Output of map phase:

scss

CopyEdit

(1950, 22)

(1949, 78)

(1950, -11)

...

🔵 Note:

 Each output is a key-value pair.

 It can produce one or many key-value pairs for each input line.

4. Shuffle and Sort Phase (System-Managed)

This is automatic (you don’t have to code it).

 Hadoop groups all the values that have the same key.

 It also sorts the data by key.

👉 Example:

 All temperatures from the year 1950 are grouped together.

 All temperatures from the year 1949 are grouped together.

🔵 Result after shuffle:

css

CopyEdit

1949 → [78]

1950 → [22, -11, 0]

This step ensures that all data for a given key (like 1950) goes to the same Reducer.
5. Reduce Phase

This is the second main function.

 Reduce function takes the key and the list of values.

 It processes them and produces the final output for that key.

👉 What happens inside Reduce function?

 It can calculate things like maximum, minimum, sum, average, etc.

 Example: From [22, -11, 0] for 1950, find the maximum value: 22.

🔵 Output of reduce phase:

scss

CopyEdit

(1949, 78)

(1950, 22)

6. Final Output

 The final result is written back to HDFS (or any storage).

 Each output is in the form of (key, final value).

🎯 In Short:

Phase Action

Input Split Divides the input file into blocks

Map Extracts key-value pairs from data

Shuffle/Sort Groups all values by the same key

Reduce Aggregates and finalizes the results

Output Stores the result

Input Data

Splitting

Map Phase: (key, value) → (Year, Temperature)

Shuffle and Sort: group by Year

Reduce Phase: Maximum temperature per Year

Final Output

Java Map reduce ***

MapReduce Job Execution Flow in Hadoop (Using Java)

1. Writing Mapper, Reducer, and Driver Classes

The process begins by writing Java classes that define the Mapper, Reducer, and Driver (Main)
logic. These classes contain the business logic for processing input and generating the required
output.

2. Compilation of Java Code

The Java source files are compiled using the Hadoop classpath. This results in the generation
of .class files for each Java class.

3. Creation of JAR File

All compiled .class files are packaged into a single JAR (Java Archive) file. This JAR file contains all
the necessary classes for executing the MapReduce job.

4. Job Submission

The MapReduce job is submitted to the Hadoop framework using the hadoop jar command,
specifying the JAR file, the main class, input path, and output path.

5. Initialization of JVM and Job ID Assignment

Upon submission, Hadoop internally starts a Java Virtual Machine (JVM) to run the job. The
system loads the JAR file using the classpath. The job is then assigned a unique Job ID (e.g.,
job_local_0001) for tracking purposes.

6. Task Creation

Hadoop divides the job into multiple smaller tasks:

 Map Tasks: Responsible for reading and processing the input data.

 Reduce Tasks: Handle the aggregation of intermediate results produced by the mappers.

7. Task Attempt Identification

Each map and reduce task is assigned a unique attempt ID. For example:

 Map Task: attempt_local_0001_m_000000_0

 Reduce Task: attempt_local_0001_r_000000_0

8. Execution of Tasks by JVM

Each task (both Map and Reduce) is executed by a separate instance of the JVM. The Map
function processes input key-value pairs, and the Reduce function aggregates intermediate
results based on keys.

9. Use of Counters
Hadoop maintains various built-in and custom counters to monitor:

 Number of records read

 Number of records written

 Number of tasks succeeded or failed

10. Input Processing

The input data is divided into logical splits. Each split is processed by a map task, which
transforms the input into intermediate key-value pairs.

11. Output Generation

The output from reduce tasks is stored in the specified output directory. Each reducer writes its
result into separate output files (e.g., part-r-00000).

12. Result Verification

The output data is examined to ensure correctness and to verify that the result matches the
expected output.

13. Job Completion

Once all map and reduce tasks are successfully executed, and the output is written, the
MapReduce job is marked as complete.
Apache Hadoop

Hive lets you run SQL-like queries on big data stored in Hadoop. Instead of writing
Hive complicated Java programs (like MapReduce), you can use simple Hive queries. It's
good for data analysis and reporting.

Pig uses a special scripting language called Pig Latin to process and transform data.
Pig It's easier and quicker than writing full MapReduce programs. Mostly used by
programmers who want simple data processing.

HBase is a NoSQL database that runs on top of HDFS. It stores data in a table
HBase format but doesn’t use SQL. It's great when you need fast reading and writing of
individual records (example: random access).

Sqoop is a tool used to import data from traditional databases (like MySQL, Oracle)
Sqoop into Hadoop and also export processed data back to these databases. It helps in
moving structured data easily.

Flume Flume is mainly used to collect and move huge volumes of real-time streaming
data (like log files, social media feeds) into Hadoop for storage and later analysis.

Oozie is a workflow scheduler. It helps organize and schedule multiple Hadoop jobs
Oozie (like MapReduce, Hive, Pig) into a sequence or workflow, running them in a set
order automatically.

Zookeeper manages coordination and synchronization between all Hadoop

Zookeeper services. It keeps track of configuration, manages distributed locks, and helps all
parts of Hadoop work together smoothly.

Spark is a fast, in-memory data processing engine. It can process big data much
faster than MapReduce because it keeps data in memory (RAM) instead of reading
Spark
from disk again and again. It supports batch processing, real-time data processing,
machine learning, and more.

Mahout provides machine learning algorithms like clustering, classification, and

Mahout recommendation systems that can work on Hadoop. It helps to build smart
applications (like product recommendations).

In simple words:

Tool You use it when you want to...

Hive Write SQL queries instead of coding programs.

Pig Write simple scripts to process data.

HBase Quickly read/write data (like a NoSQL database).

Sqoop Move data between databases and Hadoop.

Flume Bring live/streaming data into Hadoop.

Oozie Schedule and manage multiple Hadoop tasks.

Zookeeper Keep all Hadoop services working together correctly.

Spark Process data faster than MapReduce (for batch and real-time).

Mahout Build machine learning models on big data.

Scaling out

SCALING OUT

 What is Scaling Out?

o It's a method to add more computers (or machines) to your system to handle
more data.

o Instead of using just one computer, we use many to process data faster and
better.

o The data is stored across all machines using something called HDFS (Hadoop
Distributed File System).

o Hadoop then spreads the MapReduce work across all the machines.

3.4.1 Data Flow – Terminologies

What is a MapReduce job?

 A MapReduce job is a task that we give to Hadoop.

 It includes:

o Input data (what needs to be processed)

o The MapReduce program (how to process it)

o Configuration information (settings)

How does Hadoop handle the job?

 Hadoop splits the job into small tasks.

 There are two types of tasks:

1. Map tasks – they break the data into parts and process them.

2. Reduce tasks – they collect and combine the results from the map tasks.

Who manages the tasks?

There are two important managers:

1. Jobtracker

o Think of this like the boss.

o It decides what needs to be done and who should do it.

o It keeps track of the overall progress.

o If a task fails, the jobtracker gives it to someone else.

2. Tasktracker

o These are the workers.

o They actually do the map and reduce tasks.

o They report back to the jobtracker with updates.

Summary (in simple words):

 Hadoop works like a team.

 The jobtracker is the team leader.

 The tasktrackers are the team members.

 Together, they break big jobs into smaller ones and finish them quickly across multiple
computers.

MapReduce – How it Processes Data (Easy Explanation)

🔹 Step-by-Step Procedure:

1. Splitting the Input:

o Hadoop splits the input data into smaller parts called input splits.

o Each split gets a map task to process it.

🔹 Why Splits?

 Processing a split is faster than processing the whole input.

 If we run all splits at the same time (parallel processing), the job finishes much faster.
 Faster machines finish their splits earlier and take more splits – so the system works
efficiently.

🔸 More splits = better load balancing

🔸 But too many small splits = more overhead (more tasks to manage)

🔹 Good Split Size:

 A common good size is 64 MB (same as HDFS block size).

 You can change it based on your system’s setup.

Data Locality Optimization:

 Hadoop tries to run map tasks close to where the data is stored in HDFS.

 This avoids using network bandwidth, which is slower.

There are 3 types of data placement:

1. Data-local: Map task and data on the same node.

2. Rack-local: Map task and data on different nodes but same rack.

3. Off-rack: Map task and data on different racks – slowest.

✅ Best: Data-local
⚠️Okay: Rack-local
❌ Worst: Off-rack (slow due to network transfer)

Map and Reduce Task Flow (Right Page):

🔹 What Happens After Mapping?

 Map tasks write their output to local disk (not HDFS).

 This is temporary output used by the reduce task.

🔹 Reduce Tasks:

 They collect outputs from all map tasks.

 These outputs are sent across the network to the node where reduce happens.
 Reduce tasks merge and process the data to generate final results.

🔹 Final Output:

 Reduce output is stored in HDFS for safety and reliability.

 Usually, one copy stays on the same node, and others are stored on different nodes for
backup.

🔹 Diagram Explanation (Bottom Right):

 Shows input splits (split 0, split 1, split 2)

 Each split goes to a map task

 Map outputs are collected by the reduce task

 Reduce task merges everything and saves it to HDFS (part 0)

🔹 Key Terms Simplified:

 Map Task = Takes small part of input and processes it

 Reduce Task = Takes all map outputs, combines them, and gives final result

 HDFS = Hadoop’s storage system

 Split = A small chunk of the big input file

 Rack = A group of computers in a network

MapReduce with Multiple Reduce Tasks

🔹 What happens when there are multiple reducers?

 After the map phase, the output is partitioned by keys.

 This process is called the shuffle phase.

 Each reducer gets its own partition of the data based on key values.

 When you increase the number of reduce tasks, Hadoop:

 Distributes the keys across multiple reducers.

 Each reducer works independently on its set of keys.

 This improves:

 Speed (parallel processing).

 Scalability (can handle large data).

 Fault tolerance (if one fails, others continue).

🔹 No Reduce Tasks? (Bottom diagram)

 Sometimes, we don’t need reduce tasks.

 Map output can go directly to HDFS (no shuffling).

 This is useful when no aggregation or combining is needed.

Combiner Functions (Section 3.4.2)

🔹 Why use a Combiner?

 To reduce network usage between map and reduce stages.

 It runs after the map task, doing some local aggregation.

 Example: Summing or finding the maximum temperature before data is sent to

reducers.

Suppose the map output is like this:

scss

CopyEdit

(1950, 0)

(1950, 20)

(1950, 10)

(1950, 25)
(1950, 15)

Instead of sending all these to the reducer, we can use a combiner.

🔹 Combiner Function:
 Acts like a mini-reducer.

 Example: We want to find the maximum temperature for 1950.

 Combiner locally finds max in map output:

scss

CopyEdit

(1950, 25)

Then only one value (1950, 25) is sent to the reducer – saving bandwidth.

🔹 Final Reduce:

 The reducer still gets (1950, [20, 25]) and again finds the max.

 So result = (1950, 25).

🧠 Formula:

scss

CopyEdit

max(0, 10, 25, 15) → max(max(0, 20, 10), max(25, 15)) → 25

✅ Summary:

 Combiner = mini-reducer

 Helps reduce data transfer between map and reduce tasks

 Especially useful when doing things like sum, max, count, etc.
 Not always used, and not guaranteed to run every time

Why not all functions can be used as Combiners?

 Some operations (like mean) can't be split and recombined.

 Example:

matlab

CopyEdit

mean(0, 20, 10, 25, 15) = 14 ✅

But:

mean(mean(0,20,10), mean(25,15)) = mean(10,20) = 15 ❌

 This means mean is not suitable for a combiner.

🔹 Combiner ≠ Reducer

 Combiner just pre-processes map output locally to reduce data transfer.

 The final result is always done by the reducer.

🧱 What is a Block in HDFS?

In HDFS, a block is the minimum unit of data storage. Default block size is 128 MB or 256 MB
(larger than traditional OS file blocks like 4 KB). Files in HDFS are split into blocks and distributed
across multiple nodes in the cluster.

✅ Benefits of Using Blocks

1. Supports Very Large Files

“A file can be larger than any single disk in the network...”

 Files can span multiple disks because they are split into blocks.

 These blocks don’t need to stay on the same machine or disk.

 It allows HDFS to store files much larger than what a single disk could hold by using the
whole cluster’s storage capacity.

📌 Example: A 500 MB file on a system with 128 MB block size will be split into 4 blocks:

 Block 1: 128 MB

 Block 2: 128 MB

 Block 3: 128 MB

 Block 4: 116 MB

These blocks can go to different machines in the cluster.

2. Simplifies Storage Subsystem

“Making the unit of abstraction a block rather than a file simplifies the storage subsystem...”

 Storage deals with fixed-size blocks, not variable-size files.

 Easy to manage, calculate, and allocate space.

 Reduces complexity of metadata handling since each block is a fixed size.

📌 Example: On a 1 TB disk with 128 MB block size, it's simple to know that ~8000 blocks can fit
on the disk.

3. Helps with Replication and Fault Tolerance

“Blocks fit well with replication for providing fault tolerance and availability...”

 Blocks are replicated, usually 3 copies, on different nodes.

 If one copy is lost due to disk failure, others can be used.

 Improves reliability and high availability of data.

📌 Example:

 Block A is stored on Node1, Node2, and Node3.

 If Node2 crashes, Block A is still available from Node1 or Node3.

NameNode (Master)

 Role: Manages the filesystem namespace (structure of files and directories).

 Stores:

o Namespace image: Snapshot of the file system tree.

o Edit log: Record of recent changes.

 Knows: Which DataNodes hold the blocks of a file.

 Does NOT store block locations permanently — this info is re-collected from DataNodes
at system startup.

📌 DataNode (Slave)

 Role: Actual storage workhorse.

 Stores and retrieves file blocks as told by NameNode or clients.

 Reports: Periodically sends block information to the NameNode.

 Can't work alone – depends on NameNode for coordination.

NameNode Failure Protection

Hadoop provides two ways to handle NameNode failure:

1. Backups: Save the namespace image and edit log regularly.

2. Secondary NameNode (not a backup NameNode):

o Periodically merges the namespace image + edit log.

o Runs on a separate machine with high resources.

o Helps prevent the edit log from growing too large.

Fake Logo Detection Documentation-1
No ratings yet
Fake Logo Detection Documentation-1
50 pages
Maximo Application Suite (MAS) Level 2 Manage EAM
No ratings yet
Maximo Application Suite (MAS) Level 2 Manage EAM
13 pages
Exam 70-744: IT Certification Guaranteed, The Easy Way!
No ratings yet
Exam 70-744: IT Certification Guaranteed, The Easy Way!
188 pages
Hadoop OnePage
No ratings yet
Hadoop OnePage
2 pages
BDA - Unit 3
No ratings yet
BDA - Unit 3
41 pages
Bda Unit-3
No ratings yet
Bda Unit-3
44 pages
BDA Unit 3 Notes
No ratings yet
BDA Unit 3 Notes
11 pages
BDA Unit-2
No ratings yet
BDA Unit-2
11 pages
Hadoop Karunesh
No ratings yet
Hadoop Karunesh
14 pages
Bda Unit 3
No ratings yet
Bda Unit 3
14 pages
Unit 2
No ratings yet
Unit 2
7 pages
Unit 3
No ratings yet
Unit 3
13 pages
Hadoop MapReduce Explained Simply
No ratings yet
Hadoop MapReduce Explained Simply
3 pages
Unit-2 MapReduce2024
No ratings yet
Unit-2 MapReduce2024
41 pages
BDA Unit 2 Notes
No ratings yet
BDA Unit 2 Notes
32 pages
05 Movies Data Analysis Using Mapreduce
No ratings yet
05 Movies Data Analysis Using Mapreduce
20 pages
HadoopMapreduce Summerization
No ratings yet
HadoopMapreduce Summerization
24 pages
Big Data Mapreduce and Streaming
No ratings yet
Big Data Mapreduce and Streaming
10 pages
Unit 2
No ratings yet
Unit 2
12 pages
18CS72-Big Data and Analytics 3rd Internal QP 7th Semester - Scheme of Evaluation
No ratings yet
18CS72-Big Data and Analytics 3rd Internal QP 7th Semester - Scheme of Evaluation
14 pages
CO3 Session 19
No ratings yet
CO3 Session 19
29 pages
MapReduce for Big Data Processing
No ratings yet
MapReduce for Big Data Processing
2 pages
MapReduce App Development Guide
No ratings yet
MapReduce App Development Guide
42 pages
Hadoop (Mapreduce)
No ratings yet
Hadoop (Mapreduce)
43 pages
Big Data Unit - 3
No ratings yet
Big Data Unit - 3
7 pages
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
No ratings yet
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
54 pages
MapReduce for Big Data Developers
No ratings yet
MapReduce for Big Data Developers
9 pages
Executing Hadoop Map Reduce Jobs
No ratings yet
Executing Hadoop Map Reduce Jobs
2 pages
Big Data BCA Unit4
No ratings yet
Big Data BCA Unit4
9 pages
Big Data Unit-2 PPT Part2
No ratings yet
Big Data Unit-2 PPT Part2
78 pages
? Unit 2, 3 Big Data Notes
No ratings yet
? Unit 2, 3 Big Data Notes
12 pages
BDA UNIT-3 (1) - Merged
No ratings yet
BDA UNIT-3 (1) - Merged
98 pages
Bda Unit 2
No ratings yet
Bda Unit 2
48 pages
Analyzing Data With Hadoop
No ratings yet
Analyzing Data With Hadoop
54 pages
The CAP Theorem Overview
No ratings yet
The CAP Theorem Overview
16 pages
BDA Unit-3
No ratings yet
BDA Unit-3
63 pages
Unit 5 - Mapreduce
No ratings yet
Unit 5 - Mapreduce
8 pages
Unit-III Big Data
No ratings yet
Unit-III Big Data
10 pages
Unit - III
No ratings yet
Unit - III
37 pages
L02-Hadoop Framework
No ratings yet
L02-Hadoop Framework
40 pages
Top Answers To Map Reduce Interview Questions
No ratings yet
Top Answers To Map Reduce Interview Questions
6 pages
Big Data Management Continued
No ratings yet
Big Data Management Continued
48 pages
B. Hadoop Ecosystem - III (MapReduce)
No ratings yet
B. Hadoop Ecosystem - III (MapReduce)
55 pages
Parallel Project
No ratings yet
Parallel Project
32 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Unit 4 1
No ratings yet
Unit 4 1
12 pages
Bda Unit-3
No ratings yet
Bda Unit-3
20 pages
Cloud Computing Prof
No ratings yet
Cloud Computing Prof
11 pages
Map Reduce
No ratings yet
Map Reduce
74 pages
BDA Unit 3 1
No ratings yet
BDA Unit 3 1
37 pages
2 Hadoop Ecosystem
No ratings yet
2 Hadoop Ecosystem
41 pages
21CS1601 Unit 5 Understanding Big Data Technolgies
No ratings yet
21CS1601 Unit 5 Understanding Big Data Technolgies
20 pages
MapReduce Is A Framework Using Which We Can Write Applications To Process Huge Amounts of Data
No ratings yet
MapReduce Is A Framework Using Which We Can Write Applications To Process Huge Amounts of Data
12 pages
IDS Unit3
No ratings yet
IDS Unit3
19 pages
Unit 2
No ratings yet
Unit 2
22 pages
Assignment 2 Write-Up
No ratings yet
Assignment 2 Write-Up
7 pages
04 MapReduce
No ratings yet
04 MapReduce
45 pages
MapReduce Arch
No ratings yet
MapReduce Arch
29 pages
MapReduce BDA
No ratings yet
MapReduce BDA
32 pages
Moon
No ratings yet
Moon
12 pages
Important Books For UGC NET Exam
No ratings yet
Important Books For UGC NET Exam
11 pages
Add A Heading
No ratings yet
Add A Heading
6 pages
Java Programs
No ratings yet
Java Programs
6 pages
Cybersecurity Roadmap 2025 Final
No ratings yet
Cybersecurity Roadmap 2025 Final
3 pages
Advanced Java Imp Questions
No ratings yet
Advanced Java Imp Questions
2 pages
Information Security
No ratings yet
Information Security
179 pages
Solasta Questions
No ratings yet
Solasta Questions
64 pages
Adv - Java Lab
No ratings yet
Adv - Java Lab
25 pages
All Diagrams at One Place
No ratings yet
All Diagrams at One Place
17 pages
Introduction To Zstack Solution
No ratings yet
Introduction To Zstack Solution
42 pages
Biztech Sample
No ratings yet
Biztech Sample
4 pages
ERP Group 2
No ratings yet
ERP Group 2
21 pages
Resumes - Ronithvalluri11@gmail - Com - Resume
No ratings yet
Resumes - Ronithvalluri11@gmail - Com - Resume
1 page
SAP EDI Tables & Message Output Guide
100% (1)
SAP EDI Tables & Message Output Guide
9 pages
Yr 10 Classtest1
No ratings yet
Yr 10 Classtest1
2 pages
Implementation of DDL Commands
No ratings yet
Implementation of DDL Commands
6 pages
Open Architecture Handbook - The Borland Developer's Technical Guide
No ratings yet
Open Architecture Handbook - The Borland Developer's Technical Guide
182 pages
Chapter 20
No ratings yet
Chapter 20
35 pages
IT Network Support Engineer CV
No ratings yet
IT Network Support Engineer CV
4 pages
Currency Converter Presentation
No ratings yet
Currency Converter Presentation
12 pages
Dwf13 Amf Aut t1059
No ratings yet
Dwf13 Amf Aut t1059
151 pages
Email Application Form For Hon'Ble High Court Judges - 2017
No ratings yet
Email Application Form For Hon'Ble High Court Judges - 2017
2 pages
Microsoft Access - Use ADO To Execute SQL Statements
No ratings yet
Microsoft Access - Use ADO To Execute SQL Statements
18 pages
Job Portal Final Project 2024
No ratings yet
Job Portal Final Project 2024
80 pages
1.1 Newsletter Foundations
No ratings yet
1.1 Newsletter Foundations
3 pages
AUTOSAR Module-KeyM
No ratings yet
AUTOSAR Module-KeyM
8 pages
Introduction to Data Mining Techniques
No ratings yet
Introduction to Data Mining Techniques
6 pages
Course List
No ratings yet
Course List
6 pages
Project Management Guide
No ratings yet
Project Management Guide
451 pages
Addis Ababa University
No ratings yet
Addis Ababa University
151 pages
Articles (Zaki)
No ratings yet
Articles (Zaki)
2 pages
Jouhar U A
No ratings yet
Jouhar U A
3 pages
Freeds
No ratings yet
Freeds
2 pages
Mini Project Report - Updated
No ratings yet
Mini Project Report - Updated
8 pages
Automated Billing for Hair Salons
0% (2)
Automated Billing for Hair Salons
3 pages
AD FS Knowledge Transfer-V2.0
100% (2)
AD FS Knowledge Transfer-V2.0
56 pages