Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
10 views17 pages

3 Unit

MapReduce is a framework for processing large data sets in a distributed manner, consisting of two main tasks: Map, which extracts key-value pairs from input data, and Reduce, which aggregates results. The process involves splitting input data into chunks, processing them in parallel, and managing tasks through Jobtracker and Tasktracker. Additionally, tools like Hive, Pig, and Spark enhance data processing capabilities within the Hadoop ecosystem.

Uploaded by

Mani Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views17 pages

3 Unit

MapReduce is a framework for processing large data sets in a distributed manner, consisting of two main tasks: Map, which extracts key-value pairs from input data, and Reduce, which aggregates results. The process involves splitting input data into chunks, processing them in parallel, and managing tasks through Jobtracker and Tasktracker. Additionally, tools like Hive, Pig, and Spark enhance data processing capabilities within the Hadoop ecosystem.

Uploaded by

Mani Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 17

MapReduce is a powerful framework used in Big Data (especially in Hadoop) to process and

generate large data sets with a parallel, distributed algorithm.


Instead of processing the whole huge file at once, it divides the work into small chunks and
processes them in parallel (multiple computers or nodes working together).

It mainly has two important tasks:

 Map

 Reduce

Before jumping deep, remember:

 Map → Break down and extract important information

 Reduce → Summarize and combine the results

🔵 Detailed Steps of MapReduce

1. Input Data

 A huge file containing massive data.

 Example: Billions of weather records, web logs, social media posts, etc.

 Data is stored in HDFS (Hadoop Distributed File System).

2. Splitting

 Hadoop splits the input file into fixed-size blocks (default: 128 MB or 64 MB).

 Each block is processed independently by a different mapper (small task running on a


node).

3. Map Phase

This is the first main function.

 Map function reads the input line by line.

 It processes the input data and outputs a set of (key, value) pairs.

👉 What happens inside Map function?


 It extracts useful data from the input.

 Example: From a weather record line, it extracts (year, temperature).

🔵 Output of map phase:

scss

CopyEdit

(1950, 22)

(1949, 78)

(1950, -11)

...

🔵 Note:

 Each output is a key-value pair.

 It can produce one or many key-value pairs for each input line.

4. Shuffle and Sort Phase (System-Managed)

This is automatic (you don’t have to code it).

 Hadoop groups all the values that have the same key.

 It also sorts the data by key.

👉 Example:

 All temperatures from the year 1950 are grouped together.

 All temperatures from the year 1949 are grouped together.

🔵 Result after shuffle:

css

CopyEdit

1949 → [78]

1950 → [22, -11, 0]

This step ensures that all data for a given key (like 1950) goes to the same Reducer.
5. Reduce Phase

This is the second main function.

 Reduce function takes the key and the list of values.

 It processes them and produces the final output for that key.

👉 What happens inside Reduce function?

 It can calculate things like maximum, minimum, sum, average, etc.

 Example: From [22, -11, 0] for 1950, find the maximum value: 22.

🔵 Output of reduce phase:

scss

CopyEdit

(1949, 78)

(1950, 22)

6. Final Output

 The final result is written back to HDFS (or any storage).

 Each output is in the form of (key, final value).

🎯 In Short:

Phase Action

Input Split Divides the input file into blocks

Map Extracts key-value pairs from data

Shuffle/Sort Groups all values by the same key

Reduce Aggregates and finalizes the results

Output Stores the result


Input Data

Splitting

Map Phase: (key, value) → (Year, Temperature)

Shuffle and Sort: group by Year

Reduce Phase: Maximum temperature per Year

Final Output

Java Map reduce ***


MapReduce Job Execution Flow in Hadoop (Using Java)

1. Writing Mapper, Reducer, and Driver Classes

The process begins by writing Java classes that define the Mapper, Reducer, and Driver (Main)
logic. These classes contain the business logic for processing input and generating the required
output.

2. Compilation of Java Code

The Java source files are compiled using the Hadoop classpath. This results in the generation
of .class files for each Java class.

3. Creation of JAR File


All compiled .class files are packaged into a single JAR (Java Archive) file. This JAR file contains all
the necessary classes for executing the MapReduce job.

4. Job Submission

The MapReduce job is submitted to the Hadoop framework using the hadoop jar command,
specifying the JAR file, the main class, input path, and output path.

5. Initialization of JVM and Job ID Assignment

Upon submission, Hadoop internally starts a Java Virtual Machine (JVM) to run the job. The
system loads the JAR file using the classpath. The job is then assigned a unique Job ID (e.g.,
job_local_0001) for tracking purposes.

6. Task Creation

Hadoop divides the job into multiple smaller tasks:

 Map Tasks: Responsible for reading and processing the input data.

 Reduce Tasks: Handle the aggregation of intermediate results produced by the mappers.

7. Task Attempt Identification

Each map and reduce task is assigned a unique attempt ID. For example:

 Map Task: attempt_local_0001_m_000000_0

 Reduce Task: attempt_local_0001_r_000000_0

8. Execution of Tasks by JVM

Each task (both Map and Reduce) is executed by a separate instance of the JVM. The Map
function processes input key-value pairs, and the Reduce function aggregates intermediate
results based on keys.

9. Use of Counters
Hadoop maintains various built-in and custom counters to monitor:

 Number of records read

 Number of records written

 Number of tasks succeeded or failed

10. Input Processing

The input data is divided into logical splits. Each split is processed by a map task, which
transforms the input into intermediate key-value pairs.

11. Output Generation

The output from reduce tasks is stored in the specified output directory. Each reducer writes its
result into separate output files (e.g., part-r-00000).

12. Result Verification

The output data is examined to ensure correctness and to verify that the result matches the
expected output.

13. Job Completion

Once all map and reduce tasks are successfully executed, and the output is written, the
MapReduce job is marked as complete.
Apache Hadoop

Hive lets you run SQL-like queries on big data stored in Hadoop. Instead of writing
Hive complicated Java programs (like MapReduce), you can use simple Hive queries. It's
good for data analysis and reporting.

Pig uses a special scripting language called Pig Latin to process and transform data.
Pig It's easier and quicker than writing full MapReduce programs. Mostly used by
programmers who want simple data processing.

HBase is a NoSQL database that runs on top of HDFS. It stores data in a table
HBase format but doesn’t use SQL. It's great when you need fast reading and writing of
individual records (example: random access).

Sqoop is a tool used to import data from traditional databases (like MySQL, Oracle)
Sqoop into Hadoop and also export processed data back to these databases. It helps in
moving structured data easily.

Flume Flume is mainly used to collect and move huge volumes of real-time streaming
data (like log files, social media feeds) into Hadoop for storage and later analysis.

Oozie is a workflow scheduler. It helps organize and schedule multiple Hadoop jobs
Oozie (like MapReduce, Hive, Pig) into a sequence or workflow, running them in a set
order automatically.

Zookeeper manages coordination and synchronization between all Hadoop


Zookeeper services. It keeps track of configuration, manages distributed locks, and helps all
parts of Hadoop work together smoothly.

Spark is a fast, in-memory data processing engine. It can process big data much
faster than MapReduce because it keeps data in memory (RAM) instead of reading
Spark
from disk again and again. It supports batch processing, real-time data processing,
machine learning, and more.

Mahout provides machine learning algorithms like clustering, classification, and


Mahout recommendation systems that can work on Hadoop. It helps to build smart
applications (like product recommendations).

In simple words:

Tool You use it when you want to...

Hive Write SQL queries instead of coding programs.

Pig Write simple scripts to process data.

HBase Quickly read/write data (like a NoSQL database).

Sqoop Move data between databases and Hadoop.

Flume Bring live/streaming data into Hadoop.

Oozie Schedule and manage multiple Hadoop tasks.

Zookeeper Keep all Hadoop services working together correctly.

Spark Process data faster than MapReduce (for batch and real-time).

Mahout Build machine learning models on big data.


Scaling out

SCALING OUT

 What is Scaling Out?

o It's a method to add more computers (or machines) to your system to handle
more data.

o Instead of using just one computer, we use many to process data faster and
better.

o The data is stored across all machines using something called HDFS (Hadoop
Distributed File System).

o Hadoop then spreads the MapReduce work across all the machines.

3.4.1 Data Flow – Terminologies

What is a MapReduce job?

 A MapReduce job is a task that we give to Hadoop.

 It includes:

o Input data (what needs to be processed)

o The MapReduce program (how to process it)

o Configuration information (settings)

How does Hadoop handle the job?

 Hadoop splits the job into small tasks.

 There are two types of tasks:

1. Map tasks – they break the data into parts and process them.

2. Reduce tasks – they collect and combine the results from the map tasks.

Who manages the tasks?


There are two important managers:

1. Jobtracker

o Think of this like the boss.

o It decides what needs to be done and who should do it.

o It keeps track of the overall progress.

o If a task fails, the jobtracker gives it to someone else.

2. Tasktracker

o These are the workers.

o They actually do the map and reduce tasks.

o They report back to the jobtracker with updates.

Summary (in simple words):

 Hadoop works like a team.

 The jobtracker is the team leader.

 The tasktrackers are the team members.

 Together, they break big jobs into smaller ones and finish them quickly across multiple
computers.

MapReduce – How it Processes Data (Easy Explanation)

🔹 Step-by-Step Procedure:

1. Splitting the Input:

o Hadoop splits the input data into smaller parts called input splits.

o Each split gets a map task to process it.

🔹 Why Splits?

 Processing a split is faster than processing the whole input.

 If we run all splits at the same time (parallel processing), the job finishes much faster.
 Faster machines finish their splits earlier and take more splits – so the system works
efficiently.

🔸 More splits = better load balancing


🔸 But too many small splits = more overhead (more tasks to manage)

🔹 Good Split Size:

 A common good size is 64 MB (same as HDFS block size).

 You can change it based on your system’s setup.

Data Locality Optimization:

 Hadoop tries to run map tasks close to where the data is stored in HDFS.

 This avoids using network bandwidth, which is slower.

There are 3 types of data placement:

1. Data-local: Map task and data on the same node.

2. Rack-local: Map task and data on different nodes but same rack.

3. Off-rack: Map task and data on different racks – slowest.

✅ Best: Data-local
⚠️Okay: Rack-local
❌ Worst: Off-rack (slow due to network transfer)

Map and Reduce Task Flow (Right Page):

🔹 What Happens After Mapping?

 Map tasks write their output to local disk (not HDFS).

 This is temporary output used by the reduce task.

🔹 Reduce Tasks:

 They collect outputs from all map tasks.

 These outputs are sent across the network to the node where reduce happens.
 Reduce tasks merge and process the data to generate final results.

🔹 Final Output:

 Reduce output is stored in HDFS for safety and reliability.

 Usually, one copy stays on the same node, and others are stored on different nodes for
backup.

🔹 Diagram Explanation (Bottom Right):

 Shows input splits (split 0, split 1, split 2)

 Each split goes to a map task

 Map outputs are collected by the reduce task

 Reduce task merges everything and saves it to HDFS (part 0)

🔹 Key Terms Simplified:

 Map Task = Takes small part of input and processes it

 Reduce Task = Takes all map outputs, combines them, and gives final result

 HDFS = Hadoop’s storage system

 Split = A small chunk of the big input file

 Rack = A group of computers in a network

MapReduce with Multiple Reduce Tasks

🔹 What happens when there are multiple reducers?

 After the map phase, the output is partitioned by keys.

 This process is called the shuffle phase.

 Each reducer gets its own partition of the data based on key values.

 When you increase the number of reduce tasks, Hadoop:


 Distributes the keys across multiple reducers.

 Each reducer works independently on its set of keys.

 This improves:

 Speed (parallel processing).

 Scalability (can handle large data).

 Fault tolerance (if one fails, others continue).

🔹 No Reduce Tasks? (Bottom diagram)

 Sometimes, we don’t need reduce tasks.

 Map output can go directly to HDFS (no shuffling).

 This is useful when no aggregation or combining is needed.

Combiner Functions (Section 3.4.2)

🔹 Why use a Combiner?

 To reduce network usage between map and reduce stages.

 It runs after the map task, doing some local aggregation.

 Example: Summing or finding the maximum temperature before data is sent to


reducers.

Suppose the map output is like this:

scss

CopyEdit

(1950, 0)

(1950, 20)

(1950, 10)

(1950, 25)
(1950, 15)

Instead of sending all these to the reducer, we can use a combiner.

🔹 Combiner Function:
 Acts like a mini-reducer.

 Example: We want to find the maximum temperature for 1950.

 Combiner locally finds max in map output:

scss

CopyEdit

(1950, 25)

Then only one value (1950, 25) is sent to the reducer – saving bandwidth.

🔹 Final Reduce:

 The reducer still gets (1950, [20, 25]) and again finds the max.

 So result = (1950, 25).

🧠 Formula:

scss

CopyEdit

max(0, 10, 25, 15) → max(max(0, 20, 10), max(25, 15)) → 25

✅ Summary:

 Combiner = mini-reducer

 Helps reduce data transfer between map and reduce tasks

 Especially useful when doing things like sum, max, count, etc.
 Not always used, and not guaranteed to run every time

Why not all functions can be used as Combiners?

 Some operations (like mean) can't be split and recombined.

 Example:

matlab

CopyEdit

mean(0, 20, 10, 25, 15) = 14 ✅

But:

mean(mean(0,20,10), mean(25,15)) = mean(10,20) = 15 ❌

 This means mean is not suitable for a combiner.

🔹 Combiner ≠ Reducer

 Combiner just pre-processes map output locally to reduce data transfer.

 The final result is always done by the reducer.

🧱 What is a Block in HDFS?


In HDFS, a block is the minimum unit of data storage. Default block size is 128 MB or 256 MB
(larger than traditional OS file blocks like 4 KB). Files in HDFS are split into blocks and distributed
across multiple nodes in the cluster.

✅ Benefits of Using Blocks

1. Supports Very Large Files

“A file can be larger than any single disk in the network...”

 Files can span multiple disks because they are split into blocks.

 These blocks don’t need to stay on the same machine or disk.


 It allows HDFS to store files much larger than what a single disk could hold by using the
whole cluster’s storage capacity.

📌 Example: A 500 MB file on a system with 128 MB block size will be split into 4 blocks:

 Block 1: 128 MB

 Block 2: 128 MB

 Block 3: 128 MB

 Block 4: 116 MB

These blocks can go to different machines in the cluster.

2. Simplifies Storage Subsystem

“Making the unit of abstraction a block rather than a file simplifies the storage subsystem...”

 Storage deals with fixed-size blocks, not variable-size files.

 Easy to manage, calculate, and allocate space.

 Reduces complexity of metadata handling since each block is a fixed size.

📌 Example: On a 1 TB disk with 128 MB block size, it's simple to know that ~8000 blocks can fit
on the disk.

3. Helps with Replication and Fault Tolerance

“Blocks fit well with replication for providing fault tolerance and availability...”

 Blocks are replicated, usually 3 copies, on different nodes.

 If one copy is lost due to disk failure, others can be used.

 Improves reliability and high availability of data.

📌 Example:

 Block A is stored on Node1, Node2, and Node3.

 If Node2 crashes, Block A is still available from Node1 or Node3.


NameNode (Master)

 Role: Manages the filesystem namespace (structure of files and directories).

 Stores:

o Namespace image: Snapshot of the file system tree.

o Edit log: Record of recent changes.

 Knows: Which DataNodes hold the blocks of a file.

 Does NOT store block locations permanently — this info is re-collected from DataNodes
at system startup.

📌 DataNode (Slave)

 Role: Actual storage workhorse.

 Stores and retrieves file blocks as told by NameNode or clients.

 Reports: Periodically sends block information to the NameNode.

 Can't work alone – depends on NameNode for coordination.

NameNode Failure Protection

Hadoop provides two ways to handle NameNode failure:

1. Backups: Save the namespace image and edit log regularly.

2. Secondary NameNode (not a backup NameNode):

o Periodically merges the namespace image + edit log.

o Runs on a separate machine with high resources.

o Helps prevent the edit log from growing too large.

You might also like