Bda M2
Bda M2
Introducing Hadoop
• It divides large data into smaller parts and processes them across
multiple computers at the same time.
• It is part of the Apache Software Foundation and is widely used for Big
Data processing.
Big Data Problems: Companies like Google, Facebook, and Amazon generate
terabytes of data every second.
Slow Processing: A single computer cannot handle such huge data efficiently.
Storage Limitations: Traditional databases have limits on how much data they can
store.
Hadoop helps to store and process large-scale data efficiently and quickly.
Page 1
Big Data Analytics-BIS701-Module 2
HDFS (Hadoop Distributed File System) – Like Google Drive, it stores data across
multiple computers.
MapReduce – Like group work, it processes data in parallel and then combines
results.
YARN (Yet Another Resource Negotiator) – Like a manager, it assigns tasks to
different computers.
Why Hadoop?
Low cost:
Page 2
Big Data Analytics-BIS701-Module 2
Computing power:
Scalability:
This boils down to simply adding nodes as the system grows and requires much
less administration.
• RDBMS is not suitable for storing and processing large files, images, and
videos.
Page 3
Big Data Analytics-BIS701-Module 2
RDBMS vs Hadoop
Relational Database
System Management Node Based Flat Structure.
System.
OLTP(Online
Processing Transaction Analytical, Big Data Processing
Processing)
Needs expensive
In a Hadoop Cluster, a node
hardware or high-
requires only a processor, a
Processor end processors to
network card, and few
store huge volumes
hard drives.
of data.
Choose Hadoop for large-scale batch processing, big data analytics, and cost-
effective distributed storage with HDFS.
Although there are several challenges with distributed computing, we will focus on
two major challenges.
Hardware Failure
In a distributed system, several servers are networked together. This implies that
more often than not, there may be a possibility of hardware failure. And when such
a failure does happen, how does one retrieve the data that was stored in the
system? Just to explain further - a regular hard disk may fail once in 3 years. And
when you have 1000 such hard disks, there is a possibility of at least a few being
down every day.
Page 6
Big Data Analytics-BIS701-Module 2
In a distributed system, the data is spread across the network on several machines.
A key challenge here is to integrate the data available on several machines prior to
processing it.
History of Hadoop
The image is a timeline depicting key events in the history of Hadoop and related
technologies. Here's a breakdown of the events shown:
Page 7
Big Data Analytics-BIS701-Module 2
Hadoop Overview
Framework: Means everything that you will need to develop and execute and application
is provided - programs, tools, etc..
Core Components
Page 8
Big Data Analytics-BIS701-Module 2
1. HDFS:
2. MapReduce:
1. HIVE
2. PIG
3. SQOOP
Page 9
Big Data Analytics-BIS701-Module 2
4. HBASE
5. FLUME
6. OOZIE
7. MAHOUT
Data Processing Layer which processes data in parallel to extract richer and meaningful
insights from data.
Page 10
Big Data Analytics-BIS701-Module 2
1. Master HDFS: Its main responsibility is partitioning the data storage across the
slave nodes. It also keeps track of locations of data on DataNodes.
ClickStream Data
ClickStream data (mouse clicks) helps you to understand the purchasing behavior of
customers. ClickStream analysis helps online marketers to optimize their product web
pages, promotional content, etc. to improve their business.
The ClickStream analysis, as shown in above figure, using Hadoop provides three key
benefits:
Page 11
Big Data Analytics-BIS701-Module 2
1. Hadoop helps to join ClickStream data with other data sources such as Customer
Relationship Management Data (Customer Demographics Data, Sales Data, and
Information on Advertising Campaigns). This additional data often provides the
much needed information to understand cus- tomer behavior.
2. Hadoop's scalability property helps you to store years of data without ample
incremental cost. This helps you to perform temporal or year over year analysis on
ClickStream data which your competitors may miss.
3. Business analysts can use Apache Pig or Apache Hive for website analysis. With
these tools, you can organize ClickStream data by user session, refine it, and feed it
to visualization or analytics tools.
4. Optimized for high throughput (HDFS leverages large block size and moves
computation where data is stored).
7. You can realize the power of HDFS when you perform read or write on large files
(gigabytes and larger).
Page 12
Big Data Analytics-BIS701-Module 2
8. It operates above native file systems like ext3 and ext4, as illustrated in the
figure below. This abstraction enables additional functionality and flexibility in data
management.
The figure highlights key aspects of the Hadoop Distributed File System (HDFS). It
mentions that HDFS uses a block-structured file system, has a default replication
factor of 3 for fault tolerance, and a default block size of 64 MB for efficient
storage and processing.
Page 13
Big Data Analytics-BIS701-Module 2
The figure illustrates the Hadoop Distributed File System (HDFS) architecture,
showing how a file is stored and managed across multiple nodes.
Key Components:
Client Application
The client interacts with HDFS through the Hadoop File System Client to
read/write data.
NameNode
Manages file-to-block mapping (e.g., Sample.txt is split into Block A, Block B, and
Block C).
DataNodes
Page 14
Big Data Analytics-BIS701-Module 2
The figure shows three DataNodes (A, B, and C), each storing different copies of
Blocks A, B, and C based on the replication factor.
Working of HDFS:
The NameNode manages block locations but does not store actual data.
The Hadoop File System Client communicates with the NameNode to fetch block
locations and then retrieves data from DataNodes.
This design ensures high availability, fault tolerance, and parallel processing
for large-scale data applications.
HDFS Daemons
NameNode
Page 15
Big Data Analytics-BIS701-Module 2
• When NameNode starts up, it reads FsImage and EditLog from disk and
applies all transactions from the EditLog to in-memory representation of
the FsImage.
• Then it flushes out new version of FsImage on disk and truncates the
old EditLog because the changes are updated in the FsImage.
DataNode
Page 16
Big Data Analytics-BIS701-Module 2
Heartbeat Mechanism:
Data Replication:
The mechanism helps detect node failures and automatically redistribute data, ensuring
Hadoop's reliability and resilience.
Secondary NameNode
Page 17
Big Data Analytics-BIS701-Module 2
1. The client opens the file that it wishes to read from by calling open() on
the DistributedFileSystem.
Page 18
Big Data Analytics-BIS701-Module 2
4. Client calls read() repeatedly to stream the data from the DataNode.
6. When the client completes the reading of the file, it calls close() on
the FSDataInputStream to close the connection.
Page 19
Big Data Analytics-BIS701-Module 2
6. When the client finishes writing the file, it calls close() on the stream.
7. This flushes all the remaining packets to the DataNode pipeline and waits
for relevant acknowledgments before communicating with the NameNode to
inform the client that the creation of the file is complete.
As per the Hadoop Replica Placement Strategy, first replica is placed on the same
node as the client. Then it places second replica on a node that is present on
different rack. It places the third replica on the same rack as second, but on a
different node in the rack. Once replica locations have been set, a pipeline is
Page 20
Big Data Analytics-BIS701-Module 2
built. This strategy provides good reliability. Figure below describes the typical
replica pipeline.
Objective: To get the list of directories and files at the root of HDFS.
hadoop fs –ls /
hadoop fs –ls –R /
The command hadoop fs -ls -R / is used to recursively list all files and
directories in Hadoop Distributed File System (HDFS) starting from the root
Page 21
Big Data Analytics-BIS701-Module 2
(/). This command is useful for searching files, checking storage structure, and
debugging file locations in HDFS.
Page 22
Big Data Analytics-BIS701-Module 2
Use Cases:
Use Cases:
Page 23
Big Data Analytics-BIS701-Module 2
Objective: To copy a file from Hadoop file system to local file system via
copyToLocal command.
hadoop fs -copyToLocal: Copies a file from HDFS to the local file system.
hadoop fs -cat
/sample/test.txt
Displays the contents of /sample/test.txt (stored in HDFS) on the terminal.
Page 24
Big Data Analytics-BIS701-Module 2
Page 25
Big Data Analytics-BIS701-Module 2
Hadoop Distributed File System and MapReduce Framework run on the same
set of nodes. This config- uration allows effective scheduling of tasks on the
nodes where data is present (Data Locality). This in turn results in very high
throughput.
The MapReduce functions and input/output locations are implemented via the
MapReduce applications. These applications use suitable interfaces to
construct the job. The application and the job parameters together are known
as job configuration. Hadoop job client submits job (jar/executable, etc.) to the
JobTracker. Then it is the responsibility of Job Tracker to schedule tasks to the
Page 26
Big Data Analytics-BIS701-Module 2
slaves. In addition to scheduling, it also monitors the task and provides status
information to the job-client.
Mapreduce Daemons:
Page 27
Big Data Analytics-BIS701-Module 2
The above figure shows how the mapreduce programming works which is
fundamental to Hadoop’s data processing framework. Here’s a step-by-step
explanation of the workflow illustrated in the figure:
Page 28
Big Data Analytics-BIS701-Module 2
Page 29
Big Data Analytics-BIS701-Module 2
1. First, the input dataset is split into multiple pieces of data (several small
subsets).
2. Next, the framework creates a master and several workers processes and
executes the worker processes remotely.
3. Several map tasks work simultaneously and read pieces of data that were
assigned to each map task. The map worker uses the map function to extract only
those data that are present on their server and generates key/value pair for the
extracted data.
4. Map worker uses partitioner function to divide the data into regions. Partitioner
decides which reducer should get the output of the specified mapper.
5. When the map workers complete their work, the master instructs the reduce
workers to begin their work. The reduce workers in turn contact the map workers
to get the key/value data for their partition. The data thus received is shuffled and
sorted as per keys.
6. Then it calls reduce function for every unique key. This function writes the
output to the file.
7. When all the reduce workers complete their work, the master transfers the
control to the user program.
MapReduce Example
The famous example for MapReduce Programming is Word Count. For example,
consider you need to count the occurrences of similar words across 50 files. You
can achieve this using MapReduce Programming. Refer Figure.
Page 30
Big Data Analytics-BIS701-Module 2
2. Mapper Class: This class overrides the Map Function based on the problem
statement.
3. Reducer Class: This class overrides the Reduce Function based on the problem
statement.
Page 31
Big Data Analytics-BIS701-Module 2
Page 32
Big Data Analytics-BIS701-Module 2
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
Page 33
Big Data Analytics-BIS701-Module 2
import org.apache.hadoop.mapreduce.Reducer;
Page 34
Big Data Analytics-BIS701-Module 2
applications share a common resource management. Now Hadoop can be used for
Page 35
Big Data Analytics-BIS701-Module 2
HDFS Limitation
NameNode saves all its file metadata in main memory. Although the main memory
today is not as small and as expensive as it used to be two decades ago, still there
is a limit on the number of objects that one can have in the memory on a single
NameNode. The NameNode can quickly become overwhelmed with load on the
system increasing.
In Hadoop 2.x, this is resolved with the help of HDFS Federation.
Hadoop 2: HDFS
HDFS 2 consists of two major components: (a) namespace, (b) blocks storage
service. Namespace service takes care of file-related operations, such as creating
files, modifying files, and directories. The block storage service handles data node
cluster management, replication.
Page 36
Big Data Analytics-BIS701-Module 2
HDFS 2 Features
1. Horizontal scalability.
2. High availability.
HDFS Federation uses multiple independent NameNodes for horizontal scalability.
NameNodes are independent of each other. It means, NameNodes does not need any
coordination with each other. The DataNodes are common storage for blocks and
shared by all NameNodes. All DataNodes in the cluster registers with each
NameNode in the cluster.
High availability of NameNode is obtained with the help of Passive Standby
NameNode. In Hadoop 2.x, Active-Passive NameNode handles failover
automatically. All namespace edits are recorded to a shared NFS storage and there
is a single writer at any point of time. Passive NameNode reads edits from shared
storage and keeps updated metadata information. In case of Active NameNode
failure, Passive NameNode becomes an Active NameNode automatically. Then it
starts writing to the shared storage. Figure below describes the Active-Passive
NameNode interaction.
Page 37
Big Data Analytics-BIS701-Module 2
Fundamental Idea
The fundamental idea behind this architecture is splitting the Job Tracker
responsibility of resource management and Job Scheduling/Monitoring into
separate daemons. Daemons that are part of YARN Architecture are described
below.
1. A Global ResourceManager: Its main responsibility is to distribute resources
among various applica- tions in the system. It has two main components:
(a) Scheduler: The pluggable scheduler of ResourceManager decides allocation of
resources to var- ious running applications. The scheduler is just that, a pure
scheduler, meaning it does NOT monitor or track the status of the application.
(b)ApplicationManager: ApplicationManager does the following:
• Accepting job submissions.
• Negotiating resources (container) for executing the application specific
ApplicationMaster. Restarting the ApplicationMaster in case of failure.
2. NodeManager: This is a per-machine slave daemon. NodeManager responsibility
is launching the application containers for application execution. NodeManager
monitors the resource usage such as memory, CPU, disk, network, etc. It then
reports the usage of resources to the global ResourceManager.
Page 38
Big Data Analytics-BIS701-Module 2
Basic Concepts
Application:
1. Application is a job submitted to the framework.
2. Example - MapReduce Job.
Container:
1. Basic unit of allocation.
2. Fine-grained resource allocation across multiple resource types (Memory, CPU,
disk, network, etc.)
(a) container_0 = 2GB, ICPU
(b) container_1 = 1GB, 6 CPU
3. Replaces the fixed map/reduce slots.
YARN ARCHITECTURE
The figure below shows YARN architecture.
Page 39
Big Data Analytics-BIS701-Module 2
Page 40
Big Data Analytics-BIS701-Module 2
map tasks to generate final output. Each map task is broken into the following
phases:
1. RecordReader.
2. Mapper.
3. Combiner.
4. Partitioner.
The output produced by map task is known as intermediate keys and values. These
intermediate keys and values are sent to reducer. The reduce tasks are broken into
the following phases:
1. Shuffle.
2. Sort.
3. Reducer.
4. Output Format.
Hadoop assigns map tasks to the DataNode where the actual data to be processed
resides. This ensures data locality. Data locality means that data is not moved over
network; only computational code is moved to process data which saves network
bandwidth.
Mapper
A mapper maps the input key-value pairs into a set of intermediate key-value pairs.
Maps are individual tasks that have the responsibility of transforming input records
into intermediate key-value pairs.
1. RecordReader: RecordReader converts a byte-oriented view of the input (as
generated by the Input- Split) into a record-oriented view and presents it to the
Mapper tasks. It presents the tasks with keys and values. Generally the key is the
positional information and value is a chunk of data that constitutes the record.
2. Map: Map function works on the key-value pair produced by RecordReader and
generates zero or more intermediate key-value pairs. The MapReduce decides the
key-value pair based on the context.
Page 41
Big Data Analytics-BIS701-Module 2
Reducer
The primary chore of the Reducer is to reduce a set of intermediate values (the ones that
share a common key) to a smaller set of values. The Reducer has three primary phases:
Shuffle and Sort, Reduce, and Output Format.
1. Shuffle and Sort: This phase takes the output of all the partitioners and
downloads them into the local machine where the reducer is running. Then these
individual data pipes are sorted by keys which produce larger data list. The main
purpose of this sort is grouping similar words so that their values can be easily
iterated over by the reduce task.
2. Reduce: The reducer takes the grouped data produced by the shuffle and sort
phase, applies reduce function, and processes one group at a time. The reduce
function iterates all the values associated with that key. Reducer function provides
various operations such as aggregation, filtering, and com- bining data. Once it is
done, the output (zero or more key-value pairs) of reducer is sent to the output
format.
3. Output Format: The output format separates key-value pair with tab (default)
and writes it out to a file using record writer.
Figure describes the chores of Mapper, Combiner, Partitioner, and Reducer for the word
count problem.
The Word Count problem has been discussed under "Combiner" and "Partitioner".
Page 42
Big Data Analytics-BIS701-Module 2
Combiner
It is an optimization technique for MapReduce Job. Generally, the reducer class is
set to be the combiner class. The difference between combiner class and reducer
class is as follows:
1. Output generated by combiner is intermediate data and it is passed to the
reducer.
2. Output of the reducer is passed to the output file on
disk. The sections have been designed as follows:
Objective: What is it that we are trying to achieve here?
Input Data: What is the input that has been given to us to act upon?
Act: Output:
Page 43
Big Data Analytics-BIS701-Module 2
Page 44
Big Data Analytics-BIS701-Module 2
Page 45
Big Data Analytics-BIS701-Module 2
/mapreducedemos/output/wordcount/part-r-00000
Partitioner
• The partitioning phase happens after the map phase and before the
reduce phase.
• The number of partitions equals the number of reducers.
• The default partitioner in Hadoop is the hash partitioner, but custom
partitioners can be implemented.
Objective of the Exercise
• Implement a MapReduce program to count word occurrences.
• Use a custom partitioner to divide words based on their starting
alphabet.
• This ensures that words beginning with the same letter are sent to the
same reducer.
Input Data Example
• Introduction to Hadoop
• Introducing Hive
• Hive Session
• Pig Session
Page 46
Big Data Analytics-BIS701-Module 2
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;
@Override
partitionNumber = 0;
switch (alphabet) {
Page 47
Big Data Analytics-BIS701-Module 2
return partitionNumber;
job.setNumReduceTasks(27);
Page 48
Big Data Analytics-BIS701-Module 2
job.setPartitionerClass(WordCountPartitioner.class);
FileOutputFormat.setOutputPath(job,newPath("/mapreducedemos/output/wordcou
ntpartitioner/"));
• Welcome
• to
• Hadoop
• Session
• Introduction
• to
• Hadoop
• Introducing
• Hive
• Hive
• Session
• Pig
• Session
After mapping, each word gets a count of 1:(Words emitted by the mapper)
• Welcome → 1
Page 49
Big Data Analytics-BIS701-Module 2
• to → 1
• Hadoop → 1
• Session → 1
• Introduction → 1
• to → 1
• Hadoop → 1
• Introducing → 1
• Hive → 1
• Hive → 1
• Session → 1
• Pig → 1
• Session → 1
Partition
Word First Letter
Number
Welcome W 23
to T 20
Hadoop H 8
Session S 19
Introduction I 9
Introducing I 9
Hive H 8
Pig P 16
Each reducer generates an output file with words starting with a specific letter.
Hadoop 2
Hive 2
Introduction 1
Introducing 1
Pig 1
Session 3
to 2
Welcome 1
/mapreducedemos/output/wordcountpartitioner/
Page 51
Big Data Analytics-BIS701-Module 2
Searching
Objective
The program will output lines containing the keyword along with file name and
position.
1001,John,45
1002,Jack,39
1003,Alex,44
1004,Smith,38
1005,Bob,33
This configures the Hadoop job and specifies the Mapper, Reducer, and
input/output paths.
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
Page 52
Big Data Analytics-BIS701-Module 2
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
"Word Search");
job.setJarByClass(WordSearcher.class);
job.setMapperClass(WordSearchMapper.class);
job.setReducerClass(WordSearchReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.getConfiguration().set("keyword", "Jack");
job.setNumReduceTasks(1);
Page 53
Big Data Analytics-BIS701-Module 2
System.exit(job.waitForCompletion(true) ? 0 : 1);
Page 54
Big Data Analytics-BIS701-Module 2
◻ Configures the job, sets the keyword, and specifies input and output paths.
◻ Uses only one reducer for simplicity.
The Mapper scans each line, searches for the keyword, and outputs matching
lines.
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
@Override
keyword = configuration.get("keyword");
@Override
+;
if (value.toString().contains(keyword)) {
wordPos = pos;
The Reducer simply writes the filtered results from the Mapper.
import java.io.IOException;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
@Override
Page 56
Big Data Analytics-BIS701-Module 2
context.write(key, value);
HDFS at:
/mapreduce/output/search/part-r-00000
Content of part-r-00000:
1002,Jack,39 student.csv,5
◻ The program correctly identifies and outputs the row that contains "Jack".
Sorting
1001,John,45
1002,Jack,39
1003,Alex,44
1004,Smith,38
1005,Bob,33
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
Page 57
Big Data Analytics-BIS701-Module 2
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
// Mapper Class
public static class SortMapper extends Mapper<LongWritable, Text, Text, Text>
{ protected void map(LongWritable key, Text value, Context context) throws
IOException, InterruptedException {
String[] token = value.toString().split(",");
context.write(new Text(token[1]), new Text(token[0] + "," + token[2])); //
Key: Name, Value: ID,Score
}
}
// Reducer Class
public static class SortReducer extends Reducer<Text, Text, NullWritable, Text>
{ public void reduce(Text key, Iterable<Text> values, Context context)
throws
IOException, InterruptedException
{ for (Text details : values) {
context.write(NullWritable.get(), new Text(key.toString() + "," +
details.toString()));
}
}
}
Page 58
Big Data Analytics-BIS701-Module 2
// Driver Class
public static void main(String[] args) throws IOException, InterruptedException,
ClassNotFoundException {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Sort Students by Name");
job.setJarByClass(SortStudNames.class);
job.setMapperClass(SortMapper.class);
job.setReducerClass(SortReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(Text.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Output:
Alex,1003,44
Bob,1005,33
Jack,1002,39
John,1001,45
Smith,1004,38
Page 59
Big Data Analytics-BIS701-Module 2
hadoop jar StudentSort.jar SortStudNames /mapreduce/student.csv
/mapreduce/output
COMPRESSION
In MapReduce programming, you can compress the MapReduce output file.
Compression provides two benefits as follows:
1. Reduces the space to store files.
2. Speeds up data transfer across the network.
You can specify compression format in the Driver Program as shown below:
conf.setBoolean("mapred.output.compress", true);
conf.setClass("mapred.output.compression.codec",GzipCodec.class,CompressionCo
dec.class);
Here, codec is the implementation of a compression and decompression algorithm.
GzipCodec is the compression algorithm for gzip. This compresses the output file.
****END****
Page 60