Bigdata and Hadoop
Bigdata and Hadoop
0042
Course Objectives:
The main objective of the course is to process Big Data with advance architecture like spark
Course Outcomes:
Develop MapReduce Programs to analyze large dataset Using Hadoop and Spark
Write Hive queries to analyze large dataset Outline the Spark Ecosystem and its components
Perform the filter, count, distinct, map, flatMap RDD Operations in Spark.
Make use of sqoop to import and export data from hadoop to database and vice-versa
List of Experiments:
(ii) Knowing the differencing between single node clusters and multi-node clusters
(iv) Installing and accessing the environments such as hive and sqoop
(xii) Zipping and unzipping the files with & without permission pasting it to a location
4. Map-reducing
(iii) Word-count program to understand map-reduce (Mapper phase, Reducer phase, Driver
code)
6. Compute Average Salary and Total Salary by Gender for an Enterprise.R22 B.Tech. CSBS Syllabus JNTU
Hyderabad
(ii) Loading data to external hive tables from sql tables(or)Structured c.s.v using scoop
8. Create a sql table of employees Employee table with id,designation Salary table (salary ,dept
id) Create external table in hive with similar schema of above tables,Move data to hive using
scoop and load the contents into tables,filter a new table and write a UDF to encrypt the table
with AES-algorithm, Decrypt it with key to show contents
9. (i) Pyspark Definition(Apache Pyspark) and difference between Pyspark, Scala, pandas
(ii) to remove the words, which are not necessary to analyze this text.
(iii) groupBy
(iv) What if we want to calculate how many times each word is coming in corpus ?
(v) How do I perform a task (say count the words ‘spark’ and ‘apache’ in rdd3) separatly on
each partition and get the output of the task performed in these partition ?
(ii) Using spark conf create a spark session to write a dataframe to read details in a c.s.v and
TEXT BOOKS:
WEB LINKS:
1. https://infyspringboard.onwingspan.com/web/en/app/toc/lex_auth_013301505844518912251
8 2_shared/overview
2. https://infyspringboard.onwingspan.com/web/en/app/toc/lex_auth_01258388119638835242_s
hared/overview
3. https://infyspringboard.onwingspan.com/web/en/app/toc/lex_auth_012605268423008256169
2 _shared/overview
1. To Study of Big Data Analytics and Hadoop Architecture
Big data architecture is specifically designed to manage data ingestion, data processing, and analysis of data that is
too large or complex. A big size data cannot be store, process and manage by conventional relational databases. The
solution is to organize technology into a structure of big data architecture. Big data architecture is able to manage
and process data.
The following figure shows Big Data Architecture with its sequential arrangements of different
components. The outcomes of one component work as an input to another component and this process flow
continues till to outcome of processed data.
Here is the diagram of big data architecture −
Most big data architectures include some or all of the following components:
Data sources. All big data solutions start with one or more data sources. Examples include:
o Application data stores, such as relational databases.
o Static files produced by applications, such as web server log files.
o Real-time data sources, such as IoT devices.
Data storage. Data for batch processing operations is typically stored in a distributed file store that can
hold high volumes of large files in various formats. This kind of store is often called a data lake. Options
for implementing this storage include Azure Data Lake Store or blob containers in Azure Storage.
Batch processing. Because the data sets are so large, often a big data solution must process data files using
long-running batch jobs to filter, aggregate, and otherwise prepare the data for analysis. Usually these jobs
involve reading source files, processing them, and writing the output to new files. Options include running
U-SQL jobs in Azure Data Lake Analytics, using Hive, Pig, or custom Map/Reduce jobs in an HDInsight
Hadoop cluster, or using Java, Scala, or Python programs in an HDInsight Spark cluster.
Real-time message ingestion. If the solution includes real-time sources, the architecture must include a
way to capture and store real-time messages for stream processing. This might be a simple data store,
where incoming messages are dropped into a folder for processing. However, many solutions need a
message ingestion store to act as a buffer for messages, and to support scale-out processing, reliable
delivery, and other message queuing semantics. This portion of a streaming architecture is often referred to
as stream buffering. Options include Azure Event Hubs, Azure IoT Hub, and Kafka.
Stream processing. After capturing real-time messages, the solution must process them by filtering,
aggregating, and otherwise preparing the data for analysis. The processed stream data is then written to an
output sink. Azure Stream Analytics provides a managed stream processing service based on perpetually
running SQL queries that operate on unbounded streams. You can also use open source Apache streaming
technologies like Spark Streaming in an HDInsight cluster.
Analytical data store. Many big data solutions prepare data for analysis and then serve the processed data
in a structured format that can be queried using analytical tools. The analytical data store used to serve
these queries can be a Kimball-style relational data warehouse, as seen in most traditional business
intelligence (BI) solutions. Alternatively, the data could be presented through a low-latency NoSQL
technology such as HBase, or an interactive Hive database that provides a metadata abstraction over data
files in the distributed data store. Azure Synapse Analytics provides a managed service for large-scale,
cloud-based data warehousing. HDInsight supports Interactive Hive, HBase, and Spark SQL, which can
also be used to serve data for analysis.
Analysis and reporting. The goal of most big data solutions is to provide insights into the data through
analysis and reporting. To empower users to analyze the data, the architecture may include a data modeling
layer, such as a multidimensional OLAP cube or tabular data model in Azure Analysis Services. It might
also support self-service BI, using the modeling and visualization technologies in Microsoft Power BI or
Microsoft Excel. Analysis and reporting can also take the form of interactive data exploration by data
scientists or data analysts. For these scenarios, many Azure services support analytical notebooks, such as
Jupyter, enabling these users to use their existing skills with Python or Microsoft R. For large-scale data
exploration, you can use Microsoft R Server, either standalone or with Spark.
Orchestration. Most big data solutions consist of repeated data processing operations, encapsulated in
workflows, that transform source data, move data between multiple sources and sinks, load the processed
data into an analytical data store, or push the results straight to a report or dashboard. To automate these
workflows, you can use an orchestration technology such Azure Data Factory or Apache Oozie and Sqoop.
Hadoop is a framework written in Java that utilizes a large cluster of commodity hardware to maintain and store big
size data. Hadoop works on MapReduce Programming Algorithm that was introduced by Google. Today lots of Big
Brand Companies are using Hadoop in their Organization to deal with big data, eg. Facebook, Yahoo, Netflix, eBay,
etc. The Hadoop Architecture Mainly consists of 4 components.
The Hadoop architecture uses several core components for parallel processing of large data volumes:
The Hadoop Distributed File System allows the storage of large data volumes by dividing it into blocks. HDFS is
designed for fault tolerance and ensures high availability by replicating data across multiple nodes in a Hadoop
cluster, allowing for efficient data processing and analysis in parallel.
The HDFS has three components – 1. NameNode 2. Secondary NameNode and 3. Slave Node
1. NameNode, which is the master server containing all metadata information such as block size, location,
etc. It is the NameNode that permits the user to read or write a file in HDFS. The NameNode holds all this
metadata information on the various DataNodes. There can be several of these DataNodes retrieving blocks
and sending block reports to the NameNode.
2. Secondary NameNode server that maintains metadata copies in the disk as a backup in case the
NameNode fails. The Secondary NameNode performs the same function as the standby NameNode in a
high-availability cluster.
3. Slave Node that stores all the data as blocks.
YARN (Yet Another Resource Negotiator) is the cluster resource management layer in Hadoop architecture. It has
the crucial job of job scheduling and managing the cluster. YARN helps in task distribution, job prioritization,
dependency management, and other aspects across the Hadoop cluster for optimum processing efficiency. It allows
multi-tenancy, supports easy scalability, and optimizes cluster utilization.
YARN resides as a middle layer between HDFS and MapReduce in the Hadoop architecture. It has three core
elements – 1. ResourceManager 2. ApplicationMaster and 3. NodeManagers
1. YARN ResourceManager is the sole authority for resource allocation and tracking of resources in the
cluster. It features two main components – the Scheduler, which schedules resources for various
applications, and the Application Manager, which accepts job submissions and monitors the application
clusters.
2. YARN ApplicationMaster investigates the resource-management side, fulfilling the resource
requirements of individual applications through interactions with the scheduler.
3. YARN Node Manager tracks the jobs and monitors resource utilization in containers that house the RAM
and CPU.
Hadoop uses the MapReduce programming model for parallel processing of large datasets. It is a fundamental
component in the Hadoop ecosystem for big data analytics.
MapReduce consists of two main phases: the Map Phase and the Reduce Phase.
In the Map Phase, input data is divided into smaller chunks and processed in parallel across multiple nodes in a
distributed computing environment. The input data is typically represented as key-value pairs.
In the Reduce Phase, the results from the Map phase are aggregated by key to produce the final output.
Hadoop Common or Common Utilities in the Hadoop Architecture
This crucial component of the Hadoop architecture ensures the proper functioning of Hadoop modules by providing
shared libraries and utilities. Hadoop Common contains the Java Archive (JAR) files and scripts required to start
Hadoop.
Data Storage and Scalability: The Hadoop Distributed File System’s ability to store and process vast data
volumes at speed is its biggest strength. As data grows, organizations can scale their Hadoop clusters easily
by adding more nodes for increased storage capacity and processing power.
Batch and Real-time Data Processing: Hadoop’s MapReduce module supports both batch processing and
real-time stream processing when integrated with frameworks like Apache Spark). This versatility allows
organizations to address various use cases of advanced analytics.
Cost-Effectiveness: Hadoop is designed to run on commodity hardware, which is more cost-effective than
investing in high-end, specialized hardware. This makes it an attractive option for organizations looking to
manage large datasets without incurring substantial infrastructure costs.
Data Locality and Data Integrity: Hadoop processes data on the same node where it is stored, minimizing
data movement across the network. This approach enhances performance by reducing latency and
improving overall efficiency. Hadoop minimizes data loss and ensures data integrity through duplication on
multiple nodes.
Community Support: Hadoop users enjoy a large open-source community for continuous updates,
improvement, and collaboration. Hadoop also offers a rich repository of documentation and resources.
These are some of the many advantages that Hadoop architecture provides to its users. Having said that, the Hadoop
architecture does present some limitations such as security management complexities, vulnerability to cyber threats
due to Java, and challenges in handling small datasets, among others. This often prompts organizations to seek
modern cloud-based alternatives such as Databricks, Snowflake, or the Azure suite of tools.
java -version
su - hadoop
Go to the Apache Hadoop downloads page and copy the link for the latest stable version of Hadoop.
wget https://downloads.apache.org/hadoop/common/hadoop-3.x.x/hadoop-
3.x.x.tar.gz
nano ~/.bashrc
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
source ~/.bashrc
nano $HADOOP_HOME/etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>file:///home/hadoop/hadoop_data/hdfs/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:///home/hadoop/hadoop_data/hdfs/datanode</value>
</property>
</configuration>
nano $HADOOP_HOME/etc/hadoop/mapred-site.xml
Add the following configuration:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
nano $HADOOP_HOME/etc/hadoop/yarn-site.xml
<configuration>
<property>
<name>yarn.resourcemanager.address</name>
<value>localhost:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>localhost:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>localhost:8031</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>localhost:8033</value>
</property>
</configuration>
start-dfs.sh
start-yarn.sh
You can test the setup by running a simple word count example:
This will run a word count example. You can create input files by uploading them to HDFS, for example:
(ii) Knowing the differencing between single node clusters and multi-node clusters.
The key difference between a single-node and multi-node Hadoop cluster lies in the number of machines (or nodes)
involved in the setup and the distribution of resources and workloads across those machines.
In a Hadoop cluster, both the NameNode and ResourceManager provide Web UIs that allow you to monitor and
manage the cluster's health, performance, and resources. The web UIs are accessible via specific port numbers on
the system where the Hadoop cluster is running.
Here's an overview of the Hadoop Web UIs and their corresponding ports:
1. Hadoop NameNode Web UI
Purpose: The NameNode Web UI allows you to monitor the status of the HDFS (Hadoop Distributed File
System), check the health of the cluster, view the status of data nodes, and explore the file system.
Default Port: 50070
o URL: http://<namenode_host>:50070
o For a single-node setup, it will be: http://localhost:50070
Key Features:
File System Status: Shows the total capacity, used space, and remaining space in the HDFS.
Datanode Health: Displays information about the DataNodes, including their health, storage, and data
block replication status.
HDFS Browser: Allows you to browse files stored in HDFS, upload files, and perform other file operations.
Purpose: The ResourceManager Web UI allows you to monitor and manage the YARN (Yet Another
Resource Negotiator), which is responsible for job scheduling and resource management across the
cluster.
Default Port: 8088
o URL: http://<resourcemanager_host>:8088
o For a single-node setup, it will be: http://localhost:8088
Key Features:
Cluster Summary: Displays the total number of nodes, available resources, and running applications.
Applications: Shows the status of running and completed applications, such as MapReduce jobs, Spark
jobs, etc.
Resource Utilization: Displays the resource utilization (memory and CPU) across the cluster.
Purpose: This Web UI allows you to monitor and view details about completed MapReduce jobs, including
job execution times, tasks, and logs.
Default Port: 19888
o URL: http://<history_server_host>:19888
o For a single-node setup, it will be: http://localhost:19888
Key Features:
Job History: Allows you to view the details of past MapReduce jobs.
Job Logs: Provides access to job logs, helping you troubleshoot and debug failed jobs.
Purpose: The NodeManager Web UI shows the status of the NodeManager and its resource usage
(memory, CPU, etc.) for the node where it's running.
Default Port: 8042
o URL: http://<nodemanager_host>:8042
o For a single-node setup, it will be: http://localhost:8042
Key Features:
Node Health: Shows the health and status of the NodeManager on a particular machine.
Resource Usage: Displays memory, CPU, and disk usage for the node.
Container Details: Shows information about running containers on the node.
If you're using a single-node Hadoop cluster on your local machine (Ubuntu), you can access the Web UIs by
navigating to the following URLs in your browser:
Troubleshooting:
Port Conflict: Ensure no other services are occupying the default Hadoop ports (e.g., 50070, 8088, 19888,
8042).
Permissions: Check if the user running the Hadoop services has the appropriate permissions to access and
open these ports.
(iv) Installing and accessing the environments such as hive and sqoop
Both Hive and Sqoop are popular tools in the Hadoop ecosystem for managing and querying large datasets stored in
HDFS and integrating with external databases, respectively. Here's how to install and access these tools in a Hadoop
environment.
Installing Hive
Hive is a data warehouse infrastructure built on top of Hadoop that provides data summarization, querying, and
analysis. It allows you to query data stored in HDFS using a SQL-like interface.
1. Download Hive
First, download the latest version of Hive from the official Apache Hive website: Apache Hive Download.
wget https://downloads.apache.org/hive/hive-3.x.x/apache-hive-3.x.x-bin.tar.gz
You need to configure environment variables for Hive, so edit the .bashrc file to include the following lines:
nano ~/.bashrc
source ~/.bashrc
4. Configure Hive
You need to configure Hive by editing the hive-site.xml file. First, navigate to the configuration directory:
cd $HIVE_HOME/conf
cp hive-default.xml.template hive-site.xml
nano hive-site.xml
Add the following properties to configure Hive to use Derby (default) as the metastore database:
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby:;databaseName=/usr/local/hive/metastore_db;create=true</
value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.apache.derby.jdbc.EmbeddedDriver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>sa</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value></value>
</property>
</configuration>
You need to initialize the metastore database. Run the following command to do this:
After configuring Hive, you can start the Hive shell using:
hive
This will open the Hive command-line interface (CLI). You can start running queries using HiveQL (Hive Query
Language) just like SQL.
Installing Sqoop
Sqoop is a tool used to transfer data between Hadoop and relational databases (e.g., MySQL, PostgreSQL, Oracle).
1. Download Sqoop
Go to the Apache Sqoop Download page to download the latest version. Or, you can use the following command to
download:
wget https://downloads.apache.org/sqoop/1.4.7/sqoop-1.4.7.bin__hadoop-
2.6.0.tar.gz
nano ~/.bashrc
Sqoop requires JDBC connectors to connect to databases like MySQL. If you're connecting to MySQL, download
the MySQL JDBC driver:
wget https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-
5.1.49.tar.gz
You can test the Sqoop installation by running the following command to check the available options:
sqoop help
To import data from a MySQL database to HDFS, you can use a command like this:
Accessing Hive:
To access the Hive CLI, simply run the hive command from the terminal:
hive
You can also access Hive using Beeline, a JDBC client for Hive:
beeline -u jdbc:hive2://localhost:10000
Accessing Sqoop:
To run a Sqoop job or access Sqoop commands, use the sqoop command followed by the appropriate
subcommands.
If you're using HiveServer2 (for remote clients like Beeline or JDBC), you can configure and start
HiveServer2 to allow connections via a JDBC client. It runs on port 10000 by default.
With Hive and Sqoop installed and configured, you can now perform SQL-like queries over Hadoop with Hive, and
transfer data between relational databases and HDFS using Sqoop. To access them, use their respective command-
line interfaces (hive for Hive and sqoop for Sqoop). If you need a more interactive interface, you can also
configure HiveServer2 for remote access through JDBC.
(xii) Zipping and unzipping the files with & without permission pasting it to a location
Hadoop Distributed File System (HDFS) allows you to store large files across multiple machines, but managing files
within it is different from managing files in the local filesystem. You can use Hadoop’s hdfs dfs command to
interact with HDFS.
To create a directory in HDFS, you need to use the -mkdir command with hdfs dfs.
Steps to Create a Directory in HDFS
o Here, /user/hadoop/mydir is the path in HDFS where you want to create the directory. You
can replace this path with the desired location.
3. Verify the Directory Creation:
o To confirm the directory was created, use the -ls command to list the contents of the directory:
o hdfs dfs -ls /user/hadoop
o This will list all the files and directories within /user/hadoop, and you should see your newly
created mydir listed.
Additional Options:
You can also create nested directories in a single command by using the -p option, which will create the
entire path if it doesn’t already exist:
hdfs dfs -mkdir -p /user/hadoop/dir1/dir2
In addition to HDFS commands, here are some basic Linux commands that can be useful for file management on
your local filesystem:
1. Creating a Directory:
o To create a directory in Linux, you use the mkdir command:
o mkdir mydirectory
2. Listing Files:
o To list files in the current directory:
o ls
o To list files with detailed information (permissions, size, etc.):
o ls -l
3. Changing Directory:
o To navigate into a directory:
o cd mydirectory
o To go back to the previous directory:
o cd ..
4. Removing a Directory:
o To remove an empty directory:
o rmdir mydirectory
o To remove a non-empty directory (with its contents):
o rm -r mydirectory
7. Removing Files:
o To remove a file:
o rm file.txt
To create a directory in HDFS, you use the hdfs dfs -mkdir command, specifying the path where you want
the directory. Remember to ensure your Hadoop services are running, and you can verify the directory's creation
with hdfs dfs -ls. Additionally, understanding basic Linux commands such as mkdir, cd, ls, and rm can be
very helpful in managing files and directories on your local system.
Let’s walk through the specific tasks related to file management in both HDFS and Linux, covering your requested
operations.
In HDFS and Linux, moving between directories is done with the cd command.
In Linux:
o Go to a directory: cd /path/to/directory
o Go back to the previous directory: cd -
o Go up one level: cd ..
In HDFS:
o Move into a directory in HDFS: Use hdfs dfs -ls to check contents.
hdfs dfs -ls /user/hadoop
o No direct cd in HDFS, but you can view contents of directories with hdfs dfs -ls.
In Linux:
o To list the files in the current directory:
o ls
o To get a detailed listing with permissions, sizes, and timestamps:
o ls -l
o To include hidden files (those starting with ., e.g., .bashrc):
o ls -a
In HDFS:
o To list the contents of a directory in HDFS:
o hdfs dfs -ls /path/to/directory
o To recursively list all files:
o hdfs dfs -ls -R /path/to/directory
Example:
Example:
In Linux:
o To view the contents of a file:
o cat filename.txt
o To view the contents with pagination:
o less filename.txt
o To view the first few lines of a file:
o head filename.txt
In HDFS:
o To view the contents of a file in HDFS:
o hdfs dfs -cat /path/to/file
In Linux:
o Copying files:
o cp source_file destination_file
o Moving files (also renaming):
o mv source_file destination_file
In HDFS:
o Copying files in HDFS:
o hdfs dfs -cp /hdfs/source_file /hdfs/destination_file
o Moving files in HDFS:
o hdfs dfs -mv /hdfs/source_file /hdfs/destination_file
Moving a file from local to HDFS (same as copying but removes from local):
hdfs dfs -moveFromLocal /local/path/to/file /hdfs/path/to/destination
Moving a file from HDFS to local (same as copying but removes from HDFS):
hdfs dfs -moveToLocal /hdfs/path/to/file /local/path/to/destination
In Linux:
o Removing a file:
o rm filename.txt
o Removing an empty directory:
o rmdir directory_name
o Removing a directory and its contents:
o rm -r directory_name
In HDFS:
o Removing a file:
o hdfs dfs -rm /hdfs/path/to/file
o Removing a directory (and its contents):
o hdfs dfs -rm -r /hdfs/path/to/directory
In Linux:
o Displaying the first few lines of a file:
o head filename.txt
o To display a specific number of lines:
o head -n 20 filename.txt
In HDFS:
o Displaying the first few lines of a file:
o hdfs dfs -cat /path/to/file | head
In Linux:
o To get the size of a file:
o wc -c filename.txt
In HDFS:
o To get the size of a file in HDFS:
o hdfs dfs -du -s /path/to/file
In Linux:
o To check the permissions of a file:
o ls -l filename.txt
In HDFS:
o To check the permissions of a file in HDFS:
o hdfs dfs -ls /path/to/file
In Linux:
o To zip a file:
o zip myarchive.zip file1.txt file2.txt
o To unzip a file:
o unzip myarchive.zip
o To zip a directory:
o zip -r myarchive.zip mydirectory/
o To unzip a directory:
o unzip myarchive.zip -d mydirectory/
In both Linux and HDFS, the cp and mv commands serve as the basic commands for copying and moving files.
Copying in Linux:
cp source.txt destination.txt
Moving in Linux:
mv source.txt destination.txt
Copying in HDFS:
hdfs dfs -cp /hdfs/source_file /hdfs/destination_file
Moving in HDFS:
hdfs dfs -mv /hdfs/source_file /hdfs/destination_file
This covers a wide range of file management operations, both for Linux and HDFS. In HDFS, you typically interact
with files using commands like hdfs dfs -put, hdfs dfs -get, and hdfs dfs -rm. For regular Linux
file management, commands like cp, mv, rm, and ls are used. Understanding both environments helps with
effectively managing and processing large datasets in Hadoop, especially when moving between local and
distributed storage systems.
4. Map-reducing
MapReduce
MapReduce is a programming model used for processing and generating large datasets in a distributed computing
environment, such as Hadoop. It allows you to process data in parallel across many machines.
(i) Definition of MapReduce
MapReduce is a framework that allows you to process large amounts of data by splitting tasks into two main phases:
1. Map Phase: The input data is split into smaller chunks, and each chunk is processed independently. The
task in this phase is to process and transform data into key-value pairs.
2. Reduce Phase: After the map phase, the key-value pairs are grouped by key. The reduce function then
processes each group of key-value pairs to perform an aggregation or transformation.
The MapReduce framework is designed for distributed computing, where tasks are divided across a large number of
nodes (machines) to improve performance and handle massive amounts of data.
1. Input Splitting:
o The input data is split into chunks called input splits, which are processed in parallel.
o Each chunk is processed by a separate Mapper task.
5. Output:
o The final output of the MapReduce job is written to a specified location, usually in the form of a
set of key-value pairs.
Key Terminologies in MapReduce
Mapper: A function that processes the input data and produces a set of intermediate key-value pairs.
Reducer: A function that takes the intermediate key-value pairs (grouped by key) and processes them to
produce the final output.
Input Split: The chunk of data that is passed to a mapper.
Key-Value Pair: The fundamental unit of data processed in MapReduce, consisting of a key and a value.
Shuffle and Sort: The process of grouping the intermediate key-value pairs by key before they are passed
to the reducer.
Job: A single MapReduce task that consists of a Map phase and a Reduce phase.
In this example, we will implement a Word Count program in MapReduce. This program counts the number of
occurrences of each word in a large text file.
1. Mapper: Tokenizes the input text into words and emits a key-value pair (word, 1).
2. Reducer: Sums the counts for each word and outputs the final count.
3. Driver: Configures and runs the MapReduce job.
1. Mapper Phase:
o The Mapper reads input data (a line of text), splits it into words, and emits each word as a key
with a value of 1.
o Example: "Hello world hello"
Output: ("Hello", 1), ("world", 1), ("hello", 1)
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
@Override
public void map(Object key, Text value, Context context) throws
IOException, InterruptedException {
// Split the line into words
String[] words = value.toString().split("\\s+");
3. Reducer Phase:
o The Reducer receives the grouped key-value pairs (word, list of counts) and sums the counts for
each word.
o Example: ("Hello", [1, 1, 1]) → ("Hello", 3)
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context
context) throws IOException, InterruptedException {
int sum = 0;
result.set(sum);
context.write(key, result); // Emit the word and its total
count
}
}
4 Driver Code:
The Driver code sets up and configures the MapReduce job, specifying the input/output formats, mapper,
and reducer classes.
Driver Code (Java):
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
Execution Flow:
5. Map Phase:
o Each line is processed by the Mapper.
o Output of the Mapper for the above input:
o ("Hello", 1)
o ("world", 1)
o ("Hello", 1)
o ("Hadoop", 1)
o ("Word", 1)
o ("count", 1)
o ("map", 1)
o ("reduce", 1)
7. Reduce Phase:
o The reducer aggregates the values for each word:
o ("Hello", 2)
o ("world", 1)
o ("Hadoop", 1)
o ("Word", 1)
o ("count", 1)
o ("map", 1)
o ("reduce", 1)
The MapReduce paradigm helps process large datasets by splitting tasks into manageable chunks (Map phase) and
then aggregating results (Reduce phase). Understanding the Map, Shuffle & Sort, and Reduce phases is crucial for
writing efficient MapReduce programs. The Word Count example is a simple but classic use case that highlights
how MapReduce works in Hadoop.
Problem Setup:
Matrix A of dimensions m x n.
Matrix B of dimensions n x p.
We need to compute the product of these two matrices to get Matrix C of dimensions m x p.
Where i is the row index of Matrix A, j is the column index of Matrix B, and k is the common
dimension for the summation.
MapReduce Approach
We can break down the process of matrix multiplication into the following steps in the
MapReduce framework:
1. Mapper Phase:
o Matrix A will be processed such that each element is emitted as a key-value pair, where
the key is a combination of the row index and a constant (denoting Matrix A), and the
value is the element of the matrix.
o Matrix B will be processed such that each element is emitted as a key-value pair, where
the key is a combination of the column index and a constant (denoting Matrix B), and
the value is the element of the matrix.
o We will emit intermediate key-value pairs based on the indices i, j, and the value of the
matrix elements to facilitate the multiplication in the reducer.
3. Reducer Phase:
o For each unique combination of (i, j) key, the reducer will perform the multiplication
and summation as described by the matrix multiplication formula.
Let’s now implement the Matrix Multiplication using MapReduce with the following
components:
1. Mapper: This will handle matrix elements from Matrix A and Matrix B.
2. Reducer: This will handle the computation of the matrix product using the emitted key-value
pairs.
1. Mapper Class:
In the Mapper class, we will process both matrices (Matrix A and Matrix B), emit intermediate
key-value pairs, and ensure they are aligned by indices.
For Matrix A: Emit (i, k) as the key, and the value will be the element of the matrix A[i, k].
For Matrix B: Emit (k, j) as the key, and the value will be the element of the matrix B[k, j].
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
@Override
public void map(Object key, Text value, Context context) throws
IOException, InterruptedException {
String line = value.toString().trim();
String[] elements = line.split("\\s+");
2. Reducer Class:
In the Reducer, we will combine the results from both Matrix A and Matrix B by iterating
through all combinations and performing the multiplication and summation.
For each (i, j) key from the mapper, we will collect the corresponding values from Matrix A
and Matrix B and calculate their products. Finally, we emit the result for C[i, j].
Reducer Code (Java):
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
@Override
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
List<Integer> matrixAValues = new ArrayList<>();
List<Integer> matrixBValues = new ArrayList<>();
result.set(sum);
context.write(key, result); // Emit (i, j) => C[i,j]
}
}
3. Driver Code:
In the Driver class, we will configure the MapReduce job by specifying the mapper, reducer,
input, and output paths.
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
job.setJarByClass(MatrixMultiplicationDriver.class);
job.setMapperClass(MatrixMultiplicationMapper.class);
job.setReducerClass(MatrixMultiplicationReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Input Format:
Matrix A (matrixA.txt):
A 0 0 1
A 0 1 2
A 1 0 3
A 1 1 4
Matrix B (matrixB.txt):
B 0 0 5
B 0 1 6
B 1 0 7
B 1 1 8
Execution Flow:
2. Reducer receives the intermediate key-value pairs, performs the multiplication for the
specific (i, j) pair, and outputs the result for C[i,j].
3. Output: The final result will be a file representing matrix C:
4. 0,0 23
5. 0,1 26
6. 1,0 31
7. 1,1 36
To compute the Average Salary and Total Salary by gender for an enterprise using Hadoop MapReduce, we can
break down the process into two phases:
1. Mapper Phase: This will extract relevant information (salary and gender) from the input data and emit
key-value pairs.
2. Reducer Phase: This will aggregate the total salary and calculate the average salary by gender.
Problem Setup
We want to compute the total salary and average salary for each gender.
MapReduce Approach
1. Mapper Phase:
o The mapper will read each record (employee), and for each employee, it will emit a key-value
pair:
The key will be the Gender (Male or Female).
The value will be a composite value containing:
The salary (to calculate the total).
A count (to compute the average salary).
2. Reducer Phase:
o The reducer will receive all the key-value pairs grouped by gender.
o For each gender, the reducer will calculate:
The Total Salary by summing all the salaries for that gender.
The Average Salary by dividing the total salary by the number of employees of that
gender.
Step-by-Step Code Implementation
In the Mapper, we will read the input (which contains EmployeeID, Name, Gender, and Salary), and for each
employee, we will emit the gender as the key and the salary with a count of 1 as the value.
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
@Override
public void map(Object key, Text value, Context context) throws
IOException, InterruptedException {
// Skip header row if there is one
if (key.toString().startsWith("EmployeeID")) {
return;
}
gender.set(empGender);
salaryData.set(salary + ",1"); // Send salary and count (1) for
average computation
context.write(gender, salaryData);
}
}
Explanation:
The Reducer will receive all the salary values grouped by gender. It will then sum the salaries and count the number
of records for each gender.
@Override
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
int totalSalary = 0;
int count = 0;
totalSalary += salary;
count += num;
}
Explanation:
We accumulate the total salary and count of records for each gender.
The average salary is computed by dividing the total salary by the count.
The output is written as Gender => (Total Salary, Average Salary).
In the Driver class, we will configure and set up the MapReduce job, specifying the input, output paths, and the
Mapper and Reducer classes.
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
job.setJarByClass(SalaryDriver.class);
job.setMapperClass(SalaryMapper.class);
job.setReducerClass(SalaryReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Explanation:
The Mapper and Reducer classes are set in the job configuration.
FileInputFormat is used to specify the input file (containing the employee data).
FileOutputFormat is used to specify where the output will be written.
Input Format:
The input file should have employee data in the following format (CSV):
EmployeeID,Name,Gender,Salary
1,John,Male,5000
2,Jane,Female,6000
3,Alice,Female,7000
4,Bob,Male,8000
5,Charlie,Male,5500
Execution Flow:
1. Mapper Phase:
o The Mapper processes each employee record, extracting the Gender and Salary.
o For each record, it emits a key-value pair where the key is the gender and the value is the salary
and a count of 1.
Output Format:
This MapReduce job calculates the Total Salary and Average Salary for employees by gender in an enterprise. By
using the Mapper to emit gender-based salary data and the Reducer to aggregate the results, we can efficiently
compute these statistics even for large datasets distributed across a Hadoop cluster.