Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
11 views39 pages

Bigdata and Hadoop

Uploaded by

mercy joyce
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views39 pages

Bigdata and Hadoop

Uploaded by

mercy joyce
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 39

BS605PC: BIG DATA - SPARK

B.Tech. III Year II Sem. L T P C

0042

Course Objectives:

 The main objective of the course is to process Big Data with advance architecture like spark

and streaming data in Spark

Course Outcomes:

 Develop MapReduce Programs to analyze large dataset Using Hadoop and Spark

 Write Hive queries to analyze large dataset Outline the Spark Ecosystem and its components

 Perform the filter, count, distinct, map, flatMap RDD Operations in Spark.

 Build Queries using Spark SQL

 Apply Spark joins on Sample Data Sets

 Make use of sqoop to import and export data from hadoop to database and vice-versa

List of Experiments:

1. To Study of Big Data Analytics and Hadoop Architecture

(i) know the concept of big data architecture

(ii) know the concept of Hadoop architecture

2. Loading DataSet in to HDFS for Spark Analysis

Installation of Hadoop and cluster management

(i) Installing Hadoop single node cluster in ubuntu environment

(ii) Knowing the differencing between single node clusters and multi-node clusters

(iii) Accessing WEB-UI and the port number

(iv) Installing and accessing the environments such as hive and sqoop

3. File management tasks & Basic linux commands

(i) Creating a directory in HDFS

(ii) Moving forth and back to directories

(iii) Listing directory contents


(iv) Uploading and downloading a file in HDFS

(v) Checking the contents of the file

(vi) Copying and moving files

(vii) Copying and moving files between local to HDFS environment

(viii) Removing files and paths

(ix) Displaying few lines of a file

(x) Display the aggregate length of a file

(xi) Checking the permissions of a file

(xii) Zipping and unzipping the files with & without permission pasting it to a location

(xiii) Copy, Paste commands

4. Map-reducing

(i) Definition of Map-reduce

(ii) Its stages and terminologies

(iii) Word-count program to understand map-reduce (Mapper phase, Reducer phase, Driver

code)

5. Implementing Matrix-Multiplication with Hadoop Map-reduce

6. Compute Average Salary and Total Salary by Gender for an Enterprise.R22 B.Tech. CSBS Syllabus JNTU
Hyderabad

7. (i) Creating hive tables (External and internal)

(ii) Loading data to external hive tables from sql tables(or)Structured c.s.v using scoop

(iii) Performing operations like filterations and updations

(iv) Performing Join (inner, outer etc)

(v) Writing User defined function on hive tables

8. Create a sql table of employees Employee table with id,designation Salary table (salary ,dept

id) Create external table in hive with similar schema of above tables,Move data to hive using

scoop and load the contents into tables,filter a new table and write a UDF to encrypt the table
with AES-algorithm, Decrypt it with key to show contents

9. (i) Pyspark Definition(Apache Pyspark) and difference between Pyspark, Scala, pandas

(ii) Pyspark files and class methods

(iii) get(file name)

(iv) get root directory()

10. Pyspark -RDD’S

(i) what is RDD’s?

(ii) ways to Create RDD

(iii) parallelized collections

(iv) external dataset

(v) existing RDD’s

(vi) Spark RDD’s operations (Count, foreach(), Collect, join,Cache()

11. Perform pyspark transformations

(i) map and flatMap

(ii) to remove the words, which are not necessary to analyze this text.

(iii) groupBy

(iv) What if we want to calculate how many times each word is coming in corpus ?

(v) How do I perform a task (say count the words ‘spark’ and ‘apache’ in rdd3) separatly on

each partition and get the output of the task performed in these partition ?

(vi) unions of RDD

(vii) join two pairs of RDD Based upon their key

12. Pyspark sparkconf-Attributes and applications

(i) What is Pyspark spark conf ()

(ii) Using spark conf create a spark session to write a dataframe to read details in a c.s.v and

later move that c.s.v to another location

TEXT BOOKS:

1. Spark in Action, Marko Bonaci and Petar Zecevic, Manning.


2. PySpark SQL Recipes: With HiveQL, Dataframe and Graphframes, Raju Kumar Mishra and

Sundar Rajan Raman, Apress Media.

WEB LINKS:

1. https://infyspringboard.onwingspan.com/web/en/app/toc/lex_auth_013301505844518912251

8 2_shared/overview

2. https://infyspringboard.onwingspan.com/web/en/app/toc/lex_auth_01258388119638835242_s

hared/overview

3. https://infyspringboard.onwingspan.com/web/en/app/toc/lex_auth_012605268423008256169

2 _shared/overview
1. To Study of Big Data Analytics and Hadoop Architecture

(i) know the concept of big data architecture

What is Big Data Architecture?

Big data architecture is specifically designed to manage data ingestion, data processing, and analysis of data that is
too large or complex. A big size data cannot be store, process and manage by conventional relational databases. The
solution is to organize technology into a structure of big data architecture. Big data architecture is able to manage
and process data.

Key Aspects of Big Data Architecture

The following are some key aspects of big data architecture −

 To store and process large size data like 100 GB in size.


 To aggregates and transform of a wide variety of unstructured data for analysis and reporting.
 Access, processing and analysis of streamed data in real time.

Diagram of Big Data Architecture

 The following figure shows Big Data Architecture with its sequential arrangements of different
components. The outcomes of one component work as an input to another component and this process flow
continues till to outcome of processed data.
 Here is the diagram of big data architecture −

Components of a big data architecture

Most big data architectures include some or all of the following components:

 Data sources. All big data solutions start with one or more data sources. Examples include:
o Application data stores, such as relational databases.
o Static files produced by applications, such as web server log files.
o Real-time data sources, such as IoT devices.
 Data storage. Data for batch processing operations is typically stored in a distributed file store that can
hold high volumes of large files in various formats. This kind of store is often called a data lake. Options
for implementing this storage include Azure Data Lake Store or blob containers in Azure Storage.
 Batch processing. Because the data sets are so large, often a big data solution must process data files using
long-running batch jobs to filter, aggregate, and otherwise prepare the data for analysis. Usually these jobs
involve reading source files, processing them, and writing the output to new files. Options include running
U-SQL jobs in Azure Data Lake Analytics, using Hive, Pig, or custom Map/Reduce jobs in an HDInsight
Hadoop cluster, or using Java, Scala, or Python programs in an HDInsight Spark cluster.
 Real-time message ingestion. If the solution includes real-time sources, the architecture must include a
way to capture and store real-time messages for stream processing. This might be a simple data store,
where incoming messages are dropped into a folder for processing. However, many solutions need a
message ingestion store to act as a buffer for messages, and to support scale-out processing, reliable
delivery, and other message queuing semantics. This portion of a streaming architecture is often referred to
as stream buffering. Options include Azure Event Hubs, Azure IoT Hub, and Kafka.
 Stream processing. After capturing real-time messages, the solution must process them by filtering,
aggregating, and otherwise preparing the data for analysis. The processed stream data is then written to an
output sink. Azure Stream Analytics provides a managed stream processing service based on perpetually
running SQL queries that operate on unbounded streams. You can also use open source Apache streaming
technologies like Spark Streaming in an HDInsight cluster.
 Analytical data store. Many big data solutions prepare data for analysis and then serve the processed data
in a structured format that can be queried using analytical tools. The analytical data store used to serve
these queries can be a Kimball-style relational data warehouse, as seen in most traditional business
intelligence (BI) solutions. Alternatively, the data could be presented through a low-latency NoSQL
technology such as HBase, or an interactive Hive database that provides a metadata abstraction over data
files in the distributed data store. Azure Synapse Analytics provides a managed service for large-scale,
cloud-based data warehousing. HDInsight supports Interactive Hive, HBase, and Spark SQL, which can
also be used to serve data for analysis.
 Analysis and reporting. The goal of most big data solutions is to provide insights into the data through
analysis and reporting. To empower users to analyze the data, the architecture may include a data modeling
layer, such as a multidimensional OLAP cube or tabular data model in Azure Analysis Services. It might
also support self-service BI, using the modeling and visualization technologies in Microsoft Power BI or
Microsoft Excel. Analysis and reporting can also take the form of interactive data exploration by data
scientists or data analysts. For these scenarios, many Azure services support analytical notebooks, such as
Jupyter, enabling these users to use their existing skills with Python or Microsoft R. For large-scale data
exploration, you can use Microsoft R Server, either standalone or with Spark.
 Orchestration. Most big data solutions consist of repeated data processing operations, encapsulated in
workflows, that transform source data, move data between multiple sources and sinks, load the processed
data into an analytical data store, or push the results straight to a report or dashboard. To automate these
workflows, you can use an orchestration technology such Azure Data Factory or Apache Oozie and Sqoop.

(ii) know the concept of Hadoop architecture

Hadoop is a framework written in Java that utilizes a large cluster of commodity hardware to maintain and store big
size data. Hadoop works on MapReduce Programming Algorithm that was introduced by Google. Today lots of Big
Brand Companies are using Hadoop in their Organization to deal with big data, eg. Facebook, Yahoo, Netflix, eBay,
etc. The Hadoop Architecture Mainly consists of 4 components.

 HDFS(Hadoop Distributed File System)


 YARN(Yet Another Resource Negotiator)
 MapReduce
 Common Utilities or Hadoop Common
Components of Hadoop Architecture

The Hadoop architecture uses several core components for parallel processing of large data volumes:

HDFS: The Storage Layer in the Hadoop Architecture

The Hadoop Distributed File System allows the storage of large data volumes by dividing it into blocks. HDFS is
designed for fault tolerance and ensures high availability by replicating data across multiple nodes in a Hadoop
cluster, allowing for efficient data processing and analysis in parallel.

The HDFS has three components – 1. NameNode 2. Secondary NameNode and 3. Slave Node

1. NameNode, which is the master server containing all metadata information such as block size, location,
etc. It is the NameNode that permits the user to read or write a file in HDFS. The NameNode holds all this
metadata information on the various DataNodes. There can be several of these DataNodes retrieving blocks
and sending block reports to the NameNode.
2. Secondary NameNode server that maintains metadata copies in the disk as a backup in case the
NameNode fails. The Secondary NameNode performs the same function as the standby NameNode in a
high-availability cluster.
3. Slave Node that stores all the data as blocks.

YARN: The Cluster Resource Management Layer in the Hadoop Architecture

YARN (Yet Another Resource Negotiator) is the cluster resource management layer in Hadoop architecture. It has
the crucial job of job scheduling and managing the cluster. YARN helps in task distribution, job prioritization,
dependency management, and other aspects across the Hadoop cluster for optimum processing efficiency. It allows
multi-tenancy, supports easy scalability, and optimizes cluster utilization.

YARN resides as a middle layer between HDFS and MapReduce in the Hadoop architecture. It has three core
elements – 1. ResourceManager 2. ApplicationMaster and 3. NodeManagers

1. YARN ResourceManager is the sole authority for resource allocation and tracking of resources in the
cluster. It features two main components – the Scheduler, which schedules resources for various
applications, and the Application Manager, which accepts job submissions and monitors the application
clusters.
2. YARN ApplicationMaster investigates the resource-management side, fulfilling the resource
requirements of individual applications through interactions with the scheduler.
3. YARN Node Manager tracks the jobs and monitors resource utilization in containers that house the RAM
and CPU.

MapReduce: Distributed Parallel Processing Model in the Hadoop Architecture

Hadoop uses the MapReduce programming model for parallel processing of large datasets. It is a fundamental
component in the Hadoop ecosystem for big data analytics.

MapReduce consists of two main phases: the Map Phase and the Reduce Phase.

In the Map Phase, input data is divided into smaller chunks and processed in parallel across multiple nodes in a
distributed computing environment. The input data is typically represented as key-value pairs.

In the Reduce Phase, the results from the Map phase are aggregated by key to produce the final output.
Hadoop Common or Common Utilities in the Hadoop Architecture

This crucial component of the Hadoop architecture ensures the proper functioning of Hadoop modules by providing
shared libraries and utilities. Hadoop Common contains the Java Archive (JAR) files and scripts required to start
Hadoop.

Advantages of a Well-designed Hadoop Architecture

 Data Storage and Scalability: The Hadoop Distributed File System’s ability to store and process vast data
volumes at speed is its biggest strength. As data grows, organizations can scale their Hadoop clusters easily
by adding more nodes for increased storage capacity and processing power.

 Batch and Real-time Data Processing: Hadoop’s MapReduce module supports both batch processing and
real-time stream processing when integrated with frameworks like Apache Spark). This versatility allows
organizations to address various use cases of advanced analytics.

 Cost-Effectiveness: Hadoop is designed to run on commodity hardware, which is more cost-effective than
investing in high-end, specialized hardware. This makes it an attractive option for organizations looking to
manage large datasets without incurring substantial infrastructure costs.

 Data Locality and Data Integrity: Hadoop processes data on the same node where it is stored, minimizing
data movement across the network. This approach enhances performance by reducing latency and
improving overall efficiency. Hadoop minimizes data loss and ensures data integrity through duplication on
multiple nodes.

 Community Support: Hadoop users enjoy a large open-source community for continuous updates,
improvement, and collaboration. Hadoop also offers a rich repository of documentation and resources.

These are some of the many advantages that Hadoop architecture provides to its users. Having said that, the Hadoop
architecture does present some limitations such as security management complexities, vulnerability to cyber threats
due to Java, and challenges in handling small datasets, among others. This often prompts organizations to seek
modern cloud-based alternatives such as Databricks, Snowflake, or the Azure suite of tools.

2. Loading DataSet in to HDFS for Spark Analysis

Installation of Hadoop and cluster management

(i) Installing Hadoop single node cluster in ubuntu environment

To install a single-node Hadoop cluster on an Ubuntu system, follow these steps:

Step 1: Update the System

Start by updating your system packages to ensure everything is up to date.

sudo apt update


sudo apt upgrade -y
Step 2: Install Java

Hadoop requires Java, so we need to install it.

sudo apt install openjdk-11-jdk -y

Verify the Java installation:

java -version

Ensure that it returns the correct version of Java.

Step 3: Create a Hadoop User

Create a user to run Hadoop.

sudo adduser hadoop


sudo usermod -aG sudo hadoop

Switch to the hadoop user:

su - hadoop

Step 4: Download Hadoop

Go to the Apache Hadoop downloads page and copy the link for the latest stable version of Hadoop.

To download Hadoop (replace hadoop-3.x.x.tar.gz with the latest version):

wget https://downloads.apache.org/hadoop/common/hadoop-3.x.x/hadoop-
3.x.x.tar.gz

Step 5: Extract Hadoop

Extract the downloaded Hadoop archive:

tar -xvzf hadoop-3.x.x.tar.gz

Move it to the desired installation location:

sudo mv hadoop-3.x.x /usr/local/hadoop

Step 6: Set Hadoop Environment Variables

Edit the ~/.bashrc file to set Hadoop-related environment variables.

nano ~/.bashrc

Add the following lines at the end of the file:

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

Apply the changes:

source ~/.bashrc

Step 7: Configure Hadoop

Now configure Hadoop by editing several configuration files.

1. Core Site Configuration (core-site.xml)

nano $HADOOP_HOME/etc/hadoop/core-site.xml

Add the following configuration:

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

2. HDFS Site Configuration (hdfs-site.xml)

nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml

Add the following configuration:

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>file:///home/hadoop/hadoop_data/hdfs/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:///home/hadoop/hadoop_data/hdfs/datanode</value>
</property>
</configuration>

3. MapReduce Site Configuration (mapred-site.xml)

nano $HADOOP_HOME/etc/hadoop/mapred-site.xml
Add the following configuration:

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

4. YARN Site Configuration (yarn-site.xml)

nano $HADOOP_HOME/etc/hadoop/yarn-site.xml

Add the following configuration:

<configuration>
<property>
<name>yarn.resourcemanager.address</name>
<value>localhost:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>localhost:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>localhost:8031</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>localhost:8033</value>
</property>
</configuration>

Step 8: Format the HDFS

You need to format the Hadoop file system (HDFS):

hdfs namenode -format

Step 9: Start Hadoop Daemons

Start the Hadoop daemons (NameNode, DataNode, ResourceManager, and NodeManager).

start-dfs.sh
start-yarn.sh

Step 10: Verify the Installation

Check if everything is running fine.

1. Access the Hadoop NameNode web interface at http://localhost:50070.


2. Check the ResourceManager web UI at http://localhost:8088.
Step 11: Test the Setup

You can test the setup by running a simple word count example:

hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar


wordcount /user/hadoop/input /user/hadoop/output

This will run a word count example. You can create input files by uploading them to HDFS, for example:

hadoop fs -mkdir /user/hadoop/input


hadoop fs -put localfile.txt /user/hadoop/input

Now you should have a working single-node Hadoop cluster on Ubuntu!

(ii) Knowing the differencing between single node clusters and multi-node clusters.

The key difference between a single-node and multi-node Hadoop cluster lies in the number of machines (or nodes)
involved in the setup and the distribution of resources and workloads across those machines.

Here’s a detailed comparison:

 Single-Node Cluster is suitable for small-scale applications, testing, and learning.


 Multi-Node Cluster is used for large-scale production environments where fault tolerance, scalability,
and high performance are essential.

(iii) Accessing WEB-UI and the port number

In a Hadoop cluster, both the NameNode and ResourceManager provide Web UIs that allow you to monitor and
manage the cluster's health, performance, and resources. The web UIs are accessible via specific port numbers on
the system where the Hadoop cluster is running.

Here's an overview of the Hadoop Web UIs and their corresponding ports:
1. Hadoop NameNode Web UI

 Purpose: The NameNode Web UI allows you to monitor the status of the HDFS (Hadoop Distributed File
System), check the health of the cluster, view the status of data nodes, and explore the file system.
 Default Port: 50070
o URL: http://<namenode_host>:50070
o For a single-node setup, it will be: http://localhost:50070

Key Features:

 File System Status: Shows the total capacity, used space, and remaining space in the HDFS.
 Datanode Health: Displays information about the DataNodes, including their health, storage, and data
block replication status.
 HDFS Browser: Allows you to browse files stored in HDFS, upload files, and perform other file operations.

2. Hadoop ResourceManager Web UI

 Purpose: The ResourceManager Web UI allows you to monitor and manage the YARN (Yet Another
Resource Negotiator), which is responsible for job scheduling and resource management across the
cluster.
 Default Port: 8088
o URL: http://<resourcemanager_host>:8088
o For a single-node setup, it will be: http://localhost:8088

Key Features:

 Cluster Summary: Displays the total number of nodes, available resources, and running applications.
 Applications: Shows the status of running and completed applications, such as MapReduce jobs, Spark
jobs, etc.
 Resource Utilization: Displays the resource utilization (memory and CPU) across the cluster.

3. Hadoop JobHistory Server Web UI

 Purpose: This Web UI allows you to monitor and view details about completed MapReduce jobs, including
job execution times, tasks, and logs.
 Default Port: 19888
o URL: http://<history_server_host>:19888
o For a single-node setup, it will be: http://localhost:19888

Key Features:

 Job History: Allows you to view the details of past MapReduce jobs.
 Job Logs: Provides access to job logs, helping you troubleshoot and debug failed jobs.

4. Hadoop NodeManager Web UI

 Purpose: The NodeManager Web UI shows the status of the NodeManager and its resource usage
(memory, CPU, etc.) for the node where it's running.
 Default Port: 8042
o URL: http://<nodemanager_host>:8042
o For a single-node setup, it will be: http://localhost:8042

Key Features:

 Node Health: Shows the health and status of the NodeManager on a particular machine.
 Resource Usage: Displays memory, CPU, and disk usage for the node.
 Container Details: Shows information about running containers on the node.

Accessing the Web UI in a Single-Node Setup

If you're using a single-node Hadoop cluster on your local machine (Ubuntu), you can access the Web UIs by
navigating to the following URLs in your browser:

1. NameNode Web UI:


http://localhost:50070
o For monitoring HDFS and checking the health of your data nodes.

2. ResourceManager Web UI:


http://localhost:8088
o For managing YARN resources and viewing running applications.

3. JobHistory Server Web UI:


http://localhost:19888
o For viewing completed MapReduce jobs and their details.

4. NodeManager Web UI:


http://localhost:8042
o For checking the status and resource usage of the NodeManager.

How to Ensure These Ports are Open and Accessible:

1. Verify that Hadoop daemons are running:


o You can check the status of Hadoop daemons using the following commands:
o jps

This should show running processes like NameNode, DataNode, ResourceManager,


NodeManager, etc.

2. Access Web UIs:


o Open a web browser and enter the URL corresponding to the desired Web UI, such as
http://localhost:50070 for the NameNode.

3. Check if ports are open (in case of firewall restrictions):


o You can verify if the ports are open using tools like netstat or ss:
o sudo netstat -tuln | grep 50070
o sudo netstat -tuln | grep 8088
o sudo netstat -tuln | grep 19888
o sudo netstat -tuln | grep 8042
If the ports are closed or not accessible, make sure that:

 The firewall is not blocking the ports.


 Hadoop daemons are correctly configured and running.

Troubleshooting:

 Port Conflict: Ensure no other services are occupying the default Hadoop ports (e.g., 50070, 8088, 19888,
8042).
 Permissions: Check if the user running the Hadoop services has the appropriate permissions to access and
open these ports.

(iv) Installing and accessing the environments such as hive and sqoop

Installing and Accessing Hive and Sqoop on a Hadoop Single-Node Cluster

Both Hive and Sqoop are popular tools in the Hadoop ecosystem for managing and querying large datasets stored in
HDFS and integrating with external databases, respectively. Here's how to install and access these tools in a Hadoop
environment.

Installing Hive

Hive is a data warehouse infrastructure built on top of Hadoop that provides data summarization, querying, and
analysis. It allows you to query data stored in HDFS using a SQL-like interface.

1. Download Hive

First, download the latest version of Hive from the official Apache Hive website: Apache Hive Download.

To download Hive from the command line, you can use:

wget https://downloads.apache.org/hive/hive-3.x.x/apache-hive-3.x.x-bin.tar.gz

Replace 3.x.x with the latest version.

2. Extract Hive Archive

After downloading, extract the Hive tarball:

tar -xvzf apache-hive-3.x.x-bin.tar.gz

Move the extracted Hive directory to a desired location:

sudo mv apache-hive-3.x.x-bin /usr/local/hive

3. Set Hive Environment Variables

You need to configure environment variables for Hive, so edit the .bashrc file to include the following lines:
nano ~/.bashrc

Add the following lines at the end of the file:

# Hive Environment Variables


export HIVE_HOME=/usr/local/hive
export PATH=$PATH:$HIVE_HOME/bin

Apply the changes:

source ~/.bashrc

4. Configure Hive

You need to configure Hive by editing the hive-site.xml file. First, navigate to the configuration directory:

cd $HIVE_HOME/conf
cp hive-default.xml.template hive-site.xml

Edit the hive-site.xml file:

nano hive-site.xml

Add the following properties to configure Hive to use Derby (default) as the metastore database:

<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>

<value>jdbc:derby:;databaseName=/usr/local/hive/metastore_db;create=true</
value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.apache.derby.jdbc.EmbeddedDriver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>sa</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value></value>
</property>
</configuration>

5. Initialize the Metastore Database

You need to initialize the metastore database. Run the following command to do this:

schematool -initSchema -dbType derby


6. Start Hive

After configuring Hive, you can start the Hive shell using:

hive

This will open the Hive command-line interface (CLI). You can start running queries using HiveQL (Hive Query
Language) just like SQL.

For example, to create a table and load data:

CREATE TABLE IF NOT EXISTS test_table (id INT, name STRING);


LOAD DATA LOCAL INPATH '/path/to/data.txt' INTO TABLE test_table;

Installing Sqoop

Sqoop is a tool used to transfer data between Hadoop and relational databases (e.g., MySQL, PostgreSQL, Oracle).

1. Download Sqoop

Go to the Apache Sqoop Download page to download the latest version. Or, you can use the following command to
download:

wget https://downloads.apache.org/sqoop/1.4.7/sqoop-1.4.7.bin__hadoop-
2.6.0.tar.gz

2. Extract Sqoop Archive

Once the file is downloaded, extract it:

tar -xvzf sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz

Move the extracted directory to a location of your choice:

sudo mv sqoop-1.4.7.bin__hadoop-2.6.0 /usr/local/sqoop

3. Set Sqoop Environment Variables

Add the following environment variables to the .bashrc file:

nano ~/.bashrc

Add the following lines:

# Sqoop Environment Variables


export SQOOP_HOME=/usr/local/sqoop
export PATH=$PATH:$SQOOP_HOME/bin

Then apply the changes:


source ~/.bashrc

4. Configure Sqoop to Connect to MySQL

Sqoop requires JDBC connectors to connect to databases like MySQL. If you're connecting to MySQL, download
the MySQL JDBC driver:

wget https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-
5.1.49.tar.gz

Extract and copy the MySQL JDBC driver JAR file:

tar -xvzf mysql-connector-java-5.1.49.tar.gz


sudo cp mysql-connector-java-5.1.49/mysql-connector-java-5.1.49.jar
$SQOOP_HOME/lib/

5. Test the Sqoop Installation

You can test the Sqoop installation by running the following command to check the available options:

sqoop help

To import data from a MySQL database to HDFS, you can use a command like this:

sqoop import --connect jdbc:mysql://localhost/database_name --username root --


password password --table table_name --target-dir /user/hadoop/hdfs_directory

Accessing Hive and Sqoop

Accessing Hive:

 To access the Hive CLI, simply run the hive command from the terminal:

hive

 You can also access Hive using Beeline, a JDBC client for Hive:

beeline -u jdbc:hive2://localhost:10000

Accessing Sqoop:

 To run a Sqoop job or access Sqoop commands, use the sqoop command followed by the appropriate
subcommands.

Example of a simple import command:

sqoop import --connect jdbc:mysql://localhost/testdb --table employees --


username root --password password --target-dir /user/hadoop/employees
Additional Configuration (Optional):

 If you're using HiveServer2 (for remote clients like Beeline or JDBC), you can configure and start
HiveServer2 to allow connections via a JDBC client. It runs on port 10000 by default.

With Hive and Sqoop installed and configured, you can now perform SQL-like queries over Hadoop with Hive, and
transfer data between relational databases and HDFS using Sqoop. To access them, use their respective command-
line interfaces (hive for Hive and sqoop for Sqoop). If you need a more interactive interface, you can also
configure HiveServer2 for remote access through JDBC.

3. File management tasks & Basic linux commands

(i) Creating a directory in HDFS

(ii) Moving forth and back to directories

(iii) Listing directory contents

(iv) Uploading and downloading a file in HDFS

(v) Checking the contents of the file

(vi) Copying and moving files

(vii) Copying and moving files between local to HDFS environment

(viii) Removing files and paths

(ix) Displaying few lines of a file

(x) Display the aggregate length of a file

(xi) Checking the permissions of a file

(xii) Zipping and unzipping the files with & without permission pasting it to a location

(xiii) Copy, Paste commands

(i)Creating a Directory in HDFS

Hadoop Distributed File System (HDFS) allows you to store large files across multiple machines, but managing files
within it is different from managing files in the local filesystem. You can use Hadoop’s hdfs dfs command to
interact with HDFS.

To create a directory in HDFS, you need to use the -mkdir command with hdfs dfs.
Steps to Create a Directory in HDFS

1. Ensure Hadoop is running:


o Make sure your Hadoop services (NameNode, DataNode, ResourceManager, etc.) are up and
running.
o You can check the Hadoop services by running:
o jps

2. Use the hdfs dfs -mkdir command:


o To create a directory in HDFS, use the following command:
o hdfs dfs -mkdir /path/to/directory

o For example, to create a directory called mydir in HDFS, run:


o hdfs dfs -mkdir /user/hadoop/mydir

o Here, /user/hadoop/mydir is the path in HDFS where you want to create the directory. You
can replace this path with the desired location.
3. Verify the Directory Creation:
o To confirm the directory was created, use the -ls command to list the contents of the directory:
o hdfs dfs -ls /user/hadoop

o This will list all the files and directories within /user/hadoop, and you should see your newly
created mydir listed.

Example of Creating a Directory in HDFS:


# Create a directory named "mydir" in HDFS under /user/hadoop/
hdfs dfs -mkdir /user/hadoop/mydir

# Verify the directory has been created


hdfs dfs -ls /user/hadoop

Additional Options:

 You can also create nested directories in a single command by using the -p option, which will create the
entire path if it doesn’t already exist:
 hdfs dfs -mkdir -p /user/hadoop/dir1/dir2

This will create both dir1 and dir2 inside /user/hadoop/.

Basic Linux Commands for File Management:

In addition to HDFS commands, here are some basic Linux commands that can be useful for file management on
your local filesystem:

1. Creating a Directory:
o To create a directory in Linux, you use the mkdir command:
o mkdir mydirectory

2. Listing Files:
o To list files in the current directory:
o ls
o To list files with detailed information (permissions, size, etc.):
o ls -l

3. Changing Directory:
o To navigate into a directory:
o cd mydirectory
o To go back to the previous directory:
o cd ..

4. Removing a Directory:
o To remove an empty directory:
o rmdir mydirectory
o To remove a non-empty directory (with its contents):
o rm -r mydirectory

5. Copying Files or Directories:


o To copy a file:
o cp source.txt destination.txt
o To copy a directory and its contents:
o cp -r sourcedir/ destinationdir/

6. Moving or Renaming Files:


o To move or rename a file:
o mv oldname.txt newname.txt

7. Removing Files:
o To remove a file:
o rm file.txt

8. Viewing File Contents:


o To display the contents of a file:
o cat file.txt

To create a directory in HDFS, you use the hdfs dfs -mkdir command, specifying the path where you want
the directory. Remember to ensure your Hadoop services are running, and you can verify the directory's creation
with hdfs dfs -ls. Additionally, understanding basic Linux commands such as mkdir, cd, ls, and rm can be
very helpful in managing files and directories on your local system.

File Management Commands in HDFS & Linux

Let’s walk through the specific tasks related to file management in both HDFS and Linux, covering your requested
operations.

(ii) Moving Forth and Back to Directories

In HDFS and Linux, moving between directories is done with the cd command.
 In Linux:
o Go to a directory: cd /path/to/directory
o Go back to the previous directory: cd -
o Go up one level: cd ..

 In HDFS:
o Move into a directory in HDFS: Use hdfs dfs -ls to check contents.
 hdfs dfs -ls /user/hadoop
o No direct cd in HDFS, but you can view contents of directories with hdfs dfs -ls.

(iii) Listing Directory Contents

 In Linux:
o To list the files in the current directory:
o ls
o To get a detailed listing with permissions, sizes, and timestamps:
o ls -l
o To include hidden files (those starting with ., e.g., .bashrc):
o ls -a

 In HDFS:
o To list the contents of a directory in HDFS:
o hdfs dfs -ls /path/to/directory
o To recursively list all files:
o hdfs dfs -ls -R /path/to/directory

(iv) Uploading and Downloading a File in HDFS

 Uploading a file to HDFS:


 hdfs dfs -put /local/path/to/file /hdfs/path/to/destination

Example:

hdfs dfs -put /home/hadoop/test.txt /user/hadoop/

 Downloading a file from HDFS:


 hdfs dfs -get /hdfs/path/to/file /local/path/to/destination

Example:

hdfs dfs -get /user/hadoop/test.txt /home/hadoop/

(v) Checking the Contents of the File

 In Linux:
o To view the contents of a file:
o cat filename.txt
o To view the contents with pagination:
o less filename.txt
o To view the first few lines of a file:
o head filename.txt

 In HDFS:
o To view the contents of a file in HDFS:
o hdfs dfs -cat /path/to/file

(vi) Copying and Moving Files

 In Linux:
o Copying files:
o cp source_file destination_file
o Moving files (also renaming):
o mv source_file destination_file

 In HDFS:
o Copying files in HDFS:
o hdfs dfs -cp /hdfs/source_file /hdfs/destination_file
o Moving files in HDFS:
o hdfs dfs -mv /hdfs/source_file /hdfs/destination_file

(vii) Copying and Moving Files Between Local and HDFS

 Copying a file from local to HDFS:


 hdfs dfs -put /local/path/to/file /hdfs/path/to/destination

 Copying a file from HDFS to local:


 hdfs dfs -get /hdfs/path/to/file /local/path/to/destination

 Moving a file from local to HDFS (same as copying but removes from local):
 hdfs dfs -moveFromLocal /local/path/to/file /hdfs/path/to/destination

 Moving a file from HDFS to local (same as copying but removes from HDFS):
 hdfs dfs -moveToLocal /hdfs/path/to/file /local/path/to/destination

(viii) Removing Files and Paths

 In Linux:
o Removing a file:
o rm filename.txt
o Removing an empty directory:
o rmdir directory_name
o Removing a directory and its contents:
o rm -r directory_name

 In HDFS:
o Removing a file:
o hdfs dfs -rm /hdfs/path/to/file
o Removing a directory (and its contents):
o hdfs dfs -rm -r /hdfs/path/to/directory

(ix) Displaying Few Lines of a File

 In Linux:
o Displaying the first few lines of a file:
o head filename.txt
o To display a specific number of lines:
o head -n 20 filename.txt

 In HDFS:
o Displaying the first few lines of a file:
o hdfs dfs -cat /path/to/file | head

(x) Displaying the Aggregate Length of a File

 In Linux:
o To get the size of a file:
o wc -c filename.txt

This will show the number of bytes in the file.

 In HDFS:
o To get the size of a file in HDFS:
o hdfs dfs -du -s /path/to/file

(xi) Checking the Permissions of a File

 In Linux:
o To check the permissions of a file:
o ls -l filename.txt

 In HDFS:
o To check the permissions of a file in HDFS:
o hdfs dfs -ls /path/to/file

(xii) Zipping and Unzipping Files (With & Without Permission)

 In Linux:
o To zip a file:
o zip myarchive.zip file1.txt file2.txt
o To unzip a file:
o unzip myarchive.zip
o To zip a directory:
o zip -r myarchive.zip mydirectory/
o To unzip a directory:
o unzip myarchive.zip -d mydirectory/

 Without preserving permissions:


o Zip does not preserve file permissions by default.
o To preserve permissions while zipping, you can use tar:
o tar -czvf myarchive.tar.gz myfile

 In HDFS (to copy the zip file into HDFS):


 hdfs dfs -put myarchive.zip /user/hadoop/

(xiii) Copy and Paste Commands

In both Linux and HDFS, the cp and mv commands serve as the basic commands for copying and moving files.

 Copying in Linux:
 cp source.txt destination.txt

 Moving in Linux:
 mv source.txt destination.txt

 Copying in HDFS:
 hdfs dfs -cp /hdfs/source_file /hdfs/destination_file

 Moving in HDFS:
 hdfs dfs -mv /hdfs/source_file /hdfs/destination_file

This covers a wide range of file management operations, both for Linux and HDFS. In HDFS, you typically interact
with files using commands like hdfs dfs -put, hdfs dfs -get, and hdfs dfs -rm. For regular Linux
file management, commands like cp, mv, rm, and ls are used. Understanding both environments helps with
effectively managing and processing large datasets in Hadoop, especially when moving between local and
distributed storage systems.

4. Map-reducing

(i) Definition of Map-reduce

(ii) Its stages and terminologies

(iii) Word-count program to understand map-reduce (Mapper phase, Reducer phase,


Driver code)

MapReduce

MapReduce is a programming model used for processing and generating large datasets in a distributed computing
environment, such as Hadoop. It allows you to process data in parallel across many machines.
(i) Definition of MapReduce

MapReduce is a framework that allows you to process large amounts of data by splitting tasks into two main phases:

1. Map Phase: The input data is split into smaller chunks, and each chunk is processed independently. The
task in this phase is to process and transform data into key-value pairs.
2. Reduce Phase: After the map phase, the key-value pairs are grouped by key. The reduce function then
processes each group of key-value pairs to perform an aggregation or transformation.

The MapReduce framework is designed for distributed computing, where tasks are divided across a large number of
nodes (machines) to improve performance and handle massive amounts of data.

(ii) Stages and Terminologies of MapReduce

MapReduce operates in several stages:

1. Input Splitting:
o The input data is split into chunks called input splits, which are processed in parallel.
o Each chunk is processed by a separate Mapper task.

2. Map Phase (Mapper):


o The Mapper function processes each input split independently and outputs key-value pairs.
o The main job of the mapper is to process and filter input data into useful intermediate key-value
pairs.

3. Shuffle and Sort:


o After the map phase, shuffle and sort is the process of grouping the output of the mappers by
key.
o This step ensures that all values associated with the same key are sent to the same reducer.

4. Reduce Phase (Reducer):


o The Reducer function processes the sorted key-value pairs. It performs the computation (such as
sum, average, etc.) on each group of values with the same key.
o The output of the reducer is the final result of the MapReduce job.

5. Output:
o The final output of the MapReduce job is written to a specified location, usually in the form of a
set of key-value pairs.
Key Terminologies in MapReduce

 Mapper: A function that processes the input data and produces a set of intermediate key-value pairs.
 Reducer: A function that takes the intermediate key-value pairs (grouped by key) and processes them to
produce the final output.
 Input Split: The chunk of data that is passed to a mapper.
 Key-Value Pair: The fundamental unit of data processed in MapReduce, consisting of a key and a value.
 Shuffle and Sort: The process of grouping the intermediate key-value pairs by key before they are passed
to the reducer.
 Job: A single MapReduce task that consists of a Map phase and a Reduce phase.

(iii) Word Count Program to Understand MapReduce

In this example, we will implement a Word Count program in MapReduce. This program counts the number of
occurrences of each word in a large text file.

The MapReduce job consists of three main parts:

1. Mapper: Tokenizes the input text into words and emits a key-value pair (word, 1).
2. Reducer: Sums the counts for each word and outputs the final count.
3. Driver: Configures and runs the MapReduce job.

Word Count Program: Map-Reduce Example

1. Mapper Phase:
o The Mapper reads input data (a line of text), splits it into words, and emits each word as a key
with a value of 1.
o Example: "Hello world hello"
 Output: ("Hello", 1), ("world", 1), ("hello", 1)

Mapper Code (Java):

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;

public class WordCountMapper extends Mapper<Object, Text, Text,


IntWritable> {

private final static IntWritable one = new IntWritable(1);


private Text word = new Text();

@Override
public void map(Object key, Text value, Context context) throws
IOException, InterruptedException {
// Split the line into words
String[] words = value.toString().split("\\s+");

for (String str : words) {


word.set(str);
context.write(word, one); // Emit each word with count 1
}
}
}

2. Shuffle and Sort:


o This phase occurs automatically between the Map and Reduce phases. It groups all values with
the same key (in our case, all occurrences of the same word).
o The output of the Mapper is shuffled and sorted by the key (word) before being sent to the
Reducer.

3. Reducer Phase:
o The Reducer receives the grouped key-value pairs (word, list of counts) and sums the counts for
each word.
o Example: ("Hello", [1, 1, 1]) → ("Hello", 3)

Reducer Code (Java):

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;

public class WordCountReducer extends Reducer<Text, IntWritable, Text,


IntWritable> {

private IntWritable result = new IntWritable();

@Override
public void reduce(Text key, Iterable<IntWritable> values, Context
context) throws IOException, InterruptedException {
int sum = 0;

// Sum up all the counts for a particular word


for (IntWritable val : values) {
sum += val.get();
}

result.set(sum);
context.write(key, result); // Emit the word and its total
count
}
}

4 Driver Code:

The Driver code sets up and configures the MapReduce job, specifying the input/output formats, mapper,
and reducer classes.
Driver Code (Java):

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCountDriver {


public static void main(String[] args) throws Exception {
// Create a Hadoop job
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");

// Set the main classes


job.setJarByClass(WordCountDriver.class);
job.setMapperClass(WordCountMapper.class);
job.setCombinerClass(WordCountReducer.class); // Optional: Use
combiner to optimize performance
job.setReducerClass(WordCountReducer.class);

// Set the output types for the job


job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

// Set input and output paths


FileInputFormat.addInputPath(job, new Path(args[0])); // Input
file
FileOutputFormat.setOutputPath(job, new Path(args[1])); //
Output directory

// Submit the job


System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

Execution Flow:

1. Input: A text file, for example:


2. Hello world
3. Hello Hadoop
4. Word count map reduce

5. Map Phase:
o Each line is processed by the Mapper.
o Output of the Mapper for the above input:
o ("Hello", 1)
o ("world", 1)
o ("Hello", 1)
o ("Hadoop", 1)
o ("Word", 1)
o ("count", 1)
o ("map", 1)
o ("reduce", 1)

6. Shuffle and Sort:


o The output is grouped by key:
o ("Hello", [1, 1])
o ("world", [1])
o ("Hadoop", [1])
o ("Word", [1])
o ("count", [1])
o ("map", [1])
o ("reduce", [1])

7. Reduce Phase:
o The reducer aggregates the values for each word:
o ("Hello", 2)
o ("world", 1)
o ("Hadoop", 1)
o ("Word", 1)
o ("count", 1)
o ("map", 1)
o ("reduce", 1)

8. Output: The final output is written to a file (output directory):


9. Hello 2
10. world 1
11. Hadoop 1
12. Word 1
13. count 1
14. map 1
15. reduce 1

The MapReduce paradigm helps process large datasets by splitting tasks into manageable chunks (Map phase) and
then aggregating results (Reduce phase). Understanding the Map, Shuffle & Sort, and Reduce phases is crucial for
writing efficient MapReduce programs. The Word Count example is a simple but classic use case that highlights
how MapReduce works in Hadoop.

5.Implementing Matrix-Multiplication with Hadoop Map-reduce.

Implementing Matrix Multiplication with Hadoop MapReduce

Problem Setup:

Consider two matrices:

 Matrix A of dimensions m x n.
 Matrix B of dimensions n x p.
We need to compute the product of these two matrices to get Matrix C of dimensions m x p.

The matrix multiplication formula is: C[i,j]=∑k=0n−1A[i,k]×B[k,j]C[i, j] = \sum_{k=0}^{n-1}


A[i, k] \times B[k, j]

Where i is the row index of Matrix A, j is the column index of Matrix B, and k is the common
dimension for the summation.

MapReduce Approach

We can break down the process of matrix multiplication into the following steps in the
MapReduce framework:

1. Mapper Phase:
o Matrix A will be processed such that each element is emitted as a key-value pair, where
the key is a combination of the row index and a constant (denoting Matrix A), and the
value is the element of the matrix.
o Matrix B will be processed such that each element is emitted as a key-value pair, where
the key is a combination of the column index and a constant (denoting Matrix B), and
the value is the element of the matrix.
o We will emit intermediate key-value pairs based on the indices i, j, and the value of the
matrix elements to facilitate the multiplication in the reducer.

2. Shuffle and Sort:


o The Hadoop framework automatically performs this stage, where the key-value pairs are
grouped by their keys. This ensures that all values related to a particular index
combination (i, j) are sent to the same reducer.

3. Reducer Phase:
o For each unique combination of (i, j) key, the reducer will perform the multiplication
and summation as described by the matrix multiplication formula.

Matrix Multiplication with MapReduce:

Let’s now implement the Matrix Multiplication using MapReduce with the following
components:

1. Mapper: This will handle matrix elements from Matrix A and Matrix B.
2. Reducer: This will handle the computation of the matrix product using the emitted key-value
pairs.
1. Mapper Class:

In the Mapper class, we will process both matrices (Matrix A and Matrix B), emit intermediate
key-value pairs, and ensure they are aligned by indices.

 For Matrix A: Emit (i, k) as the key, and the value will be the element of the matrix A[i, k].
 For Matrix B: Emit (k, j) as the key, and the value will be the element of the matrix B[k, j].

Mapper Code (Java):

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;

public class MatrixMultiplicationMapper extends Mapper<Object, Text, Text,


Text> {
private final static Text matrixA = new Text("A");
private final static Text matrixB = new Text("B");

@Override
public void map(Object key, Text value, Context context) throws
IOException, InterruptedException {
String line = value.toString().trim();
String[] elements = line.split("\\s+");

// Process Matrix A (Emit (i, k) => value)


if (key.toString().startsWith("A")) { // For matrix A
int i = Integer.parseInt(elements[0]);
int k = Integer.parseInt(elements[1]);
int valA = Integer.parseInt(elements[2]);
context.write(new Text(i + "," + k), new Text("A:" + valA));
}
// Process Matrix B (Emit (k, j) => value)
else if (key.toString().startsWith("B")) { // For matrix B
int k = Integer.parseInt(elements[0]);
int j = Integer.parseInt(elements[1]);
int valB = Integer.parseInt(elements[2]);
context.write(new Text(k + "," + j), new Text("B:" + valB));
}
}
}

2. Reducer Class:

In the Reducer, we will combine the results from both Matrix A and Matrix B by iterating
through all combinations and performing the multiplication and summation.

For each (i, j) key from the mapper, we will collect the corresponding values from Matrix A
and Matrix B and calculate their products. Finally, we emit the result for C[i, j].
Reducer Code (Java):

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

public class MatrixMultiplicationReducer extends Reducer<Text, Text, Text,


IntWritable> {

private IntWritable result = new IntWritable();

@Override
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
List<Integer> matrixAValues = new ArrayList<>();
List<Integer> matrixBValues = new ArrayList<>();

// Split the key by comma (i, j) and process each value


for (Text val : values) {
String valStr = val.toString();
if (valStr.startsWith("A")) {
matrixAValues.add(Integer.parseInt(valStr.split(":")[1]));
} else if (valStr.startsWith("B")) {
matrixBValues.add(Integer.parseInt(valStr.split(":")[1]));
}
}

// Perform matrix multiplication (sum of A[i,k] * B[k,j] for each k)


int sum = 0;
for (int i = 0; i < matrixAValues.size(); i++) {
sum += matrixAValues.get(i) * matrixBValues.get(i);
}

result.set(sum);
context.write(key, result); // Emit (i, j) => C[i,j]
}
}

3. Driver Code:

In the Driver class, we will configure the MapReduce job by specifying the mapper, reducer,
input, and output paths.

Driver Code (Java):

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class MatrixMultiplicationDriver {


public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Matrix Multiplication");

job.setJarByClass(MatrixMultiplicationDriver.class);
job.setMapperClass(MatrixMultiplicationMapper.class);
job.setReducerClass(MatrixMultiplicationReducer.class);

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);

FileInputFormat.addInputPath(job, new Path(args[0])); // Input path


for matrix A and B
FileOutputFormat.setOutputPath(job, new Path(args[1])); // Output
path for result matrix C

System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

Input Format:

Assume the input files are stored in the following format:

Matrix A (matrixA.txt):

A 0 0 1
A 0 1 2
A 1 0 3
A 1 1 4

Matrix B (matrixB.txt):

B 0 0 5
B 0 1 6
B 1 0 7
B 1 1 8

Execution Flow:

1. Mapper processes both matrices and emits the appropriate keys:


o For matrix A, it emits pairs of the form (i, k) → A:i,k
o For matrix B, it emits pairs of the form (k, j) → B:k,j

2. Reducer receives the intermediate key-value pairs, performs the multiplication for the
specific (i, j) pair, and outputs the result for C[i,j].
3. Output: The final result will be a file representing matrix C:
4. 0,0 23
5. 0,1 26
6. 1,0 31
7. 1,1 36

6.Compute Average Salary and Total Salary by Gender for an Enterprise.

To compute the Average Salary and Total Salary by gender for an enterprise using Hadoop MapReduce, we can
break down the process into two phases:

1. Mapper Phase: This will extract relevant information (salary and gender) from the input data and emit
key-value pairs.
2. Reducer Phase: This will aggregate the total salary and calculate the average salary by gender.

Problem Setup

Let's assume we have a dataset of employees with the following structure:

EmployeeID Name Gender Salary

1 John Male 5000

2 Jane Female 6000

3 Alice Female 7000

4 Bob Male 8000

5 Charlie Male 5500

We want to compute the total salary and average salary for each gender.

MapReduce Approach

1. Mapper Phase:
o The mapper will read each record (employee), and for each employee, it will emit a key-value
pair:
 The key will be the Gender (Male or Female).
 The value will be a composite value containing:
 The salary (to calculate the total).
 A count (to compute the average salary).
2. Reducer Phase:
o The reducer will receive all the key-value pairs grouped by gender.
o For each gender, the reducer will calculate:
 The Total Salary by summing all the salaries for that gender.
 The Average Salary by dividing the total salary by the number of employees of that
gender.
Step-by-Step Code Implementation

1. Mapper Class (Java)

In the Mapper, we will read the input (which contains EmployeeID, Name, Gender, and Salary), and for each
employee, we will emit the gender as the key and the salary with a count of 1 as the value.

Mapper Code (Java):

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;

public class SalaryMapper extends Mapper<Object, Text, Text, Text> {

private Text gender = new Text();


private Text salaryData = new Text();

@Override
public void map(Object key, Text value, Context context) throws
IOException, InterruptedException {
// Skip header row if there is one
if (key.toString().startsWith("EmployeeID")) {
return;
}

// Split the input line to get employee details


String line = value.toString();
String[] fields = line.split(",");

String empGender = fields[2]; // Gender column


String salary = fields[3]; // Salary column

gender.set(empGender);
salaryData.set(salary + ",1"); // Send salary and count (1) for
average computation

context.write(gender, salaryData);
}
}

Explanation:

 Gender is the key (Male/Female).


 SalaryData is a value consisting of two parts: the salary and a count (1). We will use this for both total and
average salary computation.

2. Reducer Class (Java)

The Reducer will receive all the salary values grouped by gender. It will then sum the salaries and count the number
of records for each gender.

Reducer Code (Java):


import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;

public class SalaryReducer extends Reducer<Text, Text, Text, Text> {

private Text result = new Text();

@Override
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
int totalSalary = 0;
int count = 0;

// Iterate through the values to calculate total salary and count


for (Text val : values) {
String[] salaryData = val.toString().split(",");
int salary = Integer.parseInt(salaryData[0]);
int num = Integer.parseInt(salaryData[1]);

totalSalary += salary;
count += num;
}

// Calculate average salary


double averageSalary = (double) totalSalary / count;

// Create the result text with total and average salary


result.set("Total Salary: " + totalSalary + ", Average Salary: " +
averageSalary);

// Write output in the form of gender => (Total Salary, Average


Salary)
context.write(key, result);
}
}

Explanation:

 We accumulate the total salary and count of records for each gender.
 The average salary is computed by dividing the total salary by the count.
 The output is written as Gender => (Total Salary, Average Salary).

3. Driver Code (Java)

In the Driver class, we will configure and set up the MapReduce job, specifying the input, output paths, and the
Mapper and Reducer classes.

Driver Code (Java):

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class SalaryDriver {


public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Salary Calculation by Gender");

job.setJarByClass(SalaryDriver.class);
job.setMapperClass(SalaryMapper.class);
job.setReducerClass(SalaryReducer.class);

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);

FileInputFormat.addInputPath(job, new Path(args[0])); // Input path


FileOutputFormat.setOutputPath(job, new Path(args[1])); // Output path

System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

Explanation:

 The Mapper and Reducer classes are set in the job configuration.
 FileInputFormat is used to specify the input file (containing the employee data).
 FileOutputFormat is used to specify where the output will be written.

Input Format:

The input file should have employee data in the following format (CSV):

EmployeeID,Name,Gender,Salary
1,John,Male,5000
2,Jane,Female,6000
3,Alice,Female,7000
4,Bob,Male,8000
5,Charlie,Male,5500

Execution Flow:

1. Mapper Phase:
o The Mapper processes each employee record, extracting the Gender and Salary.
o For each record, it emits a key-value pair where the key is the gender and the value is the salary
and a count of 1.

2. Shuffle and Sort:


o Hadoop automatically groups all values by the gender key, so the reducer receives all salary
records for each gender.
3. Reducer Phase:
o The Reducer aggregates the total salary and counts the number of employees for each gender.
o It calculates the average salary for each gender and emits the result.

Output Format:

The output will look like:

Male Total Salary: 28500, Average Salary: 7125.0


Female Total Salary: 13000, Average Salary: 6500.0

This MapReduce job calculates the Total Salary and Average Salary for employees by gender in an enterprise. By
using the Mapper to emit gender-based salary data and the Reducer to aggregate the results, we can efficiently
compute these statistics even for large datasets distributed across a Hadoop cluster.

You might also like