EX-1.
Install Hadoop and Implement the following file management tasks in Hadoop:
• Adding files and directories
• Retrieving files
• Dele ng files and directories.
Hint: A typical Hadoop workflow creates data files (such as log files) elsewhere and copies
them into HDFS using one of the above command line u li es.
1. Prerequisites
• Install Java (JDK 1.8 or later)
o Download and install Java from Oracle or OpenJDK.
o Set JAVA_HOME in environment variables.
• JAVA_HOME = C:\Java\jdk-XX.X
• Install WinRAR or 7-Zip
o Needed for extrac ng Hadoop binaries.
• Install Hadoop for Windows
h ps://archive.apache.org/dist/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz
o Download Hadoop binary for Windows.
o Extract the Hadoop folder (e.g., C:\Hadoop).
Note: Command for extract zip file
tar -xvzf C:\Users\SBDR2\Downloads\ hadoop-3.3.1.tar.gz -C C:/Hadoop/
2. Configure Environment Variables
1. Add the following system variables:
o HADOOP_HOME → C:\hadoop
o Add %HADOOP_HOME%\bin to the Path variable.
2. Set the HADOOP_CONF_DIR environment variable:
o HADOOP_CONF_DIR → %HADOOP_HOME%\etc\hadoop
1. Configure Hadoop
1. Edit hadoop-env.cmd
Open C:\hadoop\etc\hadoop\hadoop-env.cmd and update:
set JAVA_HOME=C:\Program Files\Java\jdk-XX.X
2. Edit core-site.xml (C:\hadoop\etc\hadoop\core-site.xml):
<configura on>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configura on>
3. Edit hdfs-site.xml (C:\hadoop\etc\hadoop\hdfs-site.xml):
<configura on>
<property>
<name>dfs.replica on</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>C:\Hadoop\hadoop-3.3.1\data\namenode\</value>
</property>
<property>
<name>dfs.datanode.name.dir</name>
<value>C:\Hadoop\hadoop-3.3.1\data\datanode\</value>
</property>
</configura on>
4. Edit mapred-site.xml (C:\hadoop\etc\hadoop\mapred-site.xml):
<configura on>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configura on>
5. Edit yarn-site.xml (C:\hadoop\etc\hadoop\yarn-site.xml):
<configura on>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configura on>
6. Format the Namenode:
o Open Command Prompt as Administrator:
o Run:
o hdfs namenode -format
7. Start Hadoop services:
o Open Command Prompt, navigate to C:\hadoop\bin, and run:
o start-dfs.cmd or start-all.cmd
Step 2: Perform File Management in Hadoop (HDFS)
Use the Hadoop file system (HDFS) commands to manage files.
1. Adding Files and Directories
• Create a new directory in HDFS:
hdfs dfs -mkdir /mydir
• Upload a file to HDFS:
hdfs dfs -put C:\example.txt /mydir/
2. Retrieving Files
• List files in a directory:
hdfs dfs -ls /mydir
• Download a file from HDFS:
hdfs dfs -get /mydir/example.txt C:\retrieved_example.txt
• Read a file from HDFS:
hdfs dfs -cat /mydir/example.txt
3. Dele ng Files and Directories
• Delete a file:
hdfs dfs -rm /mydir/example.txt
• Delete a directory:
hdfs dfs -rm -r /mydir
Step 3: Stop Hadoop Services
To stop Hadoop services, run:
stop-dfs.cmd
EX-2 Develop a MapReduce program to implement Matrix Mul plica on
MapReduce program to perform matrix mul plica on using Hadoop. The program follows
the standard MapReduce paradigm, where:
1. Mapper: Processes input matrix elements and emits intermediate key-value pairs.
2. Reducer: Aggregates values based on the key and performs mul plica on and
summa on.
Matrix Mul plica on Logic
Given two matrices:
• Matrix A (m × n)
• Matrix B (n × p)
The resul ng matrix C (m × p) is computed as:
C(i,j)=∑k=0n−1A(i,k)×B(k,j)C(i, j) = \sum_{k=0}^{n-1} A(i, k) \ mes B(k, j)
Each element in the result matrix is the sum of element-wise mul plica ons across a shared
dimension.
Hadoop MapReduce Implementa on
The code consists of:
• Mapper: Reads input matrices and emits intermediate key-value pairs.
• Reducer: Aggregates values and computes the final mul plica on results.
• Driver: Configures and runs the job.
Here's the complete implementa on:
1. Mapper Class
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOExcep on;
public class MatrixMul plica onMapper extends Mapper<Object, Text, Text, Text> {
@Override
public void map(Object key, Text value, Context context) throws IOExcep on,
InterruptedExcep on {
String[] tokens = value.toString().split(",");
String matrixName = tokens[0]; // Iden fy if it's Matrix A or B
int row = Integer.parseInt(tokens[1]);
int col = Integer.parseInt(tokens[2]);
int val = Integer.parseInt(tokens[3]);
if (matrixName.equals("A")) {
// Emit values to be mul plied in the reducer
for (int k = 0; k < 3; k++) { // Assuming fixed size (modify accordingly)
context.write(new Text(row + "," + k), new Text("A," + col + "," + val));
} else if (matrixName.equals("B")) {
for (int i = 0; i < 3; i++) { // Assuming fixed size
context.write(new Text(i + "," + col), new Text("B," + row + "," + val));
2. Reducer Class
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOExcep on;
import java.u l.HashMap;
import java.u l.Map;
public class MatrixMul plica onReducer extends Reducer<Text, Text, Text, Text> {
@Override
public void reduce(Text key, Iterable<Text> values, Context context) throws IOExcep on,
InterruptedExcep on {
Map<Integer, Integer> matrixA = new HashMap<>();
Map<Integer, Integer> matrixB = new HashMap<>();
for (Text val : values) {
String[] parts = val.toString().split(",");
String matrixName = parts[0];
int index = Integer.parseInt(parts[1]);
int value = Integer.parseInt(parts[2]);
if (matrixName.equals("A")) {
matrixA.put(index, value);
} else if (matrixName.equals("B")) {
matrixB.put(index, value);
int sum = 0;
for (Map.Entry<Integer, Integer> entry : matrixA.entrySet()) {
int index = entry.getKey();
if (matrixB.containsKey(index)) {
sum += entry.getValue() * matrixB.get(index);
context.write(key, new Text(Integer.toString(sum)));
}
3. Driver Class
import org.apache.hadoop.conf.Configura on;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class MatrixMul plica onDriver {
public sta c void main(String[] args) throws Excep on {
Configura on conf = new Configura on();
Job job = Job.getInstance(conf, "Matrix Mul plica on");
job.setJarByClass(MatrixMul plica onDriver.class);
job.setMapperClass(MatrixMul plica onMapper.class);
job.setReducerClass(MatrixMul plica onReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForComple on(true) ? 0 : 1);
4. Input Format
Input matrices should be in CSV format:
Input File (input.txt)
A,0,0,1
A,0,1,2
A,0,2,3
A,1,0,4
A,1,1,5
A,1,2,6
B,0,0,7
B,0,1,8
B,1,0,9
B,1,1,10
B,2,0,11
B,2,1,12
5. Running the Program
Compila on & Execu on
hadoop com.sun.tools.javac.Main MatrixMul plica onMapper.java
MatrixMul plica onReducer.java MatrixMul plica onDriver.java
jar cf matrixmul plica on.jar *.class
hadoop jar matrixmul plica on.jar MatrixMul plica onDriver input output
Output Format
Output will contain matrix elements in the format row,column \t value.
For the above input, expected output:
0,0 58
0,1 64
1,0 139
1,1 154
EX-3 Develop a Map Reduce program that mines weather data and displays appropriate
messages indica ng the weather condi ons of the day.
1. Create WeatherAnalyzer.java
Save the following code as WeatherAnalyzer.java:
import java.io.IOExcep on;
import org.apache.hadoop.conf.Configura on;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WeatherAnalyzer {
// Mapper Class
public sta c class WeatherMapper extends Mapper<Object, Text, Text, IntWritable> {
private Text date = new Text();
private IntWritable temperature = new IntWritable();
public void map(Object key, Text value, Context context) throws IOExcep on,
InterruptedExcep on {
String[] fields = value.toString().split(",");
if (fields.length >= 4) {
String dateStr = fields[0] + "-" + fields[1] + "-" + fields[2]; // YYYY-MM-DD
format
int temp = Integer.parseInt(fields[3]);
date.set(dateStr);
temperature.set(temp);
context.write(date, temperature);
// Reducer Class
public sta c class WeatherReducer extends Reducer<Text, IntWritable, Text, Text> {
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws
IOExcep on, InterruptedExcep on {
int maxTemp = Integer.MIN_VALUE;
for (IntWritable val : values) {
maxTemp = Math.max(maxTemp, val.get());
String weatherCondi on;
if (maxTemp > 35) {
weatherCondi on = "Hot Day";
} else if (maxTemp < 15) {
weatherCondi on = "Cold Day";
} else {
weatherCondi on = "Moderate Day";
context.write(key, new Text(weatherCondi on + " (Max Temp: " + maxTemp +
"°C)"));
// Main Method to configure and run the MapReduce job
public sta c void main(String[] args) throws Excep on {
Configura on conf = new Configura on();
Job job = Job.getInstance(conf, "Weather Analysis");
job.setJarByClass(WeatherAnalyzer.class);
job.setMapperClass(WeatherMapper.class);
job.setReducerClass(WeatherReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForComple on(true) ? 0 : 1);
2. Compile the Program
Open Command Prompt (cmd) and navigate to the folder where WeatherAnalyzer.java is
saved:
javac -classpath C:\hadoop\share\hadoop\common\hadoop-common-
3.x.x.jar;C:\hadoop\share\hadoop\mapreduce\hadoop-mapreduce-client-core-3.x.x.jar -d .
WeatherAnalyzer.java
Create a JAR file:
jar cf WeatherAnalyzer.jar WeatherAnalyzer*.class
3. Prepare Input Data
Create a sample weather_data.txt file with the following content:
2025,03,09,32,Sunny
2025,03,10,15,Rainy
2025,03,11,40,Sunny
2025,03,12,10,Cloudy
2025,03,13,25,Clear
Upload it to HDFS:
hdfs dfs -mkdir /user/weather
hdfs dfs -put weather_data.txt /user/weather/
4. Run the MapReduce Job
Execute the Hadoop job using:
hadoop jar WeatherAnalyzer.jar WeatherAnalyzer /user/weather/weather_data.txt
/user/weather/output
5. View the Output
To check results, run:
hdfs dfs -cat /user/weather/output/part-r-00000
Expected Output
2025-03-09 Moderate Day (Max Temp: 32°C)
2025-03-10 Cold Day (Max Temp: 15°C)
2025-03-11 Hot Day (Max Temp: 40°C)
2025-03-12 Cold Day (Max Temp: 10°C)
2025-03-13 Moderate Day (Max Temp: 25°C)
• This guide sets up Hadoop on Windows 11.
• The MapReduce job processes weather data and classifies days as Hot, Cold, or
Moderate.
• The program successfully runs and displays weather condi ons.
EX-4 Develop a MapReduce program to find the tags associated with each movie by
analyzing movie lens data.
userId,movieId,tag, mestamp
15,4141,funny,1234567890
20,4141,hilarious,1234567891
33,1234,inspiring,1234567892
42,1234,emo onal,1234567893
56,7896,ac on,1234567894
Expected Output:
1234 inspiring emo onal
4141 funny hilarious
7896 ac on
Complete Java Program (MovieTags.java):
import java.io.IOExcep on;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
public class MovieTags {
public sta c class MovieMapper extends Mapper<LongWritable, Text, Text, Text> {
public void map(LongWritable key, Text value, Context context) throws IOExcep on,
InterruptedExcep on {
// Skip header line
if (key.get() == 0 && value.toString().contains("userId")) return;
String[] tokens = value.toString().split(",");
if (tokens.length < 3) return; // skip malformed lines
String movieId = tokens[1].trim();
String tag = tokens[2].trim();
context.write(new Text(movieId), new Text(tag));
public sta c class MovieReducer extends Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterable<Text> values, Context context) throws IOExcep on,
InterruptedExcep on {
StringBuilder tags = new StringBuilder();
for (Text val : values) {
tags.append(val.toString()).append(" ");
context.write(key, new Text(tags.toString().trim()));
Driver Class (MovieTagsDriver.java):
import org.apache.hadoop.conf.Configura on;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class MovieTagsDriver {
public sta c void main(String[] args) throws Excep on {
Configura on conf = new Configura on();
Job job = Job.getInstance(conf, "Movie Tags");
job.setJarByClass(MovieTags.class);
job.setMapperClass(MovieTags.MovieMapper.class);
job.setReducerClass(MovieTags.MovieReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0])); // Input file path
FileOutputFormat.setOutputPath(job, new Path(args[1])); // Output file path
System.exit(job.waitForComple on(true) ? 0 : 1);
How to Compile and Run:
Compile:
javac -classpath $(hadoop classpath) -d . MovieTags.java MovieTagsDriver.java
jar -cvf movietags.jar *.class
Run:
bash
Copy code
hadoop jar movietags.jar MovieTagsDriver /input/tags.csv /output/movietags
EX-5: Implement Func ons:Count – Sort – Limit – Skip – Aggregate using MongoDB
db.movies.count();
db.movies.find().sort({ra ng:1});
db.movies.find().limit(5);
db.movies.find().skip(10);
db.movies.aggregate([{$group:
{_id: "$genre", count: {$sum: 1}}}]);
EX-6: Develop Pig La n scripts to sort, group, join, project, and filter the data.
students =LOAD 'students.csv' USING PigStorage(',') AS (id:int, name:chararray,
marks:int);
sorted_students
= ORDER students BY marks DESC;
grouped_students
= GROUP students BY name;
filtered_students
= FILTER students BY marks > 50;
DUMP
sorted_students;
DUMP
grouped_students;
DUMP
filtered_students;
EX-7 :Use Hive to create, alter, and drop databases, tables, views, func ons, and indexes.
CREATE DATABASE mydb;
USE mydb;
CREATE TABLE students (id INT, name STRING, marks INT) ROW FORMAT DELIMITED FIELDS
TERMINATED BY ',';
ALTER TABLE students ADD COLUMNS (age INT);
DROP TABLE students;
DROP DATABASE mydb;
EX-8: Implement a word count program in Hadoop and Spark.
Hadoop Streaming (Python)
mapper.py
import sys
for line in sys.stdin:
words = line.strip().split()
for word in words:
print(f"{word}\t1")
reducer.py
import sys
from collec ons import defaultdict
word_count = defaultdict(int)
for line in sys.stdin:
word, count = line.strip().split("\t")
word_count[word] += int(count)
for word, count in word_count.items():
print(f"{word}\t{count}")
Run in Hadoop:
hadoop jar
HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar \
-input input.txt -output output \
-mapper mapper.py -reducer reducer.py
PySpark Word Count
from
pyspark.sql import SparkSession
spark =SparkSession.builder.appName("WordCount").getOrCreate()
text_file =spark.read.text("hdfs://path-to-tex ile").rdd.map(lambda r: r[0])
word_counts= text_file.flatMap(lambda line: line.split(" ")).map(lambda word:
(word, 1)).reduceByKey(lambda a, b: a + b)
word_counts.saveAsTextFile("hdfs://output")
EX-9:Use CDH (Cloudera Distribu on for Hadoop) and HUE (Hadoop User Interface) to analyze data
and generate reports for sample datasets.
Installa on (Linux)
wget
h ps://archive.cloudera.com/cm7/7.4.0/cloudera-manager-installer.bin
chmod +x
cloudera-manager-installer.bin
sudo
./cloudera-manager-installer.bin
Hue Interface for Data Reports
1.
Access
Hue via h p://localhost:8888
Login and use Query Editors for Hive,
Pig, or Spark.
Objec ve:
Use Cloudera CDH and HUE to:
Upload and analyze a sample dataset (e.g., movies.csv, tags.csv, etc.)
Run Hive or Impala queries
Generate reports and visualiza ons
Tools Used:
Cloudera CDH – Hadoop ecosystem (HDFS, Hive, Impala, etc.)
HUE (Hadoop User Experience) – Web-based GUI to interact with Hadoop components
Sample Dataset: movies.csv
Format:
movieId, tle,genres
1,Toy Story (1995),Adventure|Anima on|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
Steps to Perform the Task:
Step 1: Login to HUE
URL: h p://<cloudera-ip>:8888
Login with your creden als.
Step 2: Upload Dataset to HDFS via HUE
1. Go to File Browser
2. Click Upload
3. Upload your dataset (e.g., movies.csv) to /user/<your-username>/
Step 3: Create Table in Hive/Impala
Go to Query Editors → Hive (or Impala) and run:
CREATE TABLE movies (
movieId INT,
tle STRING,
genres STRING
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
TBLPROPERTIES ("skip.header.line.count"="1");
Step 4: Load Data into Table
LOAD DATA INPATH '/user/<your-username>/movies.csv' INTO TABLE movies;
Step 5: Analyze Data Using Queries
Example 1: Get all movies of genre "Comedy"
SELECT tle FROM movies WHERE genres LIKE '%Comedy%';
Example 2: Count movies by genre
SELECT genres, COUNT(*) as total_movies
FROM movies
GROUP BY genres
ORDER BY total_movies DESC;
Step 6: Generate Reports
In HUE:
1. Run any query (e.g., above)
2. Click on "Visualize" (pie chart or bar chart symbol)
3. Choose chart type (bar, pie, line, etc.)
4. Set dimensions (e.g., genres vs total_movies)
5. Save or export the report
Example Report Output:
Genre Total Movies
Comedy 1200
Drama 950
Ac on 700
Adventure 450