Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
7 views19 pages

BDA

The document outlines the installation and configuration of Hadoop on Windows, including prerequisites and file management tasks such as adding, retrieving, and deleting files in HDFS. It also details the development of MapReduce programs for matrix multiplication, weather data analysis, and movie tag extraction, providing code examples and expected outputs. Additionally, it includes steps for compiling and running the MapReduce jobs along with input data preparation.

Uploaded by

BHUVAN
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views19 pages

BDA

The document outlines the installation and configuration of Hadoop on Windows, including prerequisites and file management tasks such as adding, retrieving, and deleting files in HDFS. It also details the development of MapReduce programs for matrix multiplication, weather data analysis, and movie tag extraction, providing code examples and expected outputs. Additionally, it includes steps for compiling and running the MapReduce jobs along with input data preparation.

Uploaded by

BHUVAN
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

EX-1.

Install Hadoop and Implement the following file management tasks in Hadoop:

• Adding files and directories

• Retrieving files

• Dele ng files and directories.

Hint: A typical Hadoop workflow creates data files (such as log files) elsewhere and copies

them into HDFS using one of the above command line u li es.

1. Prerequisites

• Install Java (JDK 1.8 or later)

o Download and install Java from Oracle or OpenJDK.

o Set JAVA_HOME in environment variables.

• JAVA_HOME = C:\Java\jdk-XX.X

• Install WinRAR or 7-Zip

o Needed for extrac ng Hadoop binaries.

• Install Hadoop for Windows

h ps://archive.apache.org/dist/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz

o Download Hadoop binary for Windows.

o Extract the Hadoop folder (e.g., C:\Hadoop).

Note: Command for extract zip file

tar -xvzf C:\Users\SBDR2\Downloads\ hadoop-3.3.1.tar.gz -C C:/Hadoop/

2. Configure Environment Variables

1. Add the following system variables:

o HADOOP_HOME → C:\hadoop

o Add %HADOOP_HOME%\bin to the Path variable.

2. Set the HADOOP_CONF_DIR environment variable:

o HADOOP_CONF_DIR → %HADOOP_HOME%\etc\hadoop

1. Configure Hadoop

1. Edit hadoop-env.cmd

Open C:\hadoop\etc\hadoop\hadoop-env.cmd and update:

set JAVA_HOME=C:\Program Files\Java\jdk-XX.X


2. Edit core-site.xml (C:\hadoop\etc\hadoop\core-site.xml):

<configura on>

<property>

<name>fs.defaultFS</name>

<value>hdfs://localhost:9000</value>

</property>

</configura on>

3. Edit hdfs-site.xml (C:\hadoop\etc\hadoop\hdfs-site.xml):

<configura on>

<property>

<name>dfs.replica on</name>

<value>1</value>

</property>

<property>

<name>dfs.namenode.name.dir</name>

<value>C:\Hadoop\hadoop-3.3.1\data\namenode\</value>

</property>

<property>

<name>dfs.datanode.name.dir</name>

<value>C:\Hadoop\hadoop-3.3.1\data\datanode\</value>

</property>

</configura on>

4. Edit mapred-site.xml (C:\hadoop\etc\hadoop\mapred-site.xml):

<configura on>

<property>

<name>mapreduce.framework.name</name>

<value>yarn</value>

</property>

</configura on>

5. Edit yarn-site.xml (C:\hadoop\etc\hadoop\yarn-site.xml):


<configura on>

<property>

<name>yarn.nodemanager.aux-services</name>

<value>mapreduce_shuffle</value>

</property>

<property>

<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>

<value>org.apache.hadoop.mapred.ShuffleHandler</value>

</property>

</configura on>

6. Format the Namenode:

o Open Command Prompt as Administrator:

o Run:

o hdfs namenode -format

7. Start Hadoop services:

o Open Command Prompt, navigate to C:\hadoop\bin, and run:

o start-dfs.cmd or start-all.cmd

Step 2: Perform File Management in Hadoop (HDFS)

Use the Hadoop file system (HDFS) commands to manage files.

1. Adding Files and Directories

• Create a new directory in HDFS:

hdfs dfs -mkdir /mydir

• Upload a file to HDFS:

hdfs dfs -put C:\example.txt /mydir/

2. Retrieving Files

• List files in a directory:

hdfs dfs -ls /mydir


• Download a file from HDFS:

hdfs dfs -get /mydir/example.txt C:\retrieved_example.txt

• Read a file from HDFS:

hdfs dfs -cat /mydir/example.txt

3. Dele ng Files and Directories

• Delete a file:

hdfs dfs -rm /mydir/example.txt

• Delete a directory:

hdfs dfs -rm -r /mydir

Step 3: Stop Hadoop Services

To stop Hadoop services, run:

stop-dfs.cmd

EX-2 Develop a MapReduce program to implement Matrix Mul plica on

MapReduce program to perform matrix mul plica on using Hadoop. The program follows

the standard MapReduce paradigm, where:

1. Mapper: Processes input matrix elements and emits intermediate key-value pairs.

2. Reducer: Aggregates values based on the key and performs mul plica on and

summa on.

Matrix Mul plica on Logic

Given two matrices:

• Matrix A (m × n)

• Matrix B (n × p)

The resul ng matrix C (m × p) is computed as:

C(i,j)=∑k=0n−1A(i,k)×B(k,j)C(i, j) = \sum_{k=0}^{n-1} A(i, k) \ mes B(k, j)

Each element in the result matrix is the sum of element-wise mul plica ons across a shared

dimension.

Hadoop MapReduce Implementa on

The code consists of:

• Mapper: Reads input matrices and emits intermediate key-value pairs.


• Reducer: Aggregates values and computes the final mul plica on results.

• Driver: Configures and runs the job.

Here's the complete implementa on:

1. Mapper Class

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOExcep on;

public class MatrixMul plica onMapper extends Mapper<Object, Text, Text, Text> {

@Override

public void map(Object key, Text value, Context context) throws IOExcep on,

InterruptedExcep on {

String[] tokens = value.toString().split(",");

String matrixName = tokens[0]; // Iden fy if it's Matrix A or B

int row = Integer.parseInt(tokens[1]);

int col = Integer.parseInt(tokens[2]);

int val = Integer.parseInt(tokens[3]);

if (matrixName.equals("A")) {

// Emit values to be mul plied in the reducer

for (int k = 0; k < 3; k++) { // Assuming fixed size (modify accordingly)

context.write(new Text(row + "," + k), new Text("A," + col + "," + val));

} else if (matrixName.equals("B")) {

for (int i = 0; i < 3; i++) { // Assuming fixed size

context.write(new Text(i + "," + col), new Text("B," + row + "," + val));

2. Reducer Class
import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOExcep on;

import java.u l.HashMap;

import java.u l.Map;

public class MatrixMul plica onReducer extends Reducer<Text, Text, Text, Text> {

@Override

public void reduce(Text key, Iterable<Text> values, Context context) throws IOExcep on,

InterruptedExcep on {

Map<Integer, Integer> matrixA = new HashMap<>();

Map<Integer, Integer> matrixB = new HashMap<>();

for (Text val : values) {

String[] parts = val.toString().split(",");

String matrixName = parts[0];

int index = Integer.parseInt(parts[1]);

int value = Integer.parseInt(parts[2]);

if (matrixName.equals("A")) {

matrixA.put(index, value);

} else if (matrixName.equals("B")) {

matrixB.put(index, value);

int sum = 0;

for (Map.Entry<Integer, Integer> entry : matrixA.entrySet()) {

int index = entry.getKey();

if (matrixB.containsKey(index)) {

sum += entry.getValue() * matrixB.get(index);

context.write(key, new Text(Integer.toString(sum)));


}

3. Driver Class

import org.apache.hadoop.conf.Configura on;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class MatrixMul plica onDriver {

public sta c void main(String[] args) throws Excep on {

Configura on conf = new Configura on();

Job job = Job.getInstance(conf, "Matrix Mul plica on");

job.setJarByClass(MatrixMul plica onDriver.class);

job.setMapperClass(MatrixMul plica onMapper.class);

job.setReducerClass(MatrixMul plica onReducer.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(Text.class);

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

System.exit(job.waitForComple on(true) ? 0 : 1);

4. Input Format

Input matrices should be in CSV format:

Input File (input.txt)

A,0,0,1

A,0,1,2

A,0,2,3
A,1,0,4

A,1,1,5

A,1,2,6

B,0,0,7

B,0,1,8

B,1,0,9

B,1,1,10

B,2,0,11

B,2,1,12

5. Running the Program

Compila on & Execu on

hadoop com.sun.tools.javac.Main MatrixMul plica onMapper.java

MatrixMul plica onReducer.java MatrixMul plica onDriver.java

jar cf matrixmul plica on.jar *.class

hadoop jar matrixmul plica on.jar MatrixMul plica onDriver input output

Output Format

Output will contain matrix elements in the format row,column \t value.

For the above input, expected output:

0,0 58

0,1 64

1,0 139

1,1 154

EX-3 Develop a Map Reduce program that mines weather data and displays appropriate

messages indica ng the weather condi ons of the day.

1. Create WeatherAnalyzer.java

Save the following code as WeatherAnalyzer.java:

import java.io.IOExcep on;

import org.apache.hadoop.conf.Configura on;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WeatherAnalyzer {

// Mapper Class

public sta c class WeatherMapper extends Mapper<Object, Text, Text, IntWritable> {

private Text date = new Text();

private IntWritable temperature = new IntWritable();

public void map(Object key, Text value, Context context) throws IOExcep on,

InterruptedExcep on {

String[] fields = value.toString().split(",");

if (fields.length >= 4) {

String dateStr = fields[0] + "-" + fields[1] + "-" + fields[2]; // YYYY-MM-DD

format

int temp = Integer.parseInt(fields[3]);

date.set(dateStr);

temperature.set(temp);

context.write(date, temperature);

// Reducer Class

public sta c class WeatherReducer extends Reducer<Text, IntWritable, Text, Text> {

public void reduce(Text key, Iterable<IntWritable> values, Context context) throws

IOExcep on, InterruptedExcep on {

int maxTemp = Integer.MIN_VALUE;

for (IntWritable val : values) {


maxTemp = Math.max(maxTemp, val.get());

String weatherCondi on;

if (maxTemp > 35) {

weatherCondi on = "Hot Day";

} else if (maxTemp < 15) {

weatherCondi on = "Cold Day";

} else {

weatherCondi on = "Moderate Day";

context.write(key, new Text(weatherCondi on + " (Max Temp: " + maxTemp +

"°C)"));

// Main Method to configure and run the MapReduce job

public sta c void main(String[] args) throws Excep on {

Configura on conf = new Configura on();

Job job = Job.getInstance(conf, "Weather Analysis");

job.setJarByClass(WeatherAnalyzer.class);

job.setMapperClass(WeatherMapper.class);

job.setReducerClass(WeatherReducer.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

System.exit(job.waitForComple on(true) ? 0 : 1);

2. Compile the Program

Open Command Prompt (cmd) and navigate to the folder where WeatherAnalyzer.java is
saved:

javac -classpath C:\hadoop\share\hadoop\common\hadoop-common-

3.x.x.jar;C:\hadoop\share\hadoop\mapreduce\hadoop-mapreduce-client-core-3.x.x.jar -d .

WeatherAnalyzer.java

Create a JAR file:

jar cf WeatherAnalyzer.jar WeatherAnalyzer*.class

3. Prepare Input Data

Create a sample weather_data.txt file with the following content:

2025,03,09,32,Sunny

2025,03,10,15,Rainy

2025,03,11,40,Sunny

2025,03,12,10,Cloudy

2025,03,13,25,Clear

Upload it to HDFS:

hdfs dfs -mkdir /user/weather

hdfs dfs -put weather_data.txt /user/weather/

4. Run the MapReduce Job

Execute the Hadoop job using:

hadoop jar WeatherAnalyzer.jar WeatherAnalyzer /user/weather/weather_data.txt

/user/weather/output

5. View the Output

To check results, run:

hdfs dfs -cat /user/weather/output/part-r-00000

Expected Output

2025-03-09 Moderate Day (Max Temp: 32°C)

2025-03-10 Cold Day (Max Temp: 15°C)

2025-03-11 Hot Day (Max Temp: 40°C)

2025-03-12 Cold Day (Max Temp: 10°C)


2025-03-13 Moderate Day (Max Temp: 25°C)

• This guide sets up Hadoop on Windows 11.

• The MapReduce job processes weather data and classifies days as Hot, Cold, or

Moderate.

• The program successfully runs and displays weather condi ons.

EX-4 Develop a MapReduce program to find the tags associated with each movie by

analyzing movie lens data.

userId,movieId,tag, mestamp

15,4141,funny,1234567890

20,4141,hilarious,1234567891

33,1234,inspiring,1234567892

42,1234,emo onal,1234567893

56,7896,ac on,1234567894

Expected Output:

1234 inspiring emo onal

4141 funny hilarious

7896 ac on

Complete Java Program (MovieTags.java):

import java.io.IOExcep on;

import org.apache.hadoop.io.*;

import org.apache.hadoop.mapreduce.*;

public class MovieTags {

public sta c class MovieMapper extends Mapper<LongWritable, Text, Text, Text> {

public void map(LongWritable key, Text value, Context context) throws IOExcep on,
InterruptedExcep on {

// Skip header line


if (key.get() == 0 && value.toString().contains("userId")) return;

String[] tokens = value.toString().split(",");

if (tokens.length < 3) return; // skip malformed lines

String movieId = tokens[1].trim();

String tag = tokens[2].trim();

context.write(new Text(movieId), new Text(tag));

public sta c class MovieReducer extends Reducer<Text, Text, Text, Text> {

public void reduce(Text key, Iterable<Text> values, Context context) throws IOExcep on,
InterruptedExcep on {

StringBuilder tags = new StringBuilder();

for (Text val : values) {

tags.append(val.toString()).append(" ");

context.write(key, new Text(tags.toString().trim()));

Driver Class (MovieTagsDriver.java):

import org.apache.hadoop.conf.Configura on;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class MovieTagsDriver {

public sta c void main(String[] args) throws Excep on {

Configura on conf = new Configura on();

Job job = Job.getInstance(conf, "Movie Tags");

job.setJarByClass(MovieTags.class);

job.setMapperClass(MovieTags.MovieMapper.class);

job.setReducerClass(MovieTags.MovieReducer.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(Text.class);

FileInputFormat.addInputPath(job, new Path(args[0])); // Input file path

FileOutputFormat.setOutputPath(job, new Path(args[1])); // Output file path

System.exit(job.waitForComple on(true) ? 0 : 1);

How to Compile and Run:

Compile:

javac -classpath $(hadoop classpath) -d . MovieTags.java MovieTagsDriver.java

jar -cvf movietags.jar *.class

Run:

bash

Copy code

hadoop jar movietags.jar MovieTagsDriver /input/tags.csv /output/movietags

EX-5: Implement Func ons:Count – Sort – Limit – Skip – Aggregate using MongoDB

db.movies.count();

db.movies.find().sort({ra ng:1});
db.movies.find().limit(5);

db.movies.find().skip(10);

db.movies.aggregate([{$group:
{_id: "$genre", count: {$sum: 1}}}]);

EX-6: Develop Pig La n scripts to sort, group, join, project, and filter the data.

students =LOAD 'students.csv' USING PigStorage(',') AS (id:int, name:chararray,


marks:int);

sorted_students
= ORDER students BY marks DESC;

grouped_students
= GROUP students BY name;

filtered_students
= FILTER students BY marks > 50;

DUMP
sorted_students;

DUMP
grouped_students;

DUMP
filtered_students;

EX-7 :Use Hive to create, alter, and drop databases, tables, views, func ons, and indexes.

CREATE DATABASE mydb;

USE mydb;

CREATE TABLE students (id INT, name STRING, marks INT) ROW FORMAT DELIMITED FIELDS
TERMINATED BY ',';

ALTER TABLE students ADD COLUMNS (age INT);

DROP TABLE students;

DROP DATABASE mydb;


EX-8: Implement a word count program in Hadoop and Spark.

Hadoop Streaming (Python)

mapper.py

import sys
for line in sys.stdin:

words = line.strip().split()

for word in words:

print(f"{word}\t1")

reducer.py

import sys

from collec ons import defaultdict


word_count = defaultdict(int)

for line in sys.stdin:

word, count = line.strip().split("\t")

word_count[word] += int(count)

for word, count in word_count.items():

print(f"{word}\t{count}")

Run in Hadoop:
hadoop jar

HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar \

-input input.txt -output output \

-mapper mapper.py -reducer reducer.py

PySpark Word Count


from
pyspark.sql import SparkSession

spark =SparkSession.builder.appName("WordCount").getOrCreate()

text_file =spark.read.text("hdfs://path-to-tex ile").rdd.map(lambda r: r[0])

word_counts= text_file.flatMap(lambda line: line.split(" ")).map(lambda word:


(word, 1)).reduceByKey(lambda a, b: a + b)

word_counts.saveAsTextFile("hdfs://output")

EX-9:Use CDH (Cloudera Distribu on for Hadoop) and HUE (Hadoop User Interface) to analyze data
and generate reports for sample datasets.

Installa on (Linux)

wget
h ps://archive.cloudera.com/cm7/7.4.0/cloudera-manager-installer.bin

chmod +x
cloudera-manager-installer.bin

sudo
./cloudera-manager-installer.bin

Hue Interface for Data Reports

1.
Access
Hue via h p://localhost:8888

Login and use Query Editors for Hive,


Pig, or Spark.

Objec ve:

Use Cloudera CDH and HUE to:

 Upload and analyze a sample dataset (e.g., movies.csv, tags.csv, etc.)

 Run Hive or Impala queries

 Generate reports and visualiza ons


Tools Used:

 Cloudera CDH – Hadoop ecosystem (HDFS, Hive, Impala, etc.)

 HUE (Hadoop User Experience) – Web-based GUI to interact with Hadoop components

Sample Dataset: movies.csv

Format:

movieId, tle,genres

1,Toy Story (1995),Adventure|Anima on|Children|Comedy|Fantasy

2,Jumanji (1995),Adventure|Children|Fantasy

3,Grumpier Old Men (1995),Comedy|Romance

Steps to Perform the Task:

Step 1: Login to HUE

 URL: h p://<cloudera-ip>:8888

 Login with your creden als.

Step 2: Upload Dataset to HDFS via HUE

1. Go to File Browser

2. Click Upload

3. Upload your dataset (e.g., movies.csv) to /user/<your-username>/

Step 3: Create Table in Hive/Impala

Go to Query Editors → Hive (or Impala) and run:

CREATE TABLE movies (

movieId INT,

tle STRING,

genres STRING

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ','

STORED AS TEXTFILE
TBLPROPERTIES ("skip.header.line.count"="1");

Step 4: Load Data into Table

LOAD DATA INPATH '/user/<your-username>/movies.csv' INTO TABLE movies;

Step 5: Analyze Data Using Queries

Example 1: Get all movies of genre "Comedy"

SELECT tle FROM movies WHERE genres LIKE '%Comedy%';

Example 2: Count movies by genre

SELECT genres, COUNT(*) as total_movies

FROM movies

GROUP BY genres

ORDER BY total_movies DESC;

Step 6: Generate Reports

In HUE:

1. Run any query (e.g., above)

2. Click on "Visualize" (pie chart or bar chart symbol)

3. Choose chart type (bar, pie, line, etc.)

4. Set dimensions (e.g., genres vs total_movies)

5. Save or export the report

Example Report Output:

Genre Total Movies

Comedy 1200

Drama 950

Ac on 700

Adventure 450

You might also like