Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
38 views50 pages

Experiment Pgno

The document outlines the curriculum for the Big Data Analytics Lab course (BAD601) for the Bachelor of Engineering program in Computer Science and Engineering. It includes the vision, mission, program educational objectives, course objectives, and a detailed syllabus covering key topics such as Hadoop, MapReduce, MongoDB, Hive, and Spark. Additionally, it provides practical components, course outcomes, suggested learning resources, and CO-PO mappings to ensure students are equipped with the necessary skills for data science careers.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views50 pages

Experiment Pgno

The document outlines the curriculum for the Big Data Analytics Lab course (BAD601) for the Bachelor of Engineering program in Computer Science and Engineering. It includes the vision, mission, program educational objectives, course objectives, and a detailed syllabus covering key topics such as Hadoop, MapReduce, MongoDB, Hive, and Spark. Additionally, it provides practical components, course outcomes, suggested learning resources, and CO-PO mappings to ensure students are equipped with the necessary skills for data science careers.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Bachelor of Engineering (B.

E)
Department of Computer Science and Engineering (Data Science)

2022 Scheme
Subject: Big Data Analytics Lab
Course Code: BAD601 Semester: VI
Prepared by:
Ms. Bhoomika S Babu
Asst. Professor

Vision

“To enrich the next generation of young data practitioners, accomplish academic excellence and
bring forward the Data Scientists”

Mission

M1: Grooming the students equipping with advanced technical knowledge to be industry-ready
and globally competent.
M2: Facilitate quality data science education, enable students to become skilled professionals to
solve real time problems through industry collaboration.
M3: Encourage ethical value based transformation to serve the society with responsibility
emphasizing on innovation and research methods

Program Educational Objectives


PEO1. Apply the structured statistical and mathematical methodology to process massive
amounts of data to detect underlying patterns to make predictions under realistic constraints and
to visualize the data.

PEO2. Promote design, research, product implementation and services in the field of Data
Science by using modern tools.

Program Specific Outcomes


PSO1: Apply the skills in the multi-disciplinary area of Data Science.

PSO2: Demonstrate Engineering Practice learn to solve real-time problems in various domains.
Program Outcomes-POs

1 Engineering knowledge: Apply the knowledge of mathematics, science, engineering


fundamentals, and an engineering specialization to the solution of complex engineering
problems.
2 Problem analysis: Identify, formulate, review research literature, and analyze complex
engineering problems reaching substantiated conclusions using first principles of mathematics,
natural sciences, and engineering sciences.
3 Design/development of solutions: Design solutions for complex engineering problems and
design system components or processes that meet the specified needs with appropriate
consideration for the public health and safety, and the cultural, societal, and environmental
considerations
4 Conduct investigations of complex problems: Use research-based knowledge and research
methods including design of experiments, analysis and interpretation of data, and synthesis of
the information to provide valid conclusions.
5 Modern tool usage: Create, select, and apply appropriate techniques, resources, and modern
engineering and IT tools including prediction and modeling to complex engineering activities
with an understanding of the limitations.
6 The engineer and society: Apply reasoning informed by the contextual knowledge to assess
societal, health, safety, legal and cultural issues and the consequent responsibilities relevant to
the professional engineering practice.
7 Environment and sustainability: Understand the impact of the professional engineering
solutions in societal and environmental contexts, and demonstrate the knowledge of, and need
for sustainable development.
8 Ethics: Apply ethical principles and commit to professional ethics and responsibilities and
norms of the engineering practice.
9 Individual and team work: Function effectively as an individual, and as a member or leader
in diverse teams, and in multidisciplinary settings.
10 Communication: Communicate effectively on complex engineering activities with the
engineering community and with society at large, such as, being able to comprehend and write
effective reports and design documentation, make effective presentations, and give and receive
clear instructions. 11 Project management and finance: Demonstrate knowledge and
understanding of the Engineering and management principles and apply these to one’s own work,
as a member and leader in a team, to manage projects and in multidisciplinary environments.

12 Life-long learning: Recognize the need for, and have the preparation and ability to engage in
independent and lifelong learning in the broadest context of technological change.

Syllabus

BIG DATA ANALYTICS Semester 6

Course Code BAD601 CIE Marks 50

Teaching Hours/Week 3:0:2:0 SEE Marks 50


(L:T:P: S)

Total Hours of Pedagogy 40 hours Theory + 8-10 Lab slots Total Marks 100

Credits 04 Exam Hours 3

Examination nature (SEE) Theory/practical


Course objectives:
1 To implement MapReduce programs for processing big data.
2 To realize storage and processing of big data using MongoDB, Pig, Hive and Spark. 3
To analyze big data using machine learning techniques.

Teaching-Learning Process (General Instructions)


These are sample Strategies; that teachers can use to accelerate the attainment of the various course
outcomes. 1 Lecturer method (L) needs not to be only a traditional lecture method, but alternative
effective teaching methods could be adopted to attain the outcomes.
2 Use of Video/Animation to explain functioning of various concepts.
3 Encourage collaborative (Group Learning) Learning in the class.
4 Ask at least three HOT (Higher order Thinking) questions in the class, which promotes critical thinking.
5 Discuss how every concept can be applied to the real world - and when that's possible, it helps improve
the students' understanding.
6 Use any of these methods: Chalk and board, Active Learning, Case Studies.

MODULE-1

Classification of data, Characteristics, Evolution and definition of Big data, What is Big data, Why Big
data, Traditional Business Intelligence Vs Big Data, Typical data warehouse and Hadoop environment.
Big Data Analytics: What is Big data Analytics, Classification of Analytics, Importance of Big Data
Analytics, Technologies used in Big data Environments, Few Top Analytical Tools, NoSQL, Hadoop.

TB1: Ch 1: 1.1, Ch2: 2.1-2.5,2.7,2.9-2.11, Ch3: 3.2,3.5,3.8,3.12, Ch4: 4.1,4.2

MODULE-2

Introduction to Hadoop: Introducing hadoop, Why hadoop, Why not RDBMS, RDBMS Vs Hadoop, History
of Hadoop, Hadoop overview, Use case of Hadoop, HDFS (Hadoop Distributed File System),Processing data
with Hadoop, Managing resources and applications with Hadoop YARN(Yet Another Resource Negotiator).
Introduction to Map Reduce Programming: Introduction, Mapper, Reducer, Combiner, Partitioner, Searching,
Sorting, Compression.

TB1: Ch 5: 5.1-,5.8, 5.10-5.12, Ch 8: 8.1 - 8.8

MODULE-3

Introduction to MongoDB: What is MongoDB, Why MongoDB, Terms used in RDBMS and MongoDB,
Data Types in MongoDB, MongoDB Query Language.

TB1: Ch 6: 6.1-6.5

MODULE-4

Introduction to Hive: What is Hive, Hive Architecture, Hive data types, Hive file formats, Hive Query
Language (HQL), RC File implementation, User Defined Function (UDF).
Introduction to Pig: What is Pig, Anatomy of Pig, Pig on Hadoop, Pig Philosophy, Use case for Pig, Pig Latin
Overview, Data types in Pig, Running Pig, Execution Modes of Pig, HDFS Commands, Relational Operators,
Eval Function, Complex Data Types, Piggy Bank, User Defined Function, Pig Vs Hive.

TB1: Ch 9: 9.1-9.6,9.8, Ch 10: 10.1 - 10.15, 10.22

MODULE-5
Spark and Big Data Analytics: Spark, Introduction to Data Analysis with Spark.

Text, Web Content and Link Analytics: Introduction, Text Mining, Web Mining, Web Content and Web
Usage Analytics, Page Rank, Structure of Web and Analyzing a Web Graph.

TB2: Ch5: 5.2,5.3, Ch 9: 9.1-9.4

PRACTICAL COMPONENT OF IPCC


Sl.NO Experiments (Java/Python/R)

1 Install Hadoop and Implement the following file management tasks in Hadoop:
Adding files and directories
Retrieving files
Deleting files and directories.
Hint: A typical Hadoop workflow creates data files (such as log files) elsewhere and copies th
into HDFS using one of the above command line utilities.

2 Develop a MapReduce program to implement Matrix Multiplication

3 Develop a Map Reduce program that mines weather data and displays appropriate messages
indicating the weather conditions of the day.

4 Develop a MapReduce program to find the tags associated with each movie by analyzing mov
data.

5 Implement Functions: Count – Sort – Limit – Skip – Aggregate using MongoDB

6 Develop Pig Latin scripts to sort, group, join, project, and filter the data.

7 Use Hive to create, alter, and drop databases, tables, views, functions, and indexes.

8 Implement a word count program in Hadoop and Spark.

9 Use CDH (Cloudera Distribution for Hadoop) and HUE (Hadoop User Interface) to analyze da
and generate reports for sample datasets

Course outcomes (Course Skill Set):


At the end of the course, the student will be able to:

1 Identify and list various Big Data concepts, tools and applications.
2 Develop programs using HADOOP framework.
3 Make use of Hadoop Cluster to deploy Map Reduce jobs, PIG, HIVE and Spark
programs. 4 Analyze the given data set and identify deep insights from the data set.
5 Demonstrate Text, Web Content and Link Analytics.
Suggested Learning Resources:

Books:

1 Seema Acharya and Subhashini Chellappan “Big data and Analytics” Wiley India Publishers,
2nd Edition, 2019.
2 Rajkamal and Preeti Saxena, “Big Data Analytics, Introduction to Hadoop, Spark and
Machine Learning”, McGraw Hill Publication, 2019

Reference Books:
1 Adam Shook and Donald Mine, “MapReduce Design Patterns: Building Effective Algorithms and

Analyticsfor Hadoop and Other Systems” - O'Reilly 2012


th
2 Tom White, “Hadoop: The Definitive Guide” 4 Edition, O’reilly Media, 2015.
3 Thomas Erl, Wajid Khattak, and Paul Buhler, Big Data Fundamentals: Concepts, Drivers &
st
Techniques, Pearson India Education Service Pvt. Ltd., 1 Edition, 2016
4 ohn D. Kelleher, Brian Mac Namee, Aoife D'Arcy -Fundamentals of Machine Learning for Predictiv
Data Analytics: Algorithms, Worked Examples, MIT Press 2020, 2nd Edition.

Web links and Video Lectures (e-Resources):


● https://www.kaggle.com/datasets/grouplens/movielens-20m-dataset
● https://www.youtube.com/watch?v=bAyrObl7TYE&list=PLEiEAq2VkUUJqp1k-g5W1mo37urJQOd
https://www.youtube.com/watch?v=VmO0QgPCbZY&list=PLEiEAq2VkUUJqp1kg5W1mo37urJQOdC
i n dex=4
● https://www.youtube.com/watch?v=GG-VRm6XnNk https://www.youtube.com/watch?v=JglO2Nv_9

CO-PO MAPPINGS
PO PO PO3 PO PO PO6 PO PO8 PO9 PO10 PO11 PO12 PSO1 PSO2
1 2 4 5 7

CO1

CO2

CO3

CO4

INDEX
Subject: BIG DATA ANALYTICS Course Code: BAD601
Sl. LAB EXPERIMENTS Page No.
No
.
1. Install Hadoop and Implement the following file management
tasks in Hadoop: Adding files and directories
Retrieving files
Deleting files and directories.
Hint: A typical Hadoop workflow creates data files (such as log
files) elsewhere and copies them into HDFS using one of the
above command line utilities

2. Develop a MapReduce program to implement Matrix Multiplication

3. Develop a Map Reduce program that mines weather data and


displays appropriate messages indicating the weather conditions of
the day.
4. Develop a MapReduce program to find the tags associated with each movie
by analyzing movie lens data.
5. Implement Functions: Count – Sort – Limit – Skip – Aggregate using MongoDB

6. Develop Pig Latin scripts to sort, group, join, project, and filter the data.

7. Use Hive to create, alter, and drop databases, tables, views, functions,
and indexes.
8. Implement a word count program in Hadoop and Spark.

9. Use CDH (Cloudera Distribution for Hadoop) and HUE (Hadoop


User Interface) to analyze data and generate reports for sample
datasets
VIVA QUESTIONS

Big Data Analytics BAD601

EXPERIMENT-01:
Install Hadoop and Implement the following file management tasks in Hadoop:
Adding files and directories
Retrieving files
Deleting files and directories.
Hint: A typical Hadoop workflow creates data files (such as log files)
elsewhere and copies them into HDFS using one of the above command line
utilities.

Program:
AIM: To installing and Running applications on HADOOP
∙ Adding files and directories
∙ Retrieving files
∙ Deleting files and directories

Installation of OpenSSH Server:


Step-01:
sudo apt update
sudo apt install openssh-server -y

VI- Sem Dept. of CSE (DS) 8 | P a g e


Big Data Analytics BAD601

Step-02:
Verify SSH Connection
Now, check if you can SSH into your own machine:
ssh localhost
Step-03:
Start and Enable SSH Service

Once installed, start and enable the SSH service:


sudo systemctl start ssh
sudo systemctl enable ssh
Check if the SSH is running:
sudo systemctl status ssh
(If SSH is working, you should see “active (running)”)

VI- Sem Dept. of CSE (DS) 9 | P a g e


Big Data Analytics BAD601
If it asks for a password, configure passwordless SSH by
running: ssh-keygen -t rsa -P "" -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 600 ~/.ssh/authorized_keys

VI- Sem Dept. of


CSE (DS) 10 | P a g e
Big Data Analytics BAD601

1. Installing HADOOP
1.1 Prerequisites
Java (JDK 8 or later): Required for Hadoop to run.
∙ SSH: Required for Hadoop to communicate between nodes.
∙ Linux (Ubuntu 20.04 recommended)
1.2 Install JAVA
Check version if already installed java.
java -version
if not installed, install using:
sudo apt update
sudo apt install openjdk-11-jdk -y
1.3 Download and install Hadoop:
Download hadoop:
wget https://archive.apache.org/dist/hadoop/common/hadoop
3.3.1/hadoop-3.3.1.tar.gz
Extract the files:
tar -xvzf hadoop-3.3.1.tar.gz
sudo mv hadoop-3.3.1 /usr/local/hadoop

Set environment:

VI- Sem Dept. of CSE (DS) 11 | P a g e


Big Data Analytics BAD601

nano ~/.bashrc
Run: readlink -f $(which java)

Add the following lines at the end:


export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export PATH=$JAVA_HOME/bin:$PATH
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop export
HADOOP_MAPRED_HOME=$HADOOP_HOME
export
HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME

Save and exit (CTRL+X, Y, Enter). Apply changes:


source ~/.bashrc

1.4 Configure Hadoop (Standard mode):


1. Edit core-site.xml:
nano $HADOOP_HOME/etc/hadoop/core-site.xml
ADD:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
VI- Sem Dept. of CSE (DS) 12 | P a g e
Big Data Analytics BAD601

</configuration>

2. Edit hdfs-site.xml:
nano
$HADOOP_HOME/etc/hadoop/hdfs-site.xml ADD:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

3. Edit mapred-site.xml:
nano $HADOOP_HOME/etc/hadoop/mapred-site.xml
ADD:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
4. Edit yarn-site.xml:
nano $HADOOP_HOME/etc/hadoop/ yarn-site.xml
<configuration>
<property>
VI- Sem Dept. of CSE (DS) 13 | P a g e
Big Data Analytics BAD601

<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>

1.5 Format HDFS:


- Format the Hadoop file system:
hdfs namenode -format

1.6 Start Hadoop Services


Start Hadoop processes:
start-dfs.sh
start-yarn.sh

- Verify that all services are running using the `jps` command.

2. File Management in Hadoop (HDFS)


2.1 Adding Files and Directories
Create a directory in HDFS:
bash
hdfs dfs -mkdir /usr
hdfs dfs -mkdir /usr/’usn’

Copy a local file to HDFS:


bash
hdfs dfs -put /home/bhoomika/sample.txt /usr/31

hdfs dfs -ls /user/’usn’/- - to check if the file or directory are created properly
VI- Sem Dept. of CSE (DS) 14 | P a g e
Big Data Analytics BAD601

hdfs dfs -cat /usr/31/sample1.txt -- to check inside the text is there are or not

2.2 Retrieving Files


List files in a directory:
bash
hdfs dfs -ls /usr/’usn’/

Read a file from HDFS:


bash
hdfs dfs -cat /usr/’usn’/sample.txt

Copy a file from HDFS to the local system:


bash
hdfs dfs -get /user/’usn’/sample.txt /home/user/

2.3 Deleting Files and Directories


Delete a file:
bash
hdfs dfs -rm /usr/31/sample1.txt

Delete a directory:
bash
hdfs dfs -rm -r /’usn’/vtu

3. Stopping Hadoop Services


To stop Hadoop services:
bash
stop-dfs.sh
stop-yarn.sh

4. Verification:
To check HDFS status:
bash
hdfs dfsadmin -report

RESULTS:

VI- Sem Dept. of CSE (DS) 15 | P a g e


Big Data Analytics BAD601

Thus the Downloading and Installing of Hadoop in three operating modes


has been successfully completed.

EXPERIMENT-02:
Develop a MapReduce program to implement Matrix Multiplication.

Program:
AIM: To Develop a MapReduce program to implement Matrix Multiplication.
MapReduce Program for Matrix Multiplication
Overview
MapReduce is used to multiply two matrices stored in HDFS. The input format
should be well-structured, such as:
• Matrix A (m × n) stored in CSV format
• Matrix B (n × p) stored in CSV format
MapReduce Implementation
We will create a MapReduce program that:
1. Mapper: Reads matrix elements and emits intermediate key-value
pairs. 2. Reducer: Aggregates and computes the final multiplication.

1. Start Hadoop and HDFS


Ensure Hadoop is running:
VI- Sem Dept. of CSE (DS) 16 | P a g e
Big Data Analytics BAD601

hdfs namenode -format


start-dfs.sh
start-yarn.sh
2. Create a Directory in HDFS
Create a directory for your CSV file:
hdfs dfs -mkdir /user/hadoop/matrix

3. Upload the text File to HDFS


Copy the local txt file to HDFS:

4. Verify the File Upload


Check if the file is successfully uploaded:
hdfs dfs -ls /usr/hadoop/matrix/

Hadoop MapReduce Python Program for Matrix Multiplication


from mrjob.job import MRJob
from collections import defaultdict

class MatrixMultiplication(MRJob):
def configure_args(self):
super(MatrixMultiplication, self).configure_args()
self.add_passthru_arg('--rows-a', type=int, help="Number of rows in matrix A
(M)")
self.add_passthru_arg('--cols-a', type=int, help="Number of columns in matrix A
(N)")
self.add_passthru_arg('--cols-b', type=int, help="Number of columns in
matrix B (P)")

def mapper(self, _, line):


parts = line.strip().split()
matrix_name = parts[0] # 'A' or 'B'
VI- Sem Dept. of CSE (DS) 17 | P a g e
Big Data Analytics BAD601

i = int(parts[1]) # Row index


j = int(parts[2]) # Column index
value = float(parts[3]) # Value at (i, j)

# Get matrix dimensions from arguments


cols_a = self.options.cols_a # N (Columns in A)
cols_b = self.options.cols_b # P (Columns in B)
rows_a = self.options.rows_a # M (Rows in A)

if matrix_name == "A":
for k in range(cols_b):
yield (i, k), ("A", j, value)

elif matrix_name == "B":


for i in range(rows_a):
yield (i, j), ("B", i, value)

def reducer(self, key, values):


a_values = defaultdict(float)
b_values = defaultdict(float)

for value in values:


if value[0] == "A":
a_values[value[1]] = value[2]
else:
b_values[value[1]] = value[2]

# Compute the result for (i, k)


result = sum(a_values[k] * b_values[k] for k in a_values if k in b_values)

yield key, result

if __name__ == "__main__":
MatrixMultiplication.run()

Steps to Run the Program

VI- Sem Dept. of CSE (DS) 18 | P a g e


Big Data Analytics BAD601

Install Mrjobs (MapReduce) in our local system:


pip install mrjob
1. Upload the Single File to HDFS
hdfs dfs -mkdir /usr/hadoop/matrix/
hdfs dfs -put matrix_multiplication.py /usr/hadoop/matrix/
hdfs dfs -put matrix.txt /usr/hadoop/matrix/
to check wether it is created:
hdfs dfs -ls /usr/hadoop/

To see the content of the text:


hdfs dfs -cat /usr/hadoop/matrix/matrix.txt

2. Run the MapReduce Job


Running the code from our local system to hdfs environment python
matrix_multiplication.py --rows-a 2 --cols-a 2 --cols-b 2
hdfs:///usr/hadoop/matrix/matrix.txt > output.txt

3. Retrieve Output
If the output is stored in HDFS:
hdfs dfs -ls /usr/hadoop/output
hdfs dfs -cat /usr/hadoop/output/part-*

RESULT:
Thus the MapReduce program to implement Matrix Multiplication was
successfully executed.
VI- Sem Dept. of CSE (DS) 19 | P a g e
Big Data Analytics BAD601

EXPERIMENT-03:
Develop a Map Reduce program that mines weather data and displays appropriate
messages indicating the weather conditions of the day.

Program:
AIM: To develop a MapReduce program that mines weather data from a CSV file,
classifies weather conditions, and calculates the average temperature for each day.

Step-by-Step Guide to Install Python in a Hadoop Environment

1. Check If Python Is Already Installed


python3 –version
or install: sudo apt update
sudo apt install python3 -y
sudo apt install python3-pip -y
Check if pip is installed correctly:
pip3 –version
This should display something like:
pip 22.0.2 from /usr/lib/python3/dist-packages/pip (python 3.10)

2. Set Python as the Default Interpreter


Hadoop streaming may use Python 2 by default. If you want to use Python 3,
create a symbolic link:
sudo ln -s /usr/bin/python3 /usr/bin/python
python –version

3. Install Required Python Libraries (If Needed)


If your Hadoop MapReduce jobs use additional Python libraries (like pandas,
numpy, etc.), install them on all nodes:
pip3 install pandas numpy
For cluster-wide installation:
sudo pip3 install pandas numpy

4. Verify Python Installation


VI- Sem Dept. of CSE (DS) 20 | P a g e
Big Data Analytics BAD601

Run:
python3 -c "print('Python is installed in Hadoop
environment')" If you see:
Python is installed in Hadoop environment

Develop a Map Reduce program that mines weather data and displays appropriate
messages indicating the weather conditions of the day.

Program:

AIM: To develop a MapReduce program that mines weather data from a CSV file,
classifies weather conditions, and calculates the average temperature for each day.

Steps to Import and Process CSV in a MapReduce


Program: 1. Ensure Your CSV File Is Readable:

• Your CSV file should have columns like date, temperature,


weather_condition, etc.
• Example format:
date,temp,condition
2024-03-10,25,Sunny
2024-03-10,30,Cloudy
2024-03-11,28,Rainy
2. Use Hadoop Streaming for MapReduce (If Using Python) • Hadoop
Streaming allows running MapReduce programs using Python scripts.
• Your CSV file will be passed as input to the MapReduce job.
3. Write the Mapper (mapper.py)
• Read CSV lines, extract relevant fields, and emit key-value pairs.
4. Write the Reducer (reducer.py)
• Aggregate temperature values by date and compute the average.
VI- Sem Dept. of CSE (DS) 21 | P a g e
Big Data Analytics BAD601

• Classify the day's weather condition.

Mapper (mapper.py)
#!/usr/bin/env python3
import sys
import csv

# Read input line by line


for line in sys.stdin:
line = line.strip()
reader = csv.reader([line])
for row in reader:
if len(row) == 3: # Ensure proper format
date, temp, condition = row
try:
temp = float(temp) # Convert temperature to float
print(f"{date}\t{temp}\t{condition}")
except ValueError:
continue # Skip if temperature is not a number

Reducer (reducer.py)
#!/usr/bin/env python3
import sys
from collections import defaultdict

weather_data = defaultdict(lambda: {"temp_sum": 0, "count": 0, "conditions": []})

for line in sys.stdin:


line = line.strip()
parts = line.split("\t")
if len(parts) == 3:
date, temp, condition = parts
try:
temp = float(temp)
weather_data[date]["temp_sum"] += temp
weather_data[date]["count"] += 1
VI- Sem Dept. of CSE (DS) 22 | P a g e
Big Data Analytics BAD601
weather_data[date]["conditions"].append(condition)
except ValueError:
continue

for date, data in weather_data.items():


avg_temp = data["temp_sum"] / data["count"]
most_common_condition = max(set(data["conditions"]),
key=data["conditions"].count)
print(f"{date}\tAverage Temp: {avg_temp:.2f}°C\tCondition:
{most_common_condition}")

Running the Program on Hadoop


Start Hadoop
If Hadoop is not running, start the HDFS and YARN services:
Format the Hadoop file system:
hdfs namenode -format

Start Hadoop processes:

start-dfs.sh
start-yarn.sh
2. Create a Directory in HDFS
Before uploading, create a directory in HDFS to store your files:
hdfs dfs -mkdir /usr/hadoop/weather
check if created;
hdfs dfs -ls /usr/hadoop/

3. Upload Your CSV Data File to HDFS


If your data file (weather_data.csv) is in /home/user/, move it to HDFS:
hdfs dfs -put /home/bhoomika/mapper.py
/usr/hadoop/weather_data/ hdfs dfs -put
/home/bhoomika/reducer.py /usr/hadoop/weather_data/
VI- Sem Dept. of CSE (DS) 23 | P a g e
Big Data Analytics BAD601
hdfs dfs -put /home/bhoomika/weather_data.csv /usr/hadoop/weather_data/
weather_data.csv

To check all files are created out put into hdfs


hdfs dfs -ls /usr/hadoop/weather_data

Check File Contents


To verify the uploaded file contents:
hdfs dfs -cat /usr/hadoop/weather_data/weather_data.csv | head -10

Run the MapReduce Job


VI- Sem Dept. of CSE (DS) 24 | P a g e
Big Data Analytics BAD601

NOTE: To find the JAR location:


find / -name "hadoop-streaming*.jar" 2>/dev/null

hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-3.3.1.jar \


-files /home/bhoomika/mapper.py,/home/bhoomika/reducer.py \ -input
/usr/hadoop/weather_data/weather_data.csv \
-output /usr/hadoop/weather_output \
-mapper mapper.py \

-reducer reducer.py

VI- Sem Dept. of CSE (DS) 25 | P a g e


Big Data Analytics BAD601
NOTE:
Delete Old Output Directory
If you already ran the job before and it failed, delete the output directory in
HDFS: hdfs dfs -rm -r /usr/hadoop/weather_output
Hadoop does not allow overwriting an existing output folder.

Final Steps: Check the Output


Once the job completes, check the output using:
hdfs dfs -ls /usr/hadoop/weather_output/

VI- Sem Dept. of CSE (DS) 26 | P a g e


Big Data Analytics BAD601
To view the results:
hdfs dfs -cat /usr/hadoop/weather_output/part-*

VI- Sem Dept. of CSE (DS)


27 | P a g e
Big Data Analytics BAD601

Results:
This will give you an output with the average temperature and dominant
weather condition for each day.

EXPERIMENT-04:

Develop a MapReduce program to find the tags associated with each movie by
analyzing movie lens data.

Program:

AIM: The aim of this project is to develop a MapReduce program using Hadoop
Streaming to analyze the MovieLens dataset and extract the tags associated with
each movie. The program processes large-scale movie datasets stored in HDFS and
groups the user-assigned tags by movieId, ultimately providing a meaningful list of
tags for each movie.

This implementation demonstrates how MapReduce efficiently processes


structured data in a distributed environment, making it suitable for big data
analytics.

Mapper: Reads input data and emits (movieId, tag) key-value pairs.

VI- Sem Dept. of CSE (DS) 28 | P a g e


Big Data Analytics BAD601

Reducer: Aggregates tags for each movieId and outputs (movieId, list of tags).

Python Code for Mapper (mapper_mov.py)


#!/usr/bin/env python3
import sys
# Read input line by line
for line in sys.stdin:
fields = line.strip().split(",") # Splitting CSV fields
if len(fields) < 3 or fields[0] == "userId":
continue # Skip header or malformed lines
movie_id = fields[1] # Extract movieId
tag = fields[2] # Extract tag
print(f"{movie_id}\t{tag}") # Emit (movieId, tag)

Python Code for Reducer (reducer_mov.py)


#!/usr/bin/env python3
import sys

current_movie = None
tags = []

# Read input key-value pairs from Mapper


for line in sys.stdin:

VI- Sem Dept. of CSE (DS) 29 | P a g e


Big Data Analytics BAD601

movie_id, tag = line.strip().split("\t")

if current_movie == movie_id:
tags.append(tag)
else:
if current_movie:
print(f"{current_movie}\t{', '.join(tags)}") # Output collected tags
current_movie = movie_id
tags = [tag]

# Output the last movieId's tags


if current_movie:
print(f"{current_movie}\t{', '.join(tags)}")

Running the Hadoop Streaming Job


Start Hadoop
If Hadoop is not running, start the HDFS and YARN services:
Format the Hadoop file system:
hdfs namenode -format

Start Hadoop processes:

start-dfs.sh
start-yarn.sh

VI- Sem Dept. of CSE (DS) 30 | P a g e


Big Data Analytics BAD601

Create a Directory in HDFS


Before uploading, create a directory in HDFS to store your files:
hdfs dfs -mkdir /usr/hadoop/movielens
check if created;
hdfs dfs -ls /usr/hadoop/

Step 1: Upload Data to HDFS


hdfs dfs -put tags.csv /usr/hadoop/movielens/
hdfs dfs -put movies.csv /usr/hadoop/movielens/
hdfs dfs -put mapper_mov.py /usr/hadoop/movielens/
hdfs dfs -put reducer_mov.py /usr/hadoop/movielens/

EXPERIMENT-05:
Implement Functions: Count – Sort – Limit – Skip – Aggregate using MongoDB

Program:
AIM: To implement and understand the functions count(), sort(), lim
it(), skip(), and aggregate() in MongoDB, which is a NoSQL
database that supports high-performance operations. These functions help in
retrieving, managing, and processing data efficiently.

Does MongoDB Use Hadoop?


MongoDB itself is a NoSQL database that operates independently of Hadoop.
However, MongoDB can be integrated with Hadoop using connectors such as: 1
MongoDB Hadoop Connector – Enables reading and writing data between
MongoDB and Hadoop's HDFS.
2 Apache Spark with MongoDB – Allows distributed processing of Mon
goDB data.

VI- Sem Dept. of CSE (DS) 31 | P a g e


Big Data Analytics BAD601

MongoDB can work without Hadoop for normal operations. If required for big
data analysis, we can integrate it with Hadoop.

Steps to Install MongoDB on Ubuntu:


Step 1: Update System Packages
sudo apt update && sudo apt upgrade –y
Step 2: Import MongoDB Public Key
curl -fsSL https://pgp.mongodb.com/server-6.0.asc | sudo gpg --dearmor -o
/usr/share/keyrings/mongodb-server-6.0.gpg
Step 3: Add MongoDB Repository
echo "deb [ signed-by=/usr/share/keyrings/mongodb-server-6.0.gpg ]
https://repo.mongodb.org/apt/ubuntu focal/mongodb-org/6.0 multiverse" | sudo tee
/etc/apt/sources.list.d/mongodb-org-6.0.list

Step 4: Install MongoDB


sudo apt update
sudo apt install -y mongodb-org
Step 5: Start MongoDB Service
sudo systemctl start mongod
sudo systemctl enable mongod
Step 6: Verify Installation
mongod –version
mongo --eval 'db.runCommand({ connectionStatus: 1 })'

Using MongoDB Functions: Count, Sort, Limit, Skip, Aggregate


Step 1: Open MongoDB Shell
mongo
Step 2: Create a Database
use studentDB
Step 3: Insert Sample Data
db.students.insertMany([
{ name: "Alice", age: 23, grade: "A", city: "New York" },
{ name: "Bob", age: 25, grade: "B", city: "Los Angeles" },
VI- Sem Dept. of CSE (DS) 32 | P a g e
Big Data Analytics BAD601

{ name: "Charlie", age: 22, grade: "A", city: "Chicago" }, {


name: "David", age: 24, grade: "C", city: "Houston" }, { name: "Eva",
age: 21, grade: "B", city: "San Francisco" }, { name: "Frank", age: 23,
grade: "A", city: "Miami" } ]);
1. Count Function
Counts the number of documents in the collection.
db.students.count()

Count with Condition:


db.students.count({ grade: "A" })
2. Sort Function
Sorts data in ascending or descending order.
∙ Sort by age (ascending):
db.students.find().sort({ age: 1 })
Sort by name (descending):
db.students.find().sort({ name: -1 })
3. Limit Function
Limits the number of results.
db.students.find().limit(3)

4. Skip Function
Skips a certain number of records.
db.students.find().skip(2)

Skips the first two entries and returns the rest.


5. Aggregate Function
Aggregation allows complex queries like grouping and summarizing data.

Group by grade and count students:


db.students.aggregate([
{ $group: { _id: "$grade", count: { $sum: 1 } } }
])

VI- Sem Dept. of CSE (DS) 33 | P a g e


Big Data Analytics BAD601

Average age of students per grade:


db.students.aggregate([
{ $group: { _id: "$grade", averageAge: { $avg: "$age" } } }
])

Running MongoDB with Hadoop


MongoDB and Hadoop can be integrated for large-scale data processing.
Step 1: Install Hadoop

Follow this guide to set up Hadoop on Ubuntu.


Step 2: Install MongoDB Hadoop Connector
wget https://repo1.maven.org/maven2/org/mongodb/mongo-hadoop/mongo
hadoop-core/2.0.2/mongo-hadoop-core-2.0.2.jar

Step 3: Configure Hadoop to Read from MongoDB

Modify the Hadoop job configuration to use MongoDB as the data source.

1. Update core-site.xml (HDFS Configuration)

This is necessary if you want MongoDB to be used as an input source for Hadoop
jobs.

Location: $HADOOP_HOME/etc/hadoop/core-site.xml
Add MongoDB Configuration:
<configuration>
<property>
<name>mongo.input.uri</name>
<value>mongodb://localhost:27017/studentDB.students</value>
</property>
<property>
VI- Sem Dept. of CSE (DS) 34 | P a g e
Big Data Analytics BAD601

<name>mongo.output.uri</name>
<value>mongodb://localhost:27017/studentDB.output</value>
</property>
</configuration>
mongo.input.uri → Defines the MongoDB collection as input.
mongo.output.uri → Defines the MongoDB collection as output.

2. Update mapred-site.xml (MapReduce Configuration) This

ensures that MongoDB is used as an input format for MapReduce.

Location: $HADOOP_HOME/etc/hadoop/mapred-site.xml
Add MongoDB Job Properties:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.job.inputformat.class</name>
<value>com.mongodb.hadoop.MongoInputFormat</value>
</property>
<property>
<name>mapreduce.job.outputformat.class</name>
<value>com.mongodb.hadoop.MongoOutputFormat</value>
</property>
</configuration>

∙ MongoInputFormat→ Reads data from MongoDB.


∙ MongoOutputFormat → Writes output data back to MongoDB.

VI- Sem Dept. of CSE (DS) 35 | P a g e


Big Data Analytics BAD601

3. Update hdfs-site.xml (Optional for HDFS Storage)

If you want to store the output of your MongoDB processing into HDFS, update:

Location : $HADOOP_HOME/etc/hadoop/hdfs-site.xml
Add:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
This ensures that HDFS runs with minimal replication for small-scale
processing.

5. Restart Hadoop & MongoDB

After updating the XML files, restart the services:

sudo systemctl restart mongod


stop-dfs.sh
start-dfs.sh
stop-yarn.sh
start-yarn.sh

Step 4: Run a Hadoop MapReduce Job with MongoDB:

VI- Sem Dept. of CSE (DS) 36 | P a g e


Big Data Analytics BAD601

hadoop jar mongo-hadoop-core


2.0.2.jarcom.mongodb.hadoop.examples.wordcount.MongoWordCount

Conclusion

1 We installed MongoDB on Ubuntu.


2 We used MongoDB functions count, sort, limit, skip, and aggregate. 3
We explored MongoDB's integration with Hadoop for big data processing.

MongoDB is a NoSQL database that does not require Hadoop, but it can be
integrated with Hadoop for large-scale distributed computing.

EXPERIMENT-06:
Develop Pig Latin scripts to sort, group, join, project, and filter the data.
Introduction to Apache Pig
Apache Pig is a high-level scripting platform for processing and analyzing
large datasets in Hadoop. It provides a language called Pig Latin, which
simplifies complex data transformations by breaking them into a sequence of
steps.
Can We Use Pig on Hadoop?
Yes! Pig is designed to work on Hadoop’s HDFS (Hadoop Distributed File
System) and can execute its scripts as MapReduce jobs in the Hadoop ecosystem.

Program:
AIM: To implement sorting, grouping, joining, projecting, and filtering data
using Apache Pig in the Hadoop system.

VI- Sem Dept. of CSE (DS) 37 | P a g e


Big Data Analytics BAD601

Step 1: Set Up Apache Pig


Installation on Hadoop Cluster
1 Download Apache Pig:
wget https://dlcdn.apache.org/pig/latest/pig-0.17.0.tar.gz
2 Extract and set environment variables:
tar -xzf pig-0.17.0.tar.gz
export PIG_HOME=/home/user/pig-0.17.0
export PATH=$PIG_HOME/bin:$PATH
3 Run Pig in Interactive Mode (Grunt Shell)
pig

Step 2: Load Data into HDFS


We assume we have a dataset students.txt with the following
structure: 101, John, 20, CS
102, Alice, 21, IT
103, Bob, 22, CS
104, Eve, 21, IT
105, Charlie, 22, EE

Upload the file to HDFS


hdfs dfs -put students.txt /data/
Step 3: Write Pig Latin Scripts
Each Pig script is explained step by step.
(i) Loading Data
students = LOAD '/data/students.txt' USING PigStorage(',')
AS (id:int, name:chararray, age:int, dept:chararray);

(ii) Sorting Data (By Age)


sorted_students = ORDER students BY age;
DUMP sorted_students;

(iii) Grouping Data (By Department)


grouped_students = GROUP students BY dept;
DUMP grouped_students;

VI- Sem Dept. of CSE (DS) 38 | P a g e


Big Data Analytics BAD601

(iv) Filtering Data (Age > 21)


filtered_students = FILTER students BY age > 21;
DUMP filtered_students;

(v) Projecting Specific Columns (ID, Name Only)


projected_students = FOREACH students GENERATE id, name;
DUMP projected_students;

(vi) Joining Two Datasets (Students and Marks)


101, 85
102, 90
103, 88
104, 76
105, 92

Upload to HDFS:
hdfs dfs -put marks.txt /data/

Join Students and Marks Dataset


marks = LOAD '/data/marks.txt' USING PigStorage(',') AS (id:int,
marks:int);
joined_data = JOIN students BY id, marks BY id;
DUMP joined_data;
RESULT:
The Pig scripts successfully executed the required operations, processing student
data in HDFS.

EXPERIMENT-07:
Use Hive to create, alter, and drop databases, tables, views, functions, and indexes.

Apache Hive is a data warehouse system that runs on top of Hadoop and al
lows querying large datasets using SQL-like commands. In this guide, we will
cover:

VI- Sem Dept. of CSE (DS) 39 | P a g e


Big Data Analytics BAD601

∙ InstallingHive
∙ Creating, Altering, and Dropping
o Databases
o Tables
o Views
o Functions
o Indexes
∙ Running Hive on Hadoop vs. Standalone Mode

Step 1: Install Apache Hive


Prerequisites:

1 JavaInstalled (java -version)


2 Hadoop Installed and Running (hdfs dfs -ls /)

1. Download and Install Hive


On Linux (Ubuntu/Red Hat)
wget https://dlcdn.apache.org/hive/hive-3.1.3/apache-hive-3.1.3-bin.tar.gz
tar -xvzf apache-hive-3.1.3-bin.tar.gz
mv apache-hive-3.1.3-bin /usr/local/hive
2. Configure Hive Environment Variables
nano ~/.bashrc
Add the following lines at the end:
export HIVE_HOME=/usr/local/hive
export PATH=$HIVE_HOME/bin:$PATH
export HIVE_CONF_DIR=$HIVE_HOME/conf
export HADOOP_HOME=/usr/local/Hadoop
Save and apply changes:
source ~/.bashrc
3. Create Hive Warehouse in HDFS
hdfs dfs -mkdir -p /user/hive/warehouse
hdfs dfs -chmod g+w /user/hive/warehouse
4. Initialize Hive Metastore
schematool -dbType derby –initSchema

VI- Sem Dept. of CSE (DS) 40 | P a g e


Big Data Analytics BAD601

Step 2: Starting Hive


To start the Hive shell:
Hive

1. Database Operations
Create a Database
CREATE DATABASE student_db;
Use a Database
USE student_db;
Show Databases
SHOW DATABASES;
Alter Database
ALTER DATABASE student_db SET DBPROPERTIES ('edited
by'='admin');
Drop a Database
DROP DATABASE student_db CASCADE;

2. Table Operations
Create a Table
CREATE TABLE students (
id INT,
name STRING,
age INT,
marks FLOAT
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';
Load Data into Table
LOAD DATA LOCAL INPATH '/home/user/students.csv' INTO TABLE
students;
Describe a Table
DESCRIBE students;
Alter Table (Rename)
ALTER TABLE students RENAME TO student_info;
VI- Sem Dept. of CSE (DS) 41 | P a g e
Big Data Analytics BAD601

Drop a Table
DROP TABLE student_info;

3. View Operations
Create a View
CREATE VIEW top_students AS
SELECT name, marks FROM students WHERE marks >
80; Show Views
SHOW VIEWS;
Drop a View
DROP VIEW top_students;

4. User-Defined Function (UDF)

Hive allows custom functions for data


processing. Create a Temporary Function
CREATE TEMPORARY FUNCTION lower_case AS
'org.apache.hadoop.hive.ql.udf.UDFLower';
Registers a built-in UDF to convert text to
lowercase. Use Function
SELECT lower_case(name) FROM students;
Converts name column values to lowercase.

Drop a Function
DROP FUNCTION lower_case;
Removes the function.

5. Index Operations
Indexes improve query performance.
Create an Index
CREATE INDEX idx_students ON TABLE students (marks)
VI- Sem Dept. of CSE (DS) 42 | P a g e
Big Data Analytics BAD601

AS
'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
WITH DEFERRED REBUILD;
∙ Creates
an index on marks column.
∙ WITH DEFERRED REBUILD → Optimizes indexing.

Show Indexes
SHOW INDEXES ON students
Lists all indexes on students table.

Drop an Index
DROP INDEX idx_students ON students;
Deletes the index on marks.

EXPERIMENT-08:
Implement a word count program in Hadoop and Spark.

AIM: To implement a Word Count Program using Hadoop MapReduce and


Apache Spark, demonstrating distributed data processing techniques for counting
occurrences of words in a given text file.

Introduction

In this guide, we will implement a Word Count Program using:

∙ Hadoop MapReduce
∙ Apache Spark

Part 1: Word Count in Hadoop MapReduce


Step 1: Install and Configure Hadoop
1 Ensure Java is installed:
java –version
VI- Sem Dept. of CSE (DS) 43 | P a g e
Big Data Analytics BAD601

2 Ensure Hadoop is installed and running:


hdfs namenode -format
start-dfs.sh
start-yarn.sh
Step 2: Write the Mapper Program (Python)

Hadoop MapReduce consists of two parts:

1 Mapper→ Processes input and generates key-value pairs. 2 Reducer


→ Aggregates key-value pairs and produces the final count.

Create a mapper.py file:

#!/usr/bin/env python
import sys

# Read input from standard input


for line in sys.stdin:
words = line.strip().split()
for word in words:
print(f"{word}\t1") # Emit (word, 1) pair
Step 3: Write the Reducer Program (Python)

Create a reducer.py file:

#!/usr/bin/env python

import sys

current_word = None
current_count = 0

for line in sys.stdin:


word, count = line.strip().split("\t")
count = int(count)

if word == current_word:
current_count += count
VI- Sem Dept. of CSE (DS) 44 | P a g e
Big Data Analytics BAD601
else:
if current_word:
print(f"{current_word}\t{current_count}")
current_word = word
current_count = count

if current_word:
print(f"{current_word}\t{current_count}")

Step 4: Upload Input File to HDFS


Create a text file (input.txt) with sample data:
echo "hello world hello Hadoop" > input.txt
Upload to HDFS:
hdfs dfs -mkdir -p /wordcount/input
hdfs dfs -put input.txt /wordcount/input/

Step 5: Run the MapReduce Job


hadoop jar
$HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming- *.jar \
-input /wordcount/input/ \
-output /wordcount/output/ \
-mapper mapper.py \
-reducer reducer.py

Step 6: View Output


hdfs dfs -cat /wordcount/output/part-90000

Part 2: Word Count in Apache Spark (Python - PySpark)


Step 1: Install and Configure Spark
1 InstallSpark
sudo apt update
sudo apt install spark
2 Verify Installation
spark-shell
VI- Sem Dept. of CSE (DS) 45 | P a g e
Big Data Analytics BAD601

Step 2: Create a Word Count Program in Spark


Create a file wordcount_spark.py:
from pyspark.sql import SparkSession

# Step 1: Initialize Spark Session


spark = SparkSession.builder.appName("WordCount").getOrCreate()

# Step 2: Read text file


lines = spark.read.text("input.txt").rdd.map(lambda x: x[0])

# Step 3: Split lines into words and assign count 1


words = lines.flatMap(lambda line: line.split(" ")).map(lambda word:
(word, 1))

# Step 4: Perform word count


word_counts = words.reduceByKey(lambda a, b: a + b)

# Step 5: Save output


word_counts.saveAsTextFile("output_spark")

# Step 6: Stop Spark session


spark.stop()
Step 3: Run Spark Job
Run the script:
spark-submit wordcount_spark.py
Output Location:
cat output_spark/part-00000

RESULT:
The Word Count program was successfully executed using Hadoop MapReduce
and Apache Spark. The output displayed the correct count of words from the in
put text file, demonstrating efficient distributed data processing in both HDFS
(Hadoop Distributed File System) and Apache Spark RDDs (Resilient Distrib
uted Datasets).

VI- Sem Dept. of CSE (DS) 46 | P a g e


Big Data Analytics BAD601

The comparison between Hadoop MapReduce and Spark shows that Spark is
faster due to in-memory computation, while Hadoop is reliable for large-scale
batch processing.
EXPERIMENT-09:

Use CDH (Cloudera Distribution for Hadoop) and HUE (Hadoop User Interface)
to analyze data and generate reports for sample datasets

AIM: To set up Cloudera Distribution for Hadoop (CDH) and HUE (Hadoop
User Interface) on Ubuntu, load a sample dataset, analyze data, and generate re
ports using HUE.

Step 1: Install Cloudera Distribution for Hadoop (CDH) on Ubuntu CDH is an


enterprise version of Hadoop that includes HDFS, YARN, MapReduce, Hive,
Impala, and more.
1.1: Add Cloudera Repository
wget https://archive.cloudera.com/cdh7/7.1.7/ubuntu1804/apt/cloudera.list
- O /etc/apt/sources.list.d/cloudera.list
wget
https://archive.cloudera.com/cdh7/7.1.7/ubuntu1804/apt/archive.key sudo
apt-key add archive.key
1.2: Update and Install CDH Components
sudo apt update
sudo apt install -y cloudera-manager-daemons cloudera-manager-server
cloudera-manager-agent
1.3: Start Cloudera Manager Server
sudo systemctl start cloudera-scm-server

VI- Sem Dept. of CSE (DS) 47 | P a g e


Big Data Analytics BAD601

Step 2: Install and Start HUE (Hadoop User Interface)


HUE provides a GUI for Hadoop to interact with HDFS, Hive, Impala, and
more. 2.1: Install HUE
sudo apt install -y hue

2.2: Start HUE Service


sudo systemctl start hue
Access HUE Web UI:
∙ Open a browser and go to:
http://<your-ip>:8888

Step 3: Upload Sample Dataset to HDFS


3.1: Download a Sample Dataset
wget https://people.sc.fsu.edu/~jburkardt/data/csv/hw_200.csv -O
sample_data.csv

3.2: Upload Dataset to HDFS using HUE

1 Open HUE UI (http://<your-ip>:8888)


2 Navigate to File Browser → Upload
3 Select sample_data.csv and upload it to HDFS (/user/hadoop/).

Alternative (Command Line)


hdfs dfs -mkdir -p /user/hadoop/
hdfs dfs -put sample_data.csv /user/hadoop/

Step 4: Create a Table in Hive (Using HUE SQL Editor)


4.1: Open HUE’s Hive Query Editor
1 Go to HUE UI → Query Editors → Hive
2 Run the following SQL command:

CREATE DATABASE sample_db;


USE sample_db;

CREATE TABLE sample_table (


Index INT,
Height FLOAT,

VI- Sem Dept. of CSE (DS) 48 | P a g e


Big Data Analytics BAD601

Weight FLOAT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

4.2: Load Data into the Table


LOAD DATA INPATH '/user/hadoop/sample_data.csv' INTO TABLE sam
ple_table;
Step 5: Run Data Analysis Queries in HUE
5.1: Count Total Records
SELECT COUNT(*) FROM sample_table;
5.2: Find the Average Height
SELECT AVG(Height) FROM sample_table;
5.3: Find Maximum and Minimum Weight
SELECT MAX(Weight), MIN(Weight) FROM sample_table;

Step 6: Generate Reports Using HUE


6.1: Open HUE Dashboards

1 Go to HUE UI → Dashboards
2 Select "Create New Dashboard"
3 Choose the sample_table and create visualizations such as:
o Bar charts (Height vs Weight)
o Pie charts (Height distribution)
o Tables (Displaying records)
6.2: Save and Export Reports
∙ Save the dashboard as sample_report.
∙ Export it as PDF or CSV for analysis.

Step 7: Stop Hadoop and HUE Services


After completing the analysis, stop services:
sudo systemctl stop cloudera-scm-server
VI- Sem Dept. of CSE (DS) 49 | P a g e
Big Data Analytics BAD601

sudo systemctl stop hue

To check and access HUE on your browser, follow these steps:

Step 1: Find Your System's IP Address


Run the following command in your Ubuntu terminal:
hostname –I or ip a | grep inet

Step 2: Verify HUE Service is Running


Check whether the HUE service is active:
sudo systemctl status hue
If HUE is not running, start it using:
sudo systemctl start hue

Step 3: Access HUE in the Browser


Open a web browser and enter:
http://<your-ip>:8888
example: http://192.168.1.100:8888

Step 4: Login to HUE


∙ Default Username: admin
∙ Default Password: admin

RESULT: The CDH (Cloudera Distribution for Hadoop) and HUE (Hadoop User
Interface) were successfully installed and configured on Ubuntu.

VI- Sem Dept. of CSE (DS) 50 | P a g e


Big Data Analytics BAD601 VI- Sem Dept. of CSE (DS) 51 | P a g e

You might also like