Installation of Hadoop
Prerequisites/requirement:
1.
Java 8 runtime environment (JRE):
Hadoop 3 requires a Java 8
installation. I prefer using the
offline installer.
Java 8 development Kit (JDK)
To unzip downloaded Hadoop
binaries, we should install 7zip
https://www.java.com/en/download/
windows_offline.jsp
https://www.oracle.com/java/
technologies/downloads/#java8-
windows
https://www.7-zip.org/download.html
https://www.apache.org/dyn/
closer.cgi/hadoop/common/hadoop-
3.2.4/hadoop-3.2.4.tar.gz
Steps:
Extract hadoop-3.2.4.tar.gz
Create a folder with the name
hadoopsetup(this pc c drive
hadoopsetup
Copy the downloaded(hadoop-
3.2.4.tar.gz) file and paste into the file
hadoopsetup
Right click on the gz file click on
show more option 7-Zip Extract
to “hadoop-3.2.4.tar\”
Click on hadoop-3.2.4.tar file click
on show more option 7-Zip
Extract to “hadoop-3.2.4\” (actual
content of Hadoop will be taken out
and now we can access the file)
Take the file (hadoop-3.2.4) out.
2. Download libraries from
following link:
https://1drv.ms/f/s!
ArSg3Xpur4Grml7l087JBp_4bzks?
e=aSqIQV
After unpacking the package, we
should add the Hadoop native IO
libraries.
Copy 7 files and click on hadoop-
3.2.4 file bin paste all 7 files
3. Setting up environment variables:
C:\hadoopsetup\hadoop-3.2.4
C:\Progra~1\Java\jdk-1.8
%HADOOP_HOME%\bin
%JAVA_HOME%\bin
Advanced system settings click
on Environment variables
create two variables
C:\hadoopsetup\hadoop-3.2.4
(Hadoop_Home)
C:\Progra~1\Java\jdk-1.8 (Java
_Home)
Click on path specify 2 things
here
%HADOOP_HOME%\bin
%JAVA_HOME%\bin
Inside Hadoop click on etc
core-site.xml open with notepad
inside configuration tags paste
the given content save
CORE-SITE
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9820</
value>
</property>
Now open file Hadoop-env open
with notepad inside
configuration tags paste the given
content save
HADOOP ENV
set JAVA_HOME=C:\Progra~1\
Java\jdk-1.8 (path of java)
HDFS-SITE
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</
name>
<value>file:///C:/hadoopsetup/
hadoop-3.2.4/data/dfs/namenode</
value>
</property>
<property>
<name>dfs.datanode.data.dir</
name>
<value>file:///C:/hadoopsetup/
hadoop-3.2.4/data/dfs/datanode</
value>
</property>
MAPRED-SITE
<property>
<name>mapreduce.framework.nam
e</name>
<value>yarn</value>
<description>MapReduce
framework name</description>
</property>
YARN-SITE
<property>
<name>yarn.nodemanager.aux-
services</name>
<value>mapreduce_shuffle</value>
<description>Yarn Node Manager
Aux Service</description>
</property>
Formatting File System
hdfs namenode -format
STARTING HADOOP
.\start-dfs.cmd
./start-yarn.cmd
jps
Important Links
http://localhost:9870/dfshealth.html
http://localhost:9864/datanode.html
http://localhost:8088/cluster
COMMANDS:
1. cd C:\hadoopsetup\hadoop-
3.2.4\sbin
2. start-all.cmd
3. hdfs dfs -ls /
4. -mkdir /data
5. -touchz /data/test.dat
6. hdfs dfs -ls /data/
7. -du /data/test.dat
8. hdfs dfs -put "C:\Users\ASUS\
Desktop\Queries.txt" /zahra
9. hdfs dfs -ls /zahra
10. hdfs dfs -cat /zahra/Queries.txt
11. hdfs dfs -rm -r /abc/student.txt
12. hdfs dfs -copyToLocal
/zahra/Queries.txt C:\
13. hdfs dfs -get /data/folder "C:\
Users\ASUS\Desktop\
HADOOPFILES"
14. hdfs dfs -appendToFile -
/data/folder
15. hdfs dfs -cp /abc/student.txt
/data/
16. hdfs dfs -mv /data/student.txt
/zeenat/
17. hdfs dfs -rmdir /test
----------------------- to remove
directory
18. hdfs dfs -rm /zeenat/student.txt
--------------------- removing
files
19. hdfs dfs -rm -r
/abc/student.txt--------------------
removing directories/files.
20. C:\hadoopsetup\hadoop-3.2.4\
sbin>hdfs dfs -usage mkdir
21. C:\hadoopsetup\hadoop-3.2.4\
sbin>hdfs dfs -help
22. -moveFromLocal "C:\Users\
ASUS\Desktop\HADOOPFILES\
hdfs.txt" /zeenat
23. hdfs dfs -getmerge
/zeenat/myfile.txt /zeenat/hdfs.txt
"C:/Users/ASUS/Desktop/HADO
OPFILES/result.txt"
24. Command that is used to list
files of local file system
hdfs dfs -ls
file:///C:/Users/ASUS/Desktop/H
ADOOPFILES
25. Command that is used to display
content of local file system
hdfs dfs -cat
file:///C:/Users/ASUS/Desktop/H
ADOOPFILES/result.txt
26. hdfs dfs -checksum
/zeenat/hdfs.txt
27. hdfs dfs -chgrp zeenat
/zeenat/hdfs.txt
28. hdfs dfs -chown Asus:zeenat
/zeenat/myfile.txt
29. hdfs dfs -expunge
Word Count example:
cd C:\hadoopsetup\hadoop-3.2.4\sbin
start-all.cmd
start-yarn.cmd
jps
Hadoopsetup
Hadoop3.2.4
Share
Hadoop
Mapreduce
Hadoop-mapreduce-example-3.2.4
Open notepad, write/add some
content and save it
hdfs dfs -mkdir /directory1
hdfs dfs -put "C:\Users\ASUS\
Desktop\HADOOPFILES\ab.txt"
/directory1
hadoop jar C:\hadoopsetup\hadoop-
3.2.4\share\hadoop\mapreduce\
hadoop-mapreduce-examples-
3.2.4.jar wordcount /directory1
outputdir1
(copy path of jar file and name of jar
file paste the name of file by
adding .jar enter class
name(wordcount) input directory
and outputdirectrory)
Localhost9870
Localhost8088 (to check the status)
In localhost9870 click user Asus click
on outputdirectory download the file
Problem 1: Character Count
Objective: Count the number of
occurrences of each character in a text file.
Step-by-Step Solution:
1. Setup the Project:
o Create a new Java project in
Eclipse.
o Add the Hadoop library to the build
path.
2. Create the Mapper Class:
o This Mapper reads each line of text,
splits it into characters, and emits
each character with a count of one.
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import
org.apache.hadoop.mapreduce.Mapper;
public class CharCountMapper extends
Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one =
new IntWritable(1);
private Text character = new Text();
public void map(Object key, Text value,
Context context) throws IOException,
InterruptedException {
String line = value.toString();
for (char c : line.toCharArray()) {
character.set(Character.toString(c));
context.write(character, one);
}
}
}
3. Create the Reducer Class:
This Reducer sums the counts for each
character.
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import
org.apache.hadoop.mapreduce.Reducer;
public class CharCountReducer extends
Reducer<Text, IntWritable, Text,
IntWritable> {
private IntWritable result = new
IntWritable();
public void reduce(Text key,
Iterable<IntWritable> values, Context
context) throws IOException,
InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
4. Create the Driver Class:
This class sets up and runs the
MapReduce job.
import
org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import
org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import
org.apache.hadoop.mapreduce.Job;
import
org.apache.hadoop.mapreduce.lib.input
.FileInputFormat;
import
org.apache.hadoop.mapreduce.lib.outp
ut.FileOutputFormat;
public class CharCount {
public static void main(String[] args)
throws Exception {
Configuration conf = new
Configuration();
Job job = Job.getInstance(conf,
"character count");
job.setJarByClass(CharCount.class);
job.setMapperClass(CharCountMapper.
class);
job.setCombinerClass(CharCountRedu
cer.class);
job.setReducerClass(CharCountReduce
r.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.cl
ass);
FileInputFormat.addInputPath(job, new
Path(args[0]));
FileOutputFormat.setOutputPath(job,
new Path(args[1]));
System.exit(job.waitForCompletion(tru
e) ? 0 : 1);
}
}
5. Run the Job:
Create input and output directories.
Place the input text file in the input
directory.
Run the job from Eclipse, passing the
input and output paths as arguments.
2. WordCount
-mkdir /input
Copy a text from local system and
paste into “input” directory in
HDFS
Open eclipse
Click on file……> new……>java
project
Enter project name
“MapReduceWordCount”
Click on next…..> finish
Right Click on
“MapReduceWordCount”…..>
new….> package……>enter
package name
“com.mapreduce.wc”…..>finish
Again right click on
“MapReduceWordCount”…..>new
….>buildpath…..>configure build
path…..>libraries…..> addexternal
JARs.
Go to
hadoopsetup….>hadoop3.2.4……
>Hadoop….>share….>Hadoop….>
client…>add all Jar files…..>
Click on addexternal
JARs….>common…..>add all Jar
files.
Click on addexternal
JARs….>…..>common…..>lib…..
>add all Jar files.
Click on addexternal
JARs….>yarn…..>add all Jar files.
Click on addexternal
JARs….>mapreduce…..>add all Jar
files.
Click on addexternal
JARs….>hdfs…..>add all Jar files.
After adding all jar files click on
apply and close
Click on package name
“com.mapreduce.wc”…..>new…..>
class…..> in the Name field enter
“WordCount”……>finish
Copy and paste the
program…..>file….save
If any error occurs in
program………..> right click on
“MapReduceWordCount”…..>new
….>buildpath…..>configure build
path…..>libraries…..> addexternal
JARs…..>Jar files of (lib folder of
yarn ,hdfs, mapreduce, common)
Go to project file….>right
click……>export……>inside java
folder click on JAR file….>next
To change path click on browse and
choose any location……>create
folder with name
“JARFILES”…..>click on
JARFILES folder save the file
WordCountMApReduce…..>save
……>finish…..>ok
hdfs dfs -mkdir /input
hdfs dfs -put C:\Users\ASUS\
Desktop\HADOOPFILES\bid
data.txt /input
hdfs dfs -put C:\Users\ASUS\
Desktop\HADOOPFILES\
bigdata.txt /input
hadoop jar C:\Users\ASUS\
Downloads\b.jar
com.mapreduce.wc/WordCount
/input/bigdata.txt /output
hadoop-3.2.4\sbin>hdfs dfs -cat
/output/*
Mapper class: In Hadoop's MapReduce
framework, the Mapper class is a core component
responsible for processing the input data and
producing key-value pairs that are used as the input
for the subsequent stages of the MapReduce job.
context.context:
It helps us in interacting with outside world
(other components of Hadoop like yarn,
mapreduce…)
Mapper class: Processes input data and
produces key-value pairs.
map () method: Transforms each input record
into intermediate key-value pairs.
Context: Used to emit key-value pairs to the
framework for further processing.
dMAPPER OUTPUT<K,V>…….>
PARTITIONER OUTPUT…….>
<K,LIST[V]>……………..>REDUCER
PARTITIONER OUTPUT: Generates list of
values against every keys
Reducer class: In Hadoop's MapReduce
framework, the Reducer class plays a crucial role
in processing the intermediate key-value pairs
generated by the Mapper. It is responsible for
aggregating, summarizing, or otherwise processing
the data to produce the final output of the
MapReduce job.
Reducer class: Processes the intermediate key-
value pairs generated by the Mapper.
reduce () method: Aggregates or processes
the list of values for each key to produce the
final output.
Context: Used to emit the final key-value
pairs, which are written as the output of the
MapReduce job.
Driver Class: The major component in a MapReduce
job is a Driver Class.
It is responsible for setting up a MapReduce job to
run in hadoop.
public static void main(String[] args) throws
Exception {
Configuration conf = new Configuration();
The configuration object contains all hadoop
settings necessary to launch your app.
It’s in the key value format and is read fcrom the
xml files from /etc/hadoop. You can also use
configuration to change configuration parameters.
Job job = Job.getInstance(conf, "word count");
It allows the user to configure the job, submit it,
control its execution, and query the state.
job.setJarByClass(WordCount.class);
//specify various job-specific parameters
job.setMapperClass(TokenizerMapper.class);
//setting mapper class
job.setCombinerClass(IntSumReducer.class);
//setting combiner class
job.setReducerClass(IntSumReducer.class);
//setting reducer class
job.setOutputKeyClass(Text.class);
//setting output key
job.setOutputValueClass(IntWritable.class);
//setting output value
FileInputFormat.addInputPath(job, new
Path(args[0]));
FileOutputFormat.setOutputPath(job, new
Path(args[1]));
Driver Class: The main entry point of a
MapReduce job, responsible for setting up,
configuring, and submitting the job to the Hadoop
cluster.
Job Configuration: Defines input/output paths,
Mapper/Reducer classes, and other job settings.
Job Submission: Submits the job to the cluster and
monitors its progress until completion.
Hive:
cd C:\hadoopsetup\hadoop-3.2.4\sbin
start-all.cmd
start-yarn.cmd
cd C:\hive\apache-hive-3.1.2-bin\
apache-hive-3.1.2-bin\bin
hive --service schematool -dbType
derby -initSchema
hdfs dfsadmin -safemode leave
hive --service schematool -dbType
derby -initSchema
C:\hive\apache-hive-3.1.2-bin\apache-
hive-3.1.2-bin\bin>hive
hive> create database if not exists abc;
hive> show databases;
show databases like 'm*';
describe database abc;
drop database abc;
Create table syntax:
CREATE TABLE table_name
( column_name1 data_type,
column_name2 data_type, ... )
[ROW FORMAT row_format]
[STORED AS file_format]
[LOCATION 'path']
ROW FORMAT: Optional
specification of how rows are formatted
(e.g., DELIMITED).
STORED AS: Optional file format for
storing the data (e.g., TEXTFILE,
PARQUET).
hive> create table customer(id INT,
fname STRING, lname STRING, city
STRING)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '|'
> STORED AS TEXTFILE;
Describe customer;
Create any text file insert data into file
and save the file…….
LOAD DATA LOCAL INPATH
'C:/Users/ASUS/Desktop/HADOOPFI
LES/hive.txt' into table customer;
…………..(paste the path of saved text
file)
Select * from customer;
Drop table customer;
alter table customer rename to
employees;
alter table employees add columns
(salary int);
hive> alter table employees
> change column lname mname
string;
hive> alter table employees replace
columns(id int, fname string, mname
string, city string);
DML:
hive> insert into table stu values(100,
'Rohan', 10, 'ECE');
hive> insert into stu values (200,
'Priya', 9, 'CE'), (300, 'Amit', 7, 'CSE'),
(400, 'mohit', 10, 'CSE');
hive> create table result(id INT, name
STRING, marks INT, branch
STRING);
Append data from existing table:
hive> insert into result select id, name,
marks, course from stu;
Truncate: hive> truncate table result;
INSERT OVERWRITE TABLE result
SELECT * FROM stu;
Hive Partitioning:
Partitioning in Hive is a way of
dividing a large table into smaller, more
manageable pieces based on the value
of one or more columns. This helps in
faster query execution by scanning only
relevant partitions instead of the entire
table.
cd C:\hadoopsetup\hadoop-3.2.4\sbin
start-all.cmd
start-yarn.cmd
cd C:\hive\apache-hive-3.1.2-bin\
apache-hive-3.1.2-bin\bin
hdfs dfsadmin -safemode leave
hive --service schematool -dbType
derby -initSchema
hive
hive> show tables;
create table students(id INT, name
STRING, branch STRING)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ','
> STORED AS TEXTFILE;
load data local inpath 'C:\Users\ASUS\
Desktop\HADOOPFILES\
hivepartitioning.txt' into table students;
create table part_stu_branch(id INT,
name STRING)
partitioned by (branch STRING);
set hive.exec.dynamic.partition.mode =
nonstrict;
insert overwrite table part_stu_branch
partition(branch)
> select id, name, branch from
students;
Open another command prompt terminal
C:\hadoopsetup\hadoop-3.2.4\sbin
start-all.cmd
hdfs dfs -ls /user/hive/warehouse/part_stu_branch
hdfs dfs -ls
"/user/hive/warehouse/part_stu_branch/branch=CS
E"
hdfs dfs -cat
"/user/hive/warehouse/part_stu_branch/branch=CS
E/000000_0"
Hive Bucketing:
SET hive.enforce.bucketing=true;
> create table st_bucket(id INT, name STRING,
branch STRING)
> clustered by (id)into 3 buckets
> row format delimited
> fields terminated by ',';
insert overwrite table st_bucket select * from
students;
Open new cmd terminal
hdfs dfs -ls "/user/hive/warehouse/st_bucket"
hdfs dfs -cat
"/user/hive/warehouse/st_bucket/000000_0"
Hive Operators:
Hive operators are used in Hive Query Language
(HiveQL) to perform various types of data
manipulation and calculations, much like operators
in SQL.
1. Relational Operators:
Relational operators compare two values and return
a Boolean result (TRUE or FALSE).
= (Equal to): Checks if two values are equal.
!= or <>, (Not equal to): Checks if two values are
not equal.
> (Greater than): Checks if the left value is
greater than the right value.
< (Less than): Checks if the left value is less than
the right value.
>= (Greater than or equal to): Checks if the left
value is greater than or equal to the right value.
<= (Less than or equal to): Checks if the left
value is less than or equal to the right value.
E.g:
SELECT * FROM students WHERE
id > 20;
select * from students where id >=20
AND branch = 'ME';