EXPERIMENT NO – 1
1. Install Apache Hadoop
AIM: Installation of Single Node Hadoop Cluster on Ubuntu 20.04.4
PROCEDURE:
Prerequisites:
1. Install OpenJDK on Ubuntu.
2. Install OpenSSH on Ubuntu.
3. Create Hadoop User.
Step 1: Installing Java on Ubuntu.
The Hadoop framework is written in Java, and its services require a compatible Java Runtime Environment
(JRE) and Java Development Kit (JDK). Use the following command to update your system before initiating
a new installation:
Sudo apt update
The OpenJDK 8 package in Ubuntu contains both the runtime environment and development kit.
Type the following command in your terminal to install OpenJDK 8:
sudo apt install openjdk-8-jdk
The OpenJDK or Oracle Java version can affect how elements of a Hadoop ecosystem interact .
Step 2: Find Version of Java Installed
Once the installation process is complete, verify the current Java version:
java –version; javac -version
Step 3: To know the Java path
Type the following command in your terminal.
sudo update-alternatives –config java
sudo update-alternatives –config javac
Step 4: Install OpenSSH on Ubuntu
Install the OpenSSH server and client using the following command:
sudo apt install openssh-server openssh-client
In the example below, the output confirms that the latest version is already installed.
Step 5: Create Hadoop User
The adduser command is used to create a new Hadoop user:
sudo adduser hdoop
The username, in the above command is hdoop. You can add/use any username and password. Switch to
the newly created user and enter the corresponding password:
su – hdoop
Step 6: Verify SSH Installation
By giving the following commands we can check or verify whether SSH is installed or not.
Which ssh
Result :/usr/bin/ssh
Which sshd
Result :/usr/bin/sshd
Step 7:
Hadoop uses SSH (to access its nodes) which would normally require the user to enter a password.
However, this requirement can be eliminated by creating and setting up SSH certificates using the following
command. If asked for a filename just leave it blank and press the enter key to continue.
Su hdoop
The following command generate an SSH key pair and define the location is to be stored in:
hdoop@D:-/home/AITKS-Lab$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
The system proceeds to generate and save the SSH key pair.
The following command adds the newly created key to the list of authorized keys so that Hadoop can use
ssh without prompting for a password.
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
The new user is now able to SSH without needing to enter a password every time. Verify everything is set
up correctly by using the hdoop user to SSH to localhost:
ssh localhost
The Hadoop user is now able to establish an SSH connection to the localhost.
Download and Install Hadoop on Ubuntu
Note: Based on your Hadoop Version modify below commands
Step 8: Visit the official Apache Hadoop page, and select the version of Hadoop you want to implement.
Here use the Binary download for Hadoop Version 3.2.1
Select your preferred option, and you will get a mirror link that allows you to download the Hadoop tar
package.
Step 9: Use the provided mirror link and download the Hadoop package with the wget command:
wget https://downloads.apache.org/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz
Step 10:
Once the download is complete, extract the files to initiate the Hadoop installation by using the following
command:
tar xzf hadoop-3.2.1.tar.gz
Step 11:
To move the folder where your hadoop download is available use the following command:
sudo mv* /usr/local/hadoop/
Step 12: Set read/write permission
sudo chown –R hdoop:hadoop/usr/local/hadoop
Setup Configuration Files
Hadoop excels when deployed in a fully distributed mode on a large cluster of networked
servers. However, if you are new to Hadoop and want to explore basic commands or test
applications, you can configure Hadoop on a single node.
This setup, also called pseudo-distributed mode, allows each Hadoop daemon to run as a single
Java process. A Hadoop environment is configured by editing a set of configuration files:
bashrc
hadoop-env.sh
core-site.xml
hdfs-site.xml
mapred-site-xml
yarn-site.xml
Step 13: Configure Hadoop Environment Variables (bashrc)
Before editing the .bashrc file in the hdoop’s home directory, we need to find the path where java has
been installed to set the JAVA_HOME environment variables using step 3.
Sudo gedit ~/.bashrc
Use the above command to define the Hadoop environment variables by adding the following content
to the end of the file
#Hadoop Related Options
export HADOOP_HOME=/home/hdoop/hadoop-3.2.1
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS"-Djava.library.path=$HADOOP_HOME/lib/nativ"
Once you add the variables, save and exit the .bashrc file.
Step 14: To apply the changes to the current running environment use the following command:
source ~/.bashrc
Step 15: Edit hadoop-env.sh File
The hadoop-env.sh file serves as a master file to configure YARN, HDFS, MapReduce, and Hadoop-related
project settings.
When setting up a single node Hadoop cluster, you need to define which Java implementation is to be
utilized. Use the previously created $HADOOP_HOME variable to access the hadoop-env.sh file:
sudo nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh
Uncomment the $JAVA_HOME variable (i.e., remove the # sign) and add the full path to
the OpenJDK installation on your system., add the following line:
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
The path needs to match the location of the Java installation on your system.
If you need help to locate the correct Java path, run the following command in your terminal window:
which javac
The resulting output provides the path to the Java binary directory.
Use the provided path to find the OpenJDK directory with the following command:
readlink -f /usr/bin/javac
The section of the path just before the /bin/javac directory needs to be assigned to
the $JAVA_HOME variable.
Step 16: Edit core-site.xml File
The core-site.xml file defines HDFS and Hadoop core properties.
To set up Hadoop in a pseudo-distributed mode, you need to specify the URL for your NameNode, and the
temporary directory Hadoop uses for the map and reduce process.
Open the core-site.xml file in a text editor:
sudo nano $HADOOP_HOME/etc/hadoop/core-site.xml
Add the following configuration to override the default values for the temporary directory and add your
HDFS URL to replace the default local file system setting:
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hdoop/tmpdata</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://127.0.0.1:9000</value>
</property>
</configuration>
This example uses values specific to the local system. You should use values that
match your systems requirements. The data needs to be consistent throughout the
configuration process.
Step 17: Edit hdfs-site.xml File
The properties in the hdfs-site.xml file govern the location for storing node metadata, fsimage file, and edit
log file. Configure the file by defining the NameNode and DataNode storage directories.
Additionally, the default dfs.replication value of 3 needs to be changed to 1 to match the single node
setup.
Use the following command to open the hdfs-site.xml file for editing:
sudo nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml
Add the following configuration to the file and, if needed, adjust the NameNode and
DataNode directories to your custom locations:
<configuration>
<property>
<name>dfs.data.dir</name>
<value>/home/hdoop/dfsdata/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/hdoop/dfsdata/datanode</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
If necessary, create the specific directories you defined for the dfs.data.dir value.
Step 18: Edit mapred-site.xml File
Use the following command to access the mapred-site.xml file and define MapReduce values:
sudo nano $HADOOP_HOME/etc/hadoop/mapred-site.xml
Add the following configuration to change the default MapReduce framework name value to yarn:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Step 19: Edit yarn-site.xml File
The yarn-site.xml file is used to define settings relevant to YARN. It contains configurations for the Node
Manager, Resource Manager, Containers, and Application Master.
Open the yarn-site.xml file in a text editor:
sudo nano $HADOOP_HOME/etc/hadoop/yarn-site.xml
Append the following configuration to the file:
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>127.0.0.1</value>
</property>
<property>
<name>yarn.acl.enable</name>
<value>0</value>
</property>
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH
_PERPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>
</configuration>
Step 20: Format HDFS NameNode
It is important to format the NameNode before starting Hadoop services for the first time:
hdfs namenode -format
The shutdown notification signifies the end of the NameNode format process.
Step 21: Starting Hadoop
Navigate to the hadoop-3.2.1/sbin directory and execute the following commands to
start the NameNode and DataNode:
start-dfs.sh
The system takes a few moments to initiate the necessary nodes.
Step 22:
For checking running process in our Hadoop Cluster we use JSP Command .JSP
stands for Java Virtual Machine Process Status Tool.
After running JSP command the following Daemons Should start.
Note: Your Hadoop installation is successful only if above daemons should start
Step 23: Access Hadoop from Browser
Use your preferred browser and navigate to your localhost URL or IP. The default port
number 9870 gives you access to the Hadoop NameNode:
http://localhost:9870
The NameNode user interface provides a comprehensive overview of the entire cluster.
The default port 9864 is used to access individual DataNodes directly from your
browser:
http://localhost:9864
Result: Hence the Installation of Single Node Hadoop Cluster on Ubuntu 20.04.4 is successfully completed.