Installing Hadoop on Ubuntu
AMRITPAL SINGH
Introduction
• Hadoop is a Java-based programming framework that supports the
processing and storage of extremely large datasets on a cluster of
inexpensive machines.
• It was the first major open source project in the big data playing field
and is sponsored by the Apache Software Foundation.
Introduction
• Hadoop 2.7 is comprised of four main layers:
• Hadoop Common is the collection of utilities and libraries that
support other Hadoop modules.
• HDFS, which stands for Hadoop Distributed File System, is responsible
for persisting data to disk.
Introduction
• YARN, short for Yet Another Resource Negotiator, is the "operating
system" for HDFS.
• MapReduce is the original processing model for Hadoop clusters. It
distributes work within the cluster or map, then organizes and
reduces the results from the nodes into a response to a query.
• Many other processing models are available for the 2.x version of
Hadoop.
Introduction
• Hadoop clusters are relatively complex to set up, so the project
includes a stand-alone mode which is suitable for learning about
Hadoop, performing simple operations, and debugging.
• We'll install Hadoop in stand-alone mode and run one of the example
example MapReduce programs it includes to verify the installation
Prerequisites
• An Ubuntu 16.04 server with a non-root user with sudo privileges
• Java
Steps
• Step 1 — Installing Java
• To get started, we'll update our package list:
• sudo apt-get update
• Next, install OpenJDK, the default Java Development Kit on Ubuntu
16.04.
Steps
• sudo apt-get install default-jdk
• Once the installation is complete, let's check the version.
• java –version
• openjdk version "1.8.0_91"
• OpenJDK Runtime Environment (build 1.8.0_91-8u91-b14-
3ubuntu1~16.04.1-b14)
• OpenJDK 64-Bit Server VM (build 25.91-b14, mixed mode)
Steps
• Step 2 — Installing Hadoop
• With Java in place, we'll visit the Apache Hadoop Releases page to
find the most recent stable release.
• http://hadoop.apache.org/releases.html
Steps
Steps
• On the server, we'll use wget to fetch it:
• wget http://apache.mirrors.tds.net/hadoop/common/hadoop-
2.7.3/hadoop-2.7.3.tar.gz
• In order to make sure that the file we downloaded hasn't been
altered, we'll do a quick check using SHA-256.
Steps
Steps
Steps
Steps
• Again, we'll right-click to copy the file location, then use wget to
transfer the file:
• wget
https://dist.apache.org/repos/dist/release/hadoop/common/hadoop
-2.7.3/hadoop-2.7.3.tar.gz.mds
Steps
• Then run the verification:
• shasum -a 256 hadoop-2.7.3.tar.gz
• Output
• d489df3808244b906eb38f4d081ba49e50c4603db03efd5e594a1e98b
09259c2 hadoop-2.7.3.tar.gz
Steps
• Compare this value with the SHA-256 value in the .mds file:
• cat hadoop-2.7.3.tar.gz.mds
Steps
• You can safely ignore the difference in case and the spaces.
• The output of the command we ran against the file we downloaded
from the mirror should match the value in the file we downloaded
from apache.org.
Steps
• Now that we've verified that the file wasn't corrupted or changed,
we'll use the tar command with the -x flag to extract, -z to
uncompress, -v for verbose output, and -f to specify that we're
extracting from a file.
• Use tab-completion or substitute the correct version number in the
command below:
Steps
• tar -xzvf hadoop-2.7.3.tar.gz
• Finally, we'll move the extracted files into /usr/local, the appropriate
place for locally installed software.
• Change the version number, if needed, to match the version you
downloaded.
Steps
• sudo mv hadoop-2.7.3 /usr/local/Hadoop
• With the software in place, we're ready to configure its environment.
Steps
• Step 3 — Configuring Hadoop's Java Home
• Hadoop requires that you set the path to Java, either as an
environment variable or in the Hadoop configuration file.
Steps
• The path to Java, /usr/bin/java is a symlink to /etc/alternatives/java,
which is in turn a symlink to default Java binary.
• We will use readlink with the -f flag to follow every symlink in every
part of the path, recursively.
• Then, we'll use sed to trim bin/java from the output to give us the
correct value for JAVA_HOME.
Steps
• To find the default Java path
• readlink -f /usr/bin/java | sed "s:bin/java::“
• Output
• /usr/lib/jvm/java-8-openjdk-amd64/jre/
Steps
• You can copy this output to set Hadoop's Java home to this specific
version, which ensures that if the default Java changes, this value will
not.
• Alternatively, you can use the readlink command dynamically in the
file so that Hadoop will automatically use whatever Java version is set
as the system default.
Steps
• To begin, open hadoop-env.sh:
• sudo nano /usr/local/hadoop/etc/hadoop/hadoop-env.sh
Steps
• To begin, open hadoop-env.sh:
• sudo nano /usr/local/hadoop/etc/hadoop/hadoop-env.sh
Steps
• To begin, open hadoop-env.sh:
• sudo nano /usr/local/hadoop/etc/hadoop/hadoop-env.sh
Step 4 — Running Hadoop
• /usr/local/hadoop/bin/Hadoop