Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
49 views18 pages

Hadoop Cluster

The document describes the installation and configuration of a 3 node Hadoop cluster with HDFS HA and YARN on Amazon EC2 instances. It provides details on setting up Zookeeper, installing CDH5 packages, and configuring core Hadoop files.

Uploaded by

2019ht12131
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views18 pages

Hadoop Cluster

The document describes the installation and configuration of a 3 node Hadoop cluster with HDFS HA and YARN on Amazon EC2 instances. It provides details on setting up Zookeeper, installing CDH5 packages, and configuring core Hadoop files.

Uploaded by

2019ht12131
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 18

Description:

 3 node Cluster with Namenode HA and 2 data nodes.


 Automatic failover.
 CDH5Namenode (ip-172-31-32-60.ec2.internal) – Name node, data node 1, node manager 1
 CDH5DN2 (ip-172-31-41-205.ec2.internal) – data node 2, node manager 2, Name node
 CDH5RM (ip-172-31-36-158.ec2.internal) – Resource manager

SW/HW requirements:

Putty Gen, Putty, Amazon AWS CentOS 6.4 and above instances, Java 1.7.0_55

Installation and configuration

1. Create three CentOs 7 instances on Amazon AWS in free tier. (8 GB disk each).

Machin
Sequence Commands e Remarks
1 sudo yum update All To apply patch updates.
sudo yum install gcc To install gcc compiler and kernel
2 sudo yum install kernel-devel All development package
Install latest java development package on all

the machines (same version on every


3 yum install java-1.7.0-openjdk-devel All machine)

To add the CDH 5 repository: (Add the repository for CDH5)

Click the entry in the table below that matches your Red Hat or CentOS system, navigate to the repo file
for your system and save it in the /etc/yum.repos.d/ directory.

For OS Version Click this Link


Red Hat/CentOS/Oracle 5 Red Hat/CentOS/Oracle 5 link
Red Hat/CentOS/Oracle 6 (64- Red Hat/CentOS/Oracle 6 link
bit)

Optionally Add a Repository Key


Before installing YARN or MRv1: (Optionally) add a repository key on each system in the cluster. Add
the Cloudera Public GPG Key to your repository by executing one of the following commands:

 For Red Hat/CentOS/Oracle 6 systems:

$ sudo rpm --import


http://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera
Install Zookeeper

Cloudera recommends that you install (or update) and start a ZooKeeper cluster before proceeding. This
is a requirement if you are deploying high availability (HA) for the NameNode.

Installing the ZooKeeper Packages


There are two ZooKeeper server packages:

 The zookeeper base package provides the basic libraries and scripts that are necessary to run
ZooKeeper servers and clients. The documentation is also included in this package.
 The zookeeper-server package contains the init.d scripts necessary to run ZooKeeper as a
daemon process. Because zookeeper-server depends on zookeeper, installing the server package
automatically installs the base package.

Installing the ZooKeeper Base Package


To install ZooKeeper On Red Hat-compatible systems:

$ sudo yum install zookeeper

Installing the ZooKeeper Server Package and Starting ZooKeeper on a


Single Server
The instructions provided here deploy a single ZooKeeper server in "standalone" mode. This is
appropriate for evaluation, testing and development purposes, but may not provide sufficient reliability for
a production application. See Installing ZooKeeper in a Production Environment for more information.

To install the ZooKeeper Server On Red Hat-compatible systems:

$ sudo yum install zookeeper-server

To create /var/lib/zookeeper and set permissions:

sudo mkdir -p /var/lib/zookeeper

sudo chown -R zookeeper /var/lib/zookeeper/

Installing ZooKeeper in a Production Environment


In a production environment, you should deploy ZooKeeper as an ensemble with an odd number of
nodes. As long as a majority of the servers in the ensemble are available, the ZooKeeper service will be
available. The minimum recommended ensemble size is three ZooKeeper servers, and it is
recommended that each server run on a separate machine.

Deploying a ZooKeeper ensemble requires some additional configuration. The configuration file
(zoo.cfg) on each server must include a list of all servers in the ensemble, and each server must also
have a myid file in its data directory (by default /var/lib/zookeeper) that identifies it as one of the
servers in the ensemble. Proceed as follows on each server.
1. Use the commands under Installing the ZooKeeper Server Package and Starting ZooKeeper on a
Single Server to install zookeeper-server on each host.
2. Test the expected loads to set the Java heap size so as to avoid swapping. Make sure you are well below
the threshold at which the system would start swapping; for example 12GB for a machine with 16GB of
RAM.
3. Create a configuration file. This file can be called anything you like, and must specify settings for at least
the parameters shown under "Minimum Configuration" in the ZooKeeper Administrator's Guide. You
should also configure values for initLimit, sycLimit, and server.n; see the explanations in the
administrator's guide. For example:

tickTime=2000

dataDir=/var/lib/zookeeper/

clientPort=2181

initLimit=5

syncLimit=2

server.1=zoo1:2888:3888

server.2=zoo2:2888:3888

server.3=zoo3:2888:3888

In this example, the final three lines are in the form server.id=hostname:port:port. The first port is
for a follower in the ensemble to listen on for the leader; the second is for leader election. You set id for
each server in the next step.

4. Create a file named myid in the server's DataDir; in this example, /var/lib/zookeeper/myid .
The file must contain only a single line, and that line must consist of a single unique number between 1
and 255; this is the id component mentioned in the previous step. In this example, the server whose
hostname iszoo1 must have a myid file that contains only 1.
5. Start each server as described in the previous section.
6. Test the deployment by running a ZooKeeper client:

zookeeper-client -server hostname:port

For example:

zookeeper-client -server zoo1:2181

For more information on configuring a multi-server deployment, see Clustered (Multi-Server) Setup in
the ZooKeeper Administrator's Guide.

Install CDH 5 with YARN


Note:
If you decide to configure HA for the NameNode, do not install hadoop-hdfs-secondarynamenode.
After completing the HA software configuration, follow the installation instructions under Deploying
HDFS High Availability.

Install each type of daemon package on the appropriate systems(s), as follows.

Resource Manager host (analogous to MRv1


JobTracker) running:

Red Hat/CentOS compatible sudo yum clean all; sudo yum


install hadoop-yarn-resourcemanager

NameNode host running:


Red Hat/CentOS compatible sudo yum clean all; sudo yum
install hadoop-hdfs-namenode

All cluster hosts except the Resource


Managerrunning:
Red Hat/CentOS compatible sudo yum clean all; sudo yum
install hadoop-yarn-nodemanager
hadoop-hdfs-datanode hadoop-
mapreduce

All client hosts running:


Red Hat/CentOS compatible sudo yum clean all; sudo yum
install hadoop-client

Customizing Configuration Files


The following tables show the most important properties that you must configure for your cluster.

Note:

For information on other important configuration properties, and the configuration files, see the Apache
Cluster Setup page.

Sample Configuration
core-site.xml:

<property>

<name>fs.defaultFS</name>

<value>hdfs://namenode-host.company.com:8020</value>
</property>

hdfs-site.xml:

<property>

<name>dfs.permissions.superusergroup</name>

<value>hadoop</value>

</property>

Configuring Local Storage Directories


You need to specify, create, and assign the correct permissions to the local directories where you want
the HDFS daemons to store data. You specify the directories by configuring the following two properties
in the hdfs-site.xml file.

Property Configuration File Description


fs.defaultFS core-site.xml Note: fs.default.name is deprecated.
Specifies the NameNode and the default
file system, in the
formhdfs://<namenode
host>:<namenode port>/. The default
value is file///. The default file system is
used to resolve relative paths; for
example,
if fs.default.name orfs.defaultFS is set
tohdfs://mynamenode/, the relative
URI/mydir/myfile resolves
tohdfs://mynamenode/mydir/myfile.
Note: for the cluster to function
correctly, the <namenode> part of the
string mustbe the hostname (for
examplemynamenode), or the HA-
enabled logical URI, not the IP address.
dfs.permissions.superuse hdfs-site.xml Specifies the UNIX group containing
rgroup
users that will be treated as superusers
by HDFS. You can stick with the value
of 'hadoop' or pick your own group
depending on the security policies at
your site.

Note:
dfs.data.dir and dfs.name.dir are deprecated; you should
use dfs.datanode.data.dir anddfs.namenode.name.dir instead,
though dfs.data.dir and dfs.name.dir will still work.

Sample configuration:

hdfs-site.xml on the NameNode:


<property>

<name>dfs.namenode.name.dir</name>

<value>file:///data/1/dfs/nn,file:///nfsmount/dfs/nn</value>

</property>

hdfs-site.xml on each DataNode:

<property>

<name>dfs.datanode.data.dir</name>

<value>file:///data/1/dfs/dn,file:///data/2/dfs/dn,file:///data/3/dfs/
dn,file:///data/4/dfs/dn</value>

</property>

After specifying these directories as shown above, you must create the directories and assign the correct
file permissions to them on each node in your cluster.

In the following instructions, local path examples are used to represent Hadoop parameters. Change the
path examples to match your configuration.

Local directories:

 The dfs.name.dir or dfs.namenode.name.dir parameter is represented by


the /data/1/dfs/nnand /nfsmount/dfs/nn path examples.
 The dfs.data.dir or dfs.datanode.data.dir parameter is represented by
the /data/1/dfs/dn,/data/2/dfs/dn, /data/3/dfs/dn, and /data/4/dfs/dn examples.

To configure local storage directories for use by HDFS:

1. On a NameNode host: create the dfs.name.dir or dfs.namenode.name.dir local directories:

$ sudo mkdir -p /data/1/dfs/nn /nfsmount/dfs/nn

Important:

If you are using High Availability (HA), you should not configure these directories on an NFS mount;
configure them on local storage.

2. On all DataNode hosts: create the dfs.data.dir or dfs.datanode.data.dir local directories:

$ sudo mkdir -p /data/1/dfs/dn /data/2/dfs/dn /data/3/dfs/dn /data/4/dfs/dn

3. Configure the owner of the dfs.name.dir or dfs.namenode.name.dir directory, and of


thedfs.data.dir or dfs.datanode.data.dir directory, to be the hdfs user:
$ sudo chown -R hdfs:hdfs /data/1/dfs/nn /nfsmount/dfs/nn /data/1/dfs/dn
/data/2/dfs/dn /data/3/dfs/dn /data/4/dfs/dn

Here is a summary of the correct owner and permissions of the local directories:

Permissio
Directory Owner ns
dfs.name.dir ordfs.namenode.name.dir hdfs:hdfs drwx------

dfs.data.dir ordfs.datanode.data.dir hdfs:hdfs drwx------

Footnote: 1 The Hadoop daemons automatically set the correct permissions for you
on dfs.data.dir ordfs.datanode.data.dir. But in the case
of dfs.name.dir or dfs.namenode.name.dir, permissions are currently incorrectly set to the file-
system default, usually drwxr-xr-x (755). Use the chmodcommand to reset permissions for
these dfs.name.dir or dfs.namenode.name.dir directories to drwx------ (700); for example:

$ sudo chmod 700 /data/1/dfs/nn /nfsmount/dfs/nn

or

$ sudo chmod go-rx /data/1/dfs/nn /nfsmount/dfs/nn

Note:

If you specified nonexistent directories for the dfs.data.dir or dfs.datanode.data.dir property in


the hdfs-site.xml file, CDH 5 will shut down. (In previous releases, CDH silently ignored nonexistent
directories for dfs.data.dir.)

Formatting the NameNode


Before starting the NameNode for the first time you need to format the file system.

Important:

 Make sure you format the NameNode as user hdfs.


 If you are re-formatting the NameNode, keep in mind that this invalidates the DataNode storage locations,
so you should remove the data under those locations after the NameNode is formatted.

$ sudo -u hdfs hdfs namenode –format

 Note : journal node and zookeeper must be up before formatting namenode.

Configuring Software for HDFS HA


This section describes the software configuration required for HDFS HA.

Note:
The subsections that follow explain how to configure HDFS HA using Quorum-based storage. This is the
only implementation supported in CDH 5.

Configuration Overview
As with HDFS Federation configuration, HA configuration is backward compatible and allows existing
single NameNode configurations to work without change. The new configuration is designed such that all
the nodes in the cluster can have the same configuration without the need for deploying different
configuration files to different machines based on the type of the node.

HA clusters reuse the NameService ID to identify a single HDFS instance that may consist of multiple HA
NameNodes. In addition, there is a new abstraction called NameNode ID. Each distinct NameNode in the
cluster has a different NameNode ID. To support a single configuration file for all of the NameNodes, the
relevant configuration parameters include the NameService ID as well as the NameNode ID.

Changes to Existing Configuration Parameters


The following configuration parameter has changed for YARN implementations:

fs.defaultFS - formerly fs.default.name, the default path prefix used by the Hadoop FS client
when none is given. (fs.default.name is deprecated for YARN implementations, but will still work.)

Optionally, you can configure the default path for Hadoop clients to use the HA-enabled logical URI. For
example, if you use mycluster as the NameService ID as shown below, this will be the value of the
authority portion of all of your HDFS paths. You can configure the default path in your core-
site.xml file:

 For YARN:

<property>

<name>fs.defaultFS</name>

<value>hdfs://mycluster</value>

</property>

New Configuration Parameters


To configure HA NameNodes, you must add several configuration options to your hdfs-
site.xml configuration file.

The order in which you set these configurations is unimportant, but the values you choose
for dfs.nameservicesand dfs.ha.namenodes.[NameService ID] will determine the keys of
those that follow. This means that you should decide on these values before setting the rest of the
configuration options.

Configure dfs.nameservices

dfs.nameservices - the logical name for this new nameservice

Choose a logical name for this nameservice, for example mycluster, and use this logical name for the
value of this configuration option. The name you choose is arbitrary. It will be used both for configuration
and as the authority component of absolute HDFS paths in the cluster.

Note:
If you are also using HDFS Federation, this configuration setting should also include the list of other
nameservices, HA or otherwise, as a comma-separated list.
<property>

<name>dfs.nameservices</name>

<value>mycluster</value>

</property>

Configure dfs.ha.namenodes.[nameservice ID]

dfs.ha.namenodes.[nameservice ID] - unique identifiers for each NameNode in the nameservice

Configure a list of comma-separated NameNode IDs. This will be used by DataNodes to determine all the
NameNodes in the cluster. For example, if you used mycluster as the NameService ID previously, and
you wanted to use nn1 and nn2 as the individual IDs of the NameNodes, you would configure this as
follows:

<property>

<name>dfs.ha.namenodes.mycluster</name>

<value>nn1,nn2</value>

</property>

Note:
In this release, you can configure a maximum of two NameNodes per nameservice.

Configure dfs.namenode.rpc-address.[nameservice ID]

dfs.namenode.rpc-address.[nameservice ID].[name node ID] - the fully-qualified RPC


address for each NameNode to listen on

For both of the previously-configured NameNode IDs, set the full address and RPC port of the NameNode
process. Note that this results in two separate configuration options. For example:

<property>

<name>dfs.namenode.rpc-address.mycluster.nn1</name>

<value>machine1.example.com:8020</value>

</property>

<property>

<name>dfs.namenode.rpc-address.mycluster.nn2</name>

<value>machine2.example.com:8020</value>

</property>

Note:
If necessary, you can similarly configure the servicerpc-address setting.

Configure dfs.namenode.http-address.[nameservice ID]

dfs.namenode.http-address.[nameservice ID].[name node ID] - the fully-qualified HTTP


address for each NameNode to listen on
Similarly to rpc-address above, set the addresses for both NameNodes' HTTP servers to listen on. For
example:

<property>

<name>dfs.namenode.http-address.mycluster.nn1</name>

<value>machine1.example.com:50070</value>

</property>

<property>

<name>dfs.namenode.http-address.mycluster.nn2</name>

<value>machine2.example.com:50070</value>

</property>

Note:
If you have Hadoop's Kerberos security features enabled, and you intend to use HSFTP, you should also
set thehttps-address similarly for each NameNode.

Configure dfs.namenode.shared.edits.dir

dfs.namenode.shared.edits.dir - the location of the shared storage directory

Configure the addresses of the JournalNodes which provide the shared edits storage, written to by the
Active NameNode and read by the Standby NameNode to stay up-to-date with all the file system changes
the Active NameNode makes. Though you must specify several JournalNode addresses, you should
only configure one of these URIs. The URI should be in the form:

qjournal://<host1:port1>;<host2:port2>;<host3:port3>/<journalId>

The Journal ID is a unique identifier for this nameservice, which allows a single set of JournalNodes to
provide storage for multiple federated namesystems. Though it is not a requirement, it's a good idea to
reuse the nameservice ID for the journal identifier.

For example, if the JournalNodes for this cluster were running on the
machines node1.example.com,node2.example.com, and node3.example.com, and the
nameservice ID were mycluster, you would use the following as the value for this setting (the default
port for the JournalNode is 8485):

<property>

<name>dfs.namenode.shared.edits.dir</name>

<value>qjournal://node1.example.com:8485;node2.example.com:8485;node3.exampl
e.com:8485/mycluster</value>

</property>

Configure dfs.journalnode.edits.dir

dfs.journalnode.edits.dir - the path where the JournalNode daemon will store its local state

On each JournalNode machine, configure the absolute path where the edits and other local state
information used by the JournalNodes will be stored; use only a single path per JournalNode. (The other
JournalNodes provide redundancy; you can also configure this directory on a locally-attached RAID-1 or
RAID-10 array.)

For example:

<property>

<name>dfs.journalnode.edits.dir</name>

<value>/data/1/dfs/jn</value>

</property>

Now create the directory (if it doesn't already exist) and make sure its owner is hdfs, for example:

$ sudo mkdir -p /data/1/dfs/jn

$ sudo chown -R hdfs:hdfs /data/1/dfs/jn

Client Failover Configuration


dfs.client.failover.proxy.provider.[nameservice ID] - the Java class that HDFS clients
use to contact the Active NameNode

Configure the name of the Java class which the DFS Client will use to determine which NameNode is the
current Active, and therefore which NameNode is currently serving client requests. The only
implementation which currently ships with Hadoop is the ConfiguredFailoverProxyProvider, so
use this unless you are using a custom one. For example:

<property>

<name>dfs.client.failover.proxy.provider.mycluster</name>

<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProv
ider</value>

</property>

Fencing Configuration
dfs.ha.fencing.methods - a list of scripts or Java classes which will be used to fence the Active
NameNode during a failover

It is desirable for correctness of the system that only one NameNode be in the Active state at any given
time.

Configuring the sshfence fencing method

sshfence - SSH to the Active NameNode and kill the process

The sshfence option uses SSH to connect to the target node and uses fuser to kill the process
listening on the service's TCP port. In order for this fencing option to work, it must be able to SSH to the
target node without providing a passphrase. Thus, you must also configure
the dfs.ha.fencing.ssh.private-key-filesoption, which is a comma-separated list of SSH
private key files.

Important:
The files must be accessible to the user running the NameNode processes (typically the hdfs user on
the NameNode hosts).

For example:

<property>

<name>dfs.ha.fencing.methods</name>

<value>shell(/bin/true)</value>

</property>

For test cluster use this value, refer to cloudera documentation for PROD
setup.

Automatic Failover Configuration


The above sections describe how to configure manual failover. In that mode, the system will not
automatically trigger a failover from the active to the standby NameNode, even if the active node has
failed. This section describes how to configure and deploy automatic failover.

Component Overview

Automatic failover adds two new components to an HDFS deployment: a ZooKeeper quorum, and
theZKFailoverController process (abbreviated as ZKFC).

sudo yum install hadoop-hdfs-zkfc (on both namenodes)

Apache ZooKeeper is a highly available service for maintaining small amounts of coordination data,
notifying clients of changes in that data, and monitoring clients for failures. The implementation of
automatic HDFS failover relies on ZooKeeper for the following things:

 Failure detection - each of the NameNode machines in the cluster maintains a persistent session in
ZooKeeper. If the machine crashes, the ZooKeeper session will expire, notifying the other NameNode
that a failover should be triggered.
 Active NameNode election - ZooKeeper provides a simple mechanism to exclusively elect a node as
active. If the current active NameNode crashes, another node can take a special exclusive lock in
ZooKeeper indicating that it should become the next active NameNode.
The ZKFailoverController (ZKFC) is a new component - a ZooKeeper client which also monitors
and manages the state of the NameNode. Each of the machines which runs a NameNode also runs a
ZKFC, and that ZKFC is responsible for:

 Health monitoring - the ZKFC pings its local NameNode on a periodic basis with a health-check
command. So long as the NameNode responds promptly with a healthy status, the ZKFC considers the
node healthy. If the node has crashed, frozen, or otherwise entered an unhealthy state, the health monitor
will mark it as unhealthy.
 ZooKeeper session management - when the local NameNode is healthy, the ZKFC holds a session
open in ZooKeeper. If the local NameNode is active, it also holds a special lock znode. This lock uses
ZooKeeper's support for "ephemeral" nodes; if the session expires, the lock node will be automatically
deleted.
 ZooKeeper-based election - if the local NameNode is healthy, and the ZKFC sees that no other node
currently holds the lock znode, it will itself try to acquire the lock. If it succeeds, then it has "won the
election", and is responsible for running a failover to make its local NameNode active. The failover
process is similar to the manual failover described above: first, the previous active is fenced if necessary,
and then the local NameNode transitions to active state.
Deploying ZooKeeper

In a typical deployment, ZooKeeper daemons are configured to run on three or five nodes. Since
ZooKeeper itself has light resource requirements, it is acceptable to collocate the ZooKeeper nodes on
the same hardware as the HDFS NameNode and Standby Node. Operators using MapReduce v2 (MRv2)
often choose to deploy the third ZooKeeper process on the same node as the YARN ResourceManager.
It is advisable to configure the ZooKeeper nodes to store their data on separate disk drives from the
HDFS metadata for best performance and isolation.

See the ZooKeeper documentation for instructions on how to set up a ZooKeeper ensemble. In the
following sections we assume that you have set up a ZooKeeper cluster running on three or more nodes,
and have verified its correct operation by connecting using the ZooKeeper command-line interface (CLI).

Configuring Automatic Failover

Note:

Before you begin configuring automatic failover, you must shut down your cluster. It is not currently
possible to transition from a manual failover setup to an automatic failover setup while the cluster is
running.

Configuring automatic failover requires two additional configuration parameters. In your hdfs-
site.xml file, add:

<property>

<name>dfs.ha.automatic-failover.enabled</name>

<value>true</value>

</property>

This specifies that the cluster should be set up for automatic failover. In your core-site.xml file, add:

<property>

<name>ha.zookeeper.quorum</name>

<value>zk1.example.com:2181,zk2.example.com:2181,zk3.example.com:2181</
value>

</property>

This lists the host-port pairs running the ZooKeeper service.

As with the parameters described earlier in this document, these settings may be configured on a per-
nameservice basis by suffixing the configuration key with the nameservice ID. For example, in a cluster
with federation enabled, you can explicitly enable automatic failover for only one of the nameservices by
setting dfs.ha.automatic-failover.enabled.my-nameservice-id.

There are several other configuration parameters which you can set to control the behavior of automatic
failover, but they are not necessary for most installations. See the configuration section of the Hadoop
documentation for details.
Initializing the HA state in ZooKeeper
After you have added the configuration keys, the next step is to initialize the required state in ZooKeeper.
You can do so by running the following command from one of the NameNode hosts.

Note:

The ZooKeeper ensemble must be running when you use this command; otherwise it will not work
properly.

$ hdfs zkfc -formatZK

Deploying HDFS High Availability


After you have set all of the necessary configuration options, you are ready to start the JournalNodes and
the two HA NameNodes.

Important: Before you start:

 If you are setting up a new HDFS cluster, you should first format the NameNode you will use as your
primary NameNode; see Formatting the NameNode.
 Make sure you have performed all the configuration and setup tasks described under Configuring
Hardware for HDFS HA and Configuring Software for HDFS HA, including initializing the HA state in
ZooKeeper if you are deploying automatic failover.

Install and Start the JournalNodes

1. Install the JournalNode daemons on each of the machines where they will run.
To install JournalNode on Red Hat-compatible systems:

$ sudo yum install hadoop-hdfs-journalnode

2. Start the JournalNode daemons on each of the machines where they will run:

sudo service hadoop-hdfs-journalnode start

Wait for the daemons to start before starting the NameNodes.

Initialize the Shared Edits directory


If you are converting a non-HA NameNode to HA, initialize the shared edits directory with the edits data
from the local NameNode edits directories:

hdfs namenode -initializeSharedEdits

Start the NameNodes

1. Start the primary (formatted) NameNode:

$ sudo service hadoop-hdfs-namenode start


2. Start the standby NameNode:

$ sudo -u hdfs hdfs namenode -bootstrapStandby

$ sudo service hadoop-hdfs-namenode start

Note:

If Kerberos is enabled, do not use commands in the form sudo -u <user> <command>; they will fail
with a security error. Instead, use the following commands: $ kinit <user> (if you are using a
password) or $ kinit -kt <keytab> <principal> (if you are using a keytab) and then, for each
command executed by this user, $ <command>

Starting the standby NameNode with the -bootstrapStandby option copies over the contents of the
primary NameNode's metadata directories (including the namespace information and most recent
checkpoint) to the standby NameNode. (The location of the directories containing the NameNode
metadata is configured via the configuration
options dfs.namenode.name.dir and/or dfs.namenode.edits.dir.)

You can visit each NameNode's web page by browsing to its configured HTTP address. Notice that next
to the configured address is the HA state of the NameNode (either "Standby" or "Active".) Whenever an
HA NameNode starts and automatic failover is not enabled, it is initially in the Standby state. If automatic
failover is enabled the first NameNode that is started will become active.

Restart Services
If you are converting from a non-HA to an HA configuration, you need to restart the JobTracker and
TaskTracker (for MRv1, if used), or ResourceManager, NodeManager, and JobHistory Server (for YARN),
and the DataNodes:

On each DataNode:

$ sudo service hadoop-hdfs-datanode start

On the ResourceManager system (YARN):

$ sudo service hadoop-yarn-resourcemanager start

On each NodeManager system (YARN; typically the same ones where DataNode service runs):

$ sudo service hadoop-yarn-nodemanager start

On the MapReduce JobHistory Server system (YARN):

$ sudo service hadoop-mapreduce-historyserver start

Deploy Automatic Failover


If you have configured automatic failover using the ZooKeeper FailoverController (ZKFC), you must install
and start the zkfc daemon on each of the machines that runs a NameNode. Proceed as follows.

To install ZKFC on Red Hat-compatible systems:


$ sudo yum install hadoop-hdfs-zkfc

To start the zkfc daemon:

$ sudo service hadoop-hdfs-zkfc start

It is not important that you start the ZKFC and NameNode daemons in a particular order. On any given
node you can start the ZKFC before or after its corresponding NameNode.

You should add monitoring on each host that runs a NameNode to ensure that the ZKFC remains
running. In some types of ZooKeeper failures, for example, the ZKFC may unexpectedly exit, and should
be restarted to ensure that the system is ready for automatic failover.

Additionally, you should monitor each of the servers in the ZooKeeper quorum. If ZooKeeper crashes,
then automatic failover will not function. If the ZooKeeper cluster crashes, no automatic failovers will be
triggered. However, HDFS will continue to run without any impact. When ZooKeeper is restarted, HDFS
will reconnect with no issues.

Verifying Automatic Failover


After the initial deployment of a cluster with automatic failover enabled, you should test its operation. To
do so, first locate the active NameNode. As mentioned above, you can tell which node is active by visiting
the NameNode web interfaces.

Once you have located your active NameNode, you can cause a failure on that node. For example, you
can usekill -9 <pid of NN> to simulate a JVM crash. Or you can power-cycle the machine or its
network interface to simulate different kinds of outages. After you trigger the outage you want to test, the
other NameNode should automatically become active within several seconds. The amount of time
required to detect a failure and trigger a failover depends on the configuration
of ha.zookeeper.session-timeout.ms, but defaults to 5 seconds.

If the test does not succeed, you may have a misconfiguration. Check the logs for the zkfc daemons as
well as the NameNode daemons in order to further diagnose the issue.

Deploying MapReduce v2 (YARN) on a Cluster


This section describes configuration tasks for YARN clusters only, and is specifically tailored for
administrators who have installed YARN from packages.

Important:

Do the following tasks after you have configured and deployed HDFS:

Note: Running Services

When starting, stopping and restarting CDH components, always use the service (8) command rather
than running scripts in /etc/init.d directly. This is important because service sets the current
working directory to / and removes most environment variables (passing only LANG and TERM) so as to
create a predictable environment in which to administer the service. If you run the scripts
in /etc/init.d, any environment variables you have set remain in force, and could produce
unpredictable results. (If you install CDH from packages, service will be installed as part of the Linux
Standard Base (LSB).)
Important:

Make sure you are not trying to run MRv1 and YARN on the same set of nodes at the same time. This is
not supported; it will degrade performance and may result in an unstable cluster deployment.

 If you have installed YARN from packages, follow the instructions below to deploy it. (To deploy MRv1
instead, see Deploying MapReduce v1 (MRv1) on a Cluster.)
 If you have installed CDH 5 from tarballs, the default deployment is YARN. Keep in mind that the
instructions on this page are tailored for a deployment following installation from packages.

Step 1: Configure Properties for YARN Clusters


Note:

Edit these files in the custom directory you created when you copied the Hadoop configuration. When
you have finished, you will push this configuration to all the nodes in the cluster; see Step 5.

mapred-site.xml:

<property>

<name>mapreduce.framework.name</name>

<value>yarn</value>

</property>

References

http://www.cloudera.com/content/cloudera/en/documentation/cdh5/v5-0-0/CDH5-Installation-
Guide/CDH5-Installation-Guide.html

http://www.cloudera.com/content/cloudera/en/documentation/cdh5/v5-0-0/CDH5-Installation-
Guide/cdh5ig_cdh5_install.html#topic_4_4_1_unique_1

http://www.cloudera.com/content/cloudera/en/documentation/cdh5/v5-0-0/CDH5-Installation-
Guide/cdh5ig_hdfs_cluster_deploy.html

http://www.cloudera.com/content/cloudera/en/documentation/cdh5/v5-0-0/CDH5-High-Availability-
Guide/cdh5hag_hdfs_ha_software_config.html

http://www.cloudera.com/content/cloudera/en/documentation/cdh5/v5-0-0/CDH5-High-Availability-
Guide/cdh5hag_hdfs_ha_deploy.html

http://www.cloudera.com/content/cloudera/en/documentation/cdh5/v5-0-0/CDH5-Installation-
Guide/cdh5ig_yarn_cluster_deploy.html

http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html

You might also like