Being interested in distributed programming and Hadoop, I decided to setup my own VM cluster using the latest version (2.7.2) on my Windows desktop. It's actually not so difficult so I will outline all the tasks here. Even though the desktop is Windows 10, the VMs run (Arch) Linux so it helps if you are/have been a Linux user. The actual Linux distribution you use is not that important, but I chose Arch since it is a small install and that makes for small VM size.
At any point in this guide you can attempt to blindly follow the steps, but it's best if you read relevant material along the way. It may take you a bit longer to get setup, but you will be more confident if you have some understanding of the intermediate steps.
Control Panel > Programs > Programs and Features > Turn Windows features on or off > Hyper-V
If for some reason you are unable to check the Hyper-V box, you may need to enable virtualization from the BIOS or do some other trouble shooting.
The majority of Linux and Hadoop installations can be done once and cloned. This is nice for later when you want to show how easy it is to add new DataNodes to your VM cluster.
- Start downloading a Linux ISO. Again, the specific distro doesn't matter, you simply want to be able to install
- wget
- Java
- SSH
I prefer Arch Linux simple because the base install is relatively small (approx 4Gb per VM). However, Arch does not have a graphical installer and if you are not used to command line installations it may seem daunting. If you've never done it before, you should consider it a rite of passage and give it a shot.
- Open Hyper-V
- In the menu on the right hand side:
- Virtual Switch Manager
- Selected "External"
- Create Virtual Switch
- Pick a name
- External Network > Select your ethernet card
- Ok/Apply
- Open Hyper-V
- In the menu on the right hand side
- New > Virtual Machine
- Choose a name, e.g.
ArchHadoop-Template - Choose a different storage location if desired
- Generation 1
- Memory ... something small
- Connection ... the virtual switch created above
- Create a virtual hard disk, change size if desired
- Install an operating system from a bootable CD/DVD-ROM > Image File > the ISO downloaded above
- Finish
- Open Hyper-V
- Select your VM from the list
- Start/Connect ... you should boot into the installer
- Do the installation as normal
- Create a user that has sudo rights.
Arch Linux
- See the Beginner's Guide
- See the Hyper-V Guide (this also contains steps for creating a Hyper-V VM)
- Special Notes
- The Hyper-V Guide tells you how you need to format your VM's hard drive. The
cgdisktool is easiest to use. This is needed near the beginning of the installation. - Near the middle/end you will install a bootloader. The Hyper-V Guide also explains how you should do this, e.g. use
grub. - You do not need to set a
hostnamein/etc/hostname
- Install Java (Arch:
sudo pacman -S jdk8-openjdk) - Install
wget(Arch:sudo pacman -S wget) - Install ssh (Arch:
sudo pacman -S openssh,sudo systemctl enable sshd.socket) - Setup passwordless login for ssh
Now we are getting to the good stuff. If you made it to this point but you have no idea what Hadoop is, you should probably read some of the documentation. Some good pages to read are:
- HDFS Architecture
- HDFS User Guide
- Yarn Architecture
- MapReduce Tutorial (this one can wait a little bit)
Here is the general guide for Hadoop's cluster setup that we will apply to our VMs.
That major players in a Hadoop cluster are
- NameNode: the HDFS master node that manages all HDFS metadata
- DataNode(s): HDFS slaves that store and serve up (chunks of) files
- ResourceManager: the Yarn master node responsible for arbitrating cluster resources
- NodeManager: the Yarn slave monitiring a node's resources and reporting to the ResourceManager I'm not sure if you can have all these different processes running on a one or two nodes (VMs), but in this tutorial we are going to create three:
namenode: will be the NameNode and a DataNode.resourcemanager: will be the ResourceManager and a DataNodedatanode1: will be a dedicated DataNode. The configuration files that reside on each VM are identical, we only need to install Hadoop and the configurations on the template VM, and then make sure to setup the IPs/hostnames so that VMs can communicate with one another. Setting up VM hostnames and IPs will be done after the Hadoop installation.
- In your home folder
wget http://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.7.2/hadoop-2.7.2.tar.gz tar -xf hadoop-2.7.2.tar.gzmv hadoop-2.7.2 haddopmkdir -p hadoop-hdfs/{name,data}cd hadoop
This repo comprises the configuration files for the cluster I set up, which for you is ~/hadoop/etc/hadoop. To keep things simple, I am going to assume that you are going to setup the same three VM cluster having the same hostnames. Feel free to poke around the configs and change names if you like, but if you are following this guide, odds are you are also new to Hadoop and it may benefit you to have the VMs suggestively named.
Important Files in etc/hadoop:
These are the suggestively named, core configuration files. I currently don't understand everything that can be done through them, but what you see in this repo are the minimal configurations to get a cluster up and running.
If you want to start with this repo rather than copy/paste, do:
cd etcrm -rf hadoopgit clone https://github.com/jayflo/hadoop-config.git hadoop
Now...
core-site-xml: (no changes required) sets location of HDFS filesystem. Note that this valuehdfs://namenode:9000targets ournamenodeVM as the NameNode.hdfs-site.xml: change the file paths to point to your home directory. These file paths tell the NameNode and DataNodes where to store data.mapred-site.xml: (no changes required) tell Yarn to use mapreduceyarn-site.xml: (no changes required) specifies the hostname of the ResourceManager.hadoop-env.sh, mapred-env.sh, yarn-env.sh: change theJAVA_HOMEfile path to point to your installation of Java. These files setup the environment for Hadoop tasks.slaves: (no changes required) a list of hostnames to use as DataNodes. This is not used by the Java code, but utilized by helper scripts to spin up processes throughout the cluster.
After the VMs are cloned, each one will be given a hostname (namenode, resourcemanager and datanode1) and a reserved IP address on your local network. The association ip address --> hostname for machines on your network is stored in the OSs /etc/hosts file (on Windows C:\Windows\System32\drivers\etc\hosts) and will be the same for each VM in the cluster. If you already know what three addresses you will assign these VMs, you can edit the /etc/hosts file as follows:
::1 localhost
127.0.0.1 localhost
<ip_1> namenode
<ip_2> resourcemanager
<ip_3> datanode1
However, if you are not yet sure what the IP addresses should be, you can wait to edit this file till later. You will simply need to ssh to each machine to change it.
That's it! Our Linux+Hadoop template is complete. Logout and shutdown your VM.
Export
- Open Hyper-V
- Right Click > Export
- Pick a location
Import x 3
- Open Hyper-V
- In the right hand menu > Import Virtual Machine
- Browse to the folder chosen in Export above.
- Next, Next
- Copy the virtual machine (create a new unique ID)
- Store the machine in a different location: enter the same path for all three text boxes
- When choosing where to store the virtual hard drive, enter the path from the previous step +
\Virtual Hard Disks\. - Finish
- When it's complete, rename the new VM to
namenode.
Repeate 1-9 two more times but name the next VMs resourcemanager and datanode1.
Hostnames
- Startup and login to all three VMs
namenode,resourcemanageranddatanode1. - In each VM, write their respective name (
namenode,resourcemanagerordatanode1) hostname into/etc/hostname. The file should contain only this name. - Restart the VM.
Each VM will need it's own IP address which can generally be done very easily using DHCP reserved IPs. In general this is easy to do using your router's configuration UI. You likely need a wired connect to reach the router's IP.
- Use a web browser to navigate to your router's configuration UI. If you don't know the IP:
- Open a command prompt (Windows Key + "cmd" + enter)
ipconfig- It's the IP associated to "Default Gateway".
- Usually there are two ways IPs are reserved:
- There is a menu where you can specify a hostname/mac-address and the IP you want to reserve for that hostname/mac-address.
- There is a menu listing currently connected devices by hostname/mac-address and you can "edit" the connection providing an IP. The IP address you choose doesn't really matter as long as it's valid. If you like to keep things "in order," your cluster IPs should start after all other IPs. This way, if you want to add more machines in the future, the next spot won't be taken. Otherwise, you can simply start right after the router's IP.
- If you didn't already edit the
/etc/hostsfile, now you must. You can visit each VM orsshfreely between them to make the edits. The form of the/etc/hostsis shown above. - Shut down all the VMs.
- Edit the
C:\Windows\System32\drivers\etc\hostsfile of the host (your Windows desktop) to include the IP --> VM hostname mappings. You don't need to include thelocalhostlines, but it wouldn't hurt anything if you did. - Open another cmd prompt on the host (your Windows desktop) and do:
ipconfig /release
ipconfig /all
ipconfig /flushdns
ipconfig /renew
- Restart your computer.
Time to reap the benefits. Everything should now be in place, we simply need to spin up the cluster and start all the HDFS/Yarn processes.
- Open Hyper-V
- Connect/Start cluster VMs
namenode,resourcemanageranddatanode1 - On
namenoderun: cd ~/hadoopbin/hdfs namenode -formatsbin/start-dfs.sh(it should start 5 things)- On
resourcemanagerrun: sbin/start-yarn.sh
If everything is working correctly, all of the following should be true:
- The
jps(man jpsfor more info) command should have the following output: namenode-->NodeManager,NameNode,SecondaryNameNoderesourcemanager-->ResourceManager,NodeManagerdatanode1-->NodeManager- You should be able to visit the NameNode Web UI from your host's browser at URL
namenode:50070 - You should be able to visit the ResourceManager Web UI from your host's browser at URL
resourcemanager:8088 cd ~/hadoop && grep -iR error logsshould return nothing on each VM.
If everything isn't working correctly then you are in for some more work. Be certain to look at the logs ~/hadoop/logs to see if you can find specific errors which you can Google about. From my own searching on the web, more often than not the problem will be a network connectivity issue requiring firewall modifications (i.e. the nodes are unable to connect to one another). For these you are on your own...feel free to open up and issue though I promise nothing!
At this point you should also be confident in all the steps you would need to follow in order to add another (data)node, datanode2, to the cluster! (should you choose to be so bold)
I made this a repo so that anyone can help make changes and keep it up to date/provide more explanation where needed. If you'd like to add some info, submit a PR and we can include it. Aside from adding more detail to this tutorial, it would be nice to populate the "core" xml configuration files with more options+descriptions (commented out so they are not "on" by default). This will provide a quick way for other Hadoop hopefuls to learn about common cluster settings. See these pages for comprehensive lists of configurations and their default values:
And of course, read as much as you can about Hadoop from the general/API documentation.
Next I will work on setting up IntelliJ for Hadoop development so that one can easily test their mapreduce jobs on the cluster created in this tutorial. I will create another repo containing the project code as well as a comprehensive README for setting up the project and executing jobs. When it is finished, I'll link it here.
Good luck!