Clone this project to create a 4 node Apache Hadoop cluster with the Cascading SDK pre-installed.
The Cascading 2.2 SDK includes Cascading and many of its sub-projects:
- Lingual - ANSI SQL Command Shell and JDBC Driver
- Pattern - Machine Learning
- Cascalog - Clojure DSL over Cascding
- Scalding - Scala DSL over Cascading
- Multitool - Command line tool for managing large files
- Load - Command line tool for load testing Hadoop
To make getting started as easy as possible does this setup include build tools used by parts of the SDK:
- gradle - build tool used by Cascading and its related projects
- leiningen 2 - a popular build tool in the clojure community, which is used in the cascalog tutorial included in the SDK
- sbt - a popular build tool in the scala community, which is used in the scalding tutorial included in the SDK
This work is based on: http://cscarioni.blogspot.co.uk/2012/09/setting-up-hadoop-virtual-cluster-with.html
First install both Virtual Box and Vagrant for your platform.
Then simply clone this repository, change into the new cloned directory, and run the following:
$ vagrant box add cascading-hadoop-base http://files.vagrantup.com/precise64.box
$ vagrant up
This will set up 4 machines - master, hadoop1, hadoop2 and hadoop3. Each
of them will have two CPUs and .5GB of RAM. If this is too much for your machine,
adjust the Vagrantfile.
The machines will be deployed using Puppet. All of them will have hadoop (apache-hadoop-1.2.1) installed, ssh will be configured and local name resolution also works.
Hadoop is installed in /opt/hadoop-1.2.1 and all tools are in the PATH.
The master machine acts as the namenode and jobtracker, the 3 others are data
nodes and task trackers.
This cluster uses the ssh-into-all-the-boxes-and-start-things-up-approach,
which is fine for testing. Also for simplicity, everything is running as root
(patches welcome).
Once all machines are up and provisioned, the cluster can be started. Log into the master, format hdfs and start the cluster.
 $ vagrant ssh master
 $ (master) sudo hadoop namenode -format -force
 $ (master) sudo start-all.sh
After a little while, all daemons will be running and you have a fully working hadoop cluster.
If you want to shut down your cluster, but want to keep it around for later use, shut down all the services and tell vagrant to stop the machines like this:
 $ vagrant ssh master
 $ (master) sudo stop-all.sh
 $ exit or Ctrl-D
 $ vagrant halt
When you want to use your cluster again, simply do this:
 $ vagrant up
 $ vagrant ssh master
 $ (master) sudo start-all.sh
If you don't need the cluster anymore and want to get your disk-space back do this:
 $ vagrant destroy
This will only delete the VMs all local files in the directory stay untouched and can be used again, if you decide to start up a new cluster.
The namenode webinterface is available under http://master.local:50070/dfshealth.jsp and the jobtracker is available under http://master.local:50030/jobtracker.jsp
The cluster uses zeroconf
(a.k.a. bonjour) for name resolution. This means, that
you never have to remember any IP nor will you have to fiddle with your
/etc/hosts file.
Name resolution works from the host to all VMs and between all VMs as well.  If
you are using linux, make sure you have avahi-daemon installed and it is
running. On a Mac everything should just work (TM) witouth doing anything.
(Windows testers and patches welcome).
The network used is 192.168.7.0/24. If that causes any problems, change the
Vagrantfile and modules/avahi/file/hosts files to something that works for
you. Since everything else is name based, no other change is required.
To interact with the cluster on the command line, log into the master and use the hadoop command.
$ vagrant ssh master
$ (master) hadoop fs -ls /
$ ...
You can access the host file system from the /vagrant directory, which means that
you can drop your hadoop job in there and run it on your own fully distributed hadoop
cluster.
Since this is a fully virtualized environment running on your computer, it will not be super-fast. This is not the goal of this setup. The goal is to have a fully distributed cluster for testing and troubleshooting.
To not overload the host machine, has each tasktracker a hard limit of 1 map task and 1 reduce task at a time.
Puppet will download the Cascading SDK 2.2-wip and put all SDK
tools in the PATH. The SDK itself can be found in /opt/CascadingSDK.
The namenode stores the fsimage in /srv/hadoop/namenode. The datanodes  are
storing all data in /srv/hadoop/datanode.
If you change any of the puppet modules, you can simply apply the changes with vagrants built-in provisioner.
$ vagrant provision
In order to save bandwidth and time we try to download hadoop only once and
store it in the /vagrant directory, so that the other vms can reuse it. If the
download fails for some reason, delete the tarball and rerun vagrant provision.
We are also downloading a file containing checksums for the tarball. They are
verified, before the cluster is started. If something went wrong during the
download, you will see the verify_tarball part of puppet fail. If that is the
case, delete the tarball and the checksum file (<tarball>.mds) and rerun
vagrant provision.
- have it working on windows
- run as other user than root
- have a way to configure the names/ips in only one file