Michigan Hadoop (madoop) is a light weight MapReduce framework for education. Madoop implements the Hadoop Streaming interface. Madoop is implemented in Python and runs on a single machine.
For an in-depth explanation of how to write MapReduce programs in Python for Hadoop Streaming, see our Hadoop Streaming tutorial.
Install Madoop.
$ pip install madoopCreate example MapReduce program with input files.
$ madoop --example
$ tree example
example
├── input
│ ├── input01.txt
│ └── input02.txt
├── map.py
└── reduce.pyRun example word count MapReduce program.
$ madoop \
-input example/input \
-output example/output \
-mapper example/map.py \
-reducer example/reduce.pyConcatenate and print the output.
$ cat example/output/part-*
Goodbye 1
Bye 1
Hadoop 2
World 2
Hello 2Madoop implements a subset of the Hadoop Streaming interface. You can simulate the Hadoop Streaming interface at the command line with cat and sort.
Here's how to run our example MapReduce program on Apache Hadoop.
$ hadoop \
jar path/to/hadoop-streaming-X.Y.Z.jar
-input example/input \
-output output \
-mapper example/map.py \
-reducer example/reduce.py
$ cat output/part-*Here's how to run our example MapReduce program at the command line using cat and sort.
$ cat input/* | ./map.py | sort | ./reduce.py| Madoop | Hadoop | cat/sort |
|---|---|---|
| Implement some Hadoop options | All Hadoop options | No Hadoop options |
| Multiple mappers and reducers | Multiple mappers and reducers | One mapper, one reducer |
| Single machine | Many machines | Single Machine |
jar hadoop-streaming-X.Y.Z.jar argument ignored |
jar hadoop-streaming-X.Y.Z.jar argument required |
No arguments |
| Lines within a group are sorted | Lines within a group are sorted | Lines within a group are sorted |
Contributions from the community are welcome! Check out the guide for contributing.
Michigan Hadoop is written by Andrew DeOrio [email protected].