Thanks to visit codestin.com
Credit goes to jugad2.blogspot.com

Showing posts with label Disco. Show all posts
Showing posts with label Disco. Show all posts

Monday, August 13, 2012

Inferno on Disco, Python MapReduce library / daemon for structured text

By Vasudev Ram


Inferno is an open-source Python MapReduce library. It has (from the site):

[ A query language for large amounts of structured text (CSV, JSON, etc).

A continuous and scheduled MapReduce daemon with an HTTP interface that automatically launches MapReduce jobs to handle a constant stream of incoming data. ]

Overview of Inferno.

This overview page has a nice serial example: starting with a small set of test data, it shows how to query for a certain result, in SQL and then in AWK (both are easy one-liners), but then goes on to show how the achieve the same result using Inferno.

The interesting point is that the Inferno code is also small (a "rule" of ~10 lines, presumably stored in a config file) and a one-line command, but the difference from the SQL and AWK examples is that this runs a Disco MapReduce job to distribute the work across the nodes on a cluster. There is almost nothing in the Inferno code to indicate that this is a distributed computing MapReduce job.

Inferno uses Disco.

Disco is "a distributed computing framework based on the MapReduce paradigm. Disco is open-source; developed by Nokia Research Center to solve real problems in handling massive amounts of data."

Some users of Disco: (Chango, Nokia, Zemanta). Chango staff seem to be the developers of Disco.

- Vasudev Ram - Dancing Bison Enterprises