Posts

Showing posts with the label pig

Extending Hadoop Pig for Hierarchical Data

I've been playing with Hadoop Pig lately, and having a fun time.  Pig is an easy to use language for writing Map Reduce jobs against Hadoop. Our data is very hierarchical, and we calculate a lot of aggregates for self nodes, their children nodes, and self plus children.  We have a few tricks up our sleeves for SQL for handling these types of aggregates, but of course with Map Reduce an entirely new way of thinking is required. Luckily, Pig allows for easily created User Defined Functions (UDFs) that extend the Pig language.  I was able to take an existing Pig UDF, TOKENIZE, and alter it to suite my needs. Specifically, our data looks like this: 111,/A/B/C 222,/A/B 333,/A/B/C We need to answer questions such as "How many records for A and all of its children?" In this case, the answer is three. We also need to answer "How many records for just A?" which is zero, or "for just C?" which is two. Our strategy is to take the path (eg /A/B/C )...