Thanks to visit codestin.com
Credit goes to github.com

Skip to content
jboulon edited this page Mar 3, 2013 · 19 revisions

Honu

Honu a large scale streaming data collection and processing pipeline build in the cloud for the cloud.

After months of inactivity because of several reasons, I’m glad to say that soon the code based will be refreshed and a fully functional data collection and processing pipeline will be available.
More coming soon!!

Links

What’s Honu

Honu or Chukwa-Streaming is an agent less solution compatible with the Apache Chukwa Project.
Like Chukwa the goal is to be able to collect a large volume of structured and unstructured logs, to process them and gain business insights.

Mailing-list: http://groups.google.com/group/honu-dev

Why this project on GitHub

Chukwa requires:

  • An agent to be installed on every single machine
  • Read access to a file in order to send it over to the collector
  • Java only solution

For at least those reasons, I had to rewrite a large portion of Chukwa to meet my needs but because of time constraint I couldn’t work on fixing those issues and submitting patches to Chukwa at the same time.
Honu has been in production and stable for 3 months now, here at Netflix, collecting over 135 Millions log events every days.
So it’s time to open it!

Future of Honu

The goal is to make the production version of Honu running at Netflix available here on GitHub so others can take advantage of it and contribute back to Honu.

The main advantages of Honu over the standard Chukwa project are:

  • Proven scalability, Honu is currently processing over 70 billion events/day at Netflix
  • Agent less solution (Chunks are sent directly to one or more remote collector)
  • Multi language support for the Collector (internally Honu is using Thrift for Transport and Encoding)
    • Java is fully implemented with batching/queuing and so on
    • All others Thrift supported languages can easily be added
  • New Demux (MapReduce to automatically process all the data)
    • Dynamic & Multiple Output format for Mapper and/or Reducer
    • Hive output format and Hive schema are natively supported
  • Structured log API
    • key/value log helper
    • Dynamic Hive table creation

Roadmap

  • Milestone 3 goals (Done)
    • HBase integration for near real time data access (Done)
    • Generic forwarder
    • (06/01/2012) Honu is running in production collecting over 70 billion events/day
  • Milestone 2 goals (Done)
    • Multiple writers on the collector side. This is require in order to not have to process everything at the same time but to have an SLA driven processing.
    • 600 Millions log events/day
    • Open source the agent-less solution + collector
    • (06/28/2010) Honu is running in production collecting over 1 billion events/day
    • (02/03/2011) Honu is running in production collecting over 12 billion events/day
  • Milestone 1 goals (Done)
    • Agent-less streaming solution
    • Compress output
    • 50 Millions log events/day on 4 ec2 small instances
    • Usage:
      • (03/01/2010) In production, collecting over 95 Millions log events a day.
      • Cpu % used is between 0.8 and 2%.
      • Compression is done using LZO.

Contacts

Jerome Boulon – (jboulon at apache.org)
Mailing-list: http://groups.google.com/group/honu-dev

Clone this wiki locally