Home

Honu

Honu a large scale streaming data collection and processing pipeline build in the cloud for the cloud.

After months of inactivity because of several reasons, I’m glad to say that soon the code based will be refreshed and a fully functional data collection and processing pipeline will be available.
More coming soon!!

Links

What’s Honu

Honu or Chukwa-Streaming is an agent less solution compatible with the Apache Chukwa Project.
Like Chukwa the goal is to be able to collect a large volume of structured and unstructured logs, to process them and gain business insights.

Mailing-list: http://groups.google.com/group/honu-dev

Why this project on GitHub

Chukwa requires:

An agent to be installed on every single machine
Read access to a file in order to send it over to the collector
Java only solution

For at least those reasons, I had to rewrite a large portion of Chukwa to meet my needs but because of time constraint I couldn’t work on fixing those issues and submitting patches to Chukwa at the same time.
Honu has been in production and stable for 3 months now, here at Netflix, collecting over 135 Millions log events every days.
So it’s time to open it!

Future of Honu

The goal is to make the production version of Honu running at Netflix available here on GitHub so others can take advantage of it and contribute back to Honu.

The main advantages of Honu over the standard Chukwa project are:

Proven scalability, Honu is currently processing over 70 billion events/day at Netflix
Agent less solution (Chunks are sent directly to one or more remote collector)
Multi language support for the Collector (internally Honu is using Thrift for Transport and Encoding)
- Java is fully implemented with batching/queuing and so on
- All others Thrift supported languages can easily be added
New Demux (MapReduce to automatically process all the data)
- Dynamic & Multiple Output format for Mapper and/or Reducer
- Hive output format and Hive schema are natively supported
Structured log API
- key/value log helper
- Dynamic Hive table creation

Roadmap

Milestone 3 goals (Done)
- HBase integration for near real time data access (Done)
- Generic forwarder
- (06/01/2012) Honu is running in production collecting over 70 billion events/day

Milestone 2 goals (Done)
- Multiple writers on the collector side. This is require in order to not have to process everything at the same time but to have an SLA driven processing.
- 600 Millions log events/day
- Open source the agent-less solution + collector
- (06/28/2010) Honu is running in production collecting over 1 billion events/day
- (02/03/2011) Honu is running in production collecting over 12 billion events/day

Milestone 1 goals (Done)
- Agent-less streaming solution
- Compress output
- 50 Millions log events/day on 4 ec2 small instances
- Usage:
  - (03/01/2010) In production, collecting over 95 Millions log events a day.
  - Cpu % used is between 0.8 and 2%.
  - Compression is done using LZO.

Contacts

Jerome Boulon – (jboulon at apache.org)
Mailing-list: http://groups.google.com/group/honu-dev

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

Honu

Links

What’s Honu

Why this project on GitHub

Future of Honu

Roadmap

Contacts

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally