Apache Flume
Introduction to Apache Flume
• Definition: Distributed, reliable, and available system for collecting,
aggregating, and moving large amounts of log data.
• Not limited to logs—supports network traffic, social media data, emails, etc.
• A top-level project at the Apache Software Foundation.
Why was Flume Created?
• Traditional methods of log collection were inefficient for large-scale,
distributed systems.
• Needed a fault-tolerant and scalable solution for handling real-time log
data ingestion.
• Flume was developed to simplify and automate data flow from multiple
sources to storage systems.
System Requirements
• Java Runtime Environment - Java 1.8 or later
• Directory Permissions - Read/Write permissions for directories used by agent
Minimum Hardware Requirements:
• RAM: 4GB
• Disk Space: 10GB (for logs and temporary storage)
• Network: 1 Gbps Ethernet
Recommended Hardware for Large-Scale Deployments:
• RAM: 16GB or more
• Disk Space: 100GB+ (especially if logs are stored locally before forwarding)
• Network: 10 Gbps Ethernet for high-throughput environments
Flume Event and Agent
Flume Event
• An event is the basic unit of the data transported inside Flume.
• It
contains a payload of byte array that is to be transported from
the source to the destination accompanied by optional headers.
Flume Agent
• An agent is an independent daemon process (JVM) in Flume.
• It receives the data (events) from clients or other agents and
forwards it to its next destination (sink or agent).
• Flume may have more than one agent.
Flume Source
•ASource collects incoming data and converts it into Flume
Events.
Flume Channel
•A Channel acts as a buffer between the Source and the Sink.
Flume Sink
•ASink takes events from a Channel and delivers them to a final
destination.
Interceptors , Selectors and
Multiplexers
• AnInterceptor inspects and modifies Flume Events before they
reach the Channel.
•A Multiplexer allows the same event to be sent to multiple
destinations
•A Selector determines which Channel an event should be sent to..
Apache Flume Architecture
• 1.
Data Generators : Data generators generate real-time streaming data.
The data generated by data generators are collected by individual Flume
agents that are running on them. The common data generators are
Facebook, Twitter, etc.
• 2.Flume Agent : The agent is a JVM process in Flume. It receives events
from the clients or other agents and transfers it to the destination or other
agents. It is a JVM process that consists of three components that are a
source, channel, and sink through which data flow occurs in Flume.
•2a. Source: A Flume source is the component
of Flume Agent which consumes data
(events) from data generators like a web
server and delivers it to one or more
channels. The data generator sends data
(events) to Flume in a format recognized by
the target Flume source. Flume supports
different types of sources. Each source
receives events (data) from a specific data
generator. Example of Flume sources: Avro
source, Exec source, Thrift source, NetCat
source, HTTP source, Scribe source, twitter
1% source, etc.
•2b.Channel : When a Flume source receives
an event from a data generator, it stores it on
one or more channels. A Flume channel is a
passive store that receives events from the
Flume source and stores them till Flume sinks
consume them. Channel acts as a bridge
between Flume sources and Flume sinks.
Flume channels are fully transactional and
can work with any number of Flume sources
and sinks. Example of Flume Channel−
Custom Channel, File system channel, JDBC
channel, Memory channel, etc.
•2c. Sink
•The Flume sink retrieves the events from the Flume
channel and pushes them on the centralized store
like HDFS, HDFS, or passes them to the next agent.
Example of Flume Sink− HDFS sink, AvHBase sink,
Elasticsearch sink, etc.
•3.
Data collector : The data collector collects the
data from individual agents and aggregates them. It
pushes the collected data to a centralized store.
•4.
Centralized store : Centralized stores are Hadoop
HDFS, HBase, etc.
DATA FLOW MODEL
Data Flow Strategies
Single-hop flow: Source → Channel → Sink.
Multi-hop
flow: Data passes through multiple
agents before reaching the final destination.
Fan-in/Fan-out: Data from multiple sources to
a single sink or from one source to multiple
sinks.
Failover and Load Balancing: Ensures
reliability and efficiency.
Failover and load balancing
• ApacheFlume has an in-built feature of Failure Handling. For
each event in Flume, two transactions take place.
• Onetransaction occurs at the sender side and another at the
receiver side. The sender sends the event to the receiver.
• Thereceiver after receiving the event commits its own transaction
and sends back a received signal to the sender. The sender after
receiving the signal from the receiver commits its transaction.
• Thesender will not commit its transaction until it receives a
“received signal” from the receiver.
Use Cases of Apache Flume
• Log Collection
• Social Media Data Ingestion
• Data Replication
• Data Aggregation
• Clickstream Analysis
• Internet of Things (IoT) Data Ingestion
• Data Pipeline
• ETL (Extract, Transform, Load)
• Data Streaming and Real-Time Analytics
• Data Archiving
Features of Flume
• Reliable and Scalable Data Ingestion:
Flume efficiently handles massive data streams while ensuring no loss.
It uses durable channels like File Channel to store data even during
failures. Social media platforms use Flume to collect and ingest user
activity logs into Hadoop reliably.
• Distributed and Fault-Tolerant Architecture:
Flume operates across multiple agents, ensuring continuous data flow
even if one agent fails. It supports failover and load balancing to
prevent disruptions. Banks use Flume to log financial transactions,
redirecting data automatically during failures.
• High Throughput and Low Latency:
Flume processes large data volumes quickly using memory-based
channels and batch processing. This reduces network overhead and
ensures fast data ingestion. E-commerce websites use Flume for real-
time user activity tracking to enhance recommendations.
Flexible Data Routing:
Flume supports Fan-in and Fan-out. It uses interceptors and selectors to modify or filter data
before routing. News websites use this feature to store logs in Hadoop while analyzing real-time
data in Elasticsearch.
Support for Multiple Data Sources and Sinks:
Flume can collect data from sources like Syslog, Kafka, and Twitter and send it to HDFS, Hive,
and Elasticsearch. Its versatility makes it ideal for cybersecurity applications that aggregate logs
from firewalls and networks. This enables real-time threat detection and long-term storage.
Simple and Extensible Configuration:
Flume uses easy-to-edit .properties files for quick configuration. It allows custom
extensions for sources, sinks, and interceptors. Retail companies optimize storage by
filtering unnecessary logs before saving them.
Seamless Integration with Big Data Ecosystem:
Flume connects with Hadoop, Spark, and Kafka for real-time and batch processing. It enables
organizations to stream data into distributed storage or analytics frameworks. Telecom providers
use Flume for real-time call log analysis in Spark Streaming.
Advantages & Disadvantages
of Flume
Apache Kafka Apache Flume
Apache Kafka is a distributed data system. Apache Flume is a available, reliable, and distributed system.
It is efficiently collecting, aggregating and moving large
It is optimized for ingesting and processing streaming data in
amounts of log data from many different sources to a
real-time.
centralized data store.
It is basically working as a pull model. It is basically working as a push model .
It is easy to scale. It is not scalable in comparison with Kafka.
An fault-tolerant, efficient and scalable messaging system. It is specially designed for Hadoop.
You will lose events in the channel in case of flume-agent
It supports automatic recovery if resilient to node failure.
failure.
Kafka runs as a cluster which handles the incoming high Flume is a tool to collect log data from distributed web
volume data streams in the real time. servers.
Kafka will treat each topic partition as an ordered set of Flume can take in streaming data from the multiple sources
messages. for storage and analysis which use in Hadoop.
Companies & Platforms Using Apache
Flume
1. Facebook – Handling Massive Log Data
2. Twitter – Streaming Real-time User Activity
3. LinkedIn – Analyzing User Engagement & System Logs
4. Netflix – Optimizing Streaming Performance
5. eBay – Monitoring Transactions & Customer Behavior
6. Banks & Financial Institutions – Fraud Detection &
Compliance
7. Telecom Companies – Network Performance & Call Data
Analysis
8. Government & Smart Cities – IoT Data Collection
DEMONSTRATION
• https://www.geeksforgeeks.org/difference-between-apache-kafka-a
nd-apache-flume/
• https://www.confluent.io/learn/apache-flume/
• https://www.tutorialspoint.com/apache_flume/apache_flume_quick_
guide.htm
• https://flume.apache.org/releases/content/1.11.0/FlumeUserGuide.
html