FLUME

Apache Flume is a distributed system designed for collecting, aggregating, and moving large amounts of log data efficiently, supporting various data sources beyond logs. It features a fault-tolerant architecture with components such as agents, sources, channels, and sinks, facilitating reliable data ingestion and processing. Flume is widely used in industries for applications like log collection, real-time analytics, and data streaming, integrating seamlessly with the big data ecosystem.

Uploaded by

dummyuseage1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

92 views31 pages

FLUME

Uploaded by

dummyuseage1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 31

Apache Flume

Introduction to Apache Flume

• Definition: Distributed, reliable, and available system for collecting,
aggregating, and moving large amounts of log data.
• Not limited to logs—supports network traffic, social media data, emails, etc.
• A top-level project at the Apache Software Foundation.
Why was Flume Created?
• Traditional methods of log collection were inefficient for large-scale,
distributed systems.
• Needed a fault-tolerant and scalable solution for handling real-time log
data ingestion.
• Flume was developed to simplify and automate data flow from multiple
sources to storage systems.
System Requirements
• Java Runtime Environment - Java 1.8 or later
• Directory Permissions - Read/Write permissions for directories used by agent
Minimum Hardware Requirements:
• RAM: 4GB
• Disk Space: 10GB (for logs and temporary storage)
• Network: 1 Gbps Ethernet
Recommended Hardware for Large-Scale Deployments:
• RAM: 16GB or more
• Disk Space: 100GB+ (especially if logs are stored locally before forwarding)
• Network: 10 Gbps Ethernet for high-throughput environments
Flume Event and Agent
Flume Event
• An event is the basic unit of the data transported inside Flume.
• It
contains a payload of byte array that is to be transported from
the source to the destination accompanied by optional headers.
Flume Agent
• An agent is an independent daemon process (JVM) in Flume.
• It receives the data (events) from clients or other agents and
forwards it to its next destination (sink or agent).
• Flume may have more than one agent.
Flume Source
•ASource collects incoming data and converts it into Flume
Events.
Flume Channel
•A Channel acts as a buffer between the Source and the Sink.
Flume Sink
•ASink takes events from a Channel and delivers them to a final
destination.
Interceptors , Selectors and
Multiplexers
• AnInterceptor inspects and modifies Flume Events before they
reach the Channel.
•A Multiplexer allows the same event to be sent to multiple
destinations
•A Selector determines which Channel an event should be sent to..
Apache Flume Architecture
• 1.
Data Generators : Data generators generate real-time streaming data.
The data generated by data generators are collected by individual Flume
agents that are running on them. The common data generators are
Facebook, Twitter, etc.

• 2.Flume Agent : The agent is a JVM process in Flume. It receives events

from the clients or other agents and transfers it to the destination or other
agents. It is a JVM process that consists of three components that are a
source, channel, and sink through which data flow occurs in Flume.
•2a. Source: A Flume source is the component
of Flume Agent which consumes data
(events) from data generators like a web
server and delivers it to one or more
channels. The data generator sends data
(events) to Flume in a format recognized by
the target Flume source. Flume supports
different types of sources. Each source
receives events (data) from a specific data
generator. Example of Flume sources: Avro
source, Exec source, Thrift source, NetCat
source, HTTP source, Scribe source, twitter
1% source, etc.
•2b.Channel : When a Flume source receives
an event from a data generator, it stores it on
one or more channels. A Flume channel is a
passive store that receives events from the
Flume source and stores them till Flume sinks
consume them. Channel acts as a bridge
between Flume sources and Flume sinks.
Flume channels are fully transactional and
can work with any number of Flume sources
and sinks. Example of Flume Channel−
Custom Channel, File system channel, JDBC
channel, Memory channel, etc.
•2c. Sink
•The Flume sink retrieves the events from the Flume
channel and pushes them on the centralized store
like HDFS, HDFS, or passes them to the next agent.
Example of Flume Sink− HDFS sink, AvHBase sink,
Elasticsearch sink, etc.
•3.
Data collector : The data collector collects the
data from individual agents and aggregates them. It
pushes the collected data to a centralized store.

•4.
Centralized store : Centralized stores are Hadoop
HDFS, HBase, etc.
DATA FLOW MODEL
Data Flow Strategies
 Single-hop flow: Source → Channel → Sink.
 Multi-hop
flow: Data passes through multiple
agents before reaching the final destination.
 Fan-in/Fan-out: Data from multiple sources to
a single sink or from one source to multiple
sinks.
 Failover and Load Balancing: Ensures
reliability and efficiency.
Failover and load balancing
• ApacheFlume has an in-built feature of Failure Handling. For
each event in Flume, two transactions take place.
• Onetransaction occurs at the sender side and another at the
receiver side. The sender sends the event to the receiver.
• Thereceiver after receiving the event commits its own transaction
and sends back a received signal to the sender. The sender after
receiving the signal from the receiver commits its transaction.
• Thesender will not commit its transaction until it receives a
“received signal” from the receiver.
Use Cases of Apache Flume

• Log Collection
• Social Media Data Ingestion
• Data Replication
• Data Aggregation
• Clickstream Analysis
• Internet of Things (IoT) Data Ingestion
• Data Pipeline
• ETL (Extract, Transform, Load)
• Data Streaming and Real-Time Analytics
• Data Archiving
Features of Flume
• Reliable and Scalable Data Ingestion:
Flume efficiently handles massive data streams while ensuring no loss.
It uses durable channels like File Channel to store data even during
failures. Social media platforms use Flume to collect and ingest user
activity logs into Hadoop reliably.
• Distributed and Fault-Tolerant Architecture:
Flume operates across multiple agents, ensuring continuous data flow
even if one agent fails. It supports failover and load balancing to
prevent disruptions. Banks use Flume to log financial transactions,
redirecting data automatically during failures.
• High Throughput and Low Latency:
Flume processes large data volumes quickly using memory-based
channels and batch processing. This reduces network overhead and
ensures fast data ingestion. E-commerce websites use Flume for real-
time user activity tracking to enhance recommendations.
Flexible Data Routing:
Flume supports Fan-in and Fan-out. It uses interceptors and selectors to modify or filter data
before routing. News websites use this feature to store logs in Hadoop while analyzing real-time
data in Elasticsearch.
Support for Multiple Data Sources and Sinks:
Flume can collect data from sources like Syslog, Kafka, and Twitter and send it to HDFS, Hive,
and Elasticsearch. Its versatility makes it ideal for cybersecurity applications that aggregate logs
from firewalls and networks. This enables real-time threat detection and long-term storage.
Simple and Extensible Configuration:
Flume uses easy-to-edit .properties files for quick configuration. It allows custom
extensions for sources, sinks, and interceptors. Retail companies optimize storage by
filtering unnecessary logs before saving them.
Seamless Integration with Big Data Ecosystem:
Flume connects with Hadoop, Spark, and Kafka for real-time and batch processing. It enables
organizations to stream data into distributed storage or analytics frameworks. Telecom providers
use Flume for real-time call log analysis in Spark Streaming.
Advantages & Disadvantages
of Flume
Apache Kafka Apache Flume

Apache Kafka is a distributed data system. Apache Flume is a available, reliable, and distributed system.

It is efficiently collecting, aggregating and moving large

It is optimized for ingesting and processing streaming data in
amounts of log data from many different sources to a
real-time.
centralized data store.

It is basically working as a pull model. It is basically working as a push model .

It is easy to scale. It is not scalable in comparison with Kafka.

An fault-tolerant, efficient and scalable messaging system. It is specially designed for Hadoop.

You will lose events in the channel in case of flume-agent

It supports automatic recovery if resilient to node failure.
failure.

Kafka runs as a cluster which handles the incoming high Flume is a tool to collect log data from distributed web
volume data streams in the real time. servers.

Kafka will treat each topic partition as an ordered set of Flume can take in streaming data from the multiple sources
messages. for storage and analysis which use in Hadoop.
Companies & Platforms Using Apache
Flume

1. Facebook – Handling Massive Log Data

2. Twitter – Streaming Real-time User Activity
3. LinkedIn – Analyzing User Engagement & System Logs
4. Netflix – Optimizing Streaming Performance
5. eBay – Monitoring Transactions & Customer Behavior
6. Banks & Financial Institutions – Fraud Detection &
Compliance
7. Telecom Companies – Network Performance & Call Data
Analysis
8. Government & Smart Cities – IoT Data Collection
DEMONSTRATION
• https://www.geeksforgeeks.org/difference-between-apache-kafka-a
nd-apache-flume/
• https://www.confluent.io/learn/apache-flume/

• https://www.tutorialspoint.com/apache_flume/apache_flume_quick_
guide.htm
• https://flume.apache.org/releases/content/1.11.0/FlumeUserGuide.
html

Tracer SC+ IOM Aug 2018
No ratings yet
Tracer SC+ IOM Aug 2018
208 pages
Expose BDD
No ratings yet
Expose BDD
16 pages
Notes Bug Data and of Apache
No ratings yet
Notes Bug Data and of Apache
11 pages
Module 5 - Flume
No ratings yet
Module 5 - Flume
23 pages
Apache Flume
No ratings yet
Apache Flume
8 pages
Flume Agent
No ratings yet
Flume Agent
23 pages
Apache Flume Tutorial PDF
No ratings yet
Apache Flume Tutorial PDF
43 pages
Apache Flume for Data Engineers
No ratings yet
Apache Flume for Data Engineers
8 pages
U Iv Flume 1
No ratings yet
U Iv Flume 1
37 pages
Assignment
No ratings yet
Assignment
37 pages
Arinto Murdopo Josep Subirats Group 4 EEDC 2012
No ratings yet
Arinto Murdopo Josep Subirats Group 4 EEDC 2012
19 pages
8 - Big - Data Vivek
No ratings yet
8 - Big - Data Vivek
2 pages
Flume Configuration and HDFS Issues
No ratings yet
Flume Configuration and HDFS Issues
15 pages
06 - Acquire Data Using CLI and Flume
No ratings yet
06 - Acquire Data Using CLI and Flume
13 pages
Apache Flume
No ratings yet
Apache Flume
21 pages
Flume User Guide
No ratings yet
Flume User Guide
48 pages
Module 10 Flume - Massive Logs Aggregation
No ratings yet
Module 10 Flume - Massive Logs Aggregation
42 pages
Apache Flume for Data Engineers
No ratings yet
Apache Flume for Data Engineers
29 pages
Apache Flume: Distributed Log Collection For Hadoop - Second Edition - Sample Chapter
No ratings yet
Apache Flume: Distributed Log Collection For Hadoop - Second Edition - Sample Chapter
13 pages
Flume for Streaming Twitter Data
No ratings yet
Flume for Streaming Twitter Data
13 pages
Flume User Guide
No ratings yet
Flume User Guide
32 pages
Unit-3 (HDFS-II)
No ratings yet
Unit-3 (HDFS-II)
28 pages
Unit 3 (HDFS) - Part B
No ratings yet
Unit 3 (HDFS) - Part B
30 pages
5a. Introduction To Data Ingestion and Processing
No ratings yet
5a. Introduction To Data Ingestion and Processing
26 pages
Chapter 8 Flume - Massive Log Aggregation
No ratings yet
Chapter 8 Flume - Massive Log Aggregation
35 pages
Apache Flume Tutorial - What Is - Architecture
No ratings yet
Apache Flume Tutorial - What Is - Architecture
8 pages
Flume Case Study
No ratings yet
Flume Case Study
2 pages
Bda Exp7 Chinmay
No ratings yet
Bda Exp7 Chinmay
5 pages
Sqoop & Flume: Issues With Data Load Into Hadoop
No ratings yet
Sqoop & Flume: Issues With Data Load Into Hadoop
6 pages
Flume PDF
No ratings yet
Flume PDF
7 pages
Big Data: Week - 13
No ratings yet
Big Data: Week - 13
33 pages
Lect - 11 - BIG DATA
No ratings yet
Lect - 11 - BIG DATA
42 pages
BigDataAnalytics Unit5
No ratings yet
BigDataAnalytics Unit5
6 pages
Flume Developer Guide
No ratings yet
Flume Developer Guide
14 pages
Indjcse24 15 04 020
No ratings yet
Indjcse24 15 04 020
13 pages
Search Analytics With Flume and HBase
No ratings yet
Search Analytics With Flume and HBase
24 pages
Apache Flume - Data Transfer in Hadoop - Tutorialspoint
No ratings yet
Apache Flume - Data Transfer in Hadoop - Tutorialspoint
2 pages
Unit 2 (2 Part)
No ratings yet
Unit 2 (2 Part)
69 pages
BDA Mid-2 Important Questions
No ratings yet
BDA Mid-2 Important Questions
19 pages
Bda Iat2
No ratings yet
Bda Iat2
23 pages
12lecture - Technology and Tools (Ù SqoobFlume)
No ratings yet
12lecture - Technology and Tools (Ù SqoobFlume)
48 pages
HBase and Flume in Apache Hadoop
No ratings yet
HBase and Flume in Apache Hadoop
52 pages
HDFS3
No ratings yet
HDFS3
8 pages
Big Data Ca
No ratings yet
Big Data Ca
14 pages
HD Mod011 Kafka
No ratings yet
HD Mod011 Kafka
29 pages
Flume and SQ Oop
No ratings yet
Flume and SQ Oop
40 pages
BDA UNIT-2 (Final)
No ratings yet
BDA UNIT-2 (Final)
27 pages
Lecture 7 - Data Acquisition
No ratings yet
Lecture 7 - Data Acquisition
45 pages
Unit - 5 Updated MHM
No ratings yet
Unit - 5 Updated MHM
25 pages
Stream Processing
No ratings yet
Stream Processing
33 pages
Assignment No. 3 For Business Data Analytics
No ratings yet
Assignment No. 3 For Business Data Analytics
16 pages
Big Data-2 Sourcing Data
No ratings yet
Big Data-2 Sourcing Data
38 pages
Unit-2 Imp Ques Ans
No ratings yet
Unit-2 Imp Ques Ans
8 pages
Hortonworks Data Platform (HDP)
100% (1)
Hortonworks Data Platform (HDP)
56 pages
Flume Qs
No ratings yet
Flume Qs
2 pages
Unit 3 Part 2 Scoopflume
No ratings yet
Unit 3 Part 2 Scoopflume
10 pages
Real Time Data Sentiment Analysis Report
No ratings yet
Real Time Data Sentiment Analysis Report
23 pages
A728542518 - 16469 - 30 - 2019 - Flume Complete
No ratings yet
A728542518 - 16469 - 30 - 2019 - Flume Complete
13 pages
LinkedIn-JobAdder Integration Guide
No ratings yet
LinkedIn-JobAdder Integration Guide
9 pages
A6V10395678 - Cerberus DMS Danger Management Station Data Sheet - en
No ratings yet
A6V10395678 - Cerberus DMS Danger Management Station Data Sheet - en
11 pages
CM315z UserGuide en
No ratings yet
CM315z UserGuide en
278 pages
OPR2601 Minor Test 1 April 2024
No ratings yet
OPR2601 Minor Test 1 April 2024
5 pages
Docx
No ratings yet
Docx
13 pages
Delay Condonation Application for AY 2017-18
No ratings yet
Delay Condonation Application for AY 2017-18
6 pages
4.u23me584-Digital Twin and Industry 5.0
No ratings yet
4.u23me584-Digital Twin and Industry 5.0
2 pages
Describe Procedures To Be Implemented in An Organization To Allow Them To Operate in Legal, Moral and Ethical Ways
No ratings yet
Describe Procedures To Be Implemented in An Organization To Allow Them To Operate in Legal, Moral and Ethical Ways
2 pages
Email Spam Detection (Research Paper)
No ratings yet
Email Spam Detection (Research Paper)
8 pages
TESLA Presentation
No ratings yet
TESLA Presentation
13 pages
API Security Tips for Pentesters
No ratings yet
API Security Tips for Pentesters
8 pages
Guide+2 Money Management
No ratings yet
Guide+2 Money Management
21 pages
Lesson 1 To 10-1
No ratings yet
Lesson 1 To 10-1
108 pages
Microsemi PSX FW Rel Notes 309431
No ratings yet
Microsemi PSX FW Rel Notes 309431
19 pages
Linux CD For Recover All Windows Passwrd
No ratings yet
Linux CD For Recover All Windows Passwrd
6 pages
BSCCS2003: Week-2 HTML Lab Assignment
No ratings yet
BSCCS2003: Week-2 HTML Lab Assignment
4 pages
Honeywell Excel500 Lonworks Mechanism
No ratings yet
Honeywell Excel500 Lonworks Mechanism
58 pages
S-Miles Cloud Microinverter User Manual (Web) en
No ratings yet
S-Miles Cloud Microinverter User Manual (Web) en
40 pages
Untitled
No ratings yet
Untitled
89 pages
310information Gathering
No ratings yet
310information Gathering
63 pages
Global Royalty Collection Network
No ratings yet
Global Royalty Collection Network
5 pages
Introducing Amazon S3 and EC2: Justin Mason
No ratings yet
Introducing Amazon S3 and EC2: Justin Mason
14 pages
International SMS Routing Process
No ratings yet
International SMS Routing Process
27 pages
2021-2025 - Sample Project Documentation - April 2025-2
No ratings yet
2021-2025 - Sample Project Documentation - April 2025-2
25 pages
Homework Help for Overwhelmed Students
100% (1)
Homework Help for Overwhelmed Students
5 pages
Jenbacher: E 0103 e Crankcase Negative Pressure Data
No ratings yet
Jenbacher: E 0103 e Crankcase Negative Pressure Data
2 pages
Performance Management
No ratings yet
Performance Management
73 pages
BMXNOC0401 Ethernet Module Technical Datasheet
No ratings yet
BMXNOC0401 Ethernet Module Technical Datasheet
1 page

FLUME

Uploaded by

FLUME

Uploaded by

Apache Flume

Introduction to Apache Flume

• 2.Flume Agent : The agent is a JVM process in Flume. It receives events

It is efficiently collecting, aggregating and moving large

It is basically working as a pull model. It is basically working as a push model .

It is easy to scale. It is not scalable in comparison with Kafka.

You will lose events in the channel in case of flume-agent

1. Facebook – Handling Massive Log Data

You might also like