Putting Apache Kafka
Building a Real-time Data Platform for Event Streams!
to Use!
JAY KREPS, CONFLUENT!
A Couple of Themes!
Theme 1: Rise of Events!
Theme 2: Immutability Everywhere!
Level! Example! Immutable Alternative!
Mutable local state! Counter in a for loop! Functional Programming!
Mutable process-wide state! ConcurrentHashMap! Functional Programming!
!
Mutable on disk structures! B-Tree! LSM!
Distributed systems! Dynamo-like key-value store! State machine replication!
Mutability in databases! RDBMS! Event Sourcing!
Company-wide data flow! Double write! Kafka!
Theme 3: Datacenter-Level Thinking!
Experience at LinkedIn!
2009: We want all our data in Hadoop!!
What is all our data?!
Initial approach: “gut it out”!
Problems!
• Data coverage!
• Many source systems!
• Relational DBs!
• Log files!
• Metrics!
• Messaging systems!
• Many data formats!
• Constant change!
• New schemas!
• New data sources!
Needed: organizational scalability!
Θ(N) => Θ(1)!
How does everything else work?!
?!
Relational database changes!
Apps and Services
OLTP Queries
Relational
Databases
Data Guard CSV Dump
Cache
ODS Hadoop
Poll For Changes
App App App
Relational Transforms
Data
Caches & Warehouse
Derived Stores
Transforms
NoSQL!
App App App
Key-value
Store
ETL Load
Hadoop
User events!
Apps and Apps and Apps and
Services Services Services
HTTP
Log Aggregation
NFS
rsync
NFS
Load Transform & Load
Relational
Hadoop Data
Warehouse
Transform
Application Logs!
Apps and Apps and Apps and
Services Services Services
Splunk
Messaging!
App App App App App
Broker Broker
Processor Processor Processor Processor
App App App
Broker
Processor Processor Processor Processor
Metrics and operational data!
App App App
Monitoring
This is a giant mess!
Apps and Services Apps and Services Apps and Services
OLTP Queries
HTTP
ActiveMQ HTTP
Monitoring
Relational Apps Apps Log Aggregation
Databases
Splunk
Key-value
Store
Data Guard NFS
CSV Dump
ActiveMQ Cache
rsync
Poll For Changes ODS Hadoop Load NFS
Apps Apps
App App App
Relational Transforms
Data
Transform & Load
Caches & Warehouse
Derived Stores
Transforms
Impossible ideas!
• Publish data from Hadoop to a search index!
• Run a SQL query to find the biggest latency
bottleneck!
• Run a SQL query to find common error patterns!
• Low latency monitoring of database changes or user
activity!
• Incorporate popularity in real-time display and
relevance algorithms!
• Products that incorporate user activity!
An infrastructure solution?!
Idea: Stream Data Platform!
Search Impala
Apps Hive
Monitoring
Stream
Data HADOOP:
DWH
RDBMS Platform: Offline
? Data
Stream Map-
NoSQL Processing
Reduce
Real-time
Analytics Spark
Synchronous
Req/Response Near real time
Offline batch
0 - 100s ms > 100s ms > 1 hour
First Attempt: Messaging systems!!
Problems!
• Throughput!
• Batch systems!
• Persistence!
• Stream Processing!
• Ordering
guarantees!
• Partitioning!
Second Attempt: Build Kafka!!
What does it do?!
Producer Producer Producer Producer Producer
Kafka Cluster
Consumer Consumer Consumer Consumer Consumer
Commit Log Abstraction!
Reader 1 Reader 2
1 1 1 Writes
0 1 2 3 4 5 6 7 8 9
0 1 2
Old New
Logs & Publish-Subscribe Messaging!
Source
System
writes
1 1 1
Log 0 1 2 3 4 5 6 7 8 9 0 1 2
reads reads
Destination Destination
System A System B
A Kafka Topic!
Partition 1 1 1
0 0 1 2 3 4 5 6 7 8 9
0 1 2
Partition Writes
0 1 2 3 4 5 6 7 8 9
1
Partition 1 1 1
0 1 2 3 4 5 6 7 8 9
2 0 1 2
Old New
Replication!
Server 1 Server 2 Server 3
A:0 A:0 A:0
A:1 A:1 A:1
B:0 B:0 Controller
Scaling Consumers!
Kafka Cluster
Server 1 Server 2
P0 P3 P1 P2
C1 C2 C3 C4 C5 C6
Consumer Group A Consumer Group B
Kafka: A Modern Distributed System for Streams!
Scalability of a filesystem!
◦ Hundreds of MB/sec/server throughput!
◦ Many TB per server!
Guarantees of a database!
◦ Messages strictly ordered!
◦ All data persistent!
Distributed by default!
◦ Replication!
◦ Partitioning model!
Producers, Consumers, and Brokers all fault tolerant and horizontally
scalable!
Stream Data Platform!
Search Impala
Apps Hive
Monitoring
KAFKA:
Stream HADOOP:
DWH
RDBMS Data Offline
Platform Data
Stream Map-
NoSQL Processing
Reduce
Real-time
Analytics Spark
Synchronous
Req/Response Near real time
Offline batch
0 - 100s ms > 100s ms > 1 hour
Batch Data => Batch Processing!
Stream processing is a!
generalization!
of batch processing !
and request/response processing!
Request/Response processing: !
One input => One output!
Batch processing: !
All inputs => All outputs!
Stream Processing: !
Some inputs => some outputs!
(you choose how much “some” is)!
Stream Processing a la carte!
Input Kafka Topic
Transform Transform Transform
Intermediate Your code
Kafka Topic
cat input | grep “foo” | wc -l
Transform Transform Transform
Output Kafka
Topic
Hadoop Live
Data Store
Stream Processing with Frameworks!
+! =! Stream
Processing!
Unix Pipes, Modernized!
cat /usr/share/dict/words | wc -l
On Schemas!
Bad Schemas < No Schemas < Good Schemas!
Put it all together!
Apps Apps Apps Apps
Social Key-Value
Search Oracle Newsfeed OLAP
Graph Storage
Apps
Log
Search Apps
Monitoring
Kafka
Security &
Fraud Samza
Real-time
Analytics
Hadoop Teradata
At LinkedIn!
• Everything in the company is a real-time stream!
• > 800 billion messages written per day!
• > 2.9 trillion messages read per day!
• ~ 1 PB of stream data!
• Tens of thousands of producer processes!
• Backbone for data stores!
• Search!
• Social Graph!
• Newsfeed!
• Primary storage (in progress)!
• Basis for stream processing!
Elsewhere!
Why this is the future!
1. System diversity is increasing!
2. Data diversity and volume is
increasing!
3. The world is getting faster!
4. The technology exists!
• Mission: Make this a practical reality
everywhere!
• Product!
• Apache Kafka!
• Schemas and metadata management!
• Connectors for common systems!
• Monitor data flow end-to-end!
• Stream processing integration!
Questions?!
• Confluent!
• @confluentinc!
• http://confluent.io !
• http://blog.confluent.io/2015/02/25/
stream-data-platform-1 !
• Apache Kafka!
• @apachekafka!
• http://kafka.apache.org!
• http://linkd.in/199iMwY !
• Me!
• @jaykreps!