Disruptor Guide for Developers
Disruptor Guide for Developers
Version 4.0.0.RC2-SNAPSHOT
Table of Contents
Using the Disruptor
o Introduction
o Getting Started
o Advanced Techniques
Design and Implementation
Known Issues
Batch Rewind
o The Feature
o Use Case
The LMAX Disruptor is a high performance inter-thread messaging library. It grew out of
LMAX’s research into concurrency, performance and non-blocking algorithms and today
forms a core part of their Exchange’s infrastructure.
Core Concepts
Before we can understand how the Disruptor works, it is worthwhile defining a number of terms that will be
used throughout the documentation and the code. For those with a DDD bent, think of this as the ubiquitous
language of the Disruptor domain.
Ring Buffer: The Ring Buffer is often considered the main aspect of the Disruptor. However, from
3.0 onwards, the Ring Buffer is only responsible for the storing and updating of the data (Events)
that move through the Disruptor. For some advanced use cases, it can even be completely
replaced by the user.
Sequence: The Disruptor uses Sequences as a means to identify where a particular component is
up to. Each consumer (Event Processor) maintains a Sequence as does the Disruptor itself. The
majority of the concurrent code relies on the movement of these Sequence values, hence
the Sequence supports many of the current features of an AtomicLong. In fact the only real
difference between the two is that the Sequence contains additional functionality to prevent false
sharing between Sequences and other values.
Sequencer: The Sequencer is the real core of the Disruptor. The two implementations (single
producer, multi producer) of this interface implement all the concurrent algorithms for fast, correct
passing of data between producers and consumers.
Sequence Barrier: The Sequencer produces a Sequence Barrier that contains references to the
main published Sequence from the Sequencer and the Sequences of any dependent consumer. It
contains the logic to determine if there are any events available for the consumer to process.
Wait Strategy: The Wait Strategy determines how a consumer will wait for events to be placed
into the Disruptor by a producer. More details are available in the section about being optionally
lock-free.
Event: The unit of data passed from producer to consumer. There is no specific code representation
of the Event as it defined entirely by the user.
Event Processor: The main event loop for handling events from the Disruptor and has ownership
of consumer’s Sequence. There is a single representation called BatchEventProcessor that contains
an efficient implementation of the event loop and will call back onto a used supplied
implementation of the EventHandler interface.
Event Handler: An interface that is implemented by the user and represents a consumer for the
Disruptor.
Producer: This is the user code that calls the Disruptor to enqueue Events. This concept also has
no representation in the code.
To put these elements into context, below is an example of how LMAX uses the Disruptor within its high
performance core services, e.g. the exchange.
Figure 1. Disruptor with a set of dependent consumers.
Multicast Events
This is the biggest behavioural difference between queues and the Disruptor.
When you have multiple consumers listening on the same Disruptor, it publishes all events to all consumers.
In contrast, a queue will only send a single event to a single consumer. You can use this behaviour of the
Disruptor when you need to independent multiple parallel operations on the same data.
Example use-case
The canonical example from LMAX is where we have three operations: - journalling (writing the input data to
a persistent journal file); - replication (sending the input data to another machine to ensure that there is a
remote copy of the data); - and business logic (the real processing work).
Firstly we need to ensure that the producers do not overrun consumers. This is handled by adding
the relevant consumers to the Disruptor by calling RingBuffer.addGatingConsumers().
Secondly, the case referred to previously is implemented by constructing a SequenceBarrier
containing Sequences from the components that must complete their processing first.
Referring to Figure 1 there are 3 consumers listening for Events from the Ring Buffer. There is a dependency
graph in this example.
The ApplicationConsumer depends on the JournalConsumer and ReplicationConsumer. This means that
the JournalConsumer and ReplicationConsumer can run freely in parallel with each other. The dependency
relationship can be seen by the connection from the ApplicationConsumer's SequenceBarrier to
the Sequences of the JournalConsumer and ReplicationConsumer.
It is also worth noting the relationship that the Sequencer has with the downstream consumers. One of its
roles is to ensure that publication does not wrap the Ring Buffer. To do this none of the downstream
consumer may have a Sequence that is lower than the Ring Buffer’s Sequence less the size of the Ring
Buffer.
However, by using the graph of dependencies an interesting optimisation can be made. Because
the ApplicationConsumer's Sequence is guaranteed to be less than or equal to that of
the JournalConsumer and ReplicationConsumer (that is what that dependency relationship ensures)
the Sequencer need only look at the Sequence of the ApplicationConsumer. In a more general sense
the Sequencer only needs to be aware of the Sequences of the consumers that are the leaf nodes in the
dependency tree.
Event Pre-allocation
One of the goals of the Disruptor is to enable use within a low latency environment. Within low-latency
systems it is necessary to reduce or remove memory allocations. In Java-based system the purpose is to
reduce the number stalls due to garbage collection [1].
To support this the user is able to preallocate the storage required for the events within the Disruptor.
During construction and EventFactory is supplied by the user and will be called for each entry in the
Disruptor’s Ring Buffer. When publishing new data to the Disruptor the API will allow the user to get hold of
the constructed object so that they can call methods or update fields on that store object. The Disruptor
provides guarantees that these operations will be concurrency-safe as long as they are implemented
correctly.
Optionally Lock-free
Another key implementation detail pushed by the desire for low-latency is the extensive use of lock-free
algorithms to implement the Disruptor. All memory visibility and correctness guarantees are implemented
using memory barriers and/or compare-and-swap operations.
There is only one use-case where an actual lock is required and that is within the BlockingWaitStrategy.
This is done solely for the purpose of using a condition so that a consuming thread can be parked while
waiting for new events to arrive. Many low-latency systems will use a busy-wait to avoid the jitter that can
be incurred by using a condition; however, in number of system busy-wait operations can lead to
significant degradation in performance, especially where the CPU resources are heavily constrained, e.g.
web servers in virtualised-environments.
Getting Started
Getting the Disruptor
The Disruptor jar file is available from Maven Central and can be integrated into your dependency manager
of choice from there.
Firstly we will define the Event that will carry the data and is common to all following examples:
Example LongEvent
public class LongEvent
{
private long value;
In order to allow the Disruptor to preallocate these events for us, we need to an EventFactory that will
perform the construction. This could be a method reference, such as LongEvent::new or an explicit
implementation of the EventFactory interface:
Example LongEventFactory
public class LongEventFactory implements EventFactory<LongEvent>
{
@Override
public LongEvent newInstance()
{
return new LongEvent();
}
}
Once we have the event defined, we need to create a consumer that will handle these events. As an
example, we will create an EventHandler that will print the value out to the console.
Example LongEventHandler
public class LongEventHandler implements EventHandler<LongEvent>
{
@Override
public void onEvent(LongEvent event, long sequence, boolean endOfBatch)
{
System.out.println("Event: " + event);
}
}
Finally, we will need a source for these events. For simplicity, we will assume that the data is coming from
some sort of I/O device, e.g. network or file in the form of a ByteBuffer.
Publishing
Using Lambdas
Since version 3.0 of the Disruptor it has been possible to use a Lambda-style API to write publishers. This is
the preferred approach, as it encapsulates much of the complexity of the alternatives.
Disruptor<LongEvent> disruptor =
new Disruptor<>(LongEvent::new, bufferSize, DaemonThreadFactory.INSTANCE);
Given that method references can be used instead of anonymous lambdas, it is possible to rewrite the
example in this fashion:
import com.lmax.disruptor.RingBuffer;
import com.lmax.disruptor.dsl.Disruptor;
import com.lmax.disruptor.examples.longevent.LongEvent;
import com.lmax.disruptor.util.DaemonThreadFactory;
import java.nio.ByteBuffer;
public class LongEventMain
{
public static void handleEvent(LongEvent event, long sequence, boolean endOfBatch)
{
System.out.println(event);
}
Disruptor<LongEvent> disruptor =
new Disruptor<>(LongEvent::new, bufferSize, DaemonThreadFactory.INSTANCE);
disruptor.handleEventsWith(LongEventMain::handleEvent);
disruptor.start();
Prior to version 3.0, the preferred way of publishing messages was via the Event Publisher/Event Translator
interfaces:
Example LongEventProducer
import com.lmax.disruptor.EventTranslatorOneArg;
import com.lmax.disruptor.RingBuffer;
import java.nio.ByteBuffer;
This approach uses number of extra classes (e.g. handler, translator) that are not explicitly required when
using lambdas. The advantage of here is that the translator code can be pulled into a separate class and
easily unit tested.
import java.nio.ByteBuffer;
What becomes immediately obvious is that event publication becomes more involved than using a simple
queue. This is due to the desire for Event pre-allocation. It requires (at the lowest level) a 2-phase approach
to message publication, i.e. claim the slot in the ring buffer and then publish the available data.
If we claim a slot in the Ring Buffer (calling RingBuffer#next()) then we must publish this sequence.
Failing to do so can result in corruption of the state of the Disruptor.
Specifically, in the multi-producer case, this will result in the consumers stalling and being unable to
recover without a restart. Therefore, it is recommended that either the lambda or EventTranslator APIs
be used.
The final step is to wire the whole thing together. Whilst it is possible to wire up each component manually,
this can be complicated and so a DSL is provided to simplify construction.
Some of the more complicated options are not available via the DSL; however, it is suitable for most circumstances.
Example using the legacy LongEventProducer
import com.lmax.disruptor.RingBuffer;
import com.lmax.disruptor.dsl.Disruptor;
import com.lmax.disruptor.examples.longevent.LongEvent;
import com.lmax.disruptor.examples.longevent.LongEventFactory;
import com.lmax.disruptor.examples.longevent.LongEventHandler;
import com.lmax.disruptor.util.DaemonThreadFactory;
import java.nio.ByteBuffer;
One of the best ways to improve performance in concurrent systems is to adhere to the Single Writer
Principle, this applies to the Disruptor. If you are in the situation where there will only ever be a single thread
producing events into the Disruptor, then you can take advantage of this to gain additional performance.
To give an indication of how much of a performance advantage can be achieved through this technique we
can change the producer type in the OneToOne performance test. Tests run on i7 Sandy Bridge MacBook Air.
Run Disruptor=26,553,372
0 ops/sec
Run Disruptor=28,727,377
1 ops/sec
Run Disruptor=29,806,259
2 ops/sec
Run Disruptor=29,717,682
3 ops/sec
Run Disruptor=28,818,443
4 ops/sec
Run Disruptor=29,103,608
5 ops/sec
Run Disruptor=29,239,766
6 ops/sec
Run Disruptor=89,365,504
0 ops/sec
Run Disruptor=77,579,519
1 ops/sec
Run Disruptor=78,678,206
2 ops/sec
Run Disruptor=80,840,743
3 ops/sec
Run Disruptor=81,037,277
4 ops/sec
Run Disruptor=81,168,831
5 ops/sec
Run Disruptor=81,699,346
6 ops/sec
Knowledge of the deployed system can allow for additional performance by choosing a more appropriate
wait strategy:
SleepingWaitStrategy →
Like the BlockingWaitStrategy the SleepingWaitStrategy it attempts to be conservative with CPU usage
by using a simple busy wait loop. The difference is that the SleepingWaitStrategy uses a call
to LockSupport.parkNanos(1) in the middle of the loop. On a typical Linux system this will pause the thread
for around 60µs.
This has the benefits that the producing thread does not need to take any action other increment the
appropriate counter and that it does not require the cost of signalling a condition variable. However, the
mean latency of moving the event between the producer and consumer threads will be higher.
It works best in situations where low latency is not required, but a low impact on the producing thread is
desired. A common use case is for asynchronous logging.
YieldingWaitStrategy →
The YieldingWaitStrategy is one of two WaitStrategys that can be use in low-latency systems. It is
designed for cases where there is the option to burn CPU cycles with the goal of improving latency.
The YieldingWaitStrategy will busy spin, waiting for the sequence to increment to the appropriate value.
Inside the body of the loop Thread#yield() will be called allowing other queued threads to run.
This is the recommended wait strategy when you need very high performance, and the number
of EventHandler threads is lower than the total number of logical cores, e.g. you have hyper-threading
enabled.
BusySpinWaitStrategy →
The BusySpinWaitStrategy is the highest performing WaitStrategy. Like the YieldingWaitStrategy, it can
be used in low-latency systems, but puts the highest constraints on the deployment environment.
This wait strategy should only be used if the number of EventHandler threads is lower than the number of
physical cores on the box, e.g. hyper-threading should be disabled.
Example ObjectEvent
class ObjectEvent<T>
{
T val;
void clear()
{
val = null;
}
}
Example ClearingEventHandler
import com.lmax.disruptor.EventHandler;
disruptor
.handleEventsWith(new ProcessingEventHandler())
.then(new ClearingEventHandler());
}
Advanced Techniques
Dealing With Large Batches
Example of "Early Release"
public class EarlyReleaseHandler implements EventHandler<LongEvent>
{
private Sequence sequenceCallback;
private int batchRemaining = 20;
@Override
public void setSequenceCallback(final Sequence sequenceCallback)
{
this.sequenceCallback = sequenceCallback;
}
@Override
public void onEvent(final LongEvent event, final long sequence, final boolean endOfBatch)
{
processEvent(event);
Batch Rewind
The Feature
When using the BatchEventProcessor to handle events as batches, there is a feature available that can be
used to recover from an exception named "Batch Rewind".
If something goes wrong while handling an event that is recoverable, the user can throw
a RewindableException. This will invoke the BatchRewindStrategy instead of the
usual ExceptionHandler to decide whether the sequence number should rewind back to the beginning of
the batch to be reattempted or rethrow and delegate to the ExceptionHandler.
e.g.
150, 151, 152, 153(failed -> rewind), 150, 151, 152, 153(succeeded this time), 154, 155
batchEventProcessor.setRewindStrategy(batchRewindStrategy);
Use Case
This can be very useful when batches are handled as database transactions. So the start of a batch starts a
transaction, events are handled as statements, and only committed at the end of a batch.
Happy case
1. in low-latency C/C++ systems,heavy memory allocation is also problematic due to the contention that be placed on the memory
allocator
Version 4.0.0.RC2-SNAPSHOT
Last updated 2022-09-02 08:17:40 UTC
https://github.com/LMAX-Exchange/disruptor
Abstract
LMAX was established to create a very high performance financial exchange. As part of our work to
accomplish this goal we have evaluated several approaches to the design of such a system, but as we began
to measure these we ran into some fundamental limits with conventional approaches.
Many applications depend on queues to exchange data between processing stages. Our performance testing
showed that the latency costs, when using queues in this way, were in the same order of magnitude as the
cost of IO operations to disk (RAID or SSD based disk system) – dramatically slow. If there are multiple
queues in an end-to-end operation, this will add hundreds of microseconds to the overall latency. There is
clearly room for optimisation.
Further investigation and a focus on the computer science made us realise that the conflation of concerns
inherent in conventional approaches, (e.g. queues and processing nodes) leads to contention in multi-
threaded implementations, suggesting that there may be a better approach.
Thinking about how modern CPUs work, something we like to call “mechanical sympathy”, using good
design practices with a strong focus on teasing apart the concerns, we came up with a data structure and a
pattern of use that we have called the Disruptor.
Testing has shown that the mean latency using the Disruptor for a three-stage pipeline is 3 orders of
magnitude lower than an equivalent queue-based approach. In addition, the Disruptor handles
approximately 8 times more throughput for the same configuration.
These performance improvements represent a step change in the thinking around concurrent programming.
This new pattern is an ideal foundation for any asynchronous event processing architecture where high-
throughput and low-latency is required.
At LMAX we have built an order matching engine, real-time risk management, and a highly available in-
memory transaction processing system all on this pattern to great success. Each of these systems has set
new performance standards that, as far as we can tell, are unsurpassed.
However this is not a specialist solution that is only of relevance in the Finance industry. The Disruptor is a
general-purpose mechanism that solves a complex problem in concurrent programming in a way that
maximizes performance, and that is simple to implement. Although some of the concepts may seem unusual
it has been our experience that systems built to this pattern are significantly simpler to implement than
comparable mechanisms.
The Disruptor has significantly less write contention, a lower concurrency overhead and is more cache
friendly than comparable approaches, all of which results in greater throughput with less jitter at lower
latency. On processors at moderate clock rates we have seen over 25 million messages per second and
latencies lower than 50 nanoseconds. This performance is a significant improvement compared to any other
implementation that we have seen. This is very close to the theoretical limit of a modern processor to
exchange data between cores.
1. Overview
The Disruptor is the result of our efforts to build the world’s highest performance financial exchange at
LMAX. Early designs focused on architectures derived from SEDA [1] and Actors [2] using pipelines for
throughput. After profiling various implementations it became evident that the queuing of events between
stages in the pipeline was dominating the costs. We found that queues also introduced latency and high
levels of jitter. We expended significant effort on developing new queue implementations with better
performance. However it became evident that queues as a fundamental data structure are limited due to
the conflation of design concerns for the producers, consumers, and their data storage. The Disruptor is the
result of our work to build a concurrent structure that cleanly separates these concerns.
Concurrent execution of code is about two things, mutual exclusion and visibility of change. Mutual
exclusion is about managing contended updates to some resource. Visibility of change is about controlling
when such changes are made visible to other threads. It is possible to avoid the need for mutual exclusion if
you can eliminate the need for contended updates. If your algorithm can guarantee that any given resource
is modified by only one thread, then mutual exclusion is unnecessary. Read and write operations require
that all changes are made visible to other threads. However only contended write operations require the
mutual exclusion of the changes.
The most costly operation in any concurrent environment is a contended write access. To have multiple
threads write to the same resource requires complex and expensive coordination. Typically this is achieved
by employing a locking strategy of some kind.
We will illustrate the cost of locks with a simple demonstration. The focus of this experiment is to call a
function which increments a 64-bit counter in a loop 500 million times. This can be executed by a single
thread on a 2.4Ghz Intel Westmere EP in just 300ms if written in Java. The language is unimportant to this
experiment and results will be similar across all languages with the same basic primitives.
Once a lock is introduced to provide mutual exclusion, even when the lock is as yet un-contended, the cost
goes up significantly. The cost increases again, by orders of magnitude, when two or more threads begin to
contend. The results of this simple experiment are shown in the table below:
If the critical section of the program is more complex than a simple increment of a counter it may take a
complex state machine using multiple CAS operations to orchestrate the contention. Developing concurrent
programs using locks is difficult; developing lock-free algorithms using CAS operations and memory barriers
is many times more complex and it is very difficult to prove that they are correct.
The ideal algorithm would be one with only a single thread owning all writes to a single resource with other
threads reading the results. To read the results in a multi-processor environment requires memory barriers
to make the changes visible to threads running on other processors.
Modern CPUs are now much faster than the current generation of memory systems. To bridge this divide
CPUs use complex cache systems which are effectively fast hardware hash tables without chaining. These
caches are kept coherent with other processor cache systems via message passing protocols. In addition,
processors have “store buffers” to offload writes to these caches, and “invalidate queues” so that the cache
coherency protocols can acknowledge invalidation messages quickly for efficiency when a write is about to
happen.
What this means for data is that the latest version of any value could, at any stage after being written, be in
a register, a store buffer, one of many layers of cache, or in main memory. If threads are to share this value,
it needs to be made visible in an ordered fashion and this is achieved through the coordinated exchange of
cache coherency messages. The timely generation of these messages can be controlled by memory barriers.
A read memory barrier orders load instructions on the CPU that executes it by marking a point in the
invalidate queue for changes coming into its cache. This gives it a consistent view of the world for write
operations ordered before the read barrier.
A write barrier orders store instructions on the CPU that executes it by marking a point in the store buffer,
thus flushing writes out via its cache. This barrier gives an ordered view to the world of what store
operations happen before the write barrier.
A full memory barrier orders both loads and stores but only on the CPU that executes it.
Some CPUs have more variants in addition to these three primitives but these three are sufficient to
understand the complexities of what is involved. In the Java memory model the read and write of a volatile
field implements the read and write barriers respectively. This was made explicit in the Java Memory
Model [3] as defined with the release of Java 5.
Our hardware does not move memory around in bytes or words. For efficiency, caches are organised into
cache-lines that are typically 32-256 bytes in size, the most common cache-line being 64 bytes. This is the
level of granularity at which cache coherency protocols operate. This means that if two variables are in the
same cache line, and they are written to by different threads, then they present the same problems of write
contention as if they were a single variable. This is a concept know as “false sharing”. For high performance
then, it is important to ensure that independent, but concurrently written, variables do not share the same
cache-line if contention is to be minimised.
When accessing memory in a predictable manner CPUs are able to hide the latency cost of accessing main
memory by predicting which memory is likely to be accessed next and pre-fetching it into the cache in the
background. This only works if the processors can detect a pattern of access such as walking memory with a
predictable “stride”. When iterating over the contents of an array the stride is predictable and so memory
will be pre-fetched in cache lines, maximizing the efficiency of the access. Strides typically have to be less
than 2048 bytes in either direction to be noticed by the processor. However, data structures like linked lists
and trees tend to have nodes that are more widely distributed in memory with no predictable stride of
access. The lack of a consistent pattern in memory constrains the ability of the system to pre-fetch cache-
lines, resulting in main memory accesses which can be more than 2 orders of magnitude less efficient.
Queue implementations tend to have write contention on the head, tail, and size variables. When in use,
queues are typically always close to full or close to empty due to the differences in pace between consumers
and producers. They very rarely operate in a balanced middle ground where the rate of production and
consumption is evenly matched. This propensity to be always full or always empty results in high levels of
contention and/or expensive cache coherence. The problem is that even when the head and tail
mechanisms are separated using different concurrent objects such as locks or CAS variables, they generally
occupy the same cache-line.
The concerns of managing producers claiming the head of a queue, consumers claiming the tail, and the
storage of nodes in between make the designs of concurrent implementations very complex to manage
beyond using a single large-grain lock on the queue. Large grain locks on the whole queue for put and take
operations are simple to implement but represent a significant bottleneck to throughput. If the concurrent
concerns are teased apart within the semantics of a queue then the implementations become very complex
for anything other than a single producer – single consumer implementation.
In Java there is a further problem with the use of queues, as they are significant sources of garbage. Firstly,
objects have to be allocated and placed in the queue. Secondly, if linked-list backed, objects have to be
allocated representing the nodes of the list. When no longer referenced, all these objects allocated to
support the queue implementation need to be re-claimed.
This approach is not cheap - at each stage we have to incur the cost of en-queuing and de-queuing units of
work. The number of targets multiplies this cost when the path must fork, and incurs an inevitable cost of
contention when it must re-join after such a fork.
It would be ideal if the graph of dependencies could be expressed without incurring the cost of putting the
queues between stages.
The LMAX disruptor is designed to address all of the issues outlined above in an attempt to maximize the
efficiency of memory allocation, and operate in a cache-friendly manner so that it will perform optimally on
modern hardware.
At the heart of the disruptor mechanism sits a pre-allocated bounded data structure in the form of a ring-
buffer. Data is added to the ring buffer through one or more producers and processed by one or more
consumers.
Garbage collection can be problematic when developing low-latency systems in a managed runtime
environment like Java. The more memory that is allocated the greater the burden this puts on the garbage
collector. Garbage collectors work at their best when objects are either very short-lived or effectively
immortal. The pre-allocation of entries in the ring buffer means that it is immortal as far as garbage collector
is concerned and so represents little burden.
Under heavy load queue-based systems can back up, which can lead to a reduction in the rate of processing,
and results in the allocated objects surviving longer than they should, thus being promoted beyond the
young generation with generational garbage collectors. This has two implications: first, the objects have to
be copied between generations which cause latency jitter; second, these objects have to be collected from
the old generation which is typically a much more expensive operation and increases the likelihood of “stop
the world” pauses that result when the fragmented memory space requires compaction. In large memory
heaps this can cause pauses of seconds per GB in duration.
When designing a financial exchange in a language that uses garbage collection, too much memory
allocation can be problematic. So, as we have described linked-list backed queues are a not a good
approach. Garbage collection is minimized if the entire storage for the exchange of data between processing
stages can be pre-allocated. Further, if this allocation can be performed in a uniform chunk, then traversal of
that data will be done in a manner that is very friendly to the caching strategies employed by modern
processors. A data-structure that meets this requirement is an array with all the slots pre-filled. On creation
of the ring buffer the Disruptor utilises the abstract factory pattern to pre-allocate the entries. When an
entry is claimed, a producer can copy its data into the pre-allocated structure.
On most processors there is a very high cost for the remainder calculation on the sequence number, which
determines the slot in the ring. This cost can be greatly reduced by making the ring size a power of 2. A bit
mask of size minus one can be used to perform the remainder operation efficiently.
As we described earlier bounded queues suffer from contention at the head and tail of the queue. The ring
buffer data structure is free from this contention and concurrency primitives because these concerns have
been teased out into producer and consumer barriers through which the ring buffer must be accessed. The
logic for these barriers is described below.
In most common usages of the Disruptor there is usually only one producer. Typical producers are file
readers or network listeners. In cases where there is a single producer there is no contention on
sequence/entry allocation. In more unusual usages where there are multiple producers, producers will race
one another to claim the next entry in the ring-buffer. Contention on claiming the next available entry can
be managed with a simple CAS operation on the sequence number for that slot.
Once a producer has copied the relevant data to the claimed entry it can make it public to consumers by
committing the sequence. This can be done without CAS by a simple busy spin until the other producers
have reached this sequence in their own commit. Then this producer can advance the cursor signifying the
next available entry for consumption. Producers can avoid wrapping the ring by tracking the sequence of
consumers as a simple read operation before they write to the ring buffer.
Consumers wait for a sequence to become available in the ring buffer before they read the entry. Various
strategies can be employed while waiting. If CPU resource is precious they can wait on a condition variable
within a lock that gets signalled by the producers. This obviously is a point of contention and only to be used
when CPU resource is more important than latency or throughput. The consumers can also loop checking the
cursor which represents the currently available sequence in the ring buffer. This could be done with or
without a thread yield by trading CPU resource against latency. This scales very well as we have broken the
contended dependency between the producers and consumers if we do not use a lock and condition
variable. Lock free multi-producer – multi-consumer queues do exist but they require multiple CAS
operations on the head, tail, size counters. The Disruptor does not suffer this CAS contention.
3.3. Sequencing
Sequencing is the core concept to how the concurrency is managed in the Disruptor. Each producer and
consumer works off a strict sequencing concept for how it interacts with the ring buffer. Producers claim the
next slot in sequence when claiming an entry in the ring. This sequence of the next available slot can be a
simple counter in the case of only one producer or an atomic counter updated using CAS operations in the
case of multiple producers. Once a sequence value is claimed, this entry in the ring buffer is now available to
be written to by the claiming producer. When the producer has finished updating the entry it can commit the
changes by updating a separate counter which represents the cursor on the ring buffer for the latest entry
available to consumers. The ring buffer cursor can be read and written in a busy spin by the producers using
memory barrier without requiring a CAS operation as below.
Consumers wait for a given sequence to become available by using a memory barrier to read the cursor.
Once the cursor has been updated the memory barriers ensure the changes to the entries in the ring buffer
are visible to the consumers who have waited on the cursor advancing.
Consumers each contain their own sequence which they update as they process entries from the ring buffer.
These consumer sequences allow the producers to track consumers to prevent the ring from wrapping.
Consumer sequences also allow consumers to coordinate work on the same entry in an ordered manner
In the case of having only one producer, and regardless of the complexity of the consumer graph, no locks
or CAS operations are required. The whole concurrency coordination can be achieved with just memory
barriers on the discussed sequences.
Because the producer and consumer concerns are separated with the Disruptor pattern, it is possible to
represent a complex graph of dependencies between consumers while only using a single ring buffer at the
core. This results in greatly reduced fixed costs of execution thus increasing throughput while reducing
latency.
A single ring buffer can be used to store entries with a complex structure representing the whole workflow in
a cohesive place. Care must be taken in the design of such a structure so that the state written by
independent consumers does not result in false sharing of cache lines.
Separating the concerns normally conflated in queue implementations allows for a more flexible design.
A RingBuffer exists at the core of the Disruptor pattern providing storage for data exchange without
contention. The concurrency concerns are separated out for the producers and consumers interacting with
the RingBuffer. The ProducerBarrier manages any concurrency concerns associated with claiming slots in
the ring buffer, while tracking dependant consumers to prevent the ring from wrapping.
The ConsumerBarrier notifies consumers when new entries are available, and Consumers can be
constructed into a graph of dependencies representing multiple stages in a processing pipeline.
3.7. Code Example
The code below is an example of a single producer and single consumer using the convenience
interface BatchHandler for implementing a consumer. The consumer runs on a separate thread receiving
entries as they become available.
RingBuffer<ValueEntry> ringBuffer =
new RingBuffer<ValueEntry>(ValueEntry.ENTRY_FACTORY, SIZE,
ClaimStrategy.Option.SINGLE_THREADED,
WaitStrategy.Option.YIELDING);
ConsumerBarrier<ValueEntry> consumerBarrier = ringBuffer.createConsumerBarrier();
BatchConsumer<ValueEntry> batchConsumer =
new BatchConsumer<ValueEntry>(consumerBarrier, batchHandler);
ProducerBarrier<ValueEntry> producerBarrier = ringBuffer.createProducerBarrier(batchConsumer);
running the tests requires a system capable of executing at least 4 threads in parallel.
Figure 1. Unicast: 1P – 1C
Figure 2. Three Step Pipeline: 1P – 3C
Figure 3. Sequencer: 3P – 1C
Figure 4. Multicast: 1P – 3C
Figure 5. Diamond: 1P – 3C
For the above configurations an ArrayBlockingQueue was applied for each arc of data flow compared to
barrier configuration with the Disruptor. The following table shows the performance results in operations per
second using a Java 1.6.0_25 64-bit Sun JVM, Windows 7, Intel Core i7 860 @ 2.8 GHz without HT and Intel
Core i7-2720QM, Ubuntu 11.04, and taking the best of 3 runs when processing 500 million messages.
Results can vary substantially across different JVM executions and the figures below are not the highest we
have observed.
Nehalem 2.8Ghz – Windows 7 SP1 64-bit Sandy Bridge 2.2Ghz – Linux 2.6.38 64-bit
6. Conclusion
The Disruptor is a major step forward for increasing throughput, reducing latency between concurrent
execution contexts and ensuring predictable latency, an important consideration in many applications. Our
testing shows that it out-performs comparable approaches for exchanging data between threads. We believe
that this is the highest performance mechanism for such data exchange. By concentrating on a clean
separation of the concerns involved in cross-thread data exchange, by eliminating write contention,
minimizing read contention and ensuring that the code worked well with the caching employed by modern
processors, we have created a highly efficient mechanism for exchanging data between threads in any
application.
The batching effect that allows consumers to process entries up to a given threshold, without any
contention, introduces a new characteristic in high performance systems. For most systems, as load and
contention increase there is an exponential increase in latency, the characteristic “J” curve. As load
increases on the Disruptor, latency remains almost flat until saturation occurs of the memory sub-system.
We believe that the Disruptor establishes a new benchmark for high-performance computing and is very well
placed to continue to take advantage of current trends in processor and computer design.
4. Phasers - https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/concurrent/Phaser.html
7. ArrayBlockingQueue - https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/concurrent/
ArrayBlockingQueue.html
Version 1.0
Last updated 2022-09-02 08:17:40 UTC