11 Low Latency Programming
11 Low Latency Programming
Low-latency programming
Thalesians Ltd
Level39, One Canada Square, Canary Wharf, London E14 5AB
2022.06.16
Low-latency programming
The microsecond
The microsecond
Colocation (i)
From Latency equlisation: The need for fair and non-discriminatory colocation services1
(2018.08.17):
The significant upgrades to trading infrastructure, improved transparency and general best
practices necessitated by MiFID II have turned a spotlight on low-latency market access and
are driving a renewed demand for functional and scalable colocation provision.
It’s no longer enough to be optimally connected for maximum trading advantage—under Mi-
FID II, it also has to be equitable as per the regulation’s ‘latency equilisation’ provisions.
Under Articles 48(8) and (9) of Directive 2014/65/EU in MiFID II, trading venues are required
to provide “transparent, fair and non-discriminatory” colocation services that “do not create
incentives for disorderly trading conditions or market abuse.” The regulation extended the
provisions to cover multirateral and organised trading facilities, and the common requirements
of the regulation apply to all types of colocation services as well as to trading venues that
organise their own data centres or that use data centres owned or managed by third parties.
Trading venues do have a modicum of control—they are able to determine their own com-
mercial colocation policy, provided that it is based on “onbjective, transparent and non-
discriminatory criteria to the different types of users of the venue within the limits of the space,
power, cooling or similar facilities available.”
1 https://www.interxion.com/uk/blogs/2018/082/
latency-equalisation-the-need-for-fair-and-non-discriminatory-colocation-services
Low-latency programming
The microsecond
Colocation (ii)
In other words, trading venues cannot discriminate between users—they must provide all sub-
scribers with access to the same services on the same platform under the same conditions—
from space, power, cooling and cable length to data access, market connectivity, technology,
technical support and messaging types. In addition, trading venues have to monitor all con-
nections and latency measurements within the colocation services they offer, to ensure the
non-discriminatory treatment of all users with each type of service offered.
Fee structure plays an important part in this. Trading venues must be sure that users are
able to subscribe to only certain colocation services, and are not required to purchase bun-
dled services. They must also use objective criteria when determining rebates, incentives
and disincentives. Fee structures that contributes to conditions leading to disorderly trading
conditions through encouraging intensive trading and that may lead to a stress of market in-
frastructures are therefore prohibited—although volume discounts are allowed under certain
circumstances.
Transparency is a significant component of this requirement too. Trading venues must pub-
lish the details of their arrangements—including information on the colocation services they
offer, the price of the services, the conditions for accessing the services, the different types
of latency access that are available, procedures for the allocation of colocation space and the
requirements that third-party providers of colocation services must meet.
Low-latency programming
The microsecond
Colocation (ii)
In the light of these new requirements for electronic execution venues, and with the new Sys-
tematic Internaliser regime arriving on September 1, 2018, the importance of choosing the
right hosting venue has become increasingly important. Financial services firms must be sure
they are working with a data centre operator that understands the specific requirements of
MiFID II and is able to partner with the firm towards complete compliance.
Interxion is leading the way with the launch of LON3 in October - the latest addition to its Lon-
don data centre campus, situated right between the Square Mile, the centre of global finance,
and Tech City, the 3rd largest technology start-up cluster in the world. Its geographical prox-
imity to the London Stock Exchange, equidistant between the key hosting centres of Slough
(BATS, EBS) and Basildon (ICE Futures, Euronext), and access to microwave connectivity to
Frankfurt (Eurex) make it unequaled in terms of European trading venue access. This central
location yields major speed advantages for multi-venue trading strategies and enables optimal
order book aggregation / consolidation for best execution under MiFID II.
Low-latency programming
Kernel bypass
I Vanilla Linux can only process about 1M packets per second (pps).
I Modern 10 Gbps NIC’s can usually process at least 10M pps.
I The only way to squeeze more packets from the hardware is by working around the
Linux kernel networking stack. This is called a kernel bypass.
Low-latency programming
Kernel bypass
First experiment
I The fastest way to drop packets in Linux, without hacking the kernel sources, is by
placing a DROP rule in the PREROUTING iptables chain.
I By manipulating an indirection table on a NIC with ethtool -X, we direct all the
packets to RX queue #0.
I ethtool statistics show that the network card receives a line rate of 12M pps.
I As we can see the kernel is able to process 1.4M pps on that queue with a single CPU.
Low-latency programming
Kernel bypass
Second experiment
I Branch prediction is a technique used in CPU design that attempts to guess the
outcome of a conditional operation and prepare for the most likely result.
I A digital circuit that performs this operation is known as a branch predictor. It is an
important component of modern CPU architectures, such as the x86.
I Two-way branching is usually implemented with a conditional jump instruction. A
conditional jump can either be “not taken” and continue execution with the first branch
of code which follows immediately after the conditional jump, or it can be “taken” and
jump to a different place in program memory where the second branch of code is
stored.
I It is not known for certain whether a conditional jump will be taken or not taken until the
condition has been calculated and the conditional jump has passed the execution
stage in the instruction pipeline.
Low-latency programming
Branch prediction
Instruction pipeline
Figure: Example of 4-stage pipeline. The coloured boxes represent instructions independent of each
other.
Low-latency programming
Branch prediction
I Without branch prediction, the processor would have to wait until the conditional jump
instruction has passed the execute stage before the next instruction can enter the
fetch stage in the pipeline.
I The branch predictor attempts to avoid this waste of time by trying to guess whether
the conditional jump is most likely to be taken or not taken.
I The branch that is guessed to be the most likely is then fetched and speculatively
executed.
I If it is later detected that the guess was wrong, then the speculatively executed or
partially executed instructions are discarded and the pipeline starts over with the
correct branch incurring a delay.
Low-latency programming
Branch prediction
Branch misprediction
I The time that is wasted in case of a branch misprediction is equal to the number of
stages in the pipeline from the fetch stage to the execute stage.
I Modern microprocessors tend to have quite long pipelines so that the misprediction
delay is between 10 and 20 clock cycles.
I As a result, making a pipeline longer increases the need for a more advanced branch
predictor.
Low-latency programming
Branch prediction
I The first time a conditional jump instruction is encountered, there is not much
information to base a prediction on.
I But the branch predictor keeps records of whether branches are taken or not taken.
I When it encounters a conditional jump that has been seen several times before, then it
can base the prediction on the history.
I The branch predictor may, for example, recognize that the conditional jump is taken
more often than not, or that it is taken every second time.
Low-latency programming
Branch prediction
Static dispatch
class A {
public :
void foo ();
};
I Here the compiler will create a routine for foo() and remember its address.
I This routine will be executed every time the compiler finds a call to foo() on an
instance of A.
I Keep in mind that only one routine exists per class method, and is shared by all
instances of the class.
I This process is known as static dispatch or early binding: the compiler knows which
routine to execute during compilation.
Low-latency programming
The curiously recurring template pattern (CRTP)
Virtual functions
I Virtual functions...
class B {
public :
virtual void bar ();
virtual void qux ();
};
class C : public B {
public :
void bar () override ;
};
Dynamic dispatch
B* b = new C();
b->bar ();
I If we used static dispatch as above, the call b->bar() would execute B::bar(), since
(from the point of view of the compiler) b points to an object of type B.
I This would be wrong, because b actually points to an object of type C and C::bar()
should be called instead.
I Given that virtual functions can be redefined in subclasses, calls via pointers (or
references) to a base type cannot be dispatched at compile time.
I The compiler has to find the right function definition (i.e. the most specific one) at
runtime.
I This process is called dynamic dispatch or late method binding.
Low-latency programming
The curiously recurring template pattern (CRTP)
Implementation (i)
I For every class that contains virtual functions, the compiler constructs a virtual table
(vtable).
I The vtable contains an entry for each virtual function accessible by the class and
stores a pointer to its definition.
I Only the most specific function definition callable by the class is stored in the vtable.
I Entries in the vtable can point to either functions declared in the class itself (e.g.
C::bar()), or virtual functions inherited from a base class (e.g. C::qux()).
I In our example, the compiler will create the following virtual tables.
Low-latency programming
The curiously recurring template pattern (CRTP)
Implementation (ii)
Implementation (iii)
I The vtable of class B has two entries, one for each of the two virtual functions declared
in B’s scope bar() and qux().
I Additionally, the vtable of B points to the local definition of functions , since they are the
most specific (and only) from B’s point of view.
I More interesting is C’s table. In this case, the entry for bar() points to own C’s
implementation, given that it is more specific than B::bar().
I Since C doesn’t override qux(), its entry in the vtable points to B’s definition (the most
specific definition).
I Note that vtables exist at the class level, meaning there exists a single vtable per class,
and is shared by all instances.
Low-latency programming
The curiously recurring template pattern (CRTP)
Vpointers (i)
I When the compiler sees b->bar() in the example above, it will look up B’s vtable for
bar’s entry and follow the corresponding function pointer, right?
I We would still be calling B::bar() and not C::bar()...
I We haven’t told you the second part of the story: vpointers.
I Every time the compiler creates a vtable for a class, it adds an extra argument to it: a
pointer to the corresponding virtual table, called the vpointer.
I vpointer is just another class member added by the compiler and increases the size of
every object that has a vtable by sizeof(vpointer).
I When a call to a virtual function on an object is performed, the vpointer of the object is
used to find the corresponding vtable of the class.
I Next, the function name is used as index to the vtable to find the correct (most
specific) routine to be executed.
Low-latency programming
The curiously recurring template pattern (CRTP)
Vpointers (ii)
Virtual destructors
An example (i)
class Base
{
public :
~Base ()
{
std :: cout << " Destroying base" << std :: endl;
}
};
Low-latency programming
The curiously recurring template pattern (CRTP)
An example (ii)
~ Derived ()
{
std :: cout << " Destroying derived " << std :: endl;
delete some_resource_ ;
}
private :
int* some_resource_ ;
};
Low-latency programming
The curiously recurring template pattern (CRTP)
An example (ii)
int main ()
{
Base* p = new Derived (5);
delete p;
}
Summary
Classic polymorphism
class order {
virtual void place_order () { /* generic implementation */ }
};
Microbenchmarking (i)
void test_for_loop () {
int iter_count = 1000000000;
time start = get_time ();
for (int i = 0; i < iter_count ; ++i) {}
time elapsed = get_time () - start;
time elapsed_per_iteration = elapsed / iter_count ;
printf ("Time elapsed for each iteration : %d\n", elapsed_per_iteration );
}
I Compilers can see that the loop does nothing and will not generate any code for it.
The values of elapsed and elapsed_per_iteration end up useless.
Low-latency programming
Benchmarking
Microbenchmarking (ii)
void test_for_loop () {
int iter_count = 1000000000;
int sum = 0;
time start = get_time ();
for (int i = 0; i < iter_count ; ++i) {
++ sum;
}
time elapsed = get_time () - start;
time elapsed_per_iteration = elapsed / iter_count ;
printf ("Time elapsed for each iteration : %d\n", elapsed_per_iteration );
}
I The compiler may see that the variable sum isn’t used and optimize it away along with
the loop.
Low-latency programming
Benchmarking
Microbenchmarking (ii)
void test_for_loop () {
int iter_count = 1000000000;
int sum = 0;
time start = get_time ();
for (int i = 0; i < iter_count ; ++i) {
++ sum;
}
time elapsed = get_time () - start;
time elapsed_per_iteration = elapsed / iter_count ;
printf ("Time elapsed for each iteration : %d\n", elapsed_per_iteration );
printf ("Sum: %d\n", sum); // Added
}
I The compiler may see that the variable sum will always have a constant value and
optimize all that away as well.
Low-latency programming
Benchmarking
More caveats
I When we are testing I/O the OS may preload files into memory to improve
performance.
I Locality of reference (e.g. arrays versus linked lists).
I Effects of caches.
I Effects of memory bandwidth.
I Compiler inlining.
I Compiler implementation.
I Compiler switches.
I Number of processor cores.
I Optimizations at the processor level.
I Operating system schedulers.
I Operating system background processes.
Low-latency programming
Benchmarking
Profiling
Google Benchmark
# ifdef _WIN32
# pragma comment ( lib , " Shlwapi .lib" )
#endif
std :: vector <int > rng(int begin , int end , int count) {
std :: vector <int > v;
for (int i = 0; i < count; ++i) {
v. push_back (( std :: rand () % end) + begin);
}
return v;
}
Low-latency programming
Benchmarking
Source: [PZ12]
Low-latency programming
Benchmarking
Source: [PZ12]
Low-latency programming
Cache warming
No cache warming
Cache warming
Consequences
Thread
std::thread
void foo () {
std :: cout << "foo starting " << std :: endl;
for (int i = 0; i < 10; ++i) {
std :: cout << i << std :: endl;
std :: this_thread :: sleep_for (std :: chrono :: seconds (1));
}
std :: cout << "foo finished " << std :: endl;
}
void bar(int x) {
std :: cout << "bar starting " << std :: endl;
for (int i = 0; i < x; i += 10) {
std :: cout << i << std :: endl;
std :: this_thread :: sleep_for (std :: chrono :: milliseconds (500));
}
std :: cout << "bar finished " << std :: endl;
}
Low-latency programming
Multithreading
int main () {
std :: thread first(foo); // Spawn new thread that calls foo ()
std :: thread second (bar , 100); // Spawn new thread that calls bar (0)
std :: cout << "main , foo , and bar are now executing concurrently ...\n";
// Synchronize threads :
first .join (); // Pauses until first finishes
second .join (); // Pauses until second finishes
return EXIT_SUCCESS ;
}
Low-latency programming
Multithreading
std::future
Example: std::future
# include <cstdlib >
# include <future >
# include <iostream >
# include <thread >
int main () {
// std :: future from a packaged task.
std :: packaged_task <int ()> task ([] { return 7; }); // Wrap the function .
std :: future <int > f1 = task. get_future (); // Get a future .
std :: thread t(std :: move(task));
// std :: future from an async ()
std :: future <int > f2 = std :: async(std :: launch :: async , [] { return 8; });
// std :: future from a std :: promise
std :: promise <int > p;
std :: future <int > f3 = p. get_future ();
std :: thread ([&p] {p. set_value_at_thread_exit (9); }). detach ();
std::promise
Example: std::promise
# include <chrono >
# include <future >
# include <iostream >
# include <numeric >
# include <thread >
# include <vector >
void accumulate (
std :: vector <int >:: iterator first ,
std :: vector <int >:: iterator last ,
std :: promise <int > accumulate_promise ) {
int sum = std :: accumulate (first , last , 0);
accumulate_promise . set_value (sum);
}
int main () {
// Demonstrating using promise <int > to transmit a result between threads .
std :: vector <int > numbers = { 1, 2, 3, 4, 5, 6 };
std :: promise <int > accumulate_promise ;
std :: future <int > accumulate_future = accumulate_promise . get_future ();
std :: thread work_thread (accumulate , numbers .begin (), numbers .end (),
std :: move( accumulate_promise ));
// future :: get () will wait until the future has a valid result and retrieves it.
// Calling accumulate_future .wait () before accumulate_future .get () is not needed .
std :: cout << " result = " << accumulate_future .get () << ’\n’;
work_thread .join (); // Wait for thread completion .
}
Low-latency programming
Multithreading
Mutex
I A mutex is a lockable object that is designed to signal when critical sections of the
code need exclusive access, preventing other threads from executing concurrently.
I std::mutex objects provide exclusive ownership and do not support recursivity (i.e.,
a thread cannot lock a mutex it already owns).
I For example, we can use a mutex to prevent multiple threads from performing I/O at
the same time or accessing the same memory locations.
Low-latency programming
Multithreading
Example: std::mutex
int main () {
std :: thread threads [10];
for (int i = 0; i < 10; ++i)
threads [i] = std :: thread ( print_thread_id , i + 1);
return EXIT_SUCCESS ;
}
Low-latency programming
The Distruptor
The Disruptor
Cost of contention
Lock-free programming
I People often describe lock-free programming as programming without mutexes, which
are also referred to as locks.
I That’s true, but it’s only part of the story.
I The generally accepted definition, based on academic literature, is a bit more broad.
At its essence, lock-free is a property used to describe some code, without saying too
much about how that code was actually written.
I Jeff Preshing (https://preshing.com/) proposes the following definition:
The Disruptor
Going parallel
Videos on benchmarking
I Bart Massey (Portland C/Assembly Users Group). Popcount and the Black Art of
Microbenchmarking. pdxbyte, 2014.03.11,
https://www.youtube.com/watch?v=opJvsJk1B68
I Chandler Carruth (Google). Tuning C++: Benchmarks, and CPUs, and Compilers!
Oh My! CppCon 2015, https://www.youtube.com/watch?v=nXaxk27zwlk
I Alexander Radchenko (Optiver). Benchmarking C++: From video games to
algorithmic trading. Pacific++ 2018.10m,
https://www.youtube.com/watch?v=MHaZwyhpl-M
I Alexander Radchenko (Optiver). Benchmarking C++: From video games to
algorithmic trading. Meeting C++ 2018,
https://www.youtube.com/watch?v=7YVMC5v4qCA
I CoffeeBeforeArch. C++ Crash Course: Google Benchmark. 2019.07.21,
https://www.youtube.com/watch?v=eKODykkIZTE
Low-latency programming
The Distruptor
I Trisha Gee and Michael Barker (LMAX). The Disruptor—A Beginners Guide to
Hardcore Concurrency. JAX London 2011,
https://www.youtube.com/watch?v=DCdGlxBbKU4
I Trisha Gee (LMAX). Concurrent Programming Using The Disruptor. 2013.01.25,
https://www.youtube.com/watch?v=eTeWxZvlCZ8
I Peter Lawrey. Low Latency Code in Java 8. MelbJVM, April, 2015,
https://www.youtube.com/watch?v=t49bfPLp0B0
I Daniel Shaya. How Low Can You Go? Ultra Low Latency Java in the Real World.
London Java Community, 2018, https://www.youtube.com/watch?v=BD9cRbxWQx8
Low-latency programming
The Distruptor
Books on optimization
Dennis Andriesse.
Practical Binary Analysis: Build Your Own Linux Tools for Binary Instrumentation,
Analysis, and Disassembly.
No Starch Press, 2018.
Björn Andrist and Viktor Sehr.
C++ High Performance: Master the art of optimizing the functioning of your C++ code.
Packt, 2020.
Michael Born.
Binary Analysis Cookbook: Actionable recipes for disassembling and analyzing
binaries for security risks.
Packt, 2019.
Benjamin J. Evans and James Gough.
Optimizing Java: Practical Techniques for Improved Performance Tuning.
O’Reilly, 2016.
Randall Hyde.
Write Great Code, Volume 1: Understanding the Machine.
No Starch Press, 2004.
Randall Hyde.
Write Great Code, Volume 2: Thinking Low-Level, Writing High-Level.
No Starch Press, 2006.
Christopher Kormanyos.
Low-latency programming
Bibliography