Full Text 01
Full Text 01
MAX OLIVEBERG
Abstract
We study optimal order placement in a limit order book. By modelling the
limit order book dynamics as a Markov chain, we can frame the purchase of
a single share as a Markov Decision Process. Within the framework of the
model, we can estimate optimal decision policies numerically. The trade rate
is varied using a running cost control variable. The optimal policy is found to
result in a lower cost of trading as a function of the trade rate compared to a
market order only strategy.
Keywords
Optimal order placement, Limit order book, Markov
ii | Abstract
Sammanfattning | iii
Sammanfattning
Vi studerar optimal orderläggning i en limit orderbok. Genom att modellera
dynamiken av inkommande ordrar som en Markov kedja så kan vi formula
optimal orderläggning som en Markov Decision Process. Inom ramverket av
modellen så kan vi skatta optimala strategier numeriskt. En löpande kostnad
används som en kontrollvariabel för handelstakten av den optimala strategin.
Vi finner att den optimala strategin resulterar i en lägre handelskostnad som
funktion av deltagande jämfört med en marknadsorder strategi.
Nyckelord
Optimal orderläggning, Orderbok, Markov
iv | Acknowledgments
Acknowledgments
First and foremost, I would like to express my sincere gratitude to Jonas
Kiessling and Anders Szepessy for their supervision of this thesis. Jonas
patiently answered any of my questions, be it regarding quantitative finance or
anything else. Anders provided invaluable advice on the project and academic
work in general. Thank you both!
I would also like to acknowledge the support and encouragement from Jakob
Stigenberg, who always made time to discuss implementation, methodology,
and VR-headsets.
Finally, thank you Ebba for your support while writing this thesis.
Contents
1 Introduction 1
1.1 The Limit Order Book . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Modelling the Limit Order Book . . . . . . . . . . . . 2
1.2 Markov Processes & Decision Policies . . . . . . . . . . . . . 4
1.3 Value Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2.1 Results on synthetic data . . . . . . . . . . . . . . . . 36
4.2.2 Results on historic data . . . . . . . . . . . . . . . . . 37
4.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . 41
References 45
A Markov Processes 47
A.1 Potentials . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
A.2 State Indexation . . . . . . . . . . . . . . . . . . . . . . . . . 48
B Historical Data 50
B.1 Order Matching . . . . . . . . . . . . . . . . . . . . . . . . . 50
B.1.1 A simple order matching algorithm . . . . . . . . . . 51
B.2 ’Synthetic Layer’ on Historical Data . . . . . . . . . . . . . . 54
B.3 Parameter fitting . . . . . . . . . . . . . . . . . . . . . . . . . 55
B.4 Historic data used . . . . . . . . . . . . . . . . . . . . . . . . 56
C Simple Strategies 59
C.1 Market Order . . . . . . . . . . . . . . . . . . . . . . . . . . 59
C.2 Bid Plus One . . . . . . . . . . . . . . . . . . . . . . . . . . 59
List of Figures | vii
List of Figures
3.1 Results for synthetic data, using the matrix state representation. 32
List of Tables
The abstract entity which takes action in one of our MDPs is referred to as
the agent. It takes these actions according to some policy. All actions are
either representations of placing some type of order, or ’waiting’ for the state
to transition. In the latter case, the transition models another market participant
submitting an order to the LOB. Thus, we sometimes refer to such transitions
as a non-agent order arriving.
xii | List of Tables
Introduction | 1
Chapter 1
Introduction
The optimal order placement problem, and the related optimal order execution
problem∗ , have been studied extensively in the literature. In [3], reinforcement
learning is used on historic order book data in order to execute some inventory
of N orders before time T . Similarly, [4] explores several reinforcement
learning approaches to optimal order execution in the context of a Markovian
order book. Both studies consider the execution of larger orders, while we
study the placement and execution of singular orders. In [5], recurring neural
networks are proposed as a method for generating synthetic LOB data. This
approach is intended to relax the Markovian assumption on the limit order
book.
∗
In [1], the two are distinguished by timescale and focus on the micro-structure of the
LOB. It is challenging to make a strict distinction between the two, e.g. in [3] the optimal
execution problem is solved with a micro-structure centered approach.
2 | Introduction
If the order is placed so that its price is higher (lower) or equal to the best
ask (bid), then the order is matched and a trade occurs. Both the incoming
order and counter party order are then removed from the LOB. Placing such
an order is referred to as placing a market order.
In the stylised LOB studied in this work, prices p are restricted to fixed
increments, or ticks, of some size δp appropriate to the instrument in question.
A participant can only buy (sell) at prices
p = N δp, (1.1)
for integers N > 0. Consequently, the order book is divided into levels, with
some number of outstanding limit orders at each level. We refer to the number
of orders at a price level as the depth at that price level. In the work, we will
most often refer to prices by some number of ticks i relative to a price p0 :
p = p0 + iδp. (1.2)
3 3
2 2
1 1
Depth
Depth
0 0
1 1
2 2
3 3
0 1 2 3 4 0 1 2 3 4
Price [Ticks] Price [Ticks]
(a) A representation of an order book, (b) A limit order arrives at price level
with price levels along the x-axis and i = 4, represented by a green block.
the corresponding depths along the y- This increases the depth from one to
axis. Buy orders are represented as two.
negative.
3 3
2 2
1 1
Depth
Depth
0 0
1 1
2 2
3 3
0 1 2 3 4 0 1 2 3 4
Price [Ticks] Price [Ticks]
(c) The order represented by a red (d) A market order arrives at price
block is cancelled, resulting in a depth level i = 3, and matches with the limit
of two at price level i = 0. order represented by a purple block.
This lowers the depth at the price level
from three to two.
only the current price is modelled, e.g the random walk model of [6].
The LOB consists of some finite set of limit orders. If we want to create
a micro-structure model for the LOB, it is natural that we represent this set
of orders by some state s in a countable state space S, which represents
all possible sets of orders. The precise information contained in one of
these orders is model dependent. For example, on a real exchange each
order originates from a specific market participant. However, we may not be
concerned with whom we trade with, only that we have executed a trade. If
one stores a limit order of size n as n separate, unit sized limit orders, then the
bare minimum information regarding an order is the price, the priority and the
direction.
Orders are added to the state through limit orders, and removed via market
orders or cancellations. We approximate that all market participants place
their orders independently, and we model that there is almost surely some time
δt > 0 between two subsequent orders from different market participants. This
allows us to view the LOB as random process changing states at discrete times.
In the sequel, we will model the various order types as being independent
Poisson arrival processes. An alternative approach is to explicitly model the
behaviour of every actor on the exchange. An open source framework for this
is provided in [7]. Such a model is very fine, but is cumbersome to calibrate.
Our agent can interact with our model by placing a limit order, cancelling
a limit order, placing a market order or waiting for another market participant
to act. The agents actions may be subject to latency, to model real life delays
in communication between an actor and an exchange. In the work, the agent
will have zero latency when interacting with the LOB. In Section 5, we give a
brief example of how latency can be modelled in the context of the model.
We say that a random process fulfilling Equation (1.3) has the Markov
Property. An example is a Discrete Time Markov Chain (DTMC). A DTMC
is a random process s(i), i ∈ N , which assumes values in some countable
space S. Its transitions is governed by a stochastic matrix:
1. Ps,s′ ≥ 0,
∑
2. s′ ∈S Ps,s′ = 1.
Since each row sums to unity, the element Ps,s′ can be interpreted as the
transition probability from state s to state s′ . A DTMC can then be defined
as:
Definition 1.2.3 Let S be a state space, Ps,s′ a stochastic matrix on that state
space and s0 ∈ S. A discrete time Markov chain s(i) : N × Ω → S is a
random sequence such that
1. s(0) = s0 ,
( )
2. P s(i + 1) = s′ |s(i) = s = Ps,s′ .
In Section 2, we will model the limit order book as a Continuous Time Markov
Chain (CTMC). A CTMC is a random process s(t) with a countable state space
S with transition intensities defined by the rate matrix Qs,s′ .
2. |Qs,s | < ∞,
∑
3. s′ Qs,s′ = 0.
The first and third points in the definition imply that the diagonal of the rate
matrix fulfills ∑
Qs,s = − Qs,s′ . (1.4)
s̸=s′
6 | Introduction
The time that the process spends in a state is referred to as the holding time
T (s). The holding time is a random variable, and is exponentially distributed:
(
T (s) ∈ Exp − Qs,s ). (1.5)
∗
We can define a continuous time Markov chain as:
Definition 1.2.5 Let s(t) be a Markov process as per Definition 1.2.1 which
takes values in a countable space S, and let Qs,s′ be a rate matrix as per
Definition 1.2.4. We say that the process s(t) is a Continuous Time Markov
Chain if for every time t and some infinitesimal δt, we have
( )
P s(t + δt) = s′ |s(t) = s = I{s = s′ } + δtQs,s′ + o(δt). (1.6)
Given that we are in some state s(t) at time t, we can calculate the probability
of the next state being s′ as
Qs,s′ Qs,s′
Ps,s′ = ∑ =− , s ̸= s′ . (1.7)
Q
s′′ ̸=s s,s ′′ Q s,s
This is the jump matrix for the process, and we see that it is a stochastic matrix
as per Definition 1.2.2. In the work, we will primarily be concerned with the
jump chain. The jump chain is the discrete time random process of the states
of some CTMC:
Definition 1.2.6 Let s(t) be a continuous time Markov chain with rate matrix
Qs,s′ , and initial value s(0). The jump chain s̄(n) of s(t) is the discrete Markov
chain with initial value s̄(0) = s(0) and transition matrix Ps,s′ as in (1.7).
The reason for the focus on the jump chain is twofold: First, the optimal policy
for a Markov Decision Process (MDP) is stationary, i.e independent of time.
Second, the optimal strategy for the model in Section 2.1 is not dependent on
the transition times, only the jump matrix. The jump matrix is unchanged if
we scale all intensities Qs,s′ by some scalar c, i.e the transform
Qs,s′
Qs,s′ → (1.8)
c
leaves (1.7) unchanged but the transition time of (1.5) scaled by a factor 1c .†
∗
This is the infinitesimal definition. One can equivalently define the CTMC through the
transition probability and holding time definitions. See Theorem 2.8.2 in [8].
†
See the remarks at the end of Section 2.1 for an example of how such scaling can be
exploited in the context of the model presented.
Introduction | 7
Despite the focus on the jump chain, we will discuss the transition times where
appropriate, such as in Section 2.2.1. Through abuse of notation, we will refer
to the jump chain s̄(n) as s(t), with the understanding that the optimal strategy
for s̄(n) is, in practical terms, the optimal strategy for s(t) if the agent has zero
latency when taking action.
A Markov chain may have some set of terminal states ∂S ⊆ S. These are
absorbing states, and if the process s(t) reaches the absorbing states it will
end:
Definition 1.2.7 A state s is terminal, s ∈ ∂S, if
( )
P s(t + ∆t) = s|s(t) = s = 1 (1.9)
Consider a MDP, where every policy almost surely terminates in finite time.
We have some set of terminal states ∂S ⊆ S, and a reward matrix R which is
bounded, |Ri,j | ≤ R0 . For each state s ∈ S, define
{
vT (s), s ∈ ∂S
V0 (s) = . (1.14)
infs′ vT (s′ ), otherwise
for all n, s.
is bounded.
That the limit V∞ (s) is bounded follows directly from bounded rewards, and
that all strategies almost surely terminate in finite time.
Proposition 1.3.3 The policy α∞ (s), which maximises the expected value in
V∞ (s), is an optimal, stationary policy.
for a suitable threshold ϵ > 0. Our estimation of the optimal policy, α∗ , is then
the argument in each state s. This process is referred to as value iteration, and
is the method used to estimate optimal policies in the work.
Now that we have familiarised ourselves with MDPs, we are ready to present
the model for the LOB.
10 | Introduction
A Model for the Limit Order Book | 11
Chapter 2
In this chapter, we model the Limit Order Book as a Continuous Time Markov
Chain. In the first section, we present a model suggested by [2] and then
modify the state representation to make it more flexible with regards to agent
orders. In the sequel, we consider how to simulate and value policies. Finally,
we adapt the model to constraints imposed by the available data.
valid state within our model. We denote the set of all valid states by S. The
subset ∂S ⊂ S is called the set of terminal states. A terminal state is a
state corresponding to the conclusion of a trading strategy, for instance the
execution of an order in a single order execution strategy.
In addition to the variable orders in the state, we assume that one can always
sell a share at price i = −1, and always buy a share at price n. These
assumptions constrain the best bid iB and best ask iA :
−1 ≤ iB < iA ≤ n. (2.2)
The best bid is the highest price a market participant is willing to purchase a
share for, i.e the outstanding limit buy order with the highest price. Likewise,
the best ask is the lowest price a market participant is willing to sell a share
for. We can define the bid price as
( )
iB = max {i : si < 0} ∪ {i = −1} , (2.3)
i
Next, we assume that we can observe and react to each individual order
arriving to the order book. The orders are modelled as being either a limit
order, a market order or a cancellation arriving at a specific price level i. This
limits transitions to the general form
x → x ± kei , (2.5)
where k is the order size and ei is a unit vector. We will model the three order
types as independent Poisson arrival processes. These are subject to some
constraints: For example, a cancellation can not arrive to an empty price level
nor can a limit sell order arrive below the bid.
Limit Orders: Limit buy orders arrive at price levels i < iA . Limit sell
orders only arrive at price levels i > iB . A limit order results in a transition
{
s − kei , i < iA
s→ . (2.6)
s + kei , i > iB
A Model for the Limit Order Book | 13
The intensity of limit buy (sell) orders are modelled as proportional to the
distance to the ask (bid). Further, there is some distribution pk governing the
proportion of limit orders having size k, the specifics of which we will discuss
towards the end of this section. Limit buy orders of size k arrive at price i with
intensity
λ = pk λBL (iA − i), i < iA , (2.7)
and limit sell orders arrive with intensity
Market orders: Market sell orders are modelled as arriving precisely at the
bid iB , and market buy orders arrive precisely at the ask iA :
{
s + keiB
s→ . (2.9)
s − keiA
λ = qk λSM , (2.10)
λ = qk λBM , (2.11)
for parameters λSM , λBM . Here, qk is, similar to pk , a distribution governing the
sizes of market orders. We have that
∑
qk λBM = λBM . (2.12)
k
As a result, we can directly interpret λBM and λSM as the intensity of market
orders per unit of time.
price levels i ≥ iA where |si | > 0. A cancel order transition takes the form
{
s + ei , si < 0, i ≤ iB
s→ . (2.13)
s − ei , si > 0, i ≥ iA
where αL , αM are parameters. We will use this distribution for our synthetic
results in Section 3 and then replace it with a geometric distribution for a closer
A Model for the Limit Order Book | 15
The main drawback of the method is the exponential growth of the state space
with the number of price levels n under consideration
λn
Λ = (λ0 , ..., λn ) → (1, ..., ) = Λ/λ0 , (2.19)
λ0
where the indexation is an arbitrary ordering of the λji parameters such that λ0
is non-zero. Doing this will affect the average transition time T of the jump
process by a factor of λ−1
0
∑ ∑ λi
E[T ] = λi → , (2.20)
λ0
yet leave the jump matrix Ps,s′ unchanged. A strategy which is optimal
for the parameters Λ is still optimal for the parameters Λ/λ0 . If we also
want to include the case where our normalising parameter λ0 = 0, we can
view the parameter space of Λ as being the projective space P n . These
considerations are useful when trying to reduce the computational complexity,
as one parameter is effectively removed.
16 | A Model for the Limit Order Book
To generalise to a case where the agent can have a limit order at any price
level, an additional state variable j was added, representing the price level of
the limit order:
s′ = (s, y, j). (2.22)
To handle cancellations at the agent price level, we have to sample which order
is cancelled according to some scheme. The method used here is to sample
which order is cancelled uniformly.
The n × m-matrix x represents the order book state. It is best illustrated with
an example, here of a 3 × 3-matrix
−2 −2 −1
x= 1 0 0 . (2.24)
0 2 0
A Model for the Limit Order Book | 17
Similarly to the previous vector representation, the number of rows equals the
number of price levels in the model. The sum of each row is the depth at
the corresponding price level, meaning that we can recover the previous state
representation by multiplying x with the n-vector of ones 1n
We can then obtain the bid level iB and ask level iA in the same manner as
Equations (2.3) and (2.4). In the example of Equation (2.24), we would have
iB = 0, iA = 1.
If only the first entry of a row in non-zero, then there are no agent orders
and this entry is the depth. In Equation (2.24), we see that at price level i = 1,
there is only a singular limit sell order, and this order does not belong to the
agent.
Following columns represent a slot for a agent order at each price level. If an
entry is zero, then there is no agent order. Otherwise, the entry is the number
of orders between the order and the next agent order further up the queue,
including the order itself.
If there are no agent orders at higher column indices, then the entry is the
orders position in the execution queue. Having an agent order at a price level
changes the meaning of the first column to be the amount of orders behind the
lowest priority agent order. In Equation (2.24)
• The agent has two limit buy orders at price level iB = 0. The first is next
in line to execute, and there is a single non-agent order between the two.
Behind the second order, there are two additional non-agent orders.
• The agent has no orders at the ask iA = 1, where there is a single limit
sell order.
• Above the ask, the agent has a single order. There is also a single non-
agent order ahead of it, and none behind.
The matrix representation makes the model comparable with the representa-
tion of Equation (2.1). For example, we can now write all transitions to x
as
x → x + kei,j . (2.26)
18 | A Model for the Limit Order Book
Next, we note that the various transitions occur at distinct positions in the
matrix:
• When a market order arrives, we always begin deducting from the right-
most non-zero column.
The intensities per order type are not changed from Table 2.1, though we have
to adopt them to the new state representation. In particular, the intensity of a
limit buy cancellation arriving at position ei,j , i ≤ iB , in the matrix is
( )
λ = λBC (iA − i) |xi,j | − I{j > 0} , (2.27)
Now, we are ready to turn our attention to the agents position. The n-vector
p represents the agents net position, with the index of an entry denoting the
∗
This can be formalised by defining an equivalence relation on S, though we will omit
this here for brevity.
A Model for the Limit Order Book | 19
trade price and the sign the direction of the trade. For example, the vector
−1
p= 0 (2.29)
1
implies that the agent has purchased one share at price level i = 0 and sold
one at price level i = 2.
From the agents point of view, an episode would proceed as follows: First,
we are shown some state si . Then, we react to this state by taking some action
a. As a result of this action, the state transitions to some new state, si → sj
and we receive reward Rsi ,sj . This process repeats until we hit a state s ∈ ∂S.
Concretely, an action in our case would entail either placing an order, waiting
for the state to transition or terminating. We refer to a decision policy as a
strategy, in reference to trading strategies. A stationary strategy α(s) returns
an action, a, as a function of the observed state, s, i.e
α(s) = a, (2.30)
strategy for Section 2.3 and for the moment assume that such a value exists,
and can be calculated upon termination.
Next, we define a market as the counter party to the strategy, responsible for
transitioning the state. The market need not transition the state by sampling a
Markov process. For instance, we can transition the state using historic data.
We represent this as the interface of Listing 2.2∗ .
Thus, we can
1. Sample the outcome sj with weights ωj ∝ λi,j
∑
2. Sample the transition time T ∈ Exp( j λi,j )
Using the discrimination sampling method makes transitioning the state an
∼ O(1) computation on average, assuming intensities of similar magnitudes.
Uniformly sampling from an array of cumulative intensities can also be done,
with worst case complexity ∼ O(ln(nm)) if one uses an efficient search
algorithm. The latter requires constructing arrays of cumulative intensity,
implying a one-time additional complexity of ∼ O(nm), along with additional
memory footprint.
In our experiments, we benchmark all trades against some future price Rδt .
This allows us to value an arbitrary set of trades.
This method becomes ill defined when ns ̸= nb . The prices {bi }, {si } may no
longer be relevant, and even if our trades have recently executed, there is no
22 | A Model for the Limit Order Book
guarantee that we could execute the opposite trade(s) at the same price(s). Our
solution is a reference price R to benchmark the prices against. This redefines
the profit/loss as:
∑
ns ∑
nb
P = (si − R) + (R − bj ). (2.34)
i=1 j=1
iA + iB
Rδt = . (2.35)
2
Since we calculate the reference price after termination, it enters the value
iteration equations in the terminal value vT . In terminal state s ∈ ∂S, the
value of our policy is
V (s, α) = E[vT (s)]. (2.36)
The value vT (s) is the profit/loss of Equation (2.34)
∑
ns ∑
nb ∑
ns ∑
nb
E[vT (s)] = E[ (si −Rδt )+ (Rδt −bj )] = (nb −ns )E[Rδt |s]+ si − bj .
i=1 j=1 i=1 j=1
(2.37)
To find an optimal strategy through value iteration, we need to calculate the
expected reference price, given s ∈ ∂S:
Before moving on, we observe how the reference price behaves in some special
cases. First, managing to execute a limit order leads to a profit of half a spread
A Model for the Limit Order Book | 23
in all cases where there are additional orders behind the agent order. For
example
0 0 0 0
0, −1 −1 → −e1 , −1 0 , (2.39)
1 0 1 0
where the transition is caused by a unit size market order to the bid. Here, the
value at termination would be precisely half the spread
1+2 1
vT = R0 − 1 = −1= . (2.40)
2 2
However, if the agent order is the only order at the price level and a market
order of unit size arrives
0 0 0 0
0, 0 −1 → −e1 , 0 0 , (2.41)
1 0 1 0
then the order causes the price to move to the agents disadvantage. The result
is a net loss:
2−1 1
vT = −1=− . (2.42)
2 2
Inversely, if the agent places a market order at the ask for a single share when
the depth is larger than one
2 0 1 0
0, 0 0 → −e0 , 0 0 , (2.43)
0 0 0 0
These behaviours have probabilistic analogues for δt > 0, but are deterministic
in the δt = 0 case.
Since it is the most readily available, we use level 1 data for historical
evaluations. Having only two price levels also results in a tractable state space.
Level 1 data requires some adaption of the model, in Section 2.1. For example,
the matrix state model of Section 2.1 presumes detailed knowledge of all price
levels. In Appendix B, we give the algorithm used for making the historic data
compatible with the modified model of Section 2.4.1.
There are mathematical methods for dealing with MDPs under limited
information. For example, a Partially Observable Markov Decision Process
(POMDP) treats the observable state s as only containing partial information
about the true state s̄. This is directly applicable to our situation: We can only
observe the bid and ask, not the price levels above or below. Reframing the
optimal order placement problem as a POMPD would not necessarily increase
the real-world applicability of the Markov model approach. This is because
other actors have access to data of a higher level, leading to a disadvantageous
information asymmetry. Rather, it would be more fruitful to obtain a higher
level of data.
where dB , dA are the depths at the bid and ask respectively, and ∆ is the spread.
Since we are evaluating strategies by comparing prices to the instantaneous
A Model for the Limit Order Book | 25
The variable θ is some representation of the agents orders. When the agent
has no orders, we adopt the convention
θ=0 (2.48)
The intensities described in Section 2.1 remain the same with the caveat that
we no longer model transitions due to orders below (above) the bid (ask). As a
result, we can not permit the agent to have deep orders below the bid or above
the ask. Say the agent has some order ∆θ < ∆ above the bid. If a limit buy
order of size k arrives at ∆ > ∆′ > ∆θ , then we immediately cancel the agent
∗
See Appendix B.4
26 | A Model for the Limit Order Book
As we do not model any time delays between the agent and the order book,
this is equivalent with forcing the agent to cancel its order if it becomes deep,
with the added benefit of saving on state space since we do not need to index
the intermediary state
( )
s = [k, dA , ∆ − ∆′ ], θ . (2.52)
This state space representation can be extended to a case where one has access
to more price levels than just the best bid and ask. If one wants to include the
price below and above the bid and ask, a representation such as
( )
[dB−1 , ∆−1 , dB , ∆0 , dA , ∆1 , dA+1 ], θ , (2.53)
is a natural extension Equation (2.47). If more price levels are needed, pairs
(dA/B±i , ∆±i ) can be appended to the state.
Optimal Order Placement on Synthetic Data Using the Matrix State
Representation | 27
Chapter 3
We study the execution of either a buy or sell order before the price moves
out of the price range represented in the state space S. The states s ∈ S are
represented using the matrix representation of Section 2.1.2. We use synthetic
data generated using the dynamics of Section 2.1 to evaluate and study the
optimal strategy under ideal circumstances.
The order book has n = 4 price levels, and we allow a maximum depth of
dmax = 3 limit orders at each price level. We restrict the agent to a single
limit order in the order book at a time. When the agent places a limit order, it
must wait for at least one non-agent to arrive order before cancelling∗ . With
these hyper-parameters and restrictions, the state space size is
For the model parameters, we set all linear intensity parameters to unity:
The distributions for the size of limit and market orders is modelled as being
exponential, as in Equation (2.17). For the distribution parameters we choose
αM = αL = 2. (3.6)
We truncate the maximum order size to kmax = 2. With this truncation, the
occupancy rate of the jump matrix for waiting, Ps,s′ (aW ), is ∼ 6.29 · 10−4 .
∗
This restriction is to ensure that all strategies a.s terminate in finite time. If we had
no such restriction on limit orders, a strategy which repeatedly places a limit order and then
immediately cancels it would never terminate.
Optimal Order Placement on Synthetic Data Using the Matrix State
Representation | 29
As we force the agent to wait at least one transition after placing a limit
order, the action of placing a limit order does not result in a deterministic
transition. We denote a deterministic limit order action as aL , i.e an action
with a transition matrix as in Equation (3.8), and a wait action as aW . Placing
a limit order and then waiting can then be viewed as the compounded action
aLW = aW aL , with transition matrix elements
Rs,s′ (aLW ) = Rs,s′′ (aL )Rs′′ ,s′ (aW ) = I{s′ ∈ ∂S}vT (s′ ). (3.10)
3.2 Results
3.2.1 Observed behaviour of the optimal strategy
Having defined the MDP in Section 3.1.1, we can obtain an approximation for
the optimal strategy using value iteration. Studying the resulting strategy α∗ ,
we find that it exploits the properties of the instantaneous reference price R0
discussed in the latter part of Section 2.3. The strategy seeks to have a limit
order executed such that there is depth behind the agent order and the spread
30 | Optimal Order Placement on Synthetic Data Using the Matrix State
Representation
the agent will choose to wait, with the optimal outcome being a market order
of unit size. This would result in a terminal reward of
1+3
vT (s) = − 1 = 1. (3.13)
2
An order book state such as
−1 0
−1 0
x=
0
(3.14)
1
2 0
will cause the agent to cancel its order, as a unit size market order would lead
to a net-zero terminal reward
3+1
vT (s) = − 2 = 0, (3.15)
2
and an order of size two would lead to a net loss
2+1
vT (s) = − 2 = −0.5. (3.16)
2
The agent engages in greedy behaviour, where it will avoid exposure to
profitable termination when the spread is small. In the order book state
0 0
−2 0
x=
1
, (3.17)
1
1 0
∗
Here, we say ”order book state x” to distinguish from the full state s = (p, x). For this
experiment, the position vector, p, can not be discarded. It is omitted from the notation for
simplicity as it is zero in all non terminal states.
Optimal Order Placement on Synthetic Data Using the Matrix State
Representation | 31
the agent will cancel its limit order. If it would instead choose to wait, and a
market order of unit size arrives, then the terminal reward would be
1+2
vT (s) = 2 − = 0.5. (3.18)
2
A market order of size two would result in a net zero profit. The expected
reward is less than the reward for the same event under the larger spread of
Equations (3.12) and (3.13). Finally, the agent will only place a market order
in six states, all being akin to
−1 0
1 0
x=
0
, (3.19)
0
0 0
where the order would be placed at the ask. The reason for this is that the
market impact of the order causes the reference price to increase, in the above
resulting in a terminal reward of
0+4
vT (s) = − 1 = 1. (3.20)
2
In Figure 3.1a, we see the distribution of action types for states s ̸∈ ∂S. The
expected MDP rewards are shown in Figure 3.1b.
0.9
2000
0.8
Value [Price ticks]
Number of states
1500 Wait
0.7
1000
0.6
0 Market 0.4
0 1000 2000 3000 4000
Non-terminal states, sorted by V[i]
(a) Histogram over action type for non- (b) Results for n = 1000 iterations per
terminal states. Six states result in a initial state. The states have been sorted
market order, corresponding to states in ascending order of V (si , α).
where the market order moves the
reference price, R0 , so that the trade is
profitable.
Figure 3.1: Results for synthetic data, using the matrix state representation.
Optimal Order Placement on Synthetic and Historic Data Using the Reduced
State Representation | 33
Chapter 4
In this section, we study the optimal strategy for the purchase of a single
share on a state space S represented by the reduced state of Section 2.4.1.
We formulate a MDP to be used both on synthetic data and historic data. This
allows us to directly compare the ideal results on synthetic data with the results
on historic data.
θ = [δ, ∆θ ]. (4.2)
34 | Optimal Order Placement on Synthetic and Historic Data Using the
Reduced State Representation
The variable δ is the number of orders ahead of the agent, including itself. If
δ = 0, then the order has executed. The second variable, ∆θ , is price of the
agent order relative the current bid. If the agent has no order, then we will
denote this θ = 0∗ . Having established the representation, we can write the
set of terminal states as
∂S = {s : θ ̸= 0, δ = 0}. (4.3)
As in Section 3.1.1, we force the agent to wait after placing a limit order.
In addition, we force the agent to wait after cancelling a limit order† . The
construction of the cancel-then-wait action aCW = aW aC is analogue to the
construction of aLW in Section 3.1.1.1.
We resolve this by introducing a penalty c > 0 for each wait action, aW , the
agent chooses to take. This penalty enters the reward matrix:
The penalty acts as a control variable for the execution time of the strategy.
Consider the following: If we approximate that the spread at execution is
always zero, ∆ = 0, then the profit for a limit order executing would be half a
tick, and the loss for placing a market order would be half a tick. The difference
between the two is one full tick. If have a limit order in the order book, and
we expect it to execute in an average of N transitions, then we would prefer to
wait as opposed to placing a market order if
1
N< . (4.5)
c
∗
If one wants to formulate a MDP which both allows buy and sell orders, a viable
representation is [δ, ∆θ , ℓ], where ℓ ∈ {−1, 1} is the polarity. In a code implementation,
the variable ℓ can then also be used to determine order existence, with ℓ = 0 representing no
order.
†
This is to prevent certain behaviours resulting from the penalty only being applied for a
wait action aW .
Optimal Order Placement on Synthetic and Historic Data Using the Reduced
State Representation | 35
We can expect the optimal strategy to converge towards placing a market order
at t = 0, as c is increased.
δp
cT = E[vT (s)]. (4.7)
R̄
As a measure for execution time T , we use the trade rate r
1
r = E[ ]. (4.8)
T
This has some similarity to the participation rate, which is the proportion of
trades occurring which the agent is one of the parties in. The trade rate r is
easier to define in terms of the MDP, and is chosen for expediency.
depth of dmax = 8 and a maximum spread of ∆max = 2 for a state space size
of
|S| = 2592, |∂S| = 1152. (4.9)
The state space hyper parameters have been chosen through consideration
of the trade-off between data admissibility and computational expedience.
For the transition hyper parameters, we use (δ∆)max = 0, kmax = 4 after
considering the distributions observed. Plots of the typical distributions are
shown in Appendix B.4.
For each pair of initial state and parameters, (sk , Λk ), and penalty, c, we sample
the outcome of the optimal strategy niter = 300 times. Per penalty, we then
average these outcomes to obtain estimates for the rate, r, and cost of trading,
cT , as a functions of the penalty. The expected MDP reward need not be
estimated using Monte-Carlo methods. It is precisely V (sk , α∗ ) on synthetic
data generated using the dynamics the strategy is optimal for.
4.2 Results
4.2.1 Results on synthetic data
The results of the simulations are shown in Figure 4.1.
In Figure 4.1a, we see the expected Markov reward for a range of penalties
c ∈ [0.01, 0.35]. The optimal strategy, α∗ , by definition fulfills V (s, α∗ ) ≥
V (s, α) for any other strategy, α. As a result, we expect the optimal strategy to
Optimal Order Placement on Synthetic and Historic Data Using the Reduced
State Representation | 37
r=1 (4.10)
Finally, Figure 4.1c shows the cost of trading, cT , as a function of the rate,
r. Here, we see that the optimal strategy offers a better cost of trading than the
Market Order strategy for each rate r. The Bid Plus One strategy lands along
the line of optimal strategies, but has the disadvantage that the rate can not be
varied.
The Market Order strategy had identical performance on historic and synthetic
data. Since the Market Order strategy places a market order at t = 0, its value
is a deterministic function of the initial state. The initial state and parameter
pair (sk , Λk ) was used to generate the synthetic episodes, leading to the same
results.
38 | Optimal Order Placement on Synthetic and Historic Data Using the
Reduced State Representation
Optimal Policy
0.0 Bid Plus One
Market Order
−0.1
−0.2
Episode reward
−0.3
−0.4
−0.5
−0.6
−0.7
0.8
−0.1
0.6 −0.2
Rate
−0.3
0.4
−0.4
0.2 −0.5
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.2 0.4 0.6 0.8 1.0
Penalty Rate
(b) The rate r of the optimal strategy as a (c) The cost of trading cT as a function of
function of the penalty c. the rate r. Bid Plus One becomes a single
point and Market Order a constant line,
see Appendix C.
Optimal Policy
−0.1 Bid Plus One
Market Order
−0.2
−0.3
Episode reward
−0.4
−0.5
−0.6
−0.7
0.8
−0.1
0.7
−0.2
Rate
0.6
−0.3
0.5
0.4 −0.4
0.3 −0.5
0.2
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Penalty Rate
(b) The rate of the optimal strategy, as a (c) The cost of trading as a function of the
function of the penalty. rate, r.
−0.3
0.6
Rate
−0.4
0.5
−0.5
0.4
−0.6
0.3
−0.7
0.2
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
Penalty Penalty
(a) Because the historic data is not gen- (b) Rate as a function of penalty.
erated according to the model dynamics,
we can not expect the MDP rewards to be
V (s, α∗ ) for each s.
−0.1
−0.2
−0.3
−0.4
−0.5
4.2.3 Discussion
Since historic data is not generated according to the model dynamics, we can
not expect the optimal strategy to have the expected reward V (s, α∗ ) in all
states. However, the comparison in Figure 4.3 imply that the model does
capture some aspects of the limit order books behaviour. The optimal strategy
still outperforms the Bid Plus One strategy in terms of the MDP, and is within
the margin of error of the Market Order strategy.
As we see in Figure 4.3a, the historic evaluation lead to a lower reward for
all penalties. Two contributing factors could be that the strategy overestimates
the value of its eventual execution, and underestimates the time to execution.
In Figure 4.3c, we see that the execution value is indeed overestimated per
penalty, c. In Figure 4.3b, we see that the optimal strategy has a higher trade
rate on historic data. The higher trade rate could imply that we overestimate
the time to execution. This could be the result of the strategy underestimating
the probability that we encounter a state where the estimated optimal action
is to place a market order. Placing more market orders than expected would
explain both the increased trade rate and decreased execution value: Market
orders terminate the episode instantly, and generally lead to worse execution
values compared to limit orders.
There is no guarantee that the optimal strategy should perform worse in terms
of the MDP on historic data. If the model were to systematically underestimate
the execution value, then this would be to the optimal strategy’s advantage
when evaluating on historic data.
42 | Optimal Order Placement on Synthetic and Historic Data Using the
Reduced State Representation
Conclusions and Future work | 43
Chapter 5
The actions available to the agent can be extended to also model latency
44 | Conclusions and Future work
Ps,s′ (aγL ) = γPs,s′′ (aW )Ps′′ ,s′ (aL ) + (1 − γ)Ps,s′′ (aL ) (5.1)
Here, we have brazenly assumed that aL ∈ A(s′′ ) in the first term of the r.h.s. We can
∗
numerically handle the cases where aL ̸∈ A(s′′ ) by putting Ps′′ ,s′′ (aL ) = 1.
References | 45
References
[5] ——, “A generative model of a limit order book using recurrent neural
networks,” qC 20210531. [Online]. Available: http://urn.kb.se/resolve?
urn=urn:nbn:se:kth:diva-295414 [Page 1.]
Appendix A
Markov Processes
This section briefly goes over aspects of Markov processes related to the work
omitted in the introductory Section 1.2.
A.1 Potentials
Suppose we have a discrete jump process s(n), n ≥ 0 on a countable state
space S. The process hits some set of terminal states ∂S after T transitions
such that
P(T < ∞) = 1. (A.1)
If we define the agent to receive reward Rsi ,sj for each transition si → sj then,
starting at state s(0), the total reward is
∑
T −1
R= Rs(n),s(n+1) + vT (s(T )), (A.2)
n=0
∑
T −1
ϕ(s) = E[ Rs(n),s(n+1) + vT (s(T ))|s(0) = s], (A.3)
n=0
where the termination time T and future states s(n ≥ 1) are random variables.
Due to the Markov property, we can rewrite Equation (A.3) as
The moniker potential comes from the superficial similarity to the path-
independent potentials of physics. Indeed, the potential in state s is
independent of how the state arrived at s.
The value can directly be interpreted as the potential in state s given the policy
α.
f : Jn → S. (A.8)
This map is bijective, and its existence is a prerequisite for the set S being
countable. Its inverse,
f −1 : S → Jn , (A.9)
maps the states to their respective indices. In the work, this map is left implicit,
but needs to be explicitly defined if one wants to implement a decision policy
utilising the methods of this thesis. In-code, the map can be abstracted by an
interface given in Listing A.1.
raise NotImplementedError
d e f t o _ s t a t e ( s e l f , i n d e x : i n t ) −> np . n d a r r a y :
raise NotImplementedError
If the state space is relatively small it is then expedient to store the indexation
in memory through dictionaries. This allows for a few advantages
• Calculating f (i) and f −1 (s) are efficient O(1) operations, since they
entail accessing an entry in a dictionary.
The downsides of this approach become evident as the state space grows
That is, two price levels with no room for agent orders. There are (2D + 1)2
such states. We can index each of these through
[ ]
n mod (2D + 1)
sn = . (A.11)
(n − (n mod (2D + 1))/(2D + 1)
Appendix B
Historical Data
The data is provided as two separate time series: quotes and trades. Quotes
contain the best bid/ask, and the corresponding depths. Trades contain a price,
and size, for trades that have occurred. For a fully detailed account of the data,
we refer to the data source [13].
Sizes and prices: The depth of the quotes is given in units of round lots,
with the size of a lot varying between shares. For the $AAPL data used, the
lot size is one hundred shares. Trades are always given in units of singular
Appendix B: Historical Data | 51
shares. Both time series give prices in the currency the stock is denoted in.
Trades need not occur at the best bid or ask, but can occur at any price, inside
or outside the spread. We use lots as the unit, i.e a depth at the bid of dB = 2
represents two lots. If the agent places a market order of unit size, then this
represents purchasing a single lot.
Timestamps: Various timestamps are given for both time series, correspond-
ing to the different times the actors involved generated or received the data.
All timestamps are given in units of nanoseconds. Some timestamp types may
be missing for a quote or trade. Since it is guaranteed to exist and makes for a
well defined comparison, we will use the timestamp of when the data arrived
at the data-broker and fully disregard the other timestamps.
Mismatched order books: On rare occasions the quote can imply a crossed
order book, iB ≥ iA , with equality being the most common. These states are
as a rule resolved within milliseconds. That such a situation can occur is a
result of the bid and ask being reported from two different exchanges.
The quotes qtm are tuples of time, exchanges, bid, ask and corresponding
depths
qtm = [tm , idB , idA , pB , pA , dB , dA ]. (B.2)
We have omitted some additional auxiliary fields which we do not use in the
order matching algorithm. Before attempting to deduce the states and orders,
we will remove entries from both time series by a sequence of filters, meant to
52 | Appendix B: Historical Data
First, we remove trades which we do not believe are relevant. Since trades
are reported from all exchanges simultaneously, but the quotes only come
from one or two, we will disregard all trades which do not originate from the
exchange(s) currently reporting the bid/ask. Let tnm be the time of the latest
quote for trade n
tnm = max{tm : tm < tn }. (B.3)
tm
We also introduce some cutoff on trade size nmin , and disregard trades smaller
than this as insignificant. This results in the next reduced trade set T2
The two operations described above result in most trades occurring at the bid
or ask. Any remaining trades which do not occur at either have their price
modified to the closer of the two. That is,
where
τtn , if ptn = (pB )tnm ∨ ptn = (pA )tnm
ψ(τtn ) = [..., (pB )tnm , ...], if |(pB )tnm − ptn | ≤ |(pA )tnm − ptn | . (B.7)
[..., (pA )tnm , ...], otherwise
and
Tsynthetic = {[tm , (idB )tm , (pB )tm , |(dA )tm − (dB )tm |] : qtm ̸∈ Q1 }. (B.9)
Appendix B: Historical Data | 53
T4 = T3 ∪ Tsynthetic . (B.10)
Now that we have our final sets of quotes Q1 and trades T4 . We now want to
parse this into a single sequence of events
e = [t, s, θ, s′ ], (B.11)
where t is the timestamp, s is the incoming state, θ is the order causing the
transition and s′ is the outgoing state. Note that we require both θ and s′ , since
we must be able to differentiate cancellations from market orders, and receive
information regarding empty price levels if the bid/ask is emptied.
2. Deduce what order(s) this implies. See if a trade has occurred at the
relevant price within δt to differentiate cancellations and market orders.
3. Package these events into a sequence {[(t)i , s0 , θ1 , s1 ], ..., [(t)i , sk−1 , θk , sk ]}.
• For each i, the incoming state s0 is always the state implied by qi−1 , and
the final outgoing state sk is always the state implied by qi .
• Trade sizes are disregarded, and instead the depth differential is used to
size market orders.
all duplicate quotes, the sequence is never empty. If one extends the
algorithm to level 2 data, then the length of the sequence can exceed
two entries.
Here we will give a brief account of how we handle various situations when
evaluating on historic data. As we are only concerned with singular limit buy
orders, θ consists of
θ = [δ, ∆θ , ℓ], (B.14)
where δ is the position in the execution queue, ∆ ≥ ∆θ ≥ 0 is the offset
from the bid price and ℓ denotes the order polarity (presumed to be ℓ = −1,
denoting a buy order).
Limit sell orders: If the agent has an order above the bid price, and a limit
sell order of size k arrives at price ∆′ ≤ ∆θ , then we consider the agents order
matched. In addition, we reduce the size of the order to k − 1. A order of unit
size would thus not impact the spread.
A limit sell order at price ∆′ > ∆θ does not impact the agents order, and
as thus we simply transition the state as usual.
Limit buy orders: If a limit buy order comes in at ∆′ > ∆θ , then the agents
order is immediately cancelled, i.e θ → 0.
Should the buy order decrease the spread, without triggering the automatic
cancellation described above, we adjust ∆θ accordingly. Finally, any limit
buy order is placed behind the agents order in the execution queue.
Appendix B: Historical Data | 55
Market sell orders: If the agent has an order at ∆θ > 0, then the agents
order is given priority. As a result, we reduce the size k of the market order
by one unit to k − 1.
If the agent has an order at the bid, then priority is given according to the
execution queue. Should the agents order be executed as a result of the market
order, we again reduce the size by one unit before applying it to the remaining
state.
Should the market order deplete the depth at the bid dB , then ∆θ is adjusted
accordingly.
Cancellations at bid: Should the agent have an order at the bid, ∆θ = 0, then
we sample which non-agent order is cancelled uniformly. If a cancellation
comes in when the depth dB = 1, then we must adjust ∆θ to reflect the new
spread.
Agent market orders: If the agent places a market order when the depth at the
ask dA = 1, then the number of empty price levels above the ask is sampled
from the estimated parameters. This introduces a small area of ideal behaviour
around these market orders, which is to the advantage of the optimal strategy.
However, in the data observed
P(δ∆ = 0) = 1 (B.15)
sizes and the number of empty price levels below/above the bid/ask are
modelled to be geometrically distributed, i.e
Market orders: Market orders arrive with intensities λSM , λBM for sell orders
and buy orders respectively. In this case, the intensity is easily estimated by
dividing the number of buy/sell market orders n over the time interval δT ,
n
λ·M ∼ . (B.18)
δT
Limit orders & Cancellations: Both limit orders and cancellations of limit
orders have intensities λ proportional to some parameter λi and some function
f (s) of the current state s
λ = λi f (s). (B.19)
Over the interval δT , the state s will transition at times ti , such that tn −
t0 ≤ δT . The net effect of the intensity, or rather the expected number of
occurrences N , over the time δT is
∫ tn ∑
n
E[N ] = dtλi f (s(t)) = λi (ti − ti−1 )f (s(ti−1 )). (B.20)
t0 i=1
E[N ]
λi = ∑ n . (B.21)
i=1 (ti − ti−1 )f (s(ti−1 ))
This process is the same for all limit order and cancellation parameters, though
the function f (s) varies as described in Section 2.1.
select data, as opposed to times. This ensured that an equal amount of data
was selected from all dates, in the first step. The data was obtained through
the Polygon API [13].
Episodes were selected from the data such that they were at least nmin = 300
events long, and fully within the state space defined by the hyper parameters
in Section 4.1.3. All such episodes were selected. If an episode was above
2nmin in size, it was split into multiple episodes of 300 events, with the
chronologically last episode containing any remainder. In Figure B.1, we give
samples of typical distributions for the order sizes, the number of empty price
levels, and the depth at the first non empty price level. In Table B.1, details
regarding dates and data are given.
58 | Appendix B: Historical Data
0.7
0.8
0.6
0.6
0.5
0.4
0.4
0.3
0.2
0.2
0.1
0.0 0.0
0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5
(a) The distribution for sizes of limit (b) The distribution for sizes of market
orders. A geometric fit is shown in orders. A geometric fit is shown in
orange. orange.
1.0
0.25
0.8
0.20
0.6
0.15
0.4
0.10
0.2 0.05
0.0 0.00
0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 1 2 3 4 5 6 7
(c) The distribution for the number (d) The distribution for the depth at the
of empty price levels, when the bid next non empty price level.
or ask is emptied, including the
recently emptied price level. This was
approximated as P(δ∆ = 0) = 1.
Appendix C
Simple Strategies
The Market Order strategy can have an arbitrary participation rate r. Consider
the following: If we want to purchase n stock with market participation rate
r using only market orders then we can start by placing a market order, then
waiting 1/r transitions before placing the next.
The choice of the Bid Plus One strategy as a basis for comparison to the optimal
Markov strategy is that it approximates the optimal strategy for a running
penalty c ∼ 0.05, whilst still being very simple.
60 | Appendix C: Simple Strategies
The participation rate and average execution cost are constant as functions
of the penalty c. However, the MDP rewards,
vary accordingly.
TRITA-SCI-GRU 2023:407
www.kth.se