Deep MARL
Deep MARL
Learning
Jakob N. Foerster
Magdalen College
University of Oxford
Michaelmas 2018
To my parents,
Bärbel and Claus
Acknowledgements
This thesis would not have been possible without the support of a large number of
fantastic individuals and institutions. First and foremost I would like to thank my
advisor Shimon Whiteson for having been supportive of my endeavours throughout
my PhD, always providing useful advice and perspectives. I would also like to thank
Nando de Freitas for co-advising me during the first year of my PhD, until our
ways parted. I thank Peter Stone and Phil Blunsum for carefully examining this
thesis and the constructive feedback. I thank Gregory Farquhar for having been an
alter-ego for a lot of my PhD, happily discussing the wildest ideas. I thank Yannis
Assael, Brendan Shillingford, and Oana Camburu for making the first year of my
PhD unforgettable, successful, and fun. I also thank my collaborators and friends at
WhiRL: Nantas Nardelli, Tabish Rashid, Christian Schroeder, Jelena Luketina, Max
Igl, Luisa Zintgraf, Tim Rocktäschel, Wendelin Boehmer. I am extremely grateful
to Francis Song from DeepMind for believing in the ‘Bayesian Action Decoder’
long before it started being a real thing. Without his support this section of the
thesis would not have been possible. I am also thankful to Mike Bowling, Marc
Bellemare, Marc Lanctot, Neil Burch, Nolan Bard, and others from DeepMind for
the long lasting collaboration around Hanabi and to Thore Graepel for valuable
advice. My thanks also go to Richard Chen and Maruan Al-Shedivat for being
fantastic collaborators, and to Pieter Abbeel for his mentorship, during the LOLA
project and beyond. I am indebted to DeepMind for the Oxford-Google-DeepMind
scholarship. I am grateful to my close and/or long term friends for mostly bearing
with me: Christoph, Jenny, the Rasumov & Ahnen families, Raphaele C. & M.,
Jakob, Julia, Brendan, Andreas, Dave, Dhruv, Molly, Kirill, Kasim, Linus, Adri,
Yoni, Avital, Oiwi, Matthias, Cinjon, Paula, Fred, Paulina, Minqi, Valerie, Karol,
Olesya, John, Javier, Katrina, Ashly, Oli, Tim, Lisa, Lars, Anne, Thais, Diego,
Jaleh, James, Anshia, Ashish, Susanna, Ben, Fernanda, Will, Ian, Ryan, Tasha,
Ronny, Leo, Ursin, Remo, and others. I thank Angelique for the walk we took in the
alps in 2015, which inspired my thesis topic. I would also like to thank some of those
that inspired and mentored me along the way: Odiê & Elias, Edgar, Harald and
Sven, to name a few. Last not least I would like to thank my family: My parents,
Bärbel and Claus, for always believing in me and encouraging me to be who I am.
My brothers, Fridolin, Till, Moritz, and Peter, for continuously challenging me and
helping me to grow. May the sun always shine on your paths.
Abstract
A plethora of real world problems, such as the control of autonomous vehicles
and drones, packet delivery, and many others consists of a number of agents that
need to take actions based on local observations and can thus be formulated in
the multi-agent reinforcement learning (MARL) setting. Furthermore, as more
machine learning systems are deployed in the real world, they will start having
impact on each other, effectively turning most decision making problems into multi-
agent problems. In this thesis we develop and evaluate novel deep multi-agent RL
(DMARL) methods that address the unique challenges which arise in these settings.
These challenges include learning to collaborate, to communicate, and to reciprocate
amongst agents. In most of these real world use cases, during decentralised execution,
the final policies can only rely on local observations. However, in many cases it
is possible to carry out centralised training, for example when training policies
on a simulator or when using extra state information and free communication
between agents during the training process.
The first part of the thesis investigates the challenges that arise when multiple
agents need to learn to collaborate to obtain a common objective. One difficulty is
the question of multi-agent credit assignment: Since the actions of all agents impact
the reward of an episode, it is difficult for any individual agent to isolate the impact
of their actions on the reward. In this thesis we propose Counterfactual Multi-Agent
Policy Gradients (COMA) to address this issue. In COMA each agent estimates the
impact of their action on the team return by comparing the estimated return with
a counterfactual baseline. We also investigate the importance of common knowledge
for learning coordinated actions: In Multi-Agent Common Knowledge Reinforcement
Learning (MACKRL) we use a hierarchy of controllers that condition on the common
knowledge of subgroups of agents in order to either act in the joint-action space
of the group or delegate to smaller subgroups that have more common knowledge.
The key insight here is that all policies can still be executed in a fully decentralised
fashion, since each agent can independently compute the common knowledge of
the group. In MARL, since all agents are learning at the same time, the world
appears nonstationary from the perspective of any given agent. This can lead to
learning difficulties in the context of off-policy reinforcement learning which relies
on replay buffers. In order to overcome this problem we propose and evaluate a
metadata fingerprint that effectively disambiguates training episodes in the replay
buffer based on the time of collection and the randomness of policies at that time.
So far we have assumed the agents act fully decentralised, i.e., without directly
communicating with each other. In the second part of the thesis we propose three
different methods that allow agents to learn communication protocols. The first
method, Differentiable Inter-Agent Learning (DIAL), uses differentiation across a
discrete communication channel (specifically a cheap-talk channel) during centralised
training to discover a communication protocol suited for solving a given task. The
second method, Reinforced Inter-Agent Learning (RIAL), simply uses RL for learning
the protocol, effectively treating the messages as actions. Neither of these methods
directly reasons over the beliefs of the agents. In contrast, when humans observe
the actions of others, they immediately form theories about why a given action
was taken and what this indicates about the state of the world. Inspired by our
insight, in our third method, the Bayesian Action Decoder (BAD), agents directly
consider the beliefs of other agents using an approximate Bayesian update and
learn to communicate both through observable actions and through grounded
communication actions. Using BAD we obtain the best known performance on
the imperfect information, cooperative card game Hanabi.
While in the first two parts of the thesis all agents are optimising a team
reward, in the real world there commonly are conflicting interests between different
agents. This can introduce learning difficulties for MARL methods, including
unstable learning and the convergence to poorly performing policies. In the third
part of the thesis we address these issues using Learning with Opponent-Learning
Awareness (LOLA). In LOLA agents take into account the learning behaviour
of the other agents in the environment and aim to find policies that shape the
learning of their opponents in a way that is favourable to themselves. Indeed,
instead of converging to the poorly performing defect-defect equilibrium in the
iterated prisoner’s dilemma, LOLA agents discover the tit-for-tat strategy. LOLA
agents effectively reciprocate with each other, leading to overall higher returns. We
also introduce the Infinitely Differentiable Monte-Carlo Estimator (DiCE), a new
computational tool for estimating the higher order gradients that arise when one
agent is accounting for the learning behaviour of other agents in the environment.
Beyond being useful for LOLA, DiCE also is a general purpose objective that
generates higher order gradient estimators for stochastic computation graphs, when
differentiated in an auto-differentiation library.
To conclude, this thesis makes progress on broad range of the challenges that
arise in multi-agent settings and also opens-up a number of exciting questions for
future research. These include how agents can learn to account for the learning
of other agents when their rewards or observations are unknown, how to learn
communication protocols in settings of partial common interest, and how to account
for the agency of humans in the environment.
Contents
1 Introduction 1
1.1 The Industrial Revolution, Cognition, and Computers . . . . . . . . 1
1.2 Deep Multi-Agent Reinforcement-Learning . . . . . . . . . . . . . . 4
1.3 Overall Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Background 13
2.1 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Multi-Agent Settings . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Centralised vs Decentralised Control . . . . . . . . . . . . . . . . . 15
2.4 Cooperative, Zero-sum, and General-Sum . . . . . . . . . . . . . . . 16
2.5 Partial Observability . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.6 Centralised Training, Decentralised Execution . . . . . . . . . . . . 17
2.7 Value Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.8 Nash Equilibria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.9 Deep Learning for MARL . . . . . . . . . . . . . . . . . . . . . . . 20
2.10 Q-Learning and DQN . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.11 Reinforce and Actor-Critic . . . . . . . . . . . . . . . . . . . . . . . 23
I Learning to Collaborate 25
3 Counterfactual Multi-Agent Policy Gradients 29
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3 Multi-Agent StarCraft Micromanagement . . . . . . . . . . . . . . . 32
3.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4.1 Independent Actor-Critic . . . . . . . . . . . . . . . . . . . . 35
3.4.2 Counterfactual Multi-Agent Policy Gradients . . . . . . . . . 36
ix
x Contents
3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.6 Conclusions & Future Work . . . . . . . . . . . . . . . . . . . . . . 44
II Learning to Communicate 77
10 Afterword 169
References 173
List of Figures
3.1 Starting position with example local field of view for the 2d_3z map. 33
3.2 An example of the observations obtained by all agents at each time
step t. The function f provides a set of features for each unit in
the agent’s field of view, which are concatenated. The feature set
is {distance, relative x, relative y, health points, weapon
cooldown}. Each quantity is normalised by its maximum possible
value. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3 In (a), information flow between the decentralised actors, the en-
vironment and the centralised critic in COMA; red arrows and
components are only required during centralised learning. In (b)
and (c), architectures of the actor and critic. . . . . . . . . . . . . 37
3.4 Win rates for COMA and competing algorithms on four different
scenarios. COMA outperforms all baseline methods. Centralised
critics also clearly outperform their decentralised counterparts. The
legend at the top applies across all plots. . . . . . . . . . . . . . . . 43
4.1 Three agents and their fields of view. A and B’s locations are common
knowledge to A and B as they are within each other’s fields of view.
However, even though C can see A and B, it shares no common
knowledge with them. . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2 Pairwise MACKRL. Left: the pair selector must assign agents to
pairs (plus a singleton in this case). Middle: the pair controller
can either partition the pair or select among the pair’s joint actions;
Right, at the last level, the controller must select an action for a
single agent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3 Matrix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.4 Matrix B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.6 Mean and standard error of the mean of test win rates for two levels
in StarCraft II: one with 5 marines (left), and one with 2 stalkers
and 3 zealots on each side (right). Also shown is the number of runs
[in brackets]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.5 Results for the matrix game. . . . . . . . . . . . . . . . . . . . . . . 59
xiii
xiv List of Figures
4.7 Delegation rate vs. number of enemies (2s3z) in the common knowl-
edge of the pair controller over training. . . . . . . . . . . . . . . . 61
6.1 In RIAL (a), all Q-values are fed to the action selector, which selects
both environment and communication actions. Gradients, shown
in red, are computed using DQN for the selected action and flow
only through the Q-network of a single agent. In DIAL (b), the
message mat bypasses the action selector and instead is processed by
the DRU (Section 6.4.2) and passed as a continuous value to the next
C-network. Hence, gradients flow across agents, from the recipient
to the sender. For simplicity, at each time step only one agent is
highlighted, while the other agent is greyed out. . . . . . . . . . . . 87
6.2 DIAL architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.3 Switch: Every day one prisoner gets sent to the interrogation room
where he sees the switch and chooses from “On”, “Off”, “Tell” and
“None”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.4 Switch: (a-b) Performance of DIAL and RIAL, with and without (
-NS) parameter sharing, and NoComm-baseline, for n = 3 and n = 4
agents. (c) The decision tree extracted for n = 3 to interpret the
communication protocol discovered by DIAL . . . . . . . . . . . . 94
List of Figures xv
7.4 a) Training curves for BAD on Hanabi and the V0 and V1 baseline
methods using LSTMs rather than the Bayesian belief. The thick
line for each agent type shows the final evaluated agent for each
type; upward kinks are generally due to agents ‘evolving’ in PBT
by copying its weights and hyperparameters (plus perturbations)
from a superior agent. b) Distribution of game scores for BAD on
Hanabi under testing conditions. BAD achieves a perfect score in
nearly 60% of the games. The dashed line shows the proportion of
perfect games reported for SmartBot, the best known heuristic for
two-player Hanabi. c) Per-card cross entropy with the true hand for
different belief mechanisms in Hanabi during BAD play. V0 is the
basic belief based on hints and card counts, V1 is the self-consistent
belief, and V2 is the BAD belief which also includes the Bayesian
update. The BAD agent conveys around 40% of the information via
conventions, rather than grounded information. . . . . . . . . . . . 115
xix
xx List of Abbreviations
xxi
xxii
Introduction
1
Contents
1.1 The Industrial Revolution, Cognition, and Computers 1
1.2 Deep Multi-Agent Reinforcement-Learning . . . . . . . 4
1.3 Overall Structure . . . . . . . . . . . . . . . . . . . . . . 7
1
2 1.1. The Industrial Revolution, Cognition, and Computers
of cognition, at first sight there are striking differences between computers and
brains. While computers operate using deterministic binary gates implemented in
silicon, the brain uses noisy biological neurones with probabilistic firing patterns [3].
Furthermore, there is a stark difference between the instruction sets that the brain
and the computer execute: Computers require carefully constructed programs to
carry out computation, in which any single bit can cause failure. In contrast, all
neural-programs running on the brain are learned via repeated interaction with
the environment, rather than provided by a programmer.
Historically, artificial intelligence (AI) research has been focused on recreating
human-like reasoning abilities through expert systems that execute rule sets provided
by the designer [4]. However, while recognising a dog in a picture is a trivial task for
most humans, specifying a set of rules that achieve this reliably across a variety of
different points of view and backgrounds has proven to be a superhuman challenge.
Machine Learning (ML) is an alternative approach for bringing cognitive abilities
to machines. Importantly, in the ML paradigm the designer no longer needs to specify
a set of rules for recognising a dog. Instead, it is sufficient to specify a set of learning
rules which in conjunction with a labeled dataset of examples allow the algorithm to
extract decision rules. Over the last 30 years, this approach has proven successful and
transformed many areas of modern life. Example learning algorithms include linear
regression, support-vector-machines [5], gaussian processes [6], and many others.
During the last decade, Deep Learning [7], in particular has shown tremendous
success. Prominent success stories include speech-recognition [8], image recogni-
tion [9], lip reading [10], and language translation [11] amongst many others. All of
these success stories consist of a large dataset of inputs and desired outputs, a setting
commonly referred to as ‘supervised learning’. Importantly, in supervised learning
the training dataset is always assumed to by independent of the classification
decisions that the algorithm makes.
This assumption is violated as soon as the classification decisions that the
algorithm makes actively change the future training data. One area where this
commonly occurs is when algorithms take actions that affect a stateful environment.
1. Introduction 3
For example, when a self-driving car takes a specific action during training, this
changes the kind of data and experiences that the car is exposed to in the later parts
of the training process. A large number of real world problems fall into this category.
For example, a ranking algorithm has impact on the kind of decisions users are
facing and thus changes the future training data. Similarly when a cleaning robot
knocks over a flower pot this changes the future state distribution.
All of these problems can be formalised in the Reinforcement Learning (RL)
framework [12]. In RL, an agent, e.g., the robot, sequentially interacts with the
environment, e.g., the living room, by taking actions based upon the observation it
makes, e.g., the camera input. The action space defines which actions are available
to the agent, for the cleaning robot these might be navigation actions such as
‘move left’, ‘move right’, and so on. At every time-step the agent receives both
an observation and a reward from the environment.
The agent’s mapping from observations to actions is called the ‘policy’ and RL
aims to find a policy that maximises the expected sum of discounted rewards across
an episode. Here ‘discounted’ means that rewards occurring later in the episode,
i.e., further in the future, have less importance than early ones. An episode consists
of a sequence of observations, rewards, and actions and ends whenever a ‘terminal
state’ is reached, in which no further rewards are given and from which the agent is
unable to leave. Importantly, the action chosen can change the immediate reward,
but also changes the probability distribution over the next state, which in turn
impacts future rewards. Furthermore, the agent is not a-priori provided with the
rules governing the state-transition probabilities or the reward function, but rather
has to learn from interaction with the environment.
Deep RL (DRL) refers to a subcategory of RL in which deep neural networks [7]
are used as function approximators. In particular DRL allows agents to process high
dimensional inputs and learn relevant feature representations as well as policies. This
comes at the cost of requiring a large number of tuneable parameters. Fortunately
these parameters can be trained efficiently using backpropagation. In recent years
DRL has successfully been applied to a number of domains, including the playing
4 1.2. Deep Multi-Agent Reinforcement-Learning
of Atari games [13], Go [14], and other challenging settings. All algorithms in
this thesis can be applied to the DRL setting.
agents learning at the same time, rendering the learning problem continuously
changing, or nonstationary. It also includes the problem of multi-agent credit
assignment: Due to the large number of agents taking actions, it is commonly
unclear whether a given action by a specific agent had an overall positive or negative
effect. Effectively the other agents in the environment act as confounding factors in
the reward attribution of a given agent. Furthermore, the optimal action of each
agent can depend crucially on the (unobserved) action selection of other agents,
making the learning of coordinated policies challenging.
Communication addresses the challenges associated with learning communication
protocols. In many real world settings agents have to take decentralised actions
but have access to a limited bandwidth discrete communication channel. Learning
how to exchange information through this channel in a way that is useful for
solving a given task is a hard problem.
While all of the challenges addressed so far appear in fully cooperative settings
(i.e., when agents are aiming to maximise a joint team-reward), reciprocity is a
difficult challenge that appears in general-sum settings. In these settings agents
can commonly obtain higher rewards if they manage to encourage other agents to
collaborate with them. Humans naturally reciprocate with other humans, even in
situations of military conflict [16], but bringing these abilities to learning algorithms
is an open problem.
When addressing these problems we will commonly take advantage of centralised
training with decentralised execution: For many real world problems training can be
carried out on in a centralised fashion, for example by using a simulator or providing
extra state information during the training process, while during execution each
agent needs to chose their action independently based on local observations only.
Importantly, centralised training can greatly facilitate the learning process in
multi-agent settings, assuming algorithms are able to exploit the central state
information during training without requiring it during execution. This setting of
centralised training and decentralised execution in MARL is thus an important
avenue for deploying reinforcement learning algorithms in the real world and is
1. Introduction 7
used throughout this thesis. Interestingly, we can use centralised training even
in non-cooperative settings in order to learn better strategies, since the central
state information is not required during execution.
Background
of selecting this (optimal) action. This issue arises in particular in NL since each
agent tracks the expected reward as a function of its own action selection, omitting
the choices made by other agents.
In Chapter 3 we propose Counterfactual Multi-Agent Policy Gradients (COMA).
COMA exploits the centralised training regime by using a centralised critic which
learns a value function that conditions on the central state and the joint-action
of all agents. Inspired by difference rewards [18], we use this value function to
calculate a counterfactual baseline. This baseline is an estimate of what would have
happened on average had the agent chosen a different action. Applied to a multi-
agent version of StarCraft micromanagement, we find that COMA outperforms
a set of strong baselines.
While COMA learns a joint value function, the policies of the agents are
fully factorised. In other words, actions are sampled independently such that the
probability of a joint action is the product of probabilities across the different
agents. In some settings these kinds of factorised policies will not be able to
learn optimal strategies.
In particular, whenever the optimal action selection depends crucially on the
other agents also selecting the optimal action, the exploration of one agent can shift
the best response of the other agent away from the optimal action.
These kinds of settings can in principle be solved by centralised controllers which
learn to act in the joint action space. However, due to partial observability these
centralised policies cannot in general be executed in a decentralised fashion.
In Chapter 4 we introduce Multi-Agent Common Knowledge Reinforcement
Learning (MACKRL). MACKRL uses the common knowledge of a group of agents
in order to learn a joint-action policy which can be executed in a fully decentralised
fashion. Here the common knowledge of a group of agents are those things that
all agents know and that all agents know that all agents know, ad infinitum.
Interestingly, in a variety of MARL settings the agents can observe other agents
and thereby form common knowledge. In particular MACKRL relies on hierarchical
controllers, which can either assign a joint action for a group of agents or decide
1. Introduction 9
COMA and MACKRL are on-policy methods, i.e., they use training data that
was collected under the current policy. However, off-policy methods such as Q-
learning can provide better sample efficiency. In order to stabilise learning, off-policy
DRL relies heavily on using a replay memory: During training experiences are
stored in the replay memory and then sampled randomly to provide a diverse
range of state-action pairs to the agent.
These chapters are based on following papers and pre-prints. Here and through-
out the ‘*’ indicates equal contribution:
This part of the thesis is based on the following publications and pre-prints:
All of the methods proposed so far assume a fully cooperative setting in which
agents learn to cooperate and coordinate in order to maximise a team reward.
However, in many real world problems agents aim to maximise diverse, individual
rewards, potentially leading to a conflict of interest between different agents. For
example, each driver typically wants to reach their destination as soon as possible,
rather than to maximise the overall efficiency of the traffic. Game theory has a long
history of studying optimal strategies in these settings. Here the core concept is
a Nash equilibrium, which is achieved when none of the agents can improve their
return through a unilateral change in policy. However, game theory commonly
assumes that all Nash equilibria are known and can be computed, which is generally
not the case in RL settings. In particular, in MARL the agents have to rely on
interaction with the environment in order to learn any policy in the first place.
While NL has a surprisingly strong track record in fully cooperative settings, in
general sum settings issues can arise. First of all, all agents maximising their own
objective can lead to unstable learning behaviour. Secondly, agents can fail to
reciprocate, leading to convergence at Nash equilibria in which all agents are worse
off. Both of these issues are due to the fact that the other agents are treated as
a static part of the environment. Learning with Opponent-Learning Awareness
(LOLA) aims to overcome these issues and to allow the agents to converge to Nash
12 1.3. Overall Structure
equilibria with high returns. Rather than assuming that other agents are static,
each agent instead assumes others are naive learners and optimises the expected
return after one step of opponent-learning. Importantly, the agent can differentiate
through the learning step of the opponent, leading to a shaping of their policy.
LOLA agents manage to discover the famous tit-for-tat strategy in the iterated
prisoner’s dilemma. One technical difficulty is that differentiating through the
learning step of an agent produces a higher order derivative which needs to be
estimated using samples from the environment. In Chapter 9 we introduce DiCE:
The Infinitely Differentiable Monte-Carlo Estimator(DiCE), a new way to estimate
higher order gradients in stochastic computation graphs.
This part is based on the following publications:
13
14 2.1. Reinforcement Learning
reward function R(st , ut ), action space U , and discount factor, γ. The transition
of the next state, st+1 , given a current state, st , and action, ut . Here S is the
state-space and U is the action space. While some of the methods proposed in this
spaces throughout. Since the probability of the next state conditions only on the
fully observable settings the agent observes the Markov state of the environment,
st , and chooses an action ut ∈ U from their policy π(ut |st ) : S × U → [0, 1]. The
environment then transitions to a next state, st+1 ∼ P (st+1 |st , ut ), and provides
transition function and reward function are unknown to the agent, but instead
have to be discovered by interacting with the environment. The task also contains
The goal of the agent is to update the policy in order to maximise the total
time step t onwards. Using the Markov assumption we can express the probability
P (s0 ) P (rt |st+1 , st , ut )P (ut |st )P (st+1 |st , ut ). In this expression we recognise
Qt=T
t=0
both the transition function, P (st+1 |st , ut ) and the policy P (ut |st ) = π(ut |st ).
2. Background 15
Across this thesis we will be considering multi-agent settings. One way to formalise
them is as a stochastic game [19], G, defined by a tuple G = hS, U, P, r, Z, O, n, γi,
in which n agents identified by a ∈ A ≡ {1, ..., n} choose sequential actions. As
in the single agent setting the environment has a true state s ∈ S. At each time
step, each agent takes an action ua ∈ U , forming a joint action u ∈ U ≡ U n
which induces a transition in the environment according to the state transition
function P (s0 |s, u) : S × U × S → R. Here U is the action space, the reward
function specifies an agent specific reward, r(s, u, a) : S × U × A → R and, as
before, γ ∈ [0, 1] is a discount factor.
We denote joint quantities over agents in bold, e.g., u, and joint quantities over
agents other than a given agent a with the superscript −a, e.g., u−a .
However, there are two fundamental challenges with this approach: First of all
the joint action space, U, is exponential in the number of agents and, secondly,
in many real world settings agents have to act based on their local observations,
oa , making centralised control impossible.
Due to the two reasons above, this thesis focuses on decentralised control. In
this setting each agent has a local policy, π a (ua |st ), which maps from states (or
observations) to a probability distribution over actions for the given agent. We
note, this factorises the probability distribution over the joint-action:
a
16 2.4. Cooperative, Zero-sum, and General-Sum
This resolves both of the challenges stated above: Rather than considering an
exponential action space, each agent only needs to consider their own action space.
Furthermore, in partially observable settings (see below) each of the policies can
condition on the local observations of the agent.
The opposite is zero-sum, when the reward summed across agents is 0 across all
states:
r(s, u, a) = 0, ∀s, u. (2.4.2)
X
In zero-sum settings one agent’s gain is the loss of the other agents in the environment.
A middle ground are general-sum settings. These include cases in which agents
are neither fully cooperative nor entirely adversarial.
Depending on the setting, different challenges arise: For example, partially
observable, cooperative settings are well suited for investigating methods that
learn communication protocols. In contrast, in general-sum settings, such as the
iterated prisoner’s dilemma (see Chapter 8) agents may need to learn to reciprocate
and in zero-sum settings they need to avoid the instabilities arising from different
agents optimising opposing losses.
The reward structure of the problem and the number of agents also change
the properties of and the relationship between the different Nash equilibria, as
explained in Section 2.8.
agents will need to learn a sufficient statistic for the action-observation history τ a ,
e.g., using function approximators such as recurrent neural networks. Note that
throughout the thesis we do not consider the rewards as being directly observable
state spaces and the formalism for common knowledge. In Chapter 7 we introduce
In this thesis we focus on settings in which agents need to take actions based on
training they may utilise extra information (such as access to the Markov state)
or free communication between the agents, as long as the final policies do not rely
on this information. This is commonly possible in real world settings, e.g., when
training is carried out on a simulator while the final policies are later deployed
standard approach for Dec-POMDP solving [20] and has recently seen application
carry out centralised training, as long as the final policies do not require access
The agents’ joint policy induces a value function, i.e., an expectation over Rta
given a current state, st :
This is called the advantage function since it measures the difference between
the expected return given both the state and the action, compared to the expected
return of just being in the state, st . In other words it measure the increase (or
decrease) in expected return due to the agents having chosen action ut .
So far these value functions are presented as mathematical entities and definitions.
In Section 2.10 and Section 2.11 we introduce methods for training value functions
that are parameterised as deep neural networks, using samples generated via
interaction with the environment. Throughout this thesis we are using model-free
methods, building on the success of deep RL in the last few years. For more
background on RL, including model-based approaches and planning please see [21].
2. Background 19
When multiple agents are learning to maximise their own reward, it is unclear what
the right metric for measuring success is, see, e.g., Shoham, Powers, Grenager, et al.
[22] for more details on this. For example, in zero-sum settings the overall reward
summed across agents is always zero, independent of the policy.
One important concept here is the Nash-equilibrium [23]: A Nash equilibrium
describes any set of policies π ∗ , such that the return for each agent a given the
policies of other agents, π ∗ −a , is maximised under π ∗ a :
∀a : J a (π ∗ a , π ∗ −a ) ≥ J a (π a , π ∗ −a ), ∀π a (2.8.1)
Clearly, the goal of converging to Nash equilibria is a good starting point for
multi-agent learning algorithms. However, there commonly can be multiple Nash
equilibria with different payouts, leading to greater difficulty in evaluating multi-
agent learning compared to single agent RL. It can also lead to agents converging
to Nash equilibria in which every single agent is worse off than in an alternative
equilibrium. We provide example failure cases in Chapter 8.
The existence of multiple Nash equilibria with different payouts can also lead to
local minima in the learning dynamics of multi-agent problems: Whenever agents
have converged to a Nash equilibrium, there is no incentive for either of the agents
to change their strategy unilaterally, even though everyone might be able to improve
their return under a coordinated switch. This holds even in cooperative scenarios,
a setting which we expand on and propose a solution to in Chapter 4.
Even multiple equilibria with identical payouts can be problematic: When
different agents are playing their component of different Nash strategies, all
agents can end up receiving lower returns. This shows that, in general, any
single agent playing their component of a Nash strategy does not provide any
performance guarantees.
A special class of games are two-player zero-sum settings. In these settings
different Nash-equilibria are interchangeable. Playing their component, π ∗ a , of any
20 2.9. Deep Learning for MARL
J a (π ∗ a , π ∗ −a ) ≥ J a (π ∗ , π −a ), ∀π −a . (2.8.2)
While in game theory it is common to assume that Nash equilibria are known
or can be solved for, in MARL we typically investigate settings which are too
complex for solving them exactly. Instead, policies have to be learned based on
repeated interaction with the environment.
Another important concept is the best response: Given a set of policies π −a , the
best response to π −a is the policy π̂ a (π −a ) that maximises J a (π a , π −a ):
π̂ a (π −a ) = argmax J a (π a , π −a ). (2.8.3)
πa
In settings where we do not share parameters, θa are the weights of the policy of
agent a. We furthermore use superscripts, θπ and θC , to distinguish between weights
for the policy and the value function / critic when this is ambiguous.
Another property of deep learning that we use commonly in this thesis is
that neural networks can be differentiated efficiently using the aforementioned
backpropagation algorithm. We use this, for example, in Chapter 6, where agents
learn to communicate by differentiating through a communication channel during
training. In Chapter 9 we use the auto-differentiation mechanism in order to
construct gradient estimators.
This is the identity underlying Q-learning. In deep Q-learning (DQN), [13], the
Q-function is represented by a deep neural network parameterised by θ, Q(s, u; θ).
DQNs can be optimised by minimising the Bellman error, using the recursion above:
at each iteration i, with target yiDQN = r + γ maxu0 Q(s0 , u0 ; θi− ). Here, θi− are
the parameters of a target network that is kept constant for a number of iterations
while updating the online network Q(s, u; θi ) by gradient descent. The action u is
chosen from Q(s, u; θi ) by an action selector, which typically implements an -greedy
policy that selects the action that maximises the Q-value with a probability of
1 − and chooses randomly with a probability of in order to ensure that agents
keep exploring the environment. In order to stabilise learning and to improve
sample efficiency DQN also uses experience replay: during learning, the agent
builds a dataset of episodic experiences and then trains by sampling mini-batches of
22 2.10. Q-Learning and DQN
experiences. Maintaining the experience replay memory also prevents the network
from overfitting to recent experiences.
Tampuu et al. [26] address this setting with a framework that combines DQN
with IQL: In their work, each agent a independently and simultaneously learns its
own Q-network, Qa (s, ua ; θia ). While independent Q-learning can in principle lead
to convergence problems (since one agent’s learning makes the environment appear
nonstationary to other agents), it has a strong empirical track record [17, 27], and was
successfully applied to two-player pong. Note that in IQL each agent independently
estimates the total return for the episode given their action and the state.
Deep Recurrent Q-Networks. Both DQN and IQL assume full observability,
i.e., the agent receives the Markov state of the environment, st , as input. By
contrast, in the partially observable environments we consider, st is hidden and the
agent receives only an observation oat that is correlated with st , but in general
does not disambiguate it.
t=0
(n) (n)
where y (λ) = (1 − λ) λn−1 Gt , and the n-step returns Gt are calculated using
P∞
n=1
This concludes the common background section, covering the most important
common concepts for understanding the rest of the thesis. Further required
background is introduced within the different chapters as needed.
Part I
Learning to Collaborate
25
Abstract
In this part of the thesis we focus on methods that allow agents to learn to collaborate
in cooperative, partially observable multi-agent systems. All methods are applied
to a multi-agent version of StarCraft micromanagement. In this formulation of the
problem each unit corresponds to a learning agent that needs to select an action
based on local observations. In Chapter 3 we address multi-agent credit assignment
using a centralised, counterfactual baseline. In Chapter 4 we propose a common
knowledge based learning algorithm that allows agents to learn a centralised policy
which can be executed in a decentralised fashion. While the first two chapters
in this part are based on actor-critic algorithms, in Chapter 5 we address the
nonstationarity that arises when using DQN in a multi-agent context.
Counterfactual Multi-Agent Policy
3
Gradients
Contents
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3 Multi-Agent StarCraft Micromanagement . . . . . . . 32
3.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4.1 Independent Actor-Critic . . . . . . . . . . . . . . . . . 35
3.4.2 Counterfactual Multi-Agent Policy Gradients . . . . . . 36
3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.6 Conclusions & Future Work . . . . . . . . . . . . . . . . 44
3.1 Introduction
29
30 3.1. Introduction
macro-actions, using StarCraft’s built-in planner, that combine movement and attack
actions. To produce a meaningfully decentralised benchmark that proves challenging
for scenarios with even relatively few agents, we propose a variant that massively
reduces each agent’s field-of-view and removes access to these macro-actions.
Our empirical results on this new benchmark show that COMA can significantly
improve performance over other multi-agent actor-critic methods, as well as ablated
versions of COMA itself. In addition, COMA’s best agents are competitive with
state-of-the-art centralised controllers which are given access to full state information
and macro-actions.
architecture of the controllers exploits the multi-agent nature of the problem. Usunier
et al. [40] use a greedy MDP, which at each timestep sequentially chooses actions
for agents given all previous actions, in combination with zero-order optimisation,
while Peng et al. [41] use an actor-critic method that relies on RNNs to exchange
information between the agents. Since Usunier et al. [40] address similar scenarios
to our experiments and implement a DQN baseline in a fully observable setting, in
Section 3.5 we report our competitive performance against these state-of-the-art
baselines, while maintaining decentralised control. Omidshafiei et al. [50] also
address the stability of experience replay in multi-agent settings, but assume a
fully decentralised training regime.
Rashid et al. [51] and Sunehag et al. [52] propose learning a centralised value
function that factors into per-agent components to allow for decentralisation. Lowe et
al. [53] propose a single centralised critic that conditions on all available information
during training and is used to train decentralised actors, this work was done
concurrently with the work presented here. None of these approaches explicitly
address the question of multi-agent credit assignment.
[53] concurrently propose a multi-agent policy-gradient algorithm using cen-
tralised critics. Their approach does not address multi-agent credit assignment.
Unlike our work, it learns a separate centralised critic for each agent and is applied
to competitive environments with continuous action spaces.
Our work builds directly off of the idea of difference rewards [18]. The relationship
of COMA to this line of work is discussed in Section 6.4.
have full simulators with controlled randomness that can be freely set to any
state in order to perfectly replay experiences. This makes it possible, though
computationally expensive, to compute difference rewards via extra simulations.
In StarCraft, as in the real world, this is not possible.
In this and the next two chapters, we focus on the problem of micromanagement
in StarCraft, which refers to the low-level control of individual units’ positioning
and attack commands as they fight enemies. This task is naturally represented
as a multi-agent system, where each StarCraft unit is replaced by a decentralised
controller. We consider several scenarios with symmetric teams, formed of: 3
marines (3m), 5 marines (5m), 5 wraiths (5w), or 2 dragoons with 3 zealots (2d_3z).
The enemy team is controlled by the StarCraft AI, which uses a set of reasonable
but suboptimal hand-crafted heuristics.
We allow the agents to choose from a set of discrete actions: move[direction],
attack[enemy_id], stop, and noop. In the StarCraft game, when a unit selects
an attack action, it first moves into attack range before firing, using the game’s
built-in pathfinding to choose a route. These powerful attack-move macro-actions
make the control problem considerably easier.
To create a more challenging benchmark that is meaningfully decentralised, we
impose a restricted field of view on the agents, equal to the firing range of the
weapons of the ranged units, shown in Figure 3.1. This departure from the standard
setup for centralised StarCraft control has three effects.
First, it introduces significant partial ob-
servability. Second, it means units can only
x
attack when they are in range of enemies, re-
moving access to the StarCraft macro-actions.
Third, agents cannot distinguish between
Figure 3.1: Starting position with
enemies who are dead and those who are example local field of view for the
out of range and so can issue invalid attack 2d_3z map.
commands at such enemies, which results in no action being taken. This substantially
34 3.3. Multi-Agent StarCraft Micromanagement
O(st, 1) = f( 1 , 1) ⊕ f( 2 , 1)
1
4
O(st, 2) = f( 1 , 2) ⊕ f( 2 , 2)
3
5 6
O(st, 3) = f( 3 , 3) ⊕ f( 4 , 3)
⊕ f( 5, 3)
Figure 3.2: An example of the observations obtained by all agents at each time step t.
The function f provides a set of features for each unit in the agent’s field of view, which are
concatenated. The feature set is {distance, relative x, relative y, health points,
weapon cooldown}. Each quantity is normalised by its maximum possible value.
increases the average size of the action space, which in turn increases the difficulty
of both exploration and control.
Under these difficult conditions, scenarios with even relatively small numbers of
units become much harder to solve. As seen in Table 3.1, we compare against a
simple hand-coded heuristic that instructs the agents to run forwards into range
and then focus their fire, attacking each enemy in turn until it dies. This heuristic
achieves a 98% win rate on m5v5 with a full field of view, but only 66% in our
setting. To perform well in this task, the agents must learn to cooperate by
positioning properly and focussing their fire, while remembering which enemy and
ally units are alive or out of view.
All agents receive the same global reward at each time step, equal to the sum
of damage inflicted on the opponent units minus half the damage taken. Killing
an opponent generates a reward of 10 points, and winning the game generates a
reward equal to the team’s remaining total health plus 200. This damage-based
reward signal is comparable to that used by Usunier et al. [40]. Unlike [41], our
approach does not require estimating local rewards.
State Features. The actor and critic receive different input features,
corresponding to local observations and global state, respectively. Both include
features for allies and enemies. Units can be either allies or enemies, while agents
are the decentralised controllers that command ally units.
3. Counterfactual Multi-Agent Policy Gradients 35
The local observations for every agent are drawn only from a circular subset of
the map centred on the unit it controls and include for each unit within this field of
view: distance, relative x, relative y, unit type and shield.2 All features
are normalized by their maximum values. We do not include any information
about the units’ current target.
The global state representation consists of similar features, but for all units on
the map regardless of fields of view. Absolute distance is not included, and x-y
locations are given relative to the centre of the map rather than to a particular
agent. The global state also includes health points and cooldown for all agents.
The representation fed to the centralised Q-function critic is the concatenation of
the global state representation with the local observation of the agent whose actions
are being evaluated. Our centralised critic that estimates V (s), and is therefore
agent-agnostic, receives the global state concatenated with all agents’ observations.
The observations contain no new information but include the egocentric distances
relative to that agent.
3.4 Methods
In this section, we describe approaches for extending policy gradients to our
multi-agent setting.
The simplest way to apply policy gradients to multiple agents is to have each agent
learn independently, with its own actor and critic, from its own action-observation
history. This is essentially the idea behind independent Q-learning [25], which is
perhaps the most popular multi-agent learning algorithm, but with actor-critic in
place of Q-learning. Hence, we call this approach independent actor-critic (IAC).
In our implementation of IAC, we speed learning by sharing parameters among
the agents, i.e., we learn only one actor and one critic, which are used by all agents.
2
After firing, a unit’s cooldown is reset, and it must drop before firing again. Shields absorb
damage until they break, after which units start losing health. Dragoons and zealots have shields
but marines do not.
36 3.4. Methods
The agents can still behave differently because they receive different observations,
including an agent-specific ID, and thus evolve different hidden states. Learning
remains independent in the sense that each agent’s critic estimates only a local
value function, i.e., one that conditions on ua , not u. Though we are not aware of
previous applications of this specific algorithm, we do not consider it a significant
contribution but instead merely a baseline algorithm.
We consider two variants of IAC. In the first, each agent’s critic estimates V (τ a )
and follows a gradient based on the TD error, as described in Section 7.2. In the
second, each agent’s critic estimates Q(τ a , ua ) and follows a gradient based on the
advantage: A(τ a , ua ) = Q(τ a , ua ) − V (τ a ), where V (τ a ) = π(ua |τ a )Q(τ a , ua ).
P
ua
The difficulties discussed above arise because, beyond parameter sharing, IAC fails
to exploit the fact that learning is centralised in our setting. In this section, we
propose counterfactual multi-agent (COMA) policy gradients, which overcome this
limitation. Three main ideas underly COMA: 1) centralisation of the critic, 2) use
of a counterfactual baseline, and 3) use of a critic representation that allows efficient
evaluation of the baseline. The remainder of this section describes these ideas.
First, COMA uses a centralised critic. Note that in IAC, each actor π(ua |τ a ) and
each critic Q(τ a , ua ) or V (τ a ) conditions only on the agent’s own action-observation
history τ a . However, the critic is used only during learning and only the actor
is needed during execution. Since learning is centralised, we can therefore use a
centralised critic that conditions on the true global state s, if it is available, or
the joint action-observation histories τ otherwise. Each actor conditions on its
own action-observation histories τ a , with parameter sharing, as in IAC. Figure
5.1a illustrates this setup.
3. Counterfactual Multi-Agent Policy Gradients 37
A1 t Critic A2 t a
= (hat, ) Aa t
t
1 2
(h , ) (h , )
( ) (uat, a
t
) COMA
1
h u1 t u2 t h2 a
h t {Q(u =1, u t,..),. .,Q(ua=|U|, u-at,..)}
a -a
Actor 1 Actor 2
(hat-1) GRU (hat)
st rt
1 2
o 1t u t
u t o 2t
Figure 3.3: In (a), information flow between the decentralised actors, the environment
and the centralised critic in COMA; red arrows and components are only required during
centralised learning. In (b) and (c), architectures of the actor and critic.
A naive way to use this centralised critic would be for each actor to follow a
gradient based on the TD error estimated from this critic:
rewards using function approximation rather than a simulator. However, this still
requires a user-specified default action ca that can be difficult to choose in many
applications. In an actor-critic architecture, this approach would also introduce
an additional source of approximation error.
A key insight underlying COMA is that a centralised critic can be used to
implement difference rewards in a way that avoids these problems. COMA learns a
centralised critic, Q(s, u) that estimates Q-values for the joint action u conditioned
on the central state s. For each agent a we can then compute an advantage function
that compares the Q-value for the current action ua to a counterfactual baseline
that marginalises out ua , while keeping the other agents’ actions u−a fixed:
u0a
Hence, Aa (s, ua ) computes a separate baseline for each agent that uses the centralised
critic to reason about counterfactuals in which only a’s action changes, learned
directly from agents’ experiences instead of relying on extra simulations, a reward
model, or a user-designed default action.
This advantage has the same form as the aristocrat utility [55]. However, opti-
mising for an aristocrat utility using value-based methods creates a self-consistency
problem because the policy and utility function depend recursively on each other. As
a result, prior work focused on difference evaluations using default states and actions.
COMA is different because the counterfactual baseline’s expected contribution to
the gradient, as with other policy gradient baselines, is zero. Thus, while the
baseline does depend on the policy, its expectation does not. Consequently, COMA
can use this form of the advantage without creating a self-consistency problem.
While COMA’s advantage function replaces potential extra simulations with
evaluations of the critic, those evaluations may themselves be expensive if the
critic is a deep neural network. Furthermore, in a typical representation, the
number of output nodes of such a network would equal |U |n , the size of the joint
action space, making it impractical to train. To address both these issues, COMA
uses a critic representation that allows for efficient evaluation of the baseline. In
3. Counterfactual Multi-Agent Policy Gradients 39
particular, the actions of the other agents, ut−a , are part of the input to the network,
which outputs a Q-value for each of agent a’s actions, as shown in Figure 5.1c.
Consequently, the counterfactual advantage can be calculated efficiently by a single
forward pass of the actor and critic, for each agent. Furthermore, the number of
outputs is only |U | instead of (|U |n ). While the network has a large input space
that scales linearly in the number of agents and actions, deep neural networks
can generalise well across such spaces.
In this paper, we focus on settings with discrete actions. However, COMA can
be easily extended to continuous actions spaces by estimating the expectation in
(3.4.2) with Monte Carlo samples or using functional forms that render it analytical,
e.g., Gaussian policies and critic.
The following lemma establishes the convergence of COMA to a locally optimal
policy. The proof follows directly from the convergence of single-agent actor-critic
algorithms [29, 37], and is subject to the same assumptions.
at each iteration k,
lim inf ||∇J|| = 0 w.p. 1. (3.4.4)
k
a
Aa (s, u) = Q(s, u) − b(s, u−a ), (3.4.6)
where θ are the parameters of all actor policies, e.g. θ = {θ1 , . . . , θ|A| }, and b(s, u−a )
is the counterfactual baseline defined in equation 3.4.2.
First consider the expected contribution of the this baseline b(s, u−a ):
" #
gb = −Eπ ∇θ log π (u |τ )b(s, u ) ,
a a a −a
(3.4.7)
X
a
40 3.4. Methods
s a u−a
π (u
a a
|τ a )∇θ log π a (ua |τ a )b(s, u−a ) (3.4.8)
X
ua
=− dπ (s) π(u−a |τ −a) ·
X XX
s a u−a
∇θ π (u |τ )b(s, u−a )
a a a
(3.4.9)
X
ua
=− dπ (s) π(u−a |τ −a)b(s, u−a )∇θ 1
X XX
s a u−a
= 0. (3.4.10)
Clearly, the per-agent baseline, although it reduces variance, does not change the
expected gradient, and therefore does not affect the convergence of COMA.
The remainder of the expected policy gradient is given by:
" #
g = Eπ ∇θ log π (u |τ )Q(s, u)
a a a
(3.4.11)
X
a
" #
= Eπ ∇θ log π (u |τ )Q(s, u) .
a a a
(3.4.12)
Y
Konda and Tsitsiklis [37] prove that an actor-critic following this gradient
converges to a local maximum of the expected return J π , given that:
2. the update timescales for Q and π are sufficiently slow, and that π is updated
sufficiently slower than Q, and
3. Counterfactual Multi-Agent Policy Gradients 41
∆θπ = ∆θπ + ∇θπ log π(u|hat )Aa (sat , u) // accumulate actor gradients
π
θi+1 = θiπ + α∆θπ // update actor weights
Architecture & Training. The actor consists of 128-bit gated recurrent units
(GRUs) [38] that use fully connected layers both to process the input and to produce
the output values from the hidden state, hat . The IAC critics use extra output heads
42 3.4. Methods
appended to the last layer of the actor network. Action probabilities are produced
from the final layer, z, via a bounded softmax distribution that lower-bounds the
probability of any given action by /|U |: P (u) = (1 − )softmax(z)u + /|U |). We
anneal linearly from 0.5 to 0.02 across 750 training episodes. The centralised critic
is a feedforward network with multiple ReLU layers combined with fully connected
layers. Hyperparameters were coarsely tuned on the m5v5 scenario and then used
for all other maps. We found that the most sensitive parameter was TD(λ), but
settled on λ = 0.8 which worked best for both COMA and our baselines. Our
implementation uses TorchCraft [58] and Torch 7 [59].
Ablations. We perform ablation experiments to validate three key elements
of COMA. First, we test the importance of centralising the critic by comparing
against two IAC variants, IAC-Q and IAC-V . These take the same decentralised
input as the actor, and share the actor network parameters up to the final layer.
IAC-Q then outputs |U | Q-values, one for each action, while IAC-V outputs a
single state-value. Second, we test the significance of learning Q instead of V . The
method central-V still uses a central state for the critic, but learns V (s), and uses
the TD error to estimate the advantage for policy gradient updates. Third, we
test the utility of our counterfactual baseline. The method central-QV learns both
Q and V simultaneously and estimates the advantage as Q − V , replacing our
counterfactual baseline with V . All methods use the same architecture and training
scheme for the actors, and all critics are trained with TD(λ).
Table 3.1: Mean win rates averaged across final 1000 evaluation episodes for the different
maps, for all methods and the hand-coded heuristic in the decentralised setting with a
limited field of view. The best result for this setting is in bold. Also shown, maximum
win rates for COMA (decentralised), in comparison to the heuristic and published results
(evaluated in the centralised setting).
3. Counterfactual Multi-Agent Policy Gradients 43
Average Win %
Average Win %
60 60
50 50
40 40
30 30
20 20
10 10
0 20k 40k 60k 80k 100k120k140k 0 10k 20k 30k 40k 50k 60k 70k
# Episodes # Episodes
(a) 3m (b) 5m
90 70
80 60
70
Average Win %
60 Average Win % 50
50 40
40 30
30 20
20
10 10
0 5k 10k 15k 20k 25k 30k 35k 0 5k 10k 15k 20k 25k 30k 35k 40k
# Episodes # Episodes
(c) 5w (d) 2d_3z
Figure 3.4: Win rates for COMA and competing algorithms on four different scenarios.
COMA outperforms all baseline methods. Centralised critics also clearly outperform their
decentralised counterparts. The legend at the top applies across all plots.
3.5 Results
Figure 3.4 shows average win rates as a function of episode for each method and
each StarCraft scenario. For each method, we conducted 35 independent trials and
froze learning every 100 training episodes to evaluate the learned policies across
200 episodes per method, plotting the average across episodes and trials. Also
shown is one standard deviation in performance.
The results show that COMA is superior to the IAC baselines in all scenarios.
Interestingly, the IAC methods also eventually learn reasonable policies in m5v5,
although they need substantially more episodes to do so. This may seem counter-
intuitive since in the IAC methods, the actor and critic networks share parameters
3
5w DQN and GMEZO benchmark performances are of a policy trained on a larger map and
tested on 5w
44 3.6. Conclusions & Future Work
in their early layers (see Section 3.4.2). This could be expected to speed learning,
but these results suggest that the improved accuracy of policy evaluation made
possible by conditioning on the global state outweighs the overhead of training
a separate network.
Furthermore, COMA strictly dominates central-QV , both in training speed and
in final performance across all settings. This is a strong indicator that our counter-
factual baseline is crucial when using a central Q-critic to train decentralised policies.
Learning a state-value function has the obvious advantage of not conditioning
on the joint action. Still, we find that COMA outperforms our baseline central-V
in final performance. Further, COMA typically achieves good policies faster, which
is expected as COMA provides a shaped training signal. Training is also more
stable than central-V , which is a consequence of the COMA gradient tending
to zero as the policy becomes greedy. Overall, COMA is the best performing
and most consistent method.
Usunier et al. [40] report the performance of their best agents trained with their
state-of-the-art centralised controller labelled GMEZO (greedy-MDP with episodic
zero-order optimisation), and for a centralised DQN controller, both given a full
field of view and access to attack-move macro-actions. These results are compared
in Table 3.1 against the best agents trained with COMA for each map. Clearly, in
most settings these agents achieve performances comparable to the best published
win rates despite being restricted to decentralised policies and a local field of view.
Contents
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3 Dec-POMDP and Features . . . . . . . . . . . . . . . . . 52
4.4 Common Knowledge . . . . . . . . . . . . . . . . . . . . 53
4.5 Multi-Agent Common Knowledge Reinforcement Learn-
ing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.6 Pairwise MACKRL . . . . . . . . . . . . . . . . . . . . . 58
4.7 Experiments and Results . . . . . . . . . . . . . . . . . . 59
4.8 Conclusion & Future Work . . . . . . . . . . . . . . . . . 62
4.1 Introduction
47
48 4.1. Introduction
Figure 4.1: Three agents and their fields of view. A and B’s locations are common
knowledge to A and B as they are within each other’s fields of view. However, even
though C can see A and B, it shares no common knowledge with them.
their individual observations that would in principle be useful for maximising reward,
because acting on it would make their behaviour less predictable to their teammates.
In this chapter, we propose multi-agent common knowledge reinforcement
learning (MACKRL), which strikes a middle ground between these two extremes.
The key insight is that, even in partially observable settings, subsets of the agents
often possess some common knowledge that they can exploit to coordinate their
behaviour. Common knowledge for a set of agents consists of facts that all agents
know and “each individual knows that all other individuals know it, each individual
knows that all other individuals know that all the individuals know it, and so on” [62].
We formalise a multi-agent setting that suffices to give rise to common knowledge.
The setting involves assumptions that, while restrictive, are naturally satisfied in a
range of multi-agent problems. Intuitively, common knowledge can arise between
two agents when each agent can observe the other, and doing so disambiguates
what the other has observed. For example, if each agent can reliably observe things
that are within its field of view and the agents know each other’s fields of view,
then two agents possess common knowledge whenever they see each other. Such
a scenario, illustrated in Figure 4.1, arises in tasks such as robo-soccer [63] and
multi-agent StarCraft Micromanagement [58]. It could also arise in a range of real-
world scenarios, e.g. a self-driving car that knows which sensors other autonomous
vehicles that it observes are equipped with.
Common knowledge is useful because, by definition, each agent in a group can
independently deduce the common knowledge shared by that group. Consequently,
4. Multi-Agent Common Knowledge Reinforcement Learning 49
a centralised joint policy that conditions only on the common knowledge of all
agents can be executed in a decentralised fashion: each agent simply queries the
centralised policy for a joint action and then executes their part of that joint action.
Since each agent supplies the same common knowledge as input, they return the
same joint action, yielding coordinated behaviour.
However, the presence of common knowledge gives rise to a major challenge.
Smaller groups of agents often possess more common knowledge than larger groups,
making it unclear at what level agents should coordinate. On the one hand,
coordination would ideally happen globally, that is, a fully centralised policy would
select a joint action for all agents. However, the common knowledge shared among
all agents may be weak, limiting what such a policy can condition on. On the other
hand, the agents could be broken into subgroups that share more common knowledge
within them. Coordination would no longer occur across the subgroups, but action
selection within each group could condition on a richer common knowledge signal.
MACKRL addresses this challenge using a hierarchical approach. At each level
of the hierarchy, a controller can either select a joint action for the agents in its
subgroup, or propose a partition of the agents into smaller subgroups, whose actions
are then selected by controllers at the next level of the hierarchy. During action
selection, MACKRL simply samples sequentially from the hierarchy of common
knowledge controllers. However, during training, the total probability of the joint
action is evaluated by marginalising over all choices that could have been taken
at every level of the hierarchy. Thus, even the parameters governing a subgroup
that was not selected during the action sampling can receive gradients.
Using a matrix game we show results for a tabular version of MACKRL that
outperforms both independent and joint-action learners, the code is available at
bit.ly/2PvFJAB. Furthermore, we develop a DMARL variant called pairwise
MACKRL, which uses pairwise coordinated strategies implemented with RNN
policies and a centralised critic similar to what was proposed in Chapter 3. On a
multi-agent variant of StarCraft II (SCII) micromanagement, similar to Section 3.3,
pairwise MACKRL outperforms a baseline agent which uses a centralised critic.
50 4.2. Related Work
Figure 4.2: Pairwise MACKRL. Left: the pair selector must assign agents to pairs (plus
a singleton in this case). Middle: the pair controller can either partition the pair or select
among the pair’s joint actions; Right, at the last level, the controller must select an action
for a single agent.
We also show that the delegation decisions are semantically meaningful and relate
to the amount of common knowledge between the agents.
MARL has been studied extensively in small environments, see e.g., [42, 43], but
scaling it to large state spaces or many agents has proved problematic. One
reason is that the joint action space grows exponentially in the number of agents,
making learning intractable. Guestrin, Koller, and Parr [64] propose the use of
coordination graphs, which exploit conditional independence properties between
agents that are captured in an undirected graphical model, in order to efficiently
select joint actions. Sparse cooperative Q-learning [65] also uses coordination
graphs to efficiently maximise over joint actions in the Q-learning update rule.
Whilst these approaches allow agents to coordinate optimally, they require the
coordination graph to be known and for the agents to either observe the global
state or to be able to freely communicate. In addition, in the worst case there are
no conditional independencies to exploit and maximisation must still be performed
over an intractably large joint action space.
Genter, Agmon, and Stone [66] and Barrett, Stone, and Kraus [67] examine ad
hoc teamwork: how agents can cooperate with previously unseen teammates when
there are a variable number of non-learning agents. Albrecht and Stone [68] address
ad hoc teamwork by maintaining a belief over a set of parameterised hypothetical
behaviours and updating them after each observation. Panella and Gmytrasiewicz
4. Multi-Agent Common Knowledge Reinforcement Learning 51
[69] treat the behaviours of teammates as a stochastic process and maintain beliefs
over these in order to learn how to coordinate with previously unseen agents. Makino
and Aihara [70] develop an algorithm that reasons over the beliefs over policies of
other agents in fully observable settings. In ad-hoc teamwork, the learning agents
learn to coordinate with other agents with fixed, non-learning behaviour; in our
setting all agents learn at the same time, with the same known algorithm.
Thomas et al. [71] explore the psychology of common knowledge and coordination.
Rubinstein [72] shows that any finite number of reasoning steps, short of the
infinite number required for common knowledge, can be insufficient for achieving
coordination. Korkmaz et al. [73] examine common knowledge in scenarios where
agents use Facebook-like communication and show that a complete bipartite graph
is required for common knowledge to be shared amongst a group. Brafman and
Tennenholtz [74] use a common-knowledge protocol to improve coordination in
common interest stochastic games but, in contrast to our approach, establish
common knowledge about agents’ action sets and not about subsets of their
observation spaces.
Aumann et al. [75] introduce the concept of a correlated equilibrium, whereby
a shared correlation device helps agents coordinate better. Cigler and Faltings
[76] examine how the agents can reach such an equilibrium when given access to
a simple shared correlation vector and a communication channel. Boutilier [77]
augments the state space with a coordination mechanism, to ensure coordination
between agents is possible in a fully observable multi-agent setting. This is in
general not possible in the partially observable setting we consider. Instead of
relying on a shared communication channel or full observability, MACKRL achieves
coordination by utilising common knowledge.
Amato, Konidaris, and Kaelbling [78] propose MacDec-POMDPs, which use hier-
archically optimal policies that allow agents to undertake temporally extended macro
actions. Liu et al. [79] investigate how to learn such models in environments where
the transition dynamics are not known. Makar, Mahadevan, and Ghavamzadeh
[80] extend the MAXQ single-agent hierarchical framework by Dietterich [81] to the
52 4.3. Dec-POMDP and Features
multi-agent domain. They enable certain policies in the hierarchy to learn the joint
action-value function, which allows for faster coordination across agents. However,
unlike MACKRL this requires the agents to communicate during execution.
Kumar et al. [82] use a hierarchical controller that produces subtasks for each
agent and chooses which pairs of agents should communicate in order to select their
actions. In contrast to our approach, they allow communication during execution,
and do not test on a sequential decision making task.
The goal of the agents is to maximise the expected discounted forward looking
0
return Rt = γ t −t r(st0 , ut0 ,env ) experienced during trajectories of length T .
PT
t0 =t
The joint policy π(uenv |s) is restricted to a set of decentralised policies π a (uaenv |τ a ),
that can be executed independently, i.e. each agent’s policy conditions only on
its own action-observation history τ a . We denote a policy across the joint action
space Uenv
G
of the group G ⊆ A of agents as π G . Like in the rest of the thesis, we
allow centralised learning of decentralised policies.
Our setting is thus a special case of the Dec-POMDP in which the state space
factors into entities and the observation function is deterministic, yielding perceptual
aliasing in which the state features of each entity are either accurately observed or
completely occluded. In the next section, we leverage these properties to establish
common knowledge among the agents. The Dec-POMDP could be augmented with
additional state features that do not correspond to entities, as well as additional
possibly noisy observation features, without disrupting the common knowledge we
establish about entities. For simplicity, we omit such additions.
Lemma 4.4.1. In the setting described in the previous Section, and when all masks
are known to all agents, the common knowledge of a group of agents G in state s is
MGs , if µa (sa , sb )
V
IsG = a,b ∈G . (4.4.3)
∅ , otherwise
Therefore, starting at the knowledge any agent of group G, in which all agents
can see each other, the mutual knowledge becomes the common knowledge for all
m ≥ 3.
4. Multi-Agent Common Knowledge Reinforcement Learning 55
The common knowledge can be computed using only the visible set Mas of every
agent a ∈ G. Moreover, actions that have been chosen by a policy, which itself
is common knowledge, and that further depends only on common knowledge and
a shared random seed can also be considered common knowledge. The common
knowledge of group G up to time t is thus some common prior knowledge τ0 and
the commonly known trajectory τtG = (τ0 , oG1 , u1G , . . . , oGt , utG ), with oGk = {sek | e ∈
IsGk }. Knowing all binary masks µa makes it possible to derive τtG = I G (τta )
from the observation trajectory τta = (τ0 , oa1 , . . . , oat ) of every agent a ∈ G, and
a function that conditions on τ G can therefore be computed independently by
every member of G. This idea of common knowledge based on local observations
is illustrated in Figure 4.1.
where U G = {Uenv
G
∪ P(G)} and P(G) is the set of all partitions of G. Algorithm
2 summarises MACKRL’s hierarchical action selection. Since the hierarchical
controllers condition only on the common knowledge as defined in (4.4.2), the
actions can be executed by each agent. We use uP = uG to denote the joint
Q
G∈P
However, rather than training a set of factorised policies, one for each agent,
we train the marginalised joint policy P (uenv |s) induced by the hierarchical sampling
process:
P (uenv |s) = P (uenv |s, path)P (path|s), (4.5.2)
X
path∈Paths
where Paths is the set of all possible action assignments of the hierarchical controllers,
that is, each path is a possible outcome of the action-selection in Algorithm 2. In
general, sampling from a joint probability is problematic since the number of logits
grows exponentially in the number of agents. Furthermore, evaluating the joint
4. Multi-Agent Common Knowledge Reinforcement Learning 57
The key insight of MACKRL is that we can use the hierarchical, decentralised
process for sampling actions, while the marginal probabilities need to be computed
only during training in order to obtain the joint probability of the joint action
critic on the central state and the last actions of all agents. Since MACKRL
induces a correlated probability across the joint action space, it effectively turns
training into a single agent problem and renders the COMA baseline proposed
in Chapter 3 non-applicable.
The value function is learned by gradient descent on the TD(λ) loss [21], as is
standard in this thesis. As in Central-V , all critics are trained on-policy using
samples from the environment, and all controllers within one layer of the hierarchy
share parameters. The value function conditions both on the state, st , and on the
last joint-action at each time step, ut−1 , in order to account for durative actions.
In general, both the number of possible partitions and groups per partition can
be large, making learning difficult. However, the next section describes an example
1 0 0 0.4 0
0 0 0.2 0 1
0 0.2 0.4 0.8 0.4 0 0 0.4 0 0
0 0 0 0.4 0 0.2 0.4 0.8 0.4 0.2
0 0 0 0.2 0 0 0 0.4 0 0
0 0 0 0 1 1 0 0.2 0 0
Figure 4.3: Matrix A Figure 4.4: Matrix B
58 4.6. Pairwise MACKRL
pairwise MACKRL. The hierarchy consists of three levels. At the top level, the
controller πps is a pair selector that is not allowed to directly select a joint action
but can only choose among pairwise partitions, i.e., it must group the agents into
disjoint pairs. If there are an odd number of agents, then one agent is put in a
0
singleton group. At the second level, each controller πpc
aa
is a pair controller whose
action space consists of the joint action space of the pair of agents plus a delegate
0
action ud that further partitions the pair into singletons: U G = Uenv
a a
× Uenv ∪ {ud },
where G = {a, a0 }. At the third level, each controller selects an individual action
Table 4.1: Hierarchy of pairwise MACKRL, where h is the hidden state of RNNs and
oat are observations. #π shows the number of controllers on this level for 3 agents.
0.95
0.8
0.90
Win rate test (SEM)
0.85
0.6
0.80
0.4
0.75
0.70
0.2
0.65 MACKRL [24] MACKRL [22]
Central-V [20] Central-V [24]
0.60 0.0
0 500000 1000000 1500000 2000000 2500000 0 500000 1000000 1500000 2000000 2500000
Environmental Steps Environmental Steps
Figure 4.6: Mean and standard error of the mean of test win rates for two levels in
StarCraft II: one with 5 marines (left), and one with 2 stalkers and 3 zealots on each side
(right). Also shown is the number of runs [in brackets].
4. Multi-Agent Common Knowledge Reinforcement Learning 59
Train Performance
1.00
MACKRL
JAL
IAC
Expected Return
0.75
0.50
0.0 0.5 1.0
P(common knowledge)
We evaluate MACKRL on two tasks. The first is a matrix game with partial
observability. The state consists of two randomly sampled bits, which are drawn
iid. The first bit indicates the information state and is always observable by both
agents. The second bit selects which one of the two normal form games the agents
are playing and is sampled 50%. When the first bit is in the ‘common knowledge’
state, which happens with a probability of P (common knowledge), the matrix bit
is always observable by both agents and is thus common knowledge.
In contrast, when the first bit is in the ‘obfuscation state’, each of the agents
observes the ‘matrix bit’ with a probability of 50% (iid), indicating whether the
reward will be chosen using the matrix shown in Figure 4.3 or the matrix shown
in Figure 4.4, and receives a ‘no observation’ otherwise. The code including a
proof of principle implementation is available in the Supplementary Material and
published online. We explore how MACKRL performs, for a range of values for
P (common knowledge), in comparison to an always centralised policy using only
the common knowledge and fully decentralised learners.
Figure 4.5 shows performance on the matrix game as a function of the probability
of the common knowledge bit. When there is no common knowledge MACKRL repro-
duces the performance of independent actor critic (IAC), while a centralised policy,
learned with joint-action-learning (JAL) [42], fails to take advantage of the private
60 4.7. Experiments and Results
100
# Env steps: 400
# Env steps: 500k
Delegation rate (%)
80
# Env steps: 1.2 M
60 # Env steps: 1.6 M
40
20
0
0 1 2 3 4 5
# Enemies in CK of selected pair controller
Figure 4.7: Delegation rate vs. number of enemies (2s3z) in the common knowledge of
the pair controller over training.
We found that this both stabilised but also accelerated training compared to doing
full batch updates for the critic. The target network for the critic was updated
after every 200 update steps for the critic. We also used a TD-lambda of 0.8 to
accelerate reward propagation across time.
Figure 4.6 shows the win rate of StarCraftII agents during test trajectories on
our two maps. We omit independent learning since it is known to do poorly in
this setting [51]. On the easier map (3m), both methods achieve near state-of-
the-art performance above 90%. Since Central-V has around three times fewer
parameters, it is able to initially learn faster on this simple map, but MACKRL
achieves marginally higher final performance. On the more challenging map with
mixed unit types (2s3z), MACKRL learns faster and achieves a higher final score.
These results show that MACKRL can outperform a fully factored policy when
all other aspects of training are the same.
To demonstrate that the pair controller can indeed learn to delegate strategically,
we plot in Figure 4.7 the percentage of delegation actions ud against the number
of enemies in the common knowledge of the selected pair controller, in situations
where there is at least some common knowledge. Since we start with randomly
initialised policies, at the beginning of training the pair controller delegates only
rarely to the decentralised controllers. As training proceeds, it learns to delegate in
62 4.8. Conclusion & Future Work
most situations where the number of enemies in the common knowledge of the pair
is small, the exception being no visible enemies, which happens too rarely (5% of
cases). This shows that MACKRL can learn to delegate in order to take advantage
of the private observations of the agents, but also learns to coordinate in the joint
action space when there is substantial common knowledge.
5.1 Introduction
63
64 5.1. Introduction
part of the environment. While IQL avoids the scalability problems of centralised
learning, it introduces a new problem: the environment becomes nonstationary
from the point of view of each agent, as it contains other agents who are themselves
learning, ruling out any convergence guarantees. Fortunately, substantial empirical
evidence has shown that IQL often works well in practice [83].
In recent years, the use of deep neural networks has dramatically improved
the scalability of single-agent RL [13]. However, one element key to the success
of such approaches is the reliance on an experience replay memory, which stores
experience tuples that are sampled during training. Experience replay not only
helps to stabilise the training of a deep neural network, it also improves sample
efficiency by repeatedly reusing experience tuples. Unfortunately, the combination
of experience replay with IQL appears to be problematic: the nonstationarity
introduced by IQL means that the dynamics that generated the data in the agent’s
replay memory no longer reflect the current dynamics in which it is learning. While
IQL without a replay memory can learn well despite nonstationarity so long as each
agent is able to gradually track the other agents’ policies, that seems hopeless with
a replay memory constantly confusing the agent with obsolete experience.
To avoid this problem, previous work on DMARL has limited the use of
experience replay to short, recent buffers [44] or simply disabled replay altogether,
e.g., our work in Chapter 6. However, these workarounds limit the sample efficiency
and threaten the stability of multi-agent RL. Consequently, the incompatibility
of experience replay with IQL is emerging as a key stumbling block to scaling
DMARL to complex tasks.
In this chapter, we propose two approaches for effectively incorporating experi-
ence replay into multi-agent RL. The first approach interprets the experience in
the replay memory as off-environment data [84]. By augmenting each tuple in the
replay memory with the probability of the joint action in that tuple, according to
the policies in use at that time, we can compute an importance sampling correction
when the tuple is later sampled for training. Since older data tends to generate lower
5. Stabilising Experience Replay 65
Beyond the related work mentioned in Chapter 3 our work is also broadly related to
methods that attempt to allow for faster learning for value based methods. These
include prioritised experience replay [86], a version of the standard replay memory
that biases the sampling distribution based on the TD error. However, this method
does not account for nonstationary environments and does not take into account
the unique properties of the multi-agent setting.
Methods like hyper Q-learning [85] and AWESOME [87] try to tackle nonstation-
arity by tracking and conditioning each agent’s learning process on their teammates’
current policy, while Da Silva et al. [88] propose detecting and tracking different
classes of traces on which to condition policy learning. Kok and Vlassis [89] show
that coordination can be learnt by estimating a global Q-function in the classical
distributed setting supplemented with a coordination graph. In general, these
techniques have so far not successfully been scaled to high-dimensional state spaces.
1
StarCraft and its expansion StarCraft: Brood War are trademarks of Blizzard Entertainment™.
66 5.3. Methods
5.3 Methods
To avoid the difficulty of combining IQL with experience replay, previous work
on DMARL has limited the use of experience replay to short, recent buffers [44]
or simply disabled replay altogether, e.g., our work in Chapter 6. However, these
workarounds limit the sample efficiency and threaten the stability of multi-agent RL.
In this section, we propose two approaches for effectively incorporating experience
replay into multi-agent RL.
setting. If the Q-functions can condition directly on the true state s, we can write the
Bellman optimality equation for a single agent given the policies of all other agents:
"
Q∗a (s, ua |π −a ) = −a
(u |s) r(s, ua , u−a )+
−a
X
π
u−a
#
a
P (s |s, u , u
0 a −a
) max Q∗a (s0 , u0 ) . (5.3.1)
X
γ
u0 a
s0
0 0
The nonstationary component of this equation is π −a (u−a |s) = Πa0 ∈−a π a (ua |s),
which changes as the other agents’ policies change over time. Therefore, to enable
importance sampling, at the time of collection tc , we record πt−a
c
(u−a |s) in the replay
(tc )
memory, forming an augmented transition tuple hs, ua , r, π(u−a |s), s0 i .
At the time of replay tr , we train off-environment by minimising an importance
weighted loss function:
b
πt−a (u−a |s) DQN
L(θ) = [(yi − Q(s, u; θ))2 ], (5.3.2)
X
r
π −a
i=1 ti (u −a |s)
u−a
All other elements of the augmented game Ĝ are adopted from the original game G.
This also includes T , the space of action-observation histories. The augmented game
68 5.3. Methods
is then specified by Ĝ = hŜ, U, P̂ , r̂, Z, Ô, n, γi. We can now write a Bellman
equation for Ĝ:
"
Q(τ, u) = p(ŝ|τ ) r̂(ŝ, u) +
X
ŝ
#
P̂ (ŝ |ŝ, u)π(u , τ )p(τ |τ, ŝ , u)Q(τ , u ) . (5.3.4)
0 0 0 0 0 0 0
X
γ
τ 0 ,ŝ0 ,u0
stationary if it conditioned on the policies of the other agents. This is exactly the
philosophy behind hyper Q-learning [85]: each agent’s state space is augmented
with an estimate of the other agents’ policies computed via Bayesian inference.
Intuitively, this reduces each agent’s learning problem to a standard, single-agent
problem in a stationary, but much larger, environment.
The practical difficulty of hyper Q-learning is that it increases the dimensionality
of the Q-function, making it potentially infeasible to learn. This problem is
exacerbated in deep learning, when the other agents’ policies consist of high
dimensional deep neural networks. Consider a naive approach to combining hyper Q-
learning with deep RL that includes the weights of the other agents’ networks, θ−a , in
the observation function. The new observation function is then O0 (s) = {O(s), θ−a }.
The agent could in principle then learn a mapping from the weights θ−a , and its
own trajectory τ , into expected returns. Clearly, if the other agents are using deep
models, then θ−a is far too large to include as input to the Q-function.
However, a key observation is that, to stabilise experience replay, each agent does
not need to be able to condition on any possible θ−a , but only those values of θ−a
that actually occur in its replay memory. The sequence of policies that generated
the data in this buffer can be thought of as following a single, one-dimensional
trajectory through the high-dimensional policy space. To stabilise experience replay,
it should be sufficient if each agent’s observations disambiguate where along this
trajectory the current training sample originated from.
The question then, is how to design a low-dimensional fingerprint that contains
this information. Clearly, such a fingerprint must be correlated with the true
value of state-action pairs given the other agents’ policies. It should typically
vary smoothly over training, to allow the model to generalise across experiences
in which the other agents execute policies of varying quality as they learn. An
obvious candidate for inclusion in the fingerprint is the training iteration number
e. One potential challenge is that after policies have converged, this requires the
model to fit multiple fingerprints to the same value, making the function somewhat
harder to learn and more difficult to generalise from.
70 5.4. Experiments
Another key factor in the performance of the other agents is the rate of
exploration . Typically an annealing schedule is set for such that it varies smoothly
throughout training and is quite closely correlated to performance. Therefore, we
further augment the input to the Q-function with , such that the observation
function becomes O0 (s) = {O(s), , e}. Our results in Section 5.5 show that even
this simple fingerprint is remarkably effective.
5.4 Experiments
In this section, we describe our experiments applying experience replay with finger-
prints (XP+FP), with importance sampling (XP+IS), and with the combination
(XP+IS+FP), to the StarCraft domain. We run experiments with both feedforward
(FF) and recurrent (RNN) models, to test the hypothesis that in StarCraft recurrent
models can use trajectory information to more easily disambiguate experiences
from different stages of training.
5.4.1 Architecture
We use the recurrent DQN architecture described in Chapter 6 with a few mod-
ifications. We do not consider communicating agents, so there are no message
connections. As mentioned above, we use two different different models: one with
a feedforward model with two fully connected hidden layers, and another with a
single-layer GRU. For both models, every hidden layer has 128 neurons.
We linearly anneal from 1.0 to 0.02 over 1500 episodes, and train the network
for emax = 2500 training episodes. In the standard training loop, we collect a single
episode and add it to the replay memory at each training step. We sample batches of
30
n
episodes uniformly from the replay memory and train on fully unrolled episodes.
In order to reduce the variance of the multi-agent importance weights, we clip them
to the interval [0.01, 2]. We also normalise the importance weights by the number
of agents, by raising them to the power of 1
n−1
. Lastly, we divide the importance
weights by their running average in order to keep the overall learning rate constant.
All other hyperparameters are identical to the ones being used in Chapter 6.
5. Stabilising Experience Replay 71
15 15
10 10
Mean return
Mean return
5 5
0 0
500 1000 1500 2000 2500 500 1000 1500 2000 2500
# Episodes # Episodes
(a) 3v3 with recurrent networks (b) 3v3 with feedforward networks
20 20
15 15
Mean return
Mean return
10 10
5 5
0 0
−5 −5
500 1000 1500 2000 2500 500 1000 1500 2000 2500
# Episodes # Episodes
(c) 5v5 with recurrent networks (d) 5v5 with feedforward networks
Figure 5.1: Performance of our methods compared to the two baselines XP and NOXP,
for both RNN and FF; (a) and (b) show the 3v3 setting, in which IS and FP are only
required with feedforward networks; (c) and (d) show the 5v5 setting, in which FP clearly
improves performance over the baselines, while IS shows a small improvement only in
the feedforward setting. Overall, the FP is a more effective method for resolving the
nonstationarity and there is no additional benefit from combining IS with FP. Confidence
intervals show one standard deviation of the sample mean.
5.5 Results
In this section, we present the results of our StarCraft experiments, summarised
in Figure 5.1. Across all tasks and models, the baseline without experience
replay (NOXP) performs poorly. Without the diversity in trajectories provided
by experience replay, NOXP overfits to the greedy policy once becomes small.
When exploratory actions do occur, agents visit areas of the state space that have
not had their Q-values updated for many iterations, and bootstrap off of values
which have become stale or distorted by updates to the Q-function elsewhere. This
72 5.5. Results
effect can harm or destabilise the policy. With a recurrent model, performance
simply degrades, while in the feedforward case, it begins to drop significantly later
in training. We hypothesise that full trajectories are inherently more diverse than
single observations, as they include compounding chances for exploratory actions.
Consequently, it is easier to overfit to single observations, and experience replay
is more essential for a feedforward model.
With a naive application of experience replay (XP), the model tries to si-
multaneously learn a best-response policy to every historical policy of the other
agents. Despite the nonstationarity, the stability of experience replay enables
XP to outperform NOXP in each case. However, due to limited disambiguating
information, the model cannot appropriately account for the impact of any particular
policy of the other agents, or keep track of their current policy. The experience
replay is therefore used inefficiently, and the model cannot generalise properly
from experiences early in training.
5.5.2 Fingerprints
Our results show that the simple fingerprint of adding e and to the observation
(XP+FP) dramatically improves performance for the feedforward model. This
fingerprint provides sufficient disambiguation for the model to track the quality
5. Stabilising Experience Replay 73
14
Episodes
12
0
10
500
8
1000
V(o)
6 1500
4 2000
2
−2
0.0 0.2 0.4 0.6 0.8 1.0
Figure 5.2: Estimated value of a single initial observation with different in its fingerprint
input, at different stages of training. The network learns to smoothly vary its value
estimates across different stages of training.
of the other agents’ policies over the course of training, and make proper use
of the experience buffer. The network still sees a diverse array of input states
across which to generalise but is able to modify its predicted value in accordance
with the known stage of training.
Figure 5.1 also shows that there is no extra benefit from combining importance
sampling with fingerprints (XP+IS+FP). This makes sense given that the two ap-
proaches both address the same problem of nonstationarity, albeit in different ways.
Figure 5.2, which shows the estimated value for XP+FS of a single initial
state observation with different inputs, demonstrates that the network learns
to smoothly vary its value estimates across different stages of training, correctly
associating high values with the low seen later in training. This approach allows
the model to generalise between best responses to different policies of other agents.
In effect, a larger dataset is available in this case than when using importance
sampling, where most experiences are strongly discounted during training. The
74 5.5. Results
(a) (b)
1.2
XP 1.2
XP+FP
1.0 1.0
0.8 0.8
Predicted
Predicted
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
−0.2 −0.2
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
True True
(c) (d)
Figure 5.3: (upper) Sampled trajectories of agents, from the beginning (a) and end
(b) of training. Each agent is one colour and the starting points are marked as black
squares. (lower) Linear regression predictions of from the hidden state halfway through
each episode in the replay buffer: (c) with only XP, the hidden state still contains
disambiguating information drawn from the trajectories, (d) with XP+FP, the hidden
state is more informative about the stage of training.
When using recurrent networks, the performance gains of XP+IS and XP+FP
are not as large; in the 3v3 task, neither method helps. The reason is that, in
StarCraft, the observed trajectories are significantly informative about the state
of training, as shown in Figure 5.3a and 5.3b. For example, the agent can observe
that it or its allies have taken many seemingly random actions, and infer that the
sampled experience comes from early in training. This is a demonstration of the
power of recurrent architectures in sequential tasks with partial observability: even
without explicit additional information, the network is able to partially disambiguate
experiences from different stages of training. To illustrate this, we train a linear
model to predict the training from the hidden state of the recurrent model. Figure
5.3c shows a reasonably strong predictive accuracy even for a model trained with
XP but no fingerprint, indicating that disambiguating information is indeed kept in
the hidden states. However, the hidden states of a recurrent model trained with
a fingerprint (Figure 5.3d) are even more informative.
This chapter proposed two methods for stabilising experience replay in deep multi-
agent reinforcement learning: 1) using a multi-agent variant of importance sampling
to naturally decay obsolete data and 2) conditioning each agent’s value function on a
fingerprint that disambiguates the age of the data sampled from the replay memory.
Results on a challenging decentralised variant of StarCraft unit micromanagement
confirmed that these methods enable the successful combination of experience
replay with multiple agents. In the future, we would like to apply these methods
to a broader range of nonstationary training problems, such as classification on
changing data, and extend them to multi-agent actor-critic methods. So far we
have assumed that the agents have to execute their policies fully decentralised, i.e.,
without having access to a communication channel. In the next part of the thesis
we will investigate methods that allow agents to learn to use limited bandwidth
76 5.6. Conclusion & Future Work
Learning to Communicate
77
Abstract
So far we have been focussed on settings which do not require implicit or explicit
communication between the agents. In this part of the thesis we investigate different
methods for learning communication protocols using DMARL. In Chapter 6 we
investigate how to learn communication protocols in the presence of cheap-talk
channels, i.e., in settings where the messages sent do not have any direct impact on
the reward or the transition probability. In these settings we can use differentiation
across the communication channel in order to learn what messages to send. In
contrast, in Chapter 7 we investigate a setting in which agents need to communicate
through grounded hint-actions and by making their actions themselves informative,
when observed by another agent. We solve this by allowing agents to reason over
the beliefs of other agents in the environment.
Learning to Communicate with Deep
6
Multi-Agent Reinforcement Learning
Contents
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.3 Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.4.1 Reinforced Inter-Agent Learning . . . . . . . . . . . . . 86
6.4.2 Differentiable Inter-Agent Learning . . . . . . . . . . . . 87
6.5 DIAL Details . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.6.1 Model Architecture . . . . . . . . . . . . . . . . . . . . . 92
6.6.2 Switch Riddle . . . . . . . . . . . . . . . . . . . . . . . . 92
6.6.3 MNIST Games . . . . . . . . . . . . . . . . . . . . . . . 95
6.6.4 Effect of Channel Noise . . . . . . . . . . . . . . . . . . 98
6.7 Conclusion & Future Work . . . . . . . . . . . . . . . . . 100
6.1 Introduction
81
82 6.1. Introduction
are: Why does language use discrete structures? What role does the environment
play? What is innate and what is learned? And so on. Some of the debates on
these questions have been so fiery that in 1866 the French Academy of Sciences
banned publications about the origin of human language.
The rapid progress in recent years of machine learning, and deep learning in
particular, opens the door to a new perspective on this debate. How can agents
use machine learning to automatically discover the communication protocols they
need to coordinate their behaviour? What, if anything, can deep learning offer
to such agents? What insights can we glean from the success or failure of agents
that learn to communicate?
In this chapter, we take the first steps towards answering these questions.
Our approach is programmatic: first, we propose a set of multi-agent benchmark
tasks that require communication; then, we formulate several learning algorithms
for these tasks; finally, we analyse how these algorithms learn, or fail to learn,
communication protocols for the agents.
The tasks that we consider are fully cooperative, partially observable, sequential
multi-agent decision making problems. All the agents share the goal of maximising
the same discounted sum of rewards. While no agent can observe the underlying
Markov state, each agent receives a private observation correlated with that state.
In addition to taking actions that affect the environment, each agent can also
communicate with its fellow agents via a discrete limited-bandwidth cheap-talk
channel. Due to the partial observability and limited channel capacity, the agents
must discover a communication protocol that enables them to coordinate their
behaviour and solve the task.
As we do in the rest of the thesis, here we focus on settings with centralised
learning but decentralised execution. In other words, communication between agents
is not restricted during learning, which is performed by a centralised algorithm;
however, during execution of the learned policies, the agents can communicate only
via the limited-bandwidth channel. While not all real-world problems can be solved
in this way, a great many can, e.g., when training a group of robots on a simulator.
6. Learning to Communicate with Deep Multi-Agent Reinforcement Learning 83
the experimental section, that are essential for learning communication protocols
in our proposed benchmarks.
6.3 Setting
Like the previous chapters, we consider cooperative RL problems with both multiple
agents and partial observability. All the agents share the goal of maximising the same
discounted sum of rewards Rt . While no agent can observe the underlying Markov
state st , each agent receives a private observation oat correlated to st . However, in
contrast to previous chapters, the environment includes a cheap-talk channel ??:
In each time-step, the agents select an environment action u ∈ U that affects the
environment, and a communication action m ∈ M that is observed by other agents
but has no direct impact on the environment or reward. We are interested in such
settings because it is only when multiple agents and partial observability coexist that
agents have the incentive to communicate. As no communication protocol is given
a priori, the agents must develop and agree upon such a protocol to solve the task.
Since protocols are mappings from action-observation histories to sequences of
messages, the space of protocols is extremely high-dimensional. Automatically
discovering effective protocols in this space remains an elusive challenge. In
particular, the difficulty of exploring this space of protocols is exacerbated by
the need for agents to coordinate the sending and interpreting of messages. For
example, if one agent sends a useful message to another agent, it will only receive a
positive reward if the receiving agent correctly interprets and acts upon that message.
If it does not, the sender will be discouraged from sending that message again.
Hence, positive rewards are sparse, arising only when sending and interpreting are
properly coordinated, which is hard to discover via random exploration.
We also focus on the setting of centralised learning but decentralised execution. In
other words, communication between agents is not restricted during learning, which
is performed by a centralised algorithm; however, during execution of the learned
policies, the agents can communicate only via the limited-bandwidth channel.
86 6.4. Methods
6.4 Methods
In this section, we present two approaches for learning communication protocols.
t t+1 t t+1
2
Agent 2 u 2t+1
Agent 2
u t+1
Action Action Action Action
Q-Net Q-Net C-Net C-Net
Select Select Select Select
2
m t-1 m t1 m 2t+1 m 1t m 2t+1
DRU DRU
Agent 1
Agent 1
Action u 1t Action Action u 1t Action
Q-Net Q-Net C-Net C-Net
Select Select Select Select
o t1 o 2t+1 o t1 o 2t+1
Environment Environment
Figure 6.1: In RIAL (a), all Q-values are fed to the action selector, which selects both
environment and communication actions. Gradients, shown in red, are computed using
DQN for the selected action and flow only through the Q-network of a single agent. In
DIAL (b), the message mat bypasses the action selector and instead is processed by the
DRU (Section 6.4.2) and passed as a continuous value to the next C-network. Hence,
gradients flow across agents, from the recipient to the sender. For simplicity, at each time
step only one agent is highlighted, while the other agent is greyed out.
While RIAL can share parameters among agents, it still does not take full advantage
of centralised learning. In particular, the agents do not give each other feedback
about their communication actions. Contrast this with human communication,
which is rich with tight feedback loops. For example, during face-to-face interaction,
listeners send fast nonverbal cues to the speaker indicating the level of understanding
and interest. RIAL lacks this feedback mechanism, which is intuitively important
for learning communication protocols.
88 6.4. Methods
in order to minimise the downstream DQN loss, reducing the need for trial and
error exploration to learn good protocols.
While we limit our analysis to discrete messages, DIAL naturally handles
continuous protocols, as they are part of the differentiable training. While we
limit our analysis to discrete messages, DIAL naturally handles continuous message
spaces, as they are used anyway during centralised learning. DIAL can also scale
naturally to large discrete message spaces, since it learns binary encodings instead
of the one-hot encoding in RIAL, |m| = O(log(|M |).
0
Q(·), mat = C-Net oat , m̂at−1 , hat−1 , uat−1 , a; θi . (6.5.1)
We feed in the previous action, uat−1 , the agent index, a, along with the observation
0
oat , the previous internal state, hat−1 and the incoming messages m̂at−1 from other
agents. After all agents have taken their actions, we query the environment for
a state update and reward information.
When we reach the final time-step or a terminal state, we proceed to the
backwards pass. Here, for each agent, a, and time-step, j, we calculate a target
Q-value, yja , using the observed reward, rt , and the discounted target network. We
then accumulate the gradients, ∇θ, by regressing the Q-value estimate
0
Q(oat , m̂at−1 , hat−1 , uat−1 , a, u; θi ), (6.5.2)
against the target Q-value, yta , for the action chosen, uat . We also update the message
gradient chain µat which contains the derivative of the downstream bootstrap error
0
2
∆Qat+1 with respect to the outgoing message mat .
P
m,t0 >t
To allow for efficient calculation, this sum can be broken out into two parts: The
0
2
first part, ∂
∆Qat+1 , captures the impact of the message on the total
P
m0 6=m ∂ m̂a
t
estimation error of the next step. The impact of the message mat on all other future
90 6.5. DIAL Details
θi+1 = θi + α∇θ
Every C steps reset θi− = θi
rewards t0 > t + 1 can be calculated using the partial derivative of the outgoing
messages from the agents at time t + 1 with respect to the incoming message mat ,
0
multiplied with their message gradients, µat+1 . Using the message gradient, we can
∂ m̂a
calculate the derivative with respect to the parameters, µat ∂θ
t
.
Having accumulated all gradients, we conduct two parameter updates, first θi
in the direction of the accumulated gradients, ∇θ, and then every C steps θi− = θi .
During decentralised execution, the outgoing activations in the channel are mapped
6. Learning to Communicate with Deep Multi-Agent Reinforcement Learning 91
into a binary vector, m̂ = 1{mat > 0}. This ensures that discrete messages are
exchanged, as required by the task.
In order to minimise the discretisation error when mapping from continuous
values to discrete encodings, two measures are taken during centralised learning.
First, Gaussian noise is added in order to limit the number of bits that can be
encoded in a given range of m values. Second, the noisy message is passed through
a logistic function to restrict the range available for encoding information. Together,
these two measures regularise the information transmitted through the bottleneck.
Furthermore, the noise also perturbs values in the middle of the range, due to the
steeper slope, but leaves the tails of the distribution unchanged. Formally, during
centralised learning, m is mapped to m̂ = Logistic (N (m, σ)), where σ is chosen to be
comparable to the width of the logistic function. In Algorithm 3, the mapping logic
from m to m̂ during training and execution is contained in the DRU(mat ) function.
6.6 Experiments
In this section, we evaluate RIAL and DIAL with and without parameter sharing
in two multi-agent problems and compare it with a no-communication shared
parameters baseline (NoComm). Results presented are the average performance
across several runs, where those without parameter sharing (-NS), are represented
by dashed lines. Across plots, rewards are normalised by the highest average reward
achievable given access to the true state (Oracle).
In our experiments, we use an -greedy policy with = 0.05, the discount factor
is γ = 1, and the target network is reset every 100 episodes. To stabilise learning,
we execute parallel episodes in batches of 32. The parameters are optimised using
RMSProp with momentum of 0.95 and a learning rate of 5 × 10−4 . The architecture
makes use of rectified linear units (ReLU), and gated recurrent units (GRU) [38],
which have similar performance to long short-term memory [102] (LSTM) [103,
104]. Unless stated otherwise we set σ = 2, which was found to be essential for
good performance. We intent to published the source code online.
92 6.6. Experiments
h a21 a
h 22 a
h 23 a
h 2T
… …
steps, that maintains an internal state h, an
( o 1a, m 0a, u 0, a ) … ( o a3, m a2, u 2, a ) … ( o aT, m aT-1, u T-1, a )
The first task is inspired by a well-known riddle described as follows: “One hundred
prisoners have been newly ushered into prison. The warden tells them that starting
tomorrow, each of them will be placed in an isolated cell, unable to communicate
amongst each other. Each day, the warden will choose one of the prisoners uniformly
at random with replacement, and place him in a central interrogation room containing
only a light bulb with a toggle switch. The prisoner will be able to observe the current
state of the light bulb. If he wishes, he can toggle the light bulb. He also has the
option of announcing that he believes all prisoners have visited the interrogation
6. Learning to Communicate with Deep Multi-Agent Reinforcement Learning 93
Prisoner :
in IR
3 2 3 1
On On On On
Switch: Off Off Off Off
Figure 6.3: Switch: Every day one prisoner gets sent to the interrogation room where
he sees the switch and chooses from “On”, “Off”, “Tell” and “None”.
room at some point in time. If this announcement is true, then all prisoners are
set free, but if it is false, all prisoners are executed. The warden leaves and the
prisoners huddle together to discuss their fate. Can they agree on a protocol that
will guarantee their freedom?” [106].
Architecture. In our formalisation, at time-step t, agent a observes oat ∈ 0, 1,
which indicates if the agent is in the interrogation room. Since the switch has
two positions, it can be modelled as a 1-bit message, mat . If agent a is in the
interrogation room, then its actions are uat ∈ {“None”,“Tell”}; otherwise the only
action is “None”. The episode ends when an agent chooses “Tell” or when the
maximum time-step, T , is reached. The reward rt is 0 unless an agent chooses
“Tell”, in which case it is 1 if all agents have been to the interrogation room and
−1 otherwise. Following the riddle definition, in this experiment mat−1 is available
only to the agent a in the interrogation room. Finally, we set the time horizon
T = 4n − 6 in order to keep the experiments computationally tractable.
Complexity. The switch riddle poses significant protocol learning challenges.
At any time-step t, there are |o|t possible observation histories for a given agent,
with |o| = 3: the agent either is not in the interrogation room or receives one of
two messages when he is. For each of these histories, an agent can chose between
4 = |U ||M | different options, so at time-step t, the single-agent policy space is
t
(|U ||M |)|o| = 43 . The product of all policies for all time-steps defines the total
t
The size of the multi-agent policy space grows exponentially in n, the number of
agents: 4n(3 . We consider a setting where T is proportional to the number
T +1 −3)/2
of agents, so the total policy space is 4n3 . For n = 4, the size is 4354288 . Our
O(n)
0.9 0.9
Norm. R (Optimal)
Norm. R (Optimal)
Yes Off
2
0.8 0.8 Day Has Been?
No On
3+
0.7 0.7
Yes None
Has Been? On
0.6 0.6 Tell
No Switch?
On
0.5 0.5 Off
1k 2k 3k 4k 5k 10k 20k 30k 40k
# Epochs # Epochs
Figure 6.4: Switch: (a-b) Performance of DIAL and RIAL, with and without ( -NS)
parameter sharing, and NoComm-baseline, for n = 3 and n = 4 agents. (c) The decision
tree extracted for n = 3 to interpret the communication protocol discovered by DIAL
…
Agent 2
Agent 2
u 12 … … … u 25 u 21 u 22
m1 m2 m3 m4 m1
u 11 … … … u 15 u 11 u 12
Agent 1
Agent 1
… … … …
before. If three prisoners had been, the third prisoner would have finished the game.
The other options can be encoded via the “On” and “Off” position respectively.
In this section, we consider two tasks based on the well known MNIST digit
classification dataset [7].
Colour-Digit MNIST is a two-player game in which each agent observes the
pixel values of a random MNIST digit in red or green of size 2 × 28 × 28, while
the colour label, ca ∈ 0, 1, and digit value, da ∈ 0..9, are hidden. For each agent,
reward consists of two components that are antisymmetric in the action, colour,
and parity (odd, even) of the digits. Only one bit of information can be sent, so
agents must agree to encode/decode either colour or parity, with parity yielding
greater rewards. The game has two steps; in the first step, both agents send
a 1-bit message, in the second step they select a binary action ua2 . The reward
a +da0 a +ca0
for each agent is r(a) = 2(−1)a2 +c + (−1)a2 +d and the total cooperative
a a
Only at the final step, t = 5, is reward given, r5 = 0.5 for each correctly guessed
0
digit, ua5 = da . As only 1-bit is sent per step, agents must find a protocol that
integrates information across the four messages they exchange (the last message
is not received). The protocol can be trained using gradients in DIAL, but also
needs to have a low discretisation error.
Architecture. The input processing network is a 2-layer MLP TaskMLP[(|c| ×
28×28), 128, 128](oat ). Figure 6.5 depicts the generalised setting for both games. Our
experimental evaluation showed improved training time using batch normalisation
after the first layer.
Experimental results. Figures 6.6(a) and 6.6(b) show that DIAL substantially
outperforms the other methods on both games. Furthermore, parameter sharing
is crucial for reaching the optimal protocol. In multi-step MNIST, results were
obtained with σ = 0.5. In this task, RIAL fails to learn, while in colour-digit
MNIST it fluctuates around local minima in the protocol space; the NoComm
baseline is stagnant at zero. DIAL’s performance can be attributed to directly
optimising the messages in order to reduce the global DQN error while RIAL
must rely on trial and error. DIAL can also optimise the message content with
respect to rewards taking place many time-steps later, due to the gradient passing
between agents, leading to optimal performance in multi-step MNIST. To analyse
the protocol that DIAL learned, we sampled 1K episodes. Figure 6.6(c) illustrates
the communication bit sent at time-step t by agent 1, as a function of its input
digit. Thus, each agent has learned a binary encoding and decoding of the digits.
These results illustrate that differentiable communication in DIAL is essential to
fully exploiting the power of centralised learning and thus is an important tool
for studying the learning of communication protocols.
Our results show that DIAL deals more effectively with the high dimensional
input space in the colour-digit MNIST game than RIAL. To better understand
why, as a thought-experiment, consider a simpler two-agent problem with a
structurally similar reward function r = (−1)(s , which is antisymmetric in
1 +s2 +u2 )
the observations and action of the agents. Here random digits s1 , s2 ∈ 0, 1 are input
6. Learning to Communicate with Deep Multi-Agent Reinforcement Learning 97
9
8
DIAL RIAL NoComm DIAL RIAL NoComm
DIAL-NS RIAL-NS Oracle DIAL-NS RIAL-NS Oracle 7
1.0 1.0
6
True Digit
0.8 0.8 5
Norm. R (Optimal)
Norm. R (Optimal)
4
0.6 0.6
3
0.4 0.4 2
0.2 0.2
1
0
0.0
10k 20k 30k 40k 50k
0.0
5k 10k 15k 20k
1 2 3 4
# Epochs # Epochs Step
Figure 6.6: MNIST Games: (a,b) Performance of DIAL and RIAL, with and without
(-NS) parameter sharing, and NoComm, for both MNIST games. (c) Extracted coding
scheme for multi-step MNIST.
to agent 1 and agent 2 and u2 ∈ 1, 2 is a binary action. Agent 1 can send a single bit
message, m1 . Until a protocol has been learned, the average reward for any action
by agent 2 is 0, since averaged over s1 the reward has an equal probability of being
+1 or −1. Equally the TD error for agent 1, the sender, is zero for any message m:
h i h i
E ∆Q(s1 , m1 ) = Q(s1 , m1 ) − E r(s2 , u2 , s1 ) = 0 − 0, (6.6.1)
s2 ,u2
By contrast, DIAL allows for learning. Unlike the TD error, the gradient is a
function of the action and the observation of the receiving agent, so summed across
different +1/ − 1 outcomes the gradient updates for the message m no longer cancel:
" #
∂ ∂
E [∇θ] = E Q(s , m , u ) − r(s , u , s )
2 1 2
Q(s2 , m1 , u2 ) m1 (s1 )
2 2 1
.
∂m ∂θ <s2 ,u2 >
(6.6.2)
The question of why language evolved to be discrete has been studied for
centuries, see e.g., the overview in [107]. Since DIAL learns to communicate in a
continuous channel, our results offer an illuminating perspective on this topic.
In particular, Figure 6.7 shows that, in the switch riddle, DIAL without noise in
the communication channel learns centred activations. By contrast, the presence of
noise forces messages into two different modes during learning. Similar observations
have been made in relation to adding noise when training document models [100]
and performing classification [99]. In our work, we found that adding noise was
essential for successful training.
98 6.6. Experiments
Probability
Epoch 5k
pass through the channel. A first intuition can 0.5
be gained by looking at the width of the sigmoid:
0.0
Taking the decodable range of the logistic function -10 0 10 -10 0 10
Activation Activation
to be x values corresponding to y values between
Figure 6.7: DIAL’s learned ac-
0.01 and 0.99, an initial estimate for the range
tivations with and without noise
is ≈ 10. Thus, requiring distinct x values to in DRU.
be at least six standard deviations apart, with
σ = 2, only two bits can be encoded reliably in this range. To get a better
understanding of the required σ we can visualise the capacity of the channel
including the logistic function and the Gaussian noise. To do so, we must first
derive an expression for the probability distribution of outgoing messages, m̂, given
incoming activations, m, P (m̂|m):
2
1 m− log( m̂1 − 1)
P (m̂|m) = √ exp − . (6.6.3)
2πσ m̂(1 − m̂) σ2
For any m, this captures the distribution of messages leaving the channel. Two
m values m1 and m2 can be distinguished when the outgoing messages have a
small probability of overlapping. Given a value m1 we can thus pick a next value
m2 to be distinguishable when the highest value m̂1 that m1 is likely to produce
is less than the lowest value m̂2 that m2 is likely to produce. An approximation
for when this happens is when (maxm̂ s.t.P (m̂|m1 ) > ) = (minm̂ s.t.P (m̂|m2 ) > ).
Figure 6.8 illustrates this for three different values of σ. For σ > 2, only two
options can be reliably encoded using = 0.1, resulting in a channel that effectively
transmits only one bit of information.
Interestingly, the amount of noise required to regularise the channel depends
greatly on the benefits of over-encoding information. More specifically, as illustrated
6. Learning to Communicate with Deep Multi-Agent Reinforcement Learning 99
1.0 σ = 0. 5 σ = 1. 0 σ = 2. 0
0.8
0.6
Output m̂
0.4
0.2
0.0
-5 0 5 -5 0 5 -5 0 5
Input m
Figure 6.8: Distribution of regularised messages, P (m̂|m) for different noise levels.
Shading indicates P (m̂|m) > 0.1. Blue bars show a division of the x-range into intervals
s.t. the resulting y-values have a small probability of overlap, leading to decodable values.
in Figure 6.9, in tasks where sending more bits does not lead to higher rewards,
small amounts of noise are sufficient to encourage discretisation, as the network
can maximise reward by pushing activations to the tails of the sigmoid, where the
noise is minimised. The figure illustrates the final average evaluation performance
normalised by the training performance of three runs after 50K of the multi-step
MNIST game, under different noise regularisation levels σ ∈ {0, 0.5, 1, 1.5, 2}, and
different numbers of steps step ∈ [2, . . . , 5]. When the lines exceed“Regularised”,
the test reward, after discretisation, is higher than the training reward, i.e., the
channel is properly regularised and getting used as a single bit at the end of
learning. Given that there are 10 digits to encode, four bits are required to get
full rewards. Reducing the number of steps directly reduces the number of bits
that can be communicated, #bits = steps − 1, and thus creates an incentive for
the network to “over-encode” information in the channel, which leads to greater
discretisation error. This is confirmed by the normalised performance for σ = 0.5,
which is around 0.7 for 2 steps (1 bit) and then goes up to > 1 for 5 steps (4
bits). We also note that, without noise, regularisation is not possible and that
with enough noise the channel is always regularised, even if it would yield higher
training rewards to over-encode information.
100 6.7. Conclusion & Future Work
σ = 0. 0 σ = 1. 0 Regularised
σ = 0. 5 σ = 2. 0
1.2
1.0
Execution / Training R
0.8
0.6
0.4
0.2
0.0
2 3 4 5
Steps
Figure 6.9: Final evaluation performance on multi-step MNIST of DIAL normalised
by training performance after 50K epochs, under different noise regularisation levels
σ ∈ {0, 0.5, 1, 1.5, 2}, and different numbers of steps step ∈ [2, . . . , 5].
101
102 7.1. Introduction
Figure 7.1: a) In an MDP the action u is sampled from a policy π that conditions on
the state features (here separated into f pub and f a ). The next state is sampled from
P (s0 |s, u). b) In a PuB-MDP, public features f pub generated by the environment and the
public belief together constitute the Markov state sBAD . The ‘action’ sampled by the
BAD agent is in fact a deterministic partial policy π̂ ∼ πBAD (π̂|sBAD ) that maps from
private observations f a to actions. Only the acting agent observes f a and deterministically
computes u = π̂(f a ). u is provided to the environment, which transitions to state s0 and
0
produces the new observation f pub . BAD then uses the public belief update to compute
a new belief B 0 conditioned on u and π̂ (Equation 7.3.1), thereby completing the state
transition.
7.1 Introduction
The goal in Hanabi is to play a legal sequence of cards and, to aid this process,
players are allowed to give each other hints indicating which cards are of a specific
rank or colour. These hints have two levels of semantics. The first level is the
surface-level content of the hint, which is grounded in the properties of the cards
that they describe. This level of semantics is independent of any possible intent of
the agent in providing the hint, and would be equally meaningful if provided by
a random agent. For example, knowing which cards are of a specific colour often
does not indicate whether they can be safely played or discarded.
A second level of semantics arises from information contained in the actions
themselves, i.e., the very fact that an agent decided to take a particular action
and not another, rather than the information resulting from the state transition
induced by the action. This is essential to the formation of conventions and to
discovering good strategies in Hanabi.
To address these challenges, we propose the Bayesian action decoder (BAD), a
novel multi-agent RL algorithm for discovering effective communication protocols
and policies in cooperative, partially observable multi-agent settings. Inspired by the
work of Nayyar, Mahajan, and Teneketzis [110], BAD uses all publicly observable
features in the environment to compute a public belief over the players’ private
features. This effectively defines a new Markov process, the public belief Markov
decision process (PuB-MDP), in which the action space is the set of deterministic
partial policies, parameterised by deep neural networks, that can be sampled for a
given public state. By acting in the space of deterministic partial policies that map
from private observations into environment actions, an agent acting only on this
public belief state can still learn an optimal policy. Using approximate, factorised
Bayesian updates and deep neural networks, we show for the first time how a method
using the public belief of Nayyar, Mahajan, and Teneketzis [110], can scale to large
state spaces and allow agents to carry out a form of counterfactual reasoning.
When an agent observes the action of another agent, the public belief is updated
by sampling a set of possible private states from the public belief and filtering for
those states in which the teammate chose the observed action. This process is closely
104 7.2. Setting
related to the kind of theory of mind reasoning that humans routinely undertake [111].
Such reasoning seeks to understand why a person took a specific action among several,
and what information this contains about the distribution over private observations.
We experimentally validate an exact version of BAD on a simple two-step
matrix game, showing that it outperforms policy gradient methods. We then
apply an approximate version to Hanabi, where BAD achieves an average score of
24.174 points in the two-player setting, surpassing the best previously published
results for learning agents by around 9 points and approaching the best known
performance of 24.9 points for (cheating) open-hand gameplay. BAD thus establishes
a current state-of-the-art on the Hanabi-Learning-Environment [bard2019hanabi]
for the two player self-play setting. We further show that the beliefs obtained
via Bayesian reasoning have 40% less uncertainty over possible hands than those
using only grounded information.
7.2 Setting
7.3 Method
Below we introduce the Bayesian Action Decoder (BAD). BAD scales the pub-
lic belief of Nayyar, Mahajan, and Teneketzis [110] to large state spaces using
7. Bayesian Action Decoder 105
Since π̂ and uat are public information, observing uat induces a posterior belief over
the possible private state features ftpri given by the public belief update:
Using this Bayesian belief update, we can define a new Markov process, the public
belief MDP (PuB-MDP), as illustrated in Figure 7.1b. The state sBAD ∈ SBAD of
the PuB-MDP consists of the public observation and public belief; the action space
is the set of deterministic partial policies that map from private observations to
environment actions; and the transition function is given by P (s0BAD |sBAD , π̂). The
next state contains the new public belief calculated using the public belief update.
The reward function marginalises over the private state features:
rBAD (sBAD , π̂) = B(f pri )r(s, π̂(f pri )), (7.3.3)
X
f pri
where sBAD = {B, f pub }. Since s0BAD includes the new public belief, and that belief
is computed via an update which conditions on π̂, the PuB-MDP transition function
conditions on all of π̂, not just the selected action uat . Thus the state transition
7. Bayesian Action Decoder 107
depends not just on the executed action, but on the counterfactual actions, i.e.,
those specified by π̂ for private observations other than fta .
In the remainder of this section, we describe how factorised beliefs and policies
can be used to learn a public policy πBAD for the PuB-MDP efficiently.
For each public state, πBAD must select a distribution πBAD (π̂|sBAD ) over deter-
ministic partial policies. The size of this space is exponential in the number of
possible private observations |f a |, but we can reduce this to a linear dependence
by assuming a distribution across π̂ that is factorised across the different private
observations, i.e., for all π̂,
fa
With this restriction, we can easily parameterise πBAD with factors of the form
θ
πBAD (ua |Bt , f pub , f a ) using a function approximator such as a deep neural network.
In order for all of the agents to perform the public belief update, the sampled
π̂ must be public. We resolve this by having π̂ sampled deterministically from
a given Bt and ftpub , using a common knowledge random seed ξt . The seeds are
then shared prior to the game so that all agents sample the same π̂: this resembles
the way humans share common ways of reasoning in card games and allows the
agents to explore alternative policies jointly as a team.
In general, representing exact beliefs is intractable in all but the smallest state
spaces. For example, in card games the number of possible hands is typically
exponential in the number of cards held by all players. To avoid this unfavourable
scaling, we can instead represent an approximate factorised belief state
P (ftpri |f≤t
pub
)≈ P (ftpri [i]|f≤t
pub
) := Btfact . (7.3.5)
Y
From here on we drop the superscript and use B exclusively to refer to the factorised
belief. In a card game each factor represents per-card probability distributions,
108 7.3. Method
assuming approximate independence across the different cards both within a hand
and across players. This approximation makes it possible to represent and reason
over the otherwise intractably large state spaces that commonly occur in many
settings, including card games.
To carry out the public belief update with a factorised representation we maintain
factorised likelihood terms Lt [f [i]] for each private feature that we update recursively:
B 0 = Bt , (7.3.9)
f [−i]
h i
∝ Ef [−i]∼Bk Lt (f [i])P (f [i]|f [−i], ftpub ) , (7.3.11)
7. Bayesian Action Decoder 109
C 10 0 0 0 0 10
ayer 1 acts first.
cond.
4 8 4 4 8 4
0 0 0 10 0 0
Figure 7.2: Payoffs for the toy matrix-like game. The two outer dimensions correspond
to the card held by each player, the two inner dimensions to the action chosen by each
player. Payouts are structured such that Player 1 must encode information about their
card in the action they chose in order to obtain maximum payoffs. Although presented
here in matrix form for compactness, this is a two-step, turn-based game, with Player 1
always taking the first action and Player 2 taking an action after observing Player 1’s
action.
where f [−i] denotes all features excluding f [i]. In the last step we used the
we can use samples to approximate the intractable sum across features. The
notion of refining the probability across one feature while keeping the probability
across all other features fixed is similar to the Expectation Propagation algorithm
used in factor graphs [114]. However, the card counts constitute a global factor,
which renders the factor graph formulation less useful. While this iterative update
observable matrix-like game (Figure 7.2). The state consists of 2 random bits (the
cards for Player 1 and 2) and the action space consists of 3 discrete actions. Each
player observes its own card, with Player 2 also observing Player 1’s action before
acting, which in principle allows Player 1 to encode information about its card with
its action. The reward is specified by a payoff tensor, r = Payoff[card1 ][card2 ][u1 ][u2 ],
where carda and ua are the card and action of the two players, respectively. The
payout tensor is structured such that the optimal reward can only be achieved if
mation of the policy gradient. However, the Figure 7.3: BAD, both with and
without counterfactual gradients,
additional improvement in performance from us-
outperforms vanilla policy gradient
ing CF gradients is minor compared to the initial on the matrix game. Shown is mean
of 1000 games.
performance gain from using a counterfactual
belief state.
is available at https://bit.ly/2P3YOyd.
7. Bayesian Action Decoder 111
7.4.2 Hanabi
Let Nh the number of cards in a hand and n the number of players. In the
standard game of Hanabi, Nh = 5 for n = 2 or 3 and Nh = 4 for n = 4 or 5
players. For generality, we consider that for each colour there are three cards
with rank = 1, one rank = Nrank , and two each of rank = 2, . . . , (Nrank − 1), i.e.,
2Nrank cards per colour for a total of Ndeck = 2Ncolor Nrank cards in the deck. In
the standard game of Hanabi, Ncolor = Nrank = 5 for a total of Ndeck = 50. While
this is a modestly large number of cards, even for 2 players it leads to 6.2 × 1013
possible joint hands at the beginning of the game.
Each player observes the hands of all other players, but not their own. The action
space consists of Nh × 2 options for discarding and playing cards, and Ncolor + Nrank
options per teammate for hinting colours and ranks. Hints reveal all cards of a
specific rank or colour to one of the teammates, e.g., ‘Player 2’s card 3 and 5
are red’. Hinting for colours and ranks not present in the hand of the teammate
(so-called ‘empty hints’) is not allowed.
Each hint costs one hint token. The game starts with 8 hint tokens, which can
be recovered by discarding cards. After a player has played or discarded a card, it
draws a new card from the deck. When a player picks up the last card, everyone
(including that player) gets to take one more action before the game terminates.
Legal gameplay consists of building Ncolor fireworks, which are piles of ascending
numbers, starting at 1, for each colour. When the Nrank -th card has been added
to a pile the firework is complete and the team obtains another hint token (unless
they already have 8). Each time an agent plays an illegal card the players lose a
life token, after three mistakes the game also terminates. Players receive 1 point
after playing any playable card, with a perfect score being Ncolor Nrank .
112 7.4. Experiments and Results
The number of hint and life tokens at any time are observed by all players,
as are the played and discarded cards, the last action of the acting player and
any hints provided.
We call this the ‘V0 belief’, in which the belief for each card depends only on
the publicly available information for that card. In our experiments, we focus
on baseline agents that receive this basic belief, rather than the raw hints, as
public observation inputs; while the problem of simply remembering all hints and
their most immediate implication for card counts is potentially challenging for
humans in recreational play, we are here more interested in the problem of forming
effective conventions for high-level play.
As noted above, this basic belief misses an important interaction between the
hints for different slots. We can calculate an approximate version of the self-
consistent beliefs that avoids the potentially expensive and noisy sampling step in
Equation 7.3.11 (note that this sampling is distinct from the sampling required
to compute the marginal likelihood in Equation 7.3.8).
7. Bayesian Action Decoder 113
In the last two lines we are normalising the probability, since the probability of
the i-th feature being one of the possible values must sum to 1. For convenience
we also introduced the notation βi for the normalisation factor.
Next we apply the same logic to the iterative belief update. The key insight
here is to note that conditioning on the features f [−i], i.e., the other cards in the
slots, corresponds to reducing the card counts in the candidates. Below we use
M (f [i]) = HM(f [i]) × L(f [i]) for notational convenience:
B k+1 (f [i])
f [−i]
!
= B (g[−i])βi C(f ) −
k
1(g[j] = f ) M (f [i]). (7.4.6)
X X
g[−i] j6=i
In the last line we relabelled the dummy index f [−i] to g[−i] for clarity and
used the result from above. Next we substitute the factorised belief assumption
across the features, B k (g[−i]) = B k (g[j]) :
Q
j6=i
B k+1 (f [i])
!
= B (g[−i])βi C(f ) −
k
1(g[j] = f ) M (f [i]) (7.4.7)
X X
g[−i] j6=i
!
= B (g[j])βi C(f ) −
k
1(g[j] = f ) M (f [i]) (7.4.8)
XY X
In the last line we have omited the dependency of βi on the sampled hands
f [−i]. It corresponds to calculating the average across sampled hands first and
then normalising (which is approximate but tractable) rather than normalising
114 7.4. Experiments and Results
and then averaging (which is exact but intractable). We can now use product-sum
rules to simplify the expression.
B k+1 (f [i])
!
' βi C(f ) − B (g[j])
k
1(g[j] = f ) M (f [i]) (7.4.10)
XY X
j6=i g
!
= βi C(f ) − B (f [j]) M (f [i])
k
(7.4.12)
X
j6=i
!
∝ C(f ) − B (f [j]) M (f [i]).
k
(7.4.13)
X
j6=i
j6=i
We call the resulting belief at convergence (or after a maximum number of iterations)
the ‘V1 belief‘. It does not condition on the Bayesian probabilities but considers
interactions between hints for different cards. In essence, at each iteration the
belief for a given slot is updated by reducing the candidate count by the number
of cards believed to be held across all other slots.
By running the same algorithm but including L, we obtain the Bayesian beliefs,
BB, that lie at the core of BAD:
j6=i
In practice, to ensure stability, the final ‘V2 belief’ that we use is an interpolation
between the Bayesian belief and the V1 belief: V2 = (1 − α)BB + αV1 with
α = 0.01 (we found α = 0.1 to also work).
7. Bayesian Action Decoder 115
Figure 7.4: a) Training curves for BAD on Hanabi and the V0 and V1 baseline methods
using LSTMs rather than the Bayesian belief. The thick line for each agent type shows
the final evaluated agent for each type; upward kinks are generally due to agents ‘evolving’
in PBT by copying its weights and hyperparameters (plus perturbations) from a superior
agent. b) Distribution of game scores for BAD on Hanabi under testing conditions. BAD
achieves a perfect score in nearly 60% of the games. The dashed line shows the proportion
of perfect games reported for SmartBot, the best known heuristic for two-player Hanabi.
c) Per-card cross entropy with the true hand for different belief mechanisms in Hanabi
during BAD play. V0 is the basic belief based on hints and card counts, V1 is the
self-consistent belief, and V2 is the BAD belief which also includes the Bayesian update.
The BAD agent conveys around 40% of the information via conventions, rather than
grounded information.
256-unit hidden layer and ReLU activations, which then projected linearly to a
single value. Since the baseline network is only used to compute gradient updates,
we followed the centralised critic from Chapter 3 in feeding each agent’s own hand
(i.e., the other agent’s private observation) into the baseline by concatenating
it with the LSTM output; thus we make the common assumption of centralised
training and decentralised execution. We note that the V0 and V1-LSTM agents
differed only in their public belief inputs.
The BAD agent consisted of an MLP with two 384-unit hidden layers and
ReLU activations that processed all observations, followed by a linear softmax
policy readout. To compute the baseline, we used the same MLP as the policy
but included the agent’s own hand in the input (this input was present but zeroed
out for the computation of the policy).
For all agents, illegal actions (such as hint for a red card when there are no
red cards) were masked out by setting the corresponding policy logits to a large
negative value before sampling an action. In particular, for the non-acting agent
at each turn the only allowed action was the ‘no-action’.
7.4.6 Hyperparamters
For the toy matrix game, we used a batch size of 32 and the Adam optimiser with
all default TensorFlow settings; we did not tune hyperparameters for any runs.
For Hanabi, we used the RMSProp optimiser with = 10−10 , momentum 0, and
decay 0.99. The RL discounting factor γ was set to 0.999. The baseline loss was
multiplied by 0.25 and added to the policy-gradient loss. We used population-based
training (PBT) [116, 117] to ‘evolve’ the learning rate and entropy regularisation
parameter during the course of training, with each training run consisting of a
population of 30 agents. For the LSTM agents, learning rates were sampled log-
uniformly from the interval [1, 4) × 10−4 while the entropy regularisation parameter
was sampled log-uniformly from the interval [1, 5) × 10−2 . For the BAD agents,
learning rates were sampled log-uniformly from the interval [9×10−5 , 3×10−4 ) while
the entropy regularisation parameter was sampled log-uniformly from the interval
7. Bayesian Action Decoder 117
[3, 7) × 10−2 . Agents evolved within the PBT framework by copying weights and
hyperparameters (plus perturbations) according to each agent’s rating, which was
an exponentially moving average of the episode rewards with factor 0.01. An agent
was considered for copying roughly every 200M steps if a randomly chosen copy-to
agent had a rating at least 0.5 points higher. To allow the best hyperparameters to
manifest sufficiently, PBT was turned off for the first 1B steps of training.
The BAD agent was trained with 100 self-consistent iterations, a V1 mix-in
of α = 0.01, inverse temperature 1.0, and 3000 sampled hands. Since sampling
from card-factorised beliefs can result in hands that are not compatible with the
deck, we sampled 5 times the number of hands and accepted the first 3000 legal
hands, zeroing out any hands that were illegal.
The BAD agent achieves a new state-of-the-art mean performance of 24.174 points
on two-player Hanabi. In Figure 7.4a we show training curves and test performance
for BAD and two LSTM-based baseline methods, as well as the performance of
the the best known hand-coded bot for two-player Hanabi. For the LSTM agents,
test performance was obtained by using the greedy version of the trained policy,
resulting in slightly higher scores than during training. To select the agent, we first
performed a sweep over all agents for 10,000 games, then carried out a final test
run of 100,000 games on the best agent from the sweep, since taking the maximum
of a sweep introduces bias in the score. We carried out a similar procedure for
the BAD agent but with additional hyperparameters, also varying the V1 mix-in
factor, and number of sampled hands.
The results for other learning methods from the literature perform well below the
range of the y-axis (far below 20 points) and are omitted for readability. We note
that, under a strict interpretation of the rules of Hanabi, games in which all three
error tokens are exhausted should be awarded a score of zero. Under these rules the
same BAD agent achieves 23.917 ± 0.009 s.e.m., still better than the hand-coded
118 7.4. Experiments and Results
Table 7.1: Test scores on 100K games. The LSTM agents were tested with a greedy
version of the trained policy, while the final BAD agent was evaluated with V1 mix-in
α = 0.01, 20K sampled hands, and inverse softmax temperature 100.0.
bot (for whom only results in which all points are counted have been reported).
This is true even though our agents were not trained under these conditions.
While not all of the game play BAD learns is easy to follow, some conventions
can be understood simply from inspecting the game. Printouts of 100 random
games can be found at https://bit.ly/2zeEShh. One convention stands out:
Hinting for ‘red’ or ‘yellow’ indicates that the newest card of the other player is
playable. We found that in over 80% of cases when an agent hints ‘red’ or ‘yellow’,
the next action of the other agent is to play the newest card. This convention
is very powerful: Typically agents know the least about the newest card, so by
hinting ‘red’ or ‘yellow’, agents can use a single hint to tell the other agent that
the card is playable. Indeed, the use of two colours to indicate ‘play newest card’
was present all of the highest-performing agents we studied. Hinting ‘white’ and
‘blue’ are followed by a discard of the newest card in over 25% of cases. We also
found that the agent sometimes attempts to play cards which are not playable in
order to convey information to their teammate. In general, unlike human players,
agents play and discard predominantly from the last card.
Figure 7.4c) shows the quality of the different beliefs. While the iterated belief
update leads to a reduction in cross entropy compared to the basic belief, a much
greater reduction in cross entropy is obtained using counterfactual beliefs. This
clearly demonstrates the importance of learning conventions for successful gameplay
in Hanabi: Roughly 40% of the information is obtained through conventions rather
than through the grounded information and card counting.
7. Bayesian Action Decoder 119
Many works have addressed problem settings where agents must learn to communi-
cate in order to cooperatively solve a toy problem. These tasks typically involve a
cheap-talk communication channel that can be modeled as a continuous variable
during training, which allows differentiation through the channel. This was first
proposed by us, in Chapter 6, and Sukhbaatar, Fergus, et al. [48], and has since been
applied to a number of different settings. In this work we focused on the case where,
rather than relying on a cheap-talk channel, agents must learn to communicate via
grounded hinting actions and observable environment actions. This setting is closest
to the ‘hat game’ in Foerster et al. [118]. In this work we proposed a simple extension
to recurrent deep Q-networks rather than explicitly modeling action-conditioned
Bayesian beliefs. An idea very similar to the Pub-MDP was introduced in the
context of decentralised stochastic control by Nayyar, Mahajan, and Teneketzis
[110], who also formulated a coordinator that uses ‘common information’ to map
local controller information to actions. However, they did not provide a concrete
solution method that can scale to a high-dimensional problem like Hanabi.
A number of papers have been published on Hanabi. Baffier et al. [119] showed
that optimal gameplay in Hanabi is NP-hard even when players can observe their
own cards. Encoding schemes similar to the hat game essentially solves the 5-player
case [120], but only achieve 17.8 points in the two-player setting [121]. Walton-
Rivers et al. [122] developed a variety of Monte Carlo tree search and rule-based
methods for learning to play the game, but the reported scores were roughly 50%
lower than those achieved by BAD. Osawa [123] defined a number of heuristics for
the two-player case that reason over possible hands given the other player’s action.
While this is similar in spirit to our approach, the work was limited to hand-coded
heuristics, and the reported scores were around 8 points lower than our results.
120 7.6. Conclusion & Future Work
Eger, Martens, and Cordoba [124] investigated humans playing with hand-coded
agents, but no pairing resulted in scores higher than 15 points on average.
The best result for two-player Hanabi we could find was for the ‘SmartBot’
described at github.com/Quuxplusone/Hanabi, which has been reported to achieve
an average of 23.09 points (29.52% perfect games). While SmartBot uses the
same game rules as those used in our work, it is entirely hand-coded and in-
volves no learning.
The continual re-solving (nested solving) algorithm used by DeepStack [125] and
Libratus [126] for poker also used a belief state space. Like BAD, when making a
decision in a player state, continual re-solving considers the belief state associated
with the current player and generates a joint policy across all player states consistent
with this belief. The policy for the actual player is then selected from this joint
policy. Continual re-solving also does a Bayesian update of the beliefs after an action.
There are key differences, however. Continual re-solving performed exact belief
updates, which requires that the joint policy space be small enough to enumerate;
belief states were also augmented with opponent values. Continual re-solving is a
value-based method, where the training process consists of learning the values of
belief states under optimal play. Finally, the algorithm was designed for two-player,
zero-sum games, where it can independently consider player state values while
guaranteeing that an optimal choice for the joint action policy can be found.
that using the Bayesian update leads to a reduction in uncertainty across the
private hands in Hanabi by around 40%. To the best of our knowledge, this is the
first instance in which DMARL has been successfully applied to a problem setting
that both requires the discovery of communication protocols and was originally
designed to be challenging for humans. BAD also illustrates clearly that using an
explicit belief computation achieves better performance in such settings than current
state-of-the-art RL methods using implicit beliefs, such as recurrent neural networks.
In the future we would like to apply BAD to more than 2 players and further
generalise BAD by learning more of the components. While the belief update
necessarily involves a sampling step, most of the other components can likely be
learned end-to-end. We also plan to extend the BAD mechanism to value-based
methods and further investigate the relevance of counterfactual gradients.
122
Part III
Learning to Reciprocate
123
Abstract
So far we have assumed fully cooperative settings, in which all agents work together
to maximise a common return. However, in the real world, commonly a number of
different agents in the same environment pursue their own goals, leading to potential
conflicts of interest. In Chapter 8 we propose a novel framework for allowing
agents to consider the learning behaviour of other agents in the environment as
part of their policy-optimisation. We show that when each agent anticipates one
step of opponent learning, we can obtain drastically different learning outcomes
for self-interested agents. When accounting for the learning behaviour of others
in RL settings, higher order gradients need to be estimated using samples. In
Chapter 9 we propose a new objective which generates the correct gradient estimators
under automatic differentiation.
Learning with Opponent-Learning
8
Awareness
Contents
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 127
8.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 130
8.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
8.3.1 Naive Learner . . . . . . . . . . . . . . . . . . . . . . . . 134
8.3.2 Learning with Opponent Learning Awareness . . . . . . 134
8.3.3 Learning via Policy Gradient . . . . . . . . . . . . . . . 135
8.3.4 LOLA with Opponent modelling . . . . . . . . . . . . . 136
8.3.5 Higher-Order LOLA . . . . . . . . . . . . . . . . . . . . 137
8.4 Experimental Setup . . . . . . . . . . . . . . . . . . . . . 138
8.4.1 Iterated Games . . . . . . . . . . . . . . . . . . . . . . . 138
8.4.2 Coin Game . . . . . . . . . . . . . . . . . . . . . . . . . 140
8.4.3 Training Details . . . . . . . . . . . . . . . . . . . . . . 141
8.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
8.5.1 Iterated Games . . . . . . . . . . . . . . . . . . . . . . . 143
8.5.2 Coin Game . . . . . . . . . . . . . . . . . . . . . . . . . 144
8.5.3 Exploitability of LOLA . . . . . . . . . . . . . . . . . . 145
8.6 Conclusion & Future Work . . . . . . . . . . . . . . . . . 146
8.1 Introduction
In the previous two parts of the thesis we have been focused on methods that
address the learning challenges associated with fully cooperative multi-agent RL.
127
128 8.1. Introduction
the learning process of the other agents and simply treat the other agent as a
static part of the environment [127].
As a step towards reasoning over the learning behaviour of other agents in
social settings, we propose Learning with Opponent-Learning Awareness (LOLA).
The LOLA learning rule includes an additional term that accounts for the impact
of one agent’s parameter update on the learning step of the other agents. For
convenience we use the word ‘opponents’ to describe the other agents, even though
the method is not limited to zero-sum games and can be applied in the general-
sum setting. We show that this additional term, when applied by both agents,
leads to emergent reciprocity and cooperation in the iterated prisoner’s dilemma
(IPD). Experimentally we also show that in the IPD, each agent is incentivised
to switch from naive learning to LOLA, while there are no additional gains in
attempting to exploit LOLA with higher order gradient terms. This suggests that
within the space of local, gradient-based learning rules both agents using LOLA
is a stable equilibrium. This is further supported by the good performance of
the LOLA agent in a round-robin tournament, where it successfully manages to
shape the learning of a number of multi-agent learning algorithms from literature.
This leads to the overall highest average return on the IPD and good performance
on Iterated Matching Pennies (IMP).
We also present a version of LOLA adopted to the DMARL setting using
likelihood ratio policy gradients, making LOLA scalable to settings with high
dimensional input and parameter spaces.
We evaluate the policy gradient version of LOLA on the IPD and iterated
matching pennies (IMP), a simplified version of rock-paper-scissors. We show that
LOLA leads to cooperation with high social welfare, while independent policy
gradients, a standard multi-agent RL approach, does not. The policy gradient
finding is consistent with prior work, e.g., Sandholm and Crites [128]. We also
extend LOLA to settings where the opponent policy is unknown and needs to be
inferred from state-action trajectories of the opponent’s behaviour.
130 8.2. Related Work
The study of general-sum games has a long history in game theory and evolution.
Many papers address the iterated prisoner’s dilemma (IPD) in particular, including
the seminal work on the topic by Axelrod [16]. This work popularised tit-for-tat
(TFT), a strategy in which an agent cooperates on the first move and then copies
the opponent’s most recent move, as an effective and simple strategy in the IPD.
A number of methods in multi-agent RL aim to achieve convergence in self-play
and rationality in sequential, general sum games. Seminal work includes the family of
WoLF algorithms [129], which uses different learning rates depending on whether an
agent is winning or losing, joint-action-learners (JAL), and AWESOME [87]. Unlike
LOLA, these algorithms typically have well understood convergence behaviour
given an appropriate set of constraints. However, none of these algorithm have the
ability to shape the learning behaviour of the opponents in order to obtain higher
payouts at convergence. AWESOME aims to learn the equilibria of the one-shot
game, a subset of the equilibria of the iterated game.
Detailed studies have analysed the dynamics of JALs in general sum settings:
This includes work by Uther and Veloso [130] in zero-sum settings and by Claus
and Boutilier [131] in cooperative settings. Sandholm and Crites [128] study
the dynamics of independent Q-learning in the IPD under a range of different
exploration schedules and function approximators. Wunder, Littman, and Babes
[132] and Zinkevich, Greenwald, and Littman [133] explicitly study the convergence
dynamics and equilibria of learning in iterated games. Unlike LOLA, these papers
do not propose novel learning rules.
8. Learning with Opponent-Learning Awareness 131
Figure 8.1: a) shows the probability of cooperation in the iterated prisoners dilemma
(IPD) at the end of 50 training runs for both agents as a function of state under naive
learning (NL-Ex) and b) displays the results for LOLA-Ex when using the exact gradients
of the value function. c) shows the normalised discounted return for both agents in NL-Ex
vs. NL-Ex and LOLA-Ex vs. LOLA-Ex, with the exact gradient. d) plots the normalised
discounted return for both agents in NL-PG vs. NL-PG and LOLA-PG vs. LOLA-PG,
with policy gradient approximation. We see that NL-Ex leads to DD, resulting in an
average reward of ca. −2. In contrast, the LOLA-Ex agents play tit-for-tat in b): When in
the last move agent 1 defected and agent 2 cooperated (DC, green points), most likely in
the next move agent 1 will cooperate and agent 2 will defect, indicated by a concentration
of the green points in the bottom right corner. Similarly, the yellow points (CD), are
concentrated in the top left corner. While the results for the NL-PG and LOLA-PG with
policy gradient approximation are more noisy, they are qualitatively similar. Best viewed
in colour.
8.3 Methods
In this section, we review the naive learner’s strategy and introduce the LOLA
learning rule. We first derive the update rules when agents have access to exact
gradients and Hessians of their expected discounted future return in Sections 8.3.1
and 8.3.2. In Section 8.3.3, we derive the learning rules purely based on policy
gradients, thus removing access to exact gradients and Hessians. This renders LOLA
suitable for DMARL. However, we still assume agents have access to opponents’
policy parameters in policy gradient-based LOLA. Next, in Section 8.3.4, we
incorporate opponent modelling into the LOLA learning rule, such that each LOLA
agent only infers the opponent’s policy parameter from experience. Finally, we
discuss higher order LOLA in Section 8.3.5.
For simplicity, here we assume the number of agents is n = 2 and display the
update rules for agent 1 only. The same derivation holds for arbitrary num-
bers of agents.
134 8.3. Methods
1
θi+1 = argmaxθ1 J 1 (θ1 , θi2 )
2
θi+1 = argmaxθ2 J 2 (θi1 , θ2 ).
1
θi+1 = θi1 + fnl1 (θi1 , θi2 ),
A LOLA learner optimises its return under one step look-ahead of opponent learning:
Instead of optimizing the expected return under the current parameters, J 1 (θi1 , θi2 ),
a LOLA agent optimises J 1 (θi1 , θi2 + ∆θi2 ), which is the expected return after the
opponent updates its policy with one naive learning step, ∆θi2 . Going forward
we drop the subscript i for clarity. Assuming small ∆θ2 , a first-order Taylor
expansion results in:
The LOLA objective (8.3.2) differs from prior work, e.g., Zhang and Lesser
[147], that predicts the opponent’s policy parameter update and learns a best
response. LOLA learners attempt to actively influence the opponent’s future policy
update, and explicitly differentiate through the ∆θ2 with respect to θ1 . Since LOLA
8. Learning with Opponent-Learning Awareness 135
focuses on this shaping of the learning direction of the opponent, the dependency
of ∇θ2 J 1 (θ1 , θ2 ) on θ1 is dropped during the backward pass. Investigation of
how differentiating through this term would affect the learning outcomes is left
for future work.
By substituting the opponent’s naive learning step:
into (8.3.2) and taking the derivative of (8.3.2) with respect to θ1 , we obtain
our LOLA learning rule:
1
θi+1 = θi1 + flola
1
(θi1 , θi2 ),
1
flola (θ1 , θ2 ) = ∇θ1 J 1 (θ1 , θ2 ) · δ
T
+ ∇θ2 J 1 (θ1 , θ2 ) ∇θ1 ∇θ2 J 2 (θ1 , θ2 ) · δη, (8.3.4)
where the step sizes δ, η are for the first and second order updates. Exact LOLA
and NL agents (LOLA-Ex and NL-Ex) have access to the gradients and Hessians of
{J 1 , J 2 } at the current policy parameters (θi1 , θi2 ) and can evaluate (8.3.1) and
(8.3.4) exactly.
When agents do not have access to exact gradients or Hessians, we derive the
update rules fnl, pg and flola, pg based on approximations of the derivatives in
(8.3.1) and (8.3.4).
We denote an episode of horizon T as τ = (s0 , u10 , u20 , .., sT +1 , u1T , u2T , rT1 , rT2 ) and
0
its corresponding discounted return for agent a at timestep t as Rta (τ ) = γ t −t rta0 .
PT
t0 =t
Given this definition, the expected episodic return conditioned on the agents’
policies (π 1 , π 2 ), E R01 (τ ) and E R02 (τ ), approximate J 1 and J 2 respectively, as
do the gradients and Hessians.
136 8.3. Methods
where b(st ) is a baseline for variance reduction. Then the update rule fnl, pg for
the policy gradient-based naive learner (NL-PG) is
pg = ∇θ1 E R0 (τ ) · δ. (8.3.5)
1 1
fnl,
For the LOLA update, we derive the following estimator of the second-order term
in (8.3.4) based on policy gradients. The derivation (omitted) closely resembles
the standard proof of the policy gradient theorem, exploiting the fact that agents
sample actions independently. We further note that this second order term is
exact in expectation. In Chapter 8 we provide novel tools for algorithmically
generating these gradient estimators:
X
T Xt
=E t=0
γ t rt2 · l=0
∇θ1 log π 1 (u1l |sl )
Xt T
l=0
∇θ2 log π 2
(u2l |sl ) . (8.3.6)
pg = ∇θ1 E R0 (τ ) · δ+
1 1
flola,
T
∇θ2 E R01 (τ ) ∇θ1 ∇θ2 E R02 (τ ) · δη. (8.3.7)
Both versions (8.3.4) and (8.3.7) of LOLA learning assume that each agent has
access to the exact parameters of the opponent. However, in adversarial settings
the opponent’s parameters are typically obscured and have to be inferred from the
opponent’s state-action trajectories. Our proposed opponent modelling is similar to
8. Learning with Opponent-Learning Awareness 137
Figure 8.2: a) the probability of playing heads in the iterated matching pennies (IMP)
at the end of 50 independent training runs for both agents as a function of state under
naive learning NL-Ex. b) the results of LOLA-Ex when using the exact gradients of
the value function. c) the normalised discounted return for both agents in NL-Ex vs.
NL-Ex and LOLA-Ex vs. LOLA-Ex with exact gradient. d) the normalised discounted
return for both agents in NL-PG vs. NL-PG and LOLA-PG vs. LOLA-PG with policy
gradient approximation. We can see in a) that NL-Ex results in near deterministic
strategies, indicated by the accumulation of points in the corners. These strategies are
easily exploitable by other deterministic strategies leading to unstable training and high
variance in the reward per step in c). In contrast, LOLA agents learn to play the only Nash
strategy, 50%/50%, leading to low variance in the reward per step. One interpretation
is that LOLA agents anticipate that exploiting a deviation from Nash increases their
immediate return, but also renders them more exploitable by the opponent’s next learning
step. Best viewed in colour.
behavioural cloning [150, 151]. Instead of accessing agent 2’s true policy parameters
θ2 , agent 1 models the opponent’s behaviour with θ̂2 , where θ̂2 is estimated from
agent 2’s trajectories using maximum likelihood:
θ2 t
Then, θ̂2 replaces θ2 in the LOLA update rule, both for the exact version (8.3.4)
using the value function and the gradient based approximation (8.3.7). We compare
the performance of policy-gradient based LOLA agents (8.3.7) with and without
opponent modelling in our experiments. In particular we can obtain θ̂2 using the
past action-observation history. In our experiments we incrementally fit to the most
recent data in order to address the nonstationarity of the opponent.
By substituting the naive learning rule (8.3.3) into the LOLA objective (8.3.2),
the LOLA learning rule so far assumes that the opponent is a naive learner. We
138 8.4. Experimental Setup
call this setting first-order LOLA, which corresponds to the first-order learning
rule of the opponent agent. However, we can also consider a higher-order LOLA
agent that assumes the opponent applies a first-order LOLA learning rule, thus
replacing (8.3.3). This leads to third-order derivatives in the learning rule. While
the third-order terms are typically difficult to compute using policy gradient method,
due to high variance, when the exact value function is available it is tractable. We
examine the benefits of higher-order LOLA in our experiments.
In this section, we summarise the settings where we compare the learning behaviour
of NL and LOLA agents. The first setting (Sec. 8.4.1) consists of two classic
infinitely iterated games, the iterated prisoners dilemma (IPD), [152] and iterated
matching pennies (IMP) [153]. Each round in these two environments requires
a single action from each agent. We can obtain the discounted future return of
each player given both players’ policies, which leads to exact policy updates for NL
and LOLA agents. The second setting (Sec. 8.4.2), Coin Game, a more difficult
two-player environment, where each round requires the agents to take a sequence of
actions and exact discounted future reward can not be calculated. The policy of
each player is parameterised with a deep recurrent neural network.
In the policy gradient experiments with LOLA, we assume off-line learning, i.e.,
agents play many (batch-size) parallel episodes using their latest policies. Policies
remain unchanged within each episode, with learning happening between episodes.
One setting in which this kind of offline learning naturally arises is when policies
are trained on real-world data. For example, in the case of autonomous cars, the
data from a fleet of cars is used to periodically train and dispatch new policies.
We first review the two iterated games, the IPD and IMP, and explain how we
can model iterated games as a memory-1 two-agent MDP.
8. Learning with Opponent-Learning Awareness 139
C D
C (-1, -1) (-3, 0)
D (0, -3) (-2, -2)
Table 8.1 shows the per-step payoff matrix of the prisoner’s dilemma. In a
single-shot prisoner’s dilemma, there is only one Nash equilibrium [154], where
both agents defect. In the infinitely iterated prisoner’s dilemma, the folk theorem
[155] shows that there are infinitely many Nash equilibria. Two notable ones are
the always defect strategy (DD), and tit-for-tat (TFT). In TFT each agent starts
out with cooperation and then repeats the previous action of the opponent. The
average returns per step in self-play are −1 and −2 for TFT and DD respectively.
Matching pennies [156] is a zero-sum game, with per-step payouts shown in
Table 8.2. This game only has a single mixed strategy Nash equilibrium which
is both players playing 50%/50% heads / tails.
Head Tail
Head (+1, -1) (-1, +1)
Tail (-1, +1) (+1, -1)
Agents in both the IPD and IMP can condition their actions on past history.
Agents in an iterated game are endowed with a memory of length K, i.e., the agents
act based on the results of the last K rounds. Press and Dyson [157] proved that
agents with a good memory-1 strategy can effectively force the iterated game to be
played as memory-1. Thus, we consider memory-1 iterated games in our work.
We can model the memory-1 IPD and IMP as a two-agent MDP, where the state
at time 0 is empty, denoted as s0 , and at time t ≥ 1 is both agents’ actions from t−1:
Each agent’s policy is fully specified by 5 probabilities. For agent a in the case
of the IPD, they are the probability of cooperation at game start π a (C|s0 ), and
the cooperation probabilities in the four memory states: π a (C|CC), π a (C|CD),
140 8.4. Experimental Setup
IPD IMP
%TFT R(std) %Nash R(std)
NL-Ex. 20.8 -1.98(0.14) 0.0 0(0.37)
LOLA-Ex. 81.0 -1.06(0.19) 98.8 0(0.02)
NL-PG 20.0 -1.98(0.00) 13.2 0(0.19)
LOLA-PG 66.4 -1.17(0.34) 93.2 0(0.06)
Table 8.3: We summarise results for NL vs. NL and LOLA vs. LOLA settings with
either exact gradient evaluation (-Ex) or policy gradient approximation (-PG). Shown is
the probability of agents playing TFT and Nash for the IPD and IMP respectively as well
as the average reward per step, R, and standard deviation (std) at the end of training for
50 training runs.
Next, we study LOLA in a more high-dimensional setting called Coin Game. This
is a sequential game and the agents’ policies are parameterised as deep neural
networks. Coin Game was first proposed by Lerer and Peysakhovich [139] as a
higher dimensional alternative to the IPD with multi-step actions. As shown in
Figure 8.3, in this setting two agents, ‘red’ and ‘blue’, are tasked with collecting coins.
The coins are either blue or red, and appear randomly on the grid-world. A new
coin with random colour and random position appears after the last one is picked
up. Agents pick up coins by moving onto the position where the coin is located.
While every agent receives a point for picking up a coin of any colour, whenever
an picks up a coin of different colour, the other agent loses 2 points.
As a result, if both agents greedily pick up any coin available, they receive 0
points in expectation. Since the agents’ policies are parameterised as a recurrent
neural network, one cannot obtain the future discounted reward as a function of
both agents’ policies in closed form. Policy gradient-based learning is applied for
8. Learning with Opponent-Learning Awareness 141
Figure 8.3: In the Coin Game, two agents, ‘red’ and ‘blue’, get 1 point for picking up
any coin. However, the ‘red agent’ loses 2 points when the ‘blue agent’ picks up a red
coin and vice versa. Effectively, this is a world with an embedded social dilemma where
cooperation and defection are temporally extended.
−0.5
NL-Q PHC NL-Ex NL-Q PHC NL-Ex
0.4
JAL-Q WoLF-PHC LOLA-Ex JAL-Q WoLF-PHC LOLA-Ex
Average reward per step
−1.0 0.2
−1.5 0.0
−0.2
−2.0
−0.4
−2.5
0 200 400 600 800 1000 0 200 400 600 800 1000
Episode Episode
Figure 8.4: Normalised returns of a round-robin tournament on the IPD (left) and
IMP (right). LOLA-Ex agents achieve the best performance in the IPD and are within
error bars for IMP. Shading indicates a 95% confidence interval of the error of the mean.
Baselines from [129]: naive Q-learner (NL-Q), joint-action Q-learner (JAL-Q), policy
hill-climbing (PHC), and “Win or Learn Fast” (WoLF).
For the tournament, we use baseline algorithms and the corresponding hyperpa-
rameter values as provided in the literature [129]. The tournament is played in a
round-robin fashion between all pairs of agents for 1000 episodes, 200 steps each.
8.5 Results
In this section, we summarise the experimental results. We denote LOLA and
naive agents with exact policy updates as LOLA-Ex and NL-Ex respectively. We
abbreviate LOLA and native agents with policy updates with LOLA-PG and
NL-PG. We aim to answer the following questions:
2. Using policy gradient updates instead, how to LOLA-PG agents and NL-PG
agents behave?
5. Does LOLA-PG maintain its behaviour when replacing access to the exact
parameters of the opponent agent with opponent modelling?
8. Learning with Opponent-Learning Awareness 143
6. Can LOLA agents be exploited by using higher order gradients, i.e., does
LOLA lead to an arms race of ever higher order corrections or is LOLA /
LOLA stable?
We answer the first three questions in Sec. 8.5.1, the next two questions in Sec. 8.5.2,
and the last one in Sec. 8.5.3.
We first compare the behaviours of LOLA agents with NL agents, with either exact
policy updates or policy gradient updates.
Figures 8.1a) and 8.1b) show the policy for both agents at the end of training
under NL-Ex and LOLA-Ex when the agents have access to exact gradients and
Hessians of {J 1 , J 2 }. Here we consider the settings of NL-Ex vs. NL-Ex and
LOLA-Ex vs. LOLA-Ex. We study mixed learning of one LOLA-Ex agent vs.
an NL-Ex agent in Section 8.5.3. Under NL-Ex, the agents learn to defect in all
states, indicated by the accumulation of points in the bottom left corner of the plot.
However, under LOLA-Ex, in most cases the agents learn TFT. In particular agent 1
cooperates in the starting state s0 , CC and DC, while agent 2 cooperates in s0 , CC
and CD. As a result, Figure 8.1c) shows that the normalised discounted reward1 is
close to −1 for LOLA-Ex vs. LOLA-Ex, corresponding to TFT, while NL-Ex vs.
NL-Ex results in an normalised discounted reward of −2, corresponding to the fully
defective (DD) equilibrium. Figure 8.1d) shows the normalised discounted reward
for NL-PG and LOLA-PG where agents learn via policy gradient. LOLA-PG also
demonstrates cooperation while agents defect in NL-PG.
We conduct the same analysis for IMP in Figure 8.2. In this game, under naive
learning the agents’ strategies fail to converge. In contrast, under LOLA the agents’
policies converge to the only Nash equilibrium, playing 50%/50% heads/tails.
Table 8.3 summarises the numerical results comparing LOLA with NL agents in
both the exact and policy gradient settings in the two iterated game environments.
In the IPD, LOLA agents learn policies consistent with TFT with a much higher
PT
1
We use following definition for the normalised discounted reward: (1 − γ) t=0 γ t rt .
144 8.5. Results
0.85
NL-PG 16 NL-PG
0.80 LOLA-PG LOLA-PG
14
LOLA-OM LOLA-OM
0.75 12
0.70 10
P(own coin)
Points
0.65 8
6
0.60
4
0.55
2
0.50 0
0 1000 2000 3000 4000 0 1000 2000 3000 4000
Iterations Iterations
Figure 8.5: The percentage of all picked up coins that match in colour (left) and the
total points obtained per episode (right) for a pair of naive learners using policy gradient
(NL-PG), LOLA-agents (LOLA-PG), and a pair of LOLA-agents with opponent modelling
(LOLA-OM). Also shown is the standard error of the mean (shading), based on 30 training
runs. While LOLA-PG and LOLA-OM agents learn to cooperate, LOLA-OM is less
stable and obtains lower returns than LOLA-PG. Best viewed in colour.
Removing the assumption that agents can access the exact parameters of
opponents, we examine LOLA agents with opponent modelling (Section 8.3.4).
Figure 8.5 demonstrates that without access to the opponent’s policy parameters,
LOLA agents with opponent modelling pick up coins of their own colour around
60% of the time, inferior to the performance of LOLA-PG agents. We emphasise
that with opponent modelling neither agent can recover the exact policy parameters
of the opponent, since there is a large amount of redundancy in the neural network
parameters. For example, each agent could permute the weights of their fully
connected layers. Opponent modelling introduces noise in the opponent agent’s
policy parameters, thus increasing the variance and bias of the gradients (8.3.7)
during policy updates, which leads to inferior performance of LOLA-OM vs. LOLA-
PG in Figure 8.5.
In this section we address the exploitability of the LOLA learning rule. We consider
the IPD, where one can calculate the exact value function of each agent given
the policies. Thus, we can evaluate the higher-order LOLA terms. We pitch a
NL-Ex or LOLA-Ex agent against NL-Ex, LOLA-Ex, and a 2nd-order LOLA agent.
We compare the normalised discounted return of each agent in all settings and
address the question of whether there is an arms race to incorporate ever higher
orders of LOLA correction terms.
Table 8.4 shows that a LOLA-Ex learner can achieve higher payouts against
NL-Ex. Thus, there is an incentive for either agent to switch from naive learning to
first order LOLA. Furthermore, two LOLA-Ex agents playing against each other
both receive higher normalised discounted reward than a LOLA-Ex agent playing
against a NL-Ex. This makes LOLA a dominant learning rule in the IPD compared
to naive learning. We further find that 2nd-order LOLA provides no incremental
gains when playing against a LOLA-Ex agent, leading to a reduction in payouts
for both agents. These experiments were carried out with a δ of 0.5. While it is
146 8.6. Conclusion & Future Work
beyond the scope of this work to prove that LOLA vs LOLA is a dominant learning
rule in the space of gradient-based rules, these initial results are encouraging.
Table 8.4: Higher-order LOLA results on the IPD. A LOLA-Ex agent obtains higher
normalised return compared to a NL-Ex agent. However in this setting there is no
incremental gain from using higher-order LOLA in order to exploit another LOLA agent
in the IPD. In fact both agents do worse with the 2nd order LOLA (incl. 3rd order
corrections).
Contents
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 149
9.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . 152
9.2.1 Stochastic Computation Graphs . . . . . . . . . . . . . 152
9.2.2 Surrogate Losses . . . . . . . . . . . . . . . . . . . . . . 153
9.3 Higher Order Gradients . . . . . . . . . . . . . . . . . . 153
9.3.1 Higher Order Gradient Estimators . . . . . . . . . . . . 154
9.3.2 Higher Order Surrogate Losses . . . . . . . . . . . . . . 154
9.3.3 Simple Failing Example . . . . . . . . . . . . . . . . . . 156
9.4 Correct Gradient Estimators with DiCE . . . . . . . . 158
9.5 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . 163
9.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 167
9.7 Conclusion & Future Work . . . . . . . . . . . . . . . . . 168
9.1 Introduction
149
150 9.1. Introduction
Moreover, to match the first order gradient after a single differentiation, the SL
treats part of the cost as a fixed sample, severing the dependency on the parameters.
We show that this yields missing and incorrect terms in higher order gradient
estimators. We believe that these difficulties have limited the usage and exploration
of higher order methods in reinforcement learning tasks and other application
areas that may be formulated as SCGs.
Therefore, we propose a novel technique, the Infinitely Differentiable Monte-
Carlo Estimator (DiCE), to address all these shortcomings. DiCE constructs a
single objective that evaluates to an estimate of the original objective, but can
also be differentiated repeatedly to obtain correct gradient estimators of any order.
Unlike the SL approach, DiCE relies on auto-diff as implemented for instance in
TensorFlow [162] or PyTorch [163] to automatically perform the complex graph
manipulations required for these higher order gradient estimators.
DiCE uses a novel operator, MagicBox ( ), which wraps around the set of
those stochastic nodes Wc that influence each of the original losses in an SCG.
Upon differentiation, this operator generates the correct gradients associated with
the sampling distribution:
w ∈ Wc
τ= log(p(w; θ)),
X
w∈W
where ⊥ is an operator that sets the gradient of the operand to zero, so ∇x ⊥(x) = 0.
In addition, we show how to use a baseline for variance reduction in our formulation.
We verify the correctness of DiCE both through a proof and through numerical
evaluation of the DiCE gradient estimates. We also open-source our code in
TensorFlow. Last not least, to demonstrate the utility of DiCE, we also propose
a novel approach for LOLA. Using DiCE we can express LOLA in a formulation
152 9.2. Background
that closely resembled meta-learning. We hope this powerful and convenient novel
objective will unlock further exploration and adoption of higher order learning
methods in meta-learning, reinforcement learning, and other applications of SCGs.
9.2 Background
Suppose x is a random variable, x ∼ p(x; θ), f is a function of x and we want to
compute ∇θ Ex [f (x)]. If the analytical gradients ∇θ f are unavailable or nonexistent,
we can employ the score function (SF) estimator:
Gradient estimators for single random variables can be generalised using the
formalism of a stochastic computation graph [SCG, 31]. An SCG is a directed
acyclic graph with four types of nodes: input nodes, Θ; deterministic nodes, D;
cost nodes, C; and stochastic nodes, S. Input nodes are set externally and can hold
parameters we seek to optimise. Deterministic nodes are functions of their parent
nodes, while stochastic nodes are distributions conditioned on their parent nodes.
The set of cost nodes C are those associated with an objective L = E[
P
c∈C c].
Let v ≺ w denote that node v influences node w, i.e., there exists a path in the
graph from v to w. If every node along the path is deterministic, v influences w
deterministically which is denoted by v ≺D w. See Figure 9.1 (top) for a simple
SCG with an input node θ, a stochastic node x and a cost function f . Note that θ
influences f deterministically (θ ≺D f ) as well as stochastically via x (θ ≺ f ).
9. DiCE: The Infinitely Differentiable Monte Carlo Estimator 153
w∈S c∈C
Here depsw are the “dependencies" of w: the set of stochastic or input nodes that
deterministically influence the node w. Furthermore, Q̂w is the sum of sampled
costs ĉ corresponding to the cost nodes influenced by w.
The SL produces a gradient estimator when differentiated once [31, Corollary 1]:
The hat notation on Q̂w indicates that, inside the SL, these costs are treated
as fixed samples, thus severing the functional dependency on θ that was present
in the original stochastic computation graph. This ensures that the first order
gradients of the SL match the score function estimator, which does not contain
a term of the form log(p)∇θ Q.
Although Schulman et al. [31] focus on first order gradients, they argue that the
SL gradient estimates themselves can be treated as costs in an SCG and that the SL
approach can be applied repeatedly to construct higher order gradient estimators.
However, the use of sampled costs in the SL leads to missing dependencies and
wrong estimates when calculating such higher order gradients, as we discuss in
Section 9.3.2.
In this section, we illustrate how to estimate higher order gradients via repeated
application of the score function (SF) trick and show that repeated application of
the surrogate loss (SL) approach in stochastic computation graphs (SCGs) fails to
capture all of the relevant terms for higher order gradient estimates.
154 9.3. Higher Order Gradients
We begin by revisiting the derivation of the score function estimator for the gradient
of the expectation L of f (x; θ) over x ∼ p(x; θ):
∇θ L = ∇θ Ex [f (x; θ)]
x
= ∇θ p(x; θ)f (x; θ)
X
x
X
= f (x; θ)∇θ p(x; θ) + p(x; θ)∇θ f (x; θ)
x
X
= f (x; θ)p(x; θ)∇θ log(p(x; θ))
x
+ p(x; θ)∇θ f (x; θ)
= Ex [g(x; θ)].
The estimator g(x; θ) of the gradient of Ex [f (x; θ)] consists of two distinct terms:
(1) The term f (x; θ)∇θ log(p(x; θ)) originating from f (x; θ)∇θ p(x; θ) via the SF
trick, and (2) the term ∇θ f (x; θ), due to the direct dependence of f on θ. The
second term is often ignored because f is often only a function of x but not of
θ. However, even in that case, the gradient estimator g depends on both x and θ.
We might be tempted to again apply the SL approach to ∇θ Ex [g(x; θ)] to produce
estimates of higher order gradients of L, but below we demonstrate that this fails. In
Section 9.4, we subsequently introduce a practical algorithm for correctly producing
such higher order gradient estimators in SCGs.
While Schulman et al. [31] focus on the first order gradients, they state that a
recursive application of SL can generate higher order gradient estimators. However,
as we demonstrate in this section, because the SL approach treats part of the
objective as a sampled cost, the corresponding terms lose a functional dependency
on the sampling distribution. This leads to missing terms in the estimators of
the higher order gradients.
9. DiCE: The Infinitely Differentiable Monte Carlo Estimator 155
SCG
θ x f
θ x f θ x f (x)f
Figure 9.1: Simple example illustrating the difference of the Surrogate Loss (SL)
approach to DiCE. Stochastic nodes are depicted in orange, costs in gray, surrogate losses
in blue, DiCE in purple, and gradient estimators in red. Note that for second-order
gradients, SL requires the construction of an intermediate stochastic computation graph
and due to taking a sample of the cost ĝSL , the dependency on θ is lost, leading to an
incorrect second-order gradient estimator. Arrows from θ, x and f to gradient estimators
omitted for clarity.
The corresponding SCG is depicted at the top of Figure 9.1. Comparing (9.3.1)
and (9.3.2), note that the first term, fˆ(x) has lost its functional dependency on
θ, as indicated by the hat notation and the lack of a θ argument. While these
terms evaluate to the same estimate of the first order gradient, the lack of the
functional dependency yields a discrepancy between the exact derivation of the
156 9.3. Higher Order Gradients
Since gSL (x; θ) differs from g(x; θ) only in its functional dependencies on θ, gSL
and g are identical when evaluated. However, due to the missing dependencies in
gSL , the gradients w.r.t. θ, which appear in the higher order gradient estimates
in (9.3.3) and (9.3.4), differ:
We lose the term ∇θ f (x; θ)∇θ log(p(x; θ)) in the second order SL gradient because
∇θ fˆ(x) = 0 (see left part of Figure 9.1). This issue occurs immediately in the
second order gradients when f depends directly on θ. However, as g(x; θ) always
depends on θ, the SL approach always fails to produce correct third or higher order
gradient estimates even if f depends only indirectly on θ.
Here is a toy example to illustrate a possible failure case. Let x ∼ Ber(θ) and
f (x, θ) = x(1 − θ) + (1 − x)(1 + θ). For this simple example we can exactly
9. DiCE: The Infinitely Differentiable Monte Carlo Estimator 157
L = θ(1 − θ) + (1 − θ)(1 + θ)
= −2θ2 + θ + 1
∇θ L = −4θ + 1
∇2θ L = −4
(∇2θ L)SL = −2
Even with an infinite number of samples, the SL estimator produces the wrong second
order gradient. If, for example, these wrong estimates were used in combination
with the Newton-Raphson method for optimising L, then θ would never converge
to the correct value. In contrast, this method would converge in a single step
using the correct gradients.
The failure mode seen in this toy example will appear whenever the objective
includes a regularisation term that depends on θ, and is also impacted by the
stochastic samples. One example in a practical algorithm is soft Q-learning for RL
[164], which regularises the policy by adding an entropy penalty to the rewards.
This penalty encourages the agent to maintain an exploratory policy, reducing the
probability of getting stuck in local optima. Clearly the penalty depends on the
policy parameters θ. However, the policy entropy will also depends on the states
visited, which in turn depend on the stochastically sampled actions. As a result,
the entropy regularised RL objective in this algorithm will have the exact property
leading to the failure of the SL approach shown above. Unlike our toy analytic
example, the consequent errors will not just appear as a rescaling of the proper
158 9.4. Correct Gradient Estimators with DiCE
higher order gradients, but will depend in a complex way on the parameters θ.
Any second order methods with such a regularised objective will therefore require
an alternative strategy for generating gradient estimators, even setting aside the
awkwardness of repeatedly generating new surrogate objectives.
sion for a gradient estimator that preserves all required dependencies for further
differentiation is:
order gradient estimators. As described in Section 9.2.2, this was done so that the
SL approach reproduces the score function estimator after a single differentiation
and can thus be used as an objective for backpropagation in a deep learning library.
To support correct higher order gradient estimators, we propose DiCE, which
relies heavily on a novel operator, MagicBox ( ). MagicBox takes a set of
stochastic nodes W as input and has the following two properties by design:
1. (W) 1,
c∈C
Below we prove that the DiCE objective indeed produces correct arbitrary order
gradient estimators under differentiation.
c0 = c,
By induction it follows that E[cn ] = ∇nθ E[c] ∀n, i.e. that cn is an estimator of the
nth order derivative of the objective E[c].
160 9.4. Correct Gradient Estimators with DiCE
∇θ cn = ∇θ (cn (Wcn ))
w∈Wcn
+ (Wcn )∇θ cn
w∈Wcn
To proceed from (9.4.4) to (9.4.5), we need two additional steps. First, we require
an expression for cn+1 . Substituting L = E[cn ] into (9.4.1) and comparing to (9.4.3),
we find the following map from cn to cn+1 :
w∈Wcn
The term inside the brackets in (9.4.4) is identical to cn+1 . Secondly, note that
(9.4.6) shows that cn+1 depends only on cn and Wcn . Therefore, the stochastic nodes
which influence cn+1 are the same as those which influence cn . So Wcn = Wcn+1 ,
and we arrive at (9.4.5).
To conclude the proof, recall that cn is the estimator for the nth derivative of c,
and that cn cn . Summing over c ∈ C then gives the desired result.
(W) = exp τ − ⊥(τ ) ,
τ= log(p(w; θ)),
X
w∈W
9. DiCE: The Infinitely Differentiable Monte Carlo Estimator 161
where ⊥ is an operator that sets the gradient of the operand to zero, so ∇x ⊥(x) = 0.
1
= (W)(∇θ τ + 0)
w∈W
E[∇nθ J ] ∇nθ J, so J can both be used to estimate the return and to produces
estimators for any order gradients under auto-diff, which can be used for higher
order methods such as TRPO [165].
Causality. In Theorem 1 of Schulman et al. [31], two expressions for the
the gradient estimator are provided:
2. In contrast, the second expression sums over costs, c, and multiplies each cost
with a sum over the gradients of log-probabilities from upstream stochastic
nodes, ∇ log(p(w)). We can think of this as backward looking causality.
P
w∈Wc
In both cases, integrating causality into the gradient estimator leads to reduction
of variance compared to the naive approach of multiplying the full sum over costs
with the full sum over grad-log-probabilities.
1
This operator exists in PyTorch as detach and in TensorFlow as stop_gradient.
162 9.4. Correct Gradient Estimators with DiCE
While the SL approach is based on the first expression, DiCE uses the second
same terms for the gradient estimator. However, the second formulation leads to
greatly reduced conceptual complexity when calculating higher order terms, which
we exploit in the definition of the DiCE objective. This is because each further
gradient estimator maintains the same backward looking dependencies for each term
in the original sum over costs, i.e., Wcn = Wcn+1 . In contrast, the SL approach is
centred around the stochastic nodes, which each become associated with a growing
that our DiCE objective is more intuitive, as it is conceptually centred around the
objective addresses variance reduction only via implementing causality, since each
cost term is associated with the that captures all causal dependencies. However,
we can also include a baseline term in the definition of the DiCE objective:
c∈C w∈S
The baseline bw is a design choice and can be any function of nodes not influenced
by w. As long as this condition is met, the baseline will not change the expectation
of the gradient estimates, but can considerably reduce the variance (including of
Since (1 − ({w})) 0, the addition of the baseline leaves the evaluation of the
DiCE, v > H can be implemented efficiently without having to compute the full
9. DiCE: The Infinitely Differentiable Monte Carlo Estimator 163
s1 s2 ··· st
u1 u2 ··· ut
r1 r2 ··· rt
v > H = v > ∇2 L
= v > (∇> ∇L )
= ∇> (v > ∇L ).
In particular, (v > ∇L ) is a scalar, making this implementation well suited for auto-
diff.
0
0
1
2 1
Estimate
0 0
1
0 2 4 6 8 0 50 100
Gradient dimension Hessian dimension
Figure 9.3: For each of the two agents (1 top row, 2 bottom row) in the iterated
prisoner’s dilemma, shown is the flattened true (red) and estimated (green) Gradient
(left) and Hessian (right) using the first and second derivative of DiCE and the exact
value function respectively. The correlation coefficients are 0.999 for the gradients and
0.97 for the Hessian; the sample size is 100k.
0.6
0.7
0.4
0.6
0.5 0.2
0.4 0.0
1 2 3 0 10000 20000
Standard deviation of value Sample size
Figure 9.4: Shown in (a) is the correlation of the gradient estimator (averaged across
agents) as a function of the estimation error of the baseline when using a sample size of
128 and in (b) as a function of sample size when using a converged baseline (in blue) and
no baseline (in green). In both plots errors bars indicate the standard deviation.
(a) Naive Learner (b) LOLA (original) (d) LOLA (with DiCE)
1.00 1.00 1.00
Avg. per step return
no lookahead
1.25 1.25 1.25
1.50 1.50 1.50 Lookahead
1 step
1.75 1.75 1.75 2 step
3 step
2.00 2.00 2.00
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200
Iterations
Figure 9.5: Joint average per step returns for different training methods. (a) Agents
naively optimize expected returns w.r.t. their policy parameters only, without lookahead
steps. (b) The original LOLA algorithm, see Chapter 8, that uses gradient corrections.
(c) LOLA-DiCE with lookahead of up to 3 gradient steps. Shaded areas represent the
95% confidence intervals based on 5 runs. All agents used batches of size 64, which is
more than 60 times smaller than the size required in the original LOLA method.
Since the standard policy gradient learning step for one agent has no dependency
on the parameters of the other agent (which it treats as part of the environment),
LOLA relies on a Taylor expansion of the expected return in combination with
an analytical derivation of the second order gradients to be able to differentiate
through the expected return after the opponent’s learning step.
In this chapter we take a more direct approach, made possible by DiCE. Let
166 9.5. Case Studies
πθ1 be the policy of the LOLA agent and let πθ2 be the policy of its opponent and
vice versa. Assuming that the opponent learns using policy gradients, LOLA-DiCE
agents learn by directly optimising the following stochastic objective w.r.t. θ1 :
h i
L1 (θ1 , θ2 )LOLA = Eπθ1 ,πθ2 +∆θ2 (θ1 ,θ2 ) L1 , where
h i (9.5.1)
∆θ2 (θ1 , θ2 ) = α∇θ2 Eπθ1 ,πθ2 L2 ,
Here α is a scalar step size and La = γ t rta is the sum of discounted rewards
PT
t=0
for agent a. To follow the SCG formalism, here and for the rest of the the case
study we use La rather than J a to describe the return.
To evaluate these terms directly, our variant of LOLA unrolls the learning
process of the opponent, which is functionally similar to model-agnostic meta-
learning [MAML, 159]. In the MAML formulation, the gradient update of the
opponent, ∆θ2 , corresponds to the inner loop (typically training objective) and the
gradient update of the agent itself to the outer loop (typically test objective).
Using the following DiCE-objective to estimate gradient steps for agent a, we
are able preserve all dependencies:
a0 ∈{1,2}
n o
La (θ1 ,θ2 ) = γ t rta , (9.5.2)
X
ut0 ≤t
t
a0 ∈{1,2}
n o
where ut0 ≤t is the set of all actions taken by both agents up to time t. We note
that for computational reasons, we cache the ∆θa of the inner loop when unrolling
the outer loop policies in order to avoid recalculating them at every time step.
Importantly, using DiCE, differentiating through ∆θ2 produces the correct
higher order gradients, which is vital for LOLA to function. In contrast, simply
differentiating through the SL-based first order gradient estimator again, as was
done for MAML by Finn, Abbeel, and Levine [159], results in omitted gradient
terms and a biased gradient estimator, as pointed out by Al-Shedivat et al. [160]
and Stadie et al. [167].
Figure 9.5 shows a comparison between the LOLA-DiCE agents and the original
formulation of LOLA. In our experiments, we use a time horizon of 150 steps and a
reduced batch size of 64; the lookahead gradient step, α, is set to 1 and the learning
9. DiCE: The Infinitely Differentiable Monte Carlo Estimator 167
rate is 0.3. Importantly, given the approximation used, the LOLA method was
restricted to a single step of opponent learning. In contrast, using DiCE we can
differentiate through an arbitrary number of opponent learning steps.
The original LOLA implemented via second order gradient corrections shows
no stable learning, as it requires much larger batch sizes (∼ 4000). In contrast,
LOLA-DiCE agents discover strategies of high social welfare, replicating the results
of the original LOLA method in a way that is both more direct, efficient and
establishes a common formulation between MAML and LOLA.
a loss after a number policy gradient learning steps, differentiating through the
learning step to find parameters that can be quickly fine-tuned for different tasks. Li
et al. [161] extend this work to also meta-learn the fine-tuning step direction
and magnitude.
Al-Shedivat et al. [160] and Stadie et al. [167] derive the proper higher order
gradient estimators for their work by reapplying the score function trick. We also
presented an analytical derivation of the higher order gradient estimator in the
previous section. No previous research presents a general strategy for constructing
higher order gradient estimators for arbitrary graphs.
169
170 10. Afterword
• Many real world settings are both partially observable, general-sum and involve
opponents with unknown or partially known rewards and policies. In these
settings accounting for the learning behaviour of other agents is a difficult and
relevant challenge. Applications including online marketplaces and financial
markets, where different participants interact aiming to maximise their own
reward. However, there currently are no promising approaches that extend
LOLA to these settings.
• The partially observable, general sum setting also relates to the question
of strategic communication: Humans commonly communicate with others
in settings of partial common interest, for example in order to formulate a
compromise, form a coalition or to negotiate an agreement. Currently this is a
vastly under-explored area in the context of agents that learn to communicate.
10. Afterword 171
• While we show proof-of-principle results for agents that discover their own
communication protocols, these protocols are extremely simplistic and lack
the compositional, structural, and linguistic richness of human language. How
to obtain rich emergent language is thus an important challenge.
• One of the key innovations that advanced machine learning over the last decade
was the move to graphic-processing-units which allowed for great parallelism
across a number different compute-cores. However, communication constraints
between different computing processes present challenges to further parallelism.
One of the appeals of multi-agent learning is that it might be able to provide a
long term solution to the problem of parallelism: Each agent can in principle be
instantiated on a different datacenter and carry out training in a decentralised
form. These different agents could then use a learned protocol to exchange
important high level information or coordinate whenever it is required.
All of these constitute formidable challenges with the potential to not only keep
scientists engaged for the years and decades to come, but also to profoundly change
the way we think about agency and, ultimately, consciousness.
172
References
[1] John Ruggles. Locomotive steam-engine for rail and other roads. US Patent 1.
July 1836.
[2] Jeremy Rifkin. The end of work: The decline of the global labor force and the
dawn of the post-market era. ERIC, 1995.
[3] William M Siebert. “Frequency discrimination in the auditory system: Place or
periodicity mechanisms?” In: Proceedings of the IEEE 58.5 (1970), pp. 723–730.
[4] Donald Waterman. “A guide to expert systems”. In: (1986).
[5] Marti A. Hearst et al. “Support vector machines”. In: IEEE Intelligent Systems
and their applications 13.4 (1998), pp. 18–28.
[6] Carl Edward Rasmussen. “Gaussian processes in machine learning”. In: Advanced
lectures on machine learning. Springer, 2004, pp. 63–71.
[7] Yann LeCun et al. “Gradient-based learning applied to document recognition”. In:
Proceedings of the IEEE 86.11 (1998), pp. 2278–2324.
[8] Geoffrey Hinton et al. “Deep neural networks for acoustic modeling in speech
recognition: The shared views of four research groups”. In: IEEE Signal processing
magazine 29.6 (2012), pp. 82–97.
[9] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “Imagenet classification
with deep convolutional neural networks”. In: Advances in neural information
processing systems. 2012, pp. 1097–1105.
[10] Brendan Shillingford et al. “Large-scale visual speech recognition”. In: arXiv
preprint arXiv:1807.05162 (2018).
[11] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. “Sequence to sequence learning
with neural networks”. In: Advances in neural information processing systems.
2014, pp. 3104–3112.
[12] Richard S Sutton. “Learning to predict by the methods of temporal differences”.
In: Machine learning 3.1 (1988), pp. 9–44.
[13] Volodymyr Mnih et al. “Human-level control through deep reinforcement
learning”. In: Nature 518.7540 (2015), pp. 529–533.
[14] David Silver et al. “Mastering the game of Go with deep neural networks and tree
search”. In: Nature 529.7587 (2016), pp. 484–489.
[15] Robin IM Dunbar. “Neocortex size as a constraint on group size in primates”. In:
Journal of human evolution 22.6 (1992), pp. 469–493.
[16] Robert M Axelrod. The evolution of cooperation: revised edition. Basic books,
2006.
173
174 References
[17] Erik Zawadzki, Asher Lipson, and Kevin Leyton-Brown. “Empirically evaluating
multiagent learning algorithms”. In: arXiv preprint arXiv:1401.8074 (2014).
[18] Kagan Tumer and Adrian Agogino. “Distributed agent-based air traffic flow
management”. In: Proceedings of the 6th international joint conference on
Autonomous agents and multiagent systems. ACM. 2007, p. 255.
[19] Lloyd S Shapley. “Stochastic games”. In: Proceedings of the national academy of
sciences 39.10 (1953), pp. 1095–1100.
[20] Frans A. Oliehoek, Matthijs T. J. Spaan, and Nikos Vlassis. “Optimal and
Approximate Q-value Functions for Decentralized POMDPs”. In: 32 (2008),
pp. 289–353.
[21] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction.
Vol. 1. 1. MIT press Cambridge, 1998.
[22] Yoav Shoham, Rob Powers, Trond Grenager, et al. “If multi-agent learning is the
answer, what is the question?” In: Artificial Intelligence 171.7 (2007), pp. 365–377.
[23] John F Nash et al. “Equilibrium points in n-person games”. In: Proceedings of the
national academy of sciences 36.1 (1950), pp. 48–49.
[24] Ian Goodfellow et al. Deep learning. Vol. 1. MIT press Cambridge, 2016.
[25] Ming Tan. “Multi-agent reinforcement learning: Independent vs. cooperative
agents”. In: Proceedings of the tenth international conference on machine learning.
1993, pp. 330–337.
[26] Ardi Tampuu et al. “Multiagent cooperation and competition with deep
reinforcement learning”. In: arXiv preprint arXiv:1511.08779 (2015).
[27] Y. Shoham and K. Leyton-Brown. Multiagent Systems: Algorithmic,
Game-Theoretic, and Logical Foundations. New York: Cambridge University Press,
2009.
[28] Matthew Hausknecht and Peter Stone. “Deep recurrent q-learning for partially
observable mdps”. In: arXiv preprint arXiv:1507.06527 (2015).
[29] Richard S Sutton et al. “Policy gradient methods for reinforcement learning with
function approximation.” In: NIPS. Vol. 99. 1999, pp. 1057–1063.
[30] Ronald J Williams. “Simple statistical gradient-following algorithms for
connectionist reinforcement learning”. In: Machine learning 8.3-4 (1992),
pp. 229–256.
[31] John Schulman et al. “Gradient Estimation Using Stochastic Computation
Graphs”. In: Advances in Neural Information Processing Systems 28: Annual
Conference on Neural Information Processing Systems 2015, December 7-12, 2015,
Montreal, Quebec, Canada. 2015, pp. 3528–3536.
[32] Hajime Kimura, Shigenobu Kobayashi, et al. “An analysis of actor-critic
algorithms using eligibility traces: reinforcement learning with imperfect value
functions”. In: Journal of Japanese Society for Artificial Intelligence 15.2 (2000),
pp. 267–275.
[33] John Schulman et al. “High-Dimensional Continuous Control Using Generalized
Advantage Estimation”. In: CoRR abs/1506.02438 (2015). url:
http://arxiv.org/abs/1506.02438.
References 175
[34] Ziyu Wang et al. “Sample Efficient Actor-Critic with Experience Replay”. In:
arXiv preprint arXiv:1611.01224 (2016).
[35] Roland Hafner and Martin Riedmiller. “Reinforcement learning in feedback
control”. In: Machine learning 84.1 (2011), pp. 137–169.
[36] Lex Weaver and Nigel Tao. “The optimal reward baseline for gradient-based
reinforcement learning”. In: Proceedings of the Seventeenth conference on
Uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc. 2001,
pp. 538–545.
[37] Vijay R Konda and John N Tsitsiklis. “Actor-Critic Algorithms.” In: NIPS.
Vol. 13. 2000, pp. 1008–1014.
[38] Kyunghyun Cho et al. “On the properties of neural machine translation:
Encoder-decoder approaches”. In: arXiv preprint arXiv:1409.1259 (2014).
[39] Yu-Han Chang, Tracey Ho, and Leslie Pack Kaelbling. “All learning is Local:
Multi-agent Learning in Global Reward Games.” In: NIPS. 2003, pp. 807–814.
[40] Nicolas Usunier et al. “Episodic Exploration for Deep Deterministic Policies: An
Application to StarCraft Micromanagement Tasks”. In: arXiv preprint
arXiv:1609.02993 (2016).
[41] Peng Peng et al. “Multiagent Bidirectionally-Coordinated Nets for Learning to
Play StarCraft Combat Games”. In: arXiv preprint arXiv:1703.10069 (2017).
[42] Lucian Busoniu, Robert Babuska, and Bart De Schutter. “A comprehensive
survey of multiagent reinforcement learning”. In: IEEE Transactions on Systems
Man and Cybernetics Part C Applications and Reviews 38.2 (2008), p. 156.
[43] Erfu Yang and Dongbing Gu. Multiagent reinforcement learning for multi-robot
systems: A survey. Tech. rep. tech. rep, 2004.
[44] Joel Z Leibo et al. “Multi-agent Reinforcement Learning in Sequential Social
Dilemmas”. In: arXiv preprint arXiv:1702.03037 (2017).
[45] Abhishek Das et al. “Learning Cooperative Visual Dialog Agents with Deep
Reinforcement Learning”. In: arXiv preprint arXiv:1703.06585 (2017).
[46] Igor Mordatch and Pieter Abbeel. “Emergence of Grounded Compositional
Language in Multi-Agent Populations”. In: arXiv preprint arXiv:1703.04908
(2017).
[47] Angeliki Lazaridou, Alexander Peysakhovich, and Marco Baroni. “Multi-agent
cooperation and the emergence of (natural) language”. In: arXiv preprint
arXiv:1612.07182 (2016).
[48] Sainbayar Sukhbaatar, Rob Fergus, et al. “Learning multiagent communication
with backpropagation”. In: Advances in Neural Information Processing Systems.
2016, pp. 2244–2252.
[49] Jayesh K Gupta, Maxim Egorov, and Mykel Kochenderfer. “Cooperative
Multi-Agent Control Using Deep Reinforcement Learning”. In: (2017).
[50] Shayegan Omidshafiei et al. “Deep Decentralized Multi-task Multi-Agent RL
under Partial Observability”. In: arXiv preprint arXiv:1703.06182 (2017).
176 References
[51] Tabish Rashid et al. “QMIX: Monotonic Value Function Factorisation for Deep
Multi-Agent Reinforcement Learning”. In: Proceedings of The 35th International
Conference on Machine Learning. 2018.
[52] Peter Sunehag et al. “Value-Decomposition Networks For Cooperative
Multi-Agent Learning”. In: arXiv preprint arXiv:1706.05296 (2017).
[53] Ryan Lowe et al. “Multi-Agent Actor-Critic for Mixed Cooperative-Competitive
Environments”. In: arXiv preprint arXiv:1706.02275 (2017).
[54] Danny Weyns, Alexander Helleboogh, and Tom Holvoet. “The packet-world: A
test bed for investigating situated multi-agent systems”. In: Software Agent-Based
Applications, Platforms and Development Kits. Springer, 2005, pp. 383–408.
[55] David H Wolpert and Kagan Tumer. “Optimal payoff functions for members of
collectives”. In: Modeling complexity in economic and social systems. World
Scientific, 2002, pp. 355–369.
[56] Scott Proper and Kagan Tumer. “Modeling difference rewards for multiagent
learning”. In: Proceedings of the 11th International Conference on Autonomous
Agents and Multiagent Systems-Volume 3. International Foundation for
Autonomous Agents and Multiagent Systems. 2012, pp. 1397–1398.
[57] Mitchell K Colby, William Curran, and Kagan Tumer. “Approximating difference
evaluations with local information”. In: Proceedings of the 2015 International
Conference on Autonomous Agents and Multiagent Systems. International
Foundation for Autonomous Agents and Multiagent Systems. 2015, pp. 1659–1660.
[58] Gabriel Synnaeve et al. “TorchCraft: a Library for Machine Learning Research on
Real-Time Strategy Games”. In: arXiv preprint arXiv:1611.00625 (2016).
[59] R. Collobert, K. Kavukcuoglu, and C. Farabet. “Torch7: A Matlab-like
Environment for Machine Learning”. In: BigLearn, NIPS Workshop. 2011.
[60] Landon Kraemer and Bikramjit Banerjee. “Multi-agent reinforcement learning as
a rehearsal for decentralized planning”. In: Neurocomputing 190 (2016), pp. 82–94.
[61] Emilio Jorge, Mikael Kageback, and Emil Gustavsson. “Learning to Play Guess
Who? and Inventing a Grounded Language as a Consequence”. In: arXiv preprint
arXiv:1611.03218 (2016).
[62] Martin J Osborne and Ariel Rubinstein. A course in game theory. MIT press,
1994.
[63] Katie Genter, Tim Laue, and Peter Stone. “Three years of the RoboCup standard
platform league drop-in player competition”. In: Autonomous Agents and
Multi-Agent Systems 31.4 (2017), pp. 790–820.
[64] Carlos Guestrin, Daphne Koller, and Ronald Parr. “Multiagent planning with
factored MDPs”. In: Advances in neural information processing systems. 2002,
pp. 1523–1530.
[65] Jelle R Kok and Nikos Vlassis. “Sparse cooperative Q-learning”. In: Proceedings of
the twenty-first international conference on Machine learning. ACM. 2004, p. 61.
[66] Katie Genter, Noa Agmon, and Peter Stone. “Ad hoc teamwork for leading a
flock”. In: Proceedings of the 2013 international conference on Autonomous agents
and multi-agent systems. International Foundation for Autonomous Agents and
Multiagent Systems. 2013, pp. 531–538.
References 177
[67] Samuel Barrett, Peter Stone, and Sarit Kraus. “Empirical evaluation of ad hoc
teamwork in the pursuit domain”. In: The 10th International Conference on
Autonomous Agents and Multiagent Systems-Volume 2. International Foundation
for Autonomous Agents and Multiagent Systems. 2011, pp. 567–574.
[68] Stefano V Albrecht and Peter Stone. “Reasoning about hypothetical agent
behaviours and their parameters”. In: Proceedings of the 16th Conference on
Autonomous Agents and MultiAgent Systems. International Foundation for
Autonomous Agents and Multiagent Systems. 2017, pp. 547–555.
[69] Alessandro Panella and Piotr Gmytrasiewicz. “Interactive POMDPs with
finite-state models of other agents”. In: Autonomous Agents and Multi-Agent
Systems 31.4 (2017), pp. 861–904.
[70] Takaki Makino and Kazuyuki Aihara. “Multi-agent reinforcement learning
algorithm to handle beliefs of other agents’ policies and embedded beliefs”. In:
Proceedings of the fifth international joint conference on Autonomous agents and
multiagent systems. ACM. 2006, pp. 789–791.
[71] Kyle A Thomas et al. “The psychology of coordination and common knowledge.”
In: Journal of personality and social psychology 107.4 (2014), p. 657.
[72] Ariel Rubinstein. “The Electronic Mail Game: Strategic Behavior Under" Almost
Common Knowledge"”. In: The American Economic Review (1989), pp. 385–391.
[73] Gizem Korkmaz et al. “Collective action through common knowledge using a
facebook model”. In: Proceedings of the 2014 international conference on
Autonomous agents and multi-agent systems. International Foundation for
Autonomous Agents and Multiagent Systems. 2014, pp. 253–260.
[74] Ronen I. Brafman and Moshe Tennenholtz. “Learning to Coordinate Efficiently: A
Model-based Approach”. In: Journal of Artificial Intelligence Research. Vol. 19.
2003, pp. 11–23.
[75] Robert J Aumann et al. “Subjectivity and correlation in randomized strategies”.
In: Journal of mathematical Economics 1.1 (1974), pp. 67–96.
[76] Ludek Cigler and Boi Faltings. “Decentralized anti-coordination through
multi-agent learning”. In: Journal of Artificial Intelligence Research 47 (2013),
pp. 441–473.
[77] Craig Boutilier. “Sequential optimality and coordination in multiagent systems”.
In: IJCAI. Vol. 99. 1999, pp. 478–485.
[78] Christopher Amato, George D Konidaris, and Leslie P Kaelbling. “Planning with
macro-actions in decentralized POMDPs”. In: Proceedings of the 2014
international conference on Autonomous agents and multi-agent systems.
International Foundation for Autonomous Agents and Multiagent Systems. 2014,
pp. 1273–1280.
[79] Miao Liu et al. “Learning for Multi-robot Cooperation in Partially Observable
Stochastic Environments with Macro-actions”. In: arXiv preprint
arXiv:1707.07399 (2017).
[80] Rajbala Makar, Sridhar Mahadevan, and Mohammad Ghavamzadeh.
“Hierarchical multi-agent reinforcement learning”. In: Proceedings of the fifth
international conference on Autonomous agents. ACM. 2001, pp. 246–253.
178 References
[97] C. L. Giles and K. C. Jim. “Learning communication for multi-agent systems”. In:
Innovative Concepts for Agent-Based Systems. Springer, 2002, pp. 377–390.
[98] Karol Gregor et al. “DRAW: A recurrent neural network for image generation”.
In: arXiv preprint arXiv:1502.04623 (2015).
[99] Matthieu Courbariaux and Yoshua Bengio. “BinaryNet: Training deep neural
networks with weights and activations constrained to +1 or -1”. In: arXiv preprint
arXiv:1602.02830 (2016).
[100] Geoffrey Hinton and Ruslan Salakhutdinov. “Discovering binary codes for
documents by learning deep generative models”. In: Topics in Cognitive Science
3.1 (2011), pp. 74–91.
[101] Karthik Narasimhan, Tejas Kulkarni, and Regina Barzilay. “Language
understanding for text-based games using deep reinforcement learning”. In: arXiv
preprint arXiv:1506.08941 (2015).
[102] Sepp Hochreiter and Jurgen Schmidhuber. “Long short-term memory”. In: Neural
computation 9.8 (1997), pp. 1735–1780.
[103] Junyoung Chung et al. “Empirical evaluation of gated recurrent neural networks
on sequence modeling”. In: arXiv preprint arXiv:1412.3555 (2014).
[104] Rafal Jozefowicz, Wojciech Zaremba, and Ilya Sutskever. “An empirical
exploration of recurrent network architectures”. In: Proceedings of the 32nd
International Conference on Machine Learning (ICML-15). 2015, pp. 2342–2350.
[105] Sergey Ioffe and Christian Szegedy. “Batch normalization: Accelerating deep
network training by reducing internal covariate shift”. In: arXiv preprint
arXiv:1502.03167 (2015).
[106] W. Wu. 100 prisoners and a lightbulb. Tech. rep. OCF, UC Berkeley, 2002.
[107] Michael Studdert-Kennedy. “How Did Language go Discrete?” In: Language
Origins: Perspectives on Evolution. Ed. by Maggie Tallerman. Oxford University
Press, 2005. Chap. 3.
[108] H. P. Grice. “Logic and Conversation”. In: Syntax and Semantics: Vol. 3: Speech
Acts. Ed. by Peter Cole and Jerry L. Morgan. New York: Academic Press, 1975,
pp. 41–58. url: http://www.ucl.ac.uk/ls/studypacks/Grice-Logic.pdf.
[109] Michael C. Frank and Noah D. Goodman. “Predicting pragmatic reasoning in
language games”. In: Science (80-. ). 336.6084 (2012), p. 998. arXiv: 0602092
[physics].
[110] Ashutosh Nayyar, Aditya Mahajan, and Demosthenis Teneketzis. “Decentralized
stochastic control with partial history sharing: A common information approach”.
In: IEEE Trans. Automat. Contr. 58.7 (2013), pp. 1644–1658. arXiv: 1209.1695.
url: https://arxiv.org/abs/1209.1695.
[111] Chris L. Baker et al. “Rational quantitative attribution of beliefs, desires and
percepts in human mentalizing”. In: Nat. Hum. Behav. 1.4 (2017), pp. 1–10. url:
http://dx.doi.org/10.1038/s41562-017-0064.
[112] L P Kaelbling, M L Littman, and A R Cassandra. “Planning and acting in
partially observable stochastic domains”. In: Artif. Intell. 101.1-2 (1998),
pp. 99–134. url: http://dx.doi.org/10.1016/S0004-3702(98)00023-X.
180 References
[129] Michael Bowling and Manuela Veloso. “Multiagent learning using a variable
learning rate”. In: Artificial Intelligence 136.2 (2002), pp. 215–250.
[130] William Uther and Manuela Veloso. Adversarial reinforcement learning. Tech. rep.
Technical report, Carnegie Mellon University, 1997. Unpublished, 1997.
[131] C. Claus and C. Boutilier. “The Dynamics of Reinforcement Learning
Cooperative Multiagent Systems”. In: Proceedings of the Fifteenth National
Conference on Artificial Intelligence. June 1998, pp. 746–752.
[132] Michael Wunder, Michael L Littman, and Monica Babes. “Classes of multiagent
q-learning dynamics with epsilon-greedy exploration”. In: Proceedings of the 27th
International Conference on Machine Learning (ICML-10). 2010, pp. 1167–1174.
[133] Martin Zinkevich, Amy Greenwald, and Michael L Littman. “Cyclic equilibria in
Markov games”. In: Advances in Neural Information Processing Systems. 2006,
pp. 1641–1648.
[134] Michael L Littman. “Friend-or-foe Q-learning in general-sum games”. In: ICML.
Vol. 1. 2001, pp. 322–328.
[135] Doran Chakraborty and Peter Stone. “Multiagent learning in the presence of
memory-bounded agents”. In: Autonomous agents and multi-agent systems 28.2
(2014), pp. 182–213.
[136] Ronen I. Brafman and Moshe Tennenholtz. “Efficient Learning Equilibrium”. In:
Advances in Neural Information Processing Systems. Vol. 9. 2003, pp. 1635–1643.
[137] Marc Lanctot et al. “A Unified Game-Theoretic Approach to Multiagent
Reinforcement Learning”. In: Advances in Neural Information Processing Systems
(NIPS). 2017.
[138] Johannes Heinrich and David Silver. “Deep reinforcement learning from self-play
in imperfect-information games”. In: arXiv preprint arXiv:1603.01121 (2016).
[139] Adam Lerer and Alexander Peysakhovich. “Maintaining cooperation in complex
social dilemmas using deep reinforcement learning”. In: arXiv preprint
arXiv:1707.01068 (2017).
[140] Enrique Munoz de Cote and Michael L. Littman. “A Polynomial-time Nash
Equilibrium Algorithm for Repeated Stochastic Games”. In: 24th Conference on
Uncertainty in Artificial Intelligence (UAI’08). 2008. url:
http://uai2008.cs.helsinki.fi/UAI_camera_ready/munoz.pdf.
[141] Jacob W Crandall and Michael A Goodrich. “Learning to compete, coordinate,
and cooperate in repeated games using reinforcement learning”. In: Machine
Learning 82.3 (2011), pp. 281–314.
[142] George W Brown. “Iterative solution of games by fictitious play”. In: (1951).
[143] Richard Mealing and Jonathan Shapiro. “Opponent Modelling by
Expectation-Maximisation and Sequence Prediction in Simplified Poker”. In:
IEEE Transactions on Computational Intelligence and AI in Games (2015).
[144] Neil C Rabinowitz et al. “Machine Theory of Mind”. In: arXiv preprint
arXiv:1802.07740 (2018).
182 References
[164] John Schulman, Pieter Abbeel, and Xi Chen. “Equivalence Between Policy
Gradients and Soft Q-Learning”. In: CoRR abs/1704.06440 (2017). arXiv:
1704.06440.
[165] John Schulman et al. “Trust region policy optimization”. In: International
Conference on Machine Learning. 2015, pp. 1889–1897.
[166] Barak A Pearlmutter. “Fast exact multiplication by the Hessian”. In: Neural
computation 6.1 (1994), pp. 147–160.
[167] Bradly Stadie et al. Some Considerations on Learning to Explore via
Meta-Reinforcement Learning. 2018. url:
https://openreview.net/forum?id=Skk3Jm96W.
[168] Michael C Fu. “Gradient estimation”. In: Handbooks in operations research and
management science 13 (2006), pp. 575–616.
[169] Ivo Grondman et al. “A survey of actor-critic reinforcement learning: Standard
and natural policy gradients”. In: IEEE Transactions on Systems, Man, and
Cybernetics, Part C (Applications and Reviews) 42.6 (2012), pp. 1291–1307.
[170] Peter W Glynn. “Likelihood ratio gradient estimation for stochastic systems”. In:
Communications of the ACM 33.10 (1990), pp. 75–84.
[171] David Wingate and Theophane Weber. “Automated Variational Inference in
Probabilistic Programming”. In: CoRR abs/1301.1299 (2013). arXiv: 1301.1299.
[172] Rajesh Ranganath, Sean Gerrish, and David M. Blei. “Black Box Variational
Inference”. In: Proceedings of the Seventeenth International Conference on
Artificial Intelligence and Statistics, AISTATS 2014, Reykjavik, Iceland, April
22-25, 2014. 2014, pp. 814–822.
[173] Diederik P. Kingma and Max Welling. “Auto-Encoding Variational Bayes”. In:
CoRR abs/1312.6114 (2013). arXiv: 1312.6114.
[174] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. “Stochastic
Backpropagation and Approximate Inference in Deep Generative Models”. In:
(2014), pp. 1278–1286.
[175] Atilim Gunes Baydin, Barak A. Pearlmutter, and Alexey Andreyevich Radul.
“Automatic differentiation in machine learning: a survey”. In: CoRR
abs/1502.05767 (2015). arXiv: 1502.05767.