Efficient Reinforcement Learning Using Gaussian Processes
Efficient Reinforcement Learning Using Gaussian Processes
by
Marc Peter Deisenroth
Dissertation, Karlsruher Institut für Technologie
Fakultät für Informatik, 2009
Impressum
Karlsruher Institut für Technologie (KIT)
KIT Scientific Publishing
Straße am Forum 2
D-76131 Karlsruhe
www.ksp.kit.edu
ISSN 1867-3813
ISBN 978-3-86644-569-7
Efficient Reinforcement Learning
using Gaussian Processes
genehmigte
Dissertation
von
Acknowledgements
I want to thank my adviser Prof. Dr.-Ing. Uwe D. Hanebeck for accepting me as an
external PhD student and for his longstanding support since my undergraduate
student times.
I am deeply grateful to my supervisor Dr. Carl Edward Rasmussen for his
excellent supervision, numerous productive and valuable discussions and inspi-
rations, and his patience. Carl is a gifted and inspiring teacher and he creates a
positive atmosphere that made the three years of my PhD pass by too quickly.
I am wholeheartedly appreciative of Dr. Jan Peters’ great support and his
helpful advice throughout the years. Jan happily provided with me a practical
research environment and always pointed out the value of “real” applications.
This thesis emerged from my research at the Max Planck Institute for Biological
Cybernetics in Tübingen and the Department of Engineering at the University of
Cambridge. I am very thankful to Prof. Dr. Bernhard Schölkopf, Prof. Dr. Daniel
Wolpert, and Prof. Dr. Zoubin Ghahramani for giving me the opportunity to join
their outstanding research groups. I remain grateful to Daniel and Zoubin for
sharing many interesting discussions and their fantastic support during the last
few years.
I wish to thank my friends and colleagues in Tübingen, Cambridge, and Karls-
ruhe for their friendship and support, sharing ideas, thought-provoking discus-
sions over beers, and a generally great time. I especially thank Aldo Faisal, Cheng
Soon Ong, Christian Wallenta, David Franklin, Dietrich Brunn, Finale Doshi-Velez,
Florian Steinke, Florian Weißel, Frederik Beutler, Guillaume Charpiat, Hannes
Nickisch, Henrik Ohlsson, Hiroto Saigo, Ian Howard, Jack DiGiovanna, James In-
gram, Janet Milne, Jarno Vanhatalo, Jason Farquhar, Jens Kober, Jurgen Van Gael,
Karsten Borgwardt, Lydia Knüfing, Marco Huber, Markus Maier, Matthias Seeger,
Michael Hirsch, Miguel Lázaro-Gredilla, Mikkel Schmidt, Nora Toussaint, Pedro
Ortega, Peter Krauthausen, Peter Orbanz, Rachel Fogg, Ruth Mokgokong, Ryan
Turner, Sabrina Rehbaum, Sae Franklin, Shakir Mohamed, Simon Lacoste-Julien,
Sinead Williamson, Suzanne Oliveira-Martens, Tom Stepleton, and Yunus Saatçi.
Furthermore, I am grateful to Cynthia Matuszek, Finale Doshi-Velez, Henrik Ohls-
son, Jurgen Van Gael, Marco Huber, Mikkel Schmidt, Pedro Ortega, Peter Orbanz,
Ryan Turner, Shakir Mohamed, Simon Lacoste-Julien, and Sinead Williamson for
thorough proof-reading, and valuable comments on several drafts of this thesis.
Finally, I sincerely thank my family for their unwavering support and their
confidence in my decision to join this PhD course.
I acknowledge the financial support toward this PhD from the German Re-
search Foundation (DFG) through grant RA 1030/1-3.
Marc Deisenroth
Seattle, November 2010
Table of Contents III
Contents
Zusammenfassung VII
Abstract IX
1 Introduction 1
3.8.2 Pendubot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.8.3 Cart-Double Pendulum . . . . . . . . . . . . . . . . . . . . . . 92
3.8.4 5 DoF Robotic Unicycle . . . . . . . . . . . . . . . . . . . . . . 97
3.9 Practical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . 101
3.9.1 Large Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
3.9.2 Noisy Measurements of the State . . . . . . . . . . . . . . . . . 105
3.10 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
3.11 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
3.12 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5 Conclusions 163
E Implementation 183
E.1 Gaussian Process Predictions at Uncertain Inputs . . . . . . . . . . . 183
Bibliography 189
Zusammenfassung VII
Zusammenfassung
Reinforcement learning (RL) beschäftigt sich mit autonomen Lernen und
sequentieller Entscheidungsfindung unter Unsicherheiten. Bis heute sind
die meisten RL Algorithmen allerdings entweder sehr ineffizient oder sie
erfordern problemspezifisches Vorwissen. Deshalb ist RL häufig nicht
praktisch einsetzbar, wenn Entscheidungen vollständig autonom gelernt
werden sollen. Diese Dissertation beschäftigt sich vorwiegend damit,
RL effizienter zu machen, indem vorhandene Daten gut modelliert und
Informationen sorgfältig extrahiert werden.
Der wissenschaftliche Beitrag dieser Dissertation stellt sich wie folgt dar:
1. Mit pilco stellen wir ein vollständig Bayessches Verfahren für ef-
fizientes RL mit wertkontinuierlichen Zustands- und Aktionsräumen
vor. Pilco basiert auf bewährten Verfahren aus der Statistik und
dem maschinellen Lernen. Pilcos Schlüsselbestandteil ist ein pro-
babilistisches Systemmodell, das mit Hilfe eines Gaußprozesses (GP)
implementiert ist. Der GP quantifiziert Unsicherheiten durch eine
Wahrscheinlichkeitsverteilung über alle plausiblen Systemmodelle.
Die Berücksichtigung all dieser Modelle während der Planung und
Entscheidungsfindung ermöglicht es pilco, den systematischen Mo-
dellfehler zu verringern, der bei deterministischen Modellen in mo-
dellbasiertem RL stark ausgeprägt sein kann.
2. Wegen seiner Allgemeinheit und Effizienz, kann pilco als ein kon-
zeptueller und praktischer Ansatz zum algorithmischen Lernen von
sowohl Modellen als auch Reglern eingestuft werden, wenn Spezial-
wissen schwierig oder gar nicht erworben werden kann.
Für genau dieses Szenario prüfen wir pilcos Eigenschaften und seine
Anwendbarkeit am Beispiel realer und simulierter schwieriger nicht-
linearer Regelungsprobleme. Beispielsweise werden Modelle und
Regler zum Balancieren eines Einrades mit fünf Freiheitsgraden oder
zum Aufschwingen eines Doppelpendels gelernt—vollständig auto-
nom. Pilco findet gute Modelle und Regler effizienter als alle uns
bekannten Lernverfahren, die kein Spezialwissen verwenden.
3. Als einen ersten Schritt, um pilco auf teilweise beobachtbare Mar-
kovsche Entscheidungsprozesse zu erweitern, stellen wir Algorith-
men für robustes Filtering und Smoothing in GP dynamischen Sys-
temen vor. Im Gegensatz zu bekannten Gaußfiltern basiert unser
Verfahren allerdings nicht auf Linearisierungen oder Partikelapproxi-
mationen Gaußscher Dichten. Stattdessen fußt unser Algorithmus
auf exaktem Moment Matching für Prädiktionen, wobei alle Berech-
nungen analytisch durchgeführt werden können. Wir stellen vielver-
sprechende Ergebnisse vor, die die Robustheit und die Vorzüge un-
seres Verfahrens gegenüber dem unscented Kalman filter, dem cuba-
ture Kalman filter und dem extended Kalman filter unterstreichen.
Abstract IX
Abstract
In many research areas, including control and medical applications, we
face decision-making problems where data are limited and/or the under-
lying generative process is complicated and partially unknown. In these
scenarios, we can profit from algorithms that learn from data and aid
decision making.
Reinforcement learning (RL) is a general computational approach to ex-
perience-based goal-directed learning for sequential decision making un-
der uncertainty. However, RL often lacks efficiency in terms of the num-
ber of required trials when no task-specific knowledge is available. This
lack of efficiency makes RL often inapplicable to (optimal) control prob-
lems. Thus, a central issue in RL is to speed up learning by extracting
more information from available experience.
The contributions of this dissertation are threefold:
1. We propose pilco, a fully Bayesian approach for efficient RL in con-
tinuous-valued state and action spaces when no expert knowledge is
available. Pilco is based on well-established ideas from statistics and
machine learning. Pilco’s key ingredient is a probabilistic dynam-
ics model learned from data, which is implemented by a Gaussian
process (GP). The GP carefully quantifies knowledge by a probability
distribution over plausible dynamics models. By averaging over all
these models during long-term planning and decision making, pilco
takes uncertainties into account in a principled way and, therefore,
reduces model bias, a central problem in model-based RL.
2. Due to its generality and efficiency, pilco can be considered a concep-
tual and practical approach to jointly learning models and controllers
when expert knowledge is difficult to obtain or simply not available.
For this scenario, we investigate pilco’s properties its applicability
to challenging real and simulated nonlinear control problems. For
example, we consider the tasks of learning to swing up a double
pendulum attached to a cart or to balance a unicycle with five degrees
of freedom. Across all tasks we report unprecedented automation
and an unprecedented learning efficiency for solving these tasks.
3. As a step toward pilco’s extension to partially observable Markov de-
cision processes, we propose a principled algorithm for robust filter-
ing and smoothing in GP dynamic systems. Unlike commonly used
Gaussian filters for nonlinear systems, it does neither rely on func-
tion linearization nor on finite-sample representations of densities.
Our algorithm profits from exact moment matching for predictions
while keeping all computations analytically tractable. We present
experimental evidence that demonstrates the robustness and the ad-
vantages of our method over unscented Kalman filters, the cubature
Kalman filter, and the extended Kalman filter.
1
1 Introduction
As a joint field of artificial intelligence and modern statistics, machine learning
is concerned with the design and development of algorithms and techniques that
allow computers to automatically extract information and “learn” structure from
data. The learned structure can be described by a statistical model that compactly
represents the data.
As a branch of machine learning, reinforcement learning (RL) is a computational
approach to learning from interactions with the surrounding world and concerned
with sequential decision making in unknown environments to achieve high-level
goals. Usually, no sophisticated prior knowledge is available and all required in-
formation to achieve the goal has to be obtained through trials. The following (pic-
torial) setup emerged as a general framework to solve this kind of problems (Kael-
bling et al., 1996). An agent interacts with its surrounding world by taking actions
(see Figure 1.1). In turn, the agent perceives sensory inputs that reveal some infor-
mation about the state of the world. Moreover, the agent perceives a reward/penalty
signal that reveals information about the quality of the chosen action and the state
of the world. The history of taken actions and perceived information gathered
from interacting with the world forms the agent’s experience. As opposed to super-
vised and unsupervised learning, the agent’s experience is solely based on former
interactions with the world and forms the basis for its next decisions. The agent’s
objective in RL is to find a sequence of actions, a strategy, that minimizes an ex-
pected long-term cost. Solely describing the world is therefore insufficient to solve
the RL problem: The agent must also decide how to use the knowledge about the
world in order to make decisions and to choose actions. Since RL is inherently
based on collected experience, it provides a general, intuitive, and theoretically
powerful framework for autonomous learning and sequential decision making un-
der uncertainty. The general RL concept can be found in solutions to a variety of
world
agent reward/penalty
Figure 1.1: Typical RL setup: The agent interacts with the world by taking actions. After each action, the
agent perceives sensory information about the state of the world and a scalar signal rating the previously
chosen action in the previous state of the world.
2 Chapter 1. Introduction
The RL concept is related to optimal control although the fields are traditionally
separate: Like in RL, optimal control is concerned with sequential decision making
to minimize an expected long-term cost. In the context of control, the world can be
identified as the dynamic system, the decision-making algorithm within the agent
corresponds to the controller, and actions correspond to control signals. The RL
objective can also be mapped to the optimal control objective: Find a strategy that
minimizes an expected long-term cost. In optimal control, it is typically assumed
that the dynamic system are known. Since the problem of determining the param-
eters of the dynamic system is typically not dealt with in optimal control, finding a
good strategy essentially boils down to an optimization problem. Since the knowl-
edge of the parameterization of the dynamic system is a often requisite for optimal
control (Bertsekas, 2005), this parameterization can be used for internal simulations
without the need to directly interact with the dynamic system.
Unlike optimal control, the general concept of RL does not require expert knowl-
edge, that is, task-specific prior knowledge, or an intricate prior understanding of
the underlying world (in control: dynamic system). Instead, RL largely relies upon
experience gathered from directly interacting with the surrounding world; a model
of the world and the consequences of actions applied to the world are unknown
in advance. To gather information about the world, the RL agent has to explore
the world. The RL agent has to trade off exploration, which often means to act
sub-optimally, and exploitation of its current knowledge to act locally optimally.
For illustration purposes, let us consider the maze in Figure 1.2 from the book
by Russel and Norvig (2003). The agent denoted by the green disc in the lower-
right corner already found a locally optimal path (indicated by the arrows) from
previous interactions leading to the high-reward region in the upper-right corner.
Although the path avoids the high-cost region, which yields a reward of −20, it
is not globally optimal since each step taken by the agent causes a reward of −1.
Here, the exploration-exploitation dilemma becomes clearer: Either the agent sticks
to the current suboptimal path or it explores new paths, which might be better, but
which might also lead to the high-cost region. Potentially even more problematic
is that the agent does not know whether there exists a better path than the current
one. Due to its fairly general assumptions, RL typically requires many interactions
with the surrounding world to find a good strategy. In some cases, however, it
can be proven that RL can converge to a globally optimal strategy (Jaakkola et al.,
1994).
-1 -1 -1 +20
-1 -1 -20
-1 -1 -1
Figure 1.2: The agent (green disc in the lower-right corner) found a suboptimal path to the high-reward
region in the upper-right corner. The black square denotes a wall, the numerical values within the squares
denote the immediate reward. The arrows represent a suboptimal path to the high-reward region.
experience and can be used for generalization and hypothesizing about the conse-
quences in the world of taking a particular action. Therefore, using a model that
represents the world is a promising approach to make RL more efficient.
The model of the world is often described by a transition function that maps
state-action pairs to successor states. However, when only a few samples from
the world are available they can be explained by many transition functions. Let
us assume we decide on a single function, say, the most likely function given the
collected experience so far. When we use this function to learn a good strategy, we
implicitly believe that this most likely function describes the dynamics of the world
exactly—everywhere! This is a rather strong assumption since our decision on the
most likely function was based on little data. We face the problem that a strategy
based on a model that does not describe dynamically relevant regions of the world
sufficiently well can have disastrous effects in the world. We would be more con-
fident if we could select multiple “plausible” transition functions, rank them, and
learn a strategy based on a weighted average over these plausible models.
Gaussian processes (GPs) provide a consistent and principled probabilistic
framework for ranking functions according to their plausibility by defining a corre-
sponding probability distribution over functions (Rasmussen and Williams, 2006).
When we use a GP distribution on transition functions to describe the dynamics
of the world, we can incorporate all plausible functions into the decision-making
process by Bayesian averaging according to the GP distribution. This allows us
to reason about things we do not know for sure. Thus, GPs provide a practical
tool to reduce the problem of model bias, which frequently occurs when deter-
ministic models are used (Atkeson and Schaal, 1997b; Schaal, 1997; Atkeson and
Santamarı́a, 1997).
This thesis presents a principled and practical Bayesian framework for efficient
RL in continuous-valued domains by imposing fairly general prior assumptions
on the world and carefully modeling collected experience. By carefully modeling
uncertainties, our proposed method achieves unprecedented speed of learning and
an unprecedented degree of automation by reducing model bias in a principled
way: Bayesian inference with GPs is used to explicitly incorporate the uncertainty
about the world into long-term planning and decision making. Our framework
assumes a fully observable world and is applicable to episodic tasks. Hence, our
4 Chapter 1. Introduction
approach combines ideas from optimal control with the generality of reinforcement
learning and narrows the gap between control and RL.
A logical extension of the proposed RL framework is to consider the case where
the world is no longer fully observable, that is, only noisy or partial measurements
of the state of the world are available. In such a case, the true state of the world
is unknown (hidden/latent), but it can be described by a probability distribution,
the belief state. For an extension of our learning framework to this case, we re-
quire two ingredients: First, we need to learn the transition function in latent
space, a problem that is related to system identification in a control context. Sec-
ond, if the transition function is known, we need to infer the latent state of the
world based on noisy and partial measurements. The latter problem corresponds
to filtering and smoothing in stochastic and nonlinear systems. We do not fully ad-
dress the extension of our RL framework to partially observable Markov decision
processes, but we provide first steps in this direction and present an implemen-
tation of the forward-backward algorithm for filtering and smoothing targeted at
Gaussian-process dynamic systems.
The main contributions of this thesis are threefold:
• We present pilco (probabilistic inference and learning for control), a practical
and general Bayesian framework for efficient RL in continuous-valued state
and action spaces when no task-specific expert knowledge is available. We
demonstrate the viability of our framework by applying it to learning compli-
cated nonlinear control tasks in simulation and hardware. Across all tasks, we
report an unprecedented efficiency in learning and an unprecedented degree
of automation.
• Due to its generality and efficiency, pilco can be considered a conceptual and
practical approach to learning models and controllers when expert knowledge
is difficult to obtain or simply not available, which makes system identification
hard.
• We introduce a robust algorithm for filtering and smoothing in Gaussian-
process dynamic system. Our algorithm belongs to the class of Gaussian filters
and smoothers. These algorithms are a requisite for system identification and
the extension of our RL framework to partially observable Markov decision
processes.
Based on well-established ideas from Bayesian statistics and machine learning, this
dissertation touches upon problems of approximate Bayesian inference, regression,
reinforcement learning, optimal control, system identification, adaptive control,
dual control, state estimation, and robust control. We now summarize the contents
of the central chapters of this dissertation:
where kh (X, X) is the full covariance matrix of all function values h(X) under
consideration and N denotes a normalized Gaussian probability density function.
The graphical model of a GP is given in Figure 2.1. We denote a function that is
GP distributed by h ∼ GP or h ∼ GP(mh , kh ).
hi
i = x1 , ..., xn
Figure 2.1: Factor graph of a GP model. The node hi is a short-hand notation for h(xi ). The plate notation
is a compact representation of a n-fold copy of the node hi , i = x1 , . . . , xn . The black square is a factor
representing the GP prior connecting all variables hi . In the GP model any finite collection of function
values h(x1 ), . . . , h(xn ) has a joint Gaussian distribution.
2.2.1 Prior
When modeling a latent function with Gaussian processes, we place a GP prior
p(h) directly on the space of functions. In the GP model, we have to specify the
prior mean function and the prior covariance function. Unless stated otherwise,
we consider a prior mean function mh ≡ 0 and use the squared exponential (SE)
covariance function with automatic relevance determination
kSE (xp , xq ) := α2 exp − 12 (xp − xq )> Λ−1 (xp − xq ) , xp , xq ∈ RD ,
(2.3)
plus a noise covariance function δpq σε2 , such that kh = kSE + δpq σε2 . In equation (2.3),
Λ = diag([`21 , . . . , `2D ]) is a diagonal matrix of squared characteristic length-scales
`i , i = 1, . . . , D, and α2 is the signal variance of the latent function h. In the
noise covariance function, δpq denotes the Kronecker symbol that is unity when
p = q and zero otherwise, which essentially encodes that the measurement noise
is independent.1 With the SE covariance function in equation (2.3) we assume that
the latent function h is smooth and stationary.
The length-scales `1 , . . . , `D , the signal variance α2 , and the noise variance σε2
are so called hyper-parameters of the latent function, which are collected in the
hyper-parameter vector θ. Figure 2.2 is a graphical model that describes the hier-
archical structure we consider: The bottom is an observed level given by the data
D = {X, y}. Above the data is the latent function h, the random “variable” we
are primarily interested in. At the top level, the hyper-parameters θ specify the
distribution on the function values h(x). A third level of models Mi , for example
different covariance functions, could be added on top. This case is not discussed
in this thesis since we always choose a single covariance function. Rasmussen
and Williams (2006) provide the details on a three-level inference in the context of
model selection.
1 I thank Ed Snelson for pointing out that the Kronecker symbol is defined on the indices of the samples but not on input
locations. Therefore, xp and xq are uncorrelated according to the noise covariance function even if xp = xq , but p 6= q.
2.2. Bayesian Inference 11
θ level 2: hyper-parameters
h level 1: function
D data
N n 2
(x − (i + ))
1 X
x ∈ R, λ ∈ R+ ,
X
N
h(x) = lim γn exp − , (2.4)
i∈Z
N →∞ N
n=1
λ2
represented by infinitely many Gaussian-shaped basis functions along the real axis
with variance λ2 . Let us also assume a standard-normal prior distribution N (0, 1)
on the weights γn , n = 1, . . . , N . The model in equation (2.4) is typically considered
a universal function approximator. In the limit N → ∞, we can replace the sums
with an integral over R and rewrite equation (2.4) as
i+1 Z ∞
(x − s)2 (x − s)2
XZ
h(x) = γ(s) exp − ds = γ(s) exp − ds , (2.5)
i∈ Z i λ2 −∞ λ2
where γ(s) ∼ N (0, 1). The integral in equation 2.5 can be considered a convolu-
tion of the white-noise process γ with a Gaussian-shaped kernel. Therefore, the
function values of h are jointly normal and h is a Gaussian process according to
Definition 2.
Let us now compute the prior mean function and the prior covariance function
of h: The only uncertain variables in equation (2.5) are the weights γ(s). Comput-
ing the expected function of this model, that is, the prior mean function, requires
averaging over γ(s) and yields
(x − s)2
Z Z Z
Eγ [h(x)] =
(2.5)
h(x)p(γ(s)) dγ(s) = exp − γ(s)p(γ(s)) dγ(s) ds = 0
λ2
(2.6)
since Eγ [γ(s)] = 0. Hence, the prior mean function of h equals zero everywhere.
12 Chapter 2. Gaussian Process Regression
Let us now find the prior covariance function. Since the prior mean function
equals zero, we obtain
Z
covγ [h(x), h(x )] = Eγ [h(x)h(x )] = h(x)h(x0 )p(γ(s)) dγ(s)
0 0
(2.7)
(x − s)2 (x0 − s)2
Z Z
= exp − exp − γ(s)2 p(γ(s)) dγ(s) ds , x, x0 ∈ R ,
λ2 λ2
(2.8)
for the prior covariance function, where we used the definition of h in equa-
tion (2.5). With varγ [γ(s)] = 1 and by completing the squares, the prior covariance
function is given as
2 0 0 2 2 0 )2
!
2 s − s(x + x0 ) + x +2xx4 +(x ) − xx0 + x +(x
Z
0 2
covγ [h(x), h(x )] = exp − ds
λ2
(2.9)
x+x0 2 (x−x0 )2
!
2 s− +
Z
2 2
= exp − ds (2.10)
λ2
(x − x0 )2
2
= α exp − (2.11)
2λ2
for suitable α2 .
From equations (2.6) and (2.11), we see that the prior mean function and the
prior covariance function of the universal function approximator in equation (2.4)
correspond to the GP model assumptions we made earlier: a prior mean func-
tion mh ≡ 0 and the SE covariance function given in equation (2.3) for a one-
dimensional input space. Hence, the considered GP prior implicitly assumes
latent functions h that can be described by the universal function approximator
in equation (2.5). Examples of covariance functions that encode different model
assumptions are given in the book by Rasmussen and Williams (2006, Chapter 4).
2.2.2 Posterior
After having observed function values y with yi = h(xi ) + εi , i = 1, . . . , n, for a set
of input vectors X, Bayes’ theorem yields
p(y|h, X, θ)p(h|θ)
p(h|X, y, θ) = , (2.12)
p(y|X, θ)
the posterior (GP) distribution over h. Note that p(y|h, X, θ) is a finite probability
distribution, but the GP prior p(h|θ) is a distribution over functions, an infinite-
dimensional object. Hence, the (GP) posterior is also infinite dimensional.
We assume that the observations yi are conditionally independent given X.
Therefore, the likelihood of h factors according to
n
Y n
Y
p(y|h, X, θ) = p(yi |h(xi ), θ) = N (yi | h(xi ), σε2 ) = N (y | h(X), σε2 I) . (2.13)
i=1 i=1
2.2. Bayesian Inference 13
3 3
2 2
1 1
h(x)
h(x)
0 0
−1 −1
−2 −2
−3 −3
−5 −4 −3 −2 −1 0 1 2 3 4 5 −5 −4 −3 −2 −1 0 1 2 3 4 5
x x
(a) Samples from the GP prior. Without any observa- (b) Samples from the GP posterior after having observed 8
tions, the prior uncertainty about the underlying function function values (black crosses). The posterior uncertainty
is constant everywhere. varies and depends on the location of the training inputs.
Figure 2.3: Samples from the GP prior and the GP posterior for fixed hyper-parameters. The solid
black lines represent the mean functions, the shaded areas represent the 95% confidence intervals of the
(marginal) GP distribution. The colored dashed lines represent three sample functions from the GP prior
and the GP posterior, Panel (a) and Panel (b), respectively. The black crosses in Panel (b) represent the
training targets.
The likelihood in equation (2.13) essentially encodes the assumed noise model.
Here, we assume additive independent and identically distributed (i.i.d.) Gaussian
noise.
For given hyper-parameters θ, the Gaussian likelihood p(y|X, h, θ) in equa-
tion (2.13) and the GP prior p(h|θ) lead to the GP posterior in equation (2.12). The
mean function and a covariance function of this posterior GP are given by
respectively, where x̃, x0 ∈ RD are arbitrary vectors, which we call the test inputs.
For notational convenience, we write kh (X, x0 ) for [kh (x1 , x0 ), . . . , kh (xn , x0 )] ∈ Rn×1 .
Note that kh (x0 , X) = kh (X, x0 )> . Figure 2.3 shows samples from the GP prior and
the GP posterior. With only a few observations, the prior uncertainty about the
function has been noticeably reduced. At test inputs that are relatively far away
from the training inputs, the GP posterior falls back to the GP prior. This can be
seen at the left border of Panel 2.3(b)
Let us have a closer look at the two-level inference scheme in Figure 2.2. Thus
far, we determined the GP posterior for a given set of hyper-parameters θ. In the
following, we treat the hyper-parameters θ as latent variables since their values are
not known a priori. In a fully Bayesian setup, we place a hyper-prior p(θ) on the
14 Chapter 2. Gaussian Process Regression
The integration required for p(y|X) is analytically intractable since p(y|X, θ) with
log p(y|X, θ) = − 21 y> (Kθ + σε2 I)−1 y − 21 log |Kθ + σε2 I| − D
2
log(2π) (2.19)
is a nasty function of θ, where D is the dimension of the input space and K
is the kernel matrix with Kij = k(xi , xj ). We made the dependency of K on the
hyper-parameters θ explicit by writing Kθ . Approximate averaging over the hyper-
parameters can be done using computationally demanding Monte Carlo methods.
In this thesis, we do not follow the Bayesian path to the end. Instead, we find a
good point estimate θ̂ of hyper-parameters on which we condition our inference.
To do so, let us go through the hierarchical inference structure in Figure 2.2.
Level-1 Inference
When we condition on the hyper-parameters, the GP posterior on the function is
p(y|X, h, θ) p(h|θ)
p(h|X, y, θ) = , (2.20)
p(y|X, θ)
where p(y|X, h, θ) is the likelihood of the function h, see equation (2.13), and p(h|θ)
is the GP prior on h. The posterior mean and covariance functions are given in
equations (2.14) and (2.15), respectively. The normalizing constant
Z
p(y|X, θ) = p(y|X, h, θ) p(h|θ) dh (2.21)
in equation (2.20) is the marginal likelihood, also called the evidence. The marginal
likelihood is the likelihood of the hyper-parameters given the data after having
marginalized out the function h.
Level-2 Inference
The posterior on the hyper-parameters is
p(y|X, θ) p(θ)
p(θ|X, y) = , (2.22)
p(y|X)
where p(θ) is the hyper-prior. The marginal likelihood at the second level is
Z
p(y|X) = p(y|X, θ) p(θ) dθ , (2.23)
Evidence Maximization
When we choose the hyper-prior p(θ), a priori we must not exclude any possi-
ble settings of the hyper-parameters. By choosing a “flat” prior, we assume that
any values for the hyper-parameters are possible a priori. The flat prior on the
hyper-parameters has computational advantages: It makes the posterior distribu-
tion over θ (see equation (2.22)) proportional to the marginal likelihood in equa-
tion (2.21), that is, p(θ|X, y) ∝ p(y|X, θ). This means the maximum a posteri-
ori (MAP) estimate of the hyper-parameters θ equals the maximum (marginal)
likelihood estimate. To find a vector of “good” hyper-parameters, we therefore
maximize the marginal likelihood in equation (2.21) with respect to the hyper-
parameters as recommended by MacKay (1999). In particular, the log-marginal
likelihood (log-evidence) is
Z
log p(y|X, θ) = log p(y|h, X, θ) p(h|θ) dh
(2.24)
= − 21 y> (Kθ + σε2 I)−1 y − 21 log |Kθ + σε2 I| − D2 log(2π) .
| {z } | {z }
data-fit term complexity term
gaussianprocess.org.
16 Chapter 2. Gaussian Process Regression
hi hj hk
yi yj yk
and “other”. Training inputs are the vectors based on which the hyper-parameters
have been learned, test inputs are query points for predictions. All “other” vari-
ables are marginalized out during training and testing and added to the figure for
completeness.
2.3 Predictions
The main focus of this thesis lies on how we can use GP models for reinforcement
learning and smoothing. Both tasks require iterative predictions with GPs when
the input is given by a probability distribution. In the following, we provide the
central theoretical foundations of this thesis by discussing predictions with GPs
in detail. We cover both predictions at deterministic and random inputs, and for
univariate and multivariate targets.
In the following, we always assume a GP posterior, that is, we gathered training
data and learned the hyper-parameters using marginal-likelihood maximization.
The posterior GP can be used to compute the posterior predictive distribution
of h(x∗ ) for any test input x∗ . From now on, we call the “posterior predictive
distribution” simply a “predictive distribution” and omit the explicit dependence
on the ML-II estimate θ̂ of the hyper-parameters. Since we assume that the GP
has been trained before, we sometimes additionally omit the explicit dependence
of the predictions on the training set X, y.
2.3. Predictions 17
where we define h := [h(x1 ), . . . , h(xn )]> and h∗ := [h(x∗1 ), . . . , h(x∗m )]> . All
“other” function values have been integrated out.
Univariate Predictions: x∗ ∈ RD , y∗ ∈ R
Let us start scalar training targets yi ∈ R and a deterministic test input x∗ . From
equation (2.26), it follows that the predictive marginal distribution p(h(x∗ )|D, x∗ )
of the function value h(x∗ ) is Gaussian. Its mean and variance are given by
µ∗ := mh (x∗ ) := Eh [h(x∗ )|X, y] = kh (x∗ , X)(K + σε2 I)−1 y (2.27)
Xn
= kh (x∗ , X)β = βi k(xi , x∗ ) , (2.28)
i=1
σ∗2 := σh2 (x∗ ) := varh [h(x∗ )|X, y] = kh (x∗ , x∗ ) − kh (x∗ , X)(K + σε2 I)−1 kh (X, x∗ ) ,
(2.29)
respectively, where β := (K + σε2 I)−1 y. The predictive mean in equation (2.28) can
therefore be expressed as a finite kernel expansion with weights βi (Schölkopf and
Smola, 2002; Rasmussen and Williams, 2006). Note that kh (x∗ , x∗ ) in equation (2.29)
is the prior model uncertainty plus measurement noise. From this prior uncer-
tainty, we subtract a quadratic form that encodes how much information we can
transfer from the training set to the test set. Since kh (x∗ , X)(K + σε2 I)−1 kh (X, x∗ ) is
positive definite, the posterior variance σh2 (x∗ ) is not larger than the prior variance
given by kh (x∗ , x∗ ).
Multivariate Predictions: x∗ ∈ RD , y∗ ∈ RE
If y ∈ RE is a multivariate target, we train E independent GP models using the
same training inputs X = [x1 , . . . , xn ], xi ∈ RD , but different training targets
ya = [y1a , . . . , yna ]> , a = 1, . . . , E. Under this model, we assume that the function
values h1 (x), . . . , hE (x) are conditionally independent given an input x. Within
the same dimension, however, the function values are still fully jointly Gaussian.
The graphical model in Figure 2.5 shows the independence structure in the model
across dimensions. Intuitively, the target values of different dimensions can only
“communicate” via x. I x is deterministic, it d-separates the training targets as de-
tailed in the books by Bishop (2006) and Pearl (1988). Therefore, the target values
covary only if x is uncertain and we integrate it out.
For a known x∗ , the distribution of a predicted function value for a single
target dimension is given by the equations (2.28) and (2.29), respectively. Under
18 Chapter 2. Gaussian Process Regression
x
h1 hE
h2
Figure 2.5: Directed graphical model if the latent function h maps into R E
. The function values across
dimensions are conditionally independent given the input.
0.5
h(x)
0
−0.5
−1
2 1.5 1 0.5 0 −1 −0.5 0 0.5 1
p(h(x)) x
p(x)
0
−1 −0.5 0 0.5 1
Figure 2.6: GP prediction at an uncertain test input. To determine the expected function value, we
average over both the input distribution (blue, lower-right panel) and the function distribution (GP model,
upper-right panel). The exact predictive distribution (shaded area, left panel) is approximated by a
Gaussian (blue, left panel) that possesses the mean and the covariance of the exact predictive distribution
(moment matching). Therefore, the blue Gaussian distribution q in the left panel is the optimal Gaussian
approximation of the true distribution p since it minimizes the Kullback-Leibler divergence KL(p||q).
the model described by Figure 2.5, the predictive distribution of h(x∗ ) is Gaussian
with mean and covariance
h i>
µ∗ = mh1 (x∗ ) . . . mhE (x∗ ) , (2.30)
h i
Σ∗ = diag σh21 . . . σh2E , (2.31)
respectively.
In the following, we discuss how to predict with Gaussian processes when the test
input x∗ has a probability distribution. Many derivations in the following are based
on the thesis by Kuss (2006) and the work by Quiñonero-Candela et al. (2003a,b).
Consider the problem of predicting a function value h(x∗ ), h : RD → R, at
an uncertain test input x∗ ∼ N (µ, Σ), where h ∼ GP with an SE covariance func-
tion kh plus a noise covariance function. This situation is illustrated in Figure 2.6.
The input distribution p(x∗ ) is the blue Gaussian in the lower-right panel. The
upper-right panel shows the posterior GP represented by the posterior mean func-
tion (black) and twice the (marginal) standard deviation (shaded). Generally, if a
2.3. Predictions 19
is not Gaussian and unimodal as shown in the left panel of Figure 2.6, where
the shaded area represents the exact distribution over function values. On the
left-hand-side in equation (2.32), we conditioned on µ and Σ to indicate that the
test input x∗ is uncertain. As mentioned before, we omitted the conditioning on the
training data X, y and the posterior hyper-parameters θ̂. By explicitly conditioning
on x∗ in p(h(x∗ )|x∗ ), we emphasize that x∗ is a deterministic argument of h in this
conditional distribution.
Generally, the predictive distribution in equation (2.32) cannot be computed
analytically since a Gaussian distributions mapped through a nonlinear function
(or GP) leads to a non-Gaussian predictive distribution. In the considered case,
however, we approximate the predictive distribution p(h(x∗ )|µ, Σ) by a Gaussian
(blue in left panel of Figure 2.6) that possesses the same mean and variance (mo-
ment matching). To determine the moments of the predictive function value, we
average over both the input distribution and the distribution of the function given
by the GP.
In particular, this is true for kernels that involve polynomials, squared exponentials, and trigonometric functions.
20 Chapter 2. Gaussian Process Regression
on the mean and covariance of the distribution of the input x∗ . The values qi in
equation (2.36) correspond to the standard SE kernel kh (xi , µ), which has been
“inflated” by Σ. For a deterministic input x∗ with Σ ≡ 0, we obtain µ = x∗ and
recover qi = kh (xi , x∗ ). Then, the predictive mean (2.34) equals the predictive mean
for certain inputs given in equation (2.28).
Using Fubini’s theorem, we obtain the variance σ∗2 of the predictive distribution
p(h(x∗ )|µ, Σ) as
σ∗2 = varx∗ ,h [h(x∗ )|µ, Σ] = Ex∗ [varh [h(x∗ )|x∗ ]|µ, Σ] + varx∗ [Eh [h(x∗ )|x∗ ]|µ, Σ]
(2.37)
= Ex∗ [σh (x∗ )|µ, Σ] + Ex∗ [mh (x∗ ) |µ, Σ] − Ex∗ [mh (x∗ )|µ, Σ] ,
2 2 2
(2.38)
where we used mh (x∗ ) = Eh [h(x∗ )|x∗ ] and σh2 (x∗ ) = varh [h(x∗ )|x∗ ], respectively. By
taking the expectation with respect to the test input x∗ and by using equation (2.28)
for mh (x∗ ) and equation (2.29) for σh2 (x∗ ) we obtain
Z
σ∗2 = kh (x∗ , x∗ ) − kh (x∗ , X)(K + σε2 I)−1 kh (X, x∗ )p(x∗ ) dx∗
Z
+ kh (x∗ , X) ββ > kh (X, x∗ ) p(x∗ ) dx∗ − (β > q)2 , (2.39)
| {z } | {z }
1×n n×1
where we additionally used Ex∗ [mh (x∗ )|µ, Σ] = β > q from equation (2.34). Plug-
ging in the definition of the SE kernel in equation (2.3) for kh , the desired predicted
variance at an uncertain input x∗ is
Z
σ∗2 2
= α − tr (K + σε2 I)−1
kh (X, x∗ )kh (x∗ , X)p(x∗ ) dx∗
Z
>
+β kh (X, x∗ )kh (x∗ , X)p(x∗ ) dx∗ β − (β > q)2 (2.40)
| {z }
=:Q̃
2
σε2 I)−1 Q̃ β > Q̃β − µ2∗
= α − tr (K + + , (2.41)
Ex∗ [varh [h(x∗ )|x∗ ]|µ,Σ] Eh [h(x∗ )|x∗ ]|µ,Σ]
| {z } | {z }
= =varx∗ [
where we re-arranged the inner products to pull the expressions that are indepen-
dent of x∗ out of the integrals. The entries of Q̃ ∈ Rn×n are given by
with z̃ij := 21 (xi + xj ). Like the predicted mean in equation (2.34), the predictive
variance depends explicitly on the mean µ and the covariance matrix Σ of the
input distribution p(x∗ ).
2.3. Predictions 21
In the multivariate case, the predictive mean vector µ∗ of p(h(x∗ )|µ, Σ) is the
collection of all E independently predicted means computed according to equa-
tion (2.34). We obtain the predicted mean
h i>
µ∗ |µ, Σ = β > >
1 q1 . . . β E q E
, (2.43)
where the vectors qi for all target dimensions i = 1, . . . , E are given by equa-
tion (2.36).
Unlike predicting at deterministic inputs, the target dimensions now covary
(see Figure 2.5), and the corresponding predictive covariance matrix
varh,x∗ [h∗1 |µ, Σ] . . . covh,x∗ [h∗1 , h∗E |µ, Σ]
.. ..
Σ∗ |µ, Σ = .. (2.44)
. . .
covh,x∗ [h∗E , h∗1 |µ, Σ] . . . varh,x∗ [h∗E |µ, Σ]
covh,x∗ [h∗a , h∗b |µ, Σ] = Eh,x∗ [h∗a h∗b |µ, Σ] − µ∗a µ∗b . (2.45)
which leads to
Z
(2.46)
E ∗
h,x∗ [ha h∗b |µ, Σ] = mah (x∗ )mbh (x∗ )p(x∗ ) dx∗ (2.48)
Z
(2.47)
= kha (x∗ , X)β a khb (x∗ , X)β b p(x∗ ) dx∗ (2.49)
R R
| {z }| {z }
∈ ∈
Z
= β>
a kha (x∗ , X)> khb (x∗ , X)p(x∗ ) dx∗ β b , (2.50)
| {z }
=:Q
22 Chapter 2. Gaussian Process Regression
where we re-arranged the inner products to pull terms out of the integral that are
independent of the test input x∗ . The entries of Q are given by
1
−1
Qij = αa2 αb2 |(Λ−1
a + Λb )Σ + I|
−2
to include the term Ex∗ [covh [ha (x∗ ), hb (x∗ )|x∗ ]|µ, Σ] = αa −
2
If a = b, we have
2 −1
tr (Ka + σε I) Q , which equals zero for a 6= b due to the assumption that the tar-
get dimensions a and b are conditionally independent given the input as depicted
in the graphical model in Figure 2.5. The function implementing multivariate
predictions at uncertain inputs is given in Appendix E.1.
These results yield the exact mean µ∗ and the exact covariance Σ∗ of the
generally non-Gaussian predictive distribution p(h(x∗ )|µ, Σ), where h ∼ GP and
x∗ ∼ N (µ, Σ). Table 2.1 summarizes how to predict with Gaussian processes.
2.3. Predictions 23
predictive covariance
univariate eq. (2.29) eq. (2.41)
multivariate eq. (2.31) eq. (2.44), eq. (2.55)
is desired. The marginal distributions of x∗ and h(x∗ ) are either given or computed
according to Section 2.3.2. The missing piece is the cross-covariance matrix
Σx∗ ,h∗ = Ex∗ ,h [x∗ h(x∗ )> ] − Ex∗ [x∗ ] Ex∗ ,h [h(x∗ )]> = Ex∗ ,h [x∗ h(x∗ )> ] − µµ>
∗ , (2.57)
where we defined
D 1
− −
c−1 −2
1 := αa (2π)
2 |Λa | 2 , (2.62)
such that kha (x∗ , xi )
= c1 N (x∗ | xi , Λa ) is a normalized Gaussian probability distri-
bution in the test input x∗ , where xi , i = 1, . . . , n, are the training inputs. The
24 Chapter 2. Gaussian Process Regression
We now recognize that c1 c−12 = qai , see equation (2.36), and with ψ i = Σ(Σ +
Λa )−1 xi + Λ(Σ + Λa )−1 µ, we can simplify equation (2.67). We move the term
µ(µ∗ )a into the sum, use equation (2.34) for (µ∗ )a , and obtain
n
X
βai qai Σ(Σ + Λa )−1 xi + (Λa (Σ + Λa )−1 − I)µ
covx∗ ,h∗ [x∗ , ha (x∗ )|µ, Σ] =
i=1
(2.68)
n
X
βai qai Σ(Σ + Λa )−1 (xi − µ) + (Λa (Σ + Λa )−1 + Σ(Σ + Λa )−1 − I)µ
=
i=1
(2.69)
n
X
= βai qai Σ(Σ + Λa )−1 (xi − µ) , (2.70)
i=1
which fully determines the joint distribution p(x∗ , h(x∗ )|µ, Σ) in equation (2.56).
An efficient implementation that computes input-output covariances is given in
Appendix E.1.
• The matrix Q has to be computed for each entry of the predictive covariance
matrix Σ∗ ∈ RE×E , where E is the dimension of the training targets, which
gives a total computational complexity of O(E 2 n2 D).
where the inducing inputs are integrated out. Quiñonero-Candela and Rasmussen
(2005) show that most sparse approximations assume that the test targets h and the
training targets h∗ are conditionally independent given the pseudo inputs h̄, that is,
p(h, h∗ |h̄) = p(h|h̄)p(h∗ |h̄). The posterior distribution over the pseudo targets can
be computed analytically and is again Gaussian as shown bySnelson and Ghahra-
mani (2006), Snelson (2007), and Titsias (2009). Therefore, the pseudo targets
can always be integrated out analytically—at least in the standard GP regression
model.
It remains to determine the pseudo-input locations X̄. Like in the standard
GP regression model, Snelson and Ghahramani (2006), Snelson (2007), and Titsias
(2009) find the pseudo-input locations X̄ by evidence maximization. The GP pre-
dictive distribution from the pseudo-data set is used as a parameterized marginal
likelihood
Z
p(y|X, X̄) = p(y|X, X̄, h̄)p(h̄|X̄) dh̄ (2.72)
Z
= N y | KnM K−1
M M h̄, Γ N h̄ | 0, KM M dh̄ (2.73)
= N (y | 0, Qnn + Γ) , (2.74)
26 Chapter 2. Gaussian Process Regression
use a second-order Taylor series expansion of the mean function and the covariance
function to compute the predictive moments approximately. Quiñonero-Candela
et al. (2003a,b) derive the analytic expressions for the exact predictive moments
and show its superiority over the second-order Taylor series expansion employed
by Girard et al. (2002, 2003).
For an overview of sparse approximations, we refer to the paper by Quiñonero-
Candela and Rasmussen (2005), which gives a unifying view on most of the sparse
approximations presented throughout the last decades, such as those by Silverman
(1985), Wahba et al. (1999), Smola and Bartlett (2001), Csató and Opper (2002),
Seeger et al. (2003), Titsias (2009), Snelson and Ghahramani (2006), Snelson (2007),
Walder et al. (2008), or Lázaro-Gredilla et al. (2010).
29
Fsp
F
mass
spring
Figure 3.1: Simplified mass-spring system described in the book by Khalil (2002). The mass is subjected
to an external force F . The restoring force of the spring is denoted by Fsp .
“idealized” assumptions.
30 Chapter 3. Probabilistic Models for Efficient Learning in Control
on the nature of the system, expert knowledge might not be available or may be
expensive to obtain.
Sometimes, neither valid idealized assumptions about a dynamic system can
be made due to the complexity of the dynamics or too many hidden parame-
ters nor sufficient expert knowledge is available. Then, (computational) learning
techniques can be a valuable complement to automatic control. Because of this,
learning algorithms have been used more often in automatic control and robotics
during the last decades. In particular, in the context of system identification, learn-
ing has been employed to reduce the dependency on idealized assumptions, see,
for example, the papers by Atkeson et al. (1997a,b), Vijayakumar and Schaal (2000),
Kocijan et al. (2003), Murray-Smith et al. (2003), Grancharova et al. (2008), or Kober
and Peters (2009). A learning algorithm can be considered a method to automati-
cally extract relevant structure from data. The extracted information can be used
to learn a model of the system dynamics. Subsequently, the model can be used for
predictions and decision making by speculating about the long-term consequences
of particular actions.
Computational approaches for artificial learning from collected data, the expe-
rience, are studied in neuroscience, reinforcement learning (RL), approximate dy-
namic programming, and adaptive control, amongst others. Although these fields
have been studied for decades, the rate at which artificial systems learn lags typi-
cally behind biological learners with respect to the amount of experience, that is,
the data used for learning, required to learn a task if no expert knowledge is avail-
able. Experience can be gathered by direct interaction with the environment. Inter-
action, however, can be time consuming or wear out mechanical systems. Hence,
a central issue in RL is to speed up artificial learning algorithms by making them
more efficient in terms of required interactions with the system.
Broadly, there are two ways to increase the (interaction) efficiency of RL. One
approach is to exploit expert knowledge to constrain the task in various ways and
to simplify learning. This approach is problem dependent and relies on an intri-
cate understanding of the characteristics of the task and the solution. A second
approach to making RL more efficient is to extract more useful information from
available experience. This approach does not rely on expert knowledge, but re-
quires careful modeling of available data. In a practical application, one would
typically combine these two approaches. In this thesis, however, we are solely
concerned with the second approach:
How can we learn as fast as possible given only very general prior understanding
of a task?
To achieve our objective, we exploit two properties that make biological ex-
perience-based learning so efficient: First, humans can generalize from current ex-
perience to unknown situations. Second, humans explicitly model and incorpo-
rate uncertainty when making decisions as, experimentally shown by Körding and
Wolpert (2004a, 2006).
Unlike for discrete domains (Poupart et al., 2006), generalization and incor-
poration of uncertainty into planning and the decision-making process are not
consistently and fully present in continuous RL, although impressively successful
heuristics exist (Abbeel et al., 2006). In the context of motor control, generalization
typically requires a model or a simulator, that is, an internal representation of the
(system) dynamics. Learning models of the system dynamics is well studied in
the literature, see for example the work by Sutton (1990), Atkeson et al. (1997a), or
Schaal (1997). These learning approaches for parametric or non-parametric system
identification rely on the availability of sufficiently many data to learn an “accu-
rate” system model. As already pointed out by Atkeson and Schaal (1997a) and
Atkeson and Santamarı́a (1997), an “inaccurate” model used for planning often
leads to a useless policy in the real world.
Now we have the following dilemma: On the one hand we want to speed up
RL (reduce the number of interactions with the physical system to learn tasks)
by using models for internal simulations, on the other hand existing model-based
RL methods still require many interactions with the system to find a sufficiently
accurate dynamics model. In this thesis, we bridge this gap by carefully modeling
and representing uncertainties about the dynamics model, which allows us to deal
with fairly limited experience in a principled way:
With pilco (probabilistic inference and learning for control), we present a general and
fully Bayesian framework for efficient autonomous learning, planning, and deci-
sion making in a control context. Pilco’s success is due to a principled use of
probabilistic dynamics models and embodies our requirements of a faithful repre-
sentation and careful incorporation of available experience into decision making.
Due to its principled treatment of uncertainties, pilco does not rely on task-specific
expert knowledge, but still allows for efficient learning from scratch. To the best
of our knowledge, pilco is the first continuous RL algorithm that consistently
32 Chapter 3. Probabilistic Models for Efficient Learning in Control
ut−1 ut ut+1
xt−1 xt xt+1
ct−1 ct ct+1
Figure 3.2: Directed graphical model for the problem setup. The state x of the dynamic system follows
Markovian dynamics and can be influenced by applying external controls u. The cost ct := c(xt ) is either
computed or can be observed.
are not known in advance. We additionally assume that the immediate cost func-
tion c( · ) is a design criterion.2 A directed graphical model of the setup being
considered is shown in Figure 3.2.
The objective in RL is to find a policy π ∗ that minimizes the expected long-term
cost " T # T
V (x0 ) = Eτ Ext [c(xt )]
X X
π
c(xt ) = (3.2)
t=0 t=0
policy policy
action
state action
model model
state
world world
interaction simulation
(a) Interaction phase. An action is applied to the real (b) Simulation phase. An action is applied to the
world. The world changes its state and returns the model of the world. The model simulates the
state to the policy. The policy selects a correspond- real world and returns a state of which it thinks
ing action and applies it to the real world again. The the world might be in. The policy determines
model takes the applied actions and the states of the an action according to the state returned by the
real world and refines itself. model and applies it again. Using this simulated
experience, the policy is refined.
Figure 3.3: Two alternating phases in model-based reinforcement learning. We distinguish between the
real world, an internal model of the real world, and a policy. Yellow color indicates that the corresponding
component is being refined. In the interaction phase, the model of the world is refined. In the simulation
phase, this model is used to simulate experience, which in turn is used to improve the policy. The
improved policy can be used in the next interaction phase.
In the context of motor control problems, we aim to find a good policy π ∗ that
∗
leads to a low value V π (x0 ) given an initial state distribution p(x0 ) using only a
small number of interactions with the system.3 We assume that no task-specific
expert knowledge is available. The setup can be considered an RL problem with
very limited interaction resources.
real world. Data in the form of (state, action, successor state)-tuples (xt , ut , xt+1 )
are collected to train a world model for f : (xt , ut ) 7→ xt+1 . In the simulation phase,
this model is used to emulate the world and to generate simulated experience. The
policy is optimized using the simulated experience of the model. Figure 3.3 also
emphasizes the dependency of the policy on the model: The policy is refined in
the light of the model of the world, not the world itself (see Figure 3.3(b)). This
model bias of the policy is a major weakness of model-based RL. If the model does
not capture the important characteristics of the dynamic system, the found pol-
icy can be far from optimal in practice. Schaal (1997), Atkeson and Schaal (1997b),
Atkeson and Santamarı́a (1997), and many others report problems with this type of
“incorrect” models, which makes their use unattractive for learning from scratch.
The model bias and the resulting problems can be sidestepped by using model-
free RL. Model-free algorithms do not learn a model of the system. Instead, they
use experience from interaction to determine an optimal policy directly (Sutton
and Barto, 1998; Bertsekas and Tsitsiklis, 1996). Unlike model-based RL, model-
free RL is statistically inefficient, but computationally congenial since it learns by
bootstrapping.
In the context of biological learning, Daw et al. (2005) found that humans and
animals use model-based learning when only a moderate amount of experience
is available. Körding and Wolpert (2004a, 2006), and Miall and Wolpert (1996)
concluded that this internal forward model is used for planning by averaging over
uncertainties when predicting or making decisions.
To make RL more efficient—that is, to reduce the number of interactions with
the surrounding world—model-free RL cannot be employed due to its statistical
inefficiency. However, in order to use model-based RL, we have to do something
about the model bias. In particular, we need to be careful about representing un-
certainties, just as humans are when only a moderate amount of experience is
available. There are some heuristics for making model-based RL more robust by
expressing uncertainties about the model of the system. For example, Abbeel et al.
(2006) used approximate models for RL. Their algorithm was initialized with a pol-
icy, which was locally optimal for the initial (approximate) model of the dynam-
ics. Moreover, a time-dependent bias term was used to account for discrepancies
between real experience and the model’s predictions. To generate approximate
models, expert knowledge in terms of a parametric dynamics model was used,
where the model parameters were randomly initialized around ground truth or
good-guess values. A different approach to dealing with model inaccuracies is to
add a noise/uncertainty term to the system equation, a common practice in sys-
tems engineering and robust control when the system function cannot be identified
sufficiently well.
Our approach to efficient RL, pilco, alleviates model bias by not focusing on
a single dynamics model, but by using a probabilistic dynamics model, a distribu-
tion over all plausible dynamics models that could have generated the observed
experience. The probabilistic model is used for two purposes:
Figure 3.4: The learning problem can be divided into three hierarchical problems. At the bottom layer,
the transition dynamics f are learned. Based on the transition dynamics, the value function V π can be
evaluated using approximate inference techniques. At the top layer, an optimal control problem has to
be solved to determine a model-optimal policy π ∗ .
• The model uncertainty is incorporated into long-term planning and decision mak-
ing: Pilco averages according to the posterior distribution over plausible
dynamics models.
Therefore, pilco implements some key properties of biological learners in a prin-
cipled way.
Remark 1. We use a probabilistic model for the deterministic system in equa-
tion (3.1). The probabilistic model does not imply that we assume a stochastic
system. Instead, it is solely used to describe uncertainty about the model itself.
In the extreme case of a test input x∗ at the exact location xi of a training in-
put, the prediction of the probabilistic model will be absolutely certain about the
corresponding function value p(f (x∗ )) = δ(f (xi )).
In our model-based setup, the policy learning problem can be decomposed into
a hierarchy of three sub-problems as described in Figure 3.4. At the bottom level,
a probabilistic model of the transition function f is learned (Section 3.4). Given the
model of the transition dynamics and a policy π, the expected long-term cost in
equation (3.2) is evaluated. This policy evaluation requires the computation of the
predictive state distributions p(xt ) for t = 1, . . . , T (intermediate layer in Figure 3.4
and Section 3.5). At the top layer (Section 3.6), the policy parameters ψ are learned
based on the result of the policy evaluation, which is called an indirect policy search.
The search is typically non-convex and requires iterative optimization techniques.
Therefore, the policy evaluation and policy improvement steps alternate until the
policy search converges to a local optimum. For given transition dynamics, the
two top layers in Figure 3.4 correspond to an optimal control problem.
Algorithm 1 Pilco
1: set policy to random . policy initialization
2: loop
3: execute policy . interaction
4: record collected experience
5: learn probabilistic dynamics model . bottom layer
6: loop . policy search
7: simulate system with policy π . intermediate layer
8: compute expected long-term cost V π , eq. (3.2) . policy evaluation
9: improve policy . top layer
10: end loop
11: end loop
cost function
Figure 3.5: Three necessary components in an RL framework. To learn a good policy π ∗ , it is necessary
to determine the dynamics model, to specify a cost function, and then to apply an RL algorithm. The
interplay of these three components can be crucial for success.
Then, pilco applies the model-optimized policy to the system (line 3) to gather
novel experience (line 4). These two stages of learning correspond to the interac-
tion phase and the simulation phase, respectively, see Figure 3.3. The subsequent
model update (line 5) accounts for possible discrepancies between the predicted
and the actual encountered state trajectory.
With increasing experience, the probabilistic model describes the dynamics
with high certainty in regions of the state space that have been explored well and
the GP model converges to the underlying system function f . If pilco finds a so-
lution to the learning problem, the well-explored regions of the state space contain
trajectories with low expected long-term cost. Note that the dynamics model is
updated after each trial and not online. Therefore, Algorithm 1 describes batch
learning.
Generally, three components are crucial to determining a good policy π ∗ us-
ing model-based RL: a dynamics model, a cost function, and the RL algorithm
itself. Figure 3.5 illustrates the relationship between these three components for
successfully learning π ∗ . A bad interplay of these three components can lead to
a failure of the learning algorithm. In pilco, the dynamics model is probabilistic
and implemented by a Gaussian process (Section 3.4). Pilco is an indirect policy
search algorithm using approximate inference for policy evaluation (Section 3.5)
and gradient-based policy learning (Section 3.6). The cost function we use in our
RL framework is a saturating function (Section 3.7).
3.4. Bottom Layer: Learning the Transition Dynamics 37
f(xi, ui)
0
−1
−2
−3
−5 −4 −3 −2 −1 0 1 2 3 4 5
(xi, ui)
Figure 3.6: Gaussian process posterior as a distribution over transition functions. The x-axis represents
state-action pairs (xi , ui ), the y-axis represents the successor states f (xi , ui ). The shaded area represents
the (marginal) model uncertainty (95% confidence intervals). The black crosses are observed successor
states for given state-action pairs. The colored functions are transition function samples drawn from the
posterior GP distribution.
we move away from the data, the model uncertainty increases, which is illustrated
by the bumps of the shading between the crosses (see for example a state-action
pair around −3 on the x-axis). The increase in uncertainty is reasonable since the
model cannot be certain about the function values for a test input (x∗ , u∗ ) that is
not close to the training set (xi , ui ). Since the GP is non-degenerate5 , far away from
the training set, the model uncertainty falls back to the prior uncertainty, which
can be seen at the left end or right end of the figure. The GP model captures all
transition functions that plausibly could have generated the observed (training)
values represented by the black crosses. Examples of such plausible functions are
given by the three colored functions in Figure 3.6. With increasing experience, the
probabilistic GP model becomes more confident about its own accuracy and even-
tually converges to the true function (if the true function is in the class of smooth
functions); the GP is a consistent estimator.
Let us provide a few more details about training the GP dynamics models:
For a D-dimensional state space, we use D separate GPs, one for each state di-
mension. The GP dynamics models take as input a representation of state-action
pairs (xi , ui ), i = 1, . . . , n. The corresponding training targets for the dth target
dimension are
∆xid := fd (xi , ui ) − xid , d = 1, . . . , D , (3.3)
where fd maps the input to the dth dimension of the successor state. The GP tar-
gets ∆xid are the differences between the dth dimension of a state xi and the dth
dimension of the successor state f (xi , ui ) of an input (xi , ui ). As opposed to learn-
ing the function values directly, learning the differences can be advantageous since
they vary less than the original function. Learning differences ∆xid approximately
corresponds to learning the gradient of the function. The mean and the variance of
the Gaussian successor state distribution p(fd (x∗ , u∗ )) for a deterministically given
state-action pair (x∗ , u∗ ) are given by
Ef [fd (x∗ , u∗ )|x∗ , u∗ ] = x∗d + Ef [∆x∗d |x∗ , u∗ ] , (3.4)
varf [fd (x∗ , u∗ )|x∗ , u∗ ] = varf [∆x∗d |x∗ , u∗ ] , (3.5)
respectively, d = 1, . . . , D. The predictive mean and variances from the GP model
are computed according to equations (2.28) and (2.29), respectively. Note that
the predictive distribution defined by the mean and variance in equation (3.4)
and (3.5), respectively, reflects the uncertainty about the underlying function. The
full predictive state distribution p(f (x∗ , u∗ )|x∗ , u∗ ) is then given by the Gaussian
E [f (x , u )|x , u ]
f 1 ∗ ∗ ∗ ∗ f 1 ∗ ∗ ∗ ∗
var [f (x , u )|x , u ] . . . 0
.. .. ..
N , ..
. . . .
Ef [fD (x∗ , u∗ )|x∗ , u∗ ] 0 . . . varf [fD (x∗ , u∗ )|x∗ , u∗ ]
(3.6)
5 With “degeneracy” we mean that the uncertainty declines to zero when going away from the training set. The non-
degeneracy is due to the fact that in this thesis the GP is an infinite model, where the SE covariance function has an infinite
number of non-zero eigenfunctions. Any finite model with dimension N gives rise to a degenerate covariance function with
≤ N non-zero eigenfunctions.
3.5. Intermediate Layer: Long-Term Predictions 39
Thus far, we know how to predict with the dynamics GP when the test input is
deterministic, see equation (3.6). To evaluate the expected long-term cost V π in
equation (3.2), the predictive state distributions p(x1 ), . . . , p(xT ) are required. In
the following, we give a rough overview of how pilco computes the predictive
state distributions p(xt ), t = 1, . . . , T .
Generally, by propagating model uncertainties forward, pilco cascades one-
step predictions to obtain the long-term predictions p(x1 ), . . . , p(xT ) at the inter-
mediate layer in Figure 3.4. Pilco needs to propagate uncertainties since even for
a deterministically given input pair, (xt−1 , ut−1 ) the GP dynamics model returns
a Gaussian predictive distribution p(xt |xt−1 , ut−1 ) to account for the model uncer-
tainty, see equation (3.6). Thus, when pilco simulates the system dynamics T
steps forward in time the state at the next time step has a probability distribution
for t > 0.
6 The GP can naturally treat system noise and/or measurement noise. The modifications are straightforward, but we do
not go into further details as they are not required at this point.
40 Chapter 3. Probabilistic Models for Efficient Learning in Control
0.5
xt
0
−0.5
−1
2 1.5 1 0.5 0 −1 −0.5 0 0.5 1
p(xt) (xt−1,ut−1)
p(xt−1,ut−1)
1
0
−1 −0.5 0 0.5 1
Figure 3.7: Moment-matching approximation when propagating uncertainties through the dynamics
GP. The blue Gaussian in the lower-right panel represents the (approximate) joint Gaussian distribution
p(xt−1 , ut−1 ). The posterior GP model is shown in the upper-right panel. The true predictive distribution
p(xt ) when squashing p(xt−1 , ut−1 ) through the dynamics GP is represented by the shaded area in the
left panel. Pilco computes the blue Gaussian approximation of the true predictive distribution (left
panel) using exact moment matching.
state x state x
dynamics xt−1 xt dynamics xt−1 xt
state x action u
policy policy π
Figure 3.8: Cascading predictions during planning without and with a policy.
where the GP model for the one-step transition dynamics f yields the transi-
tion probability p(xt |xt−1 , ut−1 ), see equation (3.6). For a Gaussian distribution
p(xt−1 , ut−1 ), pilco adopts the results from Section 2.3.2 and approximates the typ-
ically non-Gaussian distributions p(xt ) by a Gaussian with the exact mean and the
exact covariance matrix (moment matching). This is pictorially shown in Figure 3.7.
π̃
micro-states micro-actions
Figure 3.9: “Micro-states” being mapped through the preliminary policy π̃ to a set of “micro-actions”.
The deterministic preliminary policy π̃ maps any micro-state in the state distribution (blue ellipse) to
possibly different micro-actions resulting in a distribution over actions (red ellipse).
2 2
1 1
π~(x)
π(x)
0 0
−1 −1
−2 −2
−4 −3 −2 −1 0 1 2 3 4 −4 −3 −2 −1 0 1 2 3 4
x x
(a) Preliminary policy π̃ as a function of the state. (b) Policy π = sin(π̃(x)) as a function of the state.
Figure 3.10: Constraining the control signal. Panel (a) shows an example of an unconstrained preliminary
policy π̃ as a function of the state x. Panel (b) shows the constrained policy π = sin(π̃) as a function of
the state x.
sine function has the nice property that it actually attains its extreme values ±1 for
finite values of π̃(x), namely π̃(x) = π2 + k π, k ∈ Z. Therefore, it is sufficient for
the preliminary policy π̃ to describe a function with function values in the range
of ±π. By contrast, if we mapped π̃(x) through a sigmoid function that attains ±1
in the limit for π̃(x) → ±∞, the function values of π̃ needed to be extreme in order
to apply control signals ±umax , which can lead to numerical instabilities. Another
advantageous property of the sine is that it allows for an analytic computation of
the mean and the covariance of p(π̃(x)) if π̃(x) is Gaussian distributed. Details are
given in Appendix A.1. We note that the cumulative Gaussian also allows for the
computation of the mean and the covariance of the predictive distribution for a
Gaussian distributed input π̃(x). For details, we refer to the book by Rasmussen
and Williams (2006, Chapter 3.9).
In order to work with constraint control signals during predictions, we re-
quire a preliminary policy π̃ that allows for the computation of a distribution over
actions p(π̃(xt )), where xt is a Gaussian distributed state vector. We compute ex-
actly the mean and the variance of p(π̃(x)) and approximate p(π̃(x)) by a Gaussian
with these moments (exact moment matching). This distribution is subsequently
squashed through the sine function to yield the desired distribution over actions.
To summarize, we follow the scheme
p(xt ) 7→ p(π̃(xt )) 7→ p(umax sin(π̃(xt ))) = p(ut ) , (3.9)
where we map the Gaussian state distribution p(xt ) through the preliminary policy
π̃. Then, we squash the Gaussian distribution p(π̃(x)) through the sine according
to equation (3.8), which allows for an analytical computation of the mean and the
covariance of the distribution over actions
p(π(x)) = p(u) = p(umax sin(π̃(x))) (3.10)
using the results from Appendix A.1.
Linear Model
The linear preliminary policy is given by
π̃(x∗ ) = Ψx∗ + ν , (3.11)
where Ψ is a parameter matrix of weights and ν is an offset/bias vector. In each
control dimension d, the policy (3.11) is a linear combination of the states (the
weights are given by the dth row in Ψ) plus an offset νd .
44 Chapter 3. Probabilistic Models for Efficient Learning in Control
Predictive Distribution. The predictive distribution p(π̃(x∗ )|µ, Σ) for a state dis-
tribution x∗ ∼ N (µ, Σ) is an exact Gaussian with mean and covariance
Predictive Distribution. The RBF network in equation (3.14) allows for a closed-
form computation of a predictive distribution p(π̃(x∗ )):
• The predictive mean of p(π̃(x∗ )) for a known state x∗ is equivalent to RBF
policy in equation (3.14), which itself is identical to the predictive mean of a
GP in equation (2.28). In contrast to the GP model, both the predictive variance
3.5. Intermediate Layer: Long-Term Predictions 45
Figure 3.11: Computational steps required to determine p(xt ) from p(xt−1 ) and a policy π(xt−1 ) =
umax sin(π̃(xt−1 )).
and the uncertainty about the underlying function in an RBF network are zero.
Thus, the predictive distribution p(π̃(x∗ )) for a given state x∗ has zero variance.
• For a Gaussian distributed state x∗ ∼ N (µ, Σ) the predictive mean and the
predictive covariance can be computed similarly to Section 2.3.2 when we
consider the RBF network a “deterministic GP” with the restriction that the
variance varπ̃ [π̃(x∗ )] = 0 for all x∗ . In particular, the predictive mean is given
by
Ex∗ ,π̃ [π̃(x∗ )|µ, Σ] = Ex∗ [Eπ̃ [π̃(x∗ )|x∗ ] |µ, Σ] = β>π q , (3.15)
| {z }
=mπ̃ (x∗ )=π̃(x∗ )
where the matrix Q is defined in equation (2.53). Note that the predictive
variance in equation (3.16) equals the cross-covariance entries in the covariance
matrix for a multivariate GP prediction, equation (2.55).
Figure 3.11 recaps and summarizes the computational steps required to compute
the distribution p(xt ) of the successor state from p(xt−1 ):
1. The computation of a distribution over actions p(ut−1 ) from the state distribu-
tion p(xt−1 ) requires two steps:
46 Chapter 3. Probabilistic Models for Efficient Learning in Control
Due to the choice of the SE covariance function for the GP dynamics model, the
linear or RBF representations of the preliminary policy π̃, and the sine function
for squashing the preliminary policy (see equation (3.8), all computations can be
performed analytically.
and the distributions p(xt ) are computed according to the steps detailed in Fig-
ure 3.11, it remains to compute the expected immediate cost
Z
Ext [c(xt )] = c(xt ) p(xt ) dxt , (3.21)
| {z }
Gaussian
which corresponds to convolving the cost function c with the approximate Gaus-
sian state distribution p(xt ). Depending on the representation of the immediate
cost function c, this integral can be solved analytically. If the cost function c is
unknown (not discussed in this thesis) and the values c(x) are only observed, a
GP can be employed to represent the cost function c(x). This immediate-cost GP
would also allow for the analytic computation of Ex,c [c(x)], where we additionally
would have to average according to the immediate-cost GP.
In our case, the policy class Π defines a constrained policy space and is given either
by the class of linear functions or by the class of functions that are represented by
an RBF with N Gaussian basis functions after squashing them through the sine
function, see equation (3.8). Restricting the policy search to the class Π generally
leads to suboptimal policies. However, depending on the expressiveness of Π, the
policy found causes a similar expected long-term cost V π as a globally optimal
policy. In the following, we do not distinguish between a globally optimal policy
π ∗ and π ∗ ∈ Π.
To learn the policy, we employ the deterministic conjugate gradients minimizer
described by Rasmussen (1996), which requires the gradient of the value function
V πψ with respect to the policy parameters ψ.8 Other algorithms such as the (L-
)BFGS method by Lu et al. (1994) can be employed as well. Since approximate
inference for policy evaluation can be performed in closed form (Section 3.5), the
required derivatives can be computed analytically by repeated application of the
chainrule.
Linear Policy
The linear policy model
see equation (3.11), possesses D + 1 parameters per control dimension: For policy
dimension d there are D weights in the dth row of the matrix Ψ. One additional
parameter originates from the offset parameter νd .
RBF Policy
In short, the parameters of the nonlinear RBF policy are the locations Xπ of the
centers, the training targets yπ , the hyper-parameters of the Gaussian basis func-
tions (D length-scale parameters and the signal variance), and the variance of the
measurement noise. The RBF policy is represented as
org/gpml/.
9 For notational convenience, with a linear/RBF policy we mean the linear/RBF preliminary policy π̃ squashed through
π̃ ∗ (x) π̃ ∗ (x)
(a) The training targets are treated as parameters. (b) The training targets and the corresponding support
points define the parameter set to be optimized.
Figure 3.12: Parameters of a function approximator for the preliminary policy π̃ ∗ . The x-axes show
the support points in the state space, the y-axes show the corresponding function values of an optimal
preliminary policy π̃ ∗ . For given support points, the corresponding function values of an optimal policy
are uncertain as illustrated in Panel (a). Panel (b) shows the situation where both the support points and
the corresponding function values are treated as parameters and jointly optimized.
y
3
3
y2
y
4
2
1
π (x) y y5
1 x x
6 7
~
0x x x x x
1 2 3 4 5
y
7
−1
y6
−2
−2 −1 0 1 2
x
Figure 3.13: Preliminary policy π̃ (blue) implemented by an RBF network using a pseudo-training set.
The values xi and yi are the pseudo-inputs and pseudo-targets, respectively. The blue function does not
pass exactly through the pseudo-training targets since they are noisy.
Example 3 (Number of parameters of the RBF policy). In the most general case,
where the entire pseudo-training set and the hyper-parameters are considered pa-
rameters to be optimized, the RBF policy in equation (3.14) for a scalar control
law contains N D parameters for the pseudo-inputs, N parameters for the pseudo-
targets, and D + 2 hyper-parameters. Here, N is the number of basis functions
of the RBF network, and D is the dimensionality of the pseudo-inputs. Generally,
if the policy implements an F -dimensional control signal, we need to optimize
N (D + F ) + (D + 2)F parameters. As an example, for N = 100, D = 10, and F = 2,
this leads to a 1,224-dimensional optimization problem.
approximation by Snelson and Ghahramani (2006) and the Gaussian process latent variable model by Lawrence (2005).
11 The variance of the function is related to the amplitude of π̃.
3.6. Top Layer: Policy Learning 51
The gradient of the expected long-term cost V π along a predicted trajectory τ with
respect to the policy parameters is given by
T
dV πψ (x0 ) X d
= Ext [c(xt )|πψ ] , (3.26)
dψ t=0
dψ
where the subscript ψ emphasizes that the policy π depends on the parameter set
ψ. Moreover, we conditioned explicitly on πψ in the expected value to emphasize
the dependence of the expected cost Ext [c(x)] on the policy parameters. The total
d
derivative with respect to the policy parameters is denoted by dψ . The expected
immediate cost Ext [c(xt )] requires averaging with respect to the state distribution
p(xt ). Note, however, that the moments µt and Σt of p(xt ) ≈ N (xt | µt , Σt ), which
are essentially given by the equations (2.34), (2.53), and (2.55), are functionally
dependent on both the policy parameter vector ψ and the moments µt−1 and Σt−1
of the state distribution p(xt−1 ) at time t − 1. The total derivative in equation (3.26)
is therefore given by
d ∂ dµt ∂ dΣt
Ext [c(xt )] = Ext [c(xt )] + Ext [c(xt )] (3.27)
dψ ∂µt dψ ∂Σt dψ
and the covariance Σt−1 of the input distribution p(xt−1 ) and the derivatives of
the predicted mean µt and covariance Σt with respect to the policy parameters
ψ := {θ π , Xπ , yπ }. For the RBF policy in equation (3.14) the policy parameters ψ
are the policy hyper-parameters θ π (the length-scales and the noise variance), the
pseudo-training inputs Xπ , and the pseudo-training targets yπ .
The function values from which the derivatives in equations (3.37)–(3.38) can
be obtained are summarized in Table 2.1, equation (2.56) Appendix A.1, and equa-
tions (3.15) and (3.16). Additionally, we require the derivatives ∂p(xt )/∂ψ Ext [c(xt )],
see equation (3.28), which depend on the realization of the immediate cost func-
tion. We will explicitly provide these derivatives for a saturating cost function in
Section 3.7.1 and a quadratic cost function in Section 3.7.2, respectively.
Example 4 (Derivative of the predictive mean with respect to the input distribu-
tion). We present two partial derivatives from equation (3.29) that are independent
of the policy model. These derivatives can be derived from equation (2.34) and
equation (2.36).
The derivative of the ath dimension of the predictive mean µt ∈ RD with
respect to the mean µt−1 ∈ RD+F of the input distribution13 is given by
(a)
∂µt ∂qa
>
= βa = (β a qa )> |{z}
T ∈ R1×(D+F ) , (3.39)
∂µt−1 ∂µt−1 | {z }
1×n n×(D+F )
where n is the size of the training set for the dynamics model, is a element-wise
matrix product, qa is defined in equation (2.36), and
Here, X ∈ Rn×(D+F ) are the training inputs of the dynamics GP and 11×n is a 1 × n
matrix of ones. The operator ⊗ is a Kronecker product.
The derivative of the ath dimension of the predictive mean µt with respect to
the input covariance matrix Σt−1 is given by
(a)
∂µt 1
qa ) ⊗ 11×(D+F ) ] −µt R−1 ∈ R(D+F )×(D+F ) . (3.42)
(a)
= T> T
[(β a
Σt−1 2 | {z }
n×(D+F )
1.5
cost
0.5
quadratic
saturating
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
distance
Figure 3.14: Quadratic (red, dashed) and saturating (blue, solid) cost functions. The x-axis shows the dis-
tance of the state to the target, the y-axis shows the corresponding immediate cost. Unlike the quadratic
cost function, the saturating cost function can encode that a state is simply “far away” from the target.
The quadratic cost function pays much attention to how “far away” the state really is.
the target state. Using distance penalties only should be sufficient: An autonomous
learner must be able to figure out that reaching a target xtarget with high speed leads
to “overshooting” and, therefore, to high costs in the long term.
(3.44)
where T−1 = a−2 C> C for suitable C is the precision matrix of the unnormalized
Gaussian in equation (3.44).14 If x is an input vector that has the same representa-
tion as the target vector, T−1 is a diagonal matrix with entries either unity or zero.
Hence, for x ∼ N (µ, Σ) we obtain the expected immediate cost
Ex [c(x)] = 1 − |I + ΣT−1 |−1/2 exp(− 12 (µ − xtarget )> S̃1 (µ − xtarget )) , (3.45)
S̃1 := T−1 (I + ΣT−1 )−1 . (3.46)
14 The covariance matrix does not necessarily exist and is not required to compute the expected cost. In particular, T−1
1 1
0.8 0.8
0.2 0.2
0 0
−1.5 −1 −0.5 0 0.5 1 1.5 −1.5 −1 −0.5 0 0.5 1 1.5
state state
(a) Initially, when the mean of the state is far away from (b) Finally, when the mean of the state is close to the
the target, uncertain states (red, dashed-dotted) are pre- target, certain states with peaked distributions cause
ferred to more certain states with a more peaked distri- less expected cost and are therefore preferred to more
bution (black, dashed). This leads to initial exploration. uncertain states (red, dashed-dotted). This leads to
exploitation once close to the target.
Figure 3.15: Automatic exploration and exploitation due to the saturating cost function (blue, solid). The
x-axes describe the state space. The target state is the origin.
During learning (see Algorithm 1), the saturating cost function in equation (3.43)
allows for “natural” exploration when the policy aims to minimizes the expected
long-term cost in equation (3.2). This property is illustrated in Figure 3.15 for a sin-
gle time step where we assume a Gaussian state distribution p(xt ). If the mean of
a state distribution p(xt ) is far away from the target xtarget , a wide state distribution
is more likely to have substantial tails in some low-cost region than a fairly peaked
distribution (Figure 3.15(a)). In the early stages of learning, the state uncertainty
is essentially due to model uncertainty. If we encounter a state distribution in a
high-cost region during internal simulation (Figure 3.3(b) and intermediate layer
in Figure 3.4), the saturating cost then leads to automatic exploration by favoring
uncertain states, that is, regions expectedly close to the target with a poor dynam-
ics model. When visiting these regions in the interaction phase (Figure 3.3(a)), the
subsequent model update (line 5 in Algorithm 1) reduces the model uncertainty.
In the subsequent policy evaluation, we would then end up with a tighter state
distribution in the situation described either in Figure 3.15(a) or in Figure 3.15(b).
If the mean of the state distribution is close to the target as in Figure 3.15(b),
wide distributions are likely to have substantial tails in high-cost regions. By con-
trast, the mass of a peaked distribution is more concentrated in low-cost regions.
In this case, the policy prefers peaked distributions close to the target, which leads
to exploitation.
Hence, even for a policy aiming at the expected cost only, the combination of
a probabilistic dynamics model and a saturating cost function leads to exploration
as long as the states are far away from the target. Once close to the target, the
policy does not substantially veer from a trajectory that lead the system to certain
states close to the target.
One way to encourage further exploration is to modify the objective function
in equation (3.2). Incorporation of the state uncertainty itself is an option, but
56 Chapter 3. Probabilistic Models for Efficient Learning in Control
where Ex [c(x)] is given in equation (3.45). The second moment Ex [c(x)2 ] can be
computed analytically and is given by
The variance of the cost (3.47) is then given by subtracting the square of equa-
tion (3.45) from equation (3.48).15
To encourage goal-directed exploration, we minimize the objective function
T
Ext [c(xt )] + b σxt [c(xt )] .
X
π
V (x0 ) = (3.50)
t=0
Here, σxt is the standard deviation of the predicted cost. For b < 0 uncertainty in
the cost is encouraged, for b > 0 uncertainty in the cost is penalized. Note that the
modified value function in equation (3.50) is just an approximation to
" T
# " T
#
Eτ
X X
c(xt ) + b στ c(xt ) , (3.51)
t=0 t=0
where the standard deviation of the predicted long-term cost along the predicted
trajectory τ is considered, where τ = (x0 , . . . , xT ).
What is the difference between taking the variance of the state and the variance
of the cost? The variance of the predicted cost at time t depends on the variance of
the state: If the state distribution is fairly peaked, the variance of the corresponding
cost is always small. However, an uncertain state does not necessarily cause a
wide cost distribution: If the mean of the state distribution is in a high-cost region
and the tails of the distribution do not substantially cover low-cost regions, the
uncertainty of the predicted cost is very low. The only case the cost distribution
can be uncertain is if a) the state is uncertain and b) a non-negligible part of the
mass of the state distribution is in a low-cost region. Hence, using the uncertainty
of the cost for exploration avoids extreme designs by solely exploring regions along
trajectories passing regions that are somewhat close to the target—otherwise the
objective function in equation (3.50) does not return small values.
15 We represent the cost distribution p(c(x )|µ , Σ ) by its mean and variance. This representation is good when the
t t t
mean is around 1/2, but can be fairly bad when the mean is close to a boundary, that is, zero or one. Then, the cost
distribution resembles a Beta distribution with a one-sided heavy tail and a mode close to the boundary. By contrast, our
chosen representation of the cost distribution can be interpreted to be a Gaussian distribution.
3.7. Cost Function 57
∂ ∂
Ex [c(xt )], Ex [c(xt )] (3.52)
∂µt t ∂Σt t
of the immediate cost with respect to the mean and the covariance of the state
distribution p(xt ) = N (µt , Σt ), which are required in equation (3.28), are given by
∂
Ex [c(xt )] = −Ext [c(xt )] (µt − xtarget )> S̃1 , (3.53)
∂µt t
∂
Ext [c(xt )] = 12 Ext [c(xt )] S̃1 (µt − xtarget )(µt − xtarget )> − I S̃1 ,
(3.54)
∂Σt
respectively, where S̃1 is given in equation (3.46). Additional partial derivatives are
required if the objective function (3.50) is used to encourage additional exploration.
These partial derivatives are
∂
Ex [c(xt )2 ] = −2 Ext [c(xt )2 ](µ − xtarget )> S̃2 , (3.55)
∂µt t
∂
Ext [c(xt )2 ] = 2 Ext [c(xt )2 ]S̃2 (µ − xtarget )(µ − xtarget )> S̃2 , (3.56)
∂Σt
In equation (3.57), d is the distance from the current state to the target state and a2
is a scalar parameter controlling the width of the cost parabola. In a general form,
the quadratic cost, its expectation, and its variance are given by
respectively, where x ∼ N (µ, Σ,) and T−1 is a symmetric matrix that also contains
the scaling parameter a2 in equation (3.58). In the quadratic cost function, the scal-
ing parameter a2 has theoretically no influence on the solution to the optimization
problem: The optimum of V π is always the same independent of a2 . From a prac-
tical point of view, the gradient-based optimizer can be very sensitive to the choice
of a2 .
58 Chapter 3. Probabilistic Models for Efficient Learning in Control
The variance of the quadratic cost in equation (3.60) increases if the state xt be-
comes uncertain and/or if the mean µt of xt is far away from the target xtarget .
Unlike the saturating cost function in equation (3.43), the variance of the quadratic
cost function in equation (3.58) is unbounded and depends quadratically on the
state covariance matrix. Moreover, the term involving the state covariance matrix
is additively connected with a term that depends on the scaled squared distance
between the mean µt and the target state xtarget . Thus, exploration using σx [c(x)]
is not goal-directed: Along a predicted trajectory, uncertain states16 and/or states
far away from the target are favored. Hence, the variance of the quadratic cost
function essentially boils down to grow unbounded with the state covariance, a
setting that can lead to “extreme designs” (MacKay, 1992).
Due to these issues, we use the saturating cost in equation (3.43) instead.
3.8 Results
We applied pilco to learning models and controllers for inherently unstable dy-
namic systems, where the applicable control signals were constrained. We present
typical empirical results, that is, results that carry the flavor of typical experiments
we conducted. In all control tasks, pilco followed the high-level idea described in
Algorithm 1 on page 36.
Interactions
Pictorially, pilco used the learned model as an internal representation of the world
as described in Figure 3.3. When we worked with hardware, the world was given
by the mechanical system itself. Otherwise, the world was defined as the em-
ulation of a mechanical system. For this purpose, we used an ODE solver for
numerical simulation of the corresponding equations of motion. The equations of
motion were given by a set of coupled ordinary first-order differential equations
s̈ = f (ṡ, s, u). Then, the ODE solver numerically computed the state
Z t+∆t
s(t + 1) ṡ(τ )
xt+1 := xt+∆t := = dτ (3.66)
ṡ(t + 1) t f ṡ(τ ), s(τ ), u(τ )
during the interaction phase, where ∆t is a time discretization constant (in sec-
onds). Note that the system was simulated in continuous time, but the control
decision could only be changed every ∆t , that is, at the discrete time instances
when the state of the system was measured.
Learning Procedure
Algorithm 2 describes a more detailed implementation of pilco. Here, we dis-
16 This means either states that are far away from the current GP training set or states where the function value highly
varies.
60 Chapter 3. Probabilistic Models for Efficient Learning in Control
tinguish between “automatic” steps that directly follow from the proposed learn-
ing framework and fairly general heuristics, which we used for computational
convenience. Let us start with the automatic steps:
1. The policy parameters ψ were initialized randomly (line 1): For the linear policy
in equation (3.11), the parameters were sampled from a zero-mean Gaussian
with variance 0.01. For the RBF policy in equation (3.14), the means of the ba-
sis functions were sampled from a Gaussian with variance 0.1 around x0 . The
hyper-parameters and the pseudo-training targets were sampled from a zero-
mean Gaussian with variance 0.01. Moreover, we passed in the distribution
p(x0 ) of the initial state, the width a of the saturating immediate cost function
c in equation (3.43), the exploration parameter b, the prediction horizon T , the
number P S of policy searches, and the time discretization constant ∆t .
2. An initial training set Dinit for the dynamics model was generated by applying
actions randomly (drawn uniformly from [−umax , umax ]) to the system (line 3).
3. For P S policy searches, pilco followed the steps in the loop in Algorithm 2: A
probabilistic model of the transition dynamics was learned using all previous
experience collected in the data set D (line 5). Using this model, pilco sim-
ulated the dynamics forward internally for long-term predictions, evaluated
the objective function V π (see Section 3.5), and learned the policy parameters
ψ for the current model (line 6 and Section 3.6).
4. The policy was applied to the system (line 11) and the data set was augmented
by the new experience from this interaction phase (line 12).
These steps were automatic steps and did not require any deep heuristic.
Thus far, we have not explained line 7 and the if-statement in line 8, where
the latter one uses a heuristic. In line 7, pilco computed the expected values of
3.8. Results 61
the predicted costs given by At := Ext [c(xt )|πψ∗ ∗ ] , t = 1, . . . , T , when following the
optimized policy πψ∗ ∗ . The function task learned uses AT to determine whether a
good path to the target is expected to be found (line 8): When the expected cost AT
at the end of the prediction horizon was small below 0.2, the system state at time
T was predicted to be close to the target.17 When task learned reported success,
the current prediction horizon T was increased by 25% (line 9). Then, the new
task was initialized for the extended horizon by the policy that was expected to
succeed for the shorter horizon. Initializing the learning task for a longer horizon
with the solution for the shorter horizon can be considered a continuation method
for learning. We refer to the paper by Richter and DeCarlo (1983) for details on
continuation methods.
Stepwise increasing the horizon (line 9) mimics human learning for episodic
tasks and simplifies artificial learning since the prediction and the optimization
problems are easier for relatively short time horizons T : In particular, in the early
stages of learning when the dynamics model was very uncertain, most visited
states along a long trajectory did not contribute much to goal-directed learning as
they were almost random. Instead, they made learning the dynamics model more
difficult since only the first seconds of a controlled trajectory conveyed valuable
information for solving the task.
We applied pilco to four different nonlinear control tasks: the cart-pole prob-
lem (Section 3.8.1), the Pendubot (Section 3.8.2), the cart-double pendulum (Sec-
tion 3.8.3), and the problem of riding a unicycle (Section 3.8.4). In all cases, pilco
learned models of the system dynamics and controllers that solved the individual
tasks from scratch.
The objective of this section is to shed light on the generality, applicability,
and performance of the proposed pilco framework by addressing the following
questions in our experiments:
• Is pilco applicable to different tasks by solely adapting the parameter initial-
izations (line 1 in Algorithm 2), but not the high-level assumptions?
• Are the probabilistic model and/or the saturating cost function and/or the
approximate inference algorithm crucial for pilco’s success? What is the effect
of replacing our suggested components in Figure 3.5?
17 The following simplified illustration might clarify our statement: Suppose a cost of 1 incurs if the task cannot be solved,
and a cost of 0 incurs if the task can be solved. An expected value of 0.2 of this Bernoulli variable requires a predicted
success rate of 0.8.
62 Chapter 3. Probabilistic Models for Efficient Learning in Control
experiment section
single nonlinear controller 3.8.1, 3.8.2, 3.8.3
hardware applicability 3.8.1
importance of RL ingredients 3.8.1
multivariate controls 3.8.2, 3.8.4
different implementations of u(τ ) 3.8.1
• Can pilco naturally deal with multivariate control signals, which corresponds
to having multiple actuators?
• What are the effects of different implementations of the continuous-time con-
trol signal u(τ ) in equation (3.66)?
Table 3.1 gives an overview in which sections these questions are addressed. Most
of the questions are discussed in the context of the cart-pole problem (Section 3.8.1)
and the Pendubot (Section 3.8.2). The hardware applicability was only tested on
the cart-pole task since no other hardware equipment was available at the time of
writing.
Evaluation Setup
Algorithm 3 describes the setup that was used to evaluate pilco’s performance. For
each task, initially, a policy π ∗ was learned by following Algorithm 2. Then, an ini-
(i)
tial state x0 was sampled from the state distribution p(x0 ), where i = 1, . . . , 1, 000.
(i)
Starting from x0 , the policy was applied and the resulting state trajectory τ i =
(i) (i)
(x0 , . . . , xT ) was observed and used for the evaluation. Note that the policy was
fixed in all rollouts during the evaluation.
running on an infinite track and a pendulum attached to the cart. An external force
can be applied to the cart, but not to the pendulum. The dynamics of the cart-pole
system are derived from first principles in Appendix C.2.
The state x of the system was given by the position x1 of the cart, the velocity
ẋ1 of the cart, the angle θ2 of the pendulum, and the angular velocity θ̇2 of the
pendulum. The angle was measured anti-clockwise from hanging down. During
simulation, the angle was represented as a complex number on the unit circle,
that is, the angle was mapped to its sine and cosine. Therefore, the state was
represented as
h i>
x = x1 ẋ1 θ̇2 sin θ2 cos θ2 ∈ R5 . (3.67)
On the one hand we could exploit the periodicity of the angle that naturally
comes along with this augmented state representation, on the other hand the
dimensionality of the state increased by one, the number of angles.
Initially, the system was expected to be a state, where the pendulum hangs
down (µ0 = 0). By pushing the cart to the left and to the right, the objective
was to swing the pendulum up and to balance it in the inverted position around
at the target state (green cross in Figure 3.16), such that x1 = 0 and θ2 = π +
2kπ, k ∈ Z. Doya (2000) considered a similar task, where the target position was
upright, but the target location of the cart was arbitrary. Instead of just balancing
the pendulum, we additionally required the tip of the pendulum to be balanced at
a specific location given by the cross in Figure 3.16. Thus, optimally solving the
task required the cart to stop at a particular position.
Note that a globally linear controller is incapable to swing the pendulum up
and to balance it in the inverted position (Raiko and Tornio, 2009), although locally
around equilibrium it can be successful. Therefore, pilco learned the nonlinear
RBF policy given in equation (3.14). The cart-pole problem is non trivial to solve
since sometimes actions have to be taken that temporarily move the pendulum
further away from the target. Thus, optimizing a short-term cost (myopic control)
does not lead to success.
64 Chapter 3. Probabilistic Models for Efficient Learning in Control
In control theory, the cart-pole task is a classical benchmark for nonlinear op-
timal control. However, typically the task is solved using two separate controllers:
one for the swing up and the second one for the balancing task. The control the-
ory solution is based on an intricate understanding of the dynamics of the system
(parametric system identification) and of how to solve the task (switching con-
trollers), neither of which we assumed in this dissertation. Instead, our objective
was to find a good policy without an intricate understanding of the system, which
we consider expert knowledge. Unlike the classical solution, pilco attempted to
learn a single nonlinear RBF controller to solve the entire cart-pole task.
The parameters used in the experiment are given in Appendix D.1. The chosen
time discretization of ∆t = 0.1 s corresponds to a rather slow sampling frequency
of 10 Hz. Therefore, the control signal could only be changed every ∆t = 0.1 s,
which made the control task fairly challengin.g
Cost Function
Every ∆t = 0.1 s, the squared Euclidean distance
d(x, xtarget )2 = x21 + 2x1 l sin θ2 + 2l2 + 2l2 cos θ2 (3.68)
from the tip of the pendulum to the inverted position (green cross in Figure 3.16)
was measured. Note, that d only depends on the position variables x1 , sin θ2 ,
and cos θ2 . In particular, it does not depend on the velocity variables ẋ1 and θ̇2 .
An approximate Gaussian joint distribution p(j) = N (m, S) of the involved state
variables h i>
j := x1 sin θ2 cos θ2 (3.69)
was computed using the results from Appendix A.1. The target vector in j-space
was jtarget = [0, 0, −1]> . The saturating immediate cost was then given as
c(x) = c(j) = 1 − exp − 21 (j − jtarget )> T−1 (j − jtarget ) ∈ [0, 1] , (3.70)
distance distribution in %
80
60
40
20
0
1 2 3 4 5 6
time in s
Figure 3.17: Histogram of the distances d from the tip of the pendulum to the target of 1,000 rollouts. The
x-axis shows the time, the heights of the bars represent the percentages of trajectories that fell into the
respective categories of distances to the target. After a transient phase, the controller could either solve
the problem very well (black bars) or it could not solve it at all (gray bars).
Zero-order-hold Control
0.8
immediate cost
0.6
pred. cost, median ±0.4−quantile
empirical cost, median ±0.4−quantile
0.4
0.2
0
1 2 3 4 5 6
time in s
Figure 3.18: Medians and quantiles of the predictive immediate cost distribution (blue) and the empirical
immediate cost distribution (red). The x-axis shows the time, the y-axis shows the immediate cost.
size of the gray bars slightly increase again, which indicates that in a few cases the
pendulum moved away from the target and fell over. In approximately 1.5% of the
rollouts, the controller could not balance the pendulum close to the target state for
t ≥ 2 s. This is shown by the gray bar that is approximately constant for t ≥ 2.5 s.
The red and the yellow bars almost disappear after about 2 s. Hence, the controller
could either (in 98.3% of the rollouts) solve the problem very well (height of the
black bar) or it could not solve it at all (height of the gray bar).
In the 1,000 rollouts used for testing, the controller failed to solve the cart-
pole task when the actual rollout encountered states that were in tail regions of the
predicted marginals p(xt ), t = 1, . . . , T . In some of these cases, the succeeding states
were even more unlikely under the predicted distributions p(xt ), which were used
for policy learning, see Section 3.5. The controller did not pay much attention to
this tail regions. Note that the dynamics model was trained on twelve trajectories
only. The controller can be made more robust if the data of a trajectory that led
to a failure is incorporated into the dynamics model. However, we have not yet
evaluated this thoroughly.
Figure 3.18 shows the medians (solid lines), the α-quantiles, and the 1 − α-
quantiles, α = 0.1, of the distribution of the predicted immediate cost (blue,
dashed), t = 0 s, . . . , 6.3 s = Tmax , and the corresponding empirical cost distribution
(red, shaded) after 1,000 rollouts.
In Figure 3.18, the medians of the distributions are close to each other. How-
ever, the error bars of the predictive distribution (blue, dashed) are larger than
the error bars of the empirical distribution (red, shaded) when the predicted cost
approaches zero at about t = 1 s. This is due to the fact that the predicted cost
distribution is represented by its mean and its variance, but the empirical cost dis-
tribution at t = 1 s rather resembles a Beta distribution with a one-sided heavy tail:
3.8. Results 67
immediate cost
0.5 0.5
0 0
1 2 3 4 5 6 1 2 3 4 5 6
time in s time in s
(a) Cost when applying a policy based on 2.5 s experi- (b) Cost when applying a policy based on 10 s experience.
ence.
immediate cost
0.5 0.5
0 0
1 2 3 4 5 6 1 2 3 4 5 6
time in s time in s
(c) Cost when applying a policy based on 15 s experience. (d) Cost when applying a policy based on 18.2 s experi-
ence.
immediate cost
0.5 0.5
0 0
1 2 3 4 5 6 1 2 3 4 5 6
time in s time in s
(e) Cost when applying a policy based on 22.2 s experi- (f) Cost when applying a policy based on 46.1 s experi-
ence. ence.
Figure 3.19: Predicted cost and incurred immediate cost during learning (after 1, 4, 6, 7, 8, and 12 policy
searches, from top left to bottom right). The x-axes show the time in seconds, the y-axes show the
immediate cost. The black dashed line is the minimum immediate cost (zero). The blue solid line is
the mean of the predicted cost. The error bars show the corresponding 95% confidence interval of the
predictions. The cyan solid line is the cost incurred when the new policy is applied to the system. The
prediction horizon T was increased when a low cost at the end of the current horizon was predicted (see
line 9 in Algorithm 2). The cart-pole task can be considered learned after eight policy searches since the
error bars at the end of the trajectories are negligible and do not increase when the prediction horizon
increases.
In most cases, the tip of the pendulum was fairly close to the target (see also the
black and red bars in Figure 3.17), but when the pendulum could not be stabilized,
the tip of the pendulum went far away from the target incurring full immediate
cost (gray bars in Figure 3.17). However, after about 1.6 s, the quantiles of both dis-
tributions converge toward the medians, and the medians are almost identical to
zero. The median of the predictive distribution of the immediate cost implies that
no failure was predicted. Otherwise, the median of the Gaussian approximation
of the predicted immediate cost would significantly differ from zero. The small
bump in the error bars between 1 s and 1.6 s occurs at the point in time when the
RBF controller switched from the swing-up action to balancing. Note, however,
that only a single controller was learned.
68 Chapter 3. Probabilistic Models for Efficient Learning in Control
Figure 3.19 shows six examples of the predicted cost p(c(xt )) and the cost c(xt )
of an actual rollout during pilco’s learning progress. We show plots after 1, 4, 6,
7, 8, and 12 policy searches. The predicted cost distributions p(c(xt )), t = 1, . . . , T ,
are represented by their means and their 95% confidence intervals shown by the
error bars. The cost distributions p(c(xt )) are based on t-step ahead predictions of
p(x0 ) when following the currently optimal policy.
In Figure 3.19(a), that is, after one policy search, we see that for the first roughly
0.7 s the system did not enter states with a cost that significantly differed from
unity. Between t = 0.8 s and t = 1 s a decline in cost was predicted. Simulta-
neously, a rapid increase of the predicted error bars can be observed, which is
largely due to the poor initial dynamics model: The dynamics training set so far
was solely based on a single random initial trajectory. When applying the found
policy to the system, we see (solid cyan graph) that indeed the cost did decrease
after about 0.8 s. The predicted error bars at around t = 1.4 s were small and the
means of the predicted states were far from the target, that is, almost full cost was
predicted. After the predicted fall-over, pilco lost track of the state indicated by
the increasing error bars and the mean of the predicted cost settling down at a
value of approximately 0.8. More specifically, while the position of the cart was
predicted to be close to its final destination with small uncertainty, pilco predicted
that the pendulum swung through, but it simultaneously lost track of the phase;
the GP model used to predict the angular velocity was very uncertain at this stage
of learning. Thus, roughly speaking, to compute the mean and the variance of
the immediate cost distribution for t ≥ 1.5 s, the predictor had to average over all
angles on the unit circle. Since some of the angles in this predicted distribution
were in a low-cost region the mean of the corresponding predicted immediate cost
settled down below unity (see also the discussion on exploration/exploitation in
Section 3.7.1). The trajectory of the incurred cost (cyan) confirms the prediction.
The observed state trajectory led to an improvement in the subsequently updated
dynamics model (see line 5 in Algorithm 2).
In Figure 3.19(b), the learned GP dynamics model had seven and a half seconds
more experience, also including some states closer to the goal state. The trough of
predicted low cost at t = 1 s got extended to a length of approximately a second.
After t = 2 s, the mean of the predicted cost increased and fell back close to unity
at the end of the predicted trajectory. Compared to Figure 3.19(a), the error bars
for t ≤ 1 s and for t ≥ 2.5 s got smaller due to an improved dynamics model. The
actual rollout shown in cyan was in agreement with the prediction.
In Figure 3.19(c) (after six policy searches), the mean of the predicted imme-
diate cost did no longer fall back to unity, but stayed at a value slightly below 0.2
with relatively small error bars. This means that the controller was fairly certain
that the cart-pole task could be solved (otherwise the predicted mean would not
have leveled out below 0.2), although the dynamics model still conveyed a fair
amount of uncertainty. The actual rollout did not only agree with the prediction,
but it solved the task better than expected. Therefore, many data points close to
the target state could be incorporated into the subsequent update of the dynam-
ics model. Since the mean of the predicted cost at the end of the trajectory was
3.8. Results 69
smaller than 0.2, we employed the heuristic in line 9 of Algorithm 2 and increased
the prediction horizon T increased by 25% for the next policy search.
The result after the next policy search and with the increased horizon T = 3.2 s
is shown in Figure 3.19(d). Due to the improved dynamics model, the expected
cost at the end of the trajectory declined to a value slightly above zero; the error
bars shrank substantially compared to the previous predictions. The bump in
the error bars between t = 1 s and t = 2 s reflects the time point when the RBF
controller switched from swinging up to balancing: Slight oscillations around the
target state could occur. Since the predicted mean at the end of the trajectory had
a small value, using the heuristic in line 9 of Algorithm 2, the prediction horizon
was increased again (to T = 4 s) for the subsequent policy search. The predictions
after the eighth policy search are shown in Figure 3.19(e). The error bars at the end
of the trajectory collapsed, and the predicted mean leveled out at a value of zero.
The task was essentially solved at this time.
Iterating the learning procedure for more steps resulted in a quicker swing-up
action and in even smaller error bars both during the swing up (between 0.8 s and
1 s) and during the switch from the swing up to balancing (between 1 s and 1.8 s).
From now onward, the prediction horizon was increased after each policy search
until T = Tmax . The result after the last policy search in our experiment is shown in
Figure 3.19(f). The learned controller found a way to reduce the oscillations around
the target, which is shown by the reduction of the bump after t = 1 s. Furthermore,
the error bars collapsed after t = 1.6 s, and the mean of the predicted cost stayed at
zero. The actual trajectory was in agreement with the prediction.
Remark 4. Besides the heuristic for increasing the prediction horizon, the entire
learning procedure for finding a good policy was fully automatically. The heuristic
employed was not crucial, but it made learning easier.
When a trajectory did not agree with its predictive distribution of the trajectory,
this “surprising” trajectory led to learning. When incorporating this information
into the dynamics model, the model changed correspondingly in regions where
the discrepancies between predictions and true states could not be explained by
the previous error bars.
The obtained policy was not optimal18 , but it solved the task fairly well, and
the system found it automatically using less than 30 s of interaction The task could
be considered solved after eight policy searches since the error bars of the pre-
dicted immediate cost vanished and the rollouts confirmed the prediction. In a
successful rollout when following this policy, the controller typically brought the
angle exactly upright and kept the cart approximately 1 mm left from its optimal
position.
Summary Table 3.2 summarizes the results for the learned zero-order-hold con-
troller. The total interaction time with the system was of the order of magnitude
of a minute. However, to effectively learn the cart-pole task, that is, increasing the
18 We sometimes found policies that could swing up quicker.
70 Chapter 3. Probabilistic Models for Efficient Learning in Control
prediction horizon did not lead to increased error bars and a predicted failure, only
about 20 s–30 s interaction time was necessary. The subsequent interactions were
used to collect more data to further improve the dynamics model and the policy;
the controller swung the pendulum up more quickly and became more robust:
After eight policy searches19 (when the task was essentially solved), the failure
rate was about 10% and declined to about 1.5% after four more policy searches
and roughly twice as much interaction time. Since only states that could not
be predicted well contributed much to an improvement of the dynamics model,
most data in the last trajectories were redundant. Occasional failures happened
when the system encountered states where the policy was not good, for exam-
ple, when the actual states were in tail regions of the predicted state distributions
p(xt ), t = 0, . . . , T , which were used for policy learning.
First-order-hold Control
Thus far, we considered control signals being constantly applied to the system for
a time interval of ∆t (zero-order hold). In many practical applications including
robotics, this control method requires robustness of a physical system and the mo-
tor that applies the control signal: Instantaneously switching from a large positive
control signal to a large negative control signal can accelerate attrition, for example.
One way to increase the lifetime of the system is to implement a continuous-time
control signal u(τ ) (see equation (3.66)) that smoothly interpolates between the
control decisions ut−1 and ut at time steps t − 1 and t, respectively.
Suppose the control signal is piecewise linear. At each discrete time step, the
controller decides on a new signal to be applied to the system. Instead of constantly
applying it for ∆t , the control signal interpolates linearly between the previous
control decision π(xt−1 ) = ut−1 and the new control decision π(xt ) = ut according
to
(∆t − τ ) τ
u(τ ) = π(xt−1 ) + π(xt ) , τ ∈ [0, ∆t ] . (3.72)
∆t | {z } ∆t | {z }
=ut−1 =ut
−2
−4
−6
zero−order−hold
−8
first−order−hold
0 0.1 0.2 0.3 0.4 0.5
time in s
Figure 3.20: Zero-order-hold control (black, solid) and first-order-hold control (red, dashed). The x-axis
shows the time in seconds, the y-axis shows the applied control signal. The control decision can be
changed every ∆t = 0.1 s. Zero-order-hold control applies a constant control signal for ∆t to the system.
First-order-hold control linearly interpolates between control decisions at time t and t + 1 according to
equation (3.72).
every ∆t = 0.1 s. The smoothing effect of first-order hold becomes apparent after
0.3 s: Instead of instantaneously switching from a large positive to a large negative
control signal, first-order hold linearly interpolates between these values over a
time span that corresponds to the sampling interval length ∆t . It takes ∆t for the
first-order-hold signal to achieve the full effect of the control decision ut , whereas
the zero-order-hold signal instantaneously switches to the new control signal. The
smoothing effect of first-order hold diminishes in the limit ∆t → 0.
We implemented the first-order-hold control using state augmentation. More
precisely, the state was augmented by the previous control decision. With u−1 := 0
we defined the augmented state at time step t
h i>
aug
xt := x>
t u >
t−1
, t ≥ 0. (3.73)
To learn a first-order-hold controller for the cart-pole problem, the same pa-
rameter setup as for the zero-order-hold controller was used (see Appendix D.1).
In particular, the same sampling frequency 1/∆t was chosen. We followed Algo-
rithm 3 for the evaluation of the first-order-hold controller.
In Figure 3.21, typical state trajectories are shown, which are based on con-
trollers applying zero-order-hold control and first-order-hold control starting from
the same initial state x0 = 0. The first-order-hold controller induced a delay
when solving the cart-pole task. This delay can easily be seen in the position of
the cart. To stabilize the pendulum around equilibrium, the first-order-hold con-
troller caused longer oscillations in both state variables than the zero-order-hold
controller.
Summary Table 3.3 summarizes the results for the learned first-order-hold con-
troller. Compared to the zero-order-hold controller, the expected long-term cost
∗
V π was a bit higher. We explain this by the delay induced by the first-order hold.
However, in this experiment, pilco with a first-order-hold controller learned the
cart-pole task a bit quicker (in terms of interaction time) than with the zero-order-
hold controller.
72 Chapter 3. Probabilistic Models for Efficient Learning in Control
3
0.3
2.5
0.2
cart position in m
angle in rad
0.1 1.5
0 1
−0.1 0.5
−0.2 0
zero−order hold control zero−order hold control
first−order hold control −0.5 first−order hold control
−0.3
0 0.5 1 1.5 2 2.5 3 3.5 0 0.5 1 1.5 2 2.5 3 3.5
time in s time in s
(a) Trajectory of the position of the cart. (b) Trajectory of the angle of the pendulum.
Figure 3.21: Rollouts of four seconds for the cart position and the angle of the pendulum when applying
zero-order-hold control and first-order-hold control. The delay induced by the first-order hold control
can be observed in both state variables.
Position-independent Controller
Let us now discuss the case where the initial position of the cart is very uncertain.
For this case, we set the marginal variance of the position of the cart to a value,
such that about 95% of the possible initial positions of the cart were in the interval
[−4.5, 4.5] m. Due to the width of this interval, we informally call the controller for
this problem “position independent”. Besides the initial state uncertainty and the
number of basis functions for the RBF controller, which we increased to 200, the
basic setup was equivalent to the setup for the zero-order-hold controller earlier in
this section. The full set of parameters are given in Appendix D.1
Directly performing the policy search with a large initial state uncertainty is
a difficult optimization problem. To simplify this step, we employed ideas from
continuation methods (see the tutorial paper by Richter and DeCarlo (1983) for
detailed information): Initially, pilco learned a controller for a fairly peaked initial
state distribution with Σ0 = 10−2 I. When success was predicted20 , the initial state
distribution was broadened and learning was continued. The found policy for
the narrower initial state distribution served as the parameter initialization of the
policy to be learned for the broadened initial state distribution. This“problem
shaping” is typical for a continuation method and simplified policy learning.
For the evaluation, we followed the steps described in Algorithm 3. Figure 3.22(a)
20 To predict success, we employed the task learned function in Algorithm 2 as a heuristic.
3.8. Results 73
distance distribution in %
80 80
60 60
40 40
20 20
0 0
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
time in s time in s
(a) Histogram of the distances d of the tip of the pendu- (b) Medians and quantiles of the predictive immediate
lum to the target. After a fairly long transient phase due cost distribution (blue) and the empirical immediate cost
to the widespread initial positions of the cart, after a tran- distribution (red). The fairly large uncertainty between
sient phase, the controller could either solve the cart-pole t = 0.7 s and t = 2 s is due to the potential offset of the
problem very well (height of the black bar) or it could not cart, which led to different arrival times at the “swing-up
solve it at all (height of the gray bar). location”: Typically, the cart drove to a particular location
from where the pendulum was swung up.
Figure 3.22: Cost distributions using the position-independent controller after 1,000 rollouts.
0
0.5
−1 0
cart position in m
−0.5
angle in rad
pred. rollout pred. rollout
−2 −1
actual rollout actual rollout
target −1.5 target
−3
−2
−2.5
−4
−3
0 1 2 3 4 5 0 1 2 3 4 5
time in s time in s
(a) Predicted position of the cart and true latent position. (b) Predicted angle of the pendulum and true latent angle.
The cart is 4 m away from the target. It took about 2 s to After swinging the pendulum up, the controller balanced
move to the origin. After a small overshoot, which was the pendulum.
required to stabilize the pendulum, the cart stayed close
to the origin.
Figure 3.23: Predictions (blue) of the position of the cart and the angle of the pendulum when the position
of the cart was far away from the target. The true trajectories (cyan) were within the 95% confidence
interval of the predicted trajectories (blue). Note that even in predictions without any state update the
uncertainty decreased (see Panel (b) between t ∈ [1, 2] s.) since the system was being controlled.
the empirical distribution are very broad. This effect is caused by the almost arbi-
trary initial position of the cart: When the position of the cart was close to zero, the
controller implemented a policy that resembled the policy found for a small initial
distribution (see Figure 3.18). Otherwise the cart reached the low-cost region with
a “delay”. According to Figure 3.22(b), this delay could be up to 1 s. However,
after about 3.5 s, the quantiles of both the predicted distribution and the empirical
distribution of the immediate cost converge toward the medians, and the medians
are almost identically zero.
In a typical successful rollout (when following π ∗ ), the controller swung the
pendulum up and balanced it exactly in the inverted position while the cart had
an offset of 2 mm from the optimal cart position just below the cross in Figure 3.16.
Figure 3.23 shows predictions of the position of the cart and the pendulum
angle starting from x0 = [−4, 0, 0, 0]> . We passed the initial state distribution p(x0 )
and the learned controller implementing π ∗ to the model of the transition dynam-
ics. The model predicted the state 6 s ahead without any evidence from measure-
ments to refine the predicted state distribution. However, the learned policy π ∗
was incorporated explicitly into these predictions following the approximate infer-
ence algorithm described in Section 3.5. The figure illustrates that the predictive
dynamics model was fairly accurate and not overconfident since the true trajecto-
ries were within the predicted 95% confidence bounds. The error bars in the angle
grow slightly around t = 1.3 s when the controller decelerated the pendulum to
stabilize it. Note that the initial position of the cart was 4 m away from its optimal
target position.
Summary Table 3.4 summarizes the results for the learned position-independent
controller. Compared to the results of the “standard” setup (see Table 3.2), the
failure rate and the expected long-term cost are higher. In particular, the higher
3.8. Results 75
motor
cart
track wire
pendulum
value of V π originates from the high uncertainty of the initial position of the cart.
Interaction time and the speed of learning do not differ much compared to the
results of the standard setup.
When the position-independent controller was applied to the simpler problem
with Σ0 = 10−2 I, we obtained a failure rate of 0% after 1,000 trials compared to
1.5% when we directly learned a controller for this tight distribution (see Table 3.2).
The increased robustness is due to the fact that the tails of the tighter distribution
with Σ0 = 10−2 I were better covered by the position-independent controller since
during learning the initial states were drawn from the wider distribution. The
swing up was not performed as quickly as shown in Figure 3.18, though.
Hardware Experiment
The graphical model in Figure 3.2 was employed although it is only approximately
correct (see Section 3.9.2 for a more detailed discussion).
Table 3.5 reports some physical parameters of the cart-pole system in Fig-
ure 3.24. Note that none of the parameters in Table 3.5 was directly required
to apply pilco.
Example 5 (Heuristics to set the parameters). We used the length of the pendulum
for heuristics to determine the width a of the cost function in equation (3.70) and
the sampling frequency 1/∆t . The width of the cost function encodes what states
were considered “far” from the target. If a state x was further away from the tar-
get than twice the width a of the cost function in equation (3.70), the state was
considered “far away”. The pendulum length was also used to find the character-
istic
pfrequency of the system: The swing period of the pendulum is approximately
2π l/g ≈ 0.7 s, where g is the acceleration of gravity and l is the length of the
pendulum. Since l = 125 mm, we set the sampling frequency to 10 Hz, which is
about seven times faster than the characteristic frequency of the system. Further-
more, we chose the cost function in equation (3.43) with a ≈ 0.07 m, such that the
cost incurred did not substantially differ from unity if the distance between the
pendulum tip and the target state was greater than the length of the pendulum.
Unlike classical control methods, pilco learned a probabilistic non-parametric
model of the system dynamics (in equation (3.1)) and a good controller fully auto-
matically from data. It was therefore not necessary to provide a detailed parametric
description of the transition dynamics that might have included parameters, such
as friction coefficients, motor constants, and delays. The torque motor limits im-
plied force constraints of u ∈ [−10, 10] N. The length of each trial was constant and
set to Tinit = 2.5 s = Tmax , that is, we did not use the heuristic to stepwise increase
the prediction horizon.
Figure 3.25(a) shows the initial configuration of the system. The objective is
to swing the pendulum up and to balance it around the red cross. Following the
automatic procedure described in Algorithm 1 on page 36, pilco was initialized
with two trials of length T = 2.5 s each, where actions (horizontal forces to the
cart), randomly drawn from U[−10, 10] N, were applied. The five seconds of data
collected in these trials were used to train an initial probabilistic dynamics model.
Pilco used this model to simulate the cart-pole system internally (long-term plan-
ning) and to learn the parameters of the RBF controller (line 6 in Algorithm 1). In
the third trial, which was the first non-random trial, the RBF policy was applied
to the system. The controller managed to keep the cart in the middle of the track,
but the pendulum did not cross the level of the cart’s track—the system had never
3.8. Results 77
encountered states before with the pendulum being above the track. In the subse-
quent model update (line 5), pilco took these observations into account. Due to
the incorporated new experience, the uncertainty in the predictions decreased and
pilco learned a policy for the updated model. Then, the new policy was applied
to the system for another 2.5 s, which led to the fourth trial where the controller
swung the pendulum up. The pendulum drastically overshot, but for the first
time, states close to the target state were encountered. In the subsequent iteration,
these observations were taken into account by the updated dynamics model and
the corresponding controller was learned. In the fifth trial, the controller learned
to reduce the angular velocity close to the target since falling over leads to high
cost. After two more trials, the learned controller solved the cart-pole task based
on a total of 17.5 s experience only.
Pilco found the controller and the corresponding dynamics model fully auto-
matically by simply following the steps in Algorithm 1. Figure 3.25 shows snap-
shots of a test trajectory of 20 s length. A video showing the entire learning process
can be found at http://mlg.eng.cam.ac.uk/marc/.21
Pilco is very general and worked immediately when we applied it to hard-
ware. Since we could derive the correct orders of magnitude of all required pa-
rameters (the width of the cost function and the time sampling frequency) from
the length of the pendulum, no parameter tuning was necessary.
Summary Table 3.6 summarizes the results of the hardware experiment. Although
pilco used two random initial trials, only seven trials in total were required to
learn the task. We report an unprecedented degree of automation for learning
control. The interaction time required to solve the task was far less than a minute,
an unprecedented learning efficiency for this kind of problem. Note that sys-
tem identification in a classical sense alone typically requires more data from
interaction.
1 2 3
(a) Test trajectory, t = 0.000 s. (b) Test trajectory, t = 0.300 s. (c) Test trajectory, t = 0.466 s.
4 5 6
(d) Test trajectory, t = 0.700 s. (e) Test trajectory, t = 1.100 s. (f) Test trajectory, t = 1.300 s.
7 8 9
(g) Test trajectory, t = 1.500 s. (h) Test trajectory, t = 3.666 s. (i) Test trajectory, t = 7.400 s.
Figure 3.25: Inverted pendulum in hardware; snapshots of a controlled trajectory after having learned
the task. The pendulum was swung up and balanced in the inverted position close to the target state (red
cross). To solve this task, our algorithm required only 17.5 s of interaction with the physical system.
evidence of the relevance of the probabilistic dynamics model, the saturating cost
function, and the approximate inference algorithm using moment matching.
Deterministic model. Let us start with replacing the probabilistic dynamics model
by a deterministic model while keeping the cost function and the learning algo-
rithm the same.
The deterministic model employed was a “non-parametric RBF network”: Like
in the previous sections, we trained a GP model using the collected experience in
form of tuples (state, action, successor state). However, during planning, we set
the model uncertainty to zero. In equation (3.7), this corresponds to p(xt |xt−1 , ut−1 )
being a delta function centered at xt . The resulting model was an RBF network
(functionally equivalent to the mean function of the GP), where the number of basis
functions increased with the size of the training set. A different way of looking at
this model is to take the maximum likelihood function of the GP distribution as a
point estimate of how the function looks like. The maximum likelihood function
estimate of the GP is the mean function.
3.8. Results 79
Quadratic cost function. In our next experiment, we chose the quadratic cost
function instead of the saturating cost function. We kept the probabilistic dynam-
ics model and the indirect policy search RL algorithm. This means, the second
component of the three essential components in Figure 3.5 was replaced.
22 Note that the model was trained on differences xt+1 − xt , see equation (3.3).
23 Togenerate this expert knowledge, we used a working policy (learned by using a probabilistic dynamics model) to
generate the initial training set (instead of random actions).
80 Chapter 3. Probabilistic Models for Efficient Learning in Control
80
immediate cost
1 pred. cost, median ±0.4−quantile
60
empirical cost, median ±0.4−quantile
40
0.5
20
0 0
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6
time in s time in s
(a) Histogram of the distances d of the tip of the pendu- (b) Quantiles of the predictive immediate cost distribu-
lum to the target of 1,000 rollouts. The x-axis shows the tion (blue) and the empirical immediate cost distribution
time in seconds, the colors encode distances to the tar- (red). The x-axis shows the time in seconds, the y-axis
get. The heights of the bars correspond to their respective shows the immediate cost.
percentages in the 1,000 rollouts.
Figure 3.26: Distance histogram in Panel 3.26(a) and quantiles of cost distributions in Panel 3.26(b). The
controller was learned using the quadratic immediate cost in equation (3.58).
this time span. By contrast, the saturating cost function in equation (3.43) does not
much distinguish between states where the tip of the pendulum is further away
from the target state than one meter: The cost is unity with high confidence.
Table 3.7 summarizes the results of the learning algorithm when using the
quadratic cost function in equation (3.58). No exploration was employed, that is,
the exploration parameter b in equation (3.50) equals zero. Compared to the sat-
urating cost function, learning with the quadratic cost function was often slower
since the quadratic cost function payed much attention to essentially minor details
of the state distribution when the state was far away from the target. Our expe-
rience is that if a solution was found when using the quadratic cost function, it
was often worse than the solution with a saturating cost function. It also seemed
that the optimization using the quadratic cost suffered more severely from local
optima. Better solutions existed, though: When plugging in the controller that was
found with the saturating cost, the expected sum of immediate quadratic costs
could halve.
Summarizing, the saturating cost function was not crucial but helpful for the
success of the learning framework: The saturating immediate cost function typi-
cally led to faster learning and better solutions than the quadratic cost function.
Typically, the successor state xt is sampled from p(xt |xt−1 , ut−1 ) and propagated
forward and determined by an internal random number generator. This standard
sampling procedure is inefficient and cannot be used for gradient-based policy
search methods using a small number of samples only. The key idea in Pegasus is
to draw a sample from the augmented state distribution p̃(xt |xt−1 , ut−1 , q), where
q is a random number provided externally. If q is given externally, the distribution
p̃(xt |xt−1 , ut−1 , q) collapses to a deterministic function of q. Therefore, a determinis-
tic simulative model is obtained that is very powerful as detailed by Ng and Jordan
(2000). For each rollout during the policy search, the same random numbers are
used.
Since Pegasus solely requires a generative model for the transition dynamics,
we used our probabilistic GP model for this purpose and performed the policy
search with Pegasus. To optimize the policy parameters for an initial distribu-
(i)
tion p(x0 ), we sampled S start states x0 , so called scenarios, from which we started
the sample trajectories determined by Pegasus. The value function V π (x0 ) in equa-
(i)
tion (3.2) is then approximated by a Monte Carlo sample average over V π (x0 ). The
derivatives of the value function with respect to the policy parameters (required
for the policy search) can be computed analytically for each scenario. Algorithm 4
summarizes the policy evaluation step in Algorithm 8 with Pegasus. Note that the
(i) (i) (i)
predictive state distribution p xt+1 |xt , π ∗ xt ) is a standard GP predictive distri-
(i)
bution where xt is given deterministically, which saves computations compared to
our approximate inference algorithm that requires predictions with uncertain in-
puts. However, Pegasus performs these computations for S scenarios, which can
easily become computationally cumbersome.
It turned out to be difficult for pilco to learn a good policy for the cart-pole
task using Pegasus for the policy evaluation. The combination of Pegasus with the
saturating cost function and the gradient-based policy search using a deterministic
gradient-based optimizer often led to slow learning if a good policy was found at
all.
In our experiments, we used between 10 and 100 scenarios, which might not
be enough to compute V π sufficiently well. From a computational perspective,
we were not able to sample more scenarios, since the learning algorithm using
3.8. Results 83
2
10
09
4
9
00
02
o
00
99
00
lc
20
20
20
pi
t2
i1
r2
io
a
m
sh
cu
ille
oy
rn
lo
ya
Pa
To
dm
D
ou
ba
i&
ie
&
Ko
ko
sk
&
yn
ai
R
a
rz
ur
aw
m
W
Ki
Figure 3.27: Learning efficiency for the cart-pole task in the absence of expert knowledge. The horizontal
axis references the literature, the vertical axis shows the required interaction time with the system on a
logarithmic scale.
Pegasus and 25 policy searches took about a month.24 Sometimes, learn a policy
was learned that led the pendulum close to the target point, but it was not yet able
to stabilize. However, we believe that Pegasus would have found a good solution
with more policy searches, which would have taken another couple of weeks.
A difference between Pegasus and our approximate inference algorithm using
moment matching is that Pegasus does not approximate the distribution of the
state xt by a Gaussian. Instead, the distribution of a state at time t is represented
by a set of samples. This representation might be more realistic (at least in the
case of many scenarios), but it does not necessarily lead to quicker learning. Using
the sample-based representation, Pegasus does not average over unseen states.
This fact might also be a reason why exploration with Pegasus can be difficult—
and exploration is clearly required in pilco since pilco does not rely on expert
knowledge.
Summarizing, although not tested thoroughly, the Pegasus algorithm seemed
not a very successful approximate inference algorithm in the context of the cart-
pole problem. More efficient code would allow for more possible scenarios. How-
ever, we are sceptical that Pegasus can eventually reach the efficiency of pilco.
Table 3.8: Some cart-pole results in the literature (using no expert knowledge).
model-based algorithms to learn the cart-pole task. The value function was mod-
eled by an RBF network. If the dynamics model was learned (using a different
RBF network), Doya (2000) showed that the resulting algorithm was more efficient
than the model-free learning algorithm. Coulom (2002) presented a model-free
TD-based algorithm that approximates the value function by a multilayer percep-
tron (MLP). Although the method looks rather inefficient compared to the work
by Doya (2000), better value function models can be obtained. Wawrzynski and Pa-
cut (2004) used multilayer perceptrons to approximate the value function and the
randomized policy in an actor-critic context. Riedmiller (2005) applied the Neural
Fitted Q Iteration (NFQ) to the cart-pole balancing problem without swing up. NFQ
is a generalization of Q-learning by Watkins (1989), where the Q-state-action value
function is modeled by an MLP. The range of interaction times in Table 3.8 depends
on the quality of the Q-function approximation.25 Raiko and Tornio (2005, 2009)
employed a model-based learning algorithm to solve the cart-pole task. The system
model was learned using the Nonlinear dynamical factor analysis (NDFA) proposed
by Valpola and Karhunen (2002). Raiko and Tornio (2009) used NDFA for sys-
tem identification in partially observable Markov decision processes (POMDPs),
where MLPs were used to model both the system equation and the measurement
equation. The results in Table 3.8 are reported for three different control strategies,
direct control, optimistic inference control, and nonlinear model predictive control,
which led to different interaction times required to solve the task. In this thesis,
we showed that pilco requires only 17.5 s to learn the cart-pole task. This means,
pilco requires at least one order of magnitude less interaction than any other RL
algorithm we found in the literature.
3.8.2 Pendubot
We applied pilco to learning a dynamics model and a controller for the Pendubot
task depicted in Figure 3.28. The Pendubot is a two-link, under-actuated robotic
25 NFQ code is available online at http://sourceforge.net/projects/clss/.
3.8. Results 85
target
θ3
θ2
u
start
Figure 3.28: Pendubot system. The Pendubot is an under-actuated two-link arm, where the inner link
can exert torque. The goal is to swing up both links and to balance them in the inverted position.
arm and was introduced by Spong and Block (1995). The inner joint (attached to the
ground) exerts a torque u, but the outer joint cannot. The system has four contin-
uous state variables: two joint angles and two joint angular velocities. The angles
of the joints, θ2 and θ3 , are measured anti-clockwise from the upright position. The
dynamics of the Pendubot are derived from first principles in Appendix C.3.
The state of the system was given by x = [θ̇2 , θ̇3 , θ2 , θ3 ]> , where θ2 , θ3 are the an-
gles of the inner pendulum and the outer pendulum, respectively (see Figure 3.28),
and θ̇2 , θ̇3 are the corresponding angular velocities. During simulation, the state
was represented as
h i>
x = θ̇2 θ̇3 sin θ2 cos θ2 sin θ3 cos θ3 ∈ R6 . (3.74)
Initially, the system was expected to be in a state, where both links hung down
(θ2 = π = θ3 ). By applying a torque to the inner joint, the objective was to swing
both links up and to balance them in the inverted position around the target state
(θ2 = 2k2 π, θ3 = 2k3 π, where k2 , k3 ∈ Z) as depicted in the right panel of Fig-
ure 3.28. The Pendubot system is a chaotic and inherently unstable system. A
globally linear controller is not capable to solve the Pendubot task, although it
can be successful locally to balance the Pendubot around the upright position.
Furthermore, a myopic strategy does not lead to success either.
Typically, two controllers are employed to solve the Pendubot task, one to solve
the swing up and a linear controller for the balancing (Spong and Block, 1995;
Orlov et al., 2008). Unlike this engineered solution, pilco learned a single nonlinear
RBF controller fully automatically to solve both subtasks.
The parameters used for the computer simulation are given in Appendix D.2.
The chosen time discretization ∆t = 0.075 s corresponds to a fairly slow sampling
frequency of 13.3̄ Hz: For comparison, O’Flaherty et al. (2008) chose a sampling
frequency of 2, 000 Hz.
Cost Function
Every ∆t = 0.075 s, the squared Euclidean distance
d(x, xtarget )2 = (−l2 sin θ2 − l3 sin θ3 )2 + (l2 + l3 − l2 cos θ2 − l3 cos θ3 )2
= l22 + l32 + (l2 + l3 )2 + 2l2 l3 sin θ2 sin θ3 − 2(l2 + l3 )l2 cos θ2 (3.75)
− 2(l2 + l3 )l3 cos θ3 + 2l2 l3 cos θ2 cos θ3
86 Chapter 3. Probabilistic Models for Efficient Learning in Control
from the tip of the outer pendulum to the target state was measured. Here, the
lengths of the two pendulums are denoted by li with li = 0.6 m, i = 2, 3.
Note that the distance d in equation (3.75) and, therefore, the cost function
in equation (3.43), only depends on the sines and cosines of the angles θi . In
particular, it does not depend on the angular velocities θ̇i and the torque u. An
approximate Gaussian joint distribution p(j) = N (m, S) of the involved states
h i>
j := sin θ2 cos θ2 sin θ3 cos θ3 (3.76)
was computed using the results from Section A.1. The target vector in j-space was
jtarget = [0, 1, 0, 1]> . The matrix T−1 in equation (3.44) was given by
2
l 0 l2 l3 0 l 0
2 2
2
0 l 0 l l 0 l
−1 −2
2 2 3 −2 > >
2
T := a =a C C
with C := , (3.77)
l2 l3 0 2
l3 0 l3 0
0 l2 l3 0 l32 0 l3
where a controlled the width of the saturating immediate cost function in equa-
tion (3.43). Note that when multiplying (j − jtarget ) from the left and the right
to C> C, the squared Euclidean distance d2 in equation (3.75) is recovered. The
saturating immediate cost was then given as
c(x) = c(j) = 1 − exp − 21 (j − jtarget )> T−1 (j − jtarget ) ∈ [0, 1] . (3.78)
The width a = 0.5 m of the cost function in equation (3.77) was chosen, such that
the immediate cost was about unity as long as the distance between the tip of the
outer pendulum and the target state was greater than the length of both pendu-
lums. Thus, the tip of the outer pendulum had to cross horizontal to reduce the
immediate cost significantly from unity.
Zero-order-hold Control
By exactly following the steps of Algorithm 2, pilco learned a zero-order-hold
controller, where the control decision could be changed every ∆t = 0.075 s. When
following the learned policy π ∗ , Figure 3.29(a) shows a histogram of the empirical
distribution of the distance d from the tip of the outer pendulum to the inverted
position based on 1,000 rollouts from start positions randomly sampled from p(x0 )
(see Algorithm 3). It took about 2 s to leave the high-cost region represented by
the gray bars. After about 2 s, the tip of the outer pendulum was closer to the
target than its own length in most of the rollouts. In these cases, the tip of the
outer pendulum was certainly above horizontal. After about 2.5 s, the tip of the
outer pendulum came close to the target in the first rollouts, which is illustrated
by the increasing black bars. After about 3 s the black bars “peak” meaning that
at this time point the tip of the outer pendulum was close to the target in almost
3.8. Results 87
80
immediate cost
0.6
60 pred. cost, median ±0.4−quantile
empirical cost, median ±0.4−quantile
0.4
40
0.2
20
0 0
2 4 6 8 10 12 14 16 18 20 1 2 3 4 5 6 7 8 9 10
time in s time in s
(a) Histogram of the distances d from the tip of the outer (b) Quantiles of the predictive immediate cost distribu-
pendulum to the upright position of 1,000 rollouts. At the tion (blue) and the empirical immediate cost distribution
end of the horizon, the controller could either solve the (red).
problem very well (black bar) or it could not solve it at
all, that is, d > l3 (gray bar).
Figure 3.29: Cost distributions for the Pendubot task (zero-order-hold control).
all trajectories. The decrease of the black bars and the increase of the red bars
between 3.1 s and 3.5 s is due to a slight over-swing of the Pendubot. Here, the RBF-
controller had to switch from swinging up to balancing. However, the pendulums
typically did not fall over. After 3.5 s, the red bars vanish, and the black bars level
out at 94%. Like for the cart-pole task (Figure 3.17), the controller either brought
the system close to the target, or it failed completely.
Figure 3.29(b) shows the α-quantiles and the 1 − α-quantiles, α = 0.1, of a
Gaussian approximation of the distribution of the predicted immediate costs c(xt ),
t = 0 s, . . . , 10 s = Tmax (using the controller implementing π ∗ after the last policy
search), and the corresponding empirical cost distribution after 1,000 rollouts. The
medians are described by the solid lines. The quantiles of the predicted cost (blue,
dashed) and the empirical quantiles (red, shaded) are similar to each other. The
quantiles of the predicted cost cover a larger area than the empirical quantiles due
to the Gaussian representation of the immediate cost. The Pendubot required about
1.8 s for the immediate cost to fall below unity. After about 2.2 s, the Pendubot was
balanced in the inverted position. The error bars of both the empirical immediate
cost and the predicted immediate cost declined to close to zero for t ≥ 5 s.
Figure 3.30 shows six examples of the predicted cost and the real cost during
learning the Pendubot task. In Figure 3.30(a), after 23 trials, we see that the learned
controller managed to bring the Pendubot close to the target state. This took about
2.2 s. After that, the error bars of the prediction increased. The prediction horizon
was increased for the next policy search as shown in Figure 3.30(b). Here, the
error bars increased when the time exceeds 4 s. It was predicted that the Pendubot
could not be stabilized. The actual rollout shown in cyan, however, did not incur
much cost at the end of the prediction horizon and was therefore surprising. The
rollout was not explained well by the prediction, which led to learning as discussed
in Section 3.8.1. In Figure 3.30(c), pilco predicted (with high confidence) that
the Pendubot could be stabilized, which was confirmed by the actual rollout. In
Figures 3.30(d)–3.30(f), the prediction horizon keeps increasing until T = Tmax and
88 Chapter 3. Probabilistic Models for Efficient Learning in Control
immediate cost
0.5 0.5
0 0
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
time in s time in s
(a) Cost when applying a policy based on 66.3 s experi- (b) Cost when applying a policy based on 71.4 s experi-
ence. ence.
immediate cost
0.5 0.5
0 0
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
time in s time in s
(c) Cost when applying a policy based on 77.775 s expe- (d) Cost when applying a policy based on 85.8 s experi-
rience. ence.
immediate cost
0.5 0.5
0 0
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
time in s time in s
(e) Cost when applying a policy based on 95.85 s experi- (f) Cost when applying a policy based on 105.9 s experi-
ence. ence.
Figure 3.30: Predicted cost and incurred immediate cost during learning the Pendubot task (after 23, 24,
25, 26, 27, and 28 policy searches, from top left to bottom right). The x-axis is the time in seconds, the
y-axis is the immediate cost. The black dashed line is the minimum immediate cost. The blue solid line
is the mean of the predicted cost. The error bars show the 95% confidence interval. The cyan solid line is
the cost incurred when the new policy is applied to the system. The prediction horizon T increases when
a low cost at the end of the current horizon was predicted (see line 9 in Algorithm 2). The Pendubot task
could be considered solved after 26 policy searches.
the error bars are getting even smaller. The Pendubot task was considered solved
after 26 policy searches.
Figure 3.31 illustrates a learned solution to the Pendubot task. The learned
controller attempted to keep both pendulums aligned. Substantial reward was
gained after crossing the horizontal. From the viewpoint of mechanics, alignment
of the two pendulums increases the total moment of inertia leading to a faster
swing-up movement. However, it might require more energy for swinging up than
a strategy where the two pendulums are not aligned.26 Since the torque applied to
the inner pendulum was constrained, but not penalized in the cost function defined
in equations (3.77) and (3.78), alignment of the two pendulums was therefore an
efficient strategy of solving the Pendubot task. In a typical successful rollout, the
learned controller swung the Pendubot up and balanced it in an almost exactly
inverted position: the inner joint had a deviation of up to 0.072 rad (4.125 ◦ ), the
26 I thank Toshiyuki Ohtsuka for pointing this relationship out.
3.8. Results 89
1 2 3
4 5 6
Figure 3.31: Illustration of the learned Pendubot task. Six snapshots of the swing up (top left to bottom
right) are shown. The cross marks the target state of the tip of the outer pendulum. The green bar shows
the torque exerted by the inner joint. The gray bar shows the reward (unity minus immediate cost). The
learned controller attempts to keep the pendulums aligned.
outer joint had a deviation of up to −0.012 rad (0.688 ◦ ) from the respective inverted
positions. This non-optimal solution (also shown in Figure 3.31) was maintained
by the inner joint exerting small (negative) torques.
Table 3.9 summarizes the results of the Pendubot learning task for a zero-
order-hold controller. Pilco required between one and two minutes to learn the
Pendubot task fully automatically, which is longer than the time required to learn
the cart-pole task. This is essentially due to the more complicated dynamics re-
quiring for more training data to learn a good model. Like in the cart-pole task,
the learned controller for the Pendubot was fairly robust.
u1 u2 uF
Figure 3.32: Model assumption for multivariate control. The control signals are independent given the
state x. However, when x is unobserved, the controls u1 , . . . , uF covary.
3
3
2.5
2.5 2
angle θ2 in rad
angle θ3 in rad
2 pred. rollout 1.5 pred. rollout
actual rollout actual rollout
1.5 target 1 target
1 0.5
0
0.5
−0.5
0
0 0.5 1 1.5 2 2.5 3 3.5 0 0.5 1 1.5 2 2.5 3 3.5
time in s time in s
(a) Trajectories of the angle of the inner joint. (b) Trajectories of the angle of the outer joint.
Figure 3.33: Example trajectories of the two angles for the two-link arm with two actuators when applying
the learned controller. The x-axis shows the time, the y-axis shows the angle in radians. The blue solid
lines are predicted trajectories when starting from a given state. The corresponding error bars show the
95% confidence intervals. The cyan lines are the actual trajectories when applying the controller. In both
cases, the predictions are very certain, that is, the error bars are small. Moreover, the actual rollouts are
in correspondence with the predictions.
Table 3.10: Experimental results: Pendubot with two actuators (zero-order-hold control).
tached to the ground swung left and then right and up. By contrast, the outer
pendulum, attached to the tip of inner one, swung up directly (Figure 3.33(b))
due to the synergetic effects. Note that both pendulums did not reach the
target state (black dashed lines) exactly; both joints exerted small torques to
maintain the slightly bent posture. In this posture, the tip of the outer pen-
dulum was very close to the target, which means, that it was not costly to
maintain the posture.
Summary Table 3.10 summarizes the results of the Pendubot learning task for a
zero-order-hold RBF controller with two actuators when following the evaluation
setup in Algorithm 3. With an interaction time of about three minutes, pilco
successfully learned a fairly robust controller fully automatically. Note that the
task was essentially learned after 10 trials or 40 s, which is less than half the trials
and about half the interaction time required to learn the Pendubot task with a
single actuator (see Table 3.9).
92 Chapter 3. Probabilistic Models for Efficient Learning in Control
l3
l2
u
target state
π − θ2
π + θ3
Figure 3.34: Cart with attached double pendulum. The cart can be pushed to the left and to the right in
order to swing the double pendulum up and to balance it in the inverted position. The target state of the
tip of the outer pendulum is denoted by the green cross.
2003; Graichen et al., 2007). Unlike this engineered solution, pilco learned a single
nonlinear RBF controller to solve both subtasks together.
The parameter settings for the cart-double pendulum system are given in Ap-
pendix D.3. The chosen sampling frequency of 13.3̄ Hz is fairly slow for this kind
of problem. For example, both Alamir and Murilo (2008) and Graichen et al. (2007)
sampled with 1, 000 Hz and Bogdanov (2004) sampled with 50 Hz to control the
cart-double pendulum, where Bogdanov (2004), however, solely considered the
stabilization of the system, a problem where the system dynamics are fairly slow.
Cost Function
Every ∆t = 0.075 s, the squared Euclidean distance
2 2
d(x, xtarget )2 = x1 − l2 sin θ2 − l3 sin θ3 + l2 + l3 − l2 cos θ2 − l3 cos θ3
= x21 + l22 + l32 + (l2 + l3 )2 − 2x1 l2 sin θ2 − 2x1 l3 sin θ3 + 2l2 l3 sin θ2 sin θ3
− 2(l2 + l3 )l2 cos θ2 − 2(l2 + l3 )l3 cos θ3 + 2l2 l3 cos θ2 cos θ3
(3.80)
from the tip of the outer pendulum to the target state was measured, where li =
0.6 m, i = 2, 3, are the lengths of the pendulums. The relevant variables of the
state x were the position x1 and the sines and the cosines of the angles θi . An
approximate Gaussian joint distribution p(j) = N (m, S) of the involved parameters
h i>
j := x1 sin θ2 cos θ2 sin θ3 cos θ3 (3.81)
was computed using the results from Appendix A.1. The target vector in j-space
was jtarget = [0, 0, 1, 0, 1]> . The first coordinate of jtarget is the optimal position of
the cart when both pendulums are in the inverted position. The matrix T−1 in
equation (3.44) was given by
1 −l2 0 −l3 0 1 0
−l2 l2 2 −l2 0
0 l2 l3 0
T−1 = a−2 0 0 l22
−2 >
0 l2 l3 = a C C, C> = 0 l2 , (3.82)
−l3 l2 l3 0 2 −l3 0
l3 0
2
0 0 l2 l3 0 l3 0 l3
where a controlled the width of the saturating immediate cost function in equa-
tion (3.43). The saturating immediate cost was then
1 > −1
c(x) = c(j) = 1 − exp − 2 (j − jtarget ) T (j − jtarget ) ∈ [0, 1] . (3.83)
The width a = 0.5 m of the cost function in equation (3.43) was chosen, such that the
immediate cost was about unity as long as the distance between the tip of the outer
pendulum and the target state was greater than both pendulums together. Thus,
the tip of the outer pendulum had to cross horizontal to reduce the immediate cost
significantly from unity.
94 Chapter 3. Probabilistic Models for Efficient Learning in Control
1
d ≤ 6 cm d ∈ (6,10] cm d ∈ (10,60] cm d > 60cm
100
0.8
distance distribution in %
80
immediate cost
0.6
60 pred. cost, median ±0.4−quantile
empirical cost, median ±0.4−quantile
0.4
40
0.2
20
0
0
1 2 3 4 5 6 1 2 3 4 5 6 7
time in s time in s
(a) Histogram of the distances of the tip of the outer (b) Quantiles of the predictive immediate cost distribu-
pendulum to the target of 1,000 rollouts. tion (blue) and the empirical immediate cost distribution
(red).
Figure 3.35: Cost distribution for the cart-double pendulum problem (zero-order-hold control).
Zero-order-hold Control
As described by Algorithm 3, we considered 1,000 controlled trajectories of 20 s
length each to evaluate the performance of the learned controller. The start states of
the trajectories were independent samples from p(x0 ) = N (µ0 , Σ0 ), the distribution
for which the controller was learned.
Figure 3.35 shows cost distributions for the cart-double pendulum learning
task. Figure 3.35(a) shows a histogram of the empirical distribution of the distance
d of the tip of the outer pendulum to the target over 6 s after 1,000 rollouts from
start positions randomly sampled from p(x0 ). The histogram is cut at 6 s since the
cost distribution looks alike for t ≥ 6 s. It took the system about 1.5 s to leave the
high-cost region denoted by the gray bars. After about 1.5 s, in many trajectories,
the tip of the outer pendulum was closer to the target than its own length l3 = 60 cm
as shown by the appearing yellow bars. This means, the tip of the outer pendulum
was certainly above horizontal. After about 2 s, the tip of the outer pendulum
was close to the target for the first rollouts, which is illustrated by the increasing
black bars. After about 2.8 s the black bars “peak” meaning that at this time point
in many trajectories the tip of the outer pendulum was very close to the target
state. The decrease of the black bars and the increase of the red bars between 2.8 s
and 3.2 s is due to a slight overshooting of the cart to reduce the energy in the
system; the RBF controller switched from swinging up to balancing. However, the
pendulums typically did not fall over. After 4 s, the red bars vanish, and the black
bars level out at 99.1%. Like for the cart-pole task (Figure 3.17), the controller either
brought the system close to the target, or it failed completely.
Figure 3.35(b) shows the median and the lower and upper 0.1-quantiles of both
a Gaussian approximation of the predicted immediate cost and the empirical im-
mediate cost over 7.5 s. For the first approximately 1.2 s, both immediate cost dis-
tributions are at unity without variance. Between 1.2 s and 1.875 s the cost distri-
butions transition from a high-cost regime to a low-cost regime with increasing
uncertainty. At 1.875 s, the medians of both the predicted and the empirical cost
distributions have a local minimum. Note that at this point in time, the red bars
in the cost histogram in Figure 3.35(a) start appearing. The uncertainty in both
3.8. Results 95
6 6
1
5.5
5.5
cart position in m
angle θ in rad
angle θ3 in rad
5
0.5 pred. rollout pred. rollout 4.5 pred. rollout
actual rollout 4.5 actual rollout actual rollout
2
4
target 4 target target
0 3.5
3.5 3
3 2.5
−0.5
2.5 2
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
time in s time in s time in s
(a) Position of the cart. The initial uncer- (b) Angle of the inner pendulum. (c) Angle of the outer pendulum.
tainty is very small. After about 1.5 s the
cart was slowed down and the predicted
uncertainty increased. After approx-
imately 4 s, the uncertainty decreased
again.
Figure 3.36: Example trajectories of the cart position and the two angles of the pendulums for the cart-
double pendulum when applying the learned controller. The x-axes show the time, the y-axes show the
cart position in meters and the angles in radians. The blue solid lines are the predicted trajectories when
starting from a given state. The dashed blue lines show the 95% confidence intervals. The cyan lines are
the actual trajectories when applying the controller. The actual rollouts agree with the predictions. The
increase in the predicted uncertainty in all three state variables between t = 1.5 s and t = 4 s indicates
the time interval when the controller removed energy from the system to stabilize the double pendulum
at the target state.
the predicted and the empirical immediate cost in Figure 3.35(b) significantly in-
creased between 1.8 s and 2.175 s since the controller had to switch from the swing
up to decelerating the speeds of the cart and both pendulums and balancing. After
t = 2.175 s the error bars and the medians declined toward zero. The error bars
of the predicted immediate cost were generally larger than the error bars of the
empirical immediate cost for two reasons: First, the model uncertainty was taken
into account. Second, the predictive immediate cost distribution was always repre-
sented by its mean and variance only, which ignores the skew of the distribution.
As shown in Figure 3.35(a), the true distribution of the immediate cost had a strong
peak close to zero and some outliers where the controller did not succeed. These
outliers were not predicted by pilco, otherwise the predicted mean would have
been shifted toward unity.
Let us consider a single trajectory starting from a given position x0 . For this
case, Figure 3.36 shows the corresponding predicted trajectories p(xt ) and the ac-
tual trajectories of the position of the cart, the angle of the inner pendulum, and
the angle of the outer pendulum, respectively. Note that for the predicted state dis-
tributions p(xt ) pilco predicted t steps ahead using the learned controller for an
internal simulation of the system—before applying the policy to the real system.
In all three cases, the actual rollout agreed with the predictions. In particular in the
position of the cart in Figure 3.36(a), it can be seen that the predicted uncertainty
grew and declined although no new additional information was incorporated. The
uncertainty increase was exactly during the phase where the controller switched
from swinging the pendulums up to balancing them in the inverted position. Fig-
ure 3.36(b) and Figure 3.36(c) nicely show that the angles of the inner and outer
pendulums were very much aligned from 1 s onward.
96 Chapter 3. Probabilistic Models for Efficient Learning in Control
1 2 3
4 5 6
Figure 3.37: Sketches of the learned cart-double pendulum task (top left to bottom right). The green cross
marks the target state of the tip of the outer pendulum. The green bars show the force applied to the cart.
The gray bars show the reward (unity minus immediate cost). To reach the target exactly, the cart has to
be directly below the target. The ends of the track the cart is running on denote the maximum applicable
force and the maximum reward (at the right-hand side).
Summary Table 3.11 summarizes the experimental results of the cart-double pen-
dulum learning task. With an interaction time of between one and two minutes,
pilco successfully learned a robust controller fully automatically. Occasional fail-
ures can be explained by encountering unlikely states (according to the predictive
state trajectory) where the policy was not particularly good.
3.8. Results 97
Figure 3.38: Unicycle system. The 5 DoF unicycle consists of a wheel, a frame, and a turntable (flywheel)
mounted perpendicular to the frame. We assume the wheel of the unicycle rolls without slip on a
horizontal plane along the x0 -axis (x-axis rotated by the angle φ around the z-axis of the world-coordinate
system). The contact point of the wheel with the surface is [xc , yc ]> . The wheel can fall sideways, that is,
it can be considered a body rotating around the x0 -axis. The sideways tilt is denoted by θ. The frame can
fall forward/backward in the plane of the wheel. The angle of the frame with respect to the axis z 0 is ψf .
The rotational angles of the wheel and the turntable are given by ψw and ψt , respectively.
independent of the absolute position of the contact point [xc , yc ] of the unicycle,
which is irrelevant for stabilization. Thus, the dynamics of the robotic unicycle
can be described by ten coupled first-order ordinary differential equations, see the
work by Forster (2009) for details. The state of the unicycle system was given as
h i
x = θ̇ φ̇ ψ̇w ψ̇f ψ̇t θ φ ψw ψf ψt ∈ R10 . (3.84)
Pilco represented the state x as an R15 -vector
>
θ̇, φ̇, ψ̇w , ψ̇f , ψ̇t , sin θ, cos θ, sin φ, cos φ, sin ψw , cos ψw , sin ψf , cos ψf , sin ψt , cos ψt
(3.85)
representing angles by their sines and cosines. The objective was to balance the
unicycle, that is, to prevent it from falling over.
Remark 5 (Non-holonomic constraints). The assumption that the unicycle rolls
without slip induces non-integrable constraints on the velocity variables and makes
the unicycle a non-holonomic vehicle (Bloch et al., 2003). The non-holonomic con-
straints reduce the number of the degrees of freedom from seven to five. Note that
we only use this assumption to simplify the idealized dynamics model for data
generation; pilco does incorporate the knowledge whether the wheel slips or not.
The target application we have in mind is to learn a stabilizing controller for
balancing the robotic unicycle in the Department of Engineering, University of
Cambridge, UK. A photograph of the robotic unicycle, which is assembled accord-
ing to the descriptions above, is shown in Figure 3.39.27 In the following, however,
we only consider an idealized computer simulation of the robotic unicycle. Learning
the controller using data from the hardware realization of the unicycle remains to
future work.
The robotic unicycle is a challenging control problem due to its intrinsically
nonlinear dynamics. Without going into details, following a Lagrangian approach
to deriving the equations of motion, the unicycle’s non-holonomic constraints on
the speed variables [ẋc , ẏc ] were incorporated into the remaining state variables.
The state ignores the absolute position of the contact point of the unicycle, which
is irrelevant for stabilization.
We employed a linear policy inside pilco for the stabilization of the robotic
unicycle and followed the steps of Algorithm 2. In contrast to the previous learning
tasks in this chapter, 15 trajectories with random torques were used to initialize the
dynamics model. Furthermore, we aborted the simulation when the sideways tilt θ
of the wheel or the angle ψf of the frame exceeded an angle of π/3. For θ = π/2 the
unicycle would lay flat on the ground. The fifteen initial trajectories were typically
short since the unicycle quickly fell over when applying torques randomly to the
wheel and the turntable.
Cost Function
The objective was to balance the unicycle. Therefore, the tip of the unicycle should
have the z-coordinate in a three-dimensional Cartesian coordinate system defined
27 Detailed information about the project can be found at http://www.roboticunicycle.info/.
3.8. Results 99
turntable/flywheel
frame
wheel
Figure 3.39: Robotic unicycle in the Department of Engineering, University of Cambridge, UK. With
permission borrowed from http://www.roboticunicycle.info/.
by the radius rw of the wheel plus the length rf of the frame. Every ∆t = 0.05 s,
the squared Euclidean distance
upright
2
z }| { 2
d(x, xtarget ) = rw + rf −rw cos θ − rf cos θ cos ψf (3.86)
rf rf 2
= rw + rf − rw cos θ − 2
cos(θ − ψf ) − 2
cos(θ + ψf )
from the tip of the unicycle to the upright position (a z-coordinate of rw + rf ) was
measured. The squared distance in equation (3.86) did not penalize the position
of the contact point of the unicycle since the task was to balance the unicycle
somewhere. Note that d only depends on the angles θ (sideways tilt of the wheel)
and ψf (tilt of the frame in the hyperplane of the tilted wheel). In particular, d does
not depend on the angular velocities and the applied torques u.
The state variables that were relevant to compute the squared distance in equa-
tion (3.86) were the cosines of θ and the difference/sum of the angles θ and ψf .
Therefore, we defined χ := θ − ψf and ξ := θ + ψf . An approximate Gaussian
distribution p(j) = N (m, S) of the state variables
h i>
j = cos θ cos χ cos ξ (3.87)
that are relevant for the computation of the cost function was computed using the
results from Appendix A.1. The target vector in j-space was jtarget = [1, 1, 1]> . The
100 Chapter 3. Probabilistic Models for Efficient Learning in Control
distance distribution in %
80
60
40
20
0
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
time in s
Figure 3.40: Histogram of the distances from the top of the unicycle to the fully upright position after
1,000 test rollouts.
where a = 0.1 m controlled the width of the saturating cost function in equa-
tion (3.43). Note that almost full cost incurred if the tip of the unicycle exceeded a
distance of 20 cm from the upright position.
Zero-order-hold Control
Pilco followed the steps described in Algorithm 2 to automatically learn a dynam-
ics model and a (linear) controller to balance the robotic unicycle.
As described in Algorithm 3, the controller was tested in 1,000 independent
runs of 30 s length each starting from a randomly drawn initial state x0 ∼ p(x0 )
with p(x0 ) = N (0, 0.252 I). Note that the distribution p(x0 ) of the initial state was
fairly wide. The (marginal) standard deviation for each angle was 0.25 rad ≈ 15 ◦ .
A test run was aborted in case the unicycle fell over.
Figure 3.40 shows a histogram of the empirical distribution of the distance d
from the top of the unicycle to the upright position over 5 s after 1,000 rollouts
from random start positions sampled from p(x0 ). The histogram is cut at t = 5 s
since the cost distribution does not change afterward. The histogram distinguishes
between states close to the target (black bars), states fairly close to the upright
position (red bars), states that might cause a fall-over (yellow bars), and states,
where the unicycle already fell over or could not be prevented from falling over
(gray bars). The initial distribution of the distances gives an intuition of how far
the random initial states were from the upright position: In approximately 80% of
the initial configurations, the top of the unicycle was closer than ten centimeters
to the upright position. About 20% of the states had a distance between ten and
fifty centimeters to the upright position. Within the first 0.4 s, the distances to
the target state grew for many states that used to be in the black regime. Often,
this depended on the particular realization of the sampled joint configuration of
the angles. Most of the states that were previously in the black regime moved to
3.9. Practical Considerations 101
the red regime. Additionally, some states from the red regime became parts of
the yellow regime of states. In some cases, the initial configuration was too bad
that a fall-over could not be prevented, which is indicated by the gray error bars.
Between 0.4 s and 0.7 s, the black bars increase and the yellow bars almost vanish.
The yellow bars vanish since either the state could be controlled (yellow becomes
red) or it could not and the unicycle fell over (yellow becomes gray). The heights of
the black bars increase since some of the states in the red regime got closer to the
target again. After about 1.2 s, the result is essentially binary: Either the unicycle
fell over or the linear controller managed to balance it very close to the desired
upright position. The success rate was approximately 93%.
In a typical successful run, the learned controller kept the unicycle upright, but
drove it straight ahead with relatively high speed. Intuitively, this solution makes
sense: Driving the unicycle straight ahead leads to more mechanical stability than
just keeping it upright, due to the conservation of the angular momentum. The
same effect can be experienced in ice-skating or riding a bicycle, for example. When
just keeping the unicycle upright (without rolling), the unicycle can fall into all
directions. By contrast, a unicycle rolling straight ahead is unlikely to fall sideways.
Summary Table 3.12 summarizes the results of the unicycle task. The interaction
time of 32.85 s was sufficient to learn a fairly robust (linear) policy. After 23 trials
(15 of which were random) the task was essentially solved. In about 7% of 1,000
test runs (each of length 30 s) the learned controller was incapable to balance the 5
DoF unicycle starting from a randomly drawn initial state x0 ∼ N (0, 0.252 I). Note
however, that the covariance matrix Σ0 allowed for some initial states where the
angles deviate by more than 30 ◦ from the upright position. Bringing the unicycle
upright from these extreme angles was sometimes impossible due to the torque
constraints.
Although training a GP with a training set with 500 data points can be done in
short time (see Section 2.3.4 for the computational complexity), repeated predictions
during approximate inference for policy evaluation (Section 3.5) and the compu-
tation of the derivatives for the gradient-based policy search (Section 3.6) become
computationally expensive: On top of the computations required for multivariate
predictions with uncertain inputs (see Section 2.3.4), computing the derivatives of
the predictive covariance matrix Σt with respect to the covariance matrix Σt−1 of
the input distribution and with respect to the policy parameters ψ of the RBF pol-
icy requires O(F 2 n2 D) operations per time step, where F is the dimensionality
of the control signal to be applied, n is the size of the training set, and D is the
dimension of the training inputs. Hence, repeated prediction and derivative com-
putation for planning and policy learning become very demanding although the
computational effort scales linearly with the prediction horizon T .28 Therefore, we
use sparse approximations to speed up dynamics training, policy evaluation, and
the computation of the derivatives.
When the locations of the inducing inputs and the kernel hyper-parameters
are optimized jointly, the FITC sparse approximation proposed by Snelson and
Ghahramani (2006) and Snelson (2007) is good in fitting the model, but can be poor
in predicting. Our experience is that it can suffer from overfitting indicated by a
too small (by several orders of magnitude) learned hyper-parameter for the noise
variance. By contrast, the recently proposed algorithm by Titsias (2009) attempts to
avoid overfitting but can suffer from underfitting. As mentioned in the beginning
of this section, the main computational burden arises from repeated predictions
and computations of the derivatives, but not necessarily from training the GPs. To
avoid the issues with overfitting and underfitting, we train the full GP to obtain the
28 In case of stochastic transition dynamics, that is x = f (x
t t−1 , ut−1 ) + wt , where wt ∼ N (0, Σw ) is a random offset that
affects the state of the system, the derivatives with respect to the distribution of the previous state still require O(E 2 n2 D)
arithmetic operations per time step, where E is the dimension of the GP training targets. However, when using a stochastic
policy the derivatives with respect to the policy parameters require O(F 2 n2 D + F n3 ) arithmetic operations per time step.
3.9. Practical Considerations 103
8 8
6 6
4 4
2 2
0 0
−2 −2
−4 −4
−6 −6
−6 −4 −2 0 2 4 6 8 −6 −4 −2 0 2 4 6 8
(a) Initial placement of basis functions. The blue cross is (b) A better model can require to move the red-dashed
a new data point. basis function to the location of the desired basis function
(blue).
Figure 3.41: The FITC sparse approximation encounters optimization problems when moving the location
of an unimportant basis function “through” other basis functions if the locations of these other basis
functions are crucial for the model quality. These problems are due to the gradient-based optimization of
the basis functions locations.
typically fixed a priori. The panel on the right-hand side of Figure 3.41 contains
the same four basis functions, one of which is dashed and red. Let us assume that
the blue basis function is the optimal location of the red-dashed basis function in
order to model the data set after obtaining new experience. If the black basis func-
tions are crucial to model the latent function, it is difficult to move the red-dashed
basis function to the location of the blue basis function using a gradient-based op-
timizer. To sidestep this problem in our implementation, we followed Algorithm 5.
The main idea of the heuristic in Algorithm 5 is to replace some pseudo training
inputs x̄i ∈ X̄ with “real” inputs xj ∈ X from the full training set if the model
improves. In the context of Figure 3.41, the red Gaussian corresponds to x̄i∗ in
line 8. The blue Gaussian is an input xj of the full training set and represents a
potentially “better” location of a pseudo input. The quality of the model is eval-
uated by the sparse nlml-function (lines 2, 5, and 10) that computes the negative
log-marginal likelihood (negative log-evidence) in equation (2.78) for the sparse
GP approximation. The log-marginal likelihood can be evaluated efficiently since
swapping a data point in or out only requires a rank-one-update/downdate of the
low-rank approximation of the kernel matrix.
We emphasize that the problems in the sequential-data setup stem from the
locality of the Gaussian basis function. Stepping away from local basis functions to
global basis functions should avoid the problems with sequential data completely.
3.9. Practical Considerations 105
(a) The true problem is a POMDP with (b) Stochastic MDP. There are no la- (c) Implications to the true POMDP.
deterministic latent transitions. The tent states x in this model (in contrast The control decision u does not di-
hidden states xt form a Markov chain. to Panel (a)). Instead it is assumed rectly depend on the hidden state x,
The measurements zt of the hidden that the measured states zt form a but on the observed state z. However,
states are corrupted by additive Gaus- Markov chain and that the effect of the control does affect the latent state
sian noise ν. The applied control sig- the control signal ut directly affects x in contrast to the simplified assump-
nal is a function of the state xt and is the measurement zt+1 . The measure- tion in Panel (b). Thus, the measure-
denoted by ut . ments are corrupted by Gaussian sys- ment noise from Panel (b) propagates
tem noise ε, which makes the assumed through as system noise.
MDP stochastic.
Figure 3.42: True POMDP, simplified stochastic MDP, and its implication to the true POMDP (without in-
curring costs). Hidden states, observed states, and control signals are denoted by x, z, and u, respectively.
Panel (a) shows the true POMDP. Panel (b) is the graphical model when the simplifying assumption of
the absence of a hidden layer is employed. This means, the POMDP with deterministic transitions is
transformed into an MDP with stochastic transitions. Panel (c) illustrates the effect of this simplifying
assumption to the true POMDP.
xt = f (xt−1 , ut−1 ) ,
(3.89)
zt = xt + ν t , ν t ∼ N (0, Σν ) ,
Inferring a generative model governing the latent Markov chain is a hard problem
that is closely related to nonlinear system identification in a control context. If we
assume the measurement function in equation (3.89) and a small covariance matrix
Σν of the noise, we pretend the hidden layer of states xt in Figure 3.42(a) does not
exist. Thus, we approximate the POMDP by an MDP, where the (autoregressive)
transition dynamics are given by
zt = f˜(zt−1 , ut−1 ) + εt , εt ∼ N (0, Σε ) , (3.90)
where εt is white independent Gaussian noise. The graphical model for this setup
is shown in Figure 3.42(b). When the model in equation (3.90) is used, the con-
trol signal ut is directly related to the (noisy) observed state zt , and no longer a
function of the hidden state xt . Furthermore, in the model in Figure 3.42(b), the
control signal ut directly influences the consecutive observed state zt+1 . Therefore,
the noise in the observation at time t directly translates into noise in the control
signal. If this noise is additive, the measurement noise ν t in equation (3.89) can be
considered system noise εt in equation (3.90). Hence, we approximate the POMDP
in equation (3.89) by a stochastic MDP in equation (3.90).
Figure 3.42(c) illustrates the effect of this simplified model on the true under-
lying POMDP in Figure 3.42(a). The control decision ut is based on the observed
state zt . However, unlike in the assumed model in Figure 3.42(b), the control in
Figure 3.42(c) does not directly affect the next consecutive observed state zt+1 , but
only indirectly through the hidden state xt+1 . When the simplified model in equa-
tion (3.90) is employed, both zt−1 and zt can be considered samples, either from
N (f (xt−2 , ut−2 ), Σν ) or from N (f (xt−1 , ut−1 ), Σν ). Thus, the variance of the noise ε
in Figure 3.42(b) must be larger than the variance of the measurement noise ν in
Figure 3.42(c). In particular, Σε = 2 Σν + 2 cov[zt−1 , zt ], which makes the learning
problem harder compared to having direct access to the hidden state. Note that
the measurements zt−1 and zt are not uncorrelated since the noise εt−1 in state zt−1
does affect zt through the control signal ut−1 (Figure 3.42(c)).
Hence, the presented approach of approximating the POMDP with determin-
istic latent transitions by an MDP with stochastic transitions is only applicable if
the covariance Σν is small and all state variables are measured.
3.10 Discussion
Strengths. Pilco’s major strength is that it is general, practical, robust, and co-
herent since it carefully models uncertainties. Pilco learns from scratch in the
absence of expert knowledge fully automatically; only general prior knowledge is
required.
Pilco is based on fairly simple, but well-understood approximations: For in-
stance, all predictive distributions are approximated by unimodal Gaussian dis-
tributions, one of the simplest approximation one can make. To faithfully de-
scribe model uncertainty, pilco employs a distribution over function instead of
commonly used point estimates. With a Gaussian process model, pilco uses the
simplest realization of a distribution over functions.
3.10. Discussion 107
The three ingredients required for finding an optimal policy (see Figure 3.5) us-
ing pilco, namely the probabilistic dynamics model, the saturating cost function,
and the indirect policy search algorithm form a successful and efficient RL frame-
work. Although it is difficult to tear them apart, we provided some evidence that
the probabilistic dynamics model and in particular the incorporation of the model
uncertainty into the planning and the decision-making process, are essential for
pilco’s success in learning from scratch.
Model-bias motivates the use of data-inefficient model-free RL and is a strong
argument against data-efficient model-based RL when learning from scratch. Due
to incorporation of a distribution over all plausible dynamics models into planning
and policy learning, pilco is a conceptual and practical framework for reducing
model bias in RL.
Pilco was able to learn complicated nonlinear control tasks from scratch. Pilco
achieves an unprecedented learning efficiency (in terms of required interactions)
and an unprecedented degree of automation for either control task presented in
this chapter. To the best of our knowledge, pilco is the first learning method that
can learn the cart-double pendulum problem a) without expert knowledge and b)
with only a single nonlinear controller. We demonstrated that pilco can directly
be applied to hardware and tasks with multiple actuators.
Pilco is not restricted to comprehensible mechanical control problems, but it
can theoretically also be applied to control of more complicated mechanical con-
trol systems, biological and chemical process control, and medical processes, for
example. In these cases, pilco would profit from its generality and from the fact
that it does not rely on expert knowledge: Modeling slack, protein interactions,
or responses of humans to drug treatments are just a few examples, where non-
parametric Bayesian models can be superior to parametric approaches although
the physical and biological interpretations are not directly given.
Current limitations. Pilco learns very fast in terms of the amount of experience
(interactions with the system) required to solve a task, but the computational de-
mand is not negligible. In our current implementation, a single policy search for a
typically-sized data set takes on the order of thirty minutes CPU time on a standard
PC. Performance can certainly be improved by writing more efficient and parallel
code. Recently, graphics processing units (GPUs) have been shown promising for
demanding computations in machine learning. Catanzaro et al. (2008) used them
in the context of support vector machines whereas Raina et al. (2009) applied them
to large-scale deep learning. Nevertheless, it is not obvious that our algorithms
can necessarily learn in real time, which would be required to move from batch
learning to online learning. However, once the policy has been learned, the com-
putational requirements of applying the policy to a control task are fairly trivial
and real-time capable as demonstrated in Section 3.8.1.
Thus far, pilco can only deal with unconstrained state spaces. A principled
incorporation of prior knowledge about constraints such as obstacles in the state
space remains to future work. We might be able to adopt ideas from approximate
inference control discussed by Toussaint (2009) and Toussaint and Goerick (2010).
108 Chapter 3. Probabilistic Models for Efficient Learning in Control
Parameters. The parameters to be set for each task are essentially described in
the upper half of Table D.1: the time discretization ∆t , the width a of the immediate
cost, the exploration parameter b, and the prediction horizon. We give some rule-
of-thumb heuristics how we chose these parameters although the algorithms are
fairly robust against other parameter choices. The key problem is to find the right
order of magnitude of the parameters.
The time discretization is set somewhat faster than the characteristic frequency
of the system. The tradeoff with the ∆t -parameter is that for small ∆t the dynamics
can be learned fairly easily, but more prediction steps are needed resulting in
higher computational burden. Thus, we attempt to set ∆t to a large value, which
presumable requires more interaction time to learn the dynamics. The width of
the saturating cost function should be set in a way that the cost function can easily
distinguish between a “good” state and a “bad” state. Making the cost function
too wide can result in numerical problems. In the experiments discussed in this
dissertation, we typically set the exploration parameter to a small negative value,
say, b ∈ [−0.5, 0]. Besides the exploration effect, a negative value of b smoothes
the value function out and simplifies the optimization problem. First experiments
with the cart-double pendulum indicated that a negative exploration parameter
simplifies learning. Since we have no representative results yet, we do not discuss
this issue thoroughly in this thesis. The (initial) prediction horizon Tinit should be
set in a way that the controller can approximately solve the task. Thus, Tinit is also
related to the characteristic frequency of the system. Since the computational effort
increases linearly with the length of the prediction horizon, shorter horizons are
desirable in the early stages of learning when the dynamics model is still fairly
poor. Furthermore, the learning task is difficult for long horizons since it is easy to
lose track of the state in the early stages of learning.
Linearity of the policy. Strictly speaking, the policy based on a linear model (see
equation (3.11)) is nonlinear since the linear function (the preliminary policy) is
squashed through the sine function to account for constrained control signals in
a fully Bayesian way (we can compute predictive distributions after squashing the
preliminary policy).
Noise in the policy training set and policy parameterization. The pseudo-training
targets yπ = π̃(Xπ ) + επ , επ ∼ N (0, Σπ ), for the RBF policy in equation (3.14) are
3.10. Discussion 109
considered noisy. We optimize the training targets (and the corresponding noise
variance), such that the fitted RBF policy minimizes the expected long-term cost
in equation (3.2) or likewise in equation (3.50). The pseudo-targets yπ do not
necessarily have to be noisy since they are not real observations. However, noisy
pseudo-targets smooth the latent function out and make policy learning easier.
The parameterization of the RBF policy via the mean function of a GP is
unusual. A “standard” RBF is usually given as
N
X
βi k(xi , x∗ ) , (3.91)
i=1
Model of the transition dynamics. From a practical perspective, the main chal-
lenge in learning for control seems to be finding a good model of the transition
dynamics, which can be used for internal simulation. Many model-based RL algo-
rithms can be applied when an “accurate” model is given. However, if the model
110 Chapter 3. Probabilistic Models for Efficient Learning in Control
does not coherently describe the system, the policy found by the RL algorithm
can be arbitrarily bad. The probabilistic GP model appropriately represents the
transition dynamics: Since the dynamics GP can be considered a distribution over
all models that plausibly explain the experience (collected in the training set), in-
corporation of novel experience does usually not make previously plausible mod-
els implausible. By contrast, if a deterministic model is used, incorporation of
novel experience always changes the model and, therefore, makes plausible mod-
els implausible and vice versa. We observed that this model change can have a
strong influence on the optimization procedure and is an additional reason why
deterministic models and gradient-based policy search algorithms do not fit well
together.
The dynamics GP model, which models the general input-output behavior can
be considered an efficient machine learning approach to non-parametric system
identification. All involved parameters are implicitly determined. A drawback
(from a system engineer’s point of view) of a GP is that the hyper-parameters in a
non-parametric model do not usually yield a mechanical or physical interpretation.
If some parameters in system identification cannot be determined with cer-
tainty, classic robust control (minimax/H∞ -control) aims to minimize the worst-
case error. This methodology often leads to suboptimal and conservative solu-
tions. Possibly, a fully probabilistic Gaussian process model of the system dy-
namics can be incorporated as follows: As the GP model coherently describes
the uncertainty about the underlying function, it implicitly covers all transition
dynamics that plausibly explain observed data. By Bayesian averaging over all
these models, we appropriately treat uncertainties and can potentially bridge the
gap between optimal and robust control. GPs for system identification and robust
model predictive control have been employed for example by Kocijan et al. (2004),
Murray-Smith et al. (2003), Murray-Smith and Sbarbaro (2002), Grancharova et al.
(2008), or Kocijan and Likar (2008).
given in the book by Bishop (2006, Chapter 10.1). The moment-matching approx-
imation employed is equivalent to a unimodal approximation using Expectation
Propagation (Minka, 2001b).
Unimodal distributions are usually fairly bad representations of state distri-
butions at symmetry-breaking points. Consider for example a pendulum in the
inverted position: It can fall to the left and to the right with equal probability. We
approximate this bimodal distribution by a Gaussian with high variance. When we
predict, we have to propagate this Gaussian forward and we lose track of the state
very quickly. However, we can control the state by applying actions to the system.
We are interested in minimizing the expected long-term cost. High variances are
therefore not favorable in the long term. Our experience is that the controller en-
sures that it decides on one mode and completely ignores the other mode of the
bimodal distribution. Hence, the symmetry can be broken by applying actions to
the system.
The projection of the predictive distribution of a GP with uncertain inputs
onto a unimodal Gaussian is a simplification in general since the true distribution
can easily be multi-modal (see Figure 2.6). If we want to consider and propagate
multi-modal distributions in a time series, we need to compute a multi-modal pre-
dictive distribution from a multi-modal input distribution. Consider a multi-modal
distribution p(x). It is possible to compute a multi-modal predictive distribution
following the results from 2.3.2 for each mode. Expectation Correction by Barber
(2006) leads into this direction. However, the multi-modal predictive distribution
is generally not an optimal29 multi-modal approximation of the true predictive dis-
tribution. Finding an optimal multi-modal approximation of the true distribution
is an open research problem. Only in the unimodal case, we can easily find a uni-
modal approximation of the predictive distribution in the exponential family that
minimizes the Kullback-Leibler divergence between the true distribution and the
approximate distribution: the Gaussian with the mean and the covariance of the
true predictive distribution.
Incorporation of prior knowledge. Prior knowledge about the policy and the tran-
sition dynamics can be incorporated easily: A good-guess policy or a demonstra-
tion of the task can be used instead of a random initialization of the policy. In a
practical application, if idealized transition dynamics are known, they can be used
29 In this context, “optimality” corresponds to a minimal Kullback-Leibler divergence between the true distribution and
as a prior mean function as proposed for example by Ko et al. (2007a) and Ko and
Fox (2008, 2009a,b) in the context of RL and mechanical control. In this thesis, we
used a prior mean that was zero everywhere.
Curriculum learning. Humans and animals learn much faster when they learn in
small steps: A complicated task can be learned faster if it is similar to an easy
task that has been learned before. In the context of machine learning, Bengio et al.
(2009) call this type of learning curriculum learning. Curriculum learning can be
considered a continuation method across tasks. In continuation methods, a diffi-
cult optimization problem is solved by first solving a much simpler initial problem
and then, step by step, shaping the simple problem into the original problem by
tracking the solution to the optimization problem. Details on continuation methods
can be found in the paper by Richter and DeCarlo (1983). Bengio et al. (2009) hy-
pothesize that curriculum learning improves the speed of learning and the quality
of the solution to the complicated task.
In the context of learning motor control, we can apply curriculum learning by
learning a fairly easy control task and initialize a more difficult control task with
the solution of the easy problem. For example, if we first learn to swing up and
balance a single pendulum, we can exploit this knowledge when learning to swing
up and balance a double-pendulum. Curriculum learning for control tasks remains
to future work.
zt = g(xt ) + vt , (3.93)
where vt is a noise term. Suppose a GP models for f is given. For the inter-
nal simulation of the system (Figure 3.3(b) and intermediate layer in Figure 3.4),
our learning framework can be applied without any practical changes: We simply
need to predict multiple steps ahead when the initial state is uncertain—this is
done already in the current implementation. The difference is simply where the
initial state distribution originates from. Right now, it represents a set of possible
initial states; in the POMDP case it would be the prior on the initial state. Dur-
ing the interaction phase (see Figure 3.3(a)), where we obtain noisy and partial
measurements of the latent state xt , it is advantageous to update the predicted
state distribution p(xt ) using the measurements z1:t . To do so, we require efficient
filtering algorithms suitable for GP transition functions and potentially GP mea-
surement functions. The distribution p(xt |z1:t ) can be used to compute a control
3.11. Further Reading 113
signal applied to the system (for example the mean of this distribution). The re-
maining problem is to determine the GP models for the transition function f (and
potentially the measurement function g). This problem corresponds to nonlinear
(non-parametric) system identification. Parameter learning and system identifica-
tion go beyond the scope of this thesis and are left to future work. However, a
practical tool for parameter learning is smoothing. With smoothing we can com-
pute the posterior distributions p(x1:T |z1:T ). We present algorithms for filtering and
smoothing in Gaussian-process dynamic systems in Chapter 4.
strong task-specific prior assumptions. In the context of robotics, one popular so-
lution employs prior knowledge provided by a human expert to restrict the space
of possible solutions. Successful applications of this kind of learning in control
were described by Atkeson and Schaal (1997b), Abbeel et al. (2006), Schaal (1997),
Abbeel and Ng (2005), Peters and Schaal (2008b), Nguyen-Tuong et al. (2009), and
Kober and Peters (2009). A human demonstrated a possible solution to the task.
Subsequently, the policy was improved locally by using RL methods. This kind of
learning is known as learning from demonstration, imitation learning, or apprenticeship
learning.
Engel et al. (2003, 2005), Engel (2005), and Deisenroth et al. (2008, 2009b) used
probabilistic GP models to describe the RL value function. While Engel et al. (2003,
2005) and Engel (2005) focused on model-free TD-methods to approximate the
value function, Deisenroth et al. (2008, 2009b) focused on model-based algorithms
in the context of dynamic programming/value iteration using GPs for the transi-
tion dynamics. Kuss (2006) and Rasmussen and Kuss (2004) considered model-
based policy iteration using probabilistic dynamics models. Rasmussen and Kuss
(2004) derived the policy from the value function, which was modeled globally
using GPs. The policy was not a parameterized function, but the actions at the
support points of the value function model were directly optimized. Kuss (2006)
additionally discussed Q-learning, where the Q-function was modeled by GPs.
Moreover, Kuss proposed an algorithm without an explicit global model of the
value function or the Q-function. He instead determined an open-loop sequence
of T actions (u1 , . . . , uT ) that optimized the expected reward along a predicted
trajectory.
Ng and Jordan (2000) proposed Pegasus, a policy search method for large
MDPs and POMDPs, where the transition dynamics were given by a stochastic
model. Bagnell and Schneider (2001), Ng et al. (2004a), Ng et al. (2004b), and
Michels et al. (2005) successfully incorporated Pegasus into learning complicated
control problems.
Indirect policy search algorithms often require the gradients of the value func-
tion with respect to the policy parameters. If the gradient of the value function
with respect to the policy parameters cannot be computed analytically, it has to
be estimated. To estimate the gradient, a range of policy gradient methods can be
applied starting from a finite difference approximation of the gradient to more ef-
ficient gradient estimation using Monte-Carlo rollouts as discussed by Baxter et al.
(2001). Williams (1992) approximated the value function V π by the immediate cost
and discounted future costs. Kakade (2002) and Peters et al. (2003) derived the nat-
ural policy gradient. A good overview of policy gradient methods with estimated
gradients and their application to robotics is given in the work by Peters and Schaal
(2006, 2008a,b) and Bhatnagar et al. (2009).
Several probabilistic models have been used to address the exploration-exploi-
tation issue in RL. R-max by Brafman and Tennenholtz (2002) is a model-based RL
algorithm that maintains a complete, but possibly inaccurate model of the environ-
ment and acts based on the model-optimal policy. R-max relies on the assumption
that acting optimally with respect to the model results either in acting (close to)
3.11. Further Reading 115
cannot fall forward or backward. In 2008, the Murata Company designed the
MURATA GIRL, a robot that could balance the unicycle30 .
3.12 Summary
We proposed pilco, a practical, data-efficient, and fully probabilistic learning
framework that learns continuous-valued transition dynamics and controllers fully
automatically from scratch using only very general assumptions about the under-
lying system. Two of pilco’s key ingredients are the probabilistic dynamics model
and the coherent incorporation of uncertainty into decision making. The proba-
bilistic dynamics model reduces model bias, a typical argument against model-
based learning algorithms. Beyond specifying a reasonable cost function, pilco
does not require expert knowledge to learn the task. We showed that pilco
is directly applicable to real mechanical systems and that it learns fairly high-
dimensional and complicated control tasks in only a few trials. We demonstrated
pilco’s learning success on chaotic systems and systems with up to five degrees
of freedom. Across all tasks, we reported an unprecedented speed of learning and
unprecedented automation. To emphasize this point, pilco requires at least one
order of magnitude less interaction time than other recent RL algorithms that learn
from scratch. Since pilco extracts useful information from data only, it is widely
applicable, for example to biological and chemical process control, where classical
control is difficult to implement.
Pilco is conceptually simple and relies on well-established ideas from Bayesian
statistics. A decisive difference to common RL algorithms, including the Dyna ar-
chitecture proposed by Sutton (1990), is that pilco explicitly requires probabilis-
tic models of the transition dynamics that carefully quantify uncertainties. These
model uncertainties have to be taken into account during long-term planning to
reduce model bias, a major argument against model-based learning. The success
of our proposed framework stems from the principled approach to handling the
model’s uncertainty. To the best of our knowledge, pilco is the first RL algorithm
that can learn the cart-double pendulum problem a) without expert knowledge
and b) with a single nonlinear controller for both swing up and balancing.
information.
117
xt = f (xt−1 ) + wt , (4.1)
xt−1 xt xt+1
zt−1 zt zt+1
Figure 4.1: Graphical model of a dynamic system. Shaded nodes are observed variables, the other nodes
are latent variables. The noise variables wt and vt are not shown to keep the graphical model clear.
zt = g(xt ) + vt , (4.2)
since xt+1 is conditionally independent of xt−1 given xt . This can also be seen
in the graphical model in Figure 4.1. A sufficient statistics of the joint distribution
p(X|Z) in equation (4.3) are the mean and the covariance of the marginal smoothing
distributions p(xt |Z) and the joint distributions p(xt−1 , xt |Z) for t = 1, . . . , T .
Filtering and smoothing are discussed in the context of the forward-backward
algorithm by Rabiner (1989). Forward-backward smoothing follows the steps in
Algorithm 6. The forward-backward algorithm consists of two main steps, the
forward sweep and the backward sweep, where “forward” and “backward” refer
to the temporal direction in which the computations are performed. The forward
sweep is called filtering, whereas both sweeps together are called smoothing. The
forward sweep computes the distributions p(xt |z1:t ), t = 1, . . . , T of the hidden
state at time t given all measurements up to and including time step t. For RTS
smoothing, the hidden state distributions p(xt |z1:t ) from the forward sweep are
updated to incorporate the evidence of the future measurements zt+1 , . . . , zT , t =
T −1, . . . , 0. These updates are determined in the backward sweep, which computes
the joint posterior distributions p(xt−1 , xt |z1:T ).
In the following, we provide a general probabilistic perspective on Gaussian
filtering and smoothing in dynamic systems. Initially, we do not focus on partic-
ular implementations of filtering and smoothing. Instead, we identify the high-
level concepts and the components required for Gaussian filtering and smoothing,
while avoiding getting lost in the implementation and computational details of
particular algorithms (see for instance the standard derivations of the Kalman fil-
ter given by Anderson and Moore (2005) or Thrun et al. (2005)). We show that for
4.2. Gaussian Filtering and Smoothing in Dynamic Systems 121
Gaussian filtering at time t can be split into two major steps, the time update and
the measurement update as described for instance by Anderson and Moore (2005)
and Thrun et al. (2005).
Proposition 1 (Filter Distribution). For any Gaussian filter, the mean and the covariance
of the filter distribution p(xt |z1:t ) are
µxt|t := E[xt |z1:t ] = E[xt |z1:t−1 ] + cov[xt , zt |z1:t−1 ]cov[zt |z1:t−1 ]−1 (zt − E[xt |z1:t−1 ]) ,
(4.6)
−1 zx
= µxt|t−1 + Σxz z
t|t−1 (Σt|t−1 ) Σt|t−1 (4.7)
Σxt|t := cov[xt |z1:t ] = cov[xt |z1:t−1 ] − cov[xt , zt |z1:t−1 ]cov[zt |z1:t−1 ]−1 cov[zt , xt |z1:t−1 ]
(4.8)
−1 zx
= Σt|t−1 x − Σxz z
t|t−1 (Σt|t−1 ) Σt|t−1 (4.9)
respectively.
Proof. To proof Proposition 1, we derive the filter distribution by repeatedly apply-
ing Bayes’ theorem and explicit computation of the intermediate distributions.
Filtering proceeds by alternating between predicting (time update) and correct-
ing (measurement update):
1. Time update (predictor)
(a) Compute predictive distribution p(xt |z1:t−1 ).
2. Measurement update (corrector) using Bayes’ theorem:
p(xt , zt |z1:t−1 )
p(xt |z1:t ) = (4.10)
p(zt |z1:t−1 )
122 Chapter 4. Robust Bayesian Filtering and Smoothing in GP Dynamic Systems
(a) Compute p(xt , zt |z1:t−1 ), that is, the joint distribution of the next latent
state and the next measurement given the measurements up to the current
time step.
(b) Measure zt .
(c) Compute the posterior p(xt |z1:t ).
In the following, we examine these steps more carefully.
where we exploited that the noise term wt has mean zero and is independent.
Similarly, the corresponding predictive covariance is
Σxt|t−1 := covxt [xt |z1:t−1 ] = covf (xt−1 ) [f (xt−1 )|z1:t−1 ] + covwt [wt |z1:t−1 ] (4.14)
Z
= f (xt−1 )f (xt−1 )> p(xt−1 |z1:t−1 ) dxt−1 − µxt|t−1 µxt|t−1 )> + Σw ,
|{z}
| {z } =covwt [wt ]
=covf (xt−1 ) [f (xt−1 )|z1:t−1 ]
(4.15)
such that the Gaussian approximation of the predictive distribution is
p(xt |z1:t−1 ) = N (µxt|t−1 , Σxt|t−1 ) . (4.16)
Σzt|t−1 := covzt [zt |z1:t−1 ] = covxt [g(xt )|z1:t−1 ] + covvt [vt |z1:t−1 ] (4.21)
Z
= g(xt )g(xt )> p(xt |z1:t−1 ) dxt − µzt|t−1 µzt|t−1 )> + Σv . (4.22)
|{z}
| {z } =covvt [vt ]
=covxt [g(xt )|z1:t−1 ]
with the mean and covariance given in equations (4.20) and (4.22), respec-
tively.
• Due to the independence of vt , the measurement noise does not influence
the cross-covariance entries in the joint Gaussian in equation (4.18), which
are given by
Σxz
t|t−1 := covxt ,zt [xt , zt |z1:t−1 ] (4.24)
=E >
− Ext [xt |z1:t−1 ]Ezt [zt |z1:t−1 ]
xt ,zt [xt zt |z1:t−1 ]
>
(4.25)
ZZ
= xt z > x z
t p(xt , zt ) dzt dxt − µt|t−1 (µt|t−1 ) ,
>
(4.26)
124 Chapter 4. Robust Bayesian Filtering and Smoothing in GP Dynamic Systems
where we plugged in the mean µxt|t−1 of the time update in equation (4.16)
and the mean µzt|t−1 of the predicted next measurement, whose distribu-
tion is given in equation (4.23). Exploiting now the independence of the
noise, we obtain the cross-covariance
Z
Σt|t−1 = xt g(xt )> p(xt |z1:t−1 ) dxt − µxt|t−1 (µzt|t−1 )> .
xz
(4.27)
(b) Measure zt .
(c) Compute the posterior p(xt |z1:t ). The computation of the posterior distri-
bution using Bayes’ theorem, see equation (4.10), essentially boils down to
applying the rules of computing a conditional from a joint Gaussian distri-
bution, see Appendix A.3 or the books by Bishop (2006) or Rasmussen and
Williams (2006). Using the expressions from equations (4.13), (4.15), (4.20),
(4.22), and (4.27), which fully specify the joint Gaussian p(xt , zt |z1:t−1 ), we
obtain the desired filter distribution at time instance t as the conditional
The generic filtering distribution in equation (4.28) holds for any (Gaussian) Bayes
filter as described by Thrun et al. (2005) and incorporates the time update and
the measurement update. We identify that it is sufficient to compute the mean
and the covariance of the joint distribution p(xt , zt |z1:t−1 ) for the generic filtering
distribution in equation (4.28).
Note that in general the integrals in equations (4.13), (4.15), (4.20), (4.22), and
(4.27) cannot be computed analytically. One exception is the case of linear functions
f and g, where the analytic solutions to the integrals are embodied in the Kalman
filter introduced by Kalman (1960): Using the rules of predicting in linear Gaussian
systems, the Kalman filter equations can be recovered when plugging in the respec-
tive means and covariances into equations (4.29) and (4.30), which is detailed by
Roweis and Ghahramani (1999), Minka (1998), Anderson and Moore (2005), Bishop
(2006), and Thrun et al. (2005), for example. In many nonlinear dynamic systems,
deterministic filter algorithms either approximate the state distribution (for exam-
ple, the UKF and the CKF) or the functions f and g (for example, the EKF). Using
the means and (cross-)covariances computed by these algorithms and plugging
them into the generic filter equation (4.29)–(4.30), recovers the corresponding filter
update equations.
4.2. Gaussian Filtering and Smoothing in Dynamic Systems 125
Marginal Distributions
The smoothed marginal state distribution is the posterior distribution of the hidden
state given all measurements
Proof. In the considered case of a finite time series, the smoothed state distribution
at the terminal time step T is equivalent to the filter distribution p(xT |z1:T ). By
integrating out the hidden state at time step t the distributions p(xt−1 |z1:T ), t =
T, . . . , 1, of the smoothed states can be computed recursively according to
Z Z
p(xt−1 |z1:T ) = p(xt−1 |xt , z1:T )p(xt |z1:T ) dxt = p(xt−1 |xt , z1:t−1 )p(xt |z1:T ) dxt .
(4.37)
We now examine these steps in detail, where we assume that the filter distributions
p(xt |z1:t ) for t = 1, . . . , T are known. Moreover, we assume a known smoothed state
distribution p(xt |z1:T ) to compute the smoothed state distribution p(xt−1 |z1:T ).
(a) Compute the conditional p(xt−1 |xt , z1:t−1 ). We compute the conditional in two
steps: First, we compute a joint Gaussian distribution p(xt , xt−1 |z1:t−1 ). Sec-
ond, we apply the rules of computing conditionals from a joint Gaussian
distribution.
Let us start with the joint distribution. We compute an approximate Gaussian
joint distribution
µxt−1|t−1 Σxt−1|t−1 Σxt−1,t|t−1
p(xt−1 , xt |z1:t−1 ) = N , , (4.38)
x x > x
µt|t−1 (Σt−1,t|t−1 ) Σt|t−1
where Σxt−1,t|t−1 = covxt−1 ,xt [xt−1 , xt |z1:t−1 ] denotes the cross-covariance be-
tween xt−1 and xt given the measurements z1:t−1 .
Let us look closer at the components of the joint distribution in equation (4.38).
The filter distribution p(xt−1 |z1:t−1 ) = N (µxt−1|t−1 , Σxt−1|t−1 ) at time step t − 1 is
known from equation (4.28) and is the first marginal distribution in equa-
tion (4.38). The second marginal p(xt |z1:t−1 ) = N (µxt|t−1 , Σxt|t−1 ) is the time up-
date equation (4.16), which is also known from filtering. The missing bit is the
cross-covariance Σxt−1,t|t−1 , which can also be pre-computed during filtering
since it does not depend on future measurements. We obtain
where we used the means µxt−1|t−1 and µxt|t−1 of the measurement update and
the time update, respectively. The zero-mean independent noise in the system
equation (4.1) does not influence the cross-covariance matrix. This concludes
the first step (computation of the joint Gaussian) of the computation of the
desired conditional.
In the second step, we apply the rules of Gaussian conditioning to obtain the
desired conditional distribution p(xt−1 |xt , z1:t−1 ). For a shorthand notation, we
define the matrix
Jt−1 := Σxt−1,t|t−1 (Σxt|t−1 )−1 ∈ RD×D (4.41)
and obtain the conditional Gaussian distribution
p(xt−1 |xt , z1:t−1 ) = N (m, S) , (4.42)
m = µxt−1|t−1 + Jt−1 (xt − µxt|t−1 ) , (4.43)
S = Σxt−1|t−1 − Jt−1 (Σxt−1,t|t−1 )> (4.44)
by applying the rules of computing conditionals from a joint Gaussian as
outlined in Appendix A.3 or in the book by Bishop (2006).
(b) Formulate p(xt−1 |xt , z1:t−1 ) as an unnormalized distribution in xt . The expo-
nent of the Gaussian distribution p(xt−1 |xt , z1:t−1 ) = N (xt−1 | m, S) contains
xt−1 − m = xt−1 − µxt−1|t−1 + Jt−1 µxt|t−1 −Jt−1 xt , (4.45)
| {z }
=:r(xt−1 )
which is a linear function of both xt−1 and xt . Thus, we can reformulate the
conditional Gaussian in equation (4.42) as a Gaussian in Jt−1 xt with mean
r(xt−1 ) and the unchanged covariance matrix S, that is,
N (xt−1 | m, S) = N (Jt−1 xt | r(xt−1 ), S) . (4.46)
Following Petersen and Pedersen (2008), we obtain the conditional distribu-
tion
p(xt−1 |xt , z1:t−1 ) = N (xt−1 | m, S) = N (Jt−1 xt | r(xt−1 ), S) (4.47)
= c1 N (xt | a, A) , (4.48)
a = J−1
t−1 r(xt−1 ) , (4.49)
A = (J> −1
t−1 S Jt−1 )
−1
, (4.50)
q
|2π(J> −1 −1
t−1 S Jt−1 ) |
c1 = p , (4.51)
|2πS|
which is an unnormalized Gaussian in xt , where c1 makes the Gaussian un-
normalized. Note that the matrix Jt−1 ∈ RD×D defined in equation (4.41) not
necessarily invertible, for example, if the cross-covariance Σxt−1,t|t−1 is rank de-
ficient. In this case, we would take the pseudo-inverse of J. At the end of
the day, we will see that this (pseudo-)inverse matrix cancels out and is not
necessary to compute the final smoothing distribution.
(c) Multiply the new distribution with p(xt |z1:T ). To determine p(xt−1 |z1:T ), we
multiply the Gaussian in equation (4.48) with the Gaussian smoothing dis-
tribution p(xt |z1:T ) = N (xt | µxt|T , Σxt|T ) of the state at time t, equation. This
yields
c1 N (xt | a, A)N (xt | µxt|T , Σxt|T ) = c1 c2 (a)N (xt | b, B) (4.52)
for some b, B, where c2 (a) is the inverse normalization constant of N (xt | b, B).
128 Chapter 4. Robust Bayesian Filtering and Smoothing in GP Dynamic Systems
Σx0|T Σx0,1|T
Σx2,1|T
ΣxT |T
Figure 4.2: Joint covariance matrix of all hidden states given all measurements. The light-gray block-
diagonal expressions are the marginal covariance matrices of the smoothed states given in equation (4.56).
The Markov property in latent space makes solely an additional computation of the cross-covariances
between time steps, marked by the dark-gray blocks, necessary.
(d) Solve the integral. Since we integrate over xt in equation (4.37), we are solely
interested in the parts that make equation (4.52) unnormalized, that is, the
constants c1 and c2 (a), which are independent of xt . The constant c2 (a) in
equation (4.52) can be rewritten as c2 (xt−1 ) by reversing the step that inverted
the matrix Jt−1 , see equation (4.48). Then, c2 (xt−1 ) is given by
Since c1 c−1
1 = 1 (plug equation (4.53) into equation (4.52)), the smoothed state
distribution is
p(xt−1 |z1:T ) = N (xt−1 | µxt−1|T , Σxt−1|T ) , (4.56)
where the mean and the covariance are given in equation (4.54) and equa-
tion (4.55), respectively.
Full Distribution
In the following, we extend the marginal smoothing distributions p(xt |Z) to a
smoothing distribution on the entire time series x0 , . . . , xT .
The mean of the joint distribution is simply the concatenation of the means of
the marginals computed according to equation (4.56). Computing the covariance
can be simplified by looking at the definition of the dynamic system in equa-
tions (4.1)–(4.2). Due to the Markov assumption, the marginal posteriors and the
joint distributions p(xt−1 , xt |Z), t = 1, . . . , T , are a sufficient statistics to describe a
Gaussian posterior p(X|Z) of the entire time series. Therefore, the covariance of the
conditional joint distribution p(X|Z) is block-diagonal as depicted in Figure 4.2.
The covariance blocks on the diagonal are the marginal smoothing distributions
from equation (4.56).
4.2. Gaussian Filtering and Smoothing in Dynamic Systems 129
(4.58)
ZZ
= xt−1 p(xt−1 |xt , Z)x> x x >
t p(xt |Z) dxt−1 dxt − µt−1|T (µt|T ) . (4.59)
Here, we plugged in the means µxt−1|T , µxt|T of the marginal smoothing distributions
p(xt−1 |Z) and p(xt |Z), respectively, and factored p(xt−1 , xt |Z) = p(xt−1 |xt , Z)p(xt |Z).
The inner integral in equation (4.59) determines the mean of the conditional distri-
bution p(xt−1 |xt , Z), which is given by
Ext−1 [xt−1 |xt , Z] = µxt−1|T + Jt−1 (xt − µxt|T ) . (4.60)
The matrix Jt−1 in equation (4.60) is defined in equation (4.41).
Remark 7. In contrast to equation (4.42), in equation (4.60), we conditioned on all
measurements, not only on z1 , . . . , zt−1 , and computed an updated state distribu-
tion p(xt |Z), which is used in equation (4.60).
Using equation (4.60) for the inner integral in equation (4.59), the desired cross-
covariance is
Z
Σt−1,t|T = Ext−1 [xt−1 |xt , Z]x>
x x x >
t p(xt |Z) dxt − µt−1|T (µt|T ) (4.61)
Z
(4.60)
= (µxt−1|T + Jt−1 (xt − µxt|T ))x> x x >
t p(xt |z1:T ) dxt − µt−1|T (µt|T ) . (4.62)
The smoothing results for the marginals, see equations (4.54) and (4.55), yield
Ext [xt |Z] = µxt|T and covxt [xt |Z], respectively. After factorizing equation (4.62),
we subsequently plug these moments in and obtain
Σxt−1,t|T = µxt−1|T (µxt|T )> + Jt−1 Σxt|T + µxt|T (µxt|T )> − µxt|T (µxt|T )> − µxt−1|T (µxt|T )>
(4.63)
= Jt−1 Σxt|T , (4.64)
which concludes the proof of Proposition 3.
With Proposition 3 the joint posterior distribution p(xt−1 , xt |Z) is given by
p(xt−1 , xt |Z) = N (d, D) , (4.65)
x x x
µt−1|T Σ Σt−1,t|T
d= , D = t−1|T . (4.66)
x x x
µt|T Σt,t−1|T Σt|T
With these joint distributions, we can fill the shaded blocks in Figure 4.2 and a
Gaussian approximation of the distribution p(X|Z) is determined.
130 Chapter 4. Robust Bayesian Filtering and Smoothing in GP Dynamic Systems
4.2.3 Implications
Using the results from Sections 4.2.1 and 4.2.2, we conclude that necessities for
Gaussian filtering and smoothing are solely the determination of the means and
the covariances of two joint distributions: The joint p(xt−1 , xt |z1:t−1 ) between two
consecutive states (smoothing only) and the joint p(xt , zt |z1:t−1 ) between a state
and the subsequent measurement (filtering and smoothing). This result has two
implications:
4.3. Robust Filtering and Smoothing using Gaussian Processes 131
Table 4.1: Computations of the means and the covariances of the Gaussian approximate joints
p(xt , zt |z1:t−1 ) and p(xt−1 , xt |z1:t−1 ) using different filtering algorithms.
θf θf
f( · ) f( · )
xτ −1 xτ xτ +1 xt−1 xt xt+1
zτ −1 zτ zτ +1 zt−1 zt zt+1
g( · ) g( · )
θg θg
(a) Graphical model for training the GPs in a nonlinear (b) Graphical model used during inference (after training
state space model. The transitions xτ , τ = 1, . . . , n, the GPs for f and g using the training scheme described in
of the states are observed during training. The hyper- Panel (a)). The function nodes are still unshaded since the
parameters θ f and θ g of the GP models for f and g are functions are GP distributed. However, the corresponding
learned from the training sets (xτ −1 , xτ ) and (xτ , zτ ), hyper-parameters are fixed, which implies that xt−1 and
respectively for τ = 1, . . . , n. The function nodes and the xt+1 are independent if xt is known. Unlike in Panel (a),
nodes for the hyper-parameters are unshaded, whereas the latent states are no longer observed. The graphi-
the latent states are shaded. cal model is now functionally equivalent to the graphical
model in Figure 4.1.
Figure 4.3: Graphical models for GP training and inference in dynamic systems. The graphical models
extend the graphical model in Figure 4.1 by adding nodes for the transition function, the measurement
function, and the GP hyper-parameters. The latent states are denoted by x, the measurements are denoted
by z. Shaded nodes represent directly observed variables. Other nodes represent latent variables. The
dashed nodes for f ( · ) and g( · ) represent variables in function space.
and Ko and Fox (2008, 2009a). The graphical model for training is shown in Fig-
ure 4.3(a). After having trained the models, we assume that we can no longer
access the hidden states. Instead, they have to be inferred using the evidence from
observations z. During inference, we condition on the learned hyper-parameters
θ f and θ g , respectively, and the corresponding training sets of both GP models.1
Thus, in the graphical model for the inference (Figure 4.3(b)), the nodes for the
hyper-parameters θ f and θ g are shaded.
As shown in Section 4.2.3, it is suffices to compute the means and covari-
ances of the joint distributions p(xt , zt |z1:t−1 ) and p(xt−1 , xt |z1:t−1 ) for filtering and
smoothing. Ko and Fox (2009a) suggest to compute the means and covariances
by either linearizing the mean function of the GP and applying the Kalman fil-
ter equations (GP-EKF) or by mapping samples through the GP resulting in the
GP-UKF when using deterministically chosen samples or the GP-PF when using
random samples. Like the EKF and the UKF, the GP-EKF and the GP-UKF are not
moment-preserving in the GP dynamic system.
1 For notational convenience, we do not explicitly condition on these “parameters” of the GP models in the following.
134 Chapter 4. Robust Bayesian Filtering and Smoothing in GP Dynamic Systems
Following Section 4.2.1, it suffices to compute the mean and the covariance of the
joint distribution p(xt , zt |z1:t−1 ) to design a filtering algorithm. In the following, we
derive the GP-ADF by computing these moments analytically. Let us divide the
computation of the joint distribution
x x xz
µt|t−1 Σ Σt|t−1
p(xt , zt |z1:t−1 ) = N , t|t−1 (4.67)
µzt|t−1 Σzx
t|t−1 Σ z
t|t−1
Let us start with the computation of the mean and the covariance of the time
update p(xt |z1:t−1 ).
Mean Equation (4.13) describes how to compute the mean µxt|t−1 in a standard
model where the transition function f is known. In our case, the mapping xt−1 →
xt is not known, but instead it is distributed according to a GP, a distribution over
functions. Therefore, the computation of the mean requires to additionally takes
the GP model uncertainty into account by Bayesian averaging according to the GP
distribution. Using the system equation (4.1) and integrating over all sources of
uncertainties (the system noise, the state xt−1 , and the system function itself), we
obtain
where the law of total expectation has been exploited to nest the expectations. The
expectations are taken with respect to the GP distribution and the filter distribution
p(xt−1 |z1:t−1 ) = N (µxt−1|t−1 , Σxt−1|t−1 ) at time t − 1.
Equation (4.69) can be rewritten as
with
1
qaxi = αf2a |Σxt−1|t−1 Λ−1
a + I|
−2
i = 1, . . . , n, being the solution to the integral in equation (4.72). Here, αf2a is the sig-
nal variance of the ath target dimension of GP f , a learned (evidence maximization)
hyper-parameter of the SE covariance function in equation (2.3)
Covariance We now compute the covariance Σxt|t−1 of the time update. The
computation is split up into two steps: First, the computation of the variances
136 Chapter 4. Robust Bayesian Filtering and Smoothing in GP Dynamic Systems
(a)
varxt−1 ,f,wta [xt |z1:t−1 ] for all state dimensions a = 1, . . . , D, and second, the com-
(a) (b)
putation of the cross-covariances covxt−1 ,f [xt , xt |z1:t−1 ] for a 6= b, a, b = 1, . . . , D.
The law of total variances yields the diagonal entries of Σxt|t−1
varxt−1 ,f,wt [xt |z1:t−1 ] = Ext−1 [varf,wt [fa (xt−1 ) + wta |xt−1 ]|z1:t−1 ]
(a)
(4.75)
+ varxt−1 [Ef [fa (xt−1 )|xt−1 ]|z1:t−1 ] . (4.76)
Using Ef [fa (xt−1 )|xt−1 ] = mf (xt−1 ) and plugging in the finite kernel expansion for
mf (see also equation (4.72)), we can reformulate this equation into
(a)
varxt−1 ,f,wt [xt |z1:t−1 ] = (β xa )> Qx β xa − (µxt|t−1 )2a + σw2 a + αf2a − tr (Kfa + σw2 a I)−1 Qx .
Ext−1 [varf,wt [fa (xt−1 )+wta |xt−1 ]|z1:t−1 ] Ef [fa (xt−1 )|xt−1 ]|z1:t−1 ]
| {z } | {z }
varxt−1 [
(4.77)
−1 −1
By defining R := Σxt−1|t−1 (Λ−1 x −1
a +Λb )+I, ζ i := xi −µt−1|t−1 , and zij := Λa ζ i +Λb ζ j ,
the entries of Qx ∈ Rn×n are
kfa (xi , µxt−1|t−1 )kfb (xj , µxt−1|t−1 ) 1 > −1 x
exp(n2ij )
Qxij = p exp z R Σt−1|t−1 zij
2 ij
= p , (4.78)
|R| |R|
ζ> −1 > −1 > −1 x
i Λa ζ i + ζ j Λb ζ j −zij R Σt−1|t−1 zij
n2ij = 2(log(αfa ) + log(αfb )) − , (4.79)
2
where σw2 a is the variance of the system noise in the ath target dimension, another
hyper-parameter that has been learned from data. Note that the computation of
Qxij is numerically stable.
The off-diagonal entries of Σxt|t−1 are given as
(a) (b)
covxt−1 ,f [xt , xt |z1:t−1 ] = (β xa )> Qx β xb − (µxt|t−1 )a (µxt|t−1 )b , a 6= b . (4.80)
Comparing this expression to the diagonal entries in equation (4.77), we see that
for a = b, we include the terms Ext−1 [varf [fa (xt−1 )|xt−1 ]|z1:t−1 ] = αf2a − tr (Kfa +
σw2 a I)−1 Qx and σw2 a , which are both zero for a 6= b. This reflects the assumptions
that the target dimensions a and b are conditionally independent given xt−1 and
that the noise covariance Σw is diagonal.
With this, we determined the mean µxt|t−1 and the covariance Σxt|t−1 of the
marginal distribution p(xt |z1:t−1 ) in equation (4.67).
µzt|t−1 = Ezt [zt |z1:t−1 ] = Ext ,g,vt [g(xt ) + vt |z1:t−1 ] = Ext Eg [g(xt )|xt ]|z1:t−1 , (4.81)
4.3. Robust Filtering and Smoothing using Gaussian Processes 137
where the law of total expectation has been exploited to nest the expectations. The
expectations are taken with respect to the GP distribution and the time update
p(xt |z1:t−1 ) = N (µxt|t−1 , Σxt|t−1 ) at time t − 1.
The computations closely follow the steps we have taken in the time update. By
using p(xt |z1:t−1 ) as a prior instead of p(xt−1 |z1:t−1 ), GP g instead of GP f , vt instead
of wt , we obtain the marginal mean
(µzt|t−1 )a = (β za )> qza , a = 1, . . . , E , (4.82)
with β za := (Kga + σv2a I)−1 ya depending on the training set of GP g and
− 21
qazi = αg2a |Σxt|t−1 Λ−1 1 x > x −1 x
a + I| exp − 2
(x i − µt|t−1 ) (Σt|t−1 + Λ a ) (x i − µt|t−1 ) .
(4.83)
Covariance The computation of the covariance Σzt|t−1 is also similar to the com-
putation of the time update covariance Σxt|t−1 . Therefore, we only state the results,
which we obtain when we use p(xt |z1:t−1 ) as a prior instead of p(xt−1 |z1:t−1 ), GP g
instead of GP f , vt instead of wt in the computations of Σxt|t−1 .
The diagonal entries of Σzt|t−1 are given as
(a)
varxt ,g,vt [zt |z1:t−1 ] = (β za )> Qz β za − (µzt|t−1 )2a + σv2a + αg2a − tr (Kga + σv2a I)−1 Qz
Ext [varg,vt [ga (xt )+vta |xt ]|z1:t−1 ] Eg [ga (xt )|xt ]|z1:t−1 ]
| {z } | {z }
varxt [
(4.84)
−1
for a = 1, . . . , E. By defining R := Σxt|t−1 (Λ−1 x
a + Λb ) + I, ζ i := xi − µt|t−1 , and
a ζ i + Λb ζ j , the entries of Qz ∈ R
−1
zij := Λ−1 n×n
are
kga (xi , µxt|t−1 )kgb (xj , µxt|t−1 ) 1 > −1 x
exp(n2ij )
Qzij = p exp z R Σt|t−1 zij
2 ij
= p , (4.85)
|R| |R|
ζ> −1 > −1 > −1 x
i Λa ζ i + ζ j Λb ζ j −zij R Σt|t−1 zij
n2ij = 2(log(αga ) + log(αgb )) − . (4.86)
2
The off-diagonal entries of Σzt|t−1 are given as
(a) (b)
covxt ,g [zt , zt |z1:t−1 ] = (β za )> Qz β zb − (µzt|t−1 )a (µzt|t−1 )b , a 6= b , (4.87)
and the covariance matrix Σzt|t−1 of the second marginal p(zt |z1:t−1 ) of the joint
distribution in equation (4.67) is fully determined by the means and covariances in
equations (4.73), (4.77), (4.80), (4.82), (4.84), and (4.87).
t|t−1 = covxt ,zt [xt , zt |z1:t−1 ] = Ext zt [xt zt |z1:t−1 ] − µt|t−1 (µt|t−1 ) ,
> >
Σxz x z
(4.88)
138 Chapter 4. Robust Bayesian Filtering and Smoothing in GP Dynamic Systems
where µxt|t−1 and µzt|t−1 are the means of the marginal distributions p(xt |z1:t−1 ) and
p(zt |z1:t−1 ), respectively. Using the law of iterated expectations, for each dimension
a = 1, . . . , E, the cross-covariance is
where we used the fact that Eg,vt [g(xt ) + vt |xt ] = mg (xt ) is the mean function of
GP g , which models the mapping xt → zt . Writing out the expectation as an integral
average over the time update p(xt |z1:t−1 ) yields
Z
Σt|t−1 = xt mg (xt )> p(xt |z1:t−1 ) dxt−1 − µxt|t−1 (µzt|t−1 )> .
xz
(4.92)
We now write mg (xt ) as a finite sum over kernel functions. For each target dimen-
sion a = 1, . . . , E of GP g , we then obtain
n
Z Z !
X
xt mga (xt )p(xt |z1:t−1 ) dxt−1 = xt βazi kga (xt , xi ) p(xt |z1:t−1 ) dxt (4.93)
i=1
n
X Z
= βazi xt kga (xt , xi )p(xt |z1:t−1 ) dxt . (4.94)
i=1
With the SE covariance function defined in equation (2.3), we can compute the
integral in equation (4.94) according to
Z n
X Z
xt mga (xt )p(xt |z1:t−1 ) dxt−1 = z
βai xt c1 N (xt |xi , Λa )N (xt |µxt|t−1 , Σxt|t−1 ) dxt ,
i=1
(4.95)
D 1
− −
c−1 2 |Λa + Σx 2 exp − 1 (xi − µx > x −1 x
2 := (2π) t|t−1 | 2 t|t−1 ) (Λ a + Σ t|t−1 ) (x i − µt|t−1 ) ,
(4.97)
Ψ := (Λ−1 x −1 −1
a + (Σt|t−1 ) ) , (4.98)
ψ i := Ψ(Λ−1 x −1 x
a xi + (Σt|t−1 ) µt|t−1 ) , a = 1, . . . , E , i = 1, . . . , n . (4.99)
4.3. Robust Filtering and Smoothing using Gaussian Processes 139
Pulling all constants outside the integral in equation (4.95), the integral determines
the expected value of the product of the two Gaussians, ψ i . We obtain
n
Ext ,g [xt zta |z1:t−1 ] =
X
c1 c−1 z
2 βai ψ i , a = 1, . . . , E , (4.100)
i=1
n
X
covxt ,g [xt , zta |z1:t−1 ] = c1 c−1 z x z
2 βai ψ i − µt|t−1 (µt|t−1 )a , a = 1, . . . , E . (4.101)
i=1
Using c1 c−1 z
2 = qai from equation (4.83) and matrix identities, which closely follow
equation (2.67), we obtain
n
X
covxt ,g [xt , zta |z1:t−1 ] = βazi qazi Σxt|t−1 (Σxt|t−1 + Λa )−1 (xi − µxt|t−1 ) , (4.102)
i=1
The marginal distributions p(xt−1 |z1:t−1 ) = N (µxt−1|t−1 , Σxt−1|t−1 ), which is the filter
distribution at time step t − 1, and p(xt |z1:t−1 ) = N (µt|t−1 , Σxt|t−1 ), which is the time
update, are both known: The filter distribution at time t − 1 has been computed
during the forward sweep, see Algorithm 6, the time update has been computed as
a marginal distribution of p(xt , zt |z1:t−1 ) in Section 4.3.1. Hence, the only remaining
part is the cross-covariance Σxt−1,t|t−1 , which we compute in the following.
By the definition of a covariance and the system equation (3.1), the missing
cross-covariance matrix in equation (4.103) is
Σxt−1,t|t−1 = covxt−1 ,xt [xt−1 , xt |z1:t−1 ] (4.104)
= Ext−1 ,f,wt [xt−1 f (xt−1 ) + wt
>
|z1:t−1 ] − µxt−1|t−1 (µxt|t−1 )> , (4.105)
where µxt−1|t−1 is the mean of the filter update at time step t − 1 and µxt|t−1 is the
mean of the time update, see equation (4.69). Note that besides the system noise
140 Chapter 4. Robust Bayesian Filtering and Smoothing in GP Dynamic Systems
wt , GP f explicitly takes the model uncertainty into account. Using the law of total
expectations, we obtain
Σxt−1,t|t−1 = Ext−1 xt−1 Ef,wt [f (xt−1 ) + wt |xt−1 ]> |z1:t−1 − µxt−1|t−1 (µxt|t−1 )> (4.106)
Writing mf (xt−1 ) as a finite sum over kernels (Rasmussen and Williams, 2006;
Schölkopf and Smola, 2002), the integration turns into
Z
xt−1 mfa (xt−1 )p(xt−1 |z1:t−1 ) dxt−1 (4.109)
n
Z !
X
= xt−1 βaxi kfa (xt−1 , xi ) p(xt−1 |z1:t−1 ) dxt−1 (4.110)
i=1
n
X Z
= βaxi xt−1 kfa (xt−1 , xi )p(xt−1 |z1:t−1 ) dxt−1 (4.111)
i=1
for each state dimension a = 1, . . . , D, where kfa is the covariance function for
the ath target dimension of GP f , xi , i = 1, . . . , n, are the training inputs, and
β xa = (Kfa + σw2 a I)−1 ya ∈ Rn . Note that β xa is independent of xt−1 , which plays the
role of an uncertain test input in a GP regression context.
With the SE covariance function defined in equation (2.3), we compute the
integral analytically and obtain
Z
xt−1 mfa (xt−1 )p(xt−1 |z1:t−1 ) dxt−1 (4.112)
Xn Z
x x x
= βai xt−1 c3 N (xt−1 | xi , Λa )N (xt−1 | µt−1|t−1 , Σt−1|t−1 ) dxt−1 , (4.113)
i=1
Dp
where c−13 = (α 2
fa (2π) 2 |Λa |)−1 “normalizes” the squared exponential covariance
function kfa (xt−1 , xi ), see equation (2.3), such that kfa (xt−1 , xi ) = c3 N (xt−1 | xi , Λa ).
In the definition of c3 , αf2 is a hyper-parameter of GP f responsible for the variance
of the latent function. The product of the two Gaussians in equation (4.113) results
in a new (unnormalized) Gaussian c−1 4 N (xt−1 | ψ i , Ψ) with
D 1
− −
c−1
4 = (2π)
2 |Λa + Σx
t−1|t−1 |
2
Algorithm 7 Forward and backward sweeps with the GP-ADF and the GP-RTSS
1: initialize forward sweep with p(x0 ) = N (µx0|∅ , Σx0|∅ ) . prior state distribution
2: for t = 1 to T do . forward sweep
st
3: compute p(xt , zt |z1:t−1 ) . 1 Gaussian joint, eq. (4.67); includes time update
4: compute p(xt−1 , xt |z1:t−1 ) . 2nd Gaussian joint, eq. (4.103); required for
smoothing
5: compute Jt−1 . eq. (4.41); required for smoothing
6: measure zt . measure latent state
7: compute filter distribution p(xt |z1:t ) . measurement update, eq. (4.28)
8: end for
9: initialize backward sweep with p(xT |Z) = N (µxT |T , ΣxT |T )
10: for t = T to 1 do
11: compute posterior marginal p(xt−1 |Z) . smoothing marginal distribution,
eq. (4.56)
12: compute joint posterior p(xt−1 , xt |Z) . smoothing joint distribution,
eq. (4.65)
13: end for
Pulling all constants outside the integral in equation (4.113), the integral deter-
mines the expected value of the product of the two Gaussians, ψ i . We finally
obtain
n
Ext−1 ,f,wt [xt−1 xta |z1:t−1 ] =
X
c3 c−1 x
4 βai ψ i , a = 1, . . . , D , (4.117)
i=1
Xn
covxt−1 ,f [xt−1 , xat |z1:t ] = c3 c−1 x x x
4 βai ψ i − µt−1|t−1 (µt|t−1 )a , a = 1, . . . , D . (4.118)
i=1
With c3 c−1
4 = qaxi , which is defined in equation (4.74), and by applying standard
matrix identities, we obtain the simplified covariance
n
X
covxt−1 ,f [xt−1 , xat |z1:t ] = βaxi qaxi Σxt−1|t−1 (Σxt−1|t−1 + Λa )−1 (xi − µxt−1|t−1 ) , (4.119)
i=1
and the Σxt−1,t|t , the covariance matrix of the joint p(xt−1 , xt |z1:t−1 ), and, hence, the
full Gaussian approximation in equation (4.103) are fully determined.
With the mean and the covariance of the joint distribution p(xt−1 , xt |z1:t−1 )
given by eqs. (4.73), (4.77), (4.80), (4.119), and the filter step, all components for
the (Gaussian) smoothing distribution p(xt |z1:T ) can be computed analytically.
4.4 Results
We evalatued the performances of the GP-ADF and the GP-RTSS for filtering
and smoothing and compared them to the commonly used EKF/EKS (Maybeck,
1979; Roweis and Ghahramani, 1999), the UKF/URTSS (Julier and Uhlmann, 2004;
Särkkä, 2008), the CKF (Arasaratnam and Haykin, 2009) the corresponding CKS.
Furthermore, we analyzed the performances of the GP-UKF (Ko and Fox, 2009a)
and an RTS smoother based on the GP-UKF, which we therefore call GP-URTSS. As
ground truth we considered a Gaussian filtering and smoothing algorithm based
on Gibbs sampling (Deisenroth and Ohlsson, 2010).
All filters and smoothers approximate the posterior distribution of the hidden
state by a Gaussian. The filters and smoothers were evaluated using up to three
performance measures:
• Root Mean Squared Error (RMSE). The RMSE
q
RMSEx := Ex [|xtrue − µxt|• |2 ] , • ∈ {t, T } , (4.120)
is an error measure often used in the literature. Smaller values indicate better
performance. For notational convenience, we introduced •. In filtering, • = t,
whereas in smoothing • = T . The RMSE is small if the mean of the inferred
state distribution is close to the true state. However, the RMSE does not take
the covariance of the state distribution into account. The units of the RMSE are
the units of the quantity xtrue . Hence, the RMSE does not directly generalize
to multivariate states.
• MAE Mean Absolute Error (MAE). The MAE
MAEx := Ex [|xtrue − µxt|• |] , • ∈ {t, T } , (4.121)
penalizes the difference between the expected true state and the mean of the
latent state distribution. However, it does not take uncertainty into account.
As for the RMSE, the units of the MAE are the units of the quantity xtrue . Like
the RMSE, the MAE does not directly generalize to multivariate states.
4.4. Results 143
where • ∈ {t, T } and D is the dimension of the state x. The last term in
equation (4.123) is a constant. The negative log-likelihood penalizes both
the volume of the posterior covariance matrix as well as the inverse covari-
ance matrix, and optimal NLL-values require to tradeoff between coherence
(Mahalanobis term) and confidence (log-determinant term). Smaller values
of the NLL-measure indicate better performance. The units of the negative
log-likelihood are nats.
Remark 10. The negative log-likelihood is related to the deviance and can be con-
sidered a Bayesian measure of the total model fit: For a large sample size, the
expected negative log-likelihood equals the posterior probability up to a constant.
Therefore, the model with the lowest expected negative log-likelihood will have
the highest posterior probability amongst all considered models (Gelman et al.,
2004).
Definition 3 (Incoherent distribution). We call a distribution p(x) incoherent if the
value xtrue is an outlier under the distribution, that is, the value of the probability
density function at xtrue is close to zero.
4.4.1 Filtering
In the following, we evaluate the GP-ADF for linear and nonlinear systems. The
evaluation of the linear system shows that the GP-ADF leads to coherent estimates
of the latent-state posterior distribution even in cases where there are only a few
training examples. The analysis in the case of nonlinear systems confirms that
the GP-ADF is a robust estimator in contrast to commonly used Gaussian filters
including the EKF, the UKF, and the CKF.
Linear System
We first consider the linear stochastic dynamic system
0.3 0.3 6
GP−ADF GP−ADF GP−ADF GP−ADF
0.15 optimal
optimal optimal optimal
RMSE
4
RMSE
RMSE
RMSE
0.2 0.2
0.1
0.1 0.1 2
0.05
0 0
50 100 150 200 50 100 150 200 50 100 150 200 50 100 150 200
training set size, 2 dimensions training set size, 5 dimensions training set size, 10 dimensions training set size, 20 dimensions
(a) RMSEx , D = 2. (b) RMSEx , D = 5. (c) RMSEx , D = 10. (d) RMSEx , D = 20.
Figure 4.4: RMSEx as a function of the dimensionality and the size of the training set. The x-axes show
the size of the training set employed, the y-axes show the RMSE of the GP-ADF (red line) and the RMSE
of the optimal filter (blue line). The error bars show the 95% confidence intervals. The different panels
show the graphs for D = 2, 5, 10, 20 dimensions, respectively. Note the different scales of the y-axes.
1.2
GP−ADF GP−ADF 1.5 GP−ADF GP−ADF
0.4 1 40
optimal 0.8 optimal optimal optimal
1 30
MAE
MAE
MAE
MAE
0.6
0.2 0.4 20
0.5
0.2 10
0
50 100 150 200 50 100 150 200 50 100 150 200 50 100 150 200
training set size, 2 dimensions training set size, 5 dimensions training set size, 10 dimensions training set size, 20 dimensions
(a) MAEx , D = 2. (b) MAEx , D = 5. (c) MAEx , D = 10. (d) MAEx , D = 20.
Figure 4.5: MAEx as a function of the dimensionality and the size of the training set. The x-axes show
the size of the training set employed, the y-axes show the MAE of the GP-ADF (red line) and the MAE
of the optimal filter (blue line). The error bars show the 95% confidence intervals. The different panels
show the graphs for D = 2, 5, 10, 20 dimensions, respectively. Note the different scales of the y-axes.
each, where we average the performance measures over multiple trajectories, each
of length 100 steps. We compare the GP-ADF to an optimal linear filter (Kalman
filter). Note, however, that in the GP dynamic system, we still use the SE covariance
function in equation (2.3), which only in the extreme case of infinitely long length-
scales models linear functions exactly. For reasons of numerical stability, restricted
the length-scales to be smaller than 105 .
Figure 4.4 shows the RMSEx -values of the GP-ADF and the optimal RMSEx
achieved by the Kalman filter as a function of the training set size and the di-
mensionality. The RMSE of the GP-ADF gets closer to the optimal RMSE the more
training data are used. We also see that the more training points we use, the less the
RMSEx -values vary. However, the rate of decrease is also slower with increasing
dimension.
Figure 4.5 allows for drawing similar conclusions for the MAEx -values. Note
that both the RMSEx and the MAEx only evaluate how close the mean of the filtered
state distribution is to the true state. Figure 4.6 shows the NLLx -values, which
reflect the coherence, that is, the “accuracy” in both the mean and the covariance
of the filter distribution. Here, it can be seen that with about 50 data points even
in 20 dimensions, the GP-ADF yields good results.
4.4. Results 145
NLL
NLL
NLL
40 200
0 20 20
−2 0 0
0
50 100 150 200 50 100 150 200 50 100 150 200 50 100 150 200
training set size, 2 dimensions training set size, 5 dimensions training set size, 10 dimensions training set size, 20 dimensions
(a) NLLx , D = 2. (b) NLLx , D = 5. (c) NLLx , D = 10. (d) NLLx , D = 20.
Figure 4.6: NLLx as a function of the dimensionality and the size of the training set. The x-axes show the
size of the training set employed, the y-axes show the NLL of the GP-ADF (red line) and the NLL of the
optimal filter (blue line). The error bars show the 95% confidence intervals. The different panels show
the graphs for D = 2, 5, 10, 20 dimensions, respectively. Note the different scales of the y-axes.
Nonlinear System
which is a modified version of the model used in the papers by Kitagawa (1996)
and Doucet et al. (2000). The system is modified in two ways: First, we excluded
a purely time-dependent term in the system equation (4.126), which would not
allow for learning stationary transition dynamics required by the GP models. Sec-
ond, we substituted a sinusoidal measurement function for the originally quadratic
measurement function used by Kitagawa (1996) and Doucet et al. (2000). The si-
nusoidal measurement function increases the difficulty in computing the marginal
distribution p(zt |z1:t−1 ) if the time update distribution p(xt |z1:t−1 ) is fairly uncertain:
While the quadratic measurement function can only lead to bimodal distributions
(assuming a Gaussian input distribution), the sinusoidal measurement function in
equation (4.127) can lead to an arbitrary number of modes—for a sufficiently broad
input distribution.
Using the dynamic system defined in equations (4.126)–(4.127), we analyze the
performances of a single filter step of all considered filtering algorithms against the
ground truth, which is approximated by the Gibbs-filter proposed by Deisenroth
and Ohlsson (2010). Compared to the evaluation of longer trajectories, evaluating
a single filter step makes it easier to find out when and why particular filtering
algorithms fail.
means and covariances of the joint distributions p(xt−1 , xt |z1:t−1 ) and p(xt , zt |z−:t−1 ),
respectively. The burn-in periods were set to 100 steps.
The prior variance was set to σ02 = 0.52 , the system noise was set to σw2 = 0.22 ,
and the measurement noise was set to σv2 = 0.22 . With this experimental setup, the
initial uncertainty is fairly high, but the system and measurement noises are fairly
small considering the amplitudes of the system function and the measurement
function.
A linear grid in the interval [−3, 3] of mean values (µx0 )i , i = 1, . . . , 100, was de-
(i) (i)
fined. Then, a single latent (initial) state x0 was sampled from the prior p(x0 ) =
(i) (i)
N ((µx0 )i , σ02 ), i = 1, . . . , 100. For 100 independent pairs (x0 , z1 ) of states and
measurements of the successor states, we assessed the performances of the filters.
Table 4.2: Expected filter performances (RMSE, MAE, NLL) with standard deviation (68% confidence
interval) in latent and observed space for the dynamic system defined in equations (4.126)–(4.127). We
also provide p-values for each performance measure from a one-sided t-test under the null hypothesis
that the corresponding filtering algorithm performs at least as good as the GP-ADF.
UKF 10.5 ± 17.2 (p < 10−4 ) 8.6 ± 14.5 (p < 10−4 ) 26.0 ± 54.9 (p < 10−4 )
CKF 9.2 ± 17.9 (p = 3.0 × 10−4 ) 7.3 ± 14.8 (p = 4.5 × 10−4 ) 2.2 × 102 ± 2.4 × 102 (p < 10−4 )
GP-UKF 5.4 ± 7.3 (p = 9.1 × 10−4 ) 3.8 ± 7.3 (p = 3.5 × 10−3 ) 6.0 ± 7.9 (p < 10−4 )
GP-ADF? 2.9 ± 2.8 2.2 ± 2.4 2.0 ± 1.0
Gibbs-filter 2.8 ± 2.7 (p = 0.54) 2.1 ± 2.3 (p = 0.56) 2.0 ± 1.1 (p = 0.57)
PF 1.6 ± 1.2 (p = 1.0) 1.1 ± 0.9 (p = 1.0) 1.0 ± 1.3 (p = 1.0)
run (see experimental setup in Algorithm 8). We consider the EKF, the UKF, and
the CKF “conventional” Gaussian filters. The GP-filters assume that the transition
mapping and the measurement mapping are given by GPs (GP dynamic system).
The Gibbs-filter and the particle filter are based on random sampling. However,
from the latter filters only the Gibbs-filter can be considered a classical Gaussian
filter since it computes the joint distributions p(xt , zt |z1:t−1 ) and p(xt−1 , xt |z1:t−1 )
following Algorithm 6 on page 6. Since the particle filter represents densities
by particles, we used these particles to compute the sample mean and the sam-
ple covariance in order to compare to the Gaussian filters. For each performance
measure, Table 4.2 also provides p-values from a one-sided t-test under the null
hypothesis that on average the corresponding filtering algorithm performs as least
as good as the GP-ADF. Small values of p cast doubt on the validity of the null
hypothesis.
Let us first have a look at the performance measures in latent space (upper
half of Table 4.2): Compared to the results from the ground-truth Gibbs-filter, the
EKF performs reasonably good in the RMSEx and the MAEx measures; however,
it performs catastrophically in the NLLx -measure. This indicates that the mean of
the EKF’s filter distributions is on average close to the true state, but the filter vari-
ances are underestimated leading to incoherent filter distributions. Compared to
the Gibbs-filter and the EKF, the CKF and UKF’s performance according to RMSE
and MAE is worse. Although the UKF is significantly worse than the Gibbs-filter
according to the NLLx -measure, it is not as incoherent as the EKF and the CKF. In
148 Chapter 4. Robust Bayesian Filtering and Smoothing in GP Dynamic Systems
the considered example, the GP-UKF somewhat alleviates the shortcomings of the
UKF although this shall not be taken as a general statement: The GP-UKF employs
two approximations (density approximation using sigma points and function ap-
proximation using GPs) instead of a single one (UKF: density approximation using
sigma points). The GP approximation of the transition mapping and the measure-
ment mapping alleviates the problems from which the original UKF suffers in the
considered case. In latent space, the GP-ADF is not statistically significantly differ-
ent from the Gibbs-filter (p-values are between 0.5 and 0.6). However, the GP-ADF
statistically significantly outperforms the GP-UKF. In the considered example, the
particle filter generally outperforms all other filters. Note that the particle fil-
ter is not a Gaussian filter. For the observed space (lower half of Table 4.2), we
draw similar conclusions, although the performances of all filters are closer to the
Gibbs-filter ground truth; the tendencies toward (in)coherent distributions remains
unchanged. We note that the p-values of the Gibbs-filter in observed space are rel-
atively large given that the average filter performance of the Gibbs-filter and the
GP-ADF are almost identical. In the NLLz -measure, for example, the Gibbs-filter is
always better than the GP-ADF. However, the average and maximum discrepancies
are small with values of 0.06 and 0.18, respectively.
Figure 4.7 visually confirms the numerical results in Table 4.2: Figure 4.7 shows
the filter distributions in a single run 100 test inputs, which have been sampled
from the prior distributions N ((µx0 )i , σ02 = 0.52 ), i = 1, . . . , 100. On a first glance,
the EKF and the CKF suffer from too optimistic (overconfident) filter distributions
since the error-bars are vanishingly small (Panels (a) and (c)) leading to incoherent
filter distributions. The incoherence of the EKF is largely due to the linearization
error in combination with uncertain input distributions. The UKF in Panel (b) and
partially the GP-UKF in Panel (d) also suffer from incoherent filter distributions, al-
though th error-bars are generally more reasonable than the ones of the EKF/CKF.
However, we notice that the means of the filter distributions of the (GP-)UKF can
be far from the true latent state, which in these cases often leads to inconsistencies.
The reason for this is the degeneracy of finite-sample approximations of densities
(the CKF suffers from the same problem), which will be discussed in the next para-
graph. The filter distribution of the particle filter in Panel (f) is largely coherent,
meaning that the true latent state (black diamonds) can be explained by the filter
distributions (red error-bars). Note that the error-bars for |µx0 | > 1 are fairly small,
whereas the error-bars for |µx0 | ≤ 1 are relatively large. This is because a) the prior
state distribution p(x0 ) is relatively broad, b) the variability of the system function
f is fairly high for |x| ≤ 1 (which leads to a large uncertainty in the time update),
and c) the measurement function is multi-modal. Both the GP-ADF in Panel 4.7(e)
and the Gibbs-filter in Panel (g) generally provide coherent filter distributions,
which is due to the fact that both filters are assumed density filters.
Degeneracy of Finite-Sample Filters On the example of the UKF for the dynamic
system given in equations (4.126)–(4.127), Figure 4.8 illustrates the danger of de-
generacy of predictions using the unscented transformation and more general why
prediction/filtering/smoothing algorithms based on small-sample approximations
4.4. Results 149
30 30 30
20 20 20
10 10 10
p(x1|z1)
p(x1|z1)
p(x1|z1)
0 0 0
−10 −10 −10
−20 EKF −20 UKF −20 CKF
true true true
−30 −30 −30
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
µx0 µx0 µx0
(a) EKF distributions can be overconfi- (b) UKF distributions can be overconfi- (c) CKF distributions can be overconfi-
dent and incoherent. dent and incoherent. dent and incoherent.
30 30 30
20 20 20
10 10 10
p(x1|z1)
p(x1|z1)
p(x1|z1)
0 0 0
−10 −10 −10
−20 GP−UKF −20 GP−ADF −20 PF
true true true
−30 −30 −30
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
µx0 µx0 µx0
(d) GP-UKF distributions can be over- (e) GP-ADF distributions are coherent. (f) PF.
confident and incoherent.
30
20
10
p(x1|z1)
0
−10
−20 Gibbs−filter
true
−30
−3 −2 −1 0 1 2 3
µx0
Figure 4.7: Example filter distributions for the nonlinear dynamic system defined in equations (4.126)–
(4.127) for 100 test inputs. The x-axis shows the means (µx0 )i of the prior distributions N ((µx0 )i , σ02 = 0.52 ),
(i)
i = 1, . . . , 100. The y-axis shows the distributions of the filtered states x1 given measurements z1 . True
latent states are represented by black diamonds, the (Gaussian) filter distributions are represented by the
red error-bars (95% confidence intervals).
(for example the UKF, the CKF, or the GP-UKF) can fail: The representation of
densities using only a few samples can underestimate the variance of the true
distribution. An extreme case of this is shown in Figure 4.8(b), where all sigma
points are mapped onto function values that are close to each other. Here, the
UT suffers severely from the high uncertainty of the input distribution and the
multi-modality of the measurement function and its predictive distribution does
not credibly take the variability of the function into account. The true predictive
distribution (shaded area) is almost uniform, essentially encoding that no function
value (in the range of g) has very low probability. Turner and Rasmussen (2010)
proposes a learning algorithm to reduce the UKF’s vulnerability to degeneracy by
finding optimal placements of the sigma points.
150 Chapter 4. Robust Bayesian Filtering and Smoothing in GP Dynamic Systems
10 10
10 10 5 5
x1
z1
0 0 0 0
−10 −10 −5 −5
p(x1)
0.5
0.05
0 0
−3 −2 −1 0 1 −25 −20 −15 −10 −5 0
(a) UKF time update. For the input distribution p(x0 ) (b) The input distribution p(x1 ) (bottom sub-figure)
(bottom sub-figure), the UT determines three sigma equals the solid Gaussian distribution in Panel (a). Again,
points (red dots) that are mapped through the transition the UT computes the location of the sigma points shown
function f (top-right sub-figure). The function values of as the red dots at the bottom sub-figure. Then, the
the sigma points are shown as the red dots in the top- UT maps these sigma points through the measurement
right sub-figure. The predictive Gaussian distribution of function g shown in the top-right sub-figure. In this
the UT (solid Gaussian in the top-left sub-figure) is fully case, the mapped sigma points happen to have approx-
determined by the mean and covariance of the mapped imately the same function value (red dots in top-right
sigma points and the noise covariance Σw . In the con- sub-figure). Therefore, the resulting predictive distribu-
sidered case, the UT predictive distribution misses out a tion p(z1 ) (based on the sample mean and sample covari-
significant mode of the true predictive distribution p(z1 ) ance of the mapped sigma points and the measurement
(shaded area). noise) is fairly peaked, see solid distribution in the top-left
sub-figure, and cannot explain the actual measurement z1
(black dot).
Figure 4.8: Degeneracy of the unscented transformation underlying the UKF. Input distributions to the
UT are the Gaussians in the sub-figures at the bottom in each panel. The function to which the UT is
applied is shown in the top right sub-figures. In Panel (a), the function is the transition mapping from
equation (4.126), in Panel (b), the function is the measurement mapping from equation (4.127). Sigma
points are marked by red dots. The predictive distributions are shown in the left sub-figures of each
panel. The true predictive distributions are represented by the shaded areas, the predictive distributions
determined by the UT are represented by the solid Gaussians. The predictive distribution of the time
update in Panel (a) corresponds to the input distribution at the bottom of Panel (b). The demonstrated
degeneracy of the predictive distribution based on the UT also occurs in the GP-UKF and the CKF, the
latter on of which uses an even smaller set of cubature points to determine predictive distributions.
4.4.2 Smoothing
The problem of tracking a pendulum is considered. In the following, we evaluate
the performances of the EKF/EKS, the UKF/URTSS, the GP-UKF/GP-URTSS, the
CKF/CKS, the Gibbs-filter/smoother, and the GP-ADF/GP-RTSS.
The pendulum possesses a mass m = 1 kg and a length l = 1 m. The pendulum
angle ϕ is measured anti-clockwise from hanging down. The pendulum can exert
a constrained torque u ∈ [−5, 5] Nm. The state x = [ϕ̇, ϕ]> of the pendulum is given
by the angle ϕ and the angular velocity ϕ̇. The equations of motion are given by
the two coupled ordinary differential equations
u − bϕ̇ − 1 mlg sin ϕ
2
d ϕ̇ 1 2
ml + I
= , (4.128)
4
dt ϕ
ϕ̇
Experimental Setup
was computed using an ODE solver for equation (4.129) with a zero-order hold
control signal u(τ ). Every time increment ∆t , the latent state was measured ac-
cording to
−1 − l sin(ϕt )
zt = arctan 1 + vt , σv2 = 0.052 . (4.131)
2
− l cos(ϕt )
The measurement equation (4.131) solely depends on the angle; only a scalar mea-
surement of the two-dimensional latent state was obtained. Thus, the full distri-
bution of the latent state x had to be reconstructed by using the cross-correlation
information between the angle and the angular velocity.
Algorithm 9 details the setup of the tracking experiment. 1,000 trajectories
were started from initial states independently sampled from the distribution x0 ∼
π 2
N (µ0 , Σ0 ) with µ0 = [0, 0]> and Σ0 = diag([0.012 , ( 16 ) ]). For each trajectory, GP
models GP f and GP g are learned based on randomly generated data. The filters
tracked the state for T = 30 time steps, which corresponds to 6 s. After filtering,
the smoothing distributions are computed. For both filtering and smoothing, the
corresponding performance measures are computed.
The GP-RTSS was implemented according to Section 4.3.2. Furthermore, we
implemented the EKS, the CKS, the Gibbs-smoother, the URTSS, and the GP-
URTSS by extending the corresponding filtering algorithms: In the forward sweep,
we just need to compute the matrix Jt−1 given in equation (4.41), which requires the
covariance Σxt−1,t|t−1 to be computed on top of the standard filtering distribution.
Having computed Jt−1 , the smoothed marginal distribution p(xt−1 |Z) can be com-
puted using matrix-matrix and matrix-vector multiplications of known quantities
as described in equations (4.54) and (4.55).
152 Chapter 4. Robust Bayesian Filtering and Smoothing in GP Dynamic Systems
Evaluation
We now discuss the RTS smoothers in the context of the pendulum dynamic
system, where we assume a frictionless system.
Table 4.3 reports the expected values of the NLLx -measure for the EKF/EKS,
the UKF/URTSS, the GP-UKF/GP-URTSS, the GP-ADF/GP-RTSS, and the CK-
F/CKS when tracking the pendulum over a horizon of 6 s, averaged over 1,000
runs. As in the example in Section 4.4.1, the NLLx -measure hints at the robust-
ness of our proposed method: The GP-RTSS is the only smoother that consistently
reduced the negative log-likelihood value compared to the corresponding filter-
ing algorithm. Increasing the NLLx -values only occurs when the filter distribution
4.4. Results 153
Table 4.3: Expected filtering and smoothing performances (NLL) with standard deviations (68%
confidence interval) for tracking the pendulum.
10 10
0 0
ang.vel. in rad/s
ang.vel. in rad/s
−10 −10
−20 −20
−30 −30
distr. smoothed ang.vel. distr. smoothed ang.vel.
true ang.vel. true ang.vel.
−40 −40
0 1 2 3 4 5 6 0 1 2 3 4 5 6
time in s time in s
(a) 0.99-quantile worst smoothing performance (EKS) re- (b) 0.99-quantile worst smoothing performance (URTSS)
sulting in NLLx = 4.5 × 103 . resulting in NLLx = 9.6 × 102 .
10 10
0
ang.vel. in rad/s
ang.vel. in rad/s
0
−10
−10
−20
−20
−30
distr. smoothed ang.vel. distr. smoothed ang.vel.
true ang.vel. true ang.vel.
−40 −30
0 1 2 3 4 5 6 0 1 2 3 4 5 6
time in s time in s
(c) 0.99-quantile worst smoothing performance (CKS) (d) 0.99-quantile worst smoothing performance (GP-
resulting in NLLx = 2.0 × 103 . URTSS) resulting in NLLx = 2.7 × 102 .
20
10
ang.vel. in rad/s
−10
−20
distr. smoothed ang.vel.
true ang.vel.
−30
0 1 2 3 4 5 6
time in s
Figure 4.9: 0.99-quantile worst smoothing performances of the EKS, the URTSS, the CKS, the GP-URTSS,
and the GP-RTSS across all our experiments. The x-axis denotes time, the y-axis shows the angular
velocity in the corresponding example trajectory. The shaded areas represent the 95% confidence intervals
of the smoothing distributions, the black diamonds represent the trajectory of the true latent angular
velocity.
recover from its tracking failure. Figure 4.9(c) shows the 0.99-quantile worst case
performance of the CKS. After about 1.2 s, the CKS lost track of the state, and the
smoothing distribution became incoherent; after 2.2 s the 95% confidence interval
of the smoothed state distribution was nowhere close to explaining the latent state
variable; the smoother did not recover from its tracking failure. In Figure 4.9(d),
initially, the GP-URTSS lost track of the angular velocity without being aware of
this. However, after about 2.2 s, the GP-URTSS could track the angular velocity
again. Although the GP-RTSS in Figure 4.9(e) lost track of the angular velocity, the
smoother was aware of this, which is indicated by the increasing uncertainty over
4.4. Results 155
Table 4.4: Expected filtering and smoothing performances (NLL) with standard deviations (68% confi-
dence interval) for tracking the pendulum with the GP-ADF and the GP-RTSS using differently sized
training sets.
time. The smoother lost track of the state since the pendulum’s trajectory went
through regions of the state space that were dissimilar to the GP training inputs.
Since the GP-RTSS took the model uncertainty fully into account, the smoothing
distribution was still coherent.
These results hint at the robustness of the GP-ADF and the GP-RTSS for fil-
tering and smoothing: In our experiments they were the only algorithms that pro-
duced coherent posterior distributions on the latent state even when they lost track
of the state. By contrast, all other filtering and smoothing algorithms could become
overconfident leading to incoherent posterior distributions.
Thus far, we considered only cases where the GP models with a halfway rea-
sonable amount of data to learn good approximations of the underlying dynamic
systems. In the following, we present results for the GP-based filters when only
a relatively small data set is available to train the GP models GP f and GP g . We
chose exactly the same experimental setup described in Algorithm 9 besides the
size of the training sets, which we set to 50 or 20 randomly sampled data points.
Table 4.4 details the NLL performance measure for using the GP-ADF and the GP-
RTSS for tracking the pendulum, where we added the number of training points as
a subindex. We observe the following: First, with only 50 or 20 training points, the
GP-ADF and the GP-RTSS still outperform the commonly used EKF/EKS, UKF/
URTSS, and CKF/CKS (see Table 4.3). Second, the smoother (GP-RTSS) still im-
proves the filtering result (GP-ADF), which, together with the small standard de-
viation, hints at the robustness of our proposed methods. This also indicates that
the posterior distributions computed by the GP-ADF and the GP-RTSS are still co-
herent, that is, the true state of the pendulum can be explained by the posterior
distributions.
Although the GP-RTSS is computationally more involved than the URTSS, the
EKS, and the CKS, this does not necessarily imply that smoothing with the GP-
RTSS is slower: function evaluations, which are heavily used by the EKS/CKS/
URTSS are not necessary in the GP-RTSS (after training). In the pendulum exam-
ple, repeatedly calling the ODE solver caused the EKS/CKS/URTSS to be slower
than the GP-RTSS by a factor of two.
156 Chapter 4. Robust Bayesian Filtering and Smoothing in GP Dynamic Systems
4.5 Discussion
Strengths. The GP-ADF and the GP-RTSS represent first steps toward robust fil-
tering and smoothing in GP dynamic systems. Model uncertainty and stochas-
ticity of the system and measurement functions are explicitly taken into account
during filtering and smoothing. The GP-ADF and the GP-RTSS do neither rely on
explicit linearization nor on sigma-point representations of Gaussian state distri-
butions. Instead, our methods profit from analytically tractable computation of the
moments of predictive distributions.
Limitations. One can pose the question why we should consider using GPs (and
filtering/smoothing in GP dynamic systems) at all if the dynamics and measure-
ment function are not too nonlinear? Covering high-dimensional spaces with uni-
form samples to train the (GP) models is impractical. Training the models and
performing filtering/smoothing can be orders of magnitudes slower than stan-
dard filtering algorithms. A potential weakness of the classical EKF/UKF/CKF
in a more realistic application is their strong dependencies on an “accurate” para-
metric model. This is independent of the approximation used the EKF, the UKF,
or the CKF. In a practical application, a promising approach is to combine ideal-
ized parametric assumptions as an informative prior and Bayesian non-parametric
models. Ko et al. (2007a) and Ko and Fox (2009a) successfully applied this idea in
the context of control and tracking problems.
Similarity to results for linear systems. The equations in the forward and back-
ward sweeps resemble the results for the unscented Rauch-Tung-Striebel smoother
(URTSS) by Särkkä (2008) and for linear dynamic systems given by Ghahramani
and Hinton (1996), Murphy (2002), Minka (1998), and Bishop (2006). This fol-
lows directly from the generic filtering and smoothing results in equations (4.28)
and (4.56). Note, however, that unlike the linear case, the distributions in our
algorithm are computed by explicitly incorporating nonlinear dynamics and mea-
surement models when propagating full Gaussian distributions. This means, the
mean and the covariance of the filtering distribution at time t depends nonlin-
early on the state at time t − 1, due to Bayesian averaging over the nonlinearities
described by the GP models GP f and GP g , respectively.
in equations (4.80) and (4.87) in the predictive covariance matrix for both the tran-
sition model and the measurement model. Note that the computational complex-
ity scales linearly with the number of time steps. The computational demand of
classical Gaussian smoothers, such as the URTSS and the EKS is O(T (D3 + E 3 )).
Implicit linearizations. The forward inferences (from xt−1 to xt via the dynam-
ics model and from xt to zt via the measurement model) compute the means and
the covariances of the true predictive distributions analytically. However, we im-
plicitly linearize the backward inferences (from zt to xt in the forward sweep and
from xt to xt−1 in the backward sweep) by modeling the corresponding joint dis-
tributions as Gaussians. Therefore, our approximate inference algorithm relies on
two implicit linearizations: First, in the forward pass, the measurement model is
158 Chapter 4. Robust Bayesian Filtering and Smoothing in GP Dynamic Systems
true
1 GP−ADF 1
GP−UKF
0.5 0.5
h(x)
0 0
−0.5 −0.5
−1 −1
2 1.5 1 0.5 0 −1 −0.5 0 0.5 1
p(h(x)) x
p(x)
1
0
−1 −0.5 0 0.5 1
Figure 4.10: GP predictions using the GP-UKF and the GP-ADF. The GP distribution is described by
the mean function (black line) and the shaded area (representing the 95% confidence intervals of the
marginals) in the upper-right panel. Assume, we want to predict the outcome for at an uncertain test
input x∗ , whose distribution is represented by the Gaussian in the lower-right panel. The GP-UKF
approximates this Gaussian by the red sigma points (lower-right panel), maps them through the GP
model (upper-right panel), and approximates the true predictive distribution (shaded area in the upper-
left panel) by the sample distribution of the red sigma points (red Gaussian in the upper-left panel).
By contrast, the GP-ADF maps the full Gaussian (lower-right panel) through the GP model (upper-
right panel) and computes the mean and the variance of the true predictive distribution (exact moment
matching). The corresponding Gaussian approximation of the true predictive distribution is the blue
Gaussian in the upper-left panel.
Preserving the moments. The moments of p(xt |z1:t ) and p(xt−1 |Z) are not com-
puted exactly (exact moment matching was only done for predictions): These dis-
tributions are based on the conditionals p(xt |zt ) and p(xt−1 |xt , Z), which themselves
were computed using the assumption of joint Gaussianity of p(xt , zt |z1:t−1 ) and
p(xt−1 , xt |Z), respectively. In the context of Gaussian filters and smoothers, these
approximations are common and also appear for example in the URTSS (Särkkä,
2008) and in sequential importance sampling (Godsill et al., 2004).
GPs are typically used in cases where D is small and n D, which reduces the
computational advantage of the GP-UKF over the GP-ADF. Summarizing, there is
no obvious reason to prefer the GP-UKF to the GP-ADF.
Without proof, we believe that the GP-UKF can be considered a finite-sample
approximation of the GP-ADF: In the limit of infinitely many sigma points drawn
from the Gaussian input distribution, the GP-UKF predictive distribution con-
verges to the GP-ADF predictive distribution if additionally the function values
are drawn according to the GP predictive distribution for certain inputs.
functions.
4.6. Further Reading 161
often prefers to define the controller as a function of the hidden state x, which
requires identification of the transition function and the measurement function. In
nonlinear systems, this problem is ill posed since there are infinitely many dy-
namic systems that could have generated the measurement sequence z1:T , which
is the only data that are available. To obtain interpretability, it is possible to train
the measurement model first via sensor calibration. This can be done by feeding
known values into the sensors and measuring the corresponding outputs. Using
this model for the mapping g in equation (4.2), the transition function in latent
space is constrained.
Learning the GP models GP f and GP g from a sequence of measurements is
similar to parameter learning or system identification. The filtering and smoothing
algorithms proposed in this chapter allow for gradient-based system identification,
for example using Expectation Maximization (EM) or variational methods (Bishop,
2006), which, however, remains to future work. First approaches for learning GP
dynamic systems have been proposed by Wang et al. (2006, 2008), Ko and Fox
(2009b), and Turner et al. (2010).
Wang et al. (2006, 2008) and Ko and Fox (2009b) propose parameter-learning
approaches based on Gaussian Process Latent Variable Models (GPLVMs) intro-
duced by Lawrence (2005). Wang et al. (2006, 2008) apply their method primarily
to motion-capture data to produce smooth measurement sequences for animation.
Unlike our model, Wang et al. (2006, 2008) and Ko and Fox (2009b) cannot exploit
the Markov property in latent space and the corresponding algorithms scale cu-
bicly in the number of time steps: The sequence of latent states is inferred at once
since neither Wang et al. (2006, 2008) nor Ko and Fox (2009b) use GP training sets
that can be generated independently of the test set, that is, the set that generates
the observed time series.
Large data sets. In case of large data sets, sparse GP approximations can be di-
rectly incorporated into the filtering and smoothing algorithm. The standard GP
predictions given in the equations (2.28) and (2.29) are replaced with the sparse pre-
dictive distribution given in equations (2.84) and (2.85), respectively. Furthermore,
prediction with uncertain inputs requires a few straightforward changes.
Ypma and Heskes (2005), Zoeter et al. (2004), Zoeter and Heskes (2005), and
Zoeter et al. (2006) discuss sampling-based inference in nonlinear systems within
an EP framework. Ghahramani and Roweis (1999) and Roweis and Ghahramani
(2001) discuss inference in nonlinear dynamic systems in the context of the EKS.
The transition dynamics and the measurement function are modeled by radial
basis function networks. Analytic approximations are presented, which also allow
for gradient-based parameter learning via EM.
Instead of representing densities by Gaussians, they can also be described by
a finite set of random samples, the particles. The corresponding estimators, the
particle filters, are essentially based on Monte Carlo methods. In the literature, they
also appear under the names sampling/importance re-sampling (SIR) filter (Gordon
et al., 1993), interacting particle filter (del Moral, 1996), or sequential Monte Carlo
(SMC) methods (Liu and Chen, 1998; Doucet et al., 2000).
4.7 Summary
From a fully probabilistic perspective, we derived general expressions for Gaussian
filtering and smoothing. We showed that the determination of two joint probability
distributions are sufficient conditions for filtering and smoothing. Based on this
insight, we introduced a coherent and general approximate inference algorithm for
filtering and smoothing in nonlinear stochastic dynamic systems for the case that
the latent transition mapping and the measurement mapping are modeled by GPs.
Our algorithm profits from the fact that the moments of predictive distributions can
be computed analytically, which allows for exact moment matching. Therefore, our
inference algorithm does not rely on sampling methods or on finite-sample approx-
imations of densities, which can lead to incoherent filter distributions. Filtering in
the forward sweep implicitly linearizes the measurement function, whereas the
backward sweep implicitly linearizes the transition dynamics, however, without
requiring an explicit backward (inverse) dynamics model.
A shortcoming of our algorithm is that it still requires ground-truth observa-
tions of the latent state to train the GP models. In a real-world application, this
assumption is fairly restrictive. We provided some ideas of how to incorporate our
inference algorithm into system identification using EM or variational learning.
Due to the dependence on ground-truth observations of the latent state, we did
not evaluate our algorithms on real data sets but only on artificial data sets.
First results with artificial data sets for both filtering and smoothing were
promising and pointed at the incoherences of state-of-the-art Gaussian filters and
smoothers including the unscented Kalman filter/smoother, their extension to GP
dynamic systems, the extended Kalman filter/smoother, and the cubature Kalman
filter/smoother. Due to exact moment matching for predictions and the faithful
representation of model uncertainty, our algorithm did not suffer severely from
these incoherences.
163
5 Conclusions
A central objective of this thesis was to make artificial systems more efficient in
terms of the number of interactions required to learn a task even if not task-specific
expert knowledge is available. Based on well-established ideas from Bayesian
statistics and machine learning, we proposed pilco, a general and fully Bayesian
framework for efficient autonomous learning in Markov decision processes (see
Chapter 3). In the light of limited experience, the key ingredient of the pilco
framework is a probabilistic model that carefully models available data and faith-
fully quantifies its own fidelity. In the context of control-related problems, this
model represents the transition dynamics of the system and mimics two important
features of biological learners: the ability to generalize and the explicit incorpo-
ration of uncertainty into decision-making process. By Bayesian averaging, pilco
coherently incorporates all sources of uncertainties into long-term planning and
policy learning. The conceptual simplicity forms a beauty of our framework.
To apply pilco to arbitrary tasks with continuous-valued state and control
spaces, one simply needs to specify the immediate cost/reward and a handfull of
parameters that are fairly easy to set. Using this information only, pilco learns
a policy fully automatically. This makes pilco a practical framework. Since only
fairly general assumptions are required, pilco is also applicable to problems where
expert knowledge is either expensive or simply not available. This includes, for
example, control of complicated robotic systems as well as control of biological
and/or chemical processes.
We provided experimental evidence that a coherent treatment of uncertainty
is crucial for rapid learning success in the absence of expert knowledge. To faith-
fully describe model uncertainty, we employed Gaussian processes. To represent
uncertainty in predicted states and/or actions, we used multivariate Gaussian dis-
tributions. These fairly simple representations of unknown quantities (the GP for
a latent function and the Gaussian distribution for a latent variable) were sufficient
to extract enough valuable information from small-sized data sets to learn fairly
complicated nonlinear control tasks in computer simulation and on a mechanical
system in only a few trials. For instance, pilco learned to swing up and balance
a double pendulum attached to a cart or to balance a unicycle. In the case of the
cart-double pendulum problem, data of about two minutes were sufficient to learn
both the dynamics of the system and a single nonlinear controller that solved the
problem. Learning to balance the unicycle required about half a minute of data.
Pilco found these solutions automatically and without an intricate understanding
of the corresponding tasks. Nevertheless, across all considered control tasks, we
reported an unprecedented learning efficiency and an unprecedented automation.
In some cases, pilco reduced the required number of interactions with the real
system by orders of magnitude compared to state-of-the-art RL algorithms.
164 Chapter 5. Conclusions
A Mathematical Tools
A.1 Integration
This section gives exact integral equations for trigonometric functions, which are
required to implement the discussed algorithms. The following expressions can
be found in the book by Gradshteyn and Ryzhik (2000), where x ∼ N (µ, σ 2 ) is
Gaussian distributed with mean µ and variance σ 2 .
Z
Ex [sin(x)] = sin(x)p(x) dx = exp(− σ2 ) sin(µ) ,
2
(A.1)
Z
Ex [cos(x)] = cos(x)p(x) dx = exp(− σ2 ) cos(µ) ,
2
(A.2)
Z
Ex [sin(x) ] = sin(x)2 p(x) dx = 21 1 − exp(−2σ 2 ) cos(2µ) ,
2
(A.3)
Z
Ex [cos(x)2 ] = cos(x)2 p(x) dx = 12 1 + exp(−2σ 2 ) cos(2µ) ,
(A.4)
Z Z
Ex [sin(x) cos(x)] = sin(x) cos(x)p(x) dx = 21 sin(2x)p(x) dx (A.5)
Gradshteyn and Ryzhik (2000) also provide a more general solution to an integral
involving squared exponentials, polynomials, and trigonometric functions,
Z
xn exp − (ax2 + bx + c) sin(px + q) dx
(A.7)
n r 2
−1 b − p2
π
=− exp −c (A.8)
2a a 4a
n
2 n−2k
X n − 2k
X n! k n−2k−j j pb π
× a b p sin − q + j , a > 0,
k=0
(n − 2k)!k! j=0 j 2a 2
(A.9)
Z
xn exp − (ax2 + bx + c) cos(px + q) dx
(A.10)
n r 2
−1 b − p2
π
= exp −c (A.11)
2a a 4a
n
2 n−2k
X n − 2k
X n! k j pb π
× a p cos −q+ j , a > 0 . (A.12)
k=0
(n − 2k)!k! j=0
j 2a 2
166 Chapter A. Mathematical Tools
The Searle identity in equation (A.28) is useful if the individual inverses of A and B
do not exist or if they are ill conditioned. The Woodbury identity in equation (A.29)
can be used to reduce the computational burden: If Z ∈ Rp×p is diagonal, the
inverse Z−1 can be computed in O(p). Consider the case where U ∈ Rp×q , W ∈
Rq×q , and V> ∈ Rq×p with p q. The inverse (Z + UWV> )−1 ∈ Rp×p would
require O(p3 ) computations (naively implemented). Using equation (A.29), the
computational burden reduces to O(p) for the inverse of the diagonal matrix Z plus
O(q 3 ) for the inverse of W and the inverse of W−1 + V> Z−1 U ∈ Rq×q . Therefore,
the inversion of a p × p matrix can be reduced to the inversion of q × q matrices, the
inversion of a diagonal p × p matrix, and some matrix multiplications, all of which
require less than O(p3 ) computations. The Kailath inverse in equation (A.30) is a
special case of the Woodbury identity in equation (A.29) with W = I. The Kailath
inverse makes the inversion of A + BC numerically a bit more stable if A + BC is
ill-conditioned and A−1 exists.
169
Kalman filter (UKF) and the unscented Kalman RTS smoother (URTSS), see the work
by Särkkä (2008) and Särkkä and Hartikainen (2010), 2010), provide solutions to the
approximate inference problem of computing p(xt |Z) for t = 0, . . . , T . Like in the
EKF case, the true predictive moments are not exactly matched since the sample
approximation by the sigma points is finite. he UT provides higher accuracy than
linearization by first-order Taylor series expansion (used by the EKF) as shown by
Julier and Uhlmann (1997). Moreover, it does not constrain the mapping function
itself by imposing differentiability. In the most general case, it simply requires
access to a “black-box” system and an estimate of the noise covariance matrices.
For more detailed information, we refer to the work by Julier and Uhlmann (1997,
2004), Lefebvre et al. (2005), Wan and van der Merwe (2000, 2001), van der Merwe
et al. (2000), and Thrun et al. (2005).
C Equations of Motion
C.1 Pendulum
The pendulum shown in Figure C.1 possesses a mass m and a length l. The pen-
dulum angle ϕ is measured anti-clockwise from hanging down. A torque u can be
applied to the pendulum. Typical values are: m = 1 kg and l = 1 m.
The coordinates x and y of the midpoint of the pendulum are
x = 12 l sin ϕ , (C.1)
y = − 21 l cos ϕ, (C.2)
and the squared velocity of the midpoint of the pendulum is
v 2 = ẋ2 + ẏ 2 = 14 l2 ϕ̇2 . (C.3)
We derive the equations of motion via the system Lagrangian L, which is the
difference between kinetic energy T and potential energy V and given by
L = T − V = 12 mv 2 + 12 I ϕ̇2 + 12 mlg cos ϕ , (C.4)
1
where g = 9.82 m/s2 is the acceleration of gravity and I = 12 ml2 is the moment of
inertia of a thin pendulum around the pendulum midpoint.
The equations of motion can generally be derived from a set of equations
defined through
d ∂L ∂L
− = Qi , (C.5)
dt ∂ q̇i ∂qi
where Qi are the non-conservative forces and qi and q̇i are the state variables of the
system. In our case,
∂L
= 14 ml2 ϕ̇ + I ϕ̇ (C.6)
∂ ϕ̇
∂L
= − 12 mlg sin ϕ (C.7)
∂ϕ
yield
ϕ̈( 41 ml2 + I) + 12 mlg sin ϕ = u − bϕ̇ , (C.8)
θ2
where b is a friction coefficient. Collecting both variables z = [ϕ̇, ϕ]> the equa-
tions of motion can be conveniently expressed as two coupled ordinary differential
equations
u − bz1 − 21 mlg sin z2
dz 1
ml2 + I
= , (C.9)
4
dt
z1
which can be simulated numerically.
∂L
= (m1 + m2 )ẋ1 + 21 m2 lθ̇2 cos θ2 , (C.17)
∂ ẋ1
∂L
= 0, (C.18)
∂x1
∂L
= 31 m2 l2 θ̇2 + 12 m2 lẋ1 cos θ2 , (C.19)
∂ θ̇2
∂L
= − 12 m2 l(ẋ1 θ̇2 + g), (C.20)
∂θ2
Collecting the four variables z = [x1 , ẋ1 , θ̇2 , θ2 ]> the equations of motion can be
conveniently expressed as four coupled ordinary differential equations
z2
2m2 lz32 sin z4 + 3m2 g sin z4 cos z4 + 4u − 4bz2
dz 4(m1 + m2 ) − 3m2 cos2 z4
= , (C.23)
dt −3m2 lz32 sin z4 cos z4 − 6(m1 + m2 )g sin z4 − 6(u − bz2 ) cos z4
4l(m + m ) − 3m l cos2z
1 2 2 4
z3
C.3 Pendubot
The Pendubot in Figure C.3 is a two-link (mass m2 and m3 and length s l2 and l3 re-
spectively), underactuated robot as described by Spong and Block (1995). The first
joint exerts torque, but the second joint cannot. The system has four continuous
state variables: two joint positions and two joint velocities. The angles of the joints,
θ2 and θ3 , are measured anti-clockwise from upright. An applied external torque
u controls the first joint. Typical values are: m2 = 0.5kg, m3 = 0.5kg l2 = 0.6m,
l3 = 0.6m.
174 Chapter C. Equations of Motion
θ3
θ2
u
θ3
θ2
(C.36)
The cart-double pendulum dynamic system (see Figure C.4) consists of a cart with
mass m1 and an attached double pendulum with masses m2 and m3 and lengths
l2 and l3 for the two links, respectively. The double pendulum swings freely in the
plane. The angles of the pendulum, θ2 and θ3 , are measured anti-clockwise from
upright. The cart can move horizontally, with an applied external force u and the
coefficient of friction b. Typical values are: m1 = 0.5 kg, m2 = 0.5 kg, m3 = 0.5 kg
l2 = 0.6 m, l3 = 0.6 m, and b = 0.1 Ns/m.
176 Chapter C. Equations of Motion
+ 81 m2 l22 θ̇22 + 12 I2 θ̇22 + 21 m3 (l22 θ̇22 + 14 l32 θ̇32 + l2 l3 θ̇2 θ̇3 cos(θ2 − θ3 )) + 21 I3 θ̇32
− 12 m2 gl2 cos(θ2 ) − m3 g(l2 cos(θ2 ) + 21 l3 cos(θ3 )) . (C.45)
The angular moment of inertia Ij , j = 2, 3 around the pendulum midpoint is
1
Ij = 12 mlj2 , and g = 9.82 m/s2 is the acceleration of gravity. This moment inertia
implies the assumption that the pendulums are infinitely thin (but rigid) wires.
The equations of motion are
d ∂L ∂L
− = Qi , (C.46)
dt ∂ q̇i ∂qi
where Qi are the non-conservative forces. We obtain the partial derivatives
∂L
= (m1 + m2 + m3 )ẋ1 − ( 21 m2 + m3 )l2 θ̇2 cos θ2 − 21 m3 l3 θ̇3 cos θ3 , (C.47)
∂ ẋ1
∂L
= 0, (C.48)
∂x1
∂L
= (m3 l22 + 14 m2 l22 + I2 )θ̇2 − ( 12 m2 + m3 )l2 ẋ1 cos θ2 + 21 m3 l2 l3 θ̇3 cos(θ2 − θ3 ) ,
∂ θ̇2
(C.49)
∂L
= ( 12 m2 + m3 )l2 (ẋ1 θ̇2 + g) sin θ2 − 21 m3 l2 l3 θ̇2 θ̇3 sin(θ2 − θ3 ) , (C.50)
∂θ2
∂L
= m3 l3 − 12 ẋ1 cos θ3 + 12 l2 θ̇2 cos(θ2 − θ3 ) + 41 l3 θ̇3 + I3 θ̇3 ,
(C.51)
∂ θ̇3
∂L
= 12 m3 l3 (ẋ1 θ̇3 + g) sin θ3 + l2 θ̇2 θ̇3 sin(θ2 − θ3 )
(C.52)
∂θ3
C.5. Robotic Unicycle 177
These three linear equations in (ẍ1 , θ̈2 , θ̈3 ) can be rewritten as the linear equation
system
1 1
(m1 + m2 + m3 ) − 2 (m2 + 2m3 )l2 cos θ2 − 2 m3 l3 cos θ3 ẍ c
1 1
1
−( 2 m2 + m3 )l2 cos θ2 m3 l22 + I2 + 14 m2 l22 1
m3 l2 l3 cos(θ2 − θ3 ) θ̈2 = c2 ,
2
1 1 1 2
− 2 m3 l3 cos θ3 m l l cos(θ2 − θ3 )
2 3 2 3
m l + I3
4 2 3
θ̈3 c3
(C.56)
where
1 2 1 2
c u − bẋ1 − 2 (m2 + 2m3 )l2 θ̇2 sin θ2 − 2 m3 l3 θ̇3 sin θ3
1
1 1
c2 = ( 2 m2 + m3 )l2 g sin θ2 − 2 m3 l2 l3 θ̇3 sin(θ2 − θ3 ) .
2 (C.57)
1 2
c3 m l [g sin θ3 + l2 θ̇2 sin(θ2 − θ3 )]
2 3 3
This linear equation system can be solved for ẍ1 , θ̈2 , θ̈3 and used for numerical
simulation.
D Parameter Settings
D.2 Pendubot
E Implementation
E.1 Gaussian Process Predictions at Uncertain Inputs
for i =1: E
end
end
% memory a l l o c a t i o n
M = zeros ( E , 1 ) ; V = zeros ( D , E ) ; S = zeros ( E ) ;
log_k = z e r o s ( n , E ) ; b e t a = z e r o s ( n , E ) ;
184 Chapter E. Implementation
%%
% steps
% 1 ) compute p r e d i c t e d mean and c o v a r i a n c e between i np ut and p r e d i c t i o n
% 2 ) p r e d i c t i v e c o v a r i a n c e matrix
% 2a ) non−c e n t r a l moments
% 2b ) c e n t r a l moments
end
% 2 ) p r e d i c t i v e c o v a r i a n c e matrix ( symmetric )
% 2a ) non−c e n t r a l moments
for i =1: E % f o r a l l t a r g e t dimensions
for j =1: i
% e f f i c i e n t i m p l e n t a t i o n o f eq . ( 2 . 5 3 ) ; n−by−n
Q = t * exp ( bsxfun ( @plus , log_k ( : , i ) , log_k ( : , j ) ' ) +maha ( Zeta_i ,−Zeta_j , R\s /2) ) ;
A = beta ( : , i ) * beta ( : , j ) ' ; % n−by−n
A = A.*Q; % n−by−n
S ( i , j ) = sum ( sum ( A ) ) ; % i f i == j : 1 s t term i n eq . ( 2 . 4 1 ) e l s e eq . ( 2 . 5 0 )
S(j , i) = S(i , j) ; % copy e n t r i e s
end
end
% 2b ) c e n t r a l i z e moments
S = S − M*M ' ; % c e n t r a l i z e moments , eq . ( 2 . 4 1 ) , ( 2 . 4 5 ) ; E−by−E
% −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
f u n c t i o n K = maha ( a , b , Q )
%
% Squared Mahalanobis d i s t a n c e ( a−b ) *Q * ( a−b ) ' ; v e c t o r s a r e row−v e c t o r s
% a , b m a t r i c e s c o n t a i n i n g n l e n g t h d row v e c t o r s , d by n
%Q weight matrix , d by d , d e f a u l t eye ( d )
% K squared d i s t a n c e s , n by n
i f nargin == 2 % assume i d e n t i t y Q
K = bsxfun ( @plus , sum ( a . * a , 2 ) ,sum ( b . * b , 2 ) ' ) −2* a * b ' ;
else
aQ = a * Q ; K = bsxfun ( @plus , sum ( aQ . * a , 2 ) ,sum ( b * Q . * b , 2 ) ' ) −2* aQ * b ' ;
end
% −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
% References
%
% Marc P e t e r D e i s e n r o t h :
% E f f i c i e n t Reinforcement Learning using Gaussian P r o c e s s e s
% PhD Thesis , K a r l s r u h e I n s t i t u t e o f Technology
Lists of Figures, Tables, Algorithms, and Examples 185
3.21 Rollouts for the cart position and the angle of the pendulum when
applying zero-order-hold control and first-order-hold control. . . . . 72
3.22 Cost distributions using the position-independent controller. . . . . 73
3.23 Predictions of the position of the cart and the angle of the pendulum
when the position of the cart was far away from the target. . . . . . . 74
3.24 Hardware setup of the cart-pole system. . . . . . . . . . . . . . . . . . 75
3.25 Inverted pendulum in hardware. . . . . . . . . . . . . . . . . . . . . . 78
3.26 Distance histogram and quantiles of the immediate cost. . . . . . . . 80
3.27 Learning efficiency for the cart-pole task in the absence of expert
knowledge. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.28 Pendubot system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.29 Cost distributions for the Pendubot task (zero-order-hold control). . 87
3.30 Predicted cost and incurred immediate cost during learning the Pen-
dubot task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.31 Illustration of the learned Pendubot task. . . . . . . . . . . . . . . . . 89
3.32 Model assumption for multivariate control. . . . . . . . . . . . . . . . 90
3.33 Example trajectories of the two angles for the two-link arm with two
actuators when applying the learned controller. . . . . . . . . . . . . 91
3.34 Cart with attached double pendulum. . . . . . . . . . . . . . . . . . . 92
3.35 Cost distribution for the cart-double pendulum problem. . . . . . . . 94
3.36 Example trajectories of the cart position and the two angles of the
pendulums for the cart-double pendulum when applying the learned
controller. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
3.37 Sketches of the learned cart-double pendulum task. . . . . . . . . . . 96
3.38 Unicycle system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
3.39 Photograph of the robotic unicycle. . . . . . . . . . . . . . . . . . . . . 99
3.40 Histogram of the distances from the top of the unicycle to the fully
upright position. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
3.41 Illustration of problems with the FITC sparse GP approximation. . . 103
3.42 True POMDP, simplified stochastic MDP, and its implication to the
true POMDP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
List of Tables
2.1 Predictions with Gaussian processes—overview. . . . . . . . . . . . . 23
List of Algorithms
1 Pilco . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2 Detailed implementation of pilco . . . . . . . . . . . . . . . . . . . . 60
3 Evaluation setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4 Policy evaluation with Pegasus . . . . . . . . . . . . . . . . . . . . . . 82
5 Sparse swap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6 Forward-backward (RTS) smoothing . . . . . . . . . . . . . . . . . . . 120
188 Lists of Figures, Tables, Algorithms, and Examples
7 Forward and backward sweeps with the GP-ADF and the GP-RTSS 141
8 Experimental setup for dynamic system (4.126)–(4.127) . . . . . . . . 146
9 Experimental setup (pendulum tracking), equations (4.129)–(4.131) . 152
Bibliography 189
Bibliography
Abbeel, P., Coates, A., Quigley, M., and Ng, A. Y. (2007). An Application of Rein-
forcement Learning to Aerobatic Helicopter Flight. In Schölkopf, B., Platt, J. C.,
and Hoffman, T., editors, Advances in Neural Information Processing Systems 19,
volume 19, page 2007. The MIT Press, Cambridge, MA, USA. Cited on p. 2.
Abbeel, P. and Ng, A. Y. (2005). Exploration and Apprenticeship Learning in Rein-
forcement Learning. In Proceedings of th 22nd International Conference on Machine
Learning, pages 1–8, Bonn, Germay. Cited on p. 114.
Abbeel, P., Quigley, M., and Ng, A. Y. (2006). Using Inaccurate Models in Rein-
forcement Learning. In Proceedings of the 23rd International Conference on Machine
Learning, pages 1–8, Pittsburgh, PA, USA. Cited on pp. 31, 34, 113, and 114.
Alamir, M. and Murilo, A. (2008). Swing-up and Stabilization of a Twin-Pendulum
under State and Control Constraints by a Fast NMPC Scheme. Automatica,
44(5):1319–1324. Cited on pp. 92 and 93.
Alspach, D. L. and Sorensen, H. W. (1972). Nonlinear Bayesian Estimation using
Gaussian Sum Approximations. IEEE Transactions on Automatic Control, 17(4):439–
448. Cited on pp. 130, 157, 169, and 170.
Anderson, B. D. O. and Moore, J. B. (2005). Optimal Filtering. Dover Publications,
Mineola, NY, USA. Cited on pp. 46, 117, 120, 121, 124, 157, and 169.
Arasaratnam, I. and Haykin, S. (2009). Cubature Kalman Filters. IEEE Transac-
tions on Automatic Control, 54(6):1254–1269. Cited on pp. 117, 121, 130, 142, 169,
and 170.
Asmuth, J., Li, L., Littman, M. L., Nouri, A., and Wingate, D. (2009). A Bayesian
Sampling Approach to Exploration in Reinforcement Learning. In Proceedings of
the 25th Conference on Uncertainty in Artificial Intelligence. Cited on p. 115.
Åström, K. J. (2006). Introduction to Stochastic Control Theory. Dover Publications,
Inc., New York, NY, USA. Cited on pp. 8, 46, and 117.
Atkeson, C., Moore, A., and Schaal, S. (1997a). Locally Weighted Learning for
Control. Artificial Intelligence Review, 11:75–113. Cited on pp. 30 and 31.
Atkeson, C. G., Moore, A. G., and Schaal, S. (1997b). Locally Weighted Learning.
AI Review, 11:11–73. Cited on p. 30.
Atkeson, C. G. and Santamarı́a, J. C. (1997). A Comparison of Direct and Model-
Based Reinforcement Learning. In Proceedings of the International Conference on
Robotics and Automation. Cited on pp. 3, 31, 34, 41, 113, and 118.
190 Bibliography
Aubin, J.-P. (2000). Applied Functional Analysis. Pure and Applied Mathematics.
John Wiley & Sons, Inc., Scientific, Technical, and Medical Division, 605 Third
Avenue, New York City, NY, USA, 2nd edition. Cited on p. 132.
Barto, A. G., Sutton, R. S., and Anderson, C. W. (1983). Neuronlike Elements that
Can Solve Difficult Learning Control Problems. IEEE Transactions on Systems,
Man, and Cybernetics, 13(5):835–846. Cited on p. 113.
Baxter, J., Bartlett, P. L., and Weaver, L. (2001). Experiments with Infinite-Horizon,
Policy-Gradient Estimation. Journal of Artificial Intelligence Research, 15:351–381.
Cited on p. 114.
Bengio, Y., Louradour, J., Collobert, R., and Weston, J. (2009). Curriculum Learn-
ing. In Bottou, L. and Littman, M., editors, Proceedings of the 26th International
Conference on Machine Learning, pages 41–48, Montreal, QC, Canada. Omnipress.
Cited on p. 112.
Daw, N. D., Niv, Y., and Dayan, P. (2005). Uncertainty-based Competition be-
tween Prefrontal and Dorsolateral Striatal Systems for Behavioral Control. Nature
Neuroscience, 8(12):1704–1711. Cited on p. 34.
del Moral, P. (1996). Non Linear Filtering: Interacting Particle Solution. Markov
Processes and Related Fields, 2(4):555–580. Cited on p. 162.
Doucet, A., Godsill, S. J., and Andrieu, C. (2000). On Sequential Monte Carlo
Sampling Methods for Bayesian Filtering. Statistics and Computing, 10:197–208.
Cited on pp. 117, 145, and 162.
Engel, Y., Mannor, S., and Meir, R. (2003). Bayes Meets Bellman: The Gaussian
Process Approach to Temporal Difference Learning. In Proceedings of the 20th
International Conference on Machine Learning (ICML-2003), volume 20, pages 154–
161, Washington, DC, USA. Cited on p. 114.
Engel, Y., Mannor, S., and Meir, R. (2005). Reinforcement Learning with Gaussian
Processes. In Proceedings of the 22nd International Conference on Machine Learning
(ICML-2005), volume 22, pages 201–208, Bonn, Germany. Cited on p. 114.
Ernst, D., Geurts, P., and Wehenkel, L. (2005). Tree-Based Batch Mode Rein-
forcement Learning. Journal of Machine Learning Research, 6:503–556. Cited
on p. 113.
Bibliography 193
Ernst, D., Stan, G., Goncalves, J., and Wehenkel, L. (2006). Clinical Data based
Optimal STI Strategies for HIV: A Reinforcement Learning Approach. In 45th
IEEE Conference on Decision and Control, pages 13–15, San Diego, CA, USA. Cited
on p. 2.
Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. (2004). Bayesian Data
Analysis. Second. Chapman & Hall/CRC. Cited on pp. 9 and 143.
Girard, A., Rasmussen, C. E., and Murray-Smith, R. (2002). Gaussian Process Priors
with Uncertain Inputs: Multiple-Step Ahead Prediction. Technical Report TR-
2002-119, University of Glasgow. Cited on pp. 27 and 28.
Girard, A., Rasmussen, C. E., Quiñonero Candela, J., and Murray-Smith, R. (2003).
Gaussian Process Priors with Uncertain Inputs—Application to Multiple-Step
Ahead Time Series Forecasting. In Becker, S., Thrun, S., and Obermayer, K.,
editors, Advances in Neural Information Processing Systems 15, pages 529–536. The
MIT Press, Cambridge, MA, USA. Cited on pp. 27 and 28.
Godsill, S. J., Doucet, A., and West, M. (2004). Monte Carlo Smoothing for Non-
linear Time Series. Journal of the American Statistical Association, 99(465):438–449.
Cited on pp. 117 and 158.
Graichen, K., Treuer, M., and Zeitz, M. (2007). Swing-up of the Double Pendulum
on a Cart by Feedforward and Feedback Control with Experimental Validation.
Automatica, 43(1):63–71. Cited on p. 93.
194 Bibliography
Grancharova, A., Kocijan, J., and Johansen, T. A. (2007). Explicit Stochastic Non-
linear Predictive Control Based on Gaussian Process Models. In Proceedings of the
9th European Control Conference 2007 (ECC 2007), pages 2340–2347, Kos, Greece.
Cited on p. 115.
Grancharova, A., Kocijan, J., and Johansen, T. A. (2008). Explicit Stochastic
Predictive Control of Combustion Plants based on Gaussian Process Models.
Automatica, 44(6):1621–1631. Cited on pp. 30, 110, and 115.
Hjort, N. L., Holmes, C., Müller, P., and Walker, S. G., editors (2010). Bayesian
Nonparametrics. Cambridge University Press. Cited on p. 8.
Huang, C. and Fu, L. (2003). Passivity Based Control of the Double Inverted Pendu-
lum Driven by a Linear Induction Motor. In Proceedings of the 2003 IEEE Conference
on Control Applications, volume 2, pages 797–802. Cited on p. 92.
Jaakkola, T., Jordan, M. I., and Singh, S. P. (1994). Convergence of Stochastic It-
erative Dynamic Programming Algorithms. Neural Computation, 6:1185—1201.
Cited on p. 2.
Julier, S. J. and Uhlmann, J. K. (1996). A General Method for Approximating Non-
linear Transformations of Probability Distributions. Technical report, Robotics
Research Group, Department of Engineering Science, University of Oxford,
Oxford, UK. Cited on p. 169.
Julier, S. J. and Uhlmann, J. K. (1997). A New Extension of the Kalman Filter to
Nonlinear Systems. In Proceedings of AeroSense: 11th Symposium on Aerospace/De-
fense Sensing, Simulation and Controls, pages 182–193, Orlando, FL, USA. Cited
on pp. 117, 130, 169, and 170.
Julier, S. J. and Uhlmann, J. K. (2004). Unscented Filtering and Nonlinear Estima-
tion. Proceedings of the IEEE, 92(3):401–422. Cited on pp. 121, 130, 142, 159, 169,
and 170.
Julier, S. J., Uhlmann, J. K., and Durrant-Whyte, H. F. (1995). A New Method for the
Nonlinear Transformation of Means and Covariances in Filters and Estimators.
In Proceedings of the American Control Conference, pages 1628–1632, Seattle, WA,
USA. Cited on p. 169.
Kaelbling, L. P., Littman, M. L., and Moore, A. W. (1996). Reinforcement Learning:
A Survey. Journal of Artificial Intelligence Research, 4:237–285. Cited on pp. 1
and 113.
Kakade, S. M. (2002). A Natural Policy Gradient. In Dietterich, T. G., S., and
Ghahramani, Z., editors, Advances in Neural Information Processing Systems 14,
pages 1531–1538. The MIT Press, Cambridge, MA, USA. Cited on p. 114.
Kalman, R. E. (1960). A New Approach to Linear Filtering and Prediction Prob-
lems. Transactions of the ASME — Journal of Basic Engineering, 82(Series D):35–45.
Cited on pp. 117, 119, 124, and 130.
Bibliography 195
Kitagawa, G. (1996). Monte Carlo Filter and Smoother for Non-Gaussian Nonlinear
State Space Models. Journal of Computational and Graphical Statistics, 5(1):1–25.
Cited on p. 145.
Ko, J. and Fox, D. (2008). GP-BayesFilters: Bayesian Filtering using Gaussian Pro-
cess Prediction and Observation Models. In Proceedings of the 2008 IEEE/RSJ In-
ternational Conference on Intelligent Robots and Systems (IROS), pages 3471–3476,
Nice, France. Cited on pp. 112, 132, and 133.
Ko, J. and Fox, D. (2009a). GP-BayesFilters: Bayesian Filtering using Gaussian Pro-
cess Prediction and Observation Models. Autonomous Robots, 27(1):75–90. Cited
on pp. 112, 132, 133, 134, 142, and 156.
Ko, J. and Fox, D. (2009b). Learning GP-BayesFilters via Gaussian Process Latent
Variable Models. In Proceedings of Robotics: Science and Systems, Seattle, USA.
Cited on pp. 112 and 161.
Ko, J., Klein, D. J., Fox, D., and Haehnel, D. (2007a). Gaussian Processes and
Reinforcement Learning for Identification and Control of an Autonomous Blimp.
In Proceedings of the International Conference on Robotics and Automation (ICRA),
pages 742–747, Rome, Italy. Cited on pp. 27, 112, 115, 132, and 156.
Ko, J., Klein, D. J., Fox, D., and Haehnel, D. (2007b). GP-UKF: Unscented Kalman
Filters with Gaussian Process Prediction and Observation Models. In Proceedings
of the 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages
1901–1907, San Diego, CA, USA. Cited on pp. 132 and 158.
Kober, J. and Peters, J. (2009). Policy Search for Motor Primitives in Robotics. In
Koller, D., Schuurmans, D., Bengio, Y., and Bottou, L., editors, Advances in Neural
Information Processing Systems 21, pages 849–856. The MIT Press. Cited on pp. 30
and 114.
196 Bibliography
Kocijan, J., Murray-Smith, R., Rasmussen, C. E., and Girard, A. (2004). Gaussian
Process Model Based Predictive Control. In Proceedings of the 2004 American Con-
trol Conference (ACC 2004), pages 2214–2219, Boston, MA, USA. Cited on pp. 110
and 115.
Kocijan, J., Murray-Smith, R., Rasmussen, C. E., and Likar, B. (2003). Predictive
Control with Gaussian Process Models. In Zajc, B. and Tkalčič, M., editors,
Proceedings of IEEE Region 8 Eurocon 2003: Computer as a Tool, pages 352–356,
Piscataway, NJ, USA. Cited on pp. 30 and 115.
Kolter, J. Z., Plagemann, C., Jackson, D. T., Ng, A. Y., , and Thrun, S. (2010). A
Probabilistic Approach to Mixed Open-loop and Closed-loop Control, with Ap-
plication to Extreme Autonomous Driving. In Proceedings of the IEEE International
Conference on Robotics and Automation. Cited on p. 2.
Kschischang, F. R., Frey, B. J., and Loeliger, H.-A. (2001). Factor Graphs and the
Sum-Product Algorithm. IEEE Transactions on Information Theory, 47:498–519.
Cited on p. 119.
Kuss, M. (2006). Gaussian Process Models for Robust Regression, Classification, and
Reinforcement Learning. PhD thesis, Technische Universität Darmstadt, Germany.
Cited on pp. 18, 27, 114, and 164.
Mitrovic, D., Klanke, S., and Vijayakumar, S. (2010). From Motor Learning to Interac-
tion Learning in Robots, chapter Adaptive Optimal Feedback Control with Learned
Internal Dynamics Models, pages 65–84. Springer-Verlag. Cited on p. 115.
Murray-Smith, R., Sbarbaro, D., Rasmussen, C. E., and Girard, A. (2003). Adap-
tive, Cautious, Predictive Control with Gaussian Process Priors. In 13th IFAC
Symposium on System Identification, Rotterdam, Netherlands. Cited on pp. 30, 110,
and 115.
Naveh, Y., Bar-Yoseph, P. Z., and Halevi, Y. (1999). Nonlinear Modeling and Control
of a Unicycle. Journal of Dynamics and Control, 9(4):279–296. Cited on p. 115.
Neal, R. M. (1996). Bayesian Learning for Neural Networks. PhD thesis, Department
of Computer Science, University of Toronto. Cited on p. 27.
Ng, A. Y., Coates, A., Diel, M., Ganapathi, V., Schulte, J., Tse, B., Berger, E., and
Liang, E. (2004a). Autonomous Inverted Helicopter Flight via Reinforcement
Learning. In H. Ang Jr., M. and Khatib, O., editors, International Symposium on
Experimental Robotics, volume 21 of Springer Tracts in Advanced Robotics, pages
363–372. Springer. Cited on p. 114.
Ng, A. Y. and Jordan, M. (2000). Pegasus: A Policy Search Method for Large MDPs
and POMDPs. In Proceedings of the 16th Conference on Uncertainty in Artificial
Intelligence, pages 406–415. Cited on pp. 81, 82, and 114.
Ng, A. Y., Kim, H. J., Jordan, M. I., and Sastry, S. (2004b). Autonomous Helicopter
Flight via Reinforcement Learning. In Thrun, S., Saul, L. K., and Schölkopf, B.,
editors, Advances in Neural Information Processing Systems 16, Cambridge, MA,
USA. The MIT Press. Cited on p. 114.
Bibliography 199
Nguyen-Tuong, D., Seeger, M., and Peters, J. (2009). Local Gaussian Process Re-
gression for Real Time Online Model Learning. In Koller, D., Schuurmans, D.,
Bengio, Y., and Bottou, L., editors, Advances in Neural Information Processing Sys-
tems 21, pages 1193–1200. The MIT Press, Cambridge, MA, USA. Cited on pp. 114
and 115.
O’Flaherty, R., Sanfelice, R. G., and Teel, A. R. (2008). Robust Global Swing-Up
of the Pendubot Via Hybrid Control. In Proceedings of the 2008 American Control
Conference, pages 1424–1429. Cited on p. 85.
O’Hagan, A. (1978). Curve Fitting and Optimal Design for Prediction. Journal of
the Royal Statistical Society, Series B, 40(1):1–42. Cited on p. 27.
Opper, M. (1998). A Bayesian Approach to Online Learning. In Online Learning in
Neural Networks, pages 363–378. Cambridge University Press. Cited on pp. 130,
157, 169, and 170.
Orlov, Y., Aguilar, L., Acho, L., and Ortiz, A. (2008). Robust Orbital Stabilization
of Pendubot: Algorithm Synthesis, Experimental Verification, and Application
to Swing up and Balancing Control. In Modern Sliding Mode Control Theory, vol-
ume 375/2008 of Lecture Notes in Control and Information Sciences, pages 383–400.
Springer. Cited on p. 85.
Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible
Inference. Morgan Kaufmann. Cited on pp. 17 and 119.
Peters, J. and Schaal, S. (2006). Policy Gradient Methods for Robotics. In Proceedings
of the 2006 IEEE/RSJ International Conference on Intelligent Robotics Systems, pages
2219–2225, Beijing, China. Cited on p. 114.
Peters, J. and Schaal, S. (2008a). Natural Actor-Critic. Neurocomputing, 71(7–
9):1180–1190. Cited on p. 114.
Peters, J. and Schaal, S. (2008b). Reinforcement Learning of Motor Skills with
Policy Gradients. Neural Networks, 21:682–697. Cited on pp. 50 and 114.
Peters, J., Vijayakumar, S., and Schaal, S. (2003). Reinforcement Learning for Hu-
manoid Robotics. In Third IEEE-RAS International Conference on Humanoid Robots,
Karlsruhe, Germany. Cited on p. 114.
Petersen, K. B. and Pedersen, M. S. (2008). The Matrix Cookbook. Version 20081110.
Cited on pp. 127 and 166.
Poupart, P. and Vlassis, N. (2008). Model-based Bayesian Reinforcement Learning
in Partially Observable Domains. In Proceedings of the International Symposium on
Artificial Intelligence and Mathematics (ISAIM), Fort Lauderdale, FL, USA. Cited
on p. 113.
Poupart, P., Vlassis, N., Hoey, J., and Regan, K. (2006). An Analytic Solution to
Discrete Bayesian Reinforcement Learning. In Proceedings of the 23rd International
200 Bibliography
Conference on Machine Learning, pages 697–704, Pittsburgh, PA, USA. ACM. Cited
on p. 31.
Quiñonero-Candela, J., Girard, A., Larsen, J., and Rasmussen, C. E. (2003a). Propa-
gation of Uncertainty in Bayesian Kernel Models—Application to Multiple-Step
Ahead Forecasting. In IEEE International Conference on Acoustics, Speech and Sig-
nal Processing (ICASSP 2003), volume 2, pages 701–704. Cited on pp. 18, 27, 28,
and 164.
Quiñonero-Candela, J., Girard, A., and Rasmussen, C. E. (2003b). Prediction at
an Uncertain Input for Gaussian Processes and Relevance Vector Machines—
Application to Multiple-Step Ahead Time-Series Forecasting. Technical Re-
port IMM-2003-18, Technical University of Denmark, 2800 Kongens Lyngby,
Denmark. Cited on pp. 18, 27, and 28.
Quiñonero-Candela, J. and Rasmussen, C. E. (2005). A Unifying View of Sparse
Approximate Gaussian Process Regression. Journal of Machine Learning Research,
6(2):1939–1960. Cited on pp. 25 and 28.
Rabiner, L. (1989). A Tutorial on HMM and Selected Applications in Speech
Recognition. Proceedings of the IEEE, 77(2):257–286. Cited on pp. 119 and 120.
Raiko, T. and Tornio, M. (2005). Learning Nonlinear State-Space Models for Con-
trol. In Proceedings of the International Joint Conference on Neural Networks, pages
815–820, Montreal, QC, Canada. Cited on p. 84.
Raiko, T. and Tornio, M. (2009). Variational Bayesian Learning of Nonlinear
Hidden State-Space Models for Model Predictive Control. Neurocomputing,
72(16–18):3702–3712. Cited on pp. 63 and 84.
Raina, R., Madhavan, A., and Ng, A. Y. (2009). Large-scale Deep Unsupervised
Learning using Graphics Processors. In Bouttou, L. and Littman, M. L., editors,
Proceedings of the 26th International Conference on Machine Learning, Montreal, QC,
Canada. Omnipress. Cited on p. 107.
Rasmussen, C. E. (1996). Evaluation of Gaussian Processes and other Methods for Non-
linear Regression. PhD thesis, Department of Computer Science, University of
Toronto. Cited on pp. 27 and 48.
Rasmussen, C. E. and Ghahramani, Z. (2001). Occam’s Razor. In Advances in Neural
Information Processing Systems 13, pages 294–300. The MIT Press. Cited on p. 15.
Rasmussen, C. E. and Kuss, M. (2004). Gaussian Processes in Reinforcement Learn-
ing. In Thrun, S., Saul, L. K., and Schölkopf, B., editors, Advances in Neural In-
formation Processing Systems 16, pages 751–759. The MIT Press, Cambridge, MA,
USA. Cited on p. 114.
Rasmussen, C. E. and Quiñonero-Candela, J. (2005). Healing the Relevance Vector
Machine through Augmentation. In Raedt, L. D. and Wrobel, S., editors, Proceed-
ings of the 22nd International Conference on Machine Learning, pages 689–696. Cited
on p. 27.
Bibliography 201
Rauch, H. E., Tung, F., and Striebel, C. T. (1965). Maximum Likelihood Estimates
of Linear Dynamical Systems. AIAA Journal, 3:1445–1450. Cited on pp. 117, 119,
and 130.
Seeger, M., Williams, C. K. I., and Lawrence, N. D. (2003). Fast Forward Selection
to Speed up Sparse Gaussian Process Regression. In Bishop, C. M. and Frey, B. J.,
editors, Ninth International Workshop on Artificial Intelligence and Statistics. Society
for Artificial Intelligence and Statistics. Cited on pp. 26 and 28.
Silverman, B. W. (1985). Some Aspects of the Spline Smoothing Approach to Non-
Parametric Regression Curve Fitting. Journal of the Royal Statistical Society, Series
B, 47(1):1–52. Cited on pp. 26 and 28.
Simao, H. P., Day, J., George, A. P., Gifford, T., Nienow, J., and Powell, W. B.
(2009). An Approximate Dynamic Programming Algorithm for Large-Scale Fleet
Management: A Case Application. Transportation Science, 43(2):178–197. Cited
on p. 2.
Smola, A. J. and Bartlett, P. (2001). Sparse Greedy Gaussian Process Regression. In
Leen, T. K., Dietterich, T. G., and Tresp, V., editors, Advances in Neural Information
Processing Systems 13, pages 619—625. The MIT Press, Cambridge, MA, USA.
Cited on pp. 26 and 28.
Snelson, E. and Ghahramani, Z. (2006). Sparse Gaussian Processes using Pseudo-
inputs. In Weiss, Y., Schölkopf, B., and Platt, J. C., editors, Advances in Neural
Information Processing Systems 18, pages 1257–1264. The MIT Press, Cambridge,
MA, USA. Cited on pp. 25, 26, 27, 28, 50, 102, and 103.
Snelson, E. L. (2007). Flexible and Efficient Gaussian Process Models for Machine Learn-
ing. PhD thesis, Gatsby Computational Neuroscience Unit, University College
London. Cited on pp. 25, 26, 27, 28, 102, and 103.
Spong, M. W. and Block, D. J. (1995). The Pendubot: A Mechatronic System for
Control Research and Education. In Proceedings of the Conference on Decision and
Control, pages 555–557. Cited on pp. 85 and 173.
Stein, M. L. (1999). Interpolation of Spatial Data: Some Theory for Kriging. Springer
Series in Statistics. Springer Verlag. Cited on p. 27.
Strens, M. J. A. (2000). A Bayesian Framework for Reinforcement Learning. In
Proceedings of the 17th International Conference on Machine Learning, pages 943–950.
Morgan Kaufmann Publishers Inc. Cited on p. 115.
Sutton, R. S. (1990). Integrated Architectures for Learning, Planning, and React-
ing Based on Approximate Dynamic Programming. In Proceedings of the Seventh
International Conference on Machine Learning, pages 215–224. Morgan Kaufman
Publishers. Cited on pp. 31 and 116.
Sutton, R. S. and Barto, A. G. (1998). Reinforcement Learning: An Introduction. Adap-
tive Computation and Machine Learning. The MIT Press, Cambridge, MA, USA.
Cited on pp. 34, 41, and 113.
Szepesvári, C. (2010). Algorithms for Reinforcement Learning. Synthesis Lectures
on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers.
Cited on p. 113.
Bibliography 203
Wahba, G., Lin, X., Gao, F., Xiang, D., Klein, R., and Klein, B. (1999). The Bias-
variance Tradeoff and the Randomized GACV. In Advances in Neural Information
Processing Systems 8, pages 620–626. The MIT Press, Cambridge, MA, USA. Cited
on pp. 26 and 28.
Walder, C., Kim, K. I., and Schölkopf, B. (2008). Sparse Multiscale Gaussian Process
Regression. In Proceedings of the 25th International Conference on Machine Learning,
pages 1112–1119, Helsinki, Finland. ACM. Cited on pp. 27 and 28.
Wan, E. A. and van der Merwe, R. (2000). The Unscented Kalman Filter for Non-
linear Estimation. In Proceedings of Symposium 2000 on Adaptive Systems for Sig-
nal Processing, Communication and Control, pages 153–158, Lake Louise, Alberta,
Canada. Cited on pp. 130, 169, and 170.
Wan, E. A. and van der Merwe, R. (2001). Kalman Filtering and Neural Networks,
chapter The Unscented Kalman Filter, pages 221–280. Wiley. Cited on pp. 118,
161, and 170.
Wang, J. M., Fleet, D. J., and Hertzmann, A. (2006). Gaussian Process Dynami-
cal Models. In Weiss, Y., Schölkopf, B., and Platt, J., editors, Advances in Neu-
ral Information Processing Systems, volume 18, pages 1441–1448. The MIT Press,
Cambridge, MA, USA. Cited on p. 161.
Wang, J. M., Fleet, D. J., and Hertzmann, A. (2008). Gaussian Process Dynamical
Models for Human Motion. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 30(2):283–298. Cited on p. 161.
ISBN 978-3-86644-569-7
ISSN 1867-3813
ISBN 978-3-86644-569-7 9 783866 445697