Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
11 views26 pages

On Kernelized Multi-Armed Bandits: Sayak Ray Chowdhury

The document presents two novel algorithms for continuous bandit optimization, Improved GP-UCB and GP-Thomson sampling, which utilize Gaussian processes to derive regret bounds for stochastic multi-armed bandit problems with continuous arms. The authors also introduce a new self-normalized concentration inequality for vector-valued martingales and provide empirical evaluations demonstrating the advantages of their algorithms over existing methods. The study emphasizes the importance of structured decision-making in large domains under uncertainty and the effectiveness of Gaussian processes in addressing these challenges.

Uploaded by

Dhruv Sarkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views26 pages

On Kernelized Multi-Armed Bandits: Sayak Ray Chowdhury

The document presents two novel algorithms for continuous bandit optimization, Improved GP-UCB and GP-Thomson sampling, which utilize Gaussian processes to derive regret bounds for stochastic multi-armed bandit problems with continuous arms. The authors also introduce a new self-normalized concentration inequality for vector-valued martingales and provide empirical evaluations demonstrating the advantages of their algorithms over existing methods. The study emphasizes the importance of structured decision-making in large domains under uncertainty and the effectiveness of Gaussian processes in addressing these challenges.

Uploaded by

Dhruv Sarkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

On Kernelized Multi-armed Bandits

Sayak Ray Chowdhury SRCHOWDHURY @ ECE . IISC . ERNET. IN


Electrical Communication Engineering,
Indian Institute of Science,
Bangalore 560012, India

Aditya Gopalan ADITYA @ ECE . IISC . ERNET. IN


Electrical Communication Engineering,
arXiv:1704.00445v2 [cs.LG] 17 May 2017

Indian Institute of Science,


Bangalore 560012, India

Abstract
We consider the stochastic bandit problem with a continuous set of arms, with the expected reward function
over the arms assumed to be fixed but unknown. We provide two new Gaussian process-based algorithms for
continuous bandit optimization – Improved GP-UCB (IGP-UCB) and GP-Thomson sampling (GP-TS), and
derive corresponding regret bounds. Specifically, the bounds hold when the expected reward function belongs
to the reproducing kernel Hilbert space (RKHS) that naturally corresponds to a Gaussian process kernel used as
input by the algorithms. Along the way, we derive a new self-normalized concentration inequality for vector-
valued martingales of arbitrary, possibly infinite, dimension. Finally, experimental evaluation and comparisons
to existing algorithms on synthetic and real-world environments are carried out that highlight the favorable
gains of the proposed strategies in many cases.

1. Introduction
Optimization over large domains under uncertainty is an important subproblem arising in a variety of sequen-
tial decision making problems, such as dynamic pricing in economics (Besbes and Zeevi, 2009), reinforce-
ment learning with continuous state/action spaces (Kaelbling et al., 1996; Smart and Kaelbling, 2000), and
power control in wireless communication (Chiang et al., 2008). A typical feature of such problems is a large,
or potentially infinite, domain of decision points or covariates (prices, actions, transmit powers), together
with only partial and noisy observability of the associated outcomes (demand, state/reward, communication
rate); reward/loss information is revealed only for decisions that are chosen. This often makes it hard to
balance exploration and exploitation, as available knowledge must be transferred efficiently from a finite set
of observations so far to estimates of the values of infinitely many decisions. A classic case in point is that
of the canonical stochastic MAB with finitely many arms, where the effort to optimize scales with the total
number of arms or decisions; the effect of this is catastrophic for large or infinite arm sets.
With suitable structure in the values or rewards of arms, however, the challenge of sequential optimization
can be efficiently addressed. Parametric bandits, especially linearly parameterized bandits (Rusmevichien-
tong and Tsitsiklis, 2010), represent a well-studied class of structured decision making settings. Here, every
arm corresponds to a known, finite dimensional vector (its feature vector), and its expected reward is assumed
to be an unknown linear function of its feature vector. This allows for a large, or even infinite, set of arms
all lying in space of finite dimension, say d, and a rich line of work gives algorithms that attain sublinear
regret with a polynomial dependence on the dimension, e.g., Confidence Ball (Dani et al., 2008), OFUL
(Abbasi-Yadkori et al., 2011) (a strengthening of Confidence Ball) and Thompson sampling for linear bandits

1
(Agrawal and Goyal, 2013)1 The insight here is that even though the number of arms can be large, the number
of unknown parameters (or degrees of freedom) in the problem is really only d, which makes it possible to
learn about the values of many other arms by playing a single arm.
A different approach to modelling bandit problems with a continuum of arms is via the framework of
Gaussian processes (GPs) (Rasmussen and Williams, 2006). GPs are a flexible class of nonparametric mod-
els for expressing uncertainty over functions on rather general domain sets, which generalize multivariate
Gaussian random vectors. GPs allow tractable regression for estimating an unknown function given a set of
(noisy) measurements of its values at chosen domain points. The fact that GPs, being distributions on func-
tions, can also help quantify function uncertainty makes it attractive for basing decision making strategies on
them. This has been exploited to great advantage to build nonparametric bandit algorithms, such as GP-UCB
(Srinivas et al., 2009), GP-EI and GP-PI (Hoffman et al., 2011). In fact, GP models for bandit optimization,
in terms of their kernel maps, can be viewed as the parametric linear bandit paradigm pushed to the extreme,
where each feature vector associated to an arm can have infinite dimension 2 .
Against this backdrop, our work revisits the problem of bandit optimization with stochastic rewards.
Specifically, we consider stochastic multiarmed bandit (MAB) problems with a continuous arm set, and
whose (unknown) expected reward function is assumed to lie in a reproducing kernel Hilbert space (RKHS),
with bounded RKHS norm – this effectively enforces smoothness on the function3 . We make the following
contributions-

• We design a new algorithm – Improved Gaussian Process-Upper Confidence Bound (IGP-UCB) – for
stochastic bandit optimization. The algorithm can be viewed as a variant of GP-UCB (Srinivas et al.,
2009), but uses a significantly reduced confidence interval width resulting in an order-wise improve-
ment in regret compared to GP-UCB. IGP-UCB also shows a markedly improved numerical perfor-
mance over GP-UCB.

• We develop a nonparametric version of Thompson sampling, called Gaussian Process Thompson sam-
√ 
pling (GP-TS), and show that enjoys a regret bound of Õ γT dT . Here, T is the total time horizon
and γT is a quantity depending on the RKHS containing the reward function. This is, to our knowl-
edge, the first known regret bound for Thompson sampling in the agnostic setup with nonparametric
structure.
• We prove a new self-normalized concentration inequality for infinite-dimensional vector-valued mar-
tingales, which is not only key to the design and analysis of the IGP-UCB and GP-TS algorithms, but
also potentially of independent interest. The inequality generalizes a corresponding self-normalized
bound for martingales in finite dimension proven by Abbasi-Yadkori et al. (2011).

• Empirical comparisons of the algorithms developed above, with other GP-based algorithms, are pre-
sented, over both synthetic and real-world setups, demonstrating performance improvements of the
proposed algorithms, as well as their performance under misspecification.

2. Problem Statement
We consider the problem of sequentially maximizing a fixed but unknown reward function f : D → R over
a (potentially infinite) set of decisions D ⊂ Rd , also called actions or arms. An algorithm for this problem
chooses, at each round t, an action xt ∈ D, and subsequently observes a reward yt = f (xt ) + εt , which is a
noisy version of the function value at xt . The action xt is chosen causally depending upon the arms played
and rewards obtained upto round t − 1, denoted by the history Ht−1 = {(xs , ys ) : s = 1, . . . , t − 1}. We
 √ 
1. Roughly, for rewards bounded in [−1, 1], these algorithms achieve optimal regret Õ d T , where Õ (·) hides polylog(T ) factors.
2. The completion of the linear span of all feature vectors (images of the kernel map) is precisely the reproducing kernel Hilbert space
(RKHS) that characterizes the GP.
3. Kernels, and their associated RKHSs,

2
assume that the noise sequence {εt }∞
t=1 is conditionally R-sub-Gaussian for a fixed constant R ≥ 0, i.e.,
 2 2
 λεt  λ R
∀t ≥ 0, ∀λ ∈ R, E e Ft−1 ≤ exp , (1)
2

where Ft−1 is the σ-algebra generated by the random variables {xs , εs }t−1 s=1 and xt .This is a mild assump-
tion on the noise (it holds, for instance, for distributions bounded in [−R, R]) and is standard in the bandit
literature (Abbasi-Yadkori et al., 2011; Agrawal and Goyal, 2013).
Regret. The goal of an algorithm is to maximize its cumulative reward or alternatively minimize its cumu-
lative regret – the loss incurred due to not knowing f ’s maximum point beforehand. Let x? ∈ argmaxx∈D f (x)
be a maximum point of f (assuming the maximum is attained). The instantaneous regret incurred at time t is
rt = f (x? ) − f (xt ), and the cumulative regret in a time horizon T (not necessarily known a priori) is defined
PT
to be RT = t=1 rt . A sub-linear growth of RT in T signifies that RT /T → 0 as T → ∞, or vanishing
per-round regret.
Regularity Assumptions. Attaining sub-linear regret is impossible in general for arbitrary reward func-
tions f and domains D, and thus some regularity assumptions are in order. In what follows, we assume
that D is compact. The smoothness assumption we make on the reward function f is motivated by Gaus-
sian processes4 and their associated reproducing kernel Hilbert spaces (RKHSs, see Schölkopf and Smola
(2002)). Specifically, we assume that f has small norm in the RKHS of functions D → R, with positive
semi-definite kernel function k : D × D → R. This RKHS, denoted by Hk (D), is completely specified
by its kernel function k(·, ·) and vice-versa, with an inner product h·, ·ik obeying the reproducing property:
f (x) = hf, k(x, ·)ik for all f ∈ Hk (D). In other words, the kernel plays the role of delta functionsp to repre-
sent the evaluation map at each point x ∈ D via the RKHS inner product. The RKHS norm kf kk = hf, f ik
is a measure of the smoothness5 of f , with respect to the kernel function k, and satisfies: f ∈ Hk (D) if and
only if kf kk < ∞.
We assume a known bound on the RKHS norm of the unknown target function6 : kf kk ≤ B. Moreover,
we assume bounded variance by restricting k(x, x) ≤ 1, for all x ∈ D. Two common kernels that satisfy
bounded variance property are Squared Exponential and Matérn, defined as
 
kSE (x, x0 ) = exp − s2 /2l2 ,
√ √
0 21−ν  s 2ν ν  s 2ν 
kM atérn (x, x ) = Bν ,
Γ(ν) l l

where l > 0 and ν > 0 are hyperparameters, s = kx − x0 k2 encodes the similarity between two points
x, x0 ∈ D, and Bν (·) is the modified Bessel function. Generally the bounded variance property holds for
any stationary kernel, i.e. kernels for which k(x, x0 ) = k(x − x0 ) for all x, x0 ∈ Rd . These assumptions are
required to make the regret bounds scale-free and are standard in the literature (Agrawal and Goyal, 2013).
Instead if k(x, x) ≤ c or kf kk ≤ cB, then our regret bounds would increase by a factor of c.

3. Algorithms
Design philosophy. Both the algorithms we propose use Gaussian likelihood models for observations, and
Gaussian process (GP) priors for uncertainty over reward functions. A Gaussian process over D, denoted
by GPD (µ(·), k(·, ·)), is a collection of random variables (f (x))x∈D , one for each x ∈ D, such that every
finite sub-collection of random variables (f (xi ))mi=1 is jointly Gaussian with mean E [f (xi )] = µ(xi ) and
covariance E [(f (xi ) − µ(xi ))(f (xj ) − µ(xj ))] = k(xi , xj ), 1 ≤ i, j ≤ m, m ∈ N. The algorithms use
4. Other work has also studied continuum-armed bandits with weaker smoothness assumptions such as Lipschitz continuity – see
Related work for details and comparison.
5. One way to see this is that for every element g in the RKHS, |g(x) − g(y)| = |hg, k(x, ·) − k(y, ·)i| ≤ kgkk kk(x, ·) − k(y, ·)kk
by Cauchy-Schwarz.
6. This is analogous to the bound on the weight θ typically assumed in regret analyses of linear parametric bandits.

3
GPD (0, v 2 k(·, ·)), v > 0, as an initial prior distribution for the unknown reward function f over D, where
k(·, ·) is the kernel function associated with the RKHS Hk (D) in which f is assumed to have ‘small’ norm at
most B. The algorithms also assume that the noise variables εt = yt − f (xt ) are drawn independently, across
t, from N (0, λv 2 ), with λ ≥ 0. Thus, the prior distribution for each f (x), is assumed to be N (0, v 2 k(x, x)),
x ∈ D. Moreover, given a set of sampling points At = (x1 , . . . , xt ) within D, it follows under the assump-
tion that the corresponding vector of observed rewards y1:t = [y1 , . . . , yt ]T has the multivariate Gaussian
distribution N (0, v 2 (Kt + λI)), where Kt = [k(x, x0 )]x,x0 ∈At is the kernel matrix at time t. Then, by the
properties of GPs, we have that y1:t and f (x) are jointly Gaussian given At :
  2
v 2 kt (x)T
  
f (x) v k(x, x)
∼ N 0, ,
y1:t v 2 kt (x) v 2 (Kt + λI)

where kt (x) = [k(x1 , x), . . . , k(xt , x)]T . Therefore conditioned on the history Ht , the posterior distribution
over f is GPD (µt (·), v 2 kt (·, ·)), where

µt (x) = kt (x)T (Kt + λI)−1 y1:t , (2)


0
kt (x, x ) = k(x, x0 ) − kt (x)T (Kt + λI)−1 kt (x0 ), (3)
σt2 (x) = kt (x, x). (4)

Thus for every x ∈ D, the posterior distribution of f (x), given Ht , is N (µt (x), v 2 σt2 (x)).
Remark. Note that the GP prior and Gaussian likelihood model described above is only an aid to al-
gorithm design, and has nothing to do with the actual reward distribution or noise model as in the problem
statement (Section 2). The reward function f is a fixed, unknown, member of the RKHS Hk (D), and the
true sequence of noise variables εt is allowed to be a conditionally R-sub-Gaussian martingale difference
sequence (Equation 1). In general, thus, this represents a misspecified prior and noise model, also termed the
agnostic setting by Srinivas et al. (2009).
The proposed algorithms, to follow, assume the knowledge of only the sub-Gaussianity parameter R,
kernel function k and upper bound B on the RKHS norm of f . Note that v, λ are free parameters (possibly
time-dependent) that can be set specific to the algorithm.

3.1 Improved GP-UCB (IGP-UCB) Algorithm


We introduce the IGP-UCB algorithm (Algorithm 1), that uses a combination of the current posterior mean
µt−1 (x) and standard deviation vσt−1 (x) to (a) construct an upper confidence bound (UCB) envelope for the
actual function f over D, and (b) choose an action to maximize it. Specifically it chooses, at each round t,
the action
xt = argmax µt−1 (x) + βt σt−1 (x), (5)
x∈D

with the scale parameter v set to be 1. Such a rule trades off exploration (picking pointsp
with high uncertainty
σt−1 (x)) with exploitation (picking points with high reward µt−1 (x)), with βt = B+R 2(γt−1 + 1 + ln(1/δ))
being the parameter governing the tradeoff, which we later show is related to the width of the confidence in-
terval for f at round t. δ ∈ (0, 1) is a free confidence parameter used by the algorithm, and γt is the maximum
information gain at time t, defined as:

γt := max I(yA ; fA ).
A⊂D:|A|=t

Here, I(yA ; fA ) denotes the mutual information between fA = [f (x)]x∈A and yA = fA + εA , where
εA ∼ N (0, λv 2 I) and quantifies the reduction in uncertainty about f after observing yA at points A ⊂ D.
γt is a problem dependent quantity and can be found given the knowledge of domain D and kernel function
k. For a compact subset D of Rd , γT is O((ln T )d+1 ) and O(T d(d+1)/(2ν+d(d+1)) ln T ), respectively, for the
Squared Exponential and Matérn kernels (Srinivas et al., 2009), depending only polylogarithmically on the
time T .

4
Algorithm 1 Improved-GP-UCB (IGP-UCB)
Input: Prior GP (0, k), parameters B, R, λ, δ.
for t = 1, 2, 3 . . . T p
do
Set βt = B + R 2(γt−1 + 1 + ln(1/δ)).
Choose xt = argmax µt−1 (x) + βt σt−1 (x).
x∈D
Observe reward yt = f (xt ) + εt .
Perform update to get µt and σt using 2, 3 and 4.
end for

Discussion. Srinivas et al. (2009) have proposed the GP-UCB algorithm, and Valko et al. (2013) the
KernelUCB algorithm, for sequentially optimizing reward functions lying in the RKHS Hk (D). Both al-
qt using the rule: xt = argmaxx∈D µt−1 (x) + β̃t σt−1 (x). GP-UCB uses the
gorithms play an arm at time
exploration parameter β̃t = 2B 2 + 300γt−1 ln3 (t/δ), with λ set to σ 2 , where σ is additionally assumed to
be a known, uniform (i.e., almost-sure) upper bound on all noise variables εt (Srinivas et al., 2009, Theorem
3). Compared to GP-UCB, IGP-UCB (Algorithm 1) reduces the width of the confidence interval by a factor
roughly O(ln3/2 t) at every round t, and, as we will see, this small but critical adjustment leads to much better
theoretical and empirical performance compared to GP-UCB. In KernelUCB, β̃t is set as η/λ1/2 , where η
is the exploration parameter and λ is the regularization constant. Thus IGP-UCB can be viewed as a special
case of KernelUCB where η = βt .

3.2 Gaussian Process Thompson Sampling (GP-TS)


Our second algorithm, GP-TS (Algorithm 2), inspired by the success of Thompson sampling for standard
and parametric bandits (Agrawal and Goyal, 2012; Kaufmann et al.,p 2012; Gopalan et al., 2014; Agrawal
and Goyal, 2013), uses the time-varying scale parameter vt = B + R 2(γt−1 + 1 + ln(2/δ)) and operates
as follows. At each round t, GP-TS samples a random function ft (·) from the GP with mean function
µt−1 (·) and covariance function vt2 kt−1 (·, ·). Next, it chooses a decision set Dt ⊂ D, and plays the arm
xt ∈ Dt that maximizes ft 7 . We call it GP-Thompson-Sampling as it falls under the general framework of
Thompson Sampling, i.e., (a) assume a prior on the underlying parameters of the reward distribution, (b) play
the arm according to the prior probability that it is optimal, and (c) observe the outcome and update the prior.
However, note that the prior is nonparametric in this case.

Algorithm 2 GP-Thompson-Sampling (GP-TS)


Input: Prior GP (0, k), parameters B, R, λ, δ.
for t = 1, 2, 3 . . . , do
p
Set vt = B + R 2(γt−1 + 1 + ln(2/δ)).
Sample ft (·) from GPD (µt−1 (·), vt2 kt−1 (·, ·)).
Choose the current decision set Dt ⊂ D.
Choose xt = argmax ft (x).
x∈Dt
Observe reward yt = f (xt ) + εt .
Perform update to get µt and kt using 2 and 3.
end for

7. If Dt = D for all t, then this is simply exact Thompson sampling. For technical reasons, however, our regret bound is valid when
Dt is chosen as a suitable discretization of D, so we include Dt as an algorithmic parameter.

5
4. Main Results
We begin by presenting two key concentration inequalities which are essential in bounding the regret of the
proposed algorithms.
Theorem 1 Let {xt }∞ d
t=1 be an R -valued discrete time stochastic process predictable with respect to the
filtration {Ft }t=0 , i.e., xt is Ft−1 -measurable ∀t ≥ 1. Let {εt }∞

t=1 be a real-valued stochastic process such
that for some R ≥ 0 and for all t ≥ 1, εt is (a) Ft -measurable, and (b) R-sub-Gaussian conditionally on
Ft−1 . Let k : Rd × Rd → R be a symmetric, positive-semidefinite kernel, and let 0 < δ ≤ 1. For a given
η > 0, with probability at least 1 − δ, the following holds simultaneously over all t ≥ 0:
p
2 2 det((1 + η)I + Kt )
kε1:t k((Kt +ηI)−1 +I)−1 ≤ 2R ln . (6)
δ
(Here, Kt√denotes the t × t matrix Kt (i, j) = k(xi , xj ), 1 ≤ i, j ≤ t and for any x ∈ Rt and A ∈ Rt×t ,
kxkA := xT Ax). Moreover, if Kt is positive definite ∀t ≥ 1 with probability 1, then the conclusion above
holds with η = 0.
Theorem 1 represents a self-normalized concentration inequality: the ‘size’ of the increasing-length se-
quence {εt }t of martingale differences is normalized by the growing quantity ((Kt + ηI)−1 + I)−1 that
explicitly depends on the sequence. The following lemma helps provide an alternative, abstract, view of the
self-normalized process of Theorem 1, based on the feature space representation induced by a kernel.
Lemma 1 Let k : Rd × Rd → R be a symmetric, positive-semidefinite kernel, with associated
Pt feature map
ϕ : Rd → Hk and the reproducing kernel Hilbert space8 (RKHS) Hk . Letting St = s=1 εs ϕ(xs ) and
Pt
the (possibly infinite dimensional) matrix9 Vt = I + s=1 ϕ(xs )ϕ(xs )T , we have, whenever Kt is positive
definite, that
kε1:t k(K −1 +I )−1 = kSt kV −1 ,
t t

−1/2 −1/2
where kSt kV −1 := Vt St denotes the norm of Vt St in the RKHS Hk .
t Hk
 
Observe that St is Ft -measurable and also E St Ft−1 = St−1 . The process {St }t≥0 is thus a mar-
tingale with values10 in the RKHS H, which can possibly be infinite-dimensional, and moreover, whose
deviation is measured by the norm weighted by Vt−1 , which is itself derived from St . Theorem 1 represents
the kernelized generalization of the finite-dimensional result of Abbasi-Yadkori et al. (2011), and we recover
their result under the special case of a linear kernel: ϕ(x) = x for all x ∈ Rd .
We remark that when ϕ is a mapping to a finite-dimensional Hilbert space, the argument of Abbasi-
Yadkori et al. (2011, Theorem 1) can be lifted to establish Theorem 1, but it breaks down in the generalized,
infinite-dimensional RKHS setting, as the self-normalized bound in their paper has an explicit, growing
dependence on the feature dimension. Specifically, the method of mixtures (de la Pena et al., 2009) or Laplace
method, as dubbed by Maillard (2016) (Lemma 5.2), fails to hold in infinite dimension. The primary reason
for this is that the mixture distribution for finite dimensional spaces can be chosen independently of time,
−1
but in a nonparametric setup like ours, where the dimensionality of the self-normalizing factor Kt−1 + I
itself grows with time, the use of (random) stopping times, precludes using time-dependent mixtures. We get
around this difficulty by applying a novel ‘double mixture’ construction, in which a pair of mixtures on (a)
the space of real-valued functions on Rd , i.e., the support of a Gaussian process, and (b) on real sequences
is simultaneously used to obtain a more general result, of potentially independent interest (see Section 5 and
the appendix for details).
Our next result shows that how the posterior mean is concentrated around the unknown reward function
f.
8. Such a pair (ϕ, Hk ) always exists, see e.g., Rasmussen and Williams (2006).
9. More formally, Vt : Hk → Hk is the linear operator defined by Vt (z) = z + ts=1 ϕ(xs )hϕ(xs ), zi ∀z ∈ Hk .
P
10. We ignore issues of measurability here.

6
Theorem 2 Under the same hypotheses as those of Theorem 1, let D ⊂ Rd , and f : D → R be a member
of the RKHS of real-valued functions on D with kernel k, with RKHS norm bounded by B. Then, with
probability at least 1 − δ, the following holds for all x ∈ D and t ≥ 1:
 p 
|µt−1 (x) − f (x)| ≤ B + R 2(γt−1 + 1 + ln(1/δ)) σt−1 (x),

2
where γt−1 is the maximum information gain after t − 1 rounds and µt−1 (x), σt−1 (x) are mean and variance
of posterior distribution defined as in Equation 2, 3, 4, with λ set to 1 + η and η = 2/T .
Theorem 3.5 of Maillard (2016) states a similar result on the estimation of the unknown reward function from
the RKHS. We improve upon it in the sense that the confidence bound in Theorem 2 is simultaneous over all
x ∈ D, while the bound has been shown only for a single, fixed x in the Kernel Least-squares setting. We are
able to achieve this result by virtue of Theorem 1.

4.1 Regret Bound of IGP-UCB


Theorem 3 Let δ ∈ (0, 1), kf kk ≤ B and εt is conditionally R-sub-Gaussian.
√ Running IGP-UCB for
√ 
a function f lying in the RKHS Hk (D), we obtain a regret bound of O T (B γT + γT ) with high
 √ p 
probability. More precisely, with probability at least 1 − δ, RT = O B T γT + T γT (γT + ln(1/δ)) .

Improvement over GP-UCB. Srinivas et al. (2009), in the course of analyzing the GP-UCB √ algorithm,

show that when the reward function lies in the RKHS Hk (D), GP-UCB obtains regret O T (B γT +

γT ln3/2 (T )) with high probability (see Theorem 3 therein for the exact bound). Furthermore, they assume
that the noise εt is uniformly bounded by σ, while our sub-Gaussianity assumption (see Equation 1) is slightly
more general, and we are able to obtain a O(ln3/2 T ) multiplicative factor improvement in the final regret
bound thanks to the new self-normalized inequality (Theorem 1). Additionally, in our numerical experiments,
we observe a significantly improved performance of IGP-UCB over GP-UCB, both on synthetically generated
function, and on real-world sensor measurement data (see Section 6).
pComparison with KernelUCB. Valko et al. (2013) show that the cumulative regret of KernelUCB is
Õ( dT ˜ ), where d,˜ defined as the effective dimension, measures, in a sense, the number of principal directions
over which the projection of the data in the RKHS is spread. They show that d˜ is √ at least as good as γT ,

precisely γT ≥ Ω(d˜ln ln T ) and thus the regret bound of KernelUCB is roughly Õ( T γT ), which is γT
factor better than IGP-UCB. However, KernelUCB requires the number of actions to be finite, so the regret
bound is not applicable for infinite or continuum action spaces.

4.2 Regret Bound of GP-TS


For technical reasons, we will analyze the following version of GP-TS. At each round t, the decision set used
by GP-TS is restricted to be a unique discretization Dt of D with the property that |f (x) − f ([x]t )| ≤ 1/t2
for all x ∈ D, where [x]t is the closest point to x in Dt . This can always be achieved by choosing a
compact and convex domain D ⊂ [0, r]d and discretization Dt with size |Dt | = (BLrdt2 )d such that
 2 1/2
kx − [x]t k1 ≤ rd/BLrdt2 = 1/BLt2 for all x ∈ D, where L = sup sup ∂∂pk(p,q) j ∂qj
| p=q=x . This
x∈D j∈[d]
implies, for every x ∈ D,

|f (x) − f ([x]t )| ≤ kf kk L kx − [x]t k1 ≤ 1/t2 , (7)

as any f ∈ Hk (D) is Lipschitz continuous with constant kf kk L (De Freitas et al., 2012, Lemma 1).
Theorem 4 (Regret bound for GP-TS) Let δ ∈ (0, 1), D ⊂ [0, r]d be compact and convex, kf kk ≤ B and
{εt }t a conditionally R-sub-Gaussian sequence. Running GP-TS for a function f lying in the RKHS Hk (D)

7
and withdecision sets Dt chosen as above,
√ with probability least 1 − δ, the regret of GP-TS satisfies
at
p p
RT = O (γT + ln(2/δ))d ln(BdT ) T γT + B T ln(2/δ) .

Comparison
√ √ regret scaling of GP-TS is Õ(γT dT ) which is a multiplica-
with IGP-UCB. Observe that
tive d factor away from the bound Õ(γT T ) obtained for IGP-UCB p and similar behavior is reflected in
our simulations on synthetic data. The additional multiplicative factor of d ln(BdT ) in the regret bound of
GP-TS is essentially a consequence of discretization. How to remove this extra logarithmic dependency, and
make the analysis discretization-independent, remains an open question.
Remark. The regret bound for GP-TS is inferior compared to IGP-UCB in terms of the dependency
on dimension d, but to the best of our knowledge, Theorem 4 is the first (frequentist) regret guarantee of
Thompson Sampling in the agnostic, non-parametric setting of infinite action spaces.
Linear Models and a Matching Lower Bound. If the mean rewards are perfectly linear, i.e. if there
exists a θ ∈ Rd such that f (x) = θT x for all x ∈ D, then we are in the parametric setup, and one way
of casting this in the kernelized framework is by using the linear
√ kernel k(x, x0 ) = xT x0 . For√this kernel,
γT = O(d ln T ), and the regret scaling of IGP-UCB is Õ(d T ) and that of GP-TS is Õ(d3/2 T ), which
recovers the regret bounds of their linear, parametric analogues OFUL (Abbasi-Yadkori et al., 2011) and
˜
Linear Thompson sampling √ (Agrawal and Goyal, 2013), respectively. Moreover, in this case d = d, thus
the regret of IGP-UCB is d factor away from that of KernelUCB. But the regret bound of √ KernelUCB also
depends on the number of arms N , and if N is exponential in d, then it also suffers Õ(d T ) regret. We
remark that a similar O(ln3/2 T ) factor improvement, as obtained by IGP-UCB over GP-UCB, was achieved
in the linear parametric setting by Abbasi-Yadkori et al. (2011) in the OFUL algorithm, over its predecessor
ConfidenceBall (Dani et al., 2008). Finally we see that the for linear bandit
√ problem with infinitely many
actions, IGP-UCB attains
√ the information theoretic lower bound of Ω(d T ) (see Dani et al. (2008)), but
GP-TS is a factor of d away from it.

5. Overview of Techniques
We briefly outline here the key arguments for all the theorems in Section 4. Formal proofs and auxiliary
lemmas required are given in the appendix.
Proof Sketch for Theorem 1. It is convenient to assume that Kt , the induced kernel matrix at time t, is
n that for any function g : D →oR
invertible, since this is where the crux of the argument lies. First we show
2
and for all t ≥ 0, thanks to the sub-Gaussian property (1), the process Mtg := exp(εT1:t g1:t − 12 kg1:t k )
t
is a non-negative super-martingale with respect to the filtration Ft , where g1:t := [g(x1 ), . . . , g(xt )]T and in
fact satisfies E [Mtg ] ≤ 1. The chief difficulty is to handle the behavior of Mt at a (random) stopping time,
since the sizes of quantities such as ε1:t at the stopping time will be random.
We next construct a mixture martingale Mt by mixing Mtg over g drawn from an independent GPD (0, k)
Gaussian process, which is a measure over a large space of functions, i.e., the space RD . Then, by a change
of measure argument, we show that this induces a mixture distribution  which is essentially  N (0, Kt ) over
1 1 2
any desired finite dimension t, thus obtaining Mt = √ exp 2 kε1:t k(I+K −1 )−1 . Next from the
det(I+Kt ) t

fact that E [Mτ ] ≤ 1 and from Markov’s inequality, for any δ ∈ (0, 1), we obtain
h p i
2
P kε1:τ k(Kτ−1 +I)−1 > 2 ln det(I + Kτ )/δ ≤ δ.

Finally, we lift this bound simultaneously for all t through a standard stopping time construction as in Abbasi-
Yadkori et al. (2011).
Proof Sketch for Theorem 2. Here we sketch the special case of η = 0, i.e. λ = 1. Observe
that |µt (x) − f (x)| is upper bounded by sum of two terms, P := kt (x)T (Kt + I)−1 ε1:t and Q :=
kt (x)T (Kt + I)−1 f1:t − f (x) . Now we observe that σt2 (x) = ϕ(x)T (ΦTt Φt + I)−1 ϕ(x) and use this
observation to show that P = ϕ(x)T (ΦTt Φt + I)−1 ΦTt ε1:t and Q = ϕ(x)T (ΦTt Φt + I)−1 f , which are

8
in turn upper bounded by the terms σt (x) kSt kV −1 and kf kk σt (x) respectively. Then the result follows using
t
Theorem 1, along with the assumption that kf kk ≤ B and the fact that 21 ln(det(I + Kt )) ≤ γt almost surely
(see Lemma 3) when Kt is invertible.
Proof Sketch for Theorem 3. First from Theorem 2 and the choice of xt in Algorithm 1, we show that
the instantaneous regret rt at round t is upper bounded by 2βt σt−1 (xt ) with probability at least 1 − δ. Then
PT √
the result follows by essentially upper bounding the term t=1 σt−1 (xt ) by O( T γT ) (Lemma 4 in the
appendix).
Proof Sketch for Theorem 4. We follow a similar approach given in Agrawal and Goyal (2013) to prove
the regret bound of GP-TS. First observe that from our choice of discretization sets Dt , the instantaneous
regret at round t is given by rt = f (x? ) − f ([x? ]t ) + f ([x? ]t ) − f (xt ) ≤ t12 + ∆t (xt ), where ∆t (x) :=
f ([x? ]t ) − f (x) and [x? ]t is the closest point to x? in Dt . Now at each round t, after an action is chosen, our
algorithm improves the confidence about true reward function f , via an update of µt (·) and kt (·, ·). However,
if we play a suboptimal arm, the regret suffered can be much higher than the improvement of our knowledge.
To overcome this difficulty, at any round t, we divide the arms (in the present discretization Dt ) into two
groups: saturated arms, St , defined as those with ∆t (x) > ct σt−1 (x) and unsaturated otherwise, where ct
is an appropriate constant (see Definition 1, 3). The idea is to show that the probability of playing a saturated
arm is small and then bound thePregret of playing an √ unsaturated arm in terms of standard deviation. This
T
is useful because the inequality t=1 σt−1 (xt ) ≤ O( T γT ) (Lemma 4) allows us to bound the total regret
due to unsaturated arms.
First we lower bound the probability of playing an unsaturated arm at round t. We define a filtra-
0
tion Ft−1h as the history Ht−1i up to round t − 1 and prove that for “most” (in a high probability sense)
0 0 √
Ft−1 , P xt ∈ Dt \ St Ft−1 ≥ p − 1/t2 , where p = 1/4e π ( Lemma 9). This observation, along
with concentration bounds for ft (x) and f (x) (Lemma 6) and “smoothness” of f (Equation 7), allow us
to show that the expected regret at round t is upper bounded in terms of σt−1 (xt ), i.e. in
h terms ofi re-
0 0
gret due to playing an unsaturated arm. More precisely, we show that for “most” Ft−1 , E rt Ft−1 ≤
h 0
i
11ct
p E σt−1 (xt ) Ft−1 + t2
2B+1
(Lemma 10), and use it to prove that Xt ' rt − 11c 2B+1
p σt−1 (xt )− t2 ; t ≥ 1
t

0
is a super-martingale difference sequence adapted to filtration {Ft }t≥1 (Lemma 12). Now, using the Azuma-
PT
Hoeffding inequality (Lemma 13), along with the bound on t=1 σt−1 (xt ), we obtain the desired high-
probability regret bound.

6. Experiments
In this section we provide numerical results on both synthetically generated test functions and functions from
real-world data. We compare GP-UCB, IGP-UCB and GP-TS with GP-EI and GP-PI11 .
Synthetic Test Functions. We use the following procedure to generate test functions from the RKHS.
First we sample 100 points uniformly from the interval [0, 1] and use that as our decision set. Then we
compute a kernel matrix K on those points and draw reward vector y ∼ N (0, K). Finally, the mean of
the resulting posterior distribution is used as the test function f . We set noise parameter R2 to be 1% of
function range and use λ = R2 . We used Squared Exponential kernel with lengthscale parameter l = 0.2 and
Matérn kernel with parameters ν = 2.5, l = 0.2. Parameters βt , β̃t , vt of IGP-UCB, GP-UCB and GP-TS are
chosen as given in Section 3, with δ = 0.1, B 2 = f T Kf and γt set according to theoretical upper bounds for
corresponding kernels. We run each algorithm for T = 30000 iterations, over 25 independent trials (samples
from the RKHS) and plot the average cumulative regret along with standard deviations (Figure 1). We see a
significant improvement in the performance of IGP-UCB over GP-UCB. In fact IGP-UCB performs the best
in the pool of competitors, while GP-TS also fares reasonably well compared to GP-UCB and GP-EI/GP-PI.
We next sample 25 random functions from the GP (0, K) and perform the same experiment (Figure 2)
for both kernels with exactly same set of parameters. The relative performance of all methods is similar to
11. GP-EI and PI perform similarly and thus are not separately distinguishable in the plots.

9
4 4
x 10 x 10
4 2.5

GP-PI GP-PI
3.5
GP-EI GP-EI
2
3 GP-TS GP-TS
Cumulative Regret

Cumulative Regret
2.5
GP-UCB GP-UCB
1.5
IGP-UCB IGP-UCB
2

1
1.5

1
0.5

0.5

0 0
0 0.5 1 1.5 2 2.5 3 3.5 0 0.5 1 1.5 2 2.5 3 3.5
Rounds x 10
4
Rounds 4
x 10

(a) (b)

Figure 1: Cumulative regret for functions lying in the RKHS corresponding to (a) Squared Exponential ker-
nel and (b) Matérn kernel.

4 4
x 10 x 10
3 3.5

GP-PI GP-PI
3
2.5 GP-EI GP-EI
GP-TS GP-TS
Cumulative Regret

Cumulative Regret

2.5
2
GP-UCB GP-UCB
2
IGP-UCB IGP-UCB
1.5
1.5

1
1

0.5
0.5

0 0
0 0.5 1 1.5 2 2.5 3 3.5 0 0.5 1 1.5 2 2.5 3 3.5
Rounds 4
x 10 Rounds 4
x 10

(a) (b)

Figure 2: Cumulative regret for functions lying in the GP corresponding to (a) Squared Exponential kernel
and (b) Matérn kernel.

that in the previous experiment, which is the arguably harder “agnostic” setting of a fixed, unknown target
function.
Standard Test Functions. We consider 2 well-known synthetic benchmark functions for Bayesian Op-
timization: Rosenbrock and Hartman3 (see Azimi et al. (2012) for exact analytical expressions). We sample
100 d points uniformly from the domain of each benchmark function, d being the dimension of respective
domain, as the decision set. We consider the Squared Exponential kernel with l = 0.2 and set all parameters
exactly as in previous experiment. The cumulative regret for 25 independent trials on Rosenbrock and Hart-
man3 benchmarks is shown in Figure 3. We see GP-EI/PI perform better than the rest, while IGP-UCB and
GP-TS show competitive performance. Here no algorithm is aware of the underlying kernel function, hence

10
9000 8000

8000
GP-PI GP-PI
7000
GP-EI GP-EI
7000
6000
GP-TS GP-TS
Cumulative Regret

Cumulative Regret
6000
GP-UCB 5000 GP-UCB
5000
IGP-UCB IGP-UCB
4000
4000
3000
3000

2000
2000

1000 1000

0 0
0 0.5 1 1.5 2 2.5 3 3.5 0 0.5 1 1.5 2 2.5 3 3.5
Rounds 4
x 10 Rounds 4
x 10

(a) (b)

Figure 3: Cumulative regret for (a) Rosenbrock and (b) Hartman3 benchmark function.
we conjecture that the UCB- and TS- based algorithms are somewhat less robust on the choice of kernel than
EI/PI.
Temperature Sensor Data. We use temperature data12 collected from 54 sensors deployed in the Intel
Berkeley Research lab between February 28th and April 5th, 2004 with samples collected at 30 second
intervals. We tested all algorithms in the context of learning the maximum reading of the sensors collected
between 8 am to 9 am. We take measurements of first 5 consecutive days (starting Feb. 28th 2004) to
learn algorithm parameters. Following Srinivas et al. (2009), we calculate the empirical covariance matrix
of the sensor measurements and use it as the kernel matrix in the algorithms. Here R2 is set to be 5% of
the average empirical variance of sensor readings and other algorithm parameters is set similarly as in the
previous experiment with γt = 1 (found via cross-validation). The functions for testing consist of one set of
measurements from all sensors in the two following days and the cumulative regret is plotted over all such test
functions. From Figure 4, we see that IGP-UCB and GP-UCB performs the same, while GP-TS outperforms
all its competitors.
Light Sensor Data. We take light sensor data collected in the CMU Intelligent Workplace in Nov 2005,
which is available online as Matlab structure13 and contains locations of 41 sensors, 601 train samples and 192
test samples. We compute the kernel matrix, estimate the noise and set other algorithm parameters exactly
as in the previous experiment. Here also GP-TS is found to perform better than the others, with IGP-UCB
performing better than GP-EI/PI (Figure 4).
Related work. An alternative line of work pertaining to X -armed bandits Kleinberg et al. (2008); Bubeck
et al. (2011); Carpentier and Valko (2015); Azar et al. (2014) studies continuum-armed bandits with smooth-
ness structure. For instance, Bubeck et al. (2011) show that with a Lipschitzness assumption on the reward
d+1
function, algorithms based on discretizing the domain yield nontrivial regret guarantees, of order Ω(T d+2 ) in
Rd . Other Bayesian approaches to function optimization are GP-EI Močkus (1975), GP-PI Kushner (1964),
GP-EST Wang et al. (2016) and GP-UCB, including the contextual Krause and Ong (2011), high-dimensional
Djolonga et al. (2013); Wang et al. (2013), time-varying Bogunovic et al. (2016) safety-aware Gotovos et al.
(2015), budget-constraint Hoffman et al. (2013) and noise-free De Freitas et al. (2012) settings. Other rel-
evant work focuses on best arm identification problem in the Bayesian setup considering pure exploration
Grünewälder et al. (2010). For Thompson sampling (TS), Russo and Van Roy (2014) analyze the Bayesian
12. http://db.csail.mit.edu/labdata/labdata.html
13. http://www.cs.cmu.edu/˜guestrin/Class/10708-F08/projects/lightsensor.zip

11
4
x 10
8000 2.5

GP-PI GP-PI
7000
GP-EI 2 GP-EI
6000
GP-TS GP-TS
Cumulative Regret

Cumulative Regret
GP-UCB 1.5 GP-UCB
5000

IGP-UCB IGP-UCB
4000 1

3000
0.5

2000

0
1000

0 −0.5
0 0.5 1 1.5 2 2.5 3 3.5 0 0.5 1 1.5 2 2.5 3 3.5
Rounds 4
x 10 Rounds 4
x 10

(a) (b)

Figure 4: Cumulative regret plots for (a) temperature data and (b) light sensor data.
regret of TS, which includes the case where the target function is sampled from a GP prior. Our work obtains
the first frequentist regret of TS for unknown, fixed functions from an RKHS.

7. Conclusion
For bandit optimization, we have improved upon the existing GP-UCB algorithm, and introduced a new GP-
TS algorithm. The proposed algorithms perform well in practice both on synthetic and real-world data. An
interesting case is when the kernel function is also not known to the algorithms a priori and needs to be
learnt adaptively. Moreover, one can consider classes of time varying functions from the RKHS, and general
reinforcement learning with GP techniques. There are also important questions on computational aspects of
optimizing functions drawn from GPs.

References
Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits.
In Advances in Neural Information Processing Systems, pages 2312–2320, 2011.
Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit problem. In
COLT, pages 39–1, 2012.
Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. In ICML,
pages 127–135, 2013.
Mohammad Gheshlaghi Azar, Alessandro Lazaric, and Emma Brunskill. Online stochastic optimization
under correlated bandit feedback. In ICML, pages 1557–1565, 2014.
Javad Azimi, Ali Jalali, and Xiaoli Fern. Hybrid batch bayesian optimization. arXiv preprint
arXiv:1202.5597, 2012.
Omar Besbes and Assaf Zeevi. Dynamic pricing without knowing the demand function: Risk bounds and
near-optimal algorithms. Operations Research, 57(6):1407–1420, 2009.
Ilija Bogunovic, Jonathan Scarlett, and Volkan Cevher. Time-varying gaussian process bandit optimization.
arXiv preprint arXiv:1601.06650, 2016.

12
Sébastien Bubeck, Rémi Munos, Gilles Stoltz, and Csaba Szepesvári. X-armed bandits. Journal of Machine
Learning Research, 12(May):1655–1695, 2011.
Alexandra Carpentier and Michal Valko. Simple regret for infinitely many armed bandits. In ICML, pages
1133–1141, 2015.
Mung Chiang, Prashanth Hande, Tian Lan, and Chee Wei Tan. Power control in wireless cellular networks.
Foundations and Trends in Networking, 2(4):381–533, 2008. ISSN 1554-057X. doi: 10.1561/1300000009.
Varsha Dani, Thomas P Hayes, and Sham M Kakade. Stochastic linear optimization under bandit feedback.
In COLT, pages 355–366, 2008.
Nando De Freitas, Alex Smola, and Masrour Zoghi. Exponential regret bounds for gaussian process bandits
with deterministic observations. arXiv preprint arXiv:1206.6457, 2012.
Victor H de la Pena, Tze Leung Lai, and Qi-Man Shao. Self-normalized processes. probability and its
applications, 2009.
Josip Djolonga, Andreas Krause, and Volkan Cevher. High-dimensional gaussian process bandits. In Ad-
vances in Neural Information Processing Systems, pages 1025–1033, 2013.
Rick Durrett. Probability: Theory and Examples. Brooks/Cole - Thomson Learning, Belmont, CA, 2005.
Aditya Gopalan, Shie Mannor, and Yishay Mansour. Thompson sampling for complex online problems. In
ICML, volume 14, pages 100–108, 2014.
Alkis Gotovos, ETHZ CH, and Joel W Burdick. Safe exploration for optimization with gaussian processes.
2015.
Steffen Grünewälder, Jean-Yves Audibert, Manfred Opper, and John Shawe-Taylor. Regret bounds for gaus-
sian process bandit problems. In AISTATS, pages 273–280, 2010.
Matthew D Hoffman, Eric Brochu, and Nando de Freitas. Portfolio allocation for bayesian optimization. In
UAI, pages 327–336, 2011.
Matthew W Hoffman, Bobak Shahriari, and Nando de Freitas. Exploiting correlation and budget constraints
in bayesian multi-armed bandit optimization. arXiv preprint arXiv:1303.6746, 2013.
Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. Reinforcement learning: A survey. Journal
of artificial intelligence research, 4:237–285, 1996.
Emilie Kaufmann, Nathaniel Korda, and Rémi Munos. Thompson sampling: An asymptotically opti-
mal finite-time analysis. In International Conference on Algorithmic Learning Theory, pages 199–213.
Springer, 2012.
Robert Kleinberg, Aleksandrs Slivkins, and Eli Upfal. Multi-armed bandits in metric spaces. In Proceedings
of the fortieth annual ACM symposium on Theory of computing, pages 681–690. ACM, 2008.
Andreas Krause and Cheng S Ong. Contextual gaussian process bandit optimization. In Advances in Neural
Information Processing Systems, pages 2447–2455, 2011.
Harold J Kushner. A new method of locating the maximum point of an arbitrary multipeak curve in the
presence of noise. Journal of Basic Engineering, 86(1):97–106, 1964.
Odalric-Ambrym Maillard. Self-normalization techniques for streaming confident regression. 2016.
J Močkus. On bayesian methods for seeking the extremum. In Optimization Techniques IFIP Technical
Conference, pages 400–404. Springer, 1975.

13
Carl Edward Rasmussen and Christopher KI Williams. Gaussian processes for machine learning. 2006.
Paat Rusmevichientong and John N. Tsitsiklis. Linearly parameterized bandits. Math. Oper. Res., 35(2):
395–411, May 2010.
Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling. Mathematics of Opera-
tions Research, 39(4):1221–1243, 2014.
Bernhard Schölkopf and Alexander J Smola. Learning with kernels: support vector machines, regularization,
optimization, and beyond. MIT press, 2002.
William D Smart and Leslie Pack Kaelbling. Practical reinforcement learning in continuous spaces. In ICML,
pages 903–910, 2000.
Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian process optimization in
the bandit setting: No regret and experimental design. arXiv preprint arXiv:0912.3995, 2009.
Michal Valko, Nathaniel Korda, Rémi Munos, Ilias Flaounas, and Nelo Cristianini. Finite-time analysis of
kernelised contextual bandits. arXiv preprint arXiv:1309.6869, 2013.
Zi Wang, Bolei Zhou, and Stefanie Jegelka. Optimization as estimation with gaussian processes in bandit
settings. In International Conf. on Artificial and Statistics (AISTATS), 2016.
Ziyu Wang, Masrour Zoghi, Frank Hutter, David Matheson, N Freitas, et al. Bayesian optimization in high di-
mensions via random embeddings. AAAI Press/International Joint Conferences on Artificial Intelligence,
2013.
Fuzhen Zhang. The Schur complement and its applications, volume 4. Springer Science & Business Media,
2006.

Appendix
A. Proof of Theorem 1
For a function g : D → R and a sequence of reals n ≡ (nt )∞
t=1 , define for any t ≥ 0
 R2 2

Mtg,n = exp εT1:t g1:t,n − kg1:t,n k ,
2
where the vector g1:t,n := [g(x1 ) + n1 , . . . , g(xt ) + nt ]T . We first establish the following technical result,
which resembles Abbasi-Yadkori et al. (2011, Lemma 8).
Lemma 2 For fixed g and n, {Mtg,n }∞ ∞
t=0 is a super-martingale with respect to the filtration {Ft }t=0 .

Proof First, define


 R2 
∆g,n
t := exp εt (g(xt ) + nt ) − (g(xt ) + nt )2 .
2
Since xt is Ft−1 -measurable and εt is Ft -measurable, Mtg,n as well as ∆g,n
t are Ft measurable. Also, by the
conditional R-sub-Gaussianity of εt , we have
 2 2
 λεt  λ R
∀λ ∈ R, E e Ft−1 ≤ exp ,
2
which in turn implies E ∆g,n
 
t Ft−1 ≤ 1. We also have

E Mtg,n Ft−1
 
 g,n g,n  g,n  g,n  g,n
= E Mt−1 ∆t Ft−1 = Mt−1 E ∆t Ft−1 ≤ Mt−1 ,

14
showing that {Mtg,n }∞
t=0 is a non-negative super-martingale and proving the lemma.

Also observe that E [Mtg,n ] ≤ 1 for all t, as

E [Mtg,n ] ≤ E Mt−1
 g,n 
≤ · · · ≤ E [M0g,n ] = E [1] = 1.

Now, let τ be a stopping time with respect to the filtration {Ft }∞


t=0 . By the convergence theorem for non-
negative super-martingales (Durrett, 2005), M∞g,n
= lim Mtg,n exists almost surely, and thus Mτg,n is well-
t→∞
defined. Now let Qg,n
t
g,n
= Mmin{τ,t} , t ≥ 0, be a stopped version of {Mtg,n }t . By Fatou’s lemma (Durrett,
2005),
h i h i
E [Mτg,n ] = E lim Qg,n t = E lim inf Qg,n
t
t→∞ t→∞
≤ lim inf E [Qg,n
t ]
t→∞
h i
g,n
= lim inf E Mmin{τ,t} ≤ 1, (8)
t→∞
 
g,n
since the stopped super-martingale Mmin{τ,t} is also a super-martingale (Durrett, 2005).
t
Now, let F∞ be the σ-algebra generated by {Ft }∞ ∞
t=0 , and let N ≡ (Nt )t=1 be a sequence of independent
and identically distributed Gaussian random variables with mean 0 and variance η, independent of F∞ . Let
h : D → R be a random function distributed according to the Gaussian process measure GPD (0, k), and
independent of both F∞ and (Nt )∞ h.
t=1 i
For each t ≥ 0, define Mt = E Mth,N F∞ . In words, (Mt )t is a mixture of super-martingales of the
form Mtg,n , and it is not hard to see that (Mt )t is also a (non-negative) super-martingale w.r.t. the filtration
{Ft }t , hence M∞ = lim Mt is well-defined almost surely. We can write
t→∞
h i h h ii
E [Mt ] = E Mth,N = E E Mth,N h, N ≤ E [1] = 1 ∀t.

An argument similar to (8) also shows that E [Mτ ] ≤ 1 for any stopping time τ . Now, without loss of
generality, we assume R = 1 (this can always be achieved through appropriate scaling), and compute
 
 1 2

Mt = E exp εT1:t h1:t,N − kh1:t,N k F∞
2
Z Z  1 2

= exp εT1:t ([h(x1 ) . . . h(xt )]T + z) − [h(x1 ) . . . h(xt )]T + z dµ1 (h)dµ2 (z)
D t 2
ZR R 
1 2

= exp εT1:t λ − kλk f (λ)dλ,
Rt 2

where µ1 is the Gaussian process measure GPD (0, k) over the function space RD ≡ {g : D → R}, µ2 is
the multivariate Gaussian distribution on Rt with mean 0 and covariance ηI where I is the identify, du is
standard Lebesgue measure on Rt , and f is the density of the random vector [h(x1 ) . . . h(xt )]T + z, which
is distributed as the multivariate Gaussian N (0, Kt + ηI) given the sampled points x1 , . . . , xt up to round t,
where Kt is the induced kernel matrix at time t given by Kt (i, j) = k(xi , xj ), 1 ≤ i, j ≤ t. (Note: Kt is not
positive definite and invertible when there are repetitions among (x1 , . . . , xt ), but Kt + ηI is).
Thus, we have
2 2 !
1
Z
T kλk kλk(Kt +ηI)−1
Mt = p exp ε1:t λ − − dλ
(2π)t det(Kt + ηI) Rt 2 2
 
k2
exp kε1:t 2 !
2 kλk(Kt +ηI)−1
kλ − ε1:t k
Z
2
= p exp − − dλ.
(2π)t det(Kt + ηI) Rt 2 2

15
Now for positive-definite matrices P and Q
2 2 2 2 2
kx − akP + kxkQ = x − (P + Q)−1 P a P +Q
+ kakP − kP ak(P +Q)−1 .

Therefore,
2 2
kλ − ε1:t kI + kλk(Kt +ηI)−1
2 2 2
= λ − (I + (Kt + ηI)−1 )−1 Iε1:t I+(Kt +ηI)−1
+ kε1:t kI − kIε1:t k(I+(Kt +ηI)−1 )−1 ,

which yields
1 1
2

Mt = p exp kε1:t k(I+(Kt +ηI)−1 )−1
(2π)t det(Kt + ηI) 2
Z  1 
2
× exp − λ − (I + (Kt + ηI)−1 )−1 ε1:t I+(Kt +ηI)−1 dλ
Rt 2
1 1
2

= p exp kε1:t k(I+(Kt +ηI)−1 )−1
det(Kt + ηI) det((Kt + ηI)−1 + I) 2
1 1
2

= p exp kε1:t k(I+(Kt +ηI)−1 )−1 ,
det(I + Kt + ηI) 2

since for any positive definite matrix A ∈ Rt ,


Z  1  Z  1 
2
p
exp − (x − a)T A(x − a) dx = exp − kx − akA dx = (2π)t / det(A).
Rt 2 Rt 2

Now as E [Mτ ] ≤ 1, using Markov’s inequality gives, for any δ ∈ (0, 1),
h p i
2
P kε1:τ k((Kτ +ηI)−1 +I)−1 > 2 ln det((1 + η)I + Kτ )/δ
= P [Mτ > 1/δ] < δE [Mτ ] ≤ δ. (9)

To complete the proof, we now employ a stopping time construction as in Abbasi-Yadkori et al. (2011). For
each t ≥ 0, define the ‘bad’ event
n p o
2
Bt (δ) = ω ∈ Ω : kε1:t k((Kt +ηI)−1 +I)−1 > 2 ln det((1 + η)I + Kt )/δ ,

so that
 
[
P Bt (δ)
t≥0
h p i
2
= P ∃t ≥ 0 : kε1:t k((Kt +ηI)−1 +I)−1 ≤ 2 ln det((1 + η)I + Kt )/δ ,

which is the probability required to be bounded by δ in the statement of the theorem.


Let τ 0 be the first time when the bad event Bt (δ) happens, i.e., τ 0 (ω) := min{t ≥ 0 : ω ∈ Bt (δ)}, with
min{∅} := ∞ by convention. Clearly, τ 0 is a stopping time, and
[
Bt (δ) = {ω ∈ Ω : τ 0 (ω) < ∞}.
t≥0

16
Therefore, we can write
 
[
P Bt (δ)
t≥0

= P [τ 0 < ∞]
h p  i
2
= P kε1:τ 0 k((Kτ 0 +ηI)−1 +I)−1 > 2 ln det((1 + η)I + Kτ 0 )/δ , τ 0 < ∞
h p i
2
≤ P kε1:τ 0 k((Kτ 0 +ηI)−1 +I)−1 > 2 ln det((1 + η)I + Kτ 0 )/δ ≤ δ,

by the inequality (9).


When Kt is positive definite (and hence invertible) for each t ≥ 1, one can use a similar construction
as in Part 1, with η = 0 (i.e., N is the all-zeros sequence with probability 1), to recover the corresponding
conclusion (6) with η = 0.

Proof of Lemma 1
Define, for each time t, the t × ∞ matrix Φt := [ϕ(x1 ) · · · ϕ(xt )]T , and observe that Vt = I + ΦTt Φt and
Kt = Φt ΦTt . With this, we can compute
t t
2
X −1 X
kSt kV −1 = StT Vt−1 St = T
εs ϕ(xs ) I+ ΦTt Φt εs ϕ(xs )
t
s=1 s=1
−1
= εT1:t Φt I + ΦTt Φt ΦTt ε1:t
−1
= εT1:t Φt ΦTt Φt ΦTt + I ε1:t
−1
= εT1:t Kt (Kt + I) ε1:t
−1 2
= εT1:t Kt−1 + I ε1:t = kε1:t k(K −1 +I )−1 ,
t

completing the proof.

B. Information Theoretic Results


Lemma 3 For every t ≥ 0, the maximum information gain γt , for the points chosen by Algorithm 1 and 2
satisfy, almost surely, the following :
1
γt ≥ ln(det(I + λ−1 Kt )),
2
t
1X
γt ≥ ln(1 + λ−1 σs−1
2
(xs )).
2 s=1

Proof At round t after observing the reward vector y1:t at points At = {x1 , . . . , xt } ⊂ D, the information
gain - by the algorithm - about the unknown reward function f is given by the mutual information between
f1:t and y1:t sampled at points At :

I(y1:t ; f1:t ) = H(y1:t ) − H(y1:t f1:t ),

17
where y1:t = f1:t + ε1:t = [y1 , . . . , yt ]T , f1:t = [f (x1 ), . . . , f (xt )]T and ε1:t = [ε1 , . . . , εt ]T . Clearly, given
f1:t the randomness - as perceived by the algorithm - in y1:t are only in the noise vector ε1:t and thus
1 t
H(y1:t f1:t ) = ln(det(2πeλv 2 I)) = log(2πeλv 2 ),
2 2
as ε1:t is assumed to follow the distribution N (0, λv 2 I) and H(N (µ, Σ)) = 21 ln(det(2πeΣ)). Now y1:t sam-
pled at points At is believed to be distributed as N (0, v 2 (Kt +λI)), which gives H(y1:t ) = 21 ln(det(2πev 2 (λI+
Kt ))) = 2t log(2πeλv 2 ) + 12 ln(det(I + λ−1 Kt )), and therefore

1
I(y1:t ; f1:t ) = ln(det(I + λ−1 Kt )). (10)
2
Again, conditioned on reward vector y1:s−1 observed at points As−1 , the reward ys at round s observed at
xs is believed to follow the distribution N (µs−1 (xs ), v 2 (λ + σs−1
2
(xs ))), which gives H(ys y1:s−1 ) =
1 2 2 1 2 1 −1 2
2 ln(2πev (λ + σ (x
s−1 s ))) = 2 ln(2πeλv ) + ln(1 + λ σ s−1 (xs )). Now by chain rule H(y1:t ) =
P t t 2 1
Pt 2 −1 2
s=1 H(ys y1:s−1 ) = 2 ln(2πeλv ) + 2 s=1 ln(1 + λ σs−1 (xs )), and therefore
t
1X
I(y1:t ; f1:t ) = ln(1 + λ−1 σs−1
2
(xs )). (11)
2 s=1

Now I(y1:t ; f1:t ) is a function of At ⊂ D, the random points chosen by the algorithm and thus

I(y1:t ; f1:t ) ≤ max I(yA ; fA ) = γt , a.s.,


A⊂D:|A|=t

Now the proof follows from Equation 10 and 11.

Lemma 4 Let x1 , . . . xt be the points selected by the algorithms. The sum of predictive standard deviation
at those points can be expressed in terms of the maximum information gain. More precisely,
T
X p
σt−1 (xt ) ≤ 4(T + 2)γT .
t=1

PT q P
t 2 (x ). Now since
Proof First note that, by Cauchy-Schwartz inequality, t=1 σt−1 (xt ) ≤ T t=1 σt−1 t
2
0 ≤ σt−1 (x) ≤ 1 for all x ∈ D and by our choice of λ = 1 + η, η ≥ 0, we have λ−1 σt−1 2
(xt ) ≤
−1 2
2 ln(1+λ σt−1 (xt )), where in the last inequality we used the fact that for any 0 ≤ α ≤ 1, ln(1+α) ≥ α/2.
2
Thus we get σt−1 (xt ) ≤ 2λ ln(1 + λ−1 σt−1
2
(xt )). This implies
v v
T u T u T
X u X
−1 2
u X 1 2 (x )) ≤
p
σt−1 (xt ) ≤ 2T
t λ ln(1 + λ σt−1 (xt )) ≤ t4T λ ln(1 + σt−1 t 4T (1 + η)γT ,
t=1 t=1 t=1
2

where the last inequality follows from Lemma 3. Now the result follows by choosing η = 2/T .

C. Proof of Theorem 2
First define ϕ(x) as k(x, ·), where ϕ : Rd → H maps any point x in the primal space Rd to the RKHS
H associated with kernel functionp k. For any two functions g, h ∈ H, define the inner product hg, hik as
g T h and the RKHS norm kgkk as g T g. Now as the unknown reward function f lies in the RKHS Hk (D),

18
these definitions along with reproducing property of the RKHS imply f (x) = hf, k(x, ·)ik = hf, ϕ(x)ik =
f T ϕ(x) and k(x, x0 ) = hk(x, ·), k(x0 , ·)ik = hϕ(x), ϕ(x0 )ik = ϕ(x)T ϕ(x0 ) for all x, x0 ∈ D. Now defining
 T
Φt := ϕ(x1 )T , . . . , ϕ(xt )T , we get the kernel matrix Kt = Φt ΦTt , kt (x) = Φt ϕ(x) for all x ∈ D and
f1:t = Φt f . Since the matrices (ΦTt Φt + I) and (Φt ΦTt + λI) are strictly positive definite and (ΦTt Φt +
λI)ΦTt = ΦTt (Φt ΦTt + λI), we have

ΦTt (Φt ΦTt + λI)−1 = (ΦTt Φt + λI)−1 ΦTt . (12)

Also from the definitions above (ΦTt Φt + λI)ϕ(x) = ΦTt kt (x) + λϕ(x), and thus from 12 we deduce that

ϕ(x) = ΦTt (Φt ΦTt + λI)−1 kt (x) + λ(ΦTt Φt + λI)−1 ϕ(x),

which gives

ϕ(x)T ϕ(x) = kt (x)T (Φt ΦTt + λI)−1 kt (x) + λϕ(x)T (ΦTt Φt + λI)−1 ϕ(x).

This implies

λϕ(x)T (ΦTt Φt + λI)−1 ϕ(x) = k(x, x) − kt (x)T (Kt + λI)−1 kt (x) = σt2 (x) (13)

Now observe that

f (x) − kt (x)T (Kt + λI)−1 f1:t = ϕ(x)T f − ϕ(x)T ΦTt (Φt ΦTt + λI)−1 Φt f
= ϕ(x)T f − ϕ(x)T (ΦTt Φt + λI)−1 ΦTt Φt f
= λϕ(x)T (ΦTt Φt + λI)−1 f
≤ λ(ΦTt Φt + λI)−1 ϕ(x) k kf kk
q
= kf kk λϕ(x)T (ΦTt Φt + λI)−1 λI(ΦTt Φt + λI)−1 ϕ(x)
q
≤ B λϕ(x)T (ΦTt Φt + λI)−1 (ΦTt Φt + λI)(ΦTt Φt + λI)−1 ϕ(x)
= B σt (x),

where the second equality uses 12, the first inequality is by Cauchy-Schwartz and the final equality is from
13. Again see that

kt (x)T (Kt + λI)−1 ε1:t = ϕ(x)T ΦTt (Φt ΦTt + λI)−1 ε1:t
= ϕ(x)T (ΦTt Φt + λI)−1 ΦTt ε1:t
≤ (ΦTt Φt + λI)−1/2 ϕ(x) (ΦTt Φt + λI)−1/2 ΦTt ε1:t
k k
q q
= ϕ(x)T (ΦTt Φt + λI)−1 ϕ(x) (ΦTt ε1:t )T (ΦTt Φt + λI)−1 ΦTt ε1:t
q
= λ−1/2 σt (x) εT1:t Φt ΦTt (Φt ΦTt + λI)−1 ε1:t
q
= λ−1/2 σt (x) εT1:t Kt (Kt + λI)−1 ε1:t

where the second equality is from 12, the first inequality is by Cauchy-Schwartz and the fourth inequality uses
both 12 and 13. Now recall that, at round t, the posterior mean function µt (x) = kt (x)T (Kt + λI)−1 y1:t =
T T
kt (x)T (Kt + λI)−1 (f1:t + ε1:t ), where f1:t = f (x1 ), . . . , f (xt ) and ε1:t = ε1 , . . . , εt . Thus we have
 

|µt (x) − f (x)| ≤ kt (x)T (Kt + λI)−1 ε1:t + f (x) − kt (x)T (Kt + λI)−1 f1:t
 q 
≤ σt (x) B + (1 + η)−1/2 εT1:t Kt (Kt + (1 + η)I)−1 ε1:t ,

19
where we have used λ = 1 + η, where η ≥ 0 as stated in Theorem 1. Now observe that when K is invertible,
K(K + I)−1 = ((K + I)K −1 )−1 = (I + K −1 )−1 . Using K = Kt + ηI, we get

(Kt + ηI)(Kt + (1 + η)I)−1 = ((Kt + ηI)−1 + I)−1 .

Now see that

εT1:t Kt (Kt + (1 + η)I)−1 ε1:t ≤ εT1:t (Kt + ηI)(Kt + (1 + η)I)−1 ε1:t = εT1:t ((Kt + ηI)−1 + I)−1 ε1:t

Now using Theorem 1, for any δ ∈ (0, 1), with probability at least 1 − δ, ∀t ≥ 0, ∀x ∈ D, we obtain
s p
   det((1 + η)I + Kt ) 
|µt (x) − f (x)| ≤ σt (x) B + kε1:t k(Kt +ηI)−1 +I)−1 ≤ σt (x) B + R 2 ln .
δ

Now observe that det((1 + η)I + Kt ) = det(I + (1 + η)−1 Kt ) det((1 + η)I). Thus we have

ln(det((1 + η)I + Kt )) = ln(det(I + (1 + η)−1 Kt )) + t ln(1 + η) ≤ 2γt + ηt,


 q 
from lemma 3. Now choosing η = 2/T we have |µt (x) − f (x)| ≤ σt (x) B + R 2 γt + 1 + ln(1/δ)
and hence the result follows.

D. Analysis of IGP-UCB (Theorem 3)


Observe that at each round t ≥ 1, by the choice of xt in Algorithm 1, we have µt−1 (xt ) + βt σt−1 (xt ) ≥
µt−1 (x? ) + βt σt−1 (x? ) and from Lemma 2, we have f (x? ) ≤ µt−1 (x? ) + βt σt−1 (x? ) and µt−1 (xt ) −
f (xt ) ≤ βt σt−1 (xt ). Therefore for all t ≥ 1 with probability at least 1 − δ,

rt = f (x? ) − f (xt )
≤ βt σt−1 (xt ) + µt−1 (xt ) − f (xt )
≤ 2βt σt−1 (xt ),
T
P PT PT √
and hence rt ≤ 2βT σt−1 (xt ). Now from Lemma 4, σt−1 (xt ) = O( T γT ) and by definition
t=1
p t=1 t=1
βT ≤ B + R 2(γT + 1 + ln(1/δ)). Hence with probability at least 1 − δ,
T
X  p p 
RT = rt = O B T γT + T γT (γT + ln(1/δ)) ,
t=1

and thus with high probability, √ √ 


RT = O T (B γT + γT ) .

E. Analysis of GP-TS (Theorem 4)


Lemma 5 For any δ ∈ (0, 1) and any finite subset D0 of D,
h p i
P ∀x ∈ D0 , |ft (x) − µt−1 (x)| ≤ vt 2 ln(|D0 | t2 ) σt−1 (x) Ht−1 ≥ 1 − 1/t2 ,

for all possible realizations of history Ht−1 .

20
Proof Fix x ∈ D and t ≥ 1. Given history Ht−1 , ft (x) ∼ N (µt−1 (x), vt2 σt−12
(x)). Thus using Lemma B4
of Hoffman et al. (2013), for any δ ∈ (0, 1), with probability at least 1 − δ
p
|ft (x) − µt−1 (x)| ≤ 2 ln(1/δ) vt σt−1 (x),

and now applying union bound,


p
|ft (x) − µt−1 (x)| ≤ vt 2 ln(|D0 | /δ) σt−1 (x) ∀x ∈ D0

holds with probability at least 1 − δ, given any possible realizations of history Ht−1 . Now setting δ = 1/t2 ,
the result follows.

p
Definition
p 1 Define For all t ≥ 1, c̃t = 4 ln t + 2d ln(BLrdt2 ) and ct = vt (1 + c̃t ), where vt = B +
R 2(γt−1 + 1 + ln(2/δ)). Clearly, ct increases with t.

Definition 2 Define E f (t) as the event that for all x ∈ D,

|µt−1 (x) − f (x)| ≤ vt σt−1 (x),

and E ft (t) as the event that for all x ∈ Dt ,

|ft (x) − µt−1 (x)| ≤ vt c̃t σt−1 (x).

Definition 3 Define the set of saturated points St in discretization Dt at round t as

St := {x ∈ Dt : ∆t (x) > ct σt−1 (x)},

where ∆t (x) := f ([x? ]t ) − f (x), the difference between function values at the closest point to x? in Dt and
at x. Clearly ∆t ([x? ]t ) = 0 for all t, and hence [x? ]t ∈ Dt is unsaturated at every t.
0 0 0
Definition 4 Define filtration Ft−1 as the history until time t, i.e., Ft−1 = Ht−1 . By definition, F1 ⊆ F 0 2 ⊆
0
· · · . Observe that given Ft−1 , the set St and the event E f (t) are completely deterministic.
  0
h 0
i
Lemma 6 Given any δ ∈ (0, 1), P ∀t ≥ 1, E f (t) ≥ 1−δ/2 and for all possible filtrations Ft−1 , P E ft (t) Ft−1 ≥
1 − 1/t2 .
δ
Proof The probability bound for the event E f (t) follows from Theorem 2 by replacing δ with 2 and for the
0
event E ft (t) follows from Lemma 5 by setting D0 = Dt and Ht−1 = Ft−1 .

Lemma 7 (Gaussian Anti-concentration) For a Gaussian random variable X with mean µ and standard
deviation σ, for any β > 0,
2
e−β
 
X −µ
P >β ≥ √ .
σ 4 πβ

0
Lemma 8 For any filtration Ft−1 such that E f (t) is true,
h 0
i
P ft (x) > f (x) Ft−1 ≥ p,

1
for any x ∈ D, where p = √ .
4e π

21
0
Proof Fix any x ∈ D. Given filtration Ft−1 , ft (x) is a Gaussian random variable with mean µt−1 (x) and
standard deviation vt σt−1 (x) and since event E f (t) is true, |µt−1 (x) − f (x)| ≤ c1,t σt−1 (x). Now using the
anti-concentration inequality in Lemma 7, we have
 
h 0
i ft (x) − µt−1 (x) f (x) − µt−1 (x) 0
P ft (x) > f (x) Ft−1 = P > Ft−1
vt σt−1 (x) vt σt−1 (x)
 
ft (x) − µt−1 (x) |f (x) − µt−1 (x)| 0
≥ P > Ft−1
vt σt−1 (x) vt σt−1 (x)
1 2
≥ √ e−θt ,
4 πβt
h 0
i
where, from Definition 2, θt = |f (x)−µ t−1 (x)|
vt σt−1 (x) ≤ 1. Therefore P ft (x) > f (x) F 1
t−1 ≥ 4e π , and hence

the result follows.

0
Lemma 9 For any filtration Ft−1 such that E f (t) is true,
h 0
i
P xt ∈ Dt \ St Ft−1 ≥ p − 1/t2 .

Proof At round t our algorithm chooses the point xt ∈ Dt , at which the highest value of ft , within cur-
rent decision set Dt , is attained. Now if ft ([x? ]t ) is greater than ft (x) for all saturated points at round t,
i.e.,ft ([x? ]t ) > ft (x), ∀x ∈ St , then one of the unsaturated points (which includes [x? ]t ) in Dt must be
played and hence xt ∈ Dt \ St . This implies
h 0
i h 0
i
P xt ∈ Dt \ St Ft−1 ≥ P ft ([x? ]t ) > ft (x), ∀x ∈ St Ft−1 . (14)

Now form Definition 3, ∆t (x) > ct σt−1 (x), for all x ∈ St . Also if both the events E f (t) and E ft (t)
are true, then from Definition 1 and 2, ft (x) ≤ f (x) + ct σt−1 (x), for all x ∈ Dt . Thus for all x ∈ St ,
0
ft (x) < f (x) + ∆t (x). Therefore, for any filtration Ft−1 such that E f (t) is true, either E ft (t) is false, or
0
else for all x ∈ St , ft (x) < f ([x? ]t ). Hence, for any Ft−1 such that E f (t) is true,
h 0
i h 0
i
P ft ([x? ]t ) > ft (x), ∀x ∈ St Ft−1 ≥ P ft ([x? ]t ) > f ([x? ]t ) Ft−1 − P E ft (t) Ft−1 ≥ p − 1/t2 ,
 

where we have used Lemma 6 and Lemma 8. Now the proof follows from Equation 14.

0
Lemma 10 For any filtration Ft−1 such that E f (t) is true,
h 0
i 11c h 0
i 2B + 1
t
E rt Ft−1 ≤ E σt−1 (xt ) Ft−1 + ,
p t2
where rt is the instantaneous regret at round t.
Proof Let x̄t be the unsaturated point in Dt with smallest σt−1 (x), i.e.,
x̄t = argmin σt−1 (x). (15)
x∈Dt \St

0 0
Since σt−1 (·) and St are deterministic given Ft−1 , so is x̄t . Now for any Ft−1 such that E f (t) is true,
h 0
i h 0
i h 0
i
E σt−1 (xt ) Ft−1 ≥ E σt−1 (xt ) Ft−1 , xt ∈ Dt \ St P xt ∈ Dt \ St Ft−1
≥ σt−1 (x̄t )(p − 1/t2 ), (16)

22
where we have used Equation 15 and Lemma 9. Now, if both the events E f (t) and E ft (t) are true, then from
Definition 1 and 2, f (x) − ct σt−1 (x) ≤ ft (x) ≤ f (x) + ct σt−1 (x), for all x ∈ Dt . Using this observation
along with Definition 3 and the facts that ft (xt ) ≥ ft (x) for all x ∈ Dt and x̄t ∈ Dt \ St , we have

∆t (xt ) = f ([x? ]t ) − f (x̄t ) + f (x̄t ) − f (xt )


≤ ∆t (x̄t ) + ft (x̄t ) + ct σt−1 (x̄t ) − ft (xt ) + ct σt−1 (xt )
≤ ct σt−1 (x̄t ) + ct σt−1 (x̄t ) + ct σt−1 (xt )

≤ ct 2σt−1 (x̄t ) + ct σt−1 (xt ) .
0 
Therefore, for any Ft−1 such that E f (t) is true, either ∆t (xt ) ≤ ct 2σt−1 (x̄t ) + ct σt−1 (xt ) , or E ft (t) is
false. Now from our assumption of bounded variance, for all x ∈ D, |f (x)| ≤ kf kk k(x, x) ≤ B, and hence
∆t (x) ≤ 2 sup |f (x)| ≤ 2B. Thus, using Equation 16, we get
x∈D
h 0
i h  0
i h 0
i
E ∆t (xt ) Ft−1 ≤ E ct 2σt−1 (x̄t ) + ct σt−1 (xt ) Ft−1 + 2BP E ft (t) Ft−1
2ct h 0
i h 0
i 2B
≤ 2
E σt−1 (xt ) Ft−1 + ct E σt−1 (xt ) Ft−1 + 2
p − 1/t t
11ct h 0
i 2B
≤ E σt−1 (xt ) Ft−1 + 2 , (17)
p t
where in the last inequality
√ we used that 1/(p − 1/t2 ) ≤ 5/p, which holds trivially for t ≤ 4 and also holds
2
for t ≥ 5, as t > 5e π. Now using Equation 7, we have the instantaneous regret at round t,
1
rt = f (x? ) − f ([x? ]t ) + f ([x? ]t ) − f (xt ) ≤ + ∆t (xt ),
t2
and then taking conditional expectation on both sides, the result follows from Equation 17.

Definition 5 Let us define Y0 = 0, and for all t = 1, . . . , T :

r̄t = rt · I{E f (t)},


11ct 2B + 1
Xt = r̄t − σt−1 (xt ) − ,
p t2
X t
Yt = Xs .
s=1

Definition 6 A sequence of random variables (Zt ; t ≥ 0) is called a super-martingale corresponding to a


filtration Ft , if for all t, Zt is Ft -measurable, and for t ≥ 1,
 
E Zt Ft−1 ≤ Zt−1 .

Lemma 11 (Azuma-Hoeffding Inequality) If a super-martingale (Zt ; t ≥ 0), corresponding to filtration


Ft , satisfies |Zt − Zt−1 | ≤ αt for some constant αt , for all t = 1, . . . , T , then for any δ ≥ 0,
 v 
u
u XT
P ZT − Z0 ≤ t2 ln(1/δ) αt2  ≥ 1 − δ.
t=1

23
0
Lemma 12 (Yt ; t = 0, ..., T ) is a super-martingale process with respect to filtration Ft .
0
h 0
i
Proof From Definition 6, we need to prove that for all t ∈ {1, . . . , T } and any possible Ft−1 , E Yt − Yt−1 Ft−1 ≤
0, i.e. h i 11c h i 2B + 1
0 t 0
E r̄t |Ft−1 ≤ E σt−1 (xt )|Ft−1 + . (18)
p t2
0
Now if Ft−1 such that E f (t) is false, then r̄t = rt · I{E f (t)} = 0, and Equation 18 holds trivially. Moreover,
0
for Ft−1 such that both E f (t) is true, Equation 18 follows from Lemma 10.

Lemma 13 Given any δ ∈ (0, 1), with probability at least 1 − δ,


T T
X 11cT X (2B + 1)π 2 (4B + 11)cT p
RT = r(t) = σt−1 (xt ) + + 2T ln(2/δ),
t=1
p t=1 6 p

where T is the total number of rounds played.


Proof First note that from Definition 5 for all t = 1, . . . , T ,
11ct 2B + 1
|Yt − Yt−1 | = |Xt | ≤ |r̄t | + σt−1 (xt ) + .
p t2
2
Now as r̄t ≤ rt ≤ 2 sup |f (x)| ≤ 2B and σt−1 (xt ) ≤ σ02 (xt ) ≤ 1, we have
x∈D

11ct 2B + 1 (4B + 11)ct


|Yt − Yt−1 | ≤ 2B + + ≤ ,
p t2 p

which follows from the fact that 2B ≤ 2Bct /p and also (2B + 1)/t2 ≤ 2Bct /p. Thus, we can apply
Azuma-Hoeffding inequality (Lemma 11) to obtain that with probability at least 1 − δ/2,
v
T T T u T
X X 11ct X 2B + 1 u X (4B + 11)2 c2t
r̄t ≤ σt−1 (xt ) + + t2 ln(2/δ)
t=1 t=1
p t=1
t2 t=1
p2
T
11cT X (2B + 1)π 2 (4B + 11)cT p
≤ σt−1 (xt ) + + 2T ln(2/δ),
p t=1 6 p

as by definition ct ≤ cT for all t ∈ {1, . . . , T }. Now, as the event E f (t) holds holds for all t with probability
at least 1 − δ/2 (see Lemma 6), then from Definition 5, r̄t = rt for all t with probability at least 1 − δ/2.
Now by applying union bound, the result follows.

Proof of Theorem 4
T
P √
From Lemma 4 we have, σt−1 (xt ) = O( T γT ). Also from Definition 1,
t=1
p  p p
CT ≤ B + R 2(γT + 1 + ln(2/δ)) + B + R 2(γT + 1 + ln(2/δ)) 4 ln T + 2d ln(BLrdT 2 )
p p 
= O (γT + ln(2/δ))(ln T + d ln(BdT )) + B d ln(BdT )
p 
= O (γT + ln(2/δ))d ln(BdT ) .

24
Hence, from Lemma 13, with probability at least 1 − δ,
!
p p p 
RT = O (γT + ln(2/δ))d ln(BdT ) · T γT + B T ln(2/δ)

and thus with high probability,


q p 
RT = O T γT2 d ln(BdT ) + B
T γT d ln(BdT )
!
p  √ 
= O T d ln(BdT ) B γT + γT .

F. Recursive Updates of Posterior Mean and Covariance


We now describe a procedure to update the posterior mean and covariance function in a recursive fashion
through the properties of Schur complement (Zhang (2006)) rather than evaluating Equation 2 and 3 at each
round. Specifically for all t ≥ 1 we show the following:

kt−1 (x, xt )
µt (x) = µt−1 (x) + 2 (x ) (yt − µt−1 (xt )),
λ + σt−1
(19)
t
kt−1 (x, xt )kt−1 (xt , x0 )
kt (x, x0 ) = kt−1 (x, x0 ) − 2 (x ) , (20)
λ + σt−1 t
2
kt−1 (x, xt )
σt2 (x) 2
= σt−1 (x) − 2 (x ) . (21)
λ + σt−1 t

These update rules make our algorithms easy to implement and we are not aware of any literature which
explicitly states or uses these relations.  
A B
First we write the matrix Kt + λI as , where A = Kt−1 + λI, B = kt−1 (xt ), C = B T and
C D
D = λ + k(xt , xt ). Now using Schur’s complement we get
−1  −1
A + A−1 BβCA−1 −A−1 Bβ
 
A B
=
C D −βCA−1 β
 −1 −1 T −1
−βA−1 B

A + βA BB A
=
−βB T A−1 β
 −1 
A + βγ −βα
= ,
−βαT β

where β = (D − CA−1 B)−1 = 1/(D − B T A−1 B), γ = A−1 BB T A−1 and α = A−1 B. Therefore we
have

µt (x) = kt (x)T (Kt + λI)−1 y1:t


 A−1 + βγ
  
−βα y1:t−1
= kt−1 (x)T k(xt , x)

−βαT β yt
= kt−1 (x)T (A−1 + βγ)y1:t−1 − βk(xt , x)αT y1:t−1 − βyt αT kt−1 (x) + βyt k(xt , x)
 
= kt−1 (x)T A−1 y1:t−1 + β kt−1 (x)T γy1:t−1 − k(xt , x)αT y1:t−1 − yt αT kt−1 (x) + yt k(xt , x) ,

25
where

kt−1 (x)T A−1 y1:t−1 = kt−1 (x)T (Kt−1 + λI)−1 y1:t−1 = µt−1 (x),
kt−1 (x)T γy1:t−1 kt−1 (x)T A−1 kt−1 (xt )kt−1 (xt )T A−1 y1:t−1 = kt−1 (xt )T A−1 kt−1 (x) µt−1 (xt ),

=
αT y1:t−1 = kt−1 (xt )T A−1 y1:t−1 = µt−1 (xt ),
αT kt−1 (x) = kt−1 (xt )T A−1 kt−1 (x).

Thus we have
 
µt (x) = µt−1 (x) + β kt−1 (xt )T A−1 kt−1 (x) (µt−1 (xt ) − yt ) + k(xt , x) (yt − µt−1 (xt ))
= µt−1 (x) + β (yt − µt−1 (xt )) k(xt , x) − kt−1 (xt )T A−1 kt−1 (x)


= µt−1 (x) + βkt−1 (xt , x)(yt − µt−1 (xt )).

Now as

D − B T A−1 B = λ + k(xt , xt ) − kt−1 (xt )T (Kt−1 + λI)−1 kt−1 (xt ) = λ + σt−1


2
(xt ),
2
putting β = 1/(λ + σt−1 (xt )), we obtain Equation 19. Again observe that

kt (x, x0 )
= k(x, x0 ) − kt (x)T (Kt + λI)−1 kt (x0 )
= k(x, x0 ) − kt−1 (x)T A−1 kt−1 (x0 )
 
+β kt−1 (x)T γkt−1 (x0 ) − k(xt , x)αT kt−1 (x0 ) − k(xt , x0 )αT kt−1 (x) + k(xt , x)k(xt , x0 ) .

Now we have

k(x, x0 ) − kt−1 (x)T A−1 kt−1 (x0 ) = k(x, x0 ) − kt−1 (x)T (Kt−1 + λI)−1 kt−1 (x0 ) = kt−1 (x, x0 ),

also

kt−1 (x)T γkt−1 (x0 ) − k(xt , x)αT kt−1 (x0 )


= kt−1 (x)T A−1 kt−1 (xt )kt−1 (xt )T A−1 kt−1 (x0 ) − k(xt , x)kt−1 (xt )T A−1 kt−1 (x0 )
= (kt−1 (x)T A−1 kt−1 (xt ) − k(xt , x))kt−1 (xt )T A−1 kt−1 (x0 )
= −kt−1 (xt , x)kt−1 (xt )T A−1 kt−1 (x0 ),

and

k(xt , x)k(xt , x0 ) − k(xt , x0 )αT kt−1 (x) = k(xt , x0 )(k(xt , x) − kt−1 (xt )T A−1 kt−1 (x))
= k(xt , x0 )kt−1 (xt , x).

Putting all these together we get


 
kt (x, x0 ) = kt−1 (x, x0 ) − β k(xt , x0 )kt−1 (xt , x) − kt−1 (xt , x)kt−1 (xt )T A−1 kt−1 (x0 )
  
= kt−1 (x, x0 ) − β kt−1 (xt , x) k(xt , x0 ) − kt−1 (xt )T A−1 kt−1 (x0 )
= kt−1 (x, x0 ) − βkt−1 (xt , x)kt−1 (xt , x0 ).
2
Now Equation 20 and 21 follows by using β = 1/(λ + σt−1 (x)) and σt2 (x) = kt (x, x).

26

You might also like