Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
23 views8 pages

79.-Gaussian Process Optimization

This document discusses Gaussian Process (GP) optimization in the context of multi-armed bandit problems, focusing on deriving regret bounds and convergence rates for the GP-UCB algorithm. It establishes a connection between GP optimization and experimental design, providing sublinear regret bounds that depend on kernel choice and parameters. The authors demonstrate the effectiveness of GP-UCB through experiments on real sensor data, showing its advantages over existing heuristics in GP optimization.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views8 pages

79.-Gaussian Process Optimization

This document discusses Gaussian Process (GP) optimization in the context of multi-armed bandit problems, focusing on deriving regret bounds and convergence rates for the GP-UCB algorithm. It establishes a connection between GP optimization and experimental design, providing sublinear regret bounds that depend on kernel choice and parameters. The authors demonstrate the effectiveness of GP-UCB through experiments on real sensor data, showing its advantages over existing heuristics in GP optimization.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Gaussian Process Optimization in the Bandit Setting:

No Regret and Experimental Design

Niranjan Srinivas [email protected]


Andreas Krause [email protected]
California Institute of Technology, Pasadena, CA, USA
Sham Kakade [email protected]
University of Pennsylvania, Philadelphia, PA, USA
Matthias Seeger [email protected]
Saarland University, Saarbrücken, Germany

Abstract as possible, for example by maximizing information


Many applications require optimizing an un- gain. The challenge in both approaches is twofold: we
known, noisy function that is expensive to have to estimate an unknown function f from noisy
evaluate. We formalize this task as a multi- samples, and we must optimize our estimate over some
armed bandit problem, where the payoff function high-dimensional input space. For the former, much
is either sampled from a Gaussian process (GP)
progress has been made in machine learning through
or has low RKHS norm. We resolve the impor-
tant open problem of deriving regret bounds for kernel methods and Gaussian process (GP) models
this setting, which imply novel convergence rates (Rasmussen & Williams, 2006), where smoothness
for GP optimization. We analyze GP-UCB, an assumptions about f are encoded through the choice
intuitive upper-confidence based algorithm, and of kernel in a flexible nonparametric fashion. Beyond
bound its cumulative regret in terms of maximal
Euclidean spaces, kernels can be defined on diverse
information gain, establishing a novel connection
between GP optimization and experimental de- domains such as spaces of graphs, sets, or lists.
sign. Moreover, by bounding the latter in terms
We are concerned with GP optimization in the multi-
of operator spectra, we obtain explicit sublinear
regret bounds for many commonly used covari- armed bandit setting, where f is sampled from a GP
ance functions. In some important cases, our distribution or has low “complexity” measured in
bounds have surprisingly weak dependence on terms of its RKHS norm under some kernel. We pro-
the dimensionality. In our experiments on real vide the first sublinear regret bounds in this nonpara-
sensor data, GP-UCB compares favorably with
metric setting, which imply convergence rates for GP
other heuristical GP optimization approaches.
optimization. In particular, we analyze the Gaussian
1. Introduction Process Upper Confidence Bound (GP-UCB) algo-
rithm, a simple and intuitive Bayesian method (Auer
In most stochastic optimization settings, evaluating et al., 2002; Auer, 2002; Dani et al., 2008). While
the unknown function is expensive, and sampling objectives are different in the multi-armed bandit
is to be minimized. Examples include choosing and experimental design paradigm, our results draw
advertisements in sponsored search to maximize a close technical connection between them: our regret
profit in a click-through model (Pandey & Olston, bounds come in terms of an information gain quantity,
2007) or learning optimal control strategies for robots measuring how fast f can be learned in an information
(Lizotte et al., 2007). Predominant approaches theoretic sense. The submodularity of this function
to this problem include the multi-armed bandit allows us to prove sharp regret bounds for particular
paradigm (Robbins, 1952), where the goal is to covariance functions, which we demonstrate for com-
maximize cumulative reward by optimally balancing monly used Squared Exponential and Matérn kernels.
exploration and exploitation, and experimental design
(Chaloner & Verdinelli, 1995), where the function Related Work. Our work generalizes stochastic
is to be explored globally with as few evaluations linear optimization in a bandit setting, where the un-
known function comes from a finite-dimensional linear
Appearing in Proceedings of the 27 th International Confer- space. GPs are nonlinear random functions, which can
ence on Machine Learning, Haifa, Israel, 2010. Copyright be represented in an infinite-dimensional linear space.
2010 by the author(s)/owner(s).
For the standard linear setting, Dani et al. (2008)
Gaussian Process Optimization in the Bandit Setting

provide a near-complete characterization explicitly Kernel Linear RBF Matérn


dependent on the dimensionality. In the GP setting, √
kernel

kernel
ν+d(d+1)

the challenge is to characterize complexity in a differ-


Regret RT d T T (log T )d+1
T 2ν+d(d+1)

ent manner, through properties of the kernel function. Figure 1. Our regret bounds (up to polylog factors) for lin-
Our technical contributions are twofold: first, we ear, radial basis, and Matérn kernels — d is the dimension,
show how to analyze the nonlinear setting by focusing T is the time horizon, and ν is a Matérn parameter.
on the concept of information gain, and second, we
explicitly bound this information gain measure using
the concept of submodularity (Nemhauser et al., • We bound the cumulative regret for GP-UCB in
1978) and knowledge about kernel operator spectra. terms of the information gain due to sampling,
establishing a novel connection between experi-
Kleinberg et al. (2008) provide regret bounds un- mental design and GP optimization.
der weaker and less configurable assumptions (only
Lipschitz-continuity w.r.t. a metric is assumed; • By bounding the information gain for popular
Bubeck et al. 2008 consider arbitrary topological classes of kernels, we establish sublinear regret
spaces), which however degrade rapidly with the di- bounds for GP optimization for the first time.
d+1
mensionality of the problem (Ω(T d+2 )). In practice, Our bounds depend on kernel choice and param-
linearity w.r.t. a fixed basis is often too stringent eters in a fine-grained fashion.
an assumption, while Lipschitz-continuity can be too • We evaluate GP-UCB on sensor network data,
coarse-grained, leading to poor rate bounds. Adopting demonstrating that it compares favorably to ex-
GP assumptions, we can model levels of smoothness in isting algorithms for GP optimization.
a fine-grained way. For example, our rates for the fre-
quently used Squared Exponential kernel, enforcing a
high degree of smoothness, 2. Problem Statement and Background
p have weak dependence on
the dimensionality: O( T (log T )d+1 ) (see Fig. 1). Consider the problem of sequentially optimizing an un-
There is a large literature on GP (response surface) known reward function f : D → R: in each round t, we
optimization. Several heuristics for trading off explo- choose a point xt ∈ D and get to see the function value
ration and exploitation in GP optimization have been there, perturbed by noise: yt = f (xt ) + t . Our goal is
PT
proposed (such as Expected Improvement, Mockus to maximize the sum of rewards t=1 f (xt ), thus to
et al. 1978, and Most Probable Improvement, Mockus perform essentially as well as x∗ = argmaxx∈D f (x)
1989) and successfully applied in practice (c.f., Lizotte (as rapidly as possible). For example, we might want
et al. 2007). Brochu et al. (2009) provide a comprehen- to find locations of highest temperature in a building
sive review of and motivation for Bayesian optimiza- by sequentially activating sensors in a spatial network
tion using GPs. The Efficient Global Optimization and regressing on their measurements. D consists of
(EGO) algorithm for optimizing expensive black-box all sensor locations, f (x) is the temperature at x, and
functions is proposed by Jones et al. (1998) and ex- sensor accuracy is quantified by the noise variance.
tended to GPs by Huang et al. (2006). Little is known Each activation draws battery power, so we want to
about theoretical performance of GP optimization. sample from as few sensors as possible.
While convergence of EGO is established by Vazquez
& Bect (2007), convergence rates have remained elu- Regret. A natural performance metric in this con-
sive. Grünewälder et al. (2010) consider the pure ex- text is cumulative regret, the loss in reward due to not
ploration problem for GPs, where the goal is to find the knowing f ’s maximum points beforehand. Suppose
optimal decision over T rounds, rather than maximize the unknown function is f , its maximum point1
cumulative reward (with no exploration/exploitation x∗ = argmaxx∈D f (x). For our choice xt in round
dilemma). They provide sharp bounds for this explo- t, we incur instantaneous regret rt = f (x∗ ) − f (xt ).
ration problem. Note that this methodology would not The cumulative regret RT after P T rounds is the sum
T
lead to bounds for minimizing the cumulative regret. of instantaneous regrets: RT = t=1 rt . A desirable
Our cumulative regret bounds translate to the first asymptotic property of an algorithm is to be no-regret:
performance guarantees (rates) for GP optimization. limT →∞ RT /T = 0. Note that neither rt nor RT are
ever revealed to the algorithm. Bounds on the average
Summary. Our main contributions are: regret RT /T translate to convergence rates for GP
• We analyze GP-UCB, an intuitive algorithm for optimization: the maximum maxt≤T f (xt ) in the first
GP optimization, when the function is either sam- T rounds is no further from f (x∗ ) than the average.
pled from a known GP, or has low RKHS norm. 1
x∗ need not be unique; only f (x∗ ) occurs in the regret.
Gaussian Process Optimization in the Bandit Setting

2.1. Gaussian Processes and RKHS’s inner product h·, ·ik obeying the reproducing property:
Gaussian Processes. Some assumptions on f are hf, k(x, ·)ik = f (x) for
p all f ∈ Hk (D). The induced
required to guarantee no-regret. While rigid paramet- RKHS norm kf kk = hf, f ik measures smoothness of
ric assumptions such as linearity may not hold in prac- f w.r.t. k: in much the same way as k1 would generate
tice, a certain degree of smoothness is often warranted. smoother samples than k2 as GP covariance functions,
In our sensor network, temperature readings at closeby k · kk1 assigns larger penalties than k · kk2 . h·, ·ik can be
locations are highly correlated (see Figure 2(a)). We extended to all of L2 (D), in which case kf kk < ∞ iff
can enforce implicit properties like smoothness with- f ∈ Hk (D). For most kernels discussed in Section 5.2,
out relying on any parametric assumptions, modeling members of Hk (D) can uniformly approximate any
f as a sample from a Gaussian process (GP): a col- continuous function on any compact subset of D.
lection of dependent random variables, one for each 2.2. Information Gain & Experimental Design
x ∈ D, every finite subset of which is multivariate
Gaussian distributed in an overall consistent way (Ras- One approach to maximizing f is to first choose
mussen & Williams, 2006). A GP (µ(x), k(x, x0 )) is points xt so as to estimate the function globally
specified by its mean function µ(x) = E[f (x)] and well, then play the maximum point of our estimate.
covariance (or kernel) function k(x, x0 ) = E[(f (x) − How can we learn about f as rapidly as possible?
µ(x))(f (x0 ) − µ(x0 ))]. For GPs not conditioned on This question comes down to Bayesian Experimental
data, we assume2 that µ ≡ 0. Moreover, we restrict Design (henceforth “ED”; see Chaloner & Verdinelli
k(x, x) ≤ 1, x ∈ D, i.e., we assume bounded variance. 1995), where the informativeness of a set of sampling
By fixing the correlation behavior, the covariance func- points A ⊂ D about f is measured by the information
tion k encodes smoothness properties of sample func- gain (c.f., Cover & Thomas 1991), which is the mutual
tions f drawn from the GP. A range of commonly used information between f and observations y A = f A +A
kernel functions is given in Section 5.2. at these points:
I(y A ; f ) = H(y A ) − H(y A |f ), (3)
In this work, GPs play multiple roles. First, some of
our results hold when the unknown target function is a quantifying the reduction in uncertainty about f
sample from a known GP distribution GP(0, k(x, x0 )). from revealing y A . Here, f A = [f (x)]x∈A and
Second, the Bayesian algorithm we analyze generally εA ∼ N (0, σ 2 I). For a Gaussian, H(N (µ, Σ)) =
1
uses GP(0, k(x, x0 )) as prior distribution over f . A 2 log |2πeΣ|, so that in our setting I(y A ; f ) =
major advantage of working with GPs is the exis- I(y A ; f A ) = 21 log |I + σ −2 K A |, where K A =
tence of simple analytic formulae for mean and co- [k(x, x0 )]x,x0 ∈A . While finding the information gain
variance of the posterior distribution, which allows maximizer among A ⊂ D, |A| ≤ T is NP-hard (Ko
easy implementation of algorithms. For a noisy sam- et al., 1995), it can be approximated by an efficient
ple y T = [y1 . . . yT ]T at points AT = {x1 , . . . , xT }, greedy algorithm. If F (A) = I(y A ; f ), this algorithm
yt = f (xt )+t with t ∼ N (0, σ 2 ) i.i.d. Gaussian noise, picks xt = argmaxx∈D F (At−1 ∪{x}) in round t, which
the posterior over f is a GP distribution again, with can be shown to be equivalent to
mean µT (x), covariance kT (x, x0 ) and variance σT2 (x): xt = argmax σt−1 (x), (4)
µT (x) = kT (x)T (K T + σ 2 I)−1 y T , (1) x∈D
where At−1 = {x1 , . . . , xt−1 }. Importantly, this
kT (x, x0 ) = k(x, x0 ) − kT (x)T (K T + σ 2 I)−1 kT (x0 ),
simple algorithm is guaranteed to find a near-optimal
σT2 (x) = kT (x, x), (2) solution: for the set AT obtained after T rounds, we
T
where kT (x) = [k(x1 , x) . . . k(xT , x)] and K T is have that
the positive definite kernel matrix [k(x, x0 )]x,x0 ∈AT . F (AT ) ≥ (1 − 1/e) max F (A), (5)
|A|≤T
RKHS. Instead of the Bayes case, where f is sam- at least a constant fraction of the optimal infor-
pled from a GP prior, we also consider the more ag- mation gain value. This is because F (A) satisfies
nostic case where f has low “complexity” as measured a diminishing returns property called submodularity
under an RKHS norm (and distribution free assump- (Krause & Guestrin, 2005), and the greedy approxima-
tions on the noise process). The notion of reproduc- tion guarantee (5) holds for any submodular function
ing kernel Hilbert spaces (RKHS, Wahba 1990) is in- (Nemhauser et al., 1978).
timately related to GPs and their covariance func-
tions k(x, x0 ). The RKHS Hk (D) is a complete sub- While sequentially optimizing Eq. 4 is a provably good
space of L2 (D) of nicely behaved functions, with an way to explore f globally, it is not well suited for func-
tion optimization. For the latter, we only need to iden-
2
This is w.l.o.g. (Rasmussen & Williams, 2006). tify points x where f (x) is large, in order to concen-
Gaussian Process Optimization in the Bandit Setting

trate sampling there as rapidly as possible, thus exploit Algorithm 1 The GP-UCB algorithm.
our knowledge about maxima. In fact, the ED rule Input: Input space D; GP Prior µ0 = 0, σ0 , k
(4) does not even depend on observations yt obtained for t = 1, 2, . . . do p
along the way. Nevertheless, the maximum informa- Choose xt = argmax µt−1 (x) + βt σt−1 (x)
tion gain after T rounds will play a prominent role x∈D

in our regret bounds, forging an important connection Sample yt = f (xt ) + t


between GP optimization and experimental design. Perform Bayesian update to obtain µt and σt
end for
3. GP-UCB Algorithm
For sequential optimization, the ED rule (4) can be
wasteful: it aims at decreasing uncertainty globally, UCB algorithms (and GP optimization techniques
not just where maxima might be. Another idea is to in general) have been applied to a large number of
pick points as xt = argmaxx∈D µt−1 (x), maximizing problems in practice (Kocsis & Szepesvári, 2006;
the expected reward based on the posterior so far. Pandey & Olston, 2007; Lizotte et al., 2007). Their
However, this rule is too greedy too soon and tends performance is well characterized in both the finite
to get stuck in shallow local optima. A combined arm setting and the linear optimization setting, but
strategy is to choose no convergence rates for GP optimization are known.

xt = argmax µt−1 (x) + βt


1/2
σt−1 (x), (6) 4. Regret Bounds
x∈D
We now establish cumulative regret bounds for GP
where βt are appropriate constants. This latter objec- optimization, treating a number of different settings:
tive prefers both points x where f is uncertain (large f ∼ GP(0, k(x, x0 )) for finite D, f ∼ GP(0, k(x, x0 ))
σt−1 (·)) and such where we expect to achieve high for general compact D, and the agnostic case of arbi-
rewards (large µt−1 (·)): it implicitly negotiates the trary f with bounded RKHS norm.
exploration–exploitation tradeoff. A natural interpre-
GP optimization generalizes stochastic linear opti-
tation of this sampling rule is that it greedily selects
mization, where a function f from a finite-dimensional
points x such that f (x) should be a reasonable upper
linear space is optimized over. For the linear case, Dani
bound on f (x∗ ), since the argument in (6) is an upper
et al. (2008) provide regret bounds that explicitly de-
quantile of the marginal posterior P (f (x)|y t−1 ). We
pend on the dimensionality3 d. GPs can be seen as
call this choice the Gaussian process upper confidence
random functions in some infinite-dimensional linear
bound rule (GP-UCB), where βt is specified depending
space, so their results do not apply in this case. This
on the context (see Section 4). Pseudocode for
problem is circumvented in our regret bounds. The
the GP-UCB algorithm is provided in Algorithm 1.
quantity governing them is the maximum information
Figure 2 illustrates two subsequent iterations, where
gain γT after T rounds, defined as:
GP-UCB both explores (Figure 2(b)) by sampling an
2 γT := max I(y A ; f A ), (7)
input x with large σt−1 (x) and exploits (Figure 2(c))
A⊂D:|A|=T
by sampling x with large µt−1 (x).
where I(y A ; f A ) = I(y A ; f ) is defined in (3). Recall
The GP-UCB selection rule Eq. 6 is motivated by the that I(y A ; f A ) = 21 log |I + σ −2 K A |, where K A =
UCB algorithm for the classical multi-armed bandit [k(x, x0 )]x,x0 ∈A is the covariance matrix of f A =
problem (Auer et al., 2002; Kocsis & Szepesvári, [f (x)]x∈A associated with the
2006). Among competing criteria for GP optimization √ samples A. Our regret
bounds are of the form O∗ ( T βT γT ), where βT is the
(see Section 1), a variant of the GP-UCB rule has confidence parameter in Algorithm 1, while√ the bounds
been demonstrated to be effective for this application of Dani et al. (2008) are of the form O∗ ( T βT d) (d
(Dorard et al., 2009). To our knowledge, strong the dimensionality of the linear function space). Here
theoretical results of the kind provided for GP-UCB in and below, the O∗ notation is a variant of O, where
this paper have not been given for any of these search log factors are suppressed. While our proofs – all pro-
heuristics. In Section 6, we show that in practice vided in the longer version (Srinivas et al., 2009) – use
GP-UCB compares favorably with these alternatives. techniques similar to those of Dani et al. (2008), we
If D is infinite, finding xt in (6) may be hard: the face a number of additional significant technical chal-
upper confidence index is multimodal in general. lenges. Besides avoiding the finite-dimensional analy-
However, global search heuristics are very effective in sis, we must handle confidence issues, which are more
practice (Brochu et al., 2009). It is generally assumed 3
In general, d is the dimensionality of the input space
that evaluating f is more costly than maximizing the D, which in the finite-dimensional linear case coincides
UCB index. with the feature space.
Gaussian Process Optimization in the Bandit Setting

5 5

4 4

3 3

2 2
Temperature (C)

25
1 1

0 0
20 −1
−1

40 −2 −2

15 30 −3 −3
0
10 20 −4 −4
20 10
30 −5 −5
−6 −4 −2 0 2 4 6
40 0 −6 −4 −2 0 2 4 6

(a) Temperature data (b) Iteration t (c) Iteration t + 1


Figure 2. (a) Example of temperature data collected by a network of 46 sensors at Intel Research Berkeley. (b,c) Two
iterations of the GP-UCB algorithm. It samples points that are either uncertain (b) or have high posterior mean (c).

delicate for nonlinear random functions. Theorem 2 Let D ⊂ [0, r]d be compact and convex,
d ∈ N, r > 0. Suppose that the kernel k(x, x0 ) satisfies
Importantly, note that the information gain is a prob-
the following high probability bound on the derivatives
lem dependent quantity — properties of both the ker-
of GP sample paths f : for some constants a, b > 0,
nel and the input space will determine the growth of
2
regret. In Section 5, we provide general methods for Pr {supx∈D |∂f /∂xj | > L} ≤ ae−(L/b) , j = 1, . . . , d.
bounding γT , either by efficient auxiliary computa-
tions or by direct expressions for specific kernels of Pick δ ∈ (0, 1), and define
interest. Our results match known lower bounds (up  p 
to log factors) in both the K-armed bandit and the βt = 2 log(t2 2π 2 /(3δ)) + 2d log t2 dbr log(4da/δ) .
d-dimensional linear optimization case.
Running the GP-UCB with βt for a sample f of a
Bounds for a GP Prior. For finite D, we obtain GP with mean function zero and covariance √ function
the following bound. k(x, x0 ), we obtain a regret bound of O∗ ( dT γT ) with
high probability. Precisely, with C1 = 8/ log(1 + σ −2 )
Theorem 1 Let δ ∈ (0, 1) and βt = we have
2 log(|D|t2 π 2 /6δ). Running GP-UCB with βt for n p o
a sample f of a GP with mean function zero and Pr RT ≤ C1 T βT γT + 2 ∀T ≥ 1 ≥ 1 − δ.
0
covariance
p function k(x, x ), we obtain a regret bound
∗ The main challenge in our proof is to lift the regret
of O ( T γT log |D|) with high probability. Precisely,
bound in terms of the confidence ellipsoid to general
D. The smoothness assumption on k(x, x0 ) disqual-
n p o
Pr RT ≤ C1 T βT γT ∀T ≥ 1 ≥ 1 − δ.
ifies GPs with highly erratic sample paths. It holds
where C1 = 8/ log(1 + σ −2 ). for stationary kernels k(x, x0 ) = k(x − x0 ) which are
four times differentiable (Theorem 5 of Ghosal & Roy
The proof essentially relates the regret to the growth (2006)), such as the Squared Exponential and Matérn
of the log volume of the confidence ellipsoid, and in a kernels with ν > 2 (see Section 5.2), while it is vio-
novel manner, shows how this growth is characterized lated for the Ornstein-Uhlenbeck kernel (Matérn with
by the information gain. ν = 1/2; a stationary variant of the Wiener process).
For the latter, sample paths f are nondifferentiable al-
This theorem shows that, with high probability over most everywhere with probability one and come with
samples from the GP, the cumulative regret is bounded independent increments. We conjecture that a result
in terms of the maximum information gain, forging a of the form of Theorem 2 does not hold in this case.
novel connection between GP optimization and exper-
imental design. This link is of fundamental technical Bounds for Arbitrary f in the RKHS. Thus far,
importance, allowing us to generalize Theorem 1 to we have assumed that the target function f is sampled
infinite decision spaces. Moreover, the submodularity from a GP prior and that the noise is N (0, σ 2 ) with
of I(y A ; f A ) allows us to derive sharp a priori bounds, known variance σ 2 . We now analyze GP-UCB in an
depending on choice and parameterization of k (see agnostic setting, where f is an arbitrary function
Section 5). In the following theorem, we generalize from the RKHS corresponding to kernel k(x, x0 ).
our result to any compact and convex D ⊂ Rd under Moreover, we allow the noise variables εt to be an ar-
mild assumptions on the kernel function k. bitrary martingale difference sequence (meaning that
Gaussian Process Optimization in the Bandit Setting
15
E[εt | ε<t ] = 0 for all t ∈ N), uniformly bounded by σ. Linear (d=4) 250
Independent
Note that we still run the same GP-UCB algorithm, Squared exponential 200 Matern
10

Eigenvalue

Bound on γT
whose prior and noise model are misspecified in this Matern (ν = 2.5) 150 Squared
case. Our following result shows that GP-UCB attains 100
exponential
5
sublinear regret even in the agnostic setting. Independent
50 Linear (d=4)

Theorem 3 Let δ ∈ (0, 1). Assume that the true 0


5 10 15 20 0
10 20 30 40 50
Eigenvalue rank
underlying f lies in the RKHS Hk (D) corresponding T

to the kernel k(x, x0 ), and that the noise εt has zero Figure 3. Spectral decay (left) and information gain bound
mean conditioned on the history and is bounded by σ (right) for independent (diagonal), linear, squared expo-
nential and Matérn kernels (ν = 2.5.) with equal trace.
almost surely. In particular, assume kf k2k ≤ B and
let βt = 2B + 300γt log3 (t/δ). Running GP-UCB with
βt , prior GP (0, k(x, x0 )) and √
noise model N (0, σ 2 ), into our results in Section 4 to obtain high-probability
√ bounds on RT . This being a laborious procedure, one
we obtain a regret bound of O∗ ( T (B γT + γT )) with
high probability (over the noise). Precisely, would prefer a priori bounds for γT in practice which
n p o are simple analytical expressions of T and parameters
Pr RT ≤ C1 T βT γT ∀T ≥ 1 ≥ 1 − δ, of k. In this section, we sketch a general procedure
for obtaining such expressions, instantiating them for
where C1 = 8/ log(1 + σ −2 ).
a number of commonly used covariance functions,
Note that while our theorem implicitly assumes that once more relying crucially on the greedy ED rule
GP-UCB has knowledge of an upper bound on kf kk , upper bound. Suppose that D is finite for now, and
standard guess-and-doubling approaches suffice if no let f = [f (x)]x∈D , K D = [k(x, x0 )]x,x0 ∈D . Sampling
such bound is known a priori. Comparing Theorem 2 f at xt , we obtain yt ∼ N (v Tt f , σ 2 ), where v t ∈ R|D|
and Theorem 3, the latter holds uniformly over all is the indicator vector associated with xt . We can
functions f with kf kk < ∞, while the former is a prob- upper-bound the greedy maximum once more, by
abilistic statement requiring knowledge of the GP that relaxing this constraint to kv t k = 1 in round t of the
f is sampled from. In contrast, if f ∼ GP(0, k(x, x0 )), sequential method. For this relaxed greedy procedure,
then kf kk = ∞ almost surely (Wahba, 1990): sample all v t are leading eigenvectors of K D , since successive
paths are rougher than RKHS functions. Neither covariance matrices of P (f |y t−1 ) share their eigenba-
Theorem 2 nor 3 encompasses the other. sis with K D , while eigenvalues are damped according
to how many times the corresponding eigenvector is
5. Bounding the Information Gain selected. We can upper-bound the information gain
by considering the worst-case allocation of T samples
Since the bounds developed in Section 4 depend on the
to the min{T, |D|} leading eigenvectors of K D :
information gain, the key remaining question is how to
bound the quantity γT for practical classes of kernels. 1/2 X|D|
γT ≤ max log(1 + σ −2 mt λ̂t ), (8)
1 − e−1 (mt ) t=1
5.1. Submodularity and Greedy Maximization P
subject to t mt = T , and spec(K D ) = {λ̂1 ≥ λ̂2 ≥
In order to bound γT , we have to maximize the infor- . . . }. We can split the sum into two parts in order
mation gain F (A) = I(y A ; f ) over all subsets A ⊂ D of to obtain a bound to leading order. The following
size T : a combinatorial problem in general. However, Theorem captures this intuition:
as noted in Section 2, F (A) is a submodular function,
which implies the performance guarantee (5) for max- Theorem 4 For any T ∈ N and any T∗ = 1, . . . , T :
imizing F sequentially by the greedy ED rule (4). Di- γT ≤ O σ −2 [B(T∗ )T + T∗ (log nT T )] ,

viding both sides of (5) by 1−1/e, we can upper-bound P|D| P|D|
γT by (1 − 1/e)−1 I(y AT ; f ), where AT is constructed where nT = t=1 λ̂t and B(T∗ ) = t=T∗ +1 λ̂t .
by the greedy procedure. Thus, somewhat counterin-
Therefore, if for some T∗ = o(T ) the first T∗ eigenval-
tuitively, instead of using submodularity to prove that
ues carry most of the total mass nT , the information
F (AT ) is near-optimal, we use it in order to show that
gain will be small. The more rapidly the spectrum
γT is “near-greedy”. As noted in Section 2, the ED
of K D decays, the slower the growth of γT . Figure 3
rule does not depend on observations yt and can be
illustrates this intuition.
run without evaluating f .
The importance of this greedy bound is twofold. 5.2. Bounds for Common Kernels
First, it allows us to numerically compute highly In this section we bound γT for a range of commonly
problem-specific bounds on γT , which can be plugged used covariance functions: finite dimensional linear,
Gaussian Process Optimization in the Bandit Setting

Squared Exponential and Matérn kernels. Together For synthetic data, we sample random functions from a
with our results in Section 4, these imply sublinear squared exponential kernel with lengthscale parameter
regret bounds for GP-UCB in all cases. 0.2. The sampling noise variance σ 2 was set to 0.025 or
5% of the signal variance. Our decision set D = [0, 1]
Finite dimensional linear kernels have the form
is uniformly discretized into 1000 points. We run
k(x, x0 ) = xT x0 . GPs with this kernel correspond to
each algorithm for T = 1000 iterations with δ = 0.1,
random linear functions f (x) = wT x, w ∼ N (0, I).
averaging over 30 trials (samples from the kernel).
The Squared Exponential kernel is k(x, x0 ) = While the choice of βt as recommended by Theorem 1
exp(−(2l2 )−1 kx − x0 k2 ), l a lengthscale parameter. leads to competitive performance of GP-UCB, we
Sample functions are differentiable to any order find (using cross-validation) that the algorithm is
almost surely (Rasmussen & Williams, 2006). improved by scaling βt down by a factor 5. Note that
we did not optimize constants in our regret bounds.
The Matérn kernel is √given by k(x, x0 ) =
(21−ν /Γ(ν))rν Bν (r), r = ( 2ν/l)kx − x0 k, where ν Next, we use temperature data collected from 46 sen-
controls the smoothness of sample paths (the smaller, sors deployed at Intel Research Berkeley over 5 days at
the rougher) and Bν is a modified Bessel function. 1 minute intervals, pertaining to the example in Sec-
tion 2. We take the first two-thirds of the data set to
Theorem 5 Let D ⊂ Rd be compact and convex, d ∈ compute the empirical covariance of the sensor read-
N. Assume the kernel function satisfies k(x, x0 ) ≤ 1. ings, and use it as the kernel matrix. The functions f
1. Finite spectrum. For the d-dimensional Bayesian for optimization consist of one set of observations from
linear regression case: γT = O d log T . all the sensors taken from the remaining third of the
2. Exponential spectral decay. For the Squared data set, and the results (for T = 46, σ 2 = 0.5 or 5%
Exponential kernel: γT = O (log T )d+1 . noise, δ = 0.1) were averaged over 2000 possible
choices of the objective function.
3. Power law spectral decay. For Matérn kernels
Lastly, we take data from traffic sensors deployed along

with ν > 1: γT = O T d(d+1)/(2ν+d(d+1)) (log T ) .
the highway I-880 South in California. The goal was to
We now provide a sketch of the proof. γT is bounded find the point of minimum speed in order to identify
by Theorem 4 in terms the eigendecay of the kernel the most congested portion of the highway; we used
matrix K D . If D is infinite or very large, we can traffic speed data for all working days from 6 AM to
use the operator spectrum of k(x, x0 ), which likewise 11 AM for one month, from 357 sensors. We again
decays rapidly. For the kernels of interest here, use the covariance matrix from two-thirds of the data
asymptotic expressions for the operator eigenvalues set as kernel matrix, and test on the other third. The
are given in Seeger et al. (2008), who derived bounds results (for T = 357, σ 2 = 4.78 or 5% noise, δ = 0.1)
on the information gain for fixed and random designs were averaged over 900 runs.
(in contrast to the worst-case information gain consid-
Figure 4 compares the mean average regret incurred
ered here, which is substantially more challenging to
by the different heuristics and the GP-UCB algorithm
bound). The main challenge in the proof is to ensure
on synthetic and real data. For temperature data,
the existence of discretizations DT ⊂ D, dense in the
the GP-UCB algorithm and EI heuristic clearly
limit, for which tail sums B(T∗ )/nT in Theorem 4 are
outperform the others, and do not exhibit significant
close to corresponding operator spectra tail sums.
difference between each other. On synthetic and traf-
Together with Theorems 2 and 3, this result guaran- fic data MPI does equally well. In summary, GP-UCB
tees sublinear regret of GP-UCB for any dimension performs at least on par with the existing approaches
(see Figure 1). For the Squared Exponential kernel, which are not equipped with regret bounds.
the dimension d appears as exponent of√log T only, so
d+1
that the regret grows at most as O∗ ( T (log T ) 2 ) 7. Conclusions
– the high degree of smoothness of the sample paths We prove the first sublinear regret bounds for GP
effectively combats the curse of dimensionality. optimization with commonly used kernels (see Fig-
ure 1), both for f sampled from a known GP and f of
6. Experiments
low RKHS norm. We analyze GP-UCB, an intuitive,
We compare GP-UCB with heuristics such as the Bayesian upper confidence bound based sampling rule.
Expected Improvement (EI) and Most Probable Our regret bounds crucially depend on the information
Improvement (MPI), and with naive methods which gain due to sampling, establishing a novel connection
choose points of maximum mean or variance only, between bandit optimization and experimental design.
both on synthetic and real sensor network data.
Gaussian Process Optimization in the Bandit Setting

1 5 35
Var only
30

Mean Average Regret

Mean Average Regret


Mean Average Regret

0.8 4
25 Var only
Mean only Var only Mean only
0.6 3 20
Mean only
2 15
0.4 MPI
MPI
10 MPI
0.2 EI 1
EI 5 EI
UCB UCB UCB
0 0 0
0 20 40 60 80 100 0 10 20 30 40 0 100 200 300
Iterations Iterations Iterations
(a) Squared exponential (b) Temperature data (c) Traffic data
Figure 4. Comparison of performance: GP-UCB and various heuristics on synthetic (a), and sensor network data (b, c).

We bound the information gain in terms of the kernel problems. In AISTATS, 2010.
spectrum, providing a general methodology for obtain- Huang, D., Allen, T. T., Notz, W. I., and Zeng, N. Global
ing regret bounds with kernels of interest. Our exper- optimization of stochastic black-box systems via sequen-
iments on real sensor network data indicate that GP- tial kriging meta-models. J Glob. Opt., 34:441–466, ’06.
UCB performs at least on par with competing criteria Jones, D. R., Schonlau, M., and Welch, W. J. Efficient
for GP optimization, for which no regret bounds are global optimization of expensive black-box functions. J
Glob. Opti., 13:455–492, 1998.
known at present. Our results provide an interesting
Kleinberg, R., Slivkins, A., and Upfal, E. Multi-armed
step towards understanding exploration–exploitation bandits in metric spaces. In STOC, pp. 681–690, 2008.
tradeoffs with complex utility functions.
Ko, C., Lee, J., and Queyranne, M. An exact algorithm
Acknowledgements. We thank Marcus Hutter for for maximum entropy sampling. Ops Res, 43(4):684–691,
1995.
insightful comments on an earlier version. Research
Kocsis, L. and Szepesvári, C. Bandit based monte-carlo
partially supported by ONR grant N00014-09-1-1044, planning. In ECML, 2006.
NSF grant CNS-0932392, a gift from Microsoft Cor-
Krause, A. and Guestrin, C. Near-optimal nonmyopic value
poration and the Excellence Initiative of the German of information in graphical models. In UAI, 2005.
research foundation (DFG). Lizotte, D., Wang, T., Bowling, M., and Schuurmans, D.
References Automatic gait optimization with Gaussian process re-
gression. In IJCAI, pp. 944–949, 2007.
Auer, P. Using confidence bounds for exploitation-
Mockus, J. Bayesian Approach to Global Optimization.
exploration trade-offs. JMLR, 3:397–422, 2002.
Kluwer Academic Publishers, 1989.
Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time
Mockus, J., Tiesis, V., and Zilinskas, A. Toward Global
analysis of the multiarmed bandit problem. Mach.
Optimization, volume 2, chapter Bayesian Methods for
Learn., 47(2-3):235–256, 2002.
Seeking the Extremum, pp. 117–128. 1978.
Brochu, E., Cora, M., and de Freitas, N. A tutorial on
Nemhauser, G., Wolsey, L., and Fisher, M. An analysis
Bayesian optimization of expensive cost functions, with
of the approximations for maximizing submodular set
application to active user modeling and hierarchical re-
functions. Math. Prog., 14:265–294, 1978.
inforcement learning. In TR-2009-23, UBC, 2009.
Pandey, S. and Olston, C. Handling advertisements of un-
Bubeck, S., Munos, R., Stoltz, G., and Szepesvári, C. On-
known quality in search advertising. In NIPS. 2007.
line optimization in X-armed bandits. In NIPS, 2008.
Rasmussen, C. E. and Williams, C. K. I. Gaussian Pro-
Chaloner, K. and Verdinelli, I. Bayesian experimental de-
cesses for Machine Learning. MIT Press, 2006.
sign: A review. Stat. Sci., 10(3):273–304, 1995.
Robbins, H. Some aspects of the sequential design of ex-
Cover, T. M. and Thomas, J. A. Elements of Information
periments. Bul. Am. Math. Soc., 58:527–535, 1952.
Theory. Wiley Interscience, 1991.
Seeger, M. W., Kakade, S. M., and Foster, D. P. Infor-
Dani, V., Hayes, T. P., and Kakade, S. M. Stochastic linear
mation consistency of nonparametric Gaussian process
optimization under bandit feedback. In COLT, 2008.
methods. IEEE Tr. Inf. Theo., 54(5):2376–2382, 2008.
Dorard, L., Glowacka, D., and Shawe-Taylor, J. Gaussian
Srinivas, N., Krause, A., Kakade, S., and Seeger, M. Gaus-
process modelling of dependencies in multi-armed bandit
sian process optimization in the bandit setting: No re-
problems. In Int. Symp. Op. Res., 2009.
gret and experimental design. arXiv:0912.3995, 2009.
Ghosal, S. and Roy, A. Posterior consistency of Gaussian
Vazquez, E. and Bect, J. Convergence properties of the
process prior for nonparametric binary regression. Ann.
expected improvement algorithm, 2007.
Stat., 34(5):2413–2429, 2006.
Wahba, G. Spline Models for Observational Data. SIAM,
Grünewälder, S., Audibert, J-Y., Opper, M., and Shawe-
1990.
Taylor, J. Regret bounds for gaussian process bandit

You might also like