Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
56 views21 pages

Contextual IDS for Bandit Problems

This paper explores the concept of Information-Directed Sampling (IDS) in the context of contextual bandits, highlighting the limitations of traditional conditional IDS and proposing a new version that incorporates context distribution. The authors demonstrate that their contextual IDS approach outperforms conditional IDS in terms of Bayesian regret bounds for both contextual bandits with graph feedback and sparse linear contextual bandits. They also introduce a computationally efficient algorithm based on Actor-Critic methods and provide empirical evaluations to support their findings.

Uploaded by

elisahemeliyev
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views21 pages

Contextual IDS for Bandit Problems

This paper explores the concept of Information-Directed Sampling (IDS) in the context of contextual bandits, highlighting the limitations of traditional conditional IDS and proposing a new version that incorporates context distribution. The authors demonstrate that their contextual IDS approach outperforms conditional IDS in terms of Bayesian regret bounds for both contextual bandits with graph feedback and sparse linear contextual bandits. They also introduce a computationally efficient algorithm based on Actor-Critic methods and provide empirical evaluations to support their findings.

Uploaded by

elisahemeliyev
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Contextual Information-Directed Sampling

Botao Hao 1 Tor Lattimore 1 Chao Qin 2

Abstract simplified case of full reinforcement learning. The context


arXiv:2205.10895v1 [cs.LG] 22 May 2022

Information-directed sampling (IDS) has at each round is generated independently, rather than deter-
recently demonstrated its potential as a data- mined by the actions. Contextual bandits are widely used
efficient reinforcement learning algorithm in real-world applications, such as recommender systems
(Lu et al., 2021). However, it is still unclear (Li et al., 2010).
what is the right form of information ratio to A natural generalization of IDS to the contextual case is
optimize when contextual information is avail- conditional IDS. Given the current context, conditional IDS
able. We investigate the IDS design through two acts as if it were facing a fixed action set. However, this
contextual bandit problems: contextual bandits design ignores the context distribution and can be myopic.
with graph feedback and sparse linear contextual We mainly want to address the question of what is the right
bandits. We provably demonstrate the advantage form of information ratio in the contextual setting and high-
of contextual IDS over conditional IDS and light the necessity of utilizing the context distribution.
emphasize the importance of considering the
context distribution. The main message is that Contributions Our contribution is four-fold:
an intelligent agent should invest more on the
actions that are beneficial for the future unseen
• We introduce a version of contextual IDS that takes
contexts while the conditional IDS can be
the context distribution into consideration for contex-
myopic. We further propose a computationally-
tual bandits with graph feedback and sparse linear con-
efficient version of contextual IDS based on
textual bandits.
Actor-Critic and evaluate it empirically on a
neural network contextual bandit. • For contextual bandits with graph feedback, we
p
prove that conditional IDS suffers Ω( β(G)n)
Bayesian regret lower bound for a particular
1. Introduction
prior andpgraph while1/3 contextual IDS can achieve
Information-directed sampling (IDS) (Russo & Van Roy, e
O(min{ β(G)n, δ(G) n2/3 )} Bayesian regret up-
2018) is a promising approach to balance the exploration per bound for any prior. Here, n is the time hori-
and exploitation tradeoff (Lu et al., 2021). Its theoreti- zon, G is a directed feedback graph over the set of
cal properties have been systematically studied in a range actions, β(G) is the independence number and δ(G)
of problems (Kirschner & Krause, 2018; Liu et al., 2018a; is the weak domination number of the graph. In the
Kirschner et al., 2020a;b; Hao et al., 2021a). However, the regime where β(G) & (λ(G)2 n)1/3 , contextual IDS
aforementioned analysis1 is limited to the fixed action set. achieves better regret bound than conditional IDS.
In this work, we study the IDS design for contextual ban- • For sparse linear contextual√bandits, we prove that
dits (Langford & Zhang, 2007), which can be viewed as a conditional IDS suffers Ω( nds) Bayesian regret
*
Equal contribution 1 Deepmind 2 Columbia University. Corre- lower bound for a particular sparse√ prior while
spondence to: Botao Hao <[email protected]>. e
contextual IDS can achieve O(min{ nds, sn2/3 })
Bayesian regret upper bound for any sparse prior.
Proceedings of the 39 th International Conference on Machine
Here, d is the feature dimension and s is the sparsity.
Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copy-
right 2022 by the author(s). In the data-poor regime where d & sn1/3 , contex-
1
One exception is Kirschner et al. (2020a) who has extended tual IDS achieves better regret bound than conditional
IDS to contextual partial monitoring. IDS.
Contextual Information-Directed Sampling

• We further propose a computationally-efficient algo- action set. Lattimore et al. (2015) developed a selective
rithm to approximate contextual IDS based on Actor- explore-then-commit algorithm that only works when the
Critic (Konda & Tsitsiklis, 2000) and evaluate it em- action set is exactly the binary hypercube and derived an

pirically on a neural network contextual bandit. optimal O(s n) upper bound. Hao et al. (2020) intro-
duced the notion of an exploratory action set and proved a
2. Related works Θ(poly(s)n2/3 ) minimax rate for the data-poor regime us-
ing an explore-then-commit algorithm. Note that the mini-
Russo & Van Roy (2018) introduced IDS and derived max lower bound automatically holds for contextual setting
Bayesian regret bounds for multi-armed bandits, linear when the context distribution is a Dirac on the hard prob-
bandits and combinatorial bandits. Kirschner & Krause lem instance. Hao et al. (2021b) extended this concept to a
(2018); Kirschner et al. (2020a) investigated the use of fre- MDP setting.
quentist IDS for bandits with heteroscedastic noise and
It recently became popular to study the contextual setting.
partial monitoring. Kirschner et al. (2020b) proved the
These results can not be reduced to our setting since they
asymptotic optimality of frequentist IDS for linear ban-
rely on either careful assumptions on the context distribu-
dits. Hao et al. (2021a) developed a class of information- √
e
tion to achieve O(poly(s) e
n) or O(poly(s) log(n)) re-
theoretic Bayesian regret bounds for sparse linear ban-
gret bounds (Bastani & Bayati, 2020; Wang et al., 2018;
dits that nearly match existing lower bounds on a vari-
Kim & Paik, 2019; Wang et al., 2020; Ren & Zhou, 2020;
ety of problem instances. Recently, Lu et al. (2021) pro-
Oh et al., 2021) such that classical high-dimensional statis-
posed a general reinforcement learning framework and
tics can be used. However, minimax lower bounds is miss-
used IDS as a key component to build a data-efficient
ing to understand if those assumptions are fundamental.
agent. Arumugam & Van Roy (2021) investigated differ-
Another line of works have polynomial dependency on the
ent versions of learning targets when designing IDS. How-
number of actions (Agarwal et al., 2014; Foster & Rakhlin,
ever, there are no concrete regret bounds provided in those
2020; Simchi-Levi & Xu, 2020). We briefly summarize the
works.
comparison in Table 2.

Graph feedback Mannor & Shamir (2011); Alon et al.


(2015) gave a full characterization of online learning with
3. Problem Setting
graph feedback. Tossou et al. (2017) provided the first We study the stochastic contextual bandit problem with a
information-theoretic analysis of Thompson sampling for horizon of n rounds and a finite set of possible contexts S.
bandits with graph feedback and Liu et al. (2018a) derived For each context m ∈ S, there is an action set Am . The
a Bayesian regret bound of IDS. Both bounds depend on the interaction protocol is as follows. First the environment
clique cover number which could be much larger than the samples a sequence of independent contexts (st )nt=1 from
independence number. Liu et al. (2018b) further improved a distribution ξ over S. At the start of round t, the context
the analysis with the feedback graph unknown and derived st is revealed to the agent, who may use their observations
the Bayesian regret bound of a mixture policy in terms of and possibly an external source of randomness to choose
the independence number. In contrast, our work is the first an action At ∈ At = Ast . Then the agent receives an
one to consider contextual bandits with graph feedback and observation Ot,At including an immediate reward Yt,At as
derived nearly optimal regret bound in terms of both inde- well as some side information. Here
pendence number and domination number for a single al-
Yt,a = f (st , a, θ∗ ) + ηt,a , if At = a ,
gorithm. Very recently, Dann et al. (2020) considered the

reinforcement learning with graph feedback. Their O( n)- where f is the reward function, θ∗ is the unknown parame-
upper bound depends on mas-number rather than indepen- ter and ηt,a is 1-sub-Gaussian noise. Sometimes we write
dence number. We briefly summarize the comparison in Yt = Yt,At for short.
Table 2. We consider the Bayesian setting in the sense that θ∗ is sam-
pled from some prior distribution. Let A = ∪m∈S Am and
Sparse linear bandits For sparse linear bandits with the history Ft = {(s1 , A1 , O1 ), . . . , (st−1 , At−1 , Ot−1 )}.
fixed action set, Abbasi-Yadkori et al. (2012) proposed A policy π = (πt )t∈N is a sequence of (suitably measur-
an inefficient online-to-confidence-set
√ conversion approach able) deterministic mappings from the history Ft and con-
that achieves an O( e sdn) upper bound for an arbitrary text space S to a probability distribution P(A) with the
Contextual Information-Directed Sampling

Table 1. Comparisons with existing results on regret upper bounds and lower bounds for bandits with graphical feedback. Here, β(G) is
the independence number, δ(G) is the weak domination number and c(G) is the clique cover number of the graph. Note that β(G) ≤
c(G).

Upper Bound Regret Contextual? Algorithm


p
Alon et al. (2015) e β(G)n)
O( no Exp3.G for strongly observable graph
Alon et al. (2015) e
O(δ(G) 1/3 2/3
n ) no Exp3.G for weakly observable graph
p
Tossou et al. (2017) e c(G)n)
O( no Thompson sampling
p
Liu et al. (2018a) e c(G)n)
O( no fixed action-set IDS
p
This paper e
O(min( β(G)n, δ(G)1/3 n2/3 )) yes contextual IDS
Minimax Lower Bound
p
Alon et al. (2015) Ω( β(G)n) no strongly observable graph
Alon et al. (2015) Ω(δ(G)1/3 n2/3 ) no weakly observable graph

Table 2. Comparisons with existing results on regret upper bounds and lower bounds for sparse linear bandits. Here, s is the sparsity, d
is the feature dimension, n is the number of rounds, K is the number of arms, Cmin is the minimum eigenvalue of the data matrix and τ
is a problem-dependent parameter that may have a complicated form.

Upper Bound Regret Contextual? Assumption Algorithm



Abbasi-Yadkori et al. (2012) e sdn)
O( yes none UCB

Sivakumar et al. (2020) e sdn)
O( yes adver. + Gaussian noise greedy
Bastani & Bayati (2020) e Ks2 (log(n))2 )
O(τ yes compatibility condition Lasso greedy
Lattimore et al. (2015) e √n)
O(s no action set is hypercube explore-then-commit

Hao et al. (2021a) e
O(min( sdn, sn2/3 /Cmin )) no none fixed action-set IDS

This paper e
O(min( sdn, sn2/3 /Cmin )) yes none contextual IDS
Minimax Lower Bound
√ −1/3
Hao et al. (2020) Ω(min( sdn, Cmin s1/3 n2/3 )) no N.A N.A.

constraint that for each context m, πt (·|m) can only put with the Borel σ-algebra.
mass over P(Am ). Then the Bayesian regret of a policy π
is defined as 4. Conditional IDS versus Contextual IDS
" n n
#
X X
BR(n; π) = E ∗
max f (st , a, θ ) − Yt , (3.1) We introduce the formal definition of conditional IDS and
a∈At contextual IDS for general contextual bandits. Suppose
t=1 t=1
that the optimal action associated with context st is a∗t =
where the expectation is over the interaction sequence in- argmaxa∈At f (st , a, θ∗ ), which in Bayesian setting is a
duced by the agent and environment, the context distribu- random variable. We first define the one-step expected re-
tion and the prior distribution over θ∗ . gret of taking action a ∈ At as
h i
Notations Given a measure P and jointly distributed ran- ∆t (a|st ) := Et f (st , a∗t , θ∗ ) − f (st , a, θ∗ ) st , (4.1)
dom variables X and Y we let PX denote the law of X
and we let PX|Y be the conditional law of X given Y such and write ∆t (st ) ∈ R|At | as the corresponding vector.
that: PX|Y (·) = P(X ∈ ·|Y ). The mutual information be-
tween X and Y is I(X; Y ) = E[DKL (PX|Y ||PX )] where 4.1. Conditional IDS
DKL is the relative entropy. We write Pt (·) = P(·|Ft )
A natural generalization of the standard IDS design
as the posterior measure where P is the probability mea-
(Russo & Van Roy, 2018) to contextual setting is condi-
sure over θ∗ and the history and Et [·] = E[·|Ft ]. Denote
tional IDS. It has been investigated empirically in Lu et al.
It (X; Y ) = Et [DKL (Pt,X|Y ||Pt,X )]. We write 1{·} as an
(2021) for the full reinforcement learning setting.
indicator function. For a positive semidefinite matrix A,
we let σmin (A) be the minimum eigenvalue of A. Denote Conditional on the current context, we define the informa-
P(A) be the space of probability measures over a set A tion gain of taking action a with respect to the current opti-
Contextual Information-Directed Sampling

mal action a∗t as It (a∗t ; Ot,a |st ) := where the expectation is taken with respect to the context
h i distribution. Contextual IDS minimizes the (α, λ)-MIR to
Et DKL (Pt (a∗t ∈ ·|st , Ot,a )||Pt (a∗t ∈ ·|st )) st , find a mapping from the context space to the action space:

and write It (a∗t , st ) ∈ R|At | such that [It (a∗t , st )]a = πt = argmin Ψλt,α (π) .
π∈Π
It (a∗t ; Ot,a |st ).
When receiving context st at round t, the agent plays ac-
We introduce the (α, λ)-conditional information ratio
tions according to πt (·|st ).
(CIR) as Γλt,α : P(A) → R such that
λ We highlight the key difference between conditional IDS
λ max 0, ∆t (st )⊤ π(·|st ) − α and contextual IDS as follows:
Γt,α (π(·|st )) = , (4.2)
It (a∗t , st )⊤ π(·|st )
• Conditional IDS only optimizes a probability distribu-
for some parameters α, λ > 0. Conditional IDS minimizes
tion for the current context while contextual IDS opti-
(α, λ)-CIR to find a probability distribution over the action
mizes a full mapping from the context space to action
space:
space.
πt (·|st ) = argmin Γλt,α (π(·|st )) .
π(·|st )∈P(At ) • Conditional IDS only seeks information about the cur-
rent optimal action while contextual IDS seeks infor-
Remark 4.1. In comparison to the standard information mation for the whole optimal policy.
ratio by Russo & Van Roy (2018) who specified λ = 2, α =
0 in the non-contextual setting, we consider a generalized Remark 4.2. We can also define the information gain in
version introduced by Lattimore & György (2020). As ob- conditional IDS and contextual IDS with respect to the un-
served by Lattimore & György (2020), the right value of known parameter θ∗ . This could bring in certain computa-
λ depends on the dependence of the regret on the horizon. tional advantage when approximating the information gain.
√ However, θ∗ contains much more information to learn than
The parameter α is always chosen at order O(1/ n) for
certain problems such that its regret is negligible. Their the optimal action or policy such that the agent may suffer
role will be more clear in Sections 5&6. more regret.

4.2. Contextual IDS 4.3. Why conditional IDS could be myopic?

A possible limitation of conditional IDS is that the CIR Conditional IDS myopically balances exploration and ex-
does not take the context distribution into consideration. ploitation without taking the context distribution into con-
As we show later, this can result in sub-optimality in cer- sideration. This has the advantage of being simple to imple-
tain problems. Contextual IDS, which was first proposed ment but sometimes leads both under and over exploration,
by Kirschner et al. (2020a) for partial monitoring, is intro- which the following examples illustrate. In both examples
duced to remedy this shortcoming. there are two contexts arriving with equal probability.

Let π ∗ : S → A be the optimal policy such that for each • Example 1 [UNDER EXPLORATION] Consider a
m ∈ S, π ∗ (m) = argmaxa∈Am f (m, a, θ∗ ) with the tie noiseless case. Context set 1 contains k actions where
broken arbitrarily. We define the information gain of taking one is the optimal action and the remaining k − 1 ac-
action a with respect to π ∗ as: tions yield regret 1. Context set 2 contains a reveal-
ing action with regret 1 and one action with no regret.
It (π ∗ ; Ot,a ) := Et [DKL (Pt (π ∗ ∈ ·|Ot,a )||Pt (π ∗ ∈ ·))] ,
The revealing action provides an observation of the
and write It (π ∗ ) ∈ R|At | as the corresponding vector. De- rewards for all the k actions in context set 1. When
fine a policy class Π = {π : S → P(A), supp(π(·|m)) ⊂ context set 2 arrives, conditional IDS will never play
Am } where supp(·) is the support of a vector. We in- the revealing action since it incurs high immediate re-
troduce the (α, λ)-marginal information ratio (MIR) as gret with no useful information for the current context
Ψλt,α : Π → R such that set. However, this ignores the fact that the revealing
action could be informative for the unseen context set

max 0, (Est [∆t (st )⊤ π(·|st )] − α) 1. Conditional IDS under-explores and suffers O(k)
Ψλt,α (π) = , (4.3) regret. In contrast, contextual IDS exploits the context
Est [It (π ∗ )⊤ π(·|st )]
Contextual Information-Directed Sampling

distribution and plays the revealing action in context 2 (i, j) ∈ E}. The feedback graph G is fixed and known to
and only suffers O(1) regret. the agent.

• Example 2 [OVER EXPLORATION] Context set 1 con- Let the unknown parameter θ∗ ∈ Rk . At each round t,
tains a single revealing action (hence no regret). Con- the agent receives an observation characterized by G in
text set 2 has k actions. The first is the sense that taking action a ∈ At , the observation Ot,a
√ a revealing ac-
tion and has a (known) regret of Θ( k∆) with ∆ = contains the rewards Yt,a′ = θa∗′ + ηt,a′ for each action

Θ(1/ n). Of the remaining actions, one is optimal a′ ∈ N out (a). We review two standard quantities to de-
(zero regret) and the others have regret ∆, with the scribe the graph structure used in Alon et al. (2015).
prior such that the identify of the optimal action is un- Definition 5.1 (Independence number). An independent
known. Contextual IDS will avoid the revealing ac- set of a graph G = (A, E) is a subset S ⊆ A such that no
tion in context set 2 because it understands that this two different i, j ∈ A are connected by an edge in E. The
information can be obtained more cheaply in context cardinality of a largest independent set is the independence

set 1. Its regret is O( n). Meanwhile, if the constants number of G, denoted by β(G).
are tuned appropriately, then conditional IDS will play
the√revealing action in context set 2 and suffer regret A vertex a is observable if N in (a) 6= ∅. A vertex is strongly
Ω( nk). observable if it has either a self-loop or incoming edges
from all other vertices. A vertex is weakly observable
Remark 4.3. Similar shortcoming could also hold for the if it is observable but not strongly observable. A graph
class of UCB and Thompson sampling algorithms since G is observable if all its vertices are observable and it is
they have no explicit way to encode the information of con- strongly observable if all its vertices are strongly observ-
text distribution. able. A graph is weakly observable if it is observable but
not strongly observable.
4.4. Generic regret bound for contextual IDS
Definition 5.2 (Weak domination number). In a directed
Before we step into specific examples, we first provide a graph G = (A, E) with a set of weakly observable vertices
generic bound for contextual IDS. Denote Iα,λ as the worst- W ⊆ A, a weakly dominating set D ⊆ A is a set of ver-
case MIR such that for any t ∈ [n], Ψλt,α (πt ) ≤ Iα,λ almost tices that dominates W. That means for any w ∈ W there
surely. exists d ∈ D such that w ∈ N out (d). The weak domination
Theorem 4.4. Let π MIR = (πt )t∈[n] be the contextual IDS number of G, denoted by δ(G), is the size of the smallest
policy such that πt minimizes (α, 2)-MIR at each round. weakly dominating set.
Then the following regret upper bound holds
5.1. Lower bound for conditional IDS
BR(n; π MIR ) ≤ nα+ p
We first prove that conditional IDS suffers Ω( nβ(G))
" n #1
X   λ Bayesian regret lower bound for a particular prior and show
1/λ
inf 21−2/λ Iα,λ n1−1/λ E Est It (π ∗ )⊤ πt (·|st ) . later the optimal rate for this prior could be much smaller
λ≥2
t=1
when β(G) is very large (Section 5.2).
The proof is deferred to Appendix A.1. An interesting con- Theorem 5.3. Let π CIR be a conditional IDS policy. There
sequence of Theorem 4.4 is that contextual IDS minimizing exists a contextual bandit with graph feedback instance
λ = 2 can adapt to λ > 2. This shows the power of contex- such that β(G) = k, δ(G) = 1 and
tual IDS to adapt to different information-regret structures. 1p
BR(n; π CIR ) ≥ n(β(G) − 3) .
16
5. Contextual Bandits with Graph Feedback
Proof. Let us construct the hard problem instance
We consider a Bayesian formulation of contextual k-armed that generalizes Example 1 in Section 4.3. Suppose
bandits with graph feedback, which generalizes the setup in {x1 , . . . , xk } ⊆ Rk is the set of basis vectors that corre-
Alon et al. (2015). Let G = (A, E) be a directed feedback sponds to k arms. The first k − 1 arms form a standard
graph over the set of actions A with |A| = k. For each i ∈ multi-armed Gaussian bandit with unit variance. When kth
A, we define its in-neighborhood and out-neighborhood as arm is pulled, the agent always suffers a regret of 1, but
N in (i) = {j ∈ A : (j, i) ∈ E} and N out (i) = {j ∈ A : obtains samples for the first k − 1 arms. In terms of the
Contextual Information-Directed Sampling

language of graph feedback, the first k − 1 arms only con- have Ei [n2 ] = n/2. By Pinsker’s inequality and the di-
tain self-loop while the out-neighborhood of the last arm vergence decomposition lemma (Lattimore &p Szepesvári,
contains all the first k − 1 arms. One can verify β(G) = k, 2020, Lemma 15.1), we know that with γ = 12 k/n,
δ(G) = 1.
k−2
X k−2
X k−2
1 Xp
There are two context sets and the arrival probability of Ei [Ti (n)] ≤ E1 [Ti (n)] + nkE1 (Ti (n))
each context is 1/2. Context set 1 consists of {xk−1 , xk } i=2 i=2
4 i=2
while context set 2 consists of {x1 , . . . , xk−2 }. One can 1
≤ n + kn .
verify β(G) = k, δ(G) = 1. 4
(i)
For each i ∈ {2, . . . , k − 2}, let θ(i) ∈ Rk with θj = Therefore,
(i) (i)
1{i = j}γ for j ∈ {2, . . . , k − 2} and θ1 = 0, θk−1 = 0, k−2
(i) 1 X
θk = γ − 1 where γ > 0 is the gap that will be chosen BR(n; π) = Ei [n2 − Ti (n)] γ
k − 3 i=2
later. Assume the prior of θ∗ is uniformly distributed over !
k−2
{θ(2) , . . . , θ(k−2) }. 1 k−3 X
= n− Ei [Ti (n)] γ
We write Ei [·] = Eθ(i) [·] for short and the expectation is k−3 2 i=2
taken with respect to the measures on the sequence of out- 1p
≥ n(k − 3) .
comes (A1 , Y1 , . . . , An , Yn ) determined by θ(i) . Define the 16
cumulative regret of policy π interacting with bandit θ(i) as
This finishes the proof.
n
X h  i
Rθ(i) (n; π) = Ei hx∗st , θ(i) i − Yt 1(st = 1) Remark 5.4. We can strengthen the lower bound such that
t=1 it holds for any graph by replacing the standard multi-
Xn h  i armed bandit instance in context set 2 by the hard instance
+ Ei hx∗st , θ(i) i − Yt 1(st = 2) ,
in Alon et al. (2017, Theorem 5).
t=1

1 Pk−2
such that BR(n; π) = k−3 i=2 Rθ(i) (n; π). 5.2. Upper bound achieved by contextual IDS
In this section, we derive a Bayesian regret upper bound
Step 1. Fix a conditional IDS policy π. From the defi-
achieved by contextual IDS. According to Theorem 4.4, it
nition of conditional IDS in Eq. (4.2), when context set 1
suffices to bound the worst-case MIR Iα,λ for λ = 2, 3 as
is arriving, conditional IDS will always pull xk−1 for this
well as the cumulative information gain respectively. Let
prior since xk−1 is always the optimal arm for context set 1.
Rmax be an upper bound on the maximum expected reward.
This means conditional IDS will suffer no regret for context
set 1, which implies for any i ∈ {2, . . . , k − 2}, Lemma 5.5 (Squared information ratio). For any ǫ ∈ [0, 1]
and a strongly observable graph G, the (2ǫRmax , 2)-MIR
n
X h  i
can be bounded by
Ei hx∗st , θ(i) i − Yt 1(st = 1) = 0 .
t=1 2
 
4Rmax +4 4k 2
I2ǫRmax ,2 ≤ β(G) log .
1−ǫ β(G)ǫ
Step 2. Define Ti (n) as the number of pulls for arm i over
n rounds. On the other hand,
" n # The proof is deferred to Appendix B.1 and the proof idea
X  is to bound the worst-case squared information ratio by the
∗ (i)
Ei hxst , θ i − Yt 1(st = 2)
squared information ratio of Thompson sampling.
t=1
 
k−2
X Next we will bound Iα,λ for λ = 3 in terms of an ex-
= Ei  Tj (n) γ = Ei [n2 − Ti (n)] γ , plorability constant.
j=1,j6=i
Definition 5.6 (Explorability constant). We define the ex-
where the first equation comes from the fact that the context plorability constant as ϑ(G, ξ) :=
sets 1 and 2 are disjoint and n2 is the number of times that h  i
context 2 arrives over n rounds. Since the context is gener- max min Es∼ξ,a∼π(·|s) 1 d ∈ N out (a) .
π:S→P(A) d∈A
ated independently with respect to the learning process, we
Contextual Information-Directed Sampling

Lemma 5.7 (Cubic information ratio). For any ǫ ∈ [0, 1] graph with |D| = δ(G). Consider a policy µ such that
and a weakly observable graph G, the (2ǫRmax , 3)-MIR µ(·|st ) is uniform over D ∩ At . Then we have
can be bounded by
1
min Es∼ξ,a∼µ(·|s) [1 {d ∈ N out (a)}] ≥ ,
3
Rmax+ Rmax d∈A M δ(G)
I2ǫRmax ,3 ≤ .
ϑ(G, ξ)
which implies 1/ϑ(G, ξ) ≤ M δ(G). When M = 1,
Alon et al. (2015) demonstrated that the minimax optimal
The proof is deferred to Appendix B.2 and the proof idea e 1/3 2/3
rate for weakly observable graph is Θ(δ(G) n ) which
is to bound the worst-case cubic information ratio by the implies 1/ϑ(G) & δ(G) up to some logarithm factors.
cubic information ratio of a policy that mixes Thompson
sampling and a policy maximizing the explorability con- Remark 5.11 (Open problem). For the fixed action set set-
stant. ting, Alon et al. (2015) proved that the independence num-
ber and weakly domination number are the fundamentally
Lemma 5.8. The cumulative information gain can be quantity to describe the graph structure. However, our re-
bounded by gret upper bound has the number of contexts M appearing.
" n # It is interesting to investigate if M can be removed for con-
X  
E Et It (π ∗ )⊤ πt (·|st ) ≤ H(π ∗ ) ≤ M log(k) , textual bandits with graph feedback.
t=1
Note that one can also bound H(π ∗ ) ≤ H(θ∗ ) ≤ k such
where H(·) is the entropy and M is the number of available that the dependency on M does not appear. But it is un-
contexts. clear the right dependency on M and k for the optimal
regret rate in the contextual setting.
The proof is deferred to Appendix B.3. Combining the re-
sults in Lemmas 5.5-5.8 and the generic regret bound in
6. Sparse Linear Contextual Bandits
Theorem 4.4, we obtain the follow theorem.
Theorem 5.9 (Regret bound for graph contextual We consider a Bayesian formulation of sparse linear contex-
IDS). Suppose π MIR = (πt )t∈N where πt minimizes tual bandits. The notion of sparsity can be defined through
√ the parameter space Θ:
(2Rmax / n, 2)-MIR at each round. If G is strongly ob-
servable, we have  
 Xd 
 1 Θ = θ ∈ Rd 1{θj 6= 0} ≤ s, kθk2 ≤ 1 .
2M log(k) 3 2  
MIR j=1
BR(n; π ) ≤ CRmax min n3 ,
ϑ(G, ξ)
s  √  ! We assume s is known and it can be relaxed by putting a
4k 2 n prior on it. We consider the case where Am = A for all
β(G) log nM log(k) ,
β(G) m ∈ S. Suppose φ : S × A → Rd is a feature map
known to the agent. In sparse linear contextual bandits, the
where C is an absolute constant. reward of taking action a is f (st , a, θ∗ ) = φ(st , a)⊤ θ∗ +
ηt,a , where θ∗ is a random variable taking values in Θ and
Ignoring the constant and logarithmic terms, the Bayesian
denote ρ as the prior distribution.
regret upper bound in Theorem 5.9 can be simplified to
 p  We first define a quantity to describe the explorability of
Oe min β(G)M n, (M/ϑ(G, ξ))1/3 n2/3 . the feature set.
Definition 6.1 (Explorability constant). We define the ex-
When the independence number is large enough such that plorability constant as Cmin (φ, ξ) :=
β(G) ≥ (ϑ(G, ξ), ξ 2 n/M )1/3 , this bound is dominated
 h i
e
by O((M/ϑ(G, ξ))1/3 n2/3 ) that is independent of the in- max σmin Es∼ξ Ea∼π(·|s) [φ(s, a)φ(s, a)⊤ ] .
dependence number. Together with Theorem 5.3, we can π:S→P(A)

argue that conditional IDS is sub-optimal comparing with Remark 6.2. The explorability constant is a problem-
contextual IDS in this regime. dependent constant that has previously been introduced in
Remark 5.10 (Connection between ϑ(G, ξ) and δ(G)). Let Hao et al. (2020; 2021a) for a non-contextual sparse lin-
D ⊆ A be the smallest weakly dominating set of the full ear bandits. For an easy instance when {φ(st , a)}a∈A is
Contextual Information-Directed Sampling

a full hypercube for every st ∈ S, Cmin (φ, ξ) could be as ing regret bound holds
small as 1 where the policy is to sample uniformly from the
corner of the hypercube no matter which context arrives. BR(n; π MIR ) .
1
nq sn 3 (log(dn1/2 /s)) 3 o
2

min 1/2
nds log(dn /s), .
6.1. Lower bound for conditional IDS 1
(Cmin (φ, ξ)) 3
We prove an algorithm-dependent lower bound for a sparse
2/3
linear contextual bandits instance. In the data-poor regime where d & n1/3 s/Cmin , contex-
e 2/3 ) regret bound that is tighter than
tual IDS achieves O(sn
Theorem 6.3. Let π CIR be a conditional IDS policy. There
exists a sparse linear contextual bandit instance whose ex- the lower bound for conditional IDS. This rate matches the
plorability constant is 1/2 such that minimax lower bound derived in Hao et al. (2020) up to a
s factor in the data-poor regime.
1√
BR(n; π CIR ) ≥ nds .
16 7. Practical Algorithm
How about the principle of optimism? Hao et al. As shown in Section 4.1, Conditional IDS only needs
(2021a, Section 4) has shown that even for non-contextual to optimize over probability distributions. As proved
case, optimism-based algorithms such as UCB or Thomp- by Russo & Van Roy (2018, Proposition 6), conditional
son sampling fail to optimally balance information and re- IDS suffices to randomize over two actions at each round
gret and result in a sub-optimal regret bound. and the computational complexity is further improved to
O(|A|) by Kirschner (2021, Lemma 2.7). However, contex-
6.2. Upper bound achieved by contextual IDS tual IDS in Section 4.2 needs to optimize over a mapping
from the context space to action space at each round which
In this section, we will prove that contextual IDS can in general is much more computationally demanding.
achieve a nearly dimension-free regret bound. We say that
the feature set has sparse optimal actions if the optimal ac- To obtain a practical and scalable algorithm, we
tion is s-sparse for each context almost surely with respect approximate the contextual IDS using Actor-Critic
to the prior. (Konda & Tsitsiklis, 2000). We parametrize the policy
class by a neural network and optimize the information ra-
Lemma 6.4 (Bounds for information ratio). We have
tio by multi-steps stochastic gradient decent.
I0,2 ≤ d/2. When the feature set has sparse optimal ac-
tions, we have I0,3 ≤ s2 /(4Cmin(φ, ξ)). Consider a parametrized policy class Πθ = {πθ : S →
P(A)}. The policy, which is a conditional probability dis-
From the definition of π ∗ (m) = argmina∈Am a⊤ θ∗ , we tribution πθ , can be parameterized with a neural network.
know that π ∗ is a deterministic function of θ∗ . By the data This neural network maps (deterministically) from a con-
processing lemma, we have I(π ∗ ; Fn+1 ) ≤ I(θ∗ ; Fn+1 ). text s to a probability distribution over A. We further
According to Lemma 5.8 in Hao et al. (2021a), we have parametrize the critic which is the reward function by a
the following lemma. value network Qθ .
Lemma 6.5. The cumulative information gain can be To avoid additional approximation errors, we assume we
bounded by can sample from the true context distribution. At each
" n # round, the parametrized contextual IDS minimizes the fol-
X   lowing empirical MIR loss through SGD:
E Et It (π ) πt (·|st ) ≤ 2s log(dn1/2 /s) .
∗ ⊤

t=1 Pw (i) ⊤ (i) 2


i=1 [∆t (st ) π(·|st )]
min Pw (i)
, (7.1)
π∈Πθ ∗ ⊤
Combining the results in Lemmas 6.4-6.5 with the generic i=1 [It (π ) π(·|st )]

regret bound in Theorem 4.4, we obtain the following re- (1) (w)
where {st , . . . , st } are the independent samples of con-
gret bound.
texts. We use the Epistemic Neural Networks (ENN)
Theorem 6.6 (Regret bound for sparse contextual IDS). (Osband et al., 2021a) to quantify the posterior uncertainty
(i)
Suppose π MIR = (πt )t∈N where πt = argminπ Ψ2t,0 (π). of the value network. For each given st , following the
When the feature set has sparse optimal actions, the follow- procedure in Section 6.3.3 and 6.3.4 of Lu et al. (2021), we
Contextual Information-Directed Sampling

can approximate the one-step expected regret and the in-


formation gain efficiently using the samples outputted by
ENN.

8. Experiment
We conduct some preliminary experiments to evaluate the
parametrized contextual IDS through a neural network con-
textual bandit.
At each round t, the environment independently generates Figure 1. Cumulative regret for a neural network contextual ban-
dit.
an observation in the form of d-dimensional contextual vec-
tor xt from some distributions. Let fθ∗ : Rd → R|A|
be a neural network link function. When the agent takes References
action a, she will receive a reward in the form of Yt,a = Abbasi-Yadkori, Y., Pal, D., and Szepesvari, C. Online-
[fθ∗ (xt )]a + ηt,a . This is the not sharing parameter formu- to-confidence-set conversions and application to sparse
lation for contextual bandits which means each arm has its stochastic bandits. In Artificial Intelligence and Statis-
own parameter. tics, pp. 1–9, 2012.
We set the generative model fθ∗ being a 2-hidden-layer
Agarwal, A., Hsu, D., Kale, S., Langford, J., Li, L., and
ReLU MLP with 10 hidden neurons. The number of ac-
Schapire, R. Taming the monster: A fast and simple
tions is 5. The contextual vector xt ∈ R10 is sampled from
algorithm for contextual bandits. In International Con-
N (0, I10 ) and the noise is sampled from standard Gaussian
ference on Machine Learning, pp. 1638–1646. PMLR,
distribution.
2014.
We compare contextual IDS with conditional IDS and
Thompson sampling. For a fair comparison, we use the Alon, N., Cesa-Bianchi, N., Dekel, O., and Koren, T. On-
same ENN architecture to obtain posterior samples for line learning with feedback graphs: Beyond bandits.
three agents. As reported by Osband et al. (2021b), ensem- In Conference on Learning Theory, pp. 23–35. PMLR,
ble sampling with randomized prior function tends to be 2015.
the best ENN that balances the computation and accuracy
Alon, N., Cesa-Bianchi, N., Gentile, C., Mannor, S., Man-
so we use 10 ensembles in our experiment. With 200 pos-
sour, Y., and Shamir, O. Nonstochastic multi-armed ban-
terior samples, we use the same way described by Lu et al.
dits with graph-structured feedback. SIAM Journal on
(2021) to approximate the one-step regret and information
Computing, 46(6):1785–1826, 2017.
gain for both conditional IDS and contextual IDS. We sam-
ple 20 independent contextual vectors at each round. Both Arumugam, D. and Van Roy, B. The value of information
the policy network and value network are using 2-hidden- when deciding what to learn. Advances in Neural Infor-
layer ReLU MLP with 20 hidden neurons and optimized mation Processing Systems, 34, 2021.
by Adam with learning rate 0.001. We report our results in
Figure 1 where parametrized contextual IDS achieves rea- Bastani, H. and Bayati, M. Online decision making with
sonably good regret. high-dimensional covariates. Operations Research, 68
(1):276–294, 2020.
9. Discussion and Future Work Dann, C., Mansour, Y., Mohri, M., Sekhari, A., and Srid-
In this work, we investigate the right form of information haran, K. Reinforcement learning with feedback graphs.
ratio for contextual bandits and emphasize the importance Advances in Neural Information Processing Systems, 33,
of utilizing context distribution information through contex- 2020.
tual bandits with graph feedback and sparse linear contex-
Foster, D. and Rakhlin, A. Beyond ucb: Optimal and ef-
tual bandits. For linear contextual bandits with moderately
ficient contextual bandits with regression oracles. In In-
small number of actions, one p future work is to see if con- ternational Conference on Machine Learning, pp. 3199–
textual IDS can achieve O( dn log(k)) regret bound.
3210. PMLR, 2020.
Contextual Information-Directed Sampling

Hao, B., Lattimore, T., and Wang, M. High-dimensional Li, L., Chu, W., Langford, J., and Schapire, R. E. A
sparse linear bandits. Advances in Neural Information contextual-bandit approach to personalized news arti-
Processing Systems, 33:10753–10763, 2020. cle recommendation. In Proceedings of the 19th inter-
national conference on World wide web, pp. 661–670,
Hao, B., Lattimore, T., and Deng, W. Information directed 2010.
sampling for sparse linear bandits. Advances in Neural
Information Processing Systems, 34, 2021a. Liu, F., Buccapatnam, S., and Shroff, N. Information di-
rected sampling for stochastic bandits with graph feed-
Hao, B., Lattimore, T., Szepesvári, C., and Wang, M. On- back. In Proceedings of the AAAI Conference on Artifi-
line sparse reinforcement learning. In International Con- cial Intelligence, volume 32, 2018a.
ference on Artificial Intelligence and Statistics, pp. 316–
324. PMLR, 2021b. Liu, F., Zheng, Z., and Shroff, N. Analysis of thompson
sampling for graphical bandits without the graphs. arXiv
Kim, G.-S. and Paik, M. C. Doubly-robust lasso bandit. In preprint arXiv:1805.08930, 2018b.
Advances in Neural Information Processing Systems, pp.
5869–5879, 2019. Lu, X., Van Roy, B., Dwaracherla, V., Ibrahimi, M., Os-
band, I., and Wen, Z. Reinforcement learning, bit by bit.
Kirschner, J. Information-Directed Sampling-Frequentist arXiv preprint arXiv:2103.04047, 2021.
Analysis and Applications. PhD thesis, ETH Zurich,
2021. Mannor, S. and Shamir, O. From bandits to experts: On the
value of side-observations. Advances in Neural Informa-
Kirschner, J. and Krause, A. Information directed sampling tion Processing Systems, 24:684–692, 2011.
and bandits with heteroscedastic noise. In Conference
On Learning Theory, pp. 358–384. PMLR, 2018. Oh, M.-h., Iyengar, G., and Zeevi, A. Sparsity-agnostic
lasso bandit. In International Conference on Machine
Kirschner, J., Lattimore, T., and Krause, A. Informa- Learning, pp. 8271–8280. PMLR, 2021.
tion directed sampling for linear partial monitoring. In
Conference on Learning Theory, pp. 2328–2369. PMLR, Osband, I., Wen, Z., Asghari, M., Ibrahimi, M., Lu, X., and
2020a. Van Roy, B. Epistemic neural networks. arXiv preprint
arXiv:2107.08924, 2021a.
Kirschner, J., Lattimore, T., Vernade, C., and Szepesvári, C.
Asymptotically optimal information-directed sampling. Osband, I., Wen, Z., Asghari, S. M., Dwaracherla, V., Hao,
arXiv preprint arXiv:2011.05944, 2020b. B., Ibrahimi, M., Lawson, D., Lu, X., O’Donoghue, B.,
and Van Roy, B. Evaluating predictive distributions:
Konda, V. R. and Tsitsiklis, J. N. Actor-critic algorithms. Does bayesian deep learning work? arXiv preprint
In Advances in neural information processing systems, arXiv:2110.04629, 2021b.
pp. 1008–1014, 2000.
Ren, Z. and Zhou, Z. Dynamic batch learning in high-
Langford, J. and Zhang, T. The epoch-greedy algorithm dimensional sparse linear contextual bandits. arXiv
for contextual multi-armed bandits. Advances in neural preprint arXiv:2008.11918, 2020.
information processing systems, 20(1):96–1, 2007.
Russo, D. and Van Roy, B. Learning to optimize via poste-
Lattimore, T. and György, A. Mirror descent and the infor- rior sampling. Mathematics of Operations Research, 39
mation ratio. arXiv preprint arXiv:2009.12228, 2020. (4):1221–1243, 2014.

Lattimore, T. and Szepesvári, C. Bandit algorithms. Cam- Russo, D. and Van Roy, B. Learning to optimize via
bridge University Press, 2020. information-directed sampling. Operations Research, 66
(1):230–252, 2018.
Lattimore, T., Crammer, K., and Szepesvári, C. Linear
multi-resource allocation with semi-bandit feedback. In Simchi-Levi, D. and Xu, Y. Bypassing the monster: A
Advances in Neural Information Processing Systems, pp. faster and simpler optimal algorithm for contextual ban-
964–972, 2015. dits under realizability. Available at SSRN, 2020.
Contextual Information-Directed Sampling

Sivakumar, V., Wu, Z. S., and Banerjee, A. Structured lin-


ear contextual bandits: A sharp and geometric smoothed
analysis. International Conference on Machine Learn-
ing, 2020.

Tossou, A. C., Dimitrakakis, C., and Dubhashi, D. Thomp-


son sampling for stochastic bandits with graph feed-
back. In Thirty-First AAAI Conference on Artificial In-
telligence, 2017.

Wang, X., Wei, M., and Yao, T. Minimax concave penal-


ized multi-armed bandit model with high-dimensional
covariates. In International Conference on Machine
Learning, pp. 5200–5208, 2018.

Wang, Y., Chen, Y., Fang, E. X., Wang, Z., and Li, R.
Nearly dimension-independent sparse linear bandit over
small action spaces via best subset selection. arXiv
preprint arXiv:2009.02003, 2020.
Contextual Information-Directed Sampling

A. General Results
A.1. Proof of Theorem 4.4
Proof. From the definitions in Eqs. (3.1) and (4.1), we could decompose the Bayesian regret bound of contextual IDS as
" n #
X  
BR(n; π MIR ) = nα + E Est (∆t (st ) − α)⊤ πt (·|st ) . (A.1)
t=1

Recall that at each round contextual IDS follows


2
max 0, (Est [∆t (st )⊤ π(·|st )] − α)
πt = argmin Ψ2t,α (π) = argmin .
π:S→P(A) π:S→P(A) Est [It (π ∗ )⊤ π(·|st )]
Moreover, we define

max 0, (Est [∆t (st )⊤ π(·|st )] − α)
qt,λ = argmin Ψλt,α (π) = argmin .
π:S→P(A) π:S→P(A) Est [It (π ∗ )⊤ π(·|st )]

Suppose for a moment that Est [∆t (st )⊤ πt (·|st )] − α > 0. Let M be the number of contexts. Then the derivative can be
written as
2
2 2pm ∆t (m)(Est [∆t (st )⊤ π(·|st )] − α) pm It (m) Est [∆t (st )⊤ π(·|st )] − α)
∇π(·|m) Ψt,α (πt ) = − 2 ,
Est [It (π ∗ )⊤ π(·|st )] (Est [It (π ∗ )⊤ π(·|st )])
where pm is the arrival probability for context m. Using the first-order optimality condition for each m ∈ [M ],
M
  X
Est h∇π(·|st ) Ψt,2 (πt ), qt,λ (·|st ) − πt (·|st )i = pm h∇π(·|m) Ψ2t,α (πt ), qt,λ (·|m) − πt (·|m)i ≥ 0 .
m=1

This implies
 
2Est ∆t (st )⊤ (qt,λ (·|st ) − πt (·|st )) (Est [∆t (st )⊤ πt (·|st )] − α)
Est [It (π ∗ )⊤ πt (·|st )]
  2
Est It (π ∗ )⊤ (qt,λ (·|st ) − πt (·|st )) Est [∆t (st )⊤ πt (·|st )] − α)
≥ 2 .
(Est [It (π ∗ )⊤ πt (·|st )])
Since we assume Est [∆t (st )⊤ (πt (·|st )] − α > 0 and the information gain is always non-negative, we have
 
 ⊤
 Est It (π ∗ )⊤ (qt,λ (·|st ) − πt (·|st )) Est [∆t (st )⊤ πt (·|st )] − α)
2Est ∆t (st ) (qt,λ (·|st ) − πt (·|st )) ≥ .
Est [It (π ∗ )⊤ πt (·|st )]
which implies
 

 ⊤
 Est [It (π ∗ )⊤ qt,λ (·|st )]
2 Est [∆t (st ) qt,λ (·|st )] − α ≥ Est [∆t (st ) πt (·|st )] − α 1 +
Est [It (π ∗ )⊤ πt (·|st )]
≥ Est [∆t (st )⊤ πt (·|st )] − α .

Putting the above results together,


(Est [∆t (st )⊤ πt (·|st )] − α)λ (Est [∆t (st )⊤ πt (·|st )] − α)2 (Est [∆t (st )⊤ π(·|st )] − α)λ−2
=
Est [It (π ∗ )⊤ π(·|st )] Est [It (π ∗ )⊤ π(·|st )]
λ−2
2 (Est [∆t (st ) πt (·|st )] − α)2 (Est [∆t (st )⊤ qt,λ (·|st )] − α)λ−2


Est [It (π ∗ )⊤ πt (·|st )]
2λ−2 (Est [∆t (st )⊤ qt,λ (·|st )] − α)2 (Est [∆t (st )⊤ qt,λ (·|st )] − α)λ−2

Est [It (π ∗ )⊤ qt,λ (·|st )]
2λ−2 (Est [∆t (st )⊤ qt,λ (·|st )] − α)λ
= .
Est [It (π ∗ )⊤ qt,λ (·|st )]
Contextual Information-Directed Sampling

Recall that Iα,λ is the worst-case information ratio such that for any t ∈ [n], Ψλt,α (πt ) ≤ Iα,λ almost surely. Therefore,

   1/λ 1/λ
E ∆t (st )⊤ πt (·|st ) − α ≤ 21−2/λ E It (π ∗ )⊤ πt (·|st ) Iα,λ , (A.2)

which is obvious when E[∆t (st )⊤ (πt (·|st )] − α ≤ 0. Combining Eqs. (A.1) and (A.2) together and using Holder’s
inequality, we obtain
" n
#1/λ
1/λ
X  
BR(n; π IDS
) ≤ nα + 21−2/λ Iα,λ n1−1/λ E Est ∗ ⊤
It (π ) πt (·|st ) .
t=1

This ends the proof.

B. Contextual Bandits with Graph Feedback


B.1. Proof of Lemma 5.5
Proof. We bound the squared information ratio in terms of independence number β(G). Let πtTS (·|st ) = Pt (a∗t = ·) and
consider a mixture policy πtmix = (1 − ǫ)πtTS + ǫ/k with some mixing parameter ǫ ∈ [0, 1] that will be determined later.
First, we will derive an upper bound for one-step regret. From the definition of one-step regret,
X  
∆t (st )⊤ πtmix (·|st ) = πtmix (a|st )Et Yt,a∗t − Yt,a |st
a∈At
X   X 1   (B.1)
= (1 − ǫ) πtTS (a|st )Et Yt,a∗t − Yt,a |st + ǫ Et Yt,a∗t − Yt,a |st .
k
a∈At a∈At

′ ∗ ′
p of maximum expected reward and let dt (a, a ) = DKL (Pt (Yt,a ∈ ·|at = a ||Pt (Yt,a ∈
Recall that Rmax is the upper bound
2
·))). It is easy to see Yt,a is a Rmax + 1 sub-Gaussian random variable. For the first term of Eq. (B.1), according to
Lemma 3 in Russo & Van Roy (2014), we have
r
X   X 2
Rmax +1
πtTS (a|st )Et Yt,a∗t − Yt,a |st ≤ πtTS (a|st ) dt (a, a) . (B.2)
2
a∈At a∈At

For the second term of Eq. (B.1), we directly bound it by

X 1  
ǫ Et Yt,a∗t − Yt,a |st ≤ 2ǫRmax . (B.3)
k
a∈At

Putting Eqs. (B.1)-(B.3) together, we have


r
X 2
Rmax +1
∆t (st )⊤ πtmix (·|st ) − 2ǫRmax ≤ πtTS (a|st ) dt (a, a) .
2
a∈At

Second, we will derive a lower bound for the information gain. Let E = (aij )1≤i,j≤|A| be the adjacency matrix that
represents the graph feedback structure G. Then aij = 1 if there exists an edge (i, j) ∈ E and aij = 0 otherwise. Since for
any ai , aj ∈ N out (a), Yt,ai and Yt,aj are mutually independent,

 X X
It π ∗ ; (Yt,a′ )a′ ∈N out (a) ≥ It (π ∗ ; Yt,a′ ) = E(a, a′ )I(π ∗ ; Yt,a′ ) .
a′ ∈N out (a) a′ ∈At
Contextual Information-Directed Sampling

Therefore,
X 
It (π ∗ )⊤ πtmix (·|st ) ≥ It π ∗ ; (Yt,a′ )a′ ∈N out (a) πtmix (a|st )
a∈At
X X
≥ E(a, a′ )I(π ∗ ; Yt,a′ )πtmix (a|st )
a∈At a′ ∈At
 
X X
= πtmix (a|st ) + πtmix (a′ |st ) I(π ∗ ; Yt,a )
a∈At a′ ∈N in (a)
 
X X X
= πtmix (a|st ) + πtmix (a′ |st ) πtTS (a′ |st )dt (a, a′ )
a∈At a′ ∈N in (a) a′ ∈At
 
X X
≥ πtmix (a|st ) + πtmix (a′ |st ) πtTS (a|st )dt (a, a)
a∈At a′ ∈N in (a)
P
X πtmix (a|st ) + a′ ∈N in (a) πtmix (a′ |st )
= TS
πtTS (a|st )2 dt (a, a) .
a∈At
πt (a|st )

Let’s denote P
πtmix (a|st ) + a′ ∈Nsint (a) πtmix (a′ |st )
Ua,t = .
πtTS (a|st )
Putting the above together,
r s sX
2
Rmax +1 X 1
∆t (st )⊤ πtmix (·|st ) − 2ǫRmax ≤ Ua,t πtTS (a|st )2 dt (a, a)
2 Ua,t
a∈At a∈At
r s
2
Rmax +1 X 1 q
≤ It (π ∗ )⊤ πtmix (·|st ) ,
2 Ua,t
a∈At

where the first inequality is due to Cauchy–Schwarz inequality. This implies


2
∆t (st )⊤ πtmix (·|st ) − 2ǫRmax 2
Rmax +1 X πtTS (a|st )
≤ P
It (π ∗ )⊤ πtmix (·|st ) 2
a∈A
πtmix (a|st ) + a′ ∈N in (a) πtmix (a′ |st )
t

R2 + 1 X πtmix (a|st )
≤ max P ,
2(1 − ǫ)
a∈A
πtmix (a|st ) + a′ ∈N in (a) πtmix (a′ |st )
t

where we use πtmix ≥ (1 − ǫ)πtTS . Using Jenson’s inequality,


 
(Est [∆t (st )⊤ πtmix (·|st )] − 2ǫRmax )2 (∆t (st )⊤ πtmix (·|st ) − α)2
≤ Est
Est [It (π ∗ )⊤ πtmix (·|st )] It (π ∗ )⊤ πtmix (·|st )
" #
2
Rmax +1 X πtmix (a|st )
≤ Est P .
2(1 − ǫ)
a∈A
πtmix (a|st ) + a′ ∈N in (a) πtmix (a′ |st )
t

Next we restate Lemma 5 from Alon et al. (2015) to bound the right hand side.
Lemma B.1. Let G = (A, E) be a directed graph with |A| = k. Assume a distribution π over A such that π(i) ≥ η for all
i ∈ A with some constant 0 < η < 0.5. Then we have
X  
π(i) 4k
P ≤ 4β(G) log .
π(i) + j∈N in (i) π(j) β(G)η
i∈A
Contextual Information-Directed Sampling

Using Lemma B.1 with η = ǫ/k,


 
(Est [∆t (st )⊤ πtmix (·|st )] − 2ǫRmax )2 2
2Rmax +2 4k 2
∗ ⊤ mix ≤ 2β(G) log .
Est [It (π ) πt (·|st )] 1−ǫ β(G)α
According to the definition of contextual IDS, this ends the proof.

B.2. Proof of Lemma 5.7


Proof. We bound the worst-case marginal information ratio with λ = 3. Define an exploratory policy µ : S → A such that
  
µ = argmax min Es,a∼π(·|s) 1 d ∈ N out (a) .
π d

Consider a mixture policy πtmix = (1 − ǫ)πtTS + ǫµ for some mixing parameter ǫ ∈ [0, 1]. Since for any ai , aj ∈ N out (a),
Yt,ai and Yt,aj are mutually independent, we have
 
    X
Est It (π ∗ )⊤ πtmix (·|st ) = Est ,a∼πtmix (·|st ) It π ∗ ; (Yt,a′ )a′ ∈N out (a) ≥ Est ,a∼πtmix (·|st )  It (π ∗ ; Yt,a′ ) .
a′ ∈N out (a)

From the definition of the mixture policy, π mix (a|st ) ≥ ǫµ(a|st ) for any a ∈ At . Then we have
 
  X
Est It (π ∗ )⊤ πtmix (·|st ) ≥ ǫEst ,a∼µ(·|st )  It (π ∗ ; Yt,a′ )
a′ ∈N out (a)
 
X 
≥ ǫEst ,a∼µ(·|st )  It (π ∗ ; Yt,a′ ) 1 a′ ∈ N out (a) 
a′ ∈N out (a)
" #
X 
∗ ′ out
≥ ǫEst ,a∼µ(·|st ) It (π ; Yt,a′ ) 1 a ∈ N (a) .
a′ ∈At

Then we have
" #
  X ′   
Est It (π ∗ )⊤ πtmix (·|st ) ≥ ǫEst ,a∼µ(·|st ) ∗
It (π ; Yt,a′ ) min Est ,a∼µ(·|st ) 1 a′ ∈ N out (a)
a
a′ ∈At
" #
X
≥ ǫϑ(G, ξ)Est It (π ∗ ; Yt,a′ ) (B.4)
a′ ∈At
 
2ǫϑ(G, ξ) X X 2
≥ 2 Es  Pt (π ∗ = ζ) (Et [Yt,a |π ∗ = ζ] − Et [Yt,a ])  ,
Rmax + 1 t
ζ a∈At

where the last inequality is from Pinsker’s inequality.


On the other hand, by the definition of mixture policy and Jenson’s inequality,

Est [∆t (st )⊤ πtmix (·|st )] ≤ (1 − ǫ)Est ,a∼πtTS (·|st ) Et [Yt,a∗t − Yt,a ] + 2ǫRmax
≤ Est ,a∼πtTS (·|st ) [Et [Yt,a |a∗t = a] − Et [Yt,a ]] + 2ǫRmax
v " #
u
u X 2
t
≤ Est ∗ ∗
Pt (at = a) (Et [Yt,a |at = a] − Et [Yt,a ]) + 2ǫRmax
a∈At
v  
u
u X X
u 2
= tEst  Pt (π ∗ = ζ) (Et [Yt,a |π ∗ = ζ] − Et [Yt,a ])  + 2ǫRmax .
ζ a∈At
Contextual Information-Directed Sampling

Combining with the lower bound of the information gain in Eq. (B.4),
s
2
Rmax +1  
Est [∆t (st )⊤ πtmix (·|st )] ≤ Es It (π ∗ )⊤ πtmix (·|st ) + 2ǫRmax .
2ǫϑ(G, ξ) t

By choosing
 2
1/3
Rmax +1 1  
ǫ= 2
E It (π ∗ )⊤ πtmix (·|st ) ,
Rmax 8ϑ(G, ξ)
we reach  
2
Rmax +1 1   1/3
Est [∆t (st )⊤ πtmix (·|st )] ≤ 2Rmax 2
Es It (π ∗ ⊤ mix
) πt (·|s t ) .
Rmax 8ϑ(G, ξ) t
This implies
3
Est [∆t (st )⊤ πtmix (·|st )] R3 + Rmax
 mix
 ≤ max .
∗ ⊤
Est It (π ) πt (·|st ) ϑ(G, ξ)
This ends the proof.

B.3. Proof of Lemma 5.8


Proof. According the the definition of cumulative information gain,
" n #
X  
E Est It (π ) πt (·|st ) = Et,st [It (π ∗ ; Ot |At , st )] .
∗ ⊤
(B.5)
t=1

Note that

It (π ∗ ; (st , At , Ot )) = Et,st [It (π ∗ ; Ot |At , st )] + Et,st [It (π ∗ ; At |st )] + It (π ∗ ; st ) = Et,st [It (π ∗ ; Ot |At , st )] .

By the chain rule,


n
X n
X   
I(π ∗ ; Fn+1 ) = E [It (π ∗ ; (st , At , Ot ))] = E Est It (π ∗ )⊤ πt (·|st ) .
t=1 t=1

On the other hand,


I(π ∗ ; Fn+1 ) ≤ H(π ∗ ) ≤ M log(k) ,
where H(·) is the entropy. Putting the above together,
" n #
X  
∗ ⊤
E Est It (π ) πt (·|st ) ≤ M log(k).
t=1

This ends the proof.

Remark B.2. We analyze the upper bound H(π ∗ ) under the following examples. Denote the number of contexts by M ,
which could be infinite.

1. The possible choices of π ∗ is at most k M , and thus H(π ∗ ) ≤ log(k M ) = M log(k). When M = ∞, this bound is
vacuous.

2. We can divide the context space into fixed and disjoint clusters such that for each parameter in the parameter space,
the optimal policy maps all the contexts in a single cluster to a single action. For example, a naive partition is
that each cluster contains a single context, and in this case, the number of clusters denoted equals the number of
contexts M . Let P be the minimum number of such clusters, so P ≤ M . The possible choice of π ∗ is k P , and thus
H(π ∗ ) ≤ log(k P ) = P log(k). When M = ∞ but P < ∞, we obtain a valid upper bound instead of ∞.
Contextual Information-Directed Sampling

B.4. Proof of Theorem 5.9


Proof. According to Theorem 4.4, we have
" n #1/λ
√ 1/λ
X  
BR(n; π MIR ) ≤ 2Rmax n + inf 21−2/λ I2Rmax /√n,λ n1−1/λ E E It (π ∗ )⊤ πt (·|st ) .
λ≥2
t=1

Using Lemma 5.5 with ǫ = 1/ n, we have
2
 √   2√ 
2Rmax +2 4k 2 n 2 4k n
I2Rmax /√n,2 ≤ √ 2β(G) log ≤ 16(Rmax + 1)β(G) log .
1 − 1/ n β(G) β(G)
Combining Lemmas 5.7-5.8 together, we have

BR(n; π MIR ) ≤ 2Rmax n+
 
  2√  1/2  3 1
2 4k n R max + R max
3 2
min  16(Rmax + 1)β(G) log nM log(k) , 2M log(k) n3 
β(G) ϑ(G, ξ)
s 
 2√   1
4k n 2M log(k) 3 2
≥ CRmax min  β(G) log nM log(k), n3  ,
β(G) ϑ(G, ξ)

for some absolute constant C > 0. This ends the proof.

C. Sparse Linear Contextual Bandits


C.1. Proof of Theorem 6.3
Proof. We assume there are two available contexts. Context set 1 consists of a single action x0 = (0, . . . , 0)⊤ ∈ Rd+1 and
an informative action set H as follows:
n o
H = x ∈ Rd+1 xj ∈ {−κ, κ} for j ∈ [d], xd = 1 , (C.1)

where 0 < κ ≤ 1 is a constant. Let d = sp for some integer p ≥ 2. Context set 2 consists of a multi-task bandit action set
A = {({ei ∈ Rp : i ∈ [p]}s , 0)} ⊂ Rd+1 whose element’s last coordinate is always 0. The arriving probability of each
context is 1/2. One can verify for this feature set, the explorability constant Cmin (φ, ξ) = 1/2 achieved by a policy that
uniformly samples from H when context set 1 arrives.
Fix a conditional IDS policy π CIR . Let ∆ > 0 and Θ = {∆ei : i ∈ [p]} ⊂ Rp . Given θ ∈ {(Θs , 1)} ⊂ Rd+1 and i ∈ [s],
(i)
let θ(i) ∈ Rp be defined by θk = θ(i−1)s+k , which means that

θ⊤ = [θ(1)⊤ , . . . , θ(s)⊤ , 1] .

Assume the prior of θ∗ is uniformly distributed over {(Θs , 1)}. Define the cumulative regret of policy π interacting with
bandit θ as
n
X  
Rθ (n; π) = Eθ hx∗st , θi − Yt
t=1
n n (C.2)
X    X   
= Eθ hx∗st , θi − Yt 1(st = 1) + Eθ hx∗st , θi − Yt 1(st = 2) ,
t=1 t=1

where we write x∗st = φ(st , a∗t ) for short. Therefore,


1 X
BR(n; π) = Rθ (n; π) .
|Θ|s
θ∈{(Θs ,1)}
Contextual Information-Directed Sampling

Note that when context set 1 arrives, the action from H suffers at least 1 − s∆ regret and thus since x0 is always the optimal
action for context set 1. From the definition of conditional IDS in Eq. (4.2), when context set 1 is arriving and s∆ < 1,
conditional IDS will always pull x0 for this prior. That means conditional IDS will suffer no regret for context set 1 and
implies for any θ ∈ {(Θs , 1)},
n
X n
   X  
Eθ hx∗st , θi − Yt I(st = 1) = Eθ hx∗1 , θi − hπ CIR (i|1), θi|st = 1 P(st = 1) = 0 .
t=1 t=1

It remains to bound the second term in Eq. (C.2). It essentially follows the proof of Theorem 24.3 in
Lattimore & Szepesvári (2020). From the proof of multi-task bandit lower bound, we have
n
1 X X    1√
Eθ hx∗st , θi − Yt 1(st = 2) ≥ dsn .
|Θ|s 16
θ∈{(Θs ,1)} t=1

This ends the proof.

C.2. Proof of Lemma 6.4


Proof. If one can derive a worst-case bound of Ψλt,α (e e,, we can have an upper bound for Iα,λ
π ) for a particular policy π
automatically. The remaining step is to choose a proper policy πe for λ = 2, 3 separately.
First, we bound the information ratio with λ = 2. By the definition of mutual information, for any a ∈ At , we have
X
It (π ∗ ; Yt,a ) = Pt (π ∗ = ζ)DKL (Pt (Yt,a = ·|π ∗ = ζ)||Pt (Yt,a = ·)) . (C.3)
ζ∈A1 ×···×AM

et (Ω, F ) be a measurable space and let P, Q : F → [0, 1] be two probability measures. Based on Pinsker’s inequality,
r
1
δ(P, Q) = sup (P (A) − Q(A)) ≤ DKL (P ||Q).
A∈F 2
Let a < b and X : Ω → [a, b]. Exercise 14.4 in (Lattimore & Szepesvári, 2020) indicates that
Z Z r
1
X(w)dP (w) − X(w)dQ(w) ≤ (b − a)δ(P, Q) ≤ DKL (P ||Q).
Ω Ω 2
Overall, Z Z 
2
DKL (P ||Q) ≥ X(w)dP (w) − X(w)dQ(w) .
(b − a)2 Ω Ω
p
Recall that Rmax is the upper bound of maximum expected reward. It is easy to see Yt,a is a Rmax 2 + 1 sub-Gaussian
random variable. According to Lemma 3 in Russo & Van Roy (2014), we have
2 X  2
It (π ∗ ; Yt,a ) ≥ 2 Pt (π ∗ = ζ) Et [Yt,a |π ∗ = ζ] − Et [Yt,a ] . (C.4)
Rmax + 1 1 Mζ∈A ×···×A

The MIR of contextual IDS can be bounded by the MIR of Thompson sampling:
2  
max 0, (E[∆t (st )⊤ πtTS (·|st )]) (∆t (st )⊤ πtTS (·|st ))2
I0,2 ≤ max ≤ max E ,
t∈[n] E[It (π ∗ )⊤ πtTS (·|st )] t∈[n] It (π ∗ )⊤ πtTS (·|st )
where the second inequality is from Jenson’s inequality. Using the matrix trace rank trick described in Proposition 5 in
2
Russo & Van Roy (2014), we have I0,2 ≤ (Rmax + 1)d/2 in the end.
Second, we bound the information ratio with λ = 3. Let’s define an exploratory policy µ such that
 h i
µ = argmax σmin Es∼ξ Ea∼π(·|s) [φ(s, a)φ(s, a)⊤ ] . (C.5)
π:S→P(A)

Consider a mixture policy πtmix = (1 − ǫ)πtTS + ǫµ where the mixture rate ǫ ≥ 0 will be decided later.
Contextual Information-Directed Sampling

Step 1: Bound the information gain According the lower bound of information gain in Eq. (C.4),

 
E It (π ∗ )⊤ πtmix (·|st )
 
2 X 2
≥ 2
Est ∼ξ,a∼πtmix (·|st )  Pt (π ∗ = ζ) (Et [Yt,a |π ∗ = ζ] − Et [Yt,a ]) 
(Rmax + 1)
ζ∈A1 ×···×AM
 
2 X 2
= 2
Est ∼ξ,a∼πtmix (·|st )  Pt (π ∗ = ζ) φ(st , a)⊤ Et [θ∗ |π ∗ = ζ] − φ(st , a)⊤ Et [θ∗ ]  .
(Rmax + 1)
ζ∈A1 ×···×AM

By the definition of the mixture policy, we know that πtmix (a|st ) ≥ ǫµ(a|st ) for any a ∈ At . Then we have

  2ǫ X
E It (π ∗ )⊤ πtmix (·|st ) ≥ 2
Pt (π ∗ = ζ)
Rmax +1
ζ∈A1 ×···×AM
h i
· Est ∼ξ,a∼µ(·|st ) (Et [θ∗ |π ∗ = ζ] − Et [θ∗ ])⊤ φ(st , a)φ(st , a)⊤ (Et [θ∗ |π ∗ = ζ] − Et [θ∗ ]) .

From the definition of µ in Eq. (C.5), we have

  2ǫ X 2
E It (π ∗ )⊤ πtmix (·|st ) ≥ 2
Pt (π ∗ = ζ)Cmin (φ, ξ) Et [θ∗ |π ∗ = ζ] − Et [θ∗ ] 2
.
Rmax +1
ζ∈A1 ×···×AM

Step 2: Bound the instant regret We decompose the regret by the contribution from the exploratory policy and the one
from TS:

" #
  X      
E ∆t (st )⊤ πtmix (·|st ) = (1 − ǫ)Est Pt (a∗t = a) Et φ(st , a)⊤ θ∗ a∗t = a − Et φ(st , a)⊤ θ∗
a
" # (C.6)
X h i
∗ ⊤ ∗ ⊤ ∗
+ ǫEst Et φ(st , at ) θ − φ(st , a) θ µ(a|st ) .
a

Since Rmax is the upper bound of maximum expected reward, the second term can be bounded 2Rmax ǫ. Using Jenson’s
inequality, the first term can be bounded by

" #
X     
Est Pt (a∗t ⊤ ∗ ∗ ⊤ ∗
= a) Et φ(st , a) θ at = a − Et φ(st , a) θ
a
v " #
u   2
u X 
≤ tEst Pt (a∗t = a) Et φ(st , a)⊤ θ∗ a∗t = a − Et [φ(st , a)⊤ θ∗ ] .
a

Since all the optimal actions are sparse, any action a with Pt (a∗t = a) > 0 must be sparse. Then we have

 2 2
φ(st , a)⊤ (Et [θ∗ |a∗t = a] − Et [θ∗ ]) ≤ s2 Et [θ∗ |a∗t = a] − Et [θ∗ ] ,
2
Contextual Information-Directed Sampling

for any action a with Pt (a∗t = a) > 0. This further implies


" #
X     
∗ ⊤ ∗ ∗ ⊤ ∗
Est Pt (at = a) Et φ(st , a) θ at = a − Et φ(st , a) θ
a
v " #
u
u X 2
t
≤ Est ∗ 2 ∗ ∗ ∗
Pt (at = a)s Et [θ |at = a] − Et [θ ]
2
a
v  
u
u X 2
u
= tEst  Pt (π ∗ = ζ)s2 Et [θ∗ |π ∗ = ζ] − Et [θ∗ ]  (C.7)
2
ζ∈A1 ×···×AM
v  
u
u 2 2 X 2
u s (Rmax + 1) 2ǫCmin (φ, ξ)
=t 2
Es  Pt (π ∗ = ζ)s2 Et [θ∗ |π ∗ = ζ] − Et [θ∗ ] 
2ǫCmin(φ, ξ) (Rmax + 1) t 2
ζ∈A1 ×···×AM
s
s2 (Rmax
2 + 1)  
≤ E It (π ∗ )⊤ πtmix (·|st ) .
2ǫCmin (φ, ξ)

Putting Eq. (C.6) and (C.7) together, we have


s
  s2 (Rmax
2 + 1)  
E ∆t (st )⊤ πtmix (·|st ) ≤ E It (π ∗ )⊤ πtmix (·|st ) + 2Rmax ǫ .
2ǫCmin(φ, ξ)

By optimizing the mixture rate ǫ, we have


 3
E ∆t (st )⊤ πtmix (·|st ) s2 (Rmax
2
+ 1) s2
I0,3 ≤   ≤ 2
≤ .
E It (π ∗ )⊤ πtmix (·|st ) 8Rmax Cmin 4Cmin (φ, ξ)

This ends the proof.


This figure "1.png" is available in "png" format from:

http://arxiv.org/ps/2205.10895v1

You might also like