0% found this document useful (0 votes)

56 views21 pages

Contextual IDS for Bandit Problems

This paper explores the concept of Information-Directed Sampling (IDS) in the context of contextual bandits, highlighting the limitations of traditional conditional IDS and proposing a new version that incorporates context distribution. The authors demonstrate that their contextual IDS approach outperforms conditional IDS in terms of Bayesian regret bounds for both contextual bandits with graph feedback and sparse linear contextual bandits. They also introduce a computationally efficient algorithm based on Actor-Critic methods and provide empirical evaluations to support their findings.

Uploaded by

elisahemeliyev

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

56 views21 pages

Contextual IDS for Bandit Problems

Uploaded by

elisahemeliyev

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Contextual Information-Directed Sampling

Botao Hao 1 Tor Lattimore 1 Chao Qin 2

Abstract simplified case of full reinforcement learning. The context

arXiv:2205.10895v1 [cs.LG] 22 May 2022

Information-directed sampling (IDS) has at each round is generated independently, rather than deter-
recently demonstrated its potential as a data- mined by the actions. Contextual bandits are widely used
efficient reinforcement learning algorithm in real-world applications, such as recommender systems
(Lu et al., 2021). However, it is still unclear (Li et al., 2010).
what is the right form of information ratio to A natural generalization of IDS to the contextual case is
optimize when contextual information is avail- conditional IDS. Given the current context, conditional IDS
able. We investigate the IDS design through two acts as if it were facing a fixed action set. However, this
contextual bandit problems: contextual bandits design ignores the context distribution and can be myopic.
with graph feedback and sparse linear contextual We mainly want to address the question of what is the right
bandits. We provably demonstrate the advantage form of information ratio in the contextual setting and high-
of contextual IDS over conditional IDS and light the necessity of utilizing the context distribution.
emphasize the importance of considering the
context distribution. The main message is that Contributions Our contribution is four-fold:
an intelligent agent should invest more on the
actions that are beneficial for the future unseen
• We introduce a version of contextual IDS that takes
contexts while the conditional IDS can be
the context distribution into consideration for contex-
myopic. We further propose a computationally-
tual bandits with graph feedback and sparse linear con-
efficient version of contextual IDS based on
textual bandits.
Actor-Critic and evaluate it empirically on a
neural network contextual bandit. • For contextual bandits with graph feedback, we
p
prove that conditional IDS suffers Ω( β(G)n)
Bayesian regret lower bound for a particular
1. Introduction
prior andpgraph while1/3 contextual IDS can achieve
Information-directed sampling (IDS) (Russo & Van Roy, e
O(min{ β(G)n, δ(G) n2/3 )} Bayesian regret up-
2018) is a promising approach to balance the exploration per bound for any prior. Here, n is the time hori-
and exploitation tradeoff (Lu et al., 2021). Its theoreti- zon, G is a directed feedback graph over the set of
cal properties have been systematically studied in a range actions, β(G) is the independence number and δ(G)
of problems (Kirschner & Krause, 2018; Liu et al., 2018a; is the weak domination number of the graph. In the
Kirschner et al., 2020a;b; Hao et al., 2021a). However, the regime where β(G) & (λ(G)2 n)1/3 , contextual IDS
aforementioned analysis1 is limited to the fixed action set. achieves better regret bound than conditional IDS.
In this work, we study the IDS design for contextual ban- • For sparse linear contextual√bandits, we prove that
dits (Langford & Zhang, 2007), which can be viewed as a conditional IDS suffers Ω( nds) Bayesian regret
*
Equal contribution 1 Deepmind 2 Columbia University. Corre- lower bound for a particular sparse√ prior while
spondence to: Botao Hao <[email protected]>. e
contextual IDS can achieve O(min{ nds, sn2/3 })
Bayesian regret upper bound for any sparse prior.
Proceedings of the 39 th International Conference on Machine
Here, d is the feature dimension and s is the sparsity.
Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copy-
right 2022 by the author(s). In the data-poor regime where d & sn1/3 , contex-
1
One exception is Kirschner et al. (2020a) who has extended tual IDS achieves better regret bound than conditional
IDS to contextual partial monitoring. IDS.
Contextual Information-Directed Sampling

• We further propose a computationally-efficient algo- action set. Lattimore et al. (2015) developed a selective
rithm to approximate contextual IDS based on Actor- explore-then-commit algorithm that only works when the
Critic (Konda & Tsitsiklis, 2000) and evaluate it em- action set is exactly the binary hypercube and derived an
√
pirically on a neural network contextual bandit. optimal O(s n) upper bound. Hao et al. (2020) intro-
duced the notion of an exploratory action set and proved a
2. Related works Θ(poly(s)n2/3 ) minimax rate for the data-poor regime us-
ing an explore-then-commit algorithm. Note that the mini-
Russo & Van Roy (2018) introduced IDS and derived max lower bound automatically holds for contextual setting
Bayesian regret bounds for multi-armed bandits, linear when the context distribution is a Dirac on the hard prob-
bandits and combinatorial bandits. Kirschner & Krause lem instance. Hao et al. (2021b) extended this concept to a
(2018); Kirschner et al. (2020a) investigated the use of fre- MDP setting.
quentist IDS for bandits with heteroscedastic noise and
It recently became popular to study the contextual setting.
partial monitoring. Kirschner et al. (2020b) proved the
These results can not be reduced to our setting since they
asymptotic optimality of frequentist IDS for linear ban-
rely on either careful assumptions on the context distribu-
dits. Hao et al. (2021a) developed a class of information- √
e
tion to achieve O(poly(s) e
n) or O(poly(s) log(n)) re-
theoretic Bayesian regret bounds for sparse linear ban-
gret bounds (Bastani & Bayati, 2020; Wang et al., 2018;
dits that nearly match existing lower bounds on a vari-
Kim & Paik, 2019; Wang et al., 2020; Ren & Zhou, 2020;
ety of problem instances. Recently, Lu et al. (2021) pro-
Oh et al., 2021) such that classical high-dimensional statis-
posed a general reinforcement learning framework and
tics can be used. However, minimax lower bounds is miss-
used IDS as a key component to build a data-efficient
ing to understand if those assumptions are fundamental.
agent. Arumugam & Van Roy (2021) investigated differ-
Another line of works have polynomial dependency on the
ent versions of learning targets when designing IDS. How-
number of actions (Agarwal et al., 2014; Foster & Rakhlin,
ever, there are no concrete regret bounds provided in those
2020; Simchi-Levi & Xu, 2020). We briefly summarize the
works.
comparison in Table 2.

Graph feedback Mannor & Shamir (2011); Alon et al.

(2015) gave a full characterization of online learning with
3. Problem Setting
graph feedback. Tossou et al. (2017) provided the first We study the stochastic contextual bandit problem with a
information-theoretic analysis of Thompson sampling for horizon of n rounds and a finite set of possible contexts S.
bandits with graph feedback and Liu et al. (2018a) derived For each context m ∈ S, there is an action set Am . The
a Bayesian regret bound of IDS. Both bounds depend on the interaction protocol is as follows. First the environment
clique cover number which could be much larger than the samples a sequence of independent contexts (st )nt=1 from
independence number. Liu et al. (2018b) further improved a distribution ξ over S. At the start of round t, the context
the analysis with the feedback graph unknown and derived st is revealed to the agent, who may use their observations
the Bayesian regret bound of a mixture policy in terms of and possibly an external source of randomness to choose
the independence number. In contrast, our work is the first an action At ∈ At = Ast . Then the agent receives an
one to consider contextual bandits with graph feedback and observation Ot,At including an immediate reward Yt,At as
derived nearly optimal regret bound in terms of both inde- well as some side information. Here
pendence number and domination number for a single al-
Yt,a = f (st , a, θ∗ ) + ηt,a , if At = a ,
gorithm. Very recently, Dann et al. (2020) considered the
√
reinforcement learning with graph feedback. Their O( n)- where f is the reward function, θ∗ is the unknown parame-
upper bound depends on mas-number rather than indepen- ter and ηt,a is 1-sub-Gaussian noise. Sometimes we write
dence number. We briefly summarize the comparison in Yt = Yt,At for short.
Table 2. We consider the Bayesian setting in the sense that θ∗ is sam-
pled from some prior distribution. Let A = ∪m∈S Am and
Sparse linear bandits For sparse linear bandits with the history Ft = {(s1 , A1 , O1 ), . . . , (st−1 , At−1 , Ot−1 )}.
fixed action set, Abbasi-Yadkori et al. (2012) proposed A policy π = (πt )t∈N is a sequence of (suitably measur-
an inefficient online-to-confidence-set
√ conversion approach able) deterministic mappings from the history Ft and con-
that achieves an O( e sdn) upper bound for an arbitrary text space S to a probability distribution P(A) with the
Contextual Information-Directed Sampling

Table 1. Comparisons with existing results on regret upper bounds and lower bounds for bandits with graphical feedback. Here, β(G) is
the independence number, δ(G) is the weak domination number and c(G) is the clique cover number of the graph. Note that β(G) ≤
c(G).

Upper Bound Regret Contextual? Algorithm

p
Alon et al. (2015) e β(G)n)
O( no Exp3.G for strongly observable graph
Alon et al. (2015) e
O(δ(G) 1/3 2/3
n ) no Exp3.G for weakly observable graph
p
Tossou et al. (2017) e c(G)n)
O( no Thompson sampling
p
Liu et al. (2018a) e c(G)n)
O( no fixed action-set IDS
p
This paper e
O(min( β(G)n, δ(G)1/3 n2/3 )) yes contextual IDS
Minimax Lower Bound
p
Alon et al. (2015) Ω( β(G)n) no strongly observable graph
Alon et al. (2015) Ω(δ(G)1/3 n2/3 ) no weakly observable graph

Table 2. Comparisons with existing results on regret upper bounds and lower bounds for sparse linear bandits. Here, s is the sparsity, d
is the feature dimension, n is the number of rounds, K is the number of arms, Cmin is the minimum eigenvalue of the data matrix and τ
is a problem-dependent parameter that may have a complicated form.

Upper Bound Regret Contextual? Assumption Algorithm

√
Abbasi-Yadkori et al. (2012) e sdn)
O( yes none UCB
√
Sivakumar et al. (2020) e sdn)
O( yes adver. + Gaussian noise greedy
Bastani & Bayati (2020) e Ks2 (log(n))2 )
O(τ yes compatibility condition Lasso greedy
Lattimore et al. (2015) e √n)
O(s no action set is hypercube explore-then-commit
√
Hao et al. (2021a) e
O(min( sdn, sn2/3 /Cmin )) no none fixed action-set IDS
√
This paper e
O(min( sdn, sn2/3 /Cmin )) yes none contextual IDS
Minimax Lower Bound
√ −1/3
Hao et al. (2020) Ω(min( sdn, Cmin s1/3 n2/3 )) no N.A N.A.

constraint that for each context m, πt (·|m) can only put with the Borel σ-algebra.
mass over P(Am ). Then the Bayesian regret of a policy π
is defined as 4. Conditional IDS versus Contextual IDS
" n n
#
X X
BR(n; π) = E ∗
max f (st , a, θ ) − Yt , (3.1) We introduce the formal definition of conditional IDS and
a∈At contextual IDS for general contextual bandits. Suppose
t=1 t=1
that the optimal action associated with context st is a∗t =
where the expectation is over the interaction sequence in- argmaxa∈At f (st , a, θ∗ ), which in Bayesian setting is a
duced by the agent and environment, the context distribu- random variable. We first define the one-step expected re-
tion and the prior distribution over θ∗ . gret of taking action a ∈ At as
h i
Notations Given a measure P and jointly distributed ran- ∆t (a|st ) := Et f (st , a∗t , θ∗ ) − f (st , a, θ∗ ) st , (4.1)
dom variables X and Y we let PX denote the law of X
and we let PX|Y be the conditional law of X given Y such and write ∆t (st ) ∈ R|At | as the corresponding vector.
that: PX|Y (·) = P(X ∈ ·|Y ). The mutual information be-
tween X and Y is I(X; Y ) = E[DKL (PX|Y ||PX )] where 4.1. Conditional IDS
DKL is the relative entropy. We write Pt (·) = P(·|Ft )
A natural generalization of the standard IDS design
as the posterior measure where P is the probability mea-
(Russo & Van Roy, 2018) to contextual setting is condi-
sure over θ∗ and the history and Et [·] = E[·|Ft ]. Denote
tional IDS. It has been investigated empirically in Lu et al.
It (X; Y ) = Et [DKL (Pt,X|Y ||Pt,X )]. We write 1{·} as an
(2021) for the full reinforcement learning setting.
indicator function. For a positive semidefinite matrix A,
we let σmin (A) be the minimum eigenvalue of A. Denote Conditional on the current context, we define the informa-
P(A) be the space of probability measures over a set A tion gain of taking action a with respect to the current opti-
Contextual Information-Directed Sampling

mal action a∗t as It (a∗t ; Ot,a |st ) := where the expectation is taken with respect to the context
h i distribution. Contextual IDS minimizes the (α, λ)-MIR to
Et DKL (Pt (a∗t ∈ ·|st , Ot,a )||Pt (a∗t ∈ ·|st )) st , find a mapping from the context space to the action space:

and write It (a∗t , st ) ∈ R|At | such that [It (a∗t , st )]a = πt = argmin Ψλt,α (π) .
π∈Π
It (a∗t ; Ot,a |st ).
When receiving context st at round t, the agent plays ac-
We introduce the (α, λ)-conditional information ratio
tions according to πt (·|st ).
(CIR) as Γλt,α : P(A) → R such that
λ We highlight the key difference between conditional IDS
λ max 0, ∆t (st )⊤ π(·|st ) − α and contextual IDS as follows:
Γt,α (π(·|st )) = , (4.2)
It (a∗t , st )⊤ π(·|st )
• Conditional IDS only optimizes a probability distribu-
for some parameters α, λ > 0. Conditional IDS minimizes
tion for the current context while contextual IDS opti-
(α, λ)-CIR to find a probability distribution over the action
mizes a full mapping from the context space to action
space:
space.
πt (·|st ) = argmin Γλt,α (π(·|st )) .
π(·|st )∈P(At ) • Conditional IDS only seeks information about the cur-
rent optimal action while contextual IDS seeks infor-
Remark 4.1. In comparison to the standard information mation for the whole optimal policy.
ratio by Russo & Van Roy (2018) who specified λ = 2, α =
0 in the non-contextual setting, we consider a generalized Remark 4.2. We can also define the information gain in
version introduced by Lattimore & György (2020). As ob- conditional IDS and contextual IDS with respect to the un-
served by Lattimore & György (2020), the right value of known parameter θ∗ . This could bring in certain computa-
λ depends on the dependence of the regret on the horizon. tional advantage when approximating the information gain.
√ However, θ∗ contains much more information to learn than
The parameter α is always chosen at order O(1/ n) for
certain problems such that its regret is negligible. Their the optimal action or policy such that the agent may suffer
role will be more clear in Sections 5&6. more regret.

4.2. Contextual IDS 4.3. Why conditional IDS could be myopic?

A possible limitation of conditional IDS is that the CIR Conditional IDS myopically balances exploration and ex-
does not take the context distribution into consideration. ploitation without taking the context distribution into con-
As we show later, this can result in sub-optimality in cer- sideration. This has the advantage of being simple to imple-
tain problems. Contextual IDS, which was first proposed ment but sometimes leads both under and over exploration,
by Kirschner et al. (2020a) for partial monitoring, is intro- which the following examples illustrate. In both examples
duced to remedy this shortcoming. there are two contexts arriving with equal probability.

Let π ∗ : S → A be the optimal policy such that for each • Example 1 [UNDER EXPLORATION] Consider a
m ∈ S, π ∗ (m) = argmaxa∈Am f (m, a, θ∗ ) with the tie noiseless case. Context set 1 contains k actions where
broken arbitrarily. We define the information gain of taking one is the optimal action and the remaining k − 1 ac-
action a with respect to π ∗ as: tions yield regret 1. Context set 2 contains a reveal-
ing action with regret 1 and one action with no regret.
It (π ∗ ; Ot,a ) := Et [DKL (Pt (π ∗ ∈ ·|Ot,a )||Pt (π ∗ ∈ ·))] ,
The revealing action provides an observation of the
and write It (π ∗ ) ∈ R|At | as the corresponding vector. De- rewards for all the k actions in context set 1. When
fine a policy class Π = {π : S → P(A), supp(π(·|m)) ⊂ context set 2 arrives, conditional IDS will never play
Am } where supp(·) is the support of a vector. We in- the revealing action since it incurs high immediate re-
troduce the (α, λ)-marginal information ratio (MIR) as gret with no useful information for the current context
Ψλt,α : Π → R such that set. However, this ignores the fact that the revealing
action could be informative for the unseen context set
λ
max 0, (Est [∆t (st )⊤ π(·|st )] − α) 1. Conditional IDS under-explores and suffers O(k)
Ψλt,α (π) = , (4.3) regret. In contrast, contextual IDS exploits the context
Est [It (π ∗ )⊤ π(·|st )]
Contextual Information-Directed Sampling

distribution and plays the revealing action in context 2 (i, j) ∈ E}. The feedback graph G is fixed and known to
and only suffers O(1) regret. the agent.

• Example 2 [OVER EXPLORATION] Context set 1 con- Let the unknown parameter θ∗ ∈ Rk . At each round t,
tains a single revealing action (hence no regret). Con- the agent receives an observation characterized by G in
text set 2 has k actions. The first is the sense that taking action a ∈ At , the observation Ot,a
√ a revealing ac-
tion and has a (known) regret of Θ( k∆) with ∆ = contains the rewards Yt,a′ = θa∗′ + ηt,a′ for each action
√
Θ(1/ n). Of the remaining actions, one is optimal a′ ∈ N out (a). We review two standard quantities to de-
(zero regret) and the others have regret ∆, with the scribe the graph structure used in Alon et al. (2015).
prior such that the identify of the optimal action is un- Definition 5.1 (Independence number). An independent
known. Contextual IDS will avoid the revealing ac- set of a graph G = (A, E) is a subset S ⊆ A such that no
tion in context set 2 because it understands that this two different i, j ∈ A are connected by an edge in E. The
information can be obtained more cheaply in context cardinality of a largest independent set is the independence
√
set 1. Its regret is O( n). Meanwhile, if the constants number of G, denoted by β(G).
are tuned appropriately, then conditional IDS will play
the√revealing action in context set 2 and suffer regret A vertex a is observable if N in (a) 6= ∅. A vertex is strongly
Ω( nk). observable if it has either a self-loop or incoming edges
from all other vertices. A vertex is weakly observable
Remark 4.3. Similar shortcoming could also hold for the if it is observable but not strongly observable. A graph
class of UCB and Thompson sampling algorithms since G is observable if all its vertices are observable and it is
they have no explicit way to encode the information of con- strongly observable if all its vertices are strongly observ-
text distribution. able. A graph is weakly observable if it is observable but
not strongly observable.
4.4. Generic regret bound for contextual IDS
Definition 5.2 (Weak domination number). In a directed
Before we step into specific examples, we first provide a graph G = (A, E) with a set of weakly observable vertices
generic bound for contextual IDS. Denote Iα,λ as the worst- W ⊆ A, a weakly dominating set D ⊆ A is a set of ver-
case MIR such that for any t ∈ [n], Ψλt,α (πt ) ≤ Iα,λ almost tices that dominates W. That means for any w ∈ W there
surely. exists d ∈ D such that w ∈ N out (d). The weak domination
Theorem 4.4. Let π MIR = (πt )t∈[n] be the contextual IDS number of G, denoted by δ(G), is the size of the smallest
policy such that πt minimizes (α, 2)-MIR at each round. weakly dominating set.
Then the following regret upper bound holds
5.1. Lower bound for conditional IDS
BR(n; π MIR ) ≤ nα+ p
We first prove that conditional IDS suffers Ω( nβ(G))
" n #1
X λ Bayesian regret lower bound for a particular prior and show
1/λ
inf 21−2/λ Iα,λ n1−1/λ E Est It (π ∗ )⊤ πt (·|st ) . later the optimal rate for this prior could be much smaller
λ≥2
t=1
when β(G) is very large (Section 5.2).
The proof is deferred to Appendix A.1. An interesting con- Theorem 5.3. Let π CIR be a conditional IDS policy. There
sequence of Theorem 4.4 is that contextual IDS minimizing exists a contextual bandit with graph feedback instance
λ = 2 can adapt to λ > 2. This shows the power of contex- such that β(G) = k, δ(G) = 1 and
tual IDS to adapt to different information-regret structures. 1p
BR(n; π CIR ) ≥ n(β(G) − 3) .
16
5. Contextual Bandits with Graph Feedback
Proof. Let us construct the hard problem instance
We consider a Bayesian formulation of contextual k-armed that generalizes Example 1 in Section 4.3. Suppose
bandits with graph feedback, which generalizes the setup in {x1 , . . . , xk } ⊆ Rk is the set of basis vectors that corre-
Alon et al. (2015). Let G = (A, E) be a directed feedback sponds to k arms. The first k − 1 arms form a standard
graph over the set of actions A with |A| = k. For each i ∈ multi-armed Gaussian bandit with unit variance. When kth
A, we define its in-neighborhood and out-neighborhood as arm is pulled, the agent always suffers a regret of 1, but
N in (i) = {j ∈ A : (j, i) ∈ E} and N out (i) = {j ∈ A : obtains samples for the first k − 1 arms. In terms of the
Contextual Information-Directed Sampling

language of graph feedback, the first k − 1 arms only con- have Ei [n2 ] = n/2. By Pinsker’s inequality and the di-
tain self-loop while the out-neighborhood of the last arm vergence decomposition lemma (Lattimore &p Szepesvári,
contains all the first k − 1 arms. One can verify β(G) = k, 2020, Lemma 15.1), we know that with γ = 12 k/n,
δ(G) = 1.
k−2
X k−2
X k−2
1 Xp
There are two context sets and the arrival probability of Ei [Ti (n)] ≤ E1 [Ti (n)] + nkE1 (Ti (n))
each context is 1/2. Context set 1 consists of {xk−1 , xk } i=2 i=2
4 i=2
while context set 2 consists of {x1 , . . . , xk−2 }. One can 1
≤ n + kn .
verify β(G) = k, δ(G) = 1. 4
(i)
For each i ∈ {2, . . . , k − 2}, let θ(i) ∈ Rk with θj = Therefore,
(i) (i)
1{i = j}γ for j ∈ {2, . . . , k − 2} and θ1 = 0, θk−1 = 0, k−2
(i) 1 X
θk = γ − 1 where γ > 0 is the gap that will be chosen BR(n; π) = Ei [n2 − Ti (n)] γ
k − 3 i=2
later. Assume the prior of θ∗ is uniformly distributed over !
k−2
{θ(2) , . . . , θ(k−2) }. 1 k−3 X
= n− Ei [Ti (n)] γ
We write Ei [·] = Eθ(i) [·] for short and the expectation is k−3 2 i=2
taken with respect to the measures on the sequence of out- 1p
≥ n(k − 3) .
comes (A1 , Y1 , . . . , An , Yn ) determined by θ(i) . Define the 16
cumulative regret of policy π interacting with bandit θ(i) as
This finishes the proof.
n
X h i
Rθ(i) (n; π) = Ei hx∗st , θ(i) i − Yt 1(st = 1) Remark 5.4. We can strengthen the lower bound such that
t=1 it holds for any graph by replacing the standard multi-
Xn h i armed bandit instance in context set 2 by the hard instance
+ Ei hx∗st , θ(i) i − Yt 1(st = 2) ,
in Alon et al. (2017, Theorem 5).
t=1

1 Pk−2
such that BR(n; π) = k−3 i=2 Rθ(i) (n; π). 5.2. Upper bound achieved by contextual IDS
In this section, we derive a Bayesian regret upper bound
Step 1. Fix a conditional IDS policy π. From the defi-
achieved by contextual IDS. According to Theorem 4.4, it
nition of conditional IDS in Eq. (4.2), when context set 1
suffices to bound the worst-case MIR Iα,λ for λ = 2, 3 as
is arriving, conditional IDS will always pull xk−1 for this
well as the cumulative information gain respectively. Let
prior since xk−1 is always the optimal arm for context set 1.
Rmax be an upper bound on the maximum expected reward.
This means conditional IDS will suffer no regret for context
set 1, which implies for any i ∈ {2, . . . , k − 2}, Lemma 5.5 (Squared information ratio). For any ǫ ∈ [0, 1]
and a strongly observable graph G, the (2ǫRmax , 2)-MIR
n
X h i
can be bounded by
Ei hx∗st , θ(i) i − Yt 1(st = 1) = 0 .
t=1 2

4Rmax +4 4k 2
I2ǫRmax ,2 ≤ β(G) log .
1−ǫ β(G)ǫ
Step 2. Define Ti (n) as the number of pulls for arm i over
n rounds. On the other hand,
" n # The proof is deferred to Appendix B.1 and the proof idea
X is to bound the worst-case squared information ratio by the
∗ (i)
Ei hxst , θ i − Yt 1(st = 2)
squared information ratio of Thompson sampling.
t=1
 
k−2
X Next we will bound Iα,λ for λ = 3 in terms of an ex-
= Ei  Tj (n) γ = Ei [n2 − Ti (n)] γ , plorability constant.
j=1,j6=i
Definition 5.6 (Explorability constant). We define the ex-
where the first equation comes from the fact that the context plorability constant as ϑ(G, ξ) :=
sets 1 and 2 are disjoint and n2 is the number of times that h i
context 2 arrives over n rounds. Since the context is gener- max min Es∼ξ,a∼π(·|s) 1 d ∈ N out (a) .
π:S→P(A) d∈A
ated independently with respect to the learning process, we
Contextual Information-Directed Sampling

Lemma 5.7 (Cubic information ratio). For any ǫ ∈ [0, 1] graph with |D| = δ(G). Consider a policy µ such that
and a weakly observable graph G, the (2ǫRmax , 3)-MIR µ(·|st ) is uniform over D ∩ At . Then we have
can be bounded by
1
min Es∼ξ,a∼µ(·|s) [1 {d ∈ N out (a)}] ≥ ,
3
Rmax+ Rmax d∈A M δ(G)
I2ǫRmax ,3 ≤ .
ϑ(G, ξ)
which implies 1/ϑ(G, ξ) ≤ M δ(G). When M = 1,
Alon et al. (2015) demonstrated that the minimax optimal
The proof is deferred to Appendix B.2 and the proof idea e 1/3 2/3
rate for weakly observable graph is Θ(δ(G) n ) which
is to bound the worst-case cubic information ratio by the implies 1/ϑ(G) & δ(G) up to some logarithm factors.
cubic information ratio of a policy that mixes Thompson
sampling and a policy maximizing the explorability con- Remark 5.11 (Open problem). For the fixed action set set-
stant. ting, Alon et al. (2015) proved that the independence num-
ber and weakly domination number are the fundamentally
Lemma 5.8. The cumulative information gain can be quantity to describe the graph structure. However, our re-
bounded by gret upper bound has the number of contexts M appearing.
" n # It is interesting to investigate if M can be removed for con-
X
E Et It (π ∗ )⊤ πt (·|st ) ≤ H(π ∗ ) ≤ M log(k) , textual bandits with graph feedback.
t=1
Note that one can also bound H(π ∗ ) ≤ H(θ∗ ) ≤ k such
where H(·) is the entropy and M is the number of available that the dependency on M does not appear. But it is un-
contexts. clear the right dependency on M and k for the optimal
regret rate in the contextual setting.
The proof is deferred to Appendix B.3. Combining the re-
sults in Lemmas 5.5-5.8 and the generic regret bound in
6. Sparse Linear Contextual Bandits
Theorem 4.4, we obtain the follow theorem.
Theorem 5.9 (Regret bound for graph contextual We consider a Bayesian formulation of sparse linear contex-
IDS). Suppose π MIR = (πt )t∈N where πt minimizes tual bandits. The notion of sparsity can be defined through
√ the parameter space Θ:
(2Rmax / n, 2)-MIR at each round. If G is strongly ob-
servable, we have  
 Xd 
1 Θ = θ ∈ Rd 1{θj 6= 0} ≤ s, kθk2 ≤ 1 .
2M log(k) 3 2  
MIR j=1
BR(n; π ) ≤ CRmax min n3 ,
ϑ(G, ξ)
s √ ! We assume s is known and it can be relaxed by putting a
4k 2 n prior on it. We consider the case where Am = A for all
β(G) log nM log(k) ,
β(G) m ∈ S. Suppose φ : S × A → Rd is a feature map
known to the agent. In sparse linear contextual bandits, the
where C is an absolute constant. reward of taking action a is f (st , a, θ∗ ) = φ(st , a)⊤ θ∗ +
ηt,a , where θ∗ is a random variable taking values in Θ and
Ignoring the constant and logarithmic terms, the Bayesian
denote ρ as the prior distribution.
regret upper bound in Theorem 5.9 can be simplified to
p We first define a quantity to describe the explorability of
Oe min β(G)M n, (M/ϑ(G, ξ))1/3 n2/3 . the feature set.
Definition 6.1 (Explorability constant). We define the ex-
When the independence number is large enough such that plorability constant as Cmin (φ, ξ) :=
β(G) ≥ (ϑ(G, ξ), ξ 2 n/M )1/3 , this bound is dominated
h i
e
by O((M/ϑ(G, ξ))1/3 n2/3 ) that is independent of the in- max σmin Es∼ξ Ea∼π(·|s) [φ(s, a)φ(s, a)⊤ ] .
dependence number. Together with Theorem 5.3, we can π:S→P(A)

argue that conditional IDS is sub-optimal comparing with Remark 6.2. The explorability constant is a problem-
contextual IDS in this regime. dependent constant that has previously been introduced in
Remark 5.10 (Connection between ϑ(G, ξ) and δ(G)). Let Hao et al. (2020; 2021a) for a non-contextual sparse lin-
D ⊆ A be the smallest weakly dominating set of the full ear bandits. For an easy instance when {φ(st , a)}a∈A is
Contextual Information-Directed Sampling

a full hypercube for every st ∈ S, Cmin (φ, ξ) could be as ing regret bound holds
small as 1 where the policy is to sample uniformly from the
corner of the hypercube no matter which context arrives. BR(n; π MIR ) .
1
nq sn 3 (log(dn1/2 /s)) 3 o
2

min 1/2
nds log(dn /s), .
6.1. Lower bound for conditional IDS 1
(Cmin (φ, ξ)) 3
We prove an algorithm-dependent lower bound for a sparse
2/3
linear contextual bandits instance. In the data-poor regime where d & n1/3 s/Cmin , contex-
e 2/3 ) regret bound that is tighter than
tual IDS achieves O(sn
Theorem 6.3. Let π CIR be a conditional IDS policy. There
exists a sparse linear contextual bandit instance whose ex- the lower bound for conditional IDS. This rate matches the
plorability constant is 1/2 such that minimax lower bound derived in Hao et al. (2020) up to a
s factor in the data-poor regime.
1√
BR(n; π CIR ) ≥ nds .
16 7. Practical Algorithm
How about the principle of optimism? Hao et al. As shown in Section 4.1, Conditional IDS only needs
(2021a, Section 4) has shown that even for non-contextual to optimize over probability distributions. As proved
case, optimism-based algorithms such as UCB or Thomp- by Russo & Van Roy (2018, Proposition 6), conditional
son sampling fail to optimally balance information and re- IDS suffices to randomize over two actions at each round
gret and result in a sub-optimal regret bound. and the computational complexity is further improved to
O(|A|) by Kirschner (2021, Lemma 2.7). However, contex-
6.2. Upper bound achieved by contextual IDS tual IDS in Section 4.2 needs to optimize over a mapping
from the context space to action space at each round which
In this section, we will prove that contextual IDS can in general is much more computationally demanding.
achieve a nearly dimension-free regret bound. We say that
the feature set has sparse optimal actions if the optimal ac- To obtain a practical and scalable algorithm, we
tion is s-sparse for each context almost surely with respect approximate the contextual IDS using Actor-Critic
to the prior. (Konda & Tsitsiklis, 2000). We parametrize the policy
class by a neural network and optimize the information ra-
Lemma 6.4 (Bounds for information ratio). We have
tio by multi-steps stochastic gradient decent.
I0,2 ≤ d/2. When the feature set has sparse optimal ac-
tions, we have I0,3 ≤ s2 /(4Cmin(φ, ξ)). Consider a parametrized policy class Πθ = {πθ : S →
P(A)}. The policy, which is a conditional probability dis-
From the definition of π ∗ (m) = argmina∈Am a⊤ θ∗ , we tribution πθ , can be parameterized with a neural network.
know that π ∗ is a deterministic function of θ∗ . By the data This neural network maps (deterministically) from a con-
processing lemma, we have I(π ∗ ; Fn+1 ) ≤ I(θ∗ ; Fn+1 ). text s to a probability distribution over A. We further
According to Lemma 5.8 in Hao et al. (2021a), we have parametrize the critic which is the reward function by a
the following lemma. value network Qθ .
Lemma 6.5. The cumulative information gain can be To avoid additional approximation errors, we assume we
bounded by can sample from the true context distribution. At each
" n # round, the parametrized contextual IDS minimizes the fol-
X lowing empirical MIR loss through SGD:
E Et It (π ) πt (·|st ) ≤ 2s log(dn1/2 /s) .
∗ ⊤

t=1 Pw (i) ⊤ (i) 2

i=1 [∆t (st ) π(·|st )]
min Pw (i)
, (7.1)
π∈Πθ ∗ ⊤
Combining the results in Lemmas 6.4-6.5 with the generic i=1 [It (π ) π(·|st )]

regret bound in Theorem 4.4, we obtain the following re- (1) (w)
where {st , . . . , st } are the independent samples of con-
gret bound.
texts. We use the Epistemic Neural Networks (ENN)
Theorem 6.6 (Regret bound for sparse contextual IDS). (Osband et al., 2021a) to quantify the posterior uncertainty
(i)
Suppose π MIR = (πt )t∈N where πt = argminπ Ψ2t,0 (π). of the value network. For each given st , following the
When the feature set has sparse optimal actions, the follow- procedure in Section 6.3.3 and 6.3.4 of Lu et al. (2021), we
Contextual Information-Directed Sampling

can approximate the one-step expected regret and the in-

formation gain efficiently using the samples outputted by
ENN.

8. Experiment
We conduct some preliminary experiments to evaluate the
parametrized contextual IDS through a neural network con-
textual bandit.
At each round t, the environment independently generates Figure 1. Cumulative regret for a neural network contextual ban-
dit.
an observation in the form of d-dimensional contextual vec-
tor xt from some distributions. Let fθ∗ : Rd → R|A|
be a neural network link function. When the agent takes References
action a, she will receive a reward in the form of Yt,a = Abbasi-Yadkori, Y., Pal, D., and Szepesvari, C. Online-
[fθ∗ (xt )]a + ηt,a . This is the not sharing parameter formu- to-confidence-set conversions and application to sparse
lation for contextual bandits which means each arm has its stochastic bandits. In Artificial Intelligence and Statis-
own parameter. tics, pp. 1–9, 2012.
We set the generative model fθ∗ being a 2-hidden-layer
Agarwal, A., Hsu, D., Kale, S., Langford, J., Li, L., and
ReLU MLP with 10 hidden neurons. The number of ac-
Schapire, R. Taming the monster: A fast and simple
tions is 5. The contextual vector xt ∈ R10 is sampled from
algorithm for contextual bandits. In International Con-
N (0, I10 ) and the noise is sampled from standard Gaussian
ference on Machine Learning, pp. 1638–1646. PMLR,
distribution.
2014.
We compare contextual IDS with conditional IDS and
Thompson sampling. For a fair comparison, we use the Alon, N., Cesa-Bianchi, N., Dekel, O., and Koren, T. On-
same ENN architecture to obtain posterior samples for line learning with feedback graphs: Beyond bandits.
three agents. As reported by Osband et al. (2021b), ensem- In Conference on Learning Theory, pp. 23–35. PMLR,
ble sampling with randomized prior function tends to be 2015.
the best ENN that balances the computation and accuracy
Alon, N., Cesa-Bianchi, N., Gentile, C., Mannor, S., Man-
so we use 10 ensembles in our experiment. With 200 pos-
sour, Y., and Shamir, O. Nonstochastic multi-armed ban-
terior samples, we use the same way described by Lu et al.
dits with graph-structured feedback. SIAM Journal on
(2021) to approximate the one-step regret and information
Computing, 46(6):1785–1826, 2017.
gain for both conditional IDS and contextual IDS. We sam-
ple 20 independent contextual vectors at each round. Both Arumugam, D. and Van Roy, B. The value of information
the policy network and value network are using 2-hidden- when deciding what to learn. Advances in Neural Infor-
layer ReLU MLP with 20 hidden neurons and optimized mation Processing Systems, 34, 2021.
by Adam with learning rate 0.001. We report our results in
Figure 1 where parametrized contextual IDS achieves rea- Bastani, H. and Bayati, M. Online decision making with
sonably good regret. high-dimensional covariates. Operations Research, 68
(1):276–294, 2020.
9. Discussion and Future Work Dann, C., Mansour, Y., Mohri, M., Sekhari, A., and Srid-
In this work, we investigate the right form of information haran, K. Reinforcement learning with feedback graphs.
ratio for contextual bandits and emphasize the importance Advances in Neural Information Processing Systems, 33,
of utilizing context distribution information through contex- 2020.
tual bandits with graph feedback and sparse linear contex-
Foster, D. and Rakhlin, A. Beyond ucb: Optimal and ef-
tual bandits. For linear contextual bandits with moderately
ficient contextual bandits with regression oracles. In In-
small number of actions, one p future work is to see if con- ternational Conference on Machine Learning, pp. 3199–
textual IDS can achieve O( dn log(k)) regret bound.
3210. PMLR, 2020.
Contextual Information-Directed Sampling

Hao, B., Lattimore, T., and Wang, M. High-dimensional Li, L., Chu, W., Langford, J., and Schapire, R. E. A
sparse linear bandits. Advances in Neural Information contextual-bandit approach to personalized news arti-
Processing Systems, 33:10753–10763, 2020. cle recommendation. In Proceedings of the 19th inter-
national conference on World wide web, pp. 661–670,
Hao, B., Lattimore, T., and Deng, W. Information directed 2010.
sampling for sparse linear bandits. Advances in Neural
Information Processing Systems, 34, 2021a. Liu, F., Buccapatnam, S., and Shroff, N. Information di-
rected sampling for stochastic bandits with graph feed-
Hao, B., Lattimore, T., Szepesvári, C., and Wang, M. On- back. In Proceedings of the AAAI Conference on Artifi-
line sparse reinforcement learning. In International Con- cial Intelligence, volume 32, 2018a.
ference on Artificial Intelligence and Statistics, pp. 316–
324. PMLR, 2021b. Liu, F., Zheng, Z., and Shroff, N. Analysis of thompson
sampling for graphical bandits without the graphs. arXiv
Kim, G.-S. and Paik, M. C. Doubly-robust lasso bandit. In preprint arXiv:1805.08930, 2018b.
Advances in Neural Information Processing Systems, pp.
5869–5879, 2019. Lu, X., Van Roy, B., Dwaracherla, V., Ibrahimi, M., Os-
band, I., and Wen, Z. Reinforcement learning, bit by bit.
Kirschner, J. Information-Directed Sampling-Frequentist arXiv preprint arXiv:2103.04047, 2021.
Analysis and Applications. PhD thesis, ETH Zurich,
2021. Mannor, S. and Shamir, O. From bandits to experts: On the
value of side-observations. Advances in Neural Informa-
Kirschner, J. and Krause, A. Information directed sampling tion Processing Systems, 24:684–692, 2011.
and bandits with heteroscedastic noise. In Conference
On Learning Theory, pp. 358–384. PMLR, 2018. Oh, M.-h., Iyengar, G., and Zeevi, A. Sparsity-agnostic
lasso bandit. In International Conference on Machine
Kirschner, J., Lattimore, T., and Krause, A. Informa- Learning, pp. 8271–8280. PMLR, 2021.
tion directed sampling for linear partial monitoring. In
Conference on Learning Theory, pp. 2328–2369. PMLR, Osband, I., Wen, Z., Asghari, M., Ibrahimi, M., Lu, X., and
2020a. Van Roy, B. Epistemic neural networks. arXiv preprint
arXiv:2107.08924, 2021a.
Kirschner, J., Lattimore, T., Vernade, C., and Szepesvári, C.
Asymptotically optimal information-directed sampling. Osband, I., Wen, Z., Asghari, S. M., Dwaracherla, V., Hao,
arXiv preprint arXiv:2011.05944, 2020b. B., Ibrahimi, M., Lawson, D., Lu, X., O’Donoghue, B.,
and Van Roy, B. Evaluating predictive distributions:
Konda, V. R. and Tsitsiklis, J. N. Actor-critic algorithms. Does bayesian deep learning work? arXiv preprint
In Advances in neural information processing systems, arXiv:2110.04629, 2021b.
pp. 1008–1014, 2000.
Ren, Z. and Zhou, Z. Dynamic batch learning in high-
Langford, J. and Zhang, T. The epoch-greedy algorithm dimensional sparse linear contextual bandits. arXiv
for contextual multi-armed bandits. Advances in neural preprint arXiv:2008.11918, 2020.
information processing systems, 20(1):96–1, 2007.
Russo, D. and Van Roy, B. Learning to optimize via poste-
Lattimore, T. and György, A. Mirror descent and the infor- rior sampling. Mathematics of Operations Research, 39
mation ratio. arXiv preprint arXiv:2009.12228, 2020. (4):1221–1243, 2014.

Lattimore, T. and Szepesvári, C. Bandit algorithms. Cam- Russo, D. and Van Roy, B. Learning to optimize via
bridge University Press, 2020. information-directed sampling. Operations Research, 66
(1):230–252, 2018.
Lattimore, T., Crammer, K., and Szepesvári, C. Linear
multi-resource allocation with semi-bandit feedback. In Simchi-Levi, D. and Xu, Y. Bypassing the monster: A
Advances in Neural Information Processing Systems, pp. faster and simpler optimal algorithm for contextual ban-
964–972, 2015. dits under realizability. Available at SSRN, 2020.
Contextual Information-Directed Sampling

Sivakumar, V., Wu, Z. S., and Banerjee, A. Structured lin-

ear contextual bandits: A sharp and geometric smoothed
analysis. International Conference on Machine Learn-
ing, 2020.

Tossou, A. C., Dimitrakakis, C., and Dubhashi, D. Thomp-

son sampling for stochastic bandits with graph feed-
back. In Thirty-First AAAI Conference on Artificial In-
telligence, 2017.

Wang, X., Wei, M., and Yao, T. Minimax concave penal-

ized multi-armed bandit model with high-dimensional
covariates. In International Conference on Machine
Learning, pp. 5200–5208, 2018.

Wang, Y., Chen, Y., Fang, E. X., Wang, Z., and Li, R.
Nearly dimension-independent sparse linear bandit over
small action spaces via best subset selection. arXiv
preprint arXiv:2009.02003, 2020.
Contextual Information-Directed Sampling

A. General Results
A.1. Proof of Theorem 4.4
Proof. From the definitions in Eqs. (3.1) and (4.1), we could decompose the Bayesian regret bound of contextual IDS as
" n #
X
BR(n; π MIR ) = nα + E Est (∆t (st ) − α)⊤ πt (·|st ) . (A.1)
t=1

Recall that at each round contextual IDS follows

2
max 0, (Est [∆t (st )⊤ π(·|st )] − α)
πt = argmin Ψ2t,α (π) = argmin .
π:S→P(A) π:S→P(A) Est [It (π ∗ )⊤ π(·|st )]
Moreover, we define
λ
max 0, (Est [∆t (st )⊤ π(·|st )] − α)
qt,λ = argmin Ψλt,α (π) = argmin .
π:S→P(A) π:S→P(A) Est [It (π ∗ )⊤ π(·|st )]

Putting the above results together,

Recall that Iα,λ is the worst-case information ratio such that for any t ∈ [n], Ψλt,α (πt ) ≤ Iα,λ almost surely. Therefore,

1/λ 1/λ
E ∆t (st )⊤ πt (·|st ) − α ≤ 21−2/λ E It (π ∗ )⊤ πt (·|st ) Iα,λ , (A.2)

which is obvious when E[∆t (st )⊤ (πt (·|st )] − α ≤ 0. Combining Eqs. (A.1) and (A.2) together and using Holder’s
inequality, we obtain
" n
#1/λ
1/λ
X
BR(n; π IDS
) ≤ nα + 21−2/λ Iα,λ n1−1/λ E Est ∗ ⊤
It (π ) πt (·|st ) .
t=1

This ends the proof.

B. Contextual Bandits with Graph Feedback

B.1. Proof of Lemma 5.5
Proof. We bound the squared information ratio in terms of independence number β(G). Let πtTS (·|st ) = Pt (a∗t = ·) and
consider a mixture policy πtmix = (1 − ǫ)πtTS + ǫ/k with some mixing parameter ǫ ∈ [0, 1] that will be determined later.
First, we will derive an upper bound for one-step regret. From the definition of one-step regret,
X
∆t (st )⊤ πtmix (·|st ) = πtmix (a|st )Et Yt,a∗t − Yt,a |st
a∈At
X X 1 (B.1)
= (1 − ǫ) πtTS (a|st )Et Yt,a∗t − Yt,a |st + ǫ Et Yt,a∗t − Yt,a |st .
k
a∈At a∈At

′ ∗ ′
p of maximum expected reward and let dt (a, a ) = DKL (Pt (Yt,a ∈ ·|at = a ||Pt (Yt,a ∈
Recall that Rmax is the upper bound
2
·))). It is easy to see Yt,a is a Rmax + 1 sub-Gaussian random variable. For the first term of Eq. (B.1), according to
Lemma 3 in Russo & Van Roy (2014), we have
r
X X 2
Rmax +1
πtTS (a|st )Et Yt,a∗t − Yt,a |st ≤ πtTS (a|st ) dt (a, a) . (B.2)
2
a∈At a∈At

For the second term of Eq. (B.1), we directly bound it by

X 1
ǫ Et Yt,a∗t − Yt,a |st ≤ 2ǫRmax . (B.3)
k
a∈At

Putting Eqs. (B.1)-(B.3) together, we have

r
X 2
Rmax +1
∆t (st )⊤ πtmix (·|st ) − 2ǫRmax ≤ πtTS (a|st ) dt (a, a) .
2
a∈At

Second, we will derive a lower bound for the information gain. Let E = (aij )1≤i,j≤|A| be the adjacency matrix that
represents the graph feedback structure G. Then aij = 1 if there exists an edge (i, j) ∈ E and aij = 0 otherwise. Since for
any ai , aj ∈ N out (a), Yt,ai and Yt,aj are mutually independent,

X X
It π ∗ ; (Yt,a′ )a′ ∈N out (a) ≥ It (π ∗ ; Yt,a′ ) = E(a, a′ )I(π ∗ ; Yt,a′ ) .
a′ ∈N out (a) a′ ∈At
Contextual Information-Directed Sampling

where the first inequality is due to Cauchy–Schwarz inequality. This implies

R2 + 1 X πtmix (a|st )
≤ max P ,
2(1 − ǫ)
a∈A
πtmix (a|st ) + a′ ∈N in (a) πtmix (a′ |st )
t

where we use πtmix ≥ (1 − ǫ)πtTS . Using Jenson’s inequality,

Next we restate Lemma 5 from Alon et al. (2015) to bound the right hand side.
Lemma B.1. Let G = (A, E) be a directed graph with |A| = k. Assume a distribution π over A such that π(i) ≥ η for all
i ∈ A with some constant 0 < η < 0.5. Then we have
X
π(i) 4k
P ≤ 4β(G) log .
π(i) + j∈N in (i) π(j) β(G)η
i∈A
Contextual Information-Directed Sampling

Using Lemma B.1 with η = ǫ/k,

(Est [∆t (st )⊤ πtmix (·|st )] − 2ǫRmax )2 2
2Rmax +2 4k 2
∗ ⊤ mix ≤ 2β(G) log .
Est [It (π ) πt (·|st )] 1−ǫ β(G)α
According to the definition of contextual IDS, this ends the proof.

B.2. Proof of Lemma 5.7

Proof. We bound the worst-case marginal information ratio with λ = 3. Define an exploratory policy µ : S → A such that

µ = argmax min Es,a∼π(·|s) 1 d ∈ N out (a) .
π d

Consider a mixture policy πtmix = (1 − ǫ)πtTS + ǫµ for some mixing parameter ǫ ∈ [0, 1]. Since for any ai , aj ∈ N out (a),
Yt,ai and Yt,aj are mutually independent, we have
 
X
Est It (π ∗ )⊤ πtmix (·|st ) = Est ,a∼πtmix (·|st ) It π ∗ ; (Yt,a′ )a′ ∈N out (a) ≥ Est ,a∼πtmix (·|st )  It (π ∗ ; Yt,a′ ) .
a′ ∈N out (a)

From the definition of the mixture policy, π mix (a|st ) ≥ ǫµ(a|st ) for any a ∈ At . Then we have
 
X
Est It (π ∗ )⊤ πtmix (·|st ) ≥ ǫEst ,a∼µ(·|st )  It (π ∗ ; Yt,a′ )
a′ ∈N out (a)
 
X
≥ ǫEst ,a∼µ(·|st )  It (π ∗ ; Yt,a′ ) 1 a′ ∈ N out (a) 
a′ ∈N out (a)
" #
X
∗ ′ out
≥ ǫEst ,a∼µ(·|st ) It (π ; Yt,a′ ) 1 a ∈ N (a) .
a′ ∈At

Then we have
" #
X ′
Est It (π ∗ )⊤ πtmix (·|st ) ≥ ǫEst ,a∼µ(·|st ) ∗
It (π ; Yt,a′ ) min Est ,a∼µ(·|st ) 1 a′ ∈ N out (a)
a
a′ ∈At
" #
X
≥ ǫϑ(G, ξ)Est It (π ∗ ; Yt,a′ ) (B.4)
a′ ∈At
 
2ǫϑ(G, ξ) X X 2
≥ 2 Es  Pt (π ∗ = ζ) (Et [Yt,a |π ∗ = ζ] − Et [Yt,a ])  ,
Rmax + 1 t
ζ a∈At

where the last inequality is from Pinsker’s inequality.

On the other hand, by the definition of mixture policy and Jenson’s inequality,

Est [∆t (st )⊤ πtmix (·|st )] ≤ (1 − ǫ)Est ,a∼πtTS (·|st ) Et [Yt,a∗t − Yt,a ] + 2ǫRmax
≤ Est ,a∼πtTS (·|st ) [Et [Yt,a |a∗t = a] − Et [Yt,a ]] + 2ǫRmax
v " #
u
u X 2
t
≤ Est ∗ ∗
Pt (at = a) (Et [Yt,a |at = a] − Et [Yt,a ]) + 2ǫRmax
a∈At
v  
u
u X X
u 2
= tEst  Pt (π ∗ = ζ) (Et [Yt,a |π ∗ = ζ] − Et [Yt,a ])  + 2ǫRmax .
ζ a∈At
Contextual Information-Directed Sampling

Combining with the lower bound of the information gain in Eq. (B.4),
s
2
Rmax +1
Est [∆t (st )⊤ πtmix (·|st )] ≤ Es It (π ∗ )⊤ πtmix (·|st ) + 2ǫRmax .
2ǫϑ(G, ξ) t

By choosing
2
1/3
Rmax +1 1
ǫ= 2
E It (π ∗ )⊤ πtmix (·|st ) ,
Rmax 8ϑ(G, ξ)
we reach
2
Rmax +1 1 1/3
Est [∆t (st )⊤ πtmix (·|st )] ≤ 2Rmax 2
Es It (π ∗ ⊤ mix
) πt (·|s t ) .
Rmax 8ϑ(G, ξ) t
This implies
3
Est [∆t (st )⊤ πtmix (·|st )] R3 + Rmax
mix
≤ max .
∗ ⊤
Est It (π ) πt (·|st ) ϑ(G, ξ)
This ends the proof.

B.3. Proof of Lemma 5.8

Proof. According the the definition of cumulative information gain,
" n #
X
E Est It (π ) πt (·|st ) = Et,st [It (π ∗ ; Ot |At , st )] .
∗ ⊤
(B.5)
t=1

Note that

It (π ∗ ; (st , At , Ot )) = Et,st [It (π ∗ ; Ot |At , st )] + Et,st [It (π ∗ ; At |st )] + It (π ∗ ; st ) = Et,st [It (π ∗ ; Ot |At , st )] .

By the chain rule,

n
X n
X
I(π ∗ ; Fn+1 ) = E [It (π ∗ ; (st , At , Ot ))] = E Est It (π ∗ )⊤ πt (·|st ) .
t=1 t=1

On the other hand,

I(π ∗ ; Fn+1 ) ≤ H(π ∗ ) ≤ M log(k) ,
where H(·) is the entropy. Putting the above together,
" n #
X
∗ ⊤
E Est It (π ) πt (·|st ) ≤ M log(k).
t=1

This ends the proof.

Remark B.2. We analyze the upper bound H(π ∗ ) under the following examples. Denote the number of contexts by M ,
which could be infinite.

1. The possible choices of π ∗ is at most k M , and thus H(π ∗ ) ≤ log(k M ) = M log(k). When M = ∞, this bound is
vacuous.

2. We can divide the context space into fixed and disjoint clusters such that for each parameter in the parameter space,
the optimal policy maps all the contexts in a single cluster to a single action. For example, a naive partition is
that each cluster contains a single context, and in this case, the number of clusters denoted equals the number of
contexts M . Let P be the minimum number of such clusters, so P ≤ M . The possible choice of π ∗ is k P , and thus
H(π ∗ ) ≤ log(k P ) = P log(k). When M = ∞ but P < ∞, we obtain a valid upper bound instead of ∞.
Contextual Information-Directed Sampling

B.4. Proof of Theorem 5.9

Proof. According to Theorem 4.4, we have
" n #1/λ
√ 1/λ
X
BR(n; π MIR ) ≤ 2Rmax n + inf 21−2/λ I2Rmax /√n,λ n1−1/λ E E It (π ∗ )⊤ πt (·|st ) .
λ≥2
t=1
√
Using Lemma 5.5 with ǫ = 1/ n, we have
2
√ 2√
2Rmax +2 4k 2 n 2 4k n
I2Rmax /√n,2 ≤ √ 2β(G) log ≤ 16(Rmax + 1)β(G) log .
1 − 1/ n β(G) β(G)
Combining Lemmas 5.7-5.8 together, we have
√
BR(n; π MIR ) ≤ 2Rmax n+
 
2√ 1/2 3 1
2 4k n R max + R max
3 2
min  16(Rmax + 1)β(G) log nM log(k) , 2M log(k) n3 
β(G) ϑ(G, ξ)
s 
2√ 1
4k n 2M log(k) 3 2
≥ CRmax min  β(G) log nM log(k), n3  ,
β(G) ϑ(G, ξ)

for some absolute constant C > 0. This ends the proof.

C. Sparse Linear Contextual Bandits

C.1. Proof of Theorem 6.3
Proof. We assume there are two available contexts. Context set 1 consists of a single action x0 = (0, . . . , 0)⊤ ∈ Rd+1 and
an informative action set H as follows:
n o
H = x ∈ Rd+1 xj ∈ {−κ, κ} for j ∈ [d], xd = 1 , (C.1)

where 0 < κ ≤ 1 is a constant. Let d = sp for some integer p ≥ 2. Context set 2 consists of a multi-task bandit action set
A = {({ei ∈ Rp : i ∈ [p]}s , 0)} ⊂ Rd+1 whose element’s last coordinate is always 0. The arriving probability of each
context is 1/2. One can verify for this feature set, the explorability constant Cmin (φ, ξ) = 1/2 achieved by a policy that
uniformly samples from H when context set 1 arrives.
Fix a conditional IDS policy π CIR . Let ∆ > 0 and Θ = {∆ei : i ∈ [p]} ⊂ Rp . Given θ ∈ {(Θs , 1)} ⊂ Rd+1 and i ∈ [s],
(i)
let θ(i) ∈ Rp be defined by θk = θ(i−1)s+k , which means that

θ⊤ = [θ(1)⊤ , . . . , θ(s)⊤ , 1] .

Assume the prior of θ∗ is uniformly distributed over {(Θs , 1)}. Define the cumulative regret of policy π interacting with
bandit θ as
n
X
Rθ (n; π) = Eθ hx∗st , θi − Yt
t=1
n n (C.2)
X X
= Eθ hx∗st , θi − Yt 1(st = 1) + Eθ hx∗st , θi − Yt 1(st = 2) ,
t=1 t=1

where we write x∗st = φ(st , a∗t ) for short. Therefore,

1 X
BR(n; π) = Rθ (n; π) .
|Θ|s
θ∈{(Θs ,1)}
Contextual Information-Directed Sampling

Note that when context set 1 arrives, the action from H suffers at least 1 − s∆ regret and thus since x0 is always the optimal
action for context set 1. From the definition of conditional IDS in Eq. (4.2), when context set 1 is arriving and s∆ < 1,
conditional IDS will always pull x0 for this prior. That means conditional IDS will suffer no regret for context set 1 and
implies for any θ ∈ {(Θs , 1)},
n
X n
X
Eθ hx∗st , θi − Yt I(st = 1) = Eθ hx∗1 , θi − hπ CIR (i|1), θi|st = 1 P(st = 1) = 0 .
t=1 t=1

It remains to bound the second term in Eq. (C.2). It essentially follows the proof of Theorem 24.3 in
Lattimore & Szepesvári (2020). From the proof of multi-task bandit lower bound, we have
n
1 X X 1√
Eθ hx∗st , θi − Yt 1(st = 2) ≥ dsn .
|Θ|s 16
θ∈{(Θs ,1)} t=1

This ends the proof.

C.2. Proof of Lemma 6.4

Proof. If one can derive a worst-case bound of Ψλt,α (e e,, we can have an upper bound for Iα,λ
π ) for a particular policy π
automatically. The remaining step is to choose a proper policy πe for λ = 2, 3 separately.
First, we bound the information ratio with λ = 2. By the definition of mutual information, for any a ∈ At , we have
X
It (π ∗ ; Yt,a ) = Pt (π ∗ = ζ)DKL (Pt (Yt,a = ·|π ∗ = ζ)||Pt (Yt,a = ·)) . (C.3)
ζ∈A1 ×···×AM

et (Ω, F ) be a measurable space and let P, Q : F → [0, 1] be two probability measures. Based on Pinsker’s inequality,
r
1
δ(P, Q) = sup (P (A) − Q(A)) ≤ DKL (P ||Q).
A∈F 2
Let a < b and X : Ω → [a, b]. Exercise 14.4 in (Lattimore & Szepesvári, 2020) indicates that
Z Z r
1
X(w)dP (w) − X(w)dQ(w) ≤ (b − a)δ(P, Q) ≤ DKL (P ||Q).
Ω Ω 2
Overall, Z Z
2
DKL (P ||Q) ≥ X(w)dP (w) − X(w)dQ(w) .
(b − a)2 Ω Ω
p
Recall that Rmax is the upper bound of maximum expected reward. It is easy to see Yt,a is a Rmax 2 + 1 sub-Gaussian
random variable. According to Lemma 3 in Russo & Van Roy (2014), we have
2 X 2
It (π ∗ ; Yt,a ) ≥ 2 Pt (π ∗ = ζ) Et [Yt,a |π ∗ = ζ] − Et [Yt,a ] . (C.4)
Rmax + 1 1 Mζ∈A ×···×A

The MIR of contextual IDS can be bounded by the MIR of Thompson sampling:
2
max 0, (E[∆t (st )⊤ πtTS (·|st )]) (∆t (st )⊤ πtTS (·|st ))2
I0,2 ≤ max ≤ max E ,
t∈[n] E[It (π ∗ )⊤ πtTS (·|st )] t∈[n] It (π ∗ )⊤ πtTS (·|st )
where the second inequality is from Jenson’s inequality. Using the matrix trace rank trick described in Proposition 5 in
2
Russo & Van Roy (2014), we have I0,2 ≤ (Rmax + 1)d/2 in the end.
Second, we bound the information ratio with λ = 3. Let’s define an exploratory policy µ such that
h i
µ = argmax σmin Es∼ξ Ea∼π(·|s) [φ(s, a)φ(s, a)⊤ ] . (C.5)
π:S→P(A)

Consider a mixture policy πtmix = (1 − ǫ)πtTS + ǫµ where the mixture rate ǫ ≥ 0 will be decided later.
Contextual Information-Directed Sampling

Step 1: Bound the information gain According the lower bound of information gain in Eq. (C.4),

E It (π ∗ )⊤ πtmix (·|st )
 
2 X 2
≥ 2
Est ∼ξ,a∼πtmix (·|st )  Pt (π ∗ = ζ) (Et [Yt,a |π ∗ = ζ] − Et [Yt,a ]) 
(Rmax + 1)
ζ∈A1 ×···×AM
 
2 X 2
= 2
Est ∼ξ,a∼πtmix (·|st )  Pt (π ∗ = ζ) φ(st , a)⊤ Et [θ∗ |π ∗ = ζ] − φ(st , a)⊤ Et [θ∗ ]  .
(Rmax + 1)
ζ∈A1 ×···×AM

By the definition of the mixture policy, we know that πtmix (a|st ) ≥ ǫµ(a|st ) for any a ∈ At . Then we have

2ǫ X
E It (π ∗ )⊤ πtmix (·|st ) ≥ 2
Pt (π ∗ = ζ)
Rmax +1
ζ∈A1 ×···×AM
h i
· Est ∼ξ,a∼µ(·|st ) (Et [θ∗ |π ∗ = ζ] − Et [θ∗ ])⊤ φ(st , a)φ(st , a)⊤ (Et [θ∗ |π ∗ = ζ] − Et [θ∗ ]) .

From the definition of µ in Eq. (C.5), we have

2ǫ X 2
E It (π ∗ )⊤ πtmix (·|st ) ≥ 2
Pt (π ∗ = ζ)Cmin (φ, ξ) Et [θ∗ |π ∗ = ζ] − Et [θ∗ ] 2
.
Rmax +1
ζ∈A1 ×···×AM

Step 2: Bound the instant regret We decompose the regret by the contribution from the exploratory policy and the one
from TS:

" #
X
E ∆t (st )⊤ πtmix (·|st ) = (1 − ǫ)Est Pt (a∗t = a) Et φ(st , a)⊤ θ∗ a∗t = a − Et φ(st , a)⊤ θ∗
a
" # (C.6)
X h i
∗ ⊤ ∗ ⊤ ∗
+ ǫEst Et φ(st , at ) θ − φ(st , a) θ µ(a|st ) .
a

Since Rmax is the upper bound of maximum expected reward, the second term can be bounded 2Rmax ǫ. Using Jenson’s
inequality, the first term can be bounded by

" #
X
Est Pt (a∗t ⊤ ∗ ∗ ⊤ ∗
= a) Et φ(st , a) θ at = a − Et φ(st , a) θ
a
v " #
u 2
u X
≤ tEst Pt (a∗t = a) Et φ(st , a)⊤ θ∗ a∗t = a − Et [φ(st , a)⊤ θ∗ ] .
a

Since all the optimal actions are sparse, any action a with Pt (a∗t = a) > 0 must be sparse. Then we have

2 2
φ(st , a)⊤ (Et [θ∗ |a∗t = a] − Et [θ∗ ]) ≤ s2 Et [θ∗ |a∗t = a] − Et [θ∗ ] ,
2
Contextual Information-Directed Sampling

for any action a with Pt (a∗t = a) > 0. This further implies

" #
X
∗ ⊤ ∗ ∗ ⊤ ∗
Est Pt (at = a) Et φ(st , a) θ at = a − Et φ(st , a) θ
a
v " #
u
u X 2
t
≤ Est ∗ 2 ∗ ∗ ∗
Pt (at = a)s Et [θ |at = a] − Et [θ ]
2
a
v  
u
u X 2
u
= tEst  Pt (π ∗ = ζ)s2 Et [θ∗ |π ∗ = ζ] − Et [θ∗ ]  (C.7)
2
ζ∈A1 ×···×AM
v  
u
u 2 2 X 2
u s (Rmax + 1) 2ǫCmin (φ, ξ)
=t 2
Es  Pt (π ∗ = ζ)s2 Et [θ∗ |π ∗ = ζ] − Et [θ∗ ] 
2ǫCmin(φ, ξ) (Rmax + 1) t 2
ζ∈A1 ×···×AM
s
s2 (Rmax
2 + 1)
≤ E It (π ∗ )⊤ πtmix (·|st ) .
2ǫCmin (φ, ξ)

Putting Eq. (C.6) and (C.7) together, we have

s
s2 (Rmax
2 + 1)
E ∆t (st )⊤ πtmix (·|st ) ≤ E It (π ∗ )⊤ πtmix (·|st ) + 2Rmax ǫ .
2ǫCmin(φ, ξ)

By optimizing the mixture rate ǫ, we have

3
E ∆t (st )⊤ πtmix (·|st ) s2 (Rmax
2
+ 1) s2
I0,3 ≤ ≤ 2
≤ .
E It (π ∗ )⊤ πtmix (·|st ) 8Rmax Cmin 4Cmin (φ, ξ)

This ends the proof.

This figure "1.png" is available in "png" format from:

http://arxiv.org/ps/2205.10895v1

Key Creator Getting Started Guide V6 English
No ratings yet
Key Creator Getting Started Guide V6 English
150 pages
New Hire Checklist Download 20170907 PDF
No ratings yet
New Hire Checklist Download 20170907 PDF
1 page
Autodesk Maya Key Shortcuts en
No ratings yet
Autodesk Maya Key Shortcuts en
9 pages
Thompson Sampling For High-Dimensional Sparse Linear Contextual Bandits - ICML23
No ratings yet
Thompson Sampling For High-Dimensional Sparse Linear Contextual Bandits - ICML23
30 pages
Blanchet, Xu 2024
No ratings yet
Blanchet, Xu 2024
21 pages
Dann 23 A
No ratings yet
Dann 23 A
68 pages
Agrawal&Goyal 2013
No ratings yet
Agrawal&Goyal 2013
9 pages
Neural Contextual Bandits With UCB-based Exploration
No ratings yet
Neural Contextual Bandits With UCB-based Exploration
27 pages
Context Attentive Bandits: Contextual Bandit With Restricted Context
No ratings yet
Context Attentive Bandits: Contextual Bandit With Restricted Context
8 pages
Thompson Sampling For Contextual Bandits With Linear Payoffs
No ratings yet
Thompson Sampling For Contextual Bandits With Linear Payoffs
22 pages
Multi-Armed Bandits Theory and Applications To Online Learning in Networks
No ratings yet
Multi-Armed Bandits Theory and Applications To Online Learning in Networks
167 pages
Statistical Inference For Online Decision-Making: in A Contextual Bandit Setting
No ratings yet
Statistical Inference For Online Decision-Making: in A Contextual Bandit Setting
44 pages
30083-Article Text-34137-1-2-20240324
No ratings yet
30083-Article Text-34137-1-2-20240324
9 pages
Book PDF
No ratings yet
Book PDF
582 pages
Information Capacity Regret Bounds For Bandits With Mediator Feedback
No ratings yet
Information Capacity Regret Bounds For Bandits With Mediator Feedback
36 pages
Multi-Armed Bandit Problem With Online Clustering As Side
No ratings yet
Multi-Armed Bandit Problem With Online Clustering As Side
13 pages
Supp
No ratings yet
Supp
25 pages
LinUCB Ote
No ratings yet
LinUCB Ote
68 pages
Online Learning For Causal Bandits
No ratings yet
Online Learning For Causal Bandits
7 pages
Contextual Multi-Armed Bandits: A+b+1 A+b+2
No ratings yet
Contextual Multi-Armed Bandits: A+b+1 A+b+2
8 pages
IntroMulti Armed Bandits Slivkin Microsoft PDF
No ratings yet
IntroMulti Armed Bandits Slivkin Microsoft PDF
174 pages
NeurIPS 2021 Breaking The Moments Condition Barrier No Regret Algorithm For Bandits With Super Heavy Tailed Payoffs Paper
No ratings yet
NeurIPS 2021 Breaking The Moments Condition Barrier No Regret Algorithm For Bandits With Super Heavy Tailed Payoffs Paper
11 pages
NeurIPS 2019 Batched Multi Armed Bandits Problem Paper
No ratings yet
NeurIPS 2019 Batched Multi Armed Bandits Problem Paper
11 pages
Contextual Online Decision Making With Infinite-Dimensional Functional Regression
No ratings yet
Contextual Online Decision Making With Infinite-Dimensional Functional Regression
30 pages
5385 Representation Driven Reinforc
No ratings yet
5385 Representation Driven Reinforc
16 pages
Bandit Algorithms (Tor Lattimore, Csaba Szepesvári) (Z-Library)
0% (1)
Bandit Algorithms (Tor Lattimore, Csaba Szepesvári) (Z-Library)
537 pages
Contextual Bandits
No ratings yet
Contextual Bandits
34 pages
Bayesian Reinforcement Learning: A Survey
No ratings yet
Bayesian Reinforcement Learning: A Survey
147 pages
Evendar 06 A
No ratings yet
Evendar 06 A
27 pages
Bandit Book
No ratings yet
Bandit Book
129 pages
ConcaveBandits ICML2025
No ratings yet
ConcaveBandits ICML2025
19 pages
MMAB - Mahajan and Teneketzis. 2008. Multi-Armed Bandit Problems. in Foundations and Applications of Sensor Management
No ratings yet
MMAB - Mahajan and Teneketzis. 2008. Multi-Armed Bandit Problems. in Foundations and Applications of Sensor Management
31 pages
SSRN Id4823494 Code3832712
No ratings yet
SSRN Id4823494 Code3832712
52 pages
Bandit Algorithms
No ratings yet
Bandit Algorithms
596 pages
10939-Article Text-14467-1-2-20201228
No ratings yet
10939-Article Text-14467-1-2-20201228
8 pages
Project
No ratings yet
Project
47 pages
Garbage In, Reward Out Bootstrapping Exploration in Multi-Armed Bandits
No ratings yet
Garbage In, Reward Out Bootstrapping Exploration in Multi-Armed Bandits
19 pages
AR23
No ratings yet
AR23
159 pages
Statistical Reinforcement Learning and Decision Making
No ratings yet
Statistical Reinforcement Learning and Decision Making
157 pages
Generalized Linear Bandits With Limited Adaptivity
No ratings yet
Generalized Linear Bandits With Limited Adaptivity
42 pages
29117-Article Text-33171-1-2-20240324
No ratings yet
29117-Article Text-33171-1-2-20240324
8 pages
Summary
No ratings yet
Summary
48 pages
Banerjee 23 B
No ratings yet
Banerjee 23 B
30 pages
18 Paper
No ratings yet
18 Paper
22 pages
Pessimistic Reward Models For Off-Policy Learning in Recommendation
No ratings yet
Pessimistic Reward Models For Off-Policy Learning in Recommendation
12 pages
Assignment 3: Reinforcement Learning Prof. B. Ravindran
100% (1)
Assignment 3: Reinforcement Learning Prof. B. Ravindran
4 pages
RL Test Leif
No ratings yet
RL Test Leif
163 pages
DLMAIRIL01 Q4-2024 Session3
No ratings yet
DLMAIRIL01 Q4-2024 Session3
47 pages
Mid Term Report SoS
No ratings yet
Mid Term Report SoS
18 pages
Offline Imitation Learning From Multiple Baselines With Applications To Compiler Optimization
No ratings yet
Offline Imitation Learning From Multiple Baselines With Applications To Compiler Optimization
10 pages
30027-Article Text-34081-1-2-20240324
No ratings yet
30027-Article Text-34081-1-2-20240324
9 pages
Range in K Armed Bandits
No ratings yet
Range in K Armed Bandits
33 pages
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
No ratings yet
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
66 pages
26202-Article Text-30265-1-2-20230626
No ratings yet
26202-Article Text-30265-1-2-20230626
8 pages
Solution 3
No ratings yet
Solution 3
4 pages
Reading 3-Russo & Van Roy 2014
No ratings yet
Reading 3-Russo & Van Roy 2014
24 pages
Regret Bounds in Thompson Sampling
No ratings yet
Regret Bounds in Thompson Sampling
3 pages
NodeB Carrier Management (RAN17.1 - 02)
No ratings yet
NodeB Carrier Management (RAN17.1 - 02)
82 pages
Animasi Karakter: The Struggle
No ratings yet
Animasi Karakter: The Struggle
16 pages
Bs Computer Science-Thesis Writing 2 Final Defense (Room 1) Day 1-December 3
No ratings yet
Bs Computer Science-Thesis Writing 2 Final Defense (Room 1) Day 1-December 3
2 pages
Designing Effective PowerPoint Presentations
No ratings yet
Designing Effective PowerPoint Presentations
52 pages
GIS Client GIS Capabilities Review Project - Kickoff Meeting
No ratings yet
GIS Client GIS Capabilities Review Project - Kickoff Meeting
17 pages
Cap450:Artificial Intelligence and Intelligent Systems: Session 2023-24 Page:1/2
No ratings yet
Cap450:Artificial Intelligence and Intelligent Systems: Session 2023-24 Page:1/2
2 pages
TCL Strings
No ratings yet
TCL Strings
14 pages
SAP HANA Modeling Guide en
100% (1)
SAP HANA Modeling Guide en
120 pages
Java All Notes
No ratings yet
Java All Notes
118 pages
Sylla CISP43010 F
No ratings yet
Sylla CISP43010 F
6 pages
Saml Ref
No ratings yet
Saml Ref
54 pages
Electronic Artwork Guidelines
No ratings yet
Electronic Artwork Guidelines
9 pages
.-111111 - Tti, N: Untvi '7,'lo) LLT' Ll''l.it
No ratings yet
.-111111 - Tti, N: Untvi '7,'lo) LLT' Ll''l.it
4 pages
Experienced Business Consultant
No ratings yet
Experienced Business Consultant
2 pages
Video Library System Guide
No ratings yet
Video Library System Guide
52 pages
Python Assignments Set
No ratings yet
Python Assignments Set
10 pages
Aster Flaash PDF
No ratings yet
Aster Flaash PDF
14 pages
DBT Re-Exam
No ratings yet
DBT Re-Exam
3 pages
Apple Employee Communications Kit
100% (1)
Apple Employee Communications Kit
17 pages
4-Bit ALU Design and Functions
No ratings yet
4-Bit ALU Design and Functions
12 pages
1 Amortized Analysis: Indian Institute of Information Technology Design and Manufacturing, Kancheepuram
No ratings yet
1 Amortized Analysis: Indian Institute of Information Technology Design and Manufacturing, Kancheepuram
6 pages
40 Ways To Fill A Webinar PDF
No ratings yet
40 Ways To Fill A Webinar PDF
39 pages
NCC Documents PDF
No ratings yet
NCC Documents PDF
2 pages
Research Paper
No ratings yet
Research Paper
14 pages
Memory Management for Programmers
No ratings yet
Memory Management for Programmers
162 pages
Computer Fundamentals MCQs
No ratings yet
Computer Fundamentals MCQs
18 pages
IT Professional Resume
100% (2)
IT Professional Resume
3 pages

Contextual IDS for Bandit Problems

Uploaded by

Contextual IDS for Bandit Problems

Uploaded by

Contextual Information-Directed Sampling

Botao Hao 1 Tor Lattimore 1 Chao Qin 2

Abstract simplified case of full reinforcement learning. The context

Graph feedback Mannor & Shamir (2011); Alon et al.

Upper Bound Regret Contextual? Algorithm

Upper Bound Regret Contextual? Assumption Algorithm

4.2. Contextual IDS 4.3. Why conditional IDS could be myopic?

t=1 Pw (i) ⊤ (i) 2

can approximate the one-step expected regret and the in-

Sivakumar, V., Wu, Z. S., and Banerjee, A. Structured lin-

Tossou, A. C., Dimitrakakis, C., and Dubhashi, D. Thomp-

Wang, X., Wei, M., and Yao, T. Minimax concave penal-

Recall that at each round contextual IDS follows

Putting the above results together,

This ends the proof.

B. Contextual Bandits with Graph Feedback

For the second term of Eq. (B.1), we directly bound it by

Putting Eqs. (B.1)-(B.3) together, we have

where the first inequality is due to Cauchy–Schwarz inequality. This implies

where we use πtmix ≥ (1 − ǫ)πtTS . Using Jenson’s inequality,

Using Lemma B.1 with η = ǫ/k,

B.2. Proof of Lemma 5.7

where the last inequality is from Pinsker’s inequality.

B.3. Proof of Lemma 5.8

By the chain rule,

On the other hand,

This ends the proof.

B.4. Proof of Theorem 5.9

for some absolute constant C > 0. This ends the proof.

C. Sparse Linear Contextual Bandits

where we write x∗st = φ(st , a∗t ) for short. Therefore,

This ends the proof.

C.2. Proof of Lemma 6.4

From the definition of µ in Eq. (C.5), we have

for any action a with Pt (a∗t = a) > 0. This further implies

Putting Eq. (C.6) and (C.7) together, we have

By optimizing the mixture rate ǫ, we have

This ends the proof.

You might also like