Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
41 views68 pages

Dann 23 A

Uploaded by

Francesco
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views68 pages

Dann 23 A

Uploaded by

Francesco
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

Proceedings of Machine Learning Research vol 195:1–68, 2023

A Blackbox Approach to Best of Both Worlds in Bandits and Beyond


Christoph Dann CDANN @ CDANN . NET
Google Research
Chen-Yu Wei CHENYUW @ MIT. EDU
MIT Institute for Data, Systems, and Society
Julian Zimmert ZIMMERT @ GOOGLE . COM
Google Research

Abstract
Best-of-both-worlds algorithms for online learning which achieve near-optimal regret in both the
adversarial and the stochastic regimes have received growing attention recently. Existing techniques
often require careful adaptation to every new problem setup, including specialised potentials and
careful tuning of algorithm parameters. Yet, in domains such as linear bandits, it is still unknown if
√ an algorithm that can simultaneously obtain O(log(T )) regret in the stochastic regime
there exists
and Õ( T ) regret in the adversarial regime. In this work, we resolve this question positively and
present a general reduction from best of both worlds to a wide family of follow-the-regularized-
leader (FTRL) and online-mirror-descent (OMD) algorithms. We showcase the capability of this
reduction by transforming existing algorithms that are only known to achieve worst-case guarantees
into new algorithms with best-of-both-worlds guarantees in contextual bandits, graph bandits and
tabular Markov decision processes.

1. Introduction
Multi-armed bandits and its various extensions have a long history (Lai et al., 1985; Auer et al.,
2002a;b). Traditionally, the stochastic regime in which all losses/rewards are i.i.d. and the adver-
sarial regime in which an adversary chooses the losses in an arbitrary fashion have been studied
in isolation. However, it is often unclear in practice whether an environment is best modelled by
the stochastic regime, a slightly contaminated regime, or a fully adversarial regime. This is why
the question of automatically adapting to the hardness of the environment, also called achieving
best-of-both-worlds guarantees, has received growing attention (Bubeck and Slivkins, 2012; Seldin
and Slivkins, 2014; Auer and Chiang, 2016; Seldin and Lugosi, 2017; Wei and Luo, 2018; Zimmert
and Seldin, 2019; Ito, 2021b; Ito et al., 2022a; Masoudian and Seldin, 2021; Gaillard et al., 2014;
Mourtada and Gaı̈ffas, 2019; Ito, 2021a; Lee et al., 2021; Rouyer et al., 2021; Amir et al., 2022;
Huang et al., 2022; Tsuchiya et al., 2022; Erez and Koren, 2021; Rouyer et al., 2022; Ito et al.,
2022b; Kong et al., 2022; Jin and Luo, 2020; Jin et al., 2021; Chen et al., 2022; Saha and Gaillard,
2022; Masoudian et al., 2022; Honda et al., 2023). One of the most successful approaches has been
to derive carefully tuned FTRL algorithms, which are canonically suited for the adversarial regime,
and then show that these algorithms are also close to optimal in the stochastic regime as well.
This is achieved via proving a crucial self-bounding property, which then translates into stochastic
guarantees. This type of algorithms have been proposed for multi-armed bandits (Wei and Luo,
2018; Zimmert and Seldin, 2019; Ito, 2021b; Ito et al., 2022a), combinatorial semi-bandits (Zim-
mert et al., 2019), bandits with graph feedback (Ito et al., 2022b), tabular MDPs (Jin and Luo, 2020;
Jin et al., 2021) and others. Algorithms with self-bounding properties also automatically adapt to

© 2023 C. Dann, C.-Y. Wei & J. Zimmert.


DANN W EI Z IMMERT

Setting Algorithm Adversarial Stochastic o(C) Rd. 1 Rd. 2


d log(|X |T ) log(T )
p
Lee et al. (2021) dT log(|X |T ) log T ∆
√ d2 ν log T
Theorem 3 d νL⋆ log T ∆

Linear bandit √ d2 log T
Corollary 7 d T log T ∆
✓ ✓
d log |X | log T
p
Corollary 12 dT log |X | ∆
✓ ✓ ✓
K log |X | log T
p
Contextual bandit Corollary 13 KT log |X | ∆
✓ ✓ ✓
√ α log2 (KT ) log T
Ito et al. (2022b) αT log T log(KT ) ∆

Graph bandit p αe log(KT ) log T
Rouyer et al. (2022) α
eT log K
Strongly observable ∆
min{α,α
e log K} log K log T
p
Corollary 15 min{e
α, α log K}T log K ∆
✓ ✓ ✓
1 2 δ log(KT ) log T
Graph bandit Ito et al. (2022b) (δ log(KT ) log T ) T 3 3
∆2

1 2
Weakly observable Corollary 19 (δ log K) T
3 3
δ log K log T
✓ ✓ ✓
∆2
H 6 S 2 A log2 T
p
Jin et al. (2021) H 2 S 2 AT log2 T ∆′

Tabular MDP
H 2 S 2 A log3 T
p
Theorem 26 HS 2 AL⋆ log4 T ∆
✓ ✓ ✓

Table 1: Overview of regret bounds. The o(C) column specifies whether the algorithm achieves the
√ 2
optimal dependence on the amount of corruption ( C or C 3 , depending on the setting). “Rd. 1”
and “Rd. 2” indicate whether the result leverages the first and the second reductions described in
Section 4, respectively. L⋆ is the cumulative loss of the best action or policy. ν in Theorem 3 is the
self-concordance parameter; it holds that ν ≤ O(d) (see Appendix B for details). ∆′ in Jin et al.
(2021) is the gap in the Q⋆ function, which is different from the policy gap ∆ we use; it holds that
∆ ≤ ∆′ . α, αe, δ are complexity measures of feedback graphs defined in Section 5.

intermediate regimes of stochastic losses with adversarial corruptions (Lykouris et al., 2018; Gupta
et al., 2019; Zimmert and Seldin, 2019; Ito, 2021a), which highlights the strong robustness of this
algorithm design. However, every problem variation required careful design of potential functions
and learning rate schedules. For linear bandits, it has been still unknown whether self-bounding is
possible and therefore the state-of-the-art best-of-both-worlds algorithm (Lee et al., 2021) neither
obtains optimal log T stochastic rate, nor canonically adapts to corruptions.
In this work, the make the following contributions: 1) We propose a general reduction from best
of both worlds to typical FTRL/OMD algorithms, sidestepping the need for customized potentials
and learning rates. 2) We derive the first best-of-both-worlds algorithm for linear bandits that obtains
log T regret in the stochastic regime, optimal adversarial worst-case regret and adapts canonically
to corruptions. 3) We derive the first best-of-both-worlds algorithm for linear bandits and tabular
MDPs with first-order guarantees in the adversarial regime. 4) We obtain the first best-of-both-
worlds algorithms for bandits with graph feedback and bandits with expert advice with optimal
log T stochastic regret.

2. Related Work
Our reduction procedure is related to a class of model selection algorithms that uses a meta bandit
algorithm to learn over a set of base bandit algorithms, and the goal is to achieve a comparable
performance as if the best base algorithm is run alone. For the adversarial regime, Agarwal et al.
(2017) introduced the Corral algorithm to learn over adversarial bandits algorithm that satisfies the

2
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND

stability condition. This framework is further improved by Foster et al. (2020); Luo et al. (2022).
For the stochastic regime, Arora et al. (2021); Abbasi-Yadkori et al. (2020); Cutkosky et al. (2021)
introduced another set of techniques to achieve similar guarantees, but without
√ explicitly relying
on the stability condition. While most of these results focus on obtaining a T regret, Arora et al.
(2021); Cutkosky et al. (2021); Wei et al. (2022); Pacchiano et al. (2022) made some progress in
obtaining polylog(T ) regret in the stochastic regime. Among them, Wei et al. (2022); Pacchiano
et al. (2022) are most related to us since they also pursue a best-of-both-worlds guarantee across
adversarial and stochastic regimes. However, their regret bounds, when applied to our problems, all
suffer from highly sub-optimal dependence on log(T ) and the amount of corruptions.

3. Preliminaries
We consider sequential decision making problems where the learning interacts with the environment
in T rounds. In each round t, the environment generates a loss vector (ℓt,u )u∈X and the learner
generates a distribution over actions pt and chooses an action At ∼ pt . The learner then suffers
loss ℓt,At and receives some information about (ℓt,u )u∈X depending on the concrete setting. In all
settings but MDPs, we assume ℓt,u ∈ [−1, 1].
In the adversarial regime, (ℓt,u )u∈X is generated arbitrarily subject to the structure of the con-
crete setting (e.g., in linear bandits, E[ℓt,u ] = ⟨u, ℓt ⟩ for arbitrary ℓt ∈ Rd ). In the stochastic regime,
we further assume that there exists an action x⋆ ∈ X and a gap ∆ > 0 such that E[ℓt,x − ℓt,x⋆ ] ≥ ∆
for all x ̸= x⋆ . We also consider the corrupted stochastic regime, PTwhere the assumption is relaxed
to E[ℓt,x − ℓt,x⋆ ] ≥ ∆ − Ct for some Ct ≥ 0. We define C = hP t=1 Ct . The goal i of the learner is
T
to minimize the pseudo-regret, defined as Reg = maxu∈X E t=1 (ℓt,At − ℓt,u ) .

4. Main Techniques
4.1. The standard global self-bounding condition and a new linear bandit algorithm
To obtain a best-of-both-world regret bound, previous works show the following property for their
algorithm:

Definition 1 (α-global-self-bounding condition, or α-GSB)


" T
# ( " T
#α )
X X
∀u ∈ X : E (ℓt,At − ℓt,u ) ≤ min c01−α T α , (c1 log T ) 1−α
E (1 − pt,u ) + c2 log T
t=1 t=1

where c0 , c1 , c2 are problem-dependent constants and pt,u is the probability of the learner choosing
u in round t.

With this condition, one can use the standard self-bounding technique to obtain a best-of-both-world
regret bounds:

Proposition 2 If an algorithm satisfies α-GSB, then its pseudo-regret is bounded


α
1−α α
in the adversarial regime, and by O c1 log(T )∆− 1−α +

by O c0 T + c2 log T
α
(c1 log T )1−α C∆−1 + c2 log(T ) in the corrupted stochastic regime.


3
DANN W EI Z IMMERT

For completeness, we provide a proof of Proposition 2 in Appendix D. Previous works have found
algorithms with GSB in various settings, such as multi-armed bandits, semi-bandits, graph-bandits,
and MDPs. Here, we provide one such algorithm (Algorithm 3) for linear bandits based on the
framework of SCRiBLe (Abernethy et al., 2008). The guarantee of Algorithm 3 is stated in the
next theorem under the more general “learning with predictions” setting introduced in Rakhlin and
Sridharan (2013), where in every round t, the learner receives a loss predictor mt before making
decisions.

Theorem 3 In the “learning with predictions” setting, Algorithm 3 achieves a second-order


 q 
regret bound of O d ν log(T ) Tt=1 (ℓt,At − mt,At )2 + dν log T in the adversarial regime,
P

where d is the feature dimension, ν ≤ O(d) is the self-concordance parameter of the regu-
and mt,x = ⟨x, mt ⟩ is the loss predictor. This also implies a first-order regret bound
larizer, q
 
of O d ν log(T ) Tt=1 ℓt,u if ℓt,x ≥ 0 and mt = 0; it simultaneously achieves Reg =
P
 2 q 
2
O d ν∆ log T
+ d ν ∆log T C in the corrupted stochastic regime.

See Appendix B for the algorithm and the proof. We call this algorithm Variance-Reduced SCRiBLe
(VR-SCRiBLe) since it is based on the original SCRiBLe updates, but with some refinement in the
construction of the loss estimator to reduce its variance. A good property of a SCRiBLe-based
algorithm is that it simultaneously achieves data-dependent bounds (i.e., first- and second-order
bounds), similar to the case of multi-armed bandit using FTRL/OMD with log-barrier regularizer
(Wei and Luo, 2018; Ito, 2021b). Like Rakhlin and Sridharan (2013); Bubeck et al. (2019); Ito
(2021b); Ito et al. (2022a), we can also use another procedure to learn mt and obtain path-length or
variance bounds. The details are omitted.
We notice, however, that the the √ bound in Theorem
p 3 is sub-optimal in d since the best-known
regret for linear bandits is either d T log T or dT log |X |, depending on whether the number of
actions is larger than T d . These bounds also hint the possibility of getting better dependence on d
in the stochastic regime if we can also establish GSB for these algorithms. Therefore, we ask: can
we achieve best-of-both-world bounds for linear bandits with the optimal d dependence?
An attempt is to try to show GSB for existing algorithms with optimal d dependence, including
the EXP2 algorithm (Bubeck et al., 2012) and the logdet-FTRL algorithm (Zimmert and Lattimore,
2022)1 . Based on our attempt, it is unclear how to adapt the analysis of logdet-FTRL to show
GSB. For EXP2, using the learning-rate tuning technique by Ito et al. (2022b), one can make it
satisfy a similar guarantee as GSB, albeit with an additional log(T ) factor, resulting in a bound
d log(|X |T ) log(T )
∆ in the stochastic regime. This gives a sub-optimal rate of O(log2 T ). Therefore, we
further ask: can we achieve O(log T ) regret in the stochastic regime with an optimal d dependence?
Motivated by these questions, we resort to approaches that do not rely on GSB, which appears
for the first time in the best-of-both-world literature. Our approach is in fact a general reduction —
it not only successfully achieves the desired bounds for linear bandits, but also provides a principled
way to convert an algorithm with a worst-case guarantee to one with a best-of-both-world guarantee
for general settings. In the next two subsections, we introduce our reduction techniques.
1. Another potential algorithm is the truncated continuous exponential weight algorithm by Ito et al. (2020), which
obtains optimal regret dependence on d, is computationally efficient, and can achieve first/second-order bounds.
However, the regret bound shown by Ito et al. (2020) has sub-optimal dependence on log T , making it impossible to
obtain a O(log T ) regret bound in the stochastic regime using the analysis framework of Ito et al. (2020).

4
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND

Algorithm 1 BOBW via LSB algorithms


Input: LSB algorithm A
T1 ← 0, T0 ← −c2 log(T ).
b1 ∼ unif(X ).
x
t ← 1.
for k = 1, 2 . . . , do
Initialize A with candidate x bk .
Set counters Nk (x) = 0 for all x ∈ X .
for t = Tk + 1, Tk + 2, . . . do
Play action At as suggested by A, and advance A by one step.
Nk (At ) ← Nk (At ) + 1
t−Tk
if t − Tk ≥ 2(Tk − Tk−1 ) and ∃x ∈ X \ {b xk } such that Nk (x) ≥ 2 then
bk+1 ← x.
x
Tk+1 ← t.
break.
end
end
end

4.2. First reduction: Best of Both Worlds → Local Self-Bounding


Our reduction approach relies on an algorithm to satisfy a weaker condition than GSB, defined in
the following:

Definition 4 (α-local-self-bounding condition, or α-LSB) We say an algorithm satisfies the α-


b ∈ X as input and satisfies the follow-
local-self-bounding condition if it takes a candidate action x

ing pseudo-regret guarantee for any stopping time t ∈ [1, T ]:

∀u ∈ X :
" t′ #  " t′ #α 
X  X 
′ α
E (ℓt,At − ℓt,u ) ≤ min c1−α E[t ] , (c1 log T )1−α
E (1 − I{u = x }pt,u ) + c2 log T
 0
b

t=1 t=1

where c0 , c1 , c2 are problem dependent constants.

The difference between


P LSB in Definition 4 and GSB in Definition 1 is that LSB only requires the
self-bounding term t (1 − pt,u ) to appear when u is a particular action x b given as an input to the
algorithm; for all other actions, the worst-case bound suffices. A minor additional requirement is
that the pseudo-regret needs to hold for any stopping time (because our reduction may stop this
algorithm during the learning procedure), but this is relatively easy to satisfy — for all algorithms
in this paper, without additional efforts, their regret bounds naturally hold for any stopping time.
For linear bandits, we find that an adaptation of the logdet-FTRL algorithm (Zimmert and Latti-
more, 2022) satisfies 12 -LSB, as stated in the following lemma. The proof is provided in Appendix C.

Lemma 5 For d-dimensional linear bandits, by transforming the action set into x0 x ∈ X \
 

x} ∪ 01 ⊂ Rd+1 , Algorithm 4 satisfies 12 -LSB with c0 = O((d + 1)2 log T ) and c1 = c2 =


 
{b
O((d + 1)2 ).

5
DANN W EI Z IMMERT

With the LSB condition defined, we develop a reduction procedure (Algorithm 1) which turns
any algorithm with LSB into a best-of-both-world algorithm that has a similar guarantee as in Propo-
sition 2. The guarantee is formally stated in the next theorem.

Theorem 6 If A satisfies α-LSB, then the


 regret of Algorithm 1 with A as the base algorithm
α
is
upper bounded by O c1−α T α + c log2 T in the adversarial regime and by O c log(T )∆− 1−α +
0 2 1
α
(c1 log T )1−α C∆−1 + c2 log(T ) log(C∆−1 ) in the corrupted stochastic regime.


See Appendix E.1 for a proof of Theorem 6. In particular, Theorem 6 together with Lemma 5
directly lead to a better best-of-both-world bound for linear bandits.

Corollary 7 Combining Algorithm 1 and Algorithm4 results inqa linear bandit algorithm with
√ 2
O d T log T regret in the adversarial regime and O d log T
+d C log T

∆ ∆ regret in the corrupted
stochastic regime simultaneously.

Ideas of the first reduction Algorithm 1 proceeds in epochs. In epoch k, there is an action
bk ∈ X being selected as the candidate (b
x x1 is randomly drawn from X ). The procedure simply
executes the base algorithm A with input x b=x bk , and monitors the number of draws to each action.
If in epoch k there exists some x ̸= x bk being drawn more than half of the time, and the length of
epoch k already exceeds two times the length of epoch k − 1, then a new epoch is started with x bk+1
set to x.
Here, we give a proof sketch for Theorem 6 with α = 21 . It is straightforward in the adversarial
regime: Let τk = Tk+1 − Tk be the length of epoch k. In each epoch k, the regret is upper bounded

by O( c0 τk ) due
P to the LSB property
p Pof the base√algorithm. Thus, the over all regret is upper

bounded by O( k c0 τk ) = O( c0 k τk ) = O( c0 T ) because τk ≥ 2τk−1 (except for the last
epoch).
Next, we analyze the regret in the stochastic regime assuming c2 = 0 and C = 0 (see the formal
proof for the general case). Let m = max{k : x bk ̸= x⋆ } be the index of the last epoch in which x bk

is not x . Below, we show that x ⋆
bm ̸= x is drawn for at least Ω(Tm+1 ) times. If τm > 2τm−1 , then
by the termination condition of epoch m, we know that just one step before epoch m terminates,
bm is drawn for at least half of the times in epoch m, which is at least τ2m − 1 = Ω(τm ) = Ω(Tm+1 )
x
times. If τm ≤ 2τm−1 , then by the termination condition of epoch m − 1, we know that x bm must
τm−1
have been drawn for 2 = Ω(Tm+1 ) times in epoch m − 1.
p
Notice that the regret up to the end of epoch m is upper bounded by O( c1 Tm+1 log T ), but
also lower bounded by Ω(Tm+1 ∆) by the argument above. This implies that Tm+1 = O( c1 ∆ log T
2 )
and thus the regret up to the end of epoch m is upper bounded by O( c1 log ∆
T
). Finally, if epoch m is
not the last epoch, then in the last epoch m + 1, it must be
q P that x
b m+1 = x⋆ . By the LSB property,
in the last epoch, the regret is upper bounded by O( c1 Tt=Tm+1 +1 (1 − pt,x⋆ ) log T and lower
bounded by Tt=Tm+1 +1 (1 − pt,x⋆ )∆. This implies that Tt=Tm+1 +1 (1 − pt,x⋆ ) = O( c1 ∆ log T
P P
2 ) and
thus the regret is upper bounded by O( c1 log
∆ ).
T

Theorem 6 is not sufficient to be considered a black-box reduction approach, since algorithms


with LSB are not common. Therefore, our next step is to present a more general reduction that
makes use of recent advances of Corral algorithms.

6
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND

4.3. Second reduction: Local Self-Bounding → Importance-Weighting Stable


In this subsection, we show that one can achieve LSB using the idea of model selection. Specifically,
we will run a variant of the Corral algorithm (Agarwal et al., 2017) over two instances: one is x
b, and
the other is a importance-weighting-stable algorithm (see Definition 8) over the action set X \{b x}.
Here, we focus on the case of α = 21 , which is the case for most standard settings where the worst-

case regret is T ; examples for α = 23 in the graph bandit setting is discussed in Section 5.
First, we introduce the notion of importance-weighting stablility.

Definition 8 ( 12 -iw-stable) An algorithm is 12 -iw-stable (importance-weighting stable), if given an


adaptive sequence of weights q1 , q2 , · · · ∈ (0, 1] and the assertion that the feedback in round t
is observed with probability qt , it obtains the following pseudo regret guarantee for any stopping
time t′ :
" t′ # vu t′ 
X u X 1 c2
E (ℓt,At − ℓt,u ) ≤ E tc1 + .
qt mint≤t′ qt
t=1 t=1

Definition 8 requires that even if the algorithm only receives the desired feedback with probability
qt in round t, it still achieves a meaningful regret bound that smoothly degrades with t q1t . In
P
previous works on Corral and its variants (Agarwal
r et al., 2017; Foster et al., 2020), a similar notion

of 12 -stability is defined as having a regret of c1 maxτ ≤t′ q1τ t′ , which is a weaker assumption
than our 12 -iw-stability. Our stronger notion of stability is crucial to get a best-of-both-world bound,
but it is still very natural and holds for a wide range of algorithms.
Below, we provide two examples of 12 -iw-stable algorithms. Their analysis is mostly standard
FTRL analysis (considering extra importance weighting) and can be found in Appendix G.

Lemma 9 For linear bandits, EXP2 with adaptive learning rate (Algorithm 9) is 12 -iw-stable with
constants c1 = c2 = O(d log |X |).

Lemma 10 For contextual bandits, EXP4 with adaptive learning rate (Algorithm 10) is 12 -iw-stable
with constants c1 = O(K log |X |), c2 = 0.

Our reduction procedure is Algorithm 2, which turns an 12 -iw-stable algorithm into one with 21 -LSB.
The guarantee is formalized in the next theorem, whose proof is in Appendix F.1.

Theorem 11 If B is a 12 -iw-stable algorithm with constants (c1 , c2 ), then Algorithm 2 with B as the
base algorithm satisfies 21 -LSB with constants (c′0 , c′1 , c′2 ) where c′0 = c′1 = O(c1 ) and c′2 = O(c2 ).

Cascading Theorem 11 and Theorem 6, and combining it with Lemma 9 and Lemma 10, re-
spectively, we get the following corollaries:

Corollary 12 Combining Algorithm 1, Algorithm 2 and Algorithm 9 results in  a linear ban-


dit algorithm with O dT log |X | regret in the adversarial regime and O d log |X∆| log T +
p 
q 
d log |X | log T
∆ C regret in the corrupted stochastic regime simultaneously.

7
DANN W EI Z IMMERT

Algorithm 2 LSB via Corral (for α = 12 )


Input: candidate action x b, 21 -iw-stable algorithm B over X \{b
x} with constants c1 , c2 .
2 P2 √
Define: ψt (q) = − ηt i=1 qi + β1 2i=1 ln q1i .
P
B0 = 0.
for t = 1, 2, . . . do
Let
*  + 
t−1
0
 
 X  1 1
q̄t = argmin q, zτ −   + ψt (q) , qt = 1 − 2 q̄t + 2 1,
q∈∆2  Bt−1  2t 4t
τ =1
1 1
with ηt = √ √ , β= .
t + 8 c1 8c2

Sample it ∼ qt .
if it = 1 then draw At = x
b and observe ℓt,At ;
else draw At according to base algorithm B and observe ℓt,At ;
(ℓt,At +1)I{it =i}
Define zt,i = qt,i − 1 and
v
u t
u X 1 c2
Bt = c1
t + .
qτ,2 minτ ≤t qτ,2
τ =1

end

Corollary 13 Combining Algorithm 1, Algorithm 2 and Algorithm 10 resultsin a contextual


bandit algorithm with O KT log |X | regret in the adversarial regime and O K log |X | log T
p 
∆ +
q 
K log |X | log T
∆ C regret in the corrupted stochastic regime simultaneously, where K is the number
of arms.

Ideas of the second reduction Algorithm 2 is related to the Corral algorithm (Agarwal et al.,
2017; Foster et al., 2020; Luo et al., 2022). We use a special version which is essentially an FTRL
algorithm (with a hybrid 12 -Tsallis entropy and log-barrier regularizer) over two base algorithms:
the candidate xb, and an algorithm B operating on the reduced action set X \{b x}. For base algo-
rithm B, the Corral algorithm adds a bonus term that upper bounds the regret of B under importance
weighting (i.e., the quantity Bt defined in Algorithm 6). In the regret analysis, the bonus creates a
negative term as well as a bonus overhead in the learner’s regret. The negative term can be used to
cancel the positive regret of B, and the analysis reduces toqbounding the bonus overhead. Showing
that the bonus overhead is upper bounded by the order of c1 log(T ) Tt=1 (1 − pt,bx ) is the key to
P
establish the LSB property.
Combining Algorithm 2 and Algorithm 1, we have the following general reduction: as long
as we have an 21 -iw-stable algorithm over X \{b x} for any x
b ∈ X , we have an algorithm with the
best-of-both-world guarantee when the optimal action is unique. Notice that 12 -iw-stable algorithms
are quite common – usually it can be just a FTRL/OMD algorithm with adaptive learning rate.

8
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND

The overall reduction is reminiscent of the G-COBE procedure by Wei et al. (2022), which
also runs model selection over xb and a base algorithm for X \{b x} (similar to Algorithm 2), and
dynamically restarts the procedure with increasing epoch lengths (similar to Algorithm 1). However,
)
besides being more complicated, G-COBE only guarantees a bound of polylog(T∆ +C in the corrupted
stochastic regime (omitting dependencies on c1 , c2 ), which is sub-optimal in both C and T .2

5. Case Study: Graph Bandits


In this section, we show the power of our reduction by improving the state of the art of best-of-both-
worlds algorithms for bandits with graph feedback. In bandits with graph feedback, the learner is
given a directed feedback graph G = (V, E), where the nodes V = [K] correspond to the K-
arms of the bandit. Instead of observing the loss of the played action At , the player observes the
loss ℓt,j iff (At , j) ∈ E. Learnable graphs are divided into strongly observable graphs and weakly
observable graphs. In the first case, every arm i ∈ [K] must either receive its own feedback, i.e.
(i, i) ∈ E, or all other arms do, i.e. ∀j ∈ [K] \ {i} : (j, i) ∈ E. In the weakly observable case,
every arm i ∈ [K] must be observable by at least one arm, i.e. ∃j ∈ [K] : (j, i) ∈ E. The following
graph properties characterize the difficulty of the two regimes. An independence set is any subset
V ′ ⊂ V such that no edge exists between any two distinct nodes in V ′ , i.e. ∀i, j ∈ V ′ , i ̸= j we
have (i, j) ̸∈ E and (j, i) ̸∈ E. The independence number α is the size of the largest independence
set. A related number is the weakly independence number α e, which is the independence number of
the subgraph (V, E ′ ) obtained by removing all one-sided edges, i.e. (i, j) ∈ E ′ iff (i, j) ∈ E and
(j, i) ∈ E. For undirected graphs, the two notions coincide, but in general α e can be larger than α
by up to a factor of K. Finally, a weakly dominating set is a subset D ⊂ V such that every node
in V is observable by some node in D, i.e. ∀j ∈ V ∃i ∈ D : (i, j) ∈ E. The weakly dominating
number δ is the size of the smallest weakly dominating set.
Alon et al. (2015) √provides a almost tight characterization
p of the adversarial regime, providing
upper bounds of O( αT log T log K) and O( α eT log K) for the strongly observable case and
1 2
O((δ log K) T ) for the weakly observable case, as well as matching lower bounds up to all log
3 3

factors. Zimmert and Lattimore (2019) have shown that the log(T ) dependency can be avoided and
that hence α is the right notion of difficulty for strongly observable graphs even in the limit T → ∞
(though they pay for this with a larger log K dependency).
State-of-the-art best-of-both-worlds guarantees for bandits with graph feedback are derivations
2 2
of EXP3 and obtain either O( αe log∆ (T ) ) or O( α log(T )∆log (T K) ) regret for strongly observable graphs
2
and O( δ log∆2(T ) ) for weakly observable graphs (Rouyer et al., 2022; Ito et al., 2022b)3 . Our black-
box reduction leads directly to algorithms with optimal log(T ) regret in the stochastic regime.

5.1. Strongly observable graphs


We begin by providing an examples of 21 -iw-stable algorithm for bandits with strongly observable
graph feedback. The algorithm and the proof are in Appendix G.3.

2. However, Wei et al. (2022) handles a more general type of corruption.


3. Rouyer et al. (2022) obtains better bounds when the gaps are not uniform, while Ito et al. (2022b) can handle graphs
that are unions of strongly observable and weakly observable sub-graphs. We do not extend our analysis to the latter
case for the sake of simplicity.

9
DANN W EI Z IMMERT

Lemma 14 For bandits with strongly observable graph feedback, (1 − 1/ log K)-Tsallis-
INF with adaptive learning rate (Algorithm 11) is 21 -iw-stable with constants c1 =
O(min{e
α, α log K} log K), c2 = 0.
Before we apply the reduction, note that Algorithm 2 requires that the player observes the loss
ℓt,At when playing arm At , which is not necessarily the case for all strongly observable graphs.
However, this can be overcome via defining an observable surrogate loss that is used in Algorithm 2
instead. We explain how this works in detail in Section H and assume from now that this technical
issue does not arise.
Cascading the two reduction steps, we immediately obtain the following.
Corollary 15 Combining p Algorithm 1, Algorithm 2 and Algorithm 11 results in a graph
bandit algorithm with O min{e α, α log K}T log K regret in the adversarial regime and
 q 
O min{eα,α log K}

log T log K
+ min{e
α,α log K} log T log K
∆ C regret in the corrupted stochastic regime
simultaneously.

5.2. Weakly observable graphs


Weakly observable graphs motivate our general definition of LSB stable beyond α = 12 . We first
need to define the equivalent of 12 -iw-stable for this regime.
Definition 16 ( 23 -iw-stable) An algorithm is 32 -iw-stable (importance-weighting stable), if given
an adaptive sequence of weights q1 , q2 , . . . , qτ and the feedback in any round being observed with
probability qt , it obtains the following pseudo regret guarantee for any stopping time t′ :
 
" t′ # t′ ! 23
X  1 X 1 c2 
E (ℓt,At − ℓt,u ) ≤ E c13 √ + max′ .
qt t≤t qt
t=1 t=1

An example of such an algorithm is given by the following lemma (see Appendix G.4 for the algo-
rithm and the proof).
Lemma 17 For bandits with weakly observable graph feedback, EXP3 with adaptive learning rate
(Algorithm 12) is 23 -iw-stable with constants c1 = c2 = O(δ log K).
Similar to the 12 -case, this allows for a general reduction to LSB algorithms.
Theorem 18 If B is a 23 -iw-stable algorithm with constants (c1 , c2 ), then Algorithm 5 with B
as the base algorithm satisfies 32 -LSB with constants (c′0 , c′1 , c′2 ) where c′0 = c′1 = O(c1 ) and
1
c′2 = O(c13 + c2 ).
See Appendix F.2 for the proof. Algorithm 5 works almost the same as Algorithm 2, but we need to
adapt the bonus terms to the definition of 32 -iw-stable. The major additional difference is that we do
not necessarily observe the loss of the action played and hence need to play exploratory actions with
probability γt in every round to estimate the performance difference between x b and B. A second
point to notice is that the base algorithm B uses the action set X \ {b x}, but is still allowed to use
x
b in its internal exploration policy. This is necessary because the sub-graph with one arm removed
might have a larger dominating number, or might even not be learnable at all. By allowing x b in the
internal exploration, we guarantee the the regret of the base algorithm is not larger than over the full
action set. Finally, cascading the lemma and the reduction, we obtain

10
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND

Corollary 19 Combining Algorithm 1, Algorithm 5 and Algorithm 12 results in a graph ban-
1 2
dit algorithm with O (δ log K) 3 T 3 regret in the adversarial regime and O δ log ∆
K log T
2 +
 2 1 
C δ log K log T 3
∆2
regret in the corrupted stochastic regime simultaneously.

6. More Adaptations
To demonstrate the versatility of our reduction framework, we provide two more adaptations. The
first demonstrates that our reduction can be easily adapted to obtain a data-dependent bound (i.e.,
first- or second-order bound) in the adversarial regime, provided that the base algorithm achieves
a corresponding data-dependent bound. The second tries to eliminate the undesired requirement
by the corral algorithm (Algorithm 2) that the base algorithm needs to operate in X \{b x} instead
of the more natural X . We show that under a stronger stability assumption, we can indeed just
use a base algorithm that operates in X . This would be helpful for settings where excluding one
single action/policy in the algorithm is not straightforward (e.g., MDP). Finally, we combine the
two techniques to obtain a first-order best-of-both-world bound for tabular MDPs.

6.1. Reduction with first- and second-order bounds


p 
A first-order regret bound refers to a regret bound of order O c1 polylog(T )L⋆ + c2 polylog(T ) ,
 PT 
where L⋆ = minx∈X E t=1 ℓt,x is the best action’s cumulative loss. This is meaningful under
assumption that ℓt,x ≥ 0 for all t, x. A second-order regret bound refers to a bound of order
the q
  PT  
O c1 polylog(T )E (ℓ − m )2 + c polylog(T ) , where m
t=1 t,A t t,A t 2 t,x is a loss predictor for
action x that is available to the learner at the beginning of round t. We refer the reader to Wei and
Luo (2018); Ito (2021b) for more discussions on data-dependent bounds.
We first define counterparts of the LSB condition and iw-stability condition with data-dependent
guarantees:
Definition 20 ( 21 -dd-LSB) We say an algorithm satisfies 21 -dd-LSB (data-dependent LSB) if it takes
a candidate action x b ∈ X as input and satisfies the following pseudo-regret guarantee for any
stopping time t′ ∈ [1, T ]:
∀u ∈ X :
" t′ # vu " t′ !#
X u X X
E (ℓt,At − ℓt,u ) ≤ t(c1 log T )E b}p2t,u ξt,u
pt,x ξt,x − I{u = x + c2 log T
t=1 t=1 x

where c1 , c2 are some problem dependent constants, ξt,x = (ℓt,x −mt,x )2 in the second-order bound
case, and ξt,x = ℓt,x in the first-order bound case.
Definition 21 ( 21 -dd-iw-stable) An algorithm is 12 -dd-iw-stable (data-dependent-iw-stable) if
given an adaptive sequence of weights q1 , q2 , · · · ∈ (0, 1] and the assertion that the feedback in
round t is observed with probability qt , it obtains the following pseudo regret guarantee for any
stopping time t′ :
" t′ # v
u " t′ #  
X u X updt · ξt,A c2
t
E (ℓt,At − ℓt,u ) ≤ c1 E
t + E ,
t=1 t=1
qt2 mint≤t′ qt

11
DANN W EI Z IMMERT

where updt = 1 if feedback is observed in round t and updt = 0 otherwise, and ξt,x is defined in the
same way as in Definition 20.
We can turn a dd-iw-stable algorithm into one with dd-LSB (see Appendix F.3 for the proof):
Theorem 22 If B is 21 -dd-iw-stable with constants (c1 , c2 ), then Algorithm 6 with B as the base

algorithm satisfies 21 -dd-LSB with constants (c′1 , c′2 ) where c′1 = O(c1 ) and c′2 = O( c1 + c2 ).
Then we turn an algorithm with dd-LSB into one with data-dependent best-of-both-world guarantee
(see Appendix E.2 for the proof):
Theorem 23 If an algorithm A satisfies 21 -dd-LSE, then the regret of Algorithm 1 with A as the
r hP i 
T 2 2
base algorithm is upper bounded by O c1 E ξ
t=1 t,At log T + c2 log T in the adversarial
 q 
regime and O c1 log

T
+ c1 log T −1
∆ C + c2 log(T ) log(C∆ ) in the corrupted stochastic regime.

6.2. Achieving LSB without excluding x


b
Our reduction in Section 4.3 requires that the base algorithm B to operate in the action set of X \{b
x}.
This is sometimes not easy to implement for structural problems where actions share common com-
ponents (e.g., MDPs or combinatorial bandits). To eliminate this requirement so that we can simply
use a base algorithm B that operates in the original action space X , we propose the following
stronger notion of iw-stability that we called strongly-iw-stable:
Definition 24 ( 21 -strongly-iw-stable) An algorithm is 12 -strongly-iw-stable if the following holds:
given an adaptive sequence of weights q1 , q2 , · · · ∈ (0, 1]X and the assertion that the feedback in
round t is observed with probability qt (x) if the algorithm chooses At = x, it obtains the following
pseudo regret guarantee for any stopping time t′ :
" t′ # v
u t′ 
X u X 1 c2
E (ℓt,At − ℓt,u ) ≤ E tc1 + .
qt (At ) mint≤t′ minx qt (x)
t=1 t=1

Compared with iw-stability defined in Definition 8, strong-iw-stability requires the bound to have an
P x results
additional flexibility: if choosing action
1
in a probability qt (x) of observing the feedback,
the regret bound needs to adapts to t qt (At ) . For the class of FTRL/OMD algorithms, strong-
iw-stability holds if the stability term is bounded by a constant no matter what action is chosen.
Examples include log-barrier-OMD/FTRL for multi-armed bandits, SCRiBLe (Abernethy et al.,
2008) or a truncated version of continuous exponential weights (Ito et al., 2020) for linear ban-
dits. In fact, Upper-Confidence-Bound (UCB) algorithms also satisfy strong-iw-stability, though it
is mainly designed for the pure stochastic setting; however, we will need this property when we
design algorithms for adversarial MDPs, where the transition estimation part is done through UCB
approaches. This will be demonstrated in Section 6.3. With strong-iw-stability, the second reduction
only requires a base algorithm over X :
Theorem 25 If B is a 21 -strongly-iw-stable algorithm with constants (c1 , c2 ), then Algorithm 7
with B as the base algorithm satisfies 21 -LSB with constants (c′0 , c′1 , c′2 ) where c′0 = c′1 = O(c1 ) and

c′2 = O( c1 + c2 ).
The proof is provided in Appendix F.4.

12
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND

6.3. Application to MDPs


Finally, we combine the techniques developed in Section 6.1 and Section 6.2 to obtain a best-of-
both-world guarantee for tabular MDPs with a first-order regret bound in the adversarial regime. We
use Algorithm 13 as the base algorithm, which is adapted from the UCB-log-barrier Policy Search
algorithm by Lee et al. (2020) to satisfy both the data-dependent property (Definition 21) and the
strongly iw-stable property (Definition 24). The corral algorithm we use is Algorithm 8, which takes
a base algorithm with the dd-strongly-iw-stable property and turns it into a data-dependent best-of-
both-world algorithm. The details and notations are all provided in Appendix I. The guarantee of
the algorithm is formally stated in the next theorem.

Theorem 26 Combining
q Algorithm 1, Algorithm 8, andAlgorithm 13 results in an MDP algo-
rithm with O S 2 AHL⋆ log2 (T )ι2 + S 5 A2 log2 (T )ι2 regret in the adversarial regime, and
 2 2 2 q 
H 2 S 2 Aι2 log T
O H S Aι ∆
log T
+ ∆ C + S 5 A2 ι2 log(T ) log(C∆−1 ) in the corrupted stochastic

regime, where S is the number of states, A is the number of actions, H is the horizon, L⋆ is the
cumulative loss of the best policy, and ι = log(SAT ).

7. Conclusion
We provided a general reduction from the best-of-both-worlds problem to a wide range of
FTRL/OMD algorithms, which improves the state of the art in several problem settings. We showed
the versatility of our approach by extending it to preserving data-dependent bounds.
Another potential application of our framework is partial monitoring, where one might improve
1 2
the log2 (T ) rates of to log(T ) for both the T 2 and the T 3 regime using our respective reductions.
A weakness of our approach is the uniqueness requirement of the best action. While this as-
sumption is typical in the best-of-both-worlds literature, it is not merely an artifact of the analysis
for us due to the doubling procedure in the first reduction. Additionally, our reduction can only
obtain worst-case ∆ dependent bounds in the stochastic regime, which can be significantly weaker
than more refined notions of complexity.

References
Yasin Abbasi-Yadkori, Aldo Pacchiano, and My Phan. Regret balancing for bandit and rl model
selection. arXiv preprint arXiv:2006.05491, 2020.

Jacob Abernethy, Elad Hazan, and Alexander Rakhlin. Competing in the dark: An efficient algo-
rithm for bandit linear optimization. In 21st Annual Conference on Learning Theory, COLT 2008,
2008.

Alekh Agarwal, Haipeng Luo, Behnam Neyshabur, and Robert E Schapire. Corralling a band of
bandit algorithms. In Conference on Learning Theory, pages 12–38. PMLR, 2017.

Noga Alon, Nicolo Cesa-Bianchi, Claudio Gentile, and Yishay Mansour. From bandits to experts:
A tale of domination and independence. Advances in Neural Information Processing Systems,
26, 2013.

13
DANN W EI Z IMMERT

Noga Alon, Nicolo Cesa-Bianchi, Ofer Dekel, and Tomer Koren. Online learning with feedback
graphs: Beyond bandits. In Conference on Learning Theory, pages 23–35. PMLR, 2015.

Idan Amir, Guy Azov, Tomer Koren, and Roi Livni. Better best of both worlds bounds for bandits
with switching costs. In Advances in Neural Information Processing Systems, 2022.

Raman Arora, Teodor Vanislavov Marinov, and Mehryar Mohri. Corralling stochastic bandit algo-
rithms. In International Conference on Artificial Intelligence and Statistics, pages 2116–2124.
PMLR, 2021.

Peter Auer and Chao-Kai Chiang. An algorithm with nearly optimal pseudo-regret for both stochas-
tic and adversarial bandits. In Conference on Learning Theory, pages 116–120. PMLR, 2016.

Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit
problem. Machine learning, 47:235–256, 2002a.

Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multi-
armed bandit problem. SIAM journal on computing, 32(1):48–77, 2002b.

Sébastien Bubeck and Aleksandrs Slivkins. The best of both worlds: Stochastic and adversarial
bandits. In Conference on Learning Theory, pages 42–1. JMLR Workshop and Conference Pro-
ceedings, 2012.

Sébastien Bubeck, Nicolo Cesa-Bianchi, and Sham M Kakade. Towards minimax policies for online
linear optimization with bandit feedback. In Conference on Learning Theory, pages 41–1. JMLR
Workshop and Conference Proceedings, 2012.

Sébastien Bubeck, Yuanzhi Li, Haipeng Luo, and Chen-Yu Wei. Improved path-length regret
bounds for bandits. In Conference On Learning Theory, pages 508–528. PMLR, 2019.

Cheng Chen, Canzhe Zhao, and Shuai Li. Simultaneously learning stochastic and adversarial ban-
dits under the position-based model. In Proceedings of the AAAI Conference on Artificial Intelli-
gence, volume 36, pages 6202–6210, 2022.

Ashok Cutkosky, Christoph Dann, Abhimanyu Das, Claudio Gentile, Aldo Pacchiano, and Manish
Purohit. Dynamic balancing for model selection in bandits and rl. In International Conference
on Machine Learning, pages 2276–2285. PMLR, 2021.

Liad Erez and Tomer Koren. Towards best-of-all-worlds online learning with feedback graphs.
Advances in Neural Information Processing Systems, 34:28511–28521, 2021.

Dylan J Foster, Claudio Gentile, Mehryar Mohri, and Julian Zimmert. Adapting to misspecification
in contextual bandits. Advances in Neural Information Processing Systems, 33:11478–11489,
2020.

Pierre Gaillard, Gilles Stoltz, and Tim Van Erven. A second-order bound with excess losses. In
Conference on Learning Theory, pages 176–196. PMLR, 2014.

Anupam Gupta, Tomer Koren, and Kunal Talwar. Better algorithms for stochastic bandits with
adversarial corruptions. In Conference on Learning Theory, pages 1562–1578. PMLR, 2019.

14
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND

Junya Honda, Shinji Ito, and Taira Tsuchiya. Follow-the-perturbed-leader achieves best-of-both-
worlds for bandit problems. In International Conference on Algorithmic Learning Theory.
PMLR, 2023.

Jiatai Huang, Yan Dai, and Longbo Huang. Adaptive best-of-both-worlds algorithm for heavy-
tailed multi-armed bandits. In International Conference on Machine Learning, pages 9173–9200.
PMLR, 2022.

Shinji Ito. On optimal robustness to adversarial corruption in online decision problems. Advances
in Neural Information Processing Systems, 34:7409–7420, 2021a.

Shinji Ito. Parameter-free multi-armed bandit algorithms with hybrid data-dependent regret bounds.
In Conference on Learning Theory, pages 2552–2583. PMLR, 2021b.

Shinji Ito, Shuichi Hirahara, Tasuku Soma, and Yuichi Yoshida. Tight first-and second-order regret
bounds for adversarial linear bandits. Advances in Neural Information Processing Systems, 33:
2028–2038, 2020.

Shinji Ito, Taira Tsuchiya, and Junya Honda. Adversarially robust multi-armed bandit algorithm
with variance-dependent regret bounds. In Conference on Learning Theory, pages 1421–1422.
PMLR, 2022a.

Shinji Ito, Taira Tsuchiya, and Junya Honda. Nearly optimal best-of-both-worlds algorithms for
online learning with feedback graphs. In Advances in Neural Information Processing Systems,
2022b.

Tiancheng Jin and Haipeng Luo. Simultaneously learning stochastic and adversarial episodic mdps
with known transition. Advances in neural information processing systems, 33:16557–16566,
2020.

Tiancheng Jin, Longbo Huang, and Haipeng Luo. The best of both worlds: stochastic and adversar-
ial episodic mdps with unknown transition. Advances in Neural Information Processing Systems,
34:20491–20502, 2021.

Fang Kong, Yichi Zhou, and Shuai Li. Simultaneously learning stochastic and adversarial bandits
with general graph feedback. In International Conference on Machine Learning, pages 11473–
11482. PMLR, 2022.

Tze Leung Lai, Herbert Robbins, et al. Asymptotically efficient adaptive allocation rules. Advances
in applied mathematics, 6(1):4–22, 1985.

Chung-Wei Lee, Haipeng Luo, Chen-Yu Wei, and Mengxiao Zhang. Bias no more: high-probability
data-dependent regret bounds for adversarial bandits and mdps. Advances in neural information
processing systems, 33:15522–15533, 2020.

Chung-Wei Lee, Haipeng Luo, Chen-Yu Wei, Mengxiao Zhang, and Xiaojin Zhang. Achieving near
instance-optimality and minimax-optimality in stochastic and adversarial linear bandits simulta-
neously. In International Conference on Machine Learning, pages 6142–6151. PMLR, 2021.

15
DANN W EI Z IMMERT

Haipeng Luo. Homework 3 solution, introduction to online optimization/learning.


http://haipeng-luo.net/courses/CSCI659/2022_fall/homework/HW3_
solutions.pdf, November 2022.

Haipeng Luo, Mengxiao Zhang, Peng Zhao, and Zhi-Hua Zhou. Corralling a larger band of bandits:
A case study on switching regret for linear bandits. In Conference on Learning Theory, pages
3635–3684. PMLR, 2022.

Thodoris Lykouris, Vahab Mirrokni, and Renato Paes Leme. Stochastic bandits robust to adver-
sarial corruptions. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of
Computing, pages 114–122, 2018.

Saeed Masoudian and Yevgeny Seldin. Improved analysis of the tsallis-inf algorithm in stochas-
tically constrained adversarial bandits and stochastic bandits with adversarial corruptions. In
Conference on Learning Theory, pages 3330–3350. PMLR, 2021.

Saeed Masoudian, Julian Zimmert, and Yevgeny Seldin. A best-of-both-worlds algorithm for ban-
dits with delayed feedback. In Advances in Neural Information Processing Systems, 2022.

Jaouad Mourtada and Stéphane Gaı̈ffas. On the optimality of the hedge algorithm in the stochastic
regime. Journal of Machine Learning Research, 20:1–28, 2019.

Aldo Pacchiano, Christoph Dann, and Claudio Gentile. Best of both worlds model selection. In
Advances in Neural Information Processing Systems, 2022.

Sasha Rakhlin and Karthik Sridharan. Optimization, learning, and games with predictable se-
quences. Advances in Neural Information Processing Systems, 26, 2013.

Chloé Rouyer, Yevgeny Seldin, and Nicolo Cesa-Bianchi. An algorithm for stochastic and adver-
sarial bandits with switching costs. In International Conference on Machine Learning, pages
9127–9135. PMLR, 2021.

Chloé Rouyer, Dirk van der Hoeven, Nicolò Cesa-Bianchi, and Yevgeny Seldin. A near-optimal
best-of-both-worlds algorithm for online learning with feedback graphs. In Advances in Neural
Information Processing Systems, 2022.

Aadirupa Saha and Pierre Gaillard. Versatile dueling bandits: Best-of-both world analyses for
learning from relative preferences. In International Conference on Machine Learning, pages
19011–19026. PMLR, 2022.

Yevgeny Seldin and Gábor Lugosi. An improved parametrization and analysis of the exp3++ al-
gorithm for stochastic and adversarial bandits. In Conference on Learning Theory, pages 1743–
1759. PMLR, 2017.

Yevgeny Seldin and Aleksandrs Slivkins. One practical algorithm for both stochastic and adversarial
bandits. In International Conference on Machine Learning, pages 1287–1295. PMLR, 2014.

Taira Tsuchiya, Shinji Ito, and Junya Honda. Best-of-both-worlds algorithms for partial monitoring.
arXiv preprint arXiv:2207.14550, 2022.

16
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND

Chen-Yu Wei and Haipeng Luo. More adaptive algorithms for adversarial bandits. In Conference
On Learning Theory, pages 1263–1291. PMLR, 2018.

Chen-Yu Wei, Christoph Dann, and Julian Zimmert. A model selection approach for corruption
robust reinforcement learning. In International Conference on Algorithmic Learning Theory,
pages 1043–1096. PMLR, 2022.

Julian Zimmert and Tor Lattimore. Connections between mirror descent, thompson sampling and
the information ratio. Advances in Neural Information Processing Systems, 32, 2019.

Julian Zimmert and Tor Lattimore. Return of the bias: Almost minimax optimal high probability
bounds for adversarial linear bandits. In Conference on Learning Theory, pages 3285–3312.
PMLR, 2022.

Julian Zimmert and Yevgeny Seldin. An optimal algorithm for stochastic and adversarial bandits.
In The 22nd International Conference on Artificial Intelligence and Statistics, pages 467–475.
PMLR, 2019.

Julian Zimmert, Haipeng Luo, and Chen-Yu Wei. Beating stochastic and adversarial semi-bandits
optimally and simultaneously. In International Conference on Machine Learning, pages 7683–
7692. PMLR, 2019.

17
DANN W EI Z IMMERT

Appendices

A FTRL Analysis 19

B Analysis for Variance-Reduced SCRiBLe (Algorithm 3 / Theorem 3) 25

C Analysis for LSB Log-Determinant FTRL (Algorithm 4 / Lemma 5) 29

D Proof of Proposition 2 36

E Analysis for the First Reduction 36


E.1 BOBW to LSB (Algorithm 1 / Theorem 6) . . . . . . . . . . . . . . . . . . . . . . . . 36
E.2 dd-BOBW to dd-LSB (Algorithm 1 / Theorem 23) . . . . . . . . . . . . . . . . . . . 39

F Analysis for the Second Reduction 39


F.1 21 -LSB to 12 -iw-stable (Algorithm 2 / Theorem 11) . . . . . . . . . . . . . . . . . . . . 39
F.2 23 -LSB to 23 -iw-stable (Algorithm 5 / Theorem 18) . . . . . . . . . . . . . . . . . . . . 42
F.3 12 -dd-LSB to 12 -dd-iw-stable (Algorithm 6 / Theorem 22) . . . . . . . . . . . . . . . . 44
F.4 12 -LSB to 12 -strongly-iw-stable (Algorithm 7 / Theorem 25) . . . . . . . . . . . . . . . 48
F.5 12 -dd-LSB to 12 -dd-strongly-iw-stable (Algorithm 8 / Lemma 42) . . . . . . . . . . . . 51

G Analysis for IW-Stable Algorithms 54


G.1 EXP2 (Algorithm 9 / Lemma 9) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
G.2 EXP4 (Algorithm 10 / Lemma 10) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
G.3 (1 − 1/ log(K))-Tsallis-INF (Algorithm 11 / Lemma 14) . . . . . . . . . . . . . . . . 57
G.4 EXP3 for weakly observable graphs (Algorithm 12 / Lemma 17) . . . . . . . . . . . . 60

H Surrogate Loss for Strongly Observable Graph Problems 61

I Tabular MDP (Theorem 26) 62


I.1 Base algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
I.2 Corraling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

18
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND

Appendix A. FTRL Analysis


Lemma 27 The optimistic FTRL algorithm over a convex set Ω:
(* t−1 + )
X
pt = argmin p, ℓτ + mt + ψt (p)
p∈Ω τ =1

guarantees the following for any t′ :


t ′
X
⟨pt − u, ℓt ⟩ ≤ ψ0 (u) − min ψ0 (p)
p∈Ω
t=1
X t′ Xt ′

+ (ψt (u) − ψt−1 (u) − ψt (pt ) + ψt−1 (pt )) + max (⟨pt − p, ℓt − mt ⟩ − Dψt (p, pt )) .
p∈Ω
t=1 t=1 | {z }
stability

Proof Let Lt ≜ tτ =1 ℓτ . Define Ft (p) = ⟨p, Lt−1 + mt ⟩ + ψt (p) and Gt (p) = ⟨p, Lt ⟩ + ψt (p).
P
Therefore, pt is the minimizer of Ft over Ω. Let p′t+1 be minimizer of Gt over Ω. Then by the
first-order optimality condition, we have

Ft (pt ) − Gt (p′t+1 ) ≤ Ft (p′t+1 ) − Gt (p′t+1 ) − Dψt (p′t+1 , pt ) = −⟨p′t+1 , ℓt − mt ⟩ − Dψt (p′t+1 , pt ).


(1)

By definition, we also have

Gt (p′t+1 ) − Ft+1 (pt+1 ) ≤ Gt (pt+1 ) − Ft+1 (pt+1 ) = ψt (pt+1 ) − ψt+1 (pt+1 ) − ⟨pt+1 , mt+1 ⟩.
(2)

Thus,
t ′
X
⟨pt , ℓt ⟩
t=1
t ′
X
⟨pt , ℓt ⟩ − ⟨p′t+1 , ℓt − mt ⟩ − Dψt (p′t+1 , pt ) + Gt (p′t+1 ) − Ft (pt )

≤ (by (1))
t=1
t ′
X
⟨pt , ℓt ⟩ − ⟨p′t+1 , ℓt − mt ⟩ − Dψt (p′t+1 , pt ) + Gt−1 (p′t ) − Ft (pt ) + Gt′ (p′t′ +1 ) − G0 (p′1 )

=
t=1
t ′
X
⟨pt , ℓt ⟩ − ⟨p′t+1 , ℓt − mt ⟩ − Dψt (p′t+1 , pt ) − ψt (pt ) + ψt−1 (pt ) − ⟨pt , mt ⟩ + Gt′ (p′t′ +1 ) − G0 (p′1 )

=
t=1
t ′
X
⟨pt − p′t+1 , ℓt − mt ⟩ − Dψt (p′t+1 , pt ) − ψt (pt ) + ψt−1 (pt ) + Gt′ (u) − min ψ0 (p)


p∈Ω
t=1
(by (2), using that p′t′ +1 is the minimizer of Gt′ )
t′ 
X 
= max {⟨pt − p, ℓt − mt ⟩ − Dψt (p, pt )} − ψt (pt ) + ψt−1 (pt )
p∈Ω
t=1

19
DANN W EI Z IMMERT

t′
X
+ ⟨u, ℓt ⟩ + ψt′ (u) − min ψ0 (p)
p∈Ω
t=1
t  ′ 
X
= max {⟨pt − p, ℓt − mt ⟩ − Dψt (p, pt )} − ψt (pt ) + ψt−1 (pt )
p∈Ω
t=1
t′
X
+ ⟨u, ℓt ⟩ + ψt′ (u) − min ψ0 (p)
p∈Ω
t=1
X t′ 
= max {⟨pt − p, ℓt − mt ⟩ − Dψt (p, pt )} + (ψt (u) − ψt−1 (u) − ψt (pt ) + ψt−1 (pt ))
p∈Ω
t=1
t′
X
+ ⟨u, ℓt ⟩ + ψ0 (u) − min ψ0 (p)
p∈Ω
t=1
Re-arranging finishes the proof.

Lemma 28 Consider the optimistic FTRL algorithm with bonus bt ≥ 0:


(* t−1 + )
X
pt = argmin p, (ℓτ − bτ ) + mt + ψt (p)
p∈Ω τ =1
1 Ts 1 Lo −1 P α
with ψt (p) = ηt ψ (p)
+ where ψ Ts (p) =
β ψ (p) 1−α i pi for some α ∈ (0, 1), ψ Lo (p) =
P 1
i ln pi , and ηt is non-increasing. We have

t ′
X
⟨pt − u, ℓt ⟩
t=1
t′ K t′ K
!
1 K 1−α K ln 1δ
  X
1 X 1 1 α 1 X X 2−α Ts
≤ + − pt,i − 1 + + ηt pt,i (ℓt,i − xt )2
1 − α η0 1−α ηt ηt−1 β α
t=1 i=1 t=1 i=1
t′ K   t′ t′ t′  
XX
2 Lo 2 1 X X X 1
+ 2β pt,i (ℓt,i − wt ) + 1 + ⟨pt , bt ⟩ − ⟨u, bt ⟩ + δ −u + 1, ℓt − bt .
α K
t=1 i=1 t=1 t=1 t=1

for any δ ∈ (0, 1) and any ℓTs K Lo Lo K


t , bt ∈ R , ℓt , bt ∈ R , xt , wt ∈ R such that
Ts

ℓTs Lo
t + ℓt = ℓt − mt , (3)
Ts Lo
bt + bt = bt (4)
1−α Ts 1
ηt pt,i (ℓt,i − xt ) ≥ − , (5)
4
1
βpt,i (ℓLo
t,i − wt ) ≥ − , (6)
4
1−α Ts 1
0 ≤ ηt pt,i bt,i ≤ , (7)
4
1
0 ≤ βpt,i bLo
t,i ≤ (8)
4
for all t, i.

20
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND

Proof Let u′ = (1 − δ)u + δ


K 1. By Lemma 27, we have

t′ K t′  X K K
X
′ 1 1 X α 1 X 1 1 α ′α
 1X p1,i
⟨pt − u , ℓt − bt ⟩ ≤ p1,i + − pt,i − ui + ln ′
1 − α η0 1−α ηt ηt−1 β ui
t=1 i=1 t=1 i=1 i=1
t′  
X 1 1
+ max ⟨pt − p, ℓt − bt − mt ⟩ − DψTs (p, pt ) − DψLo (p, pt )
p ηt β
t=1
t′  K
!
1 K 1−α K ln 1δ
 X
1 X 1 1
≤ + − pαt,i − 1 +
1 − α η0 1−α ηt ηt−1 β
t=1 i=1
t′  
X
Ts
1
+ max ⟨pt − p, ℓ − xt 1⟩ − D Ts (p, pt )
p 2ηt ψ
t=1 | {z }
stability-1
t′  
X 1 Ts
+ max ⟨pt − p, −bt 1⟩ − D Ts (p, pt )
p 2ηt ψ
t=1 | {z }
stability-2
t′  
X 1
Lo
+ max ⟨pt − p, ℓ − wt 1⟩ − D Lo (p, pt )
p 2β ψ
t=1 | {z }
stability-3
t′  
X 1 Lo
+ max ⟨pt − p, −bt 1⟩ − D Lo (p, pt )
p 2β ψ
t=1 | {z }
stability-4

where in the last inequality we use ⟨pt − p, 1⟩ = 0.


By Problem 1 in Luo (2022), we can bound stability-1 by ηαt K 2−α Ts 2
P
i=1 pt,i (ℓt,i − xt ) under the
condition (5). Similarly, stability-2 can be upper bounded by ηαt K 2−α Ts2 1 K
P P
i=1 pt,i bt,i ≤ α pt,i bTs
t,i
PK 2 i=1
under the condition (7). Using Lemma 30, we can bound stability-3 by 2β i=1 pt,i (ℓt,i − wt )2Lo

under the condition (6). Similarly, we can bound stability-4 by 2β K


P 2 Lo2
PK
i=1 pt,i bt,i ≤
Lo
i=1 pt,i bt,i

under (8). Collecting all terms and using the definition of u finishes the proof.

Lemma 29 Consider the optimistic FTRL algorithm with bonus bt ≥ 0:

t−1
(* + )
X
pt = argmin p, (ℓτ − bτ ) + mt + ψt (p)
p∈Ω τ =1

21
DANN W EI Z IMMERT

1 P 1
with ψt (p) = ηt i ln pi , and ηt is non-increasing. We have

t′ t K ′
X K ln 1δ X X
⟨pt − u, ℓt ⟩ ≤ +2 ηt p2t,i (ℓt,i − mt,i − xt )2
ηt′
t=1 t=1 i=1
t′ t′ t′  
X X X 1
+2 ⟨pt , bt ⟩ − ⟨u, bt ⟩ + δ −u + 1, ℓt − bt .
K
t=1 t=1 t=1

for any δ ∈ (0, 1) and any xt ∈ R if the following hold: ηt pt,i (ℓt,i − mt,i − xt ) ≥ − 41 and
ηt pt,i bt,i ≤ 41 for all t, i.

Proof Let u′ = (1 − δ)u + δ


K 1. By Lemma 27 and letting η0 = ∞, we have
t′ t′ X K    X t′
X

X pt,i 1 1
⟨pt − u , ℓt − bt ⟩ ≤ ln ′ − + max (⟨pt − p, ℓt − bt − mt ⟩ − Dψt (p, pt ))
ui ηt ηt−1 p
t=1 t=1 i=1 t=1
t′
K ln Kδ
 
X 1
≤ + max ⟨pt − p, ℓt − mt − xt 1⟩ − Dψt (p, pt )
ηt′ p 2
t=1 | {z }
stability-1
t′  
X 1
+ max ⟨pt − p, −bt ⟩ − Dψt (p, pt )
p 2
t=1 | {z }
stability-2

where in the last inequality we use ⟨pt − p, 1⟩ = 0.


Using Lemma 30, we can bound stability-1 by 2ηt K 2 2
P
i=1 pt,i (ℓt,i − mt,i − xt ) , and bound
PK 2 2 PK
stability-2 by 2ηt i=1 pt,i bt,i ≤ i=1 pt,i bt,i under the specified conditions. Collecting all terms
and using the definition of u′ finishes the proof.

1 1
and let ℓt ∈ RK be such that
P
Lemma 30 (Stability under log barrier) Let ψ(p) = β i ln pi ,
βpi ℓt,i ≥ − 21 . Then
X
max {⟨pt − p, ℓt ⟩ − Dψ (p, pt )} ≤ βp2t,i ℓ2t,i .
p∈∆([K])
i

Proof

max {⟨pt − p, ℓt ⟩ − Dψ (p, pt )} ≤ max {⟨pt − q, ℓt ⟩ − Dψ (q, pt )}


p∈∆([K]) q∈RK
+

Define f (q) = ⟨pt − q, ℓt ⟩ − Dψ (q, pt ). Let q ⋆ be the solution in the last expression. Next, we
verify that under the specified conditions, we have ∇f (q ⋆ ) = 0. It suffices to show that there exists
q ∈ RK ⋆
+ such that ∇f (q) = 0 since if such q exists, then it must the maximizer of f and thus q = q.

1 1
[∇f (q)]i = −ℓt,i − [∇ψ(q)]i + [∇ψ(pt )]i = −ℓt,i + −
βqi βpt,i

22
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND

By the condition, we have − βp1t,i − ℓt,i < 0 for all i. and so ∇f (q) = 0 has solution in R+ , which
 −1
is qi = p1t,i + ηt,i ℓt,i .
Therefore, ∇f (q ) = −ℓt − ∇ψt (q ⋆ ) + ∇ψt (pt ) = 0, and we have

max {⟨pt − q, ℓt ⟩ − Dψ (q, pt )} = ⟨pt − q ⋆ , ∇ψ(pt ) − ∇ψ(q ⋆ )⟩ − Dψ (q ⋆ , pt ) = Dψ (pt , q ⋆ ).


q∈RK
+

It remains to bound Dψ (pt , q ⋆ ), which by definition can be written as


X 1  pt,i 

Dψ (pt , q ) = h
β qi⋆
i

where h(x) = x − 1 − ln(x). By the relation between qi⋆ and pt,i we just derived, it holds that
pt,i 2 1
q ⋆ = 1 + βpt,i ℓt,i . By the fact that ln(1 + x) ≥ x − x for all x ≥ − 2 , we have
i
 
pt,i
h = βpt,i ℓt,i − ln(1 + βpt,i ℓt,i ) ≤ β 2 p2t,i ℓ2t,i
qi⋆

which gives the desired bound.

P
Lemma 31 (Stability under negentropy) If ψ is the negentropy ψ(p) = i pi log pi and ℓt,i > 0,
then for any η > 0 the stability is bounded by
1 ηX
max ⟨pt − p, ℓt ⟩ − Dψ (p, pt ) ≤ pt,i ℓ2t,i .
K
p∈R≥0 η 2
i

If ℓt > − η1 , then the stability is bounded by

1 X
max ⟨pt − p, ℓt ⟩ − Dψ (p, pt ) ≤ η pt,i ℓ2t,i .
p∈RK
≥0
η
i

instead.

Proof Let fi (pi ) = (pt,i − pi )ℓti − η1 (pi (log pi − 1) − pi log pt,i + pt,i ). Then we maximize
P
i fi (pi ), which is upper bounded by maximizing the expression in each coordinate for pi ≥ 0.
We have
1
fi′ (p) = −ℓt,i − (log p − log pt,i ),
η
and hence the maximum is obtained for p⋆ = pt,i exp(−ηℓt,i ). Plugging this in leads to
1
fi (p⋆ ) = pt,i ℓt,i (1 − exp(−ηℓt,i )) − (−pt,i exp(−ηℓt,i )ηℓt,i + pt,i (1 − exp(−ηℓt,i )))
η
 
pt,i pt,i 1 2 2 η
= pt,i ℓt,i − (1 − exp(−ηℓt,i )) ≤ pt,i ℓt,i − ηℓt,i − η ℓt,i = pt,i ℓ2t,i ,
η η 2 2

23
DANN W EI Z IMMERT

for non-negative ℓt and

fi (p∗ ) ≤ ηpt,i ℓ2t,i

2
for ℓt,i ≥ −1/η respectively, where we used the bound exp(−x) ≤ 1 − x + x2 which holds for
any x ≥ 0 and exp(−x) ≤ 1 − x + x2 , which holds for x ≥ −1 respectively. Summing over all
coordinates finishes the proof.

P pαi
Lemma 32 (Stability of Tsallis-INF) For the potential ψ(p) = − i α(1−α) , any pt ∈ ∆([K]),
any non-negative loss ℓt ≥ 0 and any positive learning rate η > 0, we have

1 η X 2−α 2
max⟨p − pt , ℓt ⟩ − Dψ (p, pt ) ≤ pi ℓt,i .
p η 2
i

Proof We upper bound this term by maximizing over p ≥ 0 instead of p ∈ ∆([K]). Since ℓt is
positive, the optimal p⋆ satisfies p⋆i ≤ pi in all components. We have

K Z p⋆i pα−1
t,i − p
α−1 K Z p⋆i pe K Z p⋆i pe
X X Z X Z

Dψ (p , pt ) = dp = p α−2
p≥
dp de pα−2
t,i dp de
p
pt,i 1−α pt,i pt,i pt,i pt,i
i=1 i=1 i=1
K
X 1 ⋆
= (p − pt,i )2 pα−2
t,i .
2 i
i=1

By AM-GM inequality, we further have

(pi − pt,i )2 pα−2


!
1 X t,i η X 2−α 2
max⟨p − pt , ℓt ⟩ − Dψ (p, pt ) ≤ max (pi − pt,i )ℓt,i − ≤ pt,i ℓt,i .
p η p 2η 2
i i

24
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND

Appendix B. Analysis for Variance-Reduced SCRiBLe (Algorithm 3 / Theorem 3)

Algorithm 3 VR-SCRiBLe
Define: Let ψ(·) be a ν-self-concordant barrier of conv(X ) ⊂ Rd .
for t = 1, 2, . . . do
Receive mt ∈ Rd . Compute
(* t−1 + )
X 1
wt = argmin w, ℓbτ + mt + ψ(w) (9)
w∈conv(X ) ηt
τ =1

where
 
 v 
 1 u ν log T 
ηt = min , t Pt−1 .
u
 16d

τ =1 ∥ℓτ − mτ ∥
b 2
Hτ−1

where Ht = ∇2 ψ(wt ).
Sample st uniformly from Sd (the unit sphere in d-dimension).
Define
− 12 − 12
wt+ = wt + Ht st , wt− = wt − Ht st .

Find distributions qt+ and qt− over actions such that


X X
wt+ = +
qt,x x, wt− = −
qt,x x
x∈X x∈X

qt+ +qt−
Sample At ∼ qt ≜ 2 , receive ℓt,At ∈ [−1, 1] with E[ℓt,x ] = ⟨x, ℓt ⟩, and define
+ −
!
qt,A t
− qt,A t
1
ℓbt = d(ℓt,At − mt,At ) + − Ht2 st + mt
qt,At + qt,At

where mt,x := ⟨x, mt ⟩.


end

h i
Lemma 33 In Algorithm 3, we have E ℓbt = ℓt and

  " #
2 X
E ℓbt − mt ≤ d2 E min{pt,x , 1 − pt,x }(ℓt,x − mt,x )2 .
∇−2 ψ(wt )
x∈X

where pt,x is the probability of choosing action x in round t.

25
DANN W EI Z IMMERT

Proof

+ −
" ! #
h i qt,A t
− qt,A t
1
E ℓbt = E d(ℓt,At − mt,At ) + −
2
Ht st + mt
qt,A t
+ qt,A t
+ −
" " ! # #
qt,A t
− qt,A t
1
= E dE (ℓt,At − mt,At ) + −
2
st Ht st + mt
qt,A t
+ qt,A t
(note that qt+ , qt− depend on st )

" " +
! # #
X qt,x − qt,x 1
= E dE qt,x (ℓt,x − mt,x ) + − st Ht2 st + mt
x
qt,x + qt,x

" " # +
# !
X − qt,x
1 qt,x
= E dE ⟨x, ℓt − mt ⟩ st Ht2 st + mt
x
2
⟨wt − wt− , ℓt − mt ⟩
  +  1 
2
= E dE st Ht st + mt
2
 1

− 12
= E d⟨H st , ℓt − mt ⟩Ht st + mt
2

 1 1

⊤ −2
= E dHt st st Ht (ℓt − mt ) + mt
2

= ℓt . (because E[st s⊤ 1
t ] = d I)

 !2 
+ − 2
qt,A − qt,A
 
2 1
E ℓbt − mt = E d2 (ℓt,At − mt,At )2 +
t

t 2
Ht st 
∇−2 ψ(wt ) qt,A t
+ qt,A t Ht−1
 !2 
+ −
qt,A − qt,A
= E d2 (ℓt,At − mt,At )2 +
t

t 
qt,A t
+ qt,A t
  !2 
+ −
qt,A − qt,A
= E  E d2 (ℓt,At − mt,At )2 +
t

t
st  
qt,A t
+ qt,A t


" " +
##
X qt,x − qt,x
≤E E qt,x d2 (ℓt,x − mt,x )2 + − st
x
qt,x + qt,x

+ −
qt,x −qt,x
For any x, we have qt,x +
qt,x −
+qt,x
≤ qt,x and

+ − + − + −
qt,x − qt,x |qt,x − qt,x | qt,x + qt,x
qt,x + − = ≤1− = 1 − qt,x .
qt,x + qt,x 2 2

26
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND

 
2
Therefore, we continue to bound E ℓbt − mt by
∇−2 ψ(wt )
" " ##
X
2 2
E E min{qt,x , 1 − qt,x }d (ℓt,x − mt,x ) st
x∈X
" #
X
min pt,x , 1 − pt,x d2 (ℓt,x − mt,x )2

≤E .
x∈X
(E[min(·, ·)] ≤ min(E[·], E[·]) and pt,x = E[E[qt,x | st ]])

 
1 1
Lemma 34 If ηt ≤ 16d , then maxw ⟨wt − w, ℓbt − mt ⟩ − ηt Dψ (w, wt ) ≤ 8ηt ∥ℓbt −
mt ∥2∇−2 ψ(wt ) .

1
Proof We first show that ηt ∥ℓbt − mt ∥∇−2 ψ(wt ) ≤ 16 . By the definition of ℓbt , we have

1 1 1
ηt ℓbt − mt ≤ · d∥Ht2 st ∥H −1 ≤ .
∇−2 ψ(wt ) 16d t 16
Define
1
F (w) = ⟨wt − w, ℓbt − mt ⟩ − Dψ (w, wt ).
ηt

Define λ = ∥ℓbt − mt ∥∇−2 ψ(wt ) . Let w′ be the maximizer of F . it suffices to show ∥w′ −
wt ∥∇2 ψ(wt ) ≤ 8ηt λ because this leads to

F (w′ ) ≤ ∥w′ − wt ∥∇2 ψ(wt ) ∥ℓbt − mt ∥∇−2 ψ(wt ) ≤ 8ηt λ2 .

To show ∥w′ − wt ∥∇2 ψ(wt ) ≤ 8ηt λ, it suffices to show that for all u such that ∥u − wt ∥∇2 ψ(wt ) =
8ηt λ, F (u) ≤ 0. To see why this is the case, notice that F (wt ) = 0, and F (w′ ) ≥ 0 because w′ is
the maximizer of F . Therefore, if ∥w′ − wt ∥∇2 ψ(wt ) > 8ηt λ, then there exists u in the line segment
between wt and w′ with ∥u − wt ∥∇2 ψ(wt ) = 8ηt λ such that F (u) ≤ 0 ≤ min{F (wt ), F (w′ )},
contradicting that F is strictly concave.
Below, consider any u with ∥u − wt ∥∇2 ψ(wt ) = 8ηt λ. By Taylor expansion, there exists u′ in
the line segment between u and wt such that
1
F (u) ≤ ∥u − wt ∥∇2 ψ(wt ) ∥ℓbt − mt ∥∇−2 ψ(wt ) − ∥u − wt ∥2∇2 ψ(u′ ) .
2ηt

Because ψ is a self-concordant barrier and that ∥u′ − wt ∥∇2 ψ(wt ) ≤ ∥u − wt ∥∇2 ψ(wt ) = 8ηt λ ≤ 12 ,
we have ∇2 ψ(u′ ) ⪰ 14 ∇2 ψ(wt ). Continuing from the previous inequality,

1 (8ηt λ)2
F (u) ≤ ∥u − wt ∥∇2 ψ(wt ) ∥ℓbt − mt ∥∇−2 ψ(wt ) − ∥u − wt ∥2∇2 ψ(wt ) = 8ηt λ · λ − = 0.
8ηt 8ηt

27
DANN W EI Z IMMERT

This concludes the proof.

Proof [Proof of Theorem 3] By the standard analysis of optimistic-FTRL and Lemma 34, we have
for any u,
" T #
X
E (ℓt,At − ℓt,u )
t=1
T
" #!
ν log T X 2
≤O E + ηt ℓbt − mt
ηT ∇−2 ψ(wt )
t=1
v
u " T # 
u X 2
= O tν log T E ℓbt − mt + dν log(T ) . (by the tuning of ηt )
∇−2 ψ(wt )
t=1

(10)
In the adversarial regime, using Lemma 33, we continue to bound (10) by
 v u " T # 
u X
O dtν log T E (ℓt,At − mt,At )2 + dν log(T )
t=1

When losses are non-negative and mt = 0, we can further upper bound it by


 v u " T # 
u X
O dtν log T E ℓt,At + dν log(T ) .
t=1
hP i
T
Then solving the inequality for E ℓ
t=1 t,At , we can further get the first-order regret bound of

" T #  v u " T # 
X u X
E (ℓt,At − ℓt,u ) ≤ O dtν log T E ℓt,u + dν log(T ) .
t=1 t=1

In the corrupted stochastic setting, using Lemma 33, we continue to bound (10) by
 vu   
u X T X
O dtν log(T )E 
 u (1 − pt,u )(ℓt,u − mt,u )2 + pt,x (ℓt,x − mt,x )2 

t=1 x̸=u
 v
u " T #
u X
≤ O dtν log(T )E (1 − pt,u )  .
t=1

Then by the self-bounding technique stated in Proposition 2, we can further bound the regret in the
corrupted stochastic setting by
r !
d2 ν log(T ) d2 ν log(T )
O + C .
∆ ∆

28
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND

Appendix C. Analysis for LSB Log-Determinant FTRL (Algorithm 4 / Lemma 5)

Algorithm 4 LSB-logdet
Input: X (= { x0 }), x
b = 01 .
 

Define: H(α) := Ex∼α [xx⊤ ] , µα := Ex∼α [x].


Let πX be John’s exploration over X , i.e., πX is such that maxx∈X ∥x∥2H(πX )−1 ≤ d.
for t = 1, 2, . . . do
Let
( s )
1 log(T )
ηt = min ,
Pt−1 ,
4d τ =1 (1 − pτ,bx)
t−1
* +
X 1  
pt := argmin µα , ℓbτ − log det H(α) − µα µ⊤
α ,
α∈∆(X ∪{b
x}) ηt
τ =1
pet := (1 − ηt d)pt + ηt d((1 − pt,bx )πX + pt,bx πxb)

where πxb denotes the sampling distribution that picks x b with probability 1.
Sample an action At ∼ pet .
Construct loss estimator:
 −1
ℓbt (a) = a⊤ H(ept ) − µpet µ⊤

pet at − µpet ℓt (at ) .

end

We begin by using the following simplifying notation. For a distribution α ∈ ∆(X ∪ {b x}), we
αx
define αX as the restricted distribution over X such that αX ∝ α over X , i.e. αX ,x = 1−α x
b
for any
x ∈ X . We further define

H(α) = Ex∼α [xx⊤ ] , µα = Ex∼α [x],


HαV = H(α) − µα µ⊤
α,
G(α) = H(αX ), mα = µαX ,
GV
α = G(α) − mα m⊤
α .

Lemma 35 Assume α ∈ ∆(X ∪ {b x})) is such that GV α is of rank d − 1 (i.e. full rank over X ).
Then we have the following properties:
 

HαV = (1 − αxb) GV α + α x
b (m α − x
b )(m α − x
b ) ,
   
V −1 1 V + V + ⊤ ⊤ V + 2 1 ⊤
[Hα ] = [Gα ] + [Gα ] mα x b +x bmα [Gα ] + ∥mα ∥[GVα ]+ + x
bx ,
1 − αxb
b
αxb

where [GV +
α ] denotes the pseudo-inverse.

29
DANN W EI Z IMMERT

Proof The first identity is a simple algebraic identity

HαV = H(α) − µα µ⊤
α = (1 − αxb )G(α) + αx
bx b⊤ − ((1 − αxb)mα + αxbx
bx b)⊤
b)((1 − αxb)mα + αxbx
 
= (1 − αxb) GV b (mα − x
α + αx b)⊤ ,
b)(mα − x

b = 01 and ∀x ∈ X :

which holds for any α. For the second identity note that by the definition of x
⟨x, x
b⟩ = 0, we have GV b = [GV
αx
+ b = 0. Furthermore GV [GV ]+ = I − x
α] x α α b−1 due to the rank d − 1
bx
assumption. Multiplying the two matrices yields
   
1
 
V ⊤ V + V + ⊤ ⊤ V + 2 ⊤
Gα + αxb(mα − x b)(mα − x b) · [Gα ] + [Gα ] mα x b +x bmα [Gα ] + ∥mα ∥[GVα ]+ + x
bx b
αxb
   ⊤
⊤ ⊤ V + 2 V + 2 1
=I −x bx
b + mα x b + αxb(mα − x b) [Gα ] mα + ∥mα ∥[GVα ]+ x b − [Gα ] mα − ∥mα ∥[GVα ]+ + x
b
αxb
=I −x
bxb⊤ + mα x
b⊤ − (mα − x x⊤ = I
b)b

which implies for any x, y ∈ span(X )

⟨x, [GV +
α ] y⟩
⟨x, [HαV ]−1 y⟩ = , (11)
1 − αxb
⟨x, [HαV ]−1 (b
x − mα )⟩ = 0, (12)
1
x − mα ∥2[HαV ]−1 =
∥b . (13)
(1 − αxb)αxb

Lemma 36 Let π1 , π2 ∈ ∆(X ∪ {b


x}), then for any λ ∈ [0, 1]:
V
Hλπ1 +(1−λ)π2
⪰ λHπV1 .

Proof Simple algebra shows


V
Hλπ 1 +(1−λ)π2
= λHπV1 + (1 − λ)HπV2 + λ(1 − λ)(µπ1 − µπ2 )(µπ1 − µπ2 )⊤ .

Lemma 37 Let π ∈ ∆(X ) be arbitrary, and π̃ = (1 − ηt d)π + ηt dπX for (ηt d) ∈ (0, 1), then it
holds for any x ∈ X :
s
ηt d
∥mπ − mπ̃ ∥[GV ]+ ≤
π̃ 1 − ηt d
2
∥x − mπ̃ ∥[GV ]+ ≤ √
π̃ ηt

30
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND

Proof We have

GV V V
e = ηt dGπX + (1 − ηt d)Gπ + ηt d(1 − ηt d)(mπ − mπX )(mπ − mπX ) ,
π

hence

∥mπ − mπ̃ ∥2[GV ]+ = (ηt d)2 ∥mπ − mπX ∥2 −1


π̃ [ηt dGVπX +(1−ηt d)GVπ +ηt d(1−ηt d)(mπ −mπX )(mπ −mπX )⊤ ]
ηt d
≤ (ηt d)2 ∥mπ − mπX ∥2 + = .
[ηt d(1−ηt d)(mπ −mπX )(mπ −mπX )⊤ ] 1 − ηt d
For the second inequality, we have

∥x − mπ̃ ∥[GV ]+ ≤ ∥x − mπX ∥[GV ]+ + ∥mπ̃ − mπX ∥[GV ]+


π̃ π̃ π̃
1   2
≤√ ∥x − mπX ∥[GV ]+ + ∥mπ̃ − mπX ∥[GV ]+ ≤ √ ,
ηt d πX πX ηt

where the last inequality uses that John’s exploration satisfies ∥x − mπX ∥2[GV ]+ ≤ d for all x ∈ X .
πX

Lemma 38 The Bregman divergence  between two distributions α, β over X ∪ {b


x} with respect to
the function F (α) = − log det Hα is bounded by
V

1 − αxb
D(α, β) ≥ Dlog (αxb, βxb) + Dlog (1 − αxb, 1 − βxb) + ∥mα − mβ ∥2[GV ]+
1 − βxb β

where Dlog is the Bregman divergence of − log(x).

Proof We begin by simplifying F (α). Note that


 1  1
⊤ 2 ⊤ 2
HαV = (1 − αxb) GV
α + α x
b x
b M GV
α + α x
b x
b ,
√ √
M = I + ( αxb[GV +
b)( αxb[GV
α ] mα − x
+
α ] mα − xb))⊤ − x b⊤ .
bx

M is a matrix with d − 2 eigenvalues of size 1, since it is the identity with two rank-1 updates. The
product of the remaining eigenvalues is given by considering the determinant of the 2×2 sub-matrix
[GV +
α ] mα
with coordinates ∥[G V ]+ m ∥ and x
b. We have that
α α

2
m⊤ V + + +

⊤ α [Gα ] [GV
α ] mα ⊤ [GV
α ] mα
det(M ) = x
b Mx b· +
M +
− x b M +
∥[GVα ] mα ∥ ∥[GVα ] mα ∥ ∥[GV
α ] mα ∥
 
2 2
+ +
= 1 · 1 + αxb [GV α ] mα − αxb [GVα ] mα = 1.

Hence we have

F (α) = − log(1 − αxb) − log(αxb) − log detd−1 ((1 − αxb)GV


α)

31
DANN W EI Z IMMERT

Where detd−1 is the determinant over the first d − 1 eigenvalues of the submatrix of the first (d − 1)
coordinates. The derivative term is given by
X
⟨α − β, ∇F (β)⟩ = (αx − βx ) ∥x − µβ ∥2[H V ]−1
β
x∈X ∪{b
x}
X
= αx ∥x − µβ ∥2[H V ]−1 − d
β
x∈X ∪{b
x}
 
X
= Tr  αx (x − µβ )(x − µβ )⊤ [HβV ]−1  − d
x∈X ∪{b
x}
  
= Tr HαV + (µα − µβ )(µα − µβ )⊤ [HβV ]−1 − d
 
= Tr HαV [HβV ]−1 + ∥µα − µβ ∥2[H V ]−1 − d.
β

The first term is


 
Tr HαV [HβV ]−1
 
= Tr (1 − αxb)GV α [H V −1
β ] x − mα ∥2[H V ]−1
+ αxb(1 − αxb) ∥b (by Lemma 35)
β
 
1 − αxb V V +  
= Trd−1 Gα [Gβ ] + αxb(1 − αxb) ∥b x − mβ ∥2[H V ]−1 + ∥mα − mβ ∥2[H V ]−1
1 − βxb β β

(by Lemma 35)


2
αxb(1 − αxb) αxb(1 − αxb) ∥mα − mβ ∥[GVβ ]+
 
1 − αxb V V +
= Trd−1 G [G ] + + . (by (13))
1 − βxb α β βxb(1 − βxb) 1 − βxb

The norm term is

∥µα − µβ ∥2[H V ]−1


β

b + (1 − αxb)mβ )∥2[H V ]−1 + ∥(αxbx


= ∥µα − (αxbx b + (1 − αxb)mβ ) − µβ ∥2[H V ]−1
β β
2
= (1 − αxb) ∥mα − mβ ∥2[H V ]−1 + (αxb − βxb) ∥b
x− 2
mβ ∥2[H V ]−1
β β

(1 − αxb)2 ∥mα − mβ ∥2[GV ]+ (αxb − βxb)2


β
= + .
1 − βxb βxb(1 − βxb)

Combining both terms

⟨α − β, ∇F (β)⟩
α (1 − αxb) + (αxb − βxb)2 1 − αxb
 
1 − αxb V V +
= Trd−1 Gα [Gβ ] + xb + ∥mα − mβ ∥2[GV ]+ − d
1 − βxb βxb(1 − βxb) 1 − βxb β
 
1 − αxb V V + α 1 − αxb 1 − αxb
= Trd−1 G [G ] + xb + −1+ ∥mα − mβ ∥2[GV ]+ − d .
1 − βxb α β βxb 1 − βxb 1 − βxb β

32
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND

Combining the everything

D(α, β)
1 − αxb  
= Dlog (αxb, βxb) + Dlog (1 − αxb, 1 − βxb) + ∥mα − mβ ∥2[GV ]+ + Dd−1 (1 − αxb)GV
α , (1 − βx )GV
β
1 − βxb
b
β

1 − αxb
≥ Dlog (αxb, βxb) + Dlog (1 − αxb, 1 − βxb) + ∥mα − mβ ∥2[GV ]+ ,
1 − βxb β

where the last inequality follows from the positiveness of Bregman divergences.

1
Lemma 39 For any b ∈ (0, 1), any x ∈ R and η ≤ 2|x| , it holds that

1
sup |a − b|x − Dlog (a, b) ≤ ηb2 x2 .
α∈[0,1] η

Proof The statement is equivalent to

1
sup f (a) = sup (b − a)x − Dlog (a, b) ≤ ηb2 x2 ,
a∈[0,1] a∈[0,1] η

since x can take positive or negative sign. The function is concave, so setting the derivative to 0 is
the optimal solution if that value lies in (0, ∞).
 
′ 1 1 1 b
f (a) = −x + − a⋆ = .
η a b 1 + ηbx

Plugging this in, leads to

ηb2 x2
 
⋆ 1 1
f (a ) = − log(1 + ηbx) + −1
1 + ηbx η 1 + ηbx
ηb2 x2
 
1 2 2 2 ηbx
≤ − ηbx − η b x − = ηb2 x2 ,
1 + ηbx η 1 + ηbx

where the last line uses log(1 + x) ≥ x − x2 for any x ≥ − 12 .

Lemma 40 The stability term

1
stabt := sup ⟨µpt − µα , ℓbt ⟩ − D(α, pt ) ,
α∈∆(X ∪{b
x}) ηt

satisfies

Et [stabt ] = O((1 − pt,bx )ηt d) .

33
DANN W EI Z IMMERT

Proof We have

x + (1 − pt,bx )mpt − (1 − αxb)mα


µpt − µα = (pt,bx − αxb)b
x − mpet + mpet − mpt ) + (1 − αxb)(mpet − mα ) .
= (pt,bx − αxb)(b

Hence for At = y ∈ X :

⟨µpt − µα , ℓbt ⟩ = ⟨µpt − µα , [HpeVt ]−1 (y − µpet )⟩ℓt,At


x − mpet , [HpeVt ]−1 (y − µpet )⟩ℓt,At
= (pt,bx − αxb)⟨b
+ (pt,bx − αxb)⟨mpet − mpt , [HpeVt ]−1 (y − µpet )⟩ℓt,At
+ (1 − αxb)⟨mpet − mα , [HpeVt ]−1 (y − µpet )⟩ℓt,At
x − mpet , [HpeVt ]−1 (mpet − µpet )⟩ℓt,At
= (pt,bx − αxb)⟨b (By Eq. (12))
+ (pt,bx − αxb)⟨mpet − mpt , [HpeVt ]−1 (y − mpet )⟩ℓt,At
+ (1 − αxb)⟨mpet − mα , [HpeVt ]−1 (y − mpet )⟩ℓt,At
|αxb − pt,bx | |pt,bx − αxb|
≤ + |⟨mpet − mpt , [HpeVt ]−1 (y − mpet )⟩|
1 − pt,bx 1 − pt,bx
1 − αxb +
+ |⟨mpet − mα , [GVpet ] (y − mpet )⟩| (By Eq. (13) and Eq. (11))
1 − pt,bx
|pt,bx − αxb| √
≤ (1 + 2 d)
1 − pt,bx
4 1 − αxb
+ × ∥mpt − mα ∥[GV ]+ y − mpet ) [GV ]+
3 1 − pt,bx pt p
et

(Cauchy-Schwarz and Lemma 37, Lemma 36)

Equivalently for At = x
b,

⟨µpt − µα , ℓbt ⟩ = ⟨µpet − µα , [HpeVt ]−1 (b


x − µpet )⟩ℓt,At
x − mpet , [HpeVt ]−1 (b
= (pt,bx − αxb)⟨b x − µpet )⟩ℓbt,At
|pt,bx − αxb|
≤ .
pt,bx

Hence the stability term for At ∈ X , is bounded by


1
sup ⟨µpt − µα , ℓbt ⟩ − D(α, pt )
α∈∆(X ∪{b
x}) ηt
|αxb − pt,bx | √ 1
≤ sup (1 + 2 d) − D(1 − αxb, 1 − pt,bx )
α∈∆(X ∪{b x}) 1 − pt,b
x η t
 
1 − αxb 4 1 2
+ mα − mpet [GV ]+ y − mpet [GV ]+ − mα − mpet [GV ]+
1 − pt,bx 3 pet p
et ηt p
et

2 |αxb − pt,bx | √
≤ ηt y − mpet [GV ]+ + sup (1 + 2 d)
p
et αxb ∈[0,1] 1 − pt,bx

34
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND

|αxb − pt,bx | 2 1
+ ηt y − mpet [GV ]+
− D(1 − αxb, 1 − pt,bx ) (AM-GM inequality)
1 − pt,bx p
e t ηt
2
≤ ηt y − mpet [GV ]+
p
et

|αxb − pt,bx | √ 1
+ sup (5 + 2 d) − D(1 − αxb, 1 − pt,bx ) (Lemma 37)
αxb ∈[0,1] 1 − pt,bx ηt
2
≤ ηt y − mpet [GV ]+
+ O(ηt d) . (Lemma 39)
p
et

Taking the expectation over y ∼ pet,X leads to

EAt ∼ept,X [stabt ] = O(ηt d)

b we have two cases. If pt,bx ≥ 12 , then by Lemma 39,


For At = x
!
|pt,bx − αxb| 1 1
− Dlog (1 − αxb, 1 − pt,bx ) ≤ O ηt (1 − pt,bx )2 × 2

stabt ≤ sup ≤ O ηt (1 − pt,bx ) .
αxb ∈[0,1] pt,bx ηt pt,bx

Otherwise If 1 − pt,bx ≥ 12 , then by Lemma 39,


!
|pt,bx − αxb| 1 1
stabt ≤ sup − Dlog (αxb, pt,bx ) ≤ O ηt p2t,bx × 2 ≤ O(ηt (1 − pt,bx )) .
αxb ∈[0,1] pt,bx ηt pt,bx

Finally we have

Et [stabt ] = (1 − pt,bx )EAt ∼pt,X [stabt ] + pt,bx EAt ∼πxb [stabt ] = O((1 − pt,bx )ηt d)

Proof [Proof of Lemma 5] By standard analysis of FTRL (Lemma 27) and Lemma 40, for any τ
and x
v 
Xτ h i d log T Xτ u τ
X
u
Et ⟨pt , ℓbt ⟩ − ℓbt (x) ≤ + O(ηt (1 − pt,bx )d) ≤ O td log T (1 − pet,bx ) ,
ητ
t=1 t=1 t=1

where the last inequality follows from the definition of learning rate and pet,bx = pt,bx . Additionally,
we have
h i
Et ⟨ept − pt , ℓbt ⟩ = ⟨e
pt − pt , ℓt ⟩ ≤ ηt d(1 − pt,bx ) ,

so that
v 
τ
X h i u τ
X
u
Et ⟨e
pt , ℓbt ⟩ − ℓbt (x) = O td log T (1 − pet,bx ) .
t=1 t=1

Taking expectations on both sides finishes the proof.

35
DANN W EI Z IMMERT

Appendix D. Proof of Proposition 2


Proof [Proof of Proposition 2] The guarantee in the adversarial
PT regime is directly by Definition 1.

In the corrupted stochastic regime, if x ̸= argminu∈X E[ t=1 ℓt,u ], then it holds that T ∆ − C =
PT PT ⋆
t=1 (∆ − Ct ) ≤ t=1 E[ℓt,u − ℓt,x⋆ ] < 0 for some u ̸= x . Therefore, by Definition 1, we have
that for any u ∈ X ,
" T #
X α
E (ℓt,At − ℓt,u ) ≤ (c1 log T )1−α T α + c2 log T ≤ (c1 log T )1−α C∆−1 + c2 log T.
t=1

Now, assume that x⋆ = argminu∈X E[ Tt=1 ℓt,u ]. By Definition 1, we have


P

" T # " T #α
X X
E (ℓt,At − ℓt,x⋆ ) ≤ (c1 log T )1−α E (1 − pt,x⋆ ) + c2 log T.
t=1 t=1

On the other hand,


" T # " T
#
X X
E (ℓt,At − ℓt,x⋆ ) ≥ ∆E (1 − pt,x⋆ ) − C.
t=1 t=1

Combining the two inequalities, we get


" T #
X
E (ℓt,At − ℓt,x⋆ )
t=1
T
" # " T #
X X
= sup (1 + λ)E (ℓt,At − ℓt,x⋆ ) − λE (ℓt,At − ℓt,x⋆ )
λ∈[0,1] t=1 t=1
" T #α " T
# !
X X
≤ sup 2(c1 log T )1−α E (1 − pt,x⋆ ) + 2c2 log T − λ ∆E (1 − pt,x⋆ ) − C
λ∈[0,1] t=1 t=1
 α

− 1−α 1−α −1 α
≤ O c1 log(T )∆ + (c1 log T ) (C∆ ) + c2 log T . (using Lemma 41)

Appendix E. Analysis for the First Reduction


E.1. BOBW to LSB (Algorithm 1 / Theorem 6)
Proof [Proof of Theorem 6] We use τk = Tk+1 −Tk to denote the length of epoch k, and let n be the
last epoch (define Tn+1 = T ). Also, we use Et [·] to denote expectation conditioned on all history
up to time t. We first consider the adversarial regime. By the definition of local-self-bounding
algorithms, we have for any u,
 
Tk+1
X
ETk  (ℓt,At − ℓt,u ) ≤ c01−α ETk [τk ]α + c2 log(T ),
t=Tk +1

36
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND

which implies
 
Tk+1
X
E (ℓt,At − ℓt,u ) ≤ c01−α E[τk ]α + c2 log(T )
t=Tk +1

using the property E[xα ] ≤ E[x]α for x ∈ R+ and 0 < α < 1. Summing the bounds over
k = 1, 2, . . . , n, and using the fact that τk ≥ 2τk−1 for all k < n, we get

T
" #
X
E (ℓt,At − ℓt,u )
t=1
    
α α 1 1 T
≤ c1−α
0 E [τn ] + E [τn−1 ] 1 + α + 2α + · · · + c2 log(T ) log2
2 2 c2 log(T )
≤ O c1−α T α + c2 log2 (T ) .

0 (14)

The same analysis also gives

" T #
X
(ℓt,At − ℓt,u ) ≤ O (c1 log T )1−α T α + c2 log2 (T ) .

E (15)
t=1

Next, consider the corrupted stochastic regime. We first hargue that iti suffices to consider the regret
PT
comparator x⋆ . This is because if x⋆ ̸= argminu∈X E t=1 ℓt,u , then it holds that C ≥ T ∆.
Then the right-hand side of (15) is further upper bounded by

O (c1 log T )1−α (C∆−1 )α + c2 log(T ) log(C∆−1 ) ,




which fulfills the requirement for the stochastic regime. Below, we focus on bounding the pseudo-
regret with respect to x⋆ .
Let m = max{k ∈ N | x bk ̸= x⋆ }. Notice that m is either the last epoch (i.e., m = n) or the
second last (i.e., m = n − 1), because for any two consecutive epochs, at least one of them must
have x bk ̸= x⋆ . Below we show that |{t ∈ [Tm+1 ] | At ̸= x⋆ }| ≥ Tm+1 8 − 2τ0 .
If τm > 2τm−1 , by the fact that the termination condition was not triggered one round earlier,
the number of plays Nm (x⋆ ) in epoch m is at most τm2−1 +1 = τm2+1 ; in other words, the number of
PTm+1
times t=T m +1
I{At ̸= x⋆ } is at least τm2−1 . Further notice that because τm > 2τm−1 ≥ 4τm−2 >
Tm+1
· · · , we have τm2−1 ≥ 14 m 1
− 12 .
P
k=1 τk − 2 = 4
Now consider the case m > 1 and τm ≤ 2τm−1 . Recall that x bm is the action
n withPNm−1 (b
xmo) ≥
τm−1 PTm ⋆ τm−1 1 τm 1 m−1
2 . This implies that t=Tm−1 +1 I{At ̸= x } ≥ 2 ≥ 2 max 2 , 2 k=1 τk ≥
1 P m Tm+1
8 k=1 τk = 8 .
Finally, consider the case m = 1 and τ1 ≤ 2τ0 , then we have T2 − 2τ0 ≤ 0 and the statement
holds trivially.

37
DANN W EI Z IMMERT

The regret up to and including epoch m can be lower and upper bounded using the self-bounding
technique:
     
Tm+1 Tm+1 Tm+1
X X X
E (ℓt,At − ℓt,x⋆ ) = (1 + λ)E  (ℓt,At − ℓt,x⋆ ) − λE  (ℓt,At − ℓt,x⋆ )
t=1 t=1 t=1
(for 0 ≤ λ ≤ 1)
     
1−α α E[Tm+1 ] 1
≤ O (c1 log T ) E [Tm+1 ] + c2 log(T ) log −λ E [Tm+1 ] − c2 log(T ) ∆ − C
c2 log(T ) 8
(the first term is by a similar calculation as (14), but replacing T by Tm and c0 by c1 log T )
 α α 
≤ O c1 log(T )∆− 1−α + (c1 log T )1−α C∆−1 + c2 log(T ) log(C∆−1 )

where in the last inequality we use Lemma 41.


bn = x⋆ for the final epoch n. In this case, the regret
If m is not the last epoch, then it holds that x
in the final epoch is
" T # " T # " T #
X X X
E (ℓt,At − ℓt,x⋆ ) = (1 + λ)E (ℓt,At − ℓt,x⋆ ) − λE (ℓt,At − ℓt,x⋆ )
t=Tn +1 t=Tn +1 t=Tn +1
" T
#α ! " T
# !
X X
1−α
≤ O (c1 log T ) E (1 − pt,x⋆ ) −λ E (1 − pt,x⋆ ) ∆ − C + c2 log(T )
t=Tn +1 t=Tn +1
 α

− 1−α 1−α −1 α

≤ O c1 log(T )∆ + (c1 log T ) C∆ + c2 log(T ). (Lemma 41)

Lemma 41 For ∆ ∈ (0, 1] and c, c′ , X ≥ 1 and C ≥ 0, we have


 α
c′ + C
   
1 1−α α 1 ′ α C
min c X + c log X − λ(X∆ − C) ≤ c∆− 1−α + 2c1−α + 2c′ log 1 +
λ∈[0,1] 2 2 ∆ ∆

Proof If c1−α X α ≥ c′ log T , we have


1 1−α α 1 ′ α α
c X + c log X ≤ c1−α X α ≤ λX∆ + cλ− 1−α ∆− 1−α .
2 2
where the last inequality is by the weighted AM-GM inequality. Therefore,
 
1 1−α α 1 ′ α α
min c X + c log X − λ(X∆ − C) ≤ min cλ− 1−α ∆− 1−α + Cλ.
λ∈[0,1] 2 2 λ∈[0,1]

Choosing λ = min{1, c1−α C −(1−α) ∆−α }, we bound the last expression by


  − α  α
1−α −(1−α) −α
∆− 1−α + c1−α C α ∆−α
1−α
c max 1, c C ∆
 
α2 α
−α α 1−α
≤ c max 1, c C ∆ ∆− 1−α + c1−α C α ∆−α
α
≤ c∆− 1−α + 2c1−α C α ∆−α .

38
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND

If c1−α X α ≤ c′ log T , we have


c′2
 
1 1−α α 1 ′ ′ ′
c X + c log X − λ(X∆ − C) ≤ c log X − λX∆ + λC ≤ c log 1 + 2 2 + λC
2 2 λ ∆
′2 √
where the last inequality is because if X ≥ λ2c∆2 then c′ log X − λX∆ ≤ c′ log X − c′ X < 0.
c′
Choosing λ = min{1, C }, we bound the last expression by c′ log(1+(c′2 +C 2 )/∆2 ) ≤ 2c′ log(1+
(c′ + C)/∆). Combining cases finishes the proof.

E.2. dd-BOBW to dd-LSB (Algorithm 1 / Theorem 23)


Proof [Proof of Theorem 23] In the adversarial regime, we have that the regret in each phase k is
bounded by
  v u  
Tk+1 u Tk+1
X X X
(ℓt,At − ℓt,u ) ≤ tc1 log(T )ETk  pt,x ξt,x  + c2 log T .
u
ETk 
t=Tk +1 t=Tk +1 x

We have maximally log T episodes, since the length doubles every time. Via Cauchy-Schwarz, we
get
  v u  
kX
max T k+1 u kX
max Tk+1
X X X p
(ℓt,At − ℓt,u ) ≤ tc1 log T pt,x ξt,x  log T + c2 log2 T .
u
ETk  ETk 
k=1 t=Tk +1 k=1 t=Tk +1 x

Taking the expectation on both sides and the tower rule of expectations finishes the bound for the
adversarial regime. For the stochastic regime, note that ξt,x ≤ 1 and hence
X
pt,x ξt,x − I{u = xb}p2t,u ξt,u
x
b}p2t,u
≤ 1 − I{u = x
= (1 − I{u = x
b}pt,u )(1 + I{u = x
b}pt,u ) ≤ 2(1 − I{u = x
b}pt,u ) .
dd-LSB implies regular LSB (up to a factor of 2) and hence the stochastic bound of regular LSB
applies.

Appendix F. Analysis for the Second Reduction


1
F.1. 2 -LSB to 12 -iw-stable (Algorithm 2 / Theorem 11)
Proof [Proof of Theorem 11] The per-step bonus bt = Bt − Bt−1 is the sum of two terms:
v v
u t u t−1 c1
1 u X 1 c1
r
Ts
u X qt,2
bt = c1
t − c1
t ≤q P ≤ ,
q q t 1 qt,2
τ =1 τ,2 τ =1 τ,2 c1 τ =1 qτ,2
 
Lo
c2 c2 c2 minτ ≤t qτ,2
bt = − = 1− .
minτ ≤t qτ,2 minτ ≤t−1 qτ,2 qt,2 minτ ≤t−1 qτ,2

39
DANN W EI Z IMMERT

q̄t,2
Since qt,2 ≤ 2, using the inequalities above, we have
s
p q̄t,2 √ 1
ηt q̄t,2 bTs
t ≤ ηt c1 ≤ ηt 2c1 ≤ . (16)
qt,2 4
q̄t,2 1
β q̄t,2 bLo
t ≤β c2 ≤ 2βc2 ≤ . (17)
qt,2 4
q̄t,2
By Lemma 28 and that qt,2 ≤ 2, we have for any u,

t ′ t ′
X X
⟨qt − u, zt ⟩ ≤ ⟨qt − q̄t , zt ⟩
t=1 t=1
| {z }
term1
t √
′ t ′ t t ′ t ′ ′
qt,2 log t′ X
 
√ X 3
2
X X X
+O c1 + √ + + ηt min qt,i (zt,i − θt ) +
2 Ts
qt,2 bt + Lo
qt,2 bt − u2 bt .
t β |θt |≤1
t=1 | {z } t=1 t=1 t=1 t=1
| {z } term | {z } | {z } | {z }
3
term2 term4 term5 term6
(18)
We bound individual terms below:
" t′ # t′
!
X X 1
E[term1 ] = E ⟨qt − q̄t , zt ⟩ ≤ O = O(1).
t2
t=1 t=1

v  
u t′ √ 
uX
term2 ≤ O min t′ , t qt,2 log T  .
 
t=1

log t′
term3 = ≤ O (c2 log T ) .
β

" t ′ #
X 3
E [term4 ] = E ηt min qt,i (zt,i − θt )2
2
θt ∈[−1,1]
t=1
t′ 2
" #
X X 3
2
≤E ηt qt,i (zt,i − ℓt,At )
2

t=1 i=1
t′ 2
" #
X 1 X 2
=E ηt √ (I[it = i]ℓt,At − qt,i ℓt,At )
qt,i
t=1 i=1
" t′ 2  #
X X √ 3
≤E ηt qt,i (1 − qt,i )2 + (1 − qt,i )qt,i
2

t=1 i=1
t′
" #!
X √
≤O E ηt qt,2 ≤ O (E[term2 ]) .
t=1

40
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND

 
t′ t′ c1
X X qt,2
term5 = qt,2 bTs
t ≤ qt,2  q P 
t=1 t=1 c1 tτ =1 1
qτ,2
t′ √1
√ X qt,2 √
= c1 qP × qt,2 (19)
t 1
t=1 τ =1 qτ,2
v v
u t′ 1 u t′
√ u X qt,2
uX
≤ c1 t Pt 1
t q t,2 (Cauchy-Schwarz)
t=1 τ =1 qτ,2 t=1
v !v
u t′ u t′
√ tu X 1 u X
≤ c1 1 + log t qt,2
q
t=1 t,2 t=1
v
u t′ 
u X
≤ O tc1 qt,2 log T  .
t=1

Pt′ √ Pt′ 1

Continuing from (19), we also have term5 ≤ t=1 qt,2 bt
Ts
≤ c1 t=1

t
≤ 2 c1 t′ because
qτ,2 ≤ 1.

t ′ t  ′  t  ′

X
Lo
X minτ ≤t qτ,2 X minτ ≤t−1 qτ,2
term6 = qt,2 bt = c2 1− ≤ c2 log ≤ O (c2 log T ) .
minτ ≤t−1 qτ,2 minτ ≤t qτ,2
t=1 t=1 t=1

hP ′ i
t
Using all bounds above in (18), we can bound E t=1 ⟨qt − u, zt ⟩ by

  v "   v 
u t′ # u t′
p u X  u X 1 c2
O min c1 E[t′ ], tc
1 qt,2 log T + c2 log T  −u2 E tc1 + .
  qt,2 mint≤t′ qt,2
t=1 t=1
| {z } | {z }
pos-term neg-term

hP ′ i
t
For comparator x
b, we choose u = e1 and bound E − ℓt,bx ) by the pos-term above. For
t=1 (ℓt,At
hP ′ i
t
comparator x ̸= x
b, we first choose u = e2 and upper bound E t=1 (ℓt,A t − ℓ t,At
e ) by pos-term−
hP ′
i
t
neg-term. On the other hand, by the 21 -iw-stable assumption, E (ℓ e − ℓt,x ) ≤ neg-term.
hP ′t=1 t,At i
t
Combining them, we get that for all x ̸= x b, we also have E t=1 (ℓ t,A t − ℓt,x ) ≤ pos-term.
Comparing the coefficients in pos-term with those in Definition 4, we see that Algorithm 2 satisfies
1 ′ ′ ′ ′ ′ ′
2 -LSB with constants (c0 , c1 , c2 ) where c0 = c1 = O(c1 ) and c2 = O(c2 ).

41
DANN W EI Z IMMERT

2
F.2. 3 -LSB to 23 -iw-stable (Algorithm 5 / Theorem 18)

Algorithm 5 LSB via Corral (for α = 23 )


Input: candidate action x b, 32 -iw-stable algorithm B over X \{b x} with constants c1 , c2 .
2
3 P 2 1 P 2 1
Define: ψt (q) = − ηt i=1 qi3 + β i=1 ln qi .
for t = 1, 2, . . . do
Let B generate an action A et .
Let
*  + 
t−1
 X 0 
q̄t = argmin q, zτ −   + ψt (q) , qt = (1 − γt )q̄t ,
q∈∆2  τ =1 Bt−1 
 
1 1 √ 23 1
3
where ηt = 2 1 , β = , and γt = max ηt qt,2 , ηt qt,2 .
t + 8c1
3 3 8c 2

Sample it ∼ q̄t .
if it = 1 then set Āt = x
b;
else set Āt = Aet ;
Sample jt ∼ γt .
if jt = 1 then draw a revealing action of Āt and observe ℓt,Āt ;
else draw At = Āt ;
ℓt,Āt I{it =i}I{jt =1}
Define zt,i = γt and

t
! 23
1
3
X 1 c2
Bt = c1 √ + .
qτ,2 minτ ≤t qτ,2
τ =1

end

Proof [Proof of Theorem 18] The per-step bonus bt = Bt − Bt−1 is the sum of two terms:
 ! 23 ! 23 
t t−1 √1  1
Ts
1 X 1 X 1 1 qt,2 c1 3
bt = c13 
√ − √  ≤ c1
3
1 ≤ ,
qτ,2 qτ,2  Pt 1

3 qt,2
τ =1 τ =1 √
τ =1 qτ,2
 
c2 c2 c2 minτ ≤t qτ,2
bLo
t = − = 1− .
minτ ≤t qτ,2 minτ ≤t−1 qτ,2 qt,2 minτ ≤t−1 qτ,2
q̄t,2
Since qt,2 ≤ 2, using the inequalities above, we have
 1
1
Ts
q̄t,2 c1 3 1 1
ηt q̄t,2 bt ≤ ηt
3
≤ ηt (2c1 ) 3 ≤ . (20)
qt,2 4
q̄t,2 1
β q̄t,2 bLo
t ≤β c2 ≤ 2βc2 ≤ . (21)
qt,2 4

42
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND

q̄t,2
By Lemma 28 and that qt,2 ≤ 2, we have for any u,

t ′ t ′
X X
⟨qt − u, zt ⟩ ≤ ⟨qt − q̄t , zt ⟩
t=1 t=1
| {z }
term1
2
t′ t′ t′ t′ t′
q 3
log t′ X
 1 X 4

t,2
X X X
2 Ts Lo
3
+ O c1 + 1 + + ηt min qt,i (zt,i − θt ) +
3
qt,2 bt + qt,2 bt − u2 bt
β |θt |≤1
t=1 t 3 t=1 t=1 t=1 t=1
| {z } |term {z } | {z } | {z } | {z }
3
term2 term4 term5 term6
(22)

We bound individual terms below:


 2 1 
t′ t′ t′
" # !
3 3
X X X qt,2 qt,2
E[term1 ] = E ⟨qt − q̄t , zt ⟩ ≤ O γt = O  1 + 2

t=1 t=1 t=1 t 3 t3
  
 t′ ! 23 
 2 X 1

′3
= O min t , qt (log T ) + log T  .
 3

 
 t=1 

term2 ≤ O (term1 )
log t′
term3 = ≤ O (c2 log T ) .
β

" t′ #
X 4
E [term4 ] = E ηt qt,2 (zt,2 − zt,1 )2
3

t=1
 4 
t′
" t′ #
3 X γ2
X qt,i t
≤ E ηt ≤E = E [O(term1 )] .
γt γt
t=1 t=1

 
′ ′
t t 1 √1
X X qt,2
term5 = qt,2 bTs
t ≤ qt,2 c1 3
 
Pt 1 
3
t=1 t=1 √1
τ =1 qτ,2

t ′ −1
1 X qt,26 2

 1 × qt,2
3 3
= c1 P (23)
t 3
t=1 √1
τ =1 qτ,2
 1 ! 23
′ ′
√1
t 3 t
1 X qt,2 X
≤ c1  3
Pt  qt,2 (Cauchy-Schwarz)
√1
t=1 τ =1 qτ,2 t=1

43
DANN W EI Z IMMERT

t ′ !! 13 t ′ ! 23
1 X 1 X
≤ c1 3
1 + log qt,2
qt,2
t=1 t=1
 
t′
! 23
 1 X 1
≤ O c13 qt,2 (log T ) 3  .
t=1

Pt′ 1 Pt′ 1 2
Continuing from (23), we also have term5 ≤ t=1 qt,2 bt
Ts
≤ c13 t=1
1
1 ≤ O(c13 t′ 3 ) because
t3
qτ,2 ≤ 1.
t ′ t  ′  t   ′
X
Lo
X minτ ≤t qτ,2 X minτ ≤t−1 qτ,2
term6 = qt,2 bt = c2 1− ≤ c2 log ≤ O (c2 log T ) .
minτ ≤t−1 qτ,2 minτ ≤t qτ,2
t=1 t=1 t=1
hP ′ i
t
Using all bounds above in (22), we can bound E t=1 ⟨qt − u, zt ⟩ by
     
 " t′ # 32  t ′ ! 23
 1 2 1 X 
 1 X 1 c2
O min c13 E[t′ ] 3 , (c1 log T ) 3 qt,2 + c2 log T  −u2 E c13 + .
  
  qt,2 mint≤t′ qt,2
 t=1  t=1
| {z } | {z }
pos-term neg-term
i
hP ′
t
For comparator x
b, we choose u = e1 and bound E − ℓt,bx ) by the pos-term above. For
t=1 (ℓt,At
hP ′ i
t
comparator x ̸= x
b, we first choose u = e2 and upper bound E t=1 (ℓt,A t − ℓ t,A
et ) by pos-term−
hP ′
i
t
neg-term. On the other hand, by the 21 -iw-stable assumption, E (ℓ e − ℓt,x ) ≤ neg-term.
hP ′ t=1 t,At i
t
Combining them, we get that for all x ̸= xb, we also have E t=1 (ℓt,At − ℓt,x ) ≤ pos-term.
Finally, notice that Et [I{At ̸= x
b}] ≥ q̄t,2 (1 − γt ) = qt,2 ,. This implies that Algorithm 5 is
1
2
3 -LSB with coefficient (c′0 , c′1 , c′2 ) with c′0 = c′1 = O(c1 ), and c′2 = O(c13 + c2 ).

1
F.3. 2 -dd-LSB to 12 -dd-iw-stable (Algorithm 6 / Theorem 22)
Proof [Proof of Theorem 22] Define bt = Bt − Bt−1 . Notice that we have

ηt q̄t,2 bt ≤ 2ηt qt,2 bt


 
c1 ξt,At I[it =2]  
 2
qt,2
 1 1
≤ 2ηt qt,2  r  + 2c2 −
 Pt−1 ξτ,Aτ I[iτ =2]  minτ ≤t qτ,2 minτ ≤t−1 qτ,2
c1 τ =1 2
qτ,2
 
√ minτ ≤t qτ,2
≤ 2ηt c1 + 2ηt c2 1 −
minτ ≤t−1 qτ,2
1
≤ . (24)
4

44
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND

Algorithm 6 dd-LSB via Corral (for α = 12 )


Input: candidate action x b, 21 -iw-stable algorithm B over X \{b x} with constant c.
P2
Define: ψ(q) = i=1 ln q1i . B0 = 0.
Define: For first-order bound, ξt,x = ℓt,x and mt,x = 0; for second-order bound, ξt,x = (ℓt,x −
mt,x )2 where mt,x is the loss predictor.
for t = 1, 2, . . . do
Let B generate an action A et (which is the action to be chosen if B is selected in this round).
Receive prediction mt,x for all x ∈ X , and set yt,1 = mt,bx and yt,2 = mt,Aet .
Let
*  + 
t−1
0
 
 X 1  1 1
q̄t = argmin q, zτ + yt −   + ψ(q) , qt = 1 − 2 q̄t + 2 1,
q∈∆2  Bt−1 η t  2t 4t
τ =1

t−1
!− 21
1 1 X
where ηt = (log T ) 2 (I[iτ = i] − qτ,i )2 ξτ,Aτ + (c1 + c22 ) log T .
4
τ =1

Sample it ∼ qt .
if it = 1 then draw At = xb and observe ℓt,At ;
else draw At = A et and observe ℓt,At ;
(ℓt,At −yt,i )I{it =i}
Define zt,i = qt,i + yt,i and
v
u t
u X ξt,At I[iτ = 2] c2
Bt = tc1 2 + .
q τ,2 minτ ≤t qτ,2
τ =1

end

q̄t,2
By Lemma 29 and that qt,2 ≤ 2, we have for any u,

t ′ t ′ 2
X log T X X
⟨qt − u, zt ⟩ ≤ O + ηt min 2
qt,i (zt,i − yt,i − θt )2
ηt′ |θ|≤1
t=1 | {z } |t=1 i=1
term1
{z }
term2
t′ t′ t′
!
X X X
+ ⟨qt − q̄t , zt ⟩ + qt,2 bt − u2 bt
t=1 t=1 t=1
| {z } | {z }
term3 term4

45
DANN W EI Z IMMERT

s 
Pt′ −1 P2
t=1 i=1 (I[it = i] − qt,i )2 ξ t,At + (c1 + c22 ) log T
term1 ≤ O  log T 
log T
v
u t′ 2 
uX X √
≤ O t (I[it = i] − qt,i )2 ξt,At log T + ( c1 + c2 ) log T  .
t=1 i=1

" t ′ 2  2 #
X X
2 (ℓt,At − mt,At )I[it = i]
E [term2 ] ≤ E ηt min qt,i −θ
θ qt,i
t=1 i=1
t′ 2
" #
X X
≤E ηt (I[it = i] − qt,i )2 (ℓt,At − mt,At )2 (choosing θ = ℓt,At − mt,At )
t=1 i=1
v
u t′ 2 
uX X
≤ E t (I[it = i] − qt,i )2 ξt,At log T  .
t=1 i=1

" t′ # " t′  #
X X 1 1
E [term3 ] = E ⟨qt − q̄t , Et [zt ]⟩ = E − 2 q̄t + 2 1, Et [zt ] ≤ O(1).
2t 4t
t=1 t=1

t′
X
term4 ≤ qt,2 bt
t=1
 
c1 ξt,At I[it =2]
t′  
X  2
qt,2 1 1 
= qt,2 
 r Pt ξ + c2 − 
τ,A τ I[i τ =2] min τ ≤t q τ,2 min τ ≤t−1 q τ,2

t=1 c1 τ =1 2
qτ,2
v
ξt,At I[it =2]
t′ u t′ 
u 
√ Xu 2
qt,2
q X minτ ≤t qτ,2
≤ c1 tP
ξτ,Aτ 1[iτ =2]
× ξt,At I[it = 2] + c2 1−
t min q τ ≤t−1 τ,2
t=1 τ =1 2
qτ,2 t=1
v
ξt,At I[it =2]
v

u ′ u t′
u t t  
√ uX 2
qt,2 uX X minτ ≤t−1 qτ,2
≤ c1 t ξτ,Aτ 1[iτ =2]
×t ξ t,At I[it = 2] + c2 log
P t minτ ≤t qτ,2
t=1 τ =1 2
qτ,2 t=1 t=1

(Cauchy-Schwarz)
v v
t′ u t′
u !
√ u X ξt,At I[it = 2] u X 1
≤ c1 1 + log
t
2
t ξt,At I[it = 2] + c2 log
qt,2 minτ ≤t′ qτ,2
t=1 t=1
v
u 
u X t′
≤ O tc1 ξt,At I[it = 2] log T + c2 log T  .
t=1

46
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND

v
u t′ 
u X ξt,A I[it = 2] c2
t
term5 = −u2 Bt′ = −u2 tc1 2 + .
qt,2 mint≤t ′ qt,2
t=1

hP ′ i
t
Combining all inequalities above, we can bound E t=1 ⟨q t − u, zt ⟩ by
 v v  
u t′ 2 u t′
u XX u X √
O E tc1 (I[it = i] − qt,i )2 ξt,At log T + tc1 ξt,At I[it = 2] log T  + ( c1 + c2 ) log T 
t=1 i=1 t=1
| {z }
pos-term
(25)
vu t′ 
u X ξt,A I[it = 2] c2
t
− u2 tc1 2 + .
q t,2 mint≤t′ qt,2
t=1
| {z }
neg-term

Similar
hP ′ to the arguments i in the proofs of Theorem 11 and Theorem 18, we end up bounding
t
E t=1 (ℓt,At − ℓt,x ) by the pos-term above for all x ∈ X . Finally, we process pos-term. Ob-
serve that
" 2 #
X
Et (I[it = i] − qt,i )2 ξt,At
i=1
h i
= Et qt,1 (1 − qt,1 )2 ξt,bx + (1 − qt,1 )qt,1 2
ξt,Aet + qt,2 (1 − qt,2 )2 ξt,Aet + (1 − qt,2 )qt,2
2
ξt,bx
h i
2 2
= Et 2qt,1 qt,2 ξt,bx + 2qt,1 qt,2 ξt,Aet
 
X
2 2 
= 2qt,1 qt,2 ξt,bx + 2qt,1 pt,x ξt,x 
x̸=x
b
 
X
≤ 2pt,bx (1 − pt,bx )ξt,bx + 2  pt,x ξt,x 
x̸=x
b
!
X
=2 pt,x ξt,x − p2t,bx ξt,bx ,
x

and that
h i X X
Et ξt,Aet I[it = 2] = pt,bx ξt,x ≤ pt,x ξt,x − p2t,bx ξt,bx
x̸=x
b x

Thus, for any u ∈ X ,


"t ′ #
X
E (ℓt,At − ℓt,u )
t=1

47
DANN W EI Z IMMERT

v
u " t′ !# 
u X X √
≤ E [pos-term] ≤ O tc1 E pt,x ξt,x − p2t,bx ξt,bx log T + ( c1 + c2 ) log T  ,
t=1 x

which implies that the algorithm satisfies 12 -dd-LSB with constants (O(c1 ), O( c1 + c2 )).

1
F.4. 2 -LSB to 12 -strongly-iw-stable (Algorithm 7 / Theorem 25)

Algorithm 7 LSB via Corral (for α = 12 , using a 21 -strongly-iw-stable algorithm)


Input: candidate action x b, 21 -iw-stable algorithm B over X with constant c.
−2 P2 √ 1 P2 1
Define: ψt (q) = ηt i=1 qi + β i=1 ln qi .
B0 = 0.
for t = 1, 2, . . . do
Let B generate an action A et (which is the action to be chosen if B is selected in this round).
Let
*  + 
t−1
0
 
 X  1 1
q̄t = argmin q, zτ −   + ψt (q) , qt = 1 − 2 q̄t + 2 1,
q∈∆2  Bt−1  2t 4t
τ =1
1 1
where ηt = qP , β= .
t e ̸= x √ 8c2
τ =1 I{Aτ b} + 8 c1

Sample it ∼ qt .
if it = 1 then draw At = xb and observe ℓt,At ;
else draw At = At and observe ℓt,At ;
e
ℓ I{i =i}
Define zt,i = t,At t I{A
qt,i
et ̸= x
b} and
v n o
u
u X
u t I Aet ̸
= x
b 1
Bt = tc1 + c2 max .
qτ,2 τ ≤t qτ,2
τ =1

end

Proof [Proof of Theorem 25] The proof of this theorem mostly follows that of Theorem 11. The
difference is that the regret of the base algorithm B is now bounded by
v v
u t′ u t′
u X 1 c2 log T u X I{A et ̸= x
b} p ′ c2 log T
tc
1 + ≤ tc1 + c1 t +
qt,2 + qt,1 I{A
t=1
et = x b} mint≤t′ qt,2 t=1
qt,2 mint≤t′ qt,2
(26)

because when A et = xb, the base algorithm B is able to receive feedback no matter which side the
Corral algorithm chooses. The goal of adding bonus is now only to cancel the first term and the
third term on the right-hand side of (26).

48
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND

Similar to Eq. (18), we have

t ′ t ′
X X
⟨qt − u, zt ⟩ ≤ ⟨qt − q̄t , zt ⟩
t=1 t=1
| {z }
term1
t ′ √ et ̸= x t′ t′ t′
qt,2 I{A b}
 
√ X log T X 3
2
X
Ts
X
Lo
+O c1 + qP + + ηt min qt,i (zt,i − θt ) +
2
qt,2 bt + qt,2 bt
t eτ ̸= x β |θt |≤1
t=1 τ =1 I{ A b } | {z } t=1 t=1 t=1
| {z } term3 | {z
term
} | {z } | {z }
term term
4 5 6
term2
t ′
X
− u2 bt (27)
t=1

where

v v
c1 I{A et ̸=xb}
u t eτ ̸= x
u t−1
u X I{A b} u X I{A eτ ̸= x
b} c1
r
Ts qt,2
bt = c1
t − c1
t ≤r ≤ ,
qτ,2 qτ,2 t I{A ̸
= x } qt,2
τ =1 τ =1
P eτ
c1 τ =1 qτ,2 b

 
Lo
c2 c2 c2 minτ ≤t qτ,2
bt = − = 1−
minτ ≤t qτ,2 minτ ≤t−1 qτ,2 qt,2 minτ ≤t−1 qτ,2

√ 1 1
t ≤
satisfying ηt q̄t,2 bTs 2 t ≤
and β q̄t,2 bLo 2 as in (20) and (21).
Below, we bound term1 , . . . , term6 .

" t′ # t ′ !
X X 1
E[term1 ] = E ⟨qt − q̄t , zt ⟩ ≤ O = O(1).
t2
t=1 t=1

v  v 
u
uXt′ u t′
uX 
term2 ≤ O min
 t I{A ̸ x
et = b}, t et ̸= x
qt,2 I{A b} log T 
 
t=1 t=1
log T
term3 = ≤ O (c2 log T ) .
β

49
DANN W EI Z IMMERT

" t′ #
X 3
E [term4 ] = E ηt min qt,i (zt,i − θt )2
2
θt ∈[−1,1]
t=1
" t′ 2
#
X X 3
2
≤E ηt qt,i (zt,i − ℓt,At ) I{A
2 et ̸= x
b} (when A
et = x
b, zt,1 = zt,2 = 0)
t=1 i=1
" t′ 2
#
X 1 X 2
=E ηt √ (I[it = i]ℓt,At − qt,i ℓt,At ) I{A et ̸= x
b}
qt,i
t=1 i=1
" t′ 2   #
X X √ 3
≤E ηt qt,i (1 − qt,i )2 + (1 − qt,i )qt,i
2
I{Aet ̸= x
b}
t=1 i=1
" t′ q #
X
≤E et ̸= x
ηt qt,2 I{A b} ≤ O (E[term2 ]) .
t=1

t ′
X
term5 + term6 = qt,2 (bTs Lo
t + bt )
t=1
 
c1 I{Aet ̸=x
b}
t′  
X  qt,2 1 1 
= qt,2 
r + c2 − 
Pt eτ ̸=x
I{A b} minτ ≤t qτ,2 minτ ≤t−1 qτ,2 
t=1 c1 τ =1 qτ,2

′ e ̸=x
I{A b}
t √t t′  
√ minτ ≤t qτ,2
q
X qt,2 X
= c1 r × qt,2 I{At ̸= x
e b} + c2 1−
Pt eτ ̸=x
I{A b} minτ ≤t−1 qτ,2
t=1 t=1
τ =1 qτ,2
(28)
v v
et ̸=x
I{A b} t′
u ′ u t′
u t  
√ uX qt,2
uX X minτ ≤t−1 qτ,2
≤ c1 t t q t,2 I{Aet ̸
= x
b } + c2 log
P t I{Aeτ ̸=x
b} minτ ≤t qτ,2
t=1 τ =1 qτ,2 t=1 t=1

(Cauchy-Schwarz)
v v
t′ u t′
u !
√ u X I{A et ̸= x
b} u X 1
≤ c1 1 + log
t t et ̸= x
qt,2 I{A b} + c2 log
qt,2 minτ ≤t′ qτ,2
t=1 t=1
v
u t′ 
u X
≤ O tc1 qt,2 I{A et ̸= xb} log T + c2 log T  . (29)
t=1

√ Pt′ et ̸=x
I{A b} √
Continuing from (28), we also have term5 ≤ c1 t=1 Pt eτ ̸=x ≤ 2 c1 t′ .
τ =1 I{ A b}

50
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND

hP ′ i
t
Using all the inequalities above in (27), we can bound E ⟨q
t=1 t − u, zt ⟩ by

  v "  
t′
u #
p u X  √
O min c1 E[t′ ], tc
1 qt,2 I{Aet ̸= x
b} log T + ( c1 + c2 ) log T 
 
t=1
| {z }
pos-term
v
u t′ 
u X I{A et ̸= x
b} c2
− u2 E tc1 + .
qt,2 mint≤t′ qt,2
t=1
| {z }
neg-term

P′
For comparator hP ′x b, we set u = ie1 . Then hP we have E[ i tt=1 (ℓt,At − ℓt,bx )] ≤ pos-term.
t t′
Observe that E t=1 qt,2 I{At ̸= xb} = E t=1 (1 − pt,b
x ) , so this gives the desired prop-
e
hP ′ i
t
erty of 21 -LSB for action x b. For x ̸= x b, we set u = e2 and get E (ℓ
t=1 t,At − ℓt,x ) =
hP ′ i hP ′ i
t t
E t=1 (ℓt,At − ℓt,Aet ) + E et − ℓt,x ) ≤ (pos-term − neg-term) + (neg-term +
t=1 (ℓt,A
p √
c1 E[t′ ]) where the additional c1 t′ term comes from the second term on the right-hand side
of (26), which is not cancelled by the negative term. Note, however, that this still satisfies the re-
quirement of 12 -LSB for x ̸= x b because for these actions, we only require the worst-case bound to
hold. Overall, we have justified that Algorithm 7 satisfies 21 -LSB with constants (c′0 , c′1 , c′2 ) where

c′0 = c′1 = O(c1 ) and c2 = ( c1 + c2 ).

1
F.5. 2 -dd-LSB to 12 -dd-strongly-iw-stable (Algorithm 8 / Lemma 42)

Lemma 42 Let B be an algorithm with the following stability guarantee: given an adaptive se-
quence of weights q1 , q2 , · · · ∈ (0, 1]X such that the feedback in round t is observed with probability
qt (x) if x is chosen, and an adaptive sequence {mt,x }x∈X available at the beginning of round t, it
obtains the following pseudo regret guarantee for any stopping time t′ :

" t′ # v
u t′ 
X u X upd · ξt,A c2
t t
E (ℓt,At − ℓt,u ) ≤ E tc1 + ,
qt (At )2 mint≤t′ minx qt (x)
t=1 t=1

where updt = 1 if feedback is observed in round t and updt = 0 otherwise. ξt,x = (ℓt,x − mt,x )2 in
the second-order bound case, and ξt,x = ℓt,x in the first-order bound case. Then Algorithm 8 with
B as input satisfies 12 -dd-LSB.

Proof The proof of this lemma is a combination of the elements in Theorem 22 (data-dependent
iw-stable) and Theorem 25 (strongly-iw-stable), so we omit the details here and only provide a
sketch.

51
DANN W EI Z IMMERT

Algorithm 8 dd-LSB via Corral (for α = 12 , using a 12 -dd-strongly-iw-stable algorithm)


Input: candidate action x b, 21 -iw-stable algorithm B over X with constant (c1 , c2 ).
P2
Define: ψ(q) = i=1 ln q1i . B0 = 0.
Define: For first-order bound, ξt,x = ℓt,x and mt,x = 0; for second-order bound, ξt,x = (ℓt,x −
mt,x )2 where mt,x is the loss predictor.
for t = 1, 2, . . . do
Receive prediction mt,x for all x ∈ X , and set yt,1 = mt,bx and yt,2 = mt,Aet .
Let B generate an action A et (which is the action to be chosen if B is selected in this round).
Let
*  + 
t−1
0
 
 X 1  1 1
q̄t = argmin q, zτ + yt −   + ψ(q) , qt = 1 − 2 q̄t + 2 1,
q∈∆2  Bt−1 ηt  2t 4t
τ =1

t−1
!− 12
1 1 X
where ηt = (log T ) 2 (I[iτ = i] − qτ,i )2 ξτ,Aτ I[A b] + (c1 + c22 ) log T
eτ ̸= x .
4
τ =1

Sample it ∼ qt .
if it = 1 then draw At = x b and observe ℓt,At ;
else draw At = At and observe ℓt,At ;
e
 
(ℓt,At −yt,i )I{it =i} et ̸= x
Define zt,i = qt,i + y t,i I{A b} and
v
u t et ̸= x
u X ξt,At I{iτ = 2}I{A b} c2
Bt = c1
t
2 + . (30)
qτ,2 minτ ≤t qτ,2
τ =1

end

In Algorithm 8, since B is 21 -dd-strongly-iw-stable, its regret is upper bounded by the order of


v
u t′ !
u X
tc I{A
et = x b} · ξt,At I{A et ̸= x b}I{it = 2}ξt,At c2
1 + 2 +
1 qt,2 mint≤t minx qt (x)

t=1
v v
u t′ u t′
u X I{A et ̸= x }I{i t = 2} · ξ t,A t
u X c2
≤ tc1
b
2 + tc
1 ξt,At + (31)
qt,2 mint≤t′ min x qt (x)
t=1 t=1

because if B chooses xb, then the probability of observing the feedback is 1, and is qt,2 otherwise.
This motivates the choice of the bonus in (30). Then we can follow the proof of Theorem 22 step-
by-step, and show that the regret compared to xb is upper bounded by the order of
v
u t′ 2 
u XX
E tc1 (I[it = i] − qt,i )2 ξt,At I{A
et ̸= x
b} log T 
t=1 i=1

52
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND

v
u t′ 
u X
et ̸= x √
+ E tc1 ξt,At I[it = 2]I{A b} log T  + ( c1 + c2 ) log T
t=1
v " t′ #
u
u X 
≤ tc1 E 2 ξ 2
2qt,1 qt,2 et I{At ̸= x
x + 2qt,1 qt,2 ξt,A
t,b b}
e
t=1
v " t′ #
u
u X
et ̸= x √
+ c1 E
t ξt,Aet qt,2 I{A b} log T + ( c1 + c2 ) log T
t=1
(taking expectation over it and following the calculation in the proof of Theorem 22)
v   
u
u Xt′ X
2 ξ 2 q
= tc1 E  x Pr[At ̸= x
u 2qt,1 qt,2 t,b
e b] + 2qt,1 t,2 Pr[Aet = x]ξt,x 
t=1 x̸=x
b
v  
u
t′
et = x]ξt,x  log T + (√c1 + c2 ) log T
u X X
+ tc1 E  qt,2 Pr[A
u
t=1 x̸=x
b

(taking expectation over A


et )
v   
u
u Xt′ X
≤ tc1 E 
u 2
2qt,1 (1 − pt,bx )ξt,bx + 2qt,1 pt,x ξt,x 
t=1 x̸=x
b
v  
u
t′ X

u X
+ tc1 E  pt,x ξt,x  log T + ( c1 + c2 ) log T
u
t=1 x̸=x
b

(using the property that for x ̸= x


b, pt,x = Pr[A et = x]qt,2 and thus 1 − pt,bx = Pr[A et ̸= x
b]qt,2 )
v    
u
u Xt′ X
≤ tc1 E 2pt,bx (1 − pt,bx )ξt,bx + 2 pt,x ξt,x  (qt,1 ≤ pt,bx ≤ 1)
u 
t=1 x̸=x
b
v  
u
t′ X

u X
+ tc1 E  pt,x ξt,x  log T + ( c1 + c2 ) log T
u
t=1 x̸=x
b
v
u " t′ !# 
u X X √
≤ O tc1 E pt,x ξt,x − p2t,bx ξt,bx log T  + ( c1 + c2 ) log T
t=1 x

which satisfies the requirement of 12 -dd-LSB for the regret against x


b. For the regret against x ̸= x
b,
similar to the proof of Theorem 25, an extra positive regret comes from the second term in (31).
Therefore, the regret against x ̸= x
b, can be upper bounded by
v u " t′ # 
u XX √
O tc E 1 p ξ log T  + ( c + c ) log T
t,x t,x 1 2
t=1 x

53
DANN W EI Z IMMERT

which also satisfies the requirement of 21 -dd-LSB for x ̸= x


b.

Appendix G. Analysis for IW-Stable Algorithms

Algorithm 9 EXP2
Input: X .
for t = 1, 2, . . . do
Receive update probability qt .
Let
t−1
(s ) !
ln |X | 1 X
ηt = min , min qτ , Pt (a) ∝ exp −ηt ℓbτ (a) .
d tτ =1 q1τ 2d τ ≤t
P
τ =1

Sample an action at ∼ pt = (1 − dη t dηt


qt )Pt + qt ν, where ν is John’s exploration.
With probability qt , receive ℓt (at ) = ⟨at , θt ⟩ + noise
(in this case, set updt = 1; otherwise, set updt = 0).
Construct loss estimator:
updt ⊤  −1
ℓbt (a) = a Eb∼pt [bb⊤ ] at ℓt (at )
qt

end

G.1. EXP2 (Algorithm 9 / Lemma 9)

Proof [Proof of Lemma 9] Consider the EXP2 algorithm (Algorithm 9) which corresponds to FTRL
with negentropy potential. By standard analysis of FTRL (Lemma 27), for any τ and any a⋆ ,

τ τ τ
" #   
X X X h i ln |X | X 1

Et Pt (a)ℓbt (a) − Et ℓt (a ) ≤
b + Et max ⟨Pt − P, ℓt ⟩ − Dψ (P, Pt ) .
a
ητ P ηt
t=1 t=1 t=1

To apply Lemma 31, weqneed to show that ηq t ℓt (a) ≥ −1. We have by Cauchy Schwarz
b
−1 −1 ⊤ −1 a . For each term, we have
|a⊤ Eb∼pt [bb⊤ ] at | ≤ a⊤ (Eb∼pt [bb⊤ ]) a a⊤

t (Eb∼pt [bb ]) t

 −1 qt ⊤  −1 qt
a⊤ Eb∼pt [bb⊤ ] a≤ a Eb∼ν [bb⊤ ] a= ,
dηt ηt

54
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND

due to the properties of John’s exploration. Hence |ηt ℓbt (a)| ≤ |ℓt (a)| ≤ 1. We can apply Lemma 31
for the stability term, resulting in

  
1
Et max ⟨Pt − P, ℓt ⟩ − Dψ (P, Pt )
P ηt
" −1 ⊤ −1 a
#
a⊤ Eb∼pt [bb⊤ ] at a⊤

t Eb∼pt [bb ]
X
≤ ηt Et Pt (a)
a
qt2
−1
X a⊤ Eb∼pt [bb⊤ ] a
= ηt Pt (a)
a
qt
−1
X a⊤ Eb∼Pt [bb⊤ ] a 2ηt d
= 2ηt Pt (a) = .
a
q t q t

By |ℓt (a)| ≤ 1 we have furthermore

τ τ τ
" # " #
X X X X X dηt
Et (Pt (a) − pt (a))ℓt (a) =
b Et (Pt (a) − pt (a))ℓt (a) ≤ .
a a
qt
t=1 t=1 t=1

Combining everything and taking the expectation on both sides leads to

τ X τ
" # " #
X log |X | X 3dη t
E pt (a)ℓt (a) − ℓt (a⋆ ) ≤ E +
a
ητ qt
t=1 t=1
 v 
u τ
u X 1 1
≤ E 7td log |X | + 2 log |X |d .
qt mint≤τ qt
t=1

G.2. EXP4 (Algorithm 10 / Lemma 10)

In this section, we use the more standard notation Π to denote the policy class.

55
DANN W EI Z IMMERT

Algorithm 10 EXP4
Input: Π (policy class), K (number of arms)
for t = 1, 2, . . . do
Receive context xt .
Receive update probability qt .
Let
t−1
s !
ln |Π| X
ηt = , Pt (π) ∝ exp −ηt ℓbτ (π(xτ )) .
K tτ =1 q1τ
P
τ =1
P
Sample an arm at ∼ pt where pt (a) = π:π(xt )=a Pt (π)π(xt ).
With probability qt , receive ℓt (at ) (in this case, set updt = 1; otherwise, set updt = 0).
Construct loss estimator:
updt I[at = a]ℓt (a)
ℓbt (a) = .
qt pt (a)

end

Proof [Proof of Lemma 10] Consider the EXP4 algorithm with adaptive stepsize (Algorithm 10),
which corresponds to FTRL with negentropy regularization. By standard analysis of FTRL
(Lemma 27) and Lemma 31, for any τ and any π ⋆

τ τ τ
" # " #
h i ln |Π| X ηt
X X X X
⋆ 2
Et Pt (π)ℓbt (π(xt )) − Et ℓbt (π (xt )) ≤ + Et Pt (π)ℓbt (π(xt ))
π
ητ 2 π
t=1 t=1 t=1
τ
ln |Π| X ηt X ℓt (π(xt ))
≤ + Pt (π)
ητ 2 π
qt pt (π(xt ))
t=1
v
u τ
ln |Π| ηt u X 1
≤ +K ≤ 2 K ln |Π|
t .
ητ 2qt qt
t=1

Taking expectation on both sides finishes the proof.

56
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND

G.3. (1 − 1/ log(K))-Tsallis-INF (Algorithm 11 / Lemma 14)

Algorithm 11 (1 − 1/ log(K))-Tsallis-Inf (for strongly observable graphs)


Input: G \ x
b.

Define: ψ(x) = K
P
i=1 β(1−β) , where β = 1 − 1/ log(K).
i

for t = 1, 2, . . . do
Receive update probability qt .
Let
( t−1 )
X 1
pt = argmin ⟨x, ℓbτ ⟩ + ψ(x)
x∈∆([K]) τ =1 ηt
s
log K
where ηt = Pt 1
.
s=1 qs (1 + min{e α, α log K})

Sample At ∼ pt .
With probability qt receive ℓt,i for all At ∈ N (i) (in this case, set updt = 1; otherwise, set
updt = 0).
Define
!
(ℓt,i − I{i ∈ It })I{A t ∈ N (i)}
ℓbt,i = updt P + I{i ∈ It } ,
j∈N (i) pt,j qt
 
1
where It = i ∈ [K] : i ̸∈ N (i) ∧ pt,i > and N (i) = {j ∈ G \ x
b | (j, i) ∈ E}.
2

end

We require the following graph theoretic Lemmas.

Lemma 43 Let G = (V, E) be a directed graph with independence number α and vertex weights
pi > 0, then
pi 2pi α
∃i ∈ V : P ≤P .
pi + j:(j,i)∈E pj i pi

Proof Without loss of generality, assume that p ∈ ∆(V ), since the statement is scale invariant. The
statement is equivalent to
X 1
min pi + pj ≥ .
i 2α
j:(j,i)∈E

Let
   
X X X
min min pi + pj  ≤ min p2i + pi pj  .
p∈∆(V ) i p∈∆(V )
j:(j,i)∈E i j:(j,i)∈E

57
DANN W EI Z IMMERT

Via K.K.T. conditions, there exists λ ∈ R such that for an optimal solution p⋆ it holds
X X
∀i ∈ V : either 2p⋆i + p⋆j + p⋆j = λ
j:(j,i)∈E j:(i,j)∈E
X X
or 2p⋆i + p⋆j + p⋆j ≥ λ and p⋆i = 0 .
j:(j,i)∈E j:(i,j)∈E

Next we bound λ. Take the sub-graph over V+ = {i : p⋆i > 0} and take a maximally independence
set S over V+ (|S| ≤ α, since V+ ⊂ V ). We have
 
X X X X
1≤ p⋆j + 2p⋆i + p⋆j + p⋆j  = |S|λ ≤ αλ.
j̸∈V+ i∈S j:(j,i)∈E j:(i,j)∈E
| {z }
=0

Hence λ ≥ α1 . Finally,
   
X X 1 X X X X p⋆ λ 1
(p⋆i )2 + p⋆i p⋆j  = 2(p⋆i )2 + p⋆i p⋆j + p⋆i p⋆j  ≥ i
≥ .
2 2 2α
i j:(j,i)∈E i j:(j,i)∈E j:(i,j)∈E i

Lemma 44 (Lemma 10 Alon et al. (2013)) let G = (V, E) be a directed graph. Then, for any
distribution p ∈ ∆(V ) we have:
p
P i
X
≤ mas(G) ,
pi + j:(j,i)∈E pj
i

where mas(G) ≤ α
e is the maximal acyclic sub-graph.

With these Lemmas, we are ready to proof the iw-stability.


Proof [Proof of Lemma 14] Consider the (1 − 1/ log(K))-Tsallis-INF algorithm with adaptive
stepsize (Algorithm 11). Applying Lemma 27, we have for any stopping time τ and a⋆ ∈ [K]:
τ τ  
X 2e log K X 1
Et [⟨pt , ℓt ⟩ − ℓt,a⋆ ] ≤
b b + Et max⟨p − pt , ℓt ⟩ − Dψ (p, pt )
b
ητ p ηt
t=1 t=1
τ  
2e log K X 1
= + Et max⟨p − pt , ℓt + ct 1⟩ − Dψ (p, pt ) ,
b
ητ p ηt
t=1

(ℓt,It −1)I{At ∈N (It )}


where ct = −updt P , which is well defined because It contains by definition
j∈N (I ) pt,j qt
t
maximally one arm.
We can apply Lemma 32, since the losses are strictly positive.
  " #
1 η t
X 1+ log1K
Et max⟨p − pt , ℓbt + ct 1⟩ − Dψ (p, pt ) ≤ Et pt,i (ℓbt,i + ct )2 .
p ηt 2
i

58
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND

We split the vertices in three sets. Let M1 = {i | i ∈ N (i)} the nodes with self-loops. Let M2 =
{i | i ̸∈ M1 , ̸∈ It } and M3 = It . we have
" #
ηt X 1+ log1K
2
Et pt,i (ℓbt,i + ct )
2
i
1+ log1 K
  
p 4p
P t,i t,i
X X X
≤ ηt Et updt  Pt,i c2t + + + pt,It  ,
( j∈N (i) pt,j qt )2 qt2
i∈M1 ∪M2 i∈M1 i∈M2
P 1
since for any i ∈ M2 , we have j∈N (i) pt,j > 2 . The first term is in expectation
4pt,i updt
hP i
2 ≤ 1 , the third term is in expectation E ≤ q4t , while
P 
Et p c
i∈M1 ∪M2 t,i t qt t i∈M2 qt2
the second is
1+ 1 1+ 1
 
X pt,i log K updt 2 X pt,i log K
Et  ≤ .
( j∈N (i) pt,j qt )2
P P
qt j∈N (i) pt,j
i∈M1 i∈M1

The subgraph consisting of M1 contains sef-loops, hence we can apply Lemma 44 to bound this
term by 2e α
qt . If α log K < α e, we instead take a permutation pe of p, such that M1 is in position
1, .., |M1 | and for any i ∈ [|M1 |]
pet,i pet,i 2αept,i
P ≤P ≤ Pi .
j∈N (i) p
et,j j∈N (i)∩[i] p
et,j j=1 p
et,j

The existence of such a permutation is guaranteed by Lemma 43: Apply Lemma 43 on the graph
M1 to select p̃|M1 | . Recursively remove the i-th vertex from the graph and apply Lemma 43 on the
resulting sub-graph to pick p̃i−1 . Applying this permutation yields
1 1
log K
1+ M1 1+ log K M1
2 X pt,i 4α X pe 4α X pei 4α log K
P ≤ Pi i ≤ 1− 1 ≤ .
qt j∈N (i) pt,j qt j=1 p
ej qt P
i log K qt
i∈M1 i=1 i=1
j=1 p
ej

Combining everything
 
1 ηt
Et max⟨p − pt , ℓbt + ct 1⟩ − Dψ (p, pt ) ≤ (6 + 4 min{e
α, α log K}) .
p ηt qt
By the definition of the learning rate, we obtain
τ τ
X 2e log K X ηt
Et [⟨pt , ℓbt ⟩ − ℓbt,a⋆ ] ≤ + (6 + 4 min{eα, α log K})
ητ qt
t=1 t=1
v
u τ
u X 1
≤ 18 (1 + min{e
t α, α log K}) log K .
qt
t=1

59
DANN W EI Z IMMERT

G.4. EXP3 for weakly observable graphs (Algorithm 12 / Lemma 17)

Algorithm 12 EXP3 (for weakly observable graphs)


Input: G \ xb, dominating set D (potentially including x b).
Define: ψ(x) = K
P
x
i=1 i log(x i ). νD is the uniform distribution over D.
for t = 1, 2, . . . do
Receive update probability qt .
Let
( t−1 )
X 1
Pt = argmin ⟨x, ℓbτ ⟩ + ψ(x) , pt = (1 − γt )Pt + γt νD
x∈∆([K]) τ =1 ηt
 √ 2
−1
Pt 1 !3
s
δ s=1 qs √
4δ ηt δ
where ηt =  +  and γt = .
 
log(K) mins≤t qs qt

Sample At ∼ pt .
With probability qt receive ℓt,i for all At ∈ N (i) (in this case, set updt = 1; otherwise, set
updt = 0).
Define
ℓt,i updt I{At ∈ N (i)}
ℓbt,i = P .
j∈N (i) pt,j qt

end

Proof [Proof of Lemma 17] Consider the EXP3 algorithm (Algorithm 12). By standard analysis of
FTRL (Lemma 27) and Lemma 31 by the non-negativity of all losses, for any τ and any a⋆ ,

τ τ
" #
X X X h i
Et Pt (a)ℓbt (a) − Et ℓbt (a⋆ )
t=1 a t=1
τ
" #
log K X ηt X
≤ + Et Pt,i ℓb2t,i
ητ 2
t=1 i
τ
log K X ηt X 1
≤ + Pt,i P
ητ 2 j∈N (i) pt,j qt
t=1 i
s
τ
log K X ηt δ
≤ +
ητ 4qt
t=1

60
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND

q
ηt P δ ηt δ
1 1
where in the last inequality we use i Pt,i × γt × qt
= qt . We have due to Et [⟨Pt −pt , ℓt ⟩] ≤
b
2 2
γt
s
τ τ τ
" #
X X X h i log K X ηt δ

Et pt (a)ℓbt (a) − Et ℓbt (a ) ≤ + 9
a
η τ 4qt
t=1 t=1 t=1
τ
!2
3
1 X 1 4δ log(K)
≤ 6(δ log(K)) 3 √ + .
qt mint≤T qt
t=1

Taking expectation on both sides finishes the proof.

Appendix H. Surrogate Loss for Strongly Observable Graph Problems


When there exist arms such that i ̸∈ N (i), we cannot directly apply Algorithm 2. To make the
algorithm applicable to all strongly observable graphs, define the surrogate losses ℓ̃t in the following
way:

X
ℓ̃t,bx = ℓt,bx I{(b b) ∈ E} −
x, x pt,j ℓt,j I{(j, j) ̸∈ E}
x}
j∈[K]\{b

∀j ∈ [K] \ {b
x} : ℓ̃t,j = ℓt,j I{(j, j) ∈ E} − ℓt,bx I{(b b) ̸∈ E} ,
x, x
where pt is the distribution of the base algorithm B over [K] \ {b x} at round t. By construction
and the definition of strongly observable graphs, ℓ̃t,j is observed when playing arm j. (When the
player does not have access to pt , one can also sample one action from the current distribution for
an unbiased estimate of ℓt,bx .) The losses ℓ̃t are in range [−1, 1] instead of [0, 1] and can further be
shifted to be strictly non-negative. Finally observe that
 
X
E[ℓ̃t,At − ℓ̃t,bx ] = qt,2  pt,j ℓ̃t,j − ℓ̃t,bx 
x}
j∈[K]\{b
 
X
= qt,2  pt,j ℓt,j − ℓt,bx 
x}
j∈[K]\{b

= E[ℓt,At − ℓt,bx ],
as well as
   
X X
E ℓ̃t,At − pt,j ℓ̃t,j  = qt,1 ℓ̃t,bx − pt,j ℓ̃t,j 
x}
j∈[K]\{b x}
j∈[K]\{b
 
X
= qt,1 ℓ̃t,bx − pt,j ℓt,j 
x}
j∈[K]\{b
 
X
= E ℓt,At − pt,j ℓt,j  .
x}
j∈[K]\{b

61
DANN W EI Z IMMERT

That means running Algorithm 2, replacing ℓt by ℓ̃t allows to apply Theorem 11 to strongly observ-
able graphs where not every arm receives its own loss as feedback.

Appendix I. Tabular MDP (Theorem 26)


In this section, we consider using the UOB-Log-barrier Policy Search algorithm in Lee et al. (2020)
(their Algorithm 4) as our base algorithm. To this end, we need to show that it satisfies a dd-
strongly-iw-stable condition specified in Lemma 42. We consider a variant of their algorithm by
incorporating the feedback probability qt′ = qt + (1 − qt )I[πt = π
b]. The algorithm is Algorithm 13.

We refer the readers to Section C.3 of Lee et al. (2020) for the setting descriptions and notations,
since we will follow them tightly in this section.

I.1. Base algorithm


Lemma 45 Algorithm 13 ensures for any u⋆ ,
" T # v
u " T # 
X X upd It (s, a)ℓt (s, a)  5 2 2
X u t S A ι
E ⟨wt − u⋆ , ℓt ⟩ ≤ O tHS 2 Aι2 E ′2 +E ′
.
q t mint q t
t=1 t=1 s,a

Proof The regret is decomposed as the following (similar to Section C.3 in Lee et al. (2020)):
T
X T
X T
X T
X

⟨wt − u , ℓt ⟩ = ⟨wt − w
bt , ℓt ⟩ + ⟨w
bt , ℓt − ℓt ⟩ +
b ⟨w
bt − u, ℓbt ⟩
t=1 t=1 t=1 t=1
| {z } | {z } | {z }
Error Bias-1 Reg-term
T
X T
X
+ ⟨u, ℓbt − ℓt ⟩ + ⟨u − u⋆ , ℓt ⟩,
t=1 t=1
| {z } | {z }
Bias-2 Bias-3

with the same definition of u as in (24) of Lee et al. (2020). Bias-3 ≤ H trivially (see (25) of
Lee et al. (2020)). For the remaining four terms, we use Lemma 49, Lemma 50, Lemma 51, and
Lemma 52, respectively. Combining all terms finishes the proof.

Lemma 46 (Lemma C.2 of Lee et al. (2020)) With probability 1 − O(δ), for all t, s, a, s′ ,
ϵt (s′ |s, a)
P (s′ |s, a) − P t (s′ |s, a) ≤ .
2
Lemma 47 (cf. Lemma C.3 of Lee et al. (2020)) With probability at least 1 − δ, for all h,
T
X X qt′ · wt (s, a)
= O (|Sh |A log(T ) + ln(H/δ))
max{1, Nt (s, a)}
t=1 s∈Sh ,a∈A

Proof The proof of this lemma is identical to the original one — only need to notice that in our
case, the probability of obtaining a sample of (s, a) is qt′ · wt (s, a).

62
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND

Algorithm 13 UOB-Log-Barrier Policy Search


Input: state space S, action space A, candidate policy
P π .
bP
′ 1 1

Define: Ω = w b a, s ) ≥ T 3 S 2 A , ψ(w) = h (s,a,s′ )∈Sh ×A×Sh+1 ln w(s,a,s
b : w(s, ′) .
1
δ = T 5S3A .
Initialization:
1
b1 (s, a, s′ ) =
w , π1 = π wb1 , t⋆ ← 1.
|Sh ||A||Sh+1 |

for t = 1, 2, . . . do
b, updt = 1; otherwise, updt = 1 w.p. qt and updt = 0 w.p. 1 − qt .
If πt = π
If updt = 1, obtain the trajectory sh , ah , ℓt (sh , ah ) for all h = 1, . . . , H.
Construct loss estimators
upd ℓt (s, a)It (s, a)
ℓbt (s, a) = ′ t · , where It (s, a) = I{(sh(s) , ah(s) ) = (s, a)}
qt ϕt (s, a)

where qt′ = qt + (1 − qt )1[πt = π b].



Update counters: for all s, a, s ,

Nt+1 (s, a) ← Nt (s, a) + It (s, a), Nt+1 (s, a, s′ ) ← Nt (s, a, s′ ) + It (s, a)I{sh(s)+1 = s′ }.

Compute confidence set


n o
Pt+1 = P̂ : P̂ (s′ |s, a) − P t+1 (s′ |s, a) ≤ ϵt+1 (s′ |s, a), ∀(s, a, s′ )

Nt+1 (s,a,s′ )
where P t+1 (s′ |s, a) = max{1,Nt+1 (s,a)} and
s
P t+1 (s′ |s, a) ln(SAT /δ) 28 ln(SAT /δ)
ϵt+1 (s′ |s, a) = 4 +
max{1, Nt+1 (s, a)} 3 max{1, Nt+1 (s, a)}
Pt P updτ Iτ (s,a)ℓτ (s,a)2 H S 2 A ln(SAT )
if τ =t⋆ s,a qt′2
+ maxτ ≤t qτ′2
≥ ηt2
then
ηt
ηt+1 ← 2
wbt+1 = argminw∈∆(Pt+1 )∩Ω ψ(w)
t⋆ ← t + 1.
end
else
ηt+1 = ηt n o
1
wbt+1 = argminw∈∆(Pt+1 )∩Ω ⟨w, ℓbt ⟩ + ηt Dψ (w, w
bt ) .
end
Update policy πt+1 = π wbt+1 .
end

63
DANN W EI Z IMMERT

Lemma 48 (cf. Lemma C.6 of Lee et al. (2020)) With probability at least 1 − O(δ), for any t and
any collection of transition functions {Pts }s∈S such that Pts ∈ Pt for all s, we have
 v 
T u T 2⟩ 5 A2 ι2
X X s
u X ⟨w , ℓ
t t 2 S
wPt ,πt (s, a) − wt (s, a) ℓt (s, a) = O S tHA ι + .
qt′ mint qt′
t=1 s∈S,a∈A t=1

where ι ≜ log(SAT /δ) and ℓ2t (s, a) ≜ (ℓt (s, a))2 .

Proof Following the proof of Lemma C.6 in Lee et al. (2020), we have
T
s
X X
wPt ,πt (s, a) − wt (s, a) ℓt (s, a) ≤ B1 + SB2
t=1 s∈S,a∈A

where
 s s 
T X X
X P (s′ |s, a)ι P (x′ |x, y)ι
B2 ≤ O  wt (s, a) wt (x, y|s′ )
′ ′
max{1, Nt (s, a)} max{1, Nt (x, y)}
t=1 s,a,s x,y,x
 
T X X
X w t (s, a)ι
+O  (32)
′ ′
max{1, Nt (s, a)}
t=1 s,a,s x,y,x

The first term in (32) can be upper bounded by the order of


s s
T X X
X wt (s, a)P (x′ |x, y)wt (x, y|s′ )ι wt (s, a)P (s′ |s, a)wt (x, y|s′ )ι
max{1, Nt (s, a)} max{1, Nt (x, y)}
t=1 s,a,s′ x,y,x′
v v
T uX X
wt (s, a)P (x′ |x, y)wt (x, y|s′ )ι u X X wt (s, a)P (s′ |s, a)wt (x, y|s′ )ι
X u

u
t t
′ ′
max{1, Nt (s, a)} max{1, Nt (x, y)}
t=1 s,a,s x,y,x s,a,s′ x,y,x′
v v
T u
wt (s, a)ι wt (x, y)ι
X X u X

u u
tH tH

max{1, N t (s, a)} ′
max{1, Nt (x, y)}
t=1 s,a,s x,y,x

T X
X wt (s, a)ι
≤ HS
max{1, Nt (s, a)}
t=1 s,a,s′

HS 2 ι
 
≤O (SA ln(T ) + H log(H/δ)) . (by Lemma 47)
mint qt′
The second term in (32) can be upper bounded by the order of
T X  3 
2
X wt (s, a)ι S Aι
S A =O (SA ln(T ) + H log(H/δ)) . (by Lemma 47)

max{1, Nt (s, a)} mint qt′
t=1 s,a,s

Combining the two parts, we get


   4 2 2
1 4 2 3
 S A ι
B2 ≤ O S A ln(T )ι + HS Aι log(H/δ) ≤ O .
mint qt′ mint qt′

64
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND

Next, we bound B1 . By the same calculation as in Lemma C.6 of Lee et al. (2020),

B1
 s 
T X X T X
X P (s′ |s, a)ι X w t (x, a)ι
≤ O wt (s, a) wt (x, y|s′ )ℓt (x, y) + H 
max{1, N t (s, a)} max{1, N t (s, a)}
t=1 s,a,s′ x,y t=1 s,a,s′
 s 
T X X ′ 2 2
X P (s |s, a)ι HS Aι 
≤ O wt (s, a) wt (x, y|s′ )ℓt (x, y) +
′ x,y
max{1, Nt (s, a)} mint qt′
t=1 s,a,s

For the first term above, we consider the summation over (s, a, s′ ) ∈ Th ≜ Sh × A × Sh+1 . We
continue to bound it by the order of
s
T
X X X P (s′ |s, a)ι
wt (s, a) wt (x, y|s′ )ℓt (x, y)
′ x,y
max{1, N t (s, a)}
t=1 s,a,s ∈Th
T
X 1 X X
≤α wt (s, a)P (s′ |s, a)wt (x, y|s′ )ℓt (x, y)2
qt′
t=1 s,a,s′ ∈Th x,y
T
1X X X ι
+ qt′ wt (s, a)wt (x, y|s′ ) ·
α max{1, Nt (s, a)}
t=1 s,a,s′ ∈Th x,y
(by AM-GM, holds for any α > 0)
T T
X 1 X 2 H|Sh+1 | X X qt′ wt (s, a)ι
≤α w t (x, y)ℓ t (x, y) +
qt′ x,y α Nt (s, a)
t=1 t=1 s,a∈Sh ×A
T
X ⟨wt , ℓ2 ⟩ t H|Sh+1 ||Sh |Aι2
≤α + (by Lemma 47)
qt′ α
t=1
v 
u T 2 2
u X ⟨wt , ℓt ⟩ι 
≤ O tH|Sh ||Sh+1 |A (choose the optimal α)
qt′
t=1
 v 
u T 2 ⟩ι2
u X ⟨w ,
t tℓ
≤ O (|Sh | + |Sh+1 |)tHA 
qt′
t=1

Therefore,
 v 
u T 2 ⟩ι2 2 Aι2
X u X ⟨w ,
t t ℓ HS
B1 ≤ O  (|Sh | + |Sh+1 |)tHA + 
qt′ mint qt′
h t=1
 v 
u T 2 2 2 2
u X ⟨wt , ℓt ⟩ι HS Aι 
≤ O S tHA + .
qt′ mint qt′
t=1

Combining the bounds finishes the proof.

65
DANN W EI Z IMMERT

Lemma 49 (cf. Lemma C.7 of Lee et al. (2020))


" T   v 
T
# u
X u X 2
⟨wt , ℓt ⟩ι 2 5 2 2
S A ι 
E [Error] = E ⟨w
bt − wt , ℓt ⟩ ≤ O E S tHA ′ + .
qt mint qt′
t=1 t=1

Proof By Lemma 48, with probability at least 1 − O(δ),


T
X T X
X
Error = ⟨w
bt − wt , ℓt ⟩ ≤ |w
bt (s, a) − wt (s, a)|ℓt (s, a)
t=1 t=1 s,a
 v 
u T 2 ⟩ι2 5 A2 ι2
u X ⟨w ,
t tℓ S
≤ O S tHA + 
qt′ mint qt′
t=1

Furthermore, |Error| ≤ O(SAT ) with probability 1. Therefore,


  v  
u T 2 2 5 2 2
u X ⟨wt , ℓt ⟩ι S A ι 
E[Error] ≤ O E S tHA + + δSAT 
qt′ mint qt′
t=1
  v 
u T 2 2 5 2 2
u X ⟨wt , ℓt ⟩ι S A ι 
≤ O E S tHA ′ +
qt mint qt′
t=1

by our choice of δ.

Lemma 50 (cf. Lemma C.8 of Lee et al. (2020))


" T   v 
T
# u
X u X 5 2 2
⟨wt , ℓt ⟩ S A ι 
E[Bias-1] = E ⟨w
bt , ℓt − ℓbt ⟩ ≤ O E S tHA + .
qt′ mint qt′
t=1 t=1

Proof Let Et be the event that P ∈ Pτ for all τ ≤ t.


  X  
wt (s, a)
Et ⟨wbt , ℓt − ℓt ⟩ Et =
b wbt (s, a)ℓt (s, a) 1 −
s,a
ϕt (s, a)
X
≤ |ϕt (s, a) − wt (s, a)|ℓt (s, a)
s,a

Thus,
h i X
Et ⟨w
bt , ℓt − ℓbt ⟩ ≤ |ϕt (s, a) − wt (s, a)|ℓt (s, a) + O(HI[E t ])
s,a

Summing this over t,


T T X T
!
X h i X X
Et ⟨w
bt , ℓt − ℓbt ⟩ ≤ |ϕt (s, a) − wt (s, a)|ℓt (s, a) +O H I[E t ]
t=1 t=1 s,a t=1
| {z }
(⋆)

66
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND

 r 
⟨wt ,ℓ2t ⟩ 2 S 5 A2 ι2
By Lemma 48, (⋆) is upper bounded by O S HA Tt=1
P
qt′
ι + mint qt′
with probability
1 − O(δ). Taking expectations on both sides and using that Pr[E t ] ≤ δ, we get
" T   v 
T
# u
⟨w , ℓ2⟩ S 5 A2 ι 2
t t 2
X u X
E ⟨w
bt , ℓt − ℓbt ⟩ ≤ O E S tHA ι +  + O (δSAT )
qt′ mint qt′
t=1 t=1
  v 
u T 2⟩ 5 A2 ι 2
u X ⟨w , ℓ
t t 2 S
≤ O E S tHA ι +  .
qt′ mint qt′
t=1

Lemma 51 (cf. Lemma C.9 of Lee et al. (2020))


" T #
X
E[Bias-2] = E ⟨u, ℓbt − ℓt ⟩ ≤ O(1).
t=1

Proof Let Et be the event that P ∈ Pτ for all τ ≤ t. By the construction of the loss estimator, we
have
h i
Et ⟨u, ℓbt − ℓt ⟩ Et ≤ 0

and thus
h i
Et ⟨u, ℓbt − ℓt ⟩ ≤ I[E t ] · H · T 3 S 2 A (by Lemma C.5 of Lee et al. (2020), ℓbt (s, a) ≤ T 3 S 2 A)

and
" T # " T
#
X X
E ⟨u, ℓbt − ℓt ⟩ ≤ HT 4 S 2 AE I[E t ] ≤ O(δHT 5 S 2 A) ≤ O(1),
t=1 t=1

where the last inequality is by our choice of δ.

Lemma 52 (cf. Lemma C.10 of Lee et al. (2020))


v
u " T #
u X X upd It (s, a)ℓt (s, a)2 H
t
E[Reg-term] ≤ O tS 2 A ln(SAT )E ′2 + max ′2  .
t=1 s,a
q t t q t

Proof By the same calculation as in the proof of Lemma C.10 in Lee et al. (2020),
bt ) − Dψ (u, w
Dψ (u, w bt+1 ) X
⟨w
bt − u, ℓbt ⟩ ≤ + ηt bt (s, a)2 ℓbt (s, a)2
w
ηt s,a
bt ) − Dψ (u, w
Dψ (u, w bt+1 ) X upd It (s, a)ℓt (s, a)2
t
≤ + ηt ′2 .
ηt s,a
q t

67
DANN W EI Z IMMERT

Let t1 , t2 , . . . be the time indices when ηt = ηt−12 , and let ti be the last time this happens. Summing

the inequalities above and using telescoping, we get


 
T ti+1 −1 2
X X Dψ (u, w bti ) X X updt It (s, a)ℓt (s, a)
⟨w bt − u, ℓbt ⟩ ≤  + ηti ′2

ηti t=ti s,a
q t
t=1 i
 
ti+1 −1
X O(S 2 A ln(SAT )) X X upd It (s, a)ℓt (s, a)2
t
≤  + ηti ′2

ηti t=t s,a
q t
i i

(computed in the proof of Lemma C.10 in Lee et al. (2020))


X O(S 2 A ln(SAT ))
≤ (by the timing we halve the learning rate)
ηti
i
S 2 A ln(SAT )
 
=O
ηti⋆
v !
u T X 2
u X updt It (s, a)ℓt (s, a) H
≤ O tS 2 A ln(SAT ) ′2 + max ′2  .
t=1 s,a
qt t qt

I.2. Corraling
We use Algorithm 8 as the corral algorithm for the MDP setting, with Algorithm 13 being the base
algorithm. The guarantee of Algorithm 8 is provided in Lemma 42.
Proof [Proof of Theorem 26] To apply Lemma 42, we have to perform re-scaling on the loss because
for MDPs, the loss of a policy in one round is H. Therefore, scale down all losses by a factor of H1 .
Then by Lemma 45, the base algorithm satisfies the condition in Lemma 42 with c1 = S 2 Aι2 and
c2 = S 5 A2 ι2 /H where ξt,π is defined as ℓ′t,π = ℓt,π /H, the expected loss of policy π after scaling.
Therefore, by Lemma 42, the we can transform it to an 12 -dd-LSB algorithm with c1 = S 2 Aι2 and
c2 = S 5 A2 ι2 /H. Finally, using Theorem 23 and scaling back the loss range, we can get
q 
2 2 2 5 2 2 2
O S AHL⋆ log (T )ι + S A log (T )ι

regret in the adversarial regime, and


r !
H 2 S 2 Aι2 log T H 2 S 2 Aι2 log T
O + C + S 5 A2 ι2 log(T ) log(C∆−1 )
∆ ∆

regret in the corrupted stochastic regime.

68

You might also like