Dann 23 A
Dann 23 A
Abstract
Best-of-both-worlds algorithms for online learning which achieve near-optimal regret in both the
adversarial and the stochastic regimes have received growing attention recently. Existing techniques
often require careful adaptation to every new problem setup, including specialised potentials and
careful tuning of algorithm parameters. Yet, in domains such as linear bandits, it is still unknown if
√ an algorithm that can simultaneously obtain O(log(T )) regret in the stochastic regime
there exists
and Õ( T ) regret in the adversarial regime. In this work, we resolve this question positively and
present a general reduction from best of both worlds to a wide family of follow-the-regularized-
leader (FTRL) and online-mirror-descent (OMD) algorithms. We showcase the capability of this
reduction by transforming existing algorithms that are only known to achieve worst-case guarantees
into new algorithms with best-of-both-worlds guarantees in contextual bandits, graph bandits and
tabular Markov decision processes.
1. Introduction
Multi-armed bandits and its various extensions have a long history (Lai et al., 1985; Auer et al.,
2002a;b). Traditionally, the stochastic regime in which all losses/rewards are i.i.d. and the adver-
sarial regime in which an adversary chooses the losses in an arbitrary fashion have been studied
in isolation. However, it is often unclear in practice whether an environment is best modelled by
the stochastic regime, a slightly contaminated regime, or a fully adversarial regime. This is why
the question of automatically adapting to the hardness of the environment, also called achieving
best-of-both-worlds guarantees, has received growing attention (Bubeck and Slivkins, 2012; Seldin
and Slivkins, 2014; Auer and Chiang, 2016; Seldin and Lugosi, 2017; Wei and Luo, 2018; Zimmert
and Seldin, 2019; Ito, 2021b; Ito et al., 2022a; Masoudian and Seldin, 2021; Gaillard et al., 2014;
Mourtada and Gaı̈ffas, 2019; Ito, 2021a; Lee et al., 2021; Rouyer et al., 2021; Amir et al., 2022;
Huang et al., 2022; Tsuchiya et al., 2022; Erez and Koren, 2021; Rouyer et al., 2022; Ito et al.,
2022b; Kong et al., 2022; Jin and Luo, 2020; Jin et al., 2021; Chen et al., 2022; Saha and Gaillard,
2022; Masoudian et al., 2022; Honda et al., 2023). One of the most successful approaches has been
to derive carefully tuned FTRL algorithms, which are canonically suited for the adversarial regime,
and then show that these algorithms are also close to optimal in the stochastic regime as well.
This is achieved via proving a crucial self-bounding property, which then translates into stochastic
guarantees. This type of algorithms have been proposed for multi-armed bandits (Wei and Luo,
2018; Zimmert and Seldin, 2019; Ito, 2021b; Ito et al., 2022a), combinatorial semi-bandits (Zim-
mert et al., 2019), bandits with graph feedback (Ito et al., 2022b), tabular MDPs (Jin and Luo, 2020;
Jin et al., 2021) and others. Algorithms with self-bounding properties also automatically adapt to
Table 1: Overview of regret bounds. The o(C) column specifies whether the algorithm achieves the
√ 2
optimal dependence on the amount of corruption ( C or C 3 , depending on the setting). “Rd. 1”
and “Rd. 2” indicate whether the result leverages the first and the second reductions described in
Section 4, respectively. L⋆ is the cumulative loss of the best action or policy. ν in Theorem 3 is the
self-concordance parameter; it holds that ν ≤ O(d) (see Appendix B for details). ∆′ in Jin et al.
(2021) is the gap in the Q⋆ function, which is different from the policy gap ∆ we use; it holds that
∆ ≤ ∆′ . α, αe, δ are complexity measures of feedback graphs defined in Section 5.
intermediate regimes of stochastic losses with adversarial corruptions (Lykouris et al., 2018; Gupta
et al., 2019; Zimmert and Seldin, 2019; Ito, 2021a), which highlights the strong robustness of this
algorithm design. However, every problem variation required careful design of potential functions
and learning rate schedules. For linear bandits, it has been still unknown whether self-bounding is
possible and therefore the state-of-the-art best-of-both-worlds algorithm (Lee et al., 2021) neither
obtains optimal log T stochastic rate, nor canonically adapts to corruptions.
In this work, the make the following contributions: 1) We propose a general reduction from best
of both worlds to typical FTRL/OMD algorithms, sidestepping the need for customized potentials
and learning rates. 2) We derive the first best-of-both-worlds algorithm for linear bandits that obtains
log T regret in the stochastic regime, optimal adversarial worst-case regret and adapts canonically
to corruptions. 3) We derive the first best-of-both-worlds algorithm for linear bandits and tabular
MDPs with first-order guarantees in the adversarial regime. 4) We obtain the first best-of-both-
worlds algorithms for bandits with graph feedback and bandits with expert advice with optimal
log T stochastic regret.
2. Related Work
Our reduction procedure is related to a class of model selection algorithms that uses a meta bandit
algorithm to learn over a set of base bandit algorithms, and the goal is to achieve a comparable
performance as if the best base algorithm is run alone. For the adversarial regime, Agarwal et al.
(2017) introduced the Corral algorithm to learn over adversarial bandits algorithm that satisfies the
2
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND
stability condition. This framework is further improved by Foster et al. (2020); Luo et al. (2022).
For the stochastic regime, Arora et al. (2021); Abbasi-Yadkori et al. (2020); Cutkosky et al. (2021)
introduced another set of techniques to achieve similar guarantees, but without
√ explicitly relying
on the stability condition. While most of these results focus on obtaining a T regret, Arora et al.
(2021); Cutkosky et al. (2021); Wei et al. (2022); Pacchiano et al. (2022) made some progress in
obtaining polylog(T ) regret in the stochastic regime. Among them, Wei et al. (2022); Pacchiano
et al. (2022) are most related to us since they also pursue a best-of-both-worlds guarantee across
adversarial and stochastic regimes. However, their regret bounds, when applied to our problems, all
suffer from highly sub-optimal dependence on log(T ) and the amount of corruptions.
3. Preliminaries
We consider sequential decision making problems where the learning interacts with the environment
in T rounds. In each round t, the environment generates a loss vector (ℓt,u )u∈X and the learner
generates a distribution over actions pt and chooses an action At ∼ pt . The learner then suffers
loss ℓt,At and receives some information about (ℓt,u )u∈X depending on the concrete setting. In all
settings but MDPs, we assume ℓt,u ∈ [−1, 1].
In the adversarial regime, (ℓt,u )u∈X is generated arbitrarily subject to the structure of the con-
crete setting (e.g., in linear bandits, E[ℓt,u ] = ⟨u, ℓt ⟩ for arbitrary ℓt ∈ Rd ). In the stochastic regime,
we further assume that there exists an action x⋆ ∈ X and a gap ∆ > 0 such that E[ℓt,x − ℓt,x⋆ ] ≥ ∆
for all x ̸= x⋆ . We also consider the corrupted stochastic regime, PTwhere the assumption is relaxed
to E[ℓt,x − ℓt,x⋆ ] ≥ ∆ − Ct for some Ct ≥ 0. We define C = hP t=1 Ct . The goal i of the learner is
T
to minimize the pseudo-regret, defined as Reg = maxu∈X E t=1 (ℓt,At − ℓt,u ) .
4. Main Techniques
4.1. The standard global self-bounding condition and a new linear bandit algorithm
To obtain a best-of-both-world regret bound, previous works show the following property for their
algorithm:
where c0 , c1 , c2 are problem-dependent constants and pt,u is the probability of the learner choosing
u in round t.
With this condition, one can use the standard self-bounding technique to obtain a best-of-both-world
regret bounds:
3
DANN W EI Z IMMERT
For completeness, we provide a proof of Proposition 2 in Appendix D. Previous works have found
algorithms with GSB in various settings, such as multi-armed bandits, semi-bandits, graph-bandits,
and MDPs. Here, we provide one such algorithm (Algorithm 3) for linear bandits based on the
framework of SCRiBLe (Abernethy et al., 2008). The guarantee of Algorithm 3 is stated in the
next theorem under the more general “learning with predictions” setting introduced in Rakhlin and
Sridharan (2013), where in every round t, the learner receives a loss predictor mt before making
decisions.
where d is the feature dimension, ν ≤ O(d) is the self-concordance parameter of the regu-
and mt,x = ⟨x, mt ⟩ is the loss predictor. This also implies a first-order regret bound
larizer, q
of O d ν log(T ) Tt=1 ℓt,u if ℓt,x ≥ 0 and mt = 0; it simultaneously achieves Reg =
P
2 q
2
O d ν∆ log T
+ d ν ∆log T C in the corrupted stochastic regime.
See Appendix B for the algorithm and the proof. We call this algorithm Variance-Reduced SCRiBLe
(VR-SCRiBLe) since it is based on the original SCRiBLe updates, but with some refinement in the
construction of the loss estimator to reduce its variance. A good property of a SCRiBLe-based
algorithm is that it simultaneously achieves data-dependent bounds (i.e., first- and second-order
bounds), similar to the case of multi-armed bandit using FTRL/OMD with log-barrier regularizer
(Wei and Luo, 2018; Ito, 2021b). Like Rakhlin and Sridharan (2013); Bubeck et al. (2019); Ito
(2021b); Ito et al. (2022a), we can also use another procedure to learn mt and obtain path-length or
variance bounds. The details are omitted.
We notice, however, that the the √ bound in Theorem
p 3 is sub-optimal in d since the best-known
regret for linear bandits is either d T log T or dT log |X |, depending on whether the number of
actions is larger than T d . These bounds also hint the possibility of getting better dependence on d
in the stochastic regime if we can also establish GSB for these algorithms. Therefore, we ask: can
we achieve best-of-both-world bounds for linear bandits with the optimal d dependence?
An attempt is to try to show GSB for existing algorithms with optimal d dependence, including
the EXP2 algorithm (Bubeck et al., 2012) and the logdet-FTRL algorithm (Zimmert and Lattimore,
2022)1 . Based on our attempt, it is unclear how to adapt the analysis of logdet-FTRL to show
GSB. For EXP2, using the learning-rate tuning technique by Ito et al. (2022b), one can make it
satisfy a similar guarantee as GSB, albeit with an additional log(T ) factor, resulting in a bound
d log(|X |T ) log(T )
∆ in the stochastic regime. This gives a sub-optimal rate of O(log2 T ). Therefore, we
further ask: can we achieve O(log T ) regret in the stochastic regime with an optimal d dependence?
Motivated by these questions, we resort to approaches that do not rely on GSB, which appears
for the first time in the best-of-both-world literature. Our approach is in fact a general reduction —
it not only successfully achieves the desired bounds for linear bandits, but also provides a principled
way to convert an algorithm with a worst-case guarantee to one with a best-of-both-world guarantee
for general settings. In the next two subsections, we introduce our reduction techniques.
1. Another potential algorithm is the truncated continuous exponential weight algorithm by Ito et al. (2020), which
obtains optimal regret dependence on d, is computationally efficient, and can achieve first/second-order bounds.
However, the regret bound shown by Ito et al. (2020) has sub-optimal dependence on log T , making it impossible to
obtain a O(log T ) regret bound in the stochastic regime using the analysis framework of Ito et al. (2020).
4
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND
∀u ∈ X :
" t′ # " t′ #α
X X
′ α
E (ℓt,At − ℓt,u ) ≤ min c1−α E[t ] , (c1 log T )1−α
E (1 − I{u = x }pt,u ) + c2 log T
0
b
t=1 t=1
Lemma 5 For d-dimensional linear bandits, by transforming the action set into x0 x ∈ X \
5
DANN W EI Z IMMERT
With the LSB condition defined, we develop a reduction procedure (Algorithm 1) which turns
any algorithm with LSB into a best-of-both-world algorithm that has a similar guarantee as in Propo-
sition 2. The guarantee is formally stated in the next theorem.
See Appendix E.1 for a proof of Theorem 6. In particular, Theorem 6 together with Lemma 5
directly lead to a better best-of-both-world bound for linear bandits.
Corollary 7 Combining Algorithm 1 and Algorithm4 results inqa linear bandit algorithm with
√ 2
O d T log T regret in the adversarial regime and O d log T
+d C log T
∆ ∆ regret in the corrupted
stochastic regime simultaneously.
Ideas of the first reduction Algorithm 1 proceeds in epochs. In epoch k, there is an action
bk ∈ X being selected as the candidate (b
x x1 is randomly drawn from X ). The procedure simply
executes the base algorithm A with input x b=x bk , and monitors the number of draws to each action.
If in epoch k there exists some x ̸= x bk being drawn more than half of the time, and the length of
epoch k already exceeds two times the length of epoch k − 1, then a new epoch is started with x bk+1
set to x.
Here, we give a proof sketch for Theorem 6 with α = 21 . It is straightforward in the adversarial
regime: Let τk = Tk+1 − Tk be the length of epoch k. In each epoch k, the regret is upper bounded
√
by O( c0 τk ) due
P to the LSB property
p Pof the base√algorithm. Thus, the over all regret is upper
√
bounded by O( k c0 τk ) = O( c0 k τk ) = O( c0 T ) because τk ≥ 2τk−1 (except for the last
epoch).
Next, we analyze the regret in the stochastic regime assuming c2 = 0 and C = 0 (see the formal
proof for the general case). Let m = max{k : x bk ̸= x⋆ } be the index of the last epoch in which x bk
⋆
is not x . Below, we show that x ⋆
bm ̸= x is drawn for at least Ω(Tm+1 ) times. If τm > 2τm−1 , then
by the termination condition of epoch m, we know that just one step before epoch m terminates,
bm is drawn for at least half of the times in epoch m, which is at least τ2m − 1 = Ω(τm ) = Ω(Tm+1 )
x
times. If τm ≤ 2τm−1 , then by the termination condition of epoch m − 1, we know that x bm must
τm−1
have been drawn for 2 = Ω(Tm+1 ) times in epoch m − 1.
p
Notice that the regret up to the end of epoch m is upper bounded by O( c1 Tm+1 log T ), but
also lower bounded by Ω(Tm+1 ∆) by the argument above. This implies that Tm+1 = O( c1 ∆ log T
2 )
and thus the regret up to the end of epoch m is upper bounded by O( c1 log ∆
T
). Finally, if epoch m is
not the last epoch, then in the last epoch m + 1, it must be
q P that x
b m+1 = x⋆ . By the LSB property,
in the last epoch, the regret is upper bounded by O( c1 Tt=Tm+1 +1 (1 − pt,x⋆ ) log T and lower
bounded by Tt=Tm+1 +1 (1 − pt,x⋆ )∆. This implies that Tt=Tm+1 +1 (1 − pt,x⋆ ) = O( c1 ∆ log T
P P
2 ) and
thus the regret is upper bounded by O( c1 log
∆ ).
T
6
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND
Definition 8 requires that even if the algorithm only receives the desired feedback with probability
qt in round t, it still achieves a meaningful regret bound that smoothly degrades with t q1t . In
P
previous works on Corral and its variants (Agarwal
r et al., 2017; Foster et al., 2020), a similar notion
of 12 -stability is defined as having a regret of c1 maxτ ≤t′ q1τ t′ , which is a weaker assumption
than our 12 -iw-stability. Our stronger notion of stability is crucial to get a best-of-both-world bound,
but it is still very natural and holds for a wide range of algorithms.
Below, we provide two examples of 12 -iw-stable algorithms. Their analysis is mostly standard
FTRL analysis (considering extra importance weighting) and can be found in Appendix G.
Lemma 9 For linear bandits, EXP2 with adaptive learning rate (Algorithm 9) is 12 -iw-stable with
constants c1 = c2 = O(d log |X |).
Lemma 10 For contextual bandits, EXP4 with adaptive learning rate (Algorithm 10) is 12 -iw-stable
with constants c1 = O(K log |X |), c2 = 0.
Our reduction procedure is Algorithm 2, which turns an 12 -iw-stable algorithm into one with 21 -LSB.
The guarantee is formalized in the next theorem, whose proof is in Appendix F.1.
Theorem 11 If B is a 12 -iw-stable algorithm with constants (c1 , c2 ), then Algorithm 2 with B as the
base algorithm satisfies 21 -LSB with constants (c′0 , c′1 , c′2 ) where c′0 = c′1 = O(c1 ) and c′2 = O(c2 ).
Cascading Theorem 11 and Theorem 6, and combining it with Lemma 9 and Lemma 10, re-
spectively, we get the following corollaries:
7
DANN W EI Z IMMERT
Sample it ∼ qt .
if it = 1 then draw At = x
b and observe ℓt,At ;
else draw At according to base algorithm B and observe ℓt,At ;
(ℓt,At +1)I{it =i}
Define zt,i = qt,i − 1 and
v
u t
u X 1 c2
Bt = c1
t + .
qτ,2 minτ ≤t qτ,2
τ =1
end
Ideas of the second reduction Algorithm 2 is related to the Corral algorithm (Agarwal et al.,
2017; Foster et al., 2020; Luo et al., 2022). We use a special version which is essentially an FTRL
algorithm (with a hybrid 12 -Tsallis entropy and log-barrier regularizer) over two base algorithms:
the candidate xb, and an algorithm B operating on the reduced action set X \{b x}. For base algo-
rithm B, the Corral algorithm adds a bonus term that upper bounds the regret of B under importance
weighting (i.e., the quantity Bt defined in Algorithm 6). In the regret analysis, the bonus creates a
negative term as well as a bonus overhead in the learner’s regret. The negative term can be used to
cancel the positive regret of B, and the analysis reduces toqbounding the bonus overhead. Showing
that the bonus overhead is upper bounded by the order of c1 log(T ) Tt=1 (1 − pt,bx ) is the key to
P
establish the LSB property.
Combining Algorithm 2 and Algorithm 1, we have the following general reduction: as long
as we have an 21 -iw-stable algorithm over X \{b x} for any x
b ∈ X , we have an algorithm with the
best-of-both-world guarantee when the optimal action is unique. Notice that 12 -iw-stable algorithms
are quite common – usually it can be just a FTRL/OMD algorithm with adaptive learning rate.
8
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND
The overall reduction is reminiscent of the G-COBE procedure by Wei et al. (2022), which
also runs model selection over xb and a base algorithm for X \{b x} (similar to Algorithm 2), and
dynamically restarts the procedure with increasing epoch lengths (similar to Algorithm 1). However,
)
besides being more complicated, G-COBE only guarantees a bound of polylog(T∆ +C in the corrupted
stochastic regime (omitting dependencies on c1 , c2 ), which is sub-optimal in both C and T .2
factors. Zimmert and Lattimore (2019) have shown that the log(T ) dependency can be avoided and
that hence α is the right notion of difficulty for strongly observable graphs even in the limit T → ∞
(though they pay for this with a larger log K dependency).
State-of-the-art best-of-both-worlds guarantees for bandits with graph feedback are derivations
2 2
of EXP3 and obtain either O( αe log∆ (T ) ) or O( α log(T )∆log (T K) ) regret for strongly observable graphs
2
and O( δ log∆2(T ) ) for weakly observable graphs (Rouyer et al., 2022; Ito et al., 2022b)3 . Our black-
box reduction leads directly to algorithms with optimal log(T ) regret in the stochastic regime.
9
DANN W EI Z IMMERT
Lemma 14 For bandits with strongly observable graph feedback, (1 − 1/ log K)-Tsallis-
INF with adaptive learning rate (Algorithm 11) is 21 -iw-stable with constants c1 =
O(min{e
α, α log K} log K), c2 = 0.
Before we apply the reduction, note that Algorithm 2 requires that the player observes the loss
ℓt,At when playing arm At , which is not necessarily the case for all strongly observable graphs.
However, this can be overcome via defining an observable surrogate loss that is used in Algorithm 2
instead. We explain how this works in detail in Section H and assume from now that this technical
issue does not arise.
Cascading the two reduction steps, we immediately obtain the following.
Corollary 15 Combining p Algorithm 1, Algorithm 2 and Algorithm 11 results in a graph
bandit algorithm with O min{e α, α log K}T log K regret in the adversarial regime and
q
O min{eα,α log K}
∆
log T log K
+ min{e
α,α log K} log T log K
∆ C regret in the corrupted stochastic regime
simultaneously.
An example of such an algorithm is given by the following lemma (see Appendix G.4 for the algo-
rithm and the proof).
Lemma 17 For bandits with weakly observable graph feedback, EXP3 with adaptive learning rate
(Algorithm 12) is 23 -iw-stable with constants c1 = c2 = O(δ log K).
Similar to the 12 -case, this allows for a general reduction to LSB algorithms.
Theorem 18 If B is a 23 -iw-stable algorithm with constants (c1 , c2 ), then Algorithm 5 with B
as the base algorithm satisfies 32 -LSB with constants (c′0 , c′1 , c′2 ) where c′0 = c′1 = O(c1 ) and
1
c′2 = O(c13 + c2 ).
See Appendix F.2 for the proof. Algorithm 5 works almost the same as Algorithm 2, but we need to
adapt the bonus terms to the definition of 32 -iw-stable. The major additional difference is that we do
not necessarily observe the loss of the action played and hence need to play exploratory actions with
probability γt in every round to estimate the performance difference between x b and B. A second
point to notice is that the base algorithm B uses the action set X \ {b x}, but is still allowed to use
x
b in its internal exploration policy. This is necessary because the sub-graph with one arm removed
might have a larger dominating number, or might even not be learnable at all. By allowing x b in the
internal exploration, we guarantee the the regret of the base algorithm is not larger than over the full
action set. Finally, cascading the lemma and the reduction, we obtain
10
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND
Corollary 19 Combining Algorithm 1, Algorithm 5 and Algorithm 12 results in a graph ban-
1 2
dit algorithm with O (δ log K) 3 T 3 regret in the adversarial regime and O δ log ∆
K log T
2 +
2 1
C δ log K log T 3
∆2
regret in the corrupted stochastic regime simultaneously.
6. More Adaptations
To demonstrate the versatility of our reduction framework, we provide two more adaptations. The
first demonstrates that our reduction can be easily adapted to obtain a data-dependent bound (i.e.,
first- or second-order bound) in the adversarial regime, provided that the base algorithm achieves
a corresponding data-dependent bound. The second tries to eliminate the undesired requirement
by the corral algorithm (Algorithm 2) that the base algorithm needs to operate in X \{b x} instead
of the more natural X . We show that under a stronger stability assumption, we can indeed just
use a base algorithm that operates in X . This would be helpful for settings where excluding one
single action/policy in the algorithm is not straightforward (e.g., MDP). Finally, we combine the
two techniques to obtain a first-order best-of-both-world bound for tabular MDPs.
where c1 , c2 are some problem dependent constants, ξt,x = (ℓt,x −mt,x )2 in the second-order bound
case, and ξt,x = ℓt,x in the first-order bound case.
Definition 21 ( 21 -dd-iw-stable) An algorithm is 12 -dd-iw-stable (data-dependent-iw-stable) if
given an adaptive sequence of weights q1 , q2 , · · · ∈ (0, 1] and the assertion that the feedback in
round t is observed with probability qt , it obtains the following pseudo regret guarantee for any
stopping time t′ :
" t′ # v
u " t′ #
X u X updt · ξt,A c2
t
E (ℓt,At − ℓt,u ) ≤ c1 E
t + E ,
t=1 t=1
qt2 mint≤t′ qt
11
DANN W EI Z IMMERT
where updt = 1 if feedback is observed in round t and updt = 0 otherwise, and ξt,x is defined in the
same way as in Definition 20.
We can turn a dd-iw-stable algorithm into one with dd-LSB (see Appendix F.3 for the proof):
Theorem 22 If B is 21 -dd-iw-stable with constants (c1 , c2 ), then Algorithm 6 with B as the base
√
algorithm satisfies 21 -dd-LSB with constants (c′1 , c′2 ) where c′1 = O(c1 ) and c′2 = O( c1 + c2 ).
Then we turn an algorithm with dd-LSB into one with data-dependent best-of-both-world guarantee
(see Appendix E.2 for the proof):
Theorem 23 If an algorithm A satisfies 21 -dd-LSE, then the regret of Algorithm 1 with A as the
r hP i
T 2 2
base algorithm is upper bounded by O c1 E ξ
t=1 t,At log T + c2 log T in the adversarial
q
regime and O c1 log
∆
T
+ c1 log T −1
∆ C + c2 log(T ) log(C∆ ) in the corrupted stochastic regime.
Compared with iw-stability defined in Definition 8, strong-iw-stability requires the bound to have an
P x results
additional flexibility: if choosing action
1
in a probability qt (x) of observing the feedback,
the regret bound needs to adapts to t qt (At ) . For the class of FTRL/OMD algorithms, strong-
iw-stability holds if the stability term is bounded by a constant no matter what action is chosen.
Examples include log-barrier-OMD/FTRL for multi-armed bandits, SCRiBLe (Abernethy et al.,
2008) or a truncated version of continuous exponential weights (Ito et al., 2020) for linear ban-
dits. In fact, Upper-Confidence-Bound (UCB) algorithms also satisfy strong-iw-stability, though it
is mainly designed for the pure stochastic setting; however, we will need this property when we
design algorithms for adversarial MDPs, where the transition estimation part is done through UCB
approaches. This will be demonstrated in Section 6.3. With strong-iw-stability, the second reduction
only requires a base algorithm over X :
Theorem 25 If B is a 21 -strongly-iw-stable algorithm with constants (c1 , c2 ), then Algorithm 7
with B as the base algorithm satisfies 21 -LSB with constants (c′0 , c′1 , c′2 ) where c′0 = c′1 = O(c1 ) and
√
c′2 = O( c1 + c2 ).
The proof is provided in Appendix F.4.
12
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND
Theorem 26 Combining
q Algorithm 1, Algorithm 8, andAlgorithm 13 results in an MDP algo-
rithm with O S 2 AHL⋆ log2 (T )ι2 + S 5 A2 log2 (T )ι2 regret in the adversarial regime, and
2 2 2 q
H 2 S 2 Aι2 log T
O H S Aι ∆
log T
+ ∆ C + S 5 A2 ι2 log(T ) log(C∆−1 ) in the corrupted stochastic
regime, where S is the number of states, A is the number of actions, H is the horizon, L⋆ is the
cumulative loss of the best policy, and ι = log(SAT ).
7. Conclusion
We provided a general reduction from the best-of-both-worlds problem to a wide range of
FTRL/OMD algorithms, which improves the state of the art in several problem settings. We showed
the versatility of our approach by extending it to preserving data-dependent bounds.
Another potential application of our framework is partial monitoring, where one might improve
1 2
the log2 (T ) rates of to log(T ) for both the T 2 and the T 3 regime using our respective reductions.
A weakness of our approach is the uniqueness requirement of the best action. While this as-
sumption is typical in the best-of-both-worlds literature, it is not merely an artifact of the analysis
for us due to the doubling procedure in the first reduction. Additionally, our reduction can only
obtain worst-case ∆ dependent bounds in the stochastic regime, which can be significantly weaker
than more refined notions of complexity.
References
Yasin Abbasi-Yadkori, Aldo Pacchiano, and My Phan. Regret balancing for bandit and rl model
selection. arXiv preprint arXiv:2006.05491, 2020.
Jacob Abernethy, Elad Hazan, and Alexander Rakhlin. Competing in the dark: An efficient algo-
rithm for bandit linear optimization. In 21st Annual Conference on Learning Theory, COLT 2008,
2008.
Alekh Agarwal, Haipeng Luo, Behnam Neyshabur, and Robert E Schapire. Corralling a band of
bandit algorithms. In Conference on Learning Theory, pages 12–38. PMLR, 2017.
Noga Alon, Nicolo Cesa-Bianchi, Claudio Gentile, and Yishay Mansour. From bandits to experts:
A tale of domination and independence. Advances in Neural Information Processing Systems,
26, 2013.
13
DANN W EI Z IMMERT
Noga Alon, Nicolo Cesa-Bianchi, Ofer Dekel, and Tomer Koren. Online learning with feedback
graphs: Beyond bandits. In Conference on Learning Theory, pages 23–35. PMLR, 2015.
Idan Amir, Guy Azov, Tomer Koren, and Roi Livni. Better best of both worlds bounds for bandits
with switching costs. In Advances in Neural Information Processing Systems, 2022.
Raman Arora, Teodor Vanislavov Marinov, and Mehryar Mohri. Corralling stochastic bandit algo-
rithms. In International Conference on Artificial Intelligence and Statistics, pages 2116–2124.
PMLR, 2021.
Peter Auer and Chao-Kai Chiang. An algorithm with nearly optimal pseudo-regret for both stochas-
tic and adversarial bandits. In Conference on Learning Theory, pages 116–120. PMLR, 2016.
Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit
problem. Machine learning, 47:235–256, 2002a.
Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multi-
armed bandit problem. SIAM journal on computing, 32(1):48–77, 2002b.
Sébastien Bubeck and Aleksandrs Slivkins. The best of both worlds: Stochastic and adversarial
bandits. In Conference on Learning Theory, pages 42–1. JMLR Workshop and Conference Pro-
ceedings, 2012.
Sébastien Bubeck, Nicolo Cesa-Bianchi, and Sham M Kakade. Towards minimax policies for online
linear optimization with bandit feedback. In Conference on Learning Theory, pages 41–1. JMLR
Workshop and Conference Proceedings, 2012.
Sébastien Bubeck, Yuanzhi Li, Haipeng Luo, and Chen-Yu Wei. Improved path-length regret
bounds for bandits. In Conference On Learning Theory, pages 508–528. PMLR, 2019.
Cheng Chen, Canzhe Zhao, and Shuai Li. Simultaneously learning stochastic and adversarial ban-
dits under the position-based model. In Proceedings of the AAAI Conference on Artificial Intelli-
gence, volume 36, pages 6202–6210, 2022.
Ashok Cutkosky, Christoph Dann, Abhimanyu Das, Claudio Gentile, Aldo Pacchiano, and Manish
Purohit. Dynamic balancing for model selection in bandits and rl. In International Conference
on Machine Learning, pages 2276–2285. PMLR, 2021.
Liad Erez and Tomer Koren. Towards best-of-all-worlds online learning with feedback graphs.
Advances in Neural Information Processing Systems, 34:28511–28521, 2021.
Dylan J Foster, Claudio Gentile, Mehryar Mohri, and Julian Zimmert. Adapting to misspecification
in contextual bandits. Advances in Neural Information Processing Systems, 33:11478–11489,
2020.
Pierre Gaillard, Gilles Stoltz, and Tim Van Erven. A second-order bound with excess losses. In
Conference on Learning Theory, pages 176–196. PMLR, 2014.
Anupam Gupta, Tomer Koren, and Kunal Talwar. Better algorithms for stochastic bandits with
adversarial corruptions. In Conference on Learning Theory, pages 1562–1578. PMLR, 2019.
14
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND
Junya Honda, Shinji Ito, and Taira Tsuchiya. Follow-the-perturbed-leader achieves best-of-both-
worlds for bandit problems. In International Conference on Algorithmic Learning Theory.
PMLR, 2023.
Jiatai Huang, Yan Dai, and Longbo Huang. Adaptive best-of-both-worlds algorithm for heavy-
tailed multi-armed bandits. In International Conference on Machine Learning, pages 9173–9200.
PMLR, 2022.
Shinji Ito. On optimal robustness to adversarial corruption in online decision problems. Advances
in Neural Information Processing Systems, 34:7409–7420, 2021a.
Shinji Ito. Parameter-free multi-armed bandit algorithms with hybrid data-dependent regret bounds.
In Conference on Learning Theory, pages 2552–2583. PMLR, 2021b.
Shinji Ito, Shuichi Hirahara, Tasuku Soma, and Yuichi Yoshida. Tight first-and second-order regret
bounds for adversarial linear bandits. Advances in Neural Information Processing Systems, 33:
2028–2038, 2020.
Shinji Ito, Taira Tsuchiya, and Junya Honda. Adversarially robust multi-armed bandit algorithm
with variance-dependent regret bounds. In Conference on Learning Theory, pages 1421–1422.
PMLR, 2022a.
Shinji Ito, Taira Tsuchiya, and Junya Honda. Nearly optimal best-of-both-worlds algorithms for
online learning with feedback graphs. In Advances in Neural Information Processing Systems,
2022b.
Tiancheng Jin and Haipeng Luo. Simultaneously learning stochastic and adversarial episodic mdps
with known transition. Advances in neural information processing systems, 33:16557–16566,
2020.
Tiancheng Jin, Longbo Huang, and Haipeng Luo. The best of both worlds: stochastic and adversar-
ial episodic mdps with unknown transition. Advances in Neural Information Processing Systems,
34:20491–20502, 2021.
Fang Kong, Yichi Zhou, and Shuai Li. Simultaneously learning stochastic and adversarial bandits
with general graph feedback. In International Conference on Machine Learning, pages 11473–
11482. PMLR, 2022.
Tze Leung Lai, Herbert Robbins, et al. Asymptotically efficient adaptive allocation rules. Advances
in applied mathematics, 6(1):4–22, 1985.
Chung-Wei Lee, Haipeng Luo, Chen-Yu Wei, and Mengxiao Zhang. Bias no more: high-probability
data-dependent regret bounds for adversarial bandits and mdps. Advances in neural information
processing systems, 33:15522–15533, 2020.
Chung-Wei Lee, Haipeng Luo, Chen-Yu Wei, Mengxiao Zhang, and Xiaojin Zhang. Achieving near
instance-optimality and minimax-optimality in stochastic and adversarial linear bandits simulta-
neously. In International Conference on Machine Learning, pages 6142–6151. PMLR, 2021.
15
DANN W EI Z IMMERT
Haipeng Luo, Mengxiao Zhang, Peng Zhao, and Zhi-Hua Zhou. Corralling a larger band of bandits:
A case study on switching regret for linear bandits. In Conference on Learning Theory, pages
3635–3684. PMLR, 2022.
Thodoris Lykouris, Vahab Mirrokni, and Renato Paes Leme. Stochastic bandits robust to adver-
sarial corruptions. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of
Computing, pages 114–122, 2018.
Saeed Masoudian and Yevgeny Seldin. Improved analysis of the tsallis-inf algorithm in stochas-
tically constrained adversarial bandits and stochastic bandits with adversarial corruptions. In
Conference on Learning Theory, pages 3330–3350. PMLR, 2021.
Saeed Masoudian, Julian Zimmert, and Yevgeny Seldin. A best-of-both-worlds algorithm for ban-
dits with delayed feedback. In Advances in Neural Information Processing Systems, 2022.
Jaouad Mourtada and Stéphane Gaı̈ffas. On the optimality of the hedge algorithm in the stochastic
regime. Journal of Machine Learning Research, 20:1–28, 2019.
Aldo Pacchiano, Christoph Dann, and Claudio Gentile. Best of both worlds model selection. In
Advances in Neural Information Processing Systems, 2022.
Sasha Rakhlin and Karthik Sridharan. Optimization, learning, and games with predictable se-
quences. Advances in Neural Information Processing Systems, 26, 2013.
Chloé Rouyer, Yevgeny Seldin, and Nicolo Cesa-Bianchi. An algorithm for stochastic and adver-
sarial bandits with switching costs. In International Conference on Machine Learning, pages
9127–9135. PMLR, 2021.
Chloé Rouyer, Dirk van der Hoeven, Nicolò Cesa-Bianchi, and Yevgeny Seldin. A near-optimal
best-of-both-worlds algorithm for online learning with feedback graphs. In Advances in Neural
Information Processing Systems, 2022.
Aadirupa Saha and Pierre Gaillard. Versatile dueling bandits: Best-of-both world analyses for
learning from relative preferences. In International Conference on Machine Learning, pages
19011–19026. PMLR, 2022.
Yevgeny Seldin and Gábor Lugosi. An improved parametrization and analysis of the exp3++ al-
gorithm for stochastic and adversarial bandits. In Conference on Learning Theory, pages 1743–
1759. PMLR, 2017.
Yevgeny Seldin and Aleksandrs Slivkins. One practical algorithm for both stochastic and adversarial
bandits. In International Conference on Machine Learning, pages 1287–1295. PMLR, 2014.
Taira Tsuchiya, Shinji Ito, and Junya Honda. Best-of-both-worlds algorithms for partial monitoring.
arXiv preprint arXiv:2207.14550, 2022.
16
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND
Chen-Yu Wei and Haipeng Luo. More adaptive algorithms for adversarial bandits. In Conference
On Learning Theory, pages 1263–1291. PMLR, 2018.
Chen-Yu Wei, Christoph Dann, and Julian Zimmert. A model selection approach for corruption
robust reinforcement learning. In International Conference on Algorithmic Learning Theory,
pages 1043–1096. PMLR, 2022.
Julian Zimmert and Tor Lattimore. Connections between mirror descent, thompson sampling and
the information ratio. Advances in Neural Information Processing Systems, 32, 2019.
Julian Zimmert and Tor Lattimore. Return of the bias: Almost minimax optimal high probability
bounds for adversarial linear bandits. In Conference on Learning Theory, pages 3285–3312.
PMLR, 2022.
Julian Zimmert and Yevgeny Seldin. An optimal algorithm for stochastic and adversarial bandits.
In The 22nd International Conference on Artificial Intelligence and Statistics, pages 467–475.
PMLR, 2019.
Julian Zimmert, Haipeng Luo, and Chen-Yu Wei. Beating stochastic and adversarial semi-bandits
optimally and simultaneously. In International Conference on Machine Learning, pages 7683–
7692. PMLR, 2019.
17
DANN W EI Z IMMERT
Appendices
A FTRL Analysis 19
D Proof of Proposition 2 36
18
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND
+ (ψt (u) − ψt−1 (u) − ψt (pt ) + ψt−1 (pt )) + max (⟨pt − p, ℓt − mt ⟩ − Dψt (p, pt )) .
p∈Ω
t=1 t=1 | {z }
stability
Proof Let Lt ≜ tτ =1 ℓτ . Define Ft (p) = ⟨p, Lt−1 + mt ⟩ + ψt (p) and Gt (p) = ⟨p, Lt ⟩ + ψt (p).
P
Therefore, pt is the minimizer of Ft over Ω. Let p′t+1 be minimizer of Gt over Ω. Then by the
first-order optimality condition, we have
Gt (p′t+1 ) − Ft+1 (pt+1 ) ≤ Gt (pt+1 ) − Ft+1 (pt+1 ) = ψt (pt+1 ) − ψt+1 (pt+1 ) − ⟨pt+1 , mt+1 ⟩.
(2)
Thus,
t ′
X
⟨pt , ℓt ⟩
t=1
t ′
X
⟨pt , ℓt ⟩ − ⟨p′t+1 , ℓt − mt ⟩ − Dψt (p′t+1 , pt ) + Gt (p′t+1 ) − Ft (pt )
≤ (by (1))
t=1
t ′
X
⟨pt , ℓt ⟩ − ⟨p′t+1 , ℓt − mt ⟩ − Dψt (p′t+1 , pt ) + Gt−1 (p′t ) − Ft (pt ) + Gt′ (p′t′ +1 ) − G0 (p′1 )
=
t=1
t ′
X
⟨pt , ℓt ⟩ − ⟨p′t+1 , ℓt − mt ⟩ − Dψt (p′t+1 , pt ) − ψt (pt ) + ψt−1 (pt ) − ⟨pt , mt ⟩ + Gt′ (p′t′ +1 ) − G0 (p′1 )
=
t=1
t ′
X
⟨pt − p′t+1 , ℓt − mt ⟩ − Dψt (p′t+1 , pt ) − ψt (pt ) + ψt−1 (pt ) + Gt′ (u) − min ψ0 (p)
≤
p∈Ω
t=1
(by (2), using that p′t′ +1 is the minimizer of Gt′ )
t′
X
= max {⟨pt − p, ℓt − mt ⟩ − Dψt (p, pt )} − ψt (pt ) + ψt−1 (pt )
p∈Ω
t=1
19
DANN W EI Z IMMERT
t′
X
+ ⟨u, ℓt ⟩ + ψt′ (u) − min ψ0 (p)
p∈Ω
t=1
t ′
X
= max {⟨pt − p, ℓt − mt ⟩ − Dψt (p, pt )} − ψt (pt ) + ψt−1 (pt )
p∈Ω
t=1
t′
X
+ ⟨u, ℓt ⟩ + ψt′ (u) − min ψ0 (p)
p∈Ω
t=1
X t′
= max {⟨pt − p, ℓt − mt ⟩ − Dψt (p, pt )} + (ψt (u) − ψt−1 (u) − ψt (pt ) + ψt−1 (pt ))
p∈Ω
t=1
t′
X
+ ⟨u, ℓt ⟩ + ψ0 (u) − min ψ0 (p)
p∈Ω
t=1
Re-arranging finishes the proof.
t ′
X
⟨pt − u, ℓt ⟩
t=1
t′ K t′ K
!
1 K 1−α K ln 1δ
X
1 X 1 1 α 1 X X 2−α Ts
≤ + − pt,i − 1 + + ηt pt,i (ℓt,i − xt )2
1 − α η0 1−α ηt ηt−1 β α
t=1 i=1 t=1 i=1
t′ K t′ t′ t′
XX
2 Lo 2 1 X X X 1
+ 2β pt,i (ℓt,i − wt ) + 1 + ⟨pt , bt ⟩ − ⟨u, bt ⟩ + δ −u + 1, ℓt − bt .
α K
t=1 i=1 t=1 t=1 t=1
ℓTs Lo
t + ℓt = ℓt − mt , (3)
Ts Lo
bt + bt = bt (4)
1−α Ts 1
ηt pt,i (ℓt,i − xt ) ≥ − , (5)
4
1
βpt,i (ℓLo
t,i − wt ) ≥ − , (6)
4
1−α Ts 1
0 ≤ ηt pt,i bt,i ≤ , (7)
4
1
0 ≤ βpt,i bLo
t,i ≤ (8)
4
for all t, i.
20
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND
t′ K t′ X K K
X
′ 1 1 X α 1 X 1 1 α ′α
1X p1,i
⟨pt − u , ℓt − bt ⟩ ≤ p1,i + − pt,i − ui + ln ′
1 − α η0 1−α ηt ηt−1 β ui
t=1 i=1 t=1 i=1 i=1
t′
X 1 1
+ max ⟨pt − p, ℓt − bt − mt ⟩ − DψTs (p, pt ) − DψLo (p, pt )
p ηt β
t=1
t′ K
!
1 K 1−α K ln 1δ
X
1 X 1 1
≤ + − pαt,i − 1 +
1 − α η0 1−α ηt ηt−1 β
t=1 i=1
t′
X
Ts
1
+ max ⟨pt − p, ℓ − xt 1⟩ − D Ts (p, pt )
p 2ηt ψ
t=1 | {z }
stability-1
t′
X 1 Ts
+ max ⟨pt − p, −bt 1⟩ − D Ts (p, pt )
p 2ηt ψ
t=1 | {z }
stability-2
t′
X 1
Lo
+ max ⟨pt − p, ℓ − wt 1⟩ − D Lo (p, pt )
p 2β ψ
t=1 | {z }
stability-3
t′
X 1 Lo
+ max ⟨pt − p, −bt 1⟩ − D Lo (p, pt )
p 2β ψ
t=1 | {z }
stability-4
t−1
(* + )
X
pt = argmin p, (ℓτ − bτ ) + mt + ψt (p)
p∈Ω τ =1
21
DANN W EI Z IMMERT
1 P 1
with ψt (p) = ηt i ln pi , and ηt is non-increasing. We have
t′ t K ′
X K ln 1δ X X
⟨pt − u, ℓt ⟩ ≤ +2 ηt p2t,i (ℓt,i − mt,i − xt )2
ηt′
t=1 t=1 i=1
t′ t′ t′
X X X 1
+2 ⟨pt , bt ⟩ − ⟨u, bt ⟩ + δ −u + 1, ℓt − bt .
K
t=1 t=1 t=1
for any δ ∈ (0, 1) and any xt ∈ R if the following hold: ηt pt,i (ℓt,i − mt,i − xt ) ≥ − 41 and
ηt pt,i bt,i ≤ 41 for all t, i.
1 1
and let ℓt ∈ RK be such that
P
Lemma 30 (Stability under log barrier) Let ψ(p) = β i ln pi ,
βpi ℓt,i ≥ − 21 . Then
X
max {⟨pt − p, ℓt ⟩ − Dψ (p, pt )} ≤ βp2t,i ℓ2t,i .
p∈∆([K])
i
Proof
Define f (q) = ⟨pt − q, ℓt ⟩ − Dψ (q, pt ). Let q ⋆ be the solution in the last expression. Next, we
verify that under the specified conditions, we have ∇f (q ⋆ ) = 0. It suffices to show that there exists
q ∈ RK ⋆
+ such that ∇f (q) = 0 since if such q exists, then it must the maximizer of f and thus q = q.
1 1
[∇f (q)]i = −ℓt,i − [∇ψ(q)]i + [∇ψ(pt )]i = −ℓt,i + −
βqi βpt,i
22
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND
By the condition, we have − βp1t,i − ℓt,i < 0 for all i. and so ∇f (q) = 0 has solution in R+ , which
−1
is qi = p1t,i + ηt,i ℓt,i .
Therefore, ∇f (q ) = −ℓt − ∇ψt (q ⋆ ) + ∇ψt (pt ) = 0, and we have
⋆
where h(x) = x − 1 − ln(x). By the relation between qi⋆ and pt,i we just derived, it holds that
pt,i 2 1
q ⋆ = 1 + βpt,i ℓt,i . By the fact that ln(1 + x) ≥ x − x for all x ≥ − 2 , we have
i
pt,i
h = βpt,i ℓt,i − ln(1 + βpt,i ℓt,i ) ≤ β 2 p2t,i ℓ2t,i
qi⋆
P
Lemma 31 (Stability under negentropy) If ψ is the negentropy ψ(p) = i pi log pi and ℓt,i > 0,
then for any η > 0 the stability is bounded by
1 ηX
max ⟨pt − p, ℓt ⟩ − Dψ (p, pt ) ≤ pt,i ℓ2t,i .
K
p∈R≥0 η 2
i
1 X
max ⟨pt − p, ℓt ⟩ − Dψ (p, pt ) ≤ η pt,i ℓ2t,i .
p∈RK
≥0
η
i
instead.
Proof Let fi (pi ) = (pt,i − pi )ℓti − η1 (pi (log pi − 1) − pi log pt,i + pt,i ). Then we maximize
P
i fi (pi ), which is upper bounded by maximizing the expression in each coordinate for pi ≥ 0.
We have
1
fi′ (p) = −ℓt,i − (log p − log pt,i ),
η
and hence the maximum is obtained for p⋆ = pt,i exp(−ηℓt,i ). Plugging this in leads to
1
fi (p⋆ ) = pt,i ℓt,i (1 − exp(−ηℓt,i )) − (−pt,i exp(−ηℓt,i )ηℓt,i + pt,i (1 − exp(−ηℓt,i )))
η
pt,i pt,i 1 2 2 η
= pt,i ℓt,i − (1 − exp(−ηℓt,i )) ≤ pt,i ℓt,i − ηℓt,i − η ℓt,i = pt,i ℓ2t,i ,
η η 2 2
23
DANN W EI Z IMMERT
2
for ℓt,i ≥ −1/η respectively, where we used the bound exp(−x) ≤ 1 − x + x2 which holds for
any x ≥ 0 and exp(−x) ≤ 1 − x + x2 , which holds for x ≥ −1 respectively. Summing over all
coordinates finishes the proof.
P pαi
Lemma 32 (Stability of Tsallis-INF) For the potential ψ(p) = − i α(1−α) , any pt ∈ ∆([K]),
any non-negative loss ℓt ≥ 0 and any positive learning rate η > 0, we have
1 η X 2−α 2
max⟨p − pt , ℓt ⟩ − Dψ (p, pt ) ≤ pi ℓt,i .
p η 2
i
Proof We upper bound this term by maximizing over p ≥ 0 instead of p ∈ ∆([K]). Since ℓt is
positive, the optimal p⋆ satisfies p⋆i ≤ pi in all components. We have
K Z p⋆i pα−1
t,i − p
α−1 K Z p⋆i pe K Z p⋆i pe
X X Z X Z
⋆
Dψ (p , pt ) = dp = p α−2
p≥
dp de pα−2
t,i dp de
p
pt,i 1−α pt,i pt,i pt,i pt,i
i=1 i=1 i=1
K
X 1 ⋆
= (p − pt,i )2 pα−2
t,i .
2 i
i=1
24
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND
Algorithm 3 VR-SCRiBLe
Define: Let ψ(·) be a ν-self-concordant barrier of conv(X ) ⊂ Rd .
for t = 1, 2, . . . do
Receive mt ∈ Rd . Compute
(* t−1 + )
X 1
wt = argmin w, ℓbτ + mt + ψ(w) (9)
w∈conv(X ) ηt
τ =1
where
v
1 u ν log T
ηt = min , t Pt−1 .
u
16d
τ =1 ∥ℓτ − mτ ∥
b 2
Hτ−1
where Ht = ∇2 ψ(wt ).
Sample st uniformly from Sd (the unit sphere in d-dimension).
Define
− 12 − 12
wt+ = wt + Ht st , wt− = wt − Ht st .
qt+ +qt−
Sample At ∼ qt ≜ 2 , receive ℓt,At ∈ [−1, 1] with E[ℓt,x ] = ⟨x, ℓt ⟩, and define
+ −
!
qt,A t
− qt,A t
1
ℓbt = d(ℓt,At − mt,At ) + − Ht2 st + mt
qt,At + qt,At
h i
Lemma 33 In Algorithm 3, we have E ℓbt = ℓt and
" #
2 X
E ℓbt − mt ≤ d2 E min{pt,x , 1 − pt,x }(ℓt,x − mt,x )2 .
∇−2 ψ(wt )
x∈X
25
DANN W EI Z IMMERT
Proof
+ −
" ! #
h i qt,A t
− qt,A t
1
E ℓbt = E d(ℓt,At − mt,At ) + −
2
Ht st + mt
qt,A t
+ qt,A t
+ −
" " ! # #
qt,A t
− qt,A t
1
= E dE (ℓt,At − mt,At ) + −
2
st Ht st + mt
qt,A t
+ qt,A t
(note that qt+ , qt− depend on st )
−
" " +
! # #
X qt,x − qt,x 1
= E dE qt,x (ℓt,x − mt,x ) + − st Ht2 st + mt
x
qt,x + qt,x
−
" " # +
# !
X − qt,x
1 qt,x
= E dE ⟨x, ℓt − mt ⟩ st Ht2 st + mt
x
2
⟨wt − wt− , ℓt − mt ⟩
+ 1
2
= E dE st Ht st + mt
2
1
− 12
= E d⟨H st , ℓt − mt ⟩Ht st + mt
2
1 1
⊤ −2
= E dHt st st Ht (ℓt − mt ) + mt
2
= ℓt . (because E[st s⊤ 1
t ] = d I)
!2
+ − 2
qt,A − qt,A
2 1
E ℓbt − mt = E d2 (ℓt,At − mt,At )2 +
t
−
t 2
Ht st
∇−2 ψ(wt ) qt,A t
+ qt,A t Ht−1
!2
+ −
qt,A − qt,A
= E d2 (ℓt,At − mt,At )2 +
t
−
t
qt,A t
+ qt,A t
!2
+ −
qt,A − qt,A
= E E d2 (ℓt,At − mt,At )2 +
t
−
t
st
qt,A t
+ qt,A t
−
" " +
##
X qt,x − qt,x
≤E E qt,x d2 (ℓt,x − mt,x )2 + − st
x
qt,x + qt,x
+ −
qt,x −qt,x
For any x, we have qt,x +
qt,x −
+qt,x
≤ qt,x and
+ − + − + −
qt,x − qt,x |qt,x − qt,x | qt,x + qt,x
qt,x + − = ≤1− = 1 − qt,x .
qt,x + qt,x 2 2
26
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND
2
Therefore, we continue to bound E ℓbt − mt by
∇−2 ψ(wt )
" " ##
X
2 2
E E min{qt,x , 1 − qt,x }d (ℓt,x − mt,x ) st
x∈X
" #
X
min pt,x , 1 − pt,x d2 (ℓt,x − mt,x )2
≤E .
x∈X
(E[min(·, ·)] ≤ min(E[·], E[·]) and pt,x = E[E[qt,x | st ]])
1 1
Lemma 34 If ηt ≤ 16d , then maxw ⟨wt − w, ℓbt − mt ⟩ − ηt Dψ (w, wt ) ≤ 8ηt ∥ℓbt −
mt ∥2∇−2 ψ(wt ) .
1
Proof We first show that ηt ∥ℓbt − mt ∥∇−2 ψ(wt ) ≤ 16 . By the definition of ℓbt , we have
1 1 1
ηt ℓbt − mt ≤ · d∥Ht2 st ∥H −1 ≤ .
∇−2 ψ(wt ) 16d t 16
Define
1
F (w) = ⟨wt − w, ℓbt − mt ⟩ − Dψ (w, wt ).
ηt
Define λ = ∥ℓbt − mt ∥∇−2 ψ(wt ) . Let w′ be the maximizer of F . it suffices to show ∥w′ −
wt ∥∇2 ψ(wt ) ≤ 8ηt λ because this leads to
To show ∥w′ − wt ∥∇2 ψ(wt ) ≤ 8ηt λ, it suffices to show that for all u such that ∥u − wt ∥∇2 ψ(wt ) =
8ηt λ, F (u) ≤ 0. To see why this is the case, notice that F (wt ) = 0, and F (w′ ) ≥ 0 because w′ is
the maximizer of F . Therefore, if ∥w′ − wt ∥∇2 ψ(wt ) > 8ηt λ, then there exists u in the line segment
between wt and w′ with ∥u − wt ∥∇2 ψ(wt ) = 8ηt λ such that F (u) ≤ 0 ≤ min{F (wt ), F (w′ )},
contradicting that F is strictly concave.
Below, consider any u with ∥u − wt ∥∇2 ψ(wt ) = 8ηt λ. By Taylor expansion, there exists u′ in
the line segment between u and wt such that
1
F (u) ≤ ∥u − wt ∥∇2 ψ(wt ) ∥ℓbt − mt ∥∇−2 ψ(wt ) − ∥u − wt ∥2∇2 ψ(u′ ) .
2ηt
Because ψ is a self-concordant barrier and that ∥u′ − wt ∥∇2 ψ(wt ) ≤ ∥u − wt ∥∇2 ψ(wt ) = 8ηt λ ≤ 12 ,
we have ∇2 ψ(u′ ) ⪰ 14 ∇2 ψ(wt ). Continuing from the previous inequality,
1 (8ηt λ)2
F (u) ≤ ∥u − wt ∥∇2 ψ(wt ) ∥ℓbt − mt ∥∇−2 ψ(wt ) − ∥u − wt ∥2∇2 ψ(wt ) = 8ηt λ · λ − = 0.
8ηt 8ηt
27
DANN W EI Z IMMERT
Proof [Proof of Theorem 3] By the standard analysis of optimistic-FTRL and Lemma 34, we have
for any u,
" T #
X
E (ℓt,At − ℓt,u )
t=1
T
" #!
ν log T X 2
≤O E + ηt ℓbt − mt
ηT ∇−2 ψ(wt )
t=1
v
u " T #
u X 2
= O tν log T E ℓbt − mt + dν log(T ) . (by the tuning of ηt )
∇−2 ψ(wt )
t=1
(10)
In the adversarial regime, using Lemma 33, we continue to bound (10) by
v u " T #
u X
O dtν log T E (ℓt,At − mt,At )2 + dν log(T )
t=1
" T # v u " T #
X u X
E (ℓt,At − ℓt,u ) ≤ O dtν log T E ℓt,u + dν log(T ) .
t=1 t=1
In the corrupted stochastic setting, using Lemma 33, we continue to bound (10) by
vu
u X T X
O dtν log(T )E
u (1 − pt,u )(ℓt,u − mt,u )2 + pt,x (ℓt,x − mt,x )2
t=1 x̸=u
v
u " T #
u X
≤ O dtν log(T )E (1 − pt,u ) .
t=1
Then by the self-bounding technique stated in Proposition 2, we can further bound the regret in the
corrupted stochastic setting by
r !
d2 ν log(T ) d2 ν log(T )
O + C .
∆ ∆
28
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND
Algorithm 4 LSB-logdet
Input: X (= { x0 }), x
b = 01 .
where πxb denotes the sampling distribution that picks x b with probability 1.
Sample an action At ∼ pet .
Construct loss estimator:
−1
ℓbt (a) = a⊤ H(ept ) − µpet µ⊤
pet at − µpet ℓt (at ) .
end
We begin by using the following simplifying notation. For a distribution α ∈ ∆(X ∪ {b x}), we
αx
define αX as the restricted distribution over X such that αX ∝ α over X , i.e. αX ,x = 1−α x
b
for any
x ∈ X . We further define
Lemma 35 Assume α ∈ ∆(X ∪ {b x})) is such that GV α is of rank d − 1 (i.e. full rank over X ).
Then we have the following properties:
⊤
HαV = (1 − αxb) GV α + α x
b (m α − x
b )(m α − x
b ) ,
V −1 1 V + V + ⊤ ⊤ V + 2 1 ⊤
[Hα ] = [Gα ] + [Gα ] mα x b +x bmα [Gα ] + ∥mα ∥[GVα ]+ + x
bx ,
1 − αxb
b
αxb
where [GV +
α ] denotes the pseudo-inverse.
29
DANN W EI Z IMMERT
HαV = H(α) − µα µ⊤
α = (1 − αxb )G(α) + αx
bx b⊤ − ((1 − αxb)mα + αxbx
bx b)⊤
b)((1 − αxb)mα + αxbx
= (1 − αxb) GV b (mα − x
α + αx b)⊤ ,
b)(mα − x
b = 01 and ∀x ∈ X :
which holds for any α. For the second identity note that by the definition of x
⟨x, x
b⟩ = 0, we have GV b = [GV
αx
+ b = 0. Furthermore GV [GV ]+ = I − x
α] x α α b−1 due to the rank d − 1
bx
assumption. Multiplying the two matrices yields
1
V ⊤ V + V + ⊤ ⊤ V + 2 ⊤
Gα + αxb(mα − x b)(mα − x b) · [Gα ] + [Gα ] mα x b +x bmα [Gα ] + ∥mα ∥[GVα ]+ + x
bx b
αxb
⊤
⊤ ⊤ V + 2 V + 2 1
=I −x bx
b + mα x b + αxb(mα − x b) [Gα ] mα + ∥mα ∥[GVα ]+ x b − [Gα ] mα − ∥mα ∥[GVα ]+ + x
b
αxb
=I −x
bxb⊤ + mα x
b⊤ − (mα − x x⊤ = I
b)b
⟨x, [GV +
α ] y⟩
⟨x, [HαV ]−1 y⟩ = , (11)
1 − αxb
⟨x, [HαV ]−1 (b
x − mα )⟩ = 0, (12)
1
x − mα ∥2[HαV ]−1 =
∥b . (13)
(1 − αxb)αxb
Lemma 37 Let π ∈ ∆(X ) be arbitrary, and π̃ = (1 − ηt d)π + ηt dπX for (ηt d) ∈ (0, 1), then it
holds for any x ∈ X :
s
ηt d
∥mπ − mπ̃ ∥[GV ]+ ≤
π̃ 1 − ηt d
2
∥x − mπ̃ ∥[GV ]+ ≤ √
π̃ ηt
30
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND
Proof We have
⊤
GV V V
e = ηt dGπX + (1 − ηt d)Gπ + ηt d(1 − ηt d)(mπ − mπX )(mπ − mπX ) ,
π
hence
where the last inequality uses that John’s exploration satisfies ∥x − mπX ∥2[GV ]+ ≤ d for all x ∈ X .
πX
1 − αxb
D(α, β) ≥ Dlog (αxb, βxb) + Dlog (1 − αxb, 1 − βxb) + ∥mα − mβ ∥2[GV ]+
1 − βxb β
M is a matrix with d − 2 eigenvalues of size 1, since it is the identity with two rank-1 updates. The
product of the remaining eigenvalues is given by considering the determinant of the 2×2 sub-matrix
[GV +
α ] mα
with coordinates ∥[G V ]+ m ∥ and x
b. We have that
α α
2
m⊤ V + + +
⊤ α [Gα ] [GV
α ] mα ⊤ [GV
α ] mα
det(M ) = x
b Mx b· +
M +
− x b M +
∥[GVα ] mα ∥ ∥[GVα ] mα ∥ ∥[GV
α ] mα ∥
2 2
+ +
= 1 · 1 + αxb [GV α ] mα − αxb [GVα ] mα = 1.
Hence we have
31
DANN W EI Z IMMERT
Where detd−1 is the determinant over the first d − 1 eigenvalues of the submatrix of the first (d − 1)
coordinates. The derivative term is given by
X
⟨α − β, ∇F (β)⟩ = (αx − βx ) ∥x − µβ ∥2[H V ]−1
β
x∈X ∪{b
x}
X
= αx ∥x − µβ ∥2[H V ]−1 − d
β
x∈X ∪{b
x}
X
= Tr αx (x − µβ )(x − µβ )⊤ [HβV ]−1 − d
x∈X ∪{b
x}
= Tr HαV + (µα − µβ )(µα − µβ )⊤ [HβV ]−1 − d
= Tr HαV [HβV ]−1 + ∥µα − µβ ∥2[H V ]−1 − d.
β
⟨α − β, ∇F (β)⟩
α (1 − αxb) + (αxb − βxb)2 1 − αxb
1 − αxb V V +
= Trd−1 Gα [Gβ ] + xb + ∥mα − mβ ∥2[GV ]+ − d
1 − βxb βxb(1 − βxb) 1 − βxb β
1 − αxb V V + α 1 − αxb 1 − αxb
= Trd−1 G [G ] + xb + −1+ ∥mα − mβ ∥2[GV ]+ − d .
1 − βxb α β βxb 1 − βxb 1 − βxb β
32
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND
D(α, β)
1 − αxb
= Dlog (αxb, βxb) + Dlog (1 − αxb, 1 − βxb) + ∥mα − mβ ∥2[GV ]+ + Dd−1 (1 − αxb)GV
α , (1 − βx )GV
β
1 − βxb
b
β
1 − αxb
≥ Dlog (αxb, βxb) + Dlog (1 − αxb, 1 − βxb) + ∥mα − mβ ∥2[GV ]+ ,
1 − βxb β
where the last inequality follows from the positiveness of Bregman divergences.
1
Lemma 39 For any b ∈ (0, 1), any x ∈ R and η ≤ 2|x| , it holds that
1
sup |a − b|x − Dlog (a, b) ≤ ηb2 x2 .
α∈[0,1] η
1
sup f (a) = sup (b − a)x − Dlog (a, b) ≤ ηb2 x2 ,
a∈[0,1] a∈[0,1] η
since x can take positive or negative sign. The function is concave, so setting the derivative to 0 is
the optimal solution if that value lies in (0, ∞).
′ 1 1 1 b
f (a) = −x + − a⋆ = .
η a b 1 + ηbx
ηb2 x2
⋆ 1 1
f (a ) = − log(1 + ηbx) + −1
1 + ηbx η 1 + ηbx
ηb2 x2
1 2 2 2 ηbx
≤ − ηbx − η b x − = ηb2 x2 ,
1 + ηbx η 1 + ηbx
1
stabt := sup ⟨µpt − µα , ℓbt ⟩ − D(α, pt ) ,
α∈∆(X ∪{b
x}) ηt
satisfies
33
DANN W EI Z IMMERT
Proof We have
Hence for At = y ∈ X :
Equivalently for At = x
b,
2 |αxb − pt,bx | √
≤ ηt y − mpet [GV ]+ + sup (1 + 2 d)
p
et αxb ∈[0,1] 1 − pt,bx
34
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND
|αxb − pt,bx | 2 1
+ ηt y − mpet [GV ]+
− D(1 − αxb, 1 − pt,bx ) (AM-GM inequality)
1 − pt,bx p
e t ηt
2
≤ ηt y − mpet [GV ]+
p
et
|αxb − pt,bx | √ 1
+ sup (5 + 2 d) − D(1 − αxb, 1 − pt,bx ) (Lemma 37)
αxb ∈[0,1] 1 − pt,bx ηt
2
≤ ηt y − mpet [GV ]+
+ O(ηt d) . (Lemma 39)
p
et
Finally we have
Et [stabt ] = (1 − pt,bx )EAt ∼pt,X [stabt ] + pt,bx EAt ∼πxb [stabt ] = O((1 − pt,bx )ηt d)
Proof [Proof of Lemma 5] By standard analysis of FTRL (Lemma 27) and Lemma 40, for any τ
and x
v
Xτ h i d log T Xτ u τ
X
u
Et ⟨pt , ℓbt ⟩ − ℓbt (x) ≤ + O(ηt (1 − pt,bx )d) ≤ O td log T (1 − pet,bx ) ,
ητ
t=1 t=1 t=1
where the last inequality follows from the definition of learning rate and pet,bx = pt,bx . Additionally,
we have
h i
Et ⟨ept − pt , ℓbt ⟩ = ⟨e
pt − pt , ℓt ⟩ ≤ ηt d(1 − pt,bx ) ,
so that
v
τ
X h i u τ
X
u
Et ⟨e
pt , ℓbt ⟩ − ℓbt (x) = O td log T (1 − pet,bx ) .
t=1 t=1
35
DANN W EI Z IMMERT
" T # " T #α
X X
E (ℓt,At − ℓt,x⋆ ) ≤ (c1 log T )1−α E (1 − pt,x⋆ ) + c2 log T.
t=1 t=1
36
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND
which implies
Tk+1
X
E (ℓt,At − ℓt,u ) ≤ c01−α E[τk ]α + c2 log(T )
t=Tk +1
using the property E[xα ] ≤ E[x]α for x ∈ R+ and 0 < α < 1. Summing the bounds over
k = 1, 2, . . . , n, and using the fact that τk ≥ 2τk−1 for all k < n, we get
T
" #
X
E (ℓt,At − ℓt,u )
t=1
α α 1 1 T
≤ c1−α
0 E [τn ] + E [τn−1 ] 1 + α + 2α + · · · + c2 log(T ) log2
2 2 c2 log(T )
≤ O c1−α T α + c2 log2 (T ) .
0 (14)
" T #
X
(ℓt,At − ℓt,u ) ≤ O (c1 log T )1−α T α + c2 log2 (T ) .
E (15)
t=1
Next, consider the corrupted stochastic regime. We first hargue that iti suffices to consider the regret
PT
comparator x⋆ . This is because if x⋆ ̸= argminu∈X E t=1 ℓt,u , then it holds that C ≥ T ∆.
Then the right-hand side of (15) is further upper bounded by
which fulfills the requirement for the stochastic regime. Below, we focus on bounding the pseudo-
regret with respect to x⋆ .
Let m = max{k ∈ N | x bk ̸= x⋆ }. Notice that m is either the last epoch (i.e., m = n) or the
second last (i.e., m = n − 1), because for any two consecutive epochs, at least one of them must
have x bk ̸= x⋆ . Below we show that |{t ∈ [Tm+1 ] | At ̸= x⋆ }| ≥ Tm+1 8 − 2τ0 .
If τm > 2τm−1 , by the fact that the termination condition was not triggered one round earlier,
the number of plays Nm (x⋆ ) in epoch m is at most τm2−1 +1 = τm2+1 ; in other words, the number of
PTm+1
times t=T m +1
I{At ̸= x⋆ } is at least τm2−1 . Further notice that because τm > 2τm−1 ≥ 4τm−2 >
Tm+1
· · · , we have τm2−1 ≥ 14 m 1
− 12 .
P
k=1 τk − 2 = 4
Now consider the case m > 1 and τm ≤ 2τm−1 . Recall that x bm is the action
n withPNm−1 (b
xmo) ≥
τm−1 PTm ⋆ τm−1 1 τm 1 m−1
2 . This implies that t=Tm−1 +1 I{At ̸= x } ≥ 2 ≥ 2 max 2 , 2 k=1 τk ≥
1 P m Tm+1
8 k=1 τk = 8 .
Finally, consider the case m = 1 and τ1 ≤ 2τ0 , then we have T2 − 2τ0 ≤ 0 and the statement
holds trivially.
37
DANN W EI Z IMMERT
The regret up to and including epoch m can be lower and upper bounded using the self-bounding
technique:
Tm+1 Tm+1 Tm+1
X X X
E (ℓt,At − ℓt,x⋆ ) = (1 + λ)E (ℓt,At − ℓt,x⋆ ) − λE (ℓt,At − ℓt,x⋆ )
t=1 t=1 t=1
(for 0 ≤ λ ≤ 1)
1−α α E[Tm+1 ] 1
≤ O (c1 log T ) E [Tm+1 ] + c2 log(T ) log −λ E [Tm+1 ] − c2 log(T ) ∆ − C
c2 log(T ) 8
(the first term is by a similar calculation as (14), but replacing T by Tm and c0 by c1 log T )
α α
≤ O c1 log(T )∆− 1−α + (c1 log T )1−α C∆−1 + c2 log(T ) log(C∆−1 )
38
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND
We have maximally log T episodes, since the length doubles every time. Via Cauchy-Schwarz, we
get
v u
kX
max T k+1 u kX
max Tk+1
X X X p
(ℓt,At − ℓt,u ) ≤ tc1 log T pt,x ξt,x log T + c2 log2 T .
u
ETk ETk
k=1 t=Tk +1 k=1 t=Tk +1 x
Taking the expectation on both sides and the tower rule of expectations finishes the bound for the
adversarial regime. For the stochastic regime, note that ξt,x ≤ 1 and hence
X
pt,x ξt,x − I{u = xb}p2t,u ξt,u
x
b}p2t,u
≤ 1 − I{u = x
= (1 − I{u = x
b}pt,u )(1 + I{u = x
b}pt,u ) ≤ 2(1 − I{u = x
b}pt,u ) .
dd-LSB implies regular LSB (up to a factor of 2) and hence the stochastic bound of regular LSB
applies.
39
DANN W EI Z IMMERT
q̄t,2
Since qt,2 ≤ 2, using the inequalities above, we have
s
p q̄t,2 √ 1
ηt q̄t,2 bTs
t ≤ ηt c1 ≤ ηt 2c1 ≤ . (16)
qt,2 4
q̄t,2 1
β q̄t,2 bLo
t ≤β c2 ≤ 2βc2 ≤ . (17)
qt,2 4
q̄t,2
By Lemma 28 and that qt,2 ≤ 2, we have for any u,
t ′ t ′
X X
⟨qt − u, zt ⟩ ≤ ⟨qt − q̄t , zt ⟩
t=1 t=1
| {z }
term1
t √
′ t ′ t t ′ t ′ ′
qt,2 log t′ X
√ X 3
2
X X X
+O c1 + √ + + ηt min qt,i (zt,i − θt ) +
2 Ts
qt,2 bt + Lo
qt,2 bt − u2 bt .
t β |θt |≤1
t=1 | {z } t=1 t=1 t=1 t=1
| {z } term | {z } | {z } | {z }
3
term2 term4 term5 term6
(18)
We bound individual terms below:
" t′ # t′
!
X X 1
E[term1 ] = E ⟨qt − q̄t , zt ⟩ ≤ O = O(1).
t2
t=1 t=1
v
u t′ √
uX
term2 ≤ O min t′ , t qt,2 log T .
t=1
log t′
term3 = ≤ O (c2 log T ) .
β
" t ′ #
X 3
E [term4 ] = E ηt min qt,i (zt,i − θt )2
2
θt ∈[−1,1]
t=1
t′ 2
" #
X X 3
2
≤E ηt qt,i (zt,i − ℓt,At )
2
t=1 i=1
t′ 2
" #
X 1 X 2
=E ηt √ (I[it = i]ℓt,At − qt,i ℓt,At )
qt,i
t=1 i=1
" t′ 2 #
X X √ 3
≤E ηt qt,i (1 − qt,i )2 + (1 − qt,i )qt,i
2
t=1 i=1
t′
" #!
X √
≤O E ηt qt,2 ≤ O (E[term2 ]) .
t=1
40
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND
t′ t′ c1
X X qt,2
term5 = qt,2 bTs
t ≤ qt,2 q P
t=1 t=1 c1 tτ =1 1
qτ,2
t′ √1
√ X qt,2 √
= c1 qP × qt,2 (19)
t 1
t=1 τ =1 qτ,2
v v
u t′ 1 u t′
√ u X qt,2
uX
≤ c1 t Pt 1
t q t,2 (Cauchy-Schwarz)
t=1 τ =1 qτ,2 t=1
v !v
u t′ u t′
√ tu X 1 u X
≤ c1 1 + log t qt,2
q
t=1 t,2 t=1
v
u t′
u X
≤ O tc1 qt,2 log T .
t=1
Pt′ √ Pt′ 1
√
Continuing from (19), we also have term5 ≤ t=1 qt,2 bt
Ts
≤ c1 t=1
√
t
≤ 2 c1 t′ because
qτ,2 ≤ 1.
t ′ t ′ t ′
X
Lo
X minτ ≤t qτ,2 X minτ ≤t−1 qτ,2
term6 = qt,2 bt = c2 1− ≤ c2 log ≤ O (c2 log T ) .
minτ ≤t−1 qτ,2 minτ ≤t qτ,2
t=1 t=1 t=1
hP ′ i
t
Using all bounds above in (18), we can bound E t=1 ⟨qt − u, zt ⟩ by
v " v
u t′ # u t′
p u X u X 1 c2
O min c1 E[t′ ], tc
1 qt,2 log T + c2 log T −u2 E tc1 + .
qt,2 mint≤t′ qt,2
t=1 t=1
| {z } | {z }
pos-term neg-term
hP ′ i
t
For comparator x
b, we choose u = e1 and bound E − ℓt,bx ) by the pos-term above. For
t=1 (ℓt,At
hP ′ i
t
comparator x ̸= x
b, we first choose u = e2 and upper bound E t=1 (ℓt,A t − ℓ t,At
e ) by pos-term−
hP ′
i
t
neg-term. On the other hand, by the 21 -iw-stable assumption, E (ℓ e − ℓt,x ) ≤ neg-term.
hP ′t=1 t,At i
t
Combining them, we get that for all x ̸= x b, we also have E t=1 (ℓ t,A t − ℓt,x ) ≤ pos-term.
Comparing the coefficients in pos-term with those in Definition 4, we see that Algorithm 2 satisfies
1 ′ ′ ′ ′ ′ ′
2 -LSB with constants (c0 , c1 , c2 ) where c0 = c1 = O(c1 ) and c2 = O(c2 ).
41
DANN W EI Z IMMERT
2
F.2. 3 -LSB to 23 -iw-stable (Algorithm 5 / Theorem 18)
Sample it ∼ q̄t .
if it = 1 then set Āt = x
b;
else set Āt = Aet ;
Sample jt ∼ γt .
if jt = 1 then draw a revealing action of Āt and observe ℓt,Āt ;
else draw At = Āt ;
ℓt,Āt I{it =i}I{jt =1}
Define zt,i = γt and
t
! 23
1
3
X 1 c2
Bt = c1 √ + .
qτ,2 minτ ≤t qτ,2
τ =1
end
Proof [Proof of Theorem 18] The per-step bonus bt = Bt − Bt−1 is the sum of two terms:
! 23 ! 23
t t−1 √1 1
Ts
1 X 1 X 1 1 qt,2 c1 3
bt = c13
√ − √ ≤ c1
3
1 ≤ ,
qτ,2 qτ,2 Pt 1
3 qt,2
τ =1 τ =1 √
τ =1 qτ,2
c2 c2 c2 minτ ≤t qτ,2
bLo
t = − = 1− .
minτ ≤t qτ,2 minτ ≤t−1 qτ,2 qt,2 minτ ≤t−1 qτ,2
q̄t,2
Since qt,2 ≤ 2, using the inequalities above, we have
1
1
Ts
q̄t,2 c1 3 1 1
ηt q̄t,2 bt ≤ ηt
3
≤ ηt (2c1 ) 3 ≤ . (20)
qt,2 4
q̄t,2 1
β q̄t,2 bLo
t ≤β c2 ≤ 2βc2 ≤ . (21)
qt,2 4
42
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND
q̄t,2
By Lemma 28 and that qt,2 ≤ 2, we have for any u,
t ′ t ′
X X
⟨qt − u, zt ⟩ ≤ ⟨qt − q̄t , zt ⟩
t=1 t=1
| {z }
term1
2
t′ t′ t′ t′ t′
q 3
log t′ X
1 X 4
t,2
X X X
2 Ts Lo
3
+ O c1 + 1 + + ηt min qt,i (zt,i − θt ) +
3
qt,2 bt + qt,2 bt − u2 bt
β |θt |≤1
t=1 t 3 t=1 t=1 t=1 t=1
| {z } |term {z } | {z } | {z } | {z }
3
term2 term4 term5 term6
(22)
term2 ≤ O (term1 )
log t′
term3 = ≤ O (c2 log T ) .
β
" t′ #
X 4
E [term4 ] = E ηt qt,2 (zt,2 − zt,1 )2
3
t=1
4
t′
" t′ #
3 X γ2
X qt,i t
≤ E ηt ≤E = E [O(term1 )] .
γt γt
t=1 t=1
′ ′
t t 1 √1
X X qt,2
term5 = qt,2 bTs
t ≤ qt,2 c1 3
Pt 1
3
t=1 t=1 √1
τ =1 qτ,2
t ′ −1
1 X qt,26 2
1 × qt,2
3 3
= c1 P (23)
t 3
t=1 √1
τ =1 qτ,2
1 ! 23
′ ′
√1
t 3 t
1 X qt,2 X
≤ c1 3
Pt qt,2 (Cauchy-Schwarz)
√1
t=1 τ =1 qτ,2 t=1
43
DANN W EI Z IMMERT
t ′ !! 13 t ′ ! 23
1 X 1 X
≤ c1 3
1 + log qt,2
qt,2
t=1 t=1
t′
! 23
1 X 1
≤ O c13 qt,2 (log T ) 3 .
t=1
Pt′ 1 Pt′ 1 2
Continuing from (23), we also have term5 ≤ t=1 qt,2 bt
Ts
≤ c13 t=1
1
1 ≤ O(c13 t′ 3 ) because
t3
qτ,2 ≤ 1.
t ′ t ′ t ′
X
Lo
X minτ ≤t qτ,2 X minτ ≤t−1 qτ,2
term6 = qt,2 bt = c2 1− ≤ c2 log ≤ O (c2 log T ) .
minτ ≤t−1 qτ,2 minτ ≤t qτ,2
t=1 t=1 t=1
hP ′ i
t
Using all bounds above in (22), we can bound E t=1 ⟨qt − u, zt ⟩ by
" t′ # 32 t ′ ! 23
1 2 1 X
1 X 1 c2
O min c13 E[t′ ] 3 , (c1 log T ) 3 qt,2 + c2 log T −u2 E c13 + .
qt,2 mint≤t′ qt,2
t=1 t=1
| {z } | {z }
pos-term neg-term
i
hP ′
t
For comparator x
b, we choose u = e1 and bound E − ℓt,bx ) by the pos-term above. For
t=1 (ℓt,At
hP ′ i
t
comparator x ̸= x
b, we first choose u = e2 and upper bound E t=1 (ℓt,A t − ℓ t,A
et ) by pos-term−
hP ′
i
t
neg-term. On the other hand, by the 21 -iw-stable assumption, E (ℓ e − ℓt,x ) ≤ neg-term.
hP ′ t=1 t,At i
t
Combining them, we get that for all x ̸= xb, we also have E t=1 (ℓt,At − ℓt,x ) ≤ pos-term.
Finally, notice that Et [I{At ̸= x
b}] ≥ q̄t,2 (1 − γt ) = qt,2 ,. This implies that Algorithm 5 is
1
2
3 -LSB with coefficient (c′0 , c′1 , c′2 ) with c′0 = c′1 = O(c1 ), and c′2 = O(c13 + c2 ).
1
F.3. 2 -dd-LSB to 12 -dd-iw-stable (Algorithm 6 / Theorem 22)
Proof [Proof of Theorem 22] Define bt = Bt − Bt−1 . Notice that we have
44
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND
t−1
!− 21
1 1 X
where ηt = (log T ) 2 (I[iτ = i] − qτ,i )2 ξτ,Aτ + (c1 + c22 ) log T .
4
τ =1
Sample it ∼ qt .
if it = 1 then draw At = xb and observe ℓt,At ;
else draw At = A et and observe ℓt,At ;
(ℓt,At −yt,i )I{it =i}
Define zt,i = qt,i + yt,i and
v
u t
u X ξt,At I[iτ = 2] c2
Bt = tc1 2 + .
q τ,2 minτ ≤t qτ,2
τ =1
end
q̄t,2
By Lemma 29 and that qt,2 ≤ 2, we have for any u,
t ′ t ′ 2
X log T X X
⟨qt − u, zt ⟩ ≤ O + ηt min 2
qt,i (zt,i − yt,i − θt )2
ηt′ |θ|≤1
t=1 | {z } |t=1 i=1
term1
{z }
term2
t′ t′ t′
!
X X X
+ ⟨qt − q̄t , zt ⟩ + qt,2 bt − u2 bt
t=1 t=1 t=1
| {z } | {z }
term3 term4
45
DANN W EI Z IMMERT
s
Pt′ −1 P2
t=1 i=1 (I[it = i] − qt,i )2 ξ t,At + (c1 + c22 ) log T
term1 ≤ O log T
log T
v
u t′ 2
uX X √
≤ O t (I[it = i] − qt,i )2 ξt,At log T + ( c1 + c2 ) log T .
t=1 i=1
" t ′ 2 2 #
X X
2 (ℓt,At − mt,At )I[it = i]
E [term2 ] ≤ E ηt min qt,i −θ
θ qt,i
t=1 i=1
t′ 2
" #
X X
≤E ηt (I[it = i] − qt,i )2 (ℓt,At − mt,At )2 (choosing θ = ℓt,At − mt,At )
t=1 i=1
v
u t′ 2
uX X
≤ E t (I[it = i] − qt,i )2 ξt,At log T .
t=1 i=1
" t′ # " t′ #
X X 1 1
E [term3 ] = E ⟨qt − q̄t , Et [zt ]⟩ = E − 2 q̄t + 2 1, Et [zt ] ≤ O(1).
2t 4t
t=1 t=1
t′
X
term4 ≤ qt,2 bt
t=1
c1 ξt,At I[it =2]
t′
X 2
qt,2 1 1
= qt,2
r Pt ξ + c2 −
τ,A τ I[i τ =2] min τ ≤t q τ,2 min τ ≤t−1 q τ,2
t=1 c1 τ =1 2
qτ,2
v
ξt,At I[it =2]
t′ u t′
u
√ Xu 2
qt,2
q X minτ ≤t qτ,2
≤ c1 tP
ξτ,Aτ 1[iτ =2]
× ξt,At I[it = 2] + c2 1−
t min q τ ≤t−1 τ,2
t=1 τ =1 2
qτ,2 t=1
v
ξt,At I[it =2]
v
′
u ′ u t′
u t t
√ uX 2
qt,2 uX X minτ ≤t−1 qτ,2
≤ c1 t ξτ,Aτ 1[iτ =2]
×t ξ t,At I[it = 2] + c2 log
P t minτ ≤t qτ,2
t=1 τ =1 2
qτ,2 t=1 t=1
(Cauchy-Schwarz)
v v
t′ u t′
u !
√ u X ξt,At I[it = 2] u X 1
≤ c1 1 + log
t
2
t ξt,At I[it = 2] + c2 log
qt,2 minτ ≤t′ qτ,2
t=1 t=1
v
u
u X t′
≤ O tc1 ξt,At I[it = 2] log T + c2 log T .
t=1
46
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND
v
u t′
u X ξt,A I[it = 2] c2
t
term5 = −u2 Bt′ = −u2 tc1 2 + .
qt,2 mint≤t ′ qt,2
t=1
hP ′ i
t
Combining all inequalities above, we can bound E t=1 ⟨q t − u, zt ⟩ by
v v
u t′ 2 u t′
u XX u X √
O E tc1 (I[it = i] − qt,i )2 ξt,At log T + tc1 ξt,At I[it = 2] log T + ( c1 + c2 ) log T
t=1 i=1 t=1
| {z }
pos-term
(25)
vu t′
u X ξt,A I[it = 2] c2
t
− u2 tc1 2 + .
q t,2 mint≤t′ qt,2
t=1
| {z }
neg-term
Similar
hP ′ to the arguments i in the proofs of Theorem 11 and Theorem 18, we end up bounding
t
E t=1 (ℓt,At − ℓt,x ) by the pos-term above for all x ∈ X . Finally, we process pos-term. Ob-
serve that
" 2 #
X
Et (I[it = i] − qt,i )2 ξt,At
i=1
h i
= Et qt,1 (1 − qt,1 )2 ξt,bx + (1 − qt,1 )qt,1 2
ξt,Aet + qt,2 (1 − qt,2 )2 ξt,Aet + (1 − qt,2 )qt,2
2
ξt,bx
h i
2 2
= Et 2qt,1 qt,2 ξt,bx + 2qt,1 qt,2 ξt,Aet
X
2 2
= 2qt,1 qt,2 ξt,bx + 2qt,1 pt,x ξt,x
x̸=x
b
X
≤ 2pt,bx (1 − pt,bx )ξt,bx + 2 pt,x ξt,x
x̸=x
b
!
X
=2 pt,x ξt,x − p2t,bx ξt,bx ,
x
and that
h i X X
Et ξt,Aet I[it = 2] = pt,bx ξt,x ≤ pt,x ξt,x − p2t,bx ξt,bx
x̸=x
b x
47
DANN W EI Z IMMERT
v
u " t′ !#
u X X √
≤ E [pos-term] ≤ O tc1 E pt,x ξt,x − p2t,bx ξt,bx log T + ( c1 + c2 ) log T ,
t=1 x
√
which implies that the algorithm satisfies 12 -dd-LSB with constants (O(c1 ), O( c1 + c2 )).
1
F.4. 2 -LSB to 12 -strongly-iw-stable (Algorithm 7 / Theorem 25)
Sample it ∼ qt .
if it = 1 then draw At = xb and observe ℓt,At ;
else draw At = At and observe ℓt,At ;
e
ℓ I{i =i}
Define zt,i = t,At t I{A
qt,i
et ̸= x
b} and
v n o
u
u X
u t I Aet ̸
= x
b 1
Bt = tc1 + c2 max .
qτ,2 τ ≤t qτ,2
τ =1
end
Proof [Proof of Theorem 25] The proof of this theorem mostly follows that of Theorem 11. The
difference is that the regret of the base algorithm B is now bounded by
v v
u t′ u t′
u X 1 c2 log T u X I{A et ̸= x
b} p ′ c2 log T
tc
1 + ≤ tc1 + c1 t +
qt,2 + qt,1 I{A
t=1
et = x b} mint≤t′ qt,2 t=1
qt,2 mint≤t′ qt,2
(26)
because when A et = xb, the base algorithm B is able to receive feedback no matter which side the
Corral algorithm chooses. The goal of adding bonus is now only to cancel the first term and the
third term on the right-hand side of (26).
48
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND
t ′ t ′
X X
⟨qt − u, zt ⟩ ≤ ⟨qt − q̄t , zt ⟩
t=1 t=1
| {z }
term1
t ′ √ et ̸= x t′ t′ t′
qt,2 I{A b}
√ X log T X 3
2
X
Ts
X
Lo
+O c1 + qP + + ηt min qt,i (zt,i − θt ) +
2
qt,2 bt + qt,2 bt
t eτ ̸= x β |θt |≤1
t=1 τ =1 I{ A b } | {z } t=1 t=1 t=1
| {z } term3 | {z
term
} | {z } | {z }
term term
4 5 6
term2
t ′
X
− u2 bt (27)
t=1
where
v v
c1 I{A et ̸=xb}
u t eτ ̸= x
u t−1
u X I{A b} u X I{A eτ ̸= x
b} c1
r
Ts qt,2
bt = c1
t − c1
t ≤r ≤ ,
qτ,2 qτ,2 t I{A ̸
= x } qt,2
τ =1 τ =1
P eτ
c1 τ =1 qτ,2 b
Lo
c2 c2 c2 minτ ≤t qτ,2
bt = − = 1−
minτ ≤t qτ,2 minτ ≤t−1 qτ,2 qt,2 minτ ≤t−1 qτ,2
√ 1 1
t ≤
satisfying ηt q̄t,2 bTs 2 t ≤
and β q̄t,2 bLo 2 as in (20) and (21).
Below, we bound term1 , . . . , term6 .
" t′ # t ′ !
X X 1
E[term1 ] = E ⟨qt − q̄t , zt ⟩ ≤ O = O(1).
t2
t=1 t=1
v v
u
uXt′ u t′
uX
term2 ≤ O min
t I{A ̸ x
et = b}, t et ̸= x
qt,2 I{A b} log T
t=1 t=1
log T
term3 = ≤ O (c2 log T ) .
β
49
DANN W EI Z IMMERT
" t′ #
X 3
E [term4 ] = E ηt min qt,i (zt,i − θt )2
2
θt ∈[−1,1]
t=1
" t′ 2
#
X X 3
2
≤E ηt qt,i (zt,i − ℓt,At ) I{A
2 et ̸= x
b} (when A
et = x
b, zt,1 = zt,2 = 0)
t=1 i=1
" t′ 2
#
X 1 X 2
=E ηt √ (I[it = i]ℓt,At − qt,i ℓt,At ) I{A et ̸= x
b}
qt,i
t=1 i=1
" t′ 2 #
X X √ 3
≤E ηt qt,i (1 − qt,i )2 + (1 − qt,i )qt,i
2
I{Aet ̸= x
b}
t=1 i=1
" t′ q #
X
≤E et ̸= x
ηt qt,2 I{A b} ≤ O (E[term2 ]) .
t=1
t ′
X
term5 + term6 = qt,2 (bTs Lo
t + bt )
t=1
c1 I{Aet ̸=x
b}
t′
X qt,2 1 1
= qt,2
r + c2 −
Pt eτ ̸=x
I{A b} minτ ≤t qτ,2 minτ ≤t−1 qτ,2
t=1 c1 τ =1 qτ,2
′ e ̸=x
I{A b}
t √t t′
√ minτ ≤t qτ,2
q
X qt,2 X
= c1 r × qt,2 I{At ̸= x
e b} + c2 1−
Pt eτ ̸=x
I{A b} minτ ≤t−1 qτ,2
t=1 t=1
τ =1 qτ,2
(28)
v v
et ̸=x
I{A b} t′
u ′ u t′
u t
√ uX qt,2
uX X minτ ≤t−1 qτ,2
≤ c1 t t q t,2 I{Aet ̸
= x
b } + c2 log
P t I{Aeτ ̸=x
b} minτ ≤t qτ,2
t=1 τ =1 qτ,2 t=1 t=1
(Cauchy-Schwarz)
v v
t′ u t′
u !
√ u X I{A et ̸= x
b} u X 1
≤ c1 1 + log
t t et ̸= x
qt,2 I{A b} + c2 log
qt,2 minτ ≤t′ qτ,2
t=1 t=1
v
u t′
u X
≤ O tc1 qt,2 I{A et ̸= xb} log T + c2 log T . (29)
t=1
√ Pt′ et ̸=x
I{A b} √
Continuing from (28), we also have term5 ≤ c1 t=1 Pt eτ ̸=x ≤ 2 c1 t′ .
τ =1 I{ A b}
50
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND
hP ′ i
t
Using all the inequalities above in (27), we can bound E ⟨q
t=1 t − u, zt ⟩ by
v "
t′
u #
p u X √
O min c1 E[t′ ], tc
1 qt,2 I{Aet ̸= x
b} log T + ( c1 + c2 ) log T
t=1
| {z }
pos-term
v
u t′
u X I{A et ̸= x
b} c2
− u2 E tc1 + .
qt,2 mint≤t′ qt,2
t=1
| {z }
neg-term
P′
For comparator hP ′x b, we set u = ie1 . Then hP we have E[ i tt=1 (ℓt,At − ℓt,bx )] ≤ pos-term.
t t′
Observe that E t=1 qt,2 I{At ̸= xb} = E t=1 (1 − pt,b
x ) , so this gives the desired prop-
e
hP ′ i
t
erty of 21 -LSB for action x b. For x ̸= x b, we set u = e2 and get E (ℓ
t=1 t,At − ℓt,x ) =
hP ′ i hP ′ i
t t
E t=1 (ℓt,At − ℓt,Aet ) + E et − ℓt,x ) ≤ (pos-term − neg-term) + (neg-term +
t=1 (ℓt,A
p √
c1 E[t′ ]) where the additional c1 t′ term comes from the second term on the right-hand side
of (26), which is not cancelled by the negative term. Note, however, that this still satisfies the re-
quirement of 12 -LSB for x ̸= x b because for these actions, we only require the worst-case bound to
hold. Overall, we have justified that Algorithm 7 satisfies 21 -LSB with constants (c′0 , c′1 , c′2 ) where
√
c′0 = c′1 = O(c1 ) and c2 = ( c1 + c2 ).
1
F.5. 2 -dd-LSB to 12 -dd-strongly-iw-stable (Algorithm 8 / Lemma 42)
Lemma 42 Let B be an algorithm with the following stability guarantee: given an adaptive se-
quence of weights q1 , q2 , · · · ∈ (0, 1]X such that the feedback in round t is observed with probability
qt (x) if x is chosen, and an adaptive sequence {mt,x }x∈X available at the beginning of round t, it
obtains the following pseudo regret guarantee for any stopping time t′ :
" t′ # v
u t′
X u X upd · ξt,A c2
t t
E (ℓt,At − ℓt,u ) ≤ E tc1 + ,
qt (At )2 mint≤t′ minx qt (x)
t=1 t=1
where updt = 1 if feedback is observed in round t and updt = 0 otherwise. ξt,x = (ℓt,x − mt,x )2 in
the second-order bound case, and ξt,x = ℓt,x in the first-order bound case. Then Algorithm 8 with
B as input satisfies 12 -dd-LSB.
Proof The proof of this lemma is a combination of the elements in Theorem 22 (data-dependent
iw-stable) and Theorem 25 (strongly-iw-stable), so we omit the details here and only provide a
sketch.
51
DANN W EI Z IMMERT
t−1
!− 12
1 1 X
where ηt = (log T ) 2 (I[iτ = i] − qτ,i )2 ξτ,Aτ I[A b] + (c1 + c22 ) log T
eτ ̸= x .
4
τ =1
Sample it ∼ qt .
if it = 1 then draw At = x b and observe ℓt,At ;
else draw At = At and observe ℓt,At ;
e
(ℓt,At −yt,i )I{it =i} et ̸= x
Define zt,i = qt,i + y t,i I{A b} and
v
u t et ̸= x
u X ξt,At I{iτ = 2}I{A b} c2
Bt = c1
t
2 + . (30)
qτ,2 minτ ≤t qτ,2
τ =1
end
because if B chooses xb, then the probability of observing the feedback is 1, and is qt,2 otherwise.
This motivates the choice of the bonus in (30). Then we can follow the proof of Theorem 22 step-
by-step, and show that the regret compared to xb is upper bounded by the order of
v
u t′ 2
u XX
E tc1 (I[it = i] − qt,i )2 ξt,At I{A
et ̸= x
b} log T
t=1 i=1
52
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND
v
u t′
u X
et ̸= x √
+ E tc1 ξt,At I[it = 2]I{A b} log T + ( c1 + c2 ) log T
t=1
v " t′ #
u
u X
≤ tc1 E 2 ξ 2
2qt,1 qt,2 et I{At ̸= x
x + 2qt,1 qt,2 ξt,A
t,b b}
e
t=1
v " t′ #
u
u X
et ̸= x √
+ c1 E
t ξt,Aet qt,2 I{A b} log T + ( c1 + c2 ) log T
t=1
(taking expectation over it and following the calculation in the proof of Theorem 22)
v
u
u Xt′ X
2 ξ 2 q
= tc1 E x Pr[At ̸= x
u 2qt,1 qt,2 t,b
e b] + 2qt,1 t,2 Pr[Aet = x]ξt,x
t=1 x̸=x
b
v
u
t′
et = x]ξt,x log T + (√c1 + c2 ) log T
u X X
+ tc1 E qt,2 Pr[A
u
t=1 x̸=x
b
53
DANN W EI Z IMMERT
Algorithm 9 EXP2
Input: X .
for t = 1, 2, . . . do
Receive update probability qt .
Let
t−1
(s ) !
ln |X | 1 X
ηt = min , min qτ , Pt (a) ∝ exp −ηt ℓbτ (a) .
d tτ =1 q1τ 2d τ ≤t
P
τ =1
end
Proof [Proof of Lemma 9] Consider the EXP2 algorithm (Algorithm 9) which corresponds to FTRL
with negentropy potential. By standard analysis of FTRL (Lemma 27), for any τ and any a⋆ ,
τ τ τ
" #
X X X h i ln |X | X 1
⋆
Et Pt (a)ℓbt (a) − Et ℓt (a ) ≤
b + Et max ⟨Pt − P, ℓt ⟩ − Dψ (P, Pt ) .
a
ητ P ηt
t=1 t=1 t=1
To apply Lemma 31, weqneed to show that ηq t ℓt (a) ≥ −1. We have by Cauchy Schwarz
b
−1 −1 ⊤ −1 a . For each term, we have
|a⊤ Eb∼pt [bb⊤ ] at | ≤ a⊤ (Eb∼pt [bb⊤ ]) a a⊤
t (Eb∼pt [bb ]) t
−1 qt ⊤ −1 qt
a⊤ Eb∼pt [bb⊤ ] a≤ a Eb∼ν [bb⊤ ] a= ,
dηt ηt
54
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND
due to the properties of John’s exploration. Hence |ηt ℓbt (a)| ≤ |ℓt (a)| ≤ 1. We can apply Lemma 31
for the stability term, resulting in
1
Et max ⟨Pt − P, ℓt ⟩ − Dψ (P, Pt )
P ηt
" −1 ⊤ −1 a
#
a⊤ Eb∼pt [bb⊤ ] at a⊤
t Eb∼pt [bb ]
X
≤ ηt Et Pt (a)
a
qt2
−1
X a⊤ Eb∼pt [bb⊤ ] a
= ηt Pt (a)
a
qt
−1
X a⊤ Eb∼Pt [bb⊤ ] a 2ηt d
= 2ηt Pt (a) = .
a
q t q t
τ τ τ
" # " #
X X X X X dηt
Et (Pt (a) − pt (a))ℓt (a) =
b Et (Pt (a) − pt (a))ℓt (a) ≤ .
a a
qt
t=1 t=1 t=1
τ X τ
" # " #
X log |X | X 3dη t
E pt (a)ℓt (a) − ℓt (a⋆ ) ≤ E +
a
ητ qt
t=1 t=1
v
u τ
u X 1 1
≤ E 7td log |X | + 2 log |X |d .
qt mint≤τ qt
t=1
In this section, we use the more standard notation Π to denote the policy class.
55
DANN W EI Z IMMERT
Algorithm 10 EXP4
Input: Π (policy class), K (number of arms)
for t = 1, 2, . . . do
Receive context xt .
Receive update probability qt .
Let
t−1
s !
ln |Π| X
ηt = , Pt (π) ∝ exp −ηt ℓbτ (π(xτ )) .
K tτ =1 q1τ
P
τ =1
P
Sample an arm at ∼ pt where pt (a) = π:π(xt )=a Pt (π)π(xt ).
With probability qt , receive ℓt (at ) (in this case, set updt = 1; otherwise, set updt = 0).
Construct loss estimator:
updt I[at = a]ℓt (a)
ℓbt (a) = .
qt pt (a)
end
Proof [Proof of Lemma 10] Consider the EXP4 algorithm with adaptive stepsize (Algorithm 10),
which corresponds to FTRL with negentropy regularization. By standard analysis of FTRL
(Lemma 27) and Lemma 31, for any τ and any π ⋆
τ τ τ
" # " #
h i ln |Π| X ηt
X X X X
⋆ 2
Et Pt (π)ℓbt (π(xt )) − Et ℓbt (π (xt )) ≤ + Et Pt (π)ℓbt (π(xt ))
π
ητ 2 π
t=1 t=1 t=1
τ
ln |Π| X ηt X ℓt (π(xt ))
≤ + Pt (π)
ητ 2 π
qt pt (π(xt ))
t=1
v
u τ
ln |Π| ηt u X 1
≤ +K ≤ 2 K ln |Π|
t .
ητ 2qt qt
t=1
56
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND
for t = 1, 2, . . . do
Receive update probability qt .
Let
( t−1 )
X 1
pt = argmin ⟨x, ℓbτ ⟩ + ψ(x)
x∈∆([K]) τ =1 ηt
s
log K
where ηt = Pt 1
.
s=1 qs (1 + min{e α, α log K})
Sample At ∼ pt .
With probability qt receive ℓt,i for all At ∈ N (i) (in this case, set updt = 1; otherwise, set
updt = 0).
Define
!
(ℓt,i − I{i ∈ It })I{A t ∈ N (i)}
ℓbt,i = updt P + I{i ∈ It } ,
j∈N (i) pt,j qt
1
where It = i ∈ [K] : i ̸∈ N (i) ∧ pt,i > and N (i) = {j ∈ G \ x
b | (j, i) ∈ E}.
2
end
Lemma 43 Let G = (V, E) be a directed graph with independence number α and vertex weights
pi > 0, then
pi 2pi α
∃i ∈ V : P ≤P .
pi + j:(j,i)∈E pj i pi
Proof Without loss of generality, assume that p ∈ ∆(V ), since the statement is scale invariant. The
statement is equivalent to
X 1
min pi + pj ≥ .
i 2α
j:(j,i)∈E
Let
X X X
min min pi + pj ≤ min p2i + pi pj .
p∈∆(V ) i p∈∆(V )
j:(j,i)∈E i j:(j,i)∈E
57
DANN W EI Z IMMERT
Via K.K.T. conditions, there exists λ ∈ R such that for an optimal solution p⋆ it holds
X X
∀i ∈ V : either 2p⋆i + p⋆j + p⋆j = λ
j:(j,i)∈E j:(i,j)∈E
X X
or 2p⋆i + p⋆j + p⋆j ≥ λ and p⋆i = 0 .
j:(j,i)∈E j:(i,j)∈E
Next we bound λ. Take the sub-graph over V+ = {i : p⋆i > 0} and take a maximally independence
set S over V+ (|S| ≤ α, since V+ ⊂ V ). We have
X X X X
1≤ p⋆j + 2p⋆i + p⋆j + p⋆j = |S|λ ≤ αλ.
j̸∈V+ i∈S j:(j,i)∈E j:(i,j)∈E
| {z }
=0
Hence λ ≥ α1 . Finally,
X X 1 X X X X p⋆ λ 1
(p⋆i )2 + p⋆i p⋆j = 2(p⋆i )2 + p⋆i p⋆j + p⋆i p⋆j ≥ i
≥ .
2 2 2α
i j:(j,i)∈E i j:(j,i)∈E j:(i,j)∈E i
Lemma 44 (Lemma 10 Alon et al. (2013)) let G = (V, E) be a directed graph. Then, for any
distribution p ∈ ∆(V ) we have:
p
P i
X
≤ mas(G) ,
pi + j:(j,i)∈E pj
i
where mas(G) ≤ α
e is the maximal acyclic sub-graph.
58
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND
We split the vertices in three sets. Let M1 = {i | i ∈ N (i)} the nodes with self-loops. Let M2 =
{i | i ̸∈ M1 , ̸∈ It } and M3 = It . we have
" #
ηt X 1+ log1K
2
Et pt,i (ℓbt,i + ct )
2
i
1+ log1 K
p 4p
P t,i t,i
X X X
≤ ηt Et updt Pt,i c2t + + + pt,It ,
( j∈N (i) pt,j qt )2 qt2
i∈M1 ∪M2 i∈M1 i∈M2
P 1
since for any i ∈ M2 , we have j∈N (i) pt,j > 2 . The first term is in expectation
4pt,i updt
hP i
2 ≤ 1 , the third term is in expectation E ≤ q4t , while
P
Et p c
i∈M1 ∪M2 t,i t qt t i∈M2 qt2
the second is
1+ 1 1+ 1
X pt,i log K updt 2 X pt,i log K
Et ≤ .
( j∈N (i) pt,j qt )2
P P
qt j∈N (i) pt,j
i∈M1 i∈M1
The subgraph consisting of M1 contains sef-loops, hence we can apply Lemma 44 to bound this
term by 2e α
qt . If α log K < α e, we instead take a permutation pe of p, such that M1 is in position
1, .., |M1 | and for any i ∈ [|M1 |]
pet,i pet,i 2αept,i
P ≤P ≤ Pi .
j∈N (i) p
et,j j∈N (i)∩[i] p
et,j j=1 p
et,j
The existence of such a permutation is guaranteed by Lemma 43: Apply Lemma 43 on the graph
M1 to select p̃|M1 | . Recursively remove the i-th vertex from the graph and apply Lemma 43 on the
resulting sub-graph to pick p̃i−1 . Applying this permutation yields
1 1
log K
1+ M1 1+ log K M1
2 X pt,i 4α X pe 4α X pei 4α log K
P ≤ Pi i ≤ 1− 1 ≤ .
qt j∈N (i) pt,j qt j=1 p
ej qt P
i log K qt
i∈M1 i=1 i=1
j=1 p
ej
Combining everything
1 ηt
Et max⟨p − pt , ℓbt + ct 1⟩ − Dψ (p, pt ) ≤ (6 + 4 min{e
α, α log K}) .
p ηt qt
By the definition of the learning rate, we obtain
τ τ
X 2e log K X ηt
Et [⟨pt , ℓbt ⟩ − ℓbt,a⋆ ] ≤ + (6 + 4 min{eα, α log K})
ητ qt
t=1 t=1
v
u τ
u X 1
≤ 18 (1 + min{e
t α, α log K}) log K .
qt
t=1
59
DANN W EI Z IMMERT
Sample At ∼ pt .
With probability qt receive ℓt,i for all At ∈ N (i) (in this case, set updt = 1; otherwise, set
updt = 0).
Define
ℓt,i updt I{At ∈ N (i)}
ℓbt,i = P .
j∈N (i) pt,j qt
end
Proof [Proof of Lemma 17] Consider the EXP3 algorithm (Algorithm 12). By standard analysis of
FTRL (Lemma 27) and Lemma 31 by the non-negativity of all losses, for any τ and any a⋆ ,
τ τ
" #
X X X h i
Et Pt (a)ℓbt (a) − Et ℓbt (a⋆ )
t=1 a t=1
τ
" #
log K X ηt X
≤ + Et Pt,i ℓb2t,i
ητ 2
t=1 i
τ
log K X ηt X 1
≤ + Pt,i P
ητ 2 j∈N (i) pt,j qt
t=1 i
s
τ
log K X ηt δ
≤ +
ητ 4qt
t=1
60
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND
q
ηt P δ ηt δ
1 1
where in the last inequality we use i Pt,i × γt × qt
= qt . We have due to Et [⟨Pt −pt , ℓt ⟩] ≤
b
2 2
γt
s
τ τ τ
" #
X X X h i log K X ηt δ
⋆
Et pt (a)ℓbt (a) − Et ℓbt (a ) ≤ + 9
a
η τ 4qt
t=1 t=1 t=1
τ
!2
3
1 X 1 4δ log(K)
≤ 6(δ log(K)) 3 √ + .
qt mint≤T qt
t=1
X
ℓ̃t,bx = ℓt,bx I{(b b) ∈ E} −
x, x pt,j ℓt,j I{(j, j) ̸∈ E}
x}
j∈[K]\{b
∀j ∈ [K] \ {b
x} : ℓ̃t,j = ℓt,j I{(j, j) ∈ E} − ℓt,bx I{(b b) ̸∈ E} ,
x, x
where pt is the distribution of the base algorithm B over [K] \ {b x} at round t. By construction
and the definition of strongly observable graphs, ℓ̃t,j is observed when playing arm j. (When the
player does not have access to pt , one can also sample one action from the current distribution for
an unbiased estimate of ℓt,bx .) The losses ℓ̃t are in range [−1, 1] instead of [0, 1] and can further be
shifted to be strictly non-negative. Finally observe that
X
E[ℓ̃t,At − ℓ̃t,bx ] = qt,2 pt,j ℓ̃t,j − ℓ̃t,bx
x}
j∈[K]\{b
X
= qt,2 pt,j ℓt,j − ℓt,bx
x}
j∈[K]\{b
= E[ℓt,At − ℓt,bx ],
as well as
X X
E ℓ̃t,At − pt,j ℓ̃t,j = qt,1 ℓ̃t,bx − pt,j ℓ̃t,j
x}
j∈[K]\{b x}
j∈[K]\{b
X
= qt,1 ℓ̃t,bx − pt,j ℓt,j
x}
j∈[K]\{b
X
= E ℓt,At − pt,j ℓt,j .
x}
j∈[K]\{b
61
DANN W EI Z IMMERT
That means running Algorithm 2, replacing ℓt by ℓ̃t allows to apply Theorem 11 to strongly observ-
able graphs where not every arm receives its own loss as feedback.
We refer the readers to Section C.3 of Lee et al. (2020) for the setting descriptions and notations,
since we will follow them tightly in this section.
Proof The regret is decomposed as the following (similar to Section C.3 in Lee et al. (2020)):
T
X T
X T
X T
X
⋆
⟨wt − u , ℓt ⟩ = ⟨wt − w
bt , ℓt ⟩ + ⟨w
bt , ℓt − ℓt ⟩ +
b ⟨w
bt − u, ℓbt ⟩
t=1 t=1 t=1 t=1
| {z } | {z } | {z }
Error Bias-1 Reg-term
T
X T
X
+ ⟨u, ℓbt − ℓt ⟩ + ⟨u − u⋆ , ℓt ⟩,
t=1 t=1
| {z } | {z }
Bias-2 Bias-3
with the same definition of u as in (24) of Lee et al. (2020). Bias-3 ≤ H trivially (see (25) of
Lee et al. (2020)). For the remaining four terms, we use Lemma 49, Lemma 50, Lemma 51, and
Lemma 52, respectively. Combining all terms finishes the proof.
Lemma 46 (Lemma C.2 of Lee et al. (2020)) With probability 1 − O(δ), for all t, s, a, s′ ,
ϵt (s′ |s, a)
P (s′ |s, a) − P t (s′ |s, a) ≤ .
2
Lemma 47 (cf. Lemma C.3 of Lee et al. (2020)) With probability at least 1 − δ, for all h,
T
X X qt′ · wt (s, a)
= O (|Sh |A log(T ) + ln(H/δ))
max{1, Nt (s, a)}
t=1 s∈Sh ,a∈A
Proof The proof of this lemma is identical to the original one — only need to notice that in our
case, the probability of obtaining a sample of (s, a) is qt′ · wt (s, a).
62
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND
for t = 1, 2, . . . do
b, updt = 1; otherwise, updt = 1 w.p. qt and updt = 0 w.p. 1 − qt .
If πt = π
If updt = 1, obtain the trajectory sh , ah , ℓt (sh , ah ) for all h = 1, . . . , H.
Construct loss estimators
upd ℓt (s, a)It (s, a)
ℓbt (s, a) = ′ t · , where It (s, a) = I{(sh(s) , ah(s) ) = (s, a)}
qt ϕt (s, a)
Nt+1 (s, a) ← Nt (s, a) + It (s, a), Nt+1 (s, a, s′ ) ← Nt (s, a, s′ ) + It (s, a)I{sh(s)+1 = s′ }.
Nt+1 (s,a,s′ )
where P t+1 (s′ |s, a) = max{1,Nt+1 (s,a)} and
s
P t+1 (s′ |s, a) ln(SAT /δ) 28 ln(SAT /δ)
ϵt+1 (s′ |s, a) = 4 +
max{1, Nt+1 (s, a)} 3 max{1, Nt+1 (s, a)}
Pt P updτ Iτ (s,a)ℓτ (s,a)2 H S 2 A ln(SAT )
if τ =t⋆ s,a qt′2
+ maxτ ≤t qτ′2
≥ ηt2
then
ηt
ηt+1 ← 2
wbt+1 = argminw∈∆(Pt+1 )∩Ω ψ(w)
t⋆ ← t + 1.
end
else
ηt+1 = ηt n o
1
wbt+1 = argminw∈∆(Pt+1 )∩Ω ⟨w, ℓbt ⟩ + ηt Dψ (w, w
bt ) .
end
Update policy πt+1 = π wbt+1 .
end
63
DANN W EI Z IMMERT
Lemma 48 (cf. Lemma C.6 of Lee et al. (2020)) With probability at least 1 − O(δ), for any t and
any collection of transition functions {Pts }s∈S such that Pts ∈ Pt for all s, we have
v
T u T 2⟩ 5 A2 ι2
X X s
u X ⟨w , ℓ
t t 2 S
wPt ,πt (s, a) − wt (s, a) ℓt (s, a) = O S tHA ι + .
qt′ mint qt′
t=1 s∈S,a∈A t=1
Proof Following the proof of Lemma C.6 in Lee et al. (2020), we have
T
s
X X
wPt ,πt (s, a) − wt (s, a) ℓt (s, a) ≤ B1 + SB2
t=1 s∈S,a∈A
where
s s
T X X
X P (s′ |s, a)ι P (x′ |x, y)ι
B2 ≤ O wt (s, a) wt (x, y|s′ )
′ ′
max{1, Nt (s, a)} max{1, Nt (x, y)}
t=1 s,a,s x,y,x
T X X
X w t (s, a)ι
+O (32)
′ ′
max{1, Nt (s, a)}
t=1 s,a,s x,y,x
T X
X wt (s, a)ι
≤ HS
max{1, Nt (s, a)}
t=1 s,a,s′
HS 2 ι
≤O (SA ln(T ) + H log(H/δ)) . (by Lemma 47)
mint qt′
The second term in (32) can be upper bounded by the order of
T X 3
2
X wt (s, a)ι S Aι
S A =O (SA ln(T ) + H log(H/δ)) . (by Lemma 47)
′
max{1, Nt (s, a)} mint qt′
t=1 s,a,s
64
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND
Next, we bound B1 . By the same calculation as in Lemma C.6 of Lee et al. (2020),
B1
s
T X X T X
X P (s′ |s, a)ι X w t (x, a)ι
≤ O wt (s, a) wt (x, y|s′ )ℓt (x, y) + H
max{1, N t (s, a)} max{1, N t (s, a)}
t=1 s,a,s′ x,y t=1 s,a,s′
s
T X X ′ 2 2
X P (s |s, a)ι HS Aι
≤ O wt (s, a) wt (x, y|s′ )ℓt (x, y) +
′ x,y
max{1, Nt (s, a)} mint qt′
t=1 s,a,s
For the first term above, we consider the summation over (s, a, s′ ) ∈ Th ≜ Sh × A × Sh+1 . We
continue to bound it by the order of
s
T
X X X P (s′ |s, a)ι
wt (s, a) wt (x, y|s′ )ℓt (x, y)
′ x,y
max{1, N t (s, a)}
t=1 s,a,s ∈Th
T
X 1 X X
≤α wt (s, a)P (s′ |s, a)wt (x, y|s′ )ℓt (x, y)2
qt′
t=1 s,a,s′ ∈Th x,y
T
1X X X ι
+ qt′ wt (s, a)wt (x, y|s′ ) ·
α max{1, Nt (s, a)}
t=1 s,a,s′ ∈Th x,y
(by AM-GM, holds for any α > 0)
T T
X 1 X 2 H|Sh+1 | X X qt′ wt (s, a)ι
≤α w t (x, y)ℓ t (x, y) +
qt′ x,y α Nt (s, a)
t=1 t=1 s,a∈Sh ×A
T
X ⟨wt , ℓ2 ⟩ t H|Sh+1 ||Sh |Aι2
≤α + (by Lemma 47)
qt′ α
t=1
v
u T 2 2
u X ⟨wt , ℓt ⟩ι
≤ O tH|Sh ||Sh+1 |A (choose the optimal α)
qt′
t=1
v
u T 2 ⟩ι2
u X ⟨w ,
t tℓ
≤ O (|Sh | + |Sh+1 |)tHA
qt′
t=1
Therefore,
v
u T 2 ⟩ι2 2 Aι2
X u X ⟨w ,
t t ℓ HS
B1 ≤ O (|Sh | + |Sh+1 |)tHA +
qt′ mint qt′
h t=1
v
u T 2 2 2 2
u X ⟨wt , ℓt ⟩ι HS Aι
≤ O S tHA + .
qt′ mint qt′
t=1
65
DANN W EI Z IMMERT
by our choice of δ.
Thus,
h i X
Et ⟨w
bt , ℓt − ℓbt ⟩ ≤ |ϕt (s, a) − wt (s, a)|ℓt (s, a) + O(HI[E t ])
s,a
66
A B LACKBOX A PPROACH TO B EST OF B OTH W ORLDS IN BANDITS AND B EYOND
r
⟨wt ,ℓ2t ⟩ 2 S 5 A2 ι2
By Lemma 48, (⋆) is upper bounded by O S HA Tt=1
P
qt′
ι + mint qt′
with probability
1 − O(δ). Taking expectations on both sides and using that Pr[E t ] ≤ δ, we get
" T v
T
# u
⟨w , ℓ2⟩ S 5 A2 ι 2
t t 2
X u X
E ⟨w
bt , ℓt − ℓbt ⟩ ≤ O E S tHA ι + + O (δSAT )
qt′ mint qt′
t=1 t=1
v
u T 2⟩ 5 A2 ι 2
u X ⟨w , ℓ
t t 2 S
≤ O E S tHA ι + .
qt′ mint qt′
t=1
Proof Let Et be the event that P ∈ Pτ for all τ ≤ t. By the construction of the loss estimator, we
have
h i
Et ⟨u, ℓbt − ℓt ⟩ Et ≤ 0
and thus
h i
Et ⟨u, ℓbt − ℓt ⟩ ≤ I[E t ] · H · T 3 S 2 A (by Lemma C.5 of Lee et al. (2020), ℓbt (s, a) ≤ T 3 S 2 A)
and
" T # " T
#
X X
E ⟨u, ℓbt − ℓt ⟩ ≤ HT 4 S 2 AE I[E t ] ≤ O(δHT 5 S 2 A) ≤ O(1),
t=1 t=1
Proof By the same calculation as in the proof of Lemma C.10 in Lee et al. (2020),
bt ) − Dψ (u, w
Dψ (u, w bt+1 ) X
⟨w
bt − u, ℓbt ⟩ ≤ + ηt bt (s, a)2 ℓbt (s, a)2
w
ηt s,a
bt ) − Dψ (u, w
Dψ (u, w bt+1 ) X upd It (s, a)ℓt (s, a)2
t
≤ + ηt ′2 .
ηt s,a
q t
67
DANN W EI Z IMMERT
Let t1 , t2 , . . . be the time indices when ηt = ηt−12 , and let ti be the last time this happens. Summing
⋆
I.2. Corraling
We use Algorithm 8 as the corral algorithm for the MDP setting, with Algorithm 13 being the base
algorithm. The guarantee of Algorithm 8 is provided in Lemma 42.
Proof [Proof of Theorem 26] To apply Lemma 42, we have to perform re-scaling on the loss because
for MDPs, the loss of a policy in one round is H. Therefore, scale down all losses by a factor of H1 .
Then by Lemma 45, the base algorithm satisfies the condition in Lemma 42 with c1 = S 2 Aι2 and
c2 = S 5 A2 ι2 /H where ξt,π is defined as ℓ′t,π = ℓt,π /H, the expected loss of policy π after scaling.
Therefore, by Lemma 42, the we can transform it to an 12 -dd-LSB algorithm with c1 = S 2 Aι2 and
c2 = S 5 A2 ι2 /H. Finally, using Theorem 23 and scaling back the loss range, we can get
q
2 2 2 5 2 2 2
O S AHL⋆ log (T )ι + S A log (T )ι
68