Asymptotically Optimal Policies for Markovian
Restless Bandits
December 15, 2022
Presented by Chen Yan
Reviewers: David Goldberg Bruno Scherrer
Examiners: Benjamin Legros Supervisors: Nicolas Gast
Jérôme Malick Bruno Gaujal
Kim Thang Nguyen
Table of Contents
1 Motivation and historical review
2 Infinite horizon case
3 Finite horizon case
4 Generalization to WCMDP
5 Conclusion and future works
PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 2 / 44
Theme: Markovian bandits
Left: Stochastic bandits
Right: Markovian bandits
PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 3 / 44
Theme: Markovian bandits
Many Markovian processes (”arms”)
Taking an action to an arm changes its Markovian evolution
Arm generates rewards depending on its state
Applying an action consumes resources
We want to maximize the rewards subject to the resource
constraint
Important: all parameters of the model are known
It is an optimization problem of computational nature
PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 4 / 44
Example: to interview or not to interview
Applicants apply for a job, each with a quality level unknown to us
Arm: an applicant
Action: interview, not interview
State: belief value of the quality
Evolution: Bayesian update
depending on the results of an
interviewing round
Goal: optimize the quality levels of the admitted applicants
PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 5 / 44
Example: to observe or not to observe
Each channel evolves as a 2 state Markov chain. We can know the
state of a channel only after observing it, and only a channel in the
”good” state can transmit data
Arm: a channel
Action: observe, not observe
State: a two-tuple of the state
and the time elapsed since the
most recent observation
Goal: optimize the data throughput
PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 6 / 44
Markovin restless bandit: historical review
The special case of discounted restful bandit can be solved by the
Gittins index policy [Gittins 1979]
For the more general restless bandit (RB), the problem is
PSPACE-HARD [Papadimitriou and Tsitsiklis 1999]
Peter Whittle proposes the Whittle index policy that generalizes
the Gittins index policy to RB [Whittle 1988]
Under several additional technical assumptions, the WIP is
(N)
asymptotically optimal in the large N limit: limN→∞ VWIP = VLP
[Weber and Weiss 1990]
PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 7 / 44
Markovian restless bandit: historical review
(N)
Sub-optimality gap: VLP − VWIP
“. . . The size of the asymptotic sub-optimality of the index policy was
no more than 0.002% in any example. The evidence so far is that the
degree of sub-optimality is very small. It appears that in most cases
the index policy is a very good heuristic.”
—— [Weber and Weiss 1990]
Question: how fast does this gap converge to 0?
PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 8 / 44
Our contribution
We study the problem with N statistically identical arms
Under both infinite and finite horizon:
We propose a class of heuristics that incorporates the WIP
√
The sub-optimality gap of these heuristics can be either O(1/ N)
or e−O(N) , depending on:
1 Infinite horizon: global attractor + non-degeneracy
2 Finite horizon: non-degeneracy
We generalize our results from the RB model to weakly coupled
MDP
PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 9 / 44
Table of Contents
1 Motivation and historical review
2 Infinite horizon case
3 Finite horizon case
4 Generalization to WCMDP
5 Conclusion and future works
PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 10 / 44
The model: general formulation
An arm is modeled with
a state space: S = {1 . . . S}
an action space: A = {0, 1}
evolution: Markov transition kernels depending on the action
reward: depending on the state and action
Each arm alone is itself a Markov decision process (MDP)
It consists of a component to the whole RB
PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 11 / 44
The model: general formulation
We have N such arms that evolve independently, giving rise to a very
large MDP for the bandit as a whole.
They are coupled via a budget constraint: αN arms are pulled at each
time period with 0 < α < 1
Goal
Maximize the total expected reward subject to the budget constraint.
PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 12 / 44
Intuition: decoupling for large N
When N grows, the following two things take place:
The problem size grows exponentially
The coupling between arms is weakening
PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 13 / 44
The model: relaxation by taking expectation
Ms (t): the fraction of arms in state s at time t
Given M(t), a policy π decides Ys,a (t): the fraction of arms in state s
taking action a at time t
T
1 hXX a i
max lim Eπ Rs · Ys,a (t)
π T →∞ T
s,a t=1
s.t. Ys,0 (t) + Ys,1 (t) = Ms (t) ∀t, s,
X
Ys,1 (t) = α ∀t,
s
π Markov
M(t) −
→ Y(t) −−−−−→ M(t + 1) ∀t
transition
PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 14 / 44
The model: relaxation by taking expectation
T
X 1 XX
relaxed
Ys,1 (t) = α ∀t −−−−→ lim Eπ Ys,1 (t) = α
T →∞ T
s s t=1
1 PT
Denote by ys,a := limT →∞ T t=1 Eπ [Ys,a (t)]
P
⇒ A single constraint s ys,1 = α after the relaxation
PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 15 / 44
The model: relaxation by taking expectation
Originally: a stochastic N-armed bandit involving the variables Ys,a
After relaxation: a deterministic LP involving the variables ys,a
T X
1 hXX a Rsa · ys,a
i
max lim Eπ Rs · Ys,a (t) max
π T →∞ T y≥0 s,a
s,a t=1
X
s.t. ys,a = 1,
X
s.t. Ys,a (t) = Ms (t) ∀t, s,
s,a
a
X
ys,1 = α,
X
Ys,1 (t) = α ∀t,
s s
X
M(t) −
π Markov
→ Y(t) −−−−−→ M(t + 1) ∀t ms = ys,0 + ys,1 = ys0 ,a · Psa0 s ∀s
transition s0 ,a
PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 16 / 44
Construction of the heuristic from the LP
Suppose we obtain an optimal solution y∗ from the LP
Define:
∗ > 0, y ∗ = 0}
S + := {s | ys,1 s,0
∗ > 0, y ∗ > 0}
S 0 := {s | ys,1 s,0
∗ = 0, y ∗ > 0}
S − := {s | ys,1 s,0
S ∅ := {s | ys,1
∗ = 0, y ∗ = 0}
s,0
A LP-priority policy is any strict priority policy such that the priority
satisfies S + > S 0 > S − ∪ S ∅ .
Remark: the WIP is a LP-priority policy
PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 17 / 44
Performance theorem: infinite horizon
Infinite horizon RB performance theorem [Gast, Gaujal, and Yan
2020]
Any LP-priority policy that satisfies the global attractor condition and
the non-degenerate condition is asymptotically optimal with
exponential convergence rate:
(N)
0 ≤ VLP − Vprio ≤ C1 · e−C2 N
The global attractor condition: the deterministic dynamics φprio of
the LP-priority policy has a global attractor
The non-degenerate condition: S 0 ≥ 1
PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 18 / 44
Proof of exponential convergence rate (1)
An example with 3 states:
φprio (·) is a map from ∆3 to ∆3 , which is continuous and piecewise
linear with 3 linear pieces
1.0
partition lines for = 0.5
fixed point m( ) for = 0.5
0.8
deterministic trajectory
0.6
Z1 · m + b1 , if m ∈ Z1
M2
2
φprio (m) = Z2 · m + b2 , if m ∈ Z2 0.4
Z3 · m + b3 , if m ∈ Z3
0.2 1
3
0.0
0.0 0.2 0.4 0.6 0.8 1.0
M3
PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 19 / 44
Proof of exponential convergence rate (2)
The global attractor condition implies that φprio has a unique fixed
point m∗ (∞)
The non-degenerate condition implies that m∗ (∞) is not on the
boundary of two linear pieces
1.0
partition lines for = 0.5
fixed point m( ) for = 0.5
0.8
deterministic trajectory
0.6
M2
2
0.4
0.2 1
3
0.0
0.0 0.2 0.4 0.6 0.8 1.0
M3
PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 20 / 44
Proof of exponential convergence rate (3)
We need to show
h i
E M(N) (∞) − m∗ (∞) = e−O(N)
1
We choose a small neighbourhood N of m∗ (∞) in which φprio is
linear (thanks to the non-degenerate condition)
1.0 1.0
partition lines for = 0.5 partition lines for = 0.5
fixed point m( ) for = 0.5 fixed point m( ) for = 0.5
0.8
simulated trajectory for N=100 0.8
simulated trajectory for N=1000
0.6 0.6
M2
M2
2 2
0.4 0.4
0.2 1 0.2 1
3 3
0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
M3 M3
PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 21 / 44
Proof of exponential convergence rate (4)
/ N = e−O(N)
By Hoeffding’s inequality, P M(N) (∞) ∈
Conditional on M(N) (∞) ∈ N , we apply Stein’s method and use
local linearity to show E M (∞) − m∗ (∞) 1 = e−O(N)
(N)
1.0 1.0
partition lines for = 0.5 partition lines for = 0.5
fixed point m( ) for = 0.5 fixed point m( ) for = 0.5
0.8
simulated trajectory for N=100 0.8
simulated trajectory for N=1000
0.6 0.6
M2
M2
2 2
0.4 0.4
0.2 1 0.2 1
3 3
0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
M3 M3
PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 22 / 44
Conditions for exponential convergence rate
The non-degenerate condition is easy to check, and almost
always holds
The global attractor condition is difficult to test
PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 23 / 44
The global attractor condition revisited
A 4 state example with cyclic dynamics of φprio
0.575 0.575
0.570 0.570
0.565 0.565
0.560 0.560
0.555 0.555
0.550 0.550
0.545 0.545
0.540 0.540
0.0265 0.0265
0.0260 0.0260
0.345 0.0255 0.0255
0.350
0.355 0.0250 0.35 0.0250
0.360
0.365 0.36 0.0245
0.370 0.0245 0.37
0.375
0.3800.385 0.0240 0.38 0.0240
Figure: The deterministic trajectory Figure: A simulated trajectory
PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 24 / 44
Table of Contents
1 Motivation and historical review
2 Infinite horizon case
3 Finite horizon case
4 Generalization to WCMDP
5 Conclusion and future works
PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 25 / 44
Construction of the heuristic from the LP
In finite horizon, our heuristic is time-dependent
A policy is a sequence of T decision rules π = (π1 . . . πT ) such
that πt specifies the fraction of arms for both actions at time t:
If Y = πt (M), then N · Ys,a among the N · Ms arms in state s take
action a
PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 26 / 44
Construction of the heuristic from the LP
T X
X
max Rsa · ys,a (t)
y≥0 t=1 s,a
s.t. ys,0 (1) + ys,1 (1) = ms (1) ∀s,
X
ys,1 (t) = α ∀t,
s
X
ms (t + 1) = ys,0 (t + 1) + ys,1 (t + 1) = ys0 ,a (t) · Psa0 s ∀s, t
s0 ,a
PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 27 / 44
Construction of the heuristic from the LP
πt (M(t)) =???
The LP suggests πt (m∗ (t)) = y∗ (t), but M(t) 6= m∗ (t) due to stochastic
noise
Main idea:
πt (·) should be a “regular” function taking values in the feasible region
and satisfies πt (m∗ (t)) = y∗ (t).
PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 28 / 44
Construction of the heuristic from the LP
A policy π = (π1 . . . πT ) is called
Admissible: if for all t and πt (M(t)) = Y(t) we have
X X
Ys,1 (t) = α, and Ys,a (t) = Ms (t) ∀s, a
s a
LP-compatible: if πt (m∗ (t)) = y∗ (t) ∀t
PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 29 / 44
Performance theorem: finite horizon
Finite horizon RB performance theorem [Gast, Gaujal, and Yan
2021]
Let π be an admissible and LP-compatible policy. Then
(N) √
1 If πt are all Lipschitz continuous, then VLP − Vπ ≤ C/ N
(N)
2 If πt are all locally linear, then VLP − Vπ ≤ C1 · e−C2 N
locally linear: if there exists ε > 0 such that πt (·) is linear on the ball of radius ε
centered at m∗ (t).
PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 30 / 44
Performance theorem: finite horizon
The local linearity can not be met for any model!
Characterization of the non-degenerate condition [Gast, Gaujal,
and Yan 2021]
For a finite horizon RB, there exists a feasible and LP-compatible policy
π that is locally linear if and only if the non-degenerate condition is met.
The non-degenerate condition: S 0 (t) ≥ 1 for all 1 ≤ t ≤ T
∗ ∗
S 0 (t) := {s | ys,1 (t) > 0, ys,0 (t) > 0}
PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 31 / 44
The non-degenerate condition revisited
The green: deterministic
The orange: stochastic
The blue: locally linear neighbourhood
PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 32 / 44
The non-degenerate condition revisited
In infinite horizon, the non-degenerate condition is almost always
met: S 0 = 1
PT
Not true in finite horizon: t=1 S 0 (t) = T almost always holds,
but it may happen that
S 0 (t) = 0 and S 0 (t 0 ) = 2 for t 6= t 0
PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 33 / 44
Construction of π using water-filling
We can show that the policy induced by water-filling is
Lipschitz-continuous in general
locally linear if non-degenerate
PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 34 / 44
Construction of π using water-filling
We fill a volume α of water into buckets enumerated by s ∈ S each
with capacity Ms
PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 35 / 44
Construction of π using water-filling
Fill completely the buckets in S +
PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 35 / 44
Construction of π using water-filling
∗
Fill the buckets in S 0 with no more than a volume of ys,1
PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 35 / 44
Construction of π using water-filling
Fill completely the buckets in S 0
PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 35 / 44
Construction of π using water-filling
Fill completely the buckets in S − S∅
S
PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 35 / 44
Table of Contents
1 Motivation and historical review
2 Infinite horizon case
3 Finite horizon case
4 Generalization to WCMDP
5 Conclusion and future works
PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 36 / 44
Weakly coupled MDP
Multiple action and multiple constraint bandits:
⇒ Weakly coupled MDP
In finite horizon, our performance theorem for RB can be generalized
to WCMDP
PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 37 / 44
Weakly coupled MDP
What is not obvious:
How to characterize the non-degenerate condition for WCMDP?
How to construct a policy that is Lipschitz-continuous in general,
and locally linear if non-degenerate?
PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 38 / 44
Weakly coupled MDP
Our contribution in [Gast, Gaujal, and Yan 2022]:
Investigating the saturation of the constraints:
Generalize the non-degenerate condition from RB to WCMDP
The LP-update policy:
For each 1 ≤ t ≤ T , the decision πt (M(t)) is given by solving a
new LP with initial condition M(t) and horizon T − t
PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 39 / 44
Table of Contents
1 Motivation and historical review
2 Infinite horizon case
3 Finite horizon case
4 Generalization to WCMDP
5 Conclusion and future works
PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 40 / 44
Recapitulation
In both infinite and finite horizon:
1 Identify and characterize a non-degenerate condition on the RB
2 construct policies that achieve exponentially fast convergence, if
this condition is met
3 The square root rate cannot be improved if the RB degenerates
In finite horizon:
Generalize all the results from RB to WCMDP
PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 41 / 44
The story continues...
Infinite horizon WCMDP:
Modeling arrival and departure of arms
Potential applications in multi-class queueing networks
Piecewise linear dynamics:
Identify other mean field models with piecewise linear drift
Establish faster than square root convergence rate
PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 42 / 44
The story continues...
Is degeneracy the ultimate obstacle?
(N) √
VLP − VOPT = Ω(1/ N) because of degeneracy
(N) (N)
Ultimate goal: VOPT − Vpowerful policy = O(1/N)
(N)
Possible approach: V??? − Vpowerful policy = O(1/N) even degenerate
PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 43 / 44
References
Gast, Nicolas, Bruno Gaujal, and Chen Yan (2020). “Exponential Convergence Rate for the Asymptotic Optimality of Whittle Index
Policy”. In: arXiv preprint arXiv:2012.09064.
– (2021). “(Close to) Optimal Policies for Finite Horizon Restless Bandits”. In: arXiv preprint arXiv:2106.10067.
– (2022). “The LP-update policy for weakly coupled Markov decision processes”. In: arXiv preprint arXiv:2211.01961.
Gittins, John C (1979). “Bandit processes and dynamic allocation indices”. In: Journal of the Royal Statistical Society: Series B
(Methodological) 41.2, pp. 148–164.
Papadimitriou, Christos H. and John N. Tsitsiklis (1999). “The Complexity Of Optimal Queuing Network Control”. In: Math. Oper. Res,
pp. 293–305.
Weber, Richard R. and Gideon Weiss (1990). “On an Index Policy for Restless Bandits”. In: Journal of Applied Probability 27.3,
pp. 637–648. ISSN: 00219002.
Whittle, P. (1988). “Restless Bandits: activity allocation in a changing world”. In: Journal of Applied Probability 25A, pp. 287–298.
PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 44 / 44