Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
48 views48 pages

Soutenance

The document presents a PhD defense on asymptotically optimal policies for Markovian restless bandits. It discusses the motivation and historical context, presents the infinite and finite horizon cases, and generalizes the results to weakly coupled Markov decision processes. The defense includes slides on the model formulation, construction of heuristics from a linear programming relaxation, and a performance theorem for the infinite horizon case under certain conditions.

Uploaded by

Chen YAN
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views48 pages

Soutenance

The document presents a PhD defense on asymptotically optimal policies for Markovian restless bandits. It discusses the motivation and historical context, presents the infinite and finite horizon cases, and generalizes the results to weakly coupled Markov decision processes. The defense includes slides on the model formulation, construction of heuristics from a linear programming relaxation, and a performance theorem for the infinite horizon case under certain conditions.

Uploaded by

Chen YAN
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Asymptotically Optimal Policies for Markovian

Restless Bandits

December 15, 2022

Presented by Chen Yan

Reviewers: David Goldberg Bruno Scherrer


Examiners: Benjamin Legros Supervisors: Nicolas Gast
Jérôme Malick Bruno Gaujal
Kim Thang Nguyen
Table of Contents

1 Motivation and historical review

2 Infinite horizon case

3 Finite horizon case

4 Generalization to WCMDP

5 Conclusion and future works

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 2 / 44


Theme: Markovian bandits

Left: Stochastic bandits

Right: Markovian bandits

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 3 / 44


Theme: Markovian bandits

Many Markovian processes (”arms”)

Taking an action to an arm changes its Markovian evolution

Arm generates rewards depending on its state

Applying an action consumes resources

We want to maximize the rewards subject to the resource


constraint

Important: all parameters of the model are known

It is an optimization problem of computational nature

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 4 / 44


Example: to interview or not to interview

Applicants apply for a job, each with a quality level unknown to us

Arm: an applicant

Action: interview, not interview

State: belief value of the quality

Evolution: Bayesian update


depending on the results of an
interviewing round

Goal: optimize the quality levels of the admitted applicants

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 5 / 44


Example: to observe or not to observe

Each channel evolves as a 2 state Markov chain. We can know the


state of a channel only after observing it, and only a channel in the
”good” state can transmit data
Arm: a channel

Action: observe, not observe

State: a two-tuple of the state


and the time elapsed since the
most recent observation

Goal: optimize the data throughput

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 6 / 44


Markovin restless bandit: historical review

The special case of discounted restful bandit can be solved by the


Gittins index policy [Gittins 1979]
For the more general restless bandit (RB), the problem is
PSPACE-HARD [Papadimitriou and Tsitsiklis 1999]

Peter Whittle proposes the Whittle index policy that generalizes


the Gittins index policy to RB [Whittle 1988]
Under several additional technical assumptions, the WIP is
(N)
asymptotically optimal in the large N limit: limN→∞ VWIP = VLP
[Weber and Weiss 1990]

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 7 / 44


Markovian restless bandit: historical review

(N)
Sub-optimality gap: VLP − VWIP

“. . . The size of the asymptotic sub-optimality of the index policy was


no more than 0.002% in any example. The evidence so far is that the
degree of sub-optimality is very small. It appears that in most cases
the index policy is a very good heuristic.”
—— [Weber and Weiss 1990]

Question: how fast does this gap converge to 0?

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 8 / 44


Our contribution

We study the problem with N statistically identical arms

Under both infinite and finite horizon:


We propose a class of heuristics that incorporates the WIP

The sub-optimality gap of these heuristics can be either O(1/ N)
or e−O(N) , depending on:
1 Infinite horizon: global attractor + non-degeneracy
2 Finite horizon: non-degeneracy

We generalize our results from the RB model to weakly coupled


MDP

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 9 / 44


Table of Contents

1 Motivation and historical review

2 Infinite horizon case

3 Finite horizon case

4 Generalization to WCMDP

5 Conclusion and future works

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 10 / 44


The model: general formulation

An arm is modeled with


a state space: S = {1 . . . S}
an action space: A = {0, 1}
evolution: Markov transition kernels depending on the action
reward: depending on the state and action
Each arm alone is itself a Markov decision process (MDP)
It consists of a component to the whole RB

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 11 / 44


The model: general formulation

We have N such arms that evolve independently, giving rise to a very


large MDP for the bandit as a whole.

They are coupled via a budget constraint: αN arms are pulled at each
time period with 0 < α < 1

Goal
Maximize the total expected reward subject to the budget constraint.

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 12 / 44


Intuition: decoupling for large N

When N grows, the following two things take place:

The problem size grows exponentially

The coupling between arms is weakening

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 13 / 44


The model: relaxation by taking expectation

Ms (t): the fraction of arms in state s at time t

Given M(t), a policy π decides Ys,a (t): the fraction of arms in state s
taking action a at time t

T
1 hXX a i
max lim Eπ Rs · Ys,a (t)
π T →∞ T
s,a t=1
s.t. Ys,0 (t) + Ys,1 (t) = Ms (t) ∀t, s,
X
Ys,1 (t) = α ∀t,
s
π Markov
M(t) −
→ Y(t) −−−−−→ M(t + 1) ∀t
transition

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 14 / 44


The model: relaxation by taking expectation

T
X 1 XX 
relaxed 
Ys,1 (t) = α ∀t −−−−→ lim Eπ Ys,1 (t) = α
T →∞ T
s s t=1

1 PT
Denote by ys,a := limT →∞ T t=1 Eπ [Ys,a (t)]

P
⇒ A single constraint s ys,1 = α after the relaxation

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 15 / 44


The model: relaxation by taking expectation

Originally: a stochastic N-armed bandit involving the variables Ys,a

After relaxation: a deterministic LP involving the variables ys,a

T X
1 hXX a Rsa · ys,a
i
max lim Eπ Rs · Ys,a (t) max
π T →∞ T y≥0 s,a
s,a t=1
X
s.t. ys,a = 1,
X
s.t. Ys,a (t) = Ms (t) ∀t, s,
s,a
a
X
ys,1 = α,
X
Ys,1 (t) = α ∀t,
s s
X
M(t) −
π Markov
→ Y(t) −−−−−→ M(t + 1) ∀t ms = ys,0 + ys,1 = ys0 ,a · Psa0 s ∀s
transition s0 ,a

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 16 / 44


Construction of the heuristic from the LP

Suppose we obtain an optimal solution y∗ from the LP


Define:
∗ > 0, y ∗ = 0}
S + := {s | ys,1 s,0
∗ > 0, y ∗ > 0}
S 0 := {s | ys,1 s,0
∗ = 0, y ∗ > 0}
S − := {s | ys,1 s,0
S ∅ := {s | ys,1
∗ = 0, y ∗ = 0}
s,0
A LP-priority policy is any strict priority policy such that the priority
satisfies S + > S 0 > S − ∪ S ∅ .

Remark: the WIP is a LP-priority policy

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 17 / 44


Performance theorem: infinite horizon

Infinite horizon RB performance theorem [Gast, Gaujal, and Yan


2020]
Any LP-priority policy that satisfies the global attractor condition and
the non-degenerate condition is asymptotically optimal with
exponential convergence rate:
(N)
0 ≤ VLP − Vprio ≤ C1 · e−C2 N

The global attractor condition: the deterministic dynamics φprio of


the LP-priority policy has a global attractor
The non-degenerate condition: S 0 ≥ 1

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 18 / 44


Proof of exponential convergence rate (1)

An example with 3 states:


φprio (·) is a map from ∆3 to ∆3 , which is continuous and piecewise
linear with 3 linear pieces

1.0
partition lines for = 0.5
fixed point m( ) for = 0.5
0.8
deterministic trajectory

 0.6
Z1 · m + b1 , if m ∈ Z1

M2
2
φprio (m) = Z2 · m + b2 , if m ∈ Z2 0.4

Z3 · m + b3 , if m ∈ Z3

0.2 1
3
0.0
0.0 0.2 0.4 0.6 0.8 1.0
M3

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 19 / 44


Proof of exponential convergence rate (2)
The global attractor condition implies that φprio has a unique fixed
point m∗ (∞)
The non-degenerate condition implies that m∗ (∞) is not on the
boundary of two linear pieces

1.0
partition lines for = 0.5
fixed point m( ) for = 0.5
0.8
deterministic trajectory

0.6
M2

2
0.4

0.2 1
3
0.0
0.0 0.2 0.4 0.6 0.8 1.0
M3
PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 20 / 44
Proof of exponential convergence rate (3)
We need to show
h i
E M(N) (∞) − m∗ (∞) = e−O(N)
1
We choose a small neighbourhood N of m∗ (∞) in which φprio is
linear (thanks to the non-degenerate condition)

1.0 1.0
partition lines for = 0.5 partition lines for = 0.5
fixed point m( ) for = 0.5 fixed point m( ) for = 0.5
0.8
simulated trajectory for N=100 0.8
simulated trajectory for N=1000

0.6 0.6
M2

M2
2 2
0.4 0.4

0.2 1 0.2 1
3 3
0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
M3 M3
PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 21 / 44
Proof of exponential convergence rate (4)

/ N = e−O(N)
By Hoeffding’s inequality, P M(N) (∞) ∈


Conditional on M(N) (∞) ∈ N , we apply Stein’s method and use


local linearity to show E M (∞) − m∗ (∞) 1 = e−O(N)
(N)


1.0 1.0
partition lines for = 0.5 partition lines for = 0.5
fixed point m( ) for = 0.5 fixed point m( ) for = 0.5
0.8
simulated trajectory for N=100 0.8
simulated trajectory for N=1000

0.6 0.6
M2

M2
2 2
0.4 0.4

0.2 1 0.2 1
3 3
0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
M3 M3
PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 22 / 44
Conditions for exponential convergence rate

The non-degenerate condition is easy to check, and almost


always holds

The global attractor condition is difficult to test

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 23 / 44


The global attractor condition revisited

A 4 state example with cyclic dynamics of φprio

0.575 0.575
0.570 0.570
0.565 0.565
0.560 0.560
0.555 0.555
0.550 0.550
0.545 0.545
0.540 0.540

0.0265 0.0265
0.0260 0.0260
0.345 0.0255 0.0255
0.350
0.355 0.0250 0.35 0.0250
0.360
0.365 0.36 0.0245
0.370 0.0245 0.37
0.375
0.3800.385 0.0240 0.38 0.0240

Figure: The deterministic trajectory Figure: A simulated trajectory

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 24 / 44


Table of Contents

1 Motivation and historical review

2 Infinite horizon case

3 Finite horizon case

4 Generalization to WCMDP

5 Conclusion and future works

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 25 / 44


Construction of the heuristic from the LP

In finite horizon, our heuristic is time-dependent

A policy is a sequence of T decision rules π = (π1 . . . πT ) such


that πt specifies the fraction of arms for both actions at time t:
If Y = πt (M), then N · Ys,a among the N · Ms arms in state s take
action a

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 26 / 44


Construction of the heuristic from the LP

T X
X
max Rsa · ys,a (t)
y≥0 t=1 s,a
s.t. ys,0 (1) + ys,1 (1) = ms (1) ∀s,

X
ys,1 (t) = α ∀t,
s

X
ms (t + 1) = ys,0 (t + 1) + ys,1 (t + 1) = ys0 ,a (t) · Psa0 s ∀s, t
s0 ,a

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 27 / 44


Construction of the heuristic from the LP

πt (M(t)) =???

The LP suggests πt (m∗ (t)) = y∗ (t), but M(t) 6= m∗ (t) due to stochastic
noise

Main idea:
πt (·) should be a “regular” function taking values in the feasible region
and satisfies πt (m∗ (t)) = y∗ (t).

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 28 / 44


Construction of the heuristic from the LP

A policy π = (π1 . . . πT ) is called


Admissible: if for all t and πt (M(t)) = Y(t) we have
X X
Ys,1 (t) = α, and Ys,a (t) = Ms (t) ∀s, a
s a

LP-compatible: if πt (m∗ (t)) = y∗ (t) ∀t

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 29 / 44


Performance theorem: finite horizon

Finite horizon RB performance theorem [Gast, Gaujal, and Yan


2021]
Let π be an admissible and LP-compatible policy. Then
(N) √
1 If πt are all Lipschitz continuous, then VLP − Vπ ≤ C/ N
(N)
2 If πt are all locally linear, then VLP − Vπ ≤ C1 · e−C2 N

locally linear: if there exists ε > 0 such that πt (·) is linear on the ball of radius ε
centered at m∗ (t).

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 30 / 44


Performance theorem: finite horizon

The local linearity can not be met for any model!

Characterization of the non-degenerate condition [Gast, Gaujal,


and Yan 2021]
For a finite horizon RB, there exists a feasible and LP-compatible policy
π that is locally linear if and only if the non-degenerate condition is met.

The non-degenerate condition: S 0 (t) ≥ 1 for all 1 ≤ t ≤ T

∗ ∗
S 0 (t) := {s | ys,1 (t) > 0, ys,0 (t) > 0}

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 31 / 44


The non-degenerate condition revisited
The green: deterministic
The orange: stochastic
The blue: locally linear neighbourhood

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 32 / 44


The non-degenerate condition revisited

In infinite horizon, the non-degenerate condition is almost always


met: S 0 = 1

PT
Not true in finite horizon: t=1 S 0 (t) = T almost always holds,
but it may happen that

S 0 (t) = 0 and S 0 (t 0 ) = 2 for t 6= t 0

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 33 / 44


Construction of π using water-filling

We can show that the policy induced by water-filling is


Lipschitz-continuous in general
locally linear if non-degenerate

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 34 / 44


Construction of π using water-filling
We fill a volume α of water into buckets enumerated by s ∈ S each
with capacity Ms

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 35 / 44


Construction of π using water-filling
Fill completely the buckets in S +

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 35 / 44


Construction of π using water-filling

Fill the buckets in S 0 with no more than a volume of ys,1

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 35 / 44


Construction of π using water-filling
Fill completely the buckets in S 0

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 35 / 44


Construction of π using water-filling
Fill completely the buckets in S − S∅
S

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 35 / 44


Table of Contents

1 Motivation and historical review

2 Infinite horizon case

3 Finite horizon case

4 Generalization to WCMDP

5 Conclusion and future works

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 36 / 44


Weakly coupled MDP

Multiple action and multiple constraint bandits:


⇒ Weakly coupled MDP

In finite horizon, our performance theorem for RB can be generalized


to WCMDP

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 37 / 44


Weakly coupled MDP

What is not obvious:


How to characterize the non-degenerate condition for WCMDP?

How to construct a policy that is Lipschitz-continuous in general,


and locally linear if non-degenerate?

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 38 / 44


Weakly coupled MDP

Our contribution in [Gast, Gaujal, and Yan 2022]:


Investigating the saturation of the constraints:
Generalize the non-degenerate condition from RB to WCMDP

The LP-update policy:


For each 1 ≤ t ≤ T , the decision πt (M(t)) is given by solving a
new LP with initial condition M(t) and horizon T − t

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 39 / 44


Table of Contents

1 Motivation and historical review

2 Infinite horizon case

3 Finite horizon case

4 Generalization to WCMDP

5 Conclusion and future works

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 40 / 44


Recapitulation

In both infinite and finite horizon:


1 Identify and characterize a non-degenerate condition on the RB
2 construct policies that achieve exponentially fast convergence, if
this condition is met
3 The square root rate cannot be improved if the RB degenerates

In finite horizon:
Generalize all the results from RB to WCMDP

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 41 / 44


The story continues...

Infinite horizon WCMDP:

Modeling arrival and departure of arms


Potential applications in multi-class queueing networks

Piecewise linear dynamics:

Identify other mean field models with piecewise linear drift


Establish faster than square root convergence rate

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 42 / 44


The story continues...
Is degeneracy the ultimate obstacle?
(N) √
VLP − VOPT = Ω(1/ N) because of degeneracy

(N) (N)
Ultimate goal: VOPT − Vpowerful policy = O(1/N)

(N)
Possible approach: V??? − Vpowerful policy = O(1/N) even degenerate

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 43 / 44


References
Gast, Nicolas, Bruno Gaujal, and Chen Yan (2020). “Exponential Convergence Rate for the Asymptotic Optimality of Whittle Index
Policy”. In: arXiv preprint arXiv:2012.09064.
– (2021). “(Close to) Optimal Policies for Finite Horizon Restless Bandits”. In: arXiv preprint arXiv:2106.10067.
– (2022). “The LP-update policy for weakly coupled Markov decision processes”. In: arXiv preprint arXiv:2211.01961.
Gittins, John C (1979). “Bandit processes and dynamic allocation indices”. In: Journal of the Royal Statistical Society: Series B
(Methodological) 41.2, pp. 148–164.
Papadimitriou, Christos H. and John N. Tsitsiklis (1999). “The Complexity Of Optimal Queuing Network Control”. In: Math. Oper. Res,
pp. 293–305.
Weber, Richard R. and Gideon Weiss (1990). “On an Index Policy for Restless Bandits”. In: Journal of Applied Probability 27.3,
pp. 637–648. ISSN: 00219002.
Whittle, P. (1988). “Restless Bandits: activity allocation in a changing world”. In: Journal of Applied Probability 25A, pp. 287–298.

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 44 / 44

You might also like