0% found this document useful (0 votes)

48 views48 pages

Soutenance

The document presents a PhD defense on asymptotically optimal policies for Markovian restless bandits. It discusses the motivation and historical context, presents the infinite and finite horizon cases, and generalizes the results to weakly coupled Markov decision processes. The defense includes slides on the model formulation, construction of heuristics from a linear programming relaxation, and a performance theorem for the infinite horizon case under certain conditions.

Uploaded by

Chen YAN

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

48 views48 pages

Soutenance

Uploaded by

Chen YAN

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 48

Asymptotically Optimal Policies for Markovian

Restless Bandits

December 15, 2022

Presented by Chen Yan

Reviewers: David Goldberg Bruno Scherrer

Examiners: Benjamin Legros Supervisors: Nicolas Gast
Jérôme Malick Bruno Gaujal
Kim Thang Nguyen
Table of Contents

1 Motivation and historical review

2 Infinite horizon case

3 Finite horizon case

4 Generalization to WCMDP

5 Conclusion and future works

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 2 / 44

Theme: Markovian bandits

Left: Stochastic bandits

Right: Markovian bandits

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 3 / 44

Theme: Markovian bandits

Many Markovian processes (”arms”)

Taking an action to an arm changes its Markovian evolution

Arm generates rewards depending on its state

Applying an action consumes resources

We want to maximize the rewards subject to the resource

constraint

Important: all parameters of the model are known

It is an optimization problem of computational nature

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 4 / 44

Example: to interview or not to interview

Applicants apply for a job, each with a quality level unknown to us

Arm: an applicant

Action: interview, not interview

State: belief value of the quality

Evolution: Bayesian update

depending on the results of an
interviewing round

Goal: optimize the quality levels of the admitted applicants

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 5 / 44

Example: to observe or not to observe

Each channel evolves as a 2 state Markov chain. We can know the

state of a channel only after observing it, and only a channel in the
”good” state can transmit data
Arm: a channel

Action: observe, not observe

State: a two-tuple of the state

and the time elapsed since the
most recent observation

Goal: optimize the data throughput

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 6 / 44

Markovin restless bandit: historical review

The special case of discounted restful bandit can be solved by the

Gittins index policy [Gittins 1979]
For the more general restless bandit (RB), the problem is
PSPACE-HARD [Papadimitriou and Tsitsiklis 1999]

Peter Whittle proposes the Whittle index policy that generalizes

the Gittins index policy to RB [Whittle 1988]
Under several additional technical assumptions, the WIP is
(N)
asymptotically optimal in the large N limit: limN→∞ VWIP = VLP
[Weber and Weiss 1990]

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 7 / 44

Markovian restless bandit: historical review

(N)
Sub-optimality gap: VLP − VWIP

“. . . The size of the asymptotic sub-optimality of the index policy was

no more than 0.002% in any example. The evidence so far is that the
degree of sub-optimality is very small. It appears that in most cases
the index policy is a very good heuristic.”
—— [Weber and Weiss 1990]

Question: how fast does this gap converge to 0?

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 8 / 44

Our contribution

We study the problem with N statistically identical arms

Under both infinite and finite horizon:

We propose a class of heuristics that incorporates the WIP
√
The sub-optimality gap of these heuristics can be either O(1/ N)
or e−O(N) , depending on:
1 Infinite horizon: global attractor + non-degeneracy
2 Finite horizon: non-degeneracy

We generalize our results from the RB model to weakly coupled

MDP

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 9 / 44

Table of Contents

1 Motivation and historical review

2 Infinite horizon case

3 Finite horizon case

4 Generalization to WCMDP

5 Conclusion and future works

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 10 / 44

The model: general formulation

An arm is modeled with

a state space: S = {1 . . . S}
an action space: A = {0, 1}
evolution: Markov transition kernels depending on the action
reward: depending on the state and action
Each arm alone is itself a Markov decision process (MDP)
It consists of a component to the whole RB

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 11 / 44

The model: general formulation

We have N such arms that evolve independently, giving rise to a very

large MDP for the bandit as a whole.

They are coupled via a budget constraint: αN arms are pulled at each
time period with 0 < α < 1

Goal
Maximize the total expected reward subject to the budget constraint.

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 12 / 44

Intuition: decoupling for large N

When N grows, the following two things take place:

The problem size grows exponentially

The coupling between arms is weakening

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 13 / 44

The model: relaxation by taking expectation

Ms (t): the fraction of arms in state s at time t

Given M(t), a policy π decides Ys,a (t): the fraction of arms in state s
taking action a at time t

T
1 hXX a i
max lim Eπ Rs · Ys,a (t)
π T →∞ T
s,a t=1
s.t. Ys,0 (t) + Ys,1 (t) = Ms (t) ∀t, s,
X
Ys,1 (t) = α ∀t,
s
π Markov
M(t) −
→ Y(t) −−−−−→ M(t + 1) ∀t
transition

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 14 / 44

The model: relaxation by taking expectation

T
X 1 XX
relaxed
Ys,1 (t) = α ∀t −−−−→ lim Eπ Ys,1 (t) = α
T →∞ T
s s t=1

1 PT
Denote by ys,a := limT →∞ T t=1 Eπ [Ys,a (t)]

P
⇒ A single constraint s ys,1 = α after the relaxation

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 15 / 44

The model: relaxation by taking expectation

Originally: a stochastic N-armed bandit involving the variables Ys,a

After relaxation: a deterministic LP involving the variables ys,a

T X
1 hXX a Rsa · ys,a
i
max lim Eπ Rs · Ys,a (t) max
π T →∞ T y≥0 s,a
s,a t=1
X
s.t. ys,a = 1,
X
s.t. Ys,a (t) = Ms (t) ∀t, s,
s,a
a
X
ys,1 = α,
X
Ys,1 (t) = α ∀t,
s s
X
M(t) −
π Markov
→ Y(t) −−−−−→ M(t + 1) ∀t ms = ys,0 + ys,1 = ys0 ,a · Psa0 s ∀s
transition s0 ,a

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 16 / 44

Construction of the heuristic from the LP

Suppose we obtain an optimal solution y∗ from the LP

Define:
∗ > 0, y ∗ = 0}
S + := {s | ys,1 s,0
∗ > 0, y ∗ > 0}
S 0 := {s | ys,1 s,0
∗ = 0, y ∗ > 0}
S − := {s | ys,1 s,0
S ∅ := {s | ys,1
∗ = 0, y ∗ = 0}
s,0
A LP-priority policy is any strict priority policy such that the priority
satisfies S + > S 0 > S − ∪ S ∅ .

Remark: the WIP is a LP-priority policy

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 17 / 44

Performance theorem: infinite horizon

Infinite horizon RB performance theorem [Gast, Gaujal, and Yan

2020]
Any LP-priority policy that satisfies the global attractor condition and
the non-degenerate condition is asymptotically optimal with
exponential convergence rate:
(N)
0 ≤ VLP − Vprio ≤ C1 · e−C2 N

The global attractor condition: the deterministic dynamics φprio of

the LP-priority policy has a global attractor
The non-degenerate condition: S 0 ≥ 1

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 18 / 44

Proof of exponential convergence rate (1)

An example with 3 states:

φprio (·) is a map from ∆3 to ∆3 , which is continuous and piecewise
linear with 3 linear pieces

1.0
partition lines for = 0.5
fixed point m( ) for = 0.5
0.8
deterministic trajectory

 0.6
Z1 · m + b1 , if m ∈ Z1


M2
2
φprio (m) = Z2 · m + b2 , if m ∈ Z2 0.4

Z3 · m + b3 , if m ∈ Z3

0.2 1
3
0.0
0.0 0.2 0.4 0.6 0.8 1.0
M3

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 19 / 44

Proof of exponential convergence rate (2)
The global attractor condition implies that φprio has a unique fixed
point m∗ (∞)
The non-degenerate condition implies that m∗ (∞) is not on the
boundary of two linear pieces

1.0
partition lines for = 0.5
fixed point m( ) for = 0.5
0.8
deterministic trajectory

0.6
M2

2
0.4

0.2 1
3
0.0
0.0 0.2 0.4 0.6 0.8 1.0
M3
PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 20 / 44
Proof of exponential convergence rate (3)
We need to show
h i
E M(N) (∞) − m∗ (∞) = e−O(N)
1
We choose a small neighbourhood N of m∗ (∞) in which φprio is
linear (thanks to the non-degenerate condition)

1.0 1.0
partition lines for = 0.5 partition lines for = 0.5
fixed point m( ) for = 0.5 fixed point m( ) for = 0.5
0.8
simulated trajectory for N=100 0.8
simulated trajectory for N=1000

0.6 0.6
M2

M2
2 2
0.4 0.4

0.2 1 0.2 1
3 3
0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
M3 M3
PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 21 / 44
Proof of exponential convergence rate (4)

/ N = e−O(N)
By Hoeffding’s inequality, P M(N) (∞) ∈

Conditional on M(N) (∞) ∈ N , we apply Stein’s method and use

local linearity to show E M (∞) − m∗ (∞) 1 = e−O(N)
(N)

1.0 1.0
partition lines for = 0.5 partition lines for = 0.5
fixed point m( ) for = 0.5 fixed point m( ) for = 0.5
0.8
simulated trajectory for N=100 0.8
simulated trajectory for N=1000

0.6 0.6
M2

M2
2 2
0.4 0.4

0.2 1 0.2 1
3 3
0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
M3 M3
PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 22 / 44
Conditions for exponential convergence rate

The non-degenerate condition is easy to check, and almost

always holds

The global attractor condition is difficult to test

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 23 / 44

The global attractor condition revisited

A 4 state example with cyclic dynamics of φprio

0.575 0.575
0.570 0.570
0.565 0.565
0.560 0.560
0.555 0.555
0.550 0.550
0.545 0.545
0.540 0.540

0.0265 0.0265
0.0260 0.0260
0.345 0.0255 0.0255
0.350
0.355 0.0250 0.35 0.0250
0.360
0.365 0.36 0.0245
0.370 0.0245 0.37
0.375
0.3800.385 0.0240 0.38 0.0240

Figure: The deterministic trajectory Figure: A simulated trajectory

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 24 / 44

Table of Contents

1 Motivation and historical review

2 Infinite horizon case

3 Finite horizon case

4 Generalization to WCMDP

5 Conclusion and future works

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 25 / 44

Construction of the heuristic from the LP

In finite horizon, our heuristic is time-dependent

A policy is a sequence of T decision rules π = (π1 . . . πT ) such

that πt specifies the fraction of arms for both actions at time t:
If Y = πt (M), then N · Ys,a among the N · Ms arms in state s take
action a

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 26 / 44

Construction of the heuristic from the LP

T X
X
max Rsa · ys,a (t)
y≥0 t=1 s,a
s.t. ys,0 (1) + ys,1 (1) = ms (1) ∀s,

X
ys,1 (t) = α ∀t,
s

X
ms (t + 1) = ys,0 (t + 1) + ys,1 (t + 1) = ys0 ,a (t) · Psa0 s ∀s, t
s0 ,a

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 27 / 44

Construction of the heuristic from the LP

πt (M(t)) =???

The LP suggests πt (m∗ (t)) = y∗ (t), but M(t) 6= m∗ (t) due to stochastic
noise

Main idea:
πt (·) should be a “regular” function taking values in the feasible region
and satisfies πt (m∗ (t)) = y∗ (t).

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 28 / 44

Construction of the heuristic from the LP

A policy π = (π1 . . . πT ) is called

Admissible: if for all t and πt (M(t)) = Y(t) we have
X X
Ys,1 (t) = α, and Ys,a (t) = Ms (t) ∀s, a
s a

LP-compatible: if πt (m∗ (t)) = y∗ (t) ∀t

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 29 / 44

Performance theorem: finite horizon

Finite horizon RB performance theorem [Gast, Gaujal, and Yan

2021]
Let π be an admissible and LP-compatible policy. Then
(N) √
1 If πt are all Lipschitz continuous, then VLP − Vπ ≤ C/ N
(N)
2 If πt are all locally linear, then VLP − Vπ ≤ C1 · e−C2 N

locally linear: if there exists ε > 0 such that πt (·) is linear on the ball of radius ε
centered at m∗ (t).

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 30 / 44

Performance theorem: finite horizon

The local linearity can not be met for any model!

Characterization of the non-degenerate condition [Gast, Gaujal,

and Yan 2021]
For a finite horizon RB, there exists a feasible and LP-compatible policy
π that is locally linear if and only if the non-degenerate condition is met.

The non-degenerate condition: S 0 (t) ≥ 1 for all 1 ≤ t ≤ T

∗ ∗
S 0 (t) := {s | ys,1 (t) > 0, ys,0 (t) > 0}

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 31 / 44

The non-degenerate condition revisited
The green: deterministic
The orange: stochastic
The blue: locally linear neighbourhood

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 32 / 44

The non-degenerate condition revisited

In infinite horizon, the non-degenerate condition is almost always

met: S 0 = 1

PT
Not true in finite horizon: t=1 S 0 (t) = T almost always holds,
but it may happen that

S 0 (t) = 0 and S 0 (t 0 ) = 2 for t 6= t 0

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 33 / 44

Construction of π using water-filling

We can show that the policy induced by water-filling is

Lipschitz-continuous in general
locally linear if non-degenerate

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 34 / 44

Construction of π using water-filling
We fill a volume α of water into buckets enumerated by s ∈ S each
with capacity Ms

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 35 / 44

Construction of π using water-filling
Fill completely the buckets in S +

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 35 / 44

Construction of π using water-filling
∗
Fill the buckets in S 0 with no more than a volume of ys,1

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 35 / 44

Construction of π using water-filling
Fill completely the buckets in S 0

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 35 / 44

Construction of π using water-filling
Fill completely the buckets in S − S∅
S

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 35 / 44

Table of Contents

1 Motivation and historical review

2 Infinite horizon case

3 Finite horizon case

4 Generalization to WCMDP

5 Conclusion and future works

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 36 / 44

Weakly coupled MDP

Multiple action and multiple constraint bandits:

⇒ Weakly coupled MDP

In finite horizon, our performance theorem for RB can be generalized

to WCMDP

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 37 / 44

Weakly coupled MDP

What is not obvious:

How to characterize the non-degenerate condition for WCMDP?

How to construct a policy that is Lipschitz-continuous in general,

and locally linear if non-degenerate?

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 38 / 44

Weakly coupled MDP

Our contribution in [Gast, Gaujal, and Yan 2022]:

Investigating the saturation of the constraints:
Generalize the non-degenerate condition from RB to WCMDP

The LP-update policy:

For each 1 ≤ t ≤ T , the decision πt (M(t)) is given by solving a
new LP with initial condition M(t) and horizon T − t

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 39 / 44

Table of Contents

1 Motivation and historical review

2 Infinite horizon case

3 Finite horizon case

4 Generalization to WCMDP

5 Conclusion and future works

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 40 / 44

Recapitulation

In both infinite and finite horizon:

1 Identify and characterize a non-degenerate condition on the RB
2 construct policies that achieve exponentially fast convergence, if
this condition is met
3 The square root rate cannot be improved if the RB degenerates

In finite horizon:
Generalize all the results from RB to WCMDP

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 41 / 44

The story continues...

Infinite horizon WCMDP:

Modeling arrival and departure of arms

Potential applications in multi-class queueing networks

Piecewise linear dynamics:

Identify other mean field models with piecewise linear drift

Establish faster than square root convergence rate

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 42 / 44

The story continues...
Is degeneracy the ultimate obstacle?
(N) √
VLP − VOPT = Ω(1/ N) because of degeneracy

(N) (N)
Ultimate goal: VOPT − Vpowerful policy = O(1/N)

(N)
Possible approach: V??? − Vpowerful policy = O(1/N) even degenerate

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 43 / 44

References
Gast, Nicolas, Bruno Gaujal, and Chen Yan (2020). “Exponential Convergence Rate for the Asymptotic Optimality of Whittle Index
Policy”. In: arXiv preprint arXiv:2012.09064.
– (2021). “(Close to) Optimal Policies for Finite Horizon Restless Bandits”. In: arXiv preprint arXiv:2106.10067.
– (2022). “The LP-update policy for weakly coupled Markov decision processes”. In: arXiv preprint arXiv:2211.01961.
Gittins, John C (1979). “Bandit processes and dynamic allocation indices”. In: Journal of the Royal Statistical Society: Series B
(Methodological) 41.2, pp. 148–164.
Papadimitriou, Christos H. and John N. Tsitsiklis (1999). “The Complexity Of Optimal Queuing Network Control”. In: Math. Oper. Res,
pp. 293–305.
Weber, Richard R. and Gideon Weiss (1990). “On an Index Policy for Restless Bandits”. In: Journal of Applied Probability 27.3,
pp. 637–648. ISSN: 00219002.
Whittle, P. (1988). “Restless Bandits: activity allocation in a changing world”. In: Journal of Applied Probability 25A, pp. 287–298.

PhD Defence – Chen Y. Markovian Restless Bandits December 15, 2022 44 / 44

Bubeck 11 A
No ratings yet
Bubeck 11 A
41 pages
RL Cheatsheet for Researchers
No ratings yet
RL Cheatsheet for Researchers
16 pages
Finite-Time Analysis of The Multi-Armed Bandit Problem With Known Trend
No ratings yet
Finite-Time Analysis of The Multi-Armed Bandit Problem With Known Trend
7 pages
5385 Representation Driven Reinforc
No ratings yet
5385 Representation Driven Reinforc
16 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
Evendar 06 A
No ratings yet
Evendar 06 A
27 pages
MMAB - Mahajan and Teneketzis. 2008. Multi-Armed Bandit Problems. in Foundations and Applications of Sensor Management
No ratings yet
MMAB - Mahajan and Teneketzis. 2008. Multi-Armed Bandit Problems. in Foundations and Applications of Sensor Management
31 pages
Mid Term Report SoS
No ratings yet
Mid Term Report SoS
18 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
On Kernelized Multi-Armed Bandits: Sayak Ray Chowdhury
No ratings yet
On Kernelized Multi-Armed Bandits: Sayak Ray Chowdhury
26 pages
Bandits Scribe08
No ratings yet
Bandits Scribe08
4 pages
Bandit Algorithms (Tor Lattimore, Csaba Szepesvári) (Z-Library)
0% (1)
Bandit Algorithms (Tor Lattimore, Csaba Szepesvári) (Z-Library)
537 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
4 pages
Indexability in Restless Multi-Armed Bandits
No ratings yet
Indexability in Restless Multi-Armed Bandits
15 pages
NeurIPS 2019 Batched Multi Armed Bandits Problem Paper
No ratings yet
NeurIPS 2019 Batched Multi Armed Bandits Problem Paper
11 pages
Exploration Exploitation
No ratings yet
Exploration Exploitation
40 pages
A17 Complexdecisions
No ratings yet
A17 Complexdecisions
28 pages
On A Restless Multi-Armed Bandit Problem With Non-Identical Arms
No ratings yet
On A Restless Multi-Armed Bandit Problem With Non-Identical Arms
8 pages
Multi-Armed Bandits Theory and Applications To Online Learning in Networks
No ratings yet
Multi-Armed Bandits Theory and Applications To Online Learning in Networks
167 pages
Modelo Colas PDF
No ratings yet
Modelo Colas PDF
36 pages
Q1. Explain The Multi-Armed Bandit Problem and Its Key Characteristics. Illustrate Their Real-World Applications
No ratings yet
Q1. Explain The Multi-Armed Bandit Problem and Its Key Characteristics. Illustrate Their Real-World Applications
11 pages
Assignment 3: Reinforcement Learning Prof. B. Ravindran
100% (1)
Assignment 3: Reinforcement Learning Prof. B. Ravindran
4 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
Exploration Vs Exploitation in Stationary Multi-Armed Bandit Problems
No ratings yet
Exploration Vs Exploitation in Stationary Multi-Armed Bandit Problems
15 pages
Note 7
No ratings yet
Note 7
5 pages
RL Test Leif
No ratings yet
RL Test Leif
163 pages
Module 6 1 Ungraded Quizz
No ratings yet
Module 6 1 Ungraded Quizz
13 pages
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
No ratings yet
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
66 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
62 pages
Reinforcement Learning in A Nutshell
No ratings yet
Reinforcement Learning in A Nutshell
12 pages
Dann 23 A
No ratings yet
Dann 23 A
68 pages
Reinforcement Learning for Experts
No ratings yet
Reinforcement Learning for Experts
36 pages
Multi-Armed Bandit Problems
No ratings yet
Multi-Armed Bandit Problems
71 pages
IntroMulti Armed Bandits Slivkin Microsoft PDF
No ratings yet
IntroMulti Armed Bandits Slivkin Microsoft PDF
174 pages
2023 Week2 Lecture Before
No ratings yet
2023 Week2 Lecture Before
77 pages
MIT 6.036 Lecture
No ratings yet
MIT 6.036 Lecture
64 pages
Week 10
No ratings yet
Week 10
5 pages
DSA5102 Lecture11
No ratings yet
DSA5102 Lecture11
44 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
169 pages
NeurIPS 2021 Breaking The Moments Condition Barrier No Regret Algorithm For Bandits With Super Heavy Tailed Payoffs Paper
No ratings yet
NeurIPS 2021 Breaking The Moments Condition Barrier No Regret Algorithm For Bandits With Super Heavy Tailed Payoffs Paper
11 pages
Multi Armed Bandits
No ratings yet
Multi Armed Bandits
161 pages
Finite Markov Decision Processes-BR
No ratings yet
Finite Markov Decision Processes-BR
31 pages
Lecture 1: Introduction: Lecturer: Prof. Subrahmanya Swamy Peruru Scribe: Harshvardhan Arya - Rishabh Katiyar
No ratings yet
Lecture 1: Introduction: Lecturer: Prof. Subrahmanya Swamy Peruru Scribe: Harshvardhan Arya - Rishabh Katiyar
4 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
RL Paper Deepsk
No ratings yet
RL Paper Deepsk
4 pages
Deep RL - Content Beyond Syllabus
No ratings yet
Deep RL - Content Beyond Syllabus
16 pages
(Partially Observable) Markov Decision Processes: Frederike Petzschner & Lionel Rigoux
No ratings yet
(Partially Observable) Markov Decision Processes: Frederike Petzschner & Lionel Rigoux
19 pages
AS02
No ratings yet
AS02
16 pages
Action Robust Reinforcement Learning and Applications in Continuous Control
No ratings yet
Action Robust Reinforcement Learning and Applications in Continuous Control
10 pages
B Ravindran
No ratings yet
B Ravindran
62 pages
Lecture Notes
No ratings yet
Lecture Notes
29 pages
Video Transcript Module 6 - Lesson 1
No ratings yet
Video Transcript Module 6 - Lesson 1
13 pages
RL and ObC Lecture 2
No ratings yet
RL and ObC Lecture 2
20 pages
Reinforcement Learning Assignment
No ratings yet
Reinforcement Learning Assignment
4 pages
2008 - Carmon, Shwartz - Markov Decision Processes With Exponentially Representable Discounting
No ratings yet
2008 - Carmon, Shwartz - Markov Decision Processes With Exponentially Representable Discounting
10 pages
08 MDPs
No ratings yet
08 MDPs
110 pages
Average-Reward Model-Free RL Review
No ratings yet
Average-Reward Model-Free RL Review
36 pages
Experiment 3
No ratings yet
Experiment 3
6 pages
Lattimore Szepesvari18bandit Algorithms PDF
No ratings yet
Lattimore Szepesvari18bandit Algorithms PDF
513 pages