0% found this document useful (0 votes)

10 views51 pages

Lecture 2 - Exploration and Control - Slides

The document discusses reinforcement learning, focusing on decision-making processes where agents learn policies, value functions, and models. It covers key concepts such as exploration versus exploitation, multi-armed bandit problems, and various algorithms like greedy, ε-greedy, and UCB. The lecture emphasizes the importance of minimizing regret and optimizing learning strategies to enhance performance in uncertain environments.

Uploaded by

Husein Yusuf

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views51 pages

Lecture 2 - Exploration and Control - Slides

Uploaded by

Husein Yusuf

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 51

Reinforcement Learning

Exploration and Control

Natnael Argaw

Reinforcement learning

1
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Recap

► Reinforcement learning is the science of learning to make decisions

► Agents can learn a policy, value function and/or a model
► The general problem involves taking into account time and consequences
► Decisions affect the reward, the agent state, and environment state
► Learning is active: decisions impact data

2
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
This Lecture

In this lecture, we simplify the setting

► The environment is assumed to have only a single state
►
=⇒ actions no longer have long-term consequences in the
►
environment
► =⇒ actions still do impact immediate reward
=⇒ other observations can be ignored

► We discuss how to learn a policy in this setting

3
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Exploration vs. Exploitation

► Learning agents need to trade off two things

► Exploitation: Maximise performance based on current knowledge
► Exploration: Increase knowledge
► We need to gather information to make the best overall decisions
► The best long-term strategy may involve short-term sacrifices

4
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Formalising the problem

5
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Algorithms

6
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
The Multi-Armed Bandit

► A multi-armed bandit is a set of distributions {Ra |a ∈ A}

► A is a (known) set of actions (or “arms")
► Ra is a distribution on rewards, given action a
► At each step t the agent selects an action At ∈ A
► The environment generates a reward Rt ~ RAt
t
► The goal is to maximise cumulative reward Σ i =1Ri

► We do this by learning a policy: a distribution on A

7
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Values and Regret

► The action value for action a is the expected reward

q(a) = E [Rt |At = a]

► The optimal value is

v∗ = max q(a) = max E [R t | At = a]
a
a ∈A

► Regret of an action a is
∆a = v∗ − q(a)
► The regret for the optimal action is zero

8
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Regret

► We want to minimise total regret:

Σtn=1 v∗ − q(An) = Σtn=1 ∆An

► The summation spans over the full ‘lifetime of learning’

9
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Algorithms

► We will discuss several algorithms:

► Greedy
► ε-greedy
► UCB
► Thompson sampling
► Policy gradients
► The first three all use action value estimates Qt (a) ≈ q(a)

10
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Action values

► The action value for action a is the expected reward

q(a) = E [Rt |At = a]

► A simple estimate is the average of the sampled rewards:

I(·) is the indicator function: I(True) = 1 and I(False) = 0

► The count for action a is

11
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Action values

► This can also be updated incrementally:

where N0(a) = 0.
► We will later consider other step sizes α
► For instance, constant α would lead to tracking, rather than averaging

12
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Algorithms: greedy

13
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
The greedy policy

► One of the simplest policies is greedy:

► Select action with highest value:At = argmax Qt(a)
a
► Equivalently: π t (a) = I(At= argmax Qt ( a)) (assuming no ties are possible)
a

14
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Algorithms: ε-Greedy

15
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
ε-Greedy Algorithm

► Greedy can get stuck on a suboptimal action forever

=⇒ linear expected total regret
► The દ-greedy algorithm:
► With probability 1 −દ select greedy action: a = argmax Qt (a)
a ∈A
► With probability select a random action
► Equivalently:

► દ-greedy continues to explore

⇒ દ-greedy with constant has linear expected total regret

16
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Algorithms: Policy
gradients

17
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Policy search

► Can we learn policies π (a) directly, instead of learning values?

► For instance, define action preferences Ht (a) and a policy

► The preferences are not values: they are just learnable policy parameters
► Goal: learn by optimising the preferences

18
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Policy gradients

► Idea: update policy parameters such that expected value increases

► We can use gradient ascent
► In the bandit case, we want to update:

θt+1 = θt + α ∇θ E[Rt |πθt ] ,

where θt are the current policy parameters

► Can we compute this gradient?

19
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Gradient bandits
► Log-likelihood trick (also known as REINFORCE trick, Williams 1992:

20
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Gradient bandits

► Log-likelihood trick (also known as REINFORCE trick, Williams 1992):

∇θ E[Rt |θ ] = E [Rt ∇θ log πθ (At )]

► We can sample this!

► So
θ = θ + α Rt ∇θ log πθ (At ) ,
this is stochastic gradient ascent on the (true) value of the policy
► Can use sampled rewards — does not need value estimates

21
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Theory: what is possible?

22
How well can we do?

Theorem (Lai and Robbins)

Asymptotic total regret is at least logarithmic in number of steps

► Note that regret grows at least logarithmically

► That’s still a whole lot better than linear growth! Can we get it in practice?
► Are there algorithms for which the upper bound is logarithmic as well?

23
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Counting Regret

► Recall ∆a = v∗ − q(a)
► Total regret depends on action regrets ∆a and action counts

► A good algorithm ensures small counts for large action regrets

24
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Optimism in the face of
uncertainty

25
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Optimism in the Face of Uncertainty

1.2

1.0 P[q(a1 )]

probability density
0.8

0.6
P[q(a2 )]
0.4

0.2 P[q(a3 )]
0.0
4 2 0 2 4
expected value

► Which action should we pick?

► More uncertainty about its value: more important to explore that action

26
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Optimism in the Face of Uncertainty

1.2

1.0 P[q(a1 )]
probability density

0.8

0.6
P[q(a2 )]
0.4

0.2 P[q(a3 )]
0.0
4 2 0 2 4
expected value
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
27
Optimism in the Face of Uncertainty

1.2
probability density

1.0

0.8

0.6

0.4

0.2
0.0
4 2 0 2 4
expected value
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
28
Optimism in the Face of Uncertainty

1.2
probability density

1.0

0.8

0.6

0.4

0.2
0.0
4 2 0 2 4
expected value
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
29
Optimism in the Face of Uncertainty

1.2
probability density

1.0

0.8

0.6

0.4

0.2
0.0
4 2 0 2 4
expected value
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
30
Optimism in the Face of Uncertainty

1.2
probability density

1.0

0.8

0.6

0.4

0.2
0.0
4 2 0 2 4
expected value
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
31
Optimism in the Face of Uncertainty

1.2
probability density

1.0

0.8

0.6

0.4

0.2
0.0
4 2 0 2 4
expected value
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
32
Optimism in the Face of Uncertainty

1.2
probability density

1.0

0.8

0.6

0.4

0.2
0.0
4 2 0 2 4
expected value
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
33
Optimism in the Face of Uncertainty

1.2
probability density

1.0

0.8

0.6

0.4

0.2
0.0
4 2 0 2 4
expected value
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
34
Algorithms: UCB

35
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Upper Confidence Bounds

► Estimate an upper confidence Ut (a) for each action

value, such that q(a) ≤ Qt (a) + Ut (a) with high
probability
► Select action maximizing upper conﬁdence bound (UCB)

at = argmax Qt (a) + Ut (a)

a ∈A

► The uncertainty should depend on the number of times Nt (a) action a has been selected
► Small Nt (a) ⇒ large Ut (a) (estimated value is uncertain)
► Large Nt (a) ⇒ small Ut (a) (estimated value is accurate)
► Then a is only selected if either...
► ...Qt (a) is large (=good action), or
► ...Ut (a) is large (=high uncertainty) (or both)
► Can we derive an optimal bound?
36
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Theory: the optimality of
UCB

37
Hoeffding’s Inequality

Theorem (Hoeffding’s
Inequality)

38
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Calculating Upper Confidence Bounds

► We can pick a maximal desired probability p that the true value exceeds an upper
bound and solve for this bound Ut (a)

We then know the probability that this happens is smaller than p

► Idea: reduce p as we observe more rewards, e.g., p = 1/t
,

► This ensures that we always keep exploring, but not too much
39
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
UCB

► UCB:

► Intuition:
► If ∆a is large, then Nt (a) is small, because Qt (a) is likely to be small
► So either ∆a is small or Nt (a) is small
► In fact, we can prove ∆aNt (a) ≤ O(log t), for all a

40
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Bayesian approaches

41
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Bayesian Bandits

► We could adopt Bayesian approach and model distributions over values p(q(a) | θt )
► This is interpreted as our belief that, e.g., q(a) = x for all x ∈ R
► E.g., θt could contain the means and variances of Gaussian belief distributions
► Allows us to inject rich prior knowledge θ0
► We can then use posterior belief to guide exploration

42
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Bayesian Approach

► Prior Distribution: Initialize a prior distribution representing the agent's initial

beliefs about the parameters. This distribution captures the uncertainty before
observing any data.
► Observation and Likelihood: As the agent interacts with the environment, it collects
data (observations) about state-action pairs and rewards.
► Posterior Distribution: The updated distribution is the posterior distribution, which
combines the prior beliefs with the likelihood of the observed data.
► Exploration via Uncertainty:
○ Exploit: Choose actions that are currently believed to be optimal
according to the current posterior distribution.
○ Explore: Choose actions that have higher uncertainty or where the
posterior distribution is spread out.
► Bayesian Updating

43
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Bayesian Bandits with Upper Confidence Bounds

p!(Q)

Q(a3)
Q(a2)

Q(a1)

-1.6 -1.2 0.4 0.8 1.2 1.6 2 2.8 3.6 4 4.4 4.8 5.2 5.6
-0.8 2.4
-0.4 3.2
0
µ(a1) µ(a2) µ(a3)
Q
c"(a3)
c"(a2)
c"(a1)

► We can estimate upper confidences from the posterior

► e.g., Ut (a) = cσt (a) where σ (a) is std dev of pt (q(a))
► Then, pick an action that maximises Qt (a) + cσ (a)

44
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Algorithms: Thompson
sampling

45
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Probability Matching

► A different option is to use probability matching:

Select action a according to the probability (belief) that a is optimal

► Probability matching is optimistic in the face of uncertainty:

Actions have higher probability when either the estimated value is high, or the
uncertainty is high
► Can be difficult to compute π (a) analytically from posterior (but can be done
numerically)

46
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Thompson Sampling

► Thompson sampling (Thompson 1933):

► For Bernoulli bandits, Thompson sampling achieves Lai and Robbins lower bound
on regret, and therefore is optimal

47
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Planning to explore

48
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Information State Space

► We have viewed bandits as one-step decision-making problems

► Can also view as sequential decision-making problems
► Each step the agent updates state St to summarise the past
► Each action At causes a transition to a new information state St+1 (by adding
information), with probability p(St+1 | At , St )
► We now have a Markov decision problem
► The state is fully internal to the agent
► State transitions are random due to rewards & actions
► Even in bandits actions affect the future after all, via learning

49
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
End of lecture

50
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Background

Recommended reading:
Sutton & Barto 2018, Chapter 2

Further background material:

Bandit Algorithms, Lattimore & Szepesvári, 2020
Finite-time analysis of the multiarmed bandit problem, Auer, Cesa-Bianchi, Fischer, 2002

51
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt

Computer Security
100% (2)
Computer Security
519 pages
Bengali Text Classification Distinguishing Saintly and Common Forms Using Machine Learning Model
No ratings yet
Bengali Text Classification Distinguishing Saintly and Common Forms Using Machine Learning Model
7 pages
Cyber - Access Control and Cryptographic Concepts
No ratings yet
Cyber - Access Control and Cryptographic Concepts
28 pages
CV CH - 1 Introduction
No ratings yet
CV CH - 1 Introduction
42 pages
CV CH - 5 - High Level Feature Extraction
No ratings yet
CV CH - 5 - High Level Feature Extraction
62 pages
CV CH - 4 - Low Level Feature Extraction
No ratings yet
CV CH - 4 - Low Level Feature Extraction
72 pages
Exit Exam Question 2015
No ratings yet
Exit Exam Question 2015
20 pages
Introduction To Cognitive Science
No ratings yet
Introduction To Cognitive Science
49 pages
RL Unit-2
No ratings yet
RL Unit-2
67 pages
P702CV
No ratings yet
P702CV
4 pages
03 NetworkStructure
No ratings yet
03 NetworkStructure
75 pages
ml4r 2025 05
No ratings yet
ml4r 2025 05
22 pages
06 Homophily
No ratings yet
06 Homophily
24 pages
Introduction To RL
No ratings yet
Introduction To RL
64 pages
CSE 445 - Lecture 9 - Reinforcement Learning
No ratings yet
CSE 445 - Lecture 9 - Reinforcement Learning
45 pages
Mod6 Slides
No ratings yet
Mod6 Slides
105 pages
402 Lec20
No ratings yet
402 Lec20
21 pages
Mid Term Report SoS
No ratings yet
Mid Term Report SoS
18 pages
Clang Integration
No ratings yet
Clang Integration
12 pages
RL L2 MultiArmedBandits
No ratings yet
RL L2 MultiArmedBandits
44 pages
RL Unit
No ratings yet
RL Unit
595 pages
Reinforcement Learning: B.Tech., Last Year, Semester-Viii
No ratings yet
Reinforcement Learning: B.Tech., Last Year, Semester-Viii
32 pages
9 DeepReinforcementLearning
No ratings yet
9 DeepReinforcementLearning
138 pages
Lab 6-1 VLAN Configurations
No ratings yet
Lab 6-1 VLAN Configurations
14 pages
Solutions - REINFORCE and Linear Function Approximation
No ratings yet
Solutions - REINFORCE and Linear Function Approximation
5 pages
DLMAIRIL01 Q4-2024 Session3
No ratings yet
DLMAIRIL01 Q4-2024 Session3
47 pages
Apple's Brand Loyalty
No ratings yet
Apple's Brand Loyalty
10 pages
CH3 - 3 Policy Search Alg
No ratings yet
CH3 - 3 Policy Search Alg
9 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
28 pages
Sayali 2
No ratings yet
Sayali 2
49 pages
MSC Report - Final
No ratings yet
MSC Report - Final
142 pages
Multi-Arm Bandit Problem Guide
No ratings yet
Multi-Arm Bandit Problem Guide
10 pages
2.2+model Free+Control
No ratings yet
2.2+model Free+Control
92 pages
EXP3
No ratings yet
EXP3
36 pages
Class e Instructions Rev2a
No ratings yet
Class e Instructions Rev2a
29 pages
RLbook Solutions Manual
100% (1)
RLbook Solutions Manual
35 pages
BCA Syllabus
No ratings yet
BCA Syllabus
21 pages
Multi-Armed Bandits
No ratings yet
Multi-Armed Bandits
11 pages
Assignment 2 CS Sec#4
No ratings yet
Assignment 2 CS Sec#4
5 pages
ECT426 M5 Ktunotes - in
No ratings yet
ECT426 M5 Ktunotes - in
34 pages
CH3 - 2 Montecarlo Control
No ratings yet
CH3 - 2 Montecarlo Control
33 pages
5SC28 Machine Learning For Systems and Control
No ratings yet
5SC28 Machine Learning For Systems and Control
68 pages
l1 Mdps Exact Methods
No ratings yet
l1 Mdps Exact Methods
69 pages
Bayesian RL for Researchers
No ratings yet
Bayesian RL for Researchers
27 pages
RL Class Notes
No ratings yet
RL Class Notes
68 pages
UPSC EPFO APFC Exam Syllabus
0% (1)
UPSC EPFO APFC Exam Syllabus
5 pages
RL Unit5
No ratings yet
RL Unit5
101 pages
GT20 Quick Use Instruction
No ratings yet
GT20 Quick Use Instruction
3 pages
A Distrib Persp On RL
No ratings yet
A Distrib Persp On RL
19 pages
16 - Reinforcement Learning and Bandits
No ratings yet
16 - Reinforcement Learning and Bandits
41 pages
2.4 Rational and Real Numbers
No ratings yet
2.4 Rational and Real Numbers
38 pages
Machine - Learning - Chapter 4
No ratings yet
Machine - Learning - Chapter 4
13 pages
Practical Exercises - PowerPoint
No ratings yet
Practical Exercises - PowerPoint
2 pages
Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
Azar 17 A
No ratings yet
Azar 17 A
10 pages
Reinforcement Learning Insights
No ratings yet
Reinforcement Learning Insights
6 pages
INTERNAL
No ratings yet
INTERNAL
11 pages
Hfe Balanced Audio Technology Vk-3ix en
No ratings yet
Hfe Balanced Audio Technology Vk-3ix en
10 pages
Math Notes Inequalities Grade 11
No ratings yet
Math Notes Inequalities Grade 11
9 pages
Aveva Everything3d 11 Foundations Rev 2 PDF
No ratings yet
Aveva Everything3d 11 Foundations Rev 2 PDF
145 pages
SE CH04 Software Requirement Analysis
No ratings yet
SE CH04 Software Requirement Analysis
77 pages
I2ml3e Chap18
No ratings yet
I2ml3e Chap18
27 pages
5 - Policy Gradient Methods
No ratings yet
5 - Policy Gradient Methods
57 pages
Pictorial Presentations
No ratings yet
Pictorial Presentations
28 pages
03U0095EN
No ratings yet
03U0095EN
20 pages
A12-Online Learning Short 2020
No ratings yet
A12-Online Learning Short 2020
61 pages
Internet Banking Manual - Final
No ratings yet
Internet Banking Manual - Final
11 pages
Week 11 12 - Basic Web Page Creation Using Static Website and Online Platform PDF
No ratings yet
Week 11 12 - Basic Web Page Creation Using Static Website and Online Platform PDF
37 pages
ITR Sharad Baghla
No ratings yet
ITR Sharad Baghla
37 pages
Unit:1 Reinforcement Learning
No ratings yet
Unit:1 Reinforcement Learning
9 pages
Policy Gradient Methods
No ratings yet
Policy Gradient Methods
70 pages
RL Cheatsheet for Researchers
No ratings yet
RL Cheatsheet for Researchers
16 pages
Lecture 1: Introduction: Lecturer: Prof. Subrahmanya Swamy Peruru Scribe: Harshvardhan Arya - Rishabh Katiyar
No ratings yet
Lecture 1: Introduction: Lecturer: Prof. Subrahmanya Swamy Peruru Scribe: Harshvardhan Arya - Rishabh Katiyar
4 pages
Multi Armed Bandits
No ratings yet
Multi Armed Bandits
34 pages
Lung Cancer Detection Using CT Scan Images: Sciencedirect
No ratings yet
Lung Cancer Detection Using CT Scan Images: Sciencedirect
8 pages
Mavrin 19 A
No ratings yet
Mavrin 19 A
11 pages
AI Reinforcement Learning Guide
No ratings yet
AI Reinforcement Learning Guide
27 pages
SDW - Google V2
No ratings yet
SDW - Google V2
13 pages
Time Table - 1, B.Tech (Electronics and Communication Engineering, Esr /iot ), V Sem
No ratings yet
Time Table - 1, B.Tech (Electronics and Communication Engineering, Esr /iot ), V Sem
1 page
Adversarial Bandits: Lecture 22 Summary
No ratings yet
Adversarial Bandits: Lecture 22 Summary
12 pages
Psychosocial Factors Influencing Students Attitude Towards Computer Based Test
No ratings yet
Psychosocial Factors Influencing Students Attitude Towards Computer Based Test
7 pages
Ideai Reinforcement Learning
No ratings yet
Ideai Reinforcement Learning
167 pages
Serge Levine Course Introduction To Reinforcement Learning 6 Value Function
No ratings yet
Serge Levine Course Introduction To Reinforcement Learning 6 Value Function
27 pages
Lecture 9: Exploration and Exploitation: David Silver
No ratings yet
Lecture 9: Exploration and Exploitation: David Silver
47 pages
Bellemare17a PDF
No ratings yet
Bellemare17a PDF
10 pages
Exploration in Contextual Bandits: Reedy Reedy
No ratings yet
Exploration in Contextual Bandits: Reedy Reedy
16 pages
What We Learned Last Time: 1. Intelligence Is The Computational Part of The Ability To Achieve Goals
No ratings yet
What We Learned Last Time: 1. Intelligence Is The Computational Part of The Ability To Achieve Goals
32 pages
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
No ratings yet
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
16 pages
IntroMulti Armed Bandits Slivkin Microsoft PDF
No ratings yet
IntroMulti Armed Bandits Slivkin Microsoft PDF
174 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
136 pages
Multi-Agent Learning Dynamics
No ratings yet
Multi-Agent Learning Dynamics
26 pages
(4th Year) Roadmap To Dream Placement
No ratings yet
(4th Year) Roadmap To Dream Placement
1 page
Some Thoughts On Reinforcement Learning: 1 Motivation
No ratings yet
Some Thoughts On Reinforcement Learning: 1 Motivation
9 pages
ECE-517: Reinforcement Learning in Artificial Intelligence: Lecture 2: Evaluative Feedback (Exploration vs. Exploitation)
No ratings yet
ECE-517: Reinforcement Learning in Artificial Intelligence: Lecture 2: Evaluative Feedback (Exploration vs. Exploitation)
23 pages
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
No ratings yet
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
30 pages

Lecture 2 - Exploration and Control - Slides

Uploaded by

Lecture 2 - Exploration and Control - Slides

Uploaded by

Reinforcement Learning

Exploration and Control

► Reinforcement learning is the science of learning to make decisions

In this lecture, we simplify the setting

► We discuss how to learn a policy in this setting

► Learning agents need to trade off two things

► A multi-armed bandit is a set of distributions {Ra |a ∈ A}

► We do this by learning a policy: a distribution on A

► The action value for action a is the expected reward

q(a) = E [Rt |At = a]

► The optimal value is

► We want to minimise total regret:

Σtn=1 v∗ − q(An) = Σtn=1 ∆An

► The summation spans over the full ‘lifetime of learning’

► We will discuss several algorithms:

► The action value for action a is the expected reward

q(a) = E [Rt |At = a]

► A simple estimate is the average of the sampled rewards:

I(·) is the indicator function: I(True) = 1 and I(False) = 0

► This can also be updated incrementally:

► One of the simplest policies is greedy:

► Greedy can get stuck on a suboptimal action forever

► દ-greedy continues to explore

► Can we learn policies π (a) directly, instead of learning values?

► Idea: update policy parameters such that expected value increases

θt+1 = θt + α ∇θ E[Rt |πθt ] ,

where θt are the current policy parameters

► Log-likelihood trick (also known as REINFORCE trick, Williams 1992):

∇θ E[Rt |θ ] = E [Rt ∇θ log πθ (At )]

► We can sample this!

Theorem (Lai and Robbins)

► Note that regret grows at least logarithmically

► A good algorithm ensures small counts for large action regrets

► Which action should we pick?

► Estimate an upper confidence Ut (a) for each action

at = argmax Qt (a) + Ut (a)

We then know the probability that this happens is smaller than p

► Prior Distribution: Initialize a prior distribution representing the agent's initial

► We can estimate upper confidences from the posterior

► A different option is to use probability matching:

► Probability matching is optimistic in the face of uncertainty:

► Thompson sampling (Thompson 1933):

► We have viewed bandits as one-step decision-making problems

Further background material:

You might also like