Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
10 views51 pages

Lecture 2 - Exploration and Control - Slides

The document discusses reinforcement learning, focusing on decision-making processes where agents learn policies, value functions, and models. It covers key concepts such as exploration versus exploitation, multi-armed bandit problems, and various algorithms like greedy, ε-greedy, and UCB. The lecture emphasizes the importance of minimizing regret and optimizing learning strategies to enhance performance in uncertain environments.

Uploaded by

Husein Yusuf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views51 pages

Lecture 2 - Exploration and Control - Slides

The document discusses reinforcement learning, focusing on decision-making processes where agents learn policies, value functions, and models. It covers key concepts such as exploration versus exploitation, multi-armed bandit problems, and various algorithms like greedy, ε-greedy, and UCB. The lecture emphasizes the importance of minimizing regret and optimizing learning strategies to enhance performance in uncertain environments.

Uploaded by

Husein Yusuf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Reinforcement Learning

Exploration and Control

Natnael Argaw

Reinforcement learning

1
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Recap

► Reinforcement learning is the science of learning to make decisions


► Agents can learn a policy, value function and/or a model
► The general problem involves taking into account time and consequences
► Decisions affect the reward, the agent state, and environment state
► Learning is active: decisions impact data

2
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
This Lecture

In this lecture, we simplify the setting


► The environment is assumed to have only a single state

=⇒ actions no longer have long-term consequences in the

environment
► =⇒ actions still do impact immediate reward
=⇒ other observations can be ignored

► We discuss how to learn a policy in this setting

3
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Exploration vs. Exploitation

► Learning agents need to trade off two things


► Exploitation: Maximise performance based on current knowledge
► Exploration: Increase knowledge
► We need to gather information to make the best overall decisions
► The best long-term strategy may involve short-term sacrifices

4
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Formalising the problem

5
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Algorithms

6
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
The Multi-Armed Bandit

► A multi-armed bandit is a set of distributions {Ra |a ∈ A}


► A is a (known) set of actions (or “arms")
► Ra is a distribution on rewards, given action a
► At each step t the agent selects an action At ∈ A
► The environment generates a reward Rt ~ RAt
t
► The goal is to maximise cumulative reward Σ i =1Ri

► We do this by learning a policy: a distribution on A

7
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Values and Regret

► The action value for action a is the expected reward

q(a) = E [Rt |At = a]

► The optimal value is


v∗ = max q(a) = max E [R t | At = a]
a
a ∈A

► Regret of an action a is
∆a = v∗ − q(a)
► The regret for the optimal action is zero

8
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Regret

► We want to minimise total regret:

Σtn=1 v∗ − q(An) = Σtn=1 ∆An

► The summation spans over the full ‘lifetime of learning’

9
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Algorithms

► We will discuss several algorithms:


► Greedy
► ε-greedy
► UCB
► Thompson sampling
► Policy gradients
► The first three all use action value estimates Qt (a) ≈ q(a)

10
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Action values

► The action value for action a is the expected reward

q(a) = E [Rt |At = a]

► A simple estimate is the average of the sampled rewards:

I(·) is the indicator function: I(True) = 1 and I(False) = 0


► The count for action a is

11
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Action values

► This can also be updated incrementally:

where N0(a) = 0.
► We will later consider other step sizes α
► For instance, constant α would lead to tracking, rather than averaging

12
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Algorithms: greedy

13
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
The greedy policy

► One of the simplest policies is greedy:


► Select action with highest value:At = argmax Qt(a)
a
► Equivalently: π t (a) = I(At= argmax Qt ( a)) (assuming no ties are possible)
a

14
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Algorithms: ε-Greedy

15
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
ε-Greedy Algorithm

► Greedy can get stuck on a suboptimal action forever


=⇒ linear expected total regret
► The દ-greedy algorithm:
► With probability 1 −દ select greedy action: a = argmax Qt (a)
a ∈A
► With probability select a random action
► Equivalently:

► દ-greedy continues to explore


⇒ દ-greedy with constant has linear expected total regret

16
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Algorithms: Policy
gradients

17
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Policy search

► Can we learn policies π (a) directly, instead of learning values?


► For instance, define action preferences Ht (a) and a policy

► The preferences are not values: they are just learnable policy parameters
► Goal: learn by optimising the preferences

18
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Policy gradients

► Idea: update policy parameters such that expected value increases


► We can use gradient ascent
► In the bandit case, we want to update:

θt+1 = θt + α ∇θ E[Rt |πθt ] ,

where θt are the current policy parameters


► Can we compute this gradient?

19
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Gradient bandits
► Log-likelihood trick (also known as REINFORCE trick, Williams 1992:

20
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Gradient bandits

► Log-likelihood trick (also known as REINFORCE trick, Williams 1992):

∇θ E[Rt |θ ] = E [Rt ∇θ log πθ (At )]

► We can sample this!


► So
θ = θ + α Rt ∇θ log πθ (At ) ,
this is stochastic gradient ascent on the (true) value of the policy
► Can use sampled rewards — does not need value estimates

21
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Theory: what is possible?

22
How well can we do?

Theorem (Lai and Robbins)


Asymptotic total regret is at least logarithmic in number of steps

► Note that regret grows at least logarithmically


► That’s still a whole lot better than linear growth! Can we get it in practice?
► Are there algorithms for which the upper bound is logarithmic as well?

23
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Counting Regret

► Recall ∆a = v∗ − q(a)
► Total regret depends on action regrets ∆a and action counts

► A good algorithm ensures small counts for large action regrets

24
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Optimism in the face of
uncertainty

25
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Optimism in the Face of Uncertainty

1.2

1.0 P[q(a1 )]

probability density
0.8

0.6
P[q(a2 )]
0.4

0.2 P[q(a3 )]
0.0
4 2 0 2 4
expected value

► Which action should we pick?


► More uncertainty about its value: more important to explore that action

26
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Optimism in the Face of Uncertainty

1.2

1.0 P[q(a1 )]
probability density

0.8

0.6
P[q(a2 )]
0.4

0.2 P[q(a3 )]
0.0
4 2 0 2 4
expected value
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
27
Optimism in the Face of Uncertainty

1.2
probability density

1.0

0.8

0.6

0.4

0.2
0.0
4 2 0 2 4
expected value
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
28
Optimism in the Face of Uncertainty

1.2
probability density

1.0

0.8

0.6

0.4

0.2
0.0
4 2 0 2 4
expected value
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
29
Optimism in the Face of Uncertainty

1.2
probability density

1.0

0.8

0.6

0.4

0.2
0.0
4 2 0 2 4
expected value
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
30
Optimism in the Face of Uncertainty

1.2
probability density

1.0

0.8

0.6

0.4

0.2
0.0
4 2 0 2 4
expected value
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
31
Optimism in the Face of Uncertainty

1.2
probability density

1.0

0.8

0.6

0.4

0.2
0.0
4 2 0 2 4
expected value
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
32
Optimism in the Face of Uncertainty

1.2
probability density

1.0

0.8

0.6

0.4

0.2
0.0
4 2 0 2 4
expected value
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
33
Optimism in the Face of Uncertainty

1.2
probability density

1.0

0.8

0.6

0.4

0.2
0.0
4 2 0 2 4
expected value
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
34
Algorithms: UCB

35
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Upper Confidence Bounds

► Estimate an upper confidence Ut (a) for each action


value, such that q(a) ≤ Qt (a) + Ut (a) with high
probability
► Select action maximizing upper confidence bound (UCB)

at = argmax Qt (a) + Ut (a)


a ∈A

► The uncertainty should depend on the number of times Nt (a) action a has been selected
► Small Nt (a) ⇒ large Ut (a) (estimated value is uncertain)
► Large Nt (a) ⇒ small Ut (a) (estimated value is accurate)
► Then a is only selected if either...
► ...Qt (a) is large (=good action), or
► ...Ut (a) is large (=high uncertainty) (or both)
► Can we derive an optimal bound?
36
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Theory: the optimality of
UCB

37
Hoeffding’s Inequality

Theorem (Hoeffding’s
Inequality)

38
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Calculating Upper Confidence Bounds

► We can pick a maximal desired probability p that the true value exceeds an upper
bound and solve for this bound Ut (a)

We then know the probability that this happens is smaller than p


► Idea: reduce p as we observe more rewards, e.g., p = 1/t
,

► This ensures that we always keep exploring, but not too much
39
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
UCB

► UCB:

► Intuition:
► If ∆a is large, then Nt (a) is small, because Qt (a) is likely to be small
► So either ∆a is small or Nt (a) is small
► In fact, we can prove ∆aNt (a) ≤ O(log t), for all a

40
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Bayesian approaches

41
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Bayesian Bandits

► We could adopt Bayesian approach and model distributions over values p(q(a) | θt )
► This is interpreted as our belief that, e.g., q(a) = x for all x ∈ R
► E.g., θt could contain the means and variances of Gaussian belief distributions
► Allows us to inject rich prior knowledge θ0
► We can then use posterior belief to guide exploration

42
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Bayesian Approach

► Prior Distribution: Initialize a prior distribution representing the agent's initial


beliefs about the parameters. This distribution captures the uncertainty before
observing any data.
► Observation and Likelihood: As the agent interacts with the environment, it collects
data (observations) about state-action pairs and rewards.
► Posterior Distribution: The updated distribution is the posterior distribution, which
combines the prior beliefs with the likelihood of the observed data.
► Exploration via Uncertainty:
○ Exploit: Choose actions that are currently believed to be optimal
according to the current posterior distribution.
○ Explore: Choose actions that have higher uncertainty or where the
posterior distribution is spread out.
► Bayesian Updating

43
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Bayesian Bandits with Upper Confidence Bounds

p!(Q)

Q(a3)
Q(a2)

Q(a1)

-1.6 -1.2 0.4 0.8 1.2 1.6 2 2.8 3.6 4 4.4 4.8 5.2 5.6
-0.8 2.4
-0.4 3.2
0
µ(a1) µ(a2) µ(a3)
Q
c"(a3)
c"(a2)
c"(a1)

► We can estimate upper confidences from the posterior


► e.g., Ut (a) = cσt (a) where σ (a) is std dev of pt (q(a))
► Then, pick an action that maximises Qt (a) + cσ (a)

44
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Algorithms: Thompson
sampling

45
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Probability Matching

► A different option is to use probability matching:


Select action a according to the probability (belief) that a is optimal

► Probability matching is optimistic in the face of uncertainty:


Actions have higher probability when either the estimated value is high, or the
uncertainty is high
► Can be difficult to compute π (a) analytically from posterior (but can be done
numerically)

46
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Thompson Sampling

► Thompson sampling (Thompson 1933):

► For Bernoulli bandits, Thompson sampling achieves Lai and Robbins lower bound
on regret, and therefore is optimal

47
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Planning to explore

48
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Information State Space

► We have viewed bandits as one-step decision-making problems


► Can also view as sequential decision-making problems
► Each step the agent updates state St to summarise the past
► Each action At causes a transition to a new information state St+1 (by adding
information), with probability p(St+1 | At , St )
► We now have a Markov decision problem
► The state is fully internal to the agent
► State transitions are random due to rewards & actions
► Even in bandits actions affect the future after all, via learning

49
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
End of lecture

50
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Background

Recommended reading:
Sutton & Barto 2018, Chapter 2

Further background material:


Bandit Algorithms, Lattimore & Szepesvári, 2020
Finite-time analysis of the multiarmed bandit problem, Auer, Cesa-Bianchi, Fischer, 2002

51
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt

You might also like