Reinforcement Learning
Exploration and Control
Natnael Argaw
Reinforcement learning
1
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Recap
► Reinforcement learning is the science of learning to make decisions
► Agents can learn a policy, value function and/or a model
► The general problem involves taking into account time and consequences
► Decisions affect the reward, the agent state, and environment state
► Learning is active: decisions impact data
2
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
This Lecture
In this lecture, we simplify the setting
► The environment is assumed to have only a single state
►
=⇒ actions no longer have long-term consequences in the
►
environment
► =⇒ actions still do impact immediate reward
=⇒ other observations can be ignored
► We discuss how to learn a policy in this setting
3
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Exploration vs. Exploitation
► Learning agents need to trade off two things
► Exploitation: Maximise performance based on current knowledge
► Exploration: Increase knowledge
► We need to gather information to make the best overall decisions
► The best long-term strategy may involve short-term sacrifices
4
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Formalising the problem
5
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Algorithms
6
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
The Multi-Armed Bandit
► A multi-armed bandit is a set of distributions {Ra |a ∈ A}
► A is a (known) set of actions (or “arms")
► Ra is a distribution on rewards, given action a
► At each step t the agent selects an action At ∈ A
► The environment generates a reward Rt ~ RAt
t
► The goal is to maximise cumulative reward Σ i =1Ri
► We do this by learning a policy: a distribution on A
7
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Values and Regret
► The action value for action a is the expected reward
q(a) = E [Rt |At = a]
► The optimal value is
v∗ = max q(a) = max E [R t | At = a]
a
a ∈A
► Regret of an action a is
∆a = v∗ − q(a)
► The regret for the optimal action is zero
8
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Regret
► We want to minimise total regret:
Σtn=1 v∗ − q(An) = Σtn=1 ∆An
► The summation spans over the full ‘lifetime of learning’
9
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Algorithms
► We will discuss several algorithms:
► Greedy
► ε-greedy
► UCB
► Thompson sampling
► Policy gradients
► The first three all use action value estimates Qt (a) ≈ q(a)
10
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Action values
► The action value for action a is the expected reward
q(a) = E [Rt |At = a]
► A simple estimate is the average of the sampled rewards:
I(·) is the indicator function: I(True) = 1 and I(False) = 0
► The count for action a is
11
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Action values
► This can also be updated incrementally:
where N0(a) = 0.
► We will later consider other step sizes α
► For instance, constant α would lead to tracking, rather than averaging
12
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Algorithms: greedy
13
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
The greedy policy
► One of the simplest policies is greedy:
► Select action with highest value:At = argmax Qt(a)
a
► Equivalently: π t (a) = I(At= argmax Qt ( a)) (assuming no ties are possible)
a
14
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Algorithms: ε-Greedy
15
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
ε-Greedy Algorithm
► Greedy can get stuck on a suboptimal action forever
=⇒ linear expected total regret
► The દ-greedy algorithm:
► With probability 1 −દ select greedy action: a = argmax Qt (a)
a ∈A
► With probability select a random action
► Equivalently:
► દ-greedy continues to explore
⇒ દ-greedy with constant has linear expected total regret
16
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Algorithms: Policy
gradients
17
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Policy search
► Can we learn policies π (a) directly, instead of learning values?
► For instance, define action preferences Ht (a) and a policy
► The preferences are not values: they are just learnable policy parameters
► Goal: learn by optimising the preferences
18
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Policy gradients
► Idea: update policy parameters such that expected value increases
► We can use gradient ascent
► In the bandit case, we want to update:
θt+1 = θt + α ∇θ E[Rt |πθt ] ,
where θt are the current policy parameters
► Can we compute this gradient?
19
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Gradient bandits
► Log-likelihood trick (also known as REINFORCE trick, Williams 1992:
20
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Gradient bandits
► Log-likelihood trick (also known as REINFORCE trick, Williams 1992):
∇θ E[Rt |θ ] = E [Rt ∇θ log πθ (At )]
► We can sample this!
► So
θ = θ + α Rt ∇θ log πθ (At ) ,
this is stochastic gradient ascent on the (true) value of the policy
► Can use sampled rewards — does not need value estimates
21
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Theory: what is possible?
22
How well can we do?
Theorem (Lai and Robbins)
Asymptotic total regret is at least logarithmic in number of steps
► Note that regret grows at least logarithmically
► That’s still a whole lot better than linear growth! Can we get it in practice?
► Are there algorithms for which the upper bound is logarithmic as well?
23
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Counting Regret
► Recall ∆a = v∗ − q(a)
► Total regret depends on action regrets ∆a and action counts
► A good algorithm ensures small counts for large action regrets
24
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Optimism in the face of
uncertainty
25
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Optimism in the Face of Uncertainty
1.2
1.0 P[q(a1 )]
probability density
0.8
0.6
P[q(a2 )]
0.4
0.2 P[q(a3 )]
0.0
4 2 0 2 4
expected value
► Which action should we pick?
► More uncertainty about its value: more important to explore that action
26
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Optimism in the Face of Uncertainty
1.2
1.0 P[q(a1 )]
probability density
0.8
0.6
P[q(a2 )]
0.4
0.2 P[q(a3 )]
0.0
4 2 0 2 4
expected value
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
27
Optimism in the Face of Uncertainty
1.2
probability density
1.0
0.8
0.6
0.4
0.2
0.0
4 2 0 2 4
expected value
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
28
Optimism in the Face of Uncertainty
1.2
probability density
1.0
0.8
0.6
0.4
0.2
0.0
4 2 0 2 4
expected value
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
29
Optimism in the Face of Uncertainty
1.2
probability density
1.0
0.8
0.6
0.4
0.2
0.0
4 2 0 2 4
expected value
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
30
Optimism in the Face of Uncertainty
1.2
probability density
1.0
0.8
0.6
0.4
0.2
0.0
4 2 0 2 4
expected value
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
31
Optimism in the Face of Uncertainty
1.2
probability density
1.0
0.8
0.6
0.4
0.2
0.0
4 2 0 2 4
expected value
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
32
Optimism in the Face of Uncertainty
1.2
probability density
1.0
0.8
0.6
0.4
0.2
0.0
4 2 0 2 4
expected value
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
33
Optimism in the Face of Uncertainty
1.2
probability density
1.0
0.8
0.6
0.4
0.2
0.0
4 2 0 2 4
expected value
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
34
Algorithms: UCB
35
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Upper Confidence Bounds
► Estimate an upper confidence Ut (a) for each action
value, such that q(a) ≤ Qt (a) + Ut (a) with high
probability
► Select action maximizing upper confidence bound (UCB)
at = argmax Qt (a) + Ut (a)
a ∈A
► The uncertainty should depend on the number of times Nt (a) action a has been selected
► Small Nt (a) ⇒ large Ut (a) (estimated value is uncertain)
► Large Nt (a) ⇒ small Ut (a) (estimated value is accurate)
► Then a is only selected if either...
► ...Qt (a) is large (=good action), or
► ...Ut (a) is large (=high uncertainty) (or both)
► Can we derive an optimal bound?
36
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Theory: the optimality of
UCB
37
Hoeffding’s Inequality
Theorem (Hoeffding’s
Inequality)
38
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Calculating Upper Confidence Bounds
► We can pick a maximal desired probability p that the true value exceeds an upper
bound and solve for this bound Ut (a)
We then know the probability that this happens is smaller than p
► Idea: reduce p as we observe more rewards, e.g., p = 1/t
,
► This ensures that we always keep exploring, but not too much
39
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
UCB
► UCB:
► Intuition:
► If ∆a is large, then Nt (a) is small, because Qt (a) is likely to be small
► So either ∆a is small or Nt (a) is small
► In fact, we can prove ∆aNt (a) ≤ O(log t), for all a
40
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Bayesian approaches
41
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Bayesian Bandits
► We could adopt Bayesian approach and model distributions over values p(q(a) | θt )
► This is interpreted as our belief that, e.g., q(a) = x for all x ∈ R
► E.g., θt could contain the means and variances of Gaussian belief distributions
► Allows us to inject rich prior knowledge θ0
► We can then use posterior belief to guide exploration
42
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Bayesian Approach
► Prior Distribution: Initialize a prior distribution representing the agent's initial
beliefs about the parameters. This distribution captures the uncertainty before
observing any data.
► Observation and Likelihood: As the agent interacts with the environment, it collects
data (observations) about state-action pairs and rewards.
► Posterior Distribution: The updated distribution is the posterior distribution, which
combines the prior beliefs with the likelihood of the observed data.
► Exploration via Uncertainty:
○ Exploit: Choose actions that are currently believed to be optimal
according to the current posterior distribution.
○ Explore: Choose actions that have higher uncertainty or where the
posterior distribution is spread out.
► Bayesian Updating
43
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Bayesian Bandits with Upper Confidence Bounds
p!(Q)
Q(a3)
Q(a2)
Q(a1)
-1.6 -1.2 0.4 0.8 1.2 1.6 2 2.8 3.6 4 4.4 4.8 5.2 5.6
-0.8 2.4
-0.4 3.2
0
µ(a1) µ(a2) µ(a3)
Q
c"(a3)
c"(a2)
c"(a1)
► We can estimate upper confidences from the posterior
► e.g., Ut (a) = cσt (a) where σ (a) is std dev of pt (q(a))
► Then, pick an action that maximises Qt (a) + cσ (a)
44
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Algorithms: Thompson
sampling
45
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Probability Matching
► A different option is to use probability matching:
Select action a according to the probability (belief) that a is optimal
► Probability matching is optimistic in the face of uncertainty:
Actions have higher probability when either the estimated value is high, or the
uncertainty is high
► Can be difficult to compute π (a) analytically from posterior (but can be done
numerically)
46
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Thompson Sampling
► Thompson sampling (Thompson 1933):
► For Bernoulli bandits, Thompson sampling achieves Lai and Robbins lower bound
on regret, and therefore is optimal
47
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Planning to explore
48
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Information State Space
► We have viewed bandits as one-step decision-making problems
► Can also view as sequential decision-making problems
► Each step the agent updates state St to summarise the past
► Each action At causes a transition to a new information state St+1 (by adding
information), with probability p(St+1 | At , St )
► We now have a Markov decision problem
► The state is fully internal to the agent
► State transitions are random due to rewards & actions
► Even in bandits actions affect the future after all, via learning
49
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
End of lecture
50
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt
Background
Recommended reading:
Sutton & Barto 2018, Chapter 2
Further background material:
Bandit Algorithms, Lattimore & Szepesvári, 2020
Finite-time analysis of the multiarmed bandit problem, Auer, Cesa-Bianchi, Fischer, 2002
51
Lecture by Natnael Argaw (PhD) @CopyRight Hado Van Hasselt