Lecture 5: Model-Free Prediction
Hado van Hasselt
UCL, 2021
Background
Sutton & Barto 2018, Chapters 5 + 6 + 7 + 9 +12
Don’t worry about reading all of this at once!
Most important chapters, for now: 5 + 6
You can also defer some reading, e.g., until the reading week
Recap
I Reinforcement learning is the science of learning to make decisions
I Agents can learn a policy, value function and/or a model
I The general problem involves taking into account time and consequences
I Decisions affect the reward, the agent state, and environment state
Lecture overview
I Last lectures (3+4):
I Planning by dynamic programming to solve a known MDP
I This and next lectures (5→8):
I Model-free prediction to estimate values in an unknown MDP
I Model-free control to optimise values in an unknown MDP
I Function approximation and (some) deep reinforcement learning (but more to follow later)
I Off-policy learning
I Later lectures:
I Model-based learning and planning
I Policy gradients and actor critic systems
I More deep reinforcement learning
I More advanced topics and current research
Model-Free Prediction:
Monte Carlo Algorithms
Monte Carlo Algorithms
I We can use experience samples to learn without a model
I We call direct sampling of episodes Monte Carlo
I MC is model-free: no knowledge of MDP required, only samples
Monte Carlo: Bandits
I Simple example, multi-armed bandit:
I For each action, average reward samples
I(Ai = a)Ri+1
Ít
qt (a) = i= 0
≈ E [Rt+1 | At = a] = q(a)
i=0 I(Ai = a)
Ít
I Equivalently:
qt+1 (At ) = qt (At ) + αt (Rt+1 − qt (At ))
qt+1 (a) = qt (a) ∀a , At
with αt = N (A
1
= Ít I(A
1
t t) i=0 i =a)
I Note: we changed notation Rt → Rt+1 for the reward after At
In MDPs, the reward is said to arrive on the time step after the action
Monte Carlo: Bandits with States
I Consider bandits with different states
I episodes are still one step
I actions do not affect state transitions
I =⇒ no long-term consequences
I Then, we want to estimate
q(s, a) = E [Rt+1 |St = s, At = a]
I These are called contextual bandits
Introduction Function Approximation
Value Function Approximation
I So far we mostly considered lookup tables
I Every state s has an entry v(s)
I Or every state-action pair s, a has an entry q(s, a)
I Problem with large MDPs:
I There are too many states and/or actions to store in memory
I It is too slow to learn the value of each state individually
I Individual states are often not fully observable
Value Function Approximation
Solution for large MDPs:
I Estimate value function with function approximation
vw (s) ≈ vπ (s) (or v∗ (s))
qw (s, a) ≈ qπ (s, a) (or q∗ (s, a))
I Update parameter w (e.g., using MC or TD learning)
I Generalise from to unseen states
Agent state update
Solution for large MDPs, if the environment state is not fully observable
I Use the agent state:
St = uω (St−1, At−1, Ot )
with parameters ω (typically ω ∈ Rn )
I Henceforth, St denotes the agent state
I Think of this as either a vector inside the agent,
or, in the simplest case, just the current observation: St = Ot
I For now we are not going to talk about how to learn the agent state update
I Feel free to consider St an observation
Linear Function Approximation
Feature Vectors
I A useful special case: linear functions
I Represent state by a feature vector
x (s)
© 1.
..
x(s) =
ª
®
®
« xm (s) ¬
I x : S → Rm is a fixed mapping from agent state (e.g., observation) to features
I Short-hand: xt = x(St )
I For example:
I Distance of robot from landmarks
I Trends in the stock market
I Piece and pawn configurations in chess
Linear Value Function Approximation
I Approximate value function by a linear combination of features
n
Õ
vw (s) = w x(s) =
>
x j (s)w j
j=1
I Objective function (‘loss‘) is quadratic in w
L(w) = ES∼d [(vπ (S) − w> x(S))2 ]
I Stochastic gradient descent converges on global optimum
I Update rule is simple
∇w vw (St ) = x(St ) = xt =⇒ ∆w = α(vπ (St ) − vw (St ))xt
Update = step-size × prediction error × feature vector
Table Lookup Features
I Table lookup is a special case of linear value function approximation
I Let the n states be given by S = {s1, . . . , sn }.
I Use one-hot feature:
I(s = s1 )
x(s) =
© .. ª
.
®
®
« I(s = s n ) ¬
I Parameters w then just contains value estimates for each state
Õ
v(s) = w> x(s) = w j x j (s) = ws .
j
Model-Free Prediction:
Monte Carlo Algorithms
(Continuing from before...)
Monte Carlo: Bandits with States
I q could be a parametric function, e.g., neural network, and we could use loss
1
L(w) = E (Rt+1 − qw (St , At ))2
2
I Then the gradient update is
wt+1 = wt − α∇wt L(wt )
1
= wt − α∇wt E (Rt+1 − qwt (St , At ))2
2
= wt + αE (Rt+1 − qwt (St , At ))∇wt qwt (St , At ) .
We can sample this to get a stochastic gradient update (SGD)
I The tabular case is a special case (only updates the value in cell [St , At ])
I Also works for large (continuous) state spaces S — this is just regression
Monte Carlo: Bandits with States
I When using linear functions, q(s, a) = w> x(s, a) and
∇wt qwt (St , At ) = x(s, a)
I Then the SGD update is
wt+1 = wt + α(Rt+1 − qwt (St , At ))x(s, a) .
I Linear update = step-size × prediction error × feature vector
I Non-linear update = step-size × prediction error × gradient
Monte-Carlo Policy Evaluation
I Now we consider sequential decision problems
I Goal: learn vπ from episodes of experience under policy π
S1, A1, R2, ..., Sk ∼ π
I The return is the total discounted reward (for an episode ending at time T > t ):
Gt = Rt+1 + γRt+2 + ... + γT −t−1 RT
I The value function is the expected return:
vπ (s) = E [Gt | St = s, π]
I We can just use sample average return instead of expected return
I We call this Monte Carlo policy evaluation
Example: Blackjack
Blackjack Example
I States (200 of them):
I Current sum (12-21)
I Dealer’s showing card (ace-10)
I Do I have a “useable" ace? (yes-no)
I Action stick: Stop receiving cards (and terminate)
I Action draw: Take another card (random, no replacement)
I Reward for stick:
I +1 if sum of cards > sum of dealer cards
I 0 if sum of cards = sum of dealer cards
I -1 if sum of cards < sum of dealer cards
I Reward for draw:
I -1 if sum of cards > 21 (and terminate)
I 0 otherwise
I Transitions: automatically draw if sum of cards < 12
Blackjack Value Function after Monte-Carlo Learning
Disadvantages of Monte-Carlo Learning
I We have seen MC algorithms can be used to learn value predictions
I But when episodes are long, learning can be slow
I ...we have to wait until an episode ends before we can learn
I ...return can have high variance
I Are there alternatives? (Spoiler: yes)
Temporal-Difference Learning
Temporal Difference Learning by Sampling Bellman Equations
I Previous lecture: Bellman equations,
vπ (s) = E [Rt+1 + γvπ (St+1 ) | St = s, At ∼ π(St )]
I Previous lecture: Approximate by iterating,
vk+1 (s) = E [Rt+1 + γvk (St+1 ) | St = s, At ∼ π(St )]
I We can sample this!
vt+1 (St ) = Rt+1 + γvt (St+1 )
I This is likely quite noisy — better to take a small step (with parameter α):
vt+1 (St ) = vt (St ) + αt Rt+1 + γvt (St+1 ) −vt (St )
| {z }
target
(Note: tabular update)
Temporal difference learning
I Prediction setting: learn vπ online from experience under policy π
I Monte-Carlo
I Update value vn (St ) towards sampled return Gt
vn+1 (St ) = vn (St ) + α (Gt − vn (St ))
I Temporal-difference learning:
I Update value vt (St ) towards estimated return Rt+1 + γv(St+1 )
TD error
©z }| {ª
vt+1 (St ) ← vt (St ) + α Rt+1 + γvt (St+1 ) −vt (St )®
®
| {z } ®
« target ¬
I δt = Rt+1 + γvt (St+1 ) − vt (St ) is called the TD error
Dynamic Programming Backup
v(St ) ← E [Rt+1 + γv(St+1 ) | At ∼ π(St )]
st
rt +1
st +1
T! TT!! T! T! T!
TT! T! T! T! T!
Monte-Carlo Backup
v(St ) ← v(St ) + α (Gt − v(St ))
st
T!
T T! TT! T! T!
! ! !
TT! T! TT! T! TT!
! ! !
Temporal-Difference Backup
v(St ) ← v(St ) + α (Rt+1 + γv(St+1 ) − v(St ))
st
rt +1
st +1
T! TT! TT! T! T!
! !
T! T
T! TT!! T! TT!
!
Bootstrapping and Sampling
I Bootstrapping: update involves an estimate
I MC does not bootstrap
I DP bootstraps
I TD bootstraps
I Sampling: update samples an expectation
I MC samples
I DP does not sample
I TD samples
Temporal difference learning
I We can apply the same idea to action values
I Temporal-difference learning for action values:
I Update value qt (St , At ) towards estimated return Rt+1 + γq(St+1 , At+1 )
TD error
©z }| {ª
qt+1 (St , At ) ← qt (St , At ) + α Rt+1 + γqt (St+1 , At+1 ) −qt (St , At )®
®
| {z } ®
« target ¬
I This algorithm is known as SARSA, because it uses (St , At , Rt+1, St+1, At+1 )
Temporal-Difference Learning
I TD is model-free (no knowledge of MDP) and learn directly from experience
I TD can learn from incomplete episodes, by bootstrapping
I TD can learn during each episode
Example: Driving Home
Driving Home Example
State Elapsed Time Predicted Predicted
(minutes) Time to Go Total Time
leaving office 0 30 30
reach car, raining 5 35 40
exit highway 20 15 35
behind truck 30 10 40
home street 40 3 43
arrive home 43 0 43
Driving Home Example: MC vs. TD
Changes recommended by Changes recommended!
Monte Carlo methods (!=1)! by TD methods (!=1)!
Comparing MC and TD
Advantages and Disadvantages of MC vs. TD
I TD can learn before knowing the final outcome
I TD can learn online after every step
I MC must wait until end of episode before return is known
I TD can learn without the final outcome
I TD can learn from incomplete sequences
I MC can only learn from complete sequences
I TD works in continuing (non-terminating) environments
I MC only works for episodic (terminating) environments
I TD is independent of the temporal span of the prediction
I TD can learn from single transitions
I MC must store all predictions (or states) to update at the end of an episode
I TD needs reasonable value estimates
Bias/Variance Trade-Off
I MC return Gt = Rt+1 + γRt+2 + . . . is an unbiased estimate of vπ (St )
I TD target Rt+1 + γvt (St+1 ) is a biased estimate of vπ (St ) (unless vt (St+1 ) = vπ (St+1 ))
I But the TD target has lower variance:
I Return depends on many random actions, transitions, rewards
I TD target depends on one random action, transition, reward
Bias/Variance Trade-Off
I In some cases, TD can have irreducible bias
I The world may be partially observable
I MC would implicitly account for all the latent variables
I The function to approximate the values may fit poorly
I In the tabular case, both MC and TD will converge: vt → vπ
Example: Random Walk
Random Walk Example
I Uniform random transitions (50% left, 50% right)
I Initial values are v(s) = 0.5, for all s
I True values happen to be
v(A) = 61 , v(B) = 26 , v(C) = 63 , v(D) = 64 , v(E) = 5
6
Random Walk Example
Random Walk: MC vs. TD
TD MC
alpha 0.01
alpha 0.03
0.5 alpha 0.1 0.5
alpha 0.3
0.4 0.4
RMSE
RMSE
0.3 0.3
0.2 0.2
alpha 0.01
0.1 0.1 alpha 0.03
alpha 0.1
alpha 0.3
0.0 0.0
0 20 40 60 80 100 0 20 40 60 80 100
Episodes Episodes
Batch MC and TD
Batch MC and TD
I Tabular MC and TD converge: vt → vπ as experience → ∞ and αt → 0
I But what about finite experience?
I Consider a fixed batch of experience:
episode 1: S11, A11, R21, ..., ST11
..
.
episode K: S1K , A1K , R2K , ..., STKK
I Repeatedly sample each episode k ∈ [1, K] and apply MC or TD(0)
I = sampling from an empirical model
Example:
Batch Learning in Two States
Example: Batch Learning in Two States
Two states A, B; no discounting; 8 episodes of experience
A, 0, B, 0!
B, 1!
B, 1!
B, 1!
B, 1!
B, 1!
B, 1!
B, 0!
What is v(A), v(B)?
Example: Batch Learning in Two States
Two states A, B; no discounting; 8 episodes of experience
A, 0, B, 0!
B, 1!
B, 1!
B, 1!
B, 1!
B, 1!
B, 1!
B, 0!
What is v(A), v(B)?
Differences in batch solutions
I MC converges to best mean-squared fit for the observed returns
Õ Tk
K Õ 2
Gtk − v(Stk )
k=1 t=1
I In the AB example, v(A) = 0
I TD converges to solution of max likelihood Markov model, given the data
I Solution to the empirical MDP (S, A, p̂, γ) that best fits the data
I In the AB example: p̂(St+1 = B | St = A) = 1, and therefore v(A) = v(B) = 0.75
Advantages and Disadvantages of MC vs. TD
I TD exploits Markov property
I Can help in fully-observable environments
I MC does not exploit Markov property
I Can help in partially-observable environments
I With finite data, or with function approximation, the solutions may differ
Between MC and TD:
Multi-Step TD
Unified View of Reinforcement Learning
Multi-Step Updates
I TD uses value estimates which might be inaccurate
I In addition, information can propagate back quite slowly
I In MC information propagates faster, but the updates are noisier
I We can go in between TD and MC
Multi-Step Prediction
I Let TD target look n steps into the future
Multi-Step Returns
I Consider the following n-step returns for n = 1, 2, ∞:
n=1 (T D) G(t1) = Rt+1 + γv(St+1 )
n=2 G(t2) = Rt+1 + γRt+2 + γ 2 v(St+2 )
.. ..
. .
n = ∞ (M C) G(∞)
t = Rt+1 + γRt+2 + ... + γT −t−1 RT
I In general, the n-step return is defined by
G(n)
t = Rt+1 + γRt+2 + ... + γ
n−1
Rt+n + γ n v(St+n )
I Multi-step temporal-difference learning
v(St ) ← v(St ) + α G(n)
t − v(St )
Multi-Step Examples
Grid Example
(Reminder: SARSA is TD for action values q(s, a))
were used for all methods). First note that the on-line methods generally worked
on this task, both reaching lower levels of absolute error and doing so over a la
Large Random Walk Example
range of the step-size parameter ↵ (in fact, all the o↵-line methods were unstable f
much above 0.3). Second, note that methods with an intermediate value of n wor
best. This illustrates how the generalization of TD and Monte Carlo methods t
step methods can potentially perform better than either of the two extreme metho
..., but with 19 states, rather than 5
On-line n-step TD methods Off-line n-step TD metho
256 256
512 512
128 n=64 128 n=64
n=32
RMS error n=64
n=3
over first
10 episodes n=32
n=32 n=1
n=4
n=16
n=8
n=16
n=8 n=2
n=4
↵ ↵
Figure 7.2: Performance of n-step TD methods as a function of ↵, for various values
Mixed Multi-Step Returns
Mixing multi-step returns
I Multi-step returns bootstrap on one state, v(St+n ):
G(n) (n−1)
t = Rt+1 + γG t+1 (while n > 1, continue)
G(t1) = Rt+1 + γv(St+1 ) . (truncate & bootstrap)
I You can also bootstrap a little bit on multiple states:
Gλt = Rt+1 + γ (1 − λ)v(St+1 ) + λGλt+1
This gives a weighted average of n-step returns:
∞
Gλt =
Õ
(1 − λ)λ n−1 G(n)
t
n=1
− λ)λ n−1 = 1)
Í∞
(Note, n=1 (1
Mixing multi-step returns
Gλt = Rt+1 + γ (1 − λ)v(St+1 ) + λGλt+1
Special cases:
Gλ=
t
0
= Rt+1 + γv(St+1 ) (TD)
Gλ=
t
1
= Rt+1 + γGt+1 (MC)
Mixing multi-step returns
Intuition: 1/(1 − λ) is the ‘horizon’. E.g., λ = 0.9 ≈ n = 10.
Benefits of Multi-Step Learning
Benefits of multi-step returns
I Multi-step returns have benefits from both TD and MC
I Bootstrapping can have issues with bias
I Monte Carlo can have issues with variance
I Typically, intermediate values of n or λ are good (e.g., n = 10, λ = 0.9)
Eligibility Traces
Independence of temporal span
I MC and multi-step returns are not independent of span of the predictions:
To update values in a long episode, you have to wait
I TD can update immediately, and is independent of the span of the predictions
I Can we get both?
Eligibility traces
I Recall linear function approximation
I The Monte Carlo and TD updates to vw (s) = w> x(s) for a state s = St is
∆wt = α(Gt − v(St ))xt (MC)
∆wt = α(Rt+1 + γv(St+1 ) − v(St ))xt (TD)
I MC updates all states in episode k at once:
T
Õ −1
∆wk+1 = α(Gt − v(St ))xt
t=0
where t ∈ {0, . . . , T − 1} enumerate the time steps in this specific episode
I Recall: tabular is a special case, with one-hot vector xt
Eligibility traces
I Accumulating a whole episode of updates:
∆wt ≡ αδt et (one time step)
where et = γλet−1 + x t
I Note: if λ = 0, we get one-step TD
I Intuition: decay the eligibility of past states for the current TD error, then add it
I This is kind of magical: we can update all past states (to account for the new TD error)
with a single update! No need to recompute their values.
I This idea extends to function approximation: xt does not have to be one-hot
Eligibility traces
Eligibility traces
We can rewrite the MC error as a sum of TD errors:
Gt − v(St ) = Rt+1 + γGt+1 − v(St )
= Rt+1 + γv(St+1 ) − v(St ) +γ(Gt+1 − v(St+1 ))
| {z }
= δt
= δt + γ(Gt+1 − v(St+1 ))
= ...
= δt + γδt+1 + γ 2 (Gt+2 − v(St+2 ))
= ...
ÕT
= γ k−t δk (used in the next slide)
k=t
Eligibility traces
I Now consider accumulating a whole episode (from time t = 0 to T ) of updates:
T
Õ −1
∆wk = α(Gt − v(St ))x t
t=0
T −1 T −1
!
Õ Õ
= α γ k−t δk x t (Using result from previous slide)
t=0 k=t
T
Õ −1 k
Õ m Õ
Õ m j
m Õ
Õ
= α γ k−t
δk x t (Using zi j = zi j )
k=0 t=0 i=0 j=i j=0 i=0
T
Õ −1 k
Õ T
Õ −1 T
Õ −1
= αδk γ k−t x t = αδk ek = αδt et .
k=0 t=0 k=0 t=0
| {z } | {z }
≡ ek renaming
k→t
Eligibility traces
Accumulating a whole episode of updates:
T
Õ −1 t
Õ
∆w k = αδt et where et = γ t−j x j
t=0 j=0
t−1
Õ
= γ t−j x j + x t
j=0
t−1
Õ
=γ γ t−1−j x j +x t
j=0
| {z }
= et−1
= γ et−1 + x t .
The vector et is called an eligibility trace
Every step, it decays (according to γ ) and then the current feature xt is added
Eligibility traces
I Accumulating a whole episode of updates:
∆wt ≡ αδt et (one time step)
T
Õ −1
∆wk = ∆wt (whole episode)
t=0
where et = γ et−1 + x t .
(And then apply ∆w at the end of the episode)
I Intuition: the same TD error shows up in multiple MC errors—grouping them allows
applying it to all past states in one update
Eligibility Traces: Intuition
Eligibility traces
Consider a batch update on an episode with four steps: t ∈ {0, 1, 2, 3}
∆v = δ0 e0 δ1 e1 δ2 e2 δ3 e3
(G0 − v(S0 ))x0 δ0 x0 γδ1 x0 γ 2 δ2 x0 γ 3 δ3 x0
(G1 − v(S1 ))x1 δ1 x1 γδ2 x1 γ 2 δ3 x1
(G2 − v(S2 ))x2 δ2 x2 γδ3 x2
(G3 − v(S3 ))x3 δ3 x3
Mixed Multi-Step Returns
and Eligibility Traces
Mixing multi-step returns & traces
I Reminder: mixed multi-step return
Gλt = Rt+1 + γ (1 − λ)v(St+1 ) + λGλt+1
I The associated error and trace update are
T −t
Gλt
Õ
= λ k γ k δt+k (same as before, but with λγ instead of γ )
k=0
=⇒ et = γλet−1 + xt and ∆wt = αδt et .
I This is called an accumulating trace with decay γλ
I It is exact for batched episodic updates (‘offline’), similar traces exist for online updating
End of Lecture
Next lecture:
Model-free control