Overview
1. Function Approximation
Approximate Dynamic Programming
Does ADP converge?
2. Policy Improvement
Policy Improvement Theorem
Policy Iteration
Value Iteration
Q-Learning
SARSA
Deep Networks
Recap: Dynamic Programming
Remember the Bellman update?
Here, and are tables. How can we represent value functions that have infinitely
many states?
Dynamic Programming with Function Approximation
Two ideas:
values can be represented as parametric functions, i.e., (where is a
parameter vector)
since we have infinite states, we need to operate with samples
Dynamic Programming with Function Approximation
Values can be parametrized, i.e.,
Linear parametrization
Neural networks
1 / 12
Dynamic Programming with Function Approximation
Assume you have a dataset of states , actions , rewards , next states and next
actions where and .
Approximated dynamic programming consists of iterate the following equations:
Approximate Dynamic Programming for Policy
Evaluation
1: Input: , number of episodes , and a parameter vector
2: Collect samples using the policy
3: for do
4: Minimize using e.g.,
gradient descent
5: Minimize
6: end for
7: Return .
Approximate Dynamic Programming for Policy
Evaluation
1: Input: , number of episodes , and a parameter vector
2: Collect samples using the policy
3: for do
4: Minimize using
e.g., gradient descent
5:
2 / 12
6: end for
Gradient Descent Approximate Dynamic Programming
Let us compute the gradient of the optimization problem introduced,
This gradient allows us to write an online version of dynamic programming - thus temporal
difference with function approximation.
Gradient Updates
Let be parametric, i.e., ,
where , and .
Temporal Difference with Function Approximation
Temporal Difference with Function Approximation for
Value Functions
1: Input: , number of episodes , vector parameter
2: for do
3: Sample first state
4: for do
5: Sample action
6: Update learning rate
7: Apply a on the environment and receive reward and next state
8:
9:
10: end for
11: end for
3 / 12
Temporal Difference with Function Approximation
Temporal Difference with Function Approximation for
Action-Value
Functions
1: Input: , number of episodes , vector parameter
2: for do
3: Sample first state
4: Sample action
5: for do
6: Update learning rate
7: Apply a on the environment and receive reward and next state
8: Sample
9:
10:
11: end for
12: end for
Convergence
Approximate dynamic programming and temporal difference with function
approximation converge to a biased solution when the parametrization is linear.
When the function approximation is not linear (e.g., neural networks), ADP and TD with
function approximation are not guaranteed to converge.
In either case, their estimate is biased. Such bias can be mitigated by introducing many
parameters, but many parameters cause high estimation variance, which can be only
compensated by using many samples.
State-of-the-art reinforcement learning tends to use deep neural networks with million (or
even billion) of parameters, and use a vast amount of samples.
4 / 12
Our Objective
Our objective is to find a policy that performs better than any other policy. In
mathematical terms:
We call such policy optimal. Note: there might be several optimal policies.
Policy Improvement Theorem
Policy Improvement Theorem
if , then .
Note: the policy improvement theorem is more general in textbooks, however, this
simplified (and more specific) version, still allow us to derive the following corollaries and
algorithms.
Greedy Policies: A policy is said to be greedy (w.r.t. ) if
Tabular Policy Iteration
Tabular Policy Iteration
1: Create a table that represent deterministic policies, and initialize it randomly
2: for do
3: PolicyEvaluation (e.g, use DP for policy evaluation as introduced the previous
lecture)
4: for do
5:
6: end for
7: end for
Question: does this algorithm converge to the optimal policy? Yes (for tabular and ).
However, it is computational expensive because after each policy improvement we need to
evaluate the policy.
5 / 12
Policy Improvement Theorem
Corollary 1 If exists such that , then
.
Corollary 2 (Optimality Bellman Equation) The optimal -function satisfies the following
optimality Bellman equation:
Can we use the optimality Bellman equation to obtain some guarantees on the
convergence of policy iteration, and to derive a more efficient algorithm?
Optimality Bellman Operator
Consider the optimality Bellman operator
The optimality Bellman operator is contractive, and, thanks to the Banach's theorem, we
can state that
Once we obtain the optimal action-value function , obtaining an optimal policy is trivial:
(Note: such policy is deterministic).
On the Contractivity of the Optimality Bellman Operator
Consider the optimality Bellman operator
6 / 12
Value Iteration
Tabular Value Iteration
1: Create two tables and that represent the -functions, and that represents the
policy
2: for do
3: for do
4:
5: end for
6: for do
7:
8: end for
9: Return
10: end for
Value Iteration
Value iteration is basically dynamic programming with the max operator that selects the
action of the next state.
is model-based (it needs knowledge of the transition model and the reward). Similarly to
what we have seen the previous lecture, we can derive a model-free, online version, called
-learning.
Online Algorithms
Like in the previous lecture, we want to devise an online algorithm, that uses samples to
update the -function and the policy.
Online Algorithm
1: Initialize a -function
2: for Episodes do
3: Sample first state
4: for Single episode do
5: Sample action according to a policy (which policy?)
7 / 12
6: Apply a on the environment and receive reward and next state
7: Use to update the -function
8:
9: end for
10: end for
Q-learning
Like in the previous lecture, we can use the online averaging and bootstrapping, i.e.,
where is the reward and is the next state.
-Greedy Policies
We want to obtain an "online" algorithm: i.e., an algorithm that improves the policy while
interacting with the environment.
A popular strategy is to use -greedy policies: policies that select the greedy action with
probability , and select a sub-optimal action with probability .
Such policies select with high probability (usually epsilon is small) good actions, while
keeping exploring and avoiding local minima. (For the most curious: see the exploration-
exploitation trade-off)
Q-learning
Tabular Q-learning
1: Input: , number of episodes
2: Initialize: a table of state-actions visitation , a table of -values
initialized with zeros
3: for Episodes do
4: Sample first state
5: for Single episode do
# Greedy Policy
8 / 12
With probability select , otherwise select a randomly.
# Learning Rate Update
Apply on the environment and receive reward and next state
10: # Bellman Update
11:
12: end for
13: end for
Simulation Time
Let's simulate -learning. We use the "investor MDP" as an example (next slide). There are
three states : rich, well-off, poor .
There are two actions : no-invest, 1 : invest .
We assume constant learning rate , discount factor , and epsilon-greedy
. Episodes are truncated after 5 steps.
I need 5 student:
One student execute the MDP (simulate.py)
One student execute the epsilon-greedy policy (epsilon_greedy.py)
One student compute
One student compute the entire update
One student keeps track of the q-table on the blackboard.
Tabular SARSA
1: Input: , number of episodes
2: Initialize: a table of state-actions visitation , a table of -values
initialized with zeros.
3: for Episodes do
4: Sample first state .
5: With probability select , otherwise select randomly.
6: for Single episode do
7: # Learning Rate Update
8: Apply on the environment and receive reward and next state # Greedy Policy
9 / 12
9: With probability select , otherwise select a random action
.
10: # Bellman Update
11:
12: end for
Q-learning
Q-learning with Function approximation
1: Input: , number of episodes
2: Initialize: a table of state-actions visitation , a table of -values
initialized with zeros
3: for Episodes do
4: Sample first state
5: for Single episode do
6: with probability select , otherwise select a random action
a.
7: Update the learning rate .
8: Apply a on the environment and receive reward and next state
9:
10: end for
11: end for
SARSA with Function Approximation
1: Input: , number of episodes
2: Initialize: a table of state-actions visitation , a table of -values
initialized with zeros.
3: for Episodes do
4: Sample first state .
5: With probability select , otherwise select randomly.
6: for Single episode do
10 / 12
7: Update the learning rate
8: Apply a on the environment and receive reward and next state
9: With probability select , otherwise select randomly.
10:
11: end for
12: end for
SARSA and Q-Learning
SARSA can be seen an online version of policy iteration (the Bellman update only evaluates
the current policy), and the current policy is greedy w.r.t. the q-function.
Q-learning can be seen as an online version of value iteration (the Bellman update
evaluates the greedy policy).
For this reason, SARSA is an on-policy algorithm (i.e., it evaluates the current policy), while
-learning is off-policy, since it evaluates a greedy policy while using an -greedy policy on
the environment.
Deep Q-Network
Q-learning with function approximation tends to be a bit unstable.
DQN aims to mitigate those instabilities by 1) introducing a target q-function, and 2) by
introducing randomized mini-batch updates (replay buffer).
Target q-functions. To stabilize learning it is useful to avoid bootstrapping. The idea of DQN
is to keep two separate functions, as in DP. This can be done by having two separate sets
of parameters and , and .
The targets parameters are updates once in a while .
Replay buffer. In classic -learning, samples are very correlated, as they are obtained by
running the MDP. Using non i.i.d. samples with function approximation is problematic. The
idea of the replay buffer is to store the last samples, and to sample each time a minibatch
of i.i.d. samples from it.
Deep Q-Network
11 / 12
DQN
1: for Episodes do
2: Sample first state
3: for Single episode do
4: With probability select , otherwise select randomly.
5: Update the learning rate .
6: Apply a on the environment and receive reward and next state
7: Append to .
8: Sample a minimatch
9:
10: end for
11: Every episodes, # Target update
12: end for
12 / 12