CMPSC 448: Machine Learning
Lecture 18. Dynamic Programming for Markov Decision Processes
Rui Zhang
Fall 2024
1
Outline of RL
● Introduction to Reinforcement learning
● Multi-armed Bandits
● Markov Decision Processes (MDP)
○ Dynamic Programming when we know the world
● Learning in MDP: When we don't know the world
○ Monte Carlo Methods
○ Temporal-Difference Learning (TD): SARSA and Q-Learning
Note: All of the lectures are tabular methods; we will only briefly discuss the
motivation function approximation methods (e.g., DQN, policy gradient, deep
reinforcement learning)
2
Two Problems in MDP
Input: a perfect model of RL as a finite MDP
Two problems:
1. evaluation (prediction): given a policy , what is value function ?
2. control: find the optimal policy or optimal value functions, i.e., or
In fact, in order to solve problem 2, we must first know how to solve problem 1.
3
Solution 1: Write out Bellman Equations and Solve Them
Solve systems of equations
● Write Bellman Equations or Bellman Optimality Equations for all state and
state-action pairs
● Solve systems of linear Equations for Evaluation (i.e., compute and )
● Solve systems of nonlinear Equations for Control (i.e., compute and )
We discussed this in our previous lecture.
4
Solution 2: Dynamic Programming (DP) for MDP
Idea: Use Dynamic Programming on Bellman equations for value functions to
organize and structure the search.
Dynamic Programming in context of MDP/RL, refers to collection of algorithms to
compute optimal policies given a perfect model of the environment as a Markov
Decision Process (MDP). Note that we know all the information about MDP
including states, rewards, transition probabilities, etc.
This is the focus of this lecture.
5
Outline of DP for MDP
We introduce two DP methods to find an optimal policy for a given MDP:
Policy Iteration
● Policy Evaluation
● Policy Improvement
Value Iteration
● One-sweep Policy Evaluation + One-step Policy Improvement
Both methods rely on Bellman condition on the optimality of a policy (from the
previou lecture): The value of a state under an optimal policy must equal the
expected return for the best action from that state.
6
Policy Iteration
Policy Evaluation: Estimate (iterative policy evaluation)
Policy Improvement: Generate (Greedy policy improvement)
7
Policy Evaluation
Policy Evaluation: for a given arbitrary policy compute the state-value function
8
Solution 1: Solving A System of Linear Equations
Recall: state-value function for policy
Recall: Bellman equation for
A system of simultaneous equations
Note: environment dynamics are completely known
9
Solution 2: Iterative Policy Evaluation using Dynamic Programming
Start with a random guess of for all states, then iteratively update them using
Bellman Equation.
10
Iterative Policy Evaluation by Bellman Expectation Backup Operator
A sweep consists of applying a backup operation
to each state.
11
Bellman Expectation Backup Operator
Recall Bellman Equation for
From this, let's define Bellman Expectation Backup Operator on
12
Example: a small GridWorld
An un-discounted episodic MDP with
Actions = {up, down, right, left}
Nonterminal states: {1, 2, . . ., 14}
One terminal state (shown twice as shaded squares)
Actions that would take agent off the grid leave state unchanged
Reward is –1 until the terminal state is reached
13
Iterative Policy Evaluation for the small GridWorld
14
Iterative Policy Evaluation for the small GridWorld
15
Iterative Policy Evaluation for the small GridWorld
16
Iterative Policy Evaluation for the small GridWorld
The final estimate is in fact , which is this case gives, for each state, the
negative of the expected number of steps from that state until termination
17
Iterative Policy Evaluation
18
Policy Improvement
Suppose we have computed for a deterministic policy
For a given state , tells us how good it is to follow
For a given state , would it be better to do an action ?
19
Policy Improvement
Suppose we have computed for a deterministic policy
For a given state , tells us how good it is to follow
For a given state , would it be better to do an action ?
Let’s take action and thereafter follow the policy and see what happens to
the agent’s reward, this is just
20
Policy Improvement Theorem
The theorem can be easily generalized to stochastic policies (actions are selected
by different probabilities at every states under policy, which is more realistic)
21
Example of Greedification
22
Policy Improvement Theorem - Proof
23
Policy improvement with greedification
Do this for all states to get a new policy that is greedy with respect to
24
Policy improvement with greedification
Do this for all states to get a new policy that is greedy with respect to
Note that then from policy
improvement theorem, we have
25
Policy improvement with greedification
Do this for all states to get a new policy that is greedy with respect to
Note that then from policy
improvement theorem, we have
What if the policy is unchanged by this? then the policy satisfies Bellman
Optimality Equation, and it must be the optimal policy! 26
Policy Iteration: Iterate between Evaluation and Improvement
27
Policy Iteration for the Small GridWorld
An un-discounted episodic MDP with
Actions = {up, down, right, left}
Nonterminal states: {1, 2, . . ., 14}
One terminal state (shown twice as shaded squares)
Actions that would take agent off the grid leave state
unchanged
Reward is –1 until the terminal state is reached
28
Policy Improvement in the Middle
An un-discounted episodic MDP with
Actions = {up, down, right, left}
Nonterminal states: {1, 2, . . ., 14}
One terminal state (shown twice as shaded squares)
Actions that would take agent off the grid leave state
unchanged
Reward is –1 until the terminal state is reached
29
Policy Improvement in the Middle
Do we need to do policy
evaluation until convergence
before greedification or it could
be truncated somehow?
30
From Policy Iteration to Value Iteration
An un-discounted episodic MDP with
Actions = {up, down, right, left}
Nonterminal states: {1, 2, . . ., 14}
One terminal state (shown twice as shaded squares)
Actions that would take agent off the grid leave state
unchanged
Reward is –1 until the terminal state is reached
31
From Policy Iteration to Value Iteration
Recall Policy Iteration alternates between the following two steps:
1. Policy Evaluation: Multiple Sweeps of Bellman Expectation Backup Operation until Convergence
2. Policy Improvement: One step of greedification
32
From Policy Iteration to Value Iteration
But, we don't need to do policy evaluation until convergence.
Instead, in Value Iteration:
1. Just One Sweep of Bellman Expectation Backup Operation
2. One step of greedification
33
An example MDP
The dynamics of MDP is given: 34
An example MDP
Our goal is to learn the
optimal policy such that if
we follow this policy, we
maximize the cumulative
reward!
We can find the optimal
policy in many ways.
Let's use value iteration.
35
value iteration: initialization
36
value iteration: one sweep evaluation
37
value iteration: one sweep evaluation
38
value iteration: one sweep evaluation
39
value iteration: one sweep evaluation
0.25
0 40
value iteration: one sweep greedification
0.25
41
value iteration: one sweep greedification
0.25
42
value iteration: one sweep greedification
0.25
43
value iteration: one sweep greedification
0.25) = 10.2 0.25
44
value iteration: one sweep greedification
0.25) = 3.08 0.25
45
Value Iteration: Combine two steps in a single update
Let’s interleave the evaluation and greedification
ONE sweep of evaluation is followed by ONE step of greedification
Combine these two together gives one update of value iteration as following
We call this Bellman Optimality Backup Operator
In this way, we don't need to explicitly maintain a policy
46
Bellman Optimality Backup Operator
Bellman Optimality Equation for
Bellman Optimality Backup Operator on
47
Value Iteration
48
Convergence of Policy Iteration and Value Iteration
Both Policy Iteration and Value Iteration are guaranteed to converge to the optimal
policy and the optimal value functions!
49
Summary
Policy Evaluation: Bellman expectation backup operators (without a max)
Policy Improvement: form a greedy policy, if only locally
Policy Iteration: alternate the above two processes
Value Iteration: Bellman optimality backup operators (with a max)
DP is used when we know how the world works. Biggest limitation of DP is that it
requires a probability model (as opposed to a generative or simulation model).
DP uses Full Backups (to be contrasted later with sample backups)
Next Lecture: MC and TD when we don't know how the world works
50