0% found this document useful (0 votes)

31 views50 pages

18 - Dynamic Programming For Markov Decision Processes

Uploaded by

sanjitdfd

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views50 pages

18 - Dynamic Programming For Markov Decision Processes

Uploaded by

sanjitdfd

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 50

CMPSC 448: Machine Learning

Lecture 18. Dynamic Programming for Markov Decision Processes

Rui Zhang
Fall 2024

1
Outline of RL
● Introduction to Reinforcement learning
● Multi-armed Bandits
● Markov Decision Processes (MDP)
○ Dynamic Programming when we know the world

● Learning in MDP: When we don't know the world

○ Monte Carlo Methods

○ Temporal-Difference Learning (TD): SARSA and Q-Learning

Note: All of the lectures are tabular methods; we will only briefly discuss the
motivation function approximation methods (e.g., DQN, policy gradient, deep
reinforcement learning)
2
Two Problems in MDP
Input: a perfect model of RL as a finite MDP

Two problems:
1. evaluation (prediction): given a policy , what is value function ?
2. control: find the optimal policy or optimal value functions, i.e., or

In fact, in order to solve problem 2, we must first know how to solve problem 1.

3
Solution 1: Write out Bellman Equations and Solve Them
Solve systems of equations
● Write Bellman Equations or Bellman Optimality Equations for all state and
state-action pairs
● Solve systems of linear Equations for Evaluation (i.e., compute and )
● Solve systems of nonlinear Equations for Control (i.e., compute and )

We discussed this in our previous lecture.

4
Solution 2: Dynamic Programming (DP) for MDP
Idea: Use Dynamic Programming on Bellman equations for value functions to
organize and structure the search.

Dynamic Programming in context of MDP/RL, refers to collection of algorithms to

compute optimal policies given a perfect model of the environment as a Markov
Decision Process (MDP). Note that we know all the information about MDP
including states, rewards, transition probabilities, etc.

This is the focus of this lecture.

5
Outline of DP for MDP
We introduce two DP methods to find an optimal policy for a given MDP:

Policy Iteration
● Policy Evaluation
● Policy Improvement

Value Iteration
● One-sweep Policy Evaluation + One-step Policy Improvement

Both methods rely on Bellman condition on the optimality of a policy (from the
previou lecture): The value of a state under an optimal policy must equal the
expected return for the best action from that state.
6
Policy Iteration

Policy Evaluation: Estimate (iterative policy evaluation)

Policy Improvement: Generate (Greedy policy improvement)

7
Policy Evaluation
Policy Evaluation: for a given arbitrary policy compute the state-value function

8
Solution 1: Solving A System of Linear Equations
Recall: state-value function for policy

Recall: Bellman equation for

A system of simultaneous equations

Note: environment dynamics are completely known

9
Solution 2: Iterative Policy Evaluation using Dynamic Programming

Start with a random guess of for all states, then iteratively update them using
Bellman Equation.

10
Iterative Policy Evaluation by Bellman Expectation Backup Operator

A sweep consists of applying a backup operation

to each state.

11
Bellman Expectation Backup Operator
Recall Bellman Equation for

From this, let's define Bellman Expectation Backup Operator on

12
Example: a small GridWorld

An un-discounted episodic MDP with

Actions = {up, down, right, left}
Nonterminal states: {1, 2, . . ., 14}
One terminal state (shown twice as shaded squares)
Actions that would take agent off the grid leave state unchanged
Reward is –1 until the terminal state is reached
13
Iterative Policy Evaluation for the small GridWorld

14
Iterative Policy Evaluation for the small GridWorld

15
Iterative Policy Evaluation for the small GridWorld

16
Iterative Policy Evaluation for the small GridWorld

The final estimate is in fact , which is this case gives, for each state, the
negative of the expected number of steps from that state until termination
17
Iterative Policy Evaluation

18
Policy Improvement
Suppose we have computed for a deterministic policy

For a given state , tells us how good it is to follow

For a given state , would it be better to do an action ?

19
Policy Improvement
Suppose we have computed for a deterministic policy

For a given state , tells us how good it is to follow

For a given state , would it be better to do an action ?

Let’s take action and thereafter follow the policy and see what happens to
the agent’s reward, this is just

20
Policy Improvement Theorem

The theorem can be easily generalized to stochastic policies (actions are selected
by different probabilities at every states under policy, which is more realistic)

21
Example of Greedification

22
Policy Improvement Theorem - Proof

23
Policy improvement with greedification
Do this for all states to get a new policy that is greedy with respect to

24
Policy improvement with greedification
Do this for all states to get a new policy that is greedy with respect to

Note that then from policy

improvement theorem, we have

25
Policy improvement with greedification
Do this for all states to get a new policy that is greedy with respect to

Note that then from policy

improvement theorem, we have

What if the policy is unchanged by this? then the policy satisfies Bellman
Optimality Equation, and it must be the optimal policy! 26
Policy Iteration: Iterate between Evaluation and Improvement

27
Policy Iteration for the Small GridWorld

An un-discounted episodic MDP with

Actions = {up, down, right, left}
Nonterminal states: {1, 2, . . ., 14}
One terminal state (shown twice as shaded squares)
Actions that would take agent off the grid leave state
unchanged
Reward is –1 until the terminal state is reached
28
Policy Improvement in the Middle

An un-discounted episodic MDP with

Actions = {up, down, right, left}
Nonterminal states: {1, 2, . . ., 14}
One terminal state (shown twice as shaded squares)
Actions that would take agent off the grid leave state
unchanged
Reward is –1 until the terminal state is reached
29
Policy Improvement in the Middle

Do we need to do policy
evaluation until convergence
before greedification or it could
be truncated somehow?

30
From Policy Iteration to Value Iteration

An un-discounted episodic MDP with

Actions = {up, down, right, left}
Nonterminal states: {1, 2, . . ., 14}
One terminal state (shown twice as shaded squares)
Actions that would take agent off the grid leave state
unchanged
Reward is –1 until the terminal state is reached
31
From Policy Iteration to Value Iteration
Recall Policy Iteration alternates between the following two steps:

1. Policy Evaluation: Multiple Sweeps of Bellman Expectation Backup Operation until Convergence

2. Policy Improvement: One step of greedification

32
From Policy Iteration to Value Iteration
But, we don't need to do policy evaluation until convergence.
Instead, in Value Iteration:
1. Just One Sweep of Bellman Expectation Backup Operation

2. One step of greedification

33
An example MDP

The dynamics of MDP is given: 34

An example MDP

Our goal is to learn the

optimal policy such that if
we follow this policy, we
maximize the cumulative
reward!

We can find the optimal

policy in many ways.

Let's use value iteration.

35
value iteration: initialization

36
value iteration: one sweep evaluation

37
value iteration: one sweep evaluation

38
value iteration: one sweep evaluation

39
value iteration: one sweep evaluation

0.25
0 40
value iteration: one sweep greedification

0.25
41
value iteration: one sweep greedification

0.25
42
value iteration: one sweep greedification

0.25
43
value iteration: one sweep greedification

0.25) = 10.2 0.25

44
value iteration: one sweep greedification

0.25) = 3.08 0.25

45
Value Iteration: Combine two steps in a single update
Let’s interleave the evaluation and greedification
ONE sweep of evaluation is followed by ONE step of greedification
Combine these two together gives one update of value iteration as following

We call this Bellman Optimality Backup Operator

In this way, we don't need to explicitly maintain a policy

46
Bellman Optimality Backup Operator
Bellman Optimality Equation for

Bellman Optimality Backup Operator on

47
Value Iteration

48
Convergence of Policy Iteration and Value Iteration
Both Policy Iteration and Value Iteration are guaranteed to converge to the optimal
policy and the optimal value functions!

49
Summary
Policy Evaluation: Bellman expectation backup operators (without a max)
Policy Improvement: form a greedy policy, if only locally
Policy Iteration: alternate the above two processes
Value Iteration: Bellman optimality backup operators (with a max)

DP is used when we know how the world works. Biggest limitation of DP is that it
requires a probability model (as opposed to a generative or simulation model).
DP uses Full Backups (to be contrasted later with sample backups)
Next Lecture: MC and TD when we don't know how the world works

RL Unit-4
No ratings yet
RL Unit-4
47 pages
Markov Decision Processes Ii: Ppts by Dan Klein and Pieter Abbeel For Cs188 Intro To Ai at Uc Berkeley
No ratings yet
Markov Decision Processes Ii: Ppts by Dan Klein and Pieter Abbeel For Cs188 Intro To Ai at Uc Berkeley
50 pages
Class Notes 2
No ratings yet
Class Notes 2
6 pages
17 - Markov Decision Processes
No ratings yet
17 - Markov Decision Processes
59 pages
On Minimizing Ordered Weighted Regrets in Multiobjective Markov Decision Processes
No ratings yet
On Minimizing Ordered Weighted Regrets in Multiobjective Markov Decision Processes
15 pages
08 MDPs
No ratings yet
08 MDPs
111 pages
Artificial Intelligence: Lecture 9 - Markov Decision Processes II Dr. Shivanjali Khare
No ratings yet
Artificial Intelligence: Lecture 9 - Markov Decision Processes II Dr. Shivanjali Khare
44 pages
2025 - MDPs 2
No ratings yet
2025 - MDPs 2
42 pages
RL Unit-4
No ratings yet
RL Unit-4
18 pages
Module 04
No ratings yet
Module 04
63 pages
Cs5811 Ch17 Complex Dec
No ratings yet
Cs5811 Ch17 Complex Dec
29 pages
RL Lecture4
No ratings yet
RL Lecture4
7 pages
Reinforcement Learning for Experts
No ratings yet
Reinforcement Learning for Experts
36 pages
EE675 Lecture 10
No ratings yet
EE675 Lecture 10
4 pages
Lecture26 Ri
No ratings yet
Lecture26 Ri
55 pages
Lec 09
No ratings yet
Lec 09
51 pages
2024 MDPs Part 1
No ratings yet
2024 MDPs Part 1
59 pages
L12 Markov Decision Processes
No ratings yet
L12 Markov Decision Processes
64 pages
Mathematics Chapter Tracker - JEE Main 2026 - MathonGo
No ratings yet
Mathematics Chapter Tracker - JEE Main 2026 - MathonGo
35 pages
Fa19 Lecture 15 MDPs II
No ratings yet
Fa19 Lecture 15 MDPs II
76 pages
242 Sheet 02 03
No ratings yet
242 Sheet 02 03
5 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
4 pages
Unit 05 Dynamic Programming
No ratings yet
Unit 05 Dynamic Programming
9 pages
AI Exam Prep for CS Students
No ratings yet
AI Exam Prep for CS Students
4 pages
M 2
No ratings yet
M 2
12 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
62 pages
Experiment 4
No ratings yet
Experiment 4
7 pages
EE290 Lecture 16
No ratings yet
EE290 Lecture 16
4 pages
Experiment 3
No ratings yet
Experiment 3
6 pages
Solution To Assignment - 4 - Dynamic Programming
No ratings yet
Solution To Assignment - 4 - Dynamic Programming
11 pages
Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
Module 3.0
No ratings yet
Module 3.0
17 pages
Notes
No ratings yet
Notes
6 pages
Markov Decision Processes: Stochastic, Sequential Environments
No ratings yet
Markov Decision Processes: Stochastic, Sequential Environments
20 pages
CS229
No ratings yet
CS229
17 pages
Reinforcement Learning in A Nutshell
No ratings yet
Reinforcement Learning in A Nutshell
12 pages
15 MDP
No ratings yet
15 MDP
35 pages
DSA5102 Lecture11
No ratings yet
DSA5102 Lecture11
44 pages
Markov Decision
No ratings yet
Markov Decision
4 pages
Sp14 Cs188 Lecture 9 - Mdps II
No ratings yet
Sp14 Cs188 Lecture 9 - Mdps II
48 pages
Dynamic Programming in MDPs
No ratings yet
Dynamic Programming in MDPs
42 pages
Naked Statistics: Stripping The Dread From The Data Practical Business Statistics, Sixth Edition
No ratings yet
Naked Statistics: Stripping The Dread From The Data Practical Business Statistics, Sixth Edition
2 pages
Automation Chapter 4
No ratings yet
Automation Chapter 4
44 pages
How To Crack BITCOIN - Algorithm
0% (1)
How To Crack BITCOIN - Algorithm
3 pages
کتاب هشتم بارگزاری شده
No ratings yet
کتاب هشتم بارگزاری شده
112 pages
MDP Solution Methods: Iteration & LP
No ratings yet
MDP Solution Methods: Iteration & LP
34 pages
Instructor (Andrew NG) :okay, Good Morning. Welcome Back. So I Hope All of You Had
No ratings yet
Instructor (Andrew NG) :okay, Good Morning. Welcome Back. So I Hope All of You Had
14 pages
Ss 2 Economics 1st Term E-Note
No ratings yet
Ss 2 Economics 1st Term E-Note
77 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
7 pages
Markovian Decision Process
No ratings yet
Markovian Decision Process
27 pages
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
No ratings yet
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
74 pages
20ai903 - RL - Unit 2
No ratings yet
20ai903 - RL - Unit 2
27 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
RL and ObC Lecture 2
No ratings yet
RL and ObC Lecture 2
20 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
Finite Markov Decision Processes-BR
No ratings yet
Finite Markov Decision Processes-BR
31 pages
Deep RL - Content Beyond Syllabus
No ratings yet
Deep RL - Content Beyond Syllabus
16 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
Scleroderma: From Pathogenesis To Comprehensive Management 2nd Edition John Varga Download
100% (3)
Scleroderma: From Pathogenesis To Comprehensive Management 2nd Edition John Varga Download
122 pages
A17 Complexdecisions
No ratings yet
A17 Complexdecisions
28 pages
Quantitative Research Method 2022
No ratings yet
Quantitative Research Method 2022
31 pages
Term 2 - Week 14 - Activity 1 - Angles in Circles
No ratings yet
Term 2 - Week 14 - Activity 1 - Angles in Circles
2 pages
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
No ratings yet
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
40 pages
Math Patterns for Grade 9 Students
No ratings yet
Math Patterns for Grade 9 Students
4 pages
Did Staggered
No ratings yet
Did Staggered
37 pages
Unit 1 Mod 2 Acid-Base Eqm
No ratings yet
Unit 1 Mod 2 Acid-Base Eqm
13 pages
(Partially Observable) Markov Decision Processes: Frederike Petzschner & Lionel Rigoux
No ratings yet
(Partially Observable) Markov Decision Processes: Frederike Petzschner & Lionel Rigoux
19 pages
Logic Students' Guide
No ratings yet
Logic Students' Guide
5 pages
Automata Theory Chapter 2 PDF
No ratings yet
Automata Theory Chapter 2 PDF
12 pages
18.085 Computational Science and Engineering I: Mit Opencourseware
No ratings yet
18.085 Computational Science and Engineering I: Mit Opencourseware
13 pages
Argos: UK Retail Giant's Growth
No ratings yet
Argos: UK Retail Giant's Growth
14 pages
Com Lab Manual
100% (1)
Com Lab Manual
59 pages
9-4 Notes PDF
No ratings yet
9-4 Notes PDF
18 pages
Quantum Mechanics: Variational Method
No ratings yet
Quantum Mechanics: Variational Method
9 pages
AI Decision Making & RL Guide
No ratings yet
AI Decision Making & RL Guide
18 pages
JPEG Standard, MPEG and Recognition
No ratings yet
JPEG Standard, MPEG and Recognition
32 pages
Computational Fluid Dynamics: Indo-European Winter Academy 2013
No ratings yet
Computational Fluid Dynamics: Indo-European Winter Academy 2013
30 pages
Friction Losses in Pipes Consisting of Bends and Elbows
86% (28)
Friction Losses in Pipes Consisting of Bends and Elbows
11 pages
Chapter 6 Shear and Moments in Beams Updting 2020
No ratings yet
Chapter 6 Shear and Moments in Beams Updting 2020
19 pages
s.2 Math Mid Exam 2025
No ratings yet
s.2 Math Mid Exam 2025
2 pages
Experiment 4 - Numerical Differentiation
No ratings yet
Experiment 4 - Numerical Differentiation
6 pages
Jurnal JP - Peran Masa Kerja Dan Gaya Komunikasi Terhadap Kinerja Karyawan Dengan Motivasi Karyawan Sebagai Mediator Pada PT Gajah Tunggal TBK
No ratings yet
Jurnal JP - Peran Masa Kerja Dan Gaya Komunikasi Terhadap Kinerja Karyawan Dengan Motivasi Karyawan Sebagai Mediator Pada PT Gajah Tunggal TBK
13 pages
2021 Article
No ratings yet
2021 Article
17 pages
Intro To Real External Flows Lesson 1 PDF
No ratings yet
Intro To Real External Flows Lesson 1 PDF
11 pages
Python Basics for Beginners
No ratings yet
Python Basics for Beginners
11 pages
Chapter 12-14 Study Questions
No ratings yet
Chapter 12-14 Study Questions
2 pages
Phet Gas Law Simulation 2010
No ratings yet
Phet Gas Law Simulation 2010
8 pages

18 - Dynamic Programming For Markov Decision Processes

Uploaded by

18 - Dynamic Programming For Markov Decision Processes

Uploaded by

CMPSC 448: Machine Learning

Lecture 18. Dynamic Programming for Markov Decision Processes

● Learning in MDP: When we don't know the world

○ Temporal-Difference Learning (TD): SARSA and Q-Learning

We discussed this in our previous lecture.

Dynamic Programming in context of MDP/RL, refers to collection of algorithms to

This is the focus of this lecture.

Policy Evaluation: Estimate (iterative policy evaluation)

Policy Improvement: Generate (Greedy policy improvement)

Recall: Bellman equation for

A system of simultaneous equations

Note: environment dynamics are completely known

A sweep consists of applying a backup operation

From this, let's define Bellman Expectation Backup Operator on

An un-discounted episodic MDP with

For a given state , tells us how good it is to follow

For a given state , would it be better to do an action ?

For a given state , tells us how good it is to follow

For a given state , would it be better to do an action ?

Note that then from policy

Note that then from policy

An un-discounted episodic MDP with

An un-discounted episodic MDP with

An un-discounted episodic MDP with

2. Policy Improvement: One step of greedification

2. One step of greedification

The dynamics of MDP is given: 34

Our goal is to learn the

We can find the optimal

Let's use value iteration.

0.25) = 10.2 0.25

0.25) = 3.08 0.25

We call this Bellman Optimality Backup Operator

Bellman Optimality Backup Operator on

You might also like