MDP Cheatsheet

The document provides a comprehensive overview of Markov Decision Processes (MDPs), detailing key properties, algorithms, and definitions related to value functions and backup operators. It discusses the unique fixed points of value functions, the contraction properties of operators, and outlines algorithms for value iteration, policy iteration, and modified policy iteration. Additionally, it defines important concepts such as contraction, stationary distribution, and monotonic functions in the context of MDPs.

Uploaded by

wptjcks15

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views3 pages

MDP Cheatsheet

Uploaded by

wptjcks15

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

MDP Cheatsheet Reference Properties of T π

Author: John Schulman • Unique fixed point is V π , defined by V π (s) = E[R0 +γR1 +...|s0 =s],
(F) = facts that are a bit more technical where actions are chosen according to the policy at =π(st).
• nth iterate can be interpreted as the expected return of a
1 Markov Decision Process n-step rollout under π, with terminal cost V : (T π )nV (s) =
Infinite-horizon, discounted setting: E[R0 +γR1 +···+γ n−1Rn +γ nV (sn)|s0 =s] where at =π(st) ∀t.
• S: state space • (F) T π is a contraction under the weighted `2 norm k·kρ where ρ is
• A: action space the steady-state distribution of the Markov chain induced by executing
• P (s,a,s0): transition kernel policy π. T π is also a contraction under the max norm |·|∞.
• R(s,a,s0): reward function • T π is monotonic
• γ ∈[0,1]: discount 3 Algorithms
• µ: initial state distribution (optional)
2 Backup Operators Algorithm 1 Value Iteration
At the core of policy and value iteration are the “Bellman backup operators” Initialize V (0).
π |S| |S|
T,T , which are mappings X R →R that update the value function. for n=1,2,... do
T V (s):=max P (s,a,s0)[R(s,a,s0)+γV (s0)] for s∈S do
a
V (n)(s)=maxa s0 P (s,a,s0)(R(s,a,s0)+γV (n−1)(s0))
P
s0

end for
X
T π V (s):= P (s,π(s),s0)[R(s,π(s),s0)+γV (s0)]
s0
. The above loop over s could be written as V (n) =T V (n−1)
Note that T V (s) means that we are evaluating T V (a vector, in the finite end for
case) at state s, i.e., it would more properly be written (T V )(s). The
same convention is used when considering T nV (s) and so forth. Properties of value iteration

Properties of T • If initialized with V (0) = 0 and R(s,a,s0) ≥ 0, values monotonically

increase, i.e., V (0)(s)≤V (1)(s)≤...∀s.
• Unique fixed point is V ∗, defined by V ∗(s) = E[R0 +γR1 +...|s0 =s], • Error V (n) − V ∗ and maximum suboptimality of resulting policy are
where actions are chosen according to an optimal policy: at =π∗(st). bounded by γ n|R|∞/(1−γ).
• nth iterate can be interpreted as the optimal expected The policy update step could be written in “operator form” as
return in n-step finite-horizon problem: T nV (s) = π(n) = GV π
(n−1)
where GVPdenotes the greedy policy for value function
n−1 n
maxπ0,π1,...,πn−1 E [R0 +γR1 +···+γ Rn−1 +γ V (sn)|s0 =s], where V , i.e., GV (s)=argmaxa s0 P (s,a,s0)[R(s,a,s0)+γV (s0)], ∀s∈S.
at = π(st) ∀t and we are using the shorthand Rt := R(st,at,st+1), and
the expectation is taken with respect to all states st for t>0. Properties of policy iteration
• (F) T is a contraction under the max norm |·|∞
• T is monotonic, so V ≤ T V ⇒ V ≤ T V ≤ T 2V ≤ ··· ≤ V ∗, and • Computes optimal policy and value function in a finite number of
V ≥T V ⇒V ≥T V ≥T 2V ≥···≥V ∗ iterations
Algorithm 2 Policy Iteration a self-consistency equation (a “Bellman equation”) with a unique solution.
Initialize π(0). All of the expectations are taken with respect to all states st for t>0
for n=1,2,... do V π (s)=E[R0 +γR1 +...|s0 =s], where at =π(st) ∀t
(n−1)
V (n−1) =Solve[V =T π
X
V] V π (s)= P (s,π(s),s0)[R(s,π(s),s0)+γV π (s0)]
for s∈S do s0
π(n)(s)=argmaxa s0 P (s,a,s0)[R(s,a,s0)+γV (n−1)(s0)]
P π
Q (s,a)=E[R0 +γR1 +...|s0 =s,a0 =a], where at =π(st) ∀t
(n−1)
=argmaxaQπ (s,a) X
Qπ (s,a)= P (s,a,s0)[R(s,a,s0)+γQπ (s0,π(s0))]
end for
s0
end for V ∗(s)=E[R0 +γR1 +...|s0 =s] where at =π∗(st) ∀t
X
V ∗(s)=max P (s,a,s0)[R(s,a,s0)+γV ∗(s0)]
a
• (F) Performance of policy monotonically increases. In fact, at the s0
(n) ∗
nth iteration, the policy improves by (1−γP π )−1(T V (n−1) −V (n−1)), Q (s,a)=E[R0 +γR1 +...|s0 =s,a0 =a], where at =π(st) ∀t
where P π is the matrix defined by P π (s,s0)=P (s,π(s),s0), X
Q∗(s,a)= P (s,a,s0)[R(s,a,s0)+γmax
0
Q∗(s0,a0)]
a
s0
Algorithm 3 Modified Policy Iteration
5 Some Definitions
Initialize V (0).
for n=1,2,... do Contraction: a function f is a contraction under
π(s)=GV (n−1) norm |·| with modulus γ iff |f(x)−f(y)|≤γ|x−y|. By the Banach fixed
V (n) =(T π )k V (n−1), for integer k ≥1 point theorem, a contraction mapping on Rd has a unique fixed point.
end for Stationary Distribution: Given a
transition matrix Pss0 , the stationary distribution ρ is the left eigenvector,
Properties of modified policy iteration satisfying ρs0 =ρsPss0 . If the transition matrix satisfies appropriate
conditions (see the Markov chain theory [3]), then ρ=limn→∞νP k
• Computes optimal policy in a finite number of iterations, and value for any initial distribution ν. In the context of MDPs, we speak speak
function converges to optimal one: V (n) →V ∗. of the transition matrix induced by policy π, defined by Pss0 =P (s,π(s),s0),
• k = 1 gives value iteration, k = ∞ limit gives policy iteration (except and similarly, there is a stationary distribution induced by the policy ρπ .
at the first iteration.)
Monotonic: a function f is monotonic if x≤y =⇒ f(x)≤f(y).
4 Value Functions and Bellman Equations This definition can be extended to the case that f :Rd →Rd, in
The term “value function” in general refers to a function that returns the which case the inequalities hold for each component on the LHS and RHS.
expected sum of future rewards. However, there are several different types
of value function. A “state-value function” function V (s) is a function References
of state, whereas a “state-action-value function” Q(s,a) is a function of [1] D. P. Bertsekas, D. P. Bertsekas, et al. Dynamic programming and optimal
a state-action pair. control, vol. 1. Athena Scientific Belmont, MA, 1995.
Below, we list the most common value functions with a pair of equations: [2] M. L. Puterman. Markov decision processes: discrete stochastic dynamic
the first one involving an infinite sum of rewards, the second one providing programming. John Wiley & Sons, 2005.
2
[3] Wikipedia. Markov chain — Wikipedia, the free encyclopedia, 2015.

Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
CS 188 Fall 2018 Written HW4 Soln
No ratings yet
CS 188 Fall 2018 Written HW4 Soln
6 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
2 Dynamic
No ratings yet
2 Dynamic
50 pages
Fa19 Lecture 15 MDPs II
No ratings yet
Fa19 Lecture 15 MDPs II
76 pages
Lec 12
No ratings yet
Lec 12
60 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
51 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
7 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
Lec 09
No ratings yet
Lec 09
51 pages
2025 - MDPs 2
No ratings yet
2025 - MDPs 2
42 pages
cs229 Notes13
No ratings yet
cs229 Notes13
15 pages
2024 MDPs Part 1
No ratings yet
2024 MDPs Part 1
59 pages
CS229
No ratings yet
CS229
17 pages
242 Sheet 02 03
No ratings yet
242 Sheet 02 03
5 pages
1 Markov
No ratings yet
1 Markov
34 pages
l1 Mdps Exact Methods
No ratings yet
l1 Mdps Exact Methods
69 pages
402 Lec20
No ratings yet
402 Lec20
21 pages
Cs229-Notes12 Reinforcement in Control
No ratings yet
Cs229-Notes12 Reinforcement in Control
17 pages
Class Notes 2
No ratings yet
Class Notes 2
6 pages
Stochastic Models for Engineers
No ratings yet
Stochastic Models for Engineers
18 pages
EE675A Lec12
No ratings yet
EE675A Lec12
5 pages
Policy (RL IITH)
No ratings yet
Policy (RL IITH)
46 pages
Reinforcement Learning Lec12
No ratings yet
Reinforcement Learning Lec12
60 pages
Reinforcement Learning in A Nutshell
No ratings yet
Reinforcement Learning in A Nutshell
12 pages
Lecture26 Ri
No ratings yet
Lecture26 Ri
55 pages
Markov Decision Processes Ii: Ppts by Dan Klein and Pieter Abbeel For Cs188 Intro To Ai at Uc Berkeley
No ratings yet
Markov Decision Processes Ii: Ppts by Dan Klein and Pieter Abbeel For Cs188 Intro To Ai at Uc Berkeley
50 pages
2025 - MDPs - Part 2
No ratings yet
2025 - MDPs - Part 2
41 pages
Markov Decision Process II
No ratings yet
Markov Decision Process II
88 pages
MDP Solution Methods: Iteration & LP
No ratings yet
MDP Solution Methods: Iteration & LP
34 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
62 pages
Artificial Intelligence: Lecture 9 - Markov Decision Processes II Dr. Shivanjali Khare
No ratings yet
Artificial Intelligence: Lecture 9 - Markov Decision Processes II Dr. Shivanjali Khare
44 pages
RL Dynamic Programming Lecture
No ratings yet
RL Dynamic Programming Lecture
43 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
Deep RL - Content Beyond Syllabus
No ratings yet
Deep RL - Content Beyond Syllabus
16 pages
RL Cheatsheet for Researchers
No ratings yet
RL Cheatsheet for Researchers
16 pages
MIT 6.036 Lecture
No ratings yet
MIT 6.036 Lecture
64 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
Subtitle
No ratings yet
Subtitle
2 pages
RL Lecture4
No ratings yet
RL Lecture4
7 pages
Lecture Notes RL
No ratings yet
Lecture Notes RL
14 pages
Sp14 Cs188 Lecture 9 - Mdps II
No ratings yet
Sp14 Cs188 Lecture 9 - Mdps II
48 pages
EE675 Lecture 10
No ratings yet
EE675 Lecture 10
4 pages
RL - 03 Markov Decision Processes and Dynamic Programming
No ratings yet
RL - 03 Markov Decision Processes and Dynamic Programming
50 pages
Subtitle
No ratings yet
Subtitle
2 pages
DRL Homework 1
No ratings yet
DRL Homework 1
4 pages
کتاب هشتم بارگزاری شده
No ratings yet
کتاب هشتم بارگزاری شده
112 pages
09 - Monte Carlo Learning
No ratings yet
09 - Monte Carlo Learning
24 pages
1 - Table of Contents
No ratings yet
1 - Table of Contents
6 pages
Book All in One
No ratings yet
Book All in One
288 pages
20ai903 - RL - Unit 2
No ratings yet
20ai903 - RL - Unit 2
27 pages
Instructor (Andrew NG) :okay, Good Morning. Welcome Back. So I Hope All of You Had
No ratings yet
Instructor (Andrew NG) :okay, Good Morning. Welcome Back. So I Hope All of You Had
14 pages
Lec 3
No ratings yet
Lec 3
15 pages
02 Bellman Equations and Optimality - Complete Guide
No ratings yet
02 Bellman Equations and Optimality - Complete Guide
6 pages
RL Problem Sheet: E0 270: Machine Learning (Spring 2025)
No ratings yet
RL Problem Sheet: E0 270: Machine Learning (Spring 2025)
10 pages
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
No ratings yet
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
40 pages
Infinite-Horizon Dynamic Programming: Tianxiao Zheng Saif
No ratings yet
Infinite-Horizon Dynamic Programming: Tianxiao Zheng Saif
10 pages
Book All-In-One 2
No ratings yet
Book All-In-One 2
281 pages
b1 Reading Lesson World Migratory Bird Day - 157285
No ratings yet
b1 Reading Lesson World Migratory Bird Day - 157285
17 pages
Wabi, Sabi, and Shibui
No ratings yet
Wabi, Sabi, and Shibui
2 pages
Cloud-AI-Native 6G Powered by eBPF
No ratings yet
Cloud-AI-Native 6G Powered by eBPF
20 pages
Comparative Adjectives Workbook
No ratings yet
Comparative Adjectives Workbook
3 pages
Maquinas Electricas - Stephen Chapman - Ejercicios
100% (1)
Maquinas Electricas - Stephen Chapman - Ejercicios
22 pages
Word Processing for CSC Students
No ratings yet
Word Processing for CSC Students
2 pages
Who Are The Jews
No ratings yet
Who Are The Jews
17 pages
Chavez, R.M. (Ece-1101) Logic
No ratings yet
Chavez, R.M. (Ece-1101) Logic
10 pages
MS Excel Full Notes PDF Free Download - Google Search
No ratings yet
MS Excel Full Notes PDF Free Download - Google Search
3 pages
Tesla's TTPoE for AI Supercomputers
No ratings yet
Tesla's TTPoE for AI Supercomputers
23 pages
Mysql Json Export
100% (1)
Mysql Json Export
7 pages
Telecom Data Specialist Profile
No ratings yet
Telecom Data Specialist Profile
3 pages
Hol 2225 02 Net - PDF - en
No ratings yet
Hol 2225 02 Net - PDF - en
262 pages
Year 5 Autumn Term 1 SPaG Activity Mat 2
No ratings yet
Year 5 Autumn Term 1 SPaG Activity Mat 2
6 pages
Form 2 School Based Computer Science Syllabus
No ratings yet
Form 2 School Based Computer Science Syllabus
5 pages
Free Modules 55 PDF
No ratings yet
Free Modules 55 PDF
13 pages
9 It
No ratings yet
9 It
4 pages
Christ Is Made A Sure Foundation
No ratings yet
Christ Is Made A Sure Foundation
1 page
About Illustrator Theory
No ratings yet
About Illustrator Theory
3 pages
Luxury Living at Sainamaha Panvel
No ratings yet
Luxury Living at Sainamaha Panvel
9 pages
Face-to-Face Communication & Tech
No ratings yet
Face-to-Face Communication & Tech
4 pages
Big Grammar Revision Board Game Fun Activities Games Games Icebreakers Oneonone Ac 78674
No ratings yet
Big Grammar Revision Board Game Fun Activities Games Games Icebreakers Oneonone Ac 78674
2 pages
LEA 6 - CFLM 2 N-P-FaHCotP
No ratings yet
LEA 6 - CFLM 2 N-P-FaHCotP
128 pages
L3 - Substitution Cipher
No ratings yet
L3 - Substitution Cipher
22 pages
COMBINATIONAL Unlocked
No ratings yet
COMBINATIONAL Unlocked
20 pages
Identifying The Firmware of A Qlogic or Emulex FC HBA
No ratings yet
Identifying The Firmware of A Qlogic or Emulex FC HBA
2 pages
AMS 507 Introduction To Probability: 1.2. The Basic Principle of Counting
No ratings yet
AMS 507 Introduction To Probability: 1.2. The Basic Principle of Counting
3 pages
Intro to Philosophy Course Guide
No ratings yet
Intro to Philosophy Course Guide
3 pages
Buddhist Self-Awareness in Śaiva Siddhānta
No ratings yet
Buddhist Self-Awareness in Śaiva Siddhānta
25 pages
Tutorial - The Sum-Product Algorithm
No ratings yet
Tutorial - The Sum-Product Algorithm
5 pages

MDP Cheatsheet

Uploaded by

MDP Cheatsheet

Uploaded by

MDP Cheatsheet Reference Properties of T π

Properties of T • If initialized with V (0) = 0 and R(s,a,s0) ≥ 0, values monotonically

You might also like