Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
21 views3 pages

MDP Cheatsheet

The document provides a comprehensive overview of Markov Decision Processes (MDPs), detailing key properties, algorithms, and definitions related to value functions and backup operators. It discusses the unique fixed points of value functions, the contraction properties of operators, and outlines algorithms for value iteration, policy iteration, and modified policy iteration. Additionally, it defines important concepts such as contraction, stationary distribution, and monotonic functions in the context of MDPs.

Uploaded by

wptjcks15
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views3 pages

MDP Cheatsheet

The document provides a comprehensive overview of Markov Decision Processes (MDPs), detailing key properties, algorithms, and definitions related to value functions and backup operators. It discusses the unique fixed points of value functions, the contraction properties of operators, and outlines algorithms for value iteration, policy iteration, and modified policy iteration. Additionally, it defines important concepts such as contraction, stationary distribution, and monotonic functions in the context of MDPs.

Uploaded by

wptjcks15
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

MDP Cheatsheet Reference Properties of T π

Author: John Schulman • Unique fixed point is V π , defined by V π (s) = E[R0 +γR1 +...|s0 =s],
(F) = facts that are a bit more technical where actions are chosen according to the policy at =π(st).
• nth iterate can be interpreted as the expected return of a
1 Markov Decision Process n-step rollout under π, with terminal cost V : (T π )nV (s) =
Infinite-horizon, discounted setting: E[R0 +γR1 +···+γ n−1Rn +γ nV (sn)|s0 =s] where at =π(st) ∀t.
• S: state space • (F) T π is a contraction under the weighted `2 norm k·kρ where ρ is
• A: action space the steady-state distribution of the Markov chain induced by executing
• P (s,a,s0): transition kernel policy π. T π is also a contraction under the max norm |·|∞.
• R(s,a,s0): reward function • T π is monotonic
• γ ∈[0,1]: discount 3 Algorithms
• µ: initial state distribution (optional)
2 Backup Operators Algorithm 1 Value Iteration
At the core of policy and value iteration are the “Bellman backup operators” Initialize V (0).
π |S| |S|
T,T , which are mappings X R →R that update the value function. for n=1,2,... do
T V (s):=max P (s,a,s0)[R(s,a,s0)+γV (s0)] for s∈S do
a
V (n)(s)=maxa s0 P (s,a,s0)(R(s,a,s0)+γV (n−1)(s0))
P
s0

end for
X
T π V (s):= P (s,π(s),s0)[R(s,π(s),s0)+γV (s0)]
s0
. The above loop over s could be written as V (n) =T V (n−1)
Note that T V (s) means that we are evaluating T V (a vector, in the finite end for
case) at state s, i.e., it would more properly be written (T V )(s). The
same convention is used when considering T nV (s) and so forth. Properties of value iteration

Properties of T • If initialized with V (0) = 0 and R(s,a,s0) ≥ 0, values monotonically


increase, i.e., V (0)(s)≤V (1)(s)≤...∀s.
• Unique fixed point is V ∗, defined by V ∗(s) = E[R0 +γR1 +...|s0 =s], • Error V (n) − V ∗ and maximum suboptimality of resulting policy are
where actions are chosen according to an optimal policy: at =π∗(st). bounded by γ n|R|∞/(1−γ).
• nth iterate can be interpreted as the optimal expected The policy update step could be written in “operator form” as
return in n-step finite-horizon problem: T nV (s) = π(n) = GV π
(n−1)
where GVPdenotes the greedy policy for value function
n−1 n
maxπ0,π1,...,πn−1 E [R0 +γR1 +···+γ Rn−1 +γ V (sn)|s0 =s], where V , i.e., GV (s)=argmaxa s0 P (s,a,s0)[R(s,a,s0)+γV (s0)], ∀s∈S.
at = π(st) ∀t and we are using the shorthand Rt := R(st,at,st+1), and
the expectation is taken with respect to all states st for t>0. Properties of policy iteration
• (F) T is a contraction under the max norm |·|∞
• T is monotonic, so V ≤ T V ⇒ V ≤ T V ≤ T 2V ≤ ··· ≤ V ∗, and • Computes optimal policy and value function in a finite number of
V ≥T V ⇒V ≥T V ≥T 2V ≥···≥V ∗ iterations
Algorithm 2 Policy Iteration a self-consistency equation (a “Bellman equation”) with a unique solution.
Initialize π(0). All of the expectations are taken with respect to all states st for t>0
for n=1,2,... do V π (s)=E[R0 +γR1 +...|s0 =s], where at =π(st) ∀t
(n−1)
V (n−1) =Solve[V =T π
X
V] V π (s)= P (s,π(s),s0)[R(s,π(s),s0)+γV π (s0)]
for s∈S do s0
π(n)(s)=argmaxa s0 P (s,a,s0)[R(s,a,s0)+γV (n−1)(s0)]
P π
Q (s,a)=E[R0 +γR1 +...|s0 =s,a0 =a], where at =π(st) ∀t
(n−1)
=argmaxaQπ (s,a) X
Qπ (s,a)= P (s,a,s0)[R(s,a,s0)+γQπ (s0,π(s0))]
end for
s0
end for V ∗(s)=E[R0 +γR1 +...|s0 =s] where at =π∗(st) ∀t
X
V ∗(s)=max P (s,a,s0)[R(s,a,s0)+γV ∗(s0)]
a
• (F) Performance of policy monotonically increases. In fact, at the s0
(n) ∗
nth iteration, the policy improves by (1−γP π )−1(T V (n−1) −V (n−1)), Q (s,a)=E[R0 +γR1 +...|s0 =s,a0 =a], where at =π(st) ∀t
where P π is the matrix defined by P π (s,s0)=P (s,π(s),s0), X
Q∗(s,a)= P (s,a,s0)[R(s,a,s0)+γmax
0
Q∗(s0,a0)]
a
s0
Algorithm 3 Modified Policy Iteration
5 Some Definitions
Initialize V (0).
for n=1,2,... do Contraction: a function f is a contraction under
π(s)=GV (n−1) norm |·| with modulus γ iff |f(x)−f(y)|≤γ|x−y|. By the Banach fixed
V (n) =(T π )k V (n−1), for integer k ≥1 point theorem, a contraction mapping on Rd has a unique fixed point.
end for Stationary Distribution: Given a
transition matrix Pss0 , the stationary distribution ρ is the left eigenvector,
Properties of modified policy iteration satisfying ρs0 =ρsPss0 . If the transition matrix satisfies appropriate
conditions (see the Markov chain theory [3]), then ρ=limn→∞νP k
• Computes optimal policy in a finite number of iterations, and value for any initial distribution ν. In the context of MDPs, we speak speak
function converges to optimal one: V (n) →V ∗. of the transition matrix induced by policy π, defined by Pss0 =P (s,π(s),s0),
• k = 1 gives value iteration, k = ∞ limit gives policy iteration (except and similarly, there is a stationary distribution induced by the policy ρπ .
at the first iteration.)
Monotonic: a function f is monotonic if x≤y =⇒ f(x)≤f(y).
4 Value Functions and Bellman Equations This definition can be extended to the case that f :Rd →Rd, in
The term “value function” in general refers to a function that returns the which case the inequalities hold for each component on the LHS and RHS.
expected sum of future rewards. However, there are several different types
of value function. A “state-value function” function V (s) is a function References
of state, whereas a “state-action-value function” Q(s,a) is a function of [1] D. P. Bertsekas, D. P. Bertsekas, et al. Dynamic programming and optimal
a state-action pair. control, vol. 1. Athena Scientific Belmont, MA, 1995.
Below, we list the most common value functions with a pair of equations: [2] M. L. Puterman. Markov decision processes: discrete stochastic dynamic
the first one involving an infinite sum of rewards, the second one providing programming. John Wiley & Sons, 2005.
2
[3] Wikipedia. Markov chain — Wikipedia, the free encyclopedia, 2015.

You might also like