0% found this document useful (0 votes)

22 views50 pages

2 Dynamic

Uploaded by

Ardiansyah Mochamad Nugraha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views50 pages

2 Dynamic

Uploaded by

Ardiansyah Mochamad Nugraha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 50

Reinforcement Learning

2. Dynamic Programming

Thomas Bonald

2024 – 2025
Markov decision process → Model

At time t = 0, 1, 2, . . ., the agent in state st takes action at and:

▶ receives reward rt
▶ moves to state st+1
The rewards and transitions are stochastic in general.
Some states may be terminal.
Markov decision process → Model

At time t = 0, 1, 2, . . ., the agent in state st takes action at and:

▶ receives reward rt
▶ moves to state st+1
The rewards and transitions are stochastic in general.
Some states may be terminal.
Definition
A Markov decision process (MDP) is defined by:
▶ the reward distribution, p(rt |st , at )
▶ the state transition distribution, p(st+1 | st , at )

We denote by S the set of non-terminal states.

Policy → Agent

Definition
Given a Markov decision process, the policy of an agent defines
the action taken in each non-terminal state:

∀s ∈ S, π(a|s) = P(at = a| st = s)

When deterministic, we use the simpler notation a = π(s)

Gain → Objective

Definition
Given the rewards r0 , r1 , r2 , . . ., we refer to the gain as:

G = r0 + γr1 + γ 2 r2 + . . .

The parameter γ ∈ [0, 1] is the discount factor.

Value function → Expectation

Consider some policy π.

Definition
The value function of π is the expected gain from each state:

∀s, Vπ (s) = E(G |s0 = s)

Value function → Expectation

Consider some policy π.

Definition
The value function of π is the expected gain from each state:

∀s, Vπ (s) = E(G |s0 = s)

Bellman’s equation
The value function Vπ is the unique solution to the fixed-point
equation:
∀s, V (s) = E(r0 + γV (s1 )| s0 = s)
Outline

▶ Optimal policy
▶ Policy iteration
▶ Value iteration
▶ Games
Optimal policy

Partial ordering
Policy π ′ is better than policy π if:

∀s, Vπ′ (s) ≥ Vπ (s)

Notation: π ′ ≥ π
Optimal policy

Partial ordering
Policy π ′ is better than policy π if:

∀s, Vπ′ (s) ≥ Vπ (s)

Notation: π ′ ≥ π

Optimal policy
A policy π ⋆ is optimal if it is better than any other policy:

∀π, π⋆ ≥ π
A or B (optimal policy)

(+1) (−2)

A B

(+5) (−3)

C D
Optimal value function

Recall Bellman’s equation for a policy π:

∀s ∈ S, V (s) = E(r0 + γV (s1 )|s0 = s)

Optimal value function

Recall Bellman’s equation for a policy π:

∀s ∈ S, V (s) = E(r0 + γV (s1 )|s0 = s)

Proposition
There is a unique solution V ⋆ to the equation:

∀s ∈ S, V (s) = max E(r0 + γV (s1 )| s0 = s, a0 = a)

a
Optimal value function

Recall Bellman’s equation for a policy π:

∀s ∈ S, V (s) = E(r0 + γV (s1 )|s0 = s)

Proposition
There is a unique solution V ⋆ to the equation:

∀s ∈ S, V (s) = max E(r0 + γV (s1 )| s0 = s, a0 = a)

Notes:
▶ Not a linear system!
▶ Can still be solved by fixed-point iteration
A or B (optimal value function)

(+1) (−2)

A B

(+5) (−3)

C D
Quiz
Solution to Bellman’s optimality equation
Write Bellman’s equation as the fixed-point equation:

V = F ⋆ (V )

with F ⋆ (V )(s) = maxa E(r0 + γV (s1 )| s0 = s, a0 = a) for all s ∈ S.

Solution to Bellman’s optimality equation
Write Bellman’s equation as the fixed-point equation:

V = F ⋆ (V )

with F ⋆ (V )(s) = maxa E(r0 + γV (s1 )| s0 = s, a0 = a) for all s ∈ S.

Proposition
If γ < 1, the solution is unique and:

∀V , lim (F ⋆ )k (V ) = V ⋆
k→+∞
Solution to Bellman’s optimality equation
Write Bellman’s equation as the fixed-point equation:

V = F ⋆ (V )

with F ⋆ (V )(s) = maxa E(r0 + γV (s1 )| s0 = s, a0 = a) for all s ∈ S.

Proposition
If γ < 1, the solution is unique and:

∀V , lim (F ⋆ )k (V ) = V ⋆
k→+∞

Proof: The mapping F ⋆ is contracting:

∀U, V , ||F ⋆ (V ) − F ⋆ (U)||∞ ≤ γ||V − U||∞

→ Banach fixed-point theorem

Optimal policy

An optimal policy follows from the optimal value function:

∀s ∈ S, π ⋆ (s) = a⋆ ∈ arg max E(r0 + γV ⋆ (s1 )| s0 = s, a0 = a)

Bellman’s optimality theorem

The policy π ⋆ has value function V ⋆ and is optimal.
Optimal policy

An optimal policy follows from the optimal value function:

∀s ∈ S, π ⋆ (s) = a⋆ ∈ arg max E(r0 + γV ⋆ (s1 )| s0 = s, a0 = a)

Bellman’s optimality theorem

The policy π ⋆ has value function V ⋆ and is optimal.

Note that:
▶ The optimal policy is not unique in general.
▶ There always exists a deterministic optimal policy.
▶ The optimal value function V ⋆ is unique.
A or B (optimal policy)

(+1) (−2)

A B

(+5) (−3)

C D
Maze (optimal policy)
V⋆

π⋆
Tic-Tac-Toe (optimal player, random adversary)

V ⋆ ≈ 0.99 V ⋆ ≈ 0.99 V ⋆ ≈ 0.96 V ⋆ ≈ 0.96

X X X X
O O

V ⋆ ≈ 0.75 V ⋆ ≈ 0.75 V⋆ = 0 V⋆ = 0

X X O X X O X X O X X O
O O O O O O X
X X X
Outline

▶ Optimal policy
▶ Policy iteration
▶ Value iteration
▶ Games
Policy improvement

Let Vπ be the value function of policy π.

Policy improvement
New policy π ′ defined by:

π ′ (s) = a⋆ ∈ arg max E(r0 + γVπ (s1 )| s0 = s, a0 = a)

a
Policy improvement

Let Vπ be the value function of policy π.

Policy improvement
New policy π ′ defined by:

π ′ (s) = a⋆ ∈ arg max E(r0 + γVπ (s1 )| s0 = s, a0 = a)

Proposition
The policy π ′ is better than π.
Maze (from the random policy)
Vπ

π′
Exercise

(−1) (+1)
(+3)
A B

(+2) (+3) (+1)

C D

The initial policy is random.

What is the new policy after policy improvement?
Policy iteration
Policy iteration

Algorithm
Starting from some arbitrary policy π = π0 , iterate until
convergence:
1. Evaluate the policy (by solving Bellman’s equation)
2. Improve the policy:

∀s, π(s) ← arg max E(r0 + γVπ (s1 )| s, a)

▶ The sequence π0 , π1 , π2 , . . . is monotonic and converges in

finite time (for finite numbers of states and actions).
▶ The limit is an optimal policy.
▶ These results assume perfect policy evaluation.
Maze (policy iteration)

iteration 1 iteration 2
Practical considerations

▶ The step of policy evaluation is time-consuming

(solution of Bellman’s equation)
▶ Do we need the exact solution?
No, since it is used only to improve the policy!
▶ Why not directly improving the value function?
This is value iteration!
Outline

▶ Optimal policy
▶ Policy iteration
▶ Value iteration
▶ Games
Value iteration
Value iteration

Algorithm
Starting from some arbitrary value function V = V0 ,
iterate until convergence:

∀s, V (s) ← max E(r0 + γV (s1 )| s, a)

▶ The sequence V0 , V1 , V2 , . . . converges

(but not in finite time in general)
▶ The limit is the optimal value function.
▶ The corresponding policy is optimal.
Maze (optimal policy)
V⋆

π⋆
Exercise

(−1) (+1)
(+3)
A B

(+2) (+3) (+1)

C D

Assume V0 = 0.
What is V1 (after one iteration)?
Outline

▶ Optimal policy
▶ Policy iteration
▶ Value iteration
▶ Games
Games

In many games (Tic-Tac-Toe, Chess, etc.), we have:

▶ 2 players
▶ No reward except in terminal states → r ∈ {−1, 0, +1}
▶ No discount → G ∈ {−1, 0, +1}
▶ Deterministic state transitions

Two approaches:
1. Learn to play against a given adversary
→ The adversary is part to the environment
2. Learn to play against a perfect adversary
→ The adversary is part of the agent
(who controls both players)
Learning to play perfectly

Players:
▶ π1 → policy of the first player (player 1)
▶ π2 → policy of the second player (player 2)

States:
▶ S1 → states of player’s 1 turn
▶ S2 → states of player’s 2 turn

Rewards:
▶ +1 if player 1 wins
▶ −1 if player 2 wins
▶ 0 otherwise (tie)
Value function

The value function of the policy π = (π1 , π2 ) is the expected gain

from each state:

∀s, Vπ (s) = E(G |s0 = s)

Bellman’s equation
The value function of π is the unique solution to:

∀s, Vπ (s) = E(r0 + Vπ (s1 )|s0 = s)

Value function

The value function of the policy π = (π1 , π2 ) is the expected gain

from each state:

∀s, Vπ (s) = E(G |s0 = s)

Bellman’s equation
The value function of π is the unique solution to:

∀s, Vπ (s) = E(r0 + Vπ (s1 )|s0 = s)

Note: For deterministic players, Vπ ∈ {−1, 0, +1}.

Optimal policy

Partial ordering
Policy π ′ = (π1′ , π2′ ) is better than policy π = (π1 , π2 ) if:

∀s ∈ S1 , Vπ′ (s) ≥ Vπ (s)

∀s ∈ S2 , Vπ′ (s) ≤ Vπ (s)

Optimal policy

Partial ordering
Policy π ′ = (π1′ , π2′ ) is better than policy π = (π1 , π2 ) if:

∀s ∈ S1 , Vπ′ (s) ≥ Vπ (s)

∀s ∈ S2 , Vπ′ (s) ≤ Vπ (s)

Optimal policy
A policy π ⋆ is optimal if it is better than any other policy.
Optimal value function

Value function V ⋆ of perfect players.

Bellman’s optimality equation
The optimal value function V ⋆ is the unique solution to:

∀s ∈ S1 , V (s) = max E(r0 + V (s1 )|s0 = s, a0 = a)

∀s ∈ S2 , V (s) = min E(r0 + V (s1 )|s0 = s, a0 = a)

a
Optimal value function

Value function V ⋆ of perfect players.

Bellman’s optimality equation
The optimal value function V ⋆ is the unique solution to:

∀s ∈ S1 , V (s) = max E(r0 + V (s1 )|s0 = s, a0 = a)

∀s ∈ S2 , V (s) = min E(r0 + V (s1 )|s0 = s, a0 = a)

Note: The optimal value function satisfies V ⋆ ∈ {−1, 0, +1}.

Value iteration

Algorithm
Starting from some arbitrary value function V = V0 (e.g., 0 in
non-terminal states), iterate until convergence:

∀s ∈ S1 , V (s) ← max E(r0 + V (s1 )| s, a)

∀s ∈ S2 , V (s) ← min E(r0 + V (s1 )| s, a)

▶ The sequence V0 , V1 , V2 , . . . converges in finite time

(takes values in {−1, 0, +1})
▶ The limit is the optimal value function.
▶ The corresponding policy is optimal.
Tic-Tac-Toe (perfect players)

V⋆ = 0 V⋆ = 0 V⋆ = 0 V⋆ = 0

X X X X
O O

V⋆ = 0 V⋆ = 0 V⋆ = 0 V⋆ = 0

X X O X X O X X O X X O
O O O O O O X
X X X
Summary

Dynamic programming
▶ Optimal policy
π ⋆ with V ⋆ ≥ V
▶ Bellman’s optimality equation
V ⋆ = max E(r + γV ⋆ )
▶ Policy iteration
π0 → V0 → π1 → V1 → . . . → π ⋆
▶ Value iteration
V0 → V1 → . . . → V ⋆

Next lecture
How to evaluate a policy online?

Markov Decision Processes Ii: Ppts by Dan Klein and Pieter Abbeel For Cs188 Intro To Ai at Uc Berkeley
No ratings yet
Markov Decision Processes Ii: Ppts by Dan Klein and Pieter Abbeel For Cs188 Intro To Ai at Uc Berkeley
50 pages
08 MDPs
No ratings yet
08 MDPs
111 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
62 pages
2025 - MDPs 2
No ratings yet
2025 - MDPs 2
42 pages
RL - 03 Markov Decision Processes and Dynamic Programming
No ratings yet
RL - 03 Markov Decision Processes and Dynamic Programming
50 pages
Artificial Intelligence: Lecture 9 - Markov Decision Processes II Dr. Shivanjali Khare
No ratings yet
Artificial Intelligence: Lecture 9 - Markov Decision Processes II Dr. Shivanjali Khare
44 pages
2025 - MDPs - Part 2
No ratings yet
2025 - MDPs - Part 2
41 pages
Lec 12
No ratings yet
Lec 12
60 pages
2024 MDPs Part 1
No ratings yet
2024 MDPs Part 1
59 pages
Vinno M80 Brochures PDF
No ratings yet
Vinno M80 Brochures PDF
4 pages
Reinforcement Learning Lec12
No ratings yet
Reinforcement Learning Lec12
60 pages
Lecture 06
No ratings yet
Lecture 06
98 pages
Lecture26 Ri
No ratings yet
Lecture26 Ri
55 pages
Bellman Equation and RL Notes
No ratings yet
Bellman Equation and RL Notes
6 pages
Policy (RL IITH)
No ratings yet
Policy (RL IITH)
46 pages
Lec 09
No ratings yet
Lec 09
51 pages
RL Lecture4
No ratings yet
RL Lecture4
7 pages
Markov Decision Process II
No ratings yet
Markov Decision Process II
88 pages
18 - Dynamic Programming For Markov Decision Processes
No ratings yet
18 - Dynamic Programming For Markov Decision Processes
50 pages
1 Markov
No ratings yet
1 Markov
34 pages
Model Analysis
100% (3)
Model Analysis
7 pages
Cs5811 Ch17 Complex Dec
No ratings yet
Cs5811 Ch17 Complex Dec
29 pages
RL Dynamic Programming Lecture
No ratings yet
RL Dynamic Programming Lecture
43 pages
Reinforcement Learning for Experts
No ratings yet
Reinforcement Learning for Experts
36 pages
CS229
No ratings yet
CS229
17 pages
Effect of Niobium On The As-Cast Microstructure of Hypereutectic High Chromium Cast Iron
No ratings yet
Effect of Niobium On The As-Cast Microstructure of Hypereutectic High Chromium Cast Iron
4 pages
MDP Cheatsheet
No ratings yet
MDP Cheatsheet
3 pages
DRL Homework 1
No ratings yet
DRL Homework 1
4 pages
PN325 PDS
No ratings yet
PN325 PDS
4 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
Walmart Factory List
100% (2)
Walmart Factory List
5 pages
EE675A Lec12
No ratings yet
EE675A Lec12
5 pages
Fa19 Lecture 15 MDPs II
No ratings yet
Fa19 Lecture 15 MDPs II
76 pages
Reinforcement Learning Essentials
No ratings yet
Reinforcement Learning Essentials
21 pages
Subtitle
No ratings yet
Subtitle
2 pages
Unit 05 Dynamic Programming
No ratings yet
Unit 05 Dynamic Programming
9 pages
242 Sheet 02 03
No ratings yet
242 Sheet 02 03
5 pages
Lecture#3 Bellmann Equation and Dynamic Programming DP 2024 Part
No ratings yet
Lecture#3 Bellmann Equation and Dynamic Programming DP 2024 Part
33 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
Horse With Cowboy
No ratings yet
Horse With Cowboy
1 page
Optimal Control Theory
No ratings yet
Optimal Control Theory
28 pages
Value Functions & Bellman Equations
No ratings yet
Value Functions & Bellman Equations
11 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
51 pages
09 - Monte Carlo Learning
No ratings yet
09 - Monte Carlo Learning
24 pages
EE675 Lecture 10
No ratings yet
EE675 Lecture 10
4 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
3 - Chapter 9 Policy Gradient Methods
No ratings yet
3 - Chapter 9 Policy Gradient Methods
24 pages
CS 188 Fall 2018 Written HW4 Soln
No ratings yet
CS 188 Fall 2018 Written HW4 Soln
6 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
Module-2 For Btech in Topic
No ratings yet
Module-2 For Btech in Topic
29 pages
Reinforcement Learning 3 Recap
No ratings yet
Reinforcement Learning 3 Recap
3 pages
Sp14 Cs188 Lecture 9 - Mdps II
No ratings yet
Sp14 Cs188 Lecture 9 - Mdps II
48 pages
Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
Lec 3
No ratings yet
Lec 3
15 pages
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
No ratings yet
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
40 pages
Astm C273-C273M - 19
No ratings yet
Astm C273-C273M - 19
9 pages
Reinforcement Learning in A Nutshell
No ratings yet
Reinforcement Learning in A Nutshell
12 pages
MDP Solution Methods: Iteration & LP
No ratings yet
MDP Solution Methods: Iteration & LP
34 pages
Stochastic Models for Engineers
No ratings yet
Stochastic Models for Engineers
18 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
7 pages
Grade 9 - English All Unit 3 and Moments #3
No ratings yet
Grade 9 - English All Unit 3 and Moments #3
5 pages
Infinite-Horizon Dynamic Programming: Tianxiao Zheng Saif
No ratings yet
Infinite-Horizon Dynamic Programming: Tianxiao Zheng Saif
10 pages
Highway Alignment Principles
60% (5)
Highway Alignment Principles
89 pages
Developing and Analysis of Power Systems Using Psat Software
100% (1)
Developing and Analysis of Power Systems Using Psat Software
5 pages
Neural Networks for CS Students
100% (1)
Neural Networks for CS Students
22 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
AI Exam Prep for CS Students
No ratings yet
AI Exam Prep for CS Students
4 pages
Lectures Named Reactions
No ratings yet
Lectures Named Reactions
26 pages
BE Mech 5.5 Year
No ratings yet
BE Mech 5.5 Year
3 pages
cs229 Notes13
No ratings yet
cs229 Notes13
15 pages
BDA Lab Manual 200305105108
No ratings yet
BDA Lab Manual 200305105108
44 pages
MCQ
67% (3)
MCQ
274 pages
Markov Decision Processes: Stochastic, Sequential Environments
No ratings yet
Markov Decision Processes: Stochastic, Sequential Environments
20 pages
04-Random-Variate Generation
No ratings yet
04-Random-Variate Generation
18 pages
Saeed Updated CV
No ratings yet
Saeed Updated CV
14 pages
Cs748 s2021 Quizzes Till q4
No ratings yet
Cs748 s2021 Quizzes Till q4
4 pages
Camay Relaunch in Pakistan
100% (1)
Camay Relaunch in Pakistan
26 pages
Machine Learning for CS Students
No ratings yet
Machine Learning for CS Students
16 pages
Linear Classification Guide
No ratings yet
Linear Classification Guide
28 pages
Machine Learning: Backpropagation
No ratings yet
Machine Learning: Backpropagation
24 pages
When I Was
No ratings yet
When I Was
6 pages
Introduction To The Importance of Sanitation - 5
No ratings yet
Introduction To The Importance of Sanitation - 5
16 pages
TCS
No ratings yet
TCS
43 pages
Coffee Cost Breakdown & Gen Z Work Preferences
No ratings yet
Coffee Cost Breakdown & Gen Z Work Preferences
2 pages
Maintenance of Capital
No ratings yet
Maintenance of Capital
36 pages
Linear Regression for Astronomers
No ratings yet
Linear Regression for Astronomers
24 pages
Life Vision Int Gram Worksheet B U4
No ratings yet
Life Vision Int Gram Worksheet B U4
1 page
MyEdBC Family Portal Instructional Manual
No ratings yet
MyEdBC Family Portal Instructional Manual
6 pages
Angelica Resume
No ratings yet
Angelica Resume
1 page
Aksantara2015 Sheet1
No ratings yet
Aksantara2015 Sheet1
2 pages
Android File Management Guide
No ratings yet
Android File Management Guide
19 pages
Os Lec 4 Process
No ratings yet
Os Lec 4 Process
7 pages
23:23:48
No ratings yet
23:23:48
364 pages
Dynamic Programming for CS Students
No ratings yet
Dynamic Programming for CS Students
23 pages
Saudi Arabia Technician Jobs Listings
No ratings yet
Saudi Arabia Technician Jobs Listings
6 pages
CS221 - Artificial Intelligence - Machine Learning - 4 Stochastic Gradient Descent
No ratings yet
CS221 - Artificial Intelligence - Machine Learning - 4 Stochastic Gradient Descent
12 pages
CS221 - Artificial Intelligence - Machine Learning - 6 Non-Linear Features
No ratings yet
CS221 - Artificial Intelligence - Machine Learning - 6 Non-Linear Features
22 pages
Wind Meter App for Enthusiasts
No ratings yet
Wind Meter App for Enthusiasts
9 pages

2 Dynamic

Uploaded by

2 Dynamic

Uploaded by

Reinforcement Learning

At time t = 0, 1, 2, . . ., the agent in state st takes action at and:

At time t = 0, 1, 2, . . ., the agent in state st takes action at and:

We denote by S the set of non-terminal states.

When deterministic, we use the simpler notation a = π(s)

The parameter γ ∈ [0, 1] is the discount factor.

Consider some policy π.

∀s, Vπ (s) = E(G |s0 = s)

Consider some policy π.

∀s, Vπ (s) = E(G |s0 = s)

∀s, Vπ′ (s) ≥ Vπ (s)

∀s, Vπ′ (s) ≥ Vπ (s)

Recall Bellman’s equation for a policy π:

∀s ∈ S, V (s) = E(r0 + γV (s1 )|s0 = s)

Recall Bellman’s equation for a policy π:

∀s ∈ S, V (s) = E(r0 + γV (s1 )|s0 = s)

∀s ∈ S, V (s) = max E(r0 + γV (s1 )| s0 = s, a0 = a)

Recall Bellman’s equation for a policy π:

∀s ∈ S, V (s) = E(r0 + γV (s1 )|s0 = s)

∀s ∈ S, V (s) = max E(r0 + γV (s1 )| s0 = s, a0 = a)

with F ⋆ (V )(s) = maxa E(r0 + γV (s1 )| s0 = s, a0 = a) for all s ∈ S.

with F ⋆ (V )(s) = maxa E(r0 + γV (s1 )| s0 = s, a0 = a) for all s ∈ S.

with F ⋆ (V )(s) = maxa E(r0 + γV (s1 )| s0 = s, a0 = a) for all s ∈ S.

Proof: The mapping F ⋆ is contracting:

∀U, V , ||F ⋆ (V ) − F ⋆ (U)||∞ ≤ γ||V − U||∞

→ Banach fixed-point theorem

An optimal policy follows from the optimal value function:

∀s ∈ S, π ⋆ (s) = a⋆ ∈ arg max E(r0 + γV ⋆ (s1 )| s0 = s, a0 = a)

Bellman’s optimality theorem

An optimal policy follows from the optimal value function:

∀s ∈ S, π ⋆ (s) = a⋆ ∈ arg max E(r0 + γV ⋆ (s1 )| s0 = s, a0 = a)

Bellman’s optimality theorem

V ⋆ ≈ 0.99 V ⋆ ≈ 0.99 V ⋆ ≈ 0.96 V ⋆ ≈ 0.96

Let Vπ be the value function of policy π.

π ′ (s) = a⋆ ∈ arg max E(r0 + γVπ (s1 )| s0 = s, a0 = a)

Let Vπ be the value function of policy π.

π ′ (s) = a⋆ ∈ arg max E(r0 + γVπ (s1 )| s0 = s, a0 = a)

(+2) (+3) (+1)

The initial policy is random.

∀s, π(s) ← arg max E(r0 + γVπ (s1 )| s, a)

▶ The sequence π0 , π1 , π2 , . . . is monotonic and converges in

▶ The step of policy evaluation is time-consuming

∀s, V (s) ← max E(r0 + γV (s1 )| s, a)

▶ The sequence V0 , V1 , V2 , . . . converges

(+2) (+3) (+1)

In many games (Tic-Tac-Toe, Chess, etc.), we have:

The value function of the policy π = (π1 , π2 ) is the expected gain

∀s, Vπ (s) = E(G |s0 = s)

∀s, Vπ (s) = E(r0 + Vπ (s1 )|s0 = s)

The value function of the policy π = (π1 , π2 ) is the expected gain

∀s, Vπ (s) = E(G |s0 = s)

∀s, Vπ (s) = E(r0 + Vπ (s1 )|s0 = s)

Note: For deterministic players, Vπ ∈ {−1, 0, +1}.

∀s ∈ S1 , Vπ′ (s) ≥ Vπ (s)

∀s ∈ S2 , Vπ′ (s) ≤ Vπ (s)

∀s ∈ S1 , Vπ′ (s) ≥ Vπ (s)

∀s ∈ S2 , Vπ′ (s) ≤ Vπ (s)

Value function V ⋆ of perfect players.

∀s ∈ S1 , V (s) = max E(r0 + V (s1 )|s0 = s, a0 = a)

∀s ∈ S2 , V (s) = min E(r0 + V (s1 )|s0 = s, a0 = a)

Value function V ⋆ of perfect players.

∀s ∈ S1 , V (s) = max E(r0 + V (s1 )|s0 = s, a0 = a)

∀s ∈ S2 , V (s) = min E(r0 + V (s1 )|s0 = s, a0 = a)

Note: The optimal value function satisfies V ⋆ ∈ {−1, 0, +1}.

∀s ∈ S1 , V (s) ← max E(r0 + V (s1 )| s, a)

∀s ∈ S2 , V (s) ← min E(r0 + V (s1 )| s, a)

▶ The sequence V0 , V1 , V2 , . . . converges in finite time

You might also like