Reinforcement Learning
2. Dynamic Programming
Thomas Bonald
2024 – 2025
Markov decision process → Model
At time t = 0, 1, 2, . . ., the agent in state st takes action at and:
▶ receives reward rt
▶ moves to state st+1
The rewards and transitions are stochastic in general.
Some states may be terminal.
Markov decision process → Model
At time t = 0, 1, 2, . . ., the agent in state st takes action at and:
▶ receives reward rt
▶ moves to state st+1
The rewards and transitions are stochastic in general.
Some states may be terminal.
Definition
A Markov decision process (MDP) is defined by:
▶ the reward distribution, p(rt |st , at )
▶ the state transition distribution, p(st+1 | st , at )
We denote by S the set of non-terminal states.
Policy → Agent
Definition
Given a Markov decision process, the policy of an agent defines
the action taken in each non-terminal state:
∀s ∈ S, π(a|s) = P(at = a| st = s)
When deterministic, we use the simpler notation a = π(s)
Gain → Objective
Definition
Given the rewards r0 , r1 , r2 , . . ., we refer to the gain as:
G = r0 + γr1 + γ 2 r2 + . . .
The parameter γ ∈ [0, 1] is the discount factor.
Value function → Expectation
Consider some policy π.
Definition
The value function of π is the expected gain from each state:
∀s, Vπ (s) = E(G |s0 = s)
Value function → Expectation
Consider some policy π.
Definition
The value function of π is the expected gain from each state:
∀s, Vπ (s) = E(G |s0 = s)
Bellman’s equation
The value function Vπ is the unique solution to the fixed-point
equation:
∀s, V (s) = E(r0 + γV (s1 )| s0 = s)
Outline
▶ Optimal policy
▶ Policy iteration
▶ Value iteration
▶ Games
Optimal policy
Partial ordering
Policy π ′ is better than policy π if:
∀s, Vπ′ (s) ≥ Vπ (s)
Notation: π ′ ≥ π
Optimal policy
Partial ordering
Policy π ′ is better than policy π if:
∀s, Vπ′ (s) ≥ Vπ (s)
Notation: π ′ ≥ π
Optimal policy
A policy π ⋆ is optimal if it is better than any other policy:
∀π, π⋆ ≥ π
A or B (optimal policy)
(+1) (−2)
A B
(+5) (−3)
C D
Optimal value function
Recall Bellman’s equation for a policy π:
∀s ∈ S, V (s) = E(r0 + γV (s1 )|s0 = s)
Optimal value function
Recall Bellman’s equation for a policy π:
∀s ∈ S, V (s) = E(r0 + γV (s1 )|s0 = s)
Proposition
There is a unique solution V ⋆ to the equation:
∀s ∈ S, V (s) = max E(r0 + γV (s1 )| s0 = s, a0 = a)
a
Optimal value function
Recall Bellman’s equation for a policy π:
∀s ∈ S, V (s) = E(r0 + γV (s1 )|s0 = s)
Proposition
There is a unique solution V ⋆ to the equation:
∀s ∈ S, V (s) = max E(r0 + γV (s1 )| s0 = s, a0 = a)
a
Notes:
▶ Not a linear system!
▶ Can still be solved by fixed-point iteration
A or B (optimal value function)
(+1) (−2)
A B
(+5) (−3)
C D
Quiz
Solution to Bellman’s optimality equation
Write Bellman’s equation as the fixed-point equation:
V = F ⋆ (V )
with F ⋆ (V )(s) = maxa E(r0 + γV (s1 )| s0 = s, a0 = a) for all s ∈ S.
Solution to Bellman’s optimality equation
Write Bellman’s equation as the fixed-point equation:
V = F ⋆ (V )
with F ⋆ (V )(s) = maxa E(r0 + γV (s1 )| s0 = s, a0 = a) for all s ∈ S.
Proposition
If γ < 1, the solution is unique and:
∀V , lim (F ⋆ )k (V ) = V ⋆
k→+∞
Solution to Bellman’s optimality equation
Write Bellman’s equation as the fixed-point equation:
V = F ⋆ (V )
with F ⋆ (V )(s) = maxa E(r0 + γV (s1 )| s0 = s, a0 = a) for all s ∈ S.
Proposition
If γ < 1, the solution is unique and:
∀V , lim (F ⋆ )k (V ) = V ⋆
k→+∞
Proof: The mapping F ⋆ is contracting:
∀U, V , ||F ⋆ (V ) − F ⋆ (U)||∞ ≤ γ||V − U||∞
→ Banach fixed-point theorem
Optimal policy
An optimal policy follows from the optimal value function:
∀s ∈ S, π ⋆ (s) = a⋆ ∈ arg max E(r0 + γV ⋆ (s1 )| s0 = s, a0 = a)
a
Bellman’s optimality theorem
The policy π ⋆ has value function V ⋆ and is optimal.
Optimal policy
An optimal policy follows from the optimal value function:
∀s ∈ S, π ⋆ (s) = a⋆ ∈ arg max E(r0 + γV ⋆ (s1 )| s0 = s, a0 = a)
a
Bellman’s optimality theorem
The policy π ⋆ has value function V ⋆ and is optimal.
Note that:
▶ The optimal policy is not unique in general.
▶ There always exists a deterministic optimal policy.
▶ The optimal value function V ⋆ is unique.
A or B (optimal policy)
(+1) (−2)
A B
(+5) (−3)
C D
Maze (optimal policy)
V⋆
π⋆
Tic-Tac-Toe (optimal player, random adversary)
V ⋆ ≈ 0.99 V ⋆ ≈ 0.99 V ⋆ ≈ 0.96 V ⋆ ≈ 0.96
X X X X
O O
V ⋆ ≈ 0.75 V ⋆ ≈ 0.75 V⋆ = 0 V⋆ = 0
X X O X X O X X O X X O
O O O O O O X
X X X
Outline
▶ Optimal policy
▶ Policy iteration
▶ Value iteration
▶ Games
Policy improvement
Let Vπ be the value function of policy π.
Policy improvement
New policy π ′ defined by:
π ′ (s) = a⋆ ∈ arg max E(r0 + γVπ (s1 )| s0 = s, a0 = a)
a
Policy improvement
Let Vπ be the value function of policy π.
Policy improvement
New policy π ′ defined by:
π ′ (s) = a⋆ ∈ arg max E(r0 + γVπ (s1 )| s0 = s, a0 = a)
a
Proposition
The policy π ′ is better than π.
Maze (from the random policy)
Vπ
π′
Exercise
(−1) (+1)
(+3)
A B
(+2) (+3) (+1)
C D
The initial policy is random.
What is the new policy after policy improvement?
Policy iteration
Policy iteration
Algorithm
Starting from some arbitrary policy π = π0 , iterate until
convergence:
1. Evaluate the policy (by solving Bellman’s equation)
2. Improve the policy:
∀s, π(s) ← arg max E(r0 + γVπ (s1 )| s, a)
a
▶ The sequence π0 , π1 , π2 , . . . is monotonic and converges in
finite time (for finite numbers of states and actions).
▶ The limit is an optimal policy.
▶ These results assume perfect policy evaluation.
Maze (policy iteration)
iteration 1 iteration 2
Practical considerations
▶ The step of policy evaluation is time-consuming
(solution of Bellman’s equation)
▶ Do we need the exact solution?
No, since it is used only to improve the policy!
▶ Why not directly improving the value function?
This is value iteration!
Outline
▶ Optimal policy
▶ Policy iteration
▶ Value iteration
▶ Games
Value iteration
Value iteration
Algorithm
Starting from some arbitrary value function V = V0 ,
iterate until convergence:
∀s, V (s) ← max E(r0 + γV (s1 )| s, a)
a
▶ The sequence V0 , V1 , V2 , . . . converges
(but not in finite time in general)
▶ The limit is the optimal value function.
▶ The corresponding policy is optimal.
Maze (optimal policy)
V⋆
π⋆
Exercise
(−1) (+1)
(+3)
A B
(+2) (+3) (+1)
C D
Assume V0 = 0.
What is V1 (after one iteration)?
Outline
▶ Optimal policy
▶ Policy iteration
▶ Value iteration
▶ Games
Games
In many games (Tic-Tac-Toe, Chess, etc.), we have:
▶ 2 players
▶ No reward except in terminal states → r ∈ {−1, 0, +1}
▶ No discount → G ∈ {−1, 0, +1}
▶ Deterministic state transitions
Two approaches:
1. Learn to play against a given adversary
→ The adversary is part to the environment
2. Learn to play against a perfect adversary
→ The adversary is part of the agent
(who controls both players)
Learning to play perfectly
Players:
▶ π1 → policy of the first player (player 1)
▶ π2 → policy of the second player (player 2)
States:
▶ S1 → states of player’s 1 turn
▶ S2 → states of player’s 2 turn
Rewards:
▶ +1 if player 1 wins
▶ −1 if player 2 wins
▶ 0 otherwise (tie)
Value function
The value function of the policy π = (π1 , π2 ) is the expected gain
from each state:
∀s, Vπ (s) = E(G |s0 = s)
Bellman’s equation
The value function of π is the unique solution to:
∀s, Vπ (s) = E(r0 + Vπ (s1 )|s0 = s)
Value function
The value function of the policy π = (π1 , π2 ) is the expected gain
from each state:
∀s, Vπ (s) = E(G |s0 = s)
Bellman’s equation
The value function of π is the unique solution to:
∀s, Vπ (s) = E(r0 + Vπ (s1 )|s0 = s)
Note: For deterministic players, Vπ ∈ {−1, 0, +1}.
Optimal policy
Partial ordering
Policy π ′ = (π1′ , π2′ ) is better than policy π = (π1 , π2 ) if:
∀s ∈ S1 , Vπ′ (s) ≥ Vπ (s)
∀s ∈ S2 , Vπ′ (s) ≤ Vπ (s)
Optimal policy
Partial ordering
Policy π ′ = (π1′ , π2′ ) is better than policy π = (π1 , π2 ) if:
∀s ∈ S1 , Vπ′ (s) ≥ Vπ (s)
∀s ∈ S2 , Vπ′ (s) ≤ Vπ (s)
Optimal policy
A policy π ⋆ is optimal if it is better than any other policy.
Optimal value function
Value function V ⋆ of perfect players.
Bellman’s optimality equation
The optimal value function V ⋆ is the unique solution to:
∀s ∈ S1 , V (s) = max E(r0 + V (s1 )|s0 = s, a0 = a)
a
∀s ∈ S2 , V (s) = min E(r0 + V (s1 )|s0 = s, a0 = a)
a
Optimal value function
Value function V ⋆ of perfect players.
Bellman’s optimality equation
The optimal value function V ⋆ is the unique solution to:
∀s ∈ S1 , V (s) = max E(r0 + V (s1 )|s0 = s, a0 = a)
a
∀s ∈ S2 , V (s) = min E(r0 + V (s1 )|s0 = s, a0 = a)
a
Note: The optimal value function satisfies V ⋆ ∈ {−1, 0, +1}.
Value iteration
Algorithm
Starting from some arbitrary value function V = V0 (e.g., 0 in
non-terminal states), iterate until convergence:
∀s ∈ S1 , V (s) ← max E(r0 + V (s1 )| s, a)
a
∀s ∈ S2 , V (s) ← min E(r0 + V (s1 )| s, a)
a
▶ The sequence V0 , V1 , V2 , . . . converges in finite time
(takes values in {−1, 0, +1})
▶ The limit is the optimal value function.
▶ The corresponding policy is optimal.
Tic-Tac-Toe (perfect players)
V⋆ = 0 V⋆ = 0 V⋆ = 0 V⋆ = 0
X X X X
O O
V⋆ = 0 V⋆ = 0 V⋆ = 0 V⋆ = 0
X X O X X O X X O X X O
O O O O O O X
X X X
Summary
Dynamic programming
▶ Optimal policy
π ⋆ with V ⋆ ≥ V
▶ Bellman’s optimality equation
V ⋆ = max E(r + γV ⋆ )
▶ Policy iteration
π0 → V0 → π1 → V1 → . . . → π ⋆
▶ Value iteration
V0 → V1 → . . . → V ⋆
Next lecture
How to evaluate a policy online?