Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
115 views51 pages

Reinforcement Learning Basics

Reinforcement learning problems that satisfy the Markov property are called Markov decision processes (MDPs). An MDP is defined by its state and action spaces, transition probabilities between states, reward function, initial state distribution, and discount factor. The goal is to find an optimal policy that maps states to actions to maximize the expected discounted reward over the long run. Dynamic programming methods like value iteration and policy iteration can be used to find the optimal policy.

Uploaded by

Kerry Beach
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
115 views51 pages

Reinforcement Learning Basics

Reinforcement learning problems that satisfy the Markov property are called Markov decision processes (MDPs). An MDP is defined by its state and action spaces, transition probabilities between states, reward function, initial state distribution, and discount factor. The goal is to find an optimal policy that maps states to actions to maximize the expected discounted reward over the long run. Dynamic programming methods like value iteration and policy iteration can be used to find the optimal policy.

Uploaded by

Kerry Beach
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Reinforcement Learning

Markov decision process & Dynamic programming

value function, Bellman equation, optimality, Markov property, Markov decision process,
dynamic programming, value iteration, policy iteration.

Vien Ngo
MLR, University of Stuttgart
Outline
Reinforcement learning problem.
Element of reinforcement learning
Markov Process
Markov Reward Process
Markov decision process.

Dynamic programming
Value iteration
Policy iteration

2/??
Reinforcement Learning Problem
Elements of Reinforcement Learning Problem

Agent vs. Environment.

State, Action, Reward, Goal, Return.

The Markov property.

Markov decision process.

Bellman equations.

Optimality and Approximation.

3/??
Agent vs. Environment

4/??
Agent vs. Environment

The learner and decision-maker is called the agent.

The thing it interacts with, comprising everything outside the agent, is


called the environment.

The environment is formally formulated as a Markov Decision Process,


which is a mathematically principled framework for sequential decision
problems.
(from Introduction to RL book, Sutton & Barto)
4/??
The Markov property
A state that summarizes past sensations compactly yet in such a way
that all relevant information is retained. This normally requires more
than the immediate sensations, but never more than the complete
history of all past sensations. A state that succeeds in retaining all
relevant information is said to be Markov, or to have the Markov
property.
(Introduction to RL book, Sutton & Barto)

5/??
The Markov property
A state that summarizes past sensations compactly yet in such a way
that all relevant information is retained. This normally requires more
than the immediate sensations, but never more than the complete
history of all past sensations. A state that succeeds in retaining all
relevant information is said to be Markov, or to have the Markov
property.
(Introduction to RL book, Sutton & Barto)

Formally,

P r(st+1 , rt+1 |st , at , rt , , s0 , a0 , r0 ) = P r(st+1 , rt+1 |st , at , rt )

5/??
The Markov property
A state that summarizes past sensations compactly yet in such a way
that all relevant information is retained. This normally requires more
than the immediate sensations, but never more than the complete
history of all past sensations. A state that succeeds in retaining all
relevant information is said to be Markov, or to have the Markov
property.
(Introduction to RL book, Sutton & Barto)

Formally,

P r(st+1 , rt+1 |st , at , rt , , s0 , a0 , r0 ) = P r(st+1 , rt+1 |st , at , rt )

Example: the current configuration of the chess board for predicting the
next steps, the position, velocity of the cart, the angle and its changing
rate of the pole in cart-pole domain.

5/??
Markov Process
A Markov Process (Markov Chain) is defined as 2-tuple (S, P).
S is a state space.
P is a state transition probability matrix: Pss0 = P (st+1 = s0 |st = s)

6/??
Markov Process: Example
Rycycling Robots Markov Chain

recharge
0.9 0.9
0.1
Batery: Batery:
high
0.5 low

0.5 0.5
1.0 0.5 0.1
search

wait stop

7/??
Markov Reward Process
A Markov Reward Process is defined as 4-tuple (S, P, R, ).
S is a state space of n states.
P is a state transition probability matrix: Pss0 = P (st+1 = s0 |st = s)
R is a reward matrix of Rs .
is a discount factor, [0, 1].
The total return is

t = Rt + Rt+1 + 2 Rt+2 + . . .

8/??
Markov Reward Process: Example

recharge
0.9;0.0 0.9;0.0

0.1;0.0
Batery: Batery:
0.5;0.0
high low
0.5;0.0 0.5;0.0
0.5;-1.0 0.1;-10.0
1.0;0.0
search

wait stop

9/??
Markov Reward Process: Bellman Equations
The value function V (s)
 
V (s) = E t |st = s
 
= E Rt + V (st+1 )|st = s

V = R + P V , hence V = (I P )1 R
We will visit again in MDP.

10/??
Markov Reward Process: Discount Factor?
Many meanings:
Weighing the importance of differently timed rewards, higher
importance of more recent rewards.
Representing uncertainty over the presence of next rewards, i.e
geometric distributions.
Representing human/animals preference over ordering of received
rewards.

11/??
Markov decision process

12/??
Markov decision process
A reinforcement learning problem that satisfies the Markov property is
called a Markov decision process, or MDP.

13/??
Markov decision process
A reinforcement learning problem that satisfies the Markov property is
called a Markov decision process, or MDP.
MDP = {S, A, T , R, P0 , }.

13/??
Markov decision process
A reinforcement learning problem that satisfies the Markov property is
called a Markov decision process, or MDP.
MDP = {S, A, T , R, P0 , }.
S: consists of all possible states.
A: consists of all possible actions.
T : is a transition function which defines the probability
T (s0 , s, a) = P r(s0 |s, a).
R: is a reward function which defines the reward R(s, a).
P0 : is the probability distribution over initial states.
[0, 1]: is a discount factor.

13/??
Example: Recycling Robot MDP

14/??
a0 a1 a2

s0 s1 s2

r0 r1 r2

15/??
a0 a1 a2

s0 s1 s2

r0 r1 r2

A policy is a mapping from state space to action space

: S 7 A

15/??
a0 a1 a2

s0 s1 s2

r0 r1 r2

A policy is a mapping from state space to action space

: S 7 A

Objective function:
Expected average reward.

T 1
1 hX i
= lim E r(st , at , st+1 )
T T t=0

Expected discounted reward.



hX i
= E t r(st , at , st+1 )
t=0

15/??
a0 a1 a2

s0 s1 s2

r0 r1 r2

A policy is a mapping from state space to action space

: S 7 A

Objective function:
Expected average reward.

T 1
1 hX i
= lim E r(st , at , st+1 )
T T t=0

Expected discounted reward.



hX i
= E t r(st , at , st+1 )
t=0

Singh et. al. 1994:


1
=
1

15/??
Dynamic Programming

16/??
Dynamic Programming
State Value Functions
Bellman Equations
Value Iteration
Policy Iteration

17/??
State value function
The value (expected discounted return) of policy when started in
state s:

V (s) = E {r0 + r1 + 2 r2 + | s0 = s} (1)

discounting factor [0, 1]

18/??
State value function
The value (expected discounted return) of policy when started in
state s:

V (s) = E {r0 + r1 + 2 r2 + | s0 = s} (1)

discounting factor [0, 1]

definition of optimality: behavior is optimal iff



s : V (s) = V (s) where V (s) = max V (s)

(simultaneously maximising the value in all states)

(In MDPs there always exists (at least one) optimal deterministic policy.)

18/??
Bellman optimality equation

V (s) = E{r0 + r1 + 2 r2 + | s0 = s; }
= E{r0 | s0 = s; } + E{r1 + r2 + | s0 = s; }
= R((s), s) + s0 P (s0 | (s), s) E{r1 + r2 + | s1 = s0 ; }
P

= R((s), s) + s0 P (s0 | (s), s) V (s0 )


P

19/??
Bellman optimality equation

V (s) = E{r0 + r1 + 2 r2 + | s0 = s; }
= E{r0 | s0 = s; } + E{r1 + r2 + | s0 = s; }
= R((s), s) + s0 P (s0 | (s), s) E{r1 + r2 + | s1 = s0 ; }
P

= R((s), s) + s0 P (s0 | (s), s) V (s0 )


P

We can write this in vector notation V = R + P V


with vectors V
s = V (s), Rs = P R((s), s) and matrix P = P (s0 | (s), s)
Ps0 s
For stochastic (a|s): V (s) = a (a|s)R(a, s) + s0 ,a (a|s)P (s0 | a, s) V (s0 )

19/??
Bellman optimality equation

V (s) = E{r0 + r1 + 2 r2 + | s0 = s; }
= E{r0 | s0 = s; } + E{r1 + r2 + | s0 = s; }
= R((s), s) + s0 P (s0 | (s), s) E{r1 + r2 + | s1 = s0 ; }
P

= R((s), s) + s0 P (s0 | (s), s) V (s0 )


P

We can write this in vector notation V = R + P V


with vectors V
s = V (s), Rs = P R((s), s) and matrix P = P (s0 | (s), s)
Ps0 s
For stochastic (a|s): V (s) = a (a|s)R(a, s) + s0 ,a (a|s)P (s0 | a, s) V (s0 )

Bellman optimality equation


h i
V (s) = max R(a, s) + s0 P (s0 | a, s) V (s0 )
P
a h i

(s) = argmax R(a, s) + s0 P (s0 | a, s) V (s0 )
P
a

(Sketch of proof: If would select another action than argmaxa [], then 0 which =
everywhere except 0 (s) = argmaxa [] would be better.)
This is the principle of optimality in the stochastic case
(related to Viterbi, max-product algorithm) 19/??
Richard E. Bellman (1920-1984)
Bellmans principle of optimality
B

A
A opt B opt

h i
V (s) = max R(a, s) + s0 P (s0 | a, s) V (s0 )
P
a h i

(s) = argmax R(a, s) + s0 P (s0 | a, s) V (s0 )
P
a

20/??
Value Iteration
Given the Bellman equation
h X i
V (s) = max R(a, s) + P (s0 | a, s) V (s0 )
a
s0

iterate
h X i
s : Vk+1 (s) = max R(a, s) + P (s0 |(s), s) Vk (s0 )
a
s0

stopping criterion:

max |Vk+1 (s) Vk (s)| 


s

Value Iteration converges to the optimal value function V (proof below)

21/??
2x2 Maze

0.0 1.0 80%

10% 10%
0.0 0.0

manually solving.

22/??
State-action value function (Q-function)
The state-action value function (or Q-function) is the expected
discounted return when starting in state s and taking first action a:

Q (a, s) = E {r0 + r1 + 2 r2 + | s0 = s, a0 = a}
X
= R(a, s) + P (s0 | a, s) Q ((s0 ), s0 )
s0

(Note: V (s) = Q ((s), s).)

Bellman optimality equation for the Q-function


X
Q (a, s) = R(a, s) + P (s0 | a, s) max
0
Q (a0 , s0 )
a
s0

(s) = argmax Q (a, s)
a

23/??
Q-Iteration
Given the Bellman equation
X
Q (a, s) = R(a, s) + P (s0 | a, s) max
0
Q (a0 , s0 )
a
s0

iterate
X
a,s : Qk+1 (a, s) = R(a, s) + P (s0 |a, s) max
0
Qk (a0 , s0 )
a
s0

stopping criterion:

max |Qk+1 (a, s) Qk (a, s)| 


a,s

Q-Iteration converges to the optimal state-action value function Q

24/??
Proof of convergence
Let k = ||Q Qk || = maxa,s |Q (a, s) Qk (a, s)|

X
Qk+1 (a, s) = R(a, s) + P (s0 |a, s) max Qk (a0 , s0 )
a0
s0 h i
X
R(a, s) + P (s0 |a, s) max Q 0 0
(a , s ) + k
a0
h s0 i
X
= R(a, s) + P (s0 |a, s) max0
Q 0 0
(a , s ) + k
a
s0

= Q (a, s) + k

similarly: Qk Q k Qk+1 Q k

25/??
Convergence
Contraction property: ||Uk+1 Vk+1 || ||Uk Vk ||
which guarantees convergence with different initial values U0 , V0 of two
approximations.

||Uk+1 Vk+1 || ||Uk Vk || . . . k+1 ||U0 V0 ||

26/??
Convergence
Contraction property: ||Uk+1 Vk+1 || ||Uk Vk ||
which guarantees convergence with different initial values U0 , V0 of two
approximations.

||Uk+1 Vk+1 || ||Uk Vk || . . . k+1 ||U0 V0 ||

Stopping condition: ||Vk+1 Vk ||  ||Vk+1 V || /(1 )

26/??
Convergence
Contraction property: ||Uk+1 Vk+1 || ||Uk Vk ||
which guarantees convergence with different initial values U0 , V0 of two
approximations.

||Uk+1 Vk+1 || ||Uk Vk || . . . k+1 ||U0 V0 ||

Stopping condition: ||Vk+1 Vk ||  ||Vk+1 V || /(1 )


Proof:
| Vk+1 V |
| Vk V | | Vk+1 Vk | + | Vk+1 V |

| Vk+1 V |
 + | Vk+1 V |

26/??
Policy Evaluation
Value Iteration and Q-Iteration compute directly V and Q
If we want to evaluate a given policy , we want to compute V or Q :

27/??
Policy Evaluation
Value Iteration and Q-Iteration compute directly V and Q
If we want to evaluate a given policy , we want to compute V or Q :

Iterate using instead of maxa :


X
s : Vk+1 (s) = R((s), s) + P (s0 |(s), s) Vk (s0 )
s0
X
a,s : Qk+1 (a, s) = R(a, s) + P (s0 |a, s) Qk ((s0 ), s0 )
s0

27/??
Policy Evaluation
Value Iteration and Q-Iteration compute directly V and Q
If we want to evaluate a given policy , we want to compute V or Q :

Iterate using instead of maxa :


X
s : Vk+1 (s) = R((s), s) + P (s0 |(s), s) Vk (s0 )
s0
X
a,s : Qk+1 (a, s) = R(a, s) + P (s0 |a, s) Qk ((s0 ), s0 )
s0

Or, invert the matrix equation

V = R + P V
V + P V

= R
(I P )V = R
V = (I P )1 R

requires inversion of n n matrix for |S| = n, O(n3 )


27/??
Policy Iteration
What does it help to just compute V or Q to find the optimal policy?

28/??
Policy Iteration
What does it help to just compute V or Q to find the optimal policy?

Policy Iteration
1. Initialise 0 somehow (e.g. randomly)
2. Iterate
Policy Evaluation: compute V k or Qk
Policy Improvement: k+1 (s) argmaxa Qk (a, s)

demo: 2x2 maze

28/??
Convergence proof
The fact is that:
After policy improvement: V k V k+1 (with a sketch proof from Rich
Suttons book)
The policy space is finite, |A||S| .
The Bellman operator has a unique fixed point (due to the strict
contraction property (0 < < 1) on a Banach space). This condition is
also used to prove the fixed point for the VI algorithm.

29/??
VI vs. PI
VI is PI with one step of policy evaluation.
PI converges surprisingly rapildy, however with expensive compution,
i.e. the policy evaluation step (wait for convergence of V ).
PI is prefered if the action set is large.

30/??
Asynchronous Dynamic Programming
The value function table is updated asynchronously.
Computation is significantly reduced.
If all states are updated infinitely, convergence is still guaranteed.
Three simple algorithms:
Gauss-Seidel Value Iteration
Real-time dynamic programming
Prioritised sweeping

31/??
Gauss-Seidel Value Iteration
Standard VI algorithm updates all states at next iteration using old
values at previous iteration (each iteration finishes when all states get
updated).

Algorithm 1 Standard Value Iteration Algorithm


1: while (!converged) do
2: Vold = V
3: for (each s S) do
V (s) = maxa {R(s, a) + s0 P (s0 |s, a)Vold (s0 )}
P
4:

Gauss-Seidel VI updates each state using values from previous


computation.

Algorithm 2 Gauss-Seidel Value Iteration Algorithm


1: while (!converged) do
2: for (each s S) do
V (s) = maxa {R(s, a) + s0 P (s0 |s, a)V (s0 )}
P
3:
32/??
Prioritised Sweeping
Similar to Gauss-Seidel VI, but the sequence of states in each iteration
is proportional to their update magnitudes (Bellman errors).
Define Bellman error as E(s; Vt ) = |Vt+1 (s) Vt (s)| that is the change
of ss value after the most recent update.

Algorithm 3 Prioritised Sweeping VI Algorithm


1: Initialize V0 (s) and priority values H0 (s), s S.
2: for k = 0, 1, 2, 3, . . . do
3: pick a state to update (with the highest priortiy): sk arg maxsS Hk (s)
value update: Vk+1 (sk ) = maxaA R(sk , ak ) + s0 P (s0 |sk , ak )Vk (s0 )
 P 
4:
5: for s 6= sk : Vk+1 (s) = Vk (s)
6: update priority values: s S, Hk+1 (s) E(s; Vk+1 ) (Note: the error is w.r.t
the future update).

33/??
Real-Time Dynamic Programming
Similar to Gauss-Seidel VI, but the sequence of states in each iteration
is generated by simulating the transitions.

Algorithm 4 Real-Time Value Iteration Algorithm


1: start at an arbitray s0 , and initialize V0 (s), s calS.
2: for k = 0, 1, 2, 3, . . . do
3: action selection:
X
P (s0 |sk , a)Vk (s0 )

ak arg max R(sk , a) +
aA
s0

value update: Vk+1 (sk ) = R(sk , ak ) + s0 P (s0 |sk , ak )Vk (s0 )


P
4:
5: For s 6= sk : Vk+1 (s) = Vk (s)
6: simulate the next state: sk+1 P (s0 |sk , ak )

34/??
So far, we introduce basic notions of an MDP and value functions and
methods to compute optimal policies assuming that we know the
world (know P (s0 |a, s) and R(a, s)):

Value Iteration/Q-Iteration V , Q ,
Policy Evaluation V , Q
Policy Improvement (s) argmaxa Qk (a, s)
Policy Iteration (iterate Policy Evaluation and Policy Improvement)

Reinforcement Learning?

35/??

You might also like