0% found this document useful (0 votes)

115 views51 pages

Reinforcement Learning Basics

Reinforcement learning problems that satisfy the Markov property are called Markov decision processes (MDPs). An MDP is defined by its state and action spaces, transition probabilities between states, reward function, initial state distribution, and discount factor. The goal is to find an optimal policy that maps states to actions to maximize the expected discounted reward over the long run. Dynamic programming methods like value iteration and policy iteration can be used to find the optimal policy.

Uploaded by

Kerry Beach

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

115 views51 pages

Reinforcement Learning Basics

Uploaded by

Kerry Beach

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 51

Reinforcement Learning

Markov decision process & Dynamic programming

value function, Bellman equation, optimality, Markov property, Markov decision process,
dynamic programming, value iteration, policy iteration.

Vien Ngo
MLR, University of Stuttgart
Outline
Reinforcement learning problem.
Element of reinforcement learning
Markov Process
Markov Reward Process
Markov decision process.

Dynamic programming
Value iteration
Policy iteration

2/??
Reinforcement Learning Problem
Elements of Reinforcement Learning Problem

Agent vs. Environment.

State, Action, Reward, Goal, Return.

The Markov property.

Markov decision process.

Bellman equations.

Optimality and Approximation.

3/??
Agent vs. Environment

4/??
Agent vs. Environment

The learner and decision-maker is called the agent.

The thing it interacts with, comprising everything outside the agent, is

called the environment.

The environment is formally formulated as a Markov Decision Process,

which is a mathematically principled framework for sequential decision
problems.
(from Introduction to RL book, Sutton & Barto)
4/??
The Markov property
A state that summarizes past sensations compactly yet in such a way
that all relevant information is retained. This normally requires more
than the immediate sensations, but never more than the complete
history of all past sensations. A state that succeeds in retaining all
relevant information is said to be Markov, or to have the Markov
property.
(Introduction to RL book, Sutton & Barto)

5/??
The Markov property
A state that summarizes past sensations compactly yet in such a way
that all relevant information is retained. This normally requires more
than the immediate sensations, but never more than the complete
history of all past sensations. A state that succeeds in retaining all
relevant information is said to be Markov, or to have the Markov
property.
(Introduction to RL book, Sutton & Barto)

Formally,

P r(st+1 , rt+1 |st , at , rt , , s0 , a0 , r0 ) = P r(st+1 , rt+1 |st , at , rt )

Formally,

P r(st+1 , rt+1 |st , at , rt , , s0 , a0 , r0 ) = P r(st+1 , rt+1 |st , at , rt )

Example: the current configuration of the chess board for predicting the
next steps, the position, velocity of the cart, the angle and its changing
rate of the pole in cart-pole domain.

5/??
Markov Process
A Markov Process (Markov Chain) is defined as 2-tuple (S, P).
S is a state space.
P is a state transition probability matrix: Pss0 = P (st+1 = s0 |st = s)

6/??
Markov Process: Example
Rycycling Robots Markov Chain

recharge
0.9 0.9
0.1
Batery: Batery:
high
0.5 low

0.5 0.5
1.0 0.5 0.1
search

wait stop

7/??
Markov Reward Process
A Markov Reward Process is defined as 4-tuple (S, P, R, ).
S is a state space of n states.
P is a state transition probability matrix: Pss0 = P (st+1 = s0 |st = s)
R is a reward matrix of Rs .
is a discount factor, [0, 1].
The total return is

t = Rt + Rt+1 + 2 Rt+2 + . . .

8/??
Markov Reward Process: Example

recharge
0.9;0.0 0.9;0.0

0.1;0.0
Batery: Batery:
0.5;0.0
high low
0.5;0.0 0.5;0.0
0.5;-1.0 0.1;-10.0
1.0;0.0
search

wait stop

9/??
Markov Reward Process: Bellman Equations
The value function V (s)

V (s) = E t |st = s

= E Rt + V (st+1 )|st = s

V = R + P V , hence V = (I P )1 R
We will visit again in MDP.

10/??
Markov Reward Process: Discount Factor?
Many meanings:
Weighing the importance of differently timed rewards, higher
importance of more recent rewards.
Representing uncertainty over the presence of next rewards, i.e
geometric distributions.
Representing human/animals preference over ordering of received
rewards.

11/??
Markov decision process

12/??
Markov decision process
A reinforcement learning problem that satisfies the Markov property is
called a Markov decision process, or MDP.

13/??
Markov decision process
A reinforcement learning problem that satisfies the Markov property is
called a Markov decision process, or MDP.
MDP = {S, A, T , R, P0 , }.

13/??
Markov decision process
A reinforcement learning problem that satisfies the Markov property is
called a Markov decision process, or MDP.
MDP = {S, A, T , R, P0 , }.
S: consists of all possible states.
A: consists of all possible actions.
T : is a transition function which defines the probability
T (s0 , s, a) = P r(s0 |s, a).
R: is a reward function which defines the reward R(s, a).
P0 : is the probability distribution over initial states.
[0, 1]: is a discount factor.

13/??
Example: Recycling Robot MDP

14/??
a0 a1 a2

s0 s1 s2

r0 r1 r2

15/??
a0 a1 a2

s0 s1 s2

r0 r1 r2

A policy is a mapping from state space to action space

: S 7 A

15/??
a0 a1 a2

s0 s1 s2

r0 r1 r2

A policy is a mapping from state space to action space

: S 7 A

Objective function:
Expected average reward.

T 1
1 hX i
= lim E r(st , at , st+1 )
T T t=0

Expected discounted reward.

hX i
= E t r(st , at , st+1 )
t=0

15/??
a0 a1 a2

s0 s1 s2

r0 r1 r2

A policy is a mapping from state space to action space

: S 7 A

Objective function:
Expected average reward.

T 1
1 hX i
= lim E r(st , at , st+1 )
T T t=0

Expected discounted reward.

hX i
= E t r(st , at , st+1 )
t=0

Singh et. al. 1994:

1
=
1

15/??
Dynamic Programming

16/??
Dynamic Programming
State Value Functions
Bellman Equations
Value Iteration
Policy Iteration

17/??
State value function
The value (expected discounted return) of policy when started in
state s:

V (s) = E {r0 + r1 + 2 r2 + | s0 = s} (1)

discounting factor [0, 1]

18/??
State value function
The value (expected discounted return) of policy when started in
state s:

V (s) = E {r0 + r1 + 2 r2 + | s0 = s} (1)

discounting factor [0, 1]

definition of optimality: behavior is optimal iff

s : V (s) = V (s) where V (s) = max V (s)

(simultaneously maximising the value in all states)

(In MDPs there always exists (at least one) optimal deterministic policy.)

18/??
Bellman optimality equation

V (s) = E{r0 + r1 + 2 r2 + | s0 = s; }
= E{r0 | s0 = s; } + E{r1 + r2 + | s0 = s; }
= R((s), s) + s0 P (s0 | (s), s) E{r1 + r2 + | s1 = s0 ; }
P

= R((s), s) + s0 P (s0 | (s), s) V (s0 )

19/??
Bellman optimality equation

V (s) = E{r0 + r1 + 2 r2 + | s0 = s; }
= E{r0 | s0 = s; } + E{r1 + r2 + | s0 = s; }
= R((s), s) + s0 P (s0 | (s), s) E{r1 + r2 + | s1 = s0 ; }
P

= R((s), s) + s0 P (s0 | (s), s) V (s0 )

We can write this in vector notation V = R + P V

19/??
Bellman optimality equation

V (s) = E{r0 + r1 + 2 r2 + | s0 = s; }
= E{r0 | s0 = s; } + E{r1 + r2 + | s0 = s; }
= R((s), s) + s0 P (s0 | (s), s) E{r1 + r2 + | s1 = s0 ; }
P

= R((s), s) + s0 P (s0 | (s), s) V (s0 )

We can write this in vector notation V = R + P V

Bellman optimality equation

h i
V (s) = max R(a, s) + s0 P (s0 | a, s) V (s0 )
P
a h i

(s) = argmax R(a, s) + s0 P (s0 | a, s) V (s0 )
P
a

(Sketch of proof: If would select another action than argmaxa [], then 0 which =
everywhere except 0 (s) = argmaxa [] would be better.)
This is the principle of optimality in the stochastic case
(related to Viterbi, max-product algorithm) 19/??
Richard E. Bellman (1920-1984)
Bellmans principle of optimality
B

A
A opt B opt

h i
V (s) = max R(a, s) + s0 P (s0 | a, s) V (s0 )
P
a h i

(s) = argmax R(a, s) + s0 P (s0 | a, s) V (s0 )
P
a

20/??
Value Iteration
Given the Bellman equation
h X i
V (s) = max R(a, s) + P (s0 | a, s) V (s0 )
a
s0

iterate
h X i
s : Vk+1 (s) = max R(a, s) + P (s0 |(s), s) Vk (s0 )
a
s0

stopping criterion:

max |Vk+1 (s) Vk (s)|

Value Iteration converges to the optimal value function V (proof below)

21/??
2x2 Maze

0.0 1.0 80%

10% 10%
0.0 0.0

manually solving.

22/??
State-action value function (Q-function)
The state-action value function (or Q-function) is the expected
discounted return when starting in state s and taking first action a:

Q (a, s) = E {r0 + r1 + 2 r2 + | s0 = s, a0 = a}
X
= R(a, s) + P (s0 | a, s) Q ((s0 ), s0 )
s0

(Note: V (s) = Q ((s), s).)

Bellman optimality equation for the Q-function

X
Q (a, s) = R(a, s) + P (s0 | a, s) max
0
Q (a0 , s0 )
a
s0

(s) = argmax Q (a, s)
a

23/??
Q-Iteration
Given the Bellman equation
X
Q (a, s) = R(a, s) + P (s0 | a, s) max
0
Q (a0 , s0 )
a
s0

iterate
X
a,s : Qk+1 (a, s) = R(a, s) + P (s0 |a, s) max
0
Qk (a0 , s0 )
a
s0

stopping criterion:

max |Qk+1 (a, s) Qk (a, s)|

a,s

Q-Iteration converges to the optimal state-action value function Q

24/??
Proof of convergence
Let k = ||Q Qk || = maxa,s |Q (a, s) Qk (a, s)|

X
Qk+1 (a, s) = R(a, s) + P (s0 |a, s) max Qk (a0 , s0 )
a0
s0 h i
X
R(a, s) + P (s0 |a, s) max Q 0 0
(a , s ) + k
a0
h s0 i
X
= R(a, s) + P (s0 |a, s) max0
Q 0 0
(a , s ) + k
a
s0

= Q (a, s) + k

similarly: Qk Q k Qk+1 Q k

25/??
Convergence
Contraction property: ||Uk+1 Vk+1 || ||Uk Vk ||
which guarantees convergence with different initial values U0 , V0 of two
approximations.

||Uk+1 Vk+1 || ||Uk Vk || . . . k+1 ||U0 V0 ||

26/??
Convergence
Contraction property: ||Uk+1 Vk+1 || ||Uk Vk ||
which guarantees convergence with different initial values U0 , V0 of two
approximations.

||Uk+1 Vk+1 || ||Uk Vk || . . . k+1 ||U0 V0 ||

Stopping condition: ||Vk+1 Vk || ||Vk+1 V || /(1 )

26/??
Convergence
Contraction property: ||Uk+1 Vk+1 || ||Uk Vk ||
which guarantees convergence with different initial values U0 , V0 of two
approximations.

||Uk+1 Vk+1 || ||Uk Vk || . . . k+1 ||U0 V0 ||

Stopping condition: ||Vk+1 Vk || ||Vk+1 V || /(1 )

Proof:
| Vk+1 V |
| Vk V | | Vk+1 Vk | + | Vk+1 V |

| Vk+1 V |
+ | Vk+1 V |

26/??
Policy Evaluation
Value Iteration and Q-Iteration compute directly V and Q
If we want to evaluate a given policy , we want to compute V or Q :

27/??
Policy Evaluation
Value Iteration and Q-Iteration compute directly V and Q
If we want to evaluate a given policy , we want to compute V or Q :

Iterate using instead of maxa :

X
s : Vk+1 (s) = R((s), s) + P (s0 |(s), s) Vk (s0 )
s0
X
a,s : Qk+1 (a, s) = R(a, s) + P (s0 |a, s) Qk ((s0 ), s0 )
s0

27/??
Policy Evaluation
Value Iteration and Q-Iteration compute directly V and Q
If we want to evaluate a given policy , we want to compute V or Q :

Iterate using instead of maxa :

X
s : Vk+1 (s) = R((s), s) + P (s0 |(s), s) Vk (s0 )
s0
X
a,s : Qk+1 (a, s) = R(a, s) + P (s0 |a, s) Qk ((s0 ), s0 )
s0

Or, invert the matrix equation

V = R + P V
V + P V

= R
(I P )V = R
V = (I P )1 R

requires inversion of n n matrix for |S| = n, O(n3 )

27/??
Policy Iteration
What does it help to just compute V or Q to find the optimal policy?

28/??
Policy Iteration
What does it help to just compute V or Q to find the optimal policy?

Policy Iteration
1. Initialise 0 somehow (e.g. randomly)
2. Iterate
Policy Evaluation: compute V k or Qk
Policy Improvement: k+1 (s) argmaxa Qk (a, s)

demo: 2x2 maze

28/??
Convergence proof
The fact is that:
After policy improvement: V k V k+1 (with a sketch proof from Rich
Suttons book)
The policy space is finite, |A||S| .
The Bellman operator has a unique fixed point (due to the strict
contraction property (0 < < 1) on a Banach space). This condition is
also used to prove the fixed point for the VI algorithm.

29/??
VI vs. PI
VI is PI with one step of policy evaluation.
PI converges surprisingly rapildy, however with expensive compution,
i.e. the policy evaluation step (wait for convergence of V ).
PI is prefered if the action set is large.

30/??
Asynchronous Dynamic Programming
The value function table is updated asynchronously.
Computation is significantly reduced.
If all states are updated infinitely, convergence is still guaranteed.
Three simple algorithms:
Gauss-Seidel Value Iteration
Real-time dynamic programming
Prioritised sweeping

31/??
Gauss-Seidel Value Iteration
Standard VI algorithm updates all states at next iteration using old
values at previous iteration (each iteration finishes when all states get
updated).

Algorithm 1 Standard Value Iteration Algorithm

1: while (!converged) do
2: Vold = V
3: for (each s S) do
V (s) = maxa {R(s, a) + s0 P (s0 |s, a)Vold (s0 )}
P
4:

Gauss-Seidel VI updates each state using values from previous

computation.

Algorithm 2 Gauss-Seidel Value Iteration Algorithm

1: while (!converged) do
2: for (each s S) do
V (s) = maxa {R(s, a) + s0 P (s0 |s, a)V (s0 )}
P
3:
32/??
Prioritised Sweeping
Similar to Gauss-Seidel VI, but the sequence of states in each iteration
is proportional to their update magnitudes (Bellman errors).
Define Bellman error as E(s; Vt ) = |Vt+1 (s) Vt (s)| that is the change
of ss value after the most recent update.

Algorithm 3 Prioritised Sweeping VI Algorithm

1: Initialize V0 (s) and priority values H0 (s), s S.
2: for k = 0, 1, 2, 3, . . . do
3: pick a state to update (with the highest priortiy): sk arg maxsS Hk (s)
value update: Vk+1 (sk ) = maxaA R(sk , ak ) + s0 P (s0 |sk , ak )Vk (s0 )
P
4:
5: for s 6= sk : Vk+1 (s) = Vk (s)
6: update priority values: s S, Hk+1 (s) E(s; Vk+1 ) (Note: the error is w.r.t
the future update).

33/??
Real-Time Dynamic Programming
Similar to Gauss-Seidel VI, but the sequence of states in each iteration
is generated by simulating the transitions.

Algorithm 4 Real-Time Value Iteration Algorithm

1: start at an arbitray s0 , and initialize V0 (s), s calS.
2: for k = 0, 1, 2, 3, . . . do
3: action selection:
X
P (s0 |sk , a)Vk (s0 )

ak arg max R(sk , a) +
aA
s0

value update: Vk+1 (sk ) = R(sk , ak ) + s0 P (s0 |sk , ak )Vk (s0 )

P
4:
5: For s 6= sk : Vk+1 (s) = Vk (s)
6: simulate the next state: sk+1 P (s0 |sk , ak )

34/??
So far, we introduce basic notions of an MDP and value functions and
methods to compute optimal policies assuming that we know the
world (know P (s0 |a, s) and R(a, s)):

Value Iteration/Q-Iteration V , Q ,
Policy Evaluation V , Q
Policy Improvement (s) argmaxa Qk (a, s)
Policy Iteration (iterate Policy Evaluation and Policy Improvement)

Reinforcement Learning?

35/??

Comcolor FW Series Technical Manual Panel Test
100% (2)
Comcolor FW Series Technical Manual Panel Test
352 pages
SM 1000 Idi Reference Manual
No ratings yet
SM 1000 Idi Reference Manual
108 pages
Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
Portainer Documentation: Release 1.22.1
No ratings yet
Portainer Documentation: Release 1.22.1
55 pages
Computer Lab Checklists
100% (3)
Computer Lab Checklists
2 pages
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
No ratings yet
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
40 pages
Adobe Photoshop Level 1 - EnG
No ratings yet
Adobe Photoshop Level 1 - EnG
56 pages
1.1 Project Summary:: Digital Scrapbook
No ratings yet
1.1 Project Summary:: Digital Scrapbook
30 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
Fa19 Lecture 15 MDPs II
No ratings yet
Fa19 Lecture 15 MDPs II
76 pages
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
No ratings yet
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
40 pages
Lecture 2: Markov Decision Processes: David Silver
No ratings yet
Lecture 2: Markov Decision Processes: David Silver
57 pages
Actix Troubleshooting and Optimizing UMTS Network PDF
No ratings yet
Actix Troubleshooting and Optimizing UMTS Network PDF
124 pages
Reinforcement Learning in A Nutshell
No ratings yet
Reinforcement Learning in A Nutshell
12 pages
Markov Decision & RL Overview
No ratings yet
Markov Decision & RL Overview
39 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
7 pages
Reinforcement Learning: Part I - Definitions
No ratings yet
Reinforcement Learning: Part I - Definitions
26 pages
Lec 09
No ratings yet
Lec 09
51 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
Lec 02
No ratings yet
Lec 02
89 pages
Solaris & Oracle Cluster Setup Guide
92% (12)
Solaris & Oracle Cluster Setup Guide
108 pages
04 TemporalDiffHung Hung
No ratings yet
04 TemporalDiffHung Hung
93 pages
5.4-Reinforcement Learning-Part1-Introduction
No ratings yet
5.4-Reinforcement Learning-Part1-Introduction
15 pages
242 Sheet 02 03
No ratings yet
242 Sheet 02 03
5 pages
Lec 04 Reinforcement Learning
No ratings yet
Lec 04 Reinforcement Learning
57 pages
Module-2 For Btech in Topic
No ratings yet
Module-2 For Btech in Topic
29 pages
cs229 Notes13
No ratings yet
cs229 Notes13
15 pages
17 - Markov Decision Processes
No ratings yet
17 - Markov Decision Processes
59 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
Lecture26 Ri
No ratings yet
Lecture26 Ri
55 pages
Interactive Cyber Security Career Roadmap
100% (1)
Interactive Cyber Security Career Roadmap
22 pages
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
No ratings yet
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
74 pages
Things To Know About EPM 11.1.2.4
No ratings yet
Things To Know About EPM 11.1.2.4
58 pages
AS02
No ratings yet
AS02
16 pages
l1 Mdps Exact Methods
No ratings yet
l1 Mdps Exact Methods
69 pages
Fronius - Xplorer - Basisfunktionen - en
No ratings yet
Fronius - Xplorer - Basisfunktionen - en
14 pages
Orion ARINC653 Architecture
100% (1)
Orion ARINC653 Architecture
58 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
200Mhz Bandwidth Digital Storage Scope For PC: Part No. 01ossds200
No ratings yet
200Mhz Bandwidth Digital Storage Scope For PC: Part No. 01ossds200
3 pages
Interactive Playgrounds For Children
No ratings yet
Interactive Playgrounds For Children
20 pages
Join Logical File: Overview: Outline
No ratings yet
Join Logical File: Overview: Outline
10 pages
2 Dynamic
No ratings yet
2 Dynamic
50 pages
Circuit de DERIVARE (C-R)
No ratings yet
Circuit de DERIVARE (C-R)
6 pages
1 Markov
No ratings yet
1 Markov
34 pages
Cs229-Notes12 Reinforcement in Control
No ratings yet
Cs229-Notes12 Reinforcement in Control
17 pages
Venus X1
No ratings yet
Venus X1
12 pages
Slides Interim 2017 CFRG 01 Sessa Secp256k1 00
No ratings yet
Slides Interim 2017 CFRG 01 Sessa Secp256k1 00
7 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
Reinforcement Learning for Experts
No ratings yet
Reinforcement Learning for Experts
36 pages
DSA5102 Lecture11
No ratings yet
DSA5102 Lecture11
44 pages
Pemeliharaan Proteksi, Scada Dan Telkom
No ratings yet
Pemeliharaan Proteksi, Scada Dan Telkom
18 pages
VESPER Invitation
No ratings yet
VESPER Invitation
4 pages
Universitatea Tehnica "Gheorghe Asachi" Din Iasi Avizat Meci Facultatea de Automatica Si Calculatoare
No ratings yet
Universitatea Tehnica "Gheorghe Asachi" Din Iasi Avizat Meci Facultatea de Automatica Si Calculatoare
4 pages
Grade 6 ICT Exam 2019 Questions
No ratings yet
Grade 6 ICT Exam 2019 Questions
6 pages
Nonlinear Systems
No ratings yet
Nonlinear Systems
3 pages
020 - BCA - 2nd & 4th SEMESTER - REVISED REAPPEAR RESULT - 11 STUDENTS - NOVEMBER, 2020
No ratings yet
020 - BCA - 2nd & 4th SEMESTER - REVISED REAPPEAR RESULT - 11 STUDENTS - NOVEMBER, 2020
14 pages
Nonlinear Dynamical Models: 1. The Nonlinear Dynamic Equation For A Pendulum Is
No ratings yet
Nonlinear Dynamical Models: 1. The Nonlinear Dynamic Equation For A Pendulum Is
2 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
31 pages
Howto Logging
No ratings yet
Howto Logging
17 pages
کتاب هشتم بارگزاری شده
No ratings yet
کتاب هشتم بارگزاری شده
112 pages
Deep RL - Content Beyond Syllabus
No ratings yet
Deep RL - Content Beyond Syllabus
16 pages
Minecraft Launcher Debug Log
No ratings yet
Minecraft Launcher Debug Log
14 pages
Frank Wyatt Prentice - Patent CA253765
100% (1)
Frank Wyatt Prentice - Patent CA253765
10 pages
20ai903 - RL - Unit 2
No ratings yet
20ai903 - RL - Unit 2
27 pages
RL Cheatsheet for Researchers
No ratings yet
RL Cheatsheet for Researchers
16 pages
Afzal Resume
No ratings yet
Afzal Resume
1 page
CSIntroduction
No ratings yet
CSIntroduction
4 pages
DLMAIRIL01 Q4-2024 Session2
No ratings yet
DLMAIRIL01 Q4-2024 Session2
68 pages
MDP Cheatsheet
No ratings yet
MDP Cheatsheet
3 pages
Java Engineer Professional Profile
No ratings yet
Java Engineer Professional Profile
8 pages
CS229
No ratings yet
CS229
17 pages
Strategic IT Leadership Profile
No ratings yet
Strategic IT Leadership Profile
3 pages
Lec 12
No ratings yet
Lec 12
60 pages
Reinforcement Learning Lec12
No ratings yet
Reinforcement Learning Lec12
60 pages
Finite Markov Decision Processes-BR
No ratings yet
Finite Markov Decision Processes-BR
31 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
RADAR
No ratings yet
RADAR
10 pages
2024 MDPs Part 1
No ratings yet
2024 MDPs Part 1
59 pages
Lec17 ReinforcementLearning
No ratings yet
Lec17 ReinforcementLearning
58 pages
2025 - MDPs 1
No ratings yet
2025 - MDPs 1
62 pages
RL Lecture4
No ratings yet
RL Lecture4
7 pages
1 - Web Based Laboratory Information System LIMS - Edited
No ratings yet
1 - Web Based Laboratory Information System LIMS - Edited
63 pages
Bos Cse (Ai&ml) - 1-05-25
No ratings yet
Bos Cse (Ai&ml) - 1-05-25
35 pages
CSE 445 - Lecture 9 - Reinforcement Learning
No ratings yet
CSE 445 - Lecture 9 - Reinforcement Learning
45 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
62 pages
Lecture Notes RL
No ratings yet
Lecture Notes RL
14 pages
Nidhish RLAI-Lab1
No ratings yet
Nidhish RLAI-Lab1
18 pages
RL - 03 Markov Decision Processes and Dynamic Programming
No ratings yet
RL - 03 Markov Decision Processes and Dynamic Programming
50 pages
Markov Decision Processes Ii: Ppts by Dan Klein and Pieter Abbeel For Cs188 Intro To Ai at Uc Berkeley
No ratings yet
Markov Decision Processes Ii: Ppts by Dan Klein and Pieter Abbeel For Cs188 Intro To Ai at Uc Berkeley
50 pages
Class Notes 2
No ratings yet
Class Notes 2
6 pages

Reinforcement Learning Basics

Uploaded by

Reinforcement Learning Basics

Uploaded by

Reinforcement Learning

Markov decision process & Dynamic programming

Agent vs. Environment.

State, Action, Reward, Goal, Return.

The Markov property.

Markov decision process.

Optimality and Approximation.

The learner and decision-maker is called the agent.

The thing it interacts with, comprising everything outside the agent, is

The environment is formally formulated as a Markov Decision Process,

P r(st+1 , rt+1 |st , at , rt , , s0 , a0 , r0 ) = P r(st+1 , rt+1 |st , at , rt )

P r(st+1 , rt+1 |st , at , rt , , s0 , a0 , r0 ) = P r(st+1 , rt+1 |st , at , rt )

A policy is a mapping from state space to action space

A policy is a mapping from state space to action space

Expected discounted reward.

A policy is a mapping from state space to action space

Expected discounted reward.

Singh et. al. 1994:

V (s) = E {r0 + r1 + 2 r2 + | s0 = s} (1)

discounting factor [0, 1]

V (s) = E {r0 + r1 + 2 r2 + | s0 = s} (1)

discounting factor [0, 1]

definition of optimality: behavior is optimal iff

(simultaneously maximising the value in all states)

= R((s), s) + s0 P (s0 | (s), s) V (s0 )

= R((s), s) + s0 P (s0 | (s), s) V (s0 )

We can write this in vector notation V = R + P V

= R((s), s) + s0 P (s0 | (s), s) V (s0 )

We can write this in vector notation V = R + P V

Bellman optimality equation

max |Vk+1 (s) Vk (s)| 

Value Iteration converges to the optimal value function V (proof below)

0.0 1.0 80%

(Note: V (s) = Q ((s), s).)

Bellman optimality equation for the Q-function

max |Qk+1 (a, s) Qk (a, s)| 

Q-Iteration converges to the optimal state-action value function Q

||Uk+1 Vk+1 || ||Uk Vk || . . . k+1 ||U0 V0 ||

||Uk+1 Vk+1 || ||Uk Vk || . . . k+1 ||U0 V0 ||

Stopping condition: ||Vk+1 Vk ||  ||Vk+1 V || /(1 )

||Uk+1 Vk+1 || ||Uk Vk || . . . k+1 ||U0 V0 ||

Stopping condition: ||Vk+1 Vk ||  ||Vk+1 V || /(1 )

Iterate using instead of maxa :

Iterate using instead of maxa :

Or, invert the matrix equation

requires inversion of n n matrix for |S| = n, O(n3 )

demo: 2x2 maze

Algorithm 1 Standard Value Iteration Algorithm

Gauss-Seidel VI updates each state using values from previous

Algorithm 2 Gauss-Seidel Value Iteration Algorithm

Algorithm 3 Prioritised Sweeping VI Algorithm

Algorithm 4 Real-Time Value Iteration Algorithm

value update: Vk+1 (sk ) = R(sk , ak ) + s0 P (s0 |sk , ak )Vk (s0 )

You might also like

max |Vk+1 (s) Vk (s)|

max |Qk+1 (a, s) Qk (a, s)|

Stopping condition: ||Vk+1 Vk || ||Vk+1 V || /(1 )

Stopping condition: ||Vk+1 Vk || ||Vk+1 V || /(1 )