Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
41 views26 pages

Lec 09

Uploaded by

kethanchalla2809
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views26 pages

Lec 09

Uploaded by

kethanchalla2809
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 26

Reinforcement Learning

CS786
28th January 2022
MDPRL
• In MDP, {S,A,R,P} are known
• In RL, R and P are not known to begin with
• They are learned from experience
• Optimal policy is updated sequentially to account for
increased information about rewards and transition
probabilities
• Model-based RL
– Learns transition probabilities P as well as optimal policy
• Model-free RL
– Learns only optimal policy, not the transition probabilities P
Q-learning
• Derived from the Bush-Mosteller update rule
• Agent sees a set of states S
• Possesses a set of A actions applicable to
these states
• Does not try to learn p(s, a, s’)
• Tries to learn a quality belief about a state-
action combination Q: S X A  Real
Q-learning update rule
• Start with random Q
• Update using

• Parameter α controls the learning rate


• Parameter λ controls the time-discounting of
future reward
Q-learning
• Agent sees a set of states S
• Possesses a set of A actions applicable to
these states
• Does not try to learn p(s, a, s’)
• Tries to learn a quality belief about a state-
action combination Q: S X A  Real
Q-learning update rule
• Start with random Q
• Update using

• Parameter α controls the learning rate


• Parameter λ controls the time-discounting of
future reward
• s’ is the state accessed from s
• a’ are actions available in s’
Q-learning algorithm
• Initialize Q(s,a) for all s and a
• For each episode
– Initialize s
– For each move
• Choose a from s using Q (softmax/e-greedy)
• Perform action a, observe R and s’
• Update Q(s,a)
• Move to s’
– Until s’ is terminal/moves run out
Q-learning update
The value of taking action a

Q(s,.)
in state s

Q(s, a)

1. Select a using choice rule on Q


Q-learning update

a
s s’

1. Select a using choice rule on Q


2. Take action a from state s
3. Observe r and s’
Q-learning update

Q(s’,.)
A
a 1’

a 2’ There are many possible


a a’ from the state you
s s’
reach

a 3’

1. Select a using choice rule on Q


2. Take action a from state s
3. Observe r and s’
4. Recall Q(s’,a’) for all a’ available from s’
Q-learning update

Q(s’,.)
a 1’ A
Q(s, a)
a 2’ Assume maximally
a rewarding action will be
s s’
selected at s’

a 3’
1. Select a using choice rule on Q
2. Take action a from state s
3. Observe r and s’
4. Recall Q(s’,a’) for all a’ available from s’
5. Update Q(s,a)
Q-learning example
• Open AI gym’s frozen lake
• Setup: agent is a character that has to walk
from a start point (S) across a frozen lake (F)
with holes (H) in some locations to reach G
• Specific instantiation

S F F F
F H F H
F F F H
H F F G
Q-learning example
• Agent starts with an empty Q-matrix
• Action possibilities = {left, right, up, down}
• Reward settings
– H = -100
– G = +100 0 0 0 0

–F=0 0 0 0 0

0 0 0 0

0 0 0 0
Q-learning example
• Learning occurs via exploration episodes
• One episode is a sequence of moves
• Let’s work through one episode

S 0 F F F
F H F H
F F F H
H F F G
Q-learning example
• Learning occurs via exploration episodes
• One episode is a sequence of moves
• Let’s work through one episode

S 0 F 0 F F
F H F H
F F F H
H F F G
Q-learning example
• Learning occurs via exploration episodes
• One episode is a sequence of moves
• Let’s work through one episode

S 0 F 0 F F
0
F H F H
F F F H
H F F G
Q-learning example
• Learning occurs via exploration episodes
• One episode is a sequence of moves
• Let’s work through one episode

S 0 F 0 F F
0
F H F H
-80
F F F H
H F F G
Q-learning example
• Learning occurs via exploration episodes
• One episode is a sequence of moves
• Let’s work through one episode

S 0 F 0 F F
0
F H F H
-80 -80
F F F H
H F F G
Q-learning example
• Learning occurs via exploration episodes
• One episode is a sequence of moves
• Let’s work through one episode

S 0 F 0 F F
0
F H F H
-80 -80
F F F H
H F F G
+80
Generalized model-free RL
• Bush Mosteller style models simply update value
based on a discounted average of received rewards
– Useless in trying to predict the value of sequential
events, e.g. A B  reward
• A more generalized notion of reward learning was
needed
– Q-learning is one instance of temporal difference
learning
– Other flavors of model-free reinforcement learning also
exist, e.g. policy gradient methods
SARSA update rule
• Start with random Q
• Update using

• Parameter α controls the learning rate


• Parameter λ controls the time-discounting of
future reward
• s’ is the state accessed from s
• a’ is the action selected in s’
– Different from q-learning
SARSA algorithm
• Start with random Q(s, a) for all s and a
• For each episode
– Initialize s
– Choose a using Q (softmax/greedy)
– For each move
• Take action a, observe r, s’
• Choose a’ from s’ by comparing Q(s’, . )
• Update Q(s, a)
• Move to s’, remember a’
– Until s’ is terminal/moves run out
SARSA update

Q(s,.)
The value of taking action a
in state s
A

Q(s, a)

1. Start with a selected in the previous iteration


SARSA update

a
s s’

1. Start with a selected in the previous iteration


2. Take action a from state s
3. Observe r and s’
SARSA update

Q(s’,.)
A
a 1’

a 2’ There are many possible


a a’ from the state you
s s’
reach

a 3’
1. Start with a from the previous iteration
2. Take action a from state s
3. Observe r and s’
4. Recall Q(s’,a’) for all a’ available from s’
SARSA update

Q(s’,.)
A

a’ a’ is selected using the


a choice rule
s s’

1. Start with a from the previous iteration


2. Take action a from state s
3. Observe r and s’
4. Recall Q(s’,a’) for all a’ available from s’
5. Select a’ using choice rule on Q
6. Update Q(s,a)

You might also like