Reinforcement Learning
CS786
28th January 2022
MDPRL
• In MDP, {S,A,R,P} are known
• In RL, R and P are not known to begin with
• They are learned from experience
• Optimal policy is updated sequentially to account for
increased information about rewards and transition
probabilities
• Model-based RL
– Learns transition probabilities P as well as optimal policy
• Model-free RL
– Learns only optimal policy, not the transition probabilities P
Q-learning
• Derived from the Bush-Mosteller update rule
• Agent sees a set of states S
• Possesses a set of A actions applicable to
these states
• Does not try to learn p(s, a, s’)
• Tries to learn a quality belief about a state-
action combination Q: S X A Real
Q-learning update rule
• Start with random Q
• Update using
• Parameter α controls the learning rate
• Parameter λ controls the time-discounting of
future reward
Q-learning
• Agent sees a set of states S
• Possesses a set of A actions applicable to
these states
• Does not try to learn p(s, a, s’)
• Tries to learn a quality belief about a state-
action combination Q: S X A Real
Q-learning update rule
• Start with random Q
• Update using
• Parameter α controls the learning rate
• Parameter λ controls the time-discounting of
future reward
• s’ is the state accessed from s
• a’ are actions available in s’
Q-learning algorithm
• Initialize Q(s,a) for all s and a
• For each episode
– Initialize s
– For each move
• Choose a from s using Q (softmax/e-greedy)
• Perform action a, observe R and s’
• Update Q(s,a)
• Move to s’
– Until s’ is terminal/moves run out
Q-learning update
The value of taking action a
Q(s,.)
in state s
Q(s, a)
1. Select a using choice rule on Q
Q-learning update
a
s s’
1. Select a using choice rule on Q
2. Take action a from state s
3. Observe r and s’
Q-learning update
Q(s’,.)
A
a 1’
a 2’ There are many possible
a a’ from the state you
s s’
reach
a 3’
1. Select a using choice rule on Q
2. Take action a from state s
3. Observe r and s’
4. Recall Q(s’,a’) for all a’ available from s’
Q-learning update
Q(s’,.)
a 1’ A
Q(s, a)
a 2’ Assume maximally
a rewarding action will be
s s’
selected at s’
a 3’
1. Select a using choice rule on Q
2. Take action a from state s
3. Observe r and s’
4. Recall Q(s’,a’) for all a’ available from s’
5. Update Q(s,a)
Q-learning example
• Open AI gym’s frozen lake
• Setup: agent is a character that has to walk
from a start point (S) across a frozen lake (F)
with holes (H) in some locations to reach G
• Specific instantiation
S F F F
F H F H
F F F H
H F F G
Q-learning example
• Agent starts with an empty Q-matrix
• Action possibilities = {left, right, up, down}
• Reward settings
– H = -100
– G = +100 0 0 0 0
–F=0 0 0 0 0
0 0 0 0
0 0 0 0
Q-learning example
• Learning occurs via exploration episodes
• One episode is a sequence of moves
• Let’s work through one episode
S 0 F F F
F H F H
F F F H
H F F G
Q-learning example
• Learning occurs via exploration episodes
• One episode is a sequence of moves
• Let’s work through one episode
S 0 F 0 F F
F H F H
F F F H
H F F G
Q-learning example
• Learning occurs via exploration episodes
• One episode is a sequence of moves
• Let’s work through one episode
S 0 F 0 F F
0
F H F H
F F F H
H F F G
Q-learning example
• Learning occurs via exploration episodes
• One episode is a sequence of moves
• Let’s work through one episode
S 0 F 0 F F
0
F H F H
-80
F F F H
H F F G
Q-learning example
• Learning occurs via exploration episodes
• One episode is a sequence of moves
• Let’s work through one episode
S 0 F 0 F F
0
F H F H
-80 -80
F F F H
H F F G
Q-learning example
• Learning occurs via exploration episodes
• One episode is a sequence of moves
• Let’s work through one episode
S 0 F 0 F F
0
F H F H
-80 -80
F F F H
H F F G
+80
Generalized model-free RL
• Bush Mosteller style models simply update value
based on a discounted average of received rewards
– Useless in trying to predict the value of sequential
events, e.g. A B reward
• A more generalized notion of reward learning was
needed
– Q-learning is one instance of temporal difference
learning
– Other flavors of model-free reinforcement learning also
exist, e.g. policy gradient methods
SARSA update rule
• Start with random Q
• Update using
• Parameter α controls the learning rate
• Parameter λ controls the time-discounting of
future reward
• s’ is the state accessed from s
• a’ is the action selected in s’
– Different from q-learning
SARSA algorithm
• Start with random Q(s, a) for all s and a
• For each episode
– Initialize s
– Choose a using Q (softmax/greedy)
– For each move
• Take action a, observe r, s’
• Choose a’ from s’ by comparing Q(s’, . )
• Update Q(s, a)
• Move to s’, remember a’
– Until s’ is terminal/moves run out
SARSA update
Q(s,.)
The value of taking action a
in state s
A
Q(s, a)
1. Start with a selected in the previous iteration
SARSA update
a
s s’
1. Start with a selected in the previous iteration
2. Take action a from state s
3. Observe r and s’
SARSA update
Q(s’,.)
A
a 1’
a 2’ There are many possible
a a’ from the state you
s s’
reach
a 3’
1. Start with a from the previous iteration
2. Take action a from state s
3. Observe r and s’
4. Recall Q(s’,a’) for all a’ available from s’
SARSA update
Q(s’,.)
A
a’ a’ is selected using the
a choice rule
s s’
1. Start with a from the previous iteration
2. Take action a from state s
3. Observe r and s’
4. Recall Q(s’,a’) for all a’ available from s’
5. Select a’ using choice rule on Q
6. Update Q(s,a)