Artificial Intelligence
Lecturer 14 – Reinforcement Learning
School of Information and Communication
Technology - HUST
1
Reinforcement Learning (RL)
• RL is ML method that optimize the reward
• A class of tasks
• A process of trial-and-error learning
• Good actions are “rewarded”
• Bad actions are “punished”
2
Features of RL
• Learning from numerical rewards
• Interaction with the task; sequences of states,
actions and rewards
• Uncertainty and non-deterministic worlds
• Delayed consequences
• The explore/exploit dilemma
• The whole problem of goal-directed learning
3
Points of view
• From the point of view of agents
• RL is a process of trial-and-error learning
• How much reward will I get if I do this action?
• From the point of view of trainers
• RL is training by rewards and punishments
• Train computers like we train animals
4
Applications of RL
• Robot
• Animal training
• Scheduling
• Games
• Control systems
•…
5
Supervised Learning vs.
Reinforcement Learning
• Supervised learning • Reinforcement learning
• Teacher: Is this an AI course • World: You are in state 9.
or a Math course? Choose action A or B
• Leaner: Math • Leaner: A
• Teacher: No, AI • World: Your reward is 100
• … • …
• Teacher: Is this an AI course • World: You are in state 15.
or a Math course? Choose action C or D
• Leaner : AI • Learner: D
• Teacher : Yes • World : Your reward is 50
6
Examples
• Chess
• Win +1, loose -1
• Elevator dispatching
• reward based on mean squared time for elevator to arrive
(optimization problem)
• Channel allocation for cellular phones
• Lower rewards the more calls are blocked
7
Policy, Reward and Goal
• Policy
• defines the agent’s behaviour at a given time
• maps from perceptions to actions
• can be defined by: look-up table, neural net, search algorithm...
• may be stochastic
• Reward Function
• defines the goal(s) in an RL problem
• maps from states, state-action pairs, or state-action-successor state,
triplets to a numerical reward
• goal of the agent is to maximise the total reward in the long run
• the policy is altered to achieve this goal
8
Reward and Return
• The reward function indicates how good things are right now
• But the agent wants to maximize reward in the long-term i.e.
over many time steps
• We refer to long-term (multi-step) reward as return
Rt = rt +1 + rt + 2 + ... + rT
where
• T is the last time step of the world
9
Discounted Return
• The geometrically discounted model of return
Rt = rt +1 + rt + 2 + ... + T rT
0 1
• is called discount rate, used to
• Bound the infinite sum
• Favor earlier rewards, in other words to give preference to
shorter paths
10
Optimal Policies
• An RL agent adapts its policy in order to increase
return
• A policy p1 is at least as good as a policy p2 if its
expected return is at least as great in each possible
initial state
• An optimal policy p is at least as good as any other
policy
11
Policy Adaptation Methods
• Value function-based methods
• Learn a value function for the policy
• Generate a new policy from the value function
• Q-learning, Dynamic Programming
12
Value Functions
• A value function maps each state to an estimate of
return under a policy
• An action-value function maps from state-action
pairs to estimates of return
• Learning a value function is referred to as the
“prediction” problem or ‘policy evaluation’ in the
Dynamic Programming literature
13
Q-learning
• Learns action-values Q(s,a) rather than state-values
V(s)
• Action-values learning
Q( s, a ) = R ( s, a ) + max a ' Q(T ( s, a ), a ' )
• Q-learning improves action-values iteratively until
it converges
14
Q-learning Algorithm
1. Algorithm Q {
2. For each (s,a) initialize Q’(s,a) at zero
3. Choose current action s
4. Iterate infinitely{
5. Choose and execute action a
6. Get immediate reward r
7. Choose new state s’
8. Update Q’(s,a) as follows: Q’(s,a) r + γ maxá Q’(s’,a’)
9. s s’
10. }
11.}
15
Example
• Initially • Initialization
0 100
G G
0
0 0
0 0 100
0 0
0 0
16
Example
• s1 • Assume = 0,9
• Go right: s2
• Reward: 0
0 100 0 100
A G A G
0 0
0 0 0 0
0 0 100 0 0 100
0 0 0 0
0 0 0 0
17
Example
• Go right • Update s2
• Reward: 100 • Reward: 100
0 100 0 100
A G A G
0 0
0 0 0 0
0 0 100 0 0 100
0 0 0 0
0 0 0 0
18
Example
• Update s1 • s2
• Reward: 90
90 100 90 100
A G A G
0 0
0 0 0 0
0 0 100 0 0 100
0 0 0 0
0 0 0 0
19
Example: result of Q-learning
90 100
G
81
72 81
81 81 90 100
90
72 81
20
Exercice
• Agent is in room C of the building
• The goal is to get out of the building
21
Modeling the problem
A B C D E F
A
B 100
C
D
E 100
F 100
22
Result
= 0,8
A B C D E F
A 400
B 320 500
C 320
D 400 255 400
E 320 320 500
F 400 400 500
Divide all rewards by 5
Result: C => D => B => F
C => D => E => F
23