Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
30 views23 pages

Sdfesdf

The document discusses reinforcement learning, an area of machine learning where agents learn behavior through trial-and-error interactions with an environment. Reinforcement learning agents aim to maximize rewards or minimize penalties over time by learning action-value functions or policies that map states to actions. Popular reinforcement learning algorithms include Q-learning, which learns action values directly from experience, and dynamic programming methods that learn value functions to derive optimal policies.

Uploaded by

freeintro0404
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views23 pages

Sdfesdf

The document discusses reinforcement learning, an area of machine learning where agents learn behavior through trial-and-error interactions with an environment. Reinforcement learning agents aim to maximize rewards or minimize penalties over time by learning action-value functions or policies that map states to actions. Popular reinforcement learning algorithms include Q-learning, which learns action values directly from experience, and dynamic programming methods that learn value functions to derive optimal policies.

Uploaded by

freeintro0404
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Artificial Intelligence

Lecturer 14 – Reinforcement Learning

School of Information and Communication


Technology - HUST

1
Reinforcement Learning (RL)
• RL is ML method that optimize the reward
• A class of tasks
• A process of trial-and-error learning
• Good actions are “rewarded”
• Bad actions are “punished”

2
Features of RL
• Learning from numerical rewards
• Interaction with the task; sequences of states,
actions and rewards
• Uncertainty and non-deterministic worlds
• Delayed consequences
• The explore/exploit dilemma
• The whole problem of goal-directed learning

3
Points of view
• From the point of view of agents
• RL is a process of trial-and-error learning
• How much reward will I get if I do this action?
• From the point of view of trainers
• RL is training by rewards and punishments
• Train computers like we train animals

4
Applications of RL
• Robot
• Animal training
• Scheduling
• Games
• Control systems
•…

5
Supervised Learning vs.
Reinforcement Learning
• Supervised learning • Reinforcement learning
• Teacher: Is this an AI course • World: You are in state 9.
or a Math course? Choose action A or B
• Leaner: Math • Leaner: A
• Teacher: No, AI • World: Your reward is 100
• … • …
• Teacher: Is this an AI course • World: You are in state 15.
or a Math course? Choose action C or D
• Leaner : AI • Learner: D
• Teacher : Yes • World : Your reward is 50

6
Examples
• Chess
• Win +1, loose -1
• Elevator dispatching
• reward based on mean squared time for elevator to arrive
(optimization problem)
• Channel allocation for cellular phones
• Lower rewards the more calls are blocked

7
Policy, Reward and Goal
• Policy
• defines the agent’s behaviour at a given time
• maps from perceptions to actions
• can be defined by: look-up table, neural net, search algorithm...
• may be stochastic
• Reward Function
• defines the goal(s) in an RL problem
• maps from states, state-action pairs, or state-action-successor state,
triplets to a numerical reward
• goal of the agent is to maximise the total reward in the long run
• the policy is altered to achieve this goal

8
Reward and Return
• The reward function indicates how good things are right now
• But the agent wants to maximize reward in the long-term i.e.
over many time steps
• We refer to long-term (multi-step) reward as return

Rt = rt +1 + rt + 2 + ... + rT
where
• T is the last time step of the world

9
Discounted Return
• The geometrically discounted model of return

Rt = rt +1 + rt + 2 + ... +  T rT
0   1
• is called discount rate, used to
• Bound the infinite sum
• Favor earlier rewards, in other words to give preference to
shorter paths

10
Optimal Policies
• An RL agent adapts its policy in order to increase
return
• A policy p1 is at least as good as a policy p2 if its
expected return is at least as great in each possible
initial state
• An optimal policy p is at least as good as any other
policy

11
Policy Adaptation Methods
• Value function-based methods
• Learn a value function for the policy
• Generate a new policy from the value function
• Q-learning, Dynamic Programming

12
Value Functions
• A value function maps each state to an estimate of
return under a policy
• An action-value function maps from state-action
pairs to estimates of return
• Learning a value function is referred to as the
“prediction” problem or ‘policy evaluation’ in the
Dynamic Programming literature

13
Q-learning
• Learns action-values Q(s,a) rather than state-values
V(s)
• Action-values learning

Q( s, a ) = R ( s, a ) +  max a ' Q(T ( s, a ), a ' )


• Q-learning improves action-values iteratively until
it converges

14
Q-learning Algorithm
1. Algorithm Q {
2. For each (s,a) initialize Q’(s,a) at zero
3. Choose current action s
4. Iterate infinitely{
5. Choose and execute action a
6. Get immediate reward r
7. Choose new state s’
8. Update Q’(s,a) as follows: Q’(s,a)  r + γ maxá Q’(s’,a’)
9. s s’
10. }
11.}

15
Example
• Initially • Initialization

0 100
G G
0
0 0
0 0 100
0 0

0 0

16
Example
• s1 • Assume  = 0,9
• Go right: s2
• Reward: 0

0 100 0 100
A G A G
0 0
0 0 0 0
0 0 100 0 0 100
0 0 0 0

0 0 0 0

17
Example
• Go right • Update s2
• Reward: 100 • Reward: 100

0 100 0 100
A G A G
0 0
0 0 0 0
0 0 100 0 0 100
0 0 0 0

0 0 0 0

18
Example
• Update s1 • s2
• Reward: 90

90 100 90 100
A G A G
0 0
0 0 0 0
0 0 100 0 0 100
0 0 0 0

0 0 0 0

19
Example: result of Q-learning

90 100
G
81
72 81
81 81 90 100
90

72 81

20
Exercice
• Agent is in room C of the building
• The goal is to get out of the building

21
Modeling the problem

A B C D E F
A
B 100

C
D
E 100

F 100

22
Result
 = 0,8
A B C D E F
A 400

B 320 500

C 320

D 400 255 400

E 320 320 500

F 400 400 500


Divide all rewards by 5

Result: C => D => B => F


C => D => E => F

23

You might also like