CSE 513 Soft Computing Dr.
Djamel Bouchaffra
Chapter 10:
Learning from Reinforcement
Introduction (10.1)
Failure is the surest path to success (10.2)
Jackpot Journey
Credit Assignment Problem
Evaluation Functions
Temporal Difference Learning (10.3)
Jyh-Shing Roger Jang et al., Neuro-Fuzzy and Soft Computing: A Computational Approach to Learning and Machine Intelligence,
First Edition, Prentice Hall, 1997
Introduction (10.1)
Learning from reinforcement is a fundamental
paradigm for machine learning with emphasis on
computational learning load
It is based on a trial-error learning scheme whereby a
computational agent learns to perform an appropriate
action by receiving a reinforcement signal
(performance) through interaction with the environment
The learner reinforces itself from lesson failures
Ch.10 [sections 10.1-10.3]: Learning from
Reinforcement. 1
CSE 513 Soft Computing Dr. Djamel Bouchaffra
Introduction (10.1) (cont.)
Started & experimented in animals & particularly
chimpanzees while coping with a physical environment
It is also used by the most intelligent creatures on
earth: the humans
If an action is followed by a satisfactory state of
affairs, or an improvement in the state of affairs, then
the tendency to produce that action is reinforced
(rewarded!), Otherwise, that tendency is weakened or
inhibited (penalized!)
Introduction (10.1) (cont.)
There are 4 basic representative architectures for
reinforcement learning [Sutton 1990]
Policy only (probability-based actions)
Reinforcement comparison
Adaptic heuristic critic
Q-learning
Ch.10 [sections 10.1-10.3]: Learning from
Reinforcement. 2
CSE 513 Soft Computing Dr. Djamel Bouchaffra
Failure is the surest path to success (10.2)
Jackpot journey
Goal: Find an optimal policy for selecting a series of actions
by means of a reward-penalty scheme
Principle:
Starting from a vertex in a graph, a traveler needs to cross the graph,
vertex by vertex, in order to reach gold hidden in a terminal vertex in
the graph
At each vertex, there is a signpost that has a box with some white &
black stones in it. A traveler picks a stone from the signpost box &
follows certain instructions; when a white stone is picked, go
diagonally upward, denoted by action u. Conversely, when a black
stone is chosen, go diagonally downward, denoted by action d
The Jackpot Journey Problem
Ch.10 [sections 10.1-10.3]: Learning from
Reinforcement. 3
CSE 513 Soft Computing Dr. Djamel Bouchaffra
Failure is the surest path to success (10.2) (cont.)
Jackpot journey (cont.)
During this travel (starting from vertex A) until one of the
terminal vertices {G, H, I, J}, we have the following scheme:
When the gold is not found, prepare a penalty scheme. Then
trace back to the starting vertex A; at each visited vertex, put
the placed stone back into the signpost with an additional
stone of the same color (reward), or take the placed stone
away from the signpost (penalty). When the traveler returns,
the next traveler will have more chances to find gold!
Failure is the surest path to success (10.2) (cont.)
Jackpot journey (cont.)
Obviously, the probability of finding an optimal policy will
increase as more & more journey are undertaken
# black stones
Pdown =
# black + # white stones
Pup = 1 Pdown (exclusivity & exhaustivity)
Ch.10 [sections 10.1-10.3]: Learning from
Reinforcement. 4
CSE 513 Soft Computing Dr. Djamel Bouchaffra
Failure is the Surest Path to Success (10.2) (cont.)
Credit assignment problem
The jackpot journey is strictly success or failure driven
(reward & penalty scheme!)
Its tuning (modification of number of stones) is performed
only when the final outcome becomes known: it is supervised
learning method
This reinforcement learning ignores the intrinsic sequential
structure of the problem to make adjustments at each state
Ch.10 [sections 10.1-10.3]: Learning from
Reinforcement. 5
CSE 513 Soft Computing Dr. Djamel Bouchaffra
Failure is the Surest Path to Success (10.2) (cont.)
Credit assignment problem (cont.)
This goal driven learning scheme is not applicable to any
game playing such as chess game
In chess playing, the learner needs to make better moves with
no performance indication regarding winning during the game
The problem of rewarding or penalizing each move (or state)
individually in such a long sequence toward an eventual
victory or loss is called the temporal credit assignment
problem
Failure is the Surest Path to Success (10.2) (cont.)
Credit assignment problem (cont.)
Apportioning credit to the internal agents action structures is
called the structural credit assignment problem
The structural credit assignment problem deals with the
development of appropriate internal representatives
Temporal and structural credit are involved in any
connexionist learning model such as NN
Ch.10 [sections 10.1-10.3]: Learning from
Reinforcement. 6
CSE 513 Soft Computing Dr. Djamel Bouchaffra
Failure is the Surest Path to Success (10.2) (cont.)
Credit assignment problem (cont.)
The power of reinforcement learning lies in the fact that the
learner needs not to wait until it receives feedback at the end
to make adjustments (fundamental key concept!)
In conclusion, we need an evaluation function that gives a
score to a move locally to be optimized
Failure is the Surest Path to Success (10.2) (cont.)
Evaluation functions
They provide scalar values (reinforcement signals) of states to
aid in finding optimal trajectories: they are emotions in the
biological brain
Ch.10 [sections 10.1-10.3]: Learning from
Reinforcement. 7
CSE 513 Soft Computing Dr. Djamel Bouchaffra
Failure is the Surest Path to Success (10.2) (cont.)
Evaluation functions (cont.)
Example: (Manhattan distance in the eight puzzle problem)
2 8 1 2 3
Current 5 6 4 8 4 Target
position position
1 3 7 7 6 5
Number of moves to reach the goal = sum of each tiles vertical &
horizontal distance from its target position
tile 2 (1,1); (1,2) |(1-1)|+|(2-1)| = 1
tile 8 (1,2); (2,1) |(2-1)|+|(1-2)| = 2
*
This distance is used as a heuristic function in the A algorithm
Temporal Difference Learning (10.3)
General Form: the modifiable parameter w of the agents
predictor obey the following update rule:
t
w t = (Vt +1 Vt ) t k w Vk = TD( )
k =1
where: Vt is the prediction value at time t
is a discounting parameter in [0..1]
t = 1: w1 = (V2 V1) wV1
t = 2: w2 = (V3 V2) ( wV1+ wV2)
Big contribution of the
most recent prediction
t = 3: w3 = (V4 V3) ( 2 wV1+ wV2 + wV3)
Ch.10 [sections 10.1-10.3]: Learning from
Reinforcement. 8
CSE 513 Soft Computing Dr. Djamel Bouchaffra
Temporal Difference Learning (10.3) (cont.)
More recent predictions make greater weight changes:
recent stimuli should be used in combination with the
current ones in order to determine actions
t
TD(1): w t = (Vt +1 Vt ) w Vk
k =1
TD(0): w t = (Vt +1 Vt ) w Vt
In TD(1) all past predictions make equal predictions to
the weight alterations: all states are equally weighted
whereas in TD(0) only the most recent prediction
counts
Temporal Difference Learning (10.3) (cont.)
TD() can be viewed as a supervised learning
procedure for the pair (current prediction Vt, desired
output Vt+1) in the error term
2
1 1 m
= {z Vt } = (Vk +1 Vt )
2
E td
2 2 k = t
where: Vm+1 = z (final outcome)
Vt = current or actual output
Since wt - w Etd = w Vt (z Vt)
which implies that: wt = (z Vt) w Vt
Ch.10 [sections 10.1-10.3]: Learning from
Reinforcement. 9
CSE 513 Soft Computing Dr. Djamel Bouchaffra
Temporal Difference Learning (10.3) (cont.)
The weight variation at time t is proportional to the
difference between the final outcome & the prediction
at time t
This results shows that this scheme is similar to
ordinary supervised learning: wt can be determined
only after the whole sequence of actions has been
completed (z made available!)
wt cannot be computed incrementally in multiple step
problems (Jackpot example)
Temporal Difference Learning (10.3) (cont.)
However, the equation:
t
w t = (Vt +1 Vt ) w Vk
k =1
for:
t = 1: w1 = (V2 V1) wV1
t = 2: w2 = (V3 V2) (wV1+ wV2)
t = 3: w3 = (V4 V3) (wV1+ wV2 + wV3)
provides an incremental scheme.
Ch.10 [sections 10.1-10.3]: Learning from
Reinforcement. 10
CSE 513 Soft Computing Dr. Djamel Bouchaffra
Temporal Difference Learning (10.3) (cont.)
When = 0, w t = (Vt +1 Vt ) w Vt only the
most recent prediction affects the weight alteration:
close to dynamic programming
Temporal Difference Learning (10.3) (cont.)
Expected Jackpot
We use TD(0) to update the weights & a lookup-table
perceptron to output result
TD perceptron
TD(0) provides:
( )
wt = wT st +1 wT st st
(sin ce Vt = wT * st and thus wVt = st )
Ch.10 [sections 10.1-10.3]: Learning from
Reinforcement. 11
CSE 513 Soft Computing Dr. Djamel Bouchaffra
A lookup-table perceptron to approximate expected
values in the jackpot journey problem
Ch.10 [sections 10.1-10.3]: Learning from
Reinforcement. 12