0% found this document useful (0 votes)

85 views12 pages

Learning From Reinforcement: - Introduction (10.1) - Failure Is The Surest Path To Success (10.2)

This document summarizes key concepts from sections 10.1-10.3 of Chapter 10 on reinforcement learning. It introduces reinforcement learning and describes the Jackpot Journey problem to illustrate failure-driven learning and the credit assignment problem. Temporal difference learning is also mentioned as a way to optimize evaluation functions and learn locally from experience without waiting for final feedback. The focus is on using reward and penalty schemes to learn optimal policies through trial-and-error interaction with an environment.

Uploaded by

Charles Uchôa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

85 views12 pages

Learning From Reinforcement: - Introduction (10.1) - Failure Is The Surest Path To Success (10.2)

Uploaded by

Charles Uchôa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

CSE 513 Soft Computing Dr.

Djamel Bouchaffra

Chapter 10:
Learning from Reinforcement

Introduction (10.1)
Failure is the surest path to success (10.2)
Jackpot Journey
Credit Assignment Problem
Evaluation Functions
Temporal Difference Learning (10.3)
Jyh-Shing Roger Jang et al., Neuro-Fuzzy and Soft Computing: A Computational Approach to Learning and Machine Intelligence,
First Edition, Prentice Hall, 1997

Introduction (10.1)
Learning from reinforcement is a fundamental
paradigm for machine learning with emphasis on
computational learning load

It is based on a trial-error learning scheme whereby a

computational agent learns to perform an appropriate
action by receiving a reinforcement signal
(performance) through interaction with the environment

The learner reinforces itself from lesson failures

Ch.10 [sections 10.1-10.3]: Learning from

Reinforcement. 1
CSE 513 Soft Computing Dr. Djamel Bouchaffra

Introduction (10.1) (cont.)

Started & experimented in animals & particularly

chimpanzees while coping with a physical environment

It is also used by the most intelligent creatures on

earth: the humans

If an action is followed by a satisfactory state of

affairs, or an improvement in the state of affairs, then
the tendency to produce that action is reinforced
(rewarded!), Otherwise, that tendency is weakened or
inhibited (penalized!)

Introduction (10.1) (cont.)

There are 4 basic representative architectures for

reinforcement learning [Sutton 1990]

Policy only (probability-based actions)

Reinforcement comparison

Adaptic heuristic critic

Q-learning

Ch.10 [sections 10.1-10.3]: Learning from

Reinforcement. 2
CSE 513 Soft Computing Dr. Djamel Bouchaffra

Failure is the surest path to success (10.2)

Jackpot journey

Goal: Find an optimal policy for selecting a series of actions

by means of a reward-penalty scheme

Principle:
Starting from a vertex in a graph, a traveler needs to cross the graph,
vertex by vertex, in order to reach gold hidden in a terminal vertex in
the graph

At each vertex, there is a signpost that has a box with some white &
black stones in it. A traveler picks a stone from the signpost box &
follows certain instructions; when a white stone is picked, go
diagonally upward, denoted by action u. Conversely, when a black
stone is chosen, go diagonally downward, denoted by action d

The Jackpot Journey Problem

Ch.10 [sections 10.1-10.3]: Learning from

Reinforcement. 3
CSE 513 Soft Computing Dr. Djamel Bouchaffra

Failure is the surest path to success (10.2) (cont.)

Jackpot journey (cont.)

During this travel (starting from vertex A) until one of the

terminal vertices {G, H, I, J}, we have the following scheme:

When the gold is not found, prepare a penalty scheme. Then

trace back to the starting vertex A; at each visited vertex, put
the placed stone back into the signpost with an additional
stone of the same color (reward), or take the placed stone
away from the signpost (penalty). When the traveler returns,
the next traveler will have more chances to find gold!

Failure is the surest path to success (10.2) (cont.)

Jackpot journey (cont.)

Obviously, the probability of finding an optimal policy will

increase as more & more journey are undertaken

# black stones
Pdown =
# black + # white stones
Pup = 1 Pdown (exclusivity & exhaustivity)

Ch.10 [sections 10.1-10.3]: Learning from

Reinforcement. 4
CSE 513 Soft Computing Dr. Djamel Bouchaffra

Failure is the Surest Path to Success (10.2) (cont.)

Credit assignment problem

The jackpot journey is strictly success or failure driven

(reward & penalty scheme!)

Its tuning (modification of number of stones) is performed

only when the final outcome becomes known: it is supervised
learning method

This reinforcement learning ignores the intrinsic sequential

structure of the problem to make adjustments at each state

Ch.10 [sections 10.1-10.3]: Learning from

Reinforcement. 5
CSE 513 Soft Computing Dr. Djamel Bouchaffra

Failure is the Surest Path to Success (10.2) (cont.)

Credit assignment problem (cont.)

This goal driven learning scheme is not applicable to any

game playing such as chess game

In chess playing, the learner needs to make better moves with

no performance indication regarding winning during the game

The problem of rewarding or penalizing each move (or state)

individually in such a long sequence toward an eventual
victory or loss is called the temporal credit assignment
problem

Failure is the Surest Path to Success (10.2) (cont.)

Credit assignment problem (cont.)

Apportioning credit to the internal agents action structures is

called the structural credit assignment problem

The structural credit assignment problem deals with the

development of appropriate internal representatives

Temporal and structural credit are involved in any

connexionist learning model such as NN

Ch.10 [sections 10.1-10.3]: Learning from

Reinforcement. 6
CSE 513 Soft Computing Dr. Djamel Bouchaffra

Failure is the Surest Path to Success (10.2) (cont.)

Credit assignment problem (cont.)

The power of reinforcement learning lies in the fact that the

learner needs not to wait until it receives feedback at the end
to make adjustments (fundamental key concept!)

In conclusion, we need an evaluation function that gives a

score to a move locally to be optimized

Failure is the Surest Path to Success (10.2) (cont.)

Evaluation functions

They provide scalar values (reinforcement signals) of states to

aid in finding optimal trajectories: they are emotions in the
biological brain

Ch.10 [sections 10.1-10.3]: Learning from

Reinforcement. 7
CSE 513 Soft Computing Dr. Djamel Bouchaffra

Failure is the Surest Path to Success (10.2) (cont.)

Evaluation functions (cont.)

Example: (Manhattan distance in the eight puzzle problem)

2 8 1 2 3
Current 5 6 4 8 4 Target
position position
1 3 7 7 6 5

Number of moves to reach the goal = sum of each tiles vertical &
horizontal distance from its target position
tile 2 (1,1); (1,2) |(1-1)|+|(2-1)| = 1
tile 8 (1,2); (2,1) |(2-1)|+|(1-2)| = 2
*
This distance is used as a heuristic function in the A algorithm

Temporal Difference Learning (10.3)

General Form: the modifiable parameter w of the agents
predictor obey the following update rule:
t
w t = (Vt +1 Vt ) t k w Vk = TD( )
k =1

where: Vt is the prediction value at time t

is a discounting parameter in [0..1]
t = 1: w1 = (V2 V1) wV1
t = 2: w2 = (V3 V2) ( wV1+ wV2)

Big contribution of the

most recent prediction

t = 3: w3 = (V4 V3) ( 2 wV1+ wV2 + wV3)

Ch.10 [sections 10.1-10.3]: Learning from

Reinforcement. 8
CSE 513 Soft Computing Dr. Djamel Bouchaffra

Temporal Difference Learning (10.3) (cont.)

More recent predictions make greater weight changes:
recent stimuli should be used in combination with the
current ones in order to determine actions
t
TD(1): w t = (Vt +1 Vt ) w Vk
k =1

TD(0): w t = (Vt +1 Vt ) w Vt

In TD(1) all past predictions make equal predictions to

the weight alterations: all states are equally weighted
whereas in TD(0) only the most recent prediction
counts

Temporal Difference Learning (10.3) (cont.)

TD() can be viewed as a supervised learning

procedure for the pair (current prediction Vt, desired
output Vt+1) in the error term
2
1 1 m
= {z Vt } = (Vk +1 Vt )
2
E td
2 2 k = t
where: Vm+1 = z (final outcome)
Vt = current or actual output

Since wt - w Etd = w Vt (z Vt)

which implies that: wt = (z Vt) w Vt

Ch.10 [sections 10.1-10.3]: Learning from

Reinforcement. 9
CSE 513 Soft Computing Dr. Djamel Bouchaffra

Temporal Difference Learning (10.3) (cont.)

The weight variation at time t is proportional to the

difference between the final outcome & the prediction
at time t

This results shows that this scheme is similar to

ordinary supervised learning: wt can be determined
only after the whole sequence of actions has been
completed (z made available!)

wt cannot be computed incrementally in multiple step

problems (Jackpot example)

Temporal Difference Learning (10.3) (cont.)

However, the equation:

t
w t = (Vt +1 Vt ) w Vk
k =1
for:
t = 1: w1 = (V2 V1) wV1
t = 2: w2 = (V3 V2) (wV1+ wV2)
t = 3: w3 = (V4 V3) (wV1+ wV2 + wV3)

provides an incremental scheme.

Ch.10 [sections 10.1-10.3]: Learning from

Reinforcement. 10
CSE 513 Soft Computing Dr. Djamel Bouchaffra

Temporal Difference Learning (10.3) (cont.)

When = 0, w t = (Vt +1 Vt ) w Vt only the

most recent prediction affects the weight alteration:
close to dynamic programming

Temporal Difference Learning (10.3) (cont.)

Expected Jackpot

We use TD(0) to update the weights & a lookup-table

perceptron to output result
TD perceptron

TD(0) provides:

( )
wt = wT st +1 wT st st
(sin ce Vt = wT * st and thus wVt = st )

Ch.10 [sections 10.1-10.3]: Learning from

Reinforcement. 11
CSE 513 Soft Computing Dr. Djamel Bouchaffra

A lookup-table perceptron to approximate expected

values in the jackpot journey problem

Ch.10 [sections 10.1-10.3]: Learning from

Reinforcement. 12

Office - Vba Code Optimization
100% (4)
Office - Vba Code Optimization
19 pages
Introduction To Process Planning
No ratings yet
Introduction To Process Planning
3 pages
Kilobyte Magazine 2019-2
100% (2)
Kilobyte Magazine 2019-2
42 pages
Kilobyte Magazine 2019-2
100% (2)
Kilobyte Magazine 2019-2
42 pages
Your Freedom User Guide
100% (13)
Your Freedom User Guide
52 pages
Segmentation 04
100% (1)
Segmentation 04
11 pages
Réservation System For Hotel
50% (2)
Réservation System For Hotel
63 pages
Basic Instructions
100% (1)
Basic Instructions
3 pages
Question Bank of DMP
No ratings yet
Question Bank of DMP
4 pages
Ai (It) Unit-5
No ratings yet
Ai (It) Unit-5
43 pages
Kilobyte Magazine 3/2017
100% (1)
Kilobyte Magazine 3/2017
30 pages
Cross Script Xss
No ratings yet
Cross Script Xss
3 pages
Reinforcement Learning I
No ratings yet
Reinforcement Learning I
30 pages
OCR GCSE Computing Guide
No ratings yet
OCR GCSE Computing Guide
20 pages
Assignment 15 Modern AI
No ratings yet
Assignment 15 Modern AI
3 pages
Carrier Prog 300708
No ratings yet
Carrier Prog 300708
11 pages
Week 12
No ratings yet
Week 12
59 pages
Reinforcement Learning
100% (1)
Reinforcement Learning
64 pages
RL Unit 1
No ratings yet
RL Unit 1
44 pages
Clustering SQL Server Active-Active-Passive
100% (1)
Clustering SQL Server Active-Active-Passive
29 pages
Malfunction's Winasm Tutorial For TASM
No ratings yet
Malfunction's Winasm Tutorial For TASM
7 pages
Reinforcemnet Learning
No ratings yet
Reinforcemnet Learning
8 pages
Software Design and Architecture
No ratings yet
Software Design and Architecture
36 pages
Fingerprint Recognition Using Fuzzy Inferencing Techniques
No ratings yet
Fingerprint Recognition Using Fuzzy Inferencing Techniques
9 pages
Unit 1 Reinforcement Learning
No ratings yet
Unit 1 Reinforcement Learning
70 pages
Unit-5 ML
No ratings yet
Unit-5 ML
18 pages
Xigmanas Guide For Creating An Iscsi Target From A Zfs Volume
No ratings yet
Xigmanas Guide For Creating An Iscsi Target From A Zfs Volume
18 pages
Reinforcement Learning MY101
No ratings yet
Reinforcement Learning MY101
15 pages
ML Unit-4 - RTU
No ratings yet
ML Unit-4 - RTU
18 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
26 pages
10 ReinforcementLearning
No ratings yet
10 ReinforcementLearning
59 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I - Print
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I - Print
25 pages
C++ Exam Revision: A Guide Only
No ratings yet
C++ Exam Revision: A Guide Only
26 pages
Unit 5
No ratings yet
Unit 5
39 pages
COMP3411 Week 4 - RL
No ratings yet
COMP3411 Week 4 - RL
79 pages
Reinforcement Learning Mastery Path
No ratings yet
Reinforcement Learning Mastery Path
18 pages
Artificial Intelligence: Lecture 10 - Reinforcement Learning Prof. Shivanjali Khare
No ratings yet
Artificial Intelligence: Lecture 10 - Reinforcement Learning Prof. Shivanjali Khare
45 pages
ML Unit-V
No ratings yet
ML Unit-V
20 pages
Discrete-Time Fourier Analysis Discrete-Time Fourier Analysis
No ratings yet
Discrete-Time Fourier Analysis Discrete-Time Fourier Analysis
37 pages
EQQQ1MST
No ratings yet
EQQQ1MST
42 pages
21ai020 & Reinforcement Learning UNIT 1-LM:1
No ratings yet
21ai020 & Reinforcement Learning UNIT 1-LM:1
8 pages
37 RL
No ratings yet
37 RL
18 pages
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
No ratings yet
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
34 pages
AI Reinforcement Learning Guide
No ratings yet
AI Reinforcement Learning Guide
8 pages
Lecture 01
No ratings yet
Lecture 01
165 pages
AI Notes Module - 4
No ratings yet
AI Notes Module - 4
13 pages
Unit 3
No ratings yet
Unit 3
29 pages
Reinforcement Learning Guide
No ratings yet
Reinforcement Learning Guide
64 pages
8051 Notes New
No ratings yet
8051 Notes New
70 pages
Cisco AAA Authentication and TACACS+ Guide
No ratings yet
Cisco AAA Authentication and TACACS+ Guide
3 pages
ACM207H User Manual
No ratings yet
ACM207H User Manual
4 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
23 pages
MPDF 6.0 Demo
No ratings yet
MPDF 6.0 Demo
34 pages
Stacks and Queues
100% (1)
Stacks and Queues
64 pages
Artificial Intelligence: Lecture 11 - Reinforcement Learning II Dr. Shivanjali Khare
No ratings yet
Artificial Intelligence: Lecture 11 - Reinforcement Learning II Dr. Shivanjali Khare
52 pages
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
No ratings yet
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
11 pages
ML Unit 5
No ratings yet
ML Unit 5
13 pages
Informatics Practicals PDF
No ratings yet
Informatics Practicals PDF
10 pages
Dialnet MedicionAutomatizadaDePiezasTorneadasUsandoVisionA 4886381
No ratings yet
Dialnet MedicionAutomatizadaDePiezasTorneadasUsandoVisionA 4886381
152 pages
Sinstallation: Eucalyptus Installation
No ratings yet
Sinstallation: Eucalyptus Installation
11 pages
Quartus II Handbook Volume 2: Design Implementation and Optimization
No ratings yet
Quartus II Handbook Volume 2: Design Implementation and Optimization
321 pages
Unsupervised Learning and Other Neural Networks: CSE 1513 Soft Computing Dr. Djamel Bouchaffra
No ratings yet
Unsupervised Learning and Other Neural Networks: CSE 1513 Soft Computing Dr. Djamel Bouchaffra
12 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
38 pages
ML Unit 5 at VS
No ratings yet
ML Unit 5 at VS
29 pages
Chapter 9: Supervised Learning Neural Networks
No ratings yet
Chapter 9: Supervised Learning Neural Networks
20 pages
UNIT-5 Machine Learning
No ratings yet
UNIT-5 Machine Learning
31 pages
S18 Reinforcement Learning 2
No ratings yet
S18 Reinforcement Learning 2
46 pages
Lecture 02
No ratings yet
Lecture 02
144 pages
Module 1
No ratings yet
Module 1
72 pages
CONST Funtion in C++
No ratings yet
CONST Funtion in C++
36 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I
35 pages
Lec 10
No ratings yet
Lec 10
50 pages
Lecture#1 - RL An Introduction 2023
No ratings yet
Lecture#1 - RL An Introduction 2023
44 pages
Manish Singh Rana InternReport
No ratings yet
Manish Singh Rana InternReport
41 pages
Software Requirements Specification: ATM System
No ratings yet
Software Requirements Specification: ATM System
17 pages
RL MJJ
No ratings yet
RL MJJ
32 pages
Reinforcement Learning Research Compilation
No ratings yet
Reinforcement Learning Research Compilation
9 pages
Reinforcement Learning Guide
No ratings yet
Reinforcement Learning Guide
18 pages
Embedded Systems: Bus Structure
No ratings yet
Embedded Systems: Bus Structure
37 pages
ML U5 Notes
No ratings yet
ML U5 Notes
26 pages
Sdfesdf
No ratings yet
Sdfesdf
23 pages
ML - Unit 3 - Part II
No ratings yet
ML - Unit 3 - Part II
51 pages
140 Service Manual - Travelmate 6292
No ratings yet
140 Service Manual - Travelmate 6292
101 pages
Reinforcement Learning: Yijue Hou
No ratings yet
Reinforcement Learning: Yijue Hou
34 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
32 pages
Lecture Notes v1.0 687 F22
No ratings yet
Lecture Notes v1.0 687 F22
115 pages
Multi-Agent Learning Dynamics
No ratings yet
Multi-Agent Learning Dynamics
26 pages
AI Reinforcement Learning Guide
No ratings yet
AI Reinforcement Learning Guide
27 pages
RL With LCS
No ratings yet
RL With LCS
29 pages
Reinforcement Learning: A Tutorial
No ratings yet
Reinforcement Learning: A Tutorial
17 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
50 pages
Andy 1
No ratings yet
Andy 1
78 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
35 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
136 pages