0% found this document useful (0 votes)

29 views15 pages

5.4-Reinforcement Learning-Part1-Introduction

The document discusses reinforcement learning and dynamic programming. It defines key reinforcement learning concepts like the environment, agent, states, actions, rewards, policy, and return. It explains how reinforcement learning problems can be modeled as Markov decision processes and discusses the differences between dynamic programming and reinforcement learning approaches when the MDP model is fully or partially known. Dynamic programming algorithms aim to optimize the policy and value functions using Bellman's principle of optimality and equation. A shortest path example illustrates dynamic programming.

Uploaded by

polinati.vinesh2023

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views15 pages

5.4-Reinforcement Learning-Part1-Introduction

Uploaded by

polinati.vinesh2023

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

NPTEL

Video Course on Machine Learning

Professor Carl Gustaf Jansson, KTH

Week 5 Machine Learning enabled

by prior Theories

Video 5.4 Reinforcement Learning – Part 1 - Introduction

Reinforcement Learning (RL)

The intuitive scenario for Reinforcement Learning is an Agent that

learns from interaction with an Environment to achieve some
long-term goals related to the State of the environment by performing
a sequence of actions and by receiving feedback.

The mapping from states to possible actions is called a

Policy.

The achievement of goals is defined by Rewards or reward signals

being the feedback on actions from the environment.

The Return is defined as the cumulative (possibly discounted)

rewards over an Episode = action sequence, leading to a Terminal
state.

The goal of any RL algorithm is to establish a policy that maximizes

the Returns.
Micromouse

3
Reinforcement Learning terminology
Environment The ´microworld´defined for the particular RL problem including the agent. Designated as E.
Agent An agent Designated as A.
State A particular configuration of the agent within the environment. Designated as s.
Terminal states Defined end states for the particular RL problem. Designated as TS.
Action An agent select an action based upon the current state S and the policy π
Policy A policy π is a mapping from states of the environment to the potential actions of an agent in those states. π(s) can be
deterministic, only depending on s or stochastic π(s,a) depending also on a.
Transition Probability function s´ = T( s , a) specifies the probability that the environment will transition to state s′ if the agent takes action a in state s.
Episode Sometimes called an epoque. A sequence of states, actions and rewards, which ends in a terminal state.
Reward or Reward signal The reward R ( s´|s,a) gives feedback from the environment on the effect of a single action a in state s leading to s´
Discounted Reward When calculating the Return, the expected rewards for future steps can be weighted with a discount factor γ in the interval [0,1]
Return The accumulated rewards for an episode. Designated as G.
Value function The Value function V = V(s) is the estimation of the Value or Utility of s with respect to its average Return considering all
possible episodes possible within the current policy. It must continually be re-estimated for each action taken.
The state-value function v(s) for the policy π is given below. Note that the value of the terminal state (if any) is always zero.
Model of the environment A set of tuples T(s,a)=s´, and R(s´|s,a) representing the possible state transitions and the rewards given.
Example 1 - the 4x3 world

Environment and agent: The 4x3 world is a board of 4 * 3 positions, indexed by

3 +1 two coordinates. The agent has a start position of (1,1). The position (2,2) is excluded.

Reward: There are two terminal positions (4,3) and (4,2). Reaching (4,3) gives a reward =+1.
2 -1
Reaching (4,2) gives a reward = -1. Reaching all other postions give reward=0.

1 start Actions: up, down, left, right but restricted by the board configuration. The policy
is deterministic in the sense that an action in a specific state can only lead to one other
1 2 3 4 State.

Episode: e.g a sequence of actions leading from 1,1 to 4,3 via 1.2 1,3 2,3 and 3,3

Examples of episodes with returns.

(1,1)🡪(1,2)🡪(1,3)🡪(1,2)🡪(1,3)🡪(2,3)🡪(3,3) 🡪(3,4) +1
(1,1)🡪(1,2)🡪(1,3)🡪(2,3)🡪(3,3)🡪(3,2)🡪(3,3)🡪(3,4) +1
(1,1)🡪(2,1)🡪(3,1)🡪(3,2)🡪(4,2) -1
Reinforcement Learning modelled as Markov Decision Process (MDP)

A suitable model for the above intuitive scenario is a basic Markov .5 .5 .33
Decision Process (MDP). 1.0
+1
An MDP is typically defined by a 4-tuple (S,A,R,T) where: .5 .33
.5 .5 .33
• S is the set of states s is the state of the environment .33 .33 1.0
(including the agent)
• A is the set of actions for each s, that the agent can choose
-1
between defined by a policy π .33 .33
• R(s’ |s , a) is a function that returns the reward received for taking .5 .5.5 .5 .33 .5
action a in state s leading to s´

• s´= T(s , a) is a transition probability function, specifying the .5 .33 .5

probability that the environment will transition to state s′ if the State transitions: The state transitions from a state
agent takes action a in state s. The Markov property = Transition to its neighboring state that in this case have equal
probabilities depend on state only, not on the path to the state. probability, summed up to 1.
The goal is to find a policy π that maximizes the return = expected
future accumulated (discounted) reward for episodes from a Start
state s to a Terminal state.
Dynamic Programming versus Reinforcement Learning

In one MDP scenario, there is a complete and exact model of the MDP in the
sense that T(S,A) and R(S,A) are fully defined.

In this case a MDP problem is a Planning Problem that can be exactly

solved by use of e.g. Dynamic Programming.

However in many cases T(S,A) and R(S,A) are not completely known, meaning
that the model of the MDP is not complete.

In such cases we have a true Reinforcement Learning problem.

Dynamic Programming (DP)
Dynamic Programming is an algorithm design technique for optimization problems: often for minimizing or
maximizing. The method was developed by Richard Bellman in the 1950s.

Like divide and conquer, DP is simplifying a complicated problem by breaking it down into simpler
sub-problems in a recursive manner and then combining the solutions to the sub-problems to form the total
solution.

If a problem can be optimally solved by breaking it recursively into sub-problems and then form the
solution from optimal solutions to the sub-problems, then it is said to have optimal substructure.

Unlike divide and conquer, sub-problems are not independent, sub-problems may share sub-sub-problems,

Typically the utility value of a solution in a certain state is defined by a value function calculated recursively
based on the value functions for the remaining steps of an episode moderated by the relevant probabilities
and a discount parameter giving higher weight to closer reward contributions. Typically the policy function
can be calculated in a similar fashion.

The defining equation for finding an optimal value function is called the Bellman equation.
Dynamic Programming (DP) – motives behind name
Bellman´s principle of optimality and the Bellman equation

Dynamic programming algorithms has the ambition to optimize the following two functions:

Policy function: π(s):= argmax ∑ T(s,a) * (R(s´|s,a)+ * V(s') )

a s´

Value function: V(s):= ∑ T (s,a) * (R(s´|s,a)+ * V(s') )

s´

Richard Bellman's Principle of Optimality

An optimal policy has the property that whatever the initial state and initial decision are, the remaining
decisions must constitute an optimal policy with regard to the state resulting from the first decision.

An optimal value function is a solution to the so called Bellman equation:

V(s) = Max ( R(s´|s,ai) + * V(s´ =T(s,ai) )

i=1..n
An optimal value function can be the basis for an optimal policy function.
Three standard approaches to solve the Bellman Equation

Value iteration
Initialise V arbitrarily
repeat
Improve Vk+1 using the estimate of Vk
until convergence
Policy iteration
Initialise V and π arbitrarily
repeat
Evaluate V using π
Improve π using V
until convergence.

Linear programming
Prioritized sweeping - use priority queue to update states with large potential for change
Simple example to illustrate Dynamic Programming: The shortest path in multi-stage graphs

The shortest path is: (S, C, F, T)=> 5+2+2 = 9. A greedy method gives (S, A, D, T) => 1+4+18 =
23.
Example with Dynamic programming approach
Forward approach

• d(S, T) = min{1+d(A, T), 2+d(B, T), 5+d(C, T)}

• d(A,T) = min{4+d(D,T), 11+d(E,T)} = min{4+18, 11+13} = 22.

• d(B, T) = min{9+d(D, T), 5+d(E, T), 16+d(F, T)} = min{9+18, 5+13, 16+2} = 18.
• d(C, T) = min{ 2+d(F, T) } = 2+2 = 4

• d(S, T) = min{1+d(A, T), 2+d(B, T), 5+d(C, T)} = min{1+22, 2+18, 5+4} = 9.
Markov Decision Process (MDP) can be generalized to
Partially Observable Markov Decision Process (POMDP)
A POMDP models an agent decision process in which it is assumed that
the system dynamics are determined by an MDP, but the agent cannot
directly observe the underlying state. The POMDPs enlarge the
applicability of model-based MDPs.

Instead, it must maintain a probability distribution over the set of

possible states, based on a set of observations and observation
probabilities, and the underlying MDP.

A POMDP have two additional elements on top of the MDP elements.

• A set of observations
• A set of observational probabilities.

Because the agent does not directly observe the environment's state, the
agent must make decisions under uncertainty of the true environment
state.

The details of POMDP are outside the scope of this lecture.

To be continued in Part 2

17 - Markov Decision Processes
No ratings yet
17 - Markov Decision Processes
59 pages
CSE 445 - Lecture 9 - Reinforcement Learning
No ratings yet
CSE 445 - Lecture 9 - Reinforcement Learning
45 pages
Computer Graphics Line Drawing
No ratings yet
Computer Graphics Line Drawing
3 pages
Unit-6 Reinforcement Learning
No ratings yet
Unit-6 Reinforcement Learning
75 pages
Hebbian Learning and Associative Memory
No ratings yet
Hebbian Learning and Associative Memory
13 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
50 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
62 pages
DLMAIRIL01 Q4-2024 Session2
No ratings yet
DLMAIRIL01 Q4-2024 Session2
68 pages
RL Module 4
No ratings yet
RL Module 4
50 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
51 pages
Unit 4
No ratings yet
Unit 4
49 pages
Unit Vi
No ratings yet
Unit Vi
17 pages
Lec 12
No ratings yet
Lec 12
60 pages
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
No ratings yet
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
40 pages
Finite Markov Decision Processes-BR
No ratings yet
Finite Markov Decision Processes-BR
31 pages
CSE2530 Reinforcement Learning 2025 P1+2
No ratings yet
CSE2530 Reinforcement Learning 2025 P1+2
115 pages
Interview Prep
No ratings yet
Interview Prep
10 pages
Lec17 ReinforcementLearning
No ratings yet
Lec17 ReinforcementLearning
58 pages
DSA5102 Lecture11
No ratings yet
DSA5102 Lecture11
44 pages
AS02
No ratings yet
AS02
16 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
6 pages
CS229
No ratings yet
CS229
17 pages
A Crash Course On Reinforcement Learning - Felix Wagner
No ratings yet
A Crash Course On Reinforcement Learning - Felix Wagner
84 pages
20ai903 - RL - Unit 2
No ratings yet
20ai903 - RL - Unit 2
27 pages
RL Unit - Ii
No ratings yet
RL Unit - Ii
20 pages
Deep RL - Content Beyond Syllabus
No ratings yet
Deep RL - Content Beyond Syllabus
16 pages
Markov Decision
No ratings yet
Markov Decision
4 pages
Lecture 9 Reiforcement Learning
No ratings yet
Lecture 9 Reiforcement Learning
29 pages
Notes
No ratings yet
Notes
6 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
L12 Markov Decision Processes
No ratings yet
L12 Markov Decision Processes
64 pages
Unit 4
No ratings yet
Unit 4
6 pages
Reinforcement Learning Model Based Planning Dynamic Programming
No ratings yet
Reinforcement Learning Model Based Planning Dynamic Programming
17 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
88 pages
06 MDP
No ratings yet
06 MDP
89 pages
RL Cheatsheet for Researchers
No ratings yet
RL Cheatsheet for Researchers
16 pages
RL Ese
No ratings yet
RL Ese
7 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
45 pages
Discuss About Temporal Difference in Reinforcement Learning?
No ratings yet
Discuss About Temporal Difference in Reinforcement Learning?
9 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
9 pages
AI Decision Making & RL Guide
No ratings yet
AI Decision Making & RL Guide
18 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
Neural Networks: Hopfield & Boltzmann
100% (1)
Neural Networks: Hopfield & Boltzmann
13 pages
5.4-Reinforcement Learning-Part2-Learning-Algorithms
No ratings yet
5.4-Reinforcement Learning-Part2-Learning-Algorithms
15 pages
Bellman Equation & Markov Processes
No ratings yet
Bellman Equation & Markov Processes
10 pages
Unit 5 Notes
No ratings yet
Unit 5 Notes
20 pages
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
No ratings yet
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
74 pages
Recurrent Neural Networks
No ratings yet
Recurrent Neural Networks
18 pages
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
No ratings yet
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
40 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
4 pages
Computer Science Project: For Isc Programming in Bluej
No ratings yet
Computer Science Project: For Isc Programming in Bluej
33 pages
Markovian Decision Process
No ratings yet
Markovian Decision Process
27 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
A Tutorial For Reinforcement Learning
No ratings yet
A Tutorial For Reinforcement Learning
14 pages
A Tutorial For Reinforcement Learning
No ratings yet
A Tutorial For Reinforcement Learning
17 pages
Reinforcement Learning: Part I - Definitions
No ratings yet
Reinforcement Learning: Part I - Definitions
26 pages
Weighted Graphs and Dijkstra's Algorithm
No ratings yet
Weighted Graphs and Dijkstra's Algorithm
18 pages
Reinforcement Learning in A Nutshell
No ratings yet
Reinforcement Learning in A Nutshell
12 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
7 pages
Markov Decision & RL Overview
No ratings yet
Markov Decision & RL Overview
39 pages
Reinforcement Learning, Crawling Robot: Faculty of Sciences and Techniques Béni-Mellal
No ratings yet
Reinforcement Learning, Crawling Robot: Faculty of Sciences and Techniques Béni-Mellal
5 pages
Week 6 Machine Learning Assignments
No ratings yet
Week 6 Machine Learning Assignments
13 pages
Sec Pseudo + Number Puzzle-2
No ratings yet
Sec Pseudo + Number Puzzle-2
5 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
46 pages
Model of Neuron in An ANN
No ratings yet
Model of Neuron in An ANN
12 pages
Hopfield Networks and Boltzman Machines-Part 2
No ratings yet
Hopfield Networks and Boltzman Machines-Part 2
13 pages
Perceptrons
No ratings yet
Perceptrons
11 pages
5 2-ExplanationBasedLearning
No ratings yet
5 2-ExplanationBasedLearning
19 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
Learning in A Feed Forward Multiple Layer ANN - Backpropagation
No ratings yet
Learning in A Feed Forward Multiple Layer ANN - Backpropagation
18 pages
Assignments For Week 5 2024
No ratings yet
Assignments For Week 5 2024
10 pages
4.1-Inductive Learning Based On Symbolic Representations and Week Theories
No ratings yet
4.1-Inductive Learning Based On Symbolic Representations and Week Theories
9 pages
6 9-DeepLearning
No ratings yet
6 9-DeepLearning
8 pages
CNN Architecture & Concepts Explained
No ratings yet
CNN Architecture & Concepts Explained
21 pages
4.2-GeneralizationAsSearch Part 1
No ratings yet
4.2-GeneralizationAsSearch Part 1
17 pages
7.2 Interdisciplinary Inspiration
No ratings yet
7.2 Interdisciplinary Inspiration
17 pages
Java Palindrome String Check
No ratings yet
Java Palindrome String Check
4 pages
Unit 5 Notes
No ratings yet
Unit 5 Notes
19 pages
Programming Basics for Students
No ratings yet
Programming Basics for Students
6 pages
Fundamentals of Algorithmic Problem Solving
No ratings yet
Fundamentals of Algorithmic Problem Solving
13 pages
Cluster Analysis Concepts & Algorithms
No ratings yet
Cluster Analysis Concepts & Algorithms
93 pages
DS MCQ
No ratings yet
DS MCQ
9 pages
Unit Ii
No ratings yet
Unit Ii
35 pages
Second Semester Examinations 2017/2018: Answer FOUR Questions
No ratings yet
Second Semester Examinations 2017/2018: Answer FOUR Questions
6 pages
Understanding Big-Oh Notation in Algorithms
No ratings yet
Understanding Big-Oh Notation in Algorithms
15 pages
OR - Chapter 2
No ratings yet
OR - Chapter 2
52 pages
IAI Unit2
No ratings yet
IAI Unit2
81 pages
Analysis of Algorithms
No ratings yet
Analysis of Algorithms
35 pages
Linear Algebra: Assignment I
No ratings yet
Linear Algebra: Assignment I
11 pages
WAVL Trees: Implementation Guide
No ratings yet
WAVL Trees: Implementation Guide
8 pages
Clustering With R
No ratings yet
Clustering With R
4 pages
Exact String Matching Using Suffix Trees
No ratings yet
Exact String Matching Using Suffix Trees
2 pages
CSE246 Report
No ratings yet
CSE246 Report
10 pages
Abstract Data Types Guide
No ratings yet
Abstract Data Types Guide
16 pages
FPGA Sorting for Engineers
No ratings yet
FPGA Sorting for Engineers
4 pages
Addition and Multiplication of Huge Numbers Using Doubly Linked List - C ++
No ratings yet
Addition and Multiplication of Huge Numbers Using Doubly Linked List - C ++
5 pages
Firefly Algorithm for Feature Selection
No ratings yet
Firefly Algorithm for Feature Selection
4 pages
Computer Science Algorithms Lab Guide
No ratings yet
Computer Science Algorithms Lab Guide
3 pages
Algorithm - Finding Elementary Intervals in Overlapping Intervals - Stack Overflow
No ratings yet
Algorithm - Finding Elementary Intervals in Overlapping Intervals - Stack Overflow
3 pages
Questions
No ratings yet
Questions
2 pages
DSA III Sem Mod 1 Imp Questions For IA@Vtunetwork
No ratings yet
DSA III Sem Mod 1 Imp Questions For IA@Vtunetwork
2 pages

5.4-Reinforcement Learning-Part1-Introduction

Uploaded by

5.4-Reinforcement Learning-Part1-Introduction

Uploaded by

NPTEL

Video Course on Machine Learning

Professor Carl Gustaf Jansson, KTH

Week 5 Machine Learning enabled

Video 5.4 Reinforcement Learning – Part 1 - Introduction

The intuitive scenario for Reinforcement Learning is an Agent that

The mapping from states to possible actions is called a

The achievement of goals is defined by Rewards or reward signals

The Return is defined as the cumulative (possibly discounted)

The goal of any RL algorithm is to establish a policy that maximizes

Environment and agent: The 4x3 world is a board of 4 * 3 positions, indexed by

Examples of episodes with returns.

• s´= T(s , a) is a transition probability function, specifying the .5 .33 .5

In this case a MDP problem is a Planning Problem that can be exactly

In such cases we have a true Reinforcement Learning problem.

Policy function: π(s):= argmax ∑ T(s,a) * (R(s´|s,a)+ * V(s') )

Value function: V(s):= ∑ T (s,a) * (R(s´|s,a)+ * V(s') )

Richard Bellman's Principle of Optimality

An optimal value function is a solution to the so called Bellman equation:

V(s) = Max ( R(s´|s,ai) + * V(s´ =T(s,ai) )

• d(S, T) = min{1+d(A, T), 2+d(B, T), 5+d(C, T)}

• d(A,T) = min{4+d(D,T), 11+d(E,T)} = min{4+18, 11+13} = 22.

Instead, it must maintain a probability distribution over the set of

A POMDP have two additional elements on top of the MDP elements.

The details of POMDP are outside the scope of this lecture.

You might also like