0% found this document useful (0 votes)

36 views44 pages

DSA5102 Lecture11

lecture11

Uploaded by

gjpnwmdpz7

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views44 pages

DSA5102 Lecture11

lecture11

Uploaded by

gjpnwmdpz7

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 44

Foundations of Machine Learning

DSA 5102 • Lecture 11

Li Qianxiao
Department of Mathematics
So far
We introduced two classes of machine learning problems
• Supervised Learning
• Unsupervised Learning

Today, we will look at another class of problems that lies

somewhere in between, called reinforcement learning
Motivation
Interactions with
environment

Learning from Experience

Some General
Observations
of Learning
Reward vs Demonstrations

Planning
All of what we mean by goals and
purposes can be well thought of
as the maximization of the
The Reward expected value of the cumulative
Hypothesis sum of a received scalar signal
(called reward).
Examples
• Studying and getting good grades
• Learning to play a new musical instrument
• Winning at chess
• Navigating a maze
• An infant learning to walk
The Basic Components
Environment

Action

State
Interpreter

Reward

Agent
Examples

Task Agent Environment Interpreter Reward

Chess Player Board state Vision Win/loss at the
end
Learning to Infant The world Senses Not falling,
Walk getting to places
Navigating a Player The maze Vision Getting out of
maze the maze
Key Differences in Reinforcement
Learning

• Vs unsupervised learning: not completely unsupervised due

to a reward signal

• Vs supervised learning: not completely supervised, since

optimal actions to take are never given
Example: The Recycling Robot
Actions
• Search for cans
• Pick up or drop cans
• Stop and wait
• Go back and charge

Rewards:
• +10 for each can picked up
• -1 for each meter moved
• -1000 for running out of battery
Another Example
The Reinforcement Learning Problem
The RL problem can be posed as follows:

An agent navigates an environment through the lens of an

interpreter. It interacts with the environment through performing
actions, and the environment in turn provides the agent with a
reward signal. The agent’s goal is to learn through experience
how to maximize the long term accumulated reward.
Finite Markov Decision Processes
Finite State, Discrete Time Markov
Chains
• Sequence of time steps:
• State space: such that
• States:

The states forms a stochastic process, and evolves according to a

transition probability
Markov Property and Time
Homogeneity
Markov Property

Time Homogeneous Markov Chain

• The transition probability is independent of time, i.e.

• The matrix is called the transition (probability) matrix

Example State space:
Transition probability:
𝑠1

𝑠2 Transition probability matrix

𝑠3

https://en.wikipedia.org/wiki/Markov_chain
Non-Markovian or Non-time-
homogeneous Stochastic Processes
Example of non-Markovian process
• Drawing without replacement coins out of a bag of coins
consisting of 10 of each $1, 50c and 10c coins. Let be the total
value of coins drawn up to time

Example of non-time-homogeneous process

• Drawing coins at time
Essential Components are Markov
Decision Processes
Markov decision processes (MDP) is a generalization of Markov
processes, with actions and rewards

Essential elements
• Sequence of time steps:
• States:
• Actions: (union over all )
• Rewards:
State Evolution

Agent

Reward
State Action

Environment
Reward (Interpreter)
State
Transition Probability
For Markov chains, we have the transition probability

For Markov decision processes, we need to account for

additionally:
• The reward
• The action

Hence, we specific the MDP transition probability

Markov Decision Processes
A Markov decision process (MDP) is the evolution of according
to

A MDP is finite if is finite and is finite for each

Example: The Recycling Robot
State:
(position, charge, weight)
Actions:

• If then is empty
• If , such that , and has a can, then

• …
Reward:

Charging
Station
The “Decision” Aspect: The Policy
The only way the agent has control over this system is through the
choice of actions.

This is done by a specifying a policy

Deterministic policies:
Then we write , i.e. deterministic policies are functions
The Goal of Choosing a Policy:
Returns
We want to maximize long-term rewards…

Define the return

Here, is the discount rate

This includes both finite and infinite time MDPs.

The Objective of RL
The goal of RL is the maximize, by choosing a good policy , the
expected return

where we start from some state .

We will consider time-homogeneous cases where this is the same
as
Dynamic Programming
Example
1 +2 3 -1
+3
+1
+1 +4
4
0

+3 -2
+0 7
-3
+5 -1

How long does it take to check all possibilities?

The Curse of Dimensionality
A term coined by R. Bellman (1957)

The number of states grows exponentially when the

dimensionality of the problem increases

Can we have a non-brute-force algorithm?

Dynamic Programming Principle
On an optimal path (following the optimal policy), if we start
at any state in that path, the rest of path must again be
optimal
Dynamic Programming in Action
Define

+4 +2 +2 -1
+4 +2 +3 +3
+1 +2 +1
+1 +4
+5
+6
+6
+3 -2
+0 +5 +1
+6 +1 -3
-3
+6 +5 -4 -1
The Complexity of Dynamic
Programming

We have shown that brute-force search takes at least steps.

What about dynamic programming?

Summary of Key Ideas

Come up with Find optimal

Come up with
a recursive policy by
a measure of
way to acting greedily
“value” of
compute the according to
each state
value the value
Bellman’s Equations and Optimal
Policies
Value Function
As motivated earlier, we define the value function

and the action value function

Our goal: derive a recursion for and

These are known as Bellman’s equations

Relationship between and
Using the definitions, we can show the following relationships:

Combining, we get

This is known as the Bellman’s equation for the value function

Bellman’s Equation
For finite MDPs, the Bellman’s equation can be written as

This is a linear equation, and we can show that there exists a

unique solution for .
In fact, it is just

whose existence and uniqueness follow from the invertibility of ,

which in turn follows from .
Bellman’s Equation for Action-Value
Function
Using similar methods, one can show that the action value
function satisfies a similar recursion

Exercise: derive this equation and show that there exists a unique
solution
Comparing Policies
We can compare policies via their values
• Given , we say if for all
• This is a partial order

Examples
• , , Then
• , , Then neither nor holds
Optimal Policy
We define an optimal policy to be any policy satisfying

In other words, for all and all

• Does such a exist?
• Is it unique?
Policy Improvement
We can derive the following result:

For any two policies , if

Then we must have

In addition, if the first inequality is strict for some , then the

second equality is strict for at least one .
Bellman’s Optimality Condition
A policy is optimal if and only if for any state-action pair such
that , we have

This means that an optimal policy must choose an action that

maximize its associated action value function.

This then implies the existence of an optimal policy!

Bellman’s Optimality Equation
Corresponding to an optimal policy

we obtain the following recursion

These are known as the Bellman’s optimality equations

Some Remarks
The optimal value function is unique. Is an optimal policy
unique?

Observe that the policy generated from can be taken to be

deterministic

In fact, for every policy there exists a deterministic policy such

that !
Summary
We introduced
• Basic formulation of reinforcing learning
• MDP as the mathematical framework
• Bellman’s equations characterizing optimal policies

Next time: algorithms to solve RL problems

Ai (It) Unit-4
100% (1)
Ai (It) Unit-4
37 pages
17 - Markov Decision Processes
No ratings yet
17 - Markov Decision Processes
59 pages
3 Markov Decision Processes
No ratings yet
3 Markov Decision Processes
70 pages
An Assignment On Social Change & Development
No ratings yet
An Assignment On Social Change & Development
16 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
51 pages
Quality by Design in Pharma Development
No ratings yet
Quality by Design in Pharma Development
18 pages
Reinforcement Learning: Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning: Nguyen Do Van, PHD
40 pages
Reinforcement Learning in A Nutshell
No ratings yet
Reinforcement Learning in A Nutshell
12 pages
Vroom-Yetton-Jago: Deciding How To Decide
100% (1)
Vroom-Yetton-Jago: Deciding How To Decide
11 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
4 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
Markovian Decision Process
No ratings yet
Markovian Decision Process
27 pages
16 RL PDF
No ratings yet
16 RL PDF
87 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
50 pages
Lec 04 Reinforcement Learning
No ratings yet
Lec 04 Reinforcement Learning
57 pages
Lect28 4up
No ratings yet
Lect28 4up
11 pages
L12 Markov Decision Processes
No ratings yet
L12 Markov Decision Processes
64 pages
CSE 445 - Lecture 9 - Reinforcement Learning
No ratings yet
CSE 445 - Lecture 9 - Reinforcement Learning
45 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
RL Ese
No ratings yet
RL Ese
7 pages
Reinforcement Learning: Part I - Definitions
No ratings yet
Reinforcement Learning: Part I - Definitions
26 pages
RL DQN PG
No ratings yet
RL DQN PG
65 pages
Reinforcement Learning Model Based Planning Dynamic Programming
No ratings yet
Reinforcement Learning Model Based Planning Dynamic Programming
17 pages
Chapter 2
No ratings yet
Chapter 2
9 pages
06 MDP
No ratings yet
06 MDP
89 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
31 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
DLMAIRIL01 Q4-2024 Session2
No ratings yet
DLMAIRIL01 Q4-2024 Session2
68 pages
Unit 5 Deep Learning
No ratings yet
Unit 5 Deep Learning
24 pages
Reinforcement Learning Lec12
No ratings yet
Reinforcement Learning Lec12
60 pages
RL Lecturer
No ratings yet
RL Lecturer
38 pages
RL Module 4
No ratings yet
RL Module 4
50 pages
CSE2530 Reinforcement Learning 2025 P1+2
No ratings yet
CSE2530 Reinforcement Learning 2025 P1+2
115 pages
Ap21 SG Seminar Eoc
No ratings yet
Ap21 SG Seminar Eoc
12 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
Unit Vi
No ratings yet
Unit Vi
17 pages
Ancient Philosophy Summary
No ratings yet
Ancient Philosophy Summary
8 pages
10 ML Introduction To Reinforcement Learning
No ratings yet
10 ML Introduction To Reinforcement Learning
8 pages
Unit-6 Reinforcement Learning
No ratings yet
Unit-6 Reinforcement Learning
75 pages
AS02
No ratings yet
AS02
16 pages
Unit 4
No ratings yet
Unit 4
49 pages
RL-UNIT2 - RL Unit 2 RL-UNIT2 - RL Unit 2
No ratings yet
RL-UNIT2 - RL Unit 2 RL-UNIT2 - RL Unit 2
23 pages
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
No ratings yet
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
74 pages
کتاب هشتم بارگزاری شده
No ratings yet
کتاب هشتم بارگزاری شده
112 pages
A Crash Course On Reinforcement Learning - Felix Wagner
No ratings yet
A Crash Course On Reinforcement Learning - Felix Wagner
84 pages
Lec17 ReinforcementLearning
No ratings yet
Lec17 ReinforcementLearning
58 pages
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
No ratings yet
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
40 pages
Lec 12
No ratings yet
Lec 12
60 pages
5.4-Reinforcement Learning-Part1-Introduction
No ratings yet
5.4-Reinforcement Learning-Part1-Introduction
15 pages
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
No ratings yet
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
66 pages
Deep RL - Content Beyond Syllabus
No ratings yet
Deep RL - Content Beyond Syllabus
16 pages
A17 Complexdecisions
No ratings yet
A17 Complexdecisions
28 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
Finite Markov Decision Processes-BR
No ratings yet
Finite Markov Decision Processes-BR
31 pages
20ai903 - RL - Unit 2
No ratings yet
20ai903 - RL - Unit 2
27 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
62 pages
Markov Decision & RL Overview
No ratings yet
Markov Decision & RL Overview
39 pages
Markov Decision
No ratings yet
Markov Decision
4 pages
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
No ratings yet
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
40 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
7 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
9 pages
PDF Unit-5 (Full Unit)
No ratings yet
PDF Unit-5 (Full Unit)
37 pages
DSA5102 Lecture9
100% (1)
DSA5102 Lecture9
35 pages
恢复的关系：中美教育交流的趋势，1978 1984年
No ratings yet
恢复的关系：中美教育交流的趋势，1978 1984年
287 pages
(Partially Observable) Markov Decision Processes: Frederike Petzschner & Lionel Rigoux
No ratings yet
(Partially Observable) Markov Decision Processes: Frederike Petzschner & Lionel Rigoux
19 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
Domatia
No ratings yet
Domatia
6 pages
LSM6DS3 Datasheet
No ratings yet
LSM6DS3 Datasheet
100 pages
Data Analysis & Visualization Guide
No ratings yet
Data Analysis & Visualization Guide
63 pages
De-FOA-0003512 NOFO Part 1 Circular Supply Chains Accelerator
No ratings yet
De-FOA-0003512 NOFO Part 1 Circular Supply Chains Accelerator
38 pages
DSA5102 Lecture10
No ratings yet
DSA5102 Lecture10
40 pages
DSA5102 Lecture12
No ratings yet
DSA5102 Lecture12
41 pages
Data Science Dan Artificial Intelligence:: Konsep, Teori, Teknik, Tools, Dan Aplikasi
No ratings yet
Data Science Dan Artificial Intelligence:: Konsep, Teori, Teknik, Tools, Dan Aplikasi
57 pages
DSA5102 Lecture3
No ratings yet
DSA5102 Lecture3
34 pages
IRC - 1 Mathematics
No ratings yet
IRC - 1 Mathematics
8 pages
Psychology Guide for CBSE Students
No ratings yet
Psychology Guide for CBSE Students
8 pages
Mohammad Rehan Commerce 4.0
No ratings yet
Mohammad Rehan Commerce 4.0
26 pages
Introduction To Geography - Week 1 First Week: Physical Geography, Our World. Media
No ratings yet
Introduction To Geography - Week 1 First Week: Physical Geography, Our World. Media
21 pages
DiscreteMaths LectureNotes
No ratings yet
DiscreteMaths LectureNotes
4 pages
Affords Investors The Right To Exclude How It Works, Physics Mechanism
No ratings yet
Affords Investors The Right To Exclude How It Works, Physics Mechanism
17 pages
Copper Oxide Nanoparticles Thesis
No ratings yet
Copper Oxide Nanoparticles Thesis
8 pages
Rural Electric Motor Skills
100% (2)
Rural Electric Motor Skills
16 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
Neuromuscular Junction: Shannon Sanders Bishop O'Connell
No ratings yet
Neuromuscular Junction: Shannon Sanders Bishop O'Connell
16 pages
Development of Algan/Gan High Electron Mobility Transistors (Hemts) On Diamond Substrates
No ratings yet
Development of Algan/Gan High Electron Mobility Transistors (Hemts) On Diamond Substrates
76 pages
Bull Heading
No ratings yet
Bull Heading
9 pages
Design and Analysis A Researchers Handbo
No ratings yet
Design and Analysis A Researchers Handbo
5 pages
Road Drainage System
No ratings yet
Road Drainage System
4 pages
2023 GP Mathematics Literacy P2 June Memo
No ratings yet
2023 GP Mathematics Literacy P2 June Memo
6 pages
Kinematics Conceptual Questions
No ratings yet
Kinematics Conceptual Questions
3 pages
Material List Summary-Waptech
No ratings yet
Material List Summary-Waptech
5 pages
Mastery 2 (Etech)
No ratings yet
Mastery 2 (Etech)
4 pages
Masculinity vs. Femininity in Cultures
No ratings yet
Masculinity vs. Femininity in Cultures
6 pages

DSA5102 Lecture11

Uploaded by

DSA5102 Lecture11

Uploaded by

Foundations of Machine Learning

DSA 5102 • Lecture 11

Today, we will look at another class of problems that lies

Learning from Experience

Task Agent Environment Interpreter Reward

• Vs unsupervised learning: not completely unsupervised due

• Vs supervised learning: not completely supervised, since

An agent navigates an environment through the lens of an

The states forms a stochastic process, and evolves according to a

Time Homogeneous Markov Chain

• The matrix is called the transition (probability) matrix

𝑠2 Transition probability matrix

Example of non-time-homogeneous process

For Markov decision processes, we need to account for

Hence, we specific the MDP transition probability

A MDP is finite if is finite and is finite for each

This is done by a specifying a policy

Define the return

Here, is the discount rate

This includes both finite and infinite time MDPs.

where we start from some state .

How long does it take to check all possibilities?

The number of states grows exponentially when the

Can we have a non-brute-force algorithm?

We have shown that brute-force search takes at least steps.

What about dynamic programming?

Come up with Find optimal

and the action value function

Our goal: derive a recursion for and

These are known as Bellman’s equations

This is known as the Bellman’s equation for the value function

This is a linear equation, and we can show that there exists a

whose existence and uniqueness follow from the invertibility of ,

In other words, for all and all

For any two policies , if

Then we must have

In addition, if the first inequality is strict for some , then the

This means that an optimal policy must choose an action that

This then implies the existence of an optimal policy!

we obtain the following recursion

These are known as the Bellman’s optimality equations

Observe that the policy generated from can be taken to be

In fact, for every policy there exists a deterministic policy such

Next time: algorithms to solve RL problems

You might also like