Foundations of Machine Learning
DSA 5102 • Lecture 11
Li Qianxiao
Department of Mathematics
So far
We introduced two classes of machine learning problems
• Supervised Learning
• Unsupervised Learning
Today, we will look at another class of problems that lies
somewhere in between, called reinforcement learning
Motivation
Interactions with
environment
Learning from Experience
Some General
Observations
of Learning
Reward vs Demonstrations
Planning
All of what we mean by goals and
purposes can be well thought of
as the maximization of the
The Reward expected value of the cumulative
Hypothesis sum of a received scalar signal
(called reward).
Examples
• Studying and getting good grades
• Learning to play a new musical instrument
• Winning at chess
• Navigating a maze
• An infant learning to walk
The Basic Components
Environment
Action
State
Interpreter
Reward
Agent
Examples
Task Agent Environment Interpreter Reward
Chess Player Board state Vision Win/loss at the
end
Learning to Infant The world Senses Not falling,
Walk getting to places
Navigating a Player The maze Vision Getting out of
maze the maze
Key Differences in Reinforcement
Learning
• Vs unsupervised learning: not completely unsupervised due
to a reward signal
• Vs supervised learning: not completely supervised, since
optimal actions to take are never given
Example: The Recycling Robot
Actions
• Search for cans
• Pick up or drop cans
• Stop and wait
• Go back and charge
Rewards:
• +10 for each can picked up
• -1 for each meter moved
• -1000 for running out of battery
Another Example
The Reinforcement Learning Problem
The RL problem can be posed as follows:
An agent navigates an environment through the lens of an
interpreter. It interacts with the environment through performing
actions, and the environment in turn provides the agent with a
reward signal. The agent’s goal is to learn through experience
how to maximize the long term accumulated reward.
Finite Markov Decision Processes
Finite State, Discrete Time Markov
Chains
• Sequence of time steps:
• State space: such that
• States:
The states forms a stochastic process, and evolves according to a
transition probability
Markov Property and Time
Homogeneity
Markov Property
Time Homogeneous Markov Chain
• The transition probability is independent of time, i.e.
• The matrix is called the transition (probability) matrix
Example State space:
Transition probability:
𝑠1
𝑠2 Transition probability matrix
𝑠3
https://en.wikipedia.org/wiki/Markov_chain
Non-Markovian or Non-time-
homogeneous Stochastic Processes
Example of non-Markovian process
• Drawing without replacement coins out of a bag of coins
consisting of 10 of each $1, 50c and 10c coins. Let be the total
value of coins drawn up to time
Example of non-time-homogeneous process
• Drawing coins at time
Essential Components are Markov
Decision Processes
Markov decision processes (MDP) is a generalization of Markov
processes, with actions and rewards
Essential elements
• Sequence of time steps:
• States:
• Actions: (union over all )
• Rewards:
State Evolution
Agent
Reward
State Action
Environment
Reward (Interpreter)
State
Transition Probability
For Markov chains, we have the transition probability
For Markov decision processes, we need to account for
additionally:
• The reward
• The action
Hence, we specific the MDP transition probability
Markov Decision Processes
A Markov decision process (MDP) is the evolution of according
to
A MDP is finite if is finite and is finite for each
Example: The Recycling Robot
State:
(position, charge, weight)
Actions:
• If then is empty
• If , such that , and has a can, then
• …
Reward:
Charging
Station
The “Decision” Aspect: The Policy
The only way the agent has control over this system is through the
choice of actions.
This is done by a specifying a policy
Deterministic policies:
Then we write , i.e. deterministic policies are functions
The Goal of Choosing a Policy:
Returns
We want to maximize long-term rewards…
Define the return
Here, is the discount rate
This includes both finite and infinite time MDPs.
The Objective of RL
The goal of RL is the maximize, by choosing a good policy , the
expected return
where we start from some state .
We will consider time-homogeneous cases where this is the same
as
Dynamic Programming
Example
1 +2 3 -1
+3
+1
+1 +4
4
0
+3 -2
+0 7
-3
+5 -1
How long does it take to check all possibilities?
The Curse of Dimensionality
A term coined by R. Bellman (1957)
The number of states grows exponentially when the
dimensionality of the problem increases
Can we have a non-brute-force algorithm?
Dynamic Programming Principle
On an optimal path (following the optimal policy), if we start
at any state in that path, the rest of path must again be
optimal
Dynamic Programming in Action
Define
+4 +2 +2 -1
+4 +2 +3 +3
+1 +2 +1
+1 +4
+5
+6
+6
+3 -2
+0 +5 +1
+6 +1 -3
-3
+6 +5 -4 -1
The Complexity of Dynamic
Programming
We have shown that brute-force search takes at least steps.
What about dynamic programming?
Summary of Key Ideas
Come up with Find optimal
Come up with
a recursive policy by
a measure of
way to acting greedily
“value” of
compute the according to
each state
value the value
Bellman’s Equations and Optimal
Policies
Value Function
As motivated earlier, we define the value function
and the action value function
Our goal: derive a recursion for and
These are known as Bellman’s equations
Relationship between and
Using the definitions, we can show the following relationships:
Combining, we get
This is known as the Bellman’s equation for the value function
Bellman’s Equation
For finite MDPs, the Bellman’s equation can be written as
This is a linear equation, and we can show that there exists a
unique solution for .
In fact, it is just
whose existence and uniqueness follow from the invertibility of ,
which in turn follows from .
Bellman’s Equation for Action-Value
Function
Using similar methods, one can show that the action value
function satisfies a similar recursion
Exercise: derive this equation and show that there exists a unique
solution
Comparing Policies
We can compare policies via their values
• Given , we say if for all
• This is a partial order
Examples
• , , Then
• , , Then neither nor holds
Optimal Policy
We define an optimal policy to be any policy satisfying
In other words, for all and all
• Does such a exist?
• Is it unique?
Policy Improvement
We can derive the following result:
For any two policies , if
Then we must have
In addition, if the first inequality is strict for some , then the
second equality is strict for at least one .
Bellman’s Optimality Condition
A policy is optimal if and only if for any state-action pair such
that , we have
This means that an optimal policy must choose an action that
maximize its associated action value function.
This then implies the existence of an optimal policy!
Bellman’s Optimality Equation
Corresponding to an optimal policy
we obtain the following recursion
These are known as the Bellman’s optimality equations
Some Remarks
The optimal value function is unique. Is an optimal policy
unique?
Observe that the policy generated from can be taken to be
deterministic
In fact, for every policy there exists a deterministic policy such
that !
Summary
We introduced
• Basic formulation of reinforcing learning
• MDP as the mathematical framework
• Bellman’s equations characterizing optimal policies
Next time: algorithms to solve RL problems