NPTEL
Video Course on Machine Learning
Professor Carl Gustaf Jansson, KTH
Week 5 Machine Learning enabled
by prior Theories
Video 5.4 Reinforcement Learning – Part 1 - Introduction
Reinforcement Learning (RL)
The intuitive scenario for Reinforcement Learning is an Agent that
learns from interaction with an Environment to achieve some
long-term goals related to the State of the environment by performing
a sequence of actions and by receiving feedback.
The mapping from states to possible actions is called a
Policy.
The achievement of goals is defined by Rewards or reward signals
being the feedback on actions from the environment.
The Return is defined as the cumulative (possibly discounted)
rewards over an Episode = action sequence, leading to a Terminal
state.
The goal of any RL algorithm is to establish a policy that maximizes
the Returns.
Micromouse
3
Reinforcement Learning terminology
Environment The ´microworld´defined for the particular RL problem including the agent. Designated as E.
Agent An agent Designated as A.
State A particular configuration of the agent within the environment. Designated as s.
Terminal states Defined end states for the particular RL problem. Designated as TS.
Action An agent select an action based upon the current state S and the policy π
Policy A policy π is a mapping from states of the environment to the potential actions of an agent in those states. π(s) can be
deterministic, only depending on s or stochastic π(s,a) depending also on a.
Transition Probability function s´ = T( s , a) specifies the probability that the environment will transition to state s′ if the agent takes action a in state s.
Episode Sometimes called an epoque. A sequence of states, actions and rewards, which ends in a terminal state.
Reward or Reward signal The reward R ( s´|s,a) gives feedback from the environment on the effect of a single action a in state s leading to s´
Discounted Reward When calculating the Return, the expected rewards for future steps can be weighted with a discount factor γ in the interval [0,1]
Return The accumulated rewards for an episode. Designated as G.
Value function The Value function V = V(s) is the estimation of the Value or Utility of s with respect to its average Return considering all
possible episodes possible within the current policy. It must continually be re-estimated for each action taken.
The state-value function v(s) for the policy π is given below. Note that the value of the terminal state (if any) is always zero.
Model of the environment A set of tuples T(s,a)=s´, and R(s´|s,a) representing the possible state transitions and the rewards given.
Example 1 - the 4x3 world
Environment and agent: The 4x3 world is a board of 4 * 3 positions, indexed by
3 +1 two coordinates. The agent has a start position of (1,1). The position (2,2) is excluded.
Reward: There are two terminal positions (4,3) and (4,2). Reaching (4,3) gives a reward =+1.
2 -1
Reaching (4,2) gives a reward = -1. Reaching all other postions give reward=0.
1 start Actions: up, down, left, right but restricted by the board configuration. The policy
is deterministic in the sense that an action in a specific state can only lead to one other
1 2 3 4 State.
Episode: e.g a sequence of actions leading from 1,1 to 4,3 via 1.2 1,3 2,3 and 3,3
Examples of episodes with returns.
(1,1)🡪(1,2)🡪(1,3)🡪(1,2)🡪(1,3)🡪(2,3)🡪(3,3) 🡪(3,4) +1
(1,1)🡪(1,2)🡪(1,3)🡪(2,3)🡪(3,3)🡪(3,2)🡪(3,3)🡪(3,4) +1
(1,1)🡪(2,1)🡪(3,1)🡪(3,2)🡪(4,2) -1
Reinforcement Learning modelled as Markov Decision Process (MDP)
A suitable model for the above intuitive scenario is a basic Markov .5 .5 .33
Decision Process (MDP). 1.0
+1
An MDP is typically defined by a 4-tuple (S,A,R,T) where: .5 .33
.5 .5 .33
• S is the set of states s is the state of the environment .33 .33 1.0
(including the agent)
• A is the set of actions for each s, that the agent can choose
-1
between defined by a policy π .33 .33
• R(s’ |s , a) is a function that returns the reward received for taking .5 .5.5 .5 .33 .5
action a in state s leading to s´
• s´= T(s , a) is a transition probability function, specifying the .5 .33 .5
probability that the environment will transition to state s′ if the State transitions: The state transitions from a state
agent takes action a in state s. The Markov property = Transition to its neighboring state that in this case have equal
probabilities depend on state only, not on the path to the state. probability, summed up to 1.
The goal is to find a policy π that maximizes the return = expected
future accumulated (discounted) reward for episodes from a Start
state s to a Terminal state.
Dynamic Programming versus Reinforcement Learning
In one MDP scenario, there is a complete and exact model of the MDP in the
sense that T(S,A) and R(S,A) are fully defined.
In this case a MDP problem is a Planning Problem that can be exactly
solved by use of e.g. Dynamic Programming.
However in many cases T(S,A) and R(S,A) are not completely known, meaning
that the model of the MDP is not complete.
In such cases we have a true Reinforcement Learning problem.
Dynamic Programming (DP)
Dynamic Programming is an algorithm design technique for optimization problems: often for minimizing or
maximizing. The method was developed by Richard Bellman in the 1950s.
Like divide and conquer, DP is simplifying a complicated problem by breaking it down into simpler
sub-problems in a recursive manner and then combining the solutions to the sub-problems to form the total
solution.
If a problem can be optimally solved by breaking it recursively into sub-problems and then form the
solution from optimal solutions to the sub-problems, then it is said to have optimal substructure.
Unlike divide and conquer, sub-problems are not independent, sub-problems may share sub-sub-problems,
Typically the utility value of a solution in a certain state is defined by a value function calculated recursively
based on the value functions for the remaining steps of an episode moderated by the relevant probabilities
and a discount parameter giving higher weight to closer reward contributions. Typically the policy function
can be calculated in a similar fashion.
The defining equation for finding an optimal value function is called the Bellman equation.
Dynamic Programming (DP) – motives behind name
Bellman´s principle of optimality and the Bellman equation
Dynamic programming algorithms has the ambition to optimize the following two functions:
Policy function: π(s):= argmax ∑ T(s,a) * (R(s´|s,a)+ * V(s') )
a s´
Value function: V(s):= ∑ T (s,a) * (R(s´|s,a)+ * V(s') )
s´
Richard Bellman's Principle of Optimality
An optimal policy has the property that whatever the initial state and initial decision are, the remaining
decisions must constitute an optimal policy with regard to the state resulting from the first decision.
An optimal value function is a solution to the so called Bellman equation:
V(s) = Max ( R(s´|s,ai) + * V(s´ =T(s,ai) )
i=1..n
An optimal value function can be the basis for an optimal policy function.
Three standard approaches to solve the Bellman Equation
Value iteration
Initialise V arbitrarily
repeat
Improve Vk+1 using the estimate of Vk
until convergence
Policy iteration
Initialise V and π arbitrarily
repeat
Evaluate V using π
Improve π using V
until convergence.
Linear programming
Prioritized sweeping - use priority queue to update states with large potential for change
Simple example to illustrate Dynamic Programming: The shortest path in multi-stage graphs
The shortest path is: (S, C, F, T)=> 5+2+2 = 9. A greedy method gives (S, A, D, T) => 1+4+18 =
23.
Example with Dynamic programming approach
Forward approach
• d(S, T) = min{1+d(A, T), 2+d(B, T), 5+d(C, T)}
• d(A,T) = min{4+d(D,T), 11+d(E,T)} = min{4+18, 11+13} = 22.
• d(B, T) = min{9+d(D, T), 5+d(E, T), 16+d(F, T)} = min{9+18, 5+13, 16+2} = 18.
• d(C, T) = min{ 2+d(F, T) } = 2+2 = 4
• d(S, T) = min{1+d(A, T), 2+d(B, T), 5+d(C, T)} = min{1+22, 2+18, 5+4} = 9.
Markov Decision Process (MDP) can be generalized to
Partially Observable Markov Decision Process (POMDP)
A POMDP models an agent decision process in which it is assumed that
the system dynamics are determined by an MDP, but the agent cannot
directly observe the underlying state. The POMDPs enlarge the
applicability of model-based MDPs.
Instead, it must maintain a probability distribution over the set of
possible states, based on a set of observations and observation
probabilities, and the underlying MDP.
A POMDP have two additional elements on top of the MDP elements.
• A set of observations
• A set of observational probabilities.
Because the agent does not directly observe the environment's state, the
agent must make decisions under uncertainty of the true environment
state.
The details of POMDP are outside the scope of this lecture.
To be continued in Part 2