EE5076
Reinforcement
Learning
By: Theekshana Wijewardhana
Overview of this presentation
1. Introduction to RL
2. RL formalization
3. Concepts in RL
a. State
b. Action
c. Reward
d. Policy
e. Q function
4. Bellman’s equation
5. Introduction to OpenAI gym
Introduction
Rewards
★Positive: Meatshop
★Negative: Scary Dog
Introduction
Reinforcement Learning (RL) is a
type of machine learning where an
agent learns to make decisions by
interacting with an environment. The
agent takes actions, receives
rewards or penalties, and
continuously improves its strategy
(policy) to maximize long-term
rewards.
AI learns to park
Link: https://youtu.be/VMp6pq6_QjI
Reinforcement Learning formalization
Reward 100 0 0 0 0 40
State 1 2 3 4 5 6
Move left : R(4) + R(3) + R(2) + R(1) : 0 + 0 + 0 + 100
Move right : R(4) + R(5) + R(6) : 0 + 0 + 40
Move random : R(4) + R(5) + R(3) + R(2) + R(1) : 0 + 0 + 0 + 0 +0+100
The return in Reinforcement Learning
Reward 100 0 0 0 0 40
State 1 2 3 4 5 6
Discount factor = γ (0 - 1)
Move left : R(4) + γR(3) + γ2R(2) + γ3R(1) : 0 + 0 + 0 + γ3100
Move right : R(4) + γR(5) + γ2R(6) : 0 + 0 + γ240
Move random : R(4) + γR(5) + γ2R(4) + γ3R(3) + γ4R(2)+ γ5R(1) : 0 + 0 + 0 + 0 + 0 + γ5100
The return in Reinforcement Learning
Reward 100 0 0 0 0 40
State 1 2 3 4 5 6
Discount factor = γ = 0.9
Move left : 0+ 0.9*0 + 0.92*0 + 0.93*100 : 72.9
Move right : 0 + 0.9*0 + 0.92*40 : 32.4
Move random : 0 + 0.9*0 + 0.92*0 + 0.93*0 + 0.94*0+ 0.95*100 : 59.049
The return in Reinforcement Learning
Reward 100 0 0 0 0 40
State 1 2 3 4 5 6
Discount factor = γ = 0.1
Move left : 0+ 0.1*0 + 0.12*0 + 0.13*100 : 0.1
Move right : 0 + 0.1*0 + 0.12*40 : 0.4
Move random : 0 + 0.1*0 + 0.12*0 + 0.13*0 + 0.14*0+ 0.15*100 : 0.001
Policies in Reinforcement Learning
Reward 100 0 0 0 0 40
State 1 2 3 4 5 6
If i am in state 4,
➢ Should I move left?
➢ Should I move right?
To get the best long-term return
Policies in Reinforcement Learning
Reward 100 0 0 0 0 40
State 1 2 3 4 5 6
A policy helps the agent to find the best action to take
State(s) Best Action(a)
Policy(π)
Reinforcement learning
Key concepts
1. Agent: The learner or the decision
maker: self-driving car, robot
2. Environment: The world in which
the agent operates
3. State (s): The current situation of
the agent in the environment.
4. Action (a): A choice the agent can
make.
5. Reward (r): A numerical value
given to the agent based on its
action
6. Policy (π): A strategy that defines
the agent’s action selection
process
7. Return: The expected long-term
reward for a given state
8. Q function (Q(s,a)): The expected
reward of taking a particular action
in a given state
Optimum policy
The optimum policy (denoted as π*) is the strategy that allows an agent to achieve the
maximum expected cumulative reward over time. It defines the best action to take in every
state to maximize long-term rewards.
Reward 100 0 0 0 0 40
γ = 0.9 100 90 26.24 81 29.16 72.9 32.4 65.61 36 40
State 1 2 3 4 5 6
S4 π* Go left S3 π* Go left
State - Action value function/Q function
Reward 100 0 0 0 0 40
γ = 0.9 100 90 26.24 81 29.16 72.9 32.4 65.61 36 40
State 1 2 3 4 5 6
Q(s,a) = returns if you; Example:
1. Start in state s Q(4, right) = R(4) + γR(5) + γ2R(4) + …. + γ5R(1)
2. Take action a (Once)
Q(4, right) = R(4) + γ [R(5) + γR(4) + …. + γ4R(1)]
3. Then behave optimally after that
Q(4, right) = R(4) + γ max[Q(5, left), Q(5, right)]
Bellman's Equation
Q(s,a) = R(s) + γmax Q(s`, a`)
s : Current state
R(s): Reward of current state
A : Current action
Γ : Discount factor
s` : State you get to after action a
a` : Action that you take in state s`
Problem
When an agent is introduced to a new environment, it does not initially know the terminal state or the
optimal policy to follow. So, how can the agent determine the Q-values for a given state?
The agent should be capable of:
- Explore the environment
- Learning to estimate the Q-value for each possible action in a given state.
- Developing a policy that selects the optimal action for each state.
Thank you