0% found this document useful (0 votes)

5 views64 pages

Introduction To RL

The document provides an overview of reinforcement learning (RL), including its definitions, basic concepts, and various types such as value-based, policy-based, and their applications in networking. It discusses key components like policies, rewards, and value functions, and contrasts RL with supervised and unsupervised learning. Additionally, it covers action selection strategies and exploration-exploitation trade-offs in RL scenarios.

Uploaded by

huyhoang2522

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views64 pages

Introduction To RL

Uploaded by

huyhoang2522

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 64

2021/09/30

Reinforcement learning:
Basic concepts and Applications in networking

Nguyen Phi Le
Agenda
◦ Definition and basic terms
◦ Value-based RL
◦ Tabular RL
◦ Approximate RL
◦ Policy-based RL
◦ Applications in networking

2
Agenda
◦ Definition and basic terms
◦ Value-based RL
◦ Tabular RL
◦ Approximate RL
◦ Policy-based RL
◦ Applications in networking

3
Example

4
Interactive learning
Sensor node

Base station

Multi-Armed Bandit Traffic light control

Wireless sensor networks

ü The learner is not told which actions to take à trial-and-error
ü Action may affect also future situation à delayed reward

5
Definition
◦ Reinforcement learning
◦ Learning what to do—how to map situations to actions— to achieve the goal
◦ Reward hypothesis
◦ That all of what we mean by goals and purposes can be well thought of as the
maximization of the expected value of the cumulative sum of a received scalar
signal (called reward) (Richard S. Sutton, RL: An introduction, 2018)

6
Goal and reward
◦ Reward: a scalar signal received from the environment at each step
◦ Goal: maximizing cumulative reward in the long run Sensor node

Multi-Armed Bandit Traffic light control

Base station

Goal = maximizing total gain Goal = decreasing traffic jam Goal = increasing network lifetime
Reward = gain at every turn Reward = waiting time, Reward = load balancing,
queue length, lane speed, … route length, …

7
RL vs other learning techniques
◦ RL vs supervised learning
◦ Supervised learning: learning from a training set of labeled examples
◦ Know the true action to take
◦ Reinforcement learning: do not know the optimal action
◦ RL vs unsupervised learning
◦ Unsupervised learning: finding structure hidden in collections of unlabeled data
◦ Reinforcement learning: maximizing the reward signal

8
RL framework
◦ Policy: a mapping from states to actions
◦ may be stochastic, specifying probabilities for each action
◦ Reward signal: the goal of a reinforcement learning problem
◦ objective is to maximize the total reward the agent receives over the long run
◦ Value: total amount of accumulative reward over the future, starting from
that state
◦ Reward: immediate; Value: long-run
o Action: can be any decisions we want to learn how to make
o State: can be anything we can know that might be useful in making them
Agent Reward signal

Action
State Policy Environement

New state
9
Agenda
◦ Definition and basic terms
◦ Value-based RL
◦ Tabular RL
◦ Approximate RL
◦ Policy-based RL
◦ Applications in networking

10
Value-based RL
◦ RL’s objective: making decision
◦ Input: state
◦ Output: action
Reward signal
◦ How to decide an action? Agent
value
◦ Using action selection policy function

◦ Based on action/state value

◦ How to measure the values? Action
State Policy
◦ Value functions: estimate the goodness
of states/actions based on experience
New state
◦ Exploration-exploitation strategy Environement

11
Value functions
◦ Key point of value-based RL models
◦ Questions?
◦ What can be used to measure the goodness of an action, state?
◦ How to measure?
Maximize: long-term accumulative reward

𝐺! = 𝑅!"# + 𝛾𝑅!"$ + ⋯ + 𝛾 % 𝑅!"%&# + … = ∑*

'() 𝛾 '
𝑅!"'"#

Reward received after Reward received if following policy 𝜋

Goal
performing action 𝑎 at timing 𝑡 from timing 𝑡 + 1

12
Value functions
◦ State value function: how good it is for the agent to be in a given state
◦ Expected return when starting in s and following thereafter policy 𝜋
◦ Action value function: how good it is to perform a given action in a
given state
◦ Expected return starting from s, taking the action a, and thereafter following policy
𝜋

State value function for policy 𝜋

𝑣& 𝑠 ÷ 𝔼& 𝐺' 𝑆' = 𝑠 = 𝔼& ∑+ (

()* 𝛾 𝑅',(,- 𝑆' = 𝑠

Action value function for policy 𝜋

𝑞& 𝑠, 𝑎 ÷ 𝔼& 𝐺' 𝑆' = 𝑠, 𝐴' = 𝑎 = 𝔼& ∑+ (
()* 𝛾 𝑅',(,- 𝑆' = 𝑠, 𝐴' = 𝑎

13
Action value estimation
◦ 𝑞& 𝑠, 𝑎 ÷ 𝔼& 𝐺' 𝑆' = 𝑠, 𝐴' = 𝑎 = 𝔼& ∑+
()* 𝛾 (𝑅
',(,- 𝑆' = 𝑠, 𝐴' = 𝑎
◦ How to estimate 𝑞& 𝑠, 𝑎 ?
◦ 𝔼& ∑+()* 𝛾 (𝑅
',(,- 𝑆' = 𝑠, 𝐴' = 𝑎 = expected value à need to perform
the same action at the same state for infinite times à impossible
◦ Solution
◦ Replacing expected value by sample-based estimation
◦ Estimating value of 𝔼! ∑% "
"#$ 𝛾 𝑅&'"'( 𝑆& = 𝑠, 𝐴& = 𝑎 using a few samples
◦ Estimation method: iterative estimation
◦ Initialize 𝑞! 𝑠, 𝑎
◦ Update 𝑞! 𝑠, 𝑎 after performing one action and receiving a reward
◦ Current 𝑞! 𝑠, 𝑎
◦ Estimate of 𝔼! ∑% "
"#$ 𝛾 𝑅&'"'( 𝑆& = 𝑠, 𝐴& = 𝑎
◦ New 𝑞! 𝑠, 𝑎 ← 1 − 𝛼 × old 𝑞! 𝑠, 𝑎 + 𝛼𝑞<! (𝑠, 𝑎)

14
Action value estimation
◦ How to estimate action value?
◦ Bellman equation
𝑞, 𝑠, 𝑎 = 𝔼, 𝐺! 𝑆! = 𝑠, 𝐴! = 𝑎
= 𝔼 , ∑*
'() 𝛾 '
𝑅!"'"# 𝑆! = 𝑠, 𝐴! = 𝑎
= 𝔼, 𝑅!"# + ∑*
'(# 𝛾 '
𝑅!"'"# 𝑆! = 𝑠, 𝐴! = 𝑎
= 𝔼, 𝑅!"# + 𝛾 ∑*
'() 𝛾 '
𝑅!"'"$ 𝑆! = 𝑠, 𝐴! = 𝑎
Action value of = 𝔼, 𝑅!"# + 𝛾𝑞, 𝑆!"# , 𝑎′ 𝑆! = 𝑠, 𝐴! = 𝑎
(𝑠, 𝑎)

Immediate reward after

expectation Action value at the next step
performing action a at state s
15
Action value estimation
◦ Bellman equation
◦ 𝑞& 𝑠, 𝑎 = 𝔼& 𝑅',- + 𝛾𝑞& 𝑆',-, 𝑎′ 𝑆' = 𝑠, 𝐴' = 𝑎

Object to be estimated 𝑞?! (𝑠, 𝑎)

New 𝑞& 𝑠, 𝑎 = (1 − 𝛼) current 𝑞& 𝑠, 𝑎 + 𝛼×3

𝑞& (𝑠, 𝑎)

𝑄 𝑠, 𝑎 = 1 − 𝛼 𝑄 𝑠, 𝑎 + 𝛼 𝑟 + 𝛾𝑄 𝑠 - , 𝑎- SARSA

16
Action value estimation
Bellman equation
𝑞, 𝑠, 𝑎 = 𝔼, 𝑅!"# + 𝛾𝑞, 𝑆!"# , 𝑎′ 𝑆! = 𝑠, 𝐴! = 𝑎
𝑞,∗ 𝑠, 𝑎 = 𝔼,∗ 𝑅!"# + 𝛾𝑞,∗ 𝑆!"# , 𝑎′ 𝑆! = 𝑠, 𝐴! = 𝑎
= 𝔼,∗ 𝑅!"# + 𝛾 max 𝑞, 𝑆!"# , 𝑎′ |𝑆! = 𝑠, 𝐴! = 𝑎
,

- -
estimation: 𝑞:
, ∗ (𝑠, 𝑎) = 𝑟 + 𝛾 max 𝑄 𝑠 ,𝑎
.-

𝑄 𝑠, 𝑎 = 1 − 𝛼 𝑄 𝑠, 𝑎 + 𝛼 𝑟 + 𝛾 max 𝑄 𝑠 - , 𝑎- Q-learning
.-
𝑄 𝑠, 𝑎 1 − 𝛼 𝑄 𝑠, 𝑎 + 𝛼 𝑟 + 𝛾𝑄 𝑠 - , 𝑎- SARSA
17
State value estimation
Bellman equation for state-value function
𝑣, 𝑠 = 𝔼, 𝐺! 𝑆! = 𝑠
= 𝔼 , ∑*
'() 𝛾 '
𝑅!"'"# 𝑆! = 𝑠
Monte Carlo
= 𝔼, 𝑅!"# + 𝛾 ∑*
'() 𝛾 '
𝑅!"'"$ 𝑆! = 𝑠
= 𝔼, 𝑅!"# + 𝛾𝑣, (𝑆!"# ) 𝑆! = 𝑠

New 𝑣, 𝑠 = (1 − 𝛼) current 𝑣, 𝑠 + 𝛼×𝐸

Temporal
Difference
𝑉 𝑆! = 1 − 𝛼 𝑉 𝑆! + 𝛼𝐺! (TD)

𝑉 𝑆! = 1 − 𝛼 𝑉 𝑆! + 𝛼(𝑅!"# +𝛾𝑉(𝑆!"# ))
18
Action selection policy
◦ Based on value functions
◦ Exploitattion – exploration strategy
◦ Choose (𝑠, 𝑎) whose 𝑄 𝑠, 𝑎 is the greatestß exploitation
◦ Problem: local maximum à need to exploit new actions ß exploration
◦ Some exploration methods
◦ 𝜖-greedy
◦ Time adaptive Epsilon (annealing 𝜖)
◦ 𝜖-soft
◦ Value adaptive Epsilon
◦ Sigmoid Epsilon
◦ Optiministic initialization
◦ Upper confidence bound (UCB)

19
𝜖-greedy
◦ Define a small number 𝜖 (𝜖 = 0.01)
◦ Each time the agent needs to choose an action
◦ Generates a prob 𝑝 ∈ (0,1)
𝑎𝑟𝑔𝑚𝑎𝑥@ 𝑞 𝑠, 𝑎 , 𝑖𝑓 𝑝 ≥ 𝜖
◦ 𝐴' = 7
𝑟𝑎𝑛𝑑𝑜𝑚 𝑎𝑐𝑡𝑖𝑜𝑛, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
E-greedy with e = 0.1 in theory
Best Action

Others

0 1 2 3 4 5 6 7 8 9 10
Number of time actions are taken
20
𝜖-greedy multi-armed bandit
◦ Action space: 𝐴 = 𝑎! , … , 𝑎" Multi-Armed Bandit

◦ State space: one state

◦ Reward: the gain recevied
◦ Value function
"#$ %& '()*'+" ),(- * !*.(- /'0%' !% !
◦ 𝑄! 𝑎 ÷
-#$1(' %& !0$(" * !*.(- /'0%' !% !
◦ Exploration strategy Goal = maximizing total gain
◦ 𝜖-greedy Reward = gain at every turn

21
Exploration vs Exploitation

The higher the 𝜖, the more exploration

à The more unstable the average reward

22
Exploration vs Exploitation

The frequency of selecting optimal action is inversely proportional with 𝜖

23
Optiministic initialization
◦ Initialize 𝑄) 𝑎 = 𝑋 ∀𝑎 (𝑋 is a significantly large number)
BCD EF GHI@GJB IKHL @ '@(HL MGNEG 'E '
◦ 𝑄' 𝑎 ÷
LCDOHG EF 'NDHB @ '@(HL MGNEG 'E '

◦ Action selection strategy

◦ At every step 𝑡: choose 𝐴' = 𝑎𝑟𝑔𝑚𝑎𝑥@ 𝑄' 𝑎
◦ Guarantee that all actions are chosen at least one time
"! # $%"
◦ 𝑄! 𝑎 = < 𝑄' 𝑎
&

24
Optiministic initialization
◦ Highly exploration at first
◦ But, totally greedy afterwards
◦ Appropriate value of X is needed

25
Upper confidence bound (UCB)
◦ Actions having been selected frequently à should be explored less
◦ Confidence bound is defined by the number of time the action not
being selected
◦ Priotirizing actions with
◦ High estimated action value à exploitation
◦ High confidence bound à exploration

Action 1

Action 2

Action 3

Action 4

Action 5

0 2 4 6 8
Estimate Value Confidence Bound

26
Upper confidence bound (UCB)
◦ Mathematical expression
◦ The number of times an action 𝑎 has been taken: 𝑁' (𝑎)
◦ The total number of time steps so far: 𝑁'
◦ Confidence bound
)#
◦ 𝐶𝐵( 𝑎 = 𝑐 (𝑐: a constant)
)# (#)

◦ Action selection policy

)#
◦ At every step 𝑡: choose 𝐴( = 𝑎𝑟𝑔𝑚𝑎𝑥# 𝑄( 𝑎 + 𝑐
)# (#)
◦ Intuition
◦ If an action 𝑎 has not been selected for a long time à 𝑁( (𝑎) decreases à
𝐶𝐵( 𝑎 increase à it will be prioritized next times

27
Upper confidence bound (UCB)

28
Other exploration strategies
◦ Time adaptive epsilon (annealing 𝜖)
-
◦𝜖=
PQR('NDH,S)
◦ 𝑐: a smalle positive parameter;
◦ 𝑡𝑖𝑚𝑒: the total times performing actions à the larger the time, the
smaller the 𝜖 à the more experience, the less the exploration
◦ 𝜖-soft
◦ Guarantee that the probabilities for choosing any action is at least
T
|V2 |
◦ 𝐴B : action space

29
Other exploration strategies
◦ Value adaptive Epsilon
◦ Adjust exploration probability based on the variation of the value
function
◦ When the value varies much à unstableà agent seems not to
know much about the environment à should explore more
◦ When the value is stable à not need to explore much
Mathematical expression
$ %#&" ',) $%# ',)
!,- * &
◦ 𝑓 𝑠( , 𝑎( , 𝜎 = $ %#&" ',) $%# ',)
= $ %#&" ',) $%# ',)
− 1 à 𝑄($! 𝑠, 𝑎 − 𝑄( 𝑠, 𝑎 càng
!$- * !$- *
lớn, 𝑓 càng lớn
◦ 𝜖(($!) 𝑠 = 𝛿𝑓 𝑠( , 𝑎( , 𝜎 + 1 − 𝛿 𝜖 ( 𝑠

30
Other exploration strategies
◦ Sigmoid Epsilon
-
◦ 𝑃HWMXEGH = 1 −
-,H 34×678(:)
◦ 𝜔: hyper-parameter
◦ 𝑣𝑎𝑟 𝑞 : variance of the values
◦ When 𝑣𝑎𝑟 𝑞 is sufficiently large à there is an action that dominates the others à
should be chosen
◦ When 𝑣𝑎𝑟 𝑞 is small à all actions are equals à should increase explore

31
Tabular RL
◦ Use a table (named Q-table) to
represent the action values
State Action Q-value
◦ Update Q-table after performing
every action 𝑆! 𝐴! 𝑄(𝑆! , 𝐴! )
𝑆! 𝐴" 𝑄(𝑆! , 𝐴" )
◦ Drawbacks
◦ When state/action spaces are too large
à table size gets large too
𝑆# 𝐴$ 𝑄(𝑆# , 𝐴$ )
◦ Cannot deal with newly appear
state/actions
◦ When training the model offline

32
Approximate RL
◦Represent the relation of (𝑠, 𝑎, 𝑞) by a function or a
model: 𝑞(𝑠,
' 𝑎)
◦ Input: a vector reprenting 𝑠
N 𝑎) are updated after every action
◦ Paramters of 𝑞(𝑠,
◦ Using gradient descent

◦Advantages
◦ Handle a large state/action space
◦ Can deal with newly appearing states

33
Approximate RL

Supervised learning:
Problems:
𝑞<𝕨 𝑠, 𝑎 , 𝑞∗ 𝑠, 𝑎 Solution:
1 " In most cases, 𝑞∗ 𝑠, 𝑎 is
𝐿 = 𝑞∗ 𝑠, 𝑎 − 𝑞<𝕨 𝑠, 𝑎 Estimates 𝑞∗ 𝑠, 𝑎
2 unknown
𝕨 ÷ 𝕨 − 𝛼∇𝐿

34
Recall about tabular RL
◦ Tabular Q-learning:
◦ 𝑄 𝑠, 𝑎 = 1 − 𝛼 𝑄 𝑠, 𝑎 + 𝛼 𝑟 + 𝛾 max 𝑄 𝑠 _ , 𝑎_
@_

= 𝑄 𝑠, 𝑎 + 𝛼 𝑟 + 𝛾 max 𝑄 𝑠 - , 𝑎- − 𝑄 𝑠, 𝑎
.-
Estimated value: 𝑌 Observed value: 𝑋

b
- cd
𝐿= 𝑟 + 𝛾 max 𝑄 𝑠 _ , 𝑎_ − 𝑄 𝑠, 𝑎 à = − 𝑟 + 𝛾 max 𝑄 𝑠 _ , 𝑎_ − 𝑄 𝑠, 𝑎
b @_ ce @_

𝑄 𝑠, 𝑎 = 𝑄 𝑠, 𝑎 − 𝛼∇𝐿 à Gradient descent

35
Approximate RL
◦ Tabular Q-learning
b
-
◦𝐿= 𝑟 + 𝛾 max 𝑄 𝑠 _ , 𝑎_ − 𝑄 𝑠, 𝑎 ; 𝑄 𝑠, 𝑎 = 𝑄 𝑠, 𝑎 − 𝛼∇𝐿
b @_
𝑌 𝑋

◦ Approximate Q-learning: 𝑄𝕨 𝑠, 𝑎
&
!
◦𝐿= 𝑟 + 𝛾 max 𝑄𝕨 𝑠 . , 𝑎. − 𝑄𝕨 𝑠, 𝑎
& #.

◦ E.g: 𝑄𝕨 𝑠, 𝑎 = 𝑤! 𝑓! 𝑠, 𝑎 + 𝑤& 𝑓& 𝑠, 𝑎 + … + 𝑤0 𝑓0 𝑠, 𝑎

◦ 𝑤1 ⟵ 𝑤1 + 𝛼 𝑟 + 𝛾 max 𝑄𝕨 𝑠 . , 𝑎. − 𝑄𝕨 𝑠, 𝑎 𝑓1 𝑠, 𝑎
#.

36
Approximate RL
◦ Approximate Q-learning: 𝑄𝕨 𝑠, 𝑎
' ' )*𝕨
◦ 𝑤% ⟵ 𝑤% + 𝛼 𝑟 + 𝛾 max 𝑄𝕨 𝑠 , 𝑎 − 𝑄𝕨 𝑠, 𝑎 )+$
&'

◦ SARSA
◦ Tabular SARSA: 𝑄 𝑠, 𝑎 = 1 − 𝛼 𝑄 𝑠, 𝑎 + 𝛼 𝑟 + 𝛾𝑄 𝑠 % , 𝑎%
% % '(𝕨
◦ Approximate SARSA: 𝑤& ⟵ 𝑤& + 𝛼 𝑟 + 𝛾𝑄𝕨 𝑠 , 𝑎 − 𝑄𝕨 𝑠, 𝑎
')'

37
Approximate RL
◦ 𝑄𝕨 𝑠, 𝑎
◦ Can be a very simple linear function
◦ Can be a machine learning model, ANN, …
◦ When 𝑄𝕨 𝑠, 𝑎 is a deep learning model à we have deep
reinforcement learning

Action 1, Q-value 1
state
DNN
Action n, Q-value 2

38
Tabular RL vs Deep RL
State Action Q-value
𝑆! 𝐴! 𝑄(𝑆! , 𝐴! )
𝑆( , 𝐴) 𝑆! 𝐴" 𝑄(𝑆! , 𝐴" ) 𝑄(𝑆( , 𝐴) )
Tabular RL

𝑆# 𝐴$ 𝑄(𝑆# , 𝐴$ )

𝐴! , 𝑄(𝑆( , 𝐴! )

𝐴" , 𝑄(𝑆( , 𝐴" )

𝑆(
DRL DNN
𝐴$ , 𝑄(𝑆( , 𝐴$ )

39
Agenda
◦ Definition and basic terms
◦ Value-based RL
◦ Tabular RL
◦ Approximate RL
◦ Policy-based RL
◦ Applications in networking

40
Policy-based RL
Estimate values (action value, state value)
◦ Monte Carlo
◦ TD Value-based RL
◦ SARSA
Choose an action based on the value
◦ Q-learning

What if we learn to make the decision directly instead of estimating

value and then making the decision based on the value?

Policy-based RL
41
Value-based RL vs Policy-based RL
Q-learning, SARSA,
𝜖-greedy, 𝜖-soft, …
Monte carlo, TD, …

Value estimation Action selection

Value-based RL

42
Value-based RL vs Policy-based RL
Q-learning, SARSA,
𝜖-greedy, 𝜖-soft, …
Monte carlo, TD, …

An Value
end2end policy
Action selection
estimation

Policy-based RL

43
Value-based RL
◦ Objective
◦ Policy function: 𝜋 𝑎 𝑠 : the probability for performing an action 𝑎 at
state 𝑠
◦ Main idea
◦ 𝜋 𝑎 𝑠, Θ : 𝜋 can be a function, a model with parameters Θ needed to
be learned
◦ 𝒥(Θ): scalar performance measure
Z' ) à gradient ascent method à why not gradient
◦ 𝜃',- = 𝜃' + 𝛼∇𝒥(Θ
descent?
Estimation of 𝒥(Θ* ) Questions:
1. How to define 𝒥 Θ( ?
V( )?
2. How to estimate ∇𝒥(Θ
44
Objective function 𝒥 Θ
◦ For episodic environments
◦ 𝒥 Θ = 𝔼& [𝐺- = 𝑅- + 𝛾𝑅b + ⋯ + 𝛾 Lh-𝑅L ]
◦ For continuing environemnts
◦ Expectation of state value
)(3)
◦ 𝒥 Θ = 𝔼2 𝑉 𝑠 = ∑3 𝑑(𝑠) 𝑉 𝑠 = ∑3 ∑ +) 𝑉 𝑠 : expectation over all the states
'+ )(3

◦ Expectation of reward
◦ 𝒥 Θ = 𝔼2 𝑟 = ∑3 𝑑(𝑠) ∑# 𝜋 𝑎 𝑠, Θ 𝑅 𝑎, 𝑠

45
𝒥 Θ estimation
,! )
◦𝜃!"# = 𝜃! + 𝛼 ∇𝒥(Θ
◦Goals
◦ Find an expression proportional with ∇𝒥 Θ à 𝐹 ∝ ∇𝒥 Θ
◦ Preferably is the expected value of a variable 𝑒
◦ à 𝐹 can be approximated by samples of 𝑒

46
𝒥 Θ estimation
,! )
◦𝜃!"# = 𝜃! + 𝛼 ∇𝒥(Θ
◦Goals
◦ Find an expression proportional with ∇𝒥 Θ à 𝐹 ∝ ∇𝒥 Θ
◦ Preferably is the expected value of a variable 𝑒
◦ à 𝐹 can be approximated by samples of 𝑒

47
𝒥 Θ estimation
◦ Definition of 𝒥 Θ
◦ 𝒥 Θ = 𝔼, 𝐺# = 𝑅# + 𝛾𝑅$ + ⋯ + 𝛾 %&# 𝑅% = 𝑣, (𝑠) ): state-value
at the first state
à𝑣! 𝑠 = ∑" 𝜋 𝑎 𝑠 𝑞! (𝑠, 𝑎)

à∇𝑣! 𝑠 = ∑" ∇𝜋 𝑎 𝑠 𝑞! 𝑠, 𝑎 + 𝜋(𝑎|𝑠)∇𝑞! 𝑠, 𝑎

à∇𝒥 Θ = ∇𝑣! 𝑠# ∝ ∑$ 𝜇 𝑠 ∑" ∇𝜋 𝑎 𝑠 𝑞! 𝑠, 𝑎 (Policy Gradient

Theorem)
Probability for 𝑠 to occur The gradient of policy 𝜋 Action value

48
𝒥 Θ estimation
,! )
◦𝜃!"# = 𝜃! + 𝛼 ∇𝒥(Θ
◦Goals
◦ Find an expression proportional with ∇𝒥 Θ à 𝐹 ∝ ∇𝒥 Θ
◦ Preferably is the expected value of a variable 𝑒
◦ à 𝐹 can be approximated by samples of 𝑒

49
𝒥 Θ estimation for episode environement
∇𝒥 Θ ∝ ^ 𝜇 𝑠 ^ ∇𝜋 𝑎 𝑠 𝑞2 𝑠, 𝑎
3 #

= 𝔼2 ^ ∇𝜋 𝑎 𝑆( 𝑞2 𝑆( , 𝑎
#

Expected value of this variable à approximate the expectation by the sample

V
∇𝒥 Θ = ^ ∇𝜋 𝑎 𝑆( 𝑞2 𝑆( , 𝑎
#

Θ($! ÷ Θ( + 𝛼 ^ ∇𝜋 𝑎 𝑆( 𝑞2 𝑆( , 𝑎
#

Θ($! ÷ Θ( + 𝛼 ^ ∇𝜋 𝑎 𝑆( , Θ( 𝑞b 𝑆( , 𝑎, 𝕨
#
50
REINFORCE algorithm (1992)
◦ REINFORCE = REward Increment = Nonnegative Factor x Offset Reinforcement x Characteristic Eligibility
◦ Main idea
Θ($! ÷ Θ( + 𝛼 ^ ∇𝜋 𝑎 𝑆( , Θ( 𝑞2 𝑆( , 𝑎
#

convert this into the expected value of a variable à approximate by its sample

The sum over all actions 𝑎

à to convert into the expected value over variales a à we need 𝜋 𝑎 𝑆𝑡

∇𝜋 𝑎 𝑆* , Θ*
i ∇𝜋 𝑎 𝑆* 𝑞- 𝑆* , 𝑎 = i 𝜋 𝑎 𝑆* , Θ* 𝑞- 𝑆* , 𝑎
𝜋 𝑎 𝑆* , Θ*
, ,
∇𝜋 𝐴* 𝑆* , Θ* ∇𝜋 𝐴* 𝑆* , Θ*
= 𝔼- 𝑞- 𝑆* , 𝐴* = 𝔼- 𝐺* Vì 𝑞- 𝑆* , 𝐴* = 𝔼- 𝐺* |𝑆* , 𝐴*
𝜋 𝐴* 𝑆* , Θ* 𝜋 𝐴* 𝑆* , Θ*

∇- 𝐴* 𝑆* , Θ* ∇- 𝐴* 𝑆* , Θ*
𝔼- 𝐺* can be approximated by 𝐺*
- 𝐴* 𝑆* , Θ* - 𝐴* 𝑆* , Θ*
51
REINFORCE algorithm (1992)
Hướng của ∇𝜋 làm tăng xác suất lặp lại hành động 𝐴*
V ∇2 𝐴 𝑆 , Θ
◦ ∇𝒥 Θ = 𝐺( 2 𝐴 ( 𝑆 (, Θ ( Tức là: nếu tăng Θ một lượng tỉ lệ thuận với
( ( (
∇𝜋 𝐴* 𝑆* , Θ* thì càng ngày 𝐴* sẽ đươc thực hiện
∇2 𝐴( 𝑆( , Θ( càng nhiều.
à Θ($! ÷ Θ( + 𝛼𝐺(
2 𝐴( 𝑆( , Θ(

𝐺* ×∇𝜋 𝐴* 𝑆* , Θ* : nếu 𝐺* càng lớn thì

càng khuyến khích tăng Θ theo hướng ∇- 𝐴* 𝑆* , Θ*
: nếu𝜋 𝐴* 𝑆* , Θ* càng lớn thì càng không
∇𝜋 𝐴* 𝑆* , Θ* - 𝐴* 𝑆* , Θ*
khuyến khích tăng Θ theo hướng ∇𝜋 𝐴* 𝑆* , Θ*
à Tránh trường hợp thực hiện một hành động quá nhiều lần
52
REINFORCE algorithm (1992)

53
REINFORCE with Baseline
◦ ∇𝒥 Θ ∝ ∑d 𝜇 𝑠 ∑. ∇𝜋 𝑎 𝑠 𝑞, 𝑠, 𝑎
àLarge values of 𝑞& 𝑠, 𝑎 will dominate ∇𝒥 Θ
àUse baseline to normalize 𝑞& 𝑠, 𝑎
∇𝒥 Θ ∝ a 𝜇 𝑠 a ∇𝜋 𝑎 𝑠 𝑞& 𝑠, 𝑎 − 𝑏 𝑠
= ∇ a 𝜋 𝑎 𝑠 = ∇1 = 0
B @
@
a 𝜇 𝑠 a ∇𝜋 𝑎 𝑠 𝑞& 𝑠, 𝑎 − 𝑏 𝑠
B @
=∑B 𝜇 𝑠 ∑@ ∇𝜋 𝑎 𝑠 𝑞& 𝑠, 𝑎 − 𝑏 𝑠 ∑@ ∇𝜋 𝑎 𝑠
How to choose 𝑏 𝑠 :
- 𝑏 𝑠 can be any thing as long as its value is independent of 𝑎
- We use 𝑏 𝑠 to decrease the variance

54
REINFORCE with Baseline
∇, 𝐴! 𝑆! , Θ!
Θ!"# ÷ Θ! + 𝛼𝐺! , 𝐴! 𝑆! , Θ!
∇, 𝐴! 𝑆! , Θ!
à Θ!"# ÷ Θ! + 𝛼 𝐺! − 𝑏(𝑆! ) 𝐺!
, 𝐴! 𝑆! , Θ!

The expected value of 𝑣 𝑆𝑡

à A natual selection of 𝑏(𝑆𝑡 ) is 𝑣Z
𝑆𝑡 , 𝑣(𝑆
N 𝑡 , 𝕨)

' ! , 𝕨! )
𝕨!"# ÷ 𝕨! + 𝛽∇𝑣(𝑆

55
REINFORCE with Baseline

56
REINFORCE with Baseline

57
Actor-critic learning
◦ Definition: Methods that learn approximations to both
policy and value functions are often called actor–critic
methods,
◦ Actor: the learned policy
◦ Critic: the learned value function
◦ Usually a state-value function

◦ Is policy-based RL with baseline is actor-critic learning?

◦ If the baseline is stationary value that never updates with experience à NOT actor-
critic
◦ If the baseline is estimated from experience à can be called an actor-critic
method

58
Actor-critic learning
𝐴! 𝑆! , Θ!
∇,
Θ!"# ÷ Θ! + 𝛼 𝐺! − 𝑏(𝑆! )
, 𝐴! 𝑆! , Θ!
∇, 𝐴! 𝑆! , Θ!
à Θ!"# ÷ Θ! + 𝛼 𝐺! − 𝑣(𝑆
P ! , 𝕨) , 𝐴 𝑆 , Θ REINFORCE with baseline
! ! !

𝑅!"# + 𝛾𝑣P 𝑆!"# , 𝕨

∇% 𝐴! 𝑆! , Θ!
à Θ!"# ÷ Θ! + 𝛼 𝑅!"# + 𝛾𝑣' 𝑆!"# , 𝕨 − 𝑣' 𝑆! , 𝕨
% 𝐴! 𝑆! , Θ!
𝕨!"# ÷ 𝕨! + 𝛽 𝑅!"# + 𝛾𝑣' 𝑆!"# , 𝕨 − 𝑣' 𝑆! , 𝕨 ∇𝑣(𝑆
' ! , 𝕨! )

59
Agenda
◦ Definition and basic terms
◦ Value-based RL
◦ Tabular RL
◦ Approximate RL
◦ Policy-based RL
◦ Applications in networking

60
Applications in networking
◦ Routing in WSN
◦ How to forward packet from S to D?
◦ Reinforcement learning modeling
◦ Agent: sensor nodes
◦ RL model: Q-learning

Negative rewarding

• Khanh Le, Nguyen Thanh Hung, Kien Nguyen, Phi Le Nguyen, “Exploiting Q-learning in Extending the Network Lifetime of
Wireless Sensor Networks with Holes”, ICPADS 2019
• Phi Le Nguyen, Nang Hung Nguyen, Tuan Anh Nguyen Dinh, Khanh Le, Thanh Hung Nguyen, and Kien Nguyen, “QIH: An
Efficient Q-Learning Inspired Hole-Bypassing Routing Protocol for WSNs”, IEEE Access, 2021

61
Applications in networking
◦ Wireless charging algorithm
◦ Where should the mobile charger move to?
◦ How long should the mobile charger charge to
sensors?
◦ Techniques: Q-learning + Fuzzy

• La Van Quan, Phi Le Nguyen, Thanh-Hung Nguyen, Kien Nguyen, “Q-learning-based, Optimized On-
demand Charging Algorithm in WRSN”, NCA 2020
• Phi Le Nguyen, Van Quan La, Anh Duy Nguyen, Thanh Hung Nguyen, and Kien Nguyen, “An On-Demand
Charging for Connected Target Coverage in WRSNs Using Fuzzy Logic and Q-Learning”, MDPI Sensors 2021

62
Applications in networking
◦ Offloading in MEC
◦ Where to offload the tasks
◦ RL model: MAB

◦ Exploration strategy

• Nang Hung Nguyen, Phi Le Nguyen, Hieu Dinh, Thanh Hung Nguyen, Kien Nguyen, “Multi-Agent Multi-Armed Bandit Learning for
Offloading Delay Minimization in V2X Networks”, EUC 2021
• Trung Thanh Nguyen, Truong Thao Nguyen, Tuan Anh Nguyen Dinh, Thanh-Hung Nguyen, Phi Le Nguyen, “Q-learning-based
Opportunistic Communication for Real-time Mobile Air Quality Monitoring Systems”, IPCCC 2021

63
Source code
◦ https://colab.research.google.com/drive/1-
_6DVuOjJWBp2V5x53uTZC9U5Z1IoYiT#scrollTo=GzO_xDRDr5NP

CSE 445 - Lecture 9 - Reinforcement Learning
No ratings yet
CSE 445 - Lecture 9 - Reinforcement Learning
45 pages
5SC28 Machine Learning For Systems and Control
No ratings yet
5SC28 Machine Learning For Systems and Control
97 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
26 pages
ml4r 2025 05
No ratings yet
ml4r 2025 05
22 pages
Mid Term Report SoS
No ratings yet
Mid Term Report SoS
18 pages
Python For Artificial Intelligence. A Comprehensive Guide Elsherif H. Instant Download
No ratings yet
Python For Artificial Intelligence. A Comprehensive Guide Elsherif H. Instant Download
150 pages
Multi-Armed Bandits
No ratings yet
Multi-Armed Bandits
11 pages
402 Lec20
No ratings yet
402 Lec20
21 pages
Lecture 06
No ratings yet
Lecture 06
98 pages
RL L2 MultiArmedBandits
No ratings yet
RL L2 MultiArmedBandits
44 pages
16 - Reinforcement Learning and Bandits
No ratings yet
16 - Reinforcement Learning and Bandits
41 pages
RL Unit5
No ratings yet
RL Unit5
101 pages
Bellman Equation and RL Notes
No ratings yet
Bellman Equation and RL Notes
6 pages
5SC28 Machine Learning For Systems and Control
No ratings yet
5SC28 Machine Learning For Systems and Control
68 pages
Lecture Notes RL
No ratings yet
Lecture Notes RL
14 pages
Lecture 12 Slides - After
No ratings yet
Lecture 12 Slides - After
50 pages
Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
1 page
Ar514 MDP
No ratings yet
Ar514 MDP
27 pages
Lec 04 Reinforcement Learning
No ratings yet
Lec 04 Reinforcement Learning
57 pages
I2ml3e Chap18
No ratings yet
I2ml3e Chap18
27 pages
Subtitle
No ratings yet
Subtitle
2 pages
Lecture 1: Introduction: Lecturer: Prof. Subrahmanya Swamy Peruru Scribe: Harshvardhan Arya - Rishabh Katiyar
No ratings yet
Lecture 1: Introduction: Lecturer: Prof. Subrahmanya Swamy Peruru Scribe: Harshvardhan Arya - Rishabh Katiyar
4 pages
Reinforcement Learning: A Short Cut
No ratings yet
Reinforcement Learning: A Short Cut
7 pages
Some Thoughts On Reinforcement Learning: 1 Motivation
No ratings yet
Some Thoughts On Reinforcement Learning: 1 Motivation
9 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
5 pages
AI Reinforcement Learning Guide
No ratings yet
AI Reinforcement Learning Guide
27 pages
Lec17 ReinforcementLearning
No ratings yet
Lec17 ReinforcementLearning
58 pages
Artificial Intelligence: Lecture 10 - Reinforcement Learning Prof. Shivanjali Khare
No ratings yet
Artificial Intelligence: Lecture 10 - Reinforcement Learning Prof. Shivanjali Khare
45 pages
Ai Chapter 2
No ratings yet
Ai Chapter 2
34 pages
Deep Reinforcement Learning Guide
No ratings yet
Deep Reinforcement Learning Guide
32 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
Machine - Learning - Chapter 4
No ratings yet
Machine - Learning - Chapter 4
13 pages
AI Reinforcement Learning Guide
No ratings yet
AI Reinforcement Learning Guide
8 pages
A Crash Course On Reinforcement Learning - Felix Wagner
No ratings yet
A Crash Course On Reinforcement Learning - Felix Wagner
84 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
9 pages
10 Deep Reinforcement
No ratings yet
10 Deep Reinforcement
40 pages
Active Learning For Reward Estimation in Inverse Reinforcement Learning
No ratings yet
Active Learning For Reward Estimation in Inverse Reinforcement Learning
16 pages
Cs229-Notes12 Reinforcement in Control
No ratings yet
Cs229-Notes12 Reinforcement in Control
17 pages
کتاب هشتم بارگزاری شده
No ratings yet
کتاب هشتم بارگزاری شده
112 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
51 pages
RL Cheatsheet for Researchers
No ratings yet
RL Cheatsheet for Researchers
16 pages
Reinforcement Learning Guide
No ratings yet
Reinforcement Learning Guide
18 pages
Sections
No ratings yet
Sections
76 pages
Deep Reinforcement Learning For Smart Grid Operations Algorithms Applications and Prospects
No ratings yet
Deep Reinforcement Learning For Smart Grid Operations Algorithms Applications and Prospects
42 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
7 pages
Unit 4a
No ratings yet
Unit 4a
83 pages
HyLEAR Seminar 2021 Intro Clean
No ratings yet
HyLEAR Seminar 2021 Intro Clean
35 pages
What We Learned Last Time: 1. Intelligence Is The Computational Part of The Ability To Achieve Goals
No ratings yet
What We Learned Last Time: 1. Intelligence Is The Computational Part of The Ability To Achieve Goals
32 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
Reinforcement Learning MY101
No ratings yet
Reinforcement Learning MY101
15 pages
l1 Mdps Exact Methods
No ratings yet
l1 Mdps Exact Methods
69 pages
Deep Reinforcement Learning Mohit Sewak
No ratings yet
Deep Reinforcement Learning Mohit Sewak
6 pages
37 RL
No ratings yet
37 RL
18 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
30 pages
Multi-Agent Learning Dynamics
No ratings yet
Multi-Agent Learning Dynamics
26 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
136 pages
Foundational AI in Insurance and Real Estate A Survey of Applications Challenges and Future Directions
No ratings yet
Foundational AI in Insurance and Real Estate A Survey of Applications Challenges and Future Directions
21 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
Artificial Intelligence and Prescriptive Analytics For Supply Chain Resilience A Systematic Literature Review and Research Agenda
No ratings yet
Artificial Intelligence and Prescriptive Analytics For Supply Chain Resilience A Systematic Literature Review and Research Agenda
26 pages
Syllabus VI To VIII Sem-Smart AgriTech (2023-24)
No ratings yet
Syllabus VI To VIII Sem-Smart AgriTech (2023-24)
24 pages
Paradigms of Computational Agency
No ratings yet
Paradigms of Computational Agency
17 pages
Fundamentals of Reinforcement Learning
No ratings yet
Fundamentals of Reinforcement Learning
33 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
32 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
38 pages
PG Certificate in Data Science by IIT Roorkee
No ratings yet
PG Certificate in Data Science by IIT Roorkee
17 pages
Placement With MCTS
No ratings yet
Placement With MCTS
15 pages
Energy-Efficient Robot Trajectory Optimization
No ratings yet
Energy-Efficient Robot Trajectory Optimization
7 pages
Zavoli Federici 2021 Reinforcement Learning For Robust Trajectory Design of Interplanetary Missions
No ratings yet
Zavoli Federici 2021 Reinforcement Learning For Robust Trajectory Design of Interplanetary Missions
14 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I
35 pages
Machine Learning Task Guide
No ratings yet
Machine Learning Task Guide
4 pages
Machine Learning (Class 4-5) e
No ratings yet
Machine Learning (Class 4-5) e
20 pages
AML774 Post Assignment 2
No ratings yet
AML774 Post Assignment 2
4 pages
IoT Energy Management with RL
No ratings yet
IoT Energy Management with RL
12 pages
Elegantrl Readthedocs Io en Latest
No ratings yet
Elegantrl Readthedocs Io en Latest
84 pages
Learning To Act Using Real-Time Dynamic Programmin
No ratings yet
Learning To Act Using Real-Time Dynamic Programmin
67 pages
RoboCup 2011: Parsian Team Overview
No ratings yet
RoboCup 2011: Parsian Team Overview
10 pages
FAI&ML - Question Bank
No ratings yet
FAI&ML - Question Bank
3 pages
ML Notes UT-1
No ratings yet
ML Notes UT-1
21 pages
Environment Interaction of A Bipedal Robot Using Model-Free Control Framework Hybrid Off-Policy and On-Policy Reinforcement Learning Algorithm
No ratings yet
Environment Interaction of A Bipedal Robot Using Model-Free Control Framework Hybrid Off-Policy and On-Policy Reinforcement Learning Algorithm
12 pages
Reinforcement Learning of Fuzzy Logic Controllers For Quadruped Walking Robots
No ratings yet
Reinforcement Learning of Fuzzy Logic Controllers For Quadruped Walking Robots
6 pages
DRL Minimum Throughput Maximization For Multi-UAV Enabled WPCN
No ratings yet
DRL Minimum Throughput Maximization For Multi-UAV Enabled WPCN
9 pages
The Hanabi Challenge A New Frontier For AI Research
No ratings yet
The Hanabi Challenge A New Frontier For AI Research
9 pages
Anushka Tech IITK-2
No ratings yet
Anushka Tech IITK-2
1 page
AI Engineering for Future Leaders
No ratings yet
AI Engineering for Future Leaders
25 pages

Introduction To RL

Uploaded by

Introduction To RL

Uploaded by

2021/09/30

Multi-Armed Bandit Traffic light control

Wireless sensor networks

Multi-Armed Bandit Traffic light control

◦ Based on action/state value

𝐺! = 𝑅!"# + 𝛾𝑅!"$ + ⋯ + 𝛾 % 𝑅!"%&# + … = ∑*

Reward received after Reward received if following policy 𝜋

State value function for policy 𝜋

𝑣& 𝑠 ÷ 𝔼& 𝐺' 𝑆' = 𝑠 = 𝔼& ∑+ (

Action value function for policy 𝜋

Immediate reward after

Object to be estimated 𝑞?! (𝑠, 𝑎)

New 𝑞& 𝑠, 𝑎 = (1 − 𝛼) current 𝑞& 𝑠, 𝑎 + 𝛼×3

New 𝑣, 𝑠 = (1 − 𝛼) current 𝑣, 𝑠 + 𝛼×𝐸

◦ State space: one state

The higher the 𝜖, the more exploration

The frequency of selecting optimal action is inversely proportional with 𝜖

◦ Action selection strategy

◦ Action selection policy

𝑄 𝑠, 𝑎 = 𝑄 𝑠, 𝑎 − 𝛼∇𝐿 à Gradient descent

◦ E.g: 𝑄𝕨 𝑠, 𝑎 = 𝑤! 𝑓! 𝑠, 𝑎 + 𝑤& 𝑓& 𝑠, 𝑎 + … + 𝑤0 𝑓0 𝑠, 𝑎

𝐴" , 𝑄(𝑆( , 𝐴" )

What if we learn to make the decision directly instead of estimating

Value estimation Action selection

à∇𝑣! 𝑠 = ∑" ∇𝜋 𝑎 𝑠 𝑞! 𝑠, 𝑎 + 𝜋(𝑎|𝑠)∇𝑞! 𝑠, 𝑎

à∇𝒥 Θ = ∇𝑣! 𝑠# ∝ ∑$ 𝜇 𝑠 ∑" ∇𝜋 𝑎 𝑠 𝑞! 𝑠, 𝑎 (Policy Gradient

Expected value of this variable à approximate the expectation by the sample

The sum over all actions 𝑎

𝐺* ×∇𝜋 𝐴* 𝑆* , Θ* : nếu 𝐺* càng lớn thì

The expected value of 𝑣 𝑆𝑡

◦ Is policy-based RL with baseline is actor-critic learning?

𝑅!"# + 𝛾𝑣P 𝑆!"# , 𝕨

You might also like