2021/09/30
Reinforcement learning:
Basic concepts and Applications in networking
Nguyen Phi Le
Agenda
◦ Definition and basic terms
◦ Value-based RL
◦ Tabular RL
◦ Approximate RL
◦ Policy-based RL
◦ Applications in networking
2
Agenda
◦ Definition and basic terms
◦ Value-based RL
◦ Tabular RL
◦ Approximate RL
◦ Policy-based RL
◦ Applications in networking
3
Example
4
Interactive learning
Sensor node
Base station
Multi-Armed Bandit Traffic light control
Wireless sensor networks
ü The learner is not told which actions to take à trial-and-error
ü Action may affect also future situation à delayed reward
5
Definition
◦ Reinforcement learning
◦ Learning what to do—how to map situations to actions— to achieve the goal
◦ Reward hypothesis
◦ That all of what we mean by goals and purposes can be well thought of as the
maximization of the expected value of the cumulative sum of a received scalar
signal (called reward) (Richard S. Sutton, RL: An introduction, 2018)
6
Goal and reward
◦ Reward: a scalar signal received from the environment at each step
◦ Goal: maximizing cumulative reward in the long run Sensor node
Multi-Armed Bandit Traffic light control
Base station
Goal = maximizing total gain Goal = decreasing traffic jam Goal = increasing network lifetime
Reward = gain at every turn Reward = waiting time, Reward = load balancing,
queue length, lane speed, … route length, …
7
RL vs other learning techniques
◦ RL vs supervised learning
◦ Supervised learning: learning from a training set of labeled examples
◦ Know the true action to take
◦ Reinforcement learning: do not know the optimal action
◦ RL vs unsupervised learning
◦ Unsupervised learning: finding structure hidden in collections of unlabeled data
◦ Reinforcement learning: maximizing the reward signal
8
RL framework
◦ Policy: a mapping from states to actions
◦ may be stochastic, specifying probabilities for each action
◦ Reward signal: the goal of a reinforcement learning problem
◦ objective is to maximize the total reward the agent receives over the long run
◦ Value: total amount of accumulative reward over the future, starting from
that state
◦ Reward: immediate; Value: long-run
o Action: can be any decisions we want to learn how to make
o State: can be anything we can know that might be useful in making them
Agent Reward signal
Action
State Policy Environement
New state
9
Agenda
◦ Definition and basic terms
◦ Value-based RL
◦ Tabular RL
◦ Approximate RL
◦ Policy-based RL
◦ Applications in networking
10
Value-based RL
◦ RL’s objective: making decision
◦ Input: state
◦ Output: action
Reward signal
◦ How to decide an action? Agent
value
◦ Using action selection policy function
◦ Based on action/state value
◦ How to measure the values? Action
State Policy
◦ Value functions: estimate the goodness
of states/actions based on experience
New state
◦ Exploration-exploitation strategy Environement
11
Value functions
◦ Key point of value-based RL models
◦ Questions?
◦ What can be used to measure the goodness of an action, state?
◦ How to measure?
Maximize: long-term accumulative reward
𝐺! = 𝑅!"# + 𝛾𝑅!"$ + ⋯ + 𝛾 % 𝑅!"%&# + … = ∑*
'() 𝛾 '
𝑅!"'"#
Reward received after Reward received if following policy 𝜋
Goal
performing action 𝑎 at timing 𝑡 from timing 𝑡 + 1
12
Value functions
◦ State value function: how good it is for the agent to be in a given state
◦ Expected return when starting in s and following thereafter policy 𝜋
◦ Action value function: how good it is to perform a given action in a
given state
◦ Expected return starting from s, taking the action a, and thereafter following policy
𝜋
State value function for policy 𝜋
𝑣& 𝑠 ÷ 𝔼& 𝐺' 𝑆' = 𝑠 = 𝔼& ∑+ (
()* 𝛾 𝑅',(,- 𝑆' = 𝑠
Action value function for policy 𝜋
𝑞& 𝑠, 𝑎 ÷ 𝔼& 𝐺' 𝑆' = 𝑠, 𝐴' = 𝑎 = 𝔼& ∑+ (
()* 𝛾 𝑅',(,- 𝑆' = 𝑠, 𝐴' = 𝑎
13
Action value estimation
◦ 𝑞& 𝑠, 𝑎 ÷ 𝔼& 𝐺' 𝑆' = 𝑠, 𝐴' = 𝑎 = 𝔼& ∑+
()* 𝛾 (𝑅
',(,- 𝑆' = 𝑠, 𝐴' = 𝑎
◦ How to estimate 𝑞& 𝑠, 𝑎 ?
◦ 𝔼& ∑+()* 𝛾 (𝑅
',(,- 𝑆' = 𝑠, 𝐴' = 𝑎 = expected value à need to perform
the same action at the same state for infinite times à impossible
◦ Solution
◦ Replacing expected value by sample-based estimation
◦ Estimating value of 𝔼! ∑% "
"#$ 𝛾 𝑅&'"'( 𝑆& = 𝑠, 𝐴& = 𝑎 using a few samples
◦ Estimation method: iterative estimation
◦ Initialize 𝑞! 𝑠, 𝑎
◦ Update 𝑞! 𝑠, 𝑎 after performing one action and receiving a reward
◦ Current 𝑞! 𝑠, 𝑎
◦ Estimate of 𝔼! ∑% "
"#$ 𝛾 𝑅&'"'( 𝑆& = 𝑠, 𝐴& = 𝑎
◦ New 𝑞! 𝑠, 𝑎 ← 1 − 𝛼 × old 𝑞! 𝑠, 𝑎 + 𝛼𝑞<! (𝑠, 𝑎)
14
Action value estimation
◦ How to estimate action value?
◦ Bellman equation
𝑞, 𝑠, 𝑎 = 𝔼, 𝐺! 𝑆! = 𝑠, 𝐴! = 𝑎
= 𝔼 , ∑*
'() 𝛾 '
𝑅!"'"# 𝑆! = 𝑠, 𝐴! = 𝑎
= 𝔼, 𝑅!"# + ∑*
'(# 𝛾 '
𝑅!"'"# 𝑆! = 𝑠, 𝐴! = 𝑎
= 𝔼, 𝑅!"# + 𝛾 ∑*
'() 𝛾 '
𝑅!"'"$ 𝑆! = 𝑠, 𝐴! = 𝑎
Action value of = 𝔼, 𝑅!"# + 𝛾𝑞, 𝑆!"# , 𝑎′ 𝑆! = 𝑠, 𝐴! = 𝑎
(𝑠, 𝑎)
Immediate reward after
expectation Action value at the next step
performing action a at state s
15
Action value estimation
◦ Bellman equation
◦ 𝑞& 𝑠, 𝑎 = 𝔼& 𝑅',- + 𝛾𝑞& 𝑆',-, 𝑎′ 𝑆' = 𝑠, 𝐴' = 𝑎
Object to be estimated 𝑞?! (𝑠, 𝑎)
New 𝑞& 𝑠, 𝑎 = (1 − 𝛼) current 𝑞& 𝑠, 𝑎 + 𝛼×3
𝑞& (𝑠, 𝑎)
𝑄 𝑠, 𝑎 = 1 − 𝛼 𝑄 𝑠, 𝑎 + 𝛼 𝑟 + 𝛾𝑄 𝑠 - , 𝑎- SARSA
16
Action value estimation
Bellman equation
𝑞, 𝑠, 𝑎 = 𝔼, 𝑅!"# + 𝛾𝑞, 𝑆!"# , 𝑎′ 𝑆! = 𝑠, 𝐴! = 𝑎
𝑞,∗ 𝑠, 𝑎 = 𝔼,∗ 𝑅!"# + 𝛾𝑞,∗ 𝑆!"# , 𝑎′ 𝑆! = 𝑠, 𝐴! = 𝑎
= 𝔼,∗ 𝑅!"# + 𝛾 max 𝑞, 𝑆!"# , 𝑎′ |𝑆! = 𝑠, 𝐴! = 𝑎
,
- -
estimation: 𝑞:
, ∗ (𝑠, 𝑎) = 𝑟 + 𝛾 max 𝑄 𝑠 ,𝑎
.-
𝑄 𝑠, 𝑎 = 1 − 𝛼 𝑄 𝑠, 𝑎 + 𝛼 𝑟 + 𝛾 max 𝑄 𝑠 - , 𝑎- Q-learning
.-
𝑄 𝑠, 𝑎 1 − 𝛼 𝑄 𝑠, 𝑎 + 𝛼 𝑟 + 𝛾𝑄 𝑠 - , 𝑎- SARSA
17
State value estimation
Bellman equation for state-value function
𝑣, 𝑠 = 𝔼, 𝐺! 𝑆! = 𝑠
= 𝔼 , ∑*
'() 𝛾 '
𝑅!"'"# 𝑆! = 𝑠
Monte Carlo
= 𝔼, 𝑅!"# + 𝛾 ∑*
'() 𝛾 '
𝑅!"'"$ 𝑆! = 𝑠
= 𝔼, 𝑅!"# + 𝛾𝑣, (𝑆!"# ) 𝑆! = 𝑠
New 𝑣, 𝑠 = (1 − 𝛼) current 𝑣, 𝑠 + 𝛼×𝐸
Temporal
Difference
𝑉 𝑆! = 1 − 𝛼 𝑉 𝑆! + 𝛼𝐺! (TD)
𝑉 𝑆! = 1 − 𝛼 𝑉 𝑆! + 𝛼(𝑅!"# +𝛾𝑉(𝑆!"# ))
18
Action selection policy
◦ Based on value functions
◦ Exploitattion – exploration strategy
◦ Choose (𝑠, 𝑎) whose 𝑄 𝑠, 𝑎 is the greatestß exploitation
◦ Problem: local maximum à need to exploit new actions ß exploration
◦ Some exploration methods
◦ 𝜖-greedy
◦ Time adaptive Epsilon (annealing 𝜖)
◦ 𝜖-soft
◦ Value adaptive Epsilon
◦ Sigmoid Epsilon
◦ Optiministic initialization
◦ Upper confidence bound (UCB)
19
𝜖-greedy
◦ Define a small number 𝜖 (𝜖 = 0.01)
◦ Each time the agent needs to choose an action
◦ Generates a prob 𝑝 ∈ (0,1)
𝑎𝑟𝑔𝑚𝑎𝑥@ 𝑞 𝑠, 𝑎 , 𝑖𝑓 𝑝 ≥ 𝜖
◦ 𝐴' = 7
𝑟𝑎𝑛𝑑𝑜𝑚 𝑎𝑐𝑡𝑖𝑜𝑛, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
E-greedy with e = 0.1 in theory
Best Action
Others
0 1 2 3 4 5 6 7 8 9 10
Number of time actions are taken
20
𝜖-greedy multi-armed bandit
◦ Action space: 𝐴 = 𝑎! , … , 𝑎" Multi-Armed Bandit
◦ State space: one state
◦ Reward: the gain recevied
◦ Value function
"#$ %& '()*'+" ),(- * !*.(- /'0%' !% !
◦ 𝑄! 𝑎 ÷
-#$1(' %& !0$(" * !*.(- /'0%' !% !
◦ Exploration strategy Goal = maximizing total gain
◦ 𝜖-greedy Reward = gain at every turn
21
Exploration vs Exploitation
The higher the 𝜖, the more exploration
à The more unstable the average reward
22
Exploration vs Exploitation
The frequency of selecting optimal action is inversely proportional with 𝜖
23
Optiministic initialization
◦ Initialize 𝑄) 𝑎 = 𝑋 ∀𝑎 (𝑋 is a significantly large number)
BCD EF GHI@GJB IKHL @ '@(HL MGNEG 'E '
◦ 𝑄' 𝑎 ÷
LCDOHG EF 'NDHB @ '@(HL MGNEG 'E '
◦ Action selection strategy
◦ At every step 𝑡: choose 𝐴' = 𝑎𝑟𝑔𝑚𝑎𝑥@ 𝑄' 𝑎
◦ Guarantee that all actions are chosen at least one time
"! # $%"
◦ 𝑄! 𝑎 = < 𝑄' 𝑎
&
24
Optiministic initialization
◦ Highly exploration at first
◦ But, totally greedy afterwards
◦ Appropriate value of X is needed
25
Upper confidence bound (UCB)
◦ Actions having been selected frequently à should be explored less
◦ Confidence bound is defined by the number of time the action not
being selected
◦ Priotirizing actions with
◦ High estimated action value à exploitation
◦ High confidence bound à exploration
Action 1
Action 2
Action 3
Action 4
Action 5
0 2 4 6 8
Estimate Value Confidence Bound
26
Upper confidence bound (UCB)
◦ Mathematical expression
◦ The number of times an action 𝑎 has been taken: 𝑁' (𝑎)
◦ The total number of time steps so far: 𝑁'
◦ Confidence bound
)#
◦ 𝐶𝐵( 𝑎 = 𝑐 (𝑐: a constant)
)# (#)
◦ Action selection policy
)#
◦ At every step 𝑡: choose 𝐴( = 𝑎𝑟𝑔𝑚𝑎𝑥# 𝑄( 𝑎 + 𝑐
)# (#)
◦ Intuition
◦ If an action 𝑎 has not been selected for a long time à 𝑁( (𝑎) decreases à
𝐶𝐵( 𝑎 increase à it will be prioritized next times
27
Upper confidence bound (UCB)
28
Other exploration strategies
◦ Time adaptive epsilon (annealing 𝜖)
-
◦𝜖=
PQR('NDH,S)
◦ 𝑐: a smalle positive parameter;
◦ 𝑡𝑖𝑚𝑒: the total times performing actions à the larger the time, the
smaller the 𝜖 à the more experience, the less the exploration
◦ 𝜖-soft
◦ Guarantee that the probabilities for choosing any action is at least
T
|V2 |
◦ 𝐴B : action space
29
Other exploration strategies
◦ Value adaptive Epsilon
◦ Adjust exploration probability based on the variation of the value
function
◦ When the value varies much à unstableà agent seems not to
know much about the environment à should explore more
◦ When the value is stable à not need to explore much
Mathematical expression
$ %#&" ',) $%# ',)
!,- * &
◦ 𝑓 𝑠( , 𝑎( , 𝜎 = $ %#&" ',) $%# ',)
= $ %#&" ',) $%# ',)
− 1 à 𝑄($! 𝑠, 𝑎 − 𝑄( 𝑠, 𝑎 càng
!$- * !$- *
lớn, 𝑓 càng lớn
◦ 𝜖(($!) 𝑠 = 𝛿𝑓 𝑠( , 𝑎( , 𝜎 + 1 − 𝛿 𝜖 ( 𝑠
30
Other exploration strategies
◦ Sigmoid Epsilon
-
◦ 𝑃HWMXEGH = 1 −
-,H 34×678(:)
◦ 𝜔: hyper-parameter
◦ 𝑣𝑎𝑟 𝑞 : variance of the values
◦ When 𝑣𝑎𝑟 𝑞 is sufficiently large à there is an action that dominates the others à
should be chosen
◦ When 𝑣𝑎𝑟 𝑞 is small à all actions are equals à should increase explore
31
Tabular RL
◦ Use a table (named Q-table) to
represent the action values
State Action Q-value
◦ Update Q-table after performing
every action 𝑆! 𝐴! 𝑄(𝑆! , 𝐴! )
𝑆! 𝐴" 𝑄(𝑆! , 𝐴" )
◦ Drawbacks
◦ When state/action spaces are too large
à table size gets large too
𝑆# 𝐴$ 𝑄(𝑆# , 𝐴$ )
◦ Cannot deal with newly appear
state/actions
◦ When training the model offline
32
Approximate RL
◦Represent the relation of (𝑠, 𝑎, 𝑞) by a function or a
model: 𝑞(𝑠,
' 𝑎)
◦ Input: a vector reprenting 𝑠
N 𝑎) are updated after every action
◦ Paramters of 𝑞(𝑠,
◦ Using gradient descent
◦Advantages
◦ Handle a large state/action space
◦ Can deal with newly appearing states
33
Approximate RL
Supervised learning:
Problems:
𝑞<𝕨 𝑠, 𝑎 , 𝑞∗ 𝑠, 𝑎 Solution:
1 " In most cases, 𝑞∗ 𝑠, 𝑎 is
𝐿 = 𝑞∗ 𝑠, 𝑎 − 𝑞<𝕨 𝑠, 𝑎 Estimates 𝑞∗ 𝑠, 𝑎
2 unknown
𝕨 ÷ 𝕨 − 𝛼∇𝐿
34
Recall about tabular RL
◦ Tabular Q-learning:
◦ 𝑄 𝑠, 𝑎 = 1 − 𝛼 𝑄 𝑠, 𝑎 + 𝛼 𝑟 + 𝛾 max 𝑄 𝑠 _ , 𝑎_
@_
= 𝑄 𝑠, 𝑎 + 𝛼 𝑟 + 𝛾 max 𝑄 𝑠 - , 𝑎- − 𝑄 𝑠, 𝑎
.-
Estimated value: 𝑌 Observed value: 𝑋
b
- cd
𝐿= 𝑟 + 𝛾 max 𝑄 𝑠 _ , 𝑎_ − 𝑄 𝑠, 𝑎 à = − 𝑟 + 𝛾 max 𝑄 𝑠 _ , 𝑎_ − 𝑄 𝑠, 𝑎
b @_ ce @_
𝑄 𝑠, 𝑎 = 𝑄 𝑠, 𝑎 − 𝛼∇𝐿 à Gradient descent
35
Approximate RL
◦ Tabular Q-learning
b
-
◦𝐿= 𝑟 + 𝛾 max 𝑄 𝑠 _ , 𝑎_ − 𝑄 𝑠, 𝑎 ; 𝑄 𝑠, 𝑎 = 𝑄 𝑠, 𝑎 − 𝛼∇𝐿
b @_
𝑌 𝑋
◦ Approximate Q-learning: 𝑄𝕨 𝑠, 𝑎
&
!
◦𝐿= 𝑟 + 𝛾 max 𝑄𝕨 𝑠 . , 𝑎. − 𝑄𝕨 𝑠, 𝑎
& #.
◦ E.g: 𝑄𝕨 𝑠, 𝑎 = 𝑤! 𝑓! 𝑠, 𝑎 + 𝑤& 𝑓& 𝑠, 𝑎 + … + 𝑤0 𝑓0 𝑠, 𝑎
◦ 𝑤1 ⟵ 𝑤1 + 𝛼 𝑟 + 𝛾 max 𝑄𝕨 𝑠 . , 𝑎. − 𝑄𝕨 𝑠, 𝑎 𝑓1 𝑠, 𝑎
#.
36
Approximate RL
◦ Approximate Q-learning: 𝑄𝕨 𝑠, 𝑎
' ' )*𝕨
◦ 𝑤% ⟵ 𝑤% + 𝛼 𝑟 + 𝛾 max 𝑄𝕨 𝑠 , 𝑎 − 𝑄𝕨 𝑠, 𝑎 )+$
&'
◦ SARSA
◦ Tabular SARSA: 𝑄 𝑠, 𝑎 = 1 − 𝛼 𝑄 𝑠, 𝑎 + 𝛼 𝑟 + 𝛾𝑄 𝑠 % , 𝑎%
% % '(𝕨
◦ Approximate SARSA: 𝑤& ⟵ 𝑤& + 𝛼 𝑟 + 𝛾𝑄𝕨 𝑠 , 𝑎 − 𝑄𝕨 𝑠, 𝑎
')'
37
Approximate RL
◦ 𝑄𝕨 𝑠, 𝑎
◦ Can be a very simple linear function
◦ Can be a machine learning model, ANN, …
◦ When 𝑄𝕨 𝑠, 𝑎 is a deep learning model à we have deep
reinforcement learning
Action 1, Q-value 1
state
DNN
Action n, Q-value 2
38
Tabular RL vs Deep RL
State Action Q-value
𝑆! 𝐴! 𝑄(𝑆! , 𝐴! )
𝑆( , 𝐴) 𝑆! 𝐴" 𝑄(𝑆! , 𝐴" ) 𝑄(𝑆( , 𝐴) )
Tabular RL
𝑆# 𝐴$ 𝑄(𝑆# , 𝐴$ )
𝐴! , 𝑄(𝑆( , 𝐴! )
𝐴" , 𝑄(𝑆( , 𝐴" )
𝑆(
DRL DNN
𝐴$ , 𝑄(𝑆( , 𝐴$ )
39
Agenda
◦ Definition and basic terms
◦ Value-based RL
◦ Tabular RL
◦ Approximate RL
◦ Policy-based RL
◦ Applications in networking
40
Policy-based RL
Estimate values (action value, state value)
◦ Monte Carlo
◦ TD Value-based RL
◦ SARSA
Choose an action based on the value
◦ Q-learning
What if we learn to make the decision directly instead of estimating
value and then making the decision based on the value?
Policy-based RL
41
Value-based RL vs Policy-based RL
Q-learning, SARSA,
𝜖-greedy, 𝜖-soft, …
Monte carlo, TD, …
Value estimation Action selection
Value-based RL
42
Value-based RL vs Policy-based RL
Q-learning, SARSA,
𝜖-greedy, 𝜖-soft, …
Monte carlo, TD, …
An Value
end2end policy
Action selection
estimation
Policy-based RL
43
Value-based RL
◦ Objective
◦ Policy function: 𝜋 𝑎 𝑠 : the probability for performing an action 𝑎 at
state 𝑠
◦ Main idea
◦ 𝜋 𝑎 𝑠, Θ : 𝜋 can be a function, a model with parameters Θ needed to
be learned
◦ 𝒥(Θ): scalar performance measure
Z' ) à gradient ascent method à why not gradient
◦ 𝜃',- = 𝜃' + 𝛼∇𝒥(Θ
descent?
Estimation of 𝒥(Θ* ) Questions:
1. How to define 𝒥 Θ( ?
V( )?
2. How to estimate ∇𝒥(Θ
44
Objective function 𝒥 Θ
◦ For episodic environments
◦ 𝒥 Θ = 𝔼& [𝐺- = 𝑅- + 𝛾𝑅b + ⋯ + 𝛾 Lh-𝑅L ]
◦ For continuing environemnts
◦ Expectation of state value
)(3)
◦ 𝒥 Θ = 𝔼2 𝑉 𝑠 = ∑3 𝑑(𝑠) 𝑉 𝑠 = ∑3 ∑ +) 𝑉 𝑠 : expectation over all the states
'+ )(3
◦ Expectation of reward
◦ 𝒥 Θ = 𝔼2 𝑟 = ∑3 𝑑(𝑠) ∑# 𝜋 𝑎 𝑠, Θ 𝑅 𝑎, 𝑠
45
𝒥 Θ estimation
,! )
◦𝜃!"# = 𝜃! + 𝛼 ∇𝒥(Θ
◦Goals
◦ Find an expression proportional with ∇𝒥 Θ à 𝐹 ∝ ∇𝒥 Θ
◦ Preferably is the expected value of a variable 𝑒
◦ à 𝐹 can be approximated by samples of 𝑒
46
𝒥 Θ estimation
,! )
◦𝜃!"# = 𝜃! + 𝛼 ∇𝒥(Θ
◦Goals
◦ Find an expression proportional with ∇𝒥 Θ à 𝐹 ∝ ∇𝒥 Θ
◦ Preferably is the expected value of a variable 𝑒
◦ à 𝐹 can be approximated by samples of 𝑒
47
𝒥 Θ estimation
◦ Definition of 𝒥 Θ
◦ 𝒥 Θ = 𝔼, 𝐺# = 𝑅# + 𝛾𝑅$ + ⋯ + 𝛾 %&# 𝑅% = 𝑣, (𝑠) ): state-value
at the first state
à𝑣! 𝑠 = ∑" 𝜋 𝑎 𝑠 𝑞! (𝑠, 𝑎)
à∇𝑣! 𝑠 = ∑" ∇𝜋 𝑎 𝑠 𝑞! 𝑠, 𝑎 + 𝜋(𝑎|𝑠)∇𝑞! 𝑠, 𝑎
à∇𝒥 Θ = ∇𝑣! 𝑠# ∝ ∑$ 𝜇 𝑠 ∑" ∇𝜋 𝑎 𝑠 𝑞! 𝑠, 𝑎 (Policy Gradient
Theorem)
Probability for 𝑠 to occur The gradient of policy 𝜋 Action value
48
𝒥 Θ estimation
,! )
◦𝜃!"# = 𝜃! + 𝛼 ∇𝒥(Θ
◦Goals
◦ Find an expression proportional with ∇𝒥 Θ à 𝐹 ∝ ∇𝒥 Θ
◦ Preferably is the expected value of a variable 𝑒
◦ à 𝐹 can be approximated by samples of 𝑒
49
𝒥 Θ estimation for episode environement
∇𝒥 Θ ∝ ^ 𝜇 𝑠 ^ ∇𝜋 𝑎 𝑠 𝑞2 𝑠, 𝑎
3 #
= 𝔼2 ^ ∇𝜋 𝑎 𝑆( 𝑞2 𝑆( , 𝑎
#
Expected value of this variable à approximate the expectation by the sample
V
∇𝒥 Θ = ^ ∇𝜋 𝑎 𝑆( 𝑞2 𝑆( , 𝑎
#
Θ($! ÷ Θ( + 𝛼 ^ ∇𝜋 𝑎 𝑆( 𝑞2 𝑆( , 𝑎
#
Θ($! ÷ Θ( + 𝛼 ^ ∇𝜋 𝑎 𝑆( , Θ( 𝑞b 𝑆( , 𝑎, 𝕨
#
50
REINFORCE algorithm (1992)
◦ REINFORCE = REward Increment = Nonnegative Factor x Offset Reinforcement x Characteristic Eligibility
◦ Main idea
Θ($! ÷ Θ( + 𝛼 ^ ∇𝜋 𝑎 𝑆( , Θ( 𝑞2 𝑆( , 𝑎
#
convert this into the expected value of a variable à approximate by its sample
The sum over all actions 𝑎
à to convert into the expected value over variales a à we need 𝜋 𝑎 𝑆𝑡
∇𝜋 𝑎 𝑆* , Θ*
i ∇𝜋 𝑎 𝑆* 𝑞- 𝑆* , 𝑎 = i 𝜋 𝑎 𝑆* , Θ* 𝑞- 𝑆* , 𝑎
𝜋 𝑎 𝑆* , Θ*
, ,
∇𝜋 𝐴* 𝑆* , Θ* ∇𝜋 𝐴* 𝑆* , Θ*
= 𝔼- 𝑞- 𝑆* , 𝐴* = 𝔼- 𝐺* Vì 𝑞- 𝑆* , 𝐴* = 𝔼- 𝐺* |𝑆* , 𝐴*
𝜋 𝐴* 𝑆* , Θ* 𝜋 𝐴* 𝑆* , Θ*
∇- 𝐴* 𝑆* , Θ* ∇- 𝐴* 𝑆* , Θ*
𝔼- 𝐺* can be approximated by 𝐺*
- 𝐴* 𝑆* , Θ* - 𝐴* 𝑆* , Θ*
51
REINFORCE algorithm (1992)
Hướng của ∇𝜋 làm tăng xác suất lặp lại hành động 𝐴*
V ∇2 𝐴 𝑆 , Θ
◦ ∇𝒥 Θ = 𝐺( 2 𝐴 ( 𝑆 (, Θ ( Tức là: nếu tăng Θ một lượng tỉ lệ thuận với
( ( (
∇𝜋 𝐴* 𝑆* , Θ* thì càng ngày 𝐴* sẽ đươc thực hiện
∇2 𝐴( 𝑆( , Θ( càng nhiều.
à Θ($! ÷ Θ( + 𝛼𝐺(
2 𝐴( 𝑆( , Θ(
𝐺* ×∇𝜋 𝐴* 𝑆* , Θ* : nếu 𝐺* càng lớn thì
càng khuyến khích tăng Θ theo hướng ∇- 𝐴* 𝑆* , Θ*
: nếu𝜋 𝐴* 𝑆* , Θ* càng lớn thì càng không
∇𝜋 𝐴* 𝑆* , Θ* - 𝐴* 𝑆* , Θ*
khuyến khích tăng Θ theo hướng ∇𝜋 𝐴* 𝑆* , Θ*
à Tránh trường hợp thực hiện một hành động quá nhiều lần
52
REINFORCE algorithm (1992)
53
REINFORCE with Baseline
◦ ∇𝒥 Θ ∝ ∑d 𝜇 𝑠 ∑. ∇𝜋 𝑎 𝑠 𝑞, 𝑠, 𝑎
àLarge values of 𝑞& 𝑠, 𝑎 will dominate ∇𝒥 Θ
àUse baseline to normalize 𝑞& 𝑠, 𝑎
∇𝒥 Θ ∝ a 𝜇 𝑠 a ∇𝜋 𝑎 𝑠 𝑞& 𝑠, 𝑎 − 𝑏 𝑠
= ∇ a 𝜋 𝑎 𝑠 = ∇1 = 0
B @
@
a 𝜇 𝑠 a ∇𝜋 𝑎 𝑠 𝑞& 𝑠, 𝑎 − 𝑏 𝑠
B @
=∑B 𝜇 𝑠 ∑@ ∇𝜋 𝑎 𝑠 𝑞& 𝑠, 𝑎 − 𝑏 𝑠 ∑@ ∇𝜋 𝑎 𝑠
How to choose 𝑏 𝑠 :
- 𝑏 𝑠 can be any thing as long as its value is independent of 𝑎
- We use 𝑏 𝑠 to decrease the variance
54
REINFORCE with Baseline
∇, 𝐴! 𝑆! , Θ!
Θ!"# ÷ Θ! + 𝛼𝐺! , 𝐴! 𝑆! , Θ!
∇, 𝐴! 𝑆! , Θ!
à Θ!"# ÷ Θ! + 𝛼 𝐺! − 𝑏(𝑆! ) 𝐺!
, 𝐴! 𝑆! , Θ!
The expected value of 𝑣 𝑆𝑡
à A natual selection of 𝑏(𝑆𝑡 ) is 𝑣Z
𝑆𝑡 , 𝑣(𝑆
N 𝑡 , 𝕨)
' ! , 𝕨! )
𝕨!"# ÷ 𝕨! + 𝛽∇𝑣(𝑆
55
REINFORCE with Baseline
56
REINFORCE with Baseline
57
Actor-critic learning
◦ Definition: Methods that learn approximations to both
policy and value functions are often called actor–critic
methods,
◦ Actor: the learned policy
◦ Critic: the learned value function
◦ Usually a state-value function
◦ Is policy-based RL with baseline is actor-critic learning?
◦ If the baseline is stationary value that never updates with experience à NOT actor-
critic
◦ If the baseline is estimated from experience à can be called an actor-critic
method
58
Actor-critic learning
𝐴! 𝑆! , Θ!
∇,
Θ!"# ÷ Θ! + 𝛼 𝐺! − 𝑏(𝑆! )
, 𝐴! 𝑆! , Θ!
∇, 𝐴! 𝑆! , Θ!
à Θ!"# ÷ Θ! + 𝛼 𝐺! − 𝑣(𝑆
P ! , 𝕨) , 𝐴 𝑆 , Θ REINFORCE with baseline
! ! !
𝑅!"# + 𝛾𝑣P 𝑆!"# , 𝕨
∇% 𝐴! 𝑆! , Θ!
à Θ!"# ÷ Θ! + 𝛼 𝑅!"# + 𝛾𝑣' 𝑆!"# , 𝕨 − 𝑣' 𝑆! , 𝕨
% 𝐴! 𝑆! , Θ!
𝕨!"# ÷ 𝕨! + 𝛽 𝑅!"# + 𝛾𝑣' 𝑆!"# , 𝕨 − 𝑣' 𝑆! , 𝕨 ∇𝑣(𝑆
' ! , 𝕨! )
59
Agenda
◦ Definition and basic terms
◦ Value-based RL
◦ Tabular RL
◦ Approximate RL
◦ Policy-based RL
◦ Applications in networking
60
Applications in networking
◦ Routing in WSN
◦ How to forward packet from S to D?
◦ Reinforcement learning modeling
◦ Agent: sensor nodes
◦ RL model: Q-learning
Negative rewarding
• Khanh Le, Nguyen Thanh Hung, Kien Nguyen, Phi Le Nguyen, “Exploiting Q-learning in Extending the Network Lifetime of
Wireless Sensor Networks with Holes”, ICPADS 2019
• Phi Le Nguyen, Nang Hung Nguyen, Tuan Anh Nguyen Dinh, Khanh Le, Thanh Hung Nguyen, and Kien Nguyen, “QIH: An
Efficient Q-Learning Inspired Hole-Bypassing Routing Protocol for WSNs”, IEEE Access, 2021
61
Applications in networking
◦ Wireless charging algorithm
◦ Where should the mobile charger move to?
◦ How long should the mobile charger charge to
sensors?
◦ Techniques: Q-learning + Fuzzy
• La Van Quan, Phi Le Nguyen, Thanh-Hung Nguyen, Kien Nguyen, “Q-learning-based, Optimized On-
demand Charging Algorithm in WRSN”, NCA 2020
• Phi Le Nguyen, Van Quan La, Anh Duy Nguyen, Thanh Hung Nguyen, and Kien Nguyen, “An On-Demand
Charging for Connected Target Coverage in WRSNs Using Fuzzy Logic and Q-Learning”, MDPI Sensors 2021
62
Applications in networking
◦ Offloading in MEC
◦ Where to offload the tasks
◦ RL model: MAB
◦ Exploration strategy
• Nang Hung Nguyen, Phi Le Nguyen, Hieu Dinh, Thanh Hung Nguyen, Kien Nguyen, “Multi-Agent Multi-Armed Bandit Learning for
Offloading Delay Minimization in V2X Networks”, EUC 2021
• Trung Thanh Nguyen, Truong Thao Nguyen, Tuan Anh Nguyen Dinh, Thanh-Hung Nguyen, Phi Le Nguyen, “Q-learning-based
Opportunistic Communication for Real-time Mobile Air Quality Monitoring Systems”, IPCCC 2021
63
Source code
◦ https://colab.research.google.com/drive/1-
_6DVuOjJWBp2V5x53uTZC9U5Z1IoYiT#scrollTo=GzO_xDRDr5NP
64