Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
5 views64 pages

Introduction To RL

The document provides an overview of reinforcement learning (RL), including its definitions, basic concepts, and various types such as value-based, policy-based, and their applications in networking. It discusses key components like policies, rewards, and value functions, and contrasts RL with supervised and unsupervised learning. Additionally, it covers action selection strategies and exploration-exploitation trade-offs in RL scenarios.

Uploaded by

huyhoang2522
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views64 pages

Introduction To RL

The document provides an overview of reinforcement learning (RL), including its definitions, basic concepts, and various types such as value-based, policy-based, and their applications in networking. It discusses key components like policies, rewards, and value functions, and contrasts RL with supervised and unsupervised learning. Additionally, it covers action selection strategies and exploration-exploitation trade-offs in RL scenarios.

Uploaded by

huyhoang2522
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

2021/09/30

Reinforcement learning:
Basic concepts and Applications in networking

Nguyen Phi Le
Agenda
◦ Definition and basic terms
◦ Value-based RL
◦ Tabular RL
◦ Approximate RL
◦ Policy-based RL
◦ Applications in networking

2
Agenda
◦ Definition and basic terms
◦ Value-based RL
◦ Tabular RL
◦ Approximate RL
◦ Policy-based RL
◦ Applications in networking

3
Example

4
Interactive learning
Sensor node

Base station

Multi-Armed Bandit Traffic light control

Wireless sensor networks


ü The learner is not told which actions to take à trial-and-error
ü Action may affect also future situation à delayed reward

5
Definition
◦ Reinforcement learning
◦ Learning what to do—how to map situations to actions— to achieve the goal
◦ Reward hypothesis
◦ That all of what we mean by goals and purposes can be well thought of as the
maximization of the expected value of the cumulative sum of a received scalar
signal (called reward) (Richard S. Sutton, RL: An introduction, 2018)

6
Goal and reward
◦ Reward: a scalar signal received from the environment at each step
◦ Goal: maximizing cumulative reward in the long run Sensor node

Multi-Armed Bandit Traffic light control


Base station

Goal = maximizing total gain Goal = decreasing traffic jam Goal = increasing network lifetime
Reward = gain at every turn Reward = waiting time, Reward = load balancing,
queue length, lane speed, … route length, …

7
RL vs other learning techniques
◦ RL vs supervised learning
◦ Supervised learning: learning from a training set of labeled examples
◦ Know the true action to take
◦ Reinforcement learning: do not know the optimal action
◦ RL vs unsupervised learning
◦ Unsupervised learning: finding structure hidden in collections of unlabeled data
◦ Reinforcement learning: maximizing the reward signal

8
RL framework
◦ Policy: a mapping from states to actions
◦ may be stochastic, specifying probabilities for each action
◦ Reward signal: the goal of a reinforcement learning problem
◦ objective is to maximize the total reward the agent receives over the long run
◦ Value: total amount of accumulative reward over the future, starting from
that state
◦ Reward: immediate; Value: long-run
o Action: can be any decisions we want to learn how to make
o State: can be anything we can know that might be useful in making them
Agent Reward signal

Action
State Policy Environement

New state
9
Agenda
◦ Definition and basic terms
◦ Value-based RL
◦ Tabular RL
◦ Approximate RL
◦ Policy-based RL
◦ Applications in networking

10
Value-based RL
◦ RL’s objective: making decision
◦ Input: state
◦ Output: action
Reward signal
◦ How to decide an action? Agent
value
◦ Using action selection policy function

◦ Based on action/state value


◦ How to measure the values? Action
State Policy
◦ Value functions: estimate the goodness
of states/actions based on experience
New state
◦ Exploration-exploitation strategy Environement

11
Value functions
◦ Key point of value-based RL models
◦ Questions?
◦ What can be used to measure the goodness of an action, state?
◦ How to measure?
Maximize: long-term accumulative reward

𝐺! = 𝑅!"# + 𝛾𝑅!"$ + ⋯ + 𝛾 % 𝑅!"%&# + … = ∑*


'() 𝛾 '
𝑅!"'"#

Reward received after Reward received if following policy 𝜋


Goal
performing action 𝑎 at timing 𝑡 from timing 𝑡 + 1

12
Value functions
◦ State value function: how good it is for the agent to be in a given state
◦ Expected return when starting in s and following thereafter policy 𝜋
◦ Action value function: how good it is to perform a given action in a
given state
◦ Expected return starting from s, taking the action a, and thereafter following policy
𝜋

State value function for policy 𝜋

𝑣& 𝑠 ÷ 𝔼& 𝐺' 𝑆' = 𝑠 = 𝔼& ∑+ (


()* 𝛾 𝑅',(,- 𝑆' = 𝑠

Action value function for policy 𝜋


𝑞& 𝑠, 𝑎 ÷ 𝔼& 𝐺' 𝑆' = 𝑠, 𝐴' = 𝑎 = 𝔼& ∑+ (
()* 𝛾 𝑅',(,- 𝑆' = 𝑠, 𝐴' = 𝑎

13
Action value estimation
◦ 𝑞& 𝑠, 𝑎 ÷ 𝔼& 𝐺' 𝑆' = 𝑠, 𝐴' = 𝑎 = 𝔼& ∑+
()* 𝛾 (𝑅
',(,- 𝑆' = 𝑠, 𝐴' = 𝑎
◦ How to estimate 𝑞& 𝑠, 𝑎 ?
◦ 𝔼& ∑+()* 𝛾 (𝑅
',(,- 𝑆' = 𝑠, 𝐴' = 𝑎 = expected value à need to perform
the same action at the same state for infinite times à impossible
◦ Solution
◦ Replacing expected value by sample-based estimation
◦ Estimating value of 𝔼! ∑% "
"#$ 𝛾 𝑅&'"'( 𝑆& = 𝑠, 𝐴& = 𝑎 using a few samples
◦ Estimation method: iterative estimation
◦ Initialize 𝑞! 𝑠, 𝑎
◦ Update 𝑞! 𝑠, 𝑎 after performing one action and receiving a reward
◦ Current 𝑞! 𝑠, 𝑎
◦ Estimate of 𝔼! ∑% "
"#$ 𝛾 𝑅&'"'( 𝑆& = 𝑠, 𝐴& = 𝑎
◦ New 𝑞! 𝑠, 𝑎 ← 1 − 𝛼 × old 𝑞! 𝑠, 𝑎 + 𝛼𝑞<! (𝑠, 𝑎)

14
Action value estimation
◦ How to estimate action value?
◦ Bellman equation
𝑞, 𝑠, 𝑎 = 𝔼, 𝐺! 𝑆! = 𝑠, 𝐴! = 𝑎
= 𝔼 , ∑*
'() 𝛾 '
𝑅!"'"# 𝑆! = 𝑠, 𝐴! = 𝑎
= 𝔼, 𝑅!"# + ∑*
'(# 𝛾 '
𝑅!"'"# 𝑆! = 𝑠, 𝐴! = 𝑎
= 𝔼, 𝑅!"# + 𝛾 ∑*
'() 𝛾 '
𝑅!"'"$ 𝑆! = 𝑠, 𝐴! = 𝑎
Action value of = 𝔼, 𝑅!"# + 𝛾𝑞, 𝑆!"# , 𝑎′ 𝑆! = 𝑠, 𝐴! = 𝑎
(𝑠, 𝑎)

Immediate reward after


expectation Action value at the next step
performing action a at state s
15
Action value estimation
◦ Bellman equation
◦ 𝑞& 𝑠, 𝑎 = 𝔼& 𝑅',- + 𝛾𝑞& 𝑆',-, 𝑎′ 𝑆' = 𝑠, 𝐴' = 𝑎

Object to be estimated 𝑞?! (𝑠, 𝑎)

New 𝑞& 𝑠, 𝑎 = (1 − 𝛼) current 𝑞& 𝑠, 𝑎 + 𝛼×3


𝑞& (𝑠, 𝑎)

𝑄 𝑠, 𝑎 = 1 − 𝛼 𝑄 𝑠, 𝑎 + 𝛼 𝑟 + 𝛾𝑄 𝑠 - , 𝑎- SARSA

16
Action value estimation
Bellman equation
𝑞, 𝑠, 𝑎 = 𝔼, 𝑅!"# + 𝛾𝑞, 𝑆!"# , 𝑎′ 𝑆! = 𝑠, 𝐴! = 𝑎
𝑞,∗ 𝑠, 𝑎 = 𝔼,∗ 𝑅!"# + 𝛾𝑞,∗ 𝑆!"# , 𝑎′ 𝑆! = 𝑠, 𝐴! = 𝑎
= 𝔼,∗ 𝑅!"# + 𝛾 max 𝑞, 𝑆!"# , 𝑎′ |𝑆! = 𝑠, 𝐴! = 𝑎
,

- -
estimation: 𝑞:
, ∗ (𝑠, 𝑎) = 𝑟 + 𝛾 max 𝑄 𝑠 ,𝑎
.-

𝑄 𝑠, 𝑎 = 1 − 𝛼 𝑄 𝑠, 𝑎 + 𝛼 𝑟 + 𝛾 max 𝑄 𝑠 - , 𝑎- Q-learning
.-
𝑄 𝑠, 𝑎 1 − 𝛼 𝑄 𝑠, 𝑎 + 𝛼 𝑟 + 𝛾𝑄 𝑠 - , 𝑎- SARSA
17
State value estimation
Bellman equation for state-value function
𝑣, 𝑠 = 𝔼, 𝐺! 𝑆! = 𝑠
= 𝔼 , ∑*
'() 𝛾 '
𝑅!"'"# 𝑆! = 𝑠
Monte Carlo
= 𝔼, 𝑅!"# + 𝛾 ∑*
'() 𝛾 '
𝑅!"'"$ 𝑆! = 𝑠
= 𝔼, 𝑅!"# + 𝛾𝑣, (𝑆!"# ) 𝑆! = 𝑠

New 𝑣, 𝑠 = (1 − 𝛼) current 𝑣, 𝑠 + 𝛼×𝐸


Temporal
Difference
𝑉 𝑆! = 1 − 𝛼 𝑉 𝑆! + 𝛼𝐺! (TD)

𝑉 𝑆! = 1 − 𝛼 𝑉 𝑆! + 𝛼(𝑅!"# +𝛾𝑉(𝑆!"# ))
18
Action selection policy
◦ Based on value functions
◦ Exploitattion – exploration strategy
◦ Choose (𝑠, 𝑎) whose 𝑄 𝑠, 𝑎 is the greatestß exploitation
◦ Problem: local maximum à need to exploit new actions ß exploration
◦ Some exploration methods
◦ 𝜖-greedy
◦ Time adaptive Epsilon (annealing 𝜖)
◦ 𝜖-soft
◦ Value adaptive Epsilon
◦ Sigmoid Epsilon
◦ Optiministic initialization
◦ Upper confidence bound (UCB)

19
𝜖-greedy
◦ Define a small number 𝜖 (𝜖 = 0.01)
◦ Each time the agent needs to choose an action
◦ Generates a prob 𝑝 ∈ (0,1)
𝑎𝑟𝑔𝑚𝑎𝑥@ 𝑞 𝑠, 𝑎 , 𝑖𝑓 𝑝 ≥ 𝜖
◦ 𝐴' = 7
𝑟𝑎𝑛𝑑𝑜𝑚 𝑎𝑐𝑡𝑖𝑜𝑛, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
E-greedy with e = 0.1 in theory
Best Action

Others

0 1 2 3 4 5 6 7 8 9 10
Number of time actions are taken
20
𝜖-greedy multi-armed bandit
◦ Action space: 𝐴 = 𝑎! , … , 𝑎" Multi-Armed Bandit

◦ State space: one state


◦ Reward: the gain recevied
◦ Value function
"#$ %& '()*'+" ),(- * !*.(- /'0%' !% !
◦ 𝑄! 𝑎 ÷
-#$1(' %& !0$(" * !*.(- /'0%' !% !
◦ Exploration strategy Goal = maximizing total gain
◦ 𝜖-greedy Reward = gain at every turn

21
Exploration vs Exploitation

The higher the 𝜖, the more exploration


à The more unstable the average reward

22
Exploration vs Exploitation

The frequency of selecting optimal action is inversely proportional with 𝜖

23
Optiministic initialization
◦ Initialize 𝑄) 𝑎 = 𝑋 ∀𝑎 (𝑋 is a significantly large number)
BCD EF GHI@GJB IKHL @ '@(HL MGNEG 'E '
◦ 𝑄' 𝑎 ÷
LCDOHG EF 'NDHB @ '@(HL MGNEG 'E '

◦ Action selection strategy


◦ At every step 𝑡: choose 𝐴' = 𝑎𝑟𝑔𝑚𝑎𝑥@ 𝑄' 𝑎
◦ Guarantee that all actions are chosen at least one time
"! # $%"
◦ 𝑄! 𝑎 = < 𝑄' 𝑎
&

24
Optiministic initialization
◦ Highly exploration at first
◦ But, totally greedy afterwards
◦ Appropriate value of X is needed

25
Upper confidence bound (UCB)
◦ Actions having been selected frequently à should be explored less
◦ Confidence bound is defined by the number of time the action not
being selected
◦ Priotirizing actions with
◦ High estimated action value à exploitation
◦ High confidence bound à exploration

Action 1

Action 2

Action 3

Action 4

Action 5

0 2 4 6 8
Estimate Value Confidence Bound

26
Upper confidence bound (UCB)
◦ Mathematical expression
◦ The number of times an action 𝑎 has been taken: 𝑁' (𝑎)
◦ The total number of time steps so far: 𝑁'
◦ Confidence bound
)#
◦ 𝐶𝐵( 𝑎 = 𝑐 (𝑐: a constant)
)# (#)

◦ Action selection policy


)#
◦ At every step 𝑡: choose 𝐴( = 𝑎𝑟𝑔𝑚𝑎𝑥# 𝑄( 𝑎 + 𝑐
)# (#)
◦ Intuition
◦ If an action 𝑎 has not been selected for a long time à 𝑁( (𝑎) decreases à
𝐶𝐵( 𝑎 increase à it will be prioritized next times

27
Upper confidence bound (UCB)

28
Other exploration strategies
◦ Time adaptive epsilon (annealing 𝜖)
-
◦𝜖=
PQR('NDH,S)
◦ 𝑐: a smalle positive parameter;
◦ 𝑡𝑖𝑚𝑒: the total times performing actions à the larger the time, the
smaller the 𝜖 à the more experience, the less the exploration
◦ 𝜖-soft
◦ Guarantee that the probabilities for choosing any action is at least
T
|V2 |
◦ 𝐴B : action space

29
Other exploration strategies
◦ Value adaptive Epsilon
◦ Adjust exploration probability based on the variation of the value
function
◦ When the value varies much à unstableà agent seems not to
know much about the environment à should explore more
◦ When the value is stable à not need to explore much
Mathematical expression
$ %#&" ',) $%# ',)
!,- * &
◦ 𝑓 𝑠( , 𝑎( , 𝜎 = $ %#&" ',) $%# ',)
= $ %#&" ',) $%# ',)
− 1 à 𝑄($! 𝑠, 𝑎 − 𝑄( 𝑠, 𝑎 càng
!$- * !$- *
lớn, 𝑓 càng lớn
◦ 𝜖(($!) 𝑠 = 𝛿𝑓 𝑠( , 𝑎( , 𝜎 + 1 − 𝛿 𝜖 ( 𝑠

30
Other exploration strategies
◦ Sigmoid Epsilon
-
◦ 𝑃HWMXEGH = 1 −
-,H 34×678(:)
◦ 𝜔: hyper-parameter
◦ 𝑣𝑎𝑟 𝑞 : variance of the values
◦ When 𝑣𝑎𝑟 𝑞 is sufficiently large à there is an action that dominates the others à
should be chosen
◦ When 𝑣𝑎𝑟 𝑞 is small à all actions are equals à should increase explore

31
Tabular RL
◦ Use a table (named Q-table) to
represent the action values
State Action Q-value
◦ Update Q-table after performing
every action 𝑆! 𝐴! 𝑄(𝑆! , 𝐴! )
𝑆! 𝐴" 𝑄(𝑆! , 𝐴" )
◦ Drawbacks
◦ When state/action spaces are too large
à table size gets large too
𝑆# 𝐴$ 𝑄(𝑆# , 𝐴$ )
◦ Cannot deal with newly appear
state/actions
◦ When training the model offline

32
Approximate RL
◦Represent the relation of (𝑠, 𝑎, 𝑞) by a function or a
model: 𝑞(𝑠,
' 𝑎)
◦ Input: a vector reprenting 𝑠
N 𝑎) are updated after every action
◦ Paramters of 𝑞(𝑠,
◦ Using gradient descent

◦Advantages
◦ Handle a large state/action space
◦ Can deal with newly appearing states

33
Approximate RL

Supervised learning:
Problems:
𝑞<𝕨 𝑠, 𝑎 , 𝑞∗ 𝑠, 𝑎 Solution:
1 " In most cases, 𝑞∗ 𝑠, 𝑎 is
𝐿 = 𝑞∗ 𝑠, 𝑎 − 𝑞<𝕨 𝑠, 𝑎 Estimates 𝑞∗ 𝑠, 𝑎
2 unknown
𝕨 ÷ 𝕨 − 𝛼∇𝐿

34
Recall about tabular RL
◦ Tabular Q-learning:
◦ 𝑄 𝑠, 𝑎 = 1 − 𝛼 𝑄 𝑠, 𝑎 + 𝛼 𝑟 + 𝛾 max 𝑄 𝑠 _ , 𝑎_
@_

= 𝑄 𝑠, 𝑎 + 𝛼 𝑟 + 𝛾 max 𝑄 𝑠 - , 𝑎- − 𝑄 𝑠, 𝑎
.-
Estimated value: 𝑌 Observed value: 𝑋

b
- cd
𝐿= 𝑟 + 𝛾 max 𝑄 𝑠 _ , 𝑎_ − 𝑄 𝑠, 𝑎 à = − 𝑟 + 𝛾 max 𝑄 𝑠 _ , 𝑎_ − 𝑄 𝑠, 𝑎
b @_ ce @_

𝑄 𝑠, 𝑎 = 𝑄 𝑠, 𝑎 − 𝛼∇𝐿 à Gradient descent

35
Approximate RL
◦ Tabular Q-learning
b
-
◦𝐿= 𝑟 + 𝛾 max 𝑄 𝑠 _ , 𝑎_ − 𝑄 𝑠, 𝑎 ; 𝑄 𝑠, 𝑎 = 𝑄 𝑠, 𝑎 − 𝛼∇𝐿
b @_
𝑌 𝑋

◦ Approximate Q-learning: 𝑄𝕨 𝑠, 𝑎
&
!
◦𝐿= 𝑟 + 𝛾 max 𝑄𝕨 𝑠 . , 𝑎. − 𝑄𝕨 𝑠, 𝑎
& #.

◦ E.g: 𝑄𝕨 𝑠, 𝑎 = 𝑤! 𝑓! 𝑠, 𝑎 + 𝑤& 𝑓& 𝑠, 𝑎 + … + 𝑤0 𝑓0 𝑠, 𝑎

◦ 𝑤1 ⟵ 𝑤1 + 𝛼 𝑟 + 𝛾 max 𝑄𝕨 𝑠 . , 𝑎. − 𝑄𝕨 𝑠, 𝑎 𝑓1 𝑠, 𝑎
#.

36
Approximate RL
◦ Approximate Q-learning: 𝑄𝕨 𝑠, 𝑎
' ' )*𝕨
◦ 𝑤% ⟵ 𝑤% + 𝛼 𝑟 + 𝛾 max 𝑄𝕨 𝑠 , 𝑎 − 𝑄𝕨 𝑠, 𝑎 )+$
&'

◦ SARSA
◦ Tabular SARSA: 𝑄 𝑠, 𝑎 = 1 − 𝛼 𝑄 𝑠, 𝑎 + 𝛼 𝑟 + 𝛾𝑄 𝑠 % , 𝑎%
% % '(𝕨
◦ Approximate SARSA: 𝑤& ⟵ 𝑤& + 𝛼 𝑟 + 𝛾𝑄𝕨 𝑠 , 𝑎 − 𝑄𝕨 𝑠, 𝑎
')'

37
Approximate RL
◦ 𝑄𝕨 𝑠, 𝑎
◦ Can be a very simple linear function
◦ Can be a machine learning model, ANN, …
◦ When 𝑄𝕨 𝑠, 𝑎 is a deep learning model à we have deep
reinforcement learning

Action 1, Q-value 1
state
DNN
Action n, Q-value 2

38
Tabular RL vs Deep RL
State Action Q-value
𝑆! 𝐴! 𝑄(𝑆! , 𝐴! )
𝑆( , 𝐴) 𝑆! 𝐴" 𝑄(𝑆! , 𝐴" ) 𝑄(𝑆( , 𝐴) )
Tabular RL

𝑆# 𝐴$ 𝑄(𝑆# , 𝐴$ )

𝐴! , 𝑄(𝑆( , 𝐴! )

𝐴" , 𝑄(𝑆( , 𝐴" )


𝑆(
DRL DNN
𝐴$ , 𝑄(𝑆( , 𝐴$ )

39
Agenda
◦ Definition and basic terms
◦ Value-based RL
◦ Tabular RL
◦ Approximate RL
◦ Policy-based RL
◦ Applications in networking

40
Policy-based RL
Estimate values (action value, state value)
◦ Monte Carlo
◦ TD Value-based RL
◦ SARSA
Choose an action based on the value
◦ Q-learning

What if we learn to make the decision directly instead of estimating


value and then making the decision based on the value?

Policy-based RL
41
Value-based RL vs Policy-based RL
Q-learning, SARSA,
𝜖-greedy, 𝜖-soft, …
Monte carlo, TD, …

Value estimation Action selection

Value-based RL

42
Value-based RL vs Policy-based RL
Q-learning, SARSA,
𝜖-greedy, 𝜖-soft, …
Monte carlo, TD, …

An Value
end2end policy
Action selection
estimation

Policy-based RL

43
Value-based RL
◦ Objective
◦ Policy function: 𝜋 𝑎 𝑠 : the probability for performing an action 𝑎 at
state 𝑠
◦ Main idea
◦ 𝜋 𝑎 𝑠, Θ : 𝜋 can be a function, a model with parameters Θ needed to
be learned
◦ 𝒥(Θ): scalar performance measure
Z' ) à gradient ascent method à why not gradient
◦ 𝜃',- = 𝜃' + 𝛼∇𝒥(Θ
descent?
Estimation of 𝒥(Θ* ) Questions:
1. How to define 𝒥 Θ( ?
V( )?
2. How to estimate ∇𝒥(Θ
44
Objective function 𝒥 Θ
◦ For episodic environments
◦ 𝒥 Θ = 𝔼& [𝐺- = 𝑅- + 𝛾𝑅b + ⋯ + 𝛾 Lh-𝑅L ]
◦ For continuing environemnts
◦ Expectation of state value
)(3)
◦ 𝒥 Θ = 𝔼2 𝑉 𝑠 = ∑3 𝑑(𝑠) 𝑉 𝑠 = ∑3 ∑ +) 𝑉 𝑠 : expectation over all the states
'+ )(3

◦ Expectation of reward
◦ 𝒥 Θ = 𝔼2 𝑟 = ∑3 𝑑(𝑠) ∑# 𝜋 𝑎 𝑠, Θ 𝑅 𝑎, 𝑠

45
𝒥 Θ estimation
,! )
◦𝜃!"# = 𝜃! + 𝛼 ∇𝒥(Θ
◦Goals
◦ Find an expression proportional with ∇𝒥 Θ à 𝐹 ∝ ∇𝒥 Θ
◦ Preferably is the expected value of a variable 𝑒
◦ à 𝐹 can be approximated by samples of 𝑒

46
𝒥 Θ estimation
,! )
◦𝜃!"# = 𝜃! + 𝛼 ∇𝒥(Θ
◦Goals
◦ Find an expression proportional with ∇𝒥 Θ à 𝐹 ∝ ∇𝒥 Θ
◦ Preferably is the expected value of a variable 𝑒
◦ à 𝐹 can be approximated by samples of 𝑒

47
𝒥 Θ estimation
◦ Definition of 𝒥 Θ
◦ 𝒥 Θ = 𝔼, 𝐺# = 𝑅# + 𝛾𝑅$ + ⋯ + 𝛾 %&# 𝑅% = 𝑣, (𝑠) ): state-value
at the first state
à𝑣! 𝑠 = ∑" 𝜋 𝑎 𝑠 𝑞! (𝑠, 𝑎)

à∇𝑣! 𝑠 = ∑" ∇𝜋 𝑎 𝑠 𝑞! 𝑠, 𝑎 + 𝜋(𝑎|𝑠)∇𝑞! 𝑠, 𝑎

à∇𝒥 Θ = ∇𝑣! 𝑠# ∝ ∑$ 𝜇 𝑠 ∑" ∇𝜋 𝑎 𝑠 𝑞! 𝑠, 𝑎 (Policy Gradient


Theorem)
Probability for 𝑠 to occur The gradient of policy 𝜋 Action value

48
𝒥 Θ estimation
,! )
◦𝜃!"# = 𝜃! + 𝛼 ∇𝒥(Θ
◦Goals
◦ Find an expression proportional with ∇𝒥 Θ à 𝐹 ∝ ∇𝒥 Θ
◦ Preferably is the expected value of a variable 𝑒
◦ à 𝐹 can be approximated by samples of 𝑒

49
𝒥 Θ estimation for episode environement
∇𝒥 Θ ∝ ^ 𝜇 𝑠 ^ ∇𝜋 𝑎 𝑠 𝑞2 𝑠, 𝑎
3 #

= 𝔼2 ^ ∇𝜋 𝑎 𝑆( 𝑞2 𝑆( , 𝑎
#

Expected value of this variable à approximate the expectation by the sample

V
∇𝒥 Θ = ^ ∇𝜋 𝑎 𝑆( 𝑞2 𝑆( , 𝑎
#

Θ($! ÷ Θ( + 𝛼 ^ ∇𝜋 𝑎 𝑆( 𝑞2 𝑆( , 𝑎
#

Θ($! ÷ Θ( + 𝛼 ^ ∇𝜋 𝑎 𝑆( , Θ( 𝑞b 𝑆( , 𝑎, 𝕨
#
50
REINFORCE algorithm (1992)
◦ REINFORCE = REward Increment = Nonnegative Factor x Offset Reinforcement x Characteristic Eligibility
◦ Main idea
Θ($! ÷ Θ( + 𝛼 ^ ∇𝜋 𝑎 𝑆( , Θ( 𝑞2 𝑆( , 𝑎
#

convert this into the expected value of a variable à approximate by its sample

The sum over all actions 𝑎


à to convert into the expected value over variales a à we need 𝜋 𝑎 𝑆𝑡

∇𝜋 𝑎 𝑆* , Θ*
i ∇𝜋 𝑎 𝑆* 𝑞- 𝑆* , 𝑎 = i 𝜋 𝑎 𝑆* , Θ* 𝑞- 𝑆* , 𝑎
𝜋 𝑎 𝑆* , Θ*
, ,
∇𝜋 𝐴* 𝑆* , Θ* ∇𝜋 𝐴* 𝑆* , Θ*
= 𝔼- 𝑞- 𝑆* , 𝐴* = 𝔼- 𝐺* Vì 𝑞- 𝑆* , 𝐴* = 𝔼- 𝐺* |𝑆* , 𝐴*
𝜋 𝐴* 𝑆* , Θ* 𝜋 𝐴* 𝑆* , Θ*

∇- 𝐴* 𝑆* , Θ* ∇- 𝐴* 𝑆* , Θ*
𝔼- 𝐺* can be approximated by 𝐺*
- 𝐴* 𝑆* , Θ* - 𝐴* 𝑆* , Θ*
51
REINFORCE algorithm (1992)
Hướng của ∇𝜋 làm tăng xác suất lặp lại hành động 𝐴*
V ∇2 𝐴 𝑆 , Θ
◦ ∇𝒥 Θ = 𝐺( 2 𝐴 ( 𝑆 (, Θ ( Tức là: nếu tăng Θ một lượng tỉ lệ thuận với
( ( (
∇𝜋 𝐴* 𝑆* , Θ* thì càng ngày 𝐴* sẽ đươc thực hiện
∇2 𝐴( 𝑆( , Θ( càng nhiều.
à Θ($! ÷ Θ( + 𝛼𝐺(
2 𝐴( 𝑆( , Θ(

𝐺* ×∇𝜋 𝐴* 𝑆* , Θ* : nếu 𝐺* càng lớn thì


càng khuyến khích tăng Θ theo hướng ∇- 𝐴* 𝑆* , Θ*
: nếu𝜋 𝐴* 𝑆* , Θ* càng lớn thì càng không
∇𝜋 𝐴* 𝑆* , Θ* - 𝐴* 𝑆* , Θ*
khuyến khích tăng Θ theo hướng ∇𝜋 𝐴* 𝑆* , Θ*
à Tránh trường hợp thực hiện một hành động quá nhiều lần
52
REINFORCE algorithm (1992)

53
REINFORCE with Baseline
◦ ∇𝒥 Θ ∝ ∑d 𝜇 𝑠 ∑. ∇𝜋 𝑎 𝑠 𝑞, 𝑠, 𝑎
àLarge values of 𝑞& 𝑠, 𝑎 will dominate ∇𝒥 Θ
àUse baseline to normalize 𝑞& 𝑠, 𝑎
∇𝒥 Θ ∝ a 𝜇 𝑠 a ∇𝜋 𝑎 𝑠 𝑞& 𝑠, 𝑎 − 𝑏 𝑠
= ∇ a 𝜋 𝑎 𝑠 = ∇1 = 0
B @
@
a 𝜇 𝑠 a ∇𝜋 𝑎 𝑠 𝑞& 𝑠, 𝑎 − 𝑏 𝑠
B @
=∑B 𝜇 𝑠 ∑@ ∇𝜋 𝑎 𝑠 𝑞& 𝑠, 𝑎 − 𝑏 𝑠 ∑@ ∇𝜋 𝑎 𝑠
How to choose 𝑏 𝑠 :
- 𝑏 𝑠 can be any thing as long as its value is independent of 𝑎
- We use 𝑏 𝑠 to decrease the variance

54
REINFORCE with Baseline
∇, 𝐴! 𝑆! , Θ!
Θ!"# ÷ Θ! + 𝛼𝐺! , 𝐴! 𝑆! , Θ!
∇, 𝐴! 𝑆! , Θ!
à Θ!"# ÷ Θ! + 𝛼 𝐺! − 𝑏(𝑆! ) 𝐺!
, 𝐴! 𝑆! , Θ!

The expected value of 𝑣 𝑆𝑡


à A natual selection of 𝑏(𝑆𝑡 ) is 𝑣Z
𝑆𝑡 , 𝑣(𝑆
N 𝑡 , 𝕨)

' ! , 𝕨! )
𝕨!"# ÷ 𝕨! + 𝛽∇𝑣(𝑆

55
REINFORCE with Baseline

56
REINFORCE with Baseline

57
Actor-critic learning
◦ Definition: Methods that learn approximations to both
policy and value functions are often called actor–critic
methods,
◦ Actor: the learned policy
◦ Critic: the learned value function
◦ Usually a state-value function

◦ Is policy-based RL with baseline is actor-critic learning?


◦ If the baseline is stationary value that never updates with experience à NOT actor-
critic
◦ If the baseline is estimated from experience à can be called an actor-critic
method

58
Actor-critic learning
𝐴! 𝑆! , Θ!
∇,
Θ!"# ÷ Θ! + 𝛼 𝐺! − 𝑏(𝑆! )
, 𝐴! 𝑆! , Θ!
∇, 𝐴! 𝑆! , Θ!
à Θ!"# ÷ Θ! + 𝛼 𝐺! − 𝑣(𝑆
P ! , 𝕨) , 𝐴 𝑆 , Θ REINFORCE with baseline
! ! !

𝑅!"# + 𝛾𝑣P 𝑆!"# , 𝕨


∇% 𝐴! 𝑆! , Θ!
à Θ!"# ÷ Θ! + 𝛼 𝑅!"# + 𝛾𝑣' 𝑆!"# , 𝕨 − 𝑣' 𝑆! , 𝕨
% 𝐴! 𝑆! , Θ!
𝕨!"# ÷ 𝕨! + 𝛽 𝑅!"# + 𝛾𝑣' 𝑆!"# , 𝕨 − 𝑣' 𝑆! , 𝕨 ∇𝑣(𝑆
' ! , 𝕨! )

59
Agenda
◦ Definition and basic terms
◦ Value-based RL
◦ Tabular RL
◦ Approximate RL
◦ Policy-based RL
◦ Applications in networking

60
Applications in networking
◦ Routing in WSN
◦ How to forward packet from S to D?
◦ Reinforcement learning modeling
◦ Agent: sensor nodes
◦ RL model: Q-learning

Negative rewarding

• Khanh Le, Nguyen Thanh Hung, Kien Nguyen, Phi Le Nguyen, “Exploiting Q-learning in Extending the Network Lifetime of
Wireless Sensor Networks with Holes”, ICPADS 2019
• Phi Le Nguyen, Nang Hung Nguyen, Tuan Anh Nguyen Dinh, Khanh Le, Thanh Hung Nguyen, and Kien Nguyen, “QIH: An
Efficient Q-Learning Inspired Hole-Bypassing Routing Protocol for WSNs”, IEEE Access, 2021

61
Applications in networking
◦ Wireless charging algorithm
◦ Where should the mobile charger move to?
◦ How long should the mobile charger charge to
sensors?
◦ Techniques: Q-learning + Fuzzy

• La Van Quan, Phi Le Nguyen, Thanh-Hung Nguyen, Kien Nguyen, “Q-learning-based, Optimized On-
demand Charging Algorithm in WRSN”, NCA 2020
• Phi Le Nguyen, Van Quan La, Anh Duy Nguyen, Thanh Hung Nguyen, and Kien Nguyen, “An On-Demand
Charging for Connected Target Coverage in WRSNs Using Fuzzy Logic and Q-Learning”, MDPI Sensors 2021

62
Applications in networking
◦ Offloading in MEC
◦ Where to offload the tasks
◦ RL model: MAB

◦ Exploration strategy

• Nang Hung Nguyen, Phi Le Nguyen, Hieu Dinh, Thanh Hung Nguyen, Kien Nguyen, “Multi-Agent Multi-Armed Bandit Learning for
Offloading Delay Minimization in V2X Networks”, EUC 2021
• Trung Thanh Nguyen, Truong Thao Nguyen, Tuan Anh Nguyen Dinh, Thanh-Hung Nguyen, Phi Le Nguyen, “Q-learning-based
Opportunistic Communication for Real-time Mobile Air Quality Monitoring Systems”, IPCCC 2021

63
Source code
◦ https://colab.research.google.com/drive/1-
_6DVuOjJWBp2V5x53uTZC9U5Z1IoYiT#scrollTo=GzO_xDRDr5NP

64

You might also like