Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
16 views16 pages

Reinforcement Learning-Based Wi-Fi Contention Wind

This document discusses the limitations of the binary exponential backoff (BEB) algorithm used in IEEE 802.11 Wi-Fi networks for collision avoidance and proposes a reinforcement learning (RL) approach to optimize the contention window (CW) for improved network performance. The study evaluates Deep Q Learning (DQN) and Deep Deterministic Policy Gradient (DDPG) algorithms, demonstrating their superior performance over BEB, particularly in dense network scenarios. The proposed RL-based method aims to maximize throughput while minimizing collisions by adapting the CW value based on observed metrics from the network.

Uploaded by

shughao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views16 pages

Reinforcement Learning-Based Wi-Fi Contention Wind

This document discusses the limitations of the binary exponential backoff (BEB) algorithm used in IEEE 802.11 Wi-Fi networks for collision avoidance and proposes a reinforcement learning (RL) approach to optimize the contention window (CW) for improved network performance. The study evaluates Deep Q Learning (DQN) and Deep Deterministic Policy Gradient (DDPG) algorithms, demonstrating their superior performance over BEB, particularly in dense network scenarios. The proposed RL-based method aims to maximize throughput while minimizing collisions by adapting the CW value based on observed metrics from the network.

Uploaded by

shughao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

JOURNAL OF COMMUNICATION AND INFORMATION SYSTEMS, VOL. 38, NO.1, 2023.

128

Reinforcement Learning-based Wi-Fi Contention


Window Optimization
Sheila C. da S. J. Cruz, Messaoud Ahmed Ouameur, and Felipe A. P. de Figueiredo

Abstract—The collision avoidance mechanism adopted by the DQN and DDPG presented similar performances, meaning that
IEEE 802.11 standard is not optimal. The mechanism employs either one could be employed.
a binary exponential backoff (BEB) algorithm in the medium
access control (MAC) layer. Such an algorithm increases the Index Terms—Wi-Fi, contention-based access scheme, channel
backoff interval whenever a collision is detected to minimize utilization optimization, machine learning, reinforcement learn-
the probability of subsequent collisions. However, the increase ing, NS-3.
of the backoff interval causes degradation of the radio spectrum
utilization (i.e., bandwidth wastage). That problem worsens when
the network has to manage the channel access to a dense I. I NTRODUCTION
number of stations, leading to a dramatic decrease in network
performance. Furthermore, a wrong backoff setting increases
the probability of collisions such that the stations experience
numerous collisions before achieving the optimal backoff value.
T HE IEEE 802.11, or simply Wi-Fi, is a set of wire-
less network standards designed and maintained by the
Institute of Electrical and Electronics Engineers (IEEE) that
Therefore, to mitigate bandwidth wastage and, consequently, defines MAC and physical layer (PHY) protocols for deploy-
maximize the network performance, this work proposes using ing wireless local area networks (WLANs). Their MAC layer
reinforcement learning (RL) algorithms, namely Deep Q Learn-
ing (DQN) and Deep Deterministic Policy Gradient (DDPG), to
implements a contention-based protocol, known as carrier-
tackle such an optimization problem. In our proposed approach, sensing multiple access with collision avoidance (CSMA/CA),
we assess two different observation metrics, the average of the for the nodes to access the wireless medium (i.e., the channel)
normalized level of the transmission queue of all associated efficiently [1], [2]. With CSMA/CA, the nodes compete to
stations and the probability of collisions. The overall network’s access the radio resources and consequently, the channel [3],
throughput is defined as the reward. The action is the contention
window (CW) value that maximizes throughput while minimizing
[4].
the number of collisions. As for the simulations, the NS-3 network One of the most critical parameters of the CSMA/CA
simulator is used along with a toolkit known as NS3-gym, which mechanism is the contention window (CW) value, also known
integrates a reinforcement-learning (RL) framework into NS- as back-off time, which is a random delay used for reducing
3. The results demonstrate that DQN and DDPG have much the risk of collisions. If the medium is busy, an about-to-
better performance than BEB for both static and dynamic
scenarios, regardless of the number of stations. Additionally, transmit Wi-Fi device selects a random number uniformly
our results show that observations based on the average of the distributed within the interval [0, CW] as its back-off value,
normalized level of the transmission queues have a slightly better which defers its transmission to a later time. CW has its value
performance than observations based on the collision probability. doubled every time a collision occurs (e.g., when an ACK
Moreover, the performance difference with BEB is amplified as is not received), reducing the likelihood of multiple stations
the number of stations increases, with DQN and DDPG showing
a 45.52% increase in throughput with 50 stations. Furthermore, selecting the same back-off value. CW values range from
the minimum contention window (CWMin) value, generally
equal to 15 or 31 depending on the Wi-Fi standard, to the
Sheila C. da S. J. Cruz and Felipe A. P. de Figueiredo are with the
National Institute of Telecommunications (INATEL), Minas Gerais, Brazil established maximum contention window (CWMax) value,
(e-mail: [email protected], [email protected]); Messaoud which is equal to 1023. CW is reset to CWMin when an ACK
Ahmed Ouameur is with Université du Québec à Trois-Rivières (UQTR), is received, or the maximum number of re-transmissions has
Quebec, Canada (e-mail: [email protected]).
This work was partially funded by Fundação de Amparo à Pesquisa do been reached [5]. This deferring mechanism is also known as
Estado de Minas Gerais (FAPEMIG) - Grant No. 2070.01.0004709/2021- binary exponential back-off (BEB) [6] and is shown in Fig 1.
28, by FADCOM - Fundo de Apoio ao Desenvolvimento das Comunicações, The BEB algorithm has several limitations and may not
presidential decree no 264/10, November 26, 2010, Republic of Angola, by
Huawei, under the project Advanced Academic Education in Telecommu- always be the best solution for collision avoidance, often
nications Networks and Systems, Grant No. PPA6001BRA23032110257684, providing suboptimal solutions [7], [8]. These limitations
by the Brazilian National Council for Research and Development (CNPq) include inefficiency under high loads, lack of fairness among
under Grant Nos. 313036/2020-9 and 403827/2021-3, by São Paulo Research
Foundation (FAPESP) under Grant No. 2021/06946-0, by the Coordenação competing nodes, inability to adapt to changing network condi-
de Aperfeiçoamento de Pessoal de Nı́vel Superior - Brazil (CAPES) and tions, vulnerability to the hidden node problem, and no global
RNP, with resources from MCTIC, under Grant Nos. 01250.075413/2018- optimization. While the BEB algorithm is widely used in Wi-
04, 01245.010604/2020-14, and 01245.020548/2021-07 under the Brazil 6G
project of the Radiocommunication Reference Center (Centro de Referência Fi networks, it may not always be the best solution for collision
em Radiocomunicações - CRR) of the National Institute of Telecommuni- avoidance. Other approaches, such as machine learning-based
cations (Instituto Nacional de Telecomunicações - Inatel), Brazil; and by ones, may be better able to address the limitations of the
FCT/MCTES through national funds and when applicable co-funded EU funds
under the Project UIDB/EEA/50008/2020. BEB algorithm and provide more effective collision avoidance
Digital Object Identifier: 10.14209/jcis.2023.15 strategies.
JOURNAL OF COMMUNICATION AND INFORMATION SYSTEMS, VOL. 38, NO.1, 2023. 129

𝐶𝑜𝑙𝑙𝑖𝑠𝑜𝑛 ∶ 𝐶𝑊𝑖 = 𝐶𝑊𝑖−1 x2

𝑪𝑾𝟎 = 𝟏𝟔 𝑪𝑾𝟏 = 𝟑𝟐 𝑪𝑾𝟐 = 𝟔𝟒 𝑪𝑾𝟑 = 128 𝑪𝑾𝟒 = 256 𝑪𝑾𝟓 = 512 𝑪𝑾𝟔 = 1024

𝑆𝑢𝑐𝑐𝑒𝑠𝑠𝑓𝑢𝑙 𝑇𝑟𝑎𝑛𝑠𝑚𝑖𝑠𝑠𝑖𝑜𝑛 ∶ 𝐶𝑊𝑖 = 𝐶𝑊0

Fig. 1. Binary Exponential Back-off.

In scenarios with few nodes, collisions will be less frequent rithm (a.k.a. agent in this context) interacts with the environ-
and impactful, especially in static scenarios, where the number ment through trial and error action attempts without requiring
of nodes remains the same. On the other hand, in dynamic sce- labels, only reward information about the taken actions. This
narios, where the number of nodes increases throughout time, paradigm allows the agents (e.g., Wi-Fi nodes) to explore the
collisions will be commonplace. Furthermore, the high number environment by taking random actions and finding optimal
of collisions reduces the network throughput drastically since solutions. This paradigm can be employed to optimize the
CW has its value doubled when collisions are detected, leading actions of Wi-Fi nodes in a network to minimize collisions and
to an inefficient network operation [9]. maximize throughput. The nodes can learn to take actions that
As can be seen, optimizing the CW value could be beneficial lead to higher reward values (e.g., successful transmissions)
for Wi-Fi networks since the traditional BEB algorithm does while avoiding actions that lead to lower reward values (e.g.,
not scale well when many nodes compete for the medium collisions). This paradigm presents several advantages for col-
[10]. Once network devices with high computational capa- lision avoidance, including the ability to learn from experience,
bilities become increasingly common, CW can be optimized optimize long-term performance, handle complex and dynamic
through machine learning (ML) algorithms. In ML, the most environments, explore and exploit different strategies, and
common learning paradigms are supervised, unsupervised, adapt to different contexts. RL algorithms can learn from the
and reinforcement learning. Supervised algorithms require a state of the network and adapt to changing network conditions,
labeled dataset where the outcomes for the respective inputs optimizing cumulative rewards over time. This makes RL
are known. Still, creating such a dataset requires a model and a well-suited to handle the complexity of Wi-Fi networks, find
solution to the problem. Developing accurate models is, in sev- optimal solutions, and navigate different scenarios. Addition-
eral cases, a challenging and troublesome task. Besides that, ally, RL is flexible and adaptable, making it a powerful
many solutions are suboptimal, and supervised ML algorithms tool for collision avoidance in Wi-Fi networks. Therefore,
learning from datasets created with those solutions will never since there is no optimal solution for the CW optimization
surpass its performance since the algorithm tries to replicate (BEB is known to be suboptimal) [12]–[14], RL may offer
the input to label mapping present in the dataset [11]. more effective collision avoidance strategies. However, despite
Unsupervised learning occurs when the ML algorithm is the increasing state-of-the-art contributions presented, RL-
trained using unlabeled data. The idea behind this paradigm is based algorithms are complex, presenting high computational
to find hidden (i.e., latent) patterns in the data. This paradigm requirements, require fine-tuned hyperparameters since they
could be used to identify hidden patterns in Wi-Fi traffic that are sensitive to their settings, the training process might take
could be contributing to collisions. By clustering similar traffic several hours and need extensive exploration, requiring a
patterns, it may be possible to identify common sources of significant amount of computing power, and present difficulties
interference or other issues that are leading to collisions and in handling continuous and high-dimensional state and action
take actions to mitigate them. Nonetheless, unsupervised learn- spaces [15].
ing can be limited in its applicability to collision avoidance RL algorithms have addressed their limitations through vari-
in Wi-Fi networks due to the lack of labeled data, limited ous approaches. To mitigate high computational requirements,
interpretability of results, limited control over the learning techniques like parallelization (i.e., distributed computing) and
process, and lack of a feedback loop for ongoing adjustment hardware acceleration have been employed to leverage the
and optimization. While unsupervised learning can be useful power of multiple processors or machines, and hardware accel-
for analyzing Wi-Fi traffic data and identifying patterns, other eration using specialized processors like graphics processing
machine learning paradigms, such as reinforcement learning, units (GPUs) or tensor processing units (TPUs) to expedite the
may be better suited to the problem of collision avoidance. training process [16]. To tackle the need for extensive training
Finally, reinforcement learning occurs when the ML algo- data and exploration, methods such as experience replay,
JOURNAL OF COMMUNICATION AND INFORMATION SYSTEMS, VOL. 38, NO.1, 2023. 130

where past experiences are stored and randomly sampled this work are as follows:
during training, allowing for efficient and effective utilization 1) Adoption of the averaged normalized transmission queue
of data [17]. Exploration strategies, such as epsilon-greedy level as observation information used by the RL agent
or Thompson sampling, strike a balance between exploit- to take actions.
ing learned knowledge and exploring new possibilities [18]. 2) The comparison and assessment of two different obser-
Moreover, curiosity-driven or intrinsic motivation learning vation metrics, namely, the averaged normalized trans-
techniques have been explored to encourage agents to explore mission queue level and the collision probability.
novel states or seek out new experiences, thereby reducing the 3) A CW optimization solution that applies to any of the
reliance on vast amounts of explicit training data [19]. Sensi- 802.11 standards.
tivity to hyperparameters is addressed through techniques like The remainder of the paper is organized as follows. Section
grid search, systematically exploring different hyperparameter II discusses related work. Section III presents a brief machine
combinations to find optimal settings [20]. More advanced learning overview. Section IV describes the materials and
approaches, such as Bayesian optimization, leverage proba- methods used in the simulations. Section V presents the
bilistic models to search the hyperparameter space intelligently simulation results. Finally, Section VI presents conclusions
[21]. Furthermore, algorithmic advancements like automatic and future works.
hyperparameter tuning methods, utilizing meta-learning or
reinforcement learning itself, have been developed to automate
the selection of hyperparameters, enhancing the performance II. R ELATED W ORK
of the algorithms [22]. Difficulties in handling continuous The literature presents adequate and excellent contributions
and high-dimensional spaces are overcome using deep neural of ML methods applied to CW optimization in wireless
networks and parameterization methods. Deep neural networks networks. For example, in [28], the authors propose a CW
are used to function approximation, allowing for learning optimization mechanism for IEEE 802.11ax under dynami-
policies directly from raw sensory inputs [23]. Continuous cally varying network conditions employing DRL algorithms.
action spaces are handled through parameterization methods, The DRL algorithms were implemented on the NS-3 [29]
such as using Gaussian distributions or policy parameterization simulator using the NS3-gym [30] framework, which enables
[24]. Additionally, actor-critic architectures, combining policy integration with python frameworks [30]. They proved to have
and value function approximators, have shown promise in efficiency close to optimal according to the throughput result
effectively handling high-dimensional state and action spaces that remained stable even when the network topology changed
[25]. These ongoing advancements aim to improve the effi- dynamically. Their solution uses the collision probability, i.e.,
ciency and performance of reinforcement learning algorithms the transmission failure probability, to observe the overall
across different domains. network’s status.
This work proposes using deep reinforcement learning To allow channel access and a fair share of the unlicensed
(DRL) algorithms to optimize the CW value selection and spectrum between wireless nodes, the authors in [31] propose
improve the performance of Wi-Fi networks by maintaining an intelligent ML solution based on CWmin (minimum CW)
a stable throughput and minimizing collisions1 . More specif- adaptation. The issue is that aggressive nodes, as they refer
ically, we propose an RL-aided centralized solution aimed to in the paper, try to access the medium by forcefully
at finding the best CW value that maximizes the overall choosing low CWmin values, while CSMA/CA-based nodes
Wi-Fi network’s throughput by properly setting CW values, have a fixed CWmin set to 16, leading to an unfair share of
especially in dense environments, with dozens to hundreds of the spectrum. The intelligent CW ML solution consists of a
stations. The proposed solution takes actions, i.e., selects a CW random forest, which is a supervised classification algorithm.
value based on the observation metric adopted. In this work, Simulations were conducted on a C++ discrete-event-based
we study and compare the use of two distinct DRL algorithms, simulator called CSIM [32] to evaluate the algorithm’s per-
namely, DQN and DDPG, and two different observation met- formance. It was possible to obtain high throughput efficiency
rics, namely, averaged normalized transmission queues’ level while maintaining fair allocation of the unlicensed channels
of all associated stations and the collision probability, to tackle with other coexisting nodes.
the CW optimization problem. DQN was chosen because it is In [33], the authors present a Deep Q-learning algorithm
relatively simple and has a discrete action space. However, to dynamically adapt CWmin to random access in wireless
despite its simplicity, DQN generally displays performance networks. The idea is to maximize a network utility function
and flexibility that rivals other methods [26]. DDPG was (i.e., a metric measuring the fair use of the medium) [34] under
selected since it is a more complex method that represents dynamic and uncertain scenarios by rewarding the actions that
actions as continuous values, yielding an exciting comparison lead to high utilities (efficient resource usage). The proposed
with DQN [27]. By using DDPG, we want to explore the solution employs an intelligent node, called node 0, that imple-
hypothesis that the proposed solution can adjust the CW value ments the DQN algorithm to choose the CWmin for the next
more precisely and in a more fine-grained fashion with a time step from historical observations. The simulation was
continuous action space. Therefore, the main contributions of conducted on NS-3 to evaluate the performance against the
following baselines: optimal design, random forest classifier,
1 The source code for the reproduction of the results is available on: fairness index, optimal constant, and standard protocol (with
https://github.com/sheila-janota/RLinWiFi-avg-queue-level its CWmin fixed at 32). Two scenarios were considered for
JOURNAL OF COMMUNICATION AND INFORMATION SYSTEMS, VOL. 38, NO.1, 2023. 131

the simulation. The first scenario uses two states and follows algorithm outperforms the classical solution, i.e., the exponen-
a Markov process for CWmins of all nodes except node 0. The tial backoff, in terms of user fairness and overall throughput.
RL algorithm and random forest classifier reach outstanding Our proposed solution differs from this one because it tackles
performance for this case. The second scenario considers five the problem of random access in slot-free and contention-
states, and CWmins of all nodes different from node 0 follow based networks, aiming to maximize the network’s throughput
a more complex process. The RL algorithm achieves utility and, consequently, the throughput of individual users.
close to optimal when compared to a supervised random forest Differently from the previous works, in this work, we
classifier. propose leveraging an alternative observation metric that is
In [10], the authors propose an ML-based solution using based on the average of all stations’ normalized transmission
a Fixed-Share algorithm to adjust the CW value to improve queue levels, which seems a more informative and straight-
network performance dynamically. The algorithm comprises forward way to understand and capture the overall network’s
CW calculation, Loss/Gain function, and sharing weights. The status, to train DRL agents to take actions aiming at finding
algorithm considers the present and recent past conditions the best CW value, which in turn, optimizes the network
of the network. The NS-3 network simulator was used to performance. Moreover, we compare the observation metric
evaluate the proposed solution, and the performance metrics proposed in [28] (i.e., the collision probability observed by
used were average throughput, average end-to-end delay, and the network) with the proposed one, i.e., the stations’ averaged
channel access fairness. The Fixed-Share algorithm achieves normalized transmission queue levels. The normalized trans-
excellent performance compared to the other two conventional mission queues’ level provides more information, taking into
algorithms: binary exponential backoff (BEB) and History- account the congestion level of the network and the behavior
Based Adaptive Backoff (HBAB). of all nodes, not just the individual node. It also leads to
To optimize CW in a wireless local area network, [35] better performance, allows for adaptation to changing network
presents three algorithms based on genetic fuzzy-contention conditions, and avoids myopic decisions. In contrast, using the
window optimization (GF-CWO), which is a combination of collision probability alone can result in suboptimal decisions
fuzzy logic controller and a genetic algorithm. The proposed that do not take into account the impact on the overall network
algorithm is intended to solve issues related to success ratio, performance. Therefore, the averaged normalized transmission
packet loss ratio, collision rate, fairness index, and energy con- queues’ level seems to be a more suitable metric for RL
sumption. Simulations were conducted in Matlab in order to solutions to solve the collision avoidance problem in Wi-Fi
evaluate the performance of the proposed solution, producing networks.
better results when compared to the BEB. The ML-based approaches presented in this section show
To avoid packet collisions in Mobile ad hoc networks how well ML algorithms can be applied to reach optimal
(MANETs) [36], [37], the authors of [38] propose a Q- performance in the wireless network field. Therefore, this
learning-based solution to optimize the CW parameter in motivated us to study and propose a DRL-based solution to
an IEEE 802.11 network. The proposed CW optimization reduce collisions by optimizing CW in different scenarios.
method considers the number of packets sent and the collision
generated from each station. Simulation results show that
selecting a good CW value improves the packet delivery ratio,
channel access fairness, throughput, and latency. The benefits III. M ACHINE L EARNING OVERVIEW
are even more significant when the queue size is less or equal
to 20. ML algorithms, also called models, have been widely ap-
In [39], a different approach to controlling the CW value, plied to solve different problems related to optimization in
named contention window threshold, is used. It employs wireless network communications systems [41]–[44]. These
deep reinforcement learning (DRL) principles to establish a algorithms construct a model based on historical data, known
threshold value and learn the optimal settings under various as a training dataset, to perform tasks, for example, solving op-
network scenarios. The method used is called the smart timization problems, without being explicitly programmed to
exponential-threshold-linear backoff algorithm with a deep Q- do so [45], [46]. ML algorithms can provide self-management,
learning network (SETL-DQN). Its results demonstrate that self-learning, and self-optimizing solutions for an extensive
this algorithm can reduce collisions and improve throughput. range of issues in the context of dynamic resource allocation,
The authors of [40] apply DRL to the problem of random spectrum resource management, wireless network optimiza-
access in machine-type communications networks adopting tion, and so much more [47].
slotted ALOHA access protocols. Their proposed solution The learning process of an ML model is called training,
aims at finding a better transmission policy for slotted ALOHA and it is used for the model to gain knowledge (i.e., infer a
protocols. The proposed algorithm learns a policy that es- solution) and achieve the desired result. It is possible to clas-
tablishes a trade-off between user fairness and the overall sify the ML model learning based on the type of its training,
network throughput. The solution employs centralized training, also called learning paradigm [48]. The learning paradigms can
which makes the learned policy equal for all users. Their be classified as Supervised learning, Unsupervised learning,
approach uses binary feedback signals and past actions to learn and Reinforcement Learning (RL). Fig 2 shows the relations
transmission probabilities and adapt to traffic scenarios with between the ML paradigms. Therefore, next, we provide a
different arrival processes. Their results show that the proposed brief overview of these learning paradigms.
JOURNAL OF COMMUNICATION AND INFORMATION SYSTEMS, VOL. 38, NO.1, 2023. 132

Agent

Reward State Action


𝑹𝒕 𝑺𝒕 𝑨𝒕

𝑺𝒕+𝟏
𝑹𝒕+𝟏
Environment

Fig. 3. An RL agent interacting with the environment.

consists of a tuple of five elements M = {S, A, P, γ, R},


where S is the state space, A is the action space, P is the
transition probability, γ is the discount factor, and R is the
reward. A reward is a positive or negative numeric value that
indicates the quality of the action taken at a particular state
[45]. The higher the reward, the better the action taken at
Fig. 2. ML learning paradigms based on the type of training.
that state. Conversely, the lower the reward, the worse the
action. RL algorithms aim to find a policy that maximizes the
A. Supervised Learning total future reward. Fig 3 shows the interaction of the agent
In this paradigm, the ML model uses labeled data during with the environment, which occurs in the following way: the
the training phase. Each input sample is accompanied by agent observes (senses) the current state of the environment,
its desired output sample, called a label. It is suitable for St , based on this observation, the agent selects an action, At ,
applications with plenty of historical data [49]. Some well- and executes the action in the environment, this action on the
known algorithms following this paradigm are linear and environment returns information (i.e., results) in the form of
logistic regression, support vector machine (SVM), K-nearest reward, Rt+1 and next-state, St+1 , reached due to the action
neighbors (KNN), and artificial neural networks (ANN). taken at the t-th time interval.
To better explain the RL learning cycle shown in Fig 3,
B. Unsupervised Learning we provide a classic, well-known RL algorithm presented in
the literature, that is, the Q-learning algorithm, which has
Here in this paradigm, the ML model learns to find its pseudo-algorithm shown in Algorithm 1. Q-learning is a
(sometimes hidden) useful patterns by exploring the input tabular RL method where the training process occurs basically
data without labels (i.e., without expected output values). in a table with the rows as the states, the columns as the
The model is trained to create a small representation of actions, and each element inside the table as the Q-value. The
the data [49]. This learning paradigm includes the following goal of Q-learning is to find an optimal Q-value (i.e., measures
algorithms: K-means, isolation forest, hierarchical clustering, the quality of the action) that maximizes the reward through
and expectation–maximization. iterative updates of the Q-table Qt (s, a) using the Bellman
equation presented in (1) [50].
C. Reinforcement Learning
In this learning paradigm, an agent, in this context, the ML Qt+1 (s, a) = (1 − α) (Qt (s, a)) + α [r + γmax (Qt (s′ , a′ ))] ,
model, learns by continuously interacting with the environ- (1)
ment and decides which action to take based on its own experi- where s is the state, a is the action, r is the reward, α is
ence, mapping the current observed state of the environment to the learning rate that undertakes an interval of values between
an action. The agent aims to learn a function known as policy [0, 1], γ is the discount rate and uses values between [0, 1],
in this context that models the environment and maps observed s′ is the next-state, a′ the next action, and max(Qt (s′ , a′ )
states into the best actions. The agent performs a decision- represents the possible maximum reward based on the next-
making task by trial and error in a self-learning manner [49]. state s′ and next-action a′ .
This paradigm includes the following algorithms: Q-learning, The basic idea of how Q-learning works after training is
Deep Q-learning, policy gradient learning, deep deterministic shown in Fig 4. For each action that is fed into the Q-
policy gradient, and the multi-armed bandit. table, there is a corresponding action, which is the action
The environment is where the information (observed state corresponding to the maximum Q-value for that given input
and reward) is produced, and it has a dynamic nature com- state. In other words, given a line (the state), which action (the
pared to supervised and unsupervised learning paradigms. column) has the highest Q-value?
The Markov Decision Process (MDP) is generally adopted to Next, we explain how the policy is learned through a process
represent the environment because it has a mathematical struc- of exploration-exploitation of the environment.
ture suitable for modeling decision-making problems [49]. It Policy: is a rule that helps the agent select the best action
JOURNAL OF COMMUNICATION AND INFORMATION SYSTEMS, VOL. 38, NO.1, 2023. 133

Algorithm 1 Q-learning
▷ Parameters initialization Agent DNN Policy
1: Learning rate, α 𝝅𝜽 (𝒔, 𝒂)
Agent
State
2: Discount rate, γ
3: Exploration rate, ϵ, ϵM in , ϵDecay Parameter 𝜽

4: Initial state s0 Reward State Action


𝑹𝒕 𝑺𝒕 𝑨𝒕
5: Action space size, j
6: Initialize Q-table with zero for all possible (state, actions) pair
Qt (s, a) ← 0 𝑺𝒕+𝟏
𝑹𝒕+𝟏
Environment
7: for each Episode do
8: Reset the state to its initial state s ← s0
Fig. 5. DRL Structure.
9: if random.uniform (0, 1)<ϵ then
10: a ← uniform random integer in [0, j]
11: else number of possible values equal to the number of actions.
12: a ← argmax(Qt (s)) Non-optimal actions are chosen to explore the environment,
13: end if
i.e., uncharted actions in a given state.
14: After selecting the action a and applying it to the environ- Exploitation: in this phase, the agent selects the action with
ment, a new reward r and next-state s′ are returned from the the maximum quality value for that given state, i.e., it selects
environment. the action that has the best long-term effect in maximizing the
expected cumulative reward.
▷ Learning stage : Qt+1 (s, a) is updated with new q-values
15: Qt+1 (s, a) ← (1 − α)(Qt (s, a)) + α[r + γmax(Qt (s′ , a′ ))]
16: s ← s′ D. Deep Reinforcement Learning
DRL is an improved extension of RL that integrates deep
17: Less exploration of the environment is achieved by decreasing
ϵ. learning (DL), i.e., ANNs, with reinforcement learning algo-
18: if ϵ > ϵM in then rithms [53]. This integration happens because RL algorithms
19: ϵ ← ϵ × ϵDecay present limitation problems related to state/action spaces, com-
𝑆𝑡𝑎𝑡𝑒 (𝑠)
20: end if putational, and sample complexity [54]. That is, RL algorithms
21: end for are not scalable and are limited to low-dimensional data issues,
i.e., problems with a small number of actions and states [55].
Q-Table
Actions Therefore, the integration with DL improves the scalability
a0 a a ... a
s0 1 14 3 ... 8
issue and makes RL algorithms support high-dimensional
s 1.5 7 5.5 ... 4.5 Output data tasks. Compared to conventional RL, DRL explores a
Input ... 𝐴𝑐𝑡𝑖𝑜𝑛 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑄(𝑠, 𝑎 )
𝑆𝑡𝑎𝑡𝑒 (𝑠) States s 6 2 6 4 𝑛
... ... ... ... ... ... large dimensional neural network to speed up convergence.
s 12 3 9 ... 25
Fig 5 shows the deep reinforcement learning structure. The
interaction with the environment occurs in the same way as
Fig. 4. Q-learning.
with the RL agent. The only difference is that now the agent is
an ANN model. DRL includes the following algorithms: DQN,
in a specific state. The primary objective of an RL algorithm DDPG, Twin Delayed Deep Deterministic Policy Gradient
is to learn a policy that maximizes the expected cumulative (TD3), and Double Deep Q-learning Network (DDQN). This
reward. The policy is a function that gives the probability of work focuses on using DQN and DDPG algorithms to assist
taking a given action when the environment is in a given state. in the optimization of CW with the primary goal of reducing
The policy is learned by employing a method of exploring node collisions while improving network performance. This
unknown actions in a given state and exploiting the current way, next, we present a brief overview of these two DRL
acquired knowledge. There must be a trade-off between ex- algorithms.
ploration of the environment and exploitation of the learned 1) Deep Q Network: DQN is an off-policy DRL algorithm
policy [51]. Simple exploration-exploitation methods are the based on Q-learning with discrete action space and continuous
most practical and used ones. One such method is the ϵ-greedy, state space [56]. It is the result of incorporating deep learning
where ϵ ∈ [0, 1] is a parameter controlling the amount of into RL since, in many practical situations, the state space
exploration and exploitation [52]. Normally, ϵ is a fixed hyper- is high-dimensional and cannot be solved by traditional RL
parameter, but it can have its value decreased so that the agent algorithms, For example, Q-learning has scaling issues to
explores the environment progressively less. Eq. 2 summarizes larger amounts of state and action space for being a tabular
the ϵ-greedy exploration-exploitation mechanism used by RL method. Therefore, DQN was developed to fix the Q-learning’s
algorithms to learn the best policy. scaling problem. Being an off-policy algorithm means that

Random action (Exploration), if random number < ϵ,
DQN uses an experience replay memory, where the agent
Action = learns from a batch of randomly selected prior experiences
Best long-term action (Exploitation), otherwise.
(2) instead of the most recent one [57]. This random set of past
Exploration: in this phase, the agent randomly selects an experiences mitigates the bias that might stem from the fact
action from a uniformly distributed random variable with the that some environments have a sequential nature [57]. Here the
Deep Q-learning
DDPG
JOURNAL OF COMMUNICATION AND INFORMATION SYSTEMS, VOL. 38, NO.1, 2023. 134

Deep Neural Network Input


Critic
𝑆𝑡𝑎𝑡𝑒 (𝑠) Deep Neural Network Output 1
Output 1 𝑄(𝑠, 𝑎 ) 𝑄(𝑠, 𝑎 )
Input
Input Output 2 Actor Output 2
𝑆𝑡𝑎𝑡𝑒 (𝑠) 𝑄(𝑠, 𝑎 ) 𝐴𝑐𝑡𝑖𝑜𝑛 = 𝑄(𝑠, 𝑎 ) 𝐴𝑐𝑡𝑖𝑜𝑛 =
Deep Neural Network
𝑎𝑟𝑔𝑚𝑎𝑥𝑄(𝑠, 𝑎 ) 𝑎𝑟𝑔𝑚𝑎𝑥𝑄(𝑠, 𝑎 )

...
...

...
...
𝑖 𝑖
Output N Input Output N
𝑄(𝑠, 𝑎 ) Actor Actions
𝑄(𝑠, 𝑎 )

Fig. 6. Deep Q-learning (DQN). Fig. 7. Deep Deterministic Policy Gradient (DDPG).

agent is represented by a deep neural network(DNN) that uses problem of learning the action-value function to a stable
the state as the input of the neural network, and the outputs supervised learning problem [58].
are the Q-values corresponding to the possible actions, where Similarly to DQN, DDPG uses an experience replay mem-
the action taken by the agent is that which maximizes the ory to minimize correlations between samples. Regarding the
Q-values, as shown in Fig 6. policy aspect of exploration and exploitation, DDPG differs
The Q-value represents the quality of the action in a from DQN. Since DDPG works in continuous action space,
given state. The idea is for the neural network to output the exploring such space constitutes a significant problem. How-
highest Q-value for the action that maximizes the cumulative ever, as it is an off-policy algorithm, the exploration problem
expected reward. However, using a single neural network can be treated independently from the learning algorithm [58].
renders training very unstable [53]. The trick to mitigate this DDPG creates an exploration policy that adds a noise value
problem and add stability to the training process is using two to the actor policy to solve this issue. By default, the noise is
neural networks, predictive and target networks. They have added following the Ornstein-Uhlenbeck process [60].
the same structure (i.e., number of layers and neurons in
each layer and activation functions) but have their weights IV. A PPLYING DRL TO THE CW OPTIMIZATION
updated at different times. The weights of the target network In order to apply DRL algorithms (DQN and DDPG) to
are not trained. Instead, they are periodically synchronized optimize Wi-Fi networks, we propose a centralized approach
with the weights of the predictive network. The idea is that to solving the CW optimization in this work. Our proposed
fixing the target Q values (outputs of the target network) for approach consists of a centralized algorithm (i.e., the agent),
several updates will improve the predictive network’s training which is a module running on the Wi-Fi access point (AP)
stability. DQN employs batch training and experience replay that observes the state of the network (i.e., the environment)
memory, making the agent learn from randomly sampled and chooses suitable CW values (i.e., the actions) to optimize
batch experiences. It also employs the ϵ-greedy exploration- the network’s performance (i.e., the reward). Next, we present
exploitation mechanism. some details on the agent and its inputs and outputs values.
2) Deep Deterministic Policy Gradient: DDPG is another 1) Agent: the agent represents the proposed DRL algo-
off-policy DRL algorithm with continuous action and state rithms (i.e., DQN and DDPG). The agent is chosen to
spaces proposed in [58]. It is the result of the combination run on the AP since it has a general view of the whole
between deterministic policy gradient (DPG) and DQN algo- network and then can control the stations associated with
rithms, the former related to the actor-critic algorithm [25], it through beacon frames in a centralized way. Therefore,
[59]. DQN avoids instability during the Q-function learning as can be noticed, this is a centralized approach where
by employing a replay buffer and a target network. DQN the AP decides the best CW value that will be used
has a discrete action space, while DDPG extends it to a across the network.
continuous action space. The algorithm simultaneously learns 2) Current state (s): the current state of the environment
a Q-function and a policy. Since DDPG inherits from the actor- is the status of all stations associated with the AP. So,
critic algorithm, it is a combination of both policy (actor) it is impossible to get this information because of the
and Q-value (critic) functions, where the actor takes actions nature of the problem. Therefore, we model the problem
according to a specific policy function, and the critic plays the as a Partially Observable Markov Decision Process
role of an evaluator of the action taken [25]. Fig 7 presents (POMDP) instead of an MDP one. POMDP assumes the
a general view of how DDPG works. Here, the actor takes environment’s state cannot be perfectly observed [61].
as input the state and outputs an action. The critic receives 3) Observation (o): is the acquired information from the
as input the state and the action from the actor, which are network based on the averaged normalized transmission
used to evaluate the actor’s actions, and outputs Q-values queues’ level of all associated stations or on the prob-
corresponding to the set of possible actions. The Q-values ability of collision proposed in [28]. We will employ
outputted by the critic indicate to the agent how good the and compare the performance attained by these two dif-
action taken by the actor for that specific state was. ferent types of information obtained from the network.
DDPG consists of four networks: actor prediction, critic Sections IV-A and IV-B explain how each one of these
prediction, actor target, and critic target networks. The target observation values is calculated.
networks have their weights copied from the prediction net- 4) Action (a): the action corresponds to the CW value.
works periodically. As with DQN, this procedure is adopted Since the CW value is directly connected to the network
in order to stabilize the learning process, moving the unstable performance, longer back-off periods lead to longer
JOURNAL OF COMMUNICATION AND INFORMATION SYSTEMS, VOL. 38, NO.1, 2023. 135

(i)
waiting times for retransmitting packets in case of col- The measurement of QN L is carried out at predefined
lisions, degrading the spectrum usage and the network intervals at each station and indicates the result of the currently
performance. The RL scheme brings the idea that for chosen CW value on the network’s performance. For example,
every action, there is a related maximized reward. Thus, a value close to 1 indicates the queue is full, meaning the
applying RL concepts to optimize the CW value aiming station cannot transmit packets as quickly as it receives them
to maximize the network’s throughput is what this work from the upper layers. On the other hand, if it is close to 0,
proposes. Therefore, we use the CW value as the RL the queue is almost empty, indicating the station can access
(i)
action. As we compare DRL algorithms with discrete the medium as frequently as necessary. A high QN L value
and continuous action spaces, the actions are integer indicates a high number of collisions. Conversely, a low value
values between 0 and 6 in the discrete case and real indicates a small number of collisions. The normalized level of
(i)
values within the interval [0, 6] in the continuous case. the transmission queue of each station, QN L , is concatenated
The set of actions follows these specific values (i.e., 0 (i.e., piggybacked) to data frames sent to the AP so that the
through 6) to define the back-off interval used by IEEE agent has access to this information. At the AP, the agent
(i)
802.11a for a station to retransmit its packet, establishing normalizes the sum of QN L coming from the stations by the
the limit condition of retransmission retries. Therefore, total number of stations associated with the AP, Nstations , as
the CW value to be broadcast to the stations can be shown in (5). This is the observation used by the agent to
obtained through the application of (3). This interval gather insights into the network’s status.
is selected so that the action space is within 802.11
Nstations
standard’s CW range, which ranges from 15 up to 1023. 1 X (i)
An action, a, taken by the agent in the state, s, makes Q̄N L = QN L . (5)
Nstations
the environment switch to its next state, s′ , with a given i=1

transition probability, T (s′ |s, a). Note that the action,


a, is mapped into the CW value by using equation (3). B. Collision Probability
When using (3), the CW interval, which is broadcast by Another metric characterizing the environment, proposed
the AP, is in the range of 15 to 1023, according to the in [28], is the probability of collision, pcol , observed by
802.11 standard. the network. It can also be interpreted as the probability of
j k transmission failure. This probability is calculated based on
CW = 2a+4 − 1. (3) the number of transmitted, Nt , and correctly received, Nr ,
frames, as shown in (6).
5) Reward (r): the reward is defined as the normalized Nt − Nr
network throughput, which can be observed at the AP, by pcol = . (6)
Nt
the agent, by taking action, a, in the state, s. Therefore,
the reward is a real value in the interval [0, 1]. This This collision rate approximates the actual probability of
normalized metric is obtained by dividing the actual collision as the number of frames used to calculate it increases.
throughput by its expected maximum. Thus, this rate represents the probability of a frame not being
received due to another station transmitting a frame at the same
time. These probabilities are calculated within the interaction
A. Averaged transmission queues’ level periods and provide information on the performance of the
Since the metric characterizing the environment should pro- selected CW value.
vide the best possible understanding of the current network’s
status, we propose using the averaged normalized transmission C. Centralized DRL-based CW Optimization Method
queues’ level of all associated stations, Q̄N L , as observation.
The proposed method has three stages. The first is a pre-
This metric is adopted as observation because it offers a more
learning stage, where the legacy Wi-Fi contention-based mech-
direct way of obtaining information about the overall network
anism manages the network. This stage is used to initialize
status. Next, we describe how it is calculated. Conversely,
the observation buffer with information that will be used to
as will become clear next, the collision probability requires
train the DRL algorithm being used (either DQN or DDPG).
the AP to determine the number of transmitted and correctly
Next, in the learning stage, the agent chooses CW values (i.e.,
received frames, which is, in turn, determined based on the
actions) according to what is shown in Algorithm 2.
number of transmitted or received acknowledgment frames.
The mean, µ, and variance, σ 2 , of the history of recent ob-
The normalized level of the transmission queue of the i-th
(i) servation values (either the averaged normalized queues’ level
station, QN L , is calculated according to (4), where (i) indi- or the collision probability) are calculated as a preprocessing
(i)
cates the station number. The queue level, Ql , is normalized step. Moving average with window and stride of fixed sizes is
(i.e., divided) by the maximum queue size value of that station, used to calculate both statistics. This calculation renders the
(i)
Qmax . observation into a two-dimensional vector for each stride of
the moving average. Therefore, the agent is trained based on
(i)
(i) Ql this two-dimensional vector of observations as illustrated in
QN L = . (4)
(i)
Qmax Fig 8.
JOURNAL OF COMMUNICATION AND INFORMATION SYSTEMS, VOL. 38, NO.1, 2023. 136

( ) ( )

( ) 0.415, 0.241
0.876, 0.146
0.430, 0.419, 0.100, 0.125, 1.000, 0.515, 0.345, 0.763, 0.714, 0.001, ⋯ , 0.221, 0.274, 0.950 0.345, 0940
0.235, 0.209

History of observation values 0.150, 0.599

Fig. 8. Bi-dimensional observation vector used in the preprocessing phase.

Exploration of the environment is enabled by adding a has elapsed, the algorithm selects the CW value, and then, the
noisy factor to each action the agent takes. This noisy factor observation value results from the application of the DRL-
decreases throughout the learning stage. This addition of noise based CW optimization algorithm. Therefore, one should note
is different for each of the two considered DRL algorithms. that the instructions in lines 15 to 24 are shared between the
When DQN is used, the noisy factor corresponds to the pre-learning and learning stages.
probability of taking a random action instead of an action Next, from lines 26 to 29, the information in the observation
predicted by the agent. In DDPG’s case, the noisy factor comes buffer is pre-processed by applying a moving average opera-
from a Gaussian-distributed random variable and is added tion to the data, which results in a two-dimensional vector,
directly to the action taken by the agent. As mentioned, this is observation, with the mean, µ, and variance, σ 2 of the data
done to find a trade-off between exploring the environment and in the buffer. With the pre-processed data, the action function,
exploiting the acquired knowledge, which chooses the action Aθ , returns the action, a. The action is then used to determine
that maximizes future rewards. a new CW value that will be broadcast to all associated
The last stage is called the operational stage. This stage stations. The instructions in lines 26 to 29 are shared between
starts when the training is over. The user defines the training the learning and operational stages. The difference between
stage’s period by setting the variable trainingP eriod. At this these two stages is how the action, a, is selected. In the
stage, the noisy factor is set to zero, so the agent always learning stage, the trainingF lag is equal to True, which
chooses the action it learned to maximize the reward. At tells the action function, Aθ , to choose actions following the
this stage, as the agent has already been trained, it does not exploration-exploitation approach described previously. In the
receive any additional updates to its policy, so rewards are operational stage, the trainingF lag is equal to False, which
unnecessary. tells the action function, Aθ , to choose actions optimizing the
Finally, it is essential to mention that the hyperparameters network’s performance. In the learning stage, a noisy factor
of the DRL models (i.e., learning rate, reward discount rate, is used to explore the environment, which does not happen in
batch size, epsilon decay) need to be fine-tuned for the agent the operational stage.
to achieve its optimal performance. Lastly, as both DQN and After that, from lines 31 to 38, the algorithm is solely in the
DDPG employ replay memory (called in this work as the learning stage. In this section of the algorithm, the DRL agent
experience replay buffer, E), a size limit has to be configured learns from previous experiences; that is, it has its weights,
for this memory. The replay memory stores past interactions θ, updated by using mini-batches of samples randomly picked
of the agent with the environment, i.e., it records the current from the replay buffer, E. These mini-batches are composed
state, the action taken at that state, the reward received in that of the current µ, σ 2 , and action, a, values, the reward (i.e., the
state, and the next state resulting from the action taken. When normalized throughput), r, and the previous values of µ and
its limit is reached, the oldest record is overwritten by a new σ2 .
one (i.e., it is implemented as a circular buffer). Finally, in line 42, the algorithm checks if the training period
A brief description of the pseudo-code shown in Algorithm has elapsed. If so, the trainingF lag is set to False, and the
2 is provided next. The initialization phase goes from lines 1 algorithm enters the operational stage. In this stage, the noisy
to 13. It initiates the input parameters such as observation and factor is disabled, and the algorithm only exploits the acquired
replay buffers, agent weights, initial state, noisy factor, CW knowledge.
value, and variables used to select the current stage and the
type of observation employed. D. Experimentation Scenario
Then, the pre-learning stage (lines 15 to 24) starts by filling The proposed centralized DRL-based CW optimization so-
in the observation buffer with the selected type of observation lution is implemented on NS3-gym [30], which runs on top of
metric (either the averaged normalized queues’ level or the the NS-3 simulator [29]. NS3-gym enables the communication
collision probability). The flag useQueueLevelF lag sets the between NS-3 (c++) and OpenAI gym framework (python)
type of observation value to be calculated. In this stage, the [62]. NS-3 is a network simulator based on discrete events
observation value (either the averaged normalized transmission mainly intended for academic research. It contains the im-
queues’ level or the network’s collision probability) results plementation of several wired and wireless network standards
from the application of the legacy Wi-Fi contention-based [29]. In this work, we use version NS-3.29 of the NS-3
mechanism. After envStepT ime, which is the period between simulator. The DRL algorithms used here were implemented
interactions of the proposed algorithm with the environment, with TensorFlow and PyTorch.
JOURNAL OF COMMUNICATION AND INFORMATION SYSTEMS, VOL. 38, NO.1, 2023. 137

Algorithm 2 DRL-based CW Optimization


H1 H2
▷ ### Initialization ###
1: Initialize the observation buffer, O, with zeroes
2: Initialize the weights, θ, of the agent IN 1 H1 H2
3: Get the action function, Aθ , which the agent uses to choose the LSTM
action according to the current stage OUT
4: Initialize the algorithm’s interaction period with the environment, IN 2 H1 H2
envStepT ime . .
5: Initialize the number episodes, Nepisodes . .
. .
6: Initialize the number of steps per episode, Nspe
7: Initialize the training stage period, trainingP eriod Hn Hn
8: Set trainingFlag ← T rue to tell the algorithm is in the training
stage Input Layer LSTM Layer Fully Connected Layers Output Layer
9: Initialize the experience replay buffer, E, with zeroes.
10: trainingStartTime ← currentTime Fig. 9. Architecture of the Deep Learning Network.
11: lastUpdate ← currentTime
12: µprev ← 0 (previous mean value)
2
13: σprev ← 0 (previous variance value) The system model considered in this work is depicted in Fig
14: Set useQueueLevelF lag ← T rue to use the averaged normal-
10. We consider a linear topology comprised of one AP and
ized transmission queues’ level as observation.
15: CW ← 15 several stations transmitting packets. The AP plays the role of
the DRL agent, selecting a new CW value according to the
16: for e = 1, ..., Nepisodes do current observation. The stations send data packets to the AP,
17: Reset and run the environment, i.e., reset and run the NS-3 and the deployment of the stations can happen statically or
simulator
dynamically. So, two scenarios are considered, one with static
18: for t = 1, ..., Nspe do
topology and the other with dynamic topology.
▷ ### Pre-learning stage ### Table I presents the NS-3 parameters necessary to create the
19: if useQueueLevelF lag == T rue then environment in which the agent will learn. Apart from those
(i)
20: Get QN L received from each associated station parameters, we assume single-user transmissions, a packet
21: Calculate Q̄N L
22: observation ← Q̄N L
load adjusted to saturate the network with constant bit rate
23: else (CBR) UDP traffic of 150 Mbps, instant and faultless trans-
24: Nt ← get number of transmitted frames ference of network information, e.g., the normalized queue
25: Nr ← get number of received frames (i)
level of each station, QN L , or the number of transmitted
26: observation ← NtN−N
t
r
packets, Nt , of each station, to the DRL agent, and that each
27: end if
28: O.append(observation) station receives the selected CW value instantly. The last two
assumptions allow the assessment of the proposed solution
29: if currentT ime ≥ lastU pdate + envStepT ime then in an idealized scenario before going to more realistic ones.
(i)
Realistic scenarios will require the transmission of QN L or
▷ ### Learning and operational stages ### Nt from each station to the AP and the periodic broadcast
30: µ, σ 2 ← preprocess(O)
31: a ← Aθ (µ, σ 2 , trainingFlag) of the chosen CW to all stations through beacon frames.
32: CW ← 2a+4 − 1 The only differences we foresee in the results presented here
33: Broadcast the new CW value to all associated stations are a slower convergence of the agent and a slightly smaller
throughput due to the transmission of the required overhead,
34: if trainingF lag == T rue then (i)
i.e., QN L or Nt and CW.
35: NRP ← get the number of received packets.
36:
NRP
tput ← envStepT Table II presents the agent parameters used in NS3-gym
ime
37: r ← normalize(tput) for the experiments. These parameters were empirically found
2 2
38: E.append((µ, σ , a, r, µprev , σprev )) through several simulations. The network architecture used
39: µprev ← µ by both DRL algorithms has one recurrent long short-term
2
40: σprev ← σ2
memory (LSTM) layer and two fully connected layers leading
41: mb ← get random mini-batch from E
42: Update θ based on mb to an 8 × 128 × 64 topology. An illustration of the network
43: end if architecture is shown in Fig 9
The simulation experiment was executed for 15 episodes
44: lastUpdate ← currentTime of 60 seconds. This configuration was also found by employ-
45: end if
ing cross-validation strategies. The recurrent long short-term
▷ ### Makes the transition between learning and operational memory layer allows the DRL algorithms to consider past
stages ### observations when predicting the best action in a given state.
46: if currentT ime ≥ trainingStartT ime + The moving average window is half the size of the observation
trainingP eriod then history memory, and the stride is a fourth of its size.
47: trainingF lag ← F alse
48: end if Each one of the experiments consisted of 15 executions
49: end for of 60-second long simulations, where the first 14 executions
50: end for were part of the training stage (thus, the training stage period,
JOURNAL OF COMMUNICATION AND INFORMATION SYSTEMS, VOL. 38, NO.1, 2023. 138

TABLE I
NS-3 E NVIRONMENT CONFIGURATION PARAMETERS .

Configuration Parameter Value


Wi-Fi standard IEEE 802.11ax
Number of APs 1
Number of static stations 5,15, 30 or 50
Number of dynamic stations increases steadily from 5 to 50
Frame aggregation disabled
Packet size 1500 bytes
Max Queue Size 100 packets
Frequency 5 GHZ
Channel BW 20 MHz
Traffic constant bit-rate UDP of 150 Mbps
MCS HeMcs (1024-QAM with a 5/6 coding rate)
Guard Interval 800 [ns]
Propagation delay model ConstantSpeedPropagationDelayModel
Propagation loss model MatrixPropagationLossModel
Simulation time 60 [s]

TABLE II
NS3- GYM AGENT CONFIGURATION PARAMETERS .

Configuration Parameter Value


DQN’s learning rate 4x10−4
DDPG’s actor learning rate 4x10−4
DDPG’s critic learning rate 4x10−3
Reward discount rate 0.7
Batch size 32
Replay memory size 18,000
Size of observation history memory 300
trainingP eriod 840 [s]
envStepT ime (i.e., interaction interval) 10 [ms]

of the network efficiency achieved by the proposed centralized


DRL-based CW optimization method in both scenarios. Next,
we separate the results and discussion into two sets, static
and dynamic. Our study considers and compares the averaged
normalized transmission queues’ level of all stations and the
probability of collision as two different types of observation
metrics to train the DRL-based CW optimization agent. The
simulation results presented in this section demonstrate the
performance achieved by the DRL-based solution employing
either one of the observations in the previously mentioned
scenarios.

A. Metrics and their connection with the reward function


Fig. 10. Scenario for assessing the proposed centralized DRL-based CW
optimization method. In this work, we assess two metrics: the network’s overall
throughput and the CW value. The network’s overall through-
put measures how much data is successfully transmitted over a
trainingP eriod, is equal to 14 × 60[s] = 840[s]), and the period over the network. The CW value is a random parameter
last one was the operational stage. Each one of the simulations that determines how long a station must wait before it can
consisted of 10 [ms] interaction intervals (i.e., envStepT ime). transmit data. Therefore, one can see that a longer CW value
Algorithm 2 was run between these interaction intervals. means that stations must wait longer before transmitting,
which can lead to lower throughput. However, a longer CW
V. S IMULATION R ESULTS value means a lower chance of collisions. On the other hand,
This section presents and discusses the results obtained the smaller the CW value, the shorter the waiting time and,
during the experiments. The performance of the proposed DRL consequently, the higher the chance of collisions, which is
algorithms is compared against the BEB algorithm, which is detrimental to the network’s overall throughput.
used in 802.11 wireless networks. Simulations were executed The DRL agent employs the CW values as actions and the
on NS-3 and NS3-gym simulators considering static and dy- average of the normalized level of the transmission queues of
namic scenarios. The graphical results allow a better evaluation the stations or the collision rate as observations. Its reward
JOURNAL OF COMMUNICATION AND INFORMATION SYSTEMS, VOL. 38, NO.1, 2023. 139

function is the normalized network’s overall throughput. The 42


agent’s objective is maximizing its received reward over time 40
based on observations of the environment. Therefore, the

Network throughput [Mb/s]


reward’s connection with the throughput is straightforward, 38
i.e., the higher the reward, the higher the network’s overall
36
throughput. Improvement
The reward’s connection with the CW value is also direct 34 over BEB

if we understand that the longer the CW value, the lower


32
the throughput. It happens because a station cannot transmit BEB
during the CW wait time, penalizing its throughput since it is 30 DQN-Pcol
calculated during a given period. On the other hand, the higher DDPG-Pcol
28 DQN-Avg Queue Level
the CW value, the lower the chance of collisions, and vice
DDPG-Avg Queue Level
versa. Therefore, the DRL agent aims to find a CW value that 26
maximizes the network’s overall throughput while minimizing 0 10 20 30 40 50
Number of stations
frame collisions., i.e., a trade-off between throughput and
collisions. Fig. 11. Comparison of the network throughput for the static scenario.

B. Static scenario 700 DQN-Pcol


DDPG-Pcol
In this scenario, the number of stations associated with the 600 DQN-Avg Queue Level
AP is kept constant throughout the experiment. As it is a static DDPG-Avg Queue Level
scenario, the optimal CW value should be constant, and so 500
should the throughput. Therefore, this scenario is used to prove
400
Mean CW
this hypothesis and to evaluate possible improvements over
802.11’s BEB algorithm. 300
Fig 11 shows the throughput achieved by the network for
different numbers of stations. As can be seen, the network 200
throughput decreases as the number of stations increases when
BEB is employed. On the other hand, when either DQN or
100
DDPG is used with either one of the observation metrics, 0
it remains practically constant as the number of stations 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
increases, proving the hypothesis. In this scenario, the graph Number of Episodes
shows that the throughput values achieved by our DRL-based
solution with four different numbers of stations (5, 15, 30, 50) Fig. 12. Mean CW value for 30 stations in the static scenario.
are very similar when using either one of the metrics. This is
indicated by the matching points on the graph, demonstrating a proposed solution to converge to an optimal and practically
close value of throughput. The improvement over BEB varies constant mean CW value for both DRL algorithms, regardless
from 5.19% for 5 stations to 48.5% for 50 stations. As can be of the observation metric considered. We can see that the mean
seen, DDPG has a slightly better performance than DQN (with value of all considered options (DQN/DDPG and Average Tx
either one of the observation metrics), which can be explained Queue level/Pcol) stabilizes around the same value after the
due to its capability to choose any real CW value within the 10th episode. It is also possible to see that the CW variance
[0, 6] range. In the static scenario, the performance achieved of all options decreases along the learning stage, meaning that
by the solution employing either one of the two observation initially, the algorithm explores the environment more (i.e.,
metrics shows a minimal difference. When compared, there is it takes uncharted actions). Then, as the number of episodes
no significant variance in their behavior. progresses, it exploits more of the acquired knowledge, which
Fig 12 shows the mean CW value and its variance for maximizes the received reward. Finally, the variance during
15 simulation episodes when DQN or DDPG is used with the final episodes of the learning stage is small because the
either one of the two observation metrics considered. This proposed algorithm correctly selected the CW value, which
experiment considers 30 stations in the static scenario. The maximizes the throughput. In the final episodes, where the
mean CW is the arithmetic mean of all CW values selected algorithm exploits more than explores the environment, DDPG
during one episode and serves as an indicator of how well presents a smaller variance than DQN, meaning it is more
the training agents can adjust and select the CW values to a assertive in choosing the ideal CW value.
more stable one. It shows that as the agent learns, it exploits
more of the acquired/learned knowledge than explores the
environment with random actions. Therefore, as the DRL agent C. Dynamic scenario
learns, the number of random actions decreases, decreasing In the dynamic scenario, the number of stations increases
the variance of the CW values. It is pretty clear that the 14 progressively throughout the experiment, going from 5 to 50.
episodes selected for the learning stage are enough for the The higher the number of stations, the higher the collision
JOURNAL OF COMMUNICATION AND INFORMATION SYSTEMS, VOL. 38, NO.1, 2023. 140

Fig. 13. Selected CW value for different numbers of stations in a dynamic Fig. 14. Comparison of the instantaneous network throughput as the number
scenario. of stations increases from 5 to 50.

put. Using the collision probability as the observation, the


probability. This experiment assessed whether the DRL algo-
resulting throughput is approximately 37.5 Mbps, presenting
rithms appropriately act upon network changes.
an increase of 43.68% over BEB, whereas using the averaged
Fig 13 depicts the chosen CW value for the dynamic
normalized transmission queues’ level yields a throughput of
scenario, where the number of stations progressively increases
up to 37.98 Mbps with an increase of 45.52% over BEB. This
from 5 to 50. As shown, the chosen CW value increases as
difference suggests that the averaged normalized transmission
the number of stations increases, meaning the back-off interval
queues’ level is a superior observation metric for dynamic
has to be increased to accommodate the transmissions of the
scenarios. However, it is essential to note that both observation
higher number of stations, mitigating the number of collisions.
metrics achieve significantly better throughput values than the
As can be noticed, DQN jumps between discrete neighbor
standard BEB algorithm. This finding underscores the potential
CW values. At the same time, DDPG continuously increases
value of utilizing RL solutions for collision avoidance in Wi-Fi
the CW value, reaching a lower CW value for 50 stations
networks.
independently of the adopted observation metric, which pos-
Fig 15 confirms that both DRL algorithms significantly
itively reflects on the attained throughput. When comparing
enhance the network’s throughput in comparison to the BEB
DQN for both observation metrics, it is possible to see a
algorithm in dynamic scenarios. The degree of improvement
similarity in selecting the adequate CW value. Furthermore,
varies from 6.84% (using either observation metric) for 5
DDPG with the averaged normalized transmission queues’
stations, to 25.56% for the collision probability metric when
level as observation presents higher CW values than when
50 stations are in use. Notably, when the averaged normal-
the probability of collision is used. This behavior, which is
ized transmission queues’ level is employed for 50 stations,
supported by the results in Fig 14, can be partially explained
the throughput is enhanced by as much as 28.43%, thereby
by the fact that the normalized transmission queues’ level
indicating the superior performance of this observation metric
of all associated stations offers more information than just
over the collision probability in dynamic scenarios with 50
the collision probability. This metric takes into account the
stations. These findings demonstrate once more the potential
congestion level of the network, and subsequently helps the
value of using DRL algorithms to address collision avoidance
DRL-based solution select a CW value that allows the network
challenges in dense Wi-Fi networks.
to achieve higher throughput. This insight suggests that the use
of this metric in the RL solution for collision avoidance could
result in improved network performance. VI. C ONCLUSIONS
Fig 14 compares the instantaneous network throughput in Wireless network transmissions are prone to various im-
the dynamic scenario when the number of stations grows from pairments (e.g., interference, path loss, channel noise, etc.)
5 to 50. The increased number of stations alters the CW value, that lead to packet loss and collisions, making re-transmission
impacting the instantaneous network throughput. When the and channel access mechanisms required. Furthermore, in
number of stations associated with the AP reaches 50, the environments with a dense number of stations, more collisions
throughput of the BEB drops to approximately 26 Mbps of that will occur while the stations attempt to access the wireless
presented by DQN and DDPG. The proposed DRL algorithm channel. Consequently, the network efficiency and channel
(with either DQN or DDPG) presents an almost constant be- utilization will both degrade. This work proposes a central-
havior, keeping a high and stable throughput as the number of ized solution that employs DRL algorithms (i.e., DQN and
stations progressively increases. Both DRL algorithms present DDPG) to optimize the CW parameter from the MAC layer.
an approximately constant throughput. Comparing the two ob- Regarding the number of stations associated with the AP,
servation metrics reveals a difference in the network through- two experimental scenarios are considered for assessing the
JOURNAL OF COMMUNICATION AND INFORMATION SYSTEMS, VOL. 38, NO.1, 2023. 141

42 of Q-Value as the operator. The difference between them is that


40
SAC is off-policy, and PPO is on-policy. In addition, a study
that included these methods could be interesting because DQN
Network throughput [Mb/s]

38 and DDPG are both off-policy and use a Q-value operator,


Improvement
over BEB which would contrast with SAC and PPO, potentially provid-
36
ing precious insights into the effect of these different operators
34 in the final result. Moreover, another research direction would
be the assessment of decentralized DRL-based solutions.
32
BEB
R EFERENCES
30 DQN-Pcol
DDPG-Pcol [1] X. Guo, S. Wang, H. Zhou, J. Xu, Y. Ling, and J. Cui, “Performance
28 DQN-Avg Queue Level evaluation of the networks with wi-fi based tdma coexisting with
DDPG-Avg Queue Level csma/ca,” Wireless Personal Communications, vol. 114, no. 2, pp. 1763–
26 1783, 2020. doi: 10.1007/s11277-020-07447-3
0 10 20 30 40 50 [2] W. Auzinger, K. Obelovska, I. Dronyuk, K. Pelekh, and R. Stolyarchuk,
Number of stations “A continuous model for states in csma/ca-based wireless local networks
derived from state transition diagrams,” in Proceedings of International
Fig. 15. Comparison of the network throughput for the dynamic scenario. Conference on Data Science and Applications. Springer, 2022. doi:
10.1007/978-981-16-5348-3 45 pp. 571–579.
[3] G. Wang and Y. Qin, “Mac protocols for wireless mesh networks
with multi-beam antennas: A survey,” in Future of Information and
proposed centralized DRL-based CW optimization solution: Communication Conference. Springer, 2019. doi: 10.1007/978-3-030-
static and dynamic. Simulation results show that the proposed 12388-8 9 pp. 117–142.
[4] Z. Beheshtifard and M. R. Meybodi, “An adaptive channel assign-
solution outperforms the 802.11 default BEB algorithm by ment in wireless mesh network: the learning automata approach,”
maintaining a stable throughput while reducing collisions. Computers & Electrical Engineering, vol. 72, pp. 79–91, 2018. doi:
Moreover, the results attest to DQN’s and DDPG’s superior 10.1016/j.compeleceng.2018.09.004
[5] S.-C. Wang and A. Helmy, “Performance limits and analysis of
performance compared to BEB for both scenarios, regardless contention-based ieee 802.11 mac,” in Proceedings. 2006 31st IEEE
of the observation metric used and the number of stations Conference on Local Computer Networks. IEEE, 2006. doi:
associated with the AP. Our results show that the difference 10.1109/LCN.2006.322129 pp. 418–425.
[6] M. Yazid, N. Sahki, L. Bouallouche-Medjkoune, and D. Aı̈ssani, “Mod-
increases as the number of stations increased, with DQN and eling and performance study of the packet fragmentation in an ieee
DDPG showing a 45.52% increase in throughput with 50 802.11 e-edca network over fading channel,” Multimedia Tools and
stations compared to BEB. Applications, vol. 74, no. 21, pp. 9507–9527, 2015. doi: 10.1007/s11042-
014-2131-y
Additionally, the results show that DQN and DDPG had [7] P. Patel and D. K. Lobiyal, “A simple but effective collision and
similar performances, which means that either could be used error aware adaptive back-off mechanism to improve the performance
in a solution deployed in Wi-Fi APs. However, since DQN of ieee 802.11 dcf in error-prone environment,” Wireless Personal
Communications, vol. 83, pp. 1477–1518, 2015. doi: 10.1007/s11277-
presents a lower computational complexity and lower training 015-2460-9
period, it would be the preferable choice. Furthermore, the [8] ——, “An adaptive contention slot selection mechanism for improving
presented results show that the network’s performance can be the performance of ieee 802.11 dcf,” International Journal of Informa-
tion and Communication Technology, vol. 10, no. 3, pp. 318–349, 2017.
dramatically improved when CW is chosen based on informa- doi: 10.1504/IJICT.2017.083272
tion from the network, such as the level of the transmission [9] R. A. da Silva and M. Nogueira, “Mac protocols for ieee
queues or the network’s probability of collision. 802.11 ax: Avoiding collisions on dense networks,” arXiv preprint
arXiv:1611.06609, 2016. doi: 10.48550/arXiv.1611.06609
Regarding the two considered observation metrics used for [10] Y. Edalat and K. Obraczka, “Dynamically tuning ieee 802.11’s con-
capturing the network’s status, our simulation results show tention window using machine learning,” in Proceedings of the 22nd
that in the static scenario, they confer similar performance International ACM Conference on Modeling, Analysis and Simulation
of Wireless and Mobile Systems, 2019. doi: 10.1145/3345768.3355920
when applied to the DRL-based algorithms. However, in the pp. 19–26.
dynamic case, the averaged normalized transmission queues’ [11] E. Bjornson and P. Giselsson, “Two applications of deep learning in
level grants the algorithms higher throughput when compared the physical layer of communication systems [lecture notes],” IEEE
Signal Processing Magazine, vol. 37, no. 5, pp. 134–140, 2020. doi:
to the observation based on the collision probability. These 10.1109/MSP.2020.2996545
findings suggest that the averaged normalized transmission [12] H. Anouar and C. Bonnet, “Optimal constant-window backoff scheme
queues’ level provides more information to the DRL agent than for ieee 802.11 dcf in single-hop wireless networks under finite load
conditions,” Wireless Personal Communications, vol. 43, no. 4, pp.
the collision probability. Moreover, the centralized DRL-based 1583–1602, 2007. doi: 10.1145/1164717.1164765
solution for selecting CW values outperformed the classical [13] M. A. Bender, J. T. Fineman, S. Gilbert, and M. Young, “How to
802.11 BEB algorithm, regardless of which observation metric scale exponential backoff: Constant throughput, polylog access at-
tempts, and robustness,” in Proceedings of the Twenty-Seventh Annual
was used. Therefore, it can be concluded that either observa- ACM-SIAM Symposium on Discrete Algorithms. SIAM, 2016. doi:
tion metric is suitable for obtaining the network’s status and 10.1137/1.9781611974331.ch47 pp. 636–654.
improving the CW selection, which ultimately leads to better [14] H. Al-Ammal, L. A. Goldberg, and P. MacKenzie, “Binary exponential
backoff is stable for high arrival rates,” in Annual Symposium on Theo-
network utilization and higher throughput. retical Aspects of Computer Science. Springer, 2000. doi: 10.1007/3-
Future work could use other ML algorithms to optimize 540-46541-3 14 pp. 169–180.
CW, such as Soft Actor-Critic (SAC) and Proximal Policy [15] M. Kulin, T. Kazaz, E. De Poorter, and I. Moerman, “A survey on
machine learning-based performance improvement of wireless networks:
Optimization (PPO). These two algorithms make an interesting Phy, mac and network layer,” Electronics, vol. 10, no. 3, p. 318, 2021.
topic of investigation because they both use advantage instead doi: 10.3390/electronics10030318
JOURNAL OF COMMUNICATION AND INFORMATION SYSTEMS, VOL. 38, NO.1, 2023. 142

[16] C. Silvano, D. Ielmini, F. Ferrandi, L. Fiorin, S. Curzel, L. Benini, [37] S. Giannoulis, C. Donato, R. Mennes, F. A. de Figueiredo, I. Jabandžic,
F. Conti, A. Garofalo, C. Zambelli, E. Calore et al., “A survey on deep Y. De Bock, M. Camelo, J. Struye, P. Maddala, M. Mehari et al.,
learning hardware accelerators for heterogeneous hpc platforms,” arXiv “Dynamic and collaborative spectrum sharing: The scatter approach,”
preprint arXiv:2306.15552, 2023. doi: 10.48550/arXiv.2306.15552 in 2019 IEEE International Symposium on Dynamic Spectrum Access
[17] Q. Cai, C. Cui, Y. Xiong, W. Wang, Z. Xie, and M. Zhang, “A survey Networks (DySPAN). IEEE, 2019. doi: 10.1109/DySPAN.2019.8935774
on deep reinforcement learning for data processing and analytics,” IEEE pp. 1–6.
Transactions on Knowledge amp; Data Engineering, vol. 35, no. 05, pp. [38] N. Zerguine, M. Mostefai, Z. Aliouat, and Y. Slimani, “Intelligent cw se-
4446–4465, may 2023. doi: 10.1109/TKDE.2022.3155196 lection mechanism based on q-learning (misq).” Ingénierie des Systèmes
[18] G. Aridor, Y. Mansour, A. Slivkins, and Z. S. Wu, “Competing d Inf., vol. 25, no. 6, pp. 803–811, 2020. doi: 10.18280/isi.250610
bandits: The perils of exploration under competition,” arXiv preprint [39] C.-H. Ke and L. Astuti, “Applying deep reinforcement learning to
arXiv:2007.10144, 2020. doi: 10.48550/arXiv.2007.10144 improve throughput and reduce collision rate in ieee 802.11 networks,”
[19] X. Lu, B. V. Roy, V. Dwaracherla, M. Ibrahimi, I. Osband, and KSII Transactions on Internet and Information Systems (TIIS), vol. 16,
Z. Wen, “Reinforcement learning, bit by bit,” Foundations and Trends® no. 1, pp. 334–349, 2022. doi: 10.3837/tiis.2022.01.019
in Machine Learning, vol. 16, no. 6, pp. 733–865, 2023. doi: [40] M. A. Jadoon, A. Pastore, M. Navarro, and F. Perez-Cruz, “Deep rein-
10.1561/2200000097 forcement learning for random access in machine-type communication,”
[20] S. Shekhar, A. Bansode, and A. Salim, “A comparative study in 2022 IEEE Wireless Communications and Networking Conference
of hyper-parameter optimization tools,” in 2021 IEEE Asia-Pacific (WCNC), 2022. doi: 10.1109/WCNC51071.2022.9771953 pp. 2553–
Conference on Computer Science and Data Engineering (CSDE). 2558.
Los Alamitos, CA, USA: IEEE Computer Society, dec 2021. doi: [41] R. Mennes, F. A. P. De Figueiredo, and S. Latré, “Multi-agent
10.1109/CSDE53843.2021.9718485 pp. 1–6. deep learning for multi-channel access in slotted wireless networks,”
[21] P. I. Frazier, “A tutorial on bayesian optimization,” arXiv preprint IEEE Access, vol. 8, pp. 95 032–95 045, 2020. doi: 10.1109/AC-
arXiv:1807.02811, 2018. doi: 10.48550/arXiv.1807.02811 CESS.2020.2995456
[22] A. Karras, C. Karras, N. Schizas, M. Avlonitis, and S. Sioutas, “Automl [42] R. Mennes, M. Claeys, F. A. P. De Figueiredo, I. Jabandžić, I. Moer-
with bayesian optimizations for big data management,” Information, man, and S. Latré, “Deep learning-based spectrum prediction collision
vol. 14, no. 4, p. 223, 2023. doi: 10.3390/info14040223 avoidance for hybrid wireless environments,” IEEE Access, vol. 7, pp.
[23] P. Beneventano, P. Cheridito, R. Graeber, A. Jentzen, and 45 818–45 830, 2019. doi: 10.1109/ACCESS.2019.2909398
B. Kuckuck, “Deep neural network approximation theory for [43] M. Ahmed Ouameur, L. D. T. Anh, D. Massicotte, G. Jeon, and
high-dimensional functions,” arXiv preprint 2112.14523, 2021. F. A. P. de Figueiredo, “Adversarial bandit approach for ris-aided ofdm
doi: 10.48550/arXiv.2112.14523 communication,” EURASIP Journal on Wireless Communications and
[24] J. Zhu, F. Wu, and J. Zhao, “An overview of the action space for deep Networking, vol. 2022, no. 1, pp. 1–18, 2022. doi: 10.1186/s13638-022-
reinforcement learning,” in Proceedings of the 2021 4th International 02184-6
Conference on Algorithms, Computing and Artificial Intelligence, 2021. [44] M. V. C. Aragão, S. B. Mafra, and F. A. P. de Figueiredo, “Otimizando
doi: 10.1145/3508546.3508598 pp. 1–10. o treinamento e a topologia de um decodificador de canal baseado em
[25] V. Konda and J. Tsitsiklis, “Actor-critic algorithms,” Advances in neural redes neurais,” Polar, vol. 2, p. 1. doi: 10.14209/sbrt.2022.1570823833
information processing systems, vol. 12, 1999. [45] F. Adib Yaghmaie and L. Ljung, “A crash course on rein-
[26] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. forcement learning,” arXiv e-prints, pp. arXiv–2103, 2021. doi:
MIT press, 2018. 10.48550/arXiv.2103.04910
[27] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, [46] M. G. Kibria, K. Nguyen, G. P. Villardi, O. Zhao, K. Ishizu, and F. Ko-
D. Silver, and D. Wierstra, “Continuous control with deep re- jima, “Big data analytics, machine learning, and artificial intelligence
inforcement learning,” arXiv preprint arXiv.1509.02971, 2015. doi: in next-generation wireless networks,” IEEE access, vol. 6, pp. 32 328–
10.48550/arXiv.1509.02971 32 338, 2018. doi: 10.1109/ACCESS.2018.2837692
[28] W. Wydmański and S. Szott, “Contention window optimization in ieee [47] M. Chen, U. Challita, W. Saad, C. Yin, and M. Debbah, “Machine
802.11 ax networks with deep reinforcement learning,” in 2021 IEEE learning for wireless networks with artificial intelligence: A tutorial on
Wireless Communications and Networking Conference (WCNC). IEEE, neural networks,” arXiv preprint arXiv:1710.02913, vol. 9, 2017. doi:
2021. doi: 10.1109/WCNC49053.2021.9417575 pp. 1–6. 10.48550/arXiv.1710.02913
[29] G. F. Riley and T. R. Henderson, “The ns-3 network simulator,” in [48] H. Yang, A. Alphones, Z. Xiong, D. Niyato, J. Zhao, and K. Wu,
Modeling and tools for network simulation. Springer, 2010, pp. 15–34. “Artificial-intelligence-enabled intelligent 6g networks,” IEEE Network,
[30] P. Gawłowicz and A. Zubow, “Ns-3 meets openai gym: The playground vol. 34, no. 6, pp. 272–280, 2020. doi: 10.1109/MNET.011.2000195
for machine learning in networking research,” in Proceedings of the 22nd [49] S. Russell and P. Norvig, Artificial Intelligence: A Modern Approach,
International ACM Conference on Modeling, Analysis and Simulation of 3rd ed. USA: Prentice Hall Press, 2009. ISBN 0136042597
Wireless and Mobile Systems, 2019. doi: 10.1145/3345768.3355908 pp. [50] B. Jang, M. Kim, G. Harerimana, and J. W. Kim, “Q-learning algorithms:
113–120. A comprehensive classification and applications,” IEEE access, vol. 7,
[31] A. H. Y. Abyaneh, M. Hirzallah, and M. Krunz, “Intelligent-cw: Ai- pp. 133 653–133 667, 2019. doi: 10.1109/ACCESS.2019.2941229
based framework for controlling contention window in wlans,” in 2019 [51] A. N. Burnetas and M. N. Katehakis, “Optimal adaptive policies
IEEE International Symposium on Dynamic Spectrum Access Networks for markov decision processes,” Mathematics of Operations Research,
(DySPAN). IEEE, 2019. doi: 10.1109/DySPAN.2019.8935851 pp. 1–10. vol. 22, no. 1, pp. 222–255, 1997. doi: 10.1287/moor.22.1.222
[32] Y. Xiao, M. Hirzallah, and M. Krunz, “Distributed resource allocation [52] M. Tokic and G. Palm, “Value-difference based exploration: adaptive
for network slicing over licensed and unlicensed bands,” IEEE Journal control between epsilon-greedy and softmax,” in Annual conference on
on Selected Areas in Communications, vol. 36, no. 10, pp. 2260–2274, artificial intelligence. Springer, 2011. doi: 10.1007/978-3-642-24455-
2018. doi: 10.1109/JSAC.2018.2869964 1 33 pp. 335–346.
[33] A. Kumar, G. Verma, C. Rao, A. Swami, and S. Segarra, [53] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
“Adaptive contention window design using deep q-learning,” in Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski
ICASSP 2021-2021 IEEE International Conference on Acoustics, et al., “Human-level control through deep reinforcement learning,”
Speech and Signal Processing (ICASSP). IEEE, 2021. doi: nature, vol. 518, no. 7540, pp. 529–533, 2015. doi: 10.1038/nature14236
10.1109/ICASSP39728.2021.9414805 pp. 4950–4954. [54] A. L. Strehl, L. Li, E. Wiewiora, J. Langford, and M. L.
[34] X. Fu and E. Modiano, “Learning-num: Network utility maximization Littman, “Pac model-free reinforcement learning,” in Proceedings of
with unknown utility functions and queueing delay,” in Proceedings the 23rd international conference on Machine learning, 2006. doi:
of the Twenty-second International Symposium on Theory, Algorithmic 10.1145/1143844.1143955 pp. 881–888.
Foundations, and Protocol Design for Mobile Networks and Mobile [55] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath,
Computing, 2021. doi: 10.48550/arXiv.2012.09222 pp. 21–30. “A brief survey of deep reinforcement learning,” arXiv preprint
[35] I. A. Qureshi and S. Asghar, “A genetic fuzzy contention window opti- arXiv:1708.05866, 2017. doi: 10.48550/arXiv.1708.05866
mization approach for ieee 802.11 wlans,” Wireless Networks, vol. 27, [56] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning
no. 4, pp. 2323–2336, 2021. doi: 10.1007/s11276-021-02572-8 with double q-learning,” in Proceedings of the AAAI conference on arti-
[36] T. K. Saini and S. C. Sharma, “Prominent unicast routing pro- ficial intelligence, vol. 30, no. 1, 2016. doi: 10.48550/arXiv.1509.06461
tocols for mobile ad hoc networks: Criterion, classification, and [57] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized
key attributes,” Ad Hoc Networks, vol. 89, pp. 58–77, 2019. doi: experience replay,” arXiv preprint arXiv:1511.05952, 2015. doi:
10.1016/j.adhoc.2019.03.001 10.48550/arXiv.1511.05952
JOURNAL OF COMMUNICATION AND INFORMATION SYSTEMS, VOL. 38, NO.1, 2023. 143

[58] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,


D. Silver, and D. Wierstra, “Continuous control with deep rein-
forcement learning,” arXiv preprint arXiv:1509.02971, 2015. doi:
10.48550/arXiv.1509.02971
[59] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller,
“Deterministic policy gradient algorithms,” in International conference
on machine learning. PMLR, 2014, pp. 387–395.
[60] G. E. Uhlenbeck and L. S. Ornstein, “On the theory of the brownian
motion,” Physical review, vol. 36, no. 5, p. 823, 1930. doi: 10.1103/Phys-
Rev.36.823
[61] A. R. Cassandra, “A survey of pomdp applications,” in Working notes
of AAAI 1998 fall symposium on planning with partially observable
Markov decision processes, vol. 1724, 1998.
[62] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schul-
man, J. Tang, and W. Zaremba, “Openai gym,” arXiv preprint
arXiv:1606.01540, 2016. doi: 10.48550/arXiv.1606.01540

Sheila C. da S. J. Cruz received a bachelor’s degree


in computer engineering from the National Institute
of Telecommunications (Inatel), Brazil, in 2016. She
is currently working towards completing her mas-
ter’s degree at Inatel. Her research interests include
digital communications, Wi-Fi, link adaptation, and
machine learning. Orcid ID: 0000-0002-4905-518X

Felipe A. P. de Figueiredo received the B.Sc.


and M.Sc. degrees in telecommunication engineering
from the National Institute of Telecommunications
(Inatel), Brazil, in 2004 and 2011, respectively. He
received his first Ph.D. degree from the State Uni-
versity of Campinas (UNICAMP), Brazil, in 2019
and the second one from the University of Ghent
(UGhent), Belgium, in 2021. He has been working
on the research and development of telecommunica-
tion systems for more than 15 years. His research
interests include digital signal processing, digital
communications, mobile communications, MIMO, multicarrier modulations,
FPGA development, and machine learning. Orcid ID: 0000-0002-2167-7286

Messaoud Ahmed Ouameur received a bachelor’s


degree in electrical engineering from the Institute
national d’électronique et d’électricité (INELEC),
Boumerdes, Algeria, in 1998, the M.B.A. degree
from the Graduate School of International Studies,
Ajou University, Suwon, South Korea, in 2000, and
the master’s and Ph.D. degrees (Hons.) in electrical
engineering from the Université du Québec à Trois-
Rivières (UQTR), QC, Canada, in 2002 and 2006,
respectively. He has been a Regular Professor at
UQTR since 2018. His research interests include
embedded real-time systems, parallel and distributed processing with applica-
tions to distributed Massive MIMO, deep learning and machine learning for
communication system design, and the Internet of Things with an emphasis on
end-to-end systems prototyping and edge computing. Orcid ID: 0000-0003-
1095-8012

You might also like