Reinforcement Learning-Based Wi-Fi Contention Wind
Reinforcement Learning-Based Wi-Fi Contention Wind
128
Abstract—The collision avoidance mechanism adopted by the DQN and DDPG presented similar performances, meaning that
IEEE 802.11 standard is not optimal. The mechanism employs either one could be employed.
a binary exponential backoff (BEB) algorithm in the medium
access control (MAC) layer. Such an algorithm increases the Index Terms—Wi-Fi, contention-based access scheme, channel
backoff interval whenever a collision is detected to minimize utilization optimization, machine learning, reinforcement learn-
the probability of subsequent collisions. However, the increase ing, NS-3.
of the backoff interval causes degradation of the radio spectrum
utilization (i.e., bandwidth wastage). That problem worsens when
the network has to manage the channel access to a dense I. I NTRODUCTION
number of stations, leading to a dramatic decrease in network
performance. Furthermore, a wrong backoff setting increases
the probability of collisions such that the stations experience
numerous collisions before achieving the optimal backoff value.
T HE IEEE 802.11, or simply Wi-Fi, is a set of wire-
less network standards designed and maintained by the
Institute of Electrical and Electronics Engineers (IEEE) that
Therefore, to mitigate bandwidth wastage and, consequently, defines MAC and physical layer (PHY) protocols for deploy-
maximize the network performance, this work proposes using ing wireless local area networks (WLANs). Their MAC layer
reinforcement learning (RL) algorithms, namely Deep Q Learn-
ing (DQN) and Deep Deterministic Policy Gradient (DDPG), to
implements a contention-based protocol, known as carrier-
tackle such an optimization problem. In our proposed approach, sensing multiple access with collision avoidance (CSMA/CA),
we assess two different observation metrics, the average of the for the nodes to access the wireless medium (i.e., the channel)
normalized level of the transmission queue of all associated efficiently [1], [2]. With CSMA/CA, the nodes compete to
stations and the probability of collisions. The overall network’s access the radio resources and consequently, the channel [3],
throughput is defined as the reward. The action is the contention
window (CW) value that maximizes throughput while minimizing
[4].
the number of collisions. As for the simulations, the NS-3 network One of the most critical parameters of the CSMA/CA
simulator is used along with a toolkit known as NS3-gym, which mechanism is the contention window (CW) value, also known
integrates a reinforcement-learning (RL) framework into NS- as back-off time, which is a random delay used for reducing
3. The results demonstrate that DQN and DDPG have much the risk of collisions. If the medium is busy, an about-to-
better performance than BEB for both static and dynamic
scenarios, regardless of the number of stations. Additionally, transmit Wi-Fi device selects a random number uniformly
our results show that observations based on the average of the distributed within the interval [0, CW] as its back-off value,
normalized level of the transmission queues have a slightly better which defers its transmission to a later time. CW has its value
performance than observations based on the collision probability. doubled every time a collision occurs (e.g., when an ACK
Moreover, the performance difference with BEB is amplified as is not received), reducing the likelihood of multiple stations
the number of stations increases, with DQN and DDPG showing
a 45.52% increase in throughput with 50 stations. Furthermore, selecting the same back-off value. CW values range from
the minimum contention window (CWMin) value, generally
equal to 15 or 31 depending on the Wi-Fi standard, to the
Sheila C. da S. J. Cruz and Felipe A. P. de Figueiredo are with the
National Institute of Telecommunications (INATEL), Minas Gerais, Brazil established maximum contention window (CWMax) value,
(e-mail: [email protected], [email protected]); Messaoud which is equal to 1023. CW is reset to CWMin when an ACK
Ahmed Ouameur is with Université du Québec à Trois-Rivières (UQTR), is received, or the maximum number of re-transmissions has
Quebec, Canada (e-mail: [email protected]).
This work was partially funded by Fundação de Amparo à Pesquisa do been reached [5]. This deferring mechanism is also known as
Estado de Minas Gerais (FAPEMIG) - Grant No. 2070.01.0004709/2021- binary exponential back-off (BEB) [6] and is shown in Fig 1.
28, by FADCOM - Fundo de Apoio ao Desenvolvimento das Comunicações, The BEB algorithm has several limitations and may not
presidential decree no 264/10, November 26, 2010, Republic of Angola, by
Huawei, under the project Advanced Academic Education in Telecommu- always be the best solution for collision avoidance, often
nications Networks and Systems, Grant No. PPA6001BRA23032110257684, providing suboptimal solutions [7], [8]. These limitations
by the Brazilian National Council for Research and Development (CNPq) include inefficiency under high loads, lack of fairness among
under Grant Nos. 313036/2020-9 and 403827/2021-3, by São Paulo Research
Foundation (FAPESP) under Grant No. 2021/06946-0, by the Coordenação competing nodes, inability to adapt to changing network condi-
de Aperfeiçoamento de Pessoal de Nı́vel Superior - Brazil (CAPES) and tions, vulnerability to the hidden node problem, and no global
RNP, with resources from MCTIC, under Grant Nos. 01250.075413/2018- optimization. While the BEB algorithm is widely used in Wi-
04, 01245.010604/2020-14, and 01245.020548/2021-07 under the Brazil 6G
project of the Radiocommunication Reference Center (Centro de Referência Fi networks, it may not always be the best solution for collision
em Radiocomunicações - CRR) of the National Institute of Telecommuni- avoidance. Other approaches, such as machine learning-based
cations (Instituto Nacional de Telecomunicações - Inatel), Brazil; and by ones, may be better able to address the limitations of the
FCT/MCTES through national funds and when applicable co-funded EU funds
under the Project UIDB/EEA/50008/2020. BEB algorithm and provide more effective collision avoidance
Digital Object Identifier: 10.14209/jcis.2023.15 strategies.
JOURNAL OF COMMUNICATION AND INFORMATION SYSTEMS, VOL. 38, NO.1, 2023. 129
𝑪𝑾𝟎 = 𝟏𝟔 𝑪𝑾𝟏 = 𝟑𝟐 𝑪𝑾𝟐 = 𝟔𝟒 𝑪𝑾𝟑 = 128 𝑪𝑾𝟒 = 256 𝑪𝑾𝟓 = 512 𝑪𝑾𝟔 = 1024
In scenarios with few nodes, collisions will be less frequent rithm (a.k.a. agent in this context) interacts with the environ-
and impactful, especially in static scenarios, where the number ment through trial and error action attempts without requiring
of nodes remains the same. On the other hand, in dynamic sce- labels, only reward information about the taken actions. This
narios, where the number of nodes increases throughout time, paradigm allows the agents (e.g., Wi-Fi nodes) to explore the
collisions will be commonplace. Furthermore, the high number environment by taking random actions and finding optimal
of collisions reduces the network throughput drastically since solutions. This paradigm can be employed to optimize the
CW has its value doubled when collisions are detected, leading actions of Wi-Fi nodes in a network to minimize collisions and
to an inefficient network operation [9]. maximize throughput. The nodes can learn to take actions that
As can be seen, optimizing the CW value could be beneficial lead to higher reward values (e.g., successful transmissions)
for Wi-Fi networks since the traditional BEB algorithm does while avoiding actions that lead to lower reward values (e.g.,
not scale well when many nodes compete for the medium collisions). This paradigm presents several advantages for col-
[10]. Once network devices with high computational capa- lision avoidance, including the ability to learn from experience,
bilities become increasingly common, CW can be optimized optimize long-term performance, handle complex and dynamic
through machine learning (ML) algorithms. In ML, the most environments, explore and exploit different strategies, and
common learning paradigms are supervised, unsupervised, adapt to different contexts. RL algorithms can learn from the
and reinforcement learning. Supervised algorithms require a state of the network and adapt to changing network conditions,
labeled dataset where the outcomes for the respective inputs optimizing cumulative rewards over time. This makes RL
are known. Still, creating such a dataset requires a model and a well-suited to handle the complexity of Wi-Fi networks, find
solution to the problem. Developing accurate models is, in sev- optimal solutions, and navigate different scenarios. Addition-
eral cases, a challenging and troublesome task. Besides that, ally, RL is flexible and adaptable, making it a powerful
many solutions are suboptimal, and supervised ML algorithms tool for collision avoidance in Wi-Fi networks. Therefore,
learning from datasets created with those solutions will never since there is no optimal solution for the CW optimization
surpass its performance since the algorithm tries to replicate (BEB is known to be suboptimal) [12]–[14], RL may offer
the input to label mapping present in the dataset [11]. more effective collision avoidance strategies. However, despite
Unsupervised learning occurs when the ML algorithm is the increasing state-of-the-art contributions presented, RL-
trained using unlabeled data. The idea behind this paradigm is based algorithms are complex, presenting high computational
to find hidden (i.e., latent) patterns in the data. This paradigm requirements, require fine-tuned hyperparameters since they
could be used to identify hidden patterns in Wi-Fi traffic that are sensitive to their settings, the training process might take
could be contributing to collisions. By clustering similar traffic several hours and need extensive exploration, requiring a
patterns, it may be possible to identify common sources of significant amount of computing power, and present difficulties
interference or other issues that are leading to collisions and in handling continuous and high-dimensional state and action
take actions to mitigate them. Nonetheless, unsupervised learn- spaces [15].
ing can be limited in its applicability to collision avoidance RL algorithms have addressed their limitations through vari-
in Wi-Fi networks due to the lack of labeled data, limited ous approaches. To mitigate high computational requirements,
interpretability of results, limited control over the learning techniques like parallelization (i.e., distributed computing) and
process, and lack of a feedback loop for ongoing adjustment hardware acceleration have been employed to leverage the
and optimization. While unsupervised learning can be useful power of multiple processors or machines, and hardware accel-
for analyzing Wi-Fi traffic data and identifying patterns, other eration using specialized processors like graphics processing
machine learning paradigms, such as reinforcement learning, units (GPUs) or tensor processing units (TPUs) to expedite the
may be better suited to the problem of collision avoidance. training process [16]. To tackle the need for extensive training
Finally, reinforcement learning occurs when the ML algo- data and exploration, methods such as experience replay,
JOURNAL OF COMMUNICATION AND INFORMATION SYSTEMS, VOL. 38, NO.1, 2023. 130
where past experiences are stored and randomly sampled this work are as follows:
during training, allowing for efficient and effective utilization 1) Adoption of the averaged normalized transmission queue
of data [17]. Exploration strategies, such as epsilon-greedy level as observation information used by the RL agent
or Thompson sampling, strike a balance between exploit- to take actions.
ing learned knowledge and exploring new possibilities [18]. 2) The comparison and assessment of two different obser-
Moreover, curiosity-driven or intrinsic motivation learning vation metrics, namely, the averaged normalized trans-
techniques have been explored to encourage agents to explore mission queue level and the collision probability.
novel states or seek out new experiences, thereby reducing the 3) A CW optimization solution that applies to any of the
reliance on vast amounts of explicit training data [19]. Sensi- 802.11 standards.
tivity to hyperparameters is addressed through techniques like The remainder of the paper is organized as follows. Section
grid search, systematically exploring different hyperparameter II discusses related work. Section III presents a brief machine
combinations to find optimal settings [20]. More advanced learning overview. Section IV describes the materials and
approaches, such as Bayesian optimization, leverage proba- methods used in the simulations. Section V presents the
bilistic models to search the hyperparameter space intelligently simulation results. Finally, Section VI presents conclusions
[21]. Furthermore, algorithmic advancements like automatic and future works.
hyperparameter tuning methods, utilizing meta-learning or
reinforcement learning itself, have been developed to automate
the selection of hyperparameters, enhancing the performance II. R ELATED W ORK
of the algorithms [22]. Difficulties in handling continuous The literature presents adequate and excellent contributions
and high-dimensional spaces are overcome using deep neural of ML methods applied to CW optimization in wireless
networks and parameterization methods. Deep neural networks networks. For example, in [28], the authors propose a CW
are used to function approximation, allowing for learning optimization mechanism for IEEE 802.11ax under dynami-
policies directly from raw sensory inputs [23]. Continuous cally varying network conditions employing DRL algorithms.
action spaces are handled through parameterization methods, The DRL algorithms were implemented on the NS-3 [29]
such as using Gaussian distributions or policy parameterization simulator using the NS3-gym [30] framework, which enables
[24]. Additionally, actor-critic architectures, combining policy integration with python frameworks [30]. They proved to have
and value function approximators, have shown promise in efficiency close to optimal according to the throughput result
effectively handling high-dimensional state and action spaces that remained stable even when the network topology changed
[25]. These ongoing advancements aim to improve the effi- dynamically. Their solution uses the collision probability, i.e.,
ciency and performance of reinforcement learning algorithms the transmission failure probability, to observe the overall
across different domains. network’s status.
This work proposes using deep reinforcement learning To allow channel access and a fair share of the unlicensed
(DRL) algorithms to optimize the CW value selection and spectrum between wireless nodes, the authors in [31] propose
improve the performance of Wi-Fi networks by maintaining an intelligent ML solution based on CWmin (minimum CW)
a stable throughput and minimizing collisions1 . More specif- adaptation. The issue is that aggressive nodes, as they refer
ically, we propose an RL-aided centralized solution aimed to in the paper, try to access the medium by forcefully
at finding the best CW value that maximizes the overall choosing low CWmin values, while CSMA/CA-based nodes
Wi-Fi network’s throughput by properly setting CW values, have a fixed CWmin set to 16, leading to an unfair share of
especially in dense environments, with dozens to hundreds of the spectrum. The intelligent CW ML solution consists of a
stations. The proposed solution takes actions, i.e., selects a CW random forest, which is a supervised classification algorithm.
value based on the observation metric adopted. In this work, Simulations were conducted on a C++ discrete-event-based
we study and compare the use of two distinct DRL algorithms, simulator called CSIM [32] to evaluate the algorithm’s per-
namely, DQN and DDPG, and two different observation met- formance. It was possible to obtain high throughput efficiency
rics, namely, averaged normalized transmission queues’ level while maintaining fair allocation of the unlicensed channels
of all associated stations and the collision probability, to tackle with other coexisting nodes.
the CW optimization problem. DQN was chosen because it is In [33], the authors present a Deep Q-learning algorithm
relatively simple and has a discrete action space. However, to dynamically adapt CWmin to random access in wireless
despite its simplicity, DQN generally displays performance networks. The idea is to maximize a network utility function
and flexibility that rivals other methods [26]. DDPG was (i.e., a metric measuring the fair use of the medium) [34] under
selected since it is a more complex method that represents dynamic and uncertain scenarios by rewarding the actions that
actions as continuous values, yielding an exciting comparison lead to high utilities (efficient resource usage). The proposed
with DQN [27]. By using DDPG, we want to explore the solution employs an intelligent node, called node 0, that imple-
hypothesis that the proposed solution can adjust the CW value ments the DQN algorithm to choose the CWmin for the next
more precisely and in a more fine-grained fashion with a time step from historical observations. The simulation was
continuous action space. Therefore, the main contributions of conducted on NS-3 to evaluate the performance against the
following baselines: optimal design, random forest classifier,
1 The source code for the reproduction of the results is available on: fairness index, optimal constant, and standard protocol (with
https://github.com/sheila-janota/RLinWiFi-avg-queue-level its CWmin fixed at 32). Two scenarios were considered for
JOURNAL OF COMMUNICATION AND INFORMATION SYSTEMS, VOL. 38, NO.1, 2023. 131
the simulation. The first scenario uses two states and follows algorithm outperforms the classical solution, i.e., the exponen-
a Markov process for CWmins of all nodes except node 0. The tial backoff, in terms of user fairness and overall throughput.
RL algorithm and random forest classifier reach outstanding Our proposed solution differs from this one because it tackles
performance for this case. The second scenario considers five the problem of random access in slot-free and contention-
states, and CWmins of all nodes different from node 0 follow based networks, aiming to maximize the network’s throughput
a more complex process. The RL algorithm achieves utility and, consequently, the throughput of individual users.
close to optimal when compared to a supervised random forest Differently from the previous works, in this work, we
classifier. propose leveraging an alternative observation metric that is
In [10], the authors propose an ML-based solution using based on the average of all stations’ normalized transmission
a Fixed-Share algorithm to adjust the CW value to improve queue levels, which seems a more informative and straight-
network performance dynamically. The algorithm comprises forward way to understand and capture the overall network’s
CW calculation, Loss/Gain function, and sharing weights. The status, to train DRL agents to take actions aiming at finding
algorithm considers the present and recent past conditions the best CW value, which in turn, optimizes the network
of the network. The NS-3 network simulator was used to performance. Moreover, we compare the observation metric
evaluate the proposed solution, and the performance metrics proposed in [28] (i.e., the collision probability observed by
used were average throughput, average end-to-end delay, and the network) with the proposed one, i.e., the stations’ averaged
channel access fairness. The Fixed-Share algorithm achieves normalized transmission queue levels. The normalized trans-
excellent performance compared to the other two conventional mission queues’ level provides more information, taking into
algorithms: binary exponential backoff (BEB) and History- account the congestion level of the network and the behavior
Based Adaptive Backoff (HBAB). of all nodes, not just the individual node. It also leads to
To optimize CW in a wireless local area network, [35] better performance, allows for adaptation to changing network
presents three algorithms based on genetic fuzzy-contention conditions, and avoids myopic decisions. In contrast, using the
window optimization (GF-CWO), which is a combination of collision probability alone can result in suboptimal decisions
fuzzy logic controller and a genetic algorithm. The proposed that do not take into account the impact on the overall network
algorithm is intended to solve issues related to success ratio, performance. Therefore, the averaged normalized transmission
packet loss ratio, collision rate, fairness index, and energy con- queues’ level seems to be a more suitable metric for RL
sumption. Simulations were conducted in Matlab in order to solutions to solve the collision avoidance problem in Wi-Fi
evaluate the performance of the proposed solution, producing networks.
better results when compared to the BEB. The ML-based approaches presented in this section show
To avoid packet collisions in Mobile ad hoc networks how well ML algorithms can be applied to reach optimal
(MANETs) [36], [37], the authors of [38] propose a Q- performance in the wireless network field. Therefore, this
learning-based solution to optimize the CW parameter in motivated us to study and propose a DRL-based solution to
an IEEE 802.11 network. The proposed CW optimization reduce collisions by optimizing CW in different scenarios.
method considers the number of packets sent and the collision
generated from each station. Simulation results show that
selecting a good CW value improves the packet delivery ratio,
channel access fairness, throughput, and latency. The benefits III. M ACHINE L EARNING OVERVIEW
are even more significant when the queue size is less or equal
to 20. ML algorithms, also called models, have been widely ap-
In [39], a different approach to controlling the CW value, plied to solve different problems related to optimization in
named contention window threshold, is used. It employs wireless network communications systems [41]–[44]. These
deep reinforcement learning (DRL) principles to establish a algorithms construct a model based on historical data, known
threshold value and learn the optimal settings under various as a training dataset, to perform tasks, for example, solving op-
network scenarios. The method used is called the smart timization problems, without being explicitly programmed to
exponential-threshold-linear backoff algorithm with a deep Q- do so [45], [46]. ML algorithms can provide self-management,
learning network (SETL-DQN). Its results demonstrate that self-learning, and self-optimizing solutions for an extensive
this algorithm can reduce collisions and improve throughput. range of issues in the context of dynamic resource allocation,
The authors of [40] apply DRL to the problem of random spectrum resource management, wireless network optimiza-
access in machine-type communications networks adopting tion, and so much more [47].
slotted ALOHA access protocols. Their proposed solution The learning process of an ML model is called training,
aims at finding a better transmission policy for slotted ALOHA and it is used for the model to gain knowledge (i.e., infer a
protocols. The proposed algorithm learns a policy that es- solution) and achieve the desired result. It is possible to clas-
tablishes a trade-off between user fairness and the overall sify the ML model learning based on the type of its training,
network throughput. The solution employs centralized training, also called learning paradigm [48]. The learning paradigms can
which makes the learned policy equal for all users. Their be classified as Supervised learning, Unsupervised learning,
approach uses binary feedback signals and past actions to learn and Reinforcement Learning (RL). Fig 2 shows the relations
transmission probabilities and adapt to traffic scenarios with between the ML paradigms. Therefore, next, we provide a
different arrival processes. Their results show that the proposed brief overview of these learning paradigms.
JOURNAL OF COMMUNICATION AND INFORMATION SYSTEMS, VOL. 38, NO.1, 2023. 132
Agent
𝑺𝒕+𝟏
𝑹𝒕+𝟏
Environment
Algorithm 1 Q-learning
▷ Parameters initialization Agent DNN Policy
1: Learning rate, α 𝝅𝜽 (𝒔, 𝒂)
Agent
State
2: Discount rate, γ
3: Exploration rate, ϵ, ϵM in , ϵDecay Parameter 𝜽
...
...
...
...
𝑖 𝑖
Output N Input Output N
𝑄(𝑠, 𝑎 ) Actor Actions
𝑄(𝑠, 𝑎 )
Fig. 6. Deep Q-learning (DQN). Fig. 7. Deep Deterministic Policy Gradient (DDPG).
agent is represented by a deep neural network(DNN) that uses problem of learning the action-value function to a stable
the state as the input of the neural network, and the outputs supervised learning problem [58].
are the Q-values corresponding to the possible actions, where Similarly to DQN, DDPG uses an experience replay mem-
the action taken by the agent is that which maximizes the ory to minimize correlations between samples. Regarding the
Q-values, as shown in Fig 6. policy aspect of exploration and exploitation, DDPG differs
The Q-value represents the quality of the action in a from DQN. Since DDPG works in continuous action space,
given state. The idea is for the neural network to output the exploring such space constitutes a significant problem. How-
highest Q-value for the action that maximizes the cumulative ever, as it is an off-policy algorithm, the exploration problem
expected reward. However, using a single neural network can be treated independently from the learning algorithm [58].
renders training very unstable [53]. The trick to mitigate this DDPG creates an exploration policy that adds a noise value
problem and add stability to the training process is using two to the actor policy to solve this issue. By default, the noise is
neural networks, predictive and target networks. They have added following the Ornstein-Uhlenbeck process [60].
the same structure (i.e., number of layers and neurons in
each layer and activation functions) but have their weights IV. A PPLYING DRL TO THE CW OPTIMIZATION
updated at different times. The weights of the target network In order to apply DRL algorithms (DQN and DDPG) to
are not trained. Instead, they are periodically synchronized optimize Wi-Fi networks, we propose a centralized approach
with the weights of the predictive network. The idea is that to solving the CW optimization in this work. Our proposed
fixing the target Q values (outputs of the target network) for approach consists of a centralized algorithm (i.e., the agent),
several updates will improve the predictive network’s training which is a module running on the Wi-Fi access point (AP)
stability. DQN employs batch training and experience replay that observes the state of the network (i.e., the environment)
memory, making the agent learn from randomly sampled and chooses suitable CW values (i.e., the actions) to optimize
batch experiences. It also employs the ϵ-greedy exploration- the network’s performance (i.e., the reward). Next, we present
exploitation mechanism. some details on the agent and its inputs and outputs values.
2) Deep Deterministic Policy Gradient: DDPG is another 1) Agent: the agent represents the proposed DRL algo-
off-policy DRL algorithm with continuous action and state rithms (i.e., DQN and DDPG). The agent is chosen to
spaces proposed in [58]. It is the result of the combination run on the AP since it has a general view of the whole
between deterministic policy gradient (DPG) and DQN algo- network and then can control the stations associated with
rithms, the former related to the actor-critic algorithm [25], it through beacon frames in a centralized way. Therefore,
[59]. DQN avoids instability during the Q-function learning as can be noticed, this is a centralized approach where
by employing a replay buffer and a target network. DQN the AP decides the best CW value that will be used
has a discrete action space, while DDPG extends it to a across the network.
continuous action space. The algorithm simultaneously learns 2) Current state (s): the current state of the environment
a Q-function and a policy. Since DDPG inherits from the actor- is the status of all stations associated with the AP. So,
critic algorithm, it is a combination of both policy (actor) it is impossible to get this information because of the
and Q-value (critic) functions, where the actor takes actions nature of the problem. Therefore, we model the problem
according to a specific policy function, and the critic plays the as a Partially Observable Markov Decision Process
role of an evaluator of the action taken [25]. Fig 7 presents (POMDP) instead of an MDP one. POMDP assumes the
a general view of how DDPG works. Here, the actor takes environment’s state cannot be perfectly observed [61].
as input the state and outputs an action. The critic receives 3) Observation (o): is the acquired information from the
as input the state and the action from the actor, which are network based on the averaged normalized transmission
used to evaluate the actor’s actions, and outputs Q-values queues’ level of all associated stations or on the prob-
corresponding to the set of possible actions. The Q-values ability of collision proposed in [28]. We will employ
outputted by the critic indicate to the agent how good the and compare the performance attained by these two dif-
action taken by the actor for that specific state was. ferent types of information obtained from the network.
DDPG consists of four networks: actor prediction, critic Sections IV-A and IV-B explain how each one of these
prediction, actor target, and critic target networks. The target observation values is calculated.
networks have their weights copied from the prediction net- 4) Action (a): the action corresponds to the CW value.
works periodically. As with DQN, this procedure is adopted Since the CW value is directly connected to the network
in order to stabilize the learning process, moving the unstable performance, longer back-off periods lead to longer
JOURNAL OF COMMUNICATION AND INFORMATION SYSTEMS, VOL. 38, NO.1, 2023. 135
(i)
waiting times for retransmitting packets in case of col- The measurement of QN L is carried out at predefined
lisions, degrading the spectrum usage and the network intervals at each station and indicates the result of the currently
performance. The RL scheme brings the idea that for chosen CW value on the network’s performance. For example,
every action, there is a related maximized reward. Thus, a value close to 1 indicates the queue is full, meaning the
applying RL concepts to optimize the CW value aiming station cannot transmit packets as quickly as it receives them
to maximize the network’s throughput is what this work from the upper layers. On the other hand, if it is close to 0,
proposes. Therefore, we use the CW value as the RL the queue is almost empty, indicating the station can access
(i)
action. As we compare DRL algorithms with discrete the medium as frequently as necessary. A high QN L value
and continuous action spaces, the actions are integer indicates a high number of collisions. Conversely, a low value
values between 0 and 6 in the discrete case and real indicates a small number of collisions. The normalized level of
(i)
values within the interval [0, 6] in the continuous case. the transmission queue of each station, QN L , is concatenated
The set of actions follows these specific values (i.e., 0 (i.e., piggybacked) to data frames sent to the AP so that the
through 6) to define the back-off interval used by IEEE agent has access to this information. At the AP, the agent
(i)
802.11a for a station to retransmit its packet, establishing normalizes the sum of QN L coming from the stations by the
the limit condition of retransmission retries. Therefore, total number of stations associated with the AP, Nstations , as
the CW value to be broadcast to the stations can be shown in (5). This is the observation used by the agent to
obtained through the application of (3). This interval gather insights into the network’s status.
is selected so that the action space is within 802.11
Nstations
standard’s CW range, which ranges from 15 up to 1023. 1 X (i)
An action, a, taken by the agent in the state, s, makes Q̄N L = QN L . (5)
Nstations
the environment switch to its next state, s′ , with a given i=1
( ) ( )
( ) 0.415, 0.241
0.876, 0.146
0.430, 0.419, 0.100, 0.125, 1.000, 0.515, 0.345, 0.763, 0.714, 0.001, ⋯ , 0.221, 0.274, 0.950 0.345, 0940
0.235, 0.209
⋮
History of observation values 0.150, 0.599
Exploration of the environment is enabled by adding a has elapsed, the algorithm selects the CW value, and then, the
noisy factor to each action the agent takes. This noisy factor observation value results from the application of the DRL-
decreases throughout the learning stage. This addition of noise based CW optimization algorithm. Therefore, one should note
is different for each of the two considered DRL algorithms. that the instructions in lines 15 to 24 are shared between the
When DQN is used, the noisy factor corresponds to the pre-learning and learning stages.
probability of taking a random action instead of an action Next, from lines 26 to 29, the information in the observation
predicted by the agent. In DDPG’s case, the noisy factor comes buffer is pre-processed by applying a moving average opera-
from a Gaussian-distributed random variable and is added tion to the data, which results in a two-dimensional vector,
directly to the action taken by the agent. As mentioned, this is observation, with the mean, µ, and variance, σ 2 of the data
done to find a trade-off between exploring the environment and in the buffer. With the pre-processed data, the action function,
exploiting the acquired knowledge, which chooses the action Aθ , returns the action, a. The action is then used to determine
that maximizes future rewards. a new CW value that will be broadcast to all associated
The last stage is called the operational stage. This stage stations. The instructions in lines 26 to 29 are shared between
starts when the training is over. The user defines the training the learning and operational stages. The difference between
stage’s period by setting the variable trainingP eriod. At this these two stages is how the action, a, is selected. In the
stage, the noisy factor is set to zero, so the agent always learning stage, the trainingF lag is equal to True, which
chooses the action it learned to maximize the reward. At tells the action function, Aθ , to choose actions following the
this stage, as the agent has already been trained, it does not exploration-exploitation approach described previously. In the
receive any additional updates to its policy, so rewards are operational stage, the trainingF lag is equal to False, which
unnecessary. tells the action function, Aθ , to choose actions optimizing the
Finally, it is essential to mention that the hyperparameters network’s performance. In the learning stage, a noisy factor
of the DRL models (i.e., learning rate, reward discount rate, is used to explore the environment, which does not happen in
batch size, epsilon decay) need to be fine-tuned for the agent the operational stage.
to achieve its optimal performance. Lastly, as both DQN and After that, from lines 31 to 38, the algorithm is solely in the
DDPG employ replay memory (called in this work as the learning stage. In this section of the algorithm, the DRL agent
experience replay buffer, E), a size limit has to be configured learns from previous experiences; that is, it has its weights,
for this memory. The replay memory stores past interactions θ, updated by using mini-batches of samples randomly picked
of the agent with the environment, i.e., it records the current from the replay buffer, E. These mini-batches are composed
state, the action taken at that state, the reward received in that of the current µ, σ 2 , and action, a, values, the reward (i.e., the
state, and the next state resulting from the action taken. When normalized throughput), r, and the previous values of µ and
its limit is reached, the oldest record is overwritten by a new σ2 .
one (i.e., it is implemented as a circular buffer). Finally, in line 42, the algorithm checks if the training period
A brief description of the pseudo-code shown in Algorithm has elapsed. If so, the trainingF lag is set to False, and the
2 is provided next. The initialization phase goes from lines 1 algorithm enters the operational stage. In this stage, the noisy
to 13. It initiates the input parameters such as observation and factor is disabled, and the algorithm only exploits the acquired
replay buffers, agent weights, initial state, noisy factor, CW knowledge.
value, and variables used to select the current stage and the
type of observation employed. D. Experimentation Scenario
Then, the pre-learning stage (lines 15 to 24) starts by filling The proposed centralized DRL-based CW optimization so-
in the observation buffer with the selected type of observation lution is implemented on NS3-gym [30], which runs on top of
metric (either the averaged normalized queues’ level or the the NS-3 simulator [29]. NS3-gym enables the communication
collision probability). The flag useQueueLevelF lag sets the between NS-3 (c++) and OpenAI gym framework (python)
type of observation value to be calculated. In this stage, the [62]. NS-3 is a network simulator based on discrete events
observation value (either the averaged normalized transmission mainly intended for academic research. It contains the im-
queues’ level or the network’s collision probability) results plementation of several wired and wireless network standards
from the application of the legacy Wi-Fi contention-based [29]. In this work, we use version NS-3.29 of the NS-3
mechanism. After envStepT ime, which is the period between simulator. The DRL algorithms used here were implemented
interactions of the proposed algorithm with the environment, with TensorFlow and PyTorch.
JOURNAL OF COMMUNICATION AND INFORMATION SYSTEMS, VOL. 38, NO.1, 2023. 137
TABLE I
NS-3 E NVIRONMENT CONFIGURATION PARAMETERS .
TABLE II
NS3- GYM AGENT CONFIGURATION PARAMETERS .
Fig. 13. Selected CW value for different numbers of stations in a dynamic Fig. 14. Comparison of the instantaneous network throughput as the number
scenario. of stations increases from 5 to 50.
[16] C. Silvano, D. Ielmini, F. Ferrandi, L. Fiorin, S. Curzel, L. Benini, [37] S. Giannoulis, C. Donato, R. Mennes, F. A. de Figueiredo, I. Jabandžic,
F. Conti, A. Garofalo, C. Zambelli, E. Calore et al., “A survey on deep Y. De Bock, M. Camelo, J. Struye, P. Maddala, M. Mehari et al.,
learning hardware accelerators for heterogeneous hpc platforms,” arXiv “Dynamic and collaborative spectrum sharing: The scatter approach,”
preprint arXiv:2306.15552, 2023. doi: 10.48550/arXiv.2306.15552 in 2019 IEEE International Symposium on Dynamic Spectrum Access
[17] Q. Cai, C. Cui, Y. Xiong, W. Wang, Z. Xie, and M. Zhang, “A survey Networks (DySPAN). IEEE, 2019. doi: 10.1109/DySPAN.2019.8935774
on deep reinforcement learning for data processing and analytics,” IEEE pp. 1–6.
Transactions on Knowledge amp; Data Engineering, vol. 35, no. 05, pp. [38] N. Zerguine, M. Mostefai, Z. Aliouat, and Y. Slimani, “Intelligent cw se-
4446–4465, may 2023. doi: 10.1109/TKDE.2022.3155196 lection mechanism based on q-learning (misq).” Ingénierie des Systèmes
[18] G. Aridor, Y. Mansour, A. Slivkins, and Z. S. Wu, “Competing d Inf., vol. 25, no. 6, pp. 803–811, 2020. doi: 10.18280/isi.250610
bandits: The perils of exploration under competition,” arXiv preprint [39] C.-H. Ke and L. Astuti, “Applying deep reinforcement learning to
arXiv:2007.10144, 2020. doi: 10.48550/arXiv.2007.10144 improve throughput and reduce collision rate in ieee 802.11 networks,”
[19] X. Lu, B. V. Roy, V. Dwaracherla, M. Ibrahimi, I. Osband, and KSII Transactions on Internet and Information Systems (TIIS), vol. 16,
Z. Wen, “Reinforcement learning, bit by bit,” Foundations and Trends® no. 1, pp. 334–349, 2022. doi: 10.3837/tiis.2022.01.019
in Machine Learning, vol. 16, no. 6, pp. 733–865, 2023. doi: [40] M. A. Jadoon, A. Pastore, M. Navarro, and F. Perez-Cruz, “Deep rein-
10.1561/2200000097 forcement learning for random access in machine-type communication,”
[20] S. Shekhar, A. Bansode, and A. Salim, “A comparative study in 2022 IEEE Wireless Communications and Networking Conference
of hyper-parameter optimization tools,” in 2021 IEEE Asia-Pacific (WCNC), 2022. doi: 10.1109/WCNC51071.2022.9771953 pp. 2553–
Conference on Computer Science and Data Engineering (CSDE). 2558.
Los Alamitos, CA, USA: IEEE Computer Society, dec 2021. doi: [41] R. Mennes, F. A. P. De Figueiredo, and S. Latré, “Multi-agent
10.1109/CSDE53843.2021.9718485 pp. 1–6. deep learning for multi-channel access in slotted wireless networks,”
[21] P. I. Frazier, “A tutorial on bayesian optimization,” arXiv preprint IEEE Access, vol. 8, pp. 95 032–95 045, 2020. doi: 10.1109/AC-
arXiv:1807.02811, 2018. doi: 10.48550/arXiv.1807.02811 CESS.2020.2995456
[22] A. Karras, C. Karras, N. Schizas, M. Avlonitis, and S. Sioutas, “Automl [42] R. Mennes, M. Claeys, F. A. P. De Figueiredo, I. Jabandžić, I. Moer-
with bayesian optimizations for big data management,” Information, man, and S. Latré, “Deep learning-based spectrum prediction collision
vol. 14, no. 4, p. 223, 2023. doi: 10.3390/info14040223 avoidance for hybrid wireless environments,” IEEE Access, vol. 7, pp.
[23] P. Beneventano, P. Cheridito, R. Graeber, A. Jentzen, and 45 818–45 830, 2019. doi: 10.1109/ACCESS.2019.2909398
B. Kuckuck, “Deep neural network approximation theory for [43] M. Ahmed Ouameur, L. D. T. Anh, D. Massicotte, G. Jeon, and
high-dimensional functions,” arXiv preprint 2112.14523, 2021. F. A. P. de Figueiredo, “Adversarial bandit approach for ris-aided ofdm
doi: 10.48550/arXiv.2112.14523 communication,” EURASIP Journal on Wireless Communications and
[24] J. Zhu, F. Wu, and J. Zhao, “An overview of the action space for deep Networking, vol. 2022, no. 1, pp. 1–18, 2022. doi: 10.1186/s13638-022-
reinforcement learning,” in Proceedings of the 2021 4th International 02184-6
Conference on Algorithms, Computing and Artificial Intelligence, 2021. [44] M. V. C. Aragão, S. B. Mafra, and F. A. P. de Figueiredo, “Otimizando
doi: 10.1145/3508546.3508598 pp. 1–10. o treinamento e a topologia de um decodificador de canal baseado em
[25] V. Konda and J. Tsitsiklis, “Actor-critic algorithms,” Advances in neural redes neurais,” Polar, vol. 2, p. 1. doi: 10.14209/sbrt.2022.1570823833
information processing systems, vol. 12, 1999. [45] F. Adib Yaghmaie and L. Ljung, “A crash course on rein-
[26] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. forcement learning,” arXiv e-prints, pp. arXiv–2103, 2021. doi:
MIT press, 2018. 10.48550/arXiv.2103.04910
[27] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, [46] M. G. Kibria, K. Nguyen, G. P. Villardi, O. Zhao, K. Ishizu, and F. Ko-
D. Silver, and D. Wierstra, “Continuous control with deep re- jima, “Big data analytics, machine learning, and artificial intelligence
inforcement learning,” arXiv preprint arXiv.1509.02971, 2015. doi: in next-generation wireless networks,” IEEE access, vol. 6, pp. 32 328–
10.48550/arXiv.1509.02971 32 338, 2018. doi: 10.1109/ACCESS.2018.2837692
[28] W. Wydmański and S. Szott, “Contention window optimization in ieee [47] M. Chen, U. Challita, W. Saad, C. Yin, and M. Debbah, “Machine
802.11 ax networks with deep reinforcement learning,” in 2021 IEEE learning for wireless networks with artificial intelligence: A tutorial on
Wireless Communications and Networking Conference (WCNC). IEEE, neural networks,” arXiv preprint arXiv:1710.02913, vol. 9, 2017. doi:
2021. doi: 10.1109/WCNC49053.2021.9417575 pp. 1–6. 10.48550/arXiv.1710.02913
[29] G. F. Riley and T. R. Henderson, “The ns-3 network simulator,” in [48] H. Yang, A. Alphones, Z. Xiong, D. Niyato, J. Zhao, and K. Wu,
Modeling and tools for network simulation. Springer, 2010, pp. 15–34. “Artificial-intelligence-enabled intelligent 6g networks,” IEEE Network,
[30] P. Gawłowicz and A. Zubow, “Ns-3 meets openai gym: The playground vol. 34, no. 6, pp. 272–280, 2020. doi: 10.1109/MNET.011.2000195
for machine learning in networking research,” in Proceedings of the 22nd [49] S. Russell and P. Norvig, Artificial Intelligence: A Modern Approach,
International ACM Conference on Modeling, Analysis and Simulation of 3rd ed. USA: Prentice Hall Press, 2009. ISBN 0136042597
Wireless and Mobile Systems, 2019. doi: 10.1145/3345768.3355908 pp. [50] B. Jang, M. Kim, G. Harerimana, and J. W. Kim, “Q-learning algorithms:
113–120. A comprehensive classification and applications,” IEEE access, vol. 7,
[31] A. H. Y. Abyaneh, M. Hirzallah, and M. Krunz, “Intelligent-cw: Ai- pp. 133 653–133 667, 2019. doi: 10.1109/ACCESS.2019.2941229
based framework for controlling contention window in wlans,” in 2019 [51] A. N. Burnetas and M. N. Katehakis, “Optimal adaptive policies
IEEE International Symposium on Dynamic Spectrum Access Networks for markov decision processes,” Mathematics of Operations Research,
(DySPAN). IEEE, 2019. doi: 10.1109/DySPAN.2019.8935851 pp. 1–10. vol. 22, no. 1, pp. 222–255, 1997. doi: 10.1287/moor.22.1.222
[32] Y. Xiao, M. Hirzallah, and M. Krunz, “Distributed resource allocation [52] M. Tokic and G. Palm, “Value-difference based exploration: adaptive
for network slicing over licensed and unlicensed bands,” IEEE Journal control between epsilon-greedy and softmax,” in Annual conference on
on Selected Areas in Communications, vol. 36, no. 10, pp. 2260–2274, artificial intelligence. Springer, 2011. doi: 10.1007/978-3-642-24455-
2018. doi: 10.1109/JSAC.2018.2869964 1 33 pp. 335–346.
[33] A. Kumar, G. Verma, C. Rao, A. Swami, and S. Segarra, [53] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
“Adaptive contention window design using deep q-learning,” in Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski
ICASSP 2021-2021 IEEE International Conference on Acoustics, et al., “Human-level control through deep reinforcement learning,”
Speech and Signal Processing (ICASSP). IEEE, 2021. doi: nature, vol. 518, no. 7540, pp. 529–533, 2015. doi: 10.1038/nature14236
10.1109/ICASSP39728.2021.9414805 pp. 4950–4954. [54] A. L. Strehl, L. Li, E. Wiewiora, J. Langford, and M. L.
[34] X. Fu and E. Modiano, “Learning-num: Network utility maximization Littman, “Pac model-free reinforcement learning,” in Proceedings of
with unknown utility functions and queueing delay,” in Proceedings the 23rd international conference on Machine learning, 2006. doi:
of the Twenty-second International Symposium on Theory, Algorithmic 10.1145/1143844.1143955 pp. 881–888.
Foundations, and Protocol Design for Mobile Networks and Mobile [55] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath,
Computing, 2021. doi: 10.48550/arXiv.2012.09222 pp. 21–30. “A brief survey of deep reinforcement learning,” arXiv preprint
[35] I. A. Qureshi and S. Asghar, “A genetic fuzzy contention window opti- arXiv:1708.05866, 2017. doi: 10.48550/arXiv.1708.05866
mization approach for ieee 802.11 wlans,” Wireless Networks, vol. 27, [56] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning
no. 4, pp. 2323–2336, 2021. doi: 10.1007/s11276-021-02572-8 with double q-learning,” in Proceedings of the AAAI conference on arti-
[36] T. K. Saini and S. C. Sharma, “Prominent unicast routing pro- ficial intelligence, vol. 30, no. 1, 2016. doi: 10.48550/arXiv.1509.06461
tocols for mobile ad hoc networks: Criterion, classification, and [57] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized
key attributes,” Ad Hoc Networks, vol. 89, pp. 58–77, 2019. doi: experience replay,” arXiv preprint arXiv:1511.05952, 2015. doi:
10.1016/j.adhoc.2019.03.001 10.48550/arXiv.1511.05952
JOURNAL OF COMMUNICATION AND INFORMATION SYSTEMS, VOL. 38, NO.1, 2023. 143