0% found this document useful (0 votes)

5 views25 pages

MARL For Networks

Uploaded by

Subrahmanya Swamy P

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views25 pages

MARL For Networks

Uploaded by

Subrahmanya Swamy P

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

1

Single and Multi-Agent Deep Reinforcement

Learning for AI-Enabled Wireless Networks: A
Tutorial
Amal Feriani and Ekram Hossain, Fellow, IEEE

Abstract—Deep Reinforcement Learning (DRL) has recently Enhanced Mobile Broadband (eMBB), and massive Machine-
witnessed significant advances that have led to multiple suc-
arXiv:2011.03615v1 [cs.LG] 6 Nov 2020

Type Communications (mMTC) to enable the deployment

cesses in solving sequential decision-making problems in various of Internet-of-Things (IoT) services such as e-Health, smart
domains, particularly in wireless communications. The future
sixth-generation (6G) networks are expected to provide scalable, cities, extended reality, etc. Meanwhile, several research ini-
low-latency, ultra-reliable services empowered by the application tiatives [1], [2], [3] proposed road-maps or visions for 6G
of data-driven Artificial Intelligence (AI). The key enabling networks to overcome the limitations of 5G technologies.
technologies of future 6G networks, such as intelligent meta- All these works envision that 6G networks will rely on au-
surfaces, aerial networks, and AI at the edge, involve more than tonomous systems providing scalable, reliable, and secure ser-
one agent which motivates the importance of multi-agent learning
techniques. Furthermore, cooperation is central to establishing vices. For instance, the transition from 5G to 6G will introduce
self-organizing, self-sustaining, and decentralized networks. In new key technologies such as TeraHertz (THz) and optical
this context, this tutorial focuses on the role of DRL with an wireless communications (e.g. visible light communications),
emphasis on deep Multi-Agent Reinforcement Learning (MARL) intelligent metasurface-aided wireless communications, aerial
for AI-enabled 6G networks. The first part of this paper will networks, and Multi-access Mobile Edge Computing (MEC).
present a clear overview of the mathematical frameworks for
single-agent RL and MARL. The main idea of this work is to The recent success of AI techniques, namely Machine
motivate the application of RL beyond the model-free perspective Learning (ML) and Deep Learning (DL), has spurred the
which was extensively adopted in recent years. Thus, we provide adoption of a learning perspective to solve wireless control
a selective description of RL algorithms such as Model-Based RL and management problems. For instance, Deep Neural Net-
(MBRL) and cooperative MARL and we highlight their potential works (DNN) are universal approximators able to estimate
applications in 6G wireless networks. Finally, we overview the
state-of-the-art of MARL in fields such as Mobile Edge Com- any function thus they can approximate optimal solutions
puting (MEC), Unmanned Aerial Vehicles (UAV) networks, and for complex tasks. DNNs can be used in three different ML
cell-free massive MIMO, and identify promising future research settings: supervised ML, unsupervised ML, and Reinforcement
directions. We expect this tutorial to stimulate more research Learning (RL). Although supervised ML methods showed
endeavors to build scalable and decentralized systems based on promising results in several wireless problems such as channel
MARL.
estimation and IRS joint beamforming and shift optimization,
Keywords:- 6G networks, Deep Reinforcement Learn- they require the availability of a large amount of a priori
ing (DRL), Multi-Agent Reinforcement Learning (MARL), labeled training and testing data which is difficult to obtain for
Model-Based Reinforcement Learning (MBRL), decentralized real-life scenarios. This motivates RL techniques that rely on
networks “trial and error” to solve sequential decision-making problems.
In RL, a learning agent interacts with an environment by
I. I NTRODUCTION choosing an action and observing the system’s next state and
The evolution of wireless networks such as the fifth- an immediate reward. Naturally, the agent seeks to find optimal
generation (5G) and beyond 5G (also referred to as 6G) actions to maximize its rewards. To do so, two approaches
networks is driven by a huge increase of connected devices, the can be adopted: model-free or model-based. Model-based RL
growing demand for high data rate applications, and the con- (MBRL) assumes that the agent knows the system dynamics
vergence between communications and computing. Intelligent that is how the system transits from one state to another
signal processing along with AI-driven super-intelligent radio one and how rewards are generated. This approach is not
access and network control will be among the key technologies always feasible, especially in complex systems where the
in future wireless networks to achieve scalability, context- agent has restricted to no knowledge about its environment.
awareness, and energy efficiency along with massive capacity This motivates model-free techniques where an agent learns
and connectivity. 5G wireless networks are expected to pro- optimal strategies without any knowledge about the system.
vide Ultra-Reliable, Low Latency Communications (URLLC), As a generalization of the single-agent RL setting, multi-agent
RL seeks to solve decision-making problems involving more
The authors are with the Department of Electrical and Computer En- than one agent.
gineering, University of Manitoba, Winnipeg, MB, Canada (e-mails: feri- In RL, we differentiate between two key steps: training
[email protected], [email protected]). The work was sup-
ported by a Discovery Grant from the Natural Sciences and Engineering and inference. During the training phase, the agent interacts
Research Council of Canada (NSERC). with the environment to collect experience. The environment
2

is often a simulator mimicking the real-world system since it dynamics or rewards and has better exploratory behaviors.
is expensive to directly interact with the real system. These Recent progress in MBRL, especially for robotics, has shown
collected experiences constitute the training dataset of the that MBRL can be more efficient than MFRL. However,
RL agent and will be used to learn the optimal decision- for most applications, it is challenging to learn an accurate
making rule. In Deep RL (DRL), DNNs are used to ap- model of the world. For this reason, model-free algorithms
proximate the agent’s optimal strategy or policy and/or its are preferred. However, in the second part of the tutorial, we
optimal utility function (see Figure 1). In this case, given argue that single-agent RL is not sufficient to model scalable
the system’s current state, the DNN learns to predict either a and self-organizing systems often containing a considerable
distribution over actions or the action expected reward Q(s, a). number of interconnected agents. This claim is justified since
Therefore, the agent chooses its next action as the one that has single-agent RL algorithms learn a decision-making rule for
either the highest probability or the highest expected reward. one entity without considering the existence of other entities
After receiving the reward from the environment, the DNN that can impact its behavior. Thus, we will study extensions
parameters are updated accordingly. The generalization power of both model-free and model-based single-agent approaches
of DNNs enables solving high-dimensional problems with to multi-agent decision-making.
continuous or combinatorial state spaces. In the context of MARL is the generalization of single-agent RL that enables
wireless communications, DRL is advantageous compared to a set of agents to learn optimal policies using interactions
the traditional optimization methods thanks to their real-time with the environment and each other. Thus, MARL does not
inference. However, the training phase of DNNs requires a ignore the presence of the other agents during the learning
considerable amount of computation power which necessi- process which makes it a harder problem. Essentially, several
tates the use of GPUs and high-performance CPU clusters. challenges arise in the multi-agent case such as i) non-
The most popular DL frameworks are Tensorflow [4] and stationarity of the system due to the agents simultaneously
pytorch [5]. Once the training is complete, the agent can changing their behaviors; ii) scalability issue since the joint
make decisions in real-time which is a considerable advantage action space grows exponentially with the number of agents;
compared to traditional optimization problems. To accelerate iii) partial observability problem arising often in real-world
the inference of DNNs further, different libraries implement applications where agents have access only to partial informa-
sophisticated compression techniques such as quantization to tion of system; iv) privacy and security are also core challenges
fasten the execution of DNNs on mobile or edge devices. of the deployment of MARL systems in real-world scenarios.
More details on these issues will be presented in Section V-A.
MARL is generally formulated as a Markov Game (MG) or
also called a Stochastic Game (SG). MG generalizes Markov
Decision Processes (MDPs) used to model single-agent RL
problems and repeated games in game theory literature. In
repeated games, the same players repeatedly play a given game
called stage game. Thus, repeated games consider a stateless
static environment, and the agents’ utilities are only impacted
by the interactions between agents. This is a crucial limitation
of normal-form game theory frameworks to model multi-agent
problems. MG remedies this shortcoming by considering a
dynamic environment impacting the agents’ rewards. MGs
can be classified into three families fully cooperative, fully
competitive or mixed. Fully cooperative scenarios assume that
Fig. 1: Representation of a DRL framework. the agents have the same utility or reward function whereas
fully competitive settings involve agents with opposite goals
often known as zero-sum games. The mixed setting covers the
A. Scope of this Tutorial general case where no restriction on the rewards is considered.
In this work, we emphasize the role of DRL in future 6G This is also referred to as general-sum games. In this paper,
networks. In particular, our objective is to discuss several we focus on fully cooperative MARL and consider MGs as the
DRL learning frameworks to advance the current state-of- mathematical formalism to model such problems. However,
the-art and accommodate the requirements of 6G networks. MGs handle only problems with full observability or perfect
First, we overview single-agent RL methods and shed light information. Other extensions to model partial observability
on MBRL techniques. Although MBRL has received less will be discussed as well.
interest, it can show considerable advantage compared to Because fully cooperative agents share the same reward
model-free algorithms. MBRL consists in learning a model function, they are obliged to choose optimal joint actions.
representing the environment dynamics and utilize the learned In this context, coordination between agents is crucial in
model to compute an optimal policy. The main advantage selecting optimal joint strategies. To illustrate the importance
of having a model of the environment is the ability to plan of coordination, we consider the example from [6]. Let us
ahead which makes these methods more sample-efficient. In examine a scenario with two agents at a given state of the
addition, MBRL is more robust to changes in the environment environment where they can choose between two actions a1
3

and a2 . We assume that the joint actions (a1 , a1 ) and (a2 , a2 )

are both optimal joint actions (i.e. yield maximum reward).
Without coordination, the first agent can choose a1 and the
second agent can pick a2 which result in non-optimal joint
action (a1 , a2 ). Thus, although both agents individually choose
decisions induced by one of the optimal joint actions, the
resultant joint action is not optimal. This is why coordination
is important to make sure agents choose the same joint
action profile. The most straightforward way to establish
collaboration is to assume the existence of a central unit to
collect information from the agents, find the optimal joint
action, and communicate it back to the agents. In practice,
however, this assumption is either costly or infeasible. Ideally,
the agents learn to adapt their own policies with limited to no Fig. 2: Scope of the tutorial.
communication while ensuring coordination. This motivates
the decentralized approaches in solving MARL problems.
Different approaches can be adopted to solve fully co- In addition, distributed RL aims to accelerate the training
operative tasks depending on the information used by the of DRL algorithms using sophisticated distributed computing
agents for learning. Independent Learners (IL) is a fully techniques. From now on, we will use “distributed” and
decentralized scheme where agents have access to their local “decentralized” interchangeably; unless otherwise stated, they
information only and independently optimize their policies to both refer to decentralized agents without a central controller.
maximize their returns. However, this approach rules out the
cooperation between agents and ignores the non-stationarity B. Existing Surveys and Tutorials
issue. Classic MARL approaches adopted Nash Equilibrium DRL has become the cornerstone of a plethora of algorithms
(NE) as a solution concept. Nevertheless, as outlined by [7], for intelligently solving complex tasks in multiple areas such
solving multi-agent scenarios using NE is problematic for as MEC, aerial networks, etc. Several efforts have focused on
several reasons. First, finding Nash equilibria is challenging summarizing these contributions. For example, [9] provides a
and proved to be PPAD-complete for two-player general-sum comprehensive overview of single-agent DRL algorithms for
games [8]. Furthermore, in the case of multiple equilibria, communications and networking problems such as dynamic
it is questionable how the agents will converge toward the network access, data rate control, wireless caching, data of-
same equilibrium. This encourages other solution concepts floading, network security, and connectivity preservation. [10]
to solve cooperative multi-agent problems. In this tutorial, addresses the recent DRL contributions in MEC, software-
we will examine DRL techniques for learning optimal joint defined networking, and network virtualization in 5G. Other
policies. We distinguish between three sub-fields of deep surveys focus on specific applications of DRL such as au-
MARL: learned cooperation, emergent communication, and tonomous Internet-of-Things [11], resource management for
networked agents. Learned cooperation seeks to learn cooper- 5G heterogeneous networks [12], and mobile edge caching
ative behaviors without any communication means. If commu- [13]. Most of these works are restricted to model-free
nication is permitted, the agents can either learn an efficient single-agent DRL methods. In addition to model-free DRL,
communication protocol as in the emergent communication our work focuses on other less explored RL techniques
field or assume the existence of a communication structure like such as MBRL for single-agent settings and cooperative
in the networked agents framework. The latter is advantageous MARL for multiple agent scenarios.
since it considers heterogeneous agents with different goals More recently, the role of DRL in 6G networks has been
cooperating to maximize the team average return. On the a topic of increasing interest. A white paper [14] on ML
contrary, the other learning methods assume homogeneous for 6G networks overviews the key ML enabling techniques
agents with the same reward function. such as RL and federated learning and discusses the poten-
The scope of this tutorial is summarized in Figure 2. The tial applications in different network layers in addition to
core motivation of this work is to provide a tutorial on highlighting several open research directions. In [15], the
the tools enabling the design of multi-agent algorithms in authors reviewed the applications of ML, particularly DRL,
a decentralized fashion. Decentralized MARL is an active in vehicular networks with a discussion on the role of AI in
area of research. The expertise of the wireless communication the future 6G vehicular networks. Furthermore, [16] considers
community can boost the progress of this field via design- the role of DL and DRL for URLLC communications in
ing efficient means of communication between agents. Other future 6G networks. Similarly, these contributions mostly
multi-agent learning strategies such as federated learning and focus on single-agent DRL for future 6G networks with limited
distributed RL are not covered in this tutorial. Federated exposure to decentralized learning.
Learning aims to learn a common global model in a distributed In this paper, we argue that single-agent RL methods are
fashion while maintaining the privacy of the learning agents. not enough to meet the requirements for 6G networks in
This is different from cooperative MARL where the agents terms of reliability, latency, and efficiency. In fact, single-
seek to learn their own policies without a central coordinator. agent models of wireless problems eliminate any possibility
4

of cooperation or coordination in the network since the agent TABLE I: Summary of notations and symbols
considers all the other agents as a part of the environment.
S, A, O State, action, and observation spaces
For this reason, we highlight the importance of MARL,
A, O Joint action and observation spaces
particularly cooperative MARL, in the development of scalable
R, P Reward and transition functions respectively
and decentralized systems for 6G networks. In this context,
H Episode horizon or length of a trajectory
[17] showcases the potential applications of MARL to build
D Replay buffer
decentralized and scalable solutions for vehicle-to-everything
γ Discount factor
problems. In addition, the authors in [18] provide an overview
π∗ Agent’s optimal policy
of the evolution of cooperative MARL with an emphasis on
b(s) Belief state of a state s ∈ S
distributed optimization. Our work does not only consider
πθ Parametrized policy with parameters θ
cooperative MARL but also MBRL as enabling techniques
π Joint policy of multiple agents
for future 6G networks and we focus on delivering a more
Qπ , V π V /Q-function under the policy π
applied perspective of MARL to solve wireless communication
Qφ Parameterized Q-function with parameters φ
problems. Table II summarizes the existing surveys on DRL
Q̂, V̂ , π̂ Approximate V /Q-function and policy
and 6G and highlights the key differences compared to our
Q̄ Target Q-network
work.
J Infinite-horizon discounted return
Adv Advantage function
C. Contributions and Organization of this Paper
The main contributions of this paper can be summarized as
follows: dwells on the cooperative MARL algorithms according to the
• We provide a comprehensive tutorial on single-agent DRL
type of cooperation they address. Section VI is dedicated to
frameworks. Model-free RL (MFRL) is based on learning recent contributions of the mentioned algorithms in several
whereas MBRL is based on planning. To the best of our wireless communication problems, followed by a conclusion
knowledge, this is the first initiative to present MBRL and future research directions outlined in Section VII. A
fundamentals and potentials in future 6G networks. Re- summary of key notations and symbols is given in Table I.
cent developments in the MBRL literature render these
methods appealing for their sample efficiency (which is II. BACKGROUND
measured in terms of the minimum number of samples The objective of this section is to present the mathematical
required to achieve a near-optimal policy) and their background and preliminaries for both single agent and multi-
adaptation capabilities to changes in the environment; agent RL.
• We present different MARL frameworks and summarize
relevant MARL algorithms for 6G networks. In this work, A. Single-Agent Reinforcement Learning
we focus on the following: emergent communication
1) Markov Decision Process
where agents can learn to establish communication pro-
In RL, a learning agent interacts with an environment to
tocols to share information; learning cooperation details
solve a sequential decision-making problem. Fully observ-
different algorithms to learn collaborative behaviors in a
able environments are modeled as MDPs defined as a tuple
decentralized manner; networked agents to enable coop-
(S, A, P, R, γ). S and A define the state and the action spaces
eration between heterogeneous agents with limited shared
respectively; P := S × A 7→ [0, 1] denoted the probability of
information;
transiting from a state s to a state s0 after executing an action
• We also review the literature on applications of MARL
a; R : S × A × S 7→ R is the reward function that defines
in several enabling technologies for 6G networks such
the agent’s immediate reward for executing an action a at a
as MEC and control of aerial (e.g. drone-based) net-
state s and resulting in the transition to s0 ; and γ ∈ [0, 1] is
works, beamforming in cell-free massive Multiple-
a discount factor that trades-off the immediate and upcoming
Input Multiple-Output (MIMO) communications, spec-
rewards. The full observability assumption of MDPs enables
trum management in Heterogeneous Networks (HetNets)
the agent to access the exact state of the system s at every time
and in THz communications, and distributed deployment
step t. Given the state s, the agent will decide to take an action
and control of intelligent reflecting surface (IRS)-aided
a transiting the system to a new state s0 sampled from the
wireless systems;
probability distribution P (.|s, a). The agent will be rewarded
• We present open research directions and challenges re-
with an immediate compensation R(s, a, s0 ). Thus, the agent’s
lated to deployment of efficient, scalable, and decentral- hP
∞ t 0
ized algorithms based on RL. expected return is expressed as E t=0 γ R(s, a, s )|a ∼

The rest of the paper is organized as follows. In Section II, π(.|s), s0 . This is referred to as infinite-horizon discounted
we introduce the mathematical background for both single- return. Another hpopular formulation is undiscounted finite-
PH 0

agent RL and MARL. Standard algorithms for single-agent RL horizon return E t=0 R(s, a, s )|a ∼ π(.|s), s0 where the
are reviewed in Section III. In Section IV, we introduce MBRL return is compute over a finite horizon H. This is common
and detail potential applications for 6G systems. Section V first in episodic tasks (i.e. tasks that have an end). Note that the
summarizes the different challenges of MARL and afterward finite-horizon setting can be viewed as infinite-horizon case by
5

TABLE II: Summary of existing surveys on DRL and MARL for 5G and beyond wireless networks

Paper Summary RL techniques Scope

DRL MBRL MARL
[9] A comprehensive overview of single-agent DRL 3 7 7 Wireless Networks
algorithms for communications and networking
problems
[10] Review of the recent DRL contributions in MEC, 3 7 7 5G networks
Software Defined Network (SDN) and network
virtualization in 5G
[11] Survey on DRL applications and challenges in 3 7 7 Autonomous IoT
Autonomous IoT
[12] Summary of DRL applications in resource man- 3 7 7 5G HetNets
agement in 5G HetNets
[13] Survey on DRL for mobile edge caching 3 7 7 Mobile Edge Caching
[14] Overview of different ML techniques and present 3 7 7 6G Wireless Networks
potential applications in different network layers
[15] Review of ML and DRL applications in 6G vehic- 3 7 7 6G Vehicular networks
ular networks
[16] A comprehensive tutorial on URLLC communica- 3 7 7 URLLC communications
tions with a focus on the role of DL and DRL
for achieving URLLC communiations in future 6G
networks
[17] A survey on applications of MARL in addressing 3 7 3 5G Vehicular Networks
vehicular networks related issues
[18] An overview of the evolution of cooperative 3 7 3 Wireless Communication
MARL with an emphasis on distributed optimiza-
tion
Our work A complete and comprehensive overview of 3 3 3 B5G/6G Wireless Networks
DRL methods to build decentralized solutions for
B5G/6G networks

TABLE III: Summary of abbreviations

AI Artificial Intelligence MDP Markov Decision Process

A2C Advantage Actor-Critic A3C Asynchronous Advantage Actor-Critic
B The Bellman Operator MC Monte Carlo
CSI Channel State Information MEC Mobile Edge Computing
CTDE Centralized Training, Decentralized Execu- MFRL Model-Free RL
tion
Dec-POMDP Decentralized Partially Observable Markov MG Markov Games
Processes
DQN Deep Q-Networks mMTC Massive Machine-Type Communications
DRL Deep Reinforcement Learning MADDPG Multi-Agent Deep Deterministic Policy Gra-
dient
DP Dynamic Programming NN Neural Networks
DDPG Deep Deterministic Policy Gradient PG Policy Gradients
DPG Deterministic Policy Gradient POMDP Partially Observable Markov Decision Pro-
cess
D2D Device-to-Device POSG Partially Observable Stochastic Game
eMBB Enhanced Mobile Broadband SG Stochastic Games
HetNets Heterogeneous Networks TD Temporal Difference
IL Independent Learners UAV Unmanned Aerial Vehicles
IRS Intelligent Reflecting URLLC Ultra-Reliable, Low Latency Communications
MAL Multi-Agent Learning V2V Vehicle-to-Vehicle
MARL Multi-Agent Reinforcement Learning MMDP Multi-agent Markov Decision Process
MBRL Model-Based RL XRL eXplainable RL
6

augmenting the state space with an absorbing state transiting {(o0 , a0 ), . . . , (ot−1 , at−1 )}. The history is used to learn a
continuously to itself with zero rewards. policy π(.|Ht ) or a Q-function Q(Ht , at ). As an analogy with
The agent aims to find an optimal policy π ∗ , a mapping from MDPs, the agent state becomes the history Ht . As a result,
the environment states to actions, that maximizes the expected this method has a large state space which can be alleviated by
return. A policy or a strategy describes the agent’s behaviour at using a truncated version of the history (i.e. k-order Markov
every time step t. A deterministic policy returns actions to be model). However, limited histories have also caveats since long
taken in each perceived state. On the other hand, a stochastic histories need more computational power and short histories
policy outputs a distribution over actions. Under a given policy suffer from possible information loss. It is not straightforward
π, we can define a value function or a Q-function which how the value of k is chosen. Another way to avoid the
measures the expected accumulated rewards staring from any increasing dimension of the history with time is by defining the
given state st or any pair (st , at ) and following the policy π, notion of belief state bt (s) = p(s|Ht ), ∀s ∈ S as a distribution
as shown below: over states. Thus, the history Ht is indirectly used to estimate
" # the probability of being at a state s. Therefore, a Q-function
P∞ t
π
V (s) = E t=0 γ R(st , at , st+1 )|at ∼ π(.|st ), s0 = s
Q(b(s), a) or a policy π(.|b(s)) (see Table I) can be learned
using the belief states instead of the history. If the POMDP
" # is known, the belief states are updated using Bayes’ rule.
P∞
Qπ (s, a) = E t
t=0 γ R(st , at , st+1 )|at ∼ π(.|st ), s0 = s, a0 = a . Otherwise, a Bayesian approach can be considered. Another
approach to solve a POMDP is Predictive State Representation
Using Dynamic Programming (DP) methods such as Value [21]. The main idea consists in predicting what will happen in
Iteration or Policy Iteration [19] to solve an MDP mandates the future instead of relying on past actions and observations.
that the dynamics of the environment (P and R) are known
Example: POMDP formulation was applied to solve differ-
which is often not possible. This motivates the model-free
ent wireless problems characterized by partial access to the
RL approaches that find the optimal policy without knowing
environment state. For example, [22] proposed a POMDP
the world’s dynamics. MBRL methods learn a model of the
representation of dynamic task offloading in IoT fog systems.
environment by estimating the transition function and/or the
The authors assume that the IoT devices have imperfect
reward function and use the approximate model to learn or
Channel State Information (CSI). In this scenario, the agent is
improve a policy. Model-free RL algorithms are discussed
the IoT device, and based on the estimated CSI and the queue
in detail in Section III and MBRL is further investigated in
state, it decides the tasks to be executed locally or offloaded.
Section IV.
POMDPs are widely used in wireless sensor networks.
Example: Several wireless problems have been formulated
as MDPs. As an example, [20] presented downlink power
B. Multi-Agent Reinforcement Learning
allocation problem in a multi-cell environment as an MDP.
The agent is an ensemble of K base stations (or a controller MARL tackles sequential decision-making problems involv-
for K base stations). The state space consists of the users’ ing a set of agents. Hence, the system dynamics is influenced
channel quality and their localization with respect to a given by the joint action of all the agents. More intuitively, the
base station. The agent selects the power levels for K base reward received by an agent is no longer a function of its own
stations to maximize the entire network throughput. actions but a function of all the agents’ actions. Therefore,
2) Partially Observable Markov Decision Process to maximize the long-term reward, an agent should take into
consideration the policies of the other agents. In what follows,
In the previous section, it was assumed that the agent has
we will present mathematical backgrounds for MARL. Please
access to the full state information. However, this assumption
refer to Section VI for examples on how to formulate wireless
is violated in most real-world applications. For example, IoT
communication problems using the discussed mathematical
devices collect information about their environments using
frameworks below.
sensors. The sensor measurements are noisy and limited, hence
the agent will only have partial information about the world. 1) Markov/Stochastic Games
Several problems such as perceptual aliasing prevent the agent MGs or SGs [23] extend the MDP formalism to the multi-
from knowing the full state information using the sensors’ agent setting to take into account the relation between agents
observations. In this context, Partially Observable Markov [24]. Let N > 1 be the number of agents, S is the state space
Decision Processes (POMDP) generalize the MDP framework and Ai denotes the action space of the i’s agent. The joint
to take into account the uncertainty about the state information. action space of all agents is given by A := A1 × · · · × AN .
POMDP is described by a 7-tuple (S, A, P, R, γ, O, Z) where From now on, we will use bold and underlined characters to
the first five elements are the same as defined in §II-A1; differentiate between joint and individual functions.
O is the observations space and Z : S × A × O 7→ [0, 1] At a state s, each agent i selects an action ai and the joint
denotes the probability distribution over observations given action a = [ai ]i∈N will be executed in the environment. The
a state s ∈ S and an action a ∈ A. To solve a POMDP, we transition from the state s to the new state s0 is governed by
distinguish two main approaches. The first one is history-based the transition probability function P : S × A × S 7→ [0, 1].
methods where the agent maintains an observation history Each agent i will receive an immediate reward ri defined by
Ht = {o1 , . . . , ot−1 } or an action-observation history Ht = the reward function Ri : S ×A×S 7→ R. Therefore, the MG is
7

formally defined by the tuple (N, S, (Ai )i∈N , P, (Ri )i∈N , γ) fact, each agent needs full information about the other agents’
where γ is a discount factor. Note that the transition and reward actions to maximize its long-term rewards. Consequently, the
functions in MG are dependent on the joint action space A. uncertainty about the other opponents in addition to the state
Each agent i seeks to find the optimal policy πi∗ : S 7→ Ai uncertainty call for an extension of the MG framework to
that maximizes its long-term return. Q The joint policy π of model cooperative agents under partial observable environ-
all agents is defined as π(a|s) = i∈N πi (ai |s). Hence, the ments. In this context, Decentralized Partially Observable
value-function of an agent i is defined as follows: Markov Decision Process (Dec-POMDP) [26] is the adopted
" # mathematical framework to study the cooperative sequential
π P∞ t
decision-making problems under uncertainty. This is a direct
Vi (s) = Eπ t=0 γ Ri (st , at , st+1 )|at ∼ π(st ), s0 = s .
generalization of POMDPs to the multi-agent settings. A Dec-
Consequently, the optimal policy of the agent i is a function POMDP is described as (N, S, (Ai )i∈N , P, R, γ, (Oi )i∈N , Z)
of its opponents’ policies π−i . where the six first elements are same as defined in §II-B1; R
The complexity of MARL systems arises from this prop- is a global reward function shared by all the agents; Oi is the
erty because the other agents’ policies are non-stationary observation space of the i’s agent with O := O1 × · · · × ON
and change during learning. See Section V-A for a detailed is the joint observation space and Z : S × A × O 7→ [0, 1]
discussion of MARL challenges. As mentioned before, we is the observation function which provides the probability
distinguish three solution concepts for MGs: fully-cooperative, P (o|a, s0 ) of the agents observing o = [o1 × · · · × oN ]
fully competitive, and mixed. In fully cooperative settings, all after executing a joint action a ∈ A and transiting to a
the agents have the same reward function Ri = R and hence new state s0 ∈ S. A Dec-POMDP is a specific case of the
the same value or state-action function. Fully-cooperative MGs Partially Observable Stochastic Games (POSG) [27] defined
are also referred to as Multi-agent MDP (MMDP). This simpli- as a tuple (N, S, (Ai )i∈N , P, (Ri )i∈N , γ, (Oi )i∈N , Z) where
fies the problem since standard single-agent RL algorithms can all the elements are the same as in Dec-POMDP expect
be applied if all the agents are coordinated using the reward function Ri which becomes individual for each
P a central unit. agent. POSG enables the modeling of self-interest agents
On the other hand,P fully competitive MGs ( i Ri = 0) and
general-sum MGs ( i Ri ∈ R) are addressed by searching for whereas Dec-POMDP exclusively models cooperative agents
a NE. We will focus on the subsequent sections on extensions in partial observable environments. At a state s, each agent
of MG for cooperative problems. i receives its own observation oi without knowing the other
agents’ observations. Thus, each agent i chooses an action
Example: MGs are the most straightforward generalization of ai yielding a joint action to be executed in the environment.
single-agent wireless problems to the multi-agent scenarios. Based on a common immediate reward, each agent strives
As an example, the problem of field coverage by a team of to find a local policy πi : Oi 7→ Ai that maximizes its
UAVs is modeled as MG in [25]. team long-term reward. Thus, the joint policy is given by
π = [π1 , . . . , πN ]. The policy πi is called local because each
agent acts according to its own local observations without
communicating or sharing information with the other agents.
Example: Multi-agent task offloading [28] and multi-agent
cooperative edge caching [29] are wireless problems which
can be modeled as Dec-POMDP problems.

(a) MDP (b) POMDP 3) Networked Markov Games

Cooperative MGs or Dec-POMDPs are only suitable for
homogeneous cooperative agents since they share the same
reward signal R1 = · · · = RN = R. However, most of real-
world applications involve heterogeneous agents with distinct
preferences and goals. In addition, sharing the same reward
function requires global information from all the agents to
estimate a global value or a state-action functions which
complicates the decentralization of such models. To over-
come these shortcomings, Networked MG generalizes the MG
framework to model cooperative agents with different reward
functions by leveraging shared information through a commu-
(c) Markov Games (d) Dec-POMDP nication network (see Figure 6.b). Formally, Networked MGs
are described as a tuple (N, S, (Ai )i∈N , P, (Ri )i∈N , (Gt )t≥0 )
Fig. 3: Mathematical frameworks described in Section II. where the first five elements are same as defined in §II-B1
and Gt = (N, Et ) is a time-varying communication network
2) Dec-POMDP linking N nodes with a set of edges Et at time t. An
The intertwinement of agents in the multi-agent setting adds edge (i, j) ∈ Et , ∀i, j means that both agents i and j can
more complexity to finding optimal policies for each agent. In communicate and share information mutually at time t. Hence-
8

forth, agents know their local and neighboring information and the policy being improved or evaluated. The former is
and seek to learn an optimal joint policyPby maximizing the called behavior policy and the latter is referred to as target
team-average reward R̄(s, a, s0 ) = N 1−1 i∈N Ri (s, a, s0 ) for policy or control policy. In off-policy methods, the behavior
any (s, a, s0 ) ∈ S × A × S. To summarize, the advantages policy is different from the target policy. However, in on-
of Networked MGs compared to classical MGs are: (i) the policy methods, the behavior and the target policies are the
possibility to model heterogeneous agents with different re- same, meaning that the policy used to collect the data samples
ward functions; (ii) the reduction of the coordination cost is the same as the one being evaluated or improved. Thus,
by considering neighbor-to-neighbor communication which the notation Vφπθ means that the value function V is learned
facilitates the design of decentralized MARL algorithms; (iii) using samples from the policy πθ . If πθ is the same as the
the privacy preserving property since agents are not mandated policy the agent is learning, this is an on-policy algorithm.
to share their reward functions. The advantage of an off-policy setting is the possibility to
use a more exploratory behavior policy to continue to visit all
Example: Networked MDP can be applied in multiple wireless
the possible actions. This is why off-policy methods encourage
scenarios where agents are linked with a communication
exploration. Exploration-exploitation trade-off is a well-known
graph. For example, base stations in cell-free networks can
challenge in RL: The agent can exploit the knowledge from its
collaborate to compute optimal beamforming while minimiz-
past experiences to choose actions with the highest expected
ing interference [30]. The communication graph will enable
rewards but it also needs to explore other actions to improve
the base stations to share information with their neighbors.
its action selection.
Thus, better collaboration is possible.
Another key distinction between RL learning frameworks is
the use of Bootstrapping. The general idea of bootstrapping
III. S INGLE AGENT M ODEL -F REE RL A LGORITHMS
is estimated values of states are updated based on estimates
A. Preliminaries of the values of the next states. DP and TD methods use
We start by defining useful notions and concepts for the bootstrapping whereas MC algorithms rely on actual complete
understanding of the algorithms discussed below. episodic returns.
MFRL methods can be categorized into two classes depend- Furthermore, another dimension to consider while designing
ing on the agent’s learned objective. In value-based methods, an RL algorithm is how to represent the approximate value
an approximate value function V̂ is learned and the agent’s function. In tabular setting, a table of state-values is main-
policy π is obtained by acting greedily with respect to V̂ . tained and updated for every visited state s. Function approx-
Thus, state-values are essential for action selection. Policy imators have enabled the recent revolution in RL thanks to
evaluation methods seek to learn an estimate of the value its generalization power with high-dimensional state data. For
function V̂ = V π for a given policy π. Alternatively, policy- example, DNNs are famous non-linear function approximators
based methods aim to directly learn a parameterized policy used to compute value functions or policies.
without resorting to a value function. A well-known variant As mentioned above, the source of the training data is
of policy-based methods learns an approximation of the value crucial for learning. On one hand, batch/offline RL considers
function but the action selection is still independent of the the agent is provided with a dataset of interactions and learns
value estimates. These are the actor-critic methods where a policy using the given dataset without interacting with the
approximations to both the policy and the value function are environment. On the other hand, the agent can collect data by
learned. The actor refers to the policy and the critic is the querying the real environment or a simulator. This is referred
approximate value function. Henceforth, we will denote by θ to as online RL.
the policy’s parameters and πθ is the policy parametrized by Figure 4 summarizes the different categorization of RL
θ. In actor-critic methods, Vφπθ denotes the approximate value methods. In this tutorial, we will focus on both online
function under the policy πθ where φ is a learnable parameter policy-based and value-based methods with DNNs as function
vector. approximators. Table IV provides a comparative overview
We distinguish between two main learning principles: (i) of the discussed algorithms below. This review of MFRL
Monte Carlo (MC) and (ii) DP methods. The former methods methods is not exhaustive since several resources with in-depth
utilize experience to approximate value functions and policies. descriptions are already available (i.e. in [19], [9]).
In contrast, DP methods are known for solving the Bellman
Optimality equations. More details will be provided in the
following sections. Temporal Difference (TD) is a famous B. Policy-Based Algorithms
combination of these two learning frameworks. Therefore, Policy-based methods directly search for the optimal policy
an important question arises when MC and TD methods are by maximizing the agent’s expected long-term reward J as in
adopted: how actions are selected and samples are generated (1). The policy is parameterized by a function approximator
for learning? This leads to the two key approaches for learning πθ (a|s), typically a DNN with learnable weights θ. The Policy
from experience, namely, off-policy and on-policy methods. Gradient (PG) methods, introduced in [31], learn the optimal
Recall that the agent interacts with its environment by exe- parameters θ∗ by performing gradient ascent on the objective
cuting actions and after improve or evaluate its policy using J. Using the PG theorem [31], the policy gradients are
the collected data. Therefore, we can distinguish between expressed as in (2) and estimated using samples or trajectories
two distinct processes: the policy used for data collection collected under the current policy. This is why PG methods
9

Fig. 4: Categorization of different RL settings. The classes colored in blue are covered in Section II.

are on-policy methods. For each gradient update, the agent layers. Several workers are instantiated with local copies of
needs to interact with the environment and collect trajectories. the global network parameters and the environment. These
Samples collected at iteration k cannot be reused for the next workers are created as CPU threads in the same machine. In
policy update. This sample inefficiency represents one of the parallel, each worker interacts with its local environment and
major drawbacks of PG methods. collects experiences to estimate the gradients with respect to
the network parameters. Afterward, the worker propagates its
" ∞
# gradients and updates the parameters of the global network.
X
J(θ) = Eπθ γ t R(st , at ) . (1) Therefore, the global model is constantly updated by the
t=0 workers. This learning scheme enables the collection of more
" T
X
# diverse experiences since each worker interacts with their local
πθ copy of the environment independently. The drawback of the
∇θ J(θ) = Eπθ ∇ log πθ (at |st )Q (st , at ) (2)
t=0 asynchronous training scheme is that some workers will be
In (2), Qπθ is not known and needs to be estimated. Several using old versions of the global network. In this context,
approaches are possible. The well-known REINFORCE algo- Advantage Actor-Critic (A2C) adopts a synchronous and de-
PT
rithm [32] uses the rewards-to-go defined as k=t R(sk , ak ). terministic implementation where all the workers’ gradients
The major caveat of the REINFORCE algorithm is that it is are aggregated and averaged to update the global network.
well-defined for episodic problems only since the rewards-to- As mentioned before, PG algorithms suffer from sample
go are computed at the end of an episode. Furthermore, REIN- inefficiency since only one gradient update is performed per a
FORCE algorithm suffers from high variance. In (2), action batch of collected data. This motivates the goal to use the data
likelihoods are multiplied by their expected return thus PG more efficiently. Besides, it is hard to pick the learning rate
algorithm shifts the action distribution such that good actions since it affects the training performance and can dramatically
are more likely than bad ones. Consequently, small variations alter the visitation distribution. Intuitively, a high learning rate
in the returns can lead to a completely different policy. This can result in a bad policy update which means that the next
motivates actor-critic methods where an approximation of Qπθ batch of data is collected using a bad policy. Recovering from
is learned. Note that it also possible to estimate the value a bad policy update is not guaranteed. This motivate Trust
function V πθ or the advantage function Advπθ = Qπθ − V πθ . Region Policy Optimization (TRPO) [35] where the original
Learning a critic reduces the variance of gradient estimates optimization problem in (1) is solved under the constraint of
since different samples are used whereas in the rewards-to-go ensuring the new updated policy is close to the old one. To
only one sample trajectory is considered. However, actor-critic do so, the constraint is defined in terms of Kullback–Leibler
methods introduce bias since the Qπθ estimate can be biased divergence (DKL ) which measures the difference between two
as well. In this context, [33] proposed Generalized Advantage probability distributions. More formally, let θk be the policy
Estimation based on the idea of n-step returns to reduce the parameters at iteration k. We would like to find the new
bias. In what follows, we will examine the most common parameters θk+1 such that
policy gradient algorithms.
Asynchronous Advantage Actor-Critic (A3C) [34] proposes θk+1 = arg max L(θ)
θ
a parallel implementation of the actor-critic algorithm. In
πθ (a|s)

the original version of A3C, a global NN outputs the ac- = arg max E(s,a)∼πθk Avdπθk (s, a)
θ πθk (a|s)
tion probabilities and an estimate of the agent’s advantage
function. Thus, the actor and the critic share the network s.t. DKL (θ||θk ) ≤ δ,
10

where δ is the trust region radius. Let F be the Fisher- to Q∗ since B is a contraction and Q∗ always exists and
information matrix. With a first-order approximation of the is unique. Besides, value iteration relies on bootstrapping to
objective (L(θ) ≈ ∇θ LT (θ)(θ − θk )) and a second-order estimate the value of next states. However, to evaluate the
Taylor expansion of the constraint (DKL (θ||θk ) ≈ 12 (θ − Bellman operator, the transition function is needed. This is the
θk )T F (θ − θk )), the update rule is given by θk+1 = θk +
q major drawback of DP methods that assume the environment
2δ
F −1 ∇θ L(θ). The term F −1 ∇θ L(θ) is dynamics are known. To overcome this issue, TD methods
∇θ LT (θ)F −1 ∇θ L(θ)
called the natural gradients. Consequently, evaluating the combine the main ideas of MC and DP. They use experience
natural gradients necessitates inverting the matrix F which is as in MC and bootstrapping as in DP. The update rule of TD
expensive. To overcome this issue, TRPO implements the con- algorithm is as follows:
jugate gradient algorithm to solve the system F x = ∇θ L(θ) Q̂(s, a) = (1 − α)Q̂(s, a) + α R(s, a) + γ maxa0 Q̂(s0 , a0 ) , (6)

which involves evaluating F x instead. Finally, the matrix-
vector product F x is computed as ∇θ (∇θ DKL (θ||θk )T x) where V̂ and Q̂ are the approximate value and state-action
which is easy to evaluate using any auto-differentiation library functions and α is a learning rate.
like Tensorflow. In a similar vein, Proximal Policy Opti- TD methods can be on-policy or off-policy. Let π̂ the policy
mization (PPO) [36] algorithm solves the same optimization derived from Q̂ (i.e. -greedy). For on-policy TD, the samples
problem as TRPO but proposes a simpler implementation by used to estimate Q̂ are generated using the current policy π̂
introducing a new loss function: continuously updated greedily with respect to Q̂. SARSA is a
well-known on-policy TD algorithm where the agent collects
πθ (a|s)
LPPO (θ) = min Avdπθk (s, a), experiences in the form {(s, a, r, s0 , a0 )}. Since the action in
θ πθk (a|s)
the next state is known, the max operator in the RHS of the
πθ (a|s) πθk TD update (6) is removed.
clip( , 1 − δ, 1 + δ)Avd (s, a) ,
πθk (a|s) Q-learning algorithm [38] revolutionized the RL world
where “clip” is a function used to keep the values of the ratio allowing the development of an off-policy TD algorithm. Any
πθ (a|s)
πθk (a|s) between 1 − δ and 1 + δ to penalize the new policy
policy π̃ 6= π̂ can be used to generate experiences. When Q̂
if gets far from the old policy. is represented with a function approximator with parameters
φ, Q-learning algorithm minimizes the Bellman error (7) and
Example: The advantage of policy-based algorithms is they updates the Q-function parameters as in (9). The Bellman error
are applicable to discrete and continuous action spaces. For is not a contraction anymore thus the convergence guarantees
example, [37] applies the TRPO algorithm to find optimal discussed earlier are not valid anymore. Equation (8) defines
routing strategies in a network. the targets. Note the update rule in (9) does not consider the
full gradient of the Bellman error since it ignores the gradients
C. Value-Based Algorithms of the targets with respect to the parameters φ. This is why, this
As explained above, value-based algorithms focus on esti- learning algorithm is also called semi-gradient method [19].
mating the agent’s value function. Thus, the policy is computed 1 X
implicitly or greedily with respect to the approximate value φ∗ = arg min ||Q̂φ (s, a) − y||2 . (7)
φ 2
function. 0
(s,a,r,s )
The MC method approximates the value of a state s by y = R(s, a) + γ max Q̂φ (s0 , a0 ). (8)
averaging the rewards obtained after visiting the state s until a0

the end of an episode. Consequently, MC methods are defined
X
φ=φ−α ∇φ Qφ (s, a) Qφ (s, a) − y . (9)
only for episodic tasks. In DP, the optimal value V ∗ and (s,a,r,s0 )
state-action Q∗ function are computed by solving the Bellman
Optimality equations (3-4) and thus the optimal policy is Deep versions of the Q-learning methods such as Deep Q-
obtained greedily with respect to the Q-values (5): Networks (DQN) [39] have been developed. In particular,
the Q-function is parameterized using a DNN with weights
V ∗ (s) = max R(s, a) + γEs0 V ∗ (s0 )

(3) φ. DQN and its variants are the most popular online Q-
a
∗

Q (s) = R(s, a) + γEs0 max ∗ 0 0
Q (s , a )

(4) learning algorithms that have shown impressive results in
0
a several communication applications. To stabilize the learning
π ∗ (s) = arg max Q∗ (s, a). (5) using DNN, DQN introduces techniques such as an experience
a
replay buffer D = {(s, a, r, s0 )} to avoid correlated samples
Let B : R|S×A| 7→ R|S×A| denote the Bellman and get a better gradient estimation and a target network Q̄,
0
optimality operator such that [BQ](s, a) = R(s, a) + whose parameters φ are periodically updated with the most
γ s0 P (s0 |s, a) maxa0 Q∗ (s0 , a0 ). Therefore, equation (4) can
P
recent φ, making the targets (8) stationary and do not depend
be written in more compact way Q∗ = BQ∗ . As a result, Q∗ on the learned parameters. The target network is updated
is the called the fixed point of the Bellman optimality operator periodically as follows:
and the methods solving for this fixed point can be called fixed
yDQN = R(s, a) + γ max Q̄φ0 (s0 , a0 ). (10)
point methods. The value iteration algorithm approximates 0 a
Q∗ by iteratively applying the Bellman optimality operator yDDQN = r(s, a) + γ Q̂2 (s0 , arg max Q̂1 (s0 , a0 )). (11)
Q̂k = B Q̂k−1 . This algorithm is guaranteed to converge a0
11

Although Q-learning algorithms are sample efficient, they lack an extension of the DPG method where the policy µθ and
convergence guarantees for non-linear function approximators the critic Qφ are both DNNs. Recently, two variants of the
and also suffer from the maximization bias which results in DDPG algorithm have been proposed: Twin Delayed DDPG
an unstable learning process. The max operator in equation (TD3) [45] addresses the overestimation problem discussed in
(8) makes the algorithm overestimate the true Q-values and § III-C whereas in Soft Actor Critic (SAC) [46] the agent
is also problematic for continuous action spaces. To over- maximizes not only its expected return but also the policy
come the maximization bias, double learning technique uses entropy to improve robustness and stability.
two networks Q̂1φ1 and Q̂2φ2 with different parameters and
Example: DDPG has been widely used in continuous wireless
decouples the action selection from the action evaluation as
problems. In [47], DDPG is applied in energy harvesting
shown in (11). Other variants of DQN have been suggested to
wireless communications. Similarly, Table II lists several
guarantee a better convergence. Prioritized experience replay
references using DDPG agents to solve multi-agent wireless
is proposed in [40] to ensure the sampling of rare and task-
problems such as computation offloading, edge caching, UAV
related experiences is more frequent than the redundant ones.
management, etc.
Dueling Network [41] suggests to separate the Q-function
estimator into two separate networks: V̂ network to estimate
the state values and Âdv network to approximate the state- E. Theoretical Analysis and Challenges of RL
dependent action advantages.
After reviewing the key MFRL algorithms, we will discuss
Example: DQN and its variants have become the go-to RL selective theoretical problems in both policy-based and value-
algorithm to solve wireless problems with discrete action based methods. First, we will study the instability problems
spaces. If the action space is continuous, researchers will in value-based methods with function approximators. The
propose a careful discretization scheme to fit into the Q- stability means that the error Q̂ − Q∗ gets smaller when the
learning framework. As an example, [42] solves the dynamic number of iterations increases. As mentioned in the previous
multichannel access problem in wireless networks using DQN. section, the stability of the tabular Q-learning algorithm is
Table II lists different references that used DQN in multi-agent guaranteed thanks to the contraction property of the Bellman
scenarios. operator. However, the contraction property is not satisfied
when the Bellman error is minimized. Next, we focus on
D. Deterministic Policy Gradient (DPG) Algorithms the convergence rate and sample complexity of policy-based
methods. Sample complexity is defined as the minimal number
The max operator in (8) limits the Q-learning algorithms to
of samples or transitions to estimate Q∗ and achieve a near-
discrete action space (see Table IV). In fact, when the action
optimal policy, and the convergence rate determines how fast
space is discrete, it is tractable to compute the maximum
the learned policy converges to the optimal solution. Finally,
of the Q-values. However, when the action space becomes
we examine the interpretability issue of DRL methods which
continuous, finding the maximum involves costly optimization
is a crucial challenge towards the application of DRL in real-
problem. In this context, DPG algorithms [43] can be consid-
world problems.
ered as an extension of Q-learning to continuous action spaces
by replacing the maxa0 Qφ (s0 , a0 ) in (8) by Qφ (s0 , µθ (s0 )) a) Instability of off-policy TD learning with function
where µθ (s0 ) = arg maxa0 Qφ (s0 , a0 ). Thus, DPG algorithms approximators
concurrently learn a Q-function Qφ and a policy µθ . The In tabular RL, value-based RL methods are guaranteed to
Q-function is learned by minimizing the Bellman error as converge to the optimal value function which is the fixed
explained in the previous section. As for the policy, the point of the Bellman optimality equation. This fundamental
objective is to learn a deterministic policy that outputs the result is justified by the contraction property of the Bellman
action corresponding to maxa Qφ (s, a). The policy is called operator. Hence, successive applications of the Bellman op-
deterministic because it gives the exact action to take at each erator converge to the unique fixed point. However, these
state s. Hence, the learning process consists in performing methods combined with function approximators suffer from
gradient ascent with respect to θ to solve the objective in (12): instability and divergence. This is commonly referred to as
the deadly triad: function approximation, bootstrapping, and
J(θ) = Es∼D Qφ (s, µθ (s)) . (12) off-policy training [48]. These three elements are important
to consider in any RL method. Function approximation is
∇θ J(θ) = Es∼D ∇θ µθ (s)∇a Qφ (s, a)|a=µθ (s) . (13) crucial to handle large state spaces. Bootstrapping is known
to learn faster than MC methods [19] thus this data efficiency
Note that in the gradients formula above, the term is an advantage from using bootstrapping. Finally, off-policy
∇a Qφ (s, a) requires a continuous action space. Therefore, one learning enables exploration since the behavior policy is often
drawback of the DPG methods is that they cannot be applied more exploratory than the target policy. Therefore, trading off
to discrete action spaces. Observe as well that in (12), the one of these techniques means losing either generalization
state is sampled from a replay buffer D which means that power, data efficiency, or exploration. This is why several
DPG algorithms are off-policy. research efforts are dedicated to finding stable and convergent
Deep Deterministic Policy Gradient (DDPG) [44] has been algorithms for off-policy learning with (nonlinear) function
widely used to solve wireless communication problems. It is approximators and bootstrapping.
12

TABLE IV: Action and state spaces for model-free RL algorithms

Action State On policy Off policy
Discrete Continuous Discrete Continuous
Policy-Based
REINFORCE [32] 3 3 3 3 3
A2C-A3C [34] 3 3 3 3 3
PPO [36] 3 3 3 3 3
TRPO [35] 3 3 3 3 3
Value-Based
DQN/DDQN [39] 3 7 3 3 3
Dueling DQN [41] 3 7 3 3 3
Deterministic PG
DDPG [44] 7 3 3 3 3
TD3 [45] 7 3 3 3 3
SAC [46] 7 3 3 3 3

Fixed-point methods rely on reducing the Bellman error to tasks, estimating the Q-function using MC rollouts results
learn an approximation of the optimal state-value or state- in an unbiased approximation but a biased Q-function for
action functions. As a reminder, here is the expression of the discounted infinite-horizon.
objective function governed by the Bellman error:
" To tackle the unbiasedness issue, [31] introduces the com-
2 #
0 0
patible function approximation theorem requiring the ap-
L(Q̂) = Es,a R(s, a) + γEs0 [maxa0 Q̂(s , a )] − Q̂(s, a) . proximate Q-function Qφ should satisfy two conditions:
(i) compatibility condition with the policy πθ given by
To compute an unbiased estimate of the loss, two samples of ∇φ Qφ (s, a) = ∇θ log πθ (a|s), (ii) Qφ is learned to minimize
the next state s0 are needed because of the inner expectation. Eπθ [1/2(Qπθ (s, a) − Qφ (s, a))2 ]. If these two conditions are
This is a well-known problem called the double sampling verified, the estimates of the policy gradients are unbiased.
issue [49]. The implication of this issue is a state s must be Another approach presented in [52] is called random-horizon
visited twice to collect two independent samples of s0 which PG and proposed to use rollouts with random geometric time
is impractical. To overcome this problem, different approaches horizons to unbiasedly estimate the Q-function for the infinite-
have been adopted. The first approach reformulates the Bell- horizon setting.
man error as a saddle-point optimization problem where a
new concave function ν : S × A 7→ R is Armed with the advances in non-convex optimization, sev-
introduced such eral research efforts ( [53], [54], [55], [52]) propose variants
that the loss becomes L(Q̂,ν) = 2Es,a,s 0 ν(s, a) Q̂(s, a) −
of the REINFORCE algorithm with a rate of convergence
R(s, a) − γ maxa0 Q̂(s0 , a0 ) − Es,a,s0 ν(s, a)2 [50]. This is

to first- or second-order stationary points. [56] studies the
called a saddle-point problem since the loss will be minimized
non-asymptotic global convergence rate of PPO and TRPO
with respect to the Q-function parameters and maximized
algorithms parametrized with neural networks. These methods
with respect to ν parameters. In this context, a recent work p
converge to the optimal policies at the rate of O( 1/T ),
[51] proposed a convergent algorithm with nonlinear function
where T is the number of iterations. In addition, the work
approximators (e.g. NN) and off-policy data.
in [57] establishes the global optimality and convergence rate
Another approach to tackle this issue is to replace the Q-
of (natural) actor-critic methods where both the actor p and
function in the inner expectation term with another target
the critic are represented by a NN. A rate of O( 1/T )
function. The choice of the target function can be an old
is also proved and the authors emphasize the importance
version of the Q-function as in the famous DQN algorithm
of the “compatibility” condition to achieve convergence and
or the minimum of two target Q-functions as in TD3 and
global optimality. In [58], better bounds on sample complexity
SAC. This gives a theoretical explanation of the role of target
of (natural) actor-critic are provided. In fact, the authors
networks in stabilizing Q-learning with DNN.
demonstrate that the overall sample complexity for the mini-
b) Convergence of PG with neural function approxi- batch actor-critic algorithm to reach an -accurate stationary
mators point is of O(−2 log(1/)) and that the natural actor-critic
In this part, we will summarize the recent convergence method requires O(−2 log(1/)) samples to attain an -
results of PG algorithms with NN as function approximators. accurate globally optimal point. In the same vein, the works
Due to the non-convexity of the expected cumulative rein [59] and [60] establish the convergence rates and sample
wards in (1) with respect to the policy and its parameters, complexity bounds for the two-time scale scheme of (natural)
the analysis of the global optimality of stationary points is a actor-critic methods where the actor and the critic are updated
hard problem. Besides, the policy gradients in (2) are obtained simultaneously with different learning rates. Furthermore, [61]
using sampling and in practice, an approximate Q-function studies the convergence of PG in the context of constrained RL
is learned to estimate the expected return. In finite-horizon where the objective and the constraints are both non-convex
13

functions. reward function R̂(s, a). Consequently, for each state-action

pair (s, a), the model predicts the next state and the immediate
c) On eXplainable RL (XRL)
reward. The model can be distributional if it outputs the dis-
XRL is crucial to fully trust RL algorithms to be deployed in
tribution over all possible next states or sampled in case only
real-world scenarios. Recently, a lot of efforts have focused on
one possible next state is produced according to the computed
designing explainability algorithms for AI. However, most of
probabilities. Furthermore, the model can be either global or
these works are tailored to supervised learning. XRL is more
local. Global models learn a representation of the whole state
challenging due to its unsupervised nature and the dependence
space whereas a local model is valid only in regions of the
on complex DNN to achieve good performance. We will refer
state space. Local models are more task-related and are easier
to the model or decisions to be explained as explanandum and
to learn compared to global ones that require samples from the
the generated explanations as explanans.
whole state space. Practically, different settings are considered
According to [62], the term interpretability is defined as
for learning a model. In some applications, the agent may
the ability to not only explain the model’s decisions but also
not have access to the real state of the environment. This is
to present these explanations in an understandable way to
the case in POMDPs where the agent only receives partial
offer the possibility for non-expert users to predict the model’s
observations of the environment.
behavior. A common taxonomy classifies interpretability mod-
Another consideration is the dimension of the states. It
els along two main dimensions: the scope or level of the
might be impossible to learn a model over high-dimensional
explanans (global vs local) and the time when the explana-
state spaces. Hence, state representation is an important feature
tions are generated (intrinsic vs post-hoc). Global approaches
of the learned dynamics. For state-based models, the transition
explain the behavior of the whole model, whereas local ones
function P̂θ (s0 |s, a) is parameterized by a DNN to learn
provide explanations to local predictions. The intrinsic or
domain-agnostic dynamics. The agent collects a dataset of
transparent category encompasses models constructed to be
interactions D = {(s, a, s0 )}t and the network is trained
self-explanatory by reducing their complexity. However, post-
using Maximum Likelihood. If the agent has access only
hoc interpretability algorithms seek to provide explanations
to observations, it is possible to learn an observation tran-
of an ML model after training. Interpretability algorithms
sition model to directly estimate the transition probabilities
can also be either model-specific if they are restricted to a
between observations. In the case of high-dimensional states
precise explanandum or model-agnostic if they are applica-
or observations, the common approach is to infer a latent
ble for any model/task. Intrinsic interpretability is hard to
space from the states/observations using techniques such as
achieve in RL since they rely on complex DNN, thus post-
Variational Autoencoders (VAE) and learn a transition model
hoc models are preferred. In addition, several XRL methods
is the obtained latent space. In contrast to these parametric
focus on specific tasks/architectures, which limit their appli-
methods, several non-parametric approaches can be adopted,
cability to different settings. Consequently, the design of a
such as episodic memory [64], transition graphs [65], and
model-agnostic approach, independent of the RL environment,
Gaussian Processes [66].
RL learning method, and tasks, is still an ongoing research
Armed with our knowledge of methods to learn a model
direction. Furthermore, the current XRL algorithms provide
of the environment, we will investigate how the obtained dy-
explanations targeted to a knowledgeable audience. Designing
namics are utilized for control. It is straightforward to observe
more understandable approaches for a broader audience is also
that simulated experience can be collected using the learned
a pressing issue to overcome in XRL. We refer the reader to
model to augment the available real-world samples. Planning is
[62], [63], and the references therein for a detailed overview
therefore performed by applying the known RL methods such
of explainability methods in RL.
as Q-learning to estimate value functions using the simulated
and real experience. The estimated value functions are thereby
IV. S INGLE AGENT M ODEL -BASED RL A LGORITHMS
utilized to optimize or improve a policy. In this context, Dyna-
A. Introduction to MBRL and Planning Q algorithm [67] is a famous implementation of this approach
Planning refers to computing an optimal policy π ∗ assuming of planning. It intertwines the learning and planning of the Q-
that the MDP M is known. Classical DP methods such as function. The planning is performed by selecting uniformly
Policy Iteration and Value Iteration are planning algorithms. random K initial state-action pairs (s, a), simulating the
In MBRL, the agent computes an approximate model of the environment to obtain the next states and rewards, and update
environment denoted by M̂ and performs planning in M̂ to the Q-values of the sampled pairs. Selecting the starting state-
find the optimal actions. In contrast, model-free methods, rep- action pairs in a uniformly random manner is not always effi-
resented previously, rely on learning to obtain optimal policies. cient since all the pairs are treated equally. Different methods
Therefore, MBRL has two important components for decision- are proposed to address this issue. Prioritized sweeping [68]
making: the approximate model and the planning algorithm. assigns priorities to state-action pairs according to the changes
First, we will present the concepts for learning world models in their estimated values. In trajectory sampling methods, the
and afterward, the planning methods are discussed. Figure (5a) simulation is performed by collecting trajectories by following
provides a holistic illustration of MBRL framework. the current policy to update the Q-values. All of these methods
The learned models are representations of knowledge about fall into the realm of background planning where the learned
the real environment. In simpler words, the model can be an model is involved indirectly in the computation of the optimal
approximation of the transition function P̂ (s0 |s, a) and/or the policies.
14

Another approach of planning relies on the learned model to comparison and benchmark of MBRL methods, we refer the
select actions at the current state s. This is called decision-time interested reader to [69].
planning [19]. In this category, neither a policy nor a value In Table 5b, a comparison between MFRL and MBRL
function is required to act in the environment. In fact, action methods, inspired from [70], is provided. Although model-
selection is formulated as an optimization problem (in (14)) free methods can exhibit better asymptotic reward performance
where the agent chooses a sequence of actions or a plan that and are more computationally efficient for deployment, model-
maximizes the expected rewards over a trajectory of length H: based algorithms are more data efficient, robust to changes
X H in the environment dynamics and rewards, and support richer
a1 , ...., aH = arg max E R(st , at )|a1 , ...., aH . (14) forms of exploration. MBRL is preferred to model-free RL in
a1 ,....,aH
t=1 multi-task settings. In fact, the same learned dynamics can be
used to perform multiple tasks without further training.
There are two approaches to solve this planning problem However, several practical considerations should be consid-
according to the size of the action space. For discrete action ered for MBRL. Since the model is learned via interactions
spaces, decision-time planning encompasses heuristic search, with the environment, it is prone to the following problems:
MC rollouts, and Monte Carlo Tree Search. Heuristic search (1) insufficient experience, (2) function approximation error,
consists in considering a tree of possible continuations from (3) small model error propagation, (4) error exploitation by
a state s and the values of the next actions are estimated the planner, and (5) less reliability for longer model roll-
recursively by backing up the values from the leaf nodes. outs [71]. To avoid these problems, it is recommended to
Then, the action with the highest value is selected. MC rollout continuously re-plan to avoid error accumulation. Limited
planning uses the approximated model to generate simulated data can also cause model uncertainty which can be reduced
trajectories starting from each possible next action to estimate by observing more data. Thus, it is better to estimate the
its value (see Figure 5a.2.a). At the end of this process, the uncertainty in the model predictions to know when to trust
action with the highest value is chosen in the current state. the model to generate reliable plans/actions. One approach
The MCTS algorithm keeps track of the expected return of to estimate model uncertainty is Bayesian where methods
state-action pairs encountered during MC rollouts to direct such as Gaussian Processes [66] or Bayesian NN (i.e. [72])
the search toward more rewarding pairs. At the current state, are applied. The work in [73] proposed to use ensemble
MCTS expands a tree by executing an action according to methods (bootstrapping) for uncertainty estimation where an
the estimated values. In the next state, MCTS evaluates the ensemble of models is learned and the final predictions are the
value of the obtained state by simulating MC rollouts and combination across the models’ predictions.
propagates it backward to the parent nodes. The same process
is repeated in the next state. Two important observations can be
made regarding these approaches: (i) the estimated state-action B. Applications of MBRL
values are completely discarded after the action is selected. MBRL approaches received less interest from the wire-
None of these algorithms stores Q-functions. And (ii) all these less communication community compared to their model-free
approaches are gradient-free. counterparts. However, we argue that MBRL is important
The continuous action setting is more involved because to build practical systems. For instance, since it is hard to
it is complicated to perform a tree search. Alternatively, collect data from a real wireless network, models are trained
trajectory optimization methods consider one possible action using a simulator. The most important barrier to deploying
sequence a = {a1 , . . . , aH } sampled from a fixed distribution. such models in real-world settings is the reality gap where
This sequence
PH is then executed and the trajectory return the learned policy in a simulator does not perform as good
J(a) = t=0 R(st , at ) is computed. Notice that the return in the real world. This is a serious issue in the context of
is a function of the action sequence. If the learned model is building RL-based algorithms for 6G networks. This line of
differentiable, it is possible to compute the gradients of J(a) work is called sim2real which is an active area of research
with respect to the actions and update the action sequence in robotics. In this context, the learned models are means to
accordingly (i.e. a = a + ∇a J(a)). This planning algorithm, bridge the gap between the simulation and the real world.
known as the random shooting method, exhibits several caveats Furthermore, MBRL is advantageous because a learned model
such as sensitivity to the initial action selection and poor for a source task can be re-used to learn a new task faster.
convergence guarantees. This motivated a lot of varieties of Coupled with meta-learning techniques, MBRL is applied to
planning methods for continuous action spaces. For example, generalize to new environments and changes in the agent’s
Cross Entropy Method (CEM) is a famous planning approach world. As an example, aerial networks or drone networks, a
to escape local optima which shooting methods suffer from. key enabler of 6G systems, can benefit from MBRL for a wide
Compared to the shooting method, the main idea of CEM is variety of applications such as hovering and maneuvering tasks
to consider a normal distribution with parametrized mean and [74]. Another potential application of MBRL is related to task
covariance. The trajectories are sampled around the mean and offloading in MEC. A model can be learned to predict the
the average reward is computed for each sample to evaluate its load levels in edge servers. This will help the edge users to
fitness. Afterward, the mean and the covariance are updated make more efficient offloading decisions, especially for delay-
using the best samples. CEM is a simple and gradient-free sensitive tasks. The main challenge in applying MBRL in
approach and can exhibit fast convergence. For a detailed multi-agent problems is the non-stationarity issue. One key
15

MF MB
Asymptotic rewards + +/−
Computation at deployment + +/−
Data Efficiency − +
Adaptation to changing rewards − +
Adaptation to changing dynamics − +
Exploration − +

(a) General Description of MBRL. (b) Model-free RL vs. MBRL.

Fig. 5: (a) Real experience from interactions with the environment is collected to learn a model. The model can be utilized
in two fashions: (1) Background planning where simulated experience, generated by the model, is used to optimize a policy
by applying any model-free learning methods on the simulated data. This planning approach is compatible with discrete and
continuous action spaces; (2) Decision-time planning family seeks to find the best sequence of actions for the current state
using the learned model. Depending on the size of the action space, different methods can be applied: For discrete action
space, Monte Carlo rollouts (2.a) or Monte Carlo Tree Search (2.c) are well-known approaches. Decision-Time planning for
continuous action spaces relies on trajectory optimization (2.b) where a random action path is chosen and updated to maximize
the trajectory reward. (b) Comparative analysis of MBRL and model-free RL. The cases marked in yellow indicate that it
depends on the used MBRL methods.

application of MBRL in multi-agent problems is opponent control problems. The problem addressed in [76] is how
modeling. It consists in learning models to represent the to learn separable models for each agent while capturing
behaviors of other agents. In multi-agent systems, opponent the interactions between them. The approach is based on
modeling is useful not only to promote cooperation and coor- multi-step generative models. Instead of learning a model
dination but also to account for the behavior of the opponents using one-step samples (s, a, s0 ), multi-step generative models
to compensate for the partial observability (See Section V-A). utilize past trajectory segments Tp with length H, Tp =
In addition, modeling other agents enables the decentralization {(st−H−1 , at−H−1 ), . . . , (st , at ))} to learn a distribution over
of the problem because the agent can use the learned models the future segments Tf = {(st+1 , at+1 ), . . . , (st+H , at+H ))}.
to infer the strategies and/or rewards. Hence, an encoder will learn a distribution over a latent
The work in [75] proposes a model-based algorithm for variable Z conditioned on future trajectory segments Q(Z|Tf )
cooperative MARL. It is called Cooperative Prioritized Sweep- and a decoder reconstructs Tf such that Tˆf = D(Tp , Z). In
ing. This paper extends the prioritized sweeping algorithm the 2-agent setting, the joint probability P (Tf x , Tf y ) where
mentioned above to the multi-agent setting. The environment Tf x and Tf y are the future segments for player x and y
is modeled as factored MMDP and represented by a Dy- respectively. The key idea is to learn two disentangled latent
namic Bayesian Network in which state-action interactions spaces Zx and Zy . To do so, the algorithm proposed in the
are represented as a coordination graph. This assumption paper uses variational lower bound on the mutual information
allows the factorization of the transition function, reward [77].
function as well as the Q-values. Thus, these functions can
V. C OOPERATIVE M ULTI -AGENT R EINFORCEMENT
be learned in an efficient and sparse manner. One drawback
L EARNING
of this method is it assumes that the structure of the factored
MDP is available beforehand which is impractical in some A. Challenges and Implementation Schemes
applications with high mobility. The authors in [76] consider This section is dedicated to discuss the challenges that
two-agent competitive and cooperative settings in continuous arise in multi-agent problems. Several research endeavours
16

proposed algorithms to address these issues which led to RL to multi-agent scenarios will be IL where each agent
different training schemes for cooperative agents. We start by optimizes its policy independently of the other participants.
summarizing MARL challenges that we consider as funda- Thus, the non-stationarity problem is ignored and no coor-
mental in developing systems for wireless communications. dination or cooperation is considered. This technique suffers
Non-stationarity: As mentioned before, in multi-agent en- from convergence problems [80]. However, it may show
vironments, players update their policies concurrently. As a satisfying results in practice. In fact, recent works (see Table
consequence, the agents’ rewards and state transitions depend V) adopted the IL to solve several resource allocation and
not only on their actions but also on the actions taken by their control problems in wireless communication networks;
opponents. Hence, the Markov property, stating that the reward • Fully centralized: This approach assume the existence
and transition functions depend only on the previous state and of a centralized unit that can gather information such as
the agent’s action, is violated and the convergence guarantees actions, rewards, and observations from all the agents. This
of single agent RL are no longer valid [78]. Due to the non- training scheme alleviates the partial observability and non-
stationarity, the learning agent needs to consider the behaviour stationarity problems but it is impracticable for large scale
of other participants to maximize its return. One way to and real-time systems;
overcome this issue is by using a central coordinator collecting • Centralized training and decentralized execution: CTDE
information about agents’ observations and actions. In this assumes the existence of a centralized controller that collects
case, standard single-agent RL methods can be applied. Since additional information about the agents during training but
the centralized approach is not favorable, the non-stationarity the learned policies are decentralized and executed using
challenge need to be considered in designing decentralized the agent’s local information only. CTDE is considered in
MARL algorithms. several MARL algorithms since it presents a simple solution
to the partial observability and non-stationarity problems
Scalability: To overcome the non-stationarity problem, it is while allowing the decentralization of agents’ policies.
common to include information about the joint action space
in the learning procedure. This will give rise to a scalabil-
ity issue since the joint action space grows exponentially B. Algorithms and Paradigms
with the number of agents. Therefore, the use of DNNs as Coordination solutions considered in this tutorial are cate-
function approximators becomes more pressing which adds gorized in two families: those based on communication and
complexity to the theoretical analysis of deep MARL. For those based on learning. Emergent communication studies the
systems involving multiple agents (which are usually the cases learning of a communication protocol to exchange information
with wireless networks with many users), scalability becomes between agents. Networked agents assume the existence of
crucial. Several research endeavors aimed to overcome this a communication structure and learn cooperative behaviors
issue. One example is to learn factorized value function with through information exchange between neighbors. The second
respect to the actions (see Section V-B2). class aims to learn cooperative behaviors without informa-
tion sharing. The methods that we will present are not an
Partial Observability: In real-world scenarios, the agents
exhaustive list of deep MARL since we concentrate on the key
seldom have access to the true state of the system. They
concepts that can be applied for 6G technologies. We refer the
usually receive partial observations from the environment. Par-
readers to [81], [82], and [83] for a extensive review of deep
tial observability coupled with non-stationarity makes MARL
MARL literature.
more challenging. In fact, as stated before, the non-stationarity
issue mandates that the individual agents become aware of the 1) Emergent Communication
other agents’ policies. With only partial information available, This an active research field where cooperative agents are
the individual learners will struggle to overcome the non- allowed to communicate, for example explicitly via sending
stationarity of the system and account for the joint behavior. direct messages or implicitly by maintaining a shared memory.
Privacy and Security: Since coordination may involve Deep communication problems are modeled as Dec-POMDPs
information sharing between agents, privacy and security where agents share a communication channel in a partially
concerns will arise. Shared private information (i.e. rewards) observable environment and aim to maximize their joint utility.
with other agents is subject to attacks and vulnerabilities. This In addition to optimizing their policies, the agents learn com-
will hinder the applicability of MARL algorithms in real- munication protocols to collaborate better . Direct messages
world settings. This is why fully decentralized algorithms are can be learned concurrently with the Q-function. An NN is
preferred so that all the agents keep their information private. trained to output, in addition to the Q-values, a message
Enormous efforts have been made for addressing the privacy to communicate to the other participants in the next round.
and security issues in supervised learning. However, in MARL, This method involves exchanging information between all the
this challenge is not extensively studied. Recently, the work agents which is expensive. Alternatively, memory-driven algo-
in [79] has showed that attackers can infer information about rithms propose to use a shared memory as a communication
the training environment from a policy in single-agent RL. channel. All the agents access the shared memory before
taking an action and then write a response. The advantage
To promote coordination while considering the challenges of this method is that the agent does not communicate with
discussed above, different training schemes can be adopted. the rest of the agents directly which may eventually reduce
• Fully decentralized: A simple extension of single-agent the communication cost. Besides, the agent policy depends on
17

(a) CTDE (b) Networked Agents (c) Fully Decentralized

Fig. 6: Three representative learning frameworks in MARL. Specifically, in (a), the centralized training and decentralized
execution scheme is characterized by a central unit collecting information from the agents (i.e observations, joint actions)
to train the agents’ local policies. During execution, the agents only need their local information to act in the system. This
scheme has a considerable communication cost which will increase with the number of users. In both (b) and (c), we present
two decentralized learning structures where no central unit is needed. Figure (b) illustrates the decentralized over Networked
Agents scheme where agents are able to communicate with their neighbors over a possibly time-varying network. This allows
the local information exchange. In (c), the fully decentralized or independent learners scheme is represented. In this learning
framework, no explicit information exchange takes place between the agents.

its private local observations and the collective memory and before, training independent and hence fully-decentralized
not on messages from all the agents. agents suffer from convergence guarantees because of the non-
Integrating these methods into 6G systems requires learning stationarity problem. This issue is approached with different
cost-efficient communication due to limited resources such methodologies in the deep MARL literature. The first one
as bandwidth. Lately, more research endeavors have focused consists in generalizing the single-agent RL algorithms to the
on learning efficient communication protocols under limited- multi-agent setting. In particular, most of the single-agent RL
bandwidth constraints. Precisely, methods such as pruning, algorithms such as DQN rely on experience replay buffers
attention, and gating mechanisms are applied to reduce the where state transitions are stored. In the multi-agent setting,
number of messages communicated to the agents at each the data stored in replay memories are obsolete because
control round (i.e. [84], [85], [86]). In addition to learning agents update their policies in parallel. Several approaches
cost-efficient communication protocols, this field has many were proposed to address this problem and therefore enable
other open questions. For example, the robustness of the the use of replay buffers to train independent learners in
learned policies to communication errors or delays caused multi-agent environments [81]. Another line of work focused
by noisy channels, congestion, interference, etc needs to be on training cooperative agents using the CTDE framework.
investigated. [87] discusses the challenges and difficulties For policy gradients, centralized critic(s) are learned using
of learning communication in multi-agent environments. We all agents’ policies to avoid non-stationarity and variance
argue that this technique can be useful in designing intelli- problems and actors choose actions using local information
gent 6G systems. For example, the performance of MEC or only. This method is applied, for example, in [88] to extend the
aerial communications systems can be boosted by integrating DDPG algorithm to Multi-Agent Deep Deterministic Policy
communication between agents. We strongly believe that this Gradient (MADDPG) algorithm which can be used for systems
field can benefit from the expertise of the wireless commu- with heterogeneous and homogeneous agents. For Q-learning
nication community to develop more efficient communication based methods, the approach is to learn a centralized but
protocols while taking into consideration the restrictions of the factorized global Q-function. For example, in [89], the team
communication medians [86]. Q-function was decomposed as the sum of individual Q-
functions whereas, in [90], the authors propose to use a mixing
2) Cooperation network to combine the agents’ local Q-functions in a non-
In this section, we will overview coordination learning linear way. Although these methods show promising results,
methods without any explicit communication. As mentioned they can face several challenges such as representational ca-
18

pacity [91] and inefficient exploration [92]. As an alternative, algorithms based on the IL framework. In [100], each mobile
[93] proposes to learn a single globally shared network that user is represented by a DDPG/DQN agent aiming to inde-
outputs different policies for homogeneous agents. All the pendently learn an optimal task offloading policy to minimize
presented methods above have a straightforward application its power consumption and task buffering delays. Therefore,
in wireless communications since wireless networks are multi- this paper provides a fully decentralized algorithm where users
agent systems by definition and coordination is crucial in such decide using their local observations (task buffer length, user
systems. In fact, MADDPG has been applied in MEC and SINR, CSI), the allocated power levels for local execution, and
aerial networks (i.e. UAV networks). See Section VI for more task offloading. Similarly, the approach in [101] is based on
details. independent Q-learning where each edge user selects the trans-
mit power, the radio access technology, and the sub-channel.
3) Decentralized MARL Over Networked Agents
The problems considered in the previous papers are formulated
Cooperative agents, modeled using a cooperative MG, usu-
as MGs where the global state is the concatenation of the
ally assume that the agents share the same reward function,
users’ local observations and the agents act simultaneously on
thus the homogeneity of the agents. This is not the case in most
the system to receive independent reward signals. Thus, this
wireless communication problems where agents have different
formalization enables the consideration of heterogeneous users
preferences or reward functions. For example, MEC networks
with different reward functions. Furthermore, [102] formalizes
encompass several types of IoT devices. Hence, it is important
the task offloading as a non-cooperative problem where each
to account for the heterogeneity of the agents in the design
agent aims to minimize the task drop rate and execution delay.
of decentralized cooperative algorithms for 6G systems. In
Each mobile user is represented as an A2C agent. An energy-
this context, the objective is to form a team of heterogeneous
aware algorithm is presented in [103] where independent DQN
agents (i.e. with different reward functions) collaborating to
agents are deployed in every edge server and the servers decide
maximize the team-average reward R̄ = N1 i∈N Ri . As
P
which user(s) should offload their computations. However, the
explained in Section II, networked agents cooperate and make
independent nature of these works rules out any coordination
decisions using their local observations including shared in-
between the learning agents which may hinder the convergence
formation by the neighbors over a communication network.
of these methods in practice (see Section V-A).
The existence of the communication network enables the
collaboration between agents without the intervention of a Cooperation is considered in [28] where the authors used
central unit. Let πθi i be the agent’s policyQparametrized as a the MADDPG algorithm to jointly optimize the multi-channel
DNN. The joint policy is given by πθ = i∈N πθi i (ai |s) and access and task offloading of MEC networks in industry
the global Q-function is Qθ under the joint policy πθ . To find 4.0. The joint optimization problem is modeled as a Dec-
optimal policies, the policy gradients, for each agent, can be POMDP since the agents cannot observe the status of all the
expressed as the product of the global Q-function Qθ and the channels and is solved using the CTDE paradigm. The use of
local score function ∇θi log πθi i (ai |s). However, Qθ is hard MADDPG enables the coordination between agents without
to estimate knowing that the agents can only use their local any explicit communication since the critic is learned using
information. Consequently, each agent learns a local copy Qθi the information from all the agents but the actors are executed
of Qθ . In [94], an actor-critic algorithm is proposed where a in a decentralized matter. Experimental results showcased the
consensus-based approach is adopted to update the critics Qθi impact of cooperation in reducing the computation delay and
as the weighted average of the local and adjacent updates. enhancing the channel utilization compared to the IL case.
We refer the interested reader to [95] for other extensions and For edge caching, [29] and [104] propose MADDPG-like
algorithms for this framework with a theoretical analysis. algorithms to solve the cooperative multi-agent edge caching
problem. Both of these works model the cooperative edge
VI. A PPLICATIONS caching as Dec-POMDP and differ in the definition of the
A. MARL for MEC Systems state space and reward functions. In [29], the edge servers
Multi-access edge computing (MEC) is one of the enabling receive the same reward as the average transmission delay
technologies for 5G and beyond 5G networks. We are witness- reduction, whereas in [104], the weighted sum of the local
ing a proliferation of smart devices running computationally and the neighbors’ hit rates is considered as a reward signal
expensive tasks such as gaming, Virtual/Augmented Reality. to encourage cooperation between adjacent servers. Simulation
Therefore, designing efficient algorithms for MEC systems is results showed that the cooperative edge caching outperforms
a crucial step in providing low-latency and high-reliability ser- traditional caching mechanisms such as Least Recently Used
vices. DRL has been extensively applied to solve several prob- (LRU), Least Frequently Used (LFU), and First In First Out
lems in MEC networks including task/computation offloading (FIFO).
(i.e. [96], [97]), edge caching [98], network slicing, resource To summarize, to offer massive URLLC services, the scal-
allocation, etc [99]. Recently, more interest is accorded to ability of MEC systems is crucial. We expect to see more
MARL in MEC networks to account for the distributed nature research efforts leveraging the deep MARL techniques to study
of these networks. and analyze the reliability-latency-scalability trade-off of fu-
Task offloading has been studied in several works from a ture 6G systems. For example, applying the networked agents
multi-agent perspective with a focus on decentralized execu- scheme to the above-mentioned problems is one direction to
tion. First, we examine the works proposing fully decentralized explore in future works.
19

B. MARL for UAV-Assisted Wireless Communications [107], the joint optimization of multi-UAV target assignment
and path planning is solved using MARL. A team of UAVs
The application of deep MAR in UAV networks is getting positioned in a 2D environment aims to serve T targets while
more attention recently. In general, these applications involve minimizing the flight distance. Each UAV covers only one
solving cooperative tasks by a team of UAVs without the inter- target without collision with threat areas and other UAVs.
vention of a central unit. Hence, in UAV network management, To enforce the collision-free constraint, a collision penalty is
decentralized MARL algorithms are preferable in terms of added to the reward thus rendering the problem a mixed RL
communication cost and energy efficiency. The decentralized setting with both cooperation and competition. Consequently,
over networked agents scheme is suitable for this application the MADDPG algorithm is adopted to solve the joint opti-
if we assume that the UAVs have sufficient communication mization. Furthermore, [108] formulates resource allocation
capabilities to share information with the neighbors in its in a downlink communication network as a SG and solved
sensing and coverage area. However, due to the mobility of it using independent Q-learning. The work in [109] applies
UAVs, maintaining communication links with its neighbors MARL for fleet control, particularly, aerial surveillance and
to coordinate represents a considerable handicap for this base defense in a fully centralized fashion.
paradigm.
In [105], the authors study the cooperation in links discovery
and selection problem. Each UAV agent u perceives the local C. MARL for Beamforming Optimization in Cell-Free MIMO
available channels and decides to establish a link with another and THz Networks
UAV v over a shared channel. Due to different factors such
as UAV mobility, wireless channel quality, and perception DRL has been extensively applied for uplink/downlink
capabilities, each UAV u has a different local set of perceived beamforming optimization. Particularly, several works focused
channels Cu such that Cu ∩ Cv 6= ∅. A link is established on beamforming computation in cell-free networks. In the
between two agents u and v if they propagate messages on the fully centralized version of the cell-free architecture, all the
same channel simultaneously. Given the local information (i.e. access points are connected and coordinated through a central
state) about whether the previous message was successfully processing unit to serve users in their coverage area. Although
delivered, each UAV’s action is a pair (v, cu ) denoting by the application of single-agent DRL for cell-free networks
v the other UAV and cu the propagation channel. Each agent has empirically shown optimal performance, the computational
receives a reward ru defined as the number of successfully sent complexity and the communication cost increase drastically
messages over time-varying channels. The algorithm proposed with the number of users and access points. As a remedy
in [105] is based on independent Q-learning with two main to this issue, hybrid methods based on dynamic clustering
modifications: fractional slicing to deal with high dimensional and network partitioning are proposed. The core idea of these
and continuous action spaces and mutual sampling to share methods is to cluster users and/or access points to reduce the
information (state-action pairs and Q-function parameters) computational and communication costs as well as to enhance
between agents to alleviate the non-stationarity issue in the the coverage by reducing interference. As an example, in
fully decentralized scheme. Thus, a central coordinating unit [30], a DDQN algorithm is implemented to perform dynamic
is necessary. The problem of field coverage by a team of UAVs clustering and a DDPG agent is dedicated to beamforming
is addressed in [25]. The authors formulated the problem as a optimization. This joint clustering and beamforming opti-
MG where each agent state is defined as its position in a 3D mization is formulated as an MDP and a central unit is
grid. The UAVs cooperate to maximize the full coverage in used for training and execution. In [110], dynamic downlink
an unknown field. The UAVs are assumed to be homogeneous beamforming coordination is studied in a multi-cell MISO
and have the same action and state spaces. The proposed network. The authors proposed a distributed algorithm where
algorithm is based on Q-learning where a global Q-function is the base stations are allowed to share information via a limited
decomposed using the approximation techniques Fixed Sparse exchange protocol. Each base station is presented as a DQN
Representation (FSR) and Radial Basis Function (RBF). This agent trying to maximize its achievable rate while minimizing
decomposition technique does not allow the full decentraliza- the interference with the neighboring agents. The use of
tion of the algorithm since the basis functions depend on the DQN networks required the discretization of the action space
joint state and action spaces. Thus, the UAVs need to share which is continuous by definition. The same framework can
information which has an important communication cost. be applied with PG or DDPG methods to handle continuous
Another application of MARL in UAV networks is spectrum action space.
sharing which is analyzed in [106]. The UAV team is divided Furthermore, THz communication channels are character-
into two clusters: the relaying UAVs which provide relaying ized by high attenuation and path loss which require trans-
services for the terrestrial primary users to spectrum access mitting highly directive beams to minimize the signal power
for the other cluster that groups sensing UAVs transmitting propagating in directions other than the transmission direction.
data packets to a fusion center. Each UAV’s action is either In this context, directional beamforming and beam selection
to join the relaying or sensing clusters. The authors proposed are possible solutions to enhance the communication range and
a distributed tabular Q-learning algorithm where each UAV reduce interference. Intelligent beamforming in THz MIMO
learns a local Q function using their local states without any systems is another promising application of MARL for future
coordination with the other UAVs. In a more recent work 6G networks.
20

D. Spectrum Management THz access point, limited bandwidth, and limited antenna
gain). Devices in the first layer can directly access the network
In [111] and [112], the spectrum allocation in Device-to- resources and act as relays for the second layer devices. The
Device (D2D) enabled 5G HetNets is considered. The D2D objective is to find the optimal D2D links between the two
transmitters aim to select the available spectrum resources with layers. The devices from the first layer are modeled as Q-
minimal interference to ensure the minimum performance re- learning agents and decide, by using local information, the
quirements for the cellular users. The authors in [111] consider D2D links to establish with the second layer devices. The
a non-cooperative scenario where the agents independently agents receive two types of rewards: a private one for serving
maximize their throughput. Based on the selected resource a device in the second layer and a public reward regarding
block in the previous time step, the D2D users choose the the system throughput. To promote coordination, the agents
resource block for uplink transmission and receive a positive receive information about the number of their neighbors and
reward equivalent to the capacity of the D2D user if the their states. [116] studies a two-tier network with virtual-
cellular users’ constraints are satisfied, otherwise, a penalty ized small cells powered by energy harvesters and equipped
is imposed. The problem is solved using a tabular Q-learning with rechargeable batteries. These cells can decide to offload
algorithm where each D2D agent learns a local Q-function baseband processes to a grid-connected edge server. MARL
with local state information. This work does not scale for is applied to minimize grid energy consumption and traffic
high-dimensional state spaces since it is based on a tabular drop rate. The agents collaborate via exchanging information
approach. This problem is addressed in [112] where actor- regarding their battery state. [117] inspects the problem of
critic based algorithms are proposed for spectrum sharing. Two user association in dynamic mmWave networks where users
approaches are used to promote cooperation between D2D are represented as DQN agents independently optimizing their
users. The first is called multi-agent actor-critic where each policies using their local information.
agent learns a Q-function in a centralized manner using the
information from all the other agents. The learned policies
are executed in a decentralized fashion since the actor relies E. Intelligent Reflecting Surfaces (IRS)-Aided Wireless Com-
on local information only. The second approach proposes to munications
use information from neighboring agents only to train the Intelligent Reflecting Surfaces (IRS)-aided wireless commu-
critic instead of the information from all the agents to reduce nications have attracted increasing interest due to the coverage
the computational complexity for large-scale networks. The and spectral efficiency gains they provide. Multiple research
action selection and the reward function are similar to the ones works proposed DRL-based algorithms for joint beamforming
defined in the previous work. In this work, the state space is and phase shift computation. These contributions study sys-
richer and contains information about (i) the instant channel tems with a single IRS which is far from the real-world case.
information of the D2D corresponding link, (ii) the channel More recent research endeavors seek to remedy this shortcom-
information of the cellular link (e.g. from the BS to the D2D ing. For example, in [118], the authors consider a communica-
transmitter), (iii) the previous interference to the link, and (iv) tion system with multiple IRS cooperating together under the
the resource block selected by the D2D link in the previous coordination of an IRS controller. The joint beamforming and
time slot. phase shift optimization problem is decoupled and solved in
Another application of MARL is resource allocation an alternating manner using fractional programming. Another
in cellular-based vehicular communication networks. The line of work aims to provide secure and anti-jamming wireless
Vehicle-to-Vehicle (V2V) transmitters jointly select the com- communications by adjusting the IRS elements. This problem
munication channel and the power level for transmission. The was also approached using single-agent RL (i.e. in [119]). For
work in [113] proposes a fully decentralized algorithm based distributed deployment of multiple IRS, secure beamforming is
on DQN to maximize the Vehicle-to-Infrastructure capacity solved in [120] using alternating optimization scheme based on
under V2V latency constraints. Although the solution is decen- successive convex approximation and manifold optimization
tralized in the sense that each agent trains a Q-network locally, techniques. To the date of the writing of this tutorial, we are
the state contains information about the other participants. The unable to find a decentralized algorithm for a multi-IRS system
authors include a penalty in the reward function to account based on MARL techniques. This is why we believe that this
for the latency constraints in V2V communications. For more is a promising direction to propose distributed deployment
references on MARL in vehicular communications, we refer schemes for multi-IRS systems.
the reader to [17].
Interference mitigation will be a pressing issue in THz VII. C ONCLUSION AND F UTURE R ESEARCH D IRECTIONS
communications. Exploiting the THz bands is one key enabler We have presented an overview of model-free, model-based
of 6G systems for higher data rates. In [114], a multi-arm ban- single-agent RL, and cooperative MARL frameworks and al-
dit based algorithm is proposed for intermittent interference gorithms. We have provided the mathematical backgrounds of
mitigation from directional links in two-tier HetNets. However, the key frameworks for both single-agent and multi-agent RL.
the proposed solution is valid for a single target receiver. Afterward, we have developed an understanding of the state-
Another work in [115] proposes a two-layered distributed D2D of-the-art algorithms in the studied fields. We have discussed
model, where MARL is applied to maximize user coverage in the use of model-based methods for solving wireless commu-
dense-indoor environments with limited resources (i.e. a single nication problems. Focusing on cooperative MARL, we have
21

TABLE V: These papers applied MARL techniques for wireless communication problems. Learning type is the MARL training
technique.

Work Summary Learning Algorithm Cooperation?

[100] Proposes a decentralized computation offloading algorithm IL DDPG/DQN 7
for multi-user MEC networks to allocate power levels for
local execution and task offloading
[101] Formulates computation offloading in multi-user MEC sys- IL Q-learning 7
tems as MARL problem to allocate transmit power, radio
access technology and subchannel
[102] Studies the task offloading problem in a non-cooperative IL A2C 7
manner where agents learn independent policies to mini-
mize the task drop rate and the execution delay
[103] An energy-aware multi-agent algorithm is proposed where IL DQN 7
each edge server decide which users should offload their
computations from a pool of edge devices
[28] Considers joint multi-channel access and task offloading in CTDE MADDPG 3
cooperative MEC networks
[29] Solves cooperative edge caching in MEC systems using CTDE MADDPG-like 3
MARL with shared reward signal
[104] Proposes different formulation (i.e. state space, reward CTDE MADDPG-like 3
function) of the MARL system for cooperative edge
caching
[105] Introduces fractional slicing and mutual sampling to learn IL DQN 3
cooperative links discovery and selection using independent
Q-learning algorithm
[25] Studies a team of UAVs providing the full coverage of an Centralized Q-learning 3
unknown field while minimizing the overlapping UAVs’
field of views
[106] Considers spectrum sharing in UAV network and propose IL Q-learning 7
a distributed algorithm where each UAV decides whether
to join a relaying or sensing clusters
[107] The joint optimization of task-assignment and path plan- CTDE MADDPG 3
ning in a multi-UAV network is studied
[108] Solves the resource allocation problem in a multi-UAV IL Q-learning 7
downlink communication network using MARL
[109] Develops a multi-UAV fleet control systems particularly for Centralized PG 3
aerial surveillance and base defense
[110] Considers dynamic downlink beanmforming coordination Networked Agents DQN 3
where base stations cooperate to maximize their capac-
ity while mitigating inter-cell interference via a limited-
information sharing protocol
[111] Considers non-cooperative spectrum allocation in D2D IL Q-learning 7
HetNets with stochastic geometry
[112] Formulates a cooperative spectrum sharing problem in D2D CTDE MADDPG-like 3
underlay communications as a decentralized multi-agent
system
[113] Proposes a decentralized algorithm for joint sub-band and IL DQN 7
power level allocation in V2V enabled cellular networks
[115] A 2-layered multi-agent D2D model in THz communi- Networked Agents Q-learning 3
cations is proposed to maximize user coverage in dense-
indoor environments
[116] Applies MARL to minimize grid energy consumption and Networked Agents Q-learning 3
traffic drop rate in two-tier network with virtualized small
cells
[117] Solves user association problem in dynamic mmWave IL DQN 7
networks
22

outlined different methods to establish coordination between the environment dynamics, the reward functions can be
agents. To showcase how these methods can be applied for inferred by malicious agents [79]. Privacy-preserving al-
wireless communication systems, we have reviewed the re- gorithms are attracting more attention and interest. Differ-
cent research contributions to adopt a multi-agent perspective ential privacy was investigated in the context of Federated
in solving communication and networking problems. These Learning as well as DRL [121]. Privacy is not sufficiently
problems involve AI-enable MEC systems, intelligent control explored in the context of wireless communication.
and management of UAV networks, distributed beamforming • Security and robustness: DNNs are known to be vul-
for cell-free MIMO systems, cooperative spectrum sharing, nerable to adversarial attacks and several recent works
THz communications, and IRS deployment. We have chosen demonstrated the vulnerability of DRL to such attacks
to focus on cooperative MARL applications because several as well. To completely trust DRL-based methods in
surveys on single-agent RL exist in the literature. Our objective real-world critical applications, an understanding of the
has been to highlight the potential of these DRL methods, vulnerabilities of these methods and addressing them is
specifically MARL, in building self-organizing, reliable, and a central concern in the deployment of AI-empowered
scalable systems for future wireless generations. In what systems [122], [123]. In addition to adversarial attacks,
follows, we discuss research directions to enrich and bridge the robustness of the learned policies to differences in
the gap between both fields. simulation and real-world settings need to be addressed
and studied (i.e. [124]).
• Network topologies: One of the shortcomings of the
MG-based methods of MARL problems is the assumed R EFERENCES
homogeneity of the studied systems. However, this is
[1] W. Saad, M. Bennis, and M. Chen, “A vision of 6G wireless systems:
seldom the case in real-world scenarios like MEC-IoT Applications, trends, technologies, and open research problems,” IEEE
or sensing networks. For this reason, we have motivated network, vol. 34, no. 3, pp. 134–142, 2019.
the networked MARL paradigm where agents with dif- [2] I. F. Akyildiz, A. Kak, and S. Nie, “6G and beyond: The future of
wireless communications systems,” IEEE Access, vol. 8, pp. 133995–
ferent reward functions can cooperate. Accounting for the 134030, 2020.
heterogeneity of the wireless communications systems is [3] L. Bariah, L. Mohjazi, S. Muhaidat, P. C. Sofotasios, G. K. Kurt,
mandatory for practical algorithm design. Mobility is also H. Yanikomeroglu, and O. A. Dobre, “A prospective look: Key enabling
technologies, applications and open research topics in 6G networks,”
a challenge in wireless communication problems. Devel- arXiv preprint arXiv:2004.06049, 2020.
oping MARL algorithms with mobility considerations is [4] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,
an interesting research direction. S. Ghemawat, G. Irving, M. Isard, et al., “Tensorflow: A system
for large-scale machine learning,” in 12th {USENIX} symposium on
• Constrained/safe RL: RL is based on maximizing the operating systems design and implementation ({OSDI} 16), pp. 265–
reward feedback. The reward function is designed by 283, 2016.
human experts to guide the agent policy search but reward [5] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,
T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al., “Pytorch: An
design is often challenging and can lead to unintended imperative style, high-performance deep learning library,” in Advances
behavior. Wireless communication problems are often in neural information processing systems, pp. 8026–8037, 2019.
formulated as optimization problems under constraints. [6] C. Boutilier, “Planning, learning and coordination in multiagent deci-
sion processes,” in Proceedings of the 6th conference on Theoretical
To account for those constraints, most of the recent works aspects of rationality and knowledge, pp. 195–210, 1996.
adopt a reward shaping strategy where penalties are added [7] Y. Shoham, R. Powers, and T. Grenager, “Multi-agent reinforcement
to the reward function for violating the defined con- learning: a critical survey,” Web manuscript, vol. 2, 2003.
[8] X. Chen, X. Deng, and S.-H. Teng, “Settling the complexity of
straints. In addition, reward shaping does not ensure that computing two-player nash equilibria,” Journal of the ACM (JACM),
the exploration during the training is constraint-satisfying. vol. 56, no. 3, pp. 1–57, 2009.
This motivates the constrained RL framework. It enables [9] N. C. Luong, D. T. Hoang, S. Gong, D. Niyato, P. Wang, Y.-C.
Liang, and D. I. Kim, “Applications of deep reinforcement learning
the development of more reliable algorithms ensuring in communications and networking: A survey,” IEEE Communications
that the learned policies satisfy reasonable service quality Surveys & Tutorials, vol. 21, no. 4, pp. 3133–3174, 2019.
or/and respect system constraints. [10] Y. Qian, J. Wu, R. Wang, F. Zhu, and W. Zhang, “Survey on rein-
forcement learning applications in communication networks,” Journal
• Theoretical guarantees: Despite the abundance of the of Communications and Information Networks, vol. 4, no. 2, pp. 30–39,
experimental works around RL methods, their conver- 2019.
gence properties are still an active research area. Several [11] L. Lei, Y. Tan, K. Zheng, S. Liu, K. Zhang, and X. Shen, “Deep
reinforcement learning for autonomous internet of things: Model, ap-
endeavors, reviewed above, proposed convergence guar- plications and challenges,” IEEE Communications Surveys & Tutorials,
antees for the policy gradient algorithms under specific 2020.
assumptions such as unbiased gradient estimates. More [12] Y. L. Lee and D. Qin, “A survey on applications of deep reinforcement
learning in resource management for 5G Heterogeneous Networks,”
urging theoretical questions need to be addressed such as in 2019 Asia-Pacific Signal and Information Processing Association
the convergence speed to a globally optimal solution, their Annual Summit and Conference (APSIPA ASC), pp. 1856–1862, IEEE,
robustness due to approximation errors, their behaviors 2019.
[13] H. Zhu, Y. Cao, W. Wang, T. Jiang, and S. Jin, “Deep reinforcement
when limited sample data is available, etc. learning for mobile edge caching: Review, new features, and open
• Privacy: One of the challenges of the commercialization issues,” IEEE Network, vol. 32, no. 6, pp. 50–57, 2018.
of DRL-based solutions is privacy. These concerns are [14] S. Ali, W. Saad, N. Rajatheva, K. Chang, D. Steinbach, B. Sliwa,
C. Wietfeld, K. Mei, H. Shiri, H.-J. Zepernick, et al., “6G white
rooted in the data required to train RL agents such as paper on machine learning in wireless communication networks,” arXiv
actions and rewards. Consequently, information about preprint arXiv:2004.13875, 2020.
23

[15] F. Tang, Y. Kawamoto, N. Kato, and J. Liu, “Future intelligent and [39] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou,
secure vehicular network toward 6G: Machine-learning approaches,” D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement
Proceedings of the IEEE, vol. 108, no. 2, pp. 292–307, 2019. learning,” arXiv preprint arXiv:1312.5602, 2013.
[16] C. She, C. Sun, Z. Gu, Y. Li, C. Yang, H. V. Poor, and B. Vucetic, [40] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience
“A tutorial of ultra-reliable and low-latency communications in 6G: replay,” arXiv preprint arXiv:1511.05952, 2015.
Integrating theoretical knowledge into deep learning,” arXiv preprint [41] Z. Wang, T. Schaul, M. Hessel, H. Hasselt, M. Lanctot, and N. Freitas,
arXiv:2009.06010, 2020. “Dueling network architectures for deep reinforcement learning,” in
[17] I. Althamary, C.-W. Huang, and P. Lin, “A survey on multi-agent International conference on machine learning, pp. 1995–2003, 2016.
reinforcement learning methods for vehicular networks,” in 2019 15th [42] S. Wang, H. Liu, P. H. Gomes, and B. Krishnamachari, “Deep
International Wireless Communications & Mobile Computing Confer- reinforcement learning for dynamic multichannel access in wireless
ence (IWCMC), pp. 1154–1159, IEEE, 2019. networks,” IEEE Transactions on Cognitive Communications and Net-
[18] D. Lee, N. He, P. Kamalaruban, and V. Cevher, “Optimization for working, vol. 4, no. 2, pp. 257–265, 2018.
reinforcement learning: From a single agent to cooperative agents,” [43] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Ried-
IEEE Signal Processing Magazine, vol. 37, no. 3, pp. 123–135, 2020. miller, “Deterministic policy gradient algorithms,” in Proceedings of
[19] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. the 31st International Conference on International Conference on
MIT press, 2018. Machine Learning, pp. 387–395, 2014.
[20] K. I. Ahmed and E. Hossain, “A deep Q-learning method for [44] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,
downlink power allocation in multi-cell networks,” arXiv preprint D. Silver, and D. Wierstra, “Continuous control with deep reinforce-
arXiv:1904.13032, 2019. ment learning,” arXiv preprint arXiv:1509.02971, 2015.
[21] M. L. Littman and R. S. Sutton, “Predictive representations of state,” [45] S. Fujimoto, H. Van Hoof, and D. Meger, “Addressing func-
in Advances in neural information processing systems, pp. 1555–1561, tion approximation error in actor-critic methods,” arXiv preprint
2002. arXiv:1802.09477, 2018.
[22] R. Xie, Q. Tang, C. Liang, F. R. Yu, and T. Huang, “Dynamic [46] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-
computation offloading in IoT fog systems with imperfect channel state policy maximum entropy deep reinforcement learning with a stochastic
information: A POMDP approach,” IEEE Internet of Things Journal, actor,” arXiv preprint arXiv:1801.01290, 2018.
2020. [47] C. Qiu, Y. Hu, Y. Chen, and B. Zeng, “Deep deterministic policy
[23] L. S. Shapley, “Stochastic games,” Proceedings of the national academy gradient (ddpg)-based energy harvesting wireless communications,”
of sciences, vol. 39, no. 10, pp. 1095–1100, 1953. IEEE Internet of Things Journal, vol. 6, no. 5, pp. 8577–8588, 2019.
[24] M. L. Littman, “Markov games as a framework for multi-agent rein- [48] H. Van Hasselt, Y. Doron, F. Strub, M. Hessel, N. Sonnerat, and
forcement learning,” in Machine learning proceedings 1994, pp. 157– J. Modayil, “Deep reinforcement learning and the deadly triad,” arXiv
163, Elsevier, 1994. preprint arXiv:1812.02648, 2018.
[49] L. Baird, “Residual algorithms: Reinforcement learning with function
[25] H. X. Pham, H. M. La, D. Feil-Seifer, and A. Nefian, “Cooperative and
approximation,” in Machine Learning Proceedings 1995, pp. 30–37,
distributed reinforcement learning of drones for field coverage,” arXiv
Elsevier, 1995.
preprint arXiv:1803.07250, 2018.
[50] B. Dai, N. He, Y. Pan, B. Boots, and L. Song, “Learning from
[26] F. A. Oliehoek, C. Amato, et al., A concise introduction to decentralized
conditional distributions via dual embeddings,” in Artificial Intelligence
POMDPs, vol. 1. Springer, 2016.
and Statistics, pp. 1458–1467, 2017.
[27] E. A. Hansen, D. S. Bernstein, and S. Zilberstein, “Dynamic pro- [51] B. Dai, A. Shaw, L. Li, L. Xiao, N. He, Z. Liu, J. Chen, and L. Song,
gramming for partially observable stochastic games,” in AAAI, vol. 4, “SBEED: Convergent reinforcement learning with nonlinear function
pp. 709–715, 2004. approximation,” in International Conference on Machine Learning,
[28] Z. Cao, P. Zhou, R. Li, S. Huang, and D. Wu, “Multi-agent deep rein- pp. 1125–1134, PMLR, 2018.
forcement learning for joint multi-channel access and task offloading [52] K. Zhang, A. Koppel, H. Zhu, and T. Başar, “Global convergence of
of mobile edge computing in industry 4.0,” IEEE Internet of Things policy gradient methods to (almost) locally optimal policies,” arXiv
Journal, 2020. preprint arXiv:1906.08383, 2019.
[29] C. Zhong, M. C. Gursoy, and S. Velipasalar, “Deep multi-agent [53] M. Papini, D. Binaghi, G. Canonaco, M. Pirotta, and M. Restelli,
reinforcement learning based cooperative edge caching in wireless “Stochastic variance-reduced policy gradient,” arXiv preprint
networks,” in ICC 2019-2019 IEEE International Conference on Com- arXiv:1806.05618, 2018.
munications (ICC), pp. 1–6, IEEE, 2019. [54] Z. Shen, A. Ribeiro, H. Hassani, H. Qian, and C. Mi, “Hessian aided
[30] Y. Al-Eryani, M. Akrout, and E. Hossain, “Multiple access in cell- policy gradient,” in International Conference on Machine Learning,
free networks: Outage performance, dynamic clustering, and deep pp. 5729–5738, 2019.
reinforcement learning-based design,” IEEE Journal on Selected Areas [55] P. Xu, F. Gao, and Q. Gu, “An improved convergence analysis of
in Communications, 2020. stochastic variance-reduced policy gradient,” in Uncertainty in Artificial
[31] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, “Policy Intelligence, pp. 541–551, PMLR, 2020.
gradient methods for reinforcement learning with function approxima- [56] B. Liu, Q. Cai, Z. Yang, and Z. Wang, “Neural proximal/trust region
tion,” in Advances in neural information processing systems, pp. 1057– policy optimization attains globally optimal policy,” arXiv preprint
1063, 2000. arXiv:1906.10306, 2019.
[32] R. J. Williams, “Simple statistical gradient-following algorithms for [57] L. Wang, Q. Cai, Z. Yang, and Z. Wang, “Neural policy gradient
connectionist reinforcement learning,” Machine learning, vol. 8, no. 3- methods: Global optimality and rates of convergence,” arXiv preprint
4, pp. 229–256, 1992. arXiv:1909.01150, 2019.
[33] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High- [58] T. Xu, Z. Wang, and Y. Liang, “Improving sample complexity bounds
dimensional continuous control using generalized advantage estima- for actor-critic algorithms,” arXiv preprint arXiv:2004.12956, 2020.
tion,” arXiv preprint arXiv:1506.02438, 2015. [59] Y. Wu, W. Zhang, P. Xu, and Q. Gu, “A finite time analysis of two time-
[34] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, scale actor critic methods,” arXiv preprint arXiv:2005.01350, 2020.
D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep rein- [60] T. Xu, Z. Wang, and Y. Liang, “Non-asymptotic convergence analysis
forcement learning,” in International conference on machine learning, of two time-scale (natural) actor-critic algorithms,” arXiv preprint
pp. 1928–1937, 2016. arXiv:2005.03557, 2020.
[35] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust [61] M. Yu, Z. Yang, M. Kolar, and Z. Wang, “Convergent policy op-
region policy optimization,” in International conference on machine timization for safe reinforcement learning,” in Advances in Neural
learning, pp. 1889–1897, 2015. Information Processing Systems, pp. 3127–3139, 2019.
[36] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- [62] E. Puiutta and E. Veith, “Explainable reinforcement learning: A sur-
imal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, vey,” arXiv preprint arXiv:2005.06247, 2020.
2017. [63] A. Heuillet, F. Couthouis, and N. D. Rodrı́guez, “Explainability in deep
[37] A. Valadarsky, M. Schapira, D. Shahaf, and A. Tamar, “A machine reinforcement learning,” arXiv preprint arXiv:2008.06693, 2020.
learning approach to routing,” arXiv preprint arXiv:1708.03074, 2017. [64] H. P. van Hasselt, M. Hessel, and J. Aslanides, “When to use parametric
[38] C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8, models in reinforcement learning?,” in Advances in Neural Information
no. 3-4, pp. 279–292, 1992. Processing Systems, pp. 14322–14333, 2019.
24

[65] A. Zhang, S. Sukhbaatar, A. Lerer, A. Szlam, and R. Fergus, “Compos- [89] P. Sunehag, G. Lever, A. Gruslys, W. M. Czarnecki, V. F. Zambaldi,
able planning with attributes,” in International Conference on Machine M. Jaderberg, M. Lanctot, N. Sonnerat, J. Z. Leibo, K. Tuyls, et al.,
Learning, pp. 5842–5851, 2018. “Value-decomposition networks for cooperative multi-agent learning
[66] M. Deisenroth and C. E. Rasmussen, “PILCO: A model-based and based on team reward.,” in AAMAS, pp. 2085–2087, 2018.
data-efficient approach to policy search,” in Proceedings of the 28th [90] T. Rashid, M. Samvelyan, C. S. De Witt, G. Farquhar, J. Foerster, and
International Conference on machine learning (ICML-11), pp. 465– S. Whiteson, “Qmix: Monotonic value function factorisation for deep
472, 2011. multi-agent reinforcement learning,” arXiv preprint arXiv:1803.11485,
[67] R. S. Sutton, “Integrated architectures for learning, planning, and 2018.
reacting based on approximating dynamic programming,” in Machine [91] J. Castellini, F. A. Oliehoek, R. Savani, and S. Whiteson, “The
learning proceedings 1990, pp. 216–224, Elsevier, 1990. representational capacity of action-value networks for multi-agent
[68] A. W. Moore and C. G. Atkeson, “Prioritized sweeping: Reinforcement reinforcement learning,” arXiv preprint arXiv:1902.07497, 2019.
learning with less data and less time,” Machine learning, vol. 13, no. 1, [92] A. Mahajan, T. Rashid, M. Samvelyan, and S. Whiteson, “Maven:
pp. 103–130, 1993. Multi-agent variational exploration,” in Advances in Neural Information
[69] E. Langlois, S. Zhang, G. Zhang, P. Abbeel, and J. Ba, Processing Systems, pp. 7613–7624, 2019.
“Benchmarking model-based reinforcement learning,” arXiv preprint [93] J. K. Gupta, M. Egorov, and M. Kochenderfer, “Cooperative multi-
arXiv:1907.02057, 2019. agent control using deep reinforcement learning,” in International
[70] I. Mordatch and J. Hamrick, “Tutorial on model-based methods in Conference on Autonomous Agents and Multiagent Systems, pp. 66–
reinforcement learning,” ICML, 2020. 83, Springer, 2017.
[71] M. Janner, J. Fu, M. Zhang, and S. Levine, “When to trust your model: [94] K. Zhang, Z. Yang, and T. Basar, “Networked multi-agent reinforce-
Model-based policy optimization,” in Advances in Neural Information ment learning in continuous spaces,” in 2018 IEEE Conference on
Processing Systems, pp. 12519–12530, 2019. Decision and Control (CDC), pp. 2771–2776, IEEE, 2018.
[72] Y. Gal, R. McAllister, and C. E. Rasmussen, “Improving PILCO with [95] K. Zhang, Z. Yang, and T. Başar, “Decentralized multi-agent reinforce-
bayesian neural network dynamics models,” in Data-Efficient Machine ment learning with networked agents: Recent advances,” arXiv preprint
Learning workshop, ICML, vol. 4, p. 34, 2016. arXiv:1912.03821, 2019.
[73] K. Chua, R. Calandra, R. McAllister, and S. Levine, “Deep rein- [96] T. K. Rodrigues, K. Suto, H. Nishiyama, J. Liu, and N. Kato, “Machine
forcement learning in a handful of trials using probabilistic dynamics learning meets computation and communication control in evolving
models,” in Advances in Neural Information Processing Systems, edge and cloud: Challenges and future perspective,” IEEE Communi-
pp. 4754–4765, 2018. cations Surveys & Tutorials, vol. 22, no. 1, pp. 38–67, 2019.
[74] A. S. Polydoros and L. Nalpantidis, “Survey of model-based rein- [97] A. Shakarami, M. Ghobaei-Arani, and A. Shahidinejad, “A survey on
forcement learning: Applications on robotics,” Journal of Intelligent the computation offloading approaches in mobile edge computing: A
& Robotic Systems, vol. 86, no. 2, pp. 153–173, 2017. machine learning-based perspective,” Computer Networks, p. 107496,
2020.
[75] E. Bargiacchi, T. Verstraeten, D. M. Roijers, and A. Nowé, “Model-
based multi-agent reinforcement learning with cooperative prioritized [98] M. Sheraz, M. Ahmed, X. Hou, Y. Li, D. Jin, and Z. Han, “Arti-
sweeping,” arXiv preprint arXiv:2001.07527, 2020. ficial intelligence for wireless caching: Schemes, performance, and
challenges,” IEEE Communications Surveys & Tutorials, 2020.
[76] O. Krupnik, I. Mordatch, and A. Tamar, “Multi-agent reinforcement
[99] M. McClellan, C. Cervelló-Pastor, and S. Sallent, “Deep learning at
learning with multi-step generative models,” in Conference on Robot
the mobile edge: Opportunities for 5g networks,” Applied Sciences,
Learning, pp. 776–790, 2020.
vol. 10, no. 14, p. 4735, 2020.
[77] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and
[100] Z. Chen and X. Wang, “Decentralized computation offloading for
P. Abbeel, “Infogan: Interpretable representation learning by informa-
multi-user mobile edge computing: A deep reinforcement learning
tion maximizing generative adversarial nets,” in Advances in neural
approach,” arXiv preprint arXiv:1812.07394, 2018.
information processing systems, pp. 2172–2180, 2016.
[101] X. Liu, J. Yu, Z. Feng, and Y. Gao, “Multi-agent reinforcement learning
[78] P. Hernandez-Leal, M. Kaisers, T. Baarslag, and E. M. de Cote, “A for resource allocation in IoT networks with edge computing,” China
survey of learning in multiagent environments: Dealing with non- Communications, vol. 17, no. 9, pp. 220–236, 2020.
stationarity,” arXiv preprint arXiv:1707.09183, 2017.
[102] J. Heydari, V. Ganapathy, and M. Shah, “Dynamic task offloading in
[79] X. Pan, W. Wang, X. Zhang, B. Li, J. Yi, and D. Song, “How you multi-agent mobile edge computing networks,” in 2019 IEEE Global
act tells a lot: Privacy-leakage attack on deep reinforcement learning,” Communications Conference (GLOBECOM), pp. 1–6, IEEE, 2019.
arXiv preprint arXiv:1904.11082, 2019.
[103] N. Naderializadeh and M. Hashemi, “Energy-aware multi-server mo-
[80] M. Tan, “Multi-agent reinforcement learning: Independent vs. cooper- bile edge computing: A deep reinforcement learning approach,” in
ative agents,” in Proceedings of the tenth international conference on 2019 53rd Asilomar Conference on Signals, Systems, and Computers,
machine learning, pp. 330–337, 1993. pp. 383–387, IEEE, 2019.
[81] P. Hernandez-Leal, B. Kartal, and M. E. Taylor, “A survey and critique [104] Y. Zhang, B. Feng, W. Quan, A. Tian, K. Sood, Y. Lin, and H. Zhang,
of multiagent deep reinforcement learning,” Autonomous Agents and “Cooperative edge caching: A multi-agent deep learning based ap-
Multi-Agent Systems, vol. 33, no. 6, pp. 750–797, 2019. proach,” IEEE Access, vol. 8, pp. 133212–133224, 2020.
[82] T. T. Nguyen, N. D. Nguyen, and S. Nahavandi, “Deep reinforcement [105] B. Yang and M. Liu, “Keeping in touch with collaborative UAVs: A
learning for multiagent systems: A review of challenges, solutions, and deep reinforcement learning approach.,” in IJCAI, pp. 562–568, 2018.
applications,” IEEE transactions on cybernetics, 2020. [106] A. Shamsoshoara, M. Khaledi, F. Afghah, A. Razi, and J. Ashdown,
[83] K. Zhang, Z. Yang, and T. Başar, “Multi-agent reinforcement learning: “Distributed cooperative spectrum sharing in UAV networks using
A selective overview of theories and algorithms,” arXiv preprint multi-agent reinforcement learning,” in 2019 16th IEEE Annual Con-
arXiv:1911.10635, 2019. sumer Communications & Networking Conference (CCNC), pp. 1–6,
[84] J. Jiang and Z. Lu, “Learning attentional communication for multi- IEEE, 2019.
agent cooperation,” in Advances in neural information processing [107] H. Qie, D. Shi, T. Shen, X. Xu, Y. Li, and L. Wang, “Joint optimization
systems, pp. 7254–7264, 2018. of multi-UAV target assignment and path planning based on multi-agent
[85] H. Mao, Z. Gong, Z. Zhang, Z. Xiao, and Y. Ni, “Learning multi-agent reinforcement learning,” IEEE Access, vol. 7, pp. 146264–146272,
communication under limited-bandwidth restriction for internet packet 2019.
routing,” arXiv preprint arXiv:1903.05561, 2019. [108] J. Cui, Y. Liu, and A. Nallanathan, “The application of multi-agent
[86] R. Wang, X. He, R. Yu, W. Qiu, B. An, and Z. Rabinovich, “Learn- reinforcement learning in UAV networks,” in 2019 IEEE International
ing efficient multi-agent communication: An information bottleneck Conference on Communications Workshops (ICC Workshops), pp. 1–6,
approach,” arXiv preprint arXiv:1911.06992, 2019. IEEE, 2019.
[87] R. Lowe, J. Foerster, Y.-L. Boureau, J. Pineau, and Y. Dauphin, “On [109] J. Tožička, B. Szulyovszky, G. de Chambrier, V. Sarwal, U. Wani,
the pitfalls of measuring emergent communication,” arXiv preprint and M. Gribulis, “Application of deep reinforcement learning to UAV
arXiv:1903.05168, 2019. fleet control,” in Proceedings of SAI Intelligent Systems Conference,
[88] R. Lowe, Y. I. Wu, A. Tamar, J. Harb, O. P. Abbeel, and I. Mor- pp. 1169–1177, Springer, 2018.
datch, “Multi-agent actor-critic for mixed cooperative-competitive en- [110] J. Ge, Y.-C. Liang, J. Joung, and S. Sun, “Deep reinforcement learning
vironments,” in Advances in neural information processing systems, for distributed dynamic miso downlink-beamforming coordination,”
pp. 6379–6390, 2017. IEEE Transactions on Communications, 2020.
25

[111] K. Zia, N. Javed, M. N. Sial, S. Ahmed, A. A. Pirzada, and F. Pervez,

“A distributed multi-agent RL-based autonomous spectrum allocation
scheme in D2D enabled multi-tier hetnets,” IEEE Access, vol. 7,
pp. 6733–6745, 2019.
[112] Z. Li and C. Guo, “Multi-agent deep reinforcement learning based
spectrum allocation for D2D underlay communications,” IEEE Trans-
actions on Vehicular Technology, vol. 69, no. 2, pp. 1828–1840, 2019.
[113] H. Ye, G. Y. Li, and B.-H. F. Juang, “Deep reinforcement learning based
resource allocation for V2V communications,” IEEE Transactions on
Vehicular Technology, vol. 68, no. 4, pp. 3163–3173, 2019.
[114] R. Barazideh, O. Semiari, S. Niknam, and B. Natarajan, “Rein-
forcement learning for mitigating intermittent interference in terahertz
communication networks,” arXiv preprint arXiv:2003.04832, 2020.
[115] R. Singh and D. Sicker, “Ultra-dense low data rate (UDLD) commu-
nication in the THz,” arXiv preprint arXiv:2009.10674, 2020.
[116] D. A. Temesgene, M. Miozzo, D. Gunduz, and P. Dini, “Distributed
deep reinforcement learning for functional split control in energy
harvesting virtualized small cells,” IEEE Transactions on Sustainable
Computing, 2020.
[117] M. Sana, A. De Domenico, W. Yu, Y. Lostanlen, and E. C. Strinati,
“Multi-agent reinforcement learning for adaptive user association in
dynamic mmWave networks,” IEEE Transactions on Wireless Commu-
nications, vol. 19, no. 10, pp. 6520–6534, 2020.
[118] J. He, K. Yu, and Y. Shi, “Coordinated passive beamforming for
distributed intelligent reflecting surfaces network,” arXiv preprint
arXiv:2002.05915, 2020.
[119] H. Yang, Z. Xiong, J. Zhao, D. Niyato, Q. Wu, H. V. Poor, and
M. Tornatore, “Intelligent reflecting surface assisted anti-jamming com-
munications: A fast reinforcement learning approach,” arXiv preprint
arXiv:2004.12539, 2020.
[120] Y. Xiu, J. Zhao, C. Yuen, Z. Zhang, and G. Gui, “Secure beamforming
for distributed intelligent reflecting surfaces aided mmwave systems,”
arXiv preprint arXiv:2006.14851, 2020.
[121] B. Wang and N. Hegde, “Privacy-preserving Q-learning with functional
noise in continuous spaces,” in Advances in Neural Information Pro-
cessing Systems, pp. 11327–11337, 2019.
[122] Y.-C. Lin, Z.-W. Hong, Y.-H. Liao, M.-L. Shih, M.-Y. Liu, and M. Sun,
“Tactics of adversarial attack on deep reinforcement learning agents,”
arXiv preprint arXiv:1703.06748, 2017.
[123] A. Gleave, M. Dennis, C. Wild, N. Kant, S. Levine, and S. Russell,
“Adversarial policies: Attacking deep reinforcement learning,” arXiv
preprint arXiv:1905.10615, 2019.
[124] L. Pinto, J. Davidson, R. Sukthankar, and A. Gupta, “Robust adversarial
reinforcement learning,” arXiv preprint arXiv:1703.02702, 2017.

Offline and Distributional Reinforcement Learning For Wireless Communications
No ratings yet
Offline and Distributional Reinforcement Learning For Wireless Communications
7 pages
Deep Reinforcement Learning For Mobile 5G and Beyond Fundamentals Applications and Challenges
No ratings yet
Deep Reinforcement Learning For Mobile 5G and Beyond Fundamentals Applications and Challenges
16 pages
DRL For WSN Book
No ratings yet
DRL For WSN Book
78 pages
Deep Reinforcement Learning For Mobile 5G and Beyond Fundamentals Applications and Challenges
No ratings yet
Deep Reinforcement Learning For Mobile 5G and Beyond Fundamentals Applications and Challenges
9 pages
Machine Learning Techniques For 5G and Beyond
No ratings yet
Machine Learning Techniques For 5G and Beyond
18 pages
Comprehensive Research Resources For Cellular Netw
No ratings yet
Comprehensive Research Resources For Cellular Netw
7 pages
2766 - A Comparative Study of Multi-Agent Reinforcement Learning Approaches For Resource Optimization in Wireless Networks
No ratings yet
2766 - A Comparative Study of Multi-Agent Reinforcement Learning Approaches For Resource Optimization in Wireless Networks
7 pages
Machine Learning Techniques For 5G and Beyond
No ratings yet
Machine Learning Techniques For 5G and Beyond
17 pages
Progress
No ratings yet
Progress
30 pages
Meta Federated Reinforcement Learning For Distributed Resource Allocation
No ratings yet
Meta Federated Reinforcement Learning For Distributed Resource Allocation
11 pages
Distributed Learning in Wireless Networks Recent Progress and Future Challenges
No ratings yet
Distributed Learning in Wireless Networks Recent Progress and Future Challenges
27 pages
Distributed Learning For Wireless Communications Methods Applications and Challenges
No ratings yet
Distributed Learning For Wireless Communications Methods Applications and Challenges
17 pages
Artificial Neural Networks Based Machine Learning For Wireless Networks A Tutorial
No ratings yet
Artificial Neural Networks Based Machine Learning For Wireless Networks A Tutorial
33 pages
A Comparative Study of Multi-Agent Reinforcement Learning Approaches For Resource Optimization in Wireless Networks
No ratings yet
A Comparative Study of Multi-Agent Reinforcement Learning Approaches For Resource Optimization in Wireless Networks
8 pages
Optimized References
No ratings yet
Optimized References
2 pages
Wang Et Al. - 2020 - Artificial Intelligence Enabled Wireless Networkin
No ratings yet
Wang Et Al. - 2020 - Artificial Intelligence Enabled Wireless Networkin
8 pages
Machine Learning For 5g Mobile and Wire
No ratings yet
Machine Learning For 5g Mobile and Wire
6 pages
Edge Artificial Intelligence For 6G Vision Enabling Technologies and Applications
No ratings yet
Edge Artificial Intelligence For 6G Vision Enabling Technologies and Applications
32 pages
Robust Deep Learning For Wireless Network Optimization
No ratings yet
Robust Deep Learning For Wireless Network Optimization
7 pages
Machine Learning Enabled Wireless Communication Network System
No ratings yet
Machine Learning Enabled Wireless Communication Network System
5 pages
A Brief Review of Machine Learning-Based Approaches For Advanced Interference Management in 6G In-X Sub-Networks
No ratings yet
A Brief Review of Machine Learning-Based Approaches For Advanced Interference Management in 6G In-X Sub-Networks
13 pages
Distributed Intelligence in Wireless Networks
No ratings yet
Distributed Intelligence in Wireless Networks
79 pages
6G Whitepaper WS
No ratings yet
6G Whitepaper WS
29 pages
Split Federated Learning for 6G Networks
No ratings yet
Split Federated Learning for 6G Networks
37 pages
Review of AI in Wireless Comm
No ratings yet
Review of AI in Wireless Comm
4 pages
Artificial Neural Networks-Based Machine Learning For Wireless Networks: A Tutorial
No ratings yet
Artificial Neural Networks-Based Machine Learning For Wireless Networks: A Tutorial
33 pages
Advanced Deep Learning Models For 6G Overview Opportunities and Challenges
No ratings yet
Advanced Deep Learning Models For 6G Overview Opportunities and Challenges
69 pages
AI MLforNetworks v1 0
No ratings yet
AI MLforNetworks v1 0
145 pages
Deep Reinforcement Learning Based Computation Offloading and Resource Allocation For MEC
No ratings yet
Deep Reinforcement Learning Based Computation Offloading and Resource Allocation For MEC
6 pages
A Survey of 5G Network Systems Trends and Deep Learning Approaches
No ratings yet
A Survey of 5G Network Systems Trends and Deep Learning Approaches
2 pages
Machine Learning/Ai For Iot, M2M, and Computer Communication
No ratings yet
Machine Learning/Ai For Iot, M2M, and Computer Communication
3 pages
Artificial Intelligence-Enabled Cellular Networks - A Critical Path To Beyond-5G and 6G
No ratings yet
Artificial Intelligence-Enabled Cellular Networks - A Critical Path To Beyond-5G and 6G
7 pages
Machine Learning For Wireless Networks With Artificial Intelligence: A Tutorial On Neural Networks
No ratings yet
Machine Learning For Wireless Networks With Artificial Intelligence: A Tutorial On Neural Networks
93 pages
Distributed Channel Allocation For Mobile 6G Subnetworks Via Multi-Agent Deep Q-Learning
No ratings yet
Distributed Channel Allocation For Mobile 6G Subnetworks Via Multi-Agent Deep Q-Learning
6 pages
Robust Deep Learning-Based Physical Layer Communications Strategies and Approaches
No ratings yet
Robust Deep Learning-Based Physical Layer Communications Strategies and Approaches
8 pages
Summary
No ratings yet
Summary
55 pages
Abstract-Deep Learn-Ing (DL) Is Becoming A Powerful Method To Add Intelligence To
No ratings yet
Abstract-Deep Learn-Ing (DL) Is Becoming A Powerful Method To Add Intelligence To
4 pages
Adaptive 6G Resource Management
No ratings yet
Adaptive 6G Resource Management
24 pages
A Tutorial On Meta-Reinforcement Learning: Foundations and Trends in Machine Learning
No ratings yet
A Tutorial On Meta-Reinforcement Learning: Foundations and Trends in Machine Learning
164 pages
Research On Communication Resource Allocation Strategy Optimization Based On Deep Learning
No ratings yet
Research On Communication Resource Allocation Strategy Optimization Based On Deep Learning
4 pages
Very Scientific Paper For Smart People Part 1
No ratings yet
Very Scientific Paper For Smart People Part 1
30 pages
An Introduction To Deep Reinforcement Learning PDF
No ratings yet
An Introduction To Deep Reinforcement Learning PDF
140 pages
Semisupervised Deep Reinforcement Learning in Support of IoT and Smart City Services
No ratings yet
Semisupervised Deep Reinforcement Learning in Support of IoT and Smart City Services
12 pages
Qin 等。 - 2021 - Federated Learning and Wireless Communications
No ratings yet
Qin 等。 - 2021 - Federated Learning and Wireless Communications
7 pages
Generative AI For The Optimization of Next-Generation Wireless Networks: Basics, State-of-the-Art, and Open Challenges
No ratings yet
Generative AI For The Optimization of Next-Generation Wireless Networks: Basics, State-of-the-Art, and Open Challenges
30 pages
Reinforcement-Learning-Based Routing and Resource
No ratings yet
Reinforcement-Learning-Based Routing and Resource
25 pages
Enhanced Accuracy Model For Edge Computing and IoT Leveraging Artificial Intelligence
No ratings yet
Enhanced Accuracy Model For Edge Computing and IoT Leveraging Artificial Intelligence
7 pages
AI Solutions for 5G Network Optimization
No ratings yet
AI Solutions for 5G Network Optimization
10 pages
Machine Learning in Beyond 5G6G Networks-State-of
No ratings yet
Machine Learning in Beyond 5G6G Networks-State-of
28 pages
Du 2020
No ratings yet
Du 2020
13 pages
Deep Learning-Aided 6G Wireless Networks
No ratings yet
Deep Learning-Aided 6G Wireless Networks
51 pages
Edge-Native Intelligence For 6G Communications Driven by Federated Learning A Survey of Trends and Challenges
No ratings yet
Edge-Native Intelligence For 6G Communications Driven by Federated Learning A Survey of Trends and Challenges
23 pages
Summary of Papers
No ratings yet
Summary of Papers
6 pages
Deep-Learning Assisted Cross-Layer Routing in Multi-Hop Wireless Network
No ratings yet
Deep-Learning Assisted Cross-Layer Routing in Multi-Hop Wireless Network
5 pages
Energies 16 01512
No ratings yet
Energies 16 01512
23 pages
Under Water Final
No ratings yet
Under Water Final
21 pages
6G ppt1
No ratings yet
6G ppt1
39 pages
International Journal of Production Research
No ratings yet
International Journal of Production Research
19 pages
Reflective Teaching and Learning Guide
No ratings yet
Reflective Teaching and Learning Guide
8 pages
IQ-CRO Recommended Dose Volumes For Common Laboratory Animals June 2016
No ratings yet
IQ-CRO Recommended Dose Volumes For Common Laboratory Animals June 2016
5 pages
#### IJCM Investigating The Determinants of Construction
No ratings yet
#### IJCM Investigating The Determinants of Construction
12 pages
Capstone Proposal 1
No ratings yet
Capstone Proposal 1
5 pages
(Assignment Template) : ILM Level 5 Certificate in Coaching and Mentoring
75% (4)
(Assignment Template) : ILM Level 5 Certificate in Coaching and Mentoring
16 pages
An Investigation of A Model For Air Resistance Lab
No ratings yet
An Investigation of A Model For Air Resistance Lab
4 pages
Economic System and Trade Under Mughal Rule
No ratings yet
Economic System and Trade Under Mughal Rule
5 pages
Methods For A Multidisciplinary Landscape Assessment
No ratings yet
Methods For A Multidisciplinary Landscape Assessment
106 pages
Jewish Museum Berlin Thesis
No ratings yet
Jewish Museum Berlin Thesis
30 pages
Strategic Management Concepts and Cases Competitiveness and Globalization 11th Edition Hitt Fast Access
No ratings yet
Strategic Management Concepts and Cases Competitiveness and Globalization 11th Edition Hitt Fast Access
324 pages
Re Evaluating SBD Housing in West Yorks
No ratings yet
Re Evaluating SBD Housing in West Yorks
101 pages
Middle East Irrigation Overview
No ratings yet
Middle East Irrigation Overview
423 pages
Cases of Domestication and Foreignization in The Translation of Indonesian Poetry Into English A Preliminary Inquiry
No ratings yet
Cases of Domestication and Foreignization in The Translation of Indonesian Poetry Into English A Preliminary Inquiry
9 pages
Project Scope and WBS Guide
No ratings yet
Project Scope and WBS Guide
27 pages
A Bibliometric Analysis of Atangana-Baleanu
No ratings yet
A Bibliometric Analysis of Atangana-Baleanu
6 pages
AI Resources - IBEN Webinar - DH
No ratings yet
AI Resources - IBEN Webinar - DH
2 pages
Acknowledgement
No ratings yet
Acknowledgement
71 pages
Comprehensive Architectural Services
No ratings yet
Comprehensive Architectural Services
10 pages
Information: European Foundation For Quality Management Efqm
No ratings yet
Information: European Foundation For Quality Management Efqm
17 pages
Conjoint Analysis
No ratings yet
Conjoint Analysis
11 pages
MICA Digital Marketing Brochure
No ratings yet
MICA Digital Marketing Brochure
27 pages
Figueiredo Et Al. - 2007 - Gradient Projection For Sparse Reconstruction Application To Compressed Sensing and Other Inverse Problems-Annotated
No ratings yet
Figueiredo Et Al. - 2007 - Gradient Projection For Sparse Reconstruction Application To Compressed Sensing and Other Inverse Problems-Annotated
12 pages
Trends Q1-W3
No ratings yet
Trends Q1-W3
13 pages
How Do You Cite A Masters Thesis
100% (3)
How Do You Cite A Masters Thesis
6 pages
Seismic Vulnerability Assessment of St. Mary Magdalene Church
No ratings yet
Seismic Vulnerability Assessment of St. Mary Magdalene Church
69 pages
Isometric Training Intensity
No ratings yet
Isometric Training Intensity
4 pages
BIMO Site Audit Check List 8nov11
No ratings yet
BIMO Site Audit Check List 8nov11
8 pages
Rahul BTech ECE 11weeks 15may2024 IIST Thiruvananthapuram
No ratings yet
Rahul BTech ECE 11weeks 15may2024 IIST Thiruvananthapuram
3 pages
Use of Digital Literacy Skills For Electronic Information Search Among Students of Higher Institutions of Kebbi State, Nigeria
No ratings yet
Use of Digital Literacy Skills For Electronic Information Search Among Students of Higher Institutions of Kebbi State, Nigeria
6 pages

MARL For Networks

Uploaded by

MARL For Networks

Uploaded by

1

Single and Multi-Agent Deep Reinforcement

Type Communications (mMTC) to enable the deployment

and a2 . We assume that the joint actions (a1 , a1 ) and (a2 , a2 )

Paper Summary RL techniques Scope

TABLE III: Summary of abbreviations

AI Artificial Intelligence MDP Markov Decision Process

(a) MDP (b) POMDP 3) Networked Markov Games

TABLE IV: Action and state spaces for model-free RL algorithms

functions. reward function R̂(s, a). Consequently, for each state-action

(a) General Description of MBRL. (b) Model-free RL vs. MBRL.

(a) CTDE (b) Networked Agents (c) Fully Decentralized

Work Summary Learning Algorithm Cooperation?

[111] K. Zia, N. Javed, M. N. Sial, S. Ahmed, A. A. Pirzada, and F. Pervez,

You might also like