MARL For Networks
MARL For Networks
Abstract—Deep Reinforcement Learning (DRL) has recently Enhanced Mobile Broadband (eMBB), and massive Machine-
witnessed significant advances that have led to multiple suc-
arXiv:2011.03615v1 [cs.LG] 6 Nov 2020
is often a simulator mimicking the real-world system since it dynamics or rewards and has better exploratory behaviors.
is expensive to directly interact with the real system. These Recent progress in MBRL, especially for robotics, has shown
collected experiences constitute the training dataset of the that MBRL can be more efficient than MFRL. However,
RL agent and will be used to learn the optimal decision- for most applications, it is challenging to learn an accurate
making rule. In Deep RL (DRL), DNNs are used to ap- model of the world. For this reason, model-free algorithms
proximate the agent’s optimal strategy or policy and/or its are preferred. However, in the second part of the tutorial, we
optimal utility function (see Figure 1). In this case, given argue that single-agent RL is not sufficient to model scalable
the system’s current state, the DNN learns to predict either a and self-organizing systems often containing a considerable
distribution over actions or the action expected reward Q(s, a). number of interconnected agents. This claim is justified since
Therefore, the agent chooses its next action as the one that has single-agent RL algorithms learn a decision-making rule for
either the highest probability or the highest expected reward. one entity without considering the existence of other entities
After receiving the reward from the environment, the DNN that can impact its behavior. Thus, we will study extensions
parameters are updated accordingly. The generalization power of both model-free and model-based single-agent approaches
of DNNs enables solving high-dimensional problems with to multi-agent decision-making.
continuous or combinatorial state spaces. In the context of MARL is the generalization of single-agent RL that enables
wireless communications, DRL is advantageous compared to a set of agents to learn optimal policies using interactions
the traditional optimization methods thanks to their real-time with the environment and each other. Thus, MARL does not
inference. However, the training phase of DNNs requires a ignore the presence of the other agents during the learning
considerable amount of computation power which necessi- process which makes it a harder problem. Essentially, several
tates the use of GPUs and high-performance CPU clusters. challenges arise in the multi-agent case such as i) non-
The most popular DL frameworks are Tensorflow [4] and stationarity of the system due to the agents simultaneously
pytorch [5]. Once the training is complete, the agent can changing their behaviors; ii) scalability issue since the joint
make decisions in real-time which is a considerable advantage action space grows exponentially with the number of agents;
compared to traditional optimization problems. To accelerate iii) partial observability problem arising often in real-world
the inference of DNNs further, different libraries implement applications where agents have access only to partial informa-
sophisticated compression techniques such as quantization to tion of system; iv) privacy and security are also core challenges
fasten the execution of DNNs on mobile or edge devices. of the deployment of MARL systems in real-world scenarios.
More details on these issues will be presented in Section V-A.
MARL is generally formulated as a Markov Game (MG) or
also called a Stochastic Game (SG). MG generalizes Markov
Decision Processes (MDPs) used to model single-agent RL
problems and repeated games in game theory literature. In
repeated games, the same players repeatedly play a given game
called stage game. Thus, repeated games consider a stateless
static environment, and the agents’ utilities are only impacted
by the interactions between agents. This is a crucial limitation
of normal-form game theory frameworks to model multi-agent
problems. MG remedies this shortcoming by considering a
dynamic environment impacting the agents’ rewards. MGs
can be classified into three families fully cooperative, fully
competitive or mixed. Fully cooperative scenarios assume that
Fig. 1: Representation of a DRL framework. the agents have the same utility or reward function whereas
fully competitive settings involve agents with opposite goals
often known as zero-sum games. The mixed setting covers the
A. Scope of this Tutorial general case where no restriction on the rewards is considered.
In this work, we emphasize the role of DRL in future 6G This is also referred to as general-sum games. In this paper,
networks. In particular, our objective is to discuss several we focus on fully cooperative MARL and consider MGs as the
DRL learning frameworks to advance the current state-of- mathematical formalism to model such problems. However,
the-art and accommodate the requirements of 6G networks. MGs handle only problems with full observability or perfect
First, we overview single-agent RL methods and shed light information. Other extensions to model partial observability
on MBRL techniques. Although MBRL has received less will be discussed as well.
interest, it can show considerable advantage compared to Because fully cooperative agents share the same reward
model-free algorithms. MBRL consists in learning a model function, they are obliged to choose optimal joint actions.
representing the environment dynamics and utilize the learned In this context, coordination between agents is crucial in
model to compute an optimal policy. The main advantage selecting optimal joint strategies. To illustrate the importance
of having a model of the environment is the ability to plan of coordination, we consider the example from [6]. Let us
ahead which makes these methods more sample-efficient. In examine a scenario with two agents at a given state of the
addition, MBRL is more robust to changes in the environment environment where they can choose between two actions a1
3
of cooperation or coordination in the network since the agent TABLE I: Summary of notations and symbols
considers all the other agents as a part of the environment.
S, A, O State, action, and observation spaces
For this reason, we highlight the importance of MARL,
A, O Joint action and observation spaces
particularly cooperative MARL, in the development of scalable
R, P Reward and transition functions respectively
and decentralized systems for 6G networks. In this context,
H Episode horizon or length of a trajectory
[17] showcases the potential applications of MARL to build
D Replay buffer
decentralized and scalable solutions for vehicle-to-everything
γ Discount factor
problems. In addition, the authors in [18] provide an overview
π∗ Agent’s optimal policy
of the evolution of cooperative MARL with an emphasis on
b(s) Belief state of a state s ∈ S
distributed optimization. Our work does not only consider
πθ Parametrized policy with parameters θ
cooperative MARL but also MBRL as enabling techniques
π Joint policy of multiple agents
for future 6G networks and we focus on delivering a more
Qπ , V π V /Q-function under the policy π
applied perspective of MARL to solve wireless communication
Qφ Parameterized Q-function with parameters φ
problems. Table II summarizes the existing surveys on DRL
Q̂, V̂ , π̂ Approximate V /Q-function and policy
and 6G and highlights the key differences compared to our
Q̄ Target Q-network
work.
J Infinite-horizon discounted return
Adv Advantage function
C. Contributions and Organization of this Paper
The main contributions of this paper can be summarized as
follows: dwells on the cooperative MARL algorithms according to the
• We provide a comprehensive tutorial on single-agent DRL
type of cooperation they address. Section VI is dedicated to
frameworks. Model-free RL (MFRL) is based on learning recent contributions of the mentioned algorithms in several
whereas MBRL is based on planning. To the best of our wireless communication problems, followed by a conclusion
knowledge, this is the first initiative to present MBRL and future research directions outlined in Section VII. A
fundamentals and potentials in future 6G networks. Re- summary of key notations and symbols is given in Table I.
cent developments in the MBRL literature render these
methods appealing for their sample efficiency (which is II. BACKGROUND
measured in terms of the minimum number of samples The objective of this section is to present the mathematical
required to achieve a near-optimal policy) and their background and preliminaries for both single agent and multi-
adaptation capabilities to changes in the environment; agent RL.
• We present different MARL frameworks and summarize
relevant MARL algorithms for 6G networks. In this work, A. Single-Agent Reinforcement Learning
we focus on the following: emergent communication
1) Markov Decision Process
where agents can learn to establish communication pro-
In RL, a learning agent interacts with an environment to
tocols to share information; learning cooperation details
solve a sequential decision-making problem. Fully observ-
different algorithms to learn collaborative behaviors in a
able environments are modeled as MDPs defined as a tuple
decentralized manner; networked agents to enable coop-
(S, A, P, R, γ). S and A define the state and the action spaces
eration between heterogeneous agents with limited shared
respectively; P := S × A 7→ [0, 1] denoted the probability of
information;
transiting from a state s to a state s0 after executing an action
• We also review the literature on applications of MARL
a; R : S × A × S 7→ R is the reward function that defines
in several enabling technologies for 6G networks such
the agent’s immediate reward for executing an action a at a
as MEC and control of aerial (e.g. drone-based) net-
state s and resulting in the transition to s0 ; and γ ∈ [0, 1] is
works, beamforming in cell-free massive Multiple-
a discount factor that trades-off the immediate and upcoming
Input Multiple-Output (MIMO) communications, spec-
rewards. The full observability assumption of MDPs enables
trum management in Heterogeneous Networks (HetNets)
the agent to access the exact state of the system s at every time
and in THz communications, and distributed deployment
step t. Given the state s, the agent will decide to take an action
and control of intelligent reflecting surface (IRS)-aided
a transiting the system to a new state s0 sampled from the
wireless systems;
probability distribution P (.|s, a). The agent will be rewarded
• We present open research directions and challenges re-
with an immediate compensation R(s, a, s0 ). Thus, the agent’s
lated to deployment of efficient, scalable, and decentral- hP
∞ t 0
ized algorithms based on RL. expected return is expressed as E t=0 γ R(s, a, s )|a ∼
The rest of the paper is organized as follows. In Section II, π(.|s), s0 . This is referred to as infinite-horizon discounted
we introduce the mathematical background for both single- return. Another hpopular formulation is undiscounted finite-
PH 0
agent RL and MARL. Standard algorithms for single-agent RL horizon return E t=0 R(s, a, s )|a ∼ π(.|s), s0 where the
are reviewed in Section III. In Section IV, we introduce MBRL return is compute over a finite horizon H. This is common
and detail potential applications for 6G systems. Section V first in episodic tasks (i.e. tasks that have an end). Note that the
summarizes the different challenges of MARL and afterward finite-horizon setting can be viewed as infinite-horizon case by
5
TABLE II: Summary of existing surveys on DRL and MARL for 5G and beyond wireless networks
augmenting the state space with an absorbing state transiting {(o0 , a0 ), . . . , (ot−1 , at−1 )}. The history is used to learn a
continuously to itself with zero rewards. policy π(.|Ht ) or a Q-function Q(Ht , at ). As an analogy with
The agent aims to find an optimal policy π ∗ , a mapping from MDPs, the agent state becomes the history Ht . As a result,
the environment states to actions, that maximizes the expected this method has a large state space which can be alleviated by
return. A policy or a strategy describes the agent’s behaviour at using a truncated version of the history (i.e. k-order Markov
every time step t. A deterministic policy returns actions to be model). However, limited histories have also caveats since long
taken in each perceived state. On the other hand, a stochastic histories need more computational power and short histories
policy outputs a distribution over actions. Under a given policy suffer from possible information loss. It is not straightforward
π, we can define a value function or a Q-function which how the value of k is chosen. Another way to avoid the
measures the expected accumulated rewards staring from any increasing dimension of the history with time is by defining the
given state st or any pair (st , at ) and following the policy π, notion of belief state bt (s) = p(s|Ht ), ∀s ∈ S as a distribution
as shown below: over states. Thus, the history Ht is indirectly used to estimate
" # the probability of being at a state s. Therefore, a Q-function
P∞ t
π
V (s) = E t=0 γ R(st , at , st+1 )|at ∼ π(.|st ), s0 = s
Q(b(s), a) or a policy π(.|b(s)) (see Table I) can be learned
using the belief states instead of the history. If the POMDP
" # is known, the belief states are updated using Bayes’ rule.
P∞
Qπ (s, a) = E t
t=0 γ R(st , at , st+1 )|at ∼ π(.|st ), s0 = s, a0 = a . Otherwise, a Bayesian approach can be considered. Another
approach to solve a POMDP is Predictive State Representation
Using Dynamic Programming (DP) methods such as Value [21]. The main idea consists in predicting what will happen in
Iteration or Policy Iteration [19] to solve an MDP mandates the future instead of relying on past actions and observations.
that the dynamics of the environment (P and R) are known
Example: POMDP formulation was applied to solve differ-
which is often not possible. This motivates the model-free
ent wireless problems characterized by partial access to the
RL approaches that find the optimal policy without knowing
environment state. For example, [22] proposed a POMDP
the world’s dynamics. MBRL methods learn a model of the
representation of dynamic task offloading in IoT fog systems.
environment by estimating the transition function and/or the
The authors assume that the IoT devices have imperfect
reward function and use the approximate model to learn or
Channel State Information (CSI). In this scenario, the agent is
improve a policy. Model-free RL algorithms are discussed
the IoT device, and based on the estimated CSI and the queue
in detail in Section III and MBRL is further investigated in
state, it decides the tasks to be executed locally or offloaded.
Section IV.
POMDPs are widely used in wireless sensor networks.
Example: Several wireless problems have been formulated
as MDPs. As an example, [20] presented downlink power
B. Multi-Agent Reinforcement Learning
allocation problem in a multi-cell environment as an MDP.
The agent is an ensemble of K base stations (or a controller MARL tackles sequential decision-making problems involv-
for K base stations). The state space consists of the users’ ing a set of agents. Hence, the system dynamics is influenced
channel quality and their localization with respect to a given by the joint action of all the agents. More intuitively, the
base station. The agent selects the power levels for K base reward received by an agent is no longer a function of its own
stations to maximize the entire network throughput. actions but a function of all the agents’ actions. Therefore,
2) Partially Observable Markov Decision Process to maximize the long-term reward, an agent should take into
consideration the policies of the other agents. In what follows,
In the previous section, it was assumed that the agent has
we will present mathematical backgrounds for MARL. Please
access to the full state information. However, this assumption
refer to Section VI for examples on how to formulate wireless
is violated in most real-world applications. For example, IoT
communication problems using the discussed mathematical
devices collect information about their environments using
frameworks below.
sensors. The sensor measurements are noisy and limited, hence
the agent will only have partial information about the world. 1) Markov/Stochastic Games
Several problems such as perceptual aliasing prevent the agent MGs or SGs [23] extend the MDP formalism to the multi-
from knowing the full state information using the sensors’ agent setting to take into account the relation between agents
observations. In this context, Partially Observable Markov [24]. Let N > 1 be the number of agents, S is the state space
Decision Processes (POMDP) generalize the MDP framework and Ai denotes the action space of the i’s agent. The joint
to take into account the uncertainty about the state information. action space of all agents is given by A := A1 × · · · × AN .
POMDP is described by a 7-tuple (S, A, P, R, γ, O, Z) where From now on, we will use bold and underlined characters to
the first five elements are the same as defined in §II-A1; differentiate between joint and individual functions.
O is the observations space and Z : S × A × O 7→ [0, 1] At a state s, each agent i selects an action ai and the joint
denotes the probability distribution over observations given action a = [ai ]i∈N will be executed in the environment. The
a state s ∈ S and an action a ∈ A. To solve a POMDP, we transition from the state s to the new state s0 is governed by
distinguish two main approaches. The first one is history-based the transition probability function P : S × A × S 7→ [0, 1].
methods where the agent maintains an observation history Each agent i will receive an immediate reward ri defined by
Ht = {o1 , . . . , ot−1 } or an action-observation history Ht = the reward function Ri : S ×A×S 7→ R. Therefore, the MG is
7
formally defined by the tuple (N, S, (Ai )i∈N , P, (Ri )i∈N , γ) fact, each agent needs full information about the other agents’
where γ is a discount factor. Note that the transition and reward actions to maximize its long-term rewards. Consequently, the
functions in MG are dependent on the joint action space A. uncertainty about the other opponents in addition to the state
Each agent i seeks to find the optimal policy πi∗ : S 7→ Ai uncertainty call for an extension of the MG framework to
that maximizes its long-term return. Q The joint policy π of model cooperative agents under partial observable environ-
all agents is defined as π(a|s) = i∈N πi (ai |s). Hence, the ments. In this context, Decentralized Partially Observable
value-function of an agent i is defined as follows: Markov Decision Process (Dec-POMDP) [26] is the adopted
" # mathematical framework to study the cooperative sequential
π P∞ t
decision-making problems under uncertainty. This is a direct
Vi (s) = Eπ t=0 γ Ri (st , at , st+1 )|at ∼ π(st ), s0 = s .
generalization of POMDPs to the multi-agent settings. A Dec-
Consequently, the optimal policy of the agent i is a function POMDP is described as (N, S, (Ai )i∈N , P, R, γ, (Oi )i∈N , Z)
of its opponents’ policies π−i . where the six first elements are same as defined in §II-B1; R
The complexity of MARL systems arises from this prop- is a global reward function shared by all the agents; Oi is the
erty because the other agents’ policies are non-stationary observation space of the i’s agent with O := O1 × · · · × ON
and change during learning. See Section V-A for a detailed is the joint observation space and Z : S × A × O 7→ [0, 1]
discussion of MARL challenges. As mentioned before, we is the observation function which provides the probability
distinguish three solution concepts for MGs: fully-cooperative, P (o|a, s0 ) of the agents observing o = [o1 × · · · × oN ]
fully competitive, and mixed. In fully cooperative settings, all after executing a joint action a ∈ A and transiting to a
the agents have the same reward function Ri = R and hence new state s0 ∈ S. A Dec-POMDP is a specific case of the
the same value or state-action function. Fully-cooperative MGs Partially Observable Stochastic Games (POSG) [27] defined
are also referred to as Multi-agent MDP (MMDP). This simpli- as a tuple (N, S, (Ai )i∈N , P, (Ri )i∈N , γ, (Oi )i∈N , Z) where
fies the problem since standard single-agent RL algorithms can all the elements are the same as in Dec-POMDP expect
be applied if all the agents are coordinated using the reward function Ri which becomes individual for each
P a central unit. agent. POSG enables the modeling of self-interest agents
On the other hand,P fully competitive MGs ( i Ri = 0) and
general-sum MGs ( i Ri ∈ R) are addressed by searching for whereas Dec-POMDP exclusively models cooperative agents
a NE. We will focus on the subsequent sections on extensions in partial observable environments. At a state s, each agent
of MG for cooperative problems. i receives its own observation oi without knowing the other
agents’ observations. Thus, each agent i chooses an action
Example: MGs are the most straightforward generalization of ai yielding a joint action to be executed in the environment.
single-agent wireless problems to the multi-agent scenarios. Based on a common immediate reward, each agent strives
As an example, the problem of field coverage by a team of to find a local policy πi : Oi 7→ Ai that maximizes its
UAVs is modeled as MG in [25]. team long-term reward. Thus, the joint policy is given by
π = [π1 , . . . , πN ]. The policy πi is called local because each
agent acts according to its own local observations without
communicating or sharing information with the other agents.
Example: Multi-agent task offloading [28] and multi-agent
cooperative edge caching [29] are wireless problems which
can be modeled as Dec-POMDP problems.
forth, agents know their local and neighboring information and the policy being improved or evaluated. The former is
and seek to learn an optimal joint policyPby maximizing the called behavior policy and the latter is referred to as target
team-average reward R̄(s, a, s0 ) = N 1−1 i∈N Ri (s, a, s0 ) for policy or control policy. In off-policy methods, the behavior
any (s, a, s0 ) ∈ S × A × S. To summarize, the advantages policy is different from the target policy. However, in on-
of Networked MGs compared to classical MGs are: (i) the policy methods, the behavior and the target policies are the
possibility to model heterogeneous agents with different re- same, meaning that the policy used to collect the data samples
ward functions; (ii) the reduction of the coordination cost is the same as the one being evaluated or improved. Thus,
by considering neighbor-to-neighbor communication which the notation Vφπθ means that the value function V is learned
facilitates the design of decentralized MARL algorithms; (iii) using samples from the policy πθ . If πθ is the same as the
the privacy preserving property since agents are not mandated policy the agent is learning, this is an on-policy algorithm.
to share their reward functions. The advantage of an off-policy setting is the possibility to
use a more exploratory behavior policy to continue to visit all
Example: Networked MDP can be applied in multiple wireless
the possible actions. This is why off-policy methods encourage
scenarios where agents are linked with a communication
exploration. Exploration-exploitation trade-off is a well-known
graph. For example, base stations in cell-free networks can
challenge in RL: The agent can exploit the knowledge from its
collaborate to compute optimal beamforming while minimiz-
past experiences to choose actions with the highest expected
ing interference [30]. The communication graph will enable
rewards but it also needs to explore other actions to improve
the base stations to share information with their neighbors.
its action selection.
Thus, better collaboration is possible.
Another key distinction between RL learning frameworks is
the use of Bootstrapping. The general idea of bootstrapping
III. S INGLE AGENT M ODEL -F REE RL A LGORITHMS
is estimated values of states are updated based on estimates
A. Preliminaries of the values of the next states. DP and TD methods use
We start by defining useful notions and concepts for the bootstrapping whereas MC algorithms rely on actual complete
understanding of the algorithms discussed below. episodic returns.
MFRL methods can be categorized into two classes depend- Furthermore, another dimension to consider while designing
ing on the agent’s learned objective. In value-based methods, an RL algorithm is how to represent the approximate value
an approximate value function V̂ is learned and the agent’s function. In tabular setting, a table of state-values is main-
policy π is obtained by acting greedily with respect to V̂ . tained and updated for every visited state s. Function approx-
Thus, state-values are essential for action selection. Policy imators have enabled the recent revolution in RL thanks to
evaluation methods seek to learn an estimate of the value its generalization power with high-dimensional state data. For
function V̂ = V π for a given policy π. Alternatively, policy- example, DNNs are famous non-linear function approximators
based methods aim to directly learn a parameterized policy used to compute value functions or policies.
without resorting to a value function. A well-known variant As mentioned above, the source of the training data is
of policy-based methods learns an approximation of the value crucial for learning. On one hand, batch/offline RL considers
function but the action selection is still independent of the the agent is provided with a dataset of interactions and learns
value estimates. These are the actor-critic methods where a policy using the given dataset without interacting with the
approximations to both the policy and the value function are environment. On the other hand, the agent can collect data by
learned. The actor refers to the policy and the critic is the querying the real environment or a simulator. This is referred
approximate value function. Henceforth, we will denote by θ to as online RL.
the policy’s parameters and πθ is the policy parametrized by Figure 4 summarizes the different categorization of RL
θ. In actor-critic methods, Vφπθ denotes the approximate value methods. In this tutorial, we will focus on both online
function under the policy πθ where φ is a learnable parameter policy-based and value-based methods with DNNs as function
vector. approximators. Table IV provides a comparative overview
We distinguish between two main learning principles: (i) of the discussed algorithms below. This review of MFRL
Monte Carlo (MC) and (ii) DP methods. The former methods methods is not exhaustive since several resources with in-depth
utilize experience to approximate value functions and policies. descriptions are already available (i.e. in [19], [9]).
In contrast, DP methods are known for solving the Bellman
Optimality equations. More details will be provided in the
following sections. Temporal Difference (TD) is a famous B. Policy-Based Algorithms
combination of these two learning frameworks. Therefore, Policy-based methods directly search for the optimal policy
an important question arises when MC and TD methods are by maximizing the agent’s expected long-term reward J as in
adopted: how actions are selected and samples are generated (1). The policy is parameterized by a function approximator
for learning? This leads to the two key approaches for learning πθ (a|s), typically a DNN with learnable weights θ. The Policy
from experience, namely, off-policy and on-policy methods. Gradient (PG) methods, introduced in [31], learn the optimal
Recall that the agent interacts with its environment by exe- parameters θ∗ by performing gradient ascent on the objective
cuting actions and after improve or evaluate its policy using J. Using the PG theorem [31], the policy gradients are
the collected data. Therefore, we can distinguish between expressed as in (2) and estimated using samples or trajectories
two distinct processes: the policy used for data collection collected under the current policy. This is why PG methods
9
Fig. 4: Categorization of different RL settings. The classes colored in blue are covered in Section II.
are on-policy methods. For each gradient update, the agent layers. Several workers are instantiated with local copies of
needs to interact with the environment and collect trajectories. the global network parameters and the environment. These
Samples collected at iteration k cannot be reused for the next workers are created as CPU threads in the same machine. In
policy update. This sample inefficiency represents one of the parallel, each worker interacts with its local environment and
major drawbacks of PG methods. collects experiences to estimate the gradients with respect to
the network parameters. Afterward, the worker propagates its
" ∞
# gradients and updates the parameters of the global network.
X
J(θ) = Eπθ γ t R(st , at ) . (1) Therefore, the global model is constantly updated by the
t=0 workers. This learning scheme enables the collection of more
" T
X
# diverse experiences since each worker interacts with their local
πθ copy of the environment independently. The drawback of the
∇θ J(θ) = Eπθ ∇ log πθ (at |st )Q (st , at ) (2)
t=0 asynchronous training scheme is that some workers will be
In (2), Qπθ is not known and needs to be estimated. Several using old versions of the global network. In this context,
approaches are possible. The well-known REINFORCE algo- Advantage Actor-Critic (A2C) adopts a synchronous and de-
PT
rithm [32] uses the rewards-to-go defined as k=t R(sk , ak ). terministic implementation where all the workers’ gradients
The major caveat of the REINFORCE algorithm is that it is are aggregated and averaged to update the global network.
well-defined for episodic problems only since the rewards-to- As mentioned before, PG algorithms suffer from sample
go are computed at the end of an episode. Furthermore, REIN- inefficiency since only one gradient update is performed per a
FORCE algorithm suffers from high variance. In (2), action batch of collected data. This motivates the goal to use the data
likelihoods are multiplied by their expected return thus PG more efficiently. Besides, it is hard to pick the learning rate
algorithm shifts the action distribution such that good actions since it affects the training performance and can dramatically
are more likely than bad ones. Consequently, small variations alter the visitation distribution. Intuitively, a high learning rate
in the returns can lead to a completely different policy. This can result in a bad policy update which means that the next
motivates actor-critic methods where an approximation of Qπθ batch of data is collected using a bad policy. Recovering from
is learned. Note that it also possible to estimate the value a bad policy update is not guaranteed. This motivate Trust
function V πθ or the advantage function Advπθ = Qπθ − V πθ . Region Policy Optimization (TRPO) [35] where the original
Learning a critic reduces the variance of gradient estimates optimization problem in (1) is solved under the constraint of
since different samples are used whereas in the rewards-to-go ensuring the new updated policy is close to the old one. To
only one sample trajectory is considered. However, actor-critic do so, the constraint is defined in terms of Kullback–Leibler
methods introduce bias since the Qπθ estimate can be biased divergence (DKL ) which measures the difference between two
as well. In this context, [33] proposed Generalized Advantage probability distributions. More formally, let θk be the policy
Estimation based on the idea of n-step returns to reduce the parameters at iteration k. We would like to find the new
bias. In what follows, we will examine the most common parameters θk+1 such that
policy gradient algorithms.
Asynchronous Advantage Actor-Critic (A3C) [34] proposes θk+1 = arg max L(θ)
θ
a parallel implementation of the actor-critic algorithm. In
πθ (a|s)
the original version of A3C, a global NN outputs the ac- = arg max E(s,a)∼πθk Avdπθk (s, a)
θ πθk (a|s)
tion probabilities and an estimate of the agent’s advantage
function. Thus, the actor and the critic share the network s.t. DKL (θ||θk ) ≤ δ,
10
where δ is the trust region radius. Let F be the Fisher- to Q∗ since B is a contraction and Q∗ always exists and
information matrix. With a first-order approximation of the is unique. Besides, value iteration relies on bootstrapping to
objective (L(θ) ≈ ∇θ LT (θ)(θ − θk )) and a second-order estimate the value of next states. However, to evaluate the
Taylor expansion of the constraint (DKL (θ||θk ) ≈ 12 (θ − Bellman operator, the transition function is needed. This is the
θk )T F (θ − θk )), the update rule is given by θk+1 = θk +
q major drawback of DP methods that assume the environment
2δ
F −1 ∇θ L(θ). The term F −1 ∇θ L(θ) is dynamics are known. To overcome this issue, TD methods
∇θ LT (θ)F −1 ∇θ L(θ)
called the natural gradients. Consequently, evaluating the combine the main ideas of MC and DP. They use experience
natural gradients necessitates inverting the matrix F which is as in MC and bootstrapping as in DP. The update rule of TD
expensive. To overcome this issue, TRPO implements the con- algorithm is as follows:
jugate gradient algorithm to solve the system F x = ∇θ L(θ) Q̂(s, a) = (1 − α)Q̂(s, a) + α R(s, a) + γ maxa0 Q̂(s0 , a0 ) , (6)
which involves evaluating F x instead. Finally, the matrix-
vector product F x is computed as ∇θ (∇θ DKL (θ||θk )T x) where V̂ and Q̂ are the approximate value and state-action
which is easy to evaluate using any auto-differentiation library functions and α is a learning rate.
like Tensorflow. In a similar vein, Proximal Policy Opti- TD methods can be on-policy or off-policy. Let π̂ the policy
mization (PPO) [36] algorithm solves the same optimization derived from Q̂ (i.e. -greedy). For on-policy TD, the samples
problem as TRPO but proposes a simpler implementation by used to estimate Q̂ are generated using the current policy π̂
introducing a new loss function: continuously updated greedily with respect to Q̂. SARSA is a
well-known on-policy TD algorithm where the agent collects
πθ (a|s)
LPPO (θ) = min Avdπθk (s, a), experiences in the form {(s, a, r, s0 , a0 )}. Since the action in
θ πθk (a|s)
the next state is known, the max operator in the RHS of the
πθ (a|s) πθk TD update (6) is removed.
clip( , 1 − δ, 1 + δ)Avd (s, a) ,
πθk (a|s) Q-learning algorithm [38] revolutionized the RL world
where “clip” is a function used to keep the values of the ratio allowing the development of an off-policy TD algorithm. Any
πθ (a|s)
πθk (a|s) between 1 − δ and 1 + δ to penalize the new policy
policy π̃ 6= π̂ can be used to generate experiences. When Q̂
if gets far from the old policy. is represented with a function approximator with parameters
φ, Q-learning algorithm minimizes the Bellman error (7) and
Example: The advantage of policy-based algorithms is they updates the Q-function parameters as in (9). The Bellman error
are applicable to discrete and continuous action spaces. For is not a contraction anymore thus the convergence guarantees
example, [37] applies the TRPO algorithm to find optimal discussed earlier are not valid anymore. Equation (8) defines
routing strategies in a network. the targets. Note the update rule in (9) does not consider the
full gradient of the Bellman error since it ignores the gradients
C. Value-Based Algorithms of the targets with respect to the parameters φ. This is why, this
As explained above, value-based algorithms focus on esti- learning algorithm is also called semi-gradient method [19].
mating the agent’s value function. Thus, the policy is computed 1 X
implicitly or greedily with respect to the approximate value φ∗ = arg min ||Q̂φ (s, a) − y||2 . (7)
φ 2
function. 0
(s,a,r,s )
The MC method approximates the value of a state s by y = R(s, a) + γ max Q̂φ (s0 , a0 ). (8)
averaging the rewards obtained after visiting the state s until a0
the end of an episode. Consequently, MC methods are defined
X
φ=φ−α ∇φ Qφ (s, a) Qφ (s, a) − y . (9)
only for episodic tasks. In DP, the optimal value V ∗ and (s,a,r,s0 )
state-action Q∗ function are computed by solving the Bellman
Optimality equations (3-4) and thus the optimal policy is Deep versions of the Q-learning methods such as Deep Q-
obtained greedily with respect to the Q-values (5): Networks (DQN) [39] have been developed. In particular,
the Q-function is parameterized using a DNN with weights
V ∗ (s) = max R(s, a) + γEs0 V ∗ (s0 )
(3) φ. DQN and its variants are the most popular online Q-
a
∗
Q (s) = R(s, a) + γEs0 max ∗ 0 0
Q (s , a )
(4) learning algorithms that have shown impressive results in
0
a several communication applications. To stabilize the learning
π ∗ (s) = arg max Q∗ (s, a). (5) using DNN, DQN introduces techniques such as an experience
a
replay buffer D = {(s, a, r, s0 )} to avoid correlated samples
Let B : R|S×A| 7→ R|S×A| denote the Bellman and get a better gradient estimation and a target network Q̄,
0
optimality operator such that [BQ](s, a) = R(s, a) + whose parameters φ are periodically updated with the most
γ s0 P (s0 |s, a) maxa0 Q∗ (s0 , a0 ). Therefore, equation (4) can
P
recent φ, making the targets (8) stationary and do not depend
be written in more compact way Q∗ = BQ∗ . As a result, Q∗ on the learned parameters. The target network is updated
is the called the fixed point of the Bellman optimality operator periodically as follows:
and the methods solving for this fixed point can be called fixed
yDQN = R(s, a) + γ max Q̄φ0 (s0 , a0 ). (10)
point methods. The value iteration algorithm approximates 0 a
Q∗ by iteratively applying the Bellman optimality operator yDDQN = r(s, a) + γ Q̂2 (s0 , arg max Q̂1 (s0 , a0 )). (11)
Q̂k = B Q̂k−1 . This algorithm is guaranteed to converge a0
11
Although Q-learning algorithms are sample efficient, they lack an extension of the DPG method where the policy µθ and
convergence guarantees for non-linear function approximators the critic Qφ are both DNNs. Recently, two variants of the
and also suffer from the maximization bias which results in DDPG algorithm have been proposed: Twin Delayed DDPG
an unstable learning process. The max operator in equation (TD3) [45] addresses the overestimation problem discussed in
(8) makes the algorithm overestimate the true Q-values and § III-C whereas in Soft Actor Critic (SAC) [46] the agent
is also problematic for continuous action spaces. To over- maximizes not only its expected return but also the policy
come the maximization bias, double learning technique uses entropy to improve robustness and stability.
two networks Q̂1φ1 and Q̂2φ2 with different parameters and
Example: DDPG has been widely used in continuous wireless
decouples the action selection from the action evaluation as
problems. In [47], DDPG is applied in energy harvesting
shown in (11). Other variants of DQN have been suggested to
wireless communications. Similarly, Table II lists several
guarantee a better convergence. Prioritized experience replay
references using DDPG agents to solve multi-agent wireless
is proposed in [40] to ensure the sampling of rare and task-
problems such as computation offloading, edge caching, UAV
related experiences is more frequent than the redundant ones.
management, etc.
Dueling Network [41] suggests to separate the Q-function
estimator into two separate networks: V̂ network to estimate
the state values and Âdv network to approximate the state- E. Theoretical Analysis and Challenges of RL
dependent action advantages.
After reviewing the key MFRL algorithms, we will discuss
Example: DQN and its variants have become the go-to RL selective theoretical problems in both policy-based and value-
algorithm to solve wireless problems with discrete action based methods. First, we will study the instability problems
spaces. If the action space is continuous, researchers will in value-based methods with function approximators. The
propose a careful discretization scheme to fit into the Q- stability means that the error Q̂ − Q∗ gets smaller when the
learning framework. As an example, [42] solves the dynamic number of iterations increases. As mentioned in the previous
multichannel access problem in wireless networks using DQN. section, the stability of the tabular Q-learning algorithm is
Table II lists different references that used DQN in multi-agent guaranteed thanks to the contraction property of the Bellman
scenarios. operator. However, the contraction property is not satisfied
when the Bellman error is minimized. Next, we focus on
D. Deterministic Policy Gradient (DPG) Algorithms the convergence rate and sample complexity of policy-based
methods. Sample complexity is defined as the minimal number
The max operator in (8) limits the Q-learning algorithms to
of samples or transitions to estimate Q∗ and achieve a near-
discrete action space (see Table IV). In fact, when the action
optimal policy, and the convergence rate determines how fast
space is discrete, it is tractable to compute the maximum
the learned policy converges to the optimal solution. Finally,
of the Q-values. However, when the action space becomes
we examine the interpretability issue of DRL methods which
continuous, finding the maximum involves costly optimization
is a crucial challenge towards the application of DRL in real-
problem. In this context, DPG algorithms [43] can be consid-
world problems.
ered as an extension of Q-learning to continuous action spaces
by replacing the maxa0 Qφ (s0 , a0 ) in (8) by Qφ (s0 , µθ (s0 )) a) Instability of off-policy TD learning with function
where µθ (s0 ) = arg maxa0 Qφ (s0 , a0 ). Thus, DPG algorithms approximators
concurrently learn a Q-function Qφ and a policy µθ . The In tabular RL, value-based RL methods are guaranteed to
Q-function is learned by minimizing the Bellman error as converge to the optimal value function which is the fixed
explained in the previous section. As for the policy, the point of the Bellman optimality equation. This fundamental
objective is to learn a deterministic policy that outputs the result is justified by the contraction property of the Bellman
action corresponding to maxa Qφ (s, a). The policy is called operator. Hence, successive applications of the Bellman op-
deterministic because it gives the exact action to take at each erator converge to the unique fixed point. However, these
state s. Hence, the learning process consists in performing methods combined with function approximators suffer from
gradient ascent with respect to θ to solve the objective in (12): instability and divergence. This is commonly referred to as
the deadly triad: function approximation, bootstrapping, and
J(θ) = Es∼D Qφ (s, µθ (s)) . (12) off-policy training [48]. These three elements are important
to consider in any RL method. Function approximation is
∇θ J(θ) = Es∼D ∇θ µθ (s)∇a Qφ (s, a)|a=µθ (s) . (13) crucial to handle large state spaces. Bootstrapping is known
to learn faster than MC methods [19] thus this data efficiency
Note that in the gradients formula above, the term is an advantage from using bootstrapping. Finally, off-policy
∇a Qφ (s, a) requires a continuous action space. Therefore, one learning enables exploration since the behavior policy is often
drawback of the DPG methods is that they cannot be applied more exploratory than the target policy. Therefore, trading off
to discrete action spaces. Observe as well that in (12), the one of these techniques means losing either generalization
state is sampled from a replay buffer D which means that power, data efficiency, or exploration. This is why several
DPG algorithms are off-policy. research efforts are dedicated to finding stable and convergent
Deep Deterministic Policy Gradient (DDPG) [44] has been algorithms for off-policy learning with (nonlinear) function
widely used to solve wireless communication problems. It is approximators and bootstrapping.
12
Fixed-point methods rely on reducing the Bellman error to tasks, estimating the Q-function using MC rollouts results
learn an approximation of the optimal state-value or state- in an unbiased approximation but a biased Q-function for
action functions. As a reminder, here is the expression of the discounted infinite-horizon.
objective function governed by the Bellman error:
" To tackle the unbiasedness issue, [31] introduces the com-
2 #
0 0
patible function approximation theorem requiring the ap-
L(Q̂) = Es,a R(s, a) + γEs0 [maxa0 Q̂(s , a )] − Q̂(s, a) . proximate Q-function Qφ should satisfy two conditions:
(i) compatibility condition with the policy πθ given by
To compute an unbiased estimate of the loss, two samples of ∇φ Qφ (s, a) = ∇θ log πθ (a|s), (ii) Qφ is learned to minimize
the next state s0 are needed because of the inner expectation. Eπθ [1/2(Qπθ (s, a) − Qφ (s, a))2 ]. If these two conditions are
This is a well-known problem called the double sampling verified, the estimates of the policy gradients are unbiased.
issue [49]. The implication of this issue is a state s must be Another approach presented in [52] is called random-horizon
visited twice to collect two independent samples of s0 which PG and proposed to use rollouts with random geometric time
is impractical. To overcome this problem, different approaches horizons to unbiasedly estimate the Q-function for the infinite-
have been adopted. The first approach reformulates the Bell- horizon setting.
man error as a saddle-point optimization problem where a
new concave function ν : S × A 7→ R is Armed with the advances in non-convex optimization, sev-
introduced such eral research efforts ( [53], [54], [55], [52]) propose variants
that the loss becomes L(Q̂,ν) = 2Es,a,s 0 ν(s, a) Q̂(s, a) −
of the REINFORCE algorithm with a rate of convergence
R(s, a) − γ maxa0 Q̂(s0 , a0 ) − Es,a,s0 ν(s, a)2 [50]. This is
to first- or second-order stationary points. [56] studies the
called a saddle-point problem since the loss will be minimized
non-asymptotic global convergence rate of PPO and TRPO
with respect to the Q-function parameters and maximized
algorithms parametrized with neural networks. These methods
with respect to ν parameters. In this context, a recent work p
converge to the optimal policies at the rate of O( 1/T ),
[51] proposed a convergent algorithm with nonlinear function
where T is the number of iterations. In addition, the work
approximators (e.g. NN) and off-policy data.
in [57] establishes the global optimality and convergence rate
Another approach to tackle this issue is to replace the Q-
of (natural) actor-critic methods where both the actor p and
function in the inner expectation term with another target
the critic are represented by a NN. A rate of O( 1/T )
function. The choice of the target function can be an old
is also proved and the authors emphasize the importance
version of the Q-function as in the famous DQN algorithm
of the “compatibility” condition to achieve convergence and
or the minimum of two target Q-functions as in TD3 and
global optimality. In [58], better bounds on sample complexity
SAC. This gives a theoretical explanation of the role of target
of (natural) actor-critic are provided. In fact, the authors
networks in stabilizing Q-learning with DNN.
demonstrate that the overall sample complexity for the mini-
b) Convergence of PG with neural function approxi- batch actor-critic algorithm to reach an -accurate stationary
mators point is of O(−2 log(1/)) and that the natural actor-critic
In this part, we will summarize the recent convergence method requires O(−2 log(1/)) samples to attain an -
results of PG algorithms with NN as function approximators. accurate globally optimal point. In the same vein, the works
Due to the non-convexity of the expected cumulative re- in [59] and [60] establish the convergence rates and sample
wards in (1) with respect to the policy and its parameters, complexity bounds for the two-time scale scheme of (natural)
the analysis of the global optimality of stationary points is a actor-critic methods where the actor and the critic are updated
hard problem. Besides, the policy gradients in (2) are obtained simultaneously with different learning rates. Furthermore, [61]
using sampling and in practice, an approximate Q-function studies the convergence of PG in the context of constrained RL
is learned to estimate the expected return. In finite-horizon where the objective and the constraints are both non-convex
13
Another approach of planning relies on the learned model to comparison and benchmark of MBRL methods, we refer the
select actions at the current state s. This is called decision-time interested reader to [69].
planning [19]. In this category, neither a policy nor a value In Table 5b, a comparison between MFRL and MBRL
function is required to act in the environment. In fact, action methods, inspired from [70], is provided. Although model-
selection is formulated as an optimization problem (in (14)) free methods can exhibit better asymptotic reward performance
where the agent chooses a sequence of actions or a plan that and are more computationally efficient for deployment, model-
maximizes the expected rewards over a trajectory of length H: based algorithms are more data efficient, robust to changes
X H in the environment dynamics and rewards, and support richer
a1 , ...., aH = arg max E R(st , at )|a1 , ...., aH . (14) forms of exploration. MBRL is preferred to model-free RL in
a1 ,....,aH
t=1 multi-task settings. In fact, the same learned dynamics can be
used to perform multiple tasks without further training.
There are two approaches to solve this planning problem However, several practical considerations should be consid-
according to the size of the action space. For discrete action ered for MBRL. Since the model is learned via interactions
spaces, decision-time planning encompasses heuristic search, with the environment, it is prone to the following problems:
MC rollouts, and Monte Carlo Tree Search. Heuristic search (1) insufficient experience, (2) function approximation error,
consists in considering a tree of possible continuations from (3) small model error propagation, (4) error exploitation by
a state s and the values of the next actions are estimated the planner, and (5) less reliability for longer model roll-
recursively by backing up the values from the leaf nodes. outs [71]. To avoid these problems, it is recommended to
Then, the action with the highest value is selected. MC rollout continuously re-plan to avoid error accumulation. Limited
planning uses the approximated model to generate simulated data can also cause model uncertainty which can be reduced
trajectories starting from each possible next action to estimate by observing more data. Thus, it is better to estimate the
its value (see Figure 5a.2.a). At the end of this process, the uncertainty in the model predictions to know when to trust
action with the highest value is chosen in the current state. the model to generate reliable plans/actions. One approach
The MCTS algorithm keeps track of the expected return of to estimate model uncertainty is Bayesian where methods
state-action pairs encountered during MC rollouts to direct such as Gaussian Processes [66] or Bayesian NN (i.e. [72])
the search toward more rewarding pairs. At the current state, are applied. The work in [73] proposed to use ensemble
MCTS expands a tree by executing an action according to methods (bootstrapping) for uncertainty estimation where an
the estimated values. In the next state, MCTS evaluates the ensemble of models is learned and the final predictions are the
value of the obtained state by simulating MC rollouts and combination across the models’ predictions.
propagates it backward to the parent nodes. The same process
is repeated in the next state. Two important observations can be
made regarding these approaches: (i) the estimated state-action B. Applications of MBRL
values are completely discarded after the action is selected. MBRL approaches received less interest from the wire-
None of these algorithms stores Q-functions. And (ii) all these less communication community compared to their model-free
approaches are gradient-free. counterparts. However, we argue that MBRL is important
The continuous action setting is more involved because to build practical systems. For instance, since it is hard to
it is complicated to perform a tree search. Alternatively, collect data from a real wireless network, models are trained
trajectory optimization methods consider one possible action using a simulator. The most important barrier to deploying
sequence a = {a1 , . . . , aH } sampled from a fixed distribution. such models in real-world settings is the reality gap where
This sequence
PH is then executed and the trajectory return the learned policy in a simulator does not perform as good
J(a) = t=0 R(st , at ) is computed. Notice that the return in the real world. This is a serious issue in the context of
is a function of the action sequence. If the learned model is building RL-based algorithms for 6G networks. This line of
differentiable, it is possible to compute the gradients of J(a) work is called sim2real which is an active area of research
with respect to the actions and update the action sequence in robotics. In this context, the learned models are means to
accordingly (i.e. a = a + ∇a J(a)). This planning algorithm, bridge the gap between the simulation and the real world.
known as the random shooting method, exhibits several caveats Furthermore, MBRL is advantageous because a learned model
such as sensitivity to the initial action selection and poor for a source task can be re-used to learn a new task faster.
convergence guarantees. This motivated a lot of varieties of Coupled with meta-learning techniques, MBRL is applied to
planning methods for continuous action spaces. For example, generalize to new environments and changes in the agent’s
Cross Entropy Method (CEM) is a famous planning approach world. As an example, aerial networks or drone networks, a
to escape local optima which shooting methods suffer from. key enabler of 6G systems, can benefit from MBRL for a wide
Compared to the shooting method, the main idea of CEM is variety of applications such as hovering and maneuvering tasks
to consider a normal distribution with parametrized mean and [74]. Another potential application of MBRL is related to task
covariance. The trajectories are sampled around the mean and offloading in MEC. A model can be learned to predict the
the average reward is computed for each sample to evaluate its load levels in edge servers. This will help the edge users to
fitness. Afterward, the mean and the covariance are updated make more efficient offloading decisions, especially for delay-
using the best samples. CEM is a simple and gradient-free sensitive tasks. The main challenge in applying MBRL in
approach and can exhibit fast convergence. For a detailed multi-agent problems is the non-stationarity issue. One key
15
MF MB
Asymptotic rewards + +/−
Computation at deployment + +/−
Data Efficiency − +
Adaptation to changing rewards − +
Adaptation to changing dynamics − +
Exploration − +
application of MBRL in multi-agent problems is opponent control problems. The problem addressed in [76] is how
modeling. It consists in learning models to represent the to learn separable models for each agent while capturing
behaviors of other agents. In multi-agent systems, opponent the interactions between them. The approach is based on
modeling is useful not only to promote cooperation and coor- multi-step generative models. Instead of learning a model
dination but also to account for the behavior of the opponents using one-step samples (s, a, s0 ), multi-step generative models
to compensate for the partial observability (See Section V-A). utilize past trajectory segments Tp with length H, Tp =
In addition, modeling other agents enables the decentralization {(st−H−1 , at−H−1 ), . . . , (st , at ))} to learn a distribution over
of the problem because the agent can use the learned models the future segments Tf = {(st+1 , at+1 ), . . . , (st+H , at+H ))}.
to infer the strategies and/or rewards. Hence, an encoder will learn a distribution over a latent
The work in [75] proposes a model-based algorithm for variable Z conditioned on future trajectory segments Q(Z|Tf )
cooperative MARL. It is called Cooperative Prioritized Sweep- and a decoder reconstructs Tf such that Tˆf = D(Tp , Z). In
ing. This paper extends the prioritized sweeping algorithm the 2-agent setting, the joint probability P (Tf x , Tf y ) where
mentioned above to the multi-agent setting. The environment Tf x and Tf y are the future segments for player x and y
is modeled as factored MMDP and represented by a Dy- respectively. The key idea is to learn two disentangled latent
namic Bayesian Network in which state-action interactions spaces Zx and Zy . To do so, the algorithm proposed in the
are represented as a coordination graph. This assumption paper uses variational lower bound on the mutual information
allows the factorization of the transition function, reward [77].
function as well as the Q-values. Thus, these functions can
V. C OOPERATIVE M ULTI -AGENT R EINFORCEMENT
be learned in an efficient and sparse manner. One drawback
L EARNING
of this method is it assumes that the structure of the factored
MDP is available beforehand which is impractical in some A. Challenges and Implementation Schemes
applications with high mobility. The authors in [76] consider This section is dedicated to discuss the challenges that
two-agent competitive and cooperative settings in continuous arise in multi-agent problems. Several research endeavours
16
proposed algorithms to address these issues which led to RL to multi-agent scenarios will be IL where each agent
different training schemes for cooperative agents. We start by optimizes its policy independently of the other participants.
summarizing MARL challenges that we consider as funda- Thus, the non-stationarity problem is ignored and no coor-
mental in developing systems for wireless communications. dination or cooperation is considered. This technique suffers
Non-stationarity: As mentioned before, in multi-agent en- from convergence problems [80]. However, it may show
vironments, players update their policies concurrently. As a satisfying results in practice. In fact, recent works (see Table
consequence, the agents’ rewards and state transitions depend V) adopted the IL to solve several resource allocation and
not only on their actions but also on the actions taken by their control problems in wireless communication networks;
opponents. Hence, the Markov property, stating that the reward • Fully centralized: This approach assume the existence
and transition functions depend only on the previous state and of a centralized unit that can gather information such as
the agent’s action, is violated and the convergence guarantees actions, rewards, and observations from all the agents. This
of single agent RL are no longer valid [78]. Due to the non- training scheme alleviates the partial observability and non-
stationarity, the learning agent needs to consider the behaviour stationarity problems but it is impracticable for large scale
of other participants to maximize its return. One way to and real-time systems;
overcome this issue is by using a central coordinator collecting • Centralized training and decentralized execution: CTDE
information about agents’ observations and actions. In this assumes the existence of a centralized controller that collects
case, standard single-agent RL methods can be applied. Since additional information about the agents during training but
the centralized approach is not favorable, the non-stationarity the learned policies are decentralized and executed using
challenge need to be considered in designing decentralized the agent’s local information only. CTDE is considered in
MARL algorithms. several MARL algorithms since it presents a simple solution
to the partial observability and non-stationarity problems
Scalability: To overcome the non-stationarity problem, it is while allowing the decentralization of agents’ policies.
common to include information about the joint action space
in the learning procedure. This will give rise to a scalabil-
ity issue since the joint action space grows exponentially B. Algorithms and Paradigms
with the number of agents. Therefore, the use of DNNs as Coordination solutions considered in this tutorial are cate-
function approximators becomes more pressing which adds gorized in two families: those based on communication and
complexity to the theoretical analysis of deep MARL. For those based on learning. Emergent communication studies the
systems involving multiple agents (which are usually the cases learning of a communication protocol to exchange information
with wireless networks with many users), scalability becomes between agents. Networked agents assume the existence of
crucial. Several research endeavors aimed to overcome this a communication structure and learn cooperative behaviors
issue. One example is to learn factorized value function with through information exchange between neighbors. The second
respect to the actions (see Section V-B2). class aims to learn cooperative behaviors without informa-
tion sharing. The methods that we will present are not an
Partial Observability: In real-world scenarios, the agents
exhaustive list of deep MARL since we concentrate on the key
seldom have access to the true state of the system. They
concepts that can be applied for 6G technologies. We refer the
usually receive partial observations from the environment. Par-
readers to [81], [82], and [83] for a extensive review of deep
tial observability coupled with non-stationarity makes MARL
MARL literature.
more challenging. In fact, as stated before, the non-stationarity
issue mandates that the individual agents become aware of the 1) Emergent Communication
other agents’ policies. With only partial information available, This an active research field where cooperative agents are
the individual learners will struggle to overcome the non- allowed to communicate, for example explicitly via sending
stationarity of the system and account for the joint behavior. direct messages or implicitly by maintaining a shared memory.
Privacy and Security: Since coordination may involve Deep communication problems are modeled as Dec-POMDPs
information sharing between agents, privacy and security where agents share a communication channel in a partially
concerns will arise. Shared private information (i.e. rewards) observable environment and aim to maximize their joint utility.
with other agents is subject to attacks and vulnerabilities. This In addition to optimizing their policies, the agents learn com-
will hinder the applicability of MARL algorithms in real- munication protocols to collaborate better . Direct messages
world settings. This is why fully decentralized algorithms are can be learned concurrently with the Q-function. An NN is
preferred so that all the agents keep their information private. trained to output, in addition to the Q-values, a message
Enormous efforts have been made for addressing the privacy to communicate to the other participants in the next round.
and security issues in supervised learning. However, in MARL, This method involves exchanging information between all the
this challenge is not extensively studied. Recently, the work agents which is expensive. Alternatively, memory-driven algo-
in [79] has showed that attackers can infer information about rithms propose to use a shared memory as a communication
the training environment from a policy in single-agent RL. channel. All the agents access the shared memory before
taking an action and then write a response. The advantage
To promote coordination while considering the challenges of this method is that the agent does not communicate with
discussed above, different training schemes can be adopted. the rest of the agents directly which may eventually reduce
• Fully decentralized: A simple extension of single-agent the communication cost. Besides, the agent policy depends on
17
its private local observations and the collective memory and before, training independent and hence fully-decentralized
not on messages from all the agents. agents suffer from convergence guarantees because of the non-
Integrating these methods into 6G systems requires learning stationarity problem. This issue is approached with different
cost-efficient communication due to limited resources such methodologies in the deep MARL literature. The first one
as bandwidth. Lately, more research endeavors have focused consists in generalizing the single-agent RL algorithms to the
on learning efficient communication protocols under limited- multi-agent setting. In particular, most of the single-agent RL
bandwidth constraints. Precisely, methods such as pruning, algorithms such as DQN rely on experience replay buffers
attention, and gating mechanisms are applied to reduce the where state transitions are stored. In the multi-agent setting,
number of messages communicated to the agents at each the data stored in replay memories are obsolete because
control round (i.e. [84], [85], [86]). In addition to learning agents update their policies in parallel. Several approaches
cost-efficient communication protocols, this field has many were proposed to address this problem and therefore enable
other open questions. For example, the robustness of the the use of replay buffers to train independent learners in
learned policies to communication errors or delays caused multi-agent environments [81]. Another line of work focused
by noisy channels, congestion, interference, etc needs to be on training cooperative agents using the CTDE framework.
investigated. [87] discusses the challenges and difficulties For policy gradients, centralized critic(s) are learned using
of learning communication in multi-agent environments. We all agents’ policies to avoid non-stationarity and variance
argue that this technique can be useful in designing intelli- problems and actors choose actions using local information
gent 6G systems. For example, the performance of MEC or only. This method is applied, for example, in [88] to extend the
aerial communications systems can be boosted by integrating DDPG algorithm to Multi-Agent Deep Deterministic Policy
communication between agents. We strongly believe that this Gradient (MADDPG) algorithm which can be used for systems
field can benefit from the expertise of the wireless commu- with heterogeneous and homogeneous agents. For Q-learning
nication community to develop more efficient communication based methods, the approach is to learn a centralized but
protocols while taking into consideration the restrictions of the factorized global Q-function. For example, in [89], the team
communication medians [86]. Q-function was decomposed as the sum of individual Q-
functions whereas, in [90], the authors propose to use a mixing
2) Cooperation network to combine the agents’ local Q-functions in a non-
In this section, we will overview coordination learning linear way. Although these methods show promising results,
methods without any explicit communication. As mentioned they can face several challenges such as representational ca-
18
pacity [91] and inefficient exploration [92]. As an alternative, algorithms based on the IL framework. In [100], each mobile
[93] proposes to learn a single globally shared network that user is represented by a DDPG/DQN agent aiming to inde-
outputs different policies for homogeneous agents. All the pendently learn an optimal task offloading policy to minimize
presented methods above have a straightforward application its power consumption and task buffering delays. Therefore,
in wireless communications since wireless networks are multi- this paper provides a fully decentralized algorithm where users
agent systems by definition and coordination is crucial in such decide using their local observations (task buffer length, user
systems. In fact, MADDPG has been applied in MEC and SINR, CSI), the allocated power levels for local execution, and
aerial networks (i.e. UAV networks). See Section VI for more task offloading. Similarly, the approach in [101] is based on
details. independent Q-learning where each edge user selects the trans-
mit power, the radio access technology, and the sub-channel.
3) Decentralized MARL Over Networked Agents
The problems considered in the previous papers are formulated
Cooperative agents, modeled using a cooperative MG, usu-
as MGs where the global state is the concatenation of the
ally assume that the agents share the same reward function,
users’ local observations and the agents act simultaneously on
thus the homogeneity of the agents. This is not the case in most
the system to receive independent reward signals. Thus, this
wireless communication problems where agents have different
formalization enables the consideration of heterogeneous users
preferences or reward functions. For example, MEC networks
with different reward functions. Furthermore, [102] formalizes
encompass several types of IoT devices. Hence, it is important
the task offloading as a non-cooperative problem where each
to account for the heterogeneity of the agents in the design
agent aims to minimize the task drop rate and execution delay.
of decentralized cooperative algorithms for 6G systems. In
Each mobile user is represented as an A2C agent. An energy-
this context, the objective is to form a team of heterogeneous
aware algorithm is presented in [103] where independent DQN
agents (i.e. with different reward functions) collaborating to
agents are deployed in every edge server and the servers decide
maximize the team-average reward R̄ = N1 i∈N Ri . As
P
which user(s) should offload their computations. However, the
explained in Section II, networked agents cooperate and make
independent nature of these works rules out any coordination
decisions using their local observations including shared in-
between the learning agents which may hinder the convergence
formation by the neighbors over a communication network.
of these methods in practice (see Section V-A).
The existence of the communication network enables the
collaboration between agents without the intervention of a Cooperation is considered in [28] where the authors used
central unit. Let πθi i be the agent’s policyQparametrized as a the MADDPG algorithm to jointly optimize the multi-channel
DNN. The joint policy is given by πθ = i∈N πθi i (ai |s) and access and task offloading of MEC networks in industry
the global Q-function is Qθ under the joint policy πθ . To find 4.0. The joint optimization problem is modeled as a Dec-
optimal policies, the policy gradients, for each agent, can be POMDP since the agents cannot observe the status of all the
expressed as the product of the global Q-function Qθ and the channels and is solved using the CTDE paradigm. The use of
local score function ∇θi log πθi i (ai |s). However, Qθ is hard MADDPG enables the coordination between agents without
to estimate knowing that the agents can only use their local any explicit communication since the critic is learned using
information. Consequently, each agent learns a local copy Qθi the information from all the agents but the actors are executed
of Qθ . In [94], an actor-critic algorithm is proposed where a in a decentralized matter. Experimental results showcased the
consensus-based approach is adopted to update the critics Qθi impact of cooperation in reducing the computation delay and
as the weighted average of the local and adjacent updates. enhancing the channel utilization compared to the IL case.
We refer the interested reader to [95] for other extensions and For edge caching, [29] and [104] propose MADDPG-like
algorithms for this framework with a theoretical analysis. algorithms to solve the cooperative multi-agent edge caching
problem. Both of these works model the cooperative edge
VI. A PPLICATIONS caching as Dec-POMDP and differ in the definition of the
A. MARL for MEC Systems state space and reward functions. In [29], the edge servers
Multi-access edge computing (MEC) is one of the enabling receive the same reward as the average transmission delay
technologies for 5G and beyond 5G networks. We are witness- reduction, whereas in [104], the weighted sum of the local
ing a proliferation of smart devices running computationally and the neighbors’ hit rates is considered as a reward signal
expensive tasks such as gaming, Virtual/Augmented Reality. to encourage cooperation between adjacent servers. Simulation
Therefore, designing efficient algorithms for MEC systems is results showed that the cooperative edge caching outperforms
a crucial step in providing low-latency and high-reliability ser- traditional caching mechanisms such as Least Recently Used
vices. DRL has been extensively applied to solve several prob- (LRU), Least Frequently Used (LFU), and First In First Out
lems in MEC networks including task/computation offloading (FIFO).
(i.e. [96], [97]), edge caching [98], network slicing, resource To summarize, to offer massive URLLC services, the scal-
allocation, etc [99]. Recently, more interest is accorded to ability of MEC systems is crucial. We expect to see more
MARL in MEC networks to account for the distributed nature research efforts leveraging the deep MARL techniques to study
of these networks. and analyze the reliability-latency-scalability trade-off of fu-
Task offloading has been studied in several works from a ture 6G systems. For example, applying the networked agents
multi-agent perspective with a focus on decentralized execu- scheme to the above-mentioned problems is one direction to
tion. First, we examine the works proposing fully decentralized explore in future works.
19
B. MARL for UAV-Assisted Wireless Communications [107], the joint optimization of multi-UAV target assignment
and path planning is solved using MARL. A team of UAVs
The application of deep MAR in UAV networks is getting positioned in a 2D environment aims to serve T targets while
more attention recently. In general, these applications involve minimizing the flight distance. Each UAV covers only one
solving cooperative tasks by a team of UAVs without the inter- target without collision with threat areas and other UAVs.
vention of a central unit. Hence, in UAV network management, To enforce the collision-free constraint, a collision penalty is
decentralized MARL algorithms are preferable in terms of added to the reward thus rendering the problem a mixed RL
communication cost and energy efficiency. The decentralized setting with both cooperation and competition. Consequently,
over networked agents scheme is suitable for this application the MADDPG algorithm is adopted to solve the joint opti-
if we assume that the UAVs have sufficient communication mization. Furthermore, [108] formulates resource allocation
capabilities to share information with the neighbors in its in a downlink communication network as a SG and solved
sensing and coverage area. However, due to the mobility of it using independent Q-learning. The work in [109] applies
UAVs, maintaining communication links with its neighbors MARL for fleet control, particularly, aerial surveillance and
to coordinate represents a considerable handicap for this base defense in a fully centralized fashion.
paradigm.
In [105], the authors study the cooperation in links discovery
and selection problem. Each UAV agent u perceives the local C. MARL for Beamforming Optimization in Cell-Free MIMO
available channels and decides to establish a link with another and THz Networks
UAV v over a shared channel. Due to different factors such
as UAV mobility, wireless channel quality, and perception DRL has been extensively applied for uplink/downlink
capabilities, each UAV u has a different local set of perceived beamforming optimization. Particularly, several works focused
channels Cu such that Cu ∩ Cv 6= ∅. A link is established on beamforming computation in cell-free networks. In the
between two agents u and v if they propagate messages on the fully centralized version of the cell-free architecture, all the
same channel simultaneously. Given the local information (i.e. access points are connected and coordinated through a central
state) about whether the previous message was successfully processing unit to serve users in their coverage area. Although
delivered, each UAV’s action is a pair (v, cu ) denoting by the application of single-agent DRL for cell-free networks
v the other UAV and cu the propagation channel. Each agent has empirically shown optimal performance, the computational
receives a reward ru defined as the number of successfully sent complexity and the communication cost increase drastically
messages over time-varying channels. The algorithm proposed with the number of users and access points. As a remedy
in [105] is based on independent Q-learning with two main to this issue, hybrid methods based on dynamic clustering
modifications: fractional slicing to deal with high dimensional and network partitioning are proposed. The core idea of these
and continuous action spaces and mutual sampling to share methods is to cluster users and/or access points to reduce the
information (state-action pairs and Q-function parameters) computational and communication costs as well as to enhance
between agents to alleviate the non-stationarity issue in the the coverage by reducing interference. As an example, in
fully decentralized scheme. Thus, a central coordinating unit [30], a DDQN algorithm is implemented to perform dynamic
is necessary. The problem of field coverage by a team of UAVs clustering and a DDPG agent is dedicated to beamforming
is addressed in [25]. The authors formulated the problem as a optimization. This joint clustering and beamforming opti-
MG where each agent state is defined as its position in a 3D mization is formulated as an MDP and a central unit is
grid. The UAVs cooperate to maximize the full coverage in used for training and execution. In [110], dynamic downlink
an unknown field. The UAVs are assumed to be homogeneous beamforming coordination is studied in a multi-cell MISO
and have the same action and state spaces. The proposed network. The authors proposed a distributed algorithm where
algorithm is based on Q-learning where a global Q-function is the base stations are allowed to share information via a limited
decomposed using the approximation techniques Fixed Sparse exchange protocol. Each base station is presented as a DQN
Representation (FSR) and Radial Basis Function (RBF). This agent trying to maximize its achievable rate while minimizing
decomposition technique does not allow the full decentraliza- the interference with the neighboring agents. The use of
tion of the algorithm since the basis functions depend on the DQN networks required the discretization of the action space
joint state and action spaces. Thus, the UAVs need to share which is continuous by definition. The same framework can
information which has an important communication cost. be applied with PG or DDPG methods to handle continuous
Another application of MARL in UAV networks is spectrum action space.
sharing which is analyzed in [106]. The UAV team is divided Furthermore, THz communication channels are character-
into two clusters: the relaying UAVs which provide relaying ized by high attenuation and path loss which require trans-
services for the terrestrial primary users to spectrum access mitting highly directive beams to minimize the signal power
for the other cluster that groups sensing UAVs transmitting propagating in directions other than the transmission direction.
data packets to a fusion center. Each UAV’s action is either In this context, directional beamforming and beam selection
to join the relaying or sensing clusters. The authors proposed are possible solutions to enhance the communication range and
a distributed tabular Q-learning algorithm where each UAV reduce interference. Intelligent beamforming in THz MIMO
learns a local Q function using their local states without any systems is another promising application of MARL for future
coordination with the other UAVs. In a more recent work 6G networks.
20
D. Spectrum Management THz access point, limited bandwidth, and limited antenna
gain). Devices in the first layer can directly access the network
In [111] and [112], the spectrum allocation in Device-to- resources and act as relays for the second layer devices. The
Device (D2D) enabled 5G HetNets is considered. The D2D objective is to find the optimal D2D links between the two
transmitters aim to select the available spectrum resources with layers. The devices from the first layer are modeled as Q-
minimal interference to ensure the minimum performance re- learning agents and decide, by using local information, the
quirements for the cellular users. The authors in [111] consider D2D links to establish with the second layer devices. The
a non-cooperative scenario where the agents independently agents receive two types of rewards: a private one for serving
maximize their throughput. Based on the selected resource a device in the second layer and a public reward regarding
block in the previous time step, the D2D users choose the the system throughput. To promote coordination, the agents
resource block for uplink transmission and receive a positive receive information about the number of their neighbors and
reward equivalent to the capacity of the D2D user if the their states. [116] studies a two-tier network with virtual-
cellular users’ constraints are satisfied, otherwise, a penalty ized small cells powered by energy harvesters and equipped
is imposed. The problem is solved using a tabular Q-learning with rechargeable batteries. These cells can decide to offload
algorithm where each D2D agent learns a local Q-function baseband processes to a grid-connected edge server. MARL
with local state information. This work does not scale for is applied to minimize grid energy consumption and traffic
high-dimensional state spaces since it is based on a tabular drop rate. The agents collaborate via exchanging information
approach. This problem is addressed in [112] where actor- regarding their battery state. [117] inspects the problem of
critic based algorithms are proposed for spectrum sharing. Two user association in dynamic mmWave networks where users
approaches are used to promote cooperation between D2D are represented as DQN agents independently optimizing their
users. The first is called multi-agent actor-critic where each policies using their local information.
agent learns a Q-function in a centralized manner using the
information from all the other agents. The learned policies
are executed in a decentralized fashion since the actor relies E. Intelligent Reflecting Surfaces (IRS)-Aided Wireless Com-
on local information only. The second approach proposes to munications
use information from neighboring agents only to train the Intelligent Reflecting Surfaces (IRS)-aided wireless commu-
critic instead of the information from all the agents to reduce nications have attracted increasing interest due to the coverage
the computational complexity for large-scale networks. The and spectral efficiency gains they provide. Multiple research
action selection and the reward function are similar to the ones works proposed DRL-based algorithms for joint beamforming
defined in the previous work. In this work, the state space is and phase shift computation. These contributions study sys-
richer and contains information about (i) the instant channel tems with a single IRS which is far from the real-world case.
information of the D2D corresponding link, (ii) the channel More recent research endeavors seek to remedy this shortcom-
information of the cellular link (e.g. from the BS to the D2D ing. For example, in [118], the authors consider a communica-
transmitter), (iii) the previous interference to the link, and (iv) tion system with multiple IRS cooperating together under the
the resource block selected by the D2D link in the previous coordination of an IRS controller. The joint beamforming and
time slot. phase shift optimization problem is decoupled and solved in
Another application of MARL is resource allocation an alternating manner using fractional programming. Another
in cellular-based vehicular communication networks. The line of work aims to provide secure and anti-jamming wireless
Vehicle-to-Vehicle (V2V) transmitters jointly select the com- communications by adjusting the IRS elements. This problem
munication channel and the power level for transmission. The was also approached using single-agent RL (i.e. in [119]). For
work in [113] proposes a fully decentralized algorithm based distributed deployment of multiple IRS, secure beamforming is
on DQN to maximize the Vehicle-to-Infrastructure capacity solved in [120] using alternating optimization scheme based on
under V2V latency constraints. Although the solution is decen- successive convex approximation and manifold optimization
tralized in the sense that each agent trains a Q-network locally, techniques. To the date of the writing of this tutorial, we are
the state contains information about the other participants. The unable to find a decentralized algorithm for a multi-IRS system
authors include a penalty in the reward function to account based on MARL techniques. This is why we believe that this
for the latency constraints in V2V communications. For more is a promising direction to propose distributed deployment
references on MARL in vehicular communications, we refer schemes for multi-IRS systems.
the reader to [17].
Interference mitigation will be a pressing issue in THz VII. C ONCLUSION AND F UTURE R ESEARCH D IRECTIONS
communications. Exploiting the THz bands is one key enabler We have presented an overview of model-free, model-based
of 6G systems for higher data rates. In [114], a multi-arm ban- single-agent RL, and cooperative MARL frameworks and al-
dit based algorithm is proposed for intermittent interference gorithms. We have provided the mathematical backgrounds of
mitigation from directional links in two-tier HetNets. However, the key frameworks for both single-agent and multi-agent RL.
the proposed solution is valid for a single target receiver. Afterward, we have developed an understanding of the state-
Another work in [115] proposes a two-layered distributed D2D of-the-art algorithms in the studied fields. We have discussed
model, where MARL is applied to maximize user coverage in the use of model-based methods for solving wireless commu-
dense-indoor environments with limited resources (i.e. a single nication problems. Focusing on cooperative MARL, we have
21
TABLE V: These papers applied MARL techniques for wireless communication problems. Learning type is the MARL training
technique.
outlined different methods to establish coordination between the environment dynamics, the reward functions can be
agents. To showcase how these methods can be applied for inferred by malicious agents [79]. Privacy-preserving al-
wireless communication systems, we have reviewed the re- gorithms are attracting more attention and interest. Differ-
cent research contributions to adopt a multi-agent perspective ential privacy was investigated in the context of Federated
in solving communication and networking problems. These Learning as well as DRL [121]. Privacy is not sufficiently
problems involve AI-enable MEC systems, intelligent control explored in the context of wireless communication.
and management of UAV networks, distributed beamforming • Security and robustness: DNNs are known to be vul-
for cell-free MIMO systems, cooperative spectrum sharing, nerable to adversarial attacks and several recent works
THz communications, and IRS deployment. We have chosen demonstrated the vulnerability of DRL to such attacks
to focus on cooperative MARL applications because several as well. To completely trust DRL-based methods in
surveys on single-agent RL exist in the literature. Our objective real-world critical applications, an understanding of the
has been to highlight the potential of these DRL methods, vulnerabilities of these methods and addressing them is
specifically MARL, in building self-organizing, reliable, and a central concern in the deployment of AI-empowered
scalable systems for future wireless generations. In what systems [122], [123]. In addition to adversarial attacks,
follows, we discuss research directions to enrich and bridge the robustness of the learned policies to differences in
the gap between both fields. simulation and real-world settings need to be addressed
and studied (i.e. [124]).
• Network topologies: One of the shortcomings of the
MG-based methods of MARL problems is the assumed R EFERENCES
homogeneity of the studied systems. However, this is
[1] W. Saad, M. Bennis, and M. Chen, “A vision of 6G wireless systems:
seldom the case in real-world scenarios like MEC-IoT Applications, trends, technologies, and open research problems,” IEEE
or sensing networks. For this reason, we have motivated network, vol. 34, no. 3, pp. 134–142, 2019.
the networked MARL paradigm where agents with dif- [2] I. F. Akyildiz, A. Kak, and S. Nie, “6G and beyond: The future of
wireless communications systems,” IEEE Access, vol. 8, pp. 133995–
ferent reward functions can cooperate. Accounting for the 134030, 2020.
heterogeneity of the wireless communications systems is [3] L. Bariah, L. Mohjazi, S. Muhaidat, P. C. Sofotasios, G. K. Kurt,
mandatory for practical algorithm design. Mobility is also H. Yanikomeroglu, and O. A. Dobre, “A prospective look: Key enabling
technologies, applications and open research topics in 6G networks,”
a challenge in wireless communication problems. Devel- arXiv preprint arXiv:2004.06049, 2020.
oping MARL algorithms with mobility considerations is [4] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,
an interesting research direction. S. Ghemawat, G. Irving, M. Isard, et al., “Tensorflow: A system
for large-scale machine learning,” in 12th {USENIX} symposium on
• Constrained/safe RL: RL is based on maximizing the operating systems design and implementation ({OSDI} 16), pp. 265–
reward feedback. The reward function is designed by 283, 2016.
human experts to guide the agent policy search but reward [5] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,
T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al., “Pytorch: An
design is often challenging and can lead to unintended imperative style, high-performance deep learning library,” in Advances
behavior. Wireless communication problems are often in neural information processing systems, pp. 8026–8037, 2019.
formulated as optimization problems under constraints. [6] C. Boutilier, “Planning, learning and coordination in multiagent deci-
sion processes,” in Proceedings of the 6th conference on Theoretical
To account for those constraints, most of the recent works aspects of rationality and knowledge, pp. 195–210, 1996.
adopt a reward shaping strategy where penalties are added [7] Y. Shoham, R. Powers, and T. Grenager, “Multi-agent reinforcement
to the reward function for violating the defined con- learning: a critical survey,” Web manuscript, vol. 2, 2003.
[8] X. Chen, X. Deng, and S.-H. Teng, “Settling the complexity of
straints. In addition, reward shaping does not ensure that computing two-player nash equilibria,” Journal of the ACM (JACM),
the exploration during the training is constraint-satisfying. vol. 56, no. 3, pp. 1–57, 2009.
This motivates the constrained RL framework. It enables [9] N. C. Luong, D. T. Hoang, S. Gong, D. Niyato, P. Wang, Y.-C.
Liang, and D. I. Kim, “Applications of deep reinforcement learning
the development of more reliable algorithms ensuring in communications and networking: A survey,” IEEE Communications
that the learned policies satisfy reasonable service quality Surveys & Tutorials, vol. 21, no. 4, pp. 3133–3174, 2019.
or/and respect system constraints. [10] Y. Qian, J. Wu, R. Wang, F. Zhu, and W. Zhang, “Survey on rein-
forcement learning applications in communication networks,” Journal
• Theoretical guarantees: Despite the abundance of the of Communications and Information Networks, vol. 4, no. 2, pp. 30–39,
experimental works around RL methods, their conver- 2019.
gence properties are still an active research area. Several [11] L. Lei, Y. Tan, K. Zheng, S. Liu, K. Zhang, and X. Shen, “Deep
reinforcement learning for autonomous internet of things: Model, ap-
endeavors, reviewed above, proposed convergence guar- plications and challenges,” IEEE Communications Surveys & Tutorials,
antees for the policy gradient algorithms under specific 2020.
assumptions such as unbiased gradient estimates. More [12] Y. L. Lee and D. Qin, “A survey on applications of deep reinforcement
learning in resource management for 5G Heterogeneous Networks,”
urging theoretical questions need to be addressed such as in 2019 Asia-Pacific Signal and Information Processing Association
the convergence speed to a globally optimal solution, their Annual Summit and Conference (APSIPA ASC), pp. 1856–1862, IEEE,
robustness due to approximation errors, their behaviors 2019.
[13] H. Zhu, Y. Cao, W. Wang, T. Jiang, and S. Jin, “Deep reinforcement
when limited sample data is available, etc. learning for mobile edge caching: Review, new features, and open
• Privacy: One of the challenges of the commercialization issues,” IEEE Network, vol. 32, no. 6, pp. 50–57, 2018.
of DRL-based solutions is privacy. These concerns are [14] S. Ali, W. Saad, N. Rajatheva, K. Chang, D. Steinbach, B. Sliwa,
C. Wietfeld, K. Mei, H. Shiri, H.-J. Zepernick, et al., “6G white
rooted in the data required to train RL agents such as paper on machine learning in wireless communication networks,” arXiv
actions and rewards. Consequently, information about preprint arXiv:2004.13875, 2020.
23
[15] F. Tang, Y. Kawamoto, N. Kato, and J. Liu, “Future intelligent and [39] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou,
secure vehicular network toward 6G: Machine-learning approaches,” D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement
Proceedings of the IEEE, vol. 108, no. 2, pp. 292–307, 2019. learning,” arXiv preprint arXiv:1312.5602, 2013.
[16] C. She, C. Sun, Z. Gu, Y. Li, C. Yang, H. V. Poor, and B. Vucetic, [40] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience
“A tutorial of ultra-reliable and low-latency communications in 6G: replay,” arXiv preprint arXiv:1511.05952, 2015.
Integrating theoretical knowledge into deep learning,” arXiv preprint [41] Z. Wang, T. Schaul, M. Hessel, H. Hasselt, M. Lanctot, and N. Freitas,
arXiv:2009.06010, 2020. “Dueling network architectures for deep reinforcement learning,” in
[17] I. Althamary, C.-W. Huang, and P. Lin, “A survey on multi-agent International conference on machine learning, pp. 1995–2003, 2016.
reinforcement learning methods for vehicular networks,” in 2019 15th [42] S. Wang, H. Liu, P. H. Gomes, and B. Krishnamachari, “Deep
International Wireless Communications & Mobile Computing Confer- reinforcement learning for dynamic multichannel access in wireless
ence (IWCMC), pp. 1154–1159, IEEE, 2019. networks,” IEEE Transactions on Cognitive Communications and Net-
[18] D. Lee, N. He, P. Kamalaruban, and V. Cevher, “Optimization for working, vol. 4, no. 2, pp. 257–265, 2018.
reinforcement learning: From a single agent to cooperative agents,” [43] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Ried-
IEEE Signal Processing Magazine, vol. 37, no. 3, pp. 123–135, 2020. miller, “Deterministic policy gradient algorithms,” in Proceedings of
[19] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. the 31st International Conference on International Conference on
MIT press, 2018. Machine Learning, pp. 387–395, 2014.
[20] K. I. Ahmed and E. Hossain, “A deep Q-learning method for [44] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,
downlink power allocation in multi-cell networks,” arXiv preprint D. Silver, and D. Wierstra, “Continuous control with deep reinforce-
arXiv:1904.13032, 2019. ment learning,” arXiv preprint arXiv:1509.02971, 2015.
[21] M. L. Littman and R. S. Sutton, “Predictive representations of state,” [45] S. Fujimoto, H. Van Hoof, and D. Meger, “Addressing func-
in Advances in neural information processing systems, pp. 1555–1561, tion approximation error in actor-critic methods,” arXiv preprint
2002. arXiv:1802.09477, 2018.
[22] R. Xie, Q. Tang, C. Liang, F. R. Yu, and T. Huang, “Dynamic [46] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-
computation offloading in IoT fog systems with imperfect channel state policy maximum entropy deep reinforcement learning with a stochastic
information: A POMDP approach,” IEEE Internet of Things Journal, actor,” arXiv preprint arXiv:1801.01290, 2018.
2020. [47] C. Qiu, Y. Hu, Y. Chen, and B. Zeng, “Deep deterministic policy
[23] L. S. Shapley, “Stochastic games,” Proceedings of the national academy gradient (ddpg)-based energy harvesting wireless communications,”
of sciences, vol. 39, no. 10, pp. 1095–1100, 1953. IEEE Internet of Things Journal, vol. 6, no. 5, pp. 8577–8588, 2019.
[24] M. L. Littman, “Markov games as a framework for multi-agent rein- [48] H. Van Hasselt, Y. Doron, F. Strub, M. Hessel, N. Sonnerat, and
forcement learning,” in Machine learning proceedings 1994, pp. 157– J. Modayil, “Deep reinforcement learning and the deadly triad,” arXiv
163, Elsevier, 1994. preprint arXiv:1812.02648, 2018.
[49] L. Baird, “Residual algorithms: Reinforcement learning with function
[25] H. X. Pham, H. M. La, D. Feil-Seifer, and A. Nefian, “Cooperative and
approximation,” in Machine Learning Proceedings 1995, pp. 30–37,
distributed reinforcement learning of drones for field coverage,” arXiv
Elsevier, 1995.
preprint arXiv:1803.07250, 2018.
[50] B. Dai, N. He, Y. Pan, B. Boots, and L. Song, “Learning from
[26] F. A. Oliehoek, C. Amato, et al., A concise introduction to decentralized
conditional distributions via dual embeddings,” in Artificial Intelligence
POMDPs, vol. 1. Springer, 2016.
and Statistics, pp. 1458–1467, 2017.
[27] E. A. Hansen, D. S. Bernstein, and S. Zilberstein, “Dynamic pro- [51] B. Dai, A. Shaw, L. Li, L. Xiao, N. He, Z. Liu, J. Chen, and L. Song,
gramming for partially observable stochastic games,” in AAAI, vol. 4, “SBEED: Convergent reinforcement learning with nonlinear function
pp. 709–715, 2004. approximation,” in International Conference on Machine Learning,
[28] Z. Cao, P. Zhou, R. Li, S. Huang, and D. Wu, “Multi-agent deep rein- pp. 1125–1134, PMLR, 2018.
forcement learning for joint multi-channel access and task offloading [52] K. Zhang, A. Koppel, H. Zhu, and T. Başar, “Global convergence of
of mobile edge computing in industry 4.0,” IEEE Internet of Things policy gradient methods to (almost) locally optimal policies,” arXiv
Journal, 2020. preprint arXiv:1906.08383, 2019.
[29] C. Zhong, M. C. Gursoy, and S. Velipasalar, “Deep multi-agent [53] M. Papini, D. Binaghi, G. Canonaco, M. Pirotta, and M. Restelli,
reinforcement learning based cooperative edge caching in wireless “Stochastic variance-reduced policy gradient,” arXiv preprint
networks,” in ICC 2019-2019 IEEE International Conference on Com- arXiv:1806.05618, 2018.
munications (ICC), pp. 1–6, IEEE, 2019. [54] Z. Shen, A. Ribeiro, H. Hassani, H. Qian, and C. Mi, “Hessian aided
[30] Y. Al-Eryani, M. Akrout, and E. Hossain, “Multiple access in cell- policy gradient,” in International Conference on Machine Learning,
free networks: Outage performance, dynamic clustering, and deep pp. 5729–5738, 2019.
reinforcement learning-based design,” IEEE Journal on Selected Areas [55] P. Xu, F. Gao, and Q. Gu, “An improved convergence analysis of
in Communications, 2020. stochastic variance-reduced policy gradient,” in Uncertainty in Artificial
[31] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, “Policy Intelligence, pp. 541–551, PMLR, 2020.
gradient methods for reinforcement learning with function approxima- [56] B. Liu, Q. Cai, Z. Yang, and Z. Wang, “Neural proximal/trust region
tion,” in Advances in neural information processing systems, pp. 1057– policy optimization attains globally optimal policy,” arXiv preprint
1063, 2000. arXiv:1906.10306, 2019.
[32] R. J. Williams, “Simple statistical gradient-following algorithms for [57] L. Wang, Q. Cai, Z. Yang, and Z. Wang, “Neural policy gradient
connectionist reinforcement learning,” Machine learning, vol. 8, no. 3- methods: Global optimality and rates of convergence,” arXiv preprint
4, pp. 229–256, 1992. arXiv:1909.01150, 2019.
[33] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High- [58] T. Xu, Z. Wang, and Y. Liang, “Improving sample complexity bounds
dimensional continuous control using generalized advantage estima- for actor-critic algorithms,” arXiv preprint arXiv:2004.12956, 2020.
tion,” arXiv preprint arXiv:1506.02438, 2015. [59] Y. Wu, W. Zhang, P. Xu, and Q. Gu, “A finite time analysis of two time-
[34] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, scale actor critic methods,” arXiv preprint arXiv:2005.01350, 2020.
D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep rein- [60] T. Xu, Z. Wang, and Y. Liang, “Non-asymptotic convergence analysis
forcement learning,” in International conference on machine learning, of two time-scale (natural) actor-critic algorithms,” arXiv preprint
pp. 1928–1937, 2016. arXiv:2005.03557, 2020.
[35] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust [61] M. Yu, Z. Yang, M. Kolar, and Z. Wang, “Convergent policy op-
region policy optimization,” in International conference on machine timization for safe reinforcement learning,” in Advances in Neural
learning, pp. 1889–1897, 2015. Information Processing Systems, pp. 3127–3139, 2019.
[36] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- [62] E. Puiutta and E. Veith, “Explainable reinforcement learning: A sur-
imal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, vey,” arXiv preprint arXiv:2005.06247, 2020.
2017. [63] A. Heuillet, F. Couthouis, and N. D. Rodrı́guez, “Explainability in deep
[37] A. Valadarsky, M. Schapira, D. Shahaf, and A. Tamar, “A machine reinforcement learning,” arXiv preprint arXiv:2008.06693, 2020.
learning approach to routing,” arXiv preprint arXiv:1708.03074, 2017. [64] H. P. van Hasselt, M. Hessel, and J. Aslanides, “When to use parametric
[38] C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8, models in reinforcement learning?,” in Advances in Neural Information
no. 3-4, pp. 279–292, 1992. Processing Systems, pp. 14322–14333, 2019.
24
[65] A. Zhang, S. Sukhbaatar, A. Lerer, A. Szlam, and R. Fergus, “Compos- [89] P. Sunehag, G. Lever, A. Gruslys, W. M. Czarnecki, V. F. Zambaldi,
able planning with attributes,” in International Conference on Machine M. Jaderberg, M. Lanctot, N. Sonnerat, J. Z. Leibo, K. Tuyls, et al.,
Learning, pp. 5842–5851, 2018. “Value-decomposition networks for cooperative multi-agent learning
[66] M. Deisenroth and C. E. Rasmussen, “PILCO: A model-based and based on team reward.,” in AAMAS, pp. 2085–2087, 2018.
data-efficient approach to policy search,” in Proceedings of the 28th [90] T. Rashid, M. Samvelyan, C. S. De Witt, G. Farquhar, J. Foerster, and
International Conference on machine learning (ICML-11), pp. 465– S. Whiteson, “Qmix: Monotonic value function factorisation for deep
472, 2011. multi-agent reinforcement learning,” arXiv preprint arXiv:1803.11485,
[67] R. S. Sutton, “Integrated architectures for learning, planning, and 2018.
reacting based on approximating dynamic programming,” in Machine [91] J. Castellini, F. A. Oliehoek, R. Savani, and S. Whiteson, “The
learning proceedings 1990, pp. 216–224, Elsevier, 1990. representational capacity of action-value networks for multi-agent
[68] A. W. Moore and C. G. Atkeson, “Prioritized sweeping: Reinforcement reinforcement learning,” arXiv preprint arXiv:1902.07497, 2019.
learning with less data and less time,” Machine learning, vol. 13, no. 1, [92] A. Mahajan, T. Rashid, M. Samvelyan, and S. Whiteson, “Maven:
pp. 103–130, 1993. Multi-agent variational exploration,” in Advances in Neural Information
[69] E. Langlois, S. Zhang, G. Zhang, P. Abbeel, and J. Ba, Processing Systems, pp. 7613–7624, 2019.
“Benchmarking model-based reinforcement learning,” arXiv preprint [93] J. K. Gupta, M. Egorov, and M. Kochenderfer, “Cooperative multi-
arXiv:1907.02057, 2019. agent control using deep reinforcement learning,” in International
[70] I. Mordatch and J. Hamrick, “Tutorial on model-based methods in Conference on Autonomous Agents and Multiagent Systems, pp. 66–
reinforcement learning,” ICML, 2020. 83, Springer, 2017.
[71] M. Janner, J. Fu, M. Zhang, and S. Levine, “When to trust your model: [94] K. Zhang, Z. Yang, and T. Basar, “Networked multi-agent reinforce-
Model-based policy optimization,” in Advances in Neural Information ment learning in continuous spaces,” in 2018 IEEE Conference on
Processing Systems, pp. 12519–12530, 2019. Decision and Control (CDC), pp. 2771–2776, IEEE, 2018.
[72] Y. Gal, R. McAllister, and C. E. Rasmussen, “Improving PILCO with [95] K. Zhang, Z. Yang, and T. Başar, “Decentralized multi-agent reinforce-
bayesian neural network dynamics models,” in Data-Efficient Machine ment learning with networked agents: Recent advances,” arXiv preprint
Learning workshop, ICML, vol. 4, p. 34, 2016. arXiv:1912.03821, 2019.
[73] K. Chua, R. Calandra, R. McAllister, and S. Levine, “Deep rein- [96] T. K. Rodrigues, K. Suto, H. Nishiyama, J. Liu, and N. Kato, “Machine
forcement learning in a handful of trials using probabilistic dynamics learning meets computation and communication control in evolving
models,” in Advances in Neural Information Processing Systems, edge and cloud: Challenges and future perspective,” IEEE Communi-
pp. 4754–4765, 2018. cations Surveys & Tutorials, vol. 22, no. 1, pp. 38–67, 2019.
[74] A. S. Polydoros and L. Nalpantidis, “Survey of model-based rein- [97] A. Shakarami, M. Ghobaei-Arani, and A. Shahidinejad, “A survey on
forcement learning: Applications on robotics,” Journal of Intelligent the computation offloading approaches in mobile edge computing: A
& Robotic Systems, vol. 86, no. 2, pp. 153–173, 2017. machine learning-based perspective,” Computer Networks, p. 107496,
2020.
[75] E. Bargiacchi, T. Verstraeten, D. M. Roijers, and A. Nowé, “Model-
based multi-agent reinforcement learning with cooperative prioritized [98] M. Sheraz, M. Ahmed, X. Hou, Y. Li, D. Jin, and Z. Han, “Arti-
sweeping,” arXiv preprint arXiv:2001.07527, 2020. ficial intelligence for wireless caching: Schemes, performance, and
challenges,” IEEE Communications Surveys & Tutorials, 2020.
[76] O. Krupnik, I. Mordatch, and A. Tamar, “Multi-agent reinforcement
[99] M. McClellan, C. Cervelló-Pastor, and S. Sallent, “Deep learning at
learning with multi-step generative models,” in Conference on Robot
the mobile edge: Opportunities for 5g networks,” Applied Sciences,
Learning, pp. 776–790, 2020.
vol. 10, no. 14, p. 4735, 2020.
[77] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and
[100] Z. Chen and X. Wang, “Decentralized computation offloading for
P. Abbeel, “Infogan: Interpretable representation learning by informa-
multi-user mobile edge computing: A deep reinforcement learning
tion maximizing generative adversarial nets,” in Advances in neural
approach,” arXiv preprint arXiv:1812.07394, 2018.
information processing systems, pp. 2172–2180, 2016.
[101] X. Liu, J. Yu, Z. Feng, and Y. Gao, “Multi-agent reinforcement learning
[78] P. Hernandez-Leal, M. Kaisers, T. Baarslag, and E. M. de Cote, “A for resource allocation in IoT networks with edge computing,” China
survey of learning in multiagent environments: Dealing with non- Communications, vol. 17, no. 9, pp. 220–236, 2020.
stationarity,” arXiv preprint arXiv:1707.09183, 2017.
[102] J. Heydari, V. Ganapathy, and M. Shah, “Dynamic task offloading in
[79] X. Pan, W. Wang, X. Zhang, B. Li, J. Yi, and D. Song, “How you multi-agent mobile edge computing networks,” in 2019 IEEE Global
act tells a lot: Privacy-leakage attack on deep reinforcement learning,” Communications Conference (GLOBECOM), pp. 1–6, IEEE, 2019.
arXiv preprint arXiv:1904.11082, 2019.
[103] N. Naderializadeh and M. Hashemi, “Energy-aware multi-server mo-
[80] M. Tan, “Multi-agent reinforcement learning: Independent vs. cooper- bile edge computing: A deep reinforcement learning approach,” in
ative agents,” in Proceedings of the tenth international conference on 2019 53rd Asilomar Conference on Signals, Systems, and Computers,
machine learning, pp. 330–337, 1993. pp. 383–387, IEEE, 2019.
[81] P. Hernandez-Leal, B. Kartal, and M. E. Taylor, “A survey and critique [104] Y. Zhang, B. Feng, W. Quan, A. Tian, K. Sood, Y. Lin, and H. Zhang,
of multiagent deep reinforcement learning,” Autonomous Agents and “Cooperative edge caching: A multi-agent deep learning based ap-
Multi-Agent Systems, vol. 33, no. 6, pp. 750–797, 2019. proach,” IEEE Access, vol. 8, pp. 133212–133224, 2020.
[82] T. T. Nguyen, N. D. Nguyen, and S. Nahavandi, “Deep reinforcement [105] B. Yang and M. Liu, “Keeping in touch with collaborative UAVs: A
learning for multiagent systems: A review of challenges, solutions, and deep reinforcement learning approach.,” in IJCAI, pp. 562–568, 2018.
applications,” IEEE transactions on cybernetics, 2020. [106] A. Shamsoshoara, M. Khaledi, F. Afghah, A. Razi, and J. Ashdown,
[83] K. Zhang, Z. Yang, and T. Başar, “Multi-agent reinforcement learning: “Distributed cooperative spectrum sharing in UAV networks using
A selective overview of theories and algorithms,” arXiv preprint multi-agent reinforcement learning,” in 2019 16th IEEE Annual Con-
arXiv:1911.10635, 2019. sumer Communications & Networking Conference (CCNC), pp. 1–6,
[84] J. Jiang and Z. Lu, “Learning attentional communication for multi- IEEE, 2019.
agent cooperation,” in Advances in neural information processing [107] H. Qie, D. Shi, T. Shen, X. Xu, Y. Li, and L. Wang, “Joint optimization
systems, pp. 7254–7264, 2018. of multi-UAV target assignment and path planning based on multi-agent
[85] H. Mao, Z. Gong, Z. Zhang, Z. Xiao, and Y. Ni, “Learning multi-agent reinforcement learning,” IEEE Access, vol. 7, pp. 146264–146272,
communication under limited-bandwidth restriction for internet packet 2019.
routing,” arXiv preprint arXiv:1903.05561, 2019. [108] J. Cui, Y. Liu, and A. Nallanathan, “The application of multi-agent
[86] R. Wang, X. He, R. Yu, W. Qiu, B. An, and Z. Rabinovich, “Learn- reinforcement learning in UAV networks,” in 2019 IEEE International
ing efficient multi-agent communication: An information bottleneck Conference on Communications Workshops (ICC Workshops), pp. 1–6,
approach,” arXiv preprint arXiv:1911.06992, 2019. IEEE, 2019.
[87] R. Lowe, J. Foerster, Y.-L. Boureau, J. Pineau, and Y. Dauphin, “On [109] J. Tožička, B. Szulyovszky, G. de Chambrier, V. Sarwal, U. Wani,
the pitfalls of measuring emergent communication,” arXiv preprint and M. Gribulis, “Application of deep reinforcement learning to UAV
arXiv:1903.05168, 2019. fleet control,” in Proceedings of SAI Intelligent Systems Conference,
[88] R. Lowe, Y. I. Wu, A. Tamar, J. Harb, O. P. Abbeel, and I. Mor- pp. 1169–1177, Springer, 2018.
datch, “Multi-agent actor-critic for mixed cooperative-competitive en- [110] J. Ge, Y.-C. Liang, J. Joung, and S. Sun, “Deep reinforcement learning
vironments,” in Advances in neural information processing systems, for distributed dynamic miso downlink-beamforming coordination,”
pp. 6379–6390, 2017. IEEE Transactions on Communications, 2020.
25