Federated Deep Reinforcement Learning For RIS-Assisted
Federated Deep Reinforcement Learning For RIS-Assisted
Abstract—Indoor multi-robot communications face two key as the sudden collision, efficiency reduction and operation
challenges: one is the severe signal strength degradation caused restriction. To avoid these potential problems, the reconfig-
by blockages (e.g., walls) and the other is the dynamic envi- urable intelligent surface (RIS) can be deployed to create a
arXiv:2207.08056v1 [cs.RO] 17 Jul 2022
mit power, RIS phase shifts, and robot trajectory in a Fig. 1: RIS-assisted indoor multi-robot communications
semi-distributed manner. The reduction of control dimen-
sion greatly accelerates the convergence at the training where sk is the transmit symbol for the k-th robot, pk > 0
stage. Benefiting from the decentralized implementation, is the downlink power allocated to the k-th robot, and nk ∼
F-DRL can easily adapt to changes in the robot number. CN (0, σ 2 ) is the additive white Gaussian noise.
3) We conduct numerical experiments to show the superior- To alleviate the interference among robots, we apply the
ity of the proposed F-DRL. Compared to the centralized successive interference cancellation (SIC) technique. Without
DRL, our method takes about 86% less training time and loss of optimality, the channel coefficients of all robots are
is more robust to the dynamic multi-robot environment. ranked by |hK | ≤ · · · ≤ |h2 | ≤ |h1 |. Then, to perform
Simulation results also show that the designed F-DRL can SIC successfully, the transmit power at the AP satisfies the
outperform benchmarks in terms of energy efficiency. following constraint:
2
Xk−1 2
∆k = pk |hk−1 | − pi |hk−1 | ≥ ρmin , ∀k ≥ 2, (2)
II. S YSTEM M ODEL AND P ROBLEM F ORMULATION i=1
where ρmin > 0 is the required gap to distinguish the decoded
A. System Model signal. When the above power constraint is met, the achievable
As illustrated in Fig. 1, we consider an indoor multi-robot downlink data rate at the k-th robot can be obtained by
!
communication system aided by an RIS having M passive 2
|hk | pk
reflecting elements. Using downlink NOMA techniques, the Rk = log2 1 + 2 Pk−1
, ∀k. (3)
|hk | 2
i=1 pi + σ
AP serves K mobile robots1 , denoted by K = {1, 2, . . . , K}.
Since the energy consumed by motion is much larger than
To complete given tasks, we require the k-th robot to move
that consumed by communication, this paper mainly focuses
from a starting position qS,k to a destination qD,k within a
on the motion energy cost. Therefore, the total motion energy
given deadline Tmax . We define qtk as the position of the k-
consumed by the k-th robot is expressed as [2]
th robot at the t-th time slot, where t ∈ Tk = {1, 2, . . . , Tk }
and Tk ≤ Tmax is the total traveling time at the speed v. Ek = E1 Tk v + E2 Tk , ∀k, (4)
For brevity, the time index t is omitted in some parameters. where E1 and E2 are two constants related to the mechanical
We assume that robots update the trajectory each time slot. output power and the transforming loss, respectively [16].
The RIS is divided into N sub-surfaces, denoted by N = Their values depend on the exact robot motion model.
{1, 2, . . . , N }. Let θn ∈ [0, 2π) denote the phase shift of the
n-th sub-surface. Then, the RIS reflection matrix is denoted B. Problem Formulation
by Θ = diag(θN ×1 ⊗ 1(M/N )×1 ) = diag(ejθ1 , . . . , ejθM ) ∈ By optimizing the transmit power at the AP, the phase shifts
CM ×M with M = MR × MR , while MR is the element of the RIS, and the trajectory of robots, this paper aims to
number in the vertical or horizontal direction. In view of maximize the total energy efficiency of all robots during the
the hardware implementation, we consider the practical RIS mission. Subject to the constraints of transmit power, phase
with limited NR = 2b phase shifts [14], where θm ∈ R = shifts and robot mobility, a long-term optimization problem is
{ 12 ∆R , . . . , 2NR2 −1 ∆R }, ∀m, and ∆R = 2π/NR is the phase formulated as
resolution [15]. 1 XTk XK Rkt
Let h̄k ∈ C1×1 , hk ∈ C1×M and g ∈ CM ×1 denote the max (5a)
Θ,Q,p Tk t=1 k=1 Ek
channel coefficients from the AP to the k-th robot, from the s.t. q1k = qS,k , qTk k = qD,k , ∀k, (5b)
RIS to the k-th robot, from the AP to the RIS, respectively.
Then the combined channel coefficient experienced by the k- |htK | ≤ · · · ≤ |ht2 | ≤ |ht1 |, ∀t, (5c)
th robot is given by hk = hk Θg + h̄k . Thus, the received ∆tk ≥ ρmin , ptk > 0, ∀k, ∀t, (5d)
signal at the k-th robot is given by xmin ≤ xtk ≤ xmax , ∀k, ∀t, (5e)
√ X √
yk = hk pk sk + hk pi si + nk , ∀k, (1) ymin ≤ ykt ≤ ymax , ∀k, ∀t, (5f)
i6=k
XK
ptk ≤ Pmax , ∀t, (5g)
1 Withthe results obtained in this paper, the considered system can be easily k=1
extended to multi-antenna cases, which will be included in our future work. θnt ∈ R, ∀n, ∀t, (5h)
3
where Q = [q1 , q2 , . . . , qK ]T denotes the trajectory design of Global state sG Local state sL,1 Local state sL,K
all robots and p = [p1 , p2 , . . . , pK ]T is the power allocation ķ
strategy at the AP. However, the formulated problem (5) is Global decision stage Local
Experience decision
difficult to be solved by existing optimization methods and is Loss function replay memory
stage
Experience replay memory L(w L,1 )
also challenging to be solved optimally, due to the following
ˆ L,1 = w L,1 Experience
reasons. First, multiple optimization variables, {Θ, Q, p}, are Loss function
Gradient
descent
update
w L,1
w
replay
L(w G ) For every
memory
closely coupled in the objective function (5a). Second, the Q-network N Q time steps
Gradient
achievable data rate Rkt is not a continuous function due to Q( sL,1 , aL,1 ; w L,1 ) Fixed Q-targets
descent ˆ G = wG
w Robot 1
wG
the discrete phase shifts and the position-dependent channel update
Q-network
For every
coefficients. Third, the simultaneous motion of multiple robots Q-network N Q time steps
For every
N F time steps
Q( sL, K , aL, K ; w L, K ) Robot K
also makes problem (5) hard to solve even if only the subprob- Q( sG , aG ; w G ) ˆ L,1
w L,1 , w ˆL
wL , w ˆL
wL , w ˆ L, K
w L, K , w
NOMA
lem of trajectory design is considered. To sum up, traditional decoding AP w L = å k =1
K w L,k K
ˆ L = å k =1
,w
ˆ L,k
w
.
K K
one-shot optimization methods do not apply to this dynamic RIS phase shifts
Influence
order
III. P ROPOSED F-DRL A PPROACH Fig. 2: Proposed F-DRL approach for communication-aware
In this section, we develop an F-DRL approach that is trajectory design
capable of accelerating the training process and obtaining high
performance in terms of energy efficiency. As shown in Fig. function, because it is necessary for NOMA to maintain
2, the F-DRL approach is split into two stages: the global the distinctness among different signals.
stage for RIS configuration and the local stage for joint robot B. Local Decision Stage
trajectory and transmit power control. At the local decision stage, each robot determines its trajec-
A. Global Decision Stage tory and downlink transmit power with local state information.
At the global decision stage, the AP adjusts the RIS config- Because the control dimension of centralized DRL multiplies
uration with global state information. Specifically, we define with the increase of robots, we propose to train robots locally
the phase shift design problem as a Markov decision process and then aggregate a global model in a federated manner. The
(MDP), denoted by a transition tuple having three elements: MDP of local transition tuple hSL,k , AL,k , RL,k i maintained
hSG , AG , RG i, where SG is the state space, AG is the action by the k-th robot is defined as follows.
t
• State space: Let sL,k ∈ SL,k , ∀t. Then, the local state is
space, and RG is the reward.
t defined as
• State space: Let sG ∈ SG , ∀t. Since the combined
stL,k = qtk , h̄A,k
t
channel coefficients (hk )k∈K remain unknown before , ∀k, ∀t, (9)
the RIS phase shifts where the local state stL,k is a part of the global state stG .
are designed, the coefficients of
AP-robot links h̄k k∈K are considered as the channel • Action space: Let atL,k ∈ AL,k , ∀k, ∀t. Then, the local
features. Thus, the global state is defined as action for trajectory design and transmit power control is
stG = qtk , h̄tk | ∀k ∈ K , ∀t,
(6) defined as
atL,k = otk , ptk , ∀k, ∀t,
(10)
where the position qtk can be obtained by the simultane-
ous localization and mapping algorithm [17]. Meanwhile, where the k-th robot orientation ok ∈ {n, s, e, w} intends
the continuous 2D map is discretized into grids with the that robots move in four directions, i.e., north, south,
length of ∆S , while sampling positions are in the center east or west. To satisfy constraints in (5c), (5d) and
of each grid and satisfy constraints in (5e) and (5f).2 (5g), the first robot must guarantee p1 < Pmax /2K−1 .
• Action space: Let atG ∈ AG , ∀t. Then, the action for RIS Inspired by the discrete power control, we have pk ∈
phase shift design is defined as {Pmax /2, . . . , Pmax /2NP } and NP ≥ K.
atG = θnt | ∀n ∈ N , ∀t,
(7) • Reward: To maximize the energy efficiency, we define
t
the local reward rL,k as
where θn ∈ R is the discrete phase shift adopted by the t
n-th RIS sub-surface. rL,k = φRkt + ψRD,k
t
+ Rtime + Rgoal , ∀k, ∀t, (11)
t t−1 t
• Reward: With the aim of maximizing the sum rate, the where the guidance reward = − RD,kfor dD,k dD,k
reward is defined as t
XK t ≥ 2 and dD,k is the distance between the k-th robot
t
rG = τ1 Rkt , ∀t, (8) and its destination at the t-th time slot. The guidance
k=1 t
t reward RD,k leads the k-th robot to reach its destination.
where τ1 is a constant. Let rG < 0 to avoid robots
Moreover, the time cost Rtime is a constant and Rtime <
wondering. Additionally, it is inappropriate to put the sum
0. If the k-th robot arrives at its destination, it will gain a
of combined channel coefficients (hk )k∈K into the reward
positive reward Rgoal ; otherwise we have Rgoal = 0. In
2 Using the default track curve [18], the discrete sampling points can be this paper, the parameter φ must guarantee Rtime +φRkt <
reconstructed into continuous curves. 0 in most cases to prevent robots from wandering.
4
C. Global Aggregation (4) Experience storage: agents obtain rewards and store tran-
Take training deep Q-network (DQN) as an example. All sitions. Algorithm 1 shows the detailed training procedure of
agents collaboratively build a shared DQN, where the replay the proposed F-DRL approach. On account of the interaction
memory D and -greedy policy are considered. For each DQN between the local agents and the global agent, the proposed
agent i ∈ K ∪ {kG }, the online Q-network and the target F-DRL approach operates in a semi-distributed manner.
Q-network are defined as Q(sti , ati ; wit ) and Q(sti , ati ; ŵit ), 2) Complexity Analysis: By reducing the control dimen-
respectively. To update the online Q-network, each agent sion, the complexity of F-DRL is lower than that of cen-
performs the gradient descent step with a learning rate α > 0 tralized learning. More precisely, the complexity for DQN
on the loss function. Meanwhile, the target Q-network reset using one-dimensional replay memory is O(1). The com-
ŵit = wit every NQ time steps. putational complexity of each agent mainly depends on the
Besides, the k-th robot trains networks locally and uploads transition and back-propagation, which can be calculated by
relevant weights wL,k , ŵL,k every NF time steps during O (|D| + abE|D0 |), where a, b and E denote the number of
the local decision stage. At each aggregation step, all robots layers, the transitions in each layer and the number of episodes,
upload local weights to the AP at the t-th time slot, and the respectively. Moreover, the action space size of F-DRL at
AP aggregates the global weights wLt and ŵLt as the global and local decision stage are (NR )N and (4NP )K ,
1 XK 1 XK respectively, but that of centralized DRL is (4NP )K ×(NR )N .
wLt = t
wL,k , ŵLt = t
ŵL,k , ∀k, ∀t. (12) Therefore, the proposed F-DRL has a lower complexity as
K k=1 K k=1
Then, the updated global weights are sent back to local robots compared to centralized DRL. The theoretical analysis of F-
at the next time step until convergence. DRL convergence has been completed in [19]. A detailed
Compared to traditional optimization algorithms, the pro- proof is omitted here for brevity. In the following, we conduct
posed intelligent approach can adapt to the uncertainty and experiments to show the convergence behavior of F-DRL.
dynamics of indoor systems. Moreover, due to the semi-
distributed training and decentralized execution, the proposed
IV. N UMERICAL R ESULTS
F-DRL approach can significantly reduce the communication
overhead and effectively alleviate privacy leakage. In this section, we verify the efficiency and robustness of the
1) Overall Training Methodology: As shown in Fig. 2, the proposed F-DRL approach for the considered communication
proposed F-DRL approach has four steps. (1) State observa- system. In the simulation, the robots are randomly located,
tion: agents observe the environmental states. (2) RIS action while the AP and the RIS are located at (15, 30, 2) and
execution: the AP controls RIS phase shifts atG according to (30, 7.5, 2), respectively. The maximum transmit power of the
QG (sG , aG ; wG ) obtained at the global decision stage, and AP is Pmax = 20 dBm and the noise power spectral density
determines the NOMA decoding order. (3) Robot action exe- is N0 = −100 dBm/Hz. The channel model is the same as
cution: the k-th robot decides its action atL,k of the orientation the settings in [20]. Other parameters are given in Table I. For
and downlink transmit power based on QL,k (sL,k , aL,k ; wL,k ). comparison, we consider the following baselines:
5
(a) MR = 0 (b) MR = 20
(c) MR = 30 (d) MR = 40
Fig. 4: Trajectory of robots under different values of MR , where the red, blue and yellow points denote the robot trajectory
using QoS-based energy efficiency (EE) policy, and the black markers denote the trajectory using QoE-based EE policy.
Fig. 4 demonstrates the trajectory of robots versus different [2] Y. Yan and Y. Mostofi, “To go or not to go: On energy-aware and
MR , where the performance of QoS-based energy efficiency communication-aware robotic operation,” IEEE Trans. Control Netw.
Syst., vol. 1, no. 3, pp. 218–231, Sep. 2014.
(EE) policy and QoE-based EE policy is compared. The [3] X. Mu, Y. Liu, L. Guo, J. Lin, and R. Schober, “Intelligent reflecting
parameters are set as NR = 4, NP = 6, N = 1 and K = 3. surface enhanced indoor robot path planning: A radio map-based ap-
The background in Fig. 4 reflects the communication quality proach,” IEEE Trans. Wireless Commun., vol. 20, no. 7, pp. 4732–4747,
Jul. 2021.
of downlink channels. As expected, we find that the RIS [4] B. Di, H. Zhang, L. Song, Y. Li, Z. Han, and H. V. Poor, “Hybrid
enhances channel conditions, especially alleviating the severe beamforming for reconfigurable intelligent surface based multi-user
signal strength degradation caused by the walls. The QoS- communications: Achievable rates with limited discrete phase shifts,”
IEEE J. Sel. Areas Commun., vol. 38, no. 8, pp. 1809–1822, Aug. 2020.
based EE policy maintains better channel conditions rather [5] H. Yang, Z. Xiong, J. Zhao, D. Niyato, Q. Wu, H. V. Poor, and
than Baseline 3, especially when MR > 20. It is because that M. Tornatore, “Intelligent reflecting surface assisted anti-jamming com-
Baseline 3 cares more about the bad channel coefficients, while munications: A fast reinforcement learning approach,” IEEE Trans.
Wireless Commun., vol. 20, no. 3, pp. 1963–1974, Mar. 2021.
QoS-based EE policy cares more about the sum of channel [6] J. Park, S. Samarakoon, A. Elgabli, J. Kim, M. Bennis, S.-L. Kim,
conditions. Moreover, the result shows that QoS-based EE and M. Debbah, “Communication-efficient and distributed learning over
policy in the considered system can achieve higher energy wireless networks: Principles and applications,” Proc. IEEE, vol. 109,
no. 5, pp. 796–819, Feb. 2021.
efficiency, while the robot with the worst channel condition [7] W. Ni, Y. Liu, Z. Yang, H. Tian, and X. Shen, “Federated learning in
always maintains a required data rate in NOMA-based systems multi-RIS-aided systems,” IEEE Internet Things J., vol. 9, no. 12, pp.
under Baseline 3, because the logarithmic function is more 9608–9624, Jun. 2022.
[8] W. Ni, Y. Liu, Y. C. Eldar, Z. Yang, and H. Tian, “STAR-RIS integrated
sensitive to small data rate changes. non-orthogonal multiple access and over-the-air federated learning:
In Fig. 5, the energy efficiency under different environmen- Framework, analysis, and optimization,” IEEE Internet Things J., Jul.
tal parameters is illustrated. When NR = 4 and NP = 6, 2022, early access, doi: 10.1109/JIOT.2022.3188544.
[9] W. Ni, X. Liu, Y. Liu, H. Tian, and Y. Chen, “Resource allocation for
the energy efficiency is evaluated versus Pmax by changing multi-cell IRS-aided NOMA networks,” IEEE Trans. Wireless Commun.,
the number of robots K, multiple access technologies z ∈ vol. 20, no. 7, pp. 4253–4268, Jul. 2021.
{NOMA, OMA}, and the number of RIS elements MR . We [10] H. Yang, A. Alphones, Z. Xiong, D. Niyato, J. Zhao, and K. Wu,
“Artificial-intelligence-enabled intelligent 6G networks,” IEEE Netw.,
find that RIS is helpful to obtain higher energy efficiency. This vol. 34, no. 6, pp. 272–280, Nov. 2020.
is mainly because that the RIS can overcome signal blockage [11] H. Yang, Z. Xiong, J. Zhao, D. Niyato, L. Xiao, and Q. Wu, “Deep
by adjusting the radio environment. Meanwhile, when K = 3, reinforcement learning-based intelligent reflecting surface for secure
wireless communications,” IEEE Trans. Wireless Commun., vol. 20,
energy efficiency significantly increases with MR , and main- no. 1, pp. 375–388, Jan. 2020.
tains smaller improvement with 20 ≤ MR ≤ 30. Nevertheless, [12] Y. Nie, J. Zhao, F. Gao, and F. R. Yu, “Semi-distributed resource
the energy efficiency increases with 0 ≤ MR ≤ 30 when management in uav-aided mec systems: A multi-agent federated rein-
forcement learning approach,” IEEE Trans. Veh. Technol., vol. 70, no. 12,
K = 4. Such phenomenon reveals that there exists the suitable pp. 13 162–13 173, Dec. 2021.
transmit power budget Pmax and RIS elements MR satisfying [13] M. Chen, Z. Yang, W. Saad, C. Yin, H. V. Poor, and S. Cui, “A joint
the communication demands with lower values. Moreover, learning and communications framework for federated learning over
wireless networks,” IEEE Trans. Wireless Commun., vol. 20, no. 1, pp.
NOMA-RIS-based system gains higher energy efficiency than 269–283, Jan. 2021.
OMA-RIS-based benchmarks, because NOMA signals are su- [14] H. Zhang, B. Di, L. Song, and Z. Han, “Reconfigurable intelligent
perimposed in the same time-frequency resources and obtains surfaces assisted communications with limited phase shifts: How many
phase shifts are enough?” IEEE Trans. Veh. Technol., vol. 69, no. 4, pp.
enhanced bandwidth efficiency. In addition, fewer robots and 4498–4502, Apr. 2020.
smaller Pmax lead to lower energy efficiency. [15] W. Ni, Y. Liu, Z. Yang, H. Tian, and X. Shen, “Integrating over-the-air
federated learning and non-orthogonal multiple access: What role can
RIS play?” IEEE Trans. Wireless Commun., Jun. 2022, early access, doi:
V. C ONCLUSION 10.1109/TWC.2022.3181214.
[16] Y. Mei, Y.-H. Lu, Y. C. Hu, and C. G. Lee, “Deployment of mobile
We studied a long-term energy efficiency maximization robots with energy and timing constraints,” IEEE Trans. Robot., vol. 22,
no. 3, pp. 507–522, Jun. 2006.
problem of RIS-assisted indoor multi-robot systems. Through [17] X. Gao, Y. Liu, and X. Mu, “SLARM: Simultaneous localization and
training agents in a semi-distributed manner, we developed radio mapping for communication-aware connected robot,” in Proc. ICC
a novel methodology for the communication-aware design Workshops, Virtual, Jun. 2021, pp. 1–6.
[18] D. Rau, J. Rodina, and F. Štec, “Generating instant trajectory of an
problem by controlling the trajectory and downlink transmit indoor UAV with respect to its dynamics,” in Proc. ISMCR, Budapest,
power at local robots, and designing the RIS phase shifts at Hungary, Oct. 2020, pp. 1–5.
the AP. Owing to the decentralized nature of the proposed [19] X. Wang, C. Wang, X. Li, V. C. Leung, and T. Taleb, “Federated
deep reinforcement learning for internet of things with decentralized
F-DRL, the dynamics in such a multi-robot system can be cooperative edge caching,” IEEE Internet Things J., vol. 7, no. 10, pp.
well handled. Numerical simulations demonstrated that our 9441–9455, Apr. 2020.
designed F-DRL converges faster than the centralized method [20] R. Luo, H. Tian, and W. Ni, “Communication-aware path design for
indoor robots exploiting federated deep reinforcement learning,” in Proc.
and adapts to the changes in the number of robots, while PIMRC, Helsinki, Finland, Sept. 2021, pp. 1197–1202.
maintaining high performance in NOMA-RIS design. [21] T. T. Nguyen, N. D. Nguyen, and S. Nahavandi, “Deep reinforcement
learning for multiagent systems: A review of challenges, solutions, and
applications,” IEEE Trans. Cybern., vol. 50, no. 9, pp. 3826–3839, Sept.
R EFERENCES 2020.
[22] X. Liu, Y. Liu, Y. Chen, and H. V. Poor, “RIS enhanced massive non-
[1] M. Afrin, J. Jin, A. Rahman, A. Rahman, J. Wan, and E. Hossain, “Re- orthogonal multiple access networks: Deployment and passive beam-
source allocation and service provisioning in multi-agent cloud robotics: forming design,” IEEE J. Sel. Areas Commun., vol. 39, no. 4, pp. 1057–
A comprehensive survey,” IEEE Commun. Surveys Tuts., vol. 23, no. 2, 1071, Apr. 2021.
pp. 842–870, 2nd Quart. 2021.