Drones 06 00323 v3
Drones 06 00323 v3
Article
Adaptive Nonlinear Model Predictive Horizon Using Deep
Reinforcement Learning for Optimal Trajectory Planning
Younes Al Younes and Martin Barczyk *
Abstract: This paper presents an adaptive trajectory planning approach for nonlinear dynamical
systems based on deep reinforcement learning (DRL). This methodology is applied to the authors’
recently published optimization-based trajectory planning approach named nonlinear model pre-
dictive horizon (NMPH). The resulting design, which we call ‘adaptive NMPH’, generates optimal
trajectories for an autonomous vehicle based on the system’s states and its environment. This is done
by tuning the NMPH’s parameters online using two different actor-critic DRL-based algorithms, deep
deterministic policy gradient (DDPG) and soft actor-critic (SAC). Both adaptive NMPH variants are
trained and evaluated on an aerial drone inside a high-fidelity simulation environment. The results
demonstrate the learning curves, sample complexity, and stability of the DRL-based adaptation
scheme and show the superior performance of adaptive NMPH relative to our earlier designs.
Keywords: trajectory planning; nonlinear model predictive approach; adaptive design; deep
reinforcement learning; deterministic policy gradient; soft actor-critic
2. Methodologies
This section provides a background on the different methodologies used within the
adaptive NMPH framework.
nonlinear plant, a nonlinear control law (here, backstepping control), and a set of constraints
representing input limits plus static and dynamic obstacles in the environment. Connecting
the nonlinear plant with the control law aims to reduce the nonlinearity of the overall
closed-loop system and consequently the non-convexity of the associated optimization
problem. This greatly improves the efficiency of the optimization calculations, which
enables real-time trajectory generation to run onboard the drone vehicle.
Consider a nonlinear system with state, input, and output vectors x ∈ X ⊆ Rnx ,
u ∈ U ⊆ Rnu , and ξ ∈ Ξ ⊆ Rnξ , respectively. The output vector is assumed to be a subset
of the system state, Ξ ⊆ X. In addition, let f x (n), u(n) : X × U → X be the smooth map
that represents the plant dynamics, and g x (n), ξ re f (n) : X × Ξ → U the smooth nonlinear
control law map.
NMPH is designed to generate estimated reference trajectories ξ̂ re f ∈ Ξ, which will
be tracked by the closed-loop system consisting of the plant and control law. As shown
in (1), a copy of these closed-loop dynamics is used by the NMPH optimization problem,
where the variables used by NMPH are denoted by a ˜ to visually differentiate them from
the actual system variables. For instance, within (1), x̃ represents the predicted system state
trajectory, and ξ̃ is the predicted output trajectory.
The online NMPH optimization problem to bring the system from a current state x
to a terminal stabilization setpoint xss is shown in Equation (1) [27]. Let tn , n = 0, 1, 2, · · ·
represent successive sampling times. At every sampling instant, the optimization treats the
following problem for x̃ and ξ̂ re f , running for as long as k xss − x (tn )k ≥ ∆, where ∆ ∈ R+
is a user-specified tolerance:
Z tn + T
argmin J x̃, ξ̂ re f = E x̃ (tn + T ) + L x̃ (τ ), ξ̂ re f (τ ) dτ (1)
x̃,ξ̂ re f tn
where X ⊆ X, U ⊆ U, and Z ⊆ X are the constraint sets for the state, input, and
output trajectories, respectively, and each Oi ( x̃ ) ≤ 0 in (1e) is an inequality constraint
corresponding to a detected static or dynamic obstacle within the environment [27]. The
stage cost L and terminal cost E functions in (1) are assigned as follows:
2 2
L x̃ (τ ), ξ̂ re f (τ ) = k x̃ (τ ) − xss kW x
+ kξ̃ (τ ) − ξ̂ re f (τ )kW ξ
(2a)
2
E x̃ (tn + T ) = k x̃ (tn + T ) − xss kWT (2b)
where the errors in (2a) and (2b) are weighted by matrices Wx ∈ Rnx ×nx , Wξ ∈ Rnξ ×nξ , and
WT ∈ Rnx ×nx , which in this work will be adaptively tuned using DRL algorithms.
The optimization problem in (1) begins with measuring the current state of the physical
system x (tn ) at time tn . The cost function J x̃, ξ̂ re f is then minimized over the prediction
horizon [tn , tn + T ] subject to constrains (1b), (1c), and (1e) to provide a prediction of the
values of x̃ and ξ̂ re f . Finally, either the estimated reference trajectory ξ̂ re f or the predicted
output trajectory ξ̃ (as both converge to each other) is input into the closed-loop system
for tracking. This process is repeated in real time at a user-specified rate until the plant
reaches the desired terminal setpoint. Details about the NMPH approach can be found in
our recent works [25,26].
Drones 2022, 6, 323 4 of 18
In this work, the nonlinear backstepping control law is used within the NMPH opti-
mization problem as a constraint in (1c). The detailed development and implementation
of the BSC technique within NMPH, as well as its advantages over the earlier FBL-based
design [25], are described in our recent work [26].
The NMPH trajectory planning algorithm receives terminal points from a modular
global motion planner [27]. The global motion planner generates terminal points within
unexplored areas of an incrementally built-up volumetric map of the environment [28,29].
These terminal points, along with the current pose of the vehicle, the constraints represent-
ing the closed-loop system dynamics and environmental obstacles (which are extracted
from the volumetric map), and the entries of the weighting matrices (which in the present
design are adjusted online by a DRL algorithm) are sent to the NMPH optimization problem
in order to calculate optimal trajectories between the vehicle’s current pose and the next
terminal point. The results are then used as reference trajectories by the vehicle’s low-level
flight controller.
action (𝑎)
Agent Environment
reward (𝑟)
buffer is a memory that collects the previous experience tuples (s, a, r, s0 ) ∈ B , in which the
agent uses them to increase the computational efficiency and speed up learning [30].
We will now review the DDPG and SAC deep reinforcement learning algorithms used
within our proposed adaptive NMPH frameworks.
where a random batch of data (s, a, r, s0 ) from the replay buffer B is used for each update.
The goal is to minimize the loss in (3) by performing a gradient descent of the MSBE
JQ (φ, B).
As shown in (3), the neural network parameters represented by φ are used for both
the action-value function approximator Qφ (s, a) and the network that estimates Qφ (s0 , a0 ),
which uses the next states and actions. Unfortunately, this makes it impossible for the
gradient descent to converge. To tackle this issue, a time delay is added to the network
parameters φ for Qφ (s0 , a0 ). The adjusted network is called the target Q-network Qφtarg (s0 , a0 )
with parameters φtarg . A copy of the Q-network Qφ (s0 , a0 ) is used for the target Q-network
Qφtarg (s0 , a0 ), where the latter uses the weighted average of the model parameters φtarg ←
ρφtarg + (1 − ρ)φ to stabilize Q-function learning [32]. It should be noted that the parameters
of the target Q-network are not trained. However, they are periodically synchronized with
the original Q-network’s parameters.
The MSBE function given in (3) contains a maximization term for the Q-value. One
way to perform this maximization is to apply the optimal action a∗ (s). This can be achieved
by creating another approximator for the policy µθ (s) with parameters θ and maximizing
the associated Q-function with regard to the replay buffer B . This new policy also requires
a time delay to stabilize its learning. Therefore, a target policy µθtarg (s) is introduced to
maximize Qφtarg . The Bellman equation, MSBE, and policy learning function are respectively
given by
target Q-network
z }| {
y(r, s0 ) = r + γ Qφtarg (s0 , µθtarg (s0 )) (4)
| {z }
target policy network
2
JQ (φ, B) = E Qφ (s, a) − y(r, s0 ) (5)
(s,a,r,s0 )∼B | {z }
Q-network
Jµ (θ, B) = E Qφ s, µθ (s) (6)
s∼B
Drones 2022, 6, 323 6 of 18
Practically, for a random sample B = {(s, a, r, s0 )} from the replay buffer B with
cardinality | B|, Equations (5) and (6) can be expressed as
1
∑
2
JQ (φ, B) = Qφ (s, a) − y(r, s0 ) (7)
| B| (s,a,r,s0 )∈ B
1
∑ Qφ
Jµ (θ, B) = s, µθ (s) (8)
| B| s∈ B
The hyperparameters used for the DDPG algorithm are the number of training
episodes, target update factor (ρ), actor and critic network learning rates, replay buffer
size, random batch size, and discount factor value. The sensitivity to the hyperparameter
values and the interaction between the Q-value and policy approximator µθ (s) make ana-
lyzing the stability and convergence of DDPG difficult tasks [33], especially when using
high-dimensional nonlinear universal function approximators [34]. Moreover, DDPG is
expensive in terms of its sample complexity, which is measured by the number of training
samples needed to complete the learning process.
An alternative approach that overcomes the issues of the DDPG algorithm is soft
actor-critic (SAC) [22,34], a probabilistic DRL algorithm, which is considered next.
both target Q-networks are copies of the corresponding Q-network, but employ weighted
averaging on the network parameters during training. Because of the policy’s stochastic
nature, SAC uses the current policy to obtain the next state-action values without needing
to have a target policy [31]. In addition, the stochastic nature of the exploration process
means it’s not necessary to artificially introduce noise, as was done in the deterministic
DDPG.
The objective of SAC is to maximize the sum of the expected return and entropy. The
Bellman equation within its Q-value function thus includes the expected entropy of the
policy as follows:
where α is the coefficient that regulates the trade-off between the expected entropy and
return, s0B indicates that the replay buffer is used to obtain the expectation of the future
states, and a0π ∼ π (·|s0 ) indicates that the current policy is used to obtain future actions.
For simplicity of notation, we will denote s0B by s0 and a0π by a0 in the sequel.
Two Bellman residuals are used within SAC [22], referred to as soft-MSBEs. In addition
to the policy network πθ , each soft-MSBE includes a Q-network and two target Q-networks
in its calculation as follows:
2
0 0
JQ (φi , B) = E Qφi (s, a) − y(r, s , a ) , i = 1, 2 (10)
(s,a,r,s0 ,a0 )∼B
Similar to DDPG, the Q-functions are updated using gradient descent, while gradient
ascent is utilized to update the policy network.
The policy should maximize the state-value function Vπ (s), defined as follows:
h i
Vπ (s) = E Qπ (s, a) − α log π ( a|s) (12)
a∼π
which represents the expected return when starting from a state s and following a policy π.
For the optimal value of the action, we can employ reparameterization [22,31] to obtain
a continuous action from a deterministic function that represents the policy. The function is
expressed by the state and additive Gaussian noise as follows:
aθ (s, ξ ) = tanh(µθ (s) + σθ (s) ξ ), ξ ∼ N 0, diag(1, . . . , 1) . (13)
and the optimum policy can be obtained by finding arg maxθ Jπ (θ, B) using gradient ascent.
For a random sample B = {(s, a, r, s0 , a0 )} from the buffer B , Equations (10) and (14) can be
expressed as follows:
1
∑ Qφi (s, a) − y(r, s0 , a0 ) ,
2
JQ (φi , B) = i = 1, 2 (15)
| B| (s,a,r,s 0 )∈ B
1
∑
Jµ (θ, B) = min Qφj s, aθ (s, ξ ) − α log πθ aθ (s, ξ )|s (16)
| B| s∈ B j=1,2
Drones 2022, 6, 323 8 of 18
where et,RMS is root-mean-square (RMS) error between the generated and flight tra-
jectories, and rt,max and rt,th are the maximum and threshold values of the trajectory
tracking reward, respectively.
• Terminal setpoint reward, which reflects how close the ending point of the flight trajec-
tory is to the terminal setpoint of the reference trajectory. The terminal setpoint reward
is calculated as follows:
( r
− rs,max e + rs,max , for ess ≤ rs,th
s,th ss
rss = (18)
0, otherwise
pos
where ess = k pss − ξ̂ re f (tn + T )k is the error between the terminal point and the final
point of the reference trajectory generated by the NMPH, and rs,max , rs,th are the
maximum and threshold values of this reward, respectively.
• Completion reward, which reflects how far the drone travels along its prescribed flight
trajectory in the associated time interval. This is given by the following:
− rrc,max
(
e + rc,max , for ec ≤ rc,th
rc = c,th c (19)
−5, otherwise
where ec = k pss − p|tn +T k is the error between the drone’s position at tn + T and
the flight trajectory’s endpoint, while rc,max , rc,th are the maximum and threshold
values of the completion reward, respectively. We place more importance on this
factor by reducing the total reward (rc < 0) whenever the error ec exceeds the assigned
threshold value rc,th . Consequently, the overall algorithm will give priority to ensuring
the drone reaches the desired setpoint in the allotted timeframe.
Environment
Nonlinear Model Predictive Horizon
terminal setpoint Nonlinear 𝜉መ𝑟𝑒𝑓
Cost Function, 𝐽 𝑥, optimized variables
𝑡𝑛 … 𝑡𝑛 + 𝑇
𝑥𝑠𝑠 Plant Model 𝑥 , 𝜉መ𝑟𝑒𝑓
𝑥ሶ 𝜏 Optimization
Exploration Drone
Problem
Algorithm System
point cloud/
Nonlinear Solver
depth image
Control Law
𝑥 𝑡𝑛 𝑢 𝜏 Constraints Weights current state 𝑥 𝑡𝑛
return, 𝑅
Agent
Reinforcement
Learning Algorithm
Policy
Update
observation
/state, 𝑠 action, 𝑎
Policy
flight
Trajectory
Generated
𝑝𝑠𝑠 Trajectory
𝑣Ԧ𝑜
𝑟Ԧ
𝜑
𝑧
𝑝𝑜
𝑥 𝑦
Policy
NMPH Action Improvement
Policy Network Q-Network
Optimization
𝑊 𝜇𝜃 𝑠 𝑄𝜙 𝑠, 𝑎
Problem Policy
Evaluation
Update Update
Reference
trajectory
Target Policy
𝜉መ𝑟𝑒𝑓 Target Q-Network
Network
𝑄𝜙𝑡𝑎𝑟𝑔 𝑠, 𝑎
Observation 𝜇𝜃𝑡𝑎𝑟𝑔 𝑠
𝑣𝑜 , 𝜑, 𝑟Ԧ Store
Drone Data
system Sample
Data Sample
Reward Data
𝑟𝑡𝑟𝑎𝑗 + 𝑟𝑠𝑝 + 𝑟𝑐 Replay Buffer
Policy
NMPH Action Improvement
Policy Network Q- Networks
Optimization 𝑊 𝜋𝜃 𝑎|𝑠 Policy
𝑄𝜙𝑖 𝑠, 𝑎
Problem
Evaluation
min
Update
Reference
trajectory
𝜉መ𝑟𝑒𝑓 Target Q- Networks
Sample Data
Store Data
𝑄𝜙𝑡𝑎𝑟𝑔,𝑖 𝑠, 𝑎
Observation log-probability
Drone 𝑣𝑜 , 𝜑, 𝑟Ԧ
system
Reward Sample Data Alpha optimization
𝑟𝑡𝑟𝑎𝑗 + 𝑟𝑠𝑝 + 𝑟𝑐 Replay Buffer
five networks: a policy network, two Q-networks, and two target Q-networks. The SAC’s
policy and Q-network structures are shown in Figure 7.
Figure 6. Neural networks used by DDPG. IL: input layer; HL: hidden layer; OL: output layer.
Figure 7. Neural networks used by SAC. IL: input layer; HL: hidden layer, OL: output layer.
Figure 8 shows the average episodic reward during the training processes of the DDPG
and SAC architectures. In this comparison, each framework is learning to optimize the
values of only three actions, which represent the entries of the weight matrix corresponding
to the position states within the NMPH optimization problem.
To enhance DDPG performance in terms of sample complexity and its sensitivity to
hyperparameters, we propose and apply a ‘pre-exploration’ technique, which traverses the
RL problem spaces before the training process is started. Pre-exploration is performed by
applying a set of predefined actions and considering a random system state for each action.
The collected experiences of the pre-exploration process are then stored in the replay buffer,
which is used during the training process. It was found that using this technique helps
DDPG to improve convergence and stability over the case without pre-exploration, as can
be seen from Figure 8. Conversely, a number of episodes must be spent for pre-exploration,
which delays the learning process in the real-time adaptation. Note that the results shown
in Figure 8 also show that SAC generally outperforms DDPG (either with or without
pre-exploration) in terms of learning speed. In addition, during the training process, SAC
Drones 2022, 6, 323 13 of 18
showed noticeably better learning stability relative to DDPG with regard to the process of
selecting the hyperparameter values for each algorithm.
100
60
40
20 SAC
DDPG with pre-exploration
DDPG without pre-exploration
0
0 50 100 150 200 250
Episode
Figure 8. Training curves of SAC, DDPG with pre-exploration, and DDPG without pre-exploration
for adaptively tuning three NMPH parameters.
To test the performance of the SAC approach in a higher-dimensional setting, the num-
ber of actions was increased to 12 to estimate the weight matrix entries corresponding to the
position, velocity, and acceleration states {wx , wy , wz , wψ , wẋ , wẏ , wẋ , wψ̇ , wẍ , wÿ , wẍ , wψ̈ }
within the NMPH optimization problem. Figure 9 shows the resulting training curve of
SAC; DDPG failed to complete the learning process in this case. The effect of increasing
the number of NMPH parameters being tuned can be seen by comparing the SAC training
curves in Figures 8 and 9 in terms of the average episodic reward. In the 12-parameter
trial, SAC has better training performance than in the 3-parameter case, which is because
the former covers a larger action space and consequently provides better solutions of the
NMPH optimization problem.
To test the trajectory planning performance of NMPH with and without the proposed
adaptation scheme, four different flight tests were performed within the AirSim simulation
environment. For the second case, the weighting matrices within NMPH used fixed param-
eters, which were used as the initial values in the DRL-based adaptation method. Table 1
provides a comparison between the conventional NMPH design with fixed parameter
values and the adaptive NMPH-SAC design. The comparison is based on the average of
the error metrics discussed in Section 3.1, namely et,RMS , ess , and ec . Each flight trajectory
consists of ten trials, and each trial includes five iterations. The initial velocity and drone
orientation were selected randomly at the beginning of each trial. The first trial uses a
zigzag pattern, which consists of five paths, each with length of 5.6 m. For the second trial
(square pattern), the side length was 5 m. For the third trial (ascending square pattern), the
elevation gain was set to 1 m. The fourth trial involved a set of position setpoints provided
by a graph-based exploration algorithm (see [27] for the complete details). As shown in
Table 1, the flight performance obtained with the adaptive NMPH is much better than
the one from the non-adaptive (conventional) NMPH. The reason for this is that real-time
adaptation of NMPH parameters works better than using a single set of fixed values when
performing a variety of different flying trajectories.
Drones 2022, 6, 323 14 of 18
100
80
40
20
0
0 200 400 600 800 1000
Episode
Figure 9. Training curve of SAC adaptively tuning 12 parameters of the NMPH optimization.
Table 1. Comparison between the conventional NMPH design (fixed values of the NMPH parameters)
and the adaptive NMPH-SAC approach, for different flight trials.
Random
Ascending Square
Average Error Zigzag Pattern Square Pattern Setpoints
Pattern
(Exploration)
Fixed NMPH et,RMS 0.11353 0.09758 0.10741 0.09646
parameters ess 0.08659 0.07547 0.07663 0.07339
ec 0.12033 0.06426 0.07413 0.07739
Adaptive et,RMS 0.08877 0.08495 0.09212 0.06749
NMPH-SAC ess 0.01029 0.00919 0.01046 0.01150
ec 0.04400 0.04419 0.04952 0.05874
To show how the values of the NMPH parameters are adjusted online using SAC,
Figures 10 and 11 present the results of a flight through 20 randomly generated setpoints.
Figure 10 depicts the values of the observations vo , ϕ, and |~r | at the beginning of each
iteration, and Figure 11 shows the changing values of the NMPH weighting matrix entries.
An animation of this test showing the vehicle’s flight trajectory and corresponding online
calculation outputs is available as a supplementary video file.
vo (m/s)
State
ϕ (rad)
4
|r| (m)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Iteration
wx
200
175
wx ̇
wx ̈
Parameter Value
150
125
100
75
50
25
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Iteration
wy
200
175
wy ̇
wy
Parameter Value
150 ̈
125
100
75
50
25
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Iteration
wz
200
175
wz ̇
wz ̈
Parameter Value
150
125
100
75
50
25
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Iteration
wψ
200
175
wψ ̇
wψ
Parameter Value
150 ̈
125
100
75
50
25
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Iteration
Figure 11. Values of NMPH weighting matrix entries being adjusted online by SAC.
terms of learning speed, ability to handle a larger set of tuning parameters, and overall
flight performance.
The pros, cons, and limitations of this study are summarized as follows:
• Pros:
– The proposed design is able to dynamically adjust the parameters of the opti-
mization problem online during flight, which is preferable to tuning them before
flight and evaluating the resulting performance afterwards.
– The DRL model can adapt the gains of the optimization problem in response
to changes in the vehicle, such as new payload configurations or replaced
hardware components.
• Cons:
– DRL algorithms employ a large number of hyperparameters. While SAC is less
sensitive to hyperparameters than DDPG, finding the best combination of these
parameters to achieve fast training is a challenging task.
• Limitations:
– The present study was performed entirely within a simulation environment and
does not include hardware testing results.
Future work will include implementing NMPH-SAC onboard our hardware drone
and testing its performance in a variety of real-world environments, as well as using the
DRL algorithms for disturbance and parameter estimation.
Supplementary Materials: The following supporting information can be downloaded at: https:
//www.mdpi.com/article/10.3390/drones6110323/s1.
Author Contributions: Conceptualization, Y.A.Y. and M.B.; methodology, Y.A.Y.; software, Y.A.Y.;
validation, Y.A.Y.; formal analysis, Y.A.Y.; investigation, Y.A.Y.; resources, M.B.; data curation, Y.A.Y.;
writing—original draft preparation, Y.A.Y.; writing—review and editing, M.B.; visualization, Y.A.Y.;
supervision, M.B.; project administration, M.B.; funding acquisition, M.B. All authors have read and
agreed to the published version of the manuscript.
Funding: This research was funded by NSERC Alliance-AI Advance Program grant number 202102595.
The APC was funded by NSERC Alliance-AI Advance Program.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Not applicable.
Conflicts of Interest: The authors declare no conflict of interest. The funders had no role in the design
of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or
in the decision to publish the results.
References
1. Carlucho, I.; De Paula, M.; Acosta, G.G. An adaptive deep reinforcement learning approach for MIMO PID control of mobile
robots. ISA Trans. 2020, 102, 280–294. [CrossRef]
2. Åström, K.J. Theory and applications of adaptive control—A survey. Automatica 1983, 19, 471–486. [CrossRef]
3. Åström, K. History of Adaptive Control. In Encyclopedia of Systems and Control; Baillieul, J., Samad, T., Eds.; Springer-Verlag:
London, UK, 2015; pp. 526–533.
4. Bellman, R. Adaptive Control Processes; A Guided Tour; Princeton University Press: Princeton, NJ, USA, 1961.
5. Gregory, P. Proceedings of the Self Adaptive Flight Control Systems Symposium; Technical Report 59-49; Wright Air Development
Centre: Boulder, CO, USA, 1959.
6. Panda, S.K.; Lim, J.; Dash, P.; Lock, K. Gain-scheduled PI speed controller for PMSM drive. In Proceedings of the IECON’97 23rd
International Conference on Industrial Electronics, Control, and Instrumentation (Cat. No. 97CH36066), New Orleans, LA, USA,
14 November 1997; Volume 2, pp. 925–930.
7. Huang, H.P.; Roan, M.L.; Jeng, J.C. On-line adaptive tuning for PID controllers. IEE Proc.-Control. Theory Appl. 2002, 149, 60–67.
[CrossRef]
Drones 2022, 6, 323 17 of 18
8. Gao, F.; Tong, H. Differential evolution: An efficient method in optimal PID tuning and on–line tuning. In Proceedings of the
First International Conference on Complex Systems and Applications, Wuxi, China, 10–12 September 2006.
9. Killingsworth, N.J.; Krstic, M. PID tuning using extremum seeking: Online, model-free performance optimization. IEEE Control
Syst. Mag. 2006, 26, 70–79.
10. Gheibi, O.; Weyns, D.; Quin, F. Applying machine learning in self-adaptive systems: A systematic literature review. ACM Trans.
Auton. Adapt. Syst. (TAAS) 2021, 15, 1–37. [CrossRef]
11. Jafari, R.; Dhaouadi, R. Adaptive PID control of a nonlinear servomechanism using recurrent neural networks. In Advances in
Reinforcement Learning; Mellouk, A., Ed.; IntechOpen: London, UK, 2011; pp. 275–296.
12. Dumitrache, I.; Dragoicea, M. Mobile robots adaptive control using neural networks. arXiv 2015, arXiv:1512.03345.
13. Rossomando, F.G.; Soria, C.M. Identification and control of nonlinear dynamics of a mobile robot in discrete time using an
adaptive technique based on neural PID. Neural Comput. Appl. 2015, 26, 1179–1191. [CrossRef]
14. Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018.
15. Hu, B.; Li, J.; Yang, J.; Bai, H.; Li, S.; Sun, Y.; Yang, X. Reinforcement learning approach to design practical adaptive control for a
small-scale intelligent vehicle. Symmetry 2019, 11, 1139. [CrossRef]
16. Watkins, C. Learning from Delayed Rewards. Ph.D. Thesis, King’s College, University of Cambridge, Cambridge, UK, 1989.
17. Boubertakh, H.; Tadjine, M.; Glorennec, P.Y.; Labiod, S. Tuning fuzzy PD and PI controllers using reinforcement learning. ISA
Trans. 2010, 49, 543–551. [CrossRef]
18. Subudhi, B.; Pradhan, S.K. Direct adaptive control of a flexible robot using reinforcement learning. In Proceedings of the 2010
International Conference on Industrial Electronics, Control and Robotics, Rourkela, India, 27–29 December 2010; pp. 129–136.
19. Barto, A.G.; Sutton, R.S.; Anderson, C.W. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE
Trans. Syst. Man, Cybern. 1983, 13, 834–846. [CrossRef]
20. Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep
reinforcement learning. arXiv 2015, arXiv:1509.02971.
21. Fujimoto, S.; Hoof, H.; Meger, D. Addressing function approximation error in actor-critic methods. In Proceedings of the
International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 1587–1596.
22. Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with
a stochastic actor. In Proceedings of the International Conference on Machine Learning, PMLR, Stockholmsmässan, Sweden,
10–15 July 2018; pp. 1861–1870.
23. Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; Kavukcuoglu, K. Asynchronous methods for deep
reinforcement learning. In Proceedings of the International Conference on Machine Learning, PMLR, Baltimore, MD, USA,
17–23 July 2016; pp. 1928–1937.
24. Sun, Q.; Du, C.; Duan, Y.; Ren, H.; Li, H. Design and application of adaptive PID controller based on asynchronous advantage
actor–critic learning method. Wirel. Netw. 2021, 27, 3537–3547. [CrossRef]
25. Al Younes, Y.; Barczyk, M. Nonlinear Model Predictive Horizon for Optimal Trajectory Generation. Robotics 2021, 10, 90.
[CrossRef]
26. Al Younes, Y.; Barczyk, M. A Backstepping Approach to Nonlinear Model Predictive Horizon for Optimal Trajectory Planning.
Robotics 2022, 11, 87. [CrossRef]
27. Younes, Y.A.; Barczyk, M. Optimal Motion Planning in GPS-Denied Environments Using Nonlinear Model Predictive Horizon.
Sensors 2021, 21, 5547. [CrossRef] [PubMed]
28. Dang, T.; Mascarich, F.; Khattak, S.; Papachristos, C.; Alexis, K. Graph-based path planning for autonomous robotic exploration
in subterranean environments. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems
(IROS), The Venetian Macao, Macau, 4–8 November 2019; pp. 3105–3112.
29. Oleynikova, H.; Taylor, Z.; Fehr, M.; Siegwart, R.; Nieto, J. Voxblox: Incremental 3d euclidean signed distance fields for on-board
mav planning. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS),
Vancouver, BC, Canada, 24–28 September 2017; pp. 1366–1373.
30. Liu, R.; Zou, J. The effects of memory replay in reinforcement learning. In Proceedings of the 2018 56th annual allerton conference
on communication, control, and computing (Allerton), Monticello, IL, USA, 2–5 October 2018; pp. 478–485.
31. Achiam, J. Spinning Up in Deep Reinforcement Learning. 2018. Available online: https://github.com/openai/spinningup
(accessed on 2 October 2022).
32. Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016.
33. Duan, Y.; Chen, X.; Houthooft, R.; Schulman, J.; Abbeel, P. Benchmarking deep reinforcement learning for continuous control. In
Proceedings of the International Conference on Machine Learning, PMLR, New York, NY, USA, 20–22 June 2016; pp. 1329–1338.
34. Haarnoja, T.; Zhou, A.; Hartikainen, K.; Tucker, G.; Ha, S.; Tan, J.; Kumar, V.; Zhu, H.; Gupta, A.; Abbeel, P.; et al. Soft actor-critic
algorithms and applications. arXiv 2019, arXiv:1812.05905v2.
35. Quigley, M.; Conley, K.; Gerkey, B.; Faust, J.; Foote, T.; Leibs, J.; Wheeler, R.; Ng, A.Y. ROS: An open-source Robot Operating
System. In Proceedings of the ICRA Workshop on Open Source Software in Robotics, Kobe, Japan, 12–17 May 2009.
36. Shah, S.; Dey, D.; Lovett, C.; Kapoor, A. AirSim: High-Fidelity Visual and Physical Simulation for Autonomous Vehicles. In Field
and Service Robotics; Hutter, M., Siegwart, R., Eds.; Springer: Cham, Switzerland, 2018; pp. 621–635.
Drones 2022, 6, 323 18 of 18
37. Houska, B.; Ferreau, H.; Diehl, M. ACADO Toolkit – An Open Source Framework for Automatic Control and Dynamic
Optimization. Optim. Control. Appl. Methods 2011, 32, 298–312. [CrossRef]
38. Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean, J.; Devin, M.; et al.
TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. 2015. Available online: https://tensorflow.org (accessed
on 2 October 2022).
39. Chollet, F. Keras. 2015. Available online: https://keras.io (accessed on 2 October 2022).
40. Lai, C.; Han, J.; Dong, H. Tensorlayer 3.0: A Deep Learning Library Compatible with Multiple Backends. In Proceedings of the
2021 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Shenzhen, China, 5–9 July 2021; pp. 1–3.