Reinforcement Learning-Based Mobile Robot Navigation
Reinforcement Learning-Based Mobile Robot Navigation
1-1-2016
ERKAN İMAL
NAHİT EMANET
Part of the Computer Engineering Commons, Computer Sciences Commons, and the Electrical and
Computer Engineering Commons
Recommended Citation
ALTUNTAŞ, NİHAL; İMAL, ERKAN; EMANET, NAHİT; and ÖZTÜRK, CEYDA NUR (2016) "Reinforcement
learning-based mobile robot navigation," Turkish Journal of Electrical Engineering and Computer
Sciences: Vol. 24: No. 3, Article 74. https://doi.org/10.3906/elk-1311-129
Available at: https://journals.tubitak.gov.tr/elektrik/vol24/iss3/74
This Article is brought to you for free and open access by TÜBİTAK Academic Journals. It has been accepted for
inclusion in Turkish Journal of Electrical Engineering and Computer Sciences by an authorized editor of TÜBİTAK
Academic Journals. For more information, please contact [email protected].
Turkish Journal of Electrical Engineering & Computer Sciences Turk J Elec Eng & Comp Sci
(2016) 24: 1747 – 1767
http://journals.tubitak.gov.tr/elektrik/
⃝c TÜBİTAK
Research Article doi:10.3906/elk-1311-129
Abstract: In recent decades, reinforcement learning (RL) has been widely used in different research fields ranging from
psychology to computer science. The unfeasibility of sampling all possibilities for continuous-state problems and the
absence of an explicit teacher make RL algorithms preferable for supervised learning in the machine learning area, as
the optimal control problem has become a popular subject of research. In this study, a system is proposed to solve
mobile robot navigation by opting for the most popular two RL algorithms, Sarsa( λ) and Q( λ) . The proposed system,
developed in MATLAB, uses state and action sets, defined in a novel way, to increase performance. The system can guide
the mobile robot to a desired goal by avoiding obstacles with a high success rate in both simulated and real environments.
Additionally, it is possible to observe the effects of the initial parameters used by the RL methods, e.g., λ , on learning,
and also to make comparisons between the performances of Sarsa( λ) and Q( λ) algorithms.
Key words: Reinforcement learning, temporal difference, eligibility traces, Sarsa, Q-learning, mobile robot navigation,
obstacle avoidance
1. Introduction
With the advancement of technology, people started to prefer machines instead of human work in order to
increase productivity. In the beginning, machines were only used to automate work that did not require
intelligence. However, the invention of computers urged people to consider machine learning (ML). Today,
artificial intelligence (AI) continues to be a subject of study to provide machines with learning abilities [1]. It
is important to understand the nature of learning in order to achieve the goal of intelligent machines. Although
there is a great number of algorithms that were developed as supervised and unsupervised learning methods in
the ML field, the fundamental idea of reinforcement learning (RL) usage is that of learning from interaction. In
order to obtain interaction ability, various sensors are mounted on machines, including infrared, sonar, inductive,
and diffuse sensors.
The term ‘reinforcement’ was first used in psychology and is built on the idea of learning by trial and error
that appeared in the 1920s. Afterwards, this idea became popular in computer science. In 1957, [2] introduced
dynamic programming, which led to optimal control and then to Markov decision processes (MDPs). Although
dynamic programming is a solution for discrete stochastic MDPs, it requires expensive computations that grow
exponentially as the number of states increases. Therefore, temporal difference brought a novel aspect to RL,
and in 1989 Q-learning explored by [3] was an important breakthrough in the AI field. Finally, the Sarsa
∗ Correspondence: [email protected]
1747
ALTUNTAŞ et al./Turk J Elec Eng & Comp Sci
algorithm was introduced by [4] in 1994 as ‘modified Q-learning’. There were other studies to enhance RL
techniques for faster and more accurate learning results [1,5–7].
RL techniques have started to be preferred as learning algorithms since they are more feasible and
applicable than other techniques that require prior knowledge. The RL approach has been used for many
purposes such as feature selection of classification algorithms [8], optimal path finding [5,9,10], routing for
networks [11], and coordination of communication in multiagent systems [12,13]. A number of studies using RL
to provide effective learning for different control problems are explained in the following paragraphs.
In 2009, a new RL algorithm, Ex < α > (λ) , was developed by [14] to deal with the problems of using
continuous actions. This algorithm is based on the k -nearest neighbors approach dealing with discrete actions
and enhances the k NN-TD(λ) algorithm.
Due to the complexity of the navigation problem, RL is a widely preferred method for controlling mobile
robots. For example, [15] used prior knowledge within Q-learning in order to reduce the memory requirement
of the look-up table and increase the performance of the learning process. That study integrated fuzzy logic
into RL for this purpose, and the Q-learning algorithm was applied to provide coordination among behaviors in
the fuzzy set. Furthermore, an algorithm called ‘piecewise continuous nearest sequence memory’, which extends
the instance-based algorithm for discrete, partially observable state spaces, and the nearest sequence memory
[16] were presented in [17]. Another study using the neural network approach in RL for obstacle avoidance of
mobile robots was performed in a simulated platform in [18]. The main aim of this study was simply to avoid
obstacles while roaming; there was no specific goal to achieve. The reason for using neural networks instead of
a look-up table was to minimize the required space and to maximize learning performance .
The paper, which is an extension of work done in [19], comprises five sections. Section 2 gives information
about RL for navigation problems and Section 3 explains the implementation details of the proposed system.
The experimental results are given in Section 4 and the paper is concluded with Section 5.
1748
ALTUNTAŞ et al./Turk J Elec Eng & Comp Sci
only to learn how good the fixed policy is. The only challenge is that the agent knows neither the transition
model nor the reward function since the environment is unknown. The agent executes a set of trials using the
fixed policy to calculate the estimated value of each state. Distinct from passive learning, active learning also
aims to learn how to act in the environment. Since the agent does not know either the environment or the
policy, it has to learn the transition model in order to find an optimal policy. However, in this case, there is
more than one choice of actions for each state, because the policy is not fixed [21]. Here, the balance between
exploration and exploitation is important for deciding which action to choose.
Certain solution classes, developed to find optimal Q-values as quickly and accurately as possible, are
explained in the following subsections.
1749
ALTUNTAŞ et al./Turk J Elec Eng & Comp Sci
estimated to compare all possible action values for each state for successful learning. This is the exploita-
tion/exploration trade-off problem. There are two approaches to solving this problem: on-policy and off -policy.
On-policy methods evaluate and improve the same policy used for decisions in experiences. To ensure that all
actions are selected in order to learn better choices, ε -greedy policies are used, as mentioned above. On the
other hand, off -policy methods use two different policies: behavior policy and estimation policy. Behavior
policy is used to make decisions and needs to ensure that all actions have a probability to be selected to explore
all possibilities. Estimation policy is the evaluated and improved policy, and since it does not affect decisions,
this policy can be totally greedy.
1750
ALTUNTAŞ et al./Turk J Elec Eng & Comp Sci
updating methods are compared, empirical results show that on-line methods converge to the optimal policy
faster than off-line methods [20]. Thus, the proposed system uses on-line updating.
In order to implement the TD(λ) method, an additional variable et (s, a) is defined as the eligibility
trace for all state–action pairs at time t . The trace-decay parameter λ in the TD( λ) algorithm refers to the
use of eligibility traces for updating the values of recently visited states. At each step, eligibility traces for all
states are decayed by γλ , defining the recently visited traces where γ is a discount factor. Two kinds of traces,
accumulated traces and replacing traces, are used to record the traces of state–action pairs. The accumulated
traces increase the eligibility trace et (s, a) by 1 for each visit of state–action pairs, while the replacing traces
reset the eligibility trace to 1. The replacing traces can also reset the other actions available in the visited
state to 0. Generally, replacing traces have a better performance than accumulating traces [20]. Eq. (1) defines
et (s, a), the replacing traces used by the proposed system:
1 if s = st and a = at
et (s, a) = 0 if s = st and a ̸= at for all s, a (1)
γλet−1 (s, a) if s ̸= st
where et (s, a) is the eligibility trace for all state–action pairs at time t , λ is the trace-decay parameter referring
to the use of eligibility traces in order to update the values of the recently visited states, and γ is the discount
factor describing the effect of future rewards on the value of the current state. Both λ and γ range between 0
and 1.
The one-step TD error calculated by Eq. (2) is used to update the values of all recently visited state–action
pairs by their nonzero traces using Eq. (3):
( ′) ( ′ )
δt = Ra s, s + γQt s , b − Qt (s, a) (2)
where α is the step-size parameter, also known as the learning rate and ranging between 0 and 1; Ra (s, s′)
is the reward for transition from state s to state s′ applying action a; and Qπ (s, a) is the expected value
for selecting action a in state s under policy π at time t . The value of Qt (s′, b) depends on the preferred
algorithm, which can be Sarsa(λ) or Q(λ), explained in the following subsections.
2.4.1. Sarsa( λ )
The eligibility trace version of Sarsa is known as Sarsa(λ) , and Eq. (4) defines the update step of the Sarsa(λ)
algorithm:
Qt+1 (s, a) = Qt (s, a) + ∆Qt (s, a) , f oralls, a (4)
where ∆Qt (s, a) is defined by Eq. (3), and Qt (s′ , b) = Qt (s′ , a′ ) in Eq. (2), since Sarsa(λ) is an on-policy
method. Figure 1 shows the pseudocode for the complete Sarsa(λ) algorithm using ε-greedy policy, on-line
updating, and replacing traces.
2.4.2. Q( λ )
The proposed system uses Watkins’s Q( λ), which is a Q-learning algorithm into which the eligibility traces
are integrated [3]. Since Q(λ) is an off -policy algorithm, only traces of greedy actions are used. Thus, the
1751
ALTUNTAŞ et al./Turk J Elec Eng & Comp Sci
eligibility traces are set to 0 when a nongreedy action is selected for exploration. Q(λ) uses Eq. (4) to update Q-
values with Qt (s′ , b) = maxa Qt (s′ , a) in Eq. (2), because it uses the greedy method in the policy improvement
process. Figure 2 shows the pseudocode for the complete Q(λ) algorithm using ε-greedy behavior policy, greedy
estimation policy, on-line updating, and replacing traces.
Figure 1. Pseudocode for Sarsa( λ) algorithm used by Figure 2. Pseudocode for Q( λ) algorithm used by the
the proposed system. proposed system.
3. Implementation
First, it is necessary to design the system properly in order to achieve the main purpose of autonomous
navigation. The mobile agent is supposed to reach the goal by following an optimal path and avoiding obstacles
in the environment after a set of trials. The proposed system uses the on-line updating technique, meaning that
the robot improves its guess about the environment during the trials and does not wait until the termination
of the trials to update the guess. As the robot performs more trials and gains more information about its
environment, it exhibits better performance for navigation purposes.
Although there are a few applications using RL algorithms for mobile robot navigation, most of them are
implemented only on simulated platforms instead of real mobile robots. This system implements Sarsa(λ) and
Q(λ) on a real platform by using a state set, an actions set, and a reward function described in a novel way.
In the implementation of the system, Q-values used by the RL algorithms are in tabular form and are
represented as matrices causing many matrix calculations during the execution. Since these calculations can
be made in a concise way with the help of MATLAB, the system was implemented as a MATLAB application
(http://www.mathworks.com). Experimentation of the system was performed on a Robotino ⃝ R
, which is an
educational robot by Festo Company (http://www.festo-didactic.com). Robotino MATLAB API was integrated
into the system in order to communicate with the Robotino.
1752
ALTUNTAŞ et al./Turk J Elec Eng & Comp Sci
Although Robotino has several other sensors, only three are designed to be used in the system: the
bumper, the odometer, and the camera, with a resolution of 320 × 240 pixels. The bumper is used to detect
collisions with the obstacles. The odometer is used to obtain the location of the robot with respect to its first
location, and the obstacle locations are given to the system as prior knowledge.
3.2. Methodology
The solution classes explained in Section 2 provide several alternative approaches for the RL algorithm. The
selection of an approach among its alternatives, i.e. model-based or model-free [29], affects the performance. In
this study, two model-free methods, Sarsa(λ) and Q( λ) , were implemented for the mobile robot to navigate in
an unknown environment with no prior knowledge.
Indeed, the main goal was to find optimal Q-values during navigation. Thus, how Q-values were
represented in the system was important. Although it is possible to use certain learning algorithms for
1753
ALTUNTAŞ et al./Turk J Elec Eng & Comp Sci
representation [30,31], such as neural networks, a matrix representation was used in this study. Values of
related state–action pairs were updated in the matrix for each step.
Another important issue for a good learning performance when using RL is how to define the reward
function, the state set, and the action set. In this study, although the reward function could return 4 different
values, the state and the action set had 81 and 3 different members, respectively. Details are explained in the
next subsection.
1754
ALTUNTAŞ et al./Turk J Elec Eng & Comp Sci
The navigation path part plots the target as a green circle, the obstacles in the environment as red
demands, the robot as a gray-filled circle, and its path during the episode as a black line. This part can be
made inactive to increase the learning speed by avoiding unnecessary graphical processes. After termination
of the episodes, the robot path is plotted to give an indication to the user about the learning process. The
reward function part plots the cumulative reward gathered from the environment after each episode. When the
agent learns the optimal behavior well enough, the reward function plotted in this part is expected to become
stable. The initial parameters part gives different options to the user for controlling the system, e.g., selecting
the performed algorithm, experiment phase, and experimental platform. The image frames part displays the
images captured by the camera of the robot at a rate of 5 Hz. This part is disabled for simulated platforms
and can be made inactive to increase the speed of the learning process in real experiments. The info box part
informs the user after each episode. This part also gives information about what the parameters are when the
1755
ALTUNTAŞ et al./Turk J Elec Eng & Comp Sci
user clicks the ‘help’ button at the right bottom corner. Finally, the button groups include four buttons. The
‘help’ button gives information about the initial parameters and their features. The ‘reset’ button resets all
parameters to their default values. The ‘clear’ button clears the four parts informing the user: the navigation
path, reward function, info box, and image frames parts. Since the system is not capable of detecting obstacles,
it is necessary to configure the environment before the learning process. Hence, a ‘configure’ is added to the
interface of the system.
Figure 6 shows the interface for configuration of the environment. The boundary of the environment can
be changed, and there are three different obstacle options. ‘No obstacle’ is the default option, and the system
executes in an environment without obstacles. When the ‘random obstacles’ option is selected, the system
generates N random obstacles for the environment, where N can be set by the user. The last option is ‘set
obstacles’, which enables the user to set obstacles manually.
3.3.3. Navigator
The navigator module is responsible for performing the actions selected by the reinforcement learning control
module. In order to connect to the robot and send the action orders, this module uses the functions of Robotino
MATLAB API, which are provided by the Robotino module.
1756
ALTUNTAŞ et al./Turk J Elec Eng & Comp Sci
3.3.6. Simulator
The simulator module combines the navigator, Robotino, and simultaneous localization and mapping (SLAM)
modules, which are used for the real platform. When the user selects the simulated platform, the simulator
module performs the actions selected by the RL control module and returns the resulting position of the robot
during the learning process. Since the system is designed to be stochastic, the simulator module is developed
to have 10% Gaussian noise, similar to the real platform.
4. Experimental results
The proposed system was designed to be executed in both simulated and real platforms. The RL-based control
part of all experiments was performed using MATLAB R2007b on a PC with 2.00 GHz Duo CPU, 3 GB RAM,
1757
ALTUNTAŞ et al./Turk J Elec Eng & Comp Sci
and Windows XP. In addition to the control part, the same computer was used to implement the virtual robot for
the simulated platform and to communicate with the Robotino, as mentioned in Section 3.1. Experiments were
performed in two phases: the learning and testing phases. Preferably, the learning phase is experienced only in
a simulated platform, since the real robot can be damaged due to collisions with obstacles at the beginning of
the phase. It is also hard to relocate the robot between consecutive episodes, but several episodes should be
executed for successful learning.
Figure 7. The variables of the state set module are defined by robot distance and orientation, according to (a) the
target position represented as dtar and θtar , and (b) the nearest obstacle on the way, represented as dobs and θobs .
The system is tested in environments with and without obstacles. As expected, the learning performance
is improved in environments without obstacles, since there is no risk of obstacle collisions, which makes the
environment much simpler to be learned.
For the simulated platform, a virtual robot is implemented to perform the selected actions at each step of
the episodes. In order to perform a realistic simulation, some errors are added to those actions during execution.
The main purpose of the simulation is to obtain an optimal Q-value matrix to be used with the real robot.
When there are no obstacles in the environment, the performance of both the Sarsa(λ) and Q( λ)
algorithms is similar, and the cumulative reward stabilizes at around 20 episodes with default initial parameter
values. However, adding obstacles to the environment makes the learning process more complicated. Thus,
the performance of both algorithms decreases. Tables 1 and 2 list the average episode numbers at which the
cumulative rewards become stable for the Sarsa(λ) and Q( λ) methods, respectively. Figure 8 illustrates the
environment containing 10 obstacles in Tables 1 and 2.
Figure 9 illustrates a chart diagram comparing the results listed in Tables 1 and 2. The x-axis shows
the number of experiments whose initial parameters are given in Tables 1 and 2. The y -axis maps the average
number of episodes necessary for the convergence of Q-values. Variation of initial parameters affects both
Sarsa(λ) and Q( λ) in the same way; thus, both methods perform better or worse for the same parameter
values. In addition, Sarsa( λ) mostly converges faster in the first experiments, whereas Q(λ) generally has
better performance for the rest of the experiments, as shown in Figure 9.
Figures 10–13 plot line diagrams to illustrate the effects of value changes on the initial parameters α ,
γ , ε, and λ , respectively. The x-axis shows other parameter values during the experiments, and the y -axis
maps the average number of episodes necessary for convergence of Q-values in Figures 10–13. The effects of
these four parameters are illustrated in Tables 1 and 2 for both the Sarsa(λ) and Q( λ) methods, which are
represented by 2 different shades of blue and purple, respectively. Figures 10 and 11 show that the values of the
α and γ parameters influence each other’s effect on the performance, e.g., when α = 0.1, the performance of
1758
ALTUNTAŞ et al./Turk J Elec Eng & Comp Sci
both methods is better for the first 4 experiments and worse for the last 4 experiments in Figure 10. The only
difference between the first and last 4 experiments is the value of γ . Figure 12 shows that the value of ε highly
influences the learning performance. When ε gets smaller, the policy used by the system becomes greedier,
which renders the system unable to explore better ways to reach the target. Finally, Figure 13 shows that λ
has different effects for different parameter combinations. Therefore, it is important to tune parameters to get
better performance results.
1759
ALTUNTAŞ et al./Turk J Elec Eng & Comp Sci
Besides the initial parameters, the obstacle locations can affect the learning performance. The more
obstacles that exist between the start and target points, the more episodes are necessary to find the optimal
path. Nevertheless, defining the state set by dynamic variables as explained in Section 3.3.7 enables the system
to minimize the effect of obstacle locations. The learning performance of the Sarsa(λ) and Q(λ) methods
for different initial Q-value matrices are compared using Table 3, which lists the average number of episodes
necessary for reward stabilization with parameters α = 0.1, γ = 0.7, ε =0.15, and λ =0.8. The resulting
1760
ALTUNTAŞ et al./Turk J Elec Eng & Comp Sci
experiment was performed in the environment illustrated in Figure 8. The first column of Table 3 lists the
results for the Q-value matrix, consisting of zeros. The second column lists the results for the initial Q-value
matrix, learned in a different environment with 10 random obstacles. Thus, Table 3 shows that the system can
be executed in an unknown environment no matter where the obstacles are when it learns that the Q-values
are good enough.
450
350
300
250
200
150
100
50
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
Figure 9. Comparison of Sarsa( λ) and Q( λ) methods depending on Tables 1 and 2. The x -axis shows the number of
experiments whose initial parameters are given in Tables 1 and 2. The y -axis maps the number of episodes necessary
for the convergence of Q-values.
1761
ALTUNTAŞ et al./Turk J Elec Eng & Comp Sci
450
400
350
300
250
200
150
100
50
0
450
400
350
300
250
200
150
100
50
0
Table 3. The learning performance of Sarsa( λ) and Q( λ) algorithms for different initial Q-value matrices. The results
were computed for the environment shown in Figure 8 with initial parameters α = 0.1, γ = 0.7, ε = 0.15, λ = 0.8.
Unfortunately, it is possible to complete an experiment without finding an optimal path. The robot may
result in a vicious circle at the end of certain learning processes depending on initial parameter values and the
number and locations of the obstacles.
Finally, since the virtual robot cannot capture images of the simulated platform, information about the
location of the obstacles in the environment is gathered from the simulator module instead of the sensors.
1762
ALTUNTAŞ et al./Turk J Elec Eng & Comp Sci
450
400
350
300
250
200
150
100
50
0
450
400
350
300
250
200
150
100
50
0
1763
ALTUNTAŞ et al./Turk J Elec Eng & Comp Sci
and the EA09 I/O board updates PID control loops. Robotino MATLAB API provides functionality to tune
the PID parameters of these motors.
In this study, the built-in controller was used for the Robotino to reach the desired speed. The range
of values of PID parameters is from 0 to 255, depending on the user manual. These values are scaled by the
microcontroller firmware to match the PID controller implementation. Indeed, there are some methods to tune
PID parameters to their optimal values in the closed-loop. In this system, the parameters are set as KP = 60,
KI = 1.5, and KD = 0 after a set of trials and observations of motor behaviors. Thus, a PI controller was
preferred.
Additionally, the real movement amounts of the robot were different from what we initially expected
due to different factors such as communication delays and the position of the three wheels, which were placed
at 120 ◦ relative to each other. Besides, it is not possible to estimate how much the robot moves during the
acceleration process, so it is necessary to obtain information about robot position using the odometer on the
robot. The odometer calculates the position of the robot by measuring wheel rotations. However, using an
odometer to learn the position of the robot raises certain other questions, namely how accurate the odometer
is or whether the velocity of the robot affects odometer accuracy or not.
Certain empirical results of the Robotino with different linear and angular velocities for varied movement
amounts were measured. Since the action set of the proposed system consists of motions through the x-axis, θ ,
and – θ , motion through the y -axis was not measured. For all measurements, the initial position and velocities
of Robotino are set as x = 0, y = 0, θ = 0, v = 0, and w = 0, and, as mentioned above, PI parameters of the
robot motors are set to KP = 60 and KI = 1.5. Finally, the robot keeps capturing 5 image frames per second
during the experiments, just to eliminate one of the causes of delays not to be used in the built-in controller.
Movement amount and linear and angular velocities can be changed by the user. Therefore, it is important
to be sure that the odometer uncertainty is acceptable for all velocities and movement amounts. Generally, it
is decided that the error should be at most 0.02 mm per 1 mm of movement and 0.4 ◦ per 1 ◦ of turn. After
the measurements, the average odometry errors of the Robotino were calculated as 0.011 mm per 1 mm and
0.005 ◦ per 1 ◦ in linear and rotational motions, respectively. Thus, the odometry accuracy could be accepted
as sufficient for use in the system.
After the learning phase of the system is executed, the user can test how well the system learned what
to do in both simulated and real platforms. In the test phase, the learned Q-values are used to decide actions.
If the Q-value matrix sufficiently converges to optimal, the robot is expected to reach the goal by avoiding
obstacles. Figure 14 shows an example execution of the system on the Robotino. The path of the robot is
illustrated as a black line.
Figure 5 illustrates the resulting system interface at the termination of the system execution demonstrated
in Figure 14. As can be seen in the figures, after a successful learning phase, the system can navigate the robot
in the test phase regardless of executing on the simulated or real platform. Even if the environment used in
the test phase is different than the one in the learning phase, the robot can reach the target successfully, since
states were defined by dynamic variables.
1764
ALTUNTAŞ et al./Turk J Elec Eng & Comp Sci
Figure 14. The Robotino can reach the target by avoiding obstacles in the test phase.
termines the performance of the implemented RL algorithms. The proposed system defines a state set using
dynamic variables so that after the system learns how to behave in an environment, it can be successful in
different environments where the target and obstacles are located in different points. Another decision criterion
is how to describe the reward function, which is the response of the environment to the actions of the intelligent
agent.
Although the implemented system gives promising results, it can be enhanced to increase its learning
speed and performance. For instance, Q-values in the algorithms used by the system are represented in tabular
form, which requires large space in the memory and complex mathematical calculations. Instead of tabular
form, it is possible to integrate a supervised learning algorithm to represent Q-values in order to reduce memory
requirements and provide faster convergence to optimal policy.
Acknowledgment
This study was supported by the Scientific Research Fund of Fatih University (Project No: P50061202 B).
References
[1] Sebag M. A tour of machine learning: an AI perspective. AI Commun 2014; 27: 11-23.
[2] Bellman RE. A Markov decision process. J Math Mech 1957; 6: 679-684.
[3] Watkins CJCH. Learning from delayed rewards. PhD, Cambridge University, Cambridge, UK, 1989.
[4] Rummery GA, Niranjan M. On-line Q-learning Using Connectionist Systems. Cambridge, UK: Cambridge University
Engineering Department, 1994.
[5] Muhammad J, Bucak IO. An improved Q-learning algorithm for an autonomous mobile robot navigation problem.
In: IEEE 2013 International Conference on Technological Advances in Electrical, Electronics and Computer
Engineering; 9–11 May 2013; Konya, Turkey. New York, NY, USA: IEEE. pp. 239–243.
1765
ALTUNTAŞ et al./Turk J Elec Eng & Comp Sci
[6] Yun SC, Parasuraman S, Ganapathy V. Mobile robot navigation: neural Q-learning. Adv Comput Inf 2013; 178:
259-268.
[7] Hwang KS, Lo CY. Policy improvement by a model-free dyna architecture. IEEE T Neural Networ 2013; 24:
776-788.
[8] Fard SMH, Hamzeh A, Hashemi S. Using reinforcement learning to find an optimal set of features. Comput Math
Appl 2013; 66: 1892-1904.
[9] Zuo L, Xu X, Liu CM, Huang ZH. A hierarchical reinforcement learning approach for optimal path tracking of
wheeled mobile robots. Neural Comput Appl 2013; 23: 1873-1883.
[10] Konar A, Chakraborty IG, Singh SJ, Jain LC, Nagar AK. A deterministic improved Q-learning for path planning
of a mobile robot. IEEE T Syst Man Cy A 2013; 43: 1141-1153.
[11] Rolla VG, Curado M. A reinforcement learning-based routing for delay tolerant networks. Eng Appl Artif Intel
2013; 26: 2243-2250.
[12] Geramifard A, Redding J, How JP. Intelligent cooperative control architecture: a framework for performance
improvement using safe learning. J Intell Robot Syst 2013; 72: 83-103.
[13] Maravall D, de Lope J, Domı́nguez R. Coordination of communication in robot teams by reinforcement learning.
Robot Auton Syst 2013; 61: 661-666.
[14] Martı́n JA, de Lope J. Ex < α > : an effective algorithm for continuous actions reinforcement learning problems.
In: IEEE 2009 35th Annual Conference on Industrial Electronics; 3–5 November 2009; Porto, Portugal. New York,
NY, USA: IEEE. pp. 2063-2068.
[15] Khriji L, Touati F, Benhmed K, Al-Yahmedi A. Mobile robot navigation based on Q-learning technique. Int J Adv
Robot Syst 2011; 8: 45-51.
[16] McCallum RA. Instance-based state identification for reinforcement learning. Adv Neural In 1995; 8: 377-384.
[17] Zhumatiy V, Gomez F, Hutter M, Schmidhuber J. Metric state space reinforcement learning for a vision-capable
mobile robot. In: Arai T, Pfeifer R, Balch T, Yokoi H, editors. Intelligent Autonomous Systems 9. Amsterdam, the
Netherlands : IOS Press, 2006. pp. 272-282.
[18] Maček K, Petrović I, Perić N. A reinforcement learning approach to obstacle avoidance of mobile robots. In: IEEE
2002 7th International Workshop on Advanced Motion Control; 3–5 June 2002; Maribor, Slovenia. New York, NY,
USA: IEEE. pp. 462-466.
[19] Altuntaş N, İmal E, Emanet N, Öztürk CN. Reinforcement learning based mobile robot navigation. In: ISCSE 2013
3rd International Symposium on Computing in Science and Engineering; 24–25 October 2013; Kuşadası, Turkey.
İzmir, Turkey: Gediz University Publications. pp. 285-289.
[20] Sutton RS, Barto AG. Reinforcement Learning: an Introduction. Cambridge, MA, USA: MIT Press, 2005.
[21] Russell SJ, Norvig T. Artificial Intelligence: A Modern Approach. 2nd ed. Upper Saddle River, NJ, USA: Prentice
Hall, 2003.
[22] Xu X, Lian CQ, Zuo L, He HB. Kernel-based approximate dynamic programming for real-time online learning
control: an experimental study. IEEE T Contr Syst T 2014; 22: 146-156.
[23] Ni Z, He HB, Wen JY, Xu X. Goal representation heuristic dynamic programming on maze navigation. IEEE T
Neural Networ 2013; 24: 2038-2050.
[24] Hwang KS, Jiang WC, Chen YJ. Adaptive model learning based on dyna-Q learning. Cybernet Syst 2013; 44:
641-662.
[25] Bellman RE. Dynamic Programming. Princeton, NJ, USA: Princeton University Press, 1957.
[26] Wang YH, Li THS, Lin CJ. Backward Q-learning: the combination of Sarsa algorithm and Q-learning. Eng Appl
Artif Intel 2013; 26: 2184-2193.
1766
ALTUNTAŞ et al./Turk J Elec Eng & Comp Sci
[27] Chen XG, Gao Y, Fan SG. Temporal difference learning with piecewise linear basis. Chinese J Electron 2014; 23:
49-54.
[28] Chen XG, Gao Y, Wang, RL. Online selective kernel-based temporal difference learning. IEEE T Neural Networ
2013; 24: 1944-1956.
[29] Kober J, Bagnell JA, Peters J. Reinforcement learning in robotics: a survey. Int J Robot Res 2013; 32: 1238-1274.
[30] Lopez-Guede JM, Fernandez-Gauna, B, Gra na M. State-action value function modeled by ELM in reinforcement
learning for hose control problems. Int J Uncertain Fuzz 2013; 21: 99-116.
[31] Miljković Z, Mitić M, Lazarević M, Babić B. Neural network reinforcement learning for visual control of robot
manipulators. Expert Syst Appl 2013; 40: 1721-1736.
1767