0% found this document useful (0 votes)

40 views10 pages

Continuous Deep Q-Learning With Model-Based Acceleration: Shixiang Gu Timothy Lillicrap Ilya Sutskever Sergey Levine

This paper presents two techniques to enhance the efficiency of deep reinforcement learning for continuous control tasks by reducing sample complexity. The authors introduce a continuous variant of Q-learning called normalized advantage functions (NAF) and propose using learned models to accelerate model-free reinforcement learning. Empirical evaluations demonstrate that their approach significantly improves learning speed and effectiveness in simulated robotic tasks compared to existing methods.

Uploaded by

raed waheed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views10 pages

Continuous Deep Q-Learning With Model-Based Acceleration: Shixiang Gu Timothy Lillicrap Ilya Sutskever Sergey Levine

Uploaded by

raed waheed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Continuous Deep Q-Learning with Model-based Acceleration

Shixiang Gu1 2 3 SG 717@ CAM . AC . UK

Timothy Lillicrap4 COUNTZERO @ GOOGLE . COM
Ilya Sutskever3 ILYASU @ GOOGLE . COM
Sergey Levine3 SLEVINE @ GOOGLE . COM
1
University of Cambridge 2 Max Planck Institute for Intelligent Systems 3 Google Brain 4 Google DeepMind

Abstract value functions (Mnih et al., 2015; Lillicrap et al., 2016;

Model-free reinforcement learning has been suc- Wang et al., 2015; Heess et al., 2015; Hausknecht & Stone,
cessfully applied to a range of challenging prob- 2015; Schulman et al., 2015). This makes it possible to
lems, and has recently been extended to han- train policies for complex tasks with minimal feature and
dle large neural network policies and value func- policy engineering, using the raw state representation di-
tions. However, the sample complexity of model- rectly as input to the neural network. However, the sample
free algorithms, particularly when using high- complexity of model-free algorithms, particularly when us-
dimensional function approximators, tends to ing very high-dimensional function approximators, tends to
limit their applicability to physical systems. In be high (Schulman et al., 2015), which means that the ben-
this paper, we explore algorithms and repre- efit of reduced manual engineering and greater generality
sentations to reduce the sample complexity of is not felt in real-world domains where experience must be
deep reinforcement learning for continuous con- collected on real physical systems, such as robots and au-
trol tasks. We propose two complementary tech- tonomous vehicles. In such domains, the methods of choice
niques for improving the efficiency of such algo- have been efficient model-free algorithms that use more
rithms. First, we derive a continuous variant of suitable, task-specific representations (Peters et al., 2010;
the Q-learning algorithm, which we call normal- Deisenroth et al., 2013), as well as model-based algorithms
ized advantage functions (NAF), as an alternative that learn a model of the system with supervised learning
to the more commonly used policy gradient and and optimize a policy under this model (Deisenroth & Ras-
actor-critic methods. NAF representation allows mussen, 2011; Levine et al., 2016). Using task-specific
us to apply Q-learning with experience replay to representations dramatically improves efficiency, but limits
continuous tasks, and substantially improves per- the range of tasks that can be learned and requires greater
formance on a set of simulated robotic control domain knowledge. Using model-based RL also improves
tasks. To further improve the efficiency of our efficiency, but limits the policy to only be as good as the
approach, we explore the use of learned models learned model. For many real-world tasks, it may be easier
for accelerating model-free reinforcement learn- to represent a good policy than to learn a good model. For
ing. We show that iteratively refitted local lin- example, a simple robotic grasping behavior might only re-
ear models are especially effective for this, and quire closing the fingers at the right moment, while the cor-
demonstrate substantially faster learning on do- responding dynamics model requires learning the complex-
mains where such models are applicable. ities of rigid and deformable bodies undergoing frictional
contact. It is therefore desirable to bring the generality of
model-free deep reinforcement learning into real-world do-
1. Introduction mains by reducing their sample complexity.
In this paper, we propose two complementary techniques
Model-free reinforcement learning (RL) has been success-
for improving the efficiency of deep reinforcement learn-
fully applied to a range of challenging problems (Kober
ing in continuous control domains: we derive a variant of
& Peters, 2012; Deisenroth et al., 2013), and has recently
Q-learning that can be used in continuous domains, and
been extended to handle large neural network policies and
we propose a method for combining this continuous Q-
Proceedings of the 33 rd International Conference on Machine learning algorithm with learned models so as to accelerate
Learning, New York, NY, USA, 2016. JMLR: W&CP volume learning while preserving the benefits of model-free RL.
48. Copyright 2016 by the author(s). Model-free reinforcement learning in domains with contin-
Continuous Deep Q-Learning with Model-based Acceleration

uous actions is typically handled with policy search meth- 2. Related Work
ods (Peters & Schaal, 2006; Peters et al., 2010). Integrat-
ing value function estimation into these techniques results Deep reinforcement learning has received considerable at-
in actor-critic algorithms (Hafner & Riedmiller, 2011; Lil- tention in recent years due to its potential to automate the
licrap et al., 2016; Schulman et al., 2016), which combine design of representations in RL. Deep reinforcement learn-
the benefits of policy search and value function estimation, ing and related methods have been applied to learn policies
but at the cost of training two separate function approxi- to play Atari games (Mnih et al., 2015; Schaul et al., 2015)
mators. Our proposed Q-learning algorithm for continu- and perform a wide variety of simulated and real-world
ous domains, which we call normalized advantage func- robotic control tasks (Hafner & Riedmiller, 2011; Lillicrap
tions (NAF), avoids the need for a second actor or policy et al., 2016; Levine & Koltun, 2013; de Bruin et al., 2015;
function, resulting in a simpler algorithm. The simpler op- Hafner & Riedmiller, 2011). While the majority of deep
timization objective and the choice of value function pa- reinforcement learning methods in domains with discrete
rameterization result in an algorithm that is substantially actions, such as Atari games, are based around value func-
more sample-efficient when used with large neural network tion estimation and Q-learning (Mnih et al., 2015), con-
function approximators on a range of continuous control tinuous domains typically require explicit representation of
domains. the policy, for example in the context of a policy gradient
algorithm (Schulman et al., 2015). If we wish to incorpo-
Beyond deriving an improved model-free deep reinforce- rate the benefits of value function estimation into contin-
ment learning algorithm, we also seek to incorporate ele- uous deep reinforcement learning, we must typically use
ments of model-based RL to accelerate learning, without two networks: one to represent the policy, and one to rep-
giving up the strengths of model-free methods. One ap- resent the value function (Lillicrap et al., 2016; Schulman
proach is for off-policy algorithms such as Q-learning to in- et al., 2016). In this paper, we instead describe how the
corporate off-policy experience produced by a model-based simplicity and elegance of Q-learning can be ported into
planner. However, while this solution is a natural one, our continuous domains, by learning a single network that out-
empirical evaluation shows that it is ineffective at accelerat- puts both the value function and policy. Our Q-function
ing learning. As we discuss in our evaluation, this is due in representation is related to dueling networks (Wang et al.,
part to the nature of value function estimation algorithms, 2015), though our approach applies to continuous action
which must experience both good and bad state transitions domains. Our empirical evaluation demonstrates that our
to accurately model the value function landscape. We pro- continuous Q-learning algorithm achieves faster and more
pose an alternative approach to incorporating learned mod- effective learning on a set of benchmark tasks compared
els into our continuous-action Q-learning algorithm based to continuous actor-critic methods, and we believe that the
on imagination rollouts: on-policy samples generated un- simplicity of this approach will make it easier to adopt in
der the learned model, analogous to the Dyna-Q method practice. Our Q-learning method is also related to the work
(Sutton, 1990). We show that this is extremely effective of Rawlik et al. (2013), but the form of our Q-function up-
when the learned dynamics model perfectly matches the date is more standard.
true one, but degrades dramatically with imperfect learned
models. However, we demonstrate that iteratively fitting As in standard RL, model-based deep reinforcement learn-
local linear models to the latest batch of on-policy or off- ing methods have generally been more efficient (Nguyen &
policy rollouts provides sufficient local accuracy to achieve Widrow, 1989; Schmidhuber, 1991; Li & Todorov, 2004;
substantial improvement using short imagination rollouts Watter et al., 2015; Li & Todorov, 2004; Wahlström et al.,
in the vicinity of the real-world samples. 2015; Levine & Koltun, 2013), while model-free algo-
rithms tend to be more generally applicable but substan-
Our paper provides three main contributions: first, we de- tially slower (Koutnı́k et al., 2013; Schulman et al., 2015;
rive and evaluate a Q-function representation that allows Lillicrap et al., 2016). Combining model-based and model-
for effective Q-learning in continuous domains. Second, free learning has been explored in several ways in the lit-
we evaluate several naı̈ve options for incorporating learned erature. The method closest to our imagination rollouts
models into model-free Q-learning, and we show that they approach is Dyna-Q (Sutton, 1990), which uses simulated
are minimally effective on our continuous control tasks. experience in a learned model to supplement real-world
Third, we propose to combine locally linear models with on-policy rollouts. As we show in our evaluation, us-
local on-policy imagination rollouts to accelerate model- ing Dyna-Q style methods to accelerate model-free RL is
free continuous Q-learning, and show that this produces a very effective when the learned model perfectly matches
large improvement in sample complexity. We evaluate our the true model, but degrades rapidly as the model be-
method on a series of simulated robotic tasks and compare comes worse. Approximate Model-Assisted Neural Fit-
to prior methods. ted Q-Iteration (AMA-NFQ) (Lampe & Riedmiller, 2014)
studies a similar approach for batch variant of Q-learning
Continuous Deep Q-Learning with Model-based Acceleration

and achieves significant reduction in sample complexity following the policy π thereafter:
for a simple benchmark task. However, AMA-NFQ re-
lies on fitting neural networks for dynamics, which we em- Qπ (xt , ut ) = Eri≥t ,xi>t ∼E,ui>t ∼π [Rt |xt , ut ] (1)
pirically find is difficult for a broader range of tasks. We
demonstrate that using iteratively refitted local linear mod- Q-learning learns a greedy deterministic policy
els achieves substantially better results with imagination µ(xt ) = arg maxu Q(xt , ut ), which corresponds to
rollouts than more complex neural network models. We π(ut |xt ) = δ(ut = µ(xt )). Let θQ parametrize the
hypothesize that this is likely due to the fact that the more action-value function, and β be an arbitrary exploration
expressive models themselves require substantially more policy, the learning objective is to minimize the Bellman
data, and that otherwise efficient algorithms like Dyna-Q error, where we fix the target yt :
are vulnerable to poor model approximations.
L(θQ ) = Ext ∼ρβ ,ut ∼β,rt ∼E [(Q(xt , ut |θQ ) − yt )2 ]
(2)
yt = r(xt , ut ) + γQ(xt+1 , µ(xt+1 ))
3. Background
For continuous action problems, Q-learning becomes diffi-
In reinforcement learning, the goal is to learn a policy to
cult, because it requires maximizing a complex, nonlinear
control a system with states x ∈ X and actions u ∈ U
function at each update. For this reason, continuous do-
in environment E, so as to maximize the expected sum of
mains are often tackled using actor-critic methods (Konda
returns according to a reward function r(x, u). The dy-
& Tsitsiklis, 1999; Hafner & Riedmiller, 2011; Silver et al.,
namical system is defined by an initial state distribution
2014; Lillicrap et al., 2016), where a separate parame-
p(x1 ) and a dynamics distribution p(xt+1 |xt , ut ). At each
terized “actor” policy π is learned in addition to the Q-
time step t ∈ [1, T ], the agent chooses an action ut ac-
function or value function “critic,” such as Deep Determin-
cording to its current policy π(ut |xt ), and observes a re-
istic Policy Gradient (DDPG) algorithm (Lillicrap et al.,
ward r(xt , ut ). The agent then experiences a transition to a
2016).
new state sampled from the dynamics distribution, and we
can express the resulting state visitation
PT frequency of the In order to describe our method in the following sections, it
policy π as ρπ (xt ). Define Rt = i=t γ
(i−t)
r(xi , ui ), will be useful to also define the value function V π (xt , ut )
the goal is to maximize the expected sum of returns, given and advantage function Aπ (xt , ut ) of a given policy π:
by R = Eri≥1 ,xi≥1 ∼E,ui≥1 ∼π [R1 ], where γ is a discount
factor that prioritizes earlier rewards over later ones. With V π (xt ) = Eri≥t ,xi>t ∼E,ui≥t ∼π [Rt |xt , ut ]
(3)
γ < 1, we can also set T = ∞, though we use a finite hori- Aπ (xt , ut ) = Qπ (xt , ut ) − V π (xt ).
zon for all of the tasks in our experiments. The expected re-
turn R can be optimized using a variety of model-free and Model-Based Reinforcement Learning. If we know the
model-based algorithms. In this section, we review several dynamics p(xt+1 |xt , ut ), or if we can approximate them
of these methods that we build on in our work. with some learned model p̂(xt+1 |xt , ut ), we can use
model-based RL and optimal control. While a wide range
of model-based RL and control methods have been pro-
Model-Free Reinforcement Learning. When the sys- posed in the literature (Deisenroth et al., 2013; Kober &
tem dynamics p(xt+1 |xt , ut ) are not known, as is often Peters, 2012), two are particularly relevant for this work:
the case with physical systems such as robots, policy gra- iterative LQG (iLQG) (Li & Todorov, 2004) and Dyna-
dient methods (Peters & Schaal, 2006) and value function Q (Sutton, 1990). The iLQG algorithm optimizes tra-
or Q-function learning with function approximation (Sut- jectories by iteratively constructing locally optimal lin-
ton et al., 1999) are often preferred. Policy gradient meth- ear feedback controllers under a local linearization of the
ods provide a simple, direct approach to RL, which can dynamics p̂(xt+1 |xt , ut ) = N (fxt xt + fut ut , Ft ) and a
succeed on high-dimensional problems, but potentially re- quadratic expansion of the rewards r(xt , ut ) (Tassa et al.,
quires a large number of samples (Schulman et al., 2015; 2012). Under linear dynamics and quadratic rewards, the
2016). Off-policy algorithms that use value or Q-function action-value function Q(xt , ut ) and value function V (xt )
approximation can in principle achieve better data effi- are locally quadratic and can be computed by dynamics
ciency (Lillicrap et al., 2016). However, adapting such programming. The optimal policy can be derived ana-
methods to continuous tasks typically requires optimizing lytically from the quadratic Q(xt , ut ) and V (xt ) func-
two function approximators on different objectives. We in- tions, and corresponds to a linear feedback controller
stead build on standard Q-learning, which has a single ob- g(xt ) = ût + kt + Kt (xt − x̂t ), where kt is an open-
jective. We summarize Q-learning in this section. The Q loop term, Kt is the closed-loop feedback matrix, and x̂t
function Qπ (xt , ut ) corresponding to a policy π is defined and ût are the states and actions of the nominal trajectory,
as the expected return from xt after taking action ut and which is the average trajectory of the controller. Employing
Continuous Deep Q-Learning with Model-based Acceleration

the maximum entropy objective (Levine & Koltun, 2013), tures of the state:
we can also construct a linear-Gaussian controller, where
c is a scalar to adjust for arbitrary scaling of the reward Q(x, u|θQ ) = A(x, u|θA ) + V (x|θV )
magnitudes: 1
A(x, u|θA ) = − (u − µ(x|θµ ))T P (x|θP )(u − µ(x|θµ ))
2
P (x|θP ) is a state-dependent, positive-definite
πtiLQG (ut |xt ) = N (ût + kt + Kt (xt − x̂t ), −cQ−1
u,ut ) square matrix, which is parametrized by P (x|θP ) =
(4) L(x|θP )L(x|θP )T , where L(x|θP ) is a lower-triangular
matrix whose entries come from a linear output layer of
a neural network, with the diagonal terms exponentiated.
When the dynamics are not known, a particularly effective While this representation is more restrictive than a general
way to use iLQG is to combine it with learned time-varying neural network function, since the Q-function is quadratic
linear models p̂(xt+1 |xt , ut ). In this variant of the algo- in u, the action that maximizes the Q-function is always
rithm, trajectories are sampled from the controller in Equa- given by µ(x|θµ ). We use this representation with a deep
tion (4) and used to fit time-varying linear dynamics with Q-learning algorithm analogous to Mnih et al. (2015),
linear regression. These dynamics are then used with iLQG using target networks and a replay buffers as described
to obtain a new controller, typically using a KL-divergence by (Lillicrap et al., 2016). NAF, given by Algorithm 1, is
constraint to enforce a trust region, so that the new con- considerably simpler than DDPG.
troller doesn’t deviate too much from the region in which
the samples were generated (Levine & Abbeel, 2014). Algorithm 1 Continuous Q-Learning with NAF

Besides enabling iLQG and other planning-based algo- Randomly initialize normalized Q network Q(x, u|θQ ).
0
rithms, a learned model of the dynamics can allow a model- Initialize target network Q0 with weight θQ ← θQ .
Initialize replay buffer R ← ∅.
free algorithm to generate synthetic experience by perform- for episode=1, M do
ing rollouts in the learned model. A particularly relevant Initialize a random process N for action exploration
method of this type is Dyna-Q (Sutton, 1990), which per- Receive initial observation state x1 ∼ p(x1 )
forms real-world rollouts using the policy π, and then gen- for t=1, T do
erates synthetic rollouts using a model learned from these Select action ut = µ(xt |θµ ) + Nt
Execute ut and observe rt and xt+1
samples. The synthetic rollouts originate at states visited Store transition (xt , ut , rt , xt+1 ) in R
by the real-world rollouts, and serve as supplementary data for iteration=1, I do
for a variety of possible reinforcement learning algorithms. Sample a random minibatch of m transitions from R
0
However, most prior Dyna-Q methods have focused on rel- Set yi = ri + γV 0 (xi+1 |θQ )
Q
Update θ by minimizing the loss: L = N1
P
atively small, discrete domains. In Section 5, we describe i (yi −
how our method can be extended into a variant of Dyna-Q Q(xi , ui |θQ ))2
0 0
to achieve substantially faster learning on a range of con- Update the target network: θQ ← τ θQ + (1 − τ )θQ
tinuous control tasks with complex neural network policies, end for
and in Section 6, we empirically analyze the sensitivity of end for
end for
this method to imperfect learned dynamics models.

4. Continuous Q-Learning with Normalized Decomposing Q into an advantage term A and a state-value
Advantage Functions term V was suggested by Baird III (1993); Harmon &
Baird III (1996), and was recently explored by Wang et al.
We first propose a simple method to enable Q-learning in (2015) for discrete action problems. Normalized action-
continuous action spaces with deep neural networks, which value functions have also been proposed by Rawlik et al.
we refer to as normalized advantage functions (NAF). The (2013) in the context of an alternative temporal difference
idea behind normalized advantage functions is to represent learning algorithm. However, our method is the first to
the Q-function Q(xt , ut ) in Q-learning in such a way that combine such representations with deep neural networks
its maximum, arg maxu Q(xt , ut ), can be determined eas- into an algorithm that can be used to learn policies for a
ily and analytically during the Q-learning update. While a range of challenging continuous control tasks. In general,
number of representations are possible that allow for ana- A does not need to be quadratic, and exploring other para-
lytic maximization, the one we use in our implementation metric forms such as multimodal distributions is an inter-
is based on a neural network that separately outputs a value esting avenue for future work. The appendix provides de-
function term V (x) and an advantage term A(x, u), which tails on an adaptive exploration rule with experimental re-
is parameterized as a quadratic function of nonlinear fea- sults.
Continuous Deep Q-Learning with Model-based Acceleration

5. Accelerating Learning with Imagination tion rollouts, to the replay buffer effectively augments the
Rollouts amount of experience available for Q-learning. The par-
ticular approach we use is to perform rollouts in the real
While NAF provides some advantages over actor-critic world using a mixture of planned iLQG trajectories and on-
model-free RL methods in continuous domains, we can policy trajectories, with various mixing coefficients evalu-
improve their data efficiency substantially under some ad- ated in our experiments, and then generate additional syn-
ditional assumptions by exploiting learned models. We thetic on-policy rollouts using the learned model from each
will show that incorporating a particular type of learned state visited along the real-world rollouts. We show that
model into Q-learning with NAFs significantly improves using iteratively refitted linear models allows us to extend
sample efficiency, while still allowing the final policy to be the approach to deep reinforcement learning on a range of
finetuned with model-free learning to achieve good perfor- continuous control domains. In some scenarios, we can
mance without the limitations of imperfect models. even generate all or most of the real rollouts using off-
policy iLQG controllers, which is desirable in safety-critic
5.1. Model-Guided Exploration domains where poorly trained policies might take danger-
ous actions. The algorithm is given as Algorithm 2, and is
One natural approach to incorporating a learned model into
an extension on Algorithm 1 combining model-based RL.
an off-policy algorithm such as Q-learning is to use the
learned model to generate good exploratory behaviors us-
ing planning or trajectory optimization. To evalaute this Algorithm 2 Imagination Rollouts with Fitted Dynamics
idea, we utilize the iLQG algorithm to generate good tra- and Optional iLQG Exploration
jectories under the model, and then mix these trajectories Randomly initialize normalized Q network Q(x, u|θQ ).
0
together with on-policy experience by appending them to Initialize target network Q0 with weight θQ ← θQ .
the replay buffer. Interestingly, we show in our evalua- Initialize replay buffer R ← ∅ and fictional buffer Rf ← ∅.
tion that, even when planning under the true model, the im- Initialize additional buffers B ← ∅, Bold ← ∅ with size nT .
provement obtained from this approach is often quite small, Initialize fitted dynamics model M ← ∅.
for episode = 1, M do
and varies significantly across domains and choices of ex- Initialize a random process N for action exploration
ploration noise. The intuition behind this result is that off- Receive initial observation state x1
policy iLQG exploration is too different from the learned Select µ0 (x, t) from {µ(x|θµ ), πtiLQG (ut |xt )} with proba-
policy, and Q-learning must consider alternatives in order bilities {p, 1 − p}
to ascertain the optimality of a given action. That is, it’s not for t = 1, T do
Select action ut = µ0 (xt , t) + Nt
enough to simply show the algorithm good actions, it must Execute ut and observe rt and xt+1
also experience bad actions to understand which actions are Store transition (xt , ut , rt , xt+1 , t) in R and B
better and which are worse. if mod (episode · T + t, m) = 0 and M 6= ∅ then
Sample m (xi , ui , ri , xi+1 , i) from Bold
Use M to simulate l steps from each sample
5.2. Imagination Rollouts
Store all fictional transitions in Rf
As discussed in the previous section, incorporating off- end if
Sample a random minibatch of m transitions I · l times
policy exploration from good, narrow distributions, such 0
from Rf and I times from R, and update θQ , θQ as in
as those induced by iLQG, often does not result in signif- Algorithm 1 per minibatch.
icant improvement for Q-learning. These results suggest end for
that Q-learning, which learns a policy based on minimizing if Bf is full then
temporal differences, inherently requires noisy on-policy M ← FitLocalLinearDynamics(Bf ) (see Section 5.3)
actions to succeed. In real-world domains such as robots π iLQG ← iLQG OneStep(Bf , M) (see appendix)
and autonomous vehicles, this can be undesirable for two Bold ← Bf , Bf ← ∅
end if
reasons: first, it suggests that large amounts of on-policy end for
experience are required in addition to good off-policy sam-
ples, and second, it implies that the policy must be allowed
to make “its own mistakes” during training, which might Imagination rollouts can suffer from severe bias when the
involve taking undesirable or dangerous actions that can learned model is inaccurate. For example, we found it very
damage real-world hardware. difficult to train nonlinear neural network models for the
One way to avoid these problems while still allowing for dynamics that would actually improve the efficiency of Q-
a large amount of on-policy exploration is to generate syn- learning when used for imagination rollouts. As discussed
thetic on-policy trajectories under a learned model. Adding in the following section, we found that using iteratively re-
these synthetic samples, which we refer to as imagina- fitted time-varying linear dynamics produced substantially
better results. In either case, we would still like to preserve
Continuous Deep Q-Learning with Model-based Acceleration

the generality and optimality of model-free RL while deriv- licrap et al. (2016). Although we attempted to replicate
ing the benefits of model-based learning. To that end, we the tasks in previous work as closely as possible, discrep-
observe that most of the benefit of model-based learning is ancies in the simulator parameters and the contact model
derived in the early stages of the learning process, when the produced results that deviate slightly from those reported
policy induced by the neural network Q-function is poor. in prior work. In all experiments, the input to the policy
As the Q-function becomes more accurate, on-policy be- consisted of the state of the system, defined in terms of joint
havior tends to outperform model-based controllers. We angles and root link positions. Angles were often converted
therefore propose to switch off imagination rollouts after a to sine and cosine encoding. We assume the reward func-
given number of iterations.1 In this framework, the imag- tion is given and is not learned for model-based experience.
ination rollouts can be thought of as an inexpensive way
For both our method and the prior DDPG (Lillicrap et al.,
to pretrain the Q-function, such that fine-tuning using real
2016) algorithm in the comparisons, we used neural net-
world experience can quickly converge to an optimal solu-
works with two layers of 200 rectified linear units (ReLU)
tion.
to produce each of the output parameters – the Q-function
and policy in DDPG, and the value function V , the advan-
5.3. Fitting the Dynamics Model tage matrix L, and the mean µ for NAF. Since Q-learning
In order to obtain good imagination rollouts and improve was done with a replay buffer, we applied the Q-learning
the efficiency of Q-learning, we needed to use an effec- update 5 times per each step of experience to accelerate
tive and data-efficient model learning algorithm. While learning (I = 5). To ensure a fair comparison, DDPG also
prior methods propose a variety of model classes, including updates both the Q-function and policy parameters 5 times
neural networks (Heess et al., 2015), Gaussian processes per step.
(Deisenroth & Rasmussen, 2011), and locally-weighted re-
gression (Atkeson et al., 1997), we found that we could ob- 6.1. Normalized Advantage Functions
tain good results by using iteratively refitted time-varying
In this section, we compare NAF and DDPG on 10 repre-
linear models, as proposed by Levine & Abbeel (2014). In
sentative domains from Lillicrap et al. (2016), with three
this approach, instead of learning a good global model for
additional domains: a four-legged 3D ant, a six-joint 2D
all states and actions, we aim only to obtain a good local
swimmer, and a 2D peg (see the appendix for the de-
model around the latest set of samples. This approach re-
scriptions of task domains). We found the most sensitive
quires a few additional assumptions: namely, it requires the
hyperparameters to be presence or absence of batch nor-
initial state to be either deterministic or low-variance Gaus-
malization, base learning rate for ADAM (Kingma & Ba,
sian, and it requires the states and actions to all be continu-
2014) ∈ {1e−4 , 1e−3 , 1e−2 }, and exploration noise scale
ous. To handle domains with more varied initial states, we
∈ {0.1, 0.3, 1.0}. We report the best performance for each
can use a mixture of Gaussian initial states with separate
domain. We were unable to achieve good results with
time-varying linear models for each one. The model itself
the method of Rawlik et al. (2013) on our domains, likely
is given by pt (xt+1 |xt , ut ) = N (Ft [xt ; ut ]+ft , Nt ). Ev-
due to the complexity of high-dimensional neural network
ery n episodes, we refit the parameters Ft , ft , and Nt by
function approximators.
fitting a Gaussian distribution at each time step to the vec-
tors [xit ; uit ; xit+1 ], where i indicates the sample index, and Figure 1b, Figure 1c, and additional figures in the appendix
conditioning this Gaussian on [xt ; ut ] to obtain the param- show the performances on the three-joint reacher, peg in-
eters of the linear-Gaussian dynamics at that step. We use sertion, and a gripper with mobile base. While the nu-
n = 5 in our experiments. Although this approach intro- merical gap in reacher may be small, qualitatively there is
duces additional assumptions beyond the standard model- also a very noticeable difference between NAF and DDPG.
free RL setting, we show in our evaluation that it produces DDPG converges to a solution where the deterministic pol-
impressive gains in sample efficiency on tasks where it can icy causes the tip to fluctuate continuously around the tar-
be applied. get, and does not reach it precisely. NAF, on the other hand,
learns a smooth policy that makes the tip slow down and
6. Experiments stabilize at the target. This difference is more noticeable in
peg insertion and moving gripper, as shown by the much
We evaluated our approach on a set of simulated robotic faster convergence rate to the optimal solution. Precision is
tasks using the MuJoCo simulator (Todorov et al., 2012). very important in many real-world robotic tasks, and these
The tasks were based on the benchmarks described by Lil- result suggest that NAF may be preferred in such domains.
1
In future work, it would be interesting to select this iteration On locomotion tasks, the performance of the two meth-
adaptively based on the expected relative performance of the Q- ods is relatively similar. On the six-joint swimmer task
function policy and model-based planning. and four-legged ant, NAF slightly outperforms DDPG in
Continuous Deep Q-Learning with Model-based Acceleration

(a) Example task domains. (b) NAF and DDPG on multi-target reacher. (c) NAF and DDPG on peg insertion.

Figure 1. (a) Task domains: top row from left (manipulation tasks: peg, gripper, mobile gripper), bottom row from left (locomotion
tasks: cheetah, swimmer6, ant). (b,c) NAF vs DDPG results on three-joint reacher and peg insertion. On reacher, the DDPG policy
continuously fluctuates the tip around the target, while NAF stabilizes well at the target.

terms of the convergence speed; however, DDPG is faster Figure 2a shows the effect of mixing off-policy iLQG expe-
on cheetah and finds a better policy on walker2d. The loss rience and imagination rollouts on the three-joint reacher.
in performance of NAF can potentially be explained by It is noticeable that mixing the good off-policy experience
downside of the mode-seeking behavior, where it is hard does not significantly improve data-efficiency, while imagi-
to explore other modes once the quadratic advantage func- nation rollouts always improve data-efficiency or final per-
tion finds a good one. Choosing a parametric form that is formance significantly. In the context of Q-learning, this
more expressive than a quadratic could be used to address result is not entirely surprising: Q learning must experi-
this limitation in future work. ence both good and bad actions in order to determine which
actions are preferred, while the good model-based rollouts
The results on all of the domains are summarized in Ta-
are so far removed from the policy in the early stages of
ble 1. Overall, NAF outperformed DDPG on the major-
learning that they provide little useful information. Fig-
ity of tasks, particularly manipulation tasks that require
ure 2a also evaluates two different variants of the imag-
precision and suffer less from the lack of multimodal Q-
ination rollouts approach, where the rollouts in the real
functions. This makes this approach particularly promising
world are performed either using the learned policy, or us-
for efficient learning of real-world robotic tasks.
ing model-based planning with iLQG. In the case of this
Domains - DDPG episodes NAF episodes task, the iLQG rollouts achieve slightly better results, since
Cartpole -2.1 -0.601 420 -0.604 190 the on-policy imagination rollouts sampled around these
Reacher -2.3 -0.509 1370 -0.331 1260 off-policy states provide Q-learning with additional infor-
Peg -11 -0.950 690 -0.438 130 mation about alternative action not taken by the iLQG plan-
Gripper -29 1.03 2420 1.81 1920 ner. In general, we did not find that off-policy rollouts were
GripperM -90 -20.2 1350 -12.4 730
Canada2d -12 -4.64 1040 -4.21 900 consistently better than on-policy rollouts across all tasks,
Cheetah -0.3 8.23 1590 7.91 2390 but they did consistently produce good results. Perform-
Swimmer6 -325 -174 220 -172 190 ing off-policy rollouts with iLQG may be desirable in real-
Ant -4.8 -2.54 2450 -2.58 1350 world domains, where a partially learned policy might take
Walker2d 0.3 2.96 850 1.85 1530 undesirable or dangerous actions. Further details of these
Table 1. Best test rewards of DDPG and NAF policies, and the
experiments are provided in the appendix.
episodes it requires to reach within 5% of the best value. “-” de-
notes scores by a random agent. 6.3. Guided Imagination Rollouts with Fitted
Dynamics
6.2. Evaluating Best-Case Model-Based Improvement
In this section, we evaluated the performance of imagina-
with True Models
tion rollouts with learned dynamics. As seen in Figure 2b,
In order to determine how best to incorporate model-based we found that fitting time-varying linear models follow-
components to accelerate model-free Q-learning, we tested ing the imagination rollout algorithm is substantially better
several approaches using the ground truth dynamics, to than fitting neural network dynamics models for the tasks
control for challenges due to model fitting. We evaluated we considered. There is a fundamental tension between ef-
both of the methods discussed in Section 5: the use of ficient learning and expressive models like neural nets. We
model-based planning to generate good off-policy rollouts cannot hope to learn useful neural network models with a
in the real world, and the use of the model to generate on- small number of samples for complex tasks, which makes it
policy synthetic rollouts. difficult to acquire a good model with fewer samples than
Continuous Deep Q-Learning with Model-based Acceleration

(a) NAF on single-target reacher. (b) NAF on single-target reacher. (c) NAF on single-target gripper.
Figure 2. Results on NAF with iLQG-guided exploration and imagination rollouts (a) using true dynamics (b,c) using fitted dynamics.
“ImR” denotes using the imagination rollout with l = 10 steps on the reacher and l = 5 steps on the gripper. “iLQG-x” indicates mixing
x fraction of iLQG episodes. Fitted dynamics uses time-varying linear models with sample size n = 5, except “-NN” which fits a neural
network to global dynamics.

are necessary to acquire a good policy. While the model butions, we might cluster the trajectories and fit multiple
is trained with supervised learning, which is typically more models to account for different modes. Extending the bene-
sample efficient, it often needs to represent a more complex fits of time-varying linear models to less restrictive settings
function (e.g. rigid body physics). However, having such is a promising direction and build on prior work (Levine
expressive models is more crucial as we move to improve et al., 2016; Fu et al., 2015). That said, our results show
model accuracy. Figure 2b presents results that compare that imagination rollouts are a very promising approach to
fitted neural network models with the true dynamics when accelerating model-free learning when combined with the
combined with imagination rollouts. These results indicate right kind of dynamics model.
that the learned neural network models negate the benefits
of imagination rollouts on our domains. 7. Discussion
To evaluate imagination rollouts with fitted time-varying
In this paper, we explored several methods for improv-
linear dynamics, we chose single-target variants of two of
ing the sample efficiency of model-free deep reinforcement
the manipulation tasks: the reacher and the gripper task.
learning. We first propose a method for applying standard
The results are shown in Figure 2b and 2c. We found
Q-learning methods to high-dimensional, continuous do-
that imagination rollouts of length 5 to 10 were sufficient
mains, using the normalized advantage function (NAF) rep-
for these tasks to achieve significant improvement over the
resentation. This allows us to simplify the more standard
fully model-free variant of NAF.
actor-critic style algorithms, while preserving the benefits
Adding imagination rollouts in these domains provided 2-5 of nonlinear value function approximation. We show that,
factors of improvement in data efficiency. In order to rein comparison to recently proposed deep actor-critic algo-
tain the benefit of model-free learning and allow the policy rithms, our method tends to learn faster and acquires more
to continue improving once it exceeds the quality possi- accurate policies. We further explore how model-free RL
ble under the learned model, we switch off the imagination can be accelerated by incorporating learned models, with-
rollouts after 130 episodes (20,000 steps) on the gripper out sacrificing the optimality of the policy in the face of im-
domain. This produces a small transient drop in the perfor- perfect model learning. We show that, although Q-learning
mance of the policy, but the results quickly improve again. can incorporate off-policy experience, learning primarily
Switching off the imagination rollouts also ensures that Q- from off-policy exploration (via model-based planning)
learning does not diverge after it reaches good values, as only rarely improves the overall sample efficiency of the
were often observed in the gripper. This suggests that imag- algorithm. We postulate that this caused by the need to
ination rollouts, in contrast to off-policy exploration dis- observe both successful and unsuccessful actions, in or-
cussed in the previous section, is an effective method for der to obtain an accurate estimate of the Q-function. We
bootstrapping model-free deep RL. demonstrate that an alternative method based on synthetic
on-policy rollouts achieves substantially improved sample
It should be noted that, although time-varying linear mod-
complexity, but only when the model learning algorithm
els combined with imagination rollouts provide a substan-
is chosen carefully. We demonstrate that training neural
tial boost in sample efficiency, this improvement is pro-
network models does not provide substantive improvement
vided at some cost in generality, since effective fitting of
in our domains, but simple iteratively refitted time-varying
time-varying linear models requires relatively small initial
linear models do provide substantial improvement on do-
state distributions. With more complex initial state distri-
mains where they can be applied.
Continuous Deep Q-Learning with Model-based Acceleration

Acknowledgement Kober, Jens and Peters, Jan. Reinforcement learning in

robotics: A survey. In Reinforcement Learning, pp. 579–
We thank Nicholas Heess for helpful discussion and Tom 610. Springer, 2012.
Erez, Yuval Tassa, Vincent Vanhoucke, and the Google
Brain and DeepMind teams for their support. Konda, Vijay R and Tsitsiklis, John N. Actor-critic algo-
rithms. In Advances in Neural Information Processing
References Systems (NIPS), volume 13, pp. 1008–1014, 1999.

Atkeson, Christopher G, Moore, Andrew W, and Schaal, Koutnı́k, Jan, Cuccu, Giuseppe, Schmidhuber, Jürgen, and
Stefan. Locally weighted learning for control. In Lazy Gomez, Faustino. Evolving large-scale neural networks
learning, pp. 75–113. Springer, 1997. for vision-based reinforcement learning. In Proceedings
of the 15th annual conference on Genetic and evolution-
Baird III, Leemon C. Advantage updating. Technical re- ary computation, pp. 1061–1068. ACM, 2013.
port, DTIC Document, 1993.
Lampe, Thomas and Riedmiller, Martin. Approximate
de Bruin, Tim, Kober, Jens, Tuyls, Karl, and Babuška, model-assisted neural fitted q-iteration. In Neural Net-
Robert. The importance of experience replay database works (IJCNN), 2014 International Joint Conference on,
composition in deep reinforcement learning. Deep Rein- pp. 2698–2704. IEEE, 2014.
forcement Learning Workshop, NIPS, 2015.
Levine, Sergey and Abbeel, Pieter. Learning neural net-
Deisenroth, Marc and Rasmussen, Carl E. Pilco: A model- work policies with guided policy search under unknown
based and data-efficient approach to policy search. In dynamics. In Advances in Neural Information Process-
International Conference on Machine Learning (ICML), ing Systems (NIPS), pp. 1071–1079, 2014.
pp. 465–472, 2011. Levine, Sergey and Koltun, Vladlen. Guided policy
search. In International Conference on Machine Learn-
Deisenroth, Marc Peter, Neumann, Gerhard, Peters, Jan,
ing (ICML), pp. 1–9, 2013.
et al. A survey on policy search for robotics. Foundations
and Trends in Robotics, 2(1-2):1–142, 2013. Levine, Sergey, Finn, Chelsea, Darrell, Trevor, and Abbeel,
Pieter. End-to-end training of deep visuomotor policies.
Fu, Justin, Levine, Sergey, and Abbeel, Pieter. One-shot JMLR 17, 2016.
learning of manipulation skills with online dynamics
adaptation and neural network priors. arXiv preprint Li, Weiwei and Todorov, Emanuel. Iterative lin-
arXiv:1509.06841, 2015. ear quadratic regulator design for nonlinear biological
movement systems. In ICINCO (1), pp. 222–229, 2004.
Hafner, Roland and Riedmiller, Martin. Reinforcement
learning in feedback control. Machine learning, 84(1- Lillicrap, Timothy P, Hunt, Jonathan J, Pritzel, Alexander,
2):137–169, 2011. Heess, Nicolas, Erez, Tom, Tassa, Yuval, Silver, David,
and Wierstra, Daan. Continuous control with deep rein-
Harmon, Mance E and Baird III, Leemon C. Multi-player forcement learning. International Conference on Learn-
residual advantage learning with general function ap- ing Representations (ICLR), 2016.
proximation. Wright Laboratory, WL/AACF, Wright-
Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David,
Patterson Air Force Base, OH, pp. 45433–7308, 1996.
Rusu, Andrei A, Veness, Joel, Bellemare, Marc G,
Hausknecht, Matthew and Stone, Peter. Deep reinforce- Graves, Alex, Riedmiller, Martin, Fidjeland, Andreas K,
ment learning in parameterized action space. arXiv Ostrovski, Georg, et al. Human-level control through
preprint arXiv:1511.04143, 2015. deep reinforcement learning. Nature, 518(7540):529–
533, 2015.
Heess, Nicolas, Wayne, Gregory, Silver, David, Lillicrap,
Nguyen, D and Widrow, B. The truck backer upper: An
Tim, Erez, Tom, and Tassa, Yuval. Learning con-
example of self learning in neural networks, 1989.
tinuous control policies by stochastic value gradients.
In Advances in Neural Information Processing Systems Peters, Jan and Schaal, Stefan. Policy gradient methods
(NIPS), pp. 2926–2934, 2015. for robotics. In International Conference on Intelligent
Robots and Systems (IROS), pp. 2219–2225. IEEE, 2006.
Kingma, Diederik and Ba, Jimmy. Adam: A
method for stochastic optimization. arXiv preprint Peters, Jan, Mülling, Katharina, and Altun, Yasemin. Rel-
arXiv:1412.6980, 2014. ative entropy policy search. In AAAI. Atlanta, 2010.
Continuous Deep Q-Learning with Model-based Acceleration

Rawlik, Konrad, Toussaint, Marc, and Vijayakumar, Sethu. Watter, Manuel, Springenberg, Jost, Boedecker, Joschka,
On stochastic optimal control and reinforcement learn- and Riedmiller, Martin. Embed to control: A locally lin-
ing by approximate inference. Robotics, pp. 353, 2013. ear latent dynamics model for control from raw images.
In Advances in Neural Information Processing Systems
Schaul, Tom, Quan, John, Antonoglou, Ioannis, and Sil- (NIPS), pp. 2728–2736, 2015.
ver, David. Prioritized experience replay. arXiv preprint
arXiv:1511.05952, 2015.

Schmidhuber, Jürgen. Reinforcement learning in marko-

vian and non-markovian environments. pp. 500–506,
1991.

Schulman, John, Levine, Sergey, Abbeel, Pieter, Jordan,

Michael I., and Moritz, Philipp. Trust region policy
optimization. In International Conference on Machine
Learning (ICML), pp. 1889–1897, 2015.

Schulman, John, Moritz, Philipp, Levine, Sergey, Jordan,

Michael, and Abbeel, Pieter. High-dimensional con-
tinuous control using generalized advantage estimation.
International Conference on Learning Representations
(ICLR), 2016.

Silver, David, Lever, Guy, Heess, Nicolas, Degris, Thomas,

Wierstra, Daan, and Riedmiller, Martin. Deterministic
policy gradient algorithms. In International Conference
on Machine Learning (ICML), 2014.

Sutton, Richard S. Integrated architectures for learning,

planning, and reacting based on approximating dynamic
programming. In International Conference on Machine
Learning (ICML), pp. 216–224, 1990.

Sutton, Richard S, McAllester, David A, Singh, Satin-

der P, Mansour, Yishay, et al. Policy gradient methods
for reinforcement learning with function approximation.
In Advances in Neural Information Processing Systems
(NIPS), volume 99, pp. 1057–1063, 1999.

Tassa, Yuval, Erez, Tom, and Todorov, Emanuel. Synthe-

sis and stabilization of complex behaviors through online
trajectory optimization. In International Conference on
Intelligent Robots and Systems (IROS), pp. 4906–4913.
IEEE, 2012.

Todorov, Emanuel, Erez, Tom, and Tassa, Yuval. Mujoco:

A physics engine for model-based control. In Inter-
national Conference on Intelligent Robots and Systems
(IROS), pp. 5026–5033. IEEE, 2012.

Wahlström, Niklas, Schön, Thomas B, and Deisenroth,

Marc Peter. From pixels to torques: Policy learn-
ing with deep dynamical models. arXiv preprint
arXiv:1502.02251, 2015.

Wang, Ziyu, de Freitas, Nando, and Lanctot, Marc. Dueling

network architectures for deep reinforcement learning.
arXiv preprint arXiv:1511.06581, 2015.

Continuous Deep Q-Learning With Model-Based Acceleration
No ratings yet
Continuous Deep Q-Learning With Model-Based Acceleration
13 pages
Temporal Difference Models - Model-Free Deep RL For Model-Based Control
No ratings yet
Temporal Difference Models - Model-Free Deep RL For Model-Based Control
14 pages
Bayesian Deep Reinforcement Learning Via Deep Kernel Learning
No ratings yet
Bayesian Deep Reinforcement Learning Via Deep Kernel Learning
8 pages
An Invitation To Deep Reinforcement Learning: Bernhard Jaeger
No ratings yet
An Invitation To Deep Reinforcement Learning: Bernhard Jaeger
39 pages
Origins of Life Questions and Debates
No ratings yet
Origins of Life Questions and Debates
12 pages
Soft Actor-Critic:: Off-Policy Maximum Entropy Deep Reinforcement Learning With A Stochastic Actor
No ratings yet
Soft Actor-Critic:: Off-Policy Maximum Entropy Deep Reinforcement Learning With A Stochastic Actor
14 pages
Efficient RL from Images
No ratings yet
Efficient RL from Images
21 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
47 pages
Q Transformer
No ratings yet
Q Transformer
20 pages
Report On Reinforcement Learning
No ratings yet
Report On Reinforcement Learning
26 pages
Deep Reinforcement Learning With Optimized Reward Functions For Robotic Trajectory Planning
No ratings yet
Deep Reinforcement Learning With Optimized Reward Functions For Robotic Trajectory Planning
11 pages
Hansen 2022
No ratings yet
Hansen 2022
20 pages
Survey of Model-Based Reinforcement Learning: Applications On Robotics
No ratings yet
Survey of Model-Based Reinforcement Learning: Applications On Robotics
21 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
1 page
ML Assignment 2
No ratings yet
ML Assignment 2
6 pages
Efficient Deep Reinforcement Learning for Game Strategy
No ratings yet
Efficient Deep Reinforcement Learning for Game Strategy
12 pages
Continuous Control in RL
No ratings yet
Continuous Control in RL
28 pages
Reinforcement Learning in Robotics
No ratings yet
Reinforcement Learning in Robotics
38 pages
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
No ratings yet
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
11 pages
Model Ensemble Trpo
No ratings yet
Model Ensemble Trpo
15 pages
13-RL DRL
No ratings yet
13-RL DRL
102 pages
Reinforcement Learning Guide
No ratings yet
Reinforcement Learning Guide
18 pages
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
No ratings yet
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
34 pages
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning With A Stochastic Actor
No ratings yet
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning With A Stochastic Actor
10 pages
6S191 MIT DeepLearning L5
No ratings yet
6S191 MIT DeepLearning L5
62 pages
Reinforcement and Imitation Learning For Diverse Visuomotor Skills
No ratings yet
Reinforcement and Imitation Learning For Diverse Visuomotor Skills
12 pages
Sensors 23 02036
No ratings yet
Sensors 23 02036
24 pages
Towards Adapting Reinforcement Learning Agents To New Tasks: Insights From Q-Values
No ratings yet
Towards Adapting Reinforcement Learning Agents To New Tasks: Insights From Q-Values
10 pages
AI Plays Geometry Dash
No ratings yet
AI Plays Geometry Dash
7 pages
Reinforcement Learning With Deep Energy-Based Policies
No ratings yet
Reinforcement Learning With Deep Energy-Based Policies
16 pages
1 Introduction To RL
No ratings yet
1 Introduction To RL
46 pages
1 s2.0 S000510981500343X Main
No ratings yet
1 s2.0 S000510981500343X Main
8 pages
2019 RL Control Review
No ratings yet
2019 RL Control Review
27 pages
Real Car Steering with NFQ in 20 Min
No ratings yet
Real Car Steering with NFQ in 20 Min
8 pages
Bridging The Gap Between Value and Policy Based Reinforcement Learning
No ratings yet
Bridging The Gap Between Value and Policy Based Reinforcement Learning
21 pages
Autonomous Car Racing in Simulation Environment Using Deep Reinforcement Learning
No ratings yet
Autonomous Car Racing in Simulation Environment Using Deep Reinforcement Learning
6 pages
Controller Design For Electrical Drives by Deep Reinforcement Learning A Proof of Concept
No ratings yet
Controller Design For Electrical Drives by Deep Reinforcement Learning A Proof of Concept
8 pages
Gaussian Processes For Data-Efficient Learning in Robotics and Control
No ratings yet
Gaussian Processes For Data-Efficient Learning in Robotics and Control
20 pages
37 RL
No ratings yet
37 RL
18 pages
Active One-Shot Learning
No ratings yet
Active One-Shot Learning
8 pages
Benchmarking Deep Reinforcement Learning For Continuous Control
No ratings yet
Benchmarking Deep Reinforcement Learning For Continuous Control
14 pages
2023 Week7 modelbasedRL Updated
No ratings yet
2023 Week7 modelbasedRL Updated
56 pages
Reinforcement Learning Optimization
No ratings yet
Reinforcement Learning Optimization
6 pages
15) EXPLAIN Fitted Q and Deep Q-Learning
No ratings yet
15) EXPLAIN Fitted Q and Deep Q-Learning
17 pages
RL Presentation2
No ratings yet
RL Presentation2
19 pages
The Actor-Dueling-Critic Method
No ratings yet
The Actor-Dueling-Critic Method
20 pages
Ibarz Et Al 2021 How To Train Your Robot With Deep Reinforcement Learning Lessons We Have Learned
No ratings yet
Ibarz Et Al 2021 How To Train Your Robot With Deep Reinforcement Learning Lessons We Have Learned
24 pages
Trajectory Transformer
No ratings yet
Trajectory Transformer
15 pages
Using Q-Learning To Automatically Tune Quadcopter PID Controller Online For Fast Altitude Stabilization
No ratings yet
Using Q-Learning To Automatically Tune Quadcopter PID Controller Online For Fast Altitude Stabilization
6 pages
07 Deep Reinforcement Learning (John)
No ratings yet
07 Deep Reinforcement Learning (John)
52 pages
Thesis Reinforcement Learning
100% (2)
Thesis Reinforcement Learning
5 pages
Comparative Analysis of RL Models
No ratings yet
Comparative Analysis of RL Models
47 pages
Learning Task
No ratings yet
Learning Task
14 pages
Reinforcement Learning in Robotics A Survey
No ratings yet
Reinforcement Learning in Robotics A Survey
37 pages
CHRISTIANO Et Al 2017
No ratings yet
CHRISTIANO Et Al 2017
17 pages
Lec 23
No ratings yet
Lec 23
51 pages
KineNN: Modular Robotics Framework
No ratings yet
KineNN: Modular Robotics Framework
17 pages
RL Algorithms in Gymnasium
No ratings yet
RL Algorithms in Gymnasium
59 pages
Model-Based Reinforcement Learning
No ratings yet
Model-Based Reinforcement Learning
41 pages
Don't Trust The Locals: Investigating The Prevalence of Persistent Client-Side Cross-Site Scripting in The Wild
No ratings yet
Don't Trust The Locals: Investigating The Prevalence of Persistent Client-Side Cross-Site Scripting in The Wild
15 pages
Cross-Site Scripting: Analysis, Identification and Exploitation
No ratings yet
Cross-Site Scripting: Analysis, Identification and Exploitation
28 pages
Presentation 14 PDF
No ratings yet
Presentation 14 PDF
57 pages
Web App Security for PhD Scholars
No ratings yet
Web App Security for PhD Scholars
162 pages
STANDARD OPERATING PROCEDURES Masjid CFS
50% (2)
STANDARD OPERATING PROCEDURES Masjid CFS
2 pages
Matrices
No ratings yet
Matrices
12 pages
Saudi Arabia Technician Jobs Listings
No ratings yet
Saudi Arabia Technician Jobs Listings
6 pages
Szymanowski List of Compositions
No ratings yet
Szymanowski List of Compositions
12 pages
Android File Management Guide
No ratings yet
Android File Management Guide
19 pages
Clinical Microbiology MCQ Practice Test
100% (4)
Clinical Microbiology MCQ Practice Test
13 pages
600782556eb44428af903a75 1643114610 GE 7 Course Syllabus 2021 2022 - Rev.2
No ratings yet
600782556eb44428af903a75 1643114610 GE 7 Course Syllabus 2021 2022 - Rev.2
8 pages
Saeed Updated CV
No ratings yet
Saeed Updated CV
14 pages
8500W Installation-Manual
100% (1)
8500W Installation-Manual
21 pages
Controledge Hc900 Io Modules Specifications: 51-52-03-41, November 2019
No ratings yet
Controledge Hc900 Io Modules Specifications: 51-52-03-41, November 2019
35 pages
Kalimba Song Book For Beginners - Play by Letter
No ratings yet
Kalimba Song Book For Beginners - Play by Letter
168 pages
4 VMXQ J9 R Qyj 7 Xo XUaj EB
No ratings yet
4 VMXQ J9 R Qyj 7 Xo XUaj EB
49 pages
Second Series Plays, JUSTICE by John Galsworthy
100% (1)
Second Series Plays, JUSTICE by John Galsworthy
66 pages
Civil Engg: Structural Analysis Basics
100% (1)
Civil Engg: Structural Analysis Basics
32 pages
Portable Dust Extractors for Sale or Hire
100% (1)
Portable Dust Extractors for Sale or Hire
1 page
Combustion Tutorials 3dsmax Elements
No ratings yet
Combustion Tutorials 3dsmax Elements
30 pages
Lesson 3 Four Pillars of Education
No ratings yet
Lesson 3 Four Pillars of Education
40 pages
Risk Assessment For General Activities
75% (4)
Risk Assessment For General Activities
25 pages
Consent Document For Enrolling Adult Participants in A Research Study
No ratings yet
Consent Document For Enrolling Adult Participants in A Research Study
3 pages
MyEdBC Family Portal Instructional Manual
No ratings yet
MyEdBC Family Portal Instructional Manual
6 pages
Sabrang' 22 Final Rulebook
No ratings yet
Sabrang' 22 Final Rulebook
50 pages
Character - Lorian Nod
No ratings yet
Character - Lorian Nod
2 pages
MSC Solid State Physics Lecture#3
No ratings yet
MSC Solid State Physics Lecture#3
17 pages
Splendor Plus
No ratings yet
Splendor Plus
1 page
Day Trading Capital Management Plan
No ratings yet
Day Trading Capital Management Plan
38 pages
Why Choose Jolly Phonics Flyer - 250125 - 035602
No ratings yet
Why Choose Jolly Phonics Flyer - 250125 - 035602
8 pages
Control Account Reconciliation Statement
No ratings yet
Control Account Reconciliation Statement
8 pages
WEEK 4 - Hiking PPT With Youtube Links
No ratings yet
WEEK 4 - Hiking PPT With Youtube Links
25 pages
Metro Jobs Clearance Form Blank
100% (1)
Metro Jobs Clearance Form Blank
1 page