Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
30 views13 pages

Self-Improving Robots

Uploaded by

Mansif Hossain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views13 pages

Self-Improving Robots

Uploaded by

Mansif Hossain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Self-Improving Robots: End-to-End Autonomous

Visuomotor Reinforcement Learning


Archit Sharma§ , Ahmed M. Ahmed§ , Rehaan Ahmad, Chelsea Finn
Stanford University
{architsh, ahmedah, rehaan, cbfinn}@stanford.edu

Abstract—In imitation and reinforcement learning, the cost of


human supervision limits the amount of data that robots can be
trained on. An aspirational goal is to construct self-improving
arXiv:2303.01488v1 [cs.RO] 2 Mar 2023

robots: robots that can learn and improve on their own, from
autonomous interaction with minimal human supervision or
oversight. Such robots could collect and train on much larger
datasets, and thus learn more robust and performant policies.
While reinforcement learning offers a framework for such au-
tonomous learning via trial-and-error, practical realizations end
up requiring extensive human supervision for reward function
design and repeated resetting of the environment between episodes
of interactions. In this work, we propose MEDAL++, a novel
design for self-improving robotic systems: given a small set
of expert demonstrations at the start, the robot autonomously
practices the task by learning to both do and undo the task,
simultaneously inferring the reward function from the demon-
strations. The policy and reward function are learned end-to-
end from high-dimensional visual inputs, bypassing the need for Fig. 1. A robot resets the environment from the goal state to the initial state
explicit state estimation or task-specific pre-training for visual (top), in contrast to a human resetting the environment for the robot (bottom).
encoders used in prior work. We first evaluate our proposed While latter is the norm in robotic reinforcement learning, a robot that can
algorithm on a simulated non-episodic benchmark EARL, finding reset the environment and practice the task autonomously can train on more
that MEDAL++ is both more data efficient and gets up to data, and thus, be more competent.
30% better final performance compared to state-of-the-art vision-
based methods. Our real-robot experiments show that MEDAL++ Some of these challenges have been addressed by prior
can be applied to manipulation problems in larger environments work, for example, learning end-to-end visuomotor poli-
than those considered in prior work, and autonomous self- cies [35, 52] and learning reward functions [10, 23, 47].
improvement can improve the success rate by 30-70% over More recent works have started addressing the requirement
behavior cloning on just the expert data. Code, training and
evaluation videos along with a brief overview is available at: of repeated environment resets, demonstrating that complex
https://architsharma97.github.io/self-improving-robots/ behaviors can be learned from autonomous interaction in
simulation [22, 8, 46] and on real-robots [18, 19, 49]. However,
I. I NTRODUCTION
these works learn from low-dimensional state and require
To be useful in natural unstructured environments, robots
engineered reward functions. While R3L [57] shows an au-
have to be competent on a large set of tasks. While imi-
tonomous real-robot system that learns from visual inputs
tation learning methods have shown promising evidence for
without reward engineering, the results have been restricted to
generalization via large-scale teleoperated data collection ef-
smaller and easier to explore environments due to the use of
forts [25, 3], human supervision is expensive and collected
a state-novelty based perturbation controller [45]. A practical
datasets are still incommensurate for learning robust and
real-robot system that learns end-to-end from visual inputs
broadly performant control. In this context, the aspirational
autonomously without extensive task-specific engineering has
notion of self-improving robots becomes relevant: robots that
been elusive.
can learn and improve from their own interactions with the
A key challenge to autonomous policy learning is that of
environment autonomously. Reinforcement learning (RL) is
exploration, especially as environments grow larger. Not only
a natural framework for such self-improvement, where the
is it hard to learn how to solve tasks without engineered
robots can learn from trial-and-error. However, deploying RL
reward functions, but in the absence of frequent resetting,
algorithms has several prerequisites that are time-intensive and
the robot can reach states that are further away and harder to
require domain expertise: state estimation, designing reward
recover from. An effective choice to construct self-improving
functions, and repeated resetting of the environments after
systems can be to use a small set of demonstrations to
every episode of interaction, impeding the dream of self-
alleviate challenges related to exploration [39]. And since the
improving robotic systems.
human supervision required for collecting the demonstrations
§ Authors with substantial contributions to real-robot experimentation is front-loaded, i.e., before the training begins, the robot can
collect data autonomously and self-improve thereon. With this designed the task and environment to bypass the need for
motivation, we build on MEDAL [46], an efficient and simple resetting the environment [41, 9, 27], but this applies to a
autonomous RL algorithm that uses a small set of expert restricted set of tasks.
demonstrations collected prior to training. MEDAL trains a More recent works have identified the need for algorithms
forward policy that learns to do the task and a backward policy that can work autonomously with minimal supervision re-
that matches the distribution of states visited by the expert quired for resetting the environments [22, 57, 45]. Several
when undoing the task. The states visited by an expert can works propose learning a backward policy to undo the task, in
be an efficient initial distribution to learn the forward policy addition to learning a forward policy that does the task. Han
from, shown theoretically [26] and empirically [46]. However, et al. [22], Eysenbach et al. [8] use a backward policy that
MEDAL trains on low-dimensional states and requires explicit learns to reach the initial state distribution, Sharma et al. [44]
reward functions, making it incompatible with real-world proposes a backward policy that generates a curriculum for the
training. forward agent, Xu et al. [51], Lu et al. [36] use unsupervised
We design MEDAL++, an autonomous RL algorithm that skill discovery to create adversarial starting states and non-
is feasible and efficient to train in the real world with mini- stationary task distributions respectively, and Xie et al. [50]
mal task-specific engineering. MEDAL++ has several crucial enables robotic agents to learn autonomously in environments
components that enable such real world training: First, we with irreversible states, asking for interventions in stuck states
learn an encoder for high-dimensional visual inputs end-to-end and learning to avoid them while interacting with the envi-
along the lines of DrQ-v2 [52], bypassing the need for state ronment. In this work, we build upon MEDAL [46], where
estimation or task-specific pre-training of visual encoders used the backward policy learns to match the distribution of states
in prior works. Second, we reuse the expert demonstrations visited by an expert to solve the task. While the results from
to infer a reward function online [23, 47], eliminating the these prior papers are restricted to simulated settings, some
need for engineering reward functions. Finally, we improve recent papers have demonstrated autonomous training on real
upon the learning efficiency of MEDAL by using an ensemble robots [57, 18, 19, 49]. However, the results on real robots
of Q-value functions and increasing the update steps per have either relied on state estimation [18, 19] or pre-specified
sample collected [6], using BC regularization on expert data reward functions [49]. R3L [57] also considers the setting
to regularize policy learning towards [42], and oversampling of learning from image observations without repeated resets
transitions from demonstration data when training the Q-value and specified reward functions, similar to our work. It uses a
function [39]. backward policy that optimizes for state-novelty while learning
Overall, we propose MEDAL++, an efficient and practi- the reward function from a set of goal images collected
cally viable autonomous RL algorithm that can learn from prior to training [47]. However, R3L relies on frozen visual
visual inputs without reward specification, and requires min- encoders trained independently on data collected in the same
imum oversight during training. We evaluate MEDAL++ on environment, and optimizing for state-novelty does not scale
a pixel-based control version of EARL [45], a non-episodic to larger environments, restricting their robot evaluations to
learning benchmark and observe that MEDAL++ is more data smaller, easier to explore environments. Our simulation results
efficient and gets up to 30% better performance compared to indicate that MEDAL++ learns more efficiently than R3L, and
competitive methods [46, 57]. Most importantly, we conduct real robot evaluations indicate the MEDAL++ can be used on
real-robot evaluations using a Franka Panda robot arm on larger environments.
four manipulation tasks, such as hanging a cloth on a hook, Overall, our work proposes a system that can learn end-to-
covering a bowl with a cloth and peg insertion, all from end from visual inputs without repeated environment resets,
RGB image observations. After autonomous training using with real-robot evaluations on four manipulation tasks.
MEDAL++, we observe that the success rate of the policy
can increase by 30-70% when compared to that of a behavior III. P RELIMINARIES
cloning policy learned only on the expert data, indicating that Problem Setting. We consider the autonomous RL problem
MEDAL++ is a step towards self-improving robotic systems. setting [45]. We assume that the agent is in a Markov Decision
Process represented by (S, A, T , r, ρ0 , γ), where S is the state
II. R ELATED W ORK space, potentially corresponding to high-dimensional obser-
Several works have demonstrated the emergence of complex vations such as RGB images, A denotes the robot’s action
skills on a variety of problems using reinforcement learning space, T : S × A × S → R≥0 denotes the transition dynamics
on real robots [32, 30, 35, 7, 27, 55, 38, 53, 28, 48, 2]. of the environment, r : S × A → R is the (unknown) reward
However, these prior works require the environment to be function, ρ0 denotes the initial state distribution, and γ de-
reset to a (narrow) set of initial states for every episode notes the discount P factor. The objective is to learn a policy

of interaction with the environment. Such resetting of the that maximizes E [ t=0 γ t r(st , at )] when deployed from ρ0
environment either requires repeated human interventions and during evaluation. There are two key differences from the
constant monitoring [11, 17, 14, 5, 20] or scripting behav- standard episodic RL setting: First, the training environment is
iors [35, 38, 56, 43, 53, 1] which can be time-intensive while non-episodic, i.e., the environment does not periodically reset
resulting in brittle behaviors. Some prior works have also to the initial state distribution after a fixed number of steps.
Second, the reward function is not available during training. backward demonstrations Db∗ without reward labels. Partic-
Instead, we assume access to a set of demonstrations collected ularly, we focus on design choices that make MEDAL++
by an expert prior to robot training. Specifically, the expert viable in the real world in contrast to MEDAL: First, we
collects a small set of forward trajectories Df∗ = {(si , ai ) . . .} describe how to learn from visual inputs without explicit
demonstrating the task and similarly, a set of backward state estimation. Second, we describe how to train the VICE
demonstrations Db∗ undoing the task back to the initial state classifier to eliminate the need for ground truth rewards when
distribution ρ0 . training the forward policy πf . Third, we describe the algo-
Autonomous Reinforcement Learning via MEDAL. To rithmic modifications for training the Q-value function and the
enable a robot to practice the task autonomously, MEDAL [46] policy π more efficiently. Namely, using an ensemble of Q-
trains a forward policy πf to solve the task, and a backward value functions and leveraging the demonstration data more
policy πb to undo the task. The forward policy πf executes for effectively for efficient learning. Finally, we describe how
a fixed number of steps before the control is switched over to to construct MEDAL++ using all the components described
the backward policy πb for a fixed number of steps. Chaining here, training a forward policy πf and a backward policy πb
the forward and backward policy reduces the number of inter- to learn autonomously.
ventions required to reset theP environment. The forward policy
∞ A. Encoding Visual Inputs
is trained to maximize E [ t=0 γ t r(st , at )], which can be
done via any RL algorithm. The backward policy πb is trained We embed the high-dimensional RGB images into a low-
to minimize the Jensen-Shannon divergence JS(ρb (s) || ρ∗ (s)) dimensional feature space using a convolutional encoder E.
between the stationary state-distribution of the backward pol- The RGB images are augmented using random crops and shifts
icy ρb and the state-distribution of the expert policy ρ∗ . By (up to 4 pixels) to regularize Q-value learning [52]. While
training a classifier Cb : S 7→ (0, 1) to discriminate between some prior works incorporate explicit representation learning
states visited by the expert (i.e. s ∼ ρ∗ ) and states visited by losses for visual encoders [34, 33], Yarats et al. [52] suggest
πb (i.e., s ∼ ρb ), the divergence that regularizing Q-value learning using random crop and shift
P∞ minimization problem can be augmentations is both simpler and efficient, allowing end-
rewritten as maxπb −E[ t=0 γ t log (1 − Cb (st ))] [46]. The
classifier used in the reward function for πb is trained using to-end learning without any explicit representation learning
the cross-entropy loss, where the states s ∈ Df∗ are labeled objectives. Specifically, the training loss for Q-value function
1 and states visited by πb online are labeled 0, leading to a on an environment transition (s, a, s0 , r) can be written as:
minimax optimization between πb and Cb .   2
`(Q, E) = Q (E (aug(s)) , a) − r − γ V̄ E(aug(s0 )) (1)
Learning Reward Functions with VICE. Engineering re-
wards can be tedious, especially when only image observa- where aug(·) denotes the augmented image, and r + γ V̄ (·)
tions are available. Since the transitions from the training is the TD-target. Equation 2 describes the exact computation
environment are not labeled with rewards, the robot needs of V̄ using slow-moving target networks Q̄ and the current
to learn a reward function for the forward policy πf . In policy π.
this work, we consider VICE [13], particularly, the simplified
version presented by Singh et al. [47] that is compatible B. Learning the Reward Function
with off-policy RL. VICE requires a small set of states To train a VICE classifier, we need to specify a set of
representing the desired outcome (i.e., goal images) prior to goal states that can be used as positive samples. Instead of
training. Given a set of goal states G, VICE trains a classifier collecting the goal states separately, we use the last K states of
Cf : S 7→ (0, 1) where Cf is trained using the cross every trajectory in Df∗ to create the goal set G. The trajectories
entropy loss on states ∈ G labeled as 1, and states visited collected by the robot’s policy πf will be used to generate
by πf labeled as 0. The policy πf is trained with a reward negative states for training Cf . The policy is trained to max-
function of log Cf (s) − log (1 − Cf (s)), which can be viewed imize − log (1 − Cf (·)) as the reward function, encouraging
as minimizing the KL-divergence between the stationary state the policy to reach states that have a high probability of being
distribution of πf and the goal distribution [12, 40, 15]. labeled 1 under Cf , and thus, similar to the states in the set
VICE has two benefits over pre-trained frozen classifier-based G. The reward signal from the classifier can be sparse if the
rewards: first, the negative states do not need to be collected classifier has high accuracy on distinguishing between the goal
by a person and second, the VICE classifier is harder to states and states visited by the policy. Since the classification
exploit as the online states are iteratively added to the label 0 problem for Cf is easier than the goal-matching problem for
set, continually improving the goal-reaching reward function πf , especially early in the training when the policy is not as
implicitly. successful, it becomes critical to regularize the discriminator
Cf [16]. We use spectral normalization [37], mixup [54] to
IV. MEDAL++: P RACTICAL AND E FFICIENT regularize Cf , and apply random crop and shift augmentations
AUTONOMOUS R EINFORCEMENT L EARNING to input images during training to create a broader distribution
The goal of this section is to develop a reinforcement learning for learning.
method that can learn from autonomous online interaction in Since we assume access to expert demonstrations Df∗ , why
the real world, given just a (small) set of forward Df∗ and do we use VICE, which matches the policy’s state distribution
to the goal distribution, instead of GAIL [23, 31], which
matches policy’s state-action distribution to that of the expert?
In a practical robotic setup, actions demonstrated by an expert
during teleoperation and optimal actions for a learned neural
network policy will be different. The forward pass through
a policy network introduces a delay, especially as the visual
encoder E becomes larger. Matching both the state and actions
to that of the expert, as is the case with GAIL, can lead to
suboptimal policies and be infeasible in general. In contrast,
VICE allows the robotic policies to choose actions that are Fig. 2. Visualizing the positive target states for forward classifier Cf
and backward classifier Cb from the expert demonstrations. For forward
different from the expert as long as they lead to a set of states demonstrations, last K states are used for Cf (orange) and the rest are used
similar to those in G. The exploratory benefits of matching for Cb (pink). For backward demonstrations, last K states are used for Cb .
the actions can still be recovered, as described in the next
subsection. are frozen with respect to L(π), and are only trained through
`(Qn , E).
C. Improving the Learning Efficiency
To improve the learning efficiency over MEDAL, we incorpo- D. Putting it Together: MEDAL++
rate several changes in how we train the Q-value function
and the policy π. First, we train an ensemble of Q-value Finally, we put together the components from previ-
networks {Qn }N ous sections to construct MEDAL++ for end-to-end au-
n=1 and corresponding set of target networks
{Q̄n }N . When training an ensemble member Qn , the target tonomous reinforcement learning. MEDAL++ trains a for-
n=1
is computed by sampling a subset of target networks, and ward policy that learns to solve the task and a back-
taking the minimum over the subset. The target value V̄ (s0 ) ward policy that learns to undo the task towards the ex-
in Eq 1 can be computed as pert state distribution. The parameters and data buffers
for the  forward policy are represented by the  tuple
V̄ (s0 ) = Ea0 ∼π min Q̄j (s0 , a0 ), (2) F ≡ πf , E f , {Qfn }N , {Q̄f N
} , C , D ∗
, D , G
j∈M n=1 n n=1 f f f f , where

where M is a random subset of the index set {1 . . . N } of size the symbols retain their meaning from the previous sections.
M . Randomizing the subset of the ensemble when computing Similarly, the parameters
 and data buffers are represented by 

the target allows more gradient steps to be taken to update Qn the tuple B ≡ πb , E b , {Qbn }N n=1 , {Q̄ b N
}
n n=1 , Cb , D b , D b , Gb .
on `(Qn , E) [6] without overfitting to a specific target value, Noticeably, F and B have a similar structure: Both πf and πb
increasing the overall sample efficiency of learning. The target are trained using with − log(1 − C(·)) as the reward function
networks Q̄n are updated as an exponential moving average (using their respective classifiers Cf and Cb ), with both
of Qn in the weight space over the course of training. At classifiers trained to discriminate between the states visited
(t) (t) (t−1) by the policy and their target states. The primary difference is
iteration t, Q̄n ← τ Qn + (1 − τ )Q̄n , where τ ∈ (0, 1]
determines how closely Q̄n tracks Qn . the set of positive target states Gf and Gb used to train Cf and
Next, we leverage the expert demonstrations to optimize Q Cb respectively, visualized in Figure 2. The VICE classifier Cf
and π more efficiently. Q-value networks are typically updated is trained to predict the last K states of every trajectory from
on minibatches sampled uniformly from a replay buffer D. Df∗ as positive, whereas we train the MEDAL classifier Cb to
However, the transitions in the demonstrations are generated predict all the states of forward demonstrations except the last
by an expert, and thus, can be more informative about the K states as positive. Optionally, we can also include the last
actions for reaching successful states [39]. To bias the data K states of backward demonstrations from Db∗ as positives for
towards the expert distribution, we oversample transitions from training Cb .
the expert data such that for a batch of size B, ρB transitions The pseudocode for training is given in Algorithm 1. First,
are sampled from the expert data uniformly and (1 − ρ)B the parameters and data buffers in F and B are initialized
transitions are sampled from the replay buffer uniformly for and the forward and backward demonstrations are loaded
ρ ∈ [0, 1). Finally, we regularize the policy learning towards into Df∗ and Db∗ respectively. Next, we update the forward
expert actions by introducing a behavior cloning loss in and backward goal sets, as described above. After initializing
addition to maximizing the Q-values [42, 39]: the environment, the forward policy πf interacts with the
environment and collects data, updating the networks and
" N
# buffers in F. The control switches over to the backward
1 X
L(π) =Es∼D,a∼π(·|s) Qn (E(aug(s)), a) + policy πb after a fixed number of steps, and the networks and
N n=1 buffers in B are updated. The backward policy interacts for
h i
λE(s∗ ,a∗ )∼ρ∗ log π a∗ | E(aug(s∗ )) , (3) a fixed number of steps, after which the control is switched
over to the forward policy and this cycle is repeated thereon.
where λ ≥ 0 denotes the hyperparameter controlling the effect When executing in the real world, humans are allowed to
of BC regularization. Note that the parameters of the encoder intervene and reset the environment intermittently, switching
Algorithm 1: MEDAL++ encoder E are updated on a batch of transitions constructed
initialize F, B; // forward, backward parameters by sampling (1 − ρ)B transitions from Df and ρB transitions
F.Df∗ , B.Db∗ ← load_demonstrations() from Df∗ . The Q-value networks and the encoder are updated
PN
F.Gf ← get_states(F.Df∗ , −K:) // last K states by minimizing N1 n=1 `(Qn , E) (Eq 1), and the target Q-
// exclude last K states from Df∗ , use only the last K networks are updated as an exponential moving average of Q-
states from Db∗ value networks. The policy πf is updated by maximizing L(π).
B.Gb ← get_states(F.Df∗ , : − K) ∪ We update the Q-value networks multiple times for every step
get_states(B.Db∗ , −K:) collected in the environment, whereas the policy network is
s ∼ ρ0 ; A ← F; // initialize environment updated once for every step collected in the environment [6].
while not done do
a ∼ A.act(s); s0 ∼ T (· | s, a); V. E XPERIMENTS
A.update_buffer({s, a, s0 }); The goal of our experiments is to determine whether
A.update_classifier(); MEDAL++ can be a practical method for self-improving
A.update_parameters(); robotic systems. First, we modify EARL [45], a benchmark for
// switch policy after a fixed interval non-episodic RL, to return RGB observations instead of low-
if switch then dimensional state. We benchmark MEDAL++ against com-
switch(A, (F, B)); petitive methods [46, 57] to evaluate the learning efficiency
// allow intermittent human interventions from high-dimensional observations, in Section V-A. Our pri-
if interrupt then mary experiments in Section V-B evaluate MEDAL++ on four
s ∼ ρ0 ; real robot manipulation tasks, primarily tasks with soft-body
A ← F; objects such as hanging a cloth on a hook and covering a bowl
else with cloth. The real robot evaluation considers the question
s ← s0 ; of whether self-improvement is feasible via MEDAL++, and
if so, how much self-improvement can MEDAL++ obtain?
Finally, we run ablations to evaluate the contributions of
different components to MEDAL++ in Section V-C.

A. Benchmarking MEDAL++ on EARL


First, we benchmark MEDAL++ on continuous-control en-
vironment from EARL against state-of-the-art non-episodic
autonomous RL methods. To be consistent with the bench-
mark, we use the ground truth reward functions for all the
environments.
Environments. We consider three sparse-reward continuous-
control environments from EARL benchmark [45], shown
in Appendix, Fig 8). Tabletop organization is a simplified
manipulation environment where a gripper is tasked to move
Fig. 3. An overview of MEDAL++ training. The classifier is trained to the mug to one of the four coasters from a wide set of initial
discriminate states visited by an expert from the states visited online. The robot
reinforcement learns on a combination of self-collected and expert transitions, states, sawyer door closing task requires a sawyer robot arm
and the policy learning is regularized using the behavior cloning loss. to learn how to close a door starting from various positions,
and finally the sawyer peg insertion task requires the sawyer
the control over to πf after the intervention to restart the robot arm to grasp the peg and insert it into a goal. Not only
forward-backward cycle. does the robot have to learn how to do the task (i.e. close the
We now expand on how the networks are updated for πf door or insert the peg), but it has to learn how to undo the task
during training(also visualized in Figure 3); the updates for (i.e. open the door or remove the peg) to try task repeatedly
πb are analogous. First, the new transition in the environment in the non-episodic training environment. The sparse reward
is added to Df . Next, we sample a batch of states from function is given by r(s, a) = 1(ks−gk ≤ ), where g denotes
Df and label them 0, and sample a batch of equal size the goal, and  is the tolerance for the task to be considered
from Df∗ and label them 1. The classifier Cf is updated completed.
using gradient descent on the combined batch to minimize Training and Evaluation. The environments are setup to re-
the cross-entropy loss. Note, the classifier is not updated for turn 84×84 RGB images as observations with a 3-dimensional
every step collected in the environment. As stated earlier, the action space for the tabletop organization (2D end-effector
classification problem is easier than learning the policy, and deltas in the XY plane and 1D for gripper) and a 4-dimensional
therefore, it helps to train the classifier slower than the policy. action space for sawyer environments (3D end-effector delta
Finally, the policy πf , Q-value networks {Qfn , Q̄fn }N
n=1 and the control + 1D gripper). The training environment is reset
Fig. 4. Comparison of autonomous RL methods on vision-based manipulation tasks in simulated environments from EARL [45]. MEDAL++ is both more
efficient and learns a similarly or more successful policy compared to other methods.

to s0 ∼ ρ0 every 25,000 steps of interaction with the envi- door closing and tabletop organization, the novelty-seeking
ronment. This is extremely infrequent compared to episodic perturbation controller can cause the robot to drift to states
settings where the environment is reset to the initial state farther away from the goal in larger environments, leading
distribution every 200-1000 steps. EARL comes with 5-15 to slower improvement in evaluation performance on states
forward and backward demonstrations for every environment starting from s0 ∼ ρ0 . While MEDAL and MEDAL++
to help with exploration in these sparse reward environments. have the same objective for the backward policy, optimization
We evaluate the forward policy every 10, 000 P∞training steps, related improvements enable MEDAL++ to learn faster. Note,
where the evaluation approximates Es0 ∼ρ0 [ t=0 γ t r(st , at )] BC performs worse on tabletop organization environment with
by averaging the return of the policy over 10 episodes starting a 45% success rate, compared to the sawyer environments
from s0 ∼ ρ0 . These roll-outs are used only for evaluation, with a 70% and 80% success rate on peg insertion and door
and not for training. closing respectively. So, while BC regularization helps speed
Comparisons. We compare MEDAL++ to four methods: up efficiency and can lead to better policies, it can hurt the
(1) MEDAL [46] uses a backward policy that matches the final performance of MEDAL++ if the BC policy itself has a
expert state distribution by minimizing JS(ρb (s) || ρ∗ (s)), worse success rate (at least, when true rewards are available for
similar to ours. However, the method is designed for low- training, see ablations in Section V-C). While we use the same
dimensional states and policy/Q-value networks and cannot hyperparameters for all environments, reducing the weight
be naı̈vely extended to RGB observations. For a better com- on BC regularization when BC policies have poor success
parison, we improve the method to use a visual encoder with can reduce the bias in policy learning and improve the final
random crop and shift augmentations during training, similar performance.
to MEDAL++. (2) R3L [57] uses a perturbation controller as
backward policy which optimizes for state-novelty computed B. Real Robot Evaluations
using random network distillation [4]. Unlike our method, R3L In line with the main goal of this paper, our experiments aim
also requires a separately collected dataset of environment ob- to evaluate whether self-improvement through MEDAL++
servations to pre-train a VAE [29] based visual encoder, which can enable real-robots to be learn more competent policies
is frozen throughout the training thereafter. (3) We consider an autonomously. On four manipulation tasks, we provide quan-
oracle RL method that trains just a forward policy and gets titative and qualitative comparison of the policy learned by
a privileged training environment that resets every 200 steps behavior cloning on the expert data to the one learned after
(i.e., the same episode length as during evaluation) and finally, self-improvement by MEDAL++. We recommend viewing
(4) we consider a control method naı̈ve RL, that similar to the results on our website for a more complete overview:
oracle trains just a forward policy, but resets every 25,000 steps https://architsharma97.github.io/self-improving-robots/.
similar to the non-episodic methods. We additionally report Robot Setup and Tasks. We use Franka Emika Panda arm
the performance of a behavior cloning policy, trained on the with a Robotiq 2F-85 gripper for all our experiments. We
forward demonstrations used in the EARL environments. The use a RGB camera mounted on the wrist and a third person-
implementation details and hyperparameters can be found in camera, as shown in Figure 6. The final observation space
Appendix A. includes two 100 × 100 RGB images, 3-dimensional end-
Results. Figure 4 plots the evaluation performance of the effector position, orientation along the z-axis, and the width of
forward policy versus the training samples collected in the the gripper. The action space is set up as either a 4 DoF end-
environment. MEDAL++ outperforms all other methods on effector control, or 5 DoF end-effector control with orientation
both the sawyer environments, and is comparable to MEDAL along the z-axis depending on the task (including one degree
on tabletop organization, the best performing method. While of freedom for the gripper). Our evaluation suite consists of
R3L does recover a non-trivial performance eventually on four manipuation tasks: grasping a cube, hanging a cloth on
Fig. 5. An overview of MEDAL++ on the task of inserting the peg into the goal location. (top) Starting with a set of expert trajectories, MEDAL++ learns
a forward policy to insert the peg by matching the goal states and a backward policy to remove and randomize the peg position by matching the rest of the
states visited by an expert. (bottom) Chaining the rollouts of forward and backward policies allows the robot to practice the task autonomously. The rewards
indicate the similarity to their respective target states, output by a discriminator trained to classify online states from expert states.

a hook, covering a bowl with a piece of cloth, and a (soft) tion. Specifically, all the forward demonstrations are collected
peg insertion. Real world data and training is more pertinent starting from a narrow set of initial states (ID), but, the
for soft-body manipulation as they are harder to simulate, and robot is evaluated starting from both ID states and out-of-
thus, we emphasize those tasks in our evaluation suite. The distribution (OOD) states, visualized in Appendix, Figure 9.
tasks are shown in Figure 6, and we provide task-specific BC policy is competent on ID states, but it performs poorly
details along with the analysis. on states that are OOD. However, after autonomous self-
Training and Evaluation. For every task, we first collect a improvement using MEDAL++, we see an improvement of
set of 50 forward demonstrations and 50 backward demon- 15% on ID performance, and a large improvement of 74%
strations using a Xbox controller. We chain the forward and on OOD performance. Autonomous training allows the robot
backward demonstrations to speed up collection and better to practice the task from a diverse set of states, including
approximate autonomous training thereafter. After collecting states that were OOD relative to the demonstration data. This
the demonstrations, the robot is trained for 10-30 hours using suggests that improvement in success rate results partly from
MEDAL++ as described in Section IV-D, collecting about being robust to the initial state distribution, as a small set of
300, 000 environment transitions in the process. For the first 30 demonstrations is unlikely to cover all possible initial states a
minutes of training, we reset the environment to create enough robot can be evaluated from.
(object) diversity in the initial data collected in the replay (2) Cloth on the Hook: In this task, the robot is tasked
buffer. After the initial collection, the environment is reset with grasping the cloth and putting it through a fixed hook.
intermittently approximately every hour of real world training To practice the task repeatedly, the backward policy has to
on an average, though, it is left unattended for several hours. remove the cloth from the hook and drop it on platform.
More details related to hyperparameters, network architecture Here, MEDAL++ improves the success rate over BC by
and training can be found in Appendix A. For evaluation, we 36%. The BC policy has several failure modes: (1) it fails to
roll-out the policy from varying initial states, and measure grasp the cloth, (2) it follows through with hooking because
the success rate over 50 evaluations. To isolate the role of of memorization, or (3) it hits into into the hook because
self-improvement, we compare the performance to a behavior it drifts from the right trajectory and could not recover.
cloning policy trained on the forward demonstrations using the Autonomous self-improvement improves upon all these issues,
same network architecture for the policy as MEDAL++. For but particularly, it learns to re-try grasping the cloth if it fails
both MEDAL++ and BC, we evaluate multiple intermediate the first time, rather than following a memorized trajectory
checkpoints and report the success rate of the best performing observed in the forward demonstrations.
checkpoint. (3) Bowl Covering with Cloth: The goal of this task is to cover
a bowl entirely using the cloth. The cloth can be a wide variety
Results. The success rate of the best performing BC policy
of initial states, ranging from ‘laid out flat’ to ‘scrunched up’
and MEDAL++ policy is reported in Table 6. MEDAL++
in varying locations. The task is challenging as the robot has
substantially increases the success rate of the learned policies,
to grasp the cloth at the correct location to successfully cover
with approximately 30-70% improvements. We expand on the
the entire bowl (partial coverage is counted as a failure). Here,
task and analyze the performance on each of them:
MEDAL++ improves the performance over BC by 34%. The
(1) Cube Grasping: The goal in this task is to grasp the failure modes of BC are similar to previous task, including
cube from varying initial positions and configurations and failure to grasp, memorization and failure to re-try, and incom-
raise it. For this task, we consider a controlled setting to plete coverage due to wrong initial grasp. Autonomous self-
isolate one potential source of improvement from autonomous improvement substantially helps with the grasping (including
reinforcement learning: robustness to the initial state distribu-
C. Ablations
Finally, we consider ablations on the simulated environ-
ments to understand the contributions of each component
of MEDAL++. We benchmark four variants on the table-
top organization and peg insertion tasks in Figure 7: (1)
MEDAL++, (2) MEDAL++ with the true reward function in-
stead of the learned VICE reward, (3) MEDAL++ without the
ensemble of Q-value functions, but, using SAC [21], and (4)
MEDAL++ without both BC-regularization and oversampling
expert transitions for training Q-value functions. While there
is some room for improvement, MEDAL++ can recover the
performance from the learned reward function the performance
with the true rewards. Both ensemble of Q-values and BC-
Task Behavior Cloning MEDAL++ regularization + oversampled expert transitions improve the
ID 0.85 1.00 performance, though the latter makes a larger contribution to
Cube Grasping the improvement in performance. Note, when using the true re-
OOD 0.08 0.82
wards, BC-regularization/oversampling expert transitions can
Cloth Hanging 0.26 0.62
hurt the final performance (as discussed in Section V-A).
Bowl Cloth Cover 0.12 0.46 However, when using learned rewards, they both become
Peg Insertion 0.04 0.52 more important for better performance. We hypothesize that
the signal from the learned reward function becomes noisier,
Fig. 6. (top) The training setup for MEDAL++. The image observations
include a fixed third person view and a first person view from a wrist
making other components important for efficient learning and
camera mounted above the gripper. The evaluation tasks going clockwise: cube better final performance.
grasping, covering a bowl with a cloth, hanging a cloth on the hook and, peg
insertion. (bottom) Evaluation performance of the best checkpoint learned by
behavior cloning and MEDAL++. Table shows the final success rates over
50 trials from randomized initial states, normalized to [0, 1]. MEDAL++
substantially improves the performance over behavior cloning, validating the
feasibility of self-improving robotic systems.

re-trying) and issues related to memorization. While it plans


the grasps better than BC, there is room for improvement to
reduce failures resulting from partially covering the bowl.
(4) Peg Insertion: Finally, we consider the task of inserting
a peg into a goal location. The location and orientation of
the peg is randomized, in service of which we use 5DoF
control for this task. A successful insertion requires the toy
to be perpendicular to the goal before insertion, and the error
margin for a successful insertion is small given the size of
the peg and the goal. Additionally, the peg here is a soft
toy, it can be grasped while being in the wrong orientation.
Here, MEDAL++ improves the performance by 48% over
BC. In addition to failures described in the previous tasks,
a common cause of failure is the insertion itself where the
agent takes an imprecise trajectory and is unable to insert the
peg. After autonomous self-improvement, the robot employs
an interesting strategy where it keeps retries the insertion till
it succeeds. The policy is also better at grasping, though the
failures of insertion often result from orienting the gripper
incorrectly before the grasp which makes insertion infeasible. Fig. 7. Ablation identifying contributions from different components of
MEDAL++. Improvements from BC regularization and oversampled expert
The supplemental website features training timelapses transitions are important for learning efficiency and final performance.
showing the autonomous practice for the tasks above and eval-
VI. D ISCUSSION
uation trials for policies learned by both behavior cloning and
MEDAL++. Overall, we observe that not only is MEDAL++ We proposed MEDAL++, a method for learning au-
feasible to run in the real world with minimal task engi- tonomously from high-dimensional image observation without
neering, but it can also substantially improve the policy from engineered reward functions or human oversight for repeated
autonomous data collection. resetting of the environment. MEDAL++ takes a small set
of forward and backward demonstrations as input, and au- [3] Anthony Brohan, Noah Brown, Justice Carbajal, Yev-
tonomously practices the task to improve the learned policy, gen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana
as evidenced by comparison with behavior cloning policies Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine
trained on just the demonstrations. Hsu, et al. RT-1: Robotics transformer for real-world
Limitations and Future Work: While the results of MEDAL++ control at scale. arXiv preprint arXiv:2212.06817, 2022.
are promising, autonomous robotic reinforcement learning [4] Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg
remains a challenging problem. Real robot data collection is Klimov. Exploration by random network distillation.
slow, even when autonomous. Improving the speed of data arXiv preprint arXiv:1810.12894, 2018.
collection and learning efficiency can yield faster improvement [5] Yevgen Chebotar, Karol Hausman, Marvin Zhang, Gau-
and better policies. While the control frequency is 10 Hz, train- rav Sukhatme, Stefan Schaal, and Sergey Levine.
ing data is collected at approximately 3.5 Hz because network Combining model-based and model-free updates for
updates and collection steps are done sequentially. Paralleliz- trajectory-centric reinforcement learning. In Interna-
ing data collection and training and making it asynchronous tional conference on machine learning, pages 703–711.
can substantially increase the amount of data collected. Sim- PMLR, 2017.
ilarly, several further improvements can improve the learning [6] Xinyue Chen, Che Wang, Zijian Zhou, and Keith Ross.
efficiency: sharing the visual encoder, and more generally, the Randomized ensembled double q-learning: Learning fast
environment transitions between forward and backward poli- without a model, 2021. URL https://arxiv.org/abs/2101.
cies, using better network architectures and better algorithms 05982.
designed specifically for learning autonomously can improve [7] Frederik Ebert, Sudeep Dasari, Alex X. Lee, Sergey
sample efficiency. While our proposed system substantially Levine, and Chelsea Finn. Robustness via retrying:
improves the autonomy, intermittent human interventions to Closed-loop robotic manipulation with self-supervised
reset the environment can be important to learn successfully. learning. In Aude Billard, Anca Dragan, Jan Peters, and
The robotic system can get stuck in a specific state when Jun Morimoto, editors, Proceedings of The 2nd Con-
collecting data autonomously due to poor exploration, even ference on Robot Learning, volume 87 of Proceedings
if that state itself is not irreversible. Human interventions of Machine Learning Research, pages 983–993. PMLR,
ensure that the data collected in the replay buffer has sufficient 29–31 Oct 2018. URL https://proceedings.mlr.press/v87/
diversity, which can be important for the stability of RL ebert18a.html.
training. Developing and using better methods for exploration, [8] Benjamin Eysenbach, Shixiang Gu, Julian Ibarz, and
pretraining on more offline data or more stable optimization Sergey Levine. Leave no trace: Learning to reset for
can further reduce human interventions in training. safe and autonomous reinforcement learning, 2017. URL
Overall, self-improving robots are an exciting frontier that https://arxiv.org/abs/1711.06782.
can enable robots to collect ever larger amounts of interaction [9] Chelsea Finn and Sergey Levine. Deep visual foresight
data for planning robot motion. In 2017 IEEE International
Conference on Robotics and Automation (ICRA), pages
VII. ACKNOWLEDGEMENTS 2786–2793. IEEE, 2017.
We would like to acknowledge Tony Zhao, Sasha Khazatsky [10] Chelsea Finn, Sergey Levine, and Pieter Abbeel. Guided
and Suraj Nair for help with setting up robot tasks and control cost learning: Deep inverse optimal control via policy
stack, Eric Mitchell, Joey Hejna, Suraj Nair for feedback on optimization. In International conference on machine
an early draft, Abhishek Gupta for valuable conceptual discus- learning, pages 49–58. PMLR, 2016.
sion, and members of IRIS and SAIL for listening to AS drone [11] Chelsea Finn, Xin Yu Tan, Yan Duan, Trevor Darrell,
about this project on several occasions, personal and group Sergey Levine, and Pieter Abbeel. Deep spatial au-
meetings. This project was funded by ONR grants N00014- toencoders for visuomotor learning. In 2016 IEEE
20-1-2675 and N00014-21-1-2685 and, Schmidt Futures. International Conference on Robotics and Automation
(ICRA), pages 512–519. IEEE, 2016.
R EFERENCES [12] Justin Fu, Katie Luo, and Sergey Levine. Learning robust
[1] Pulkit Agrawal, Ashvin V Nair, Pieter Abbeel, Jitendra rewards with adversarial inverse reinforcement learning.
Malik, and Sergey Levine. Learning to poke by poking: arXiv preprint arXiv:1710.11248, 2017.
Experiential learning of intuitive physics. Advances in [13] Justin Fu, Avi Singh, Dibya Ghosh, Larry Yang, and
neural information processing systems, 29, 2016. Sergey Levine. Variational inverse control with events:
[2] Michael Bloesch, Jan Humplik, Viorica Patraucean, A general framework for data-driven reward definition,
Roland Hafner, Tuomas Haarnoja, Arunkumar Byravan, 2018. URL https://arxiv.org/abs/1805.11686.
Noah Yamamoto Siegel, Saran Tunyasuvunakool, Fed- [14] Ali Ghadirzadeh, Atsuto Maki, Danica Kragic, and
erico Casarini, Nathan Batchelor, et al. Towards real Mårten Björkman. Deep predictive policy training using
robot learning in the wild: A case study in bipedal reinforcement learning. In 2017 IEEE/RSJ International
locomotion. In Conference on Robot Learning, pages Conference on Intelligent Robots and Systems (IROS),
1502–1511. PMLR, 2022. pages 2351–2358. IEEE, 2017.
[15] Seyed Kamyar Seyed Ghasemipour, Richard Zemel, and 1558608737.
Shixiang Gu. A divergence minimization perspective [27] Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian
on imitation learning methods. In Conference on Robot Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen,
Learning, pages 1259–1277. PMLR, 2020. Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke,
[16] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, and Sergey Levine. Qt-opt: Scalable deep reinforcement
Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron learning for vision-based robotic manipulation, 2018.
Courville, and Yoshua Bengio. Generative adversarial URL https://arxiv.org/abs/1806.10293.
networks, 2014. URL https://arxiv.org/abs/1406.2661. [28] Dmitry Kalashnikov, Jacob Varley, Yevgen Chebotar,
[17] Shixiang Gu, Ethan Holly, Timothy Lillicrap, and Sergey Benjamin Swanson, Rico Jonschkowski, Chelsea Finn,
Levine. Deep reinforcement learning for robotic ma- Sergey Levine, and Karol Hausman. Mt-opt: Continuous
nipulation with asynchronous off-policy updates. In multi-task robotic reinforcement learning at scale, 2021.
2017 IEEE international conference on robotics and URL https://arxiv.org/abs/2104.08212.
automation (ICRA), pages 3389–3396. IEEE, 2017. [29] Diederik P Kingma and Max Welling. Auto-encoding
[18] Abhishek Gupta, Justin Yu, Tony Z. Zhao, Vikash Kumar, variational bayes. arXiv preprint arXiv:1312.6114, 2013.
Aaron Rovinsky, Kelvin Xu, Thomas Devlin, and Sergey [30] Jens Kober, J Andrew Bagnell, and Jan Peters. Reinforce-
Levine. Reset-free reinforcement learning via multi- ment learning in robotics: A survey. The International
task learning: Learning dexterous manipulation behaviors Journal of Robotics Research, 32(11):1238–1274, 2013.
without human intervention, 2021. URL https://arxiv.org/ [31] Ilya Kostrikov, Kumar Krishna Agrawal, Debidatta
abs/2104.11203. Dwibedi, Sergey Levine, and Jonathan Tompson.
[19] Abhishek Gupta, Corey Lynch, Brandon Kinman, Gar- Discriminator-actor-critic: Addressing sample ineffi-
rett Peake, Sergey Levine, and Karol Hausman. Boot- ciency and reward bias in adversarial imitation learning,
strapped autonomous practicing via multi-task reinforce- 2018. URL https://arxiv.org/abs/1809.02925.
ment learning. arXiv preprint arXiv:2203.15755, 2022. [32] Sascha Lange, Martin Riedmiller, and Arne Voigtländer.
[20] Tuomas Haarnoja, Sehoon Ha, Aurick Zhou, Jie Tan, Autonomous reinforcement learning on raw visual input
George Tucker, and Sergey Levine. Learning to data in a real world application. In The 2012 inter-
walk via deep reinforcement learning. arXiv preprint national joint conference on neural networks (IJCNN),
arXiv:1812.11103, 2018. pages 1–8. IEEE, 2012.
[21] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and [33] Michael Laskin, Aravind Srinivas, and Pieter Abbeel.
Sergey Levine. Soft actor-critic: Off-policy maximum CURL: Contrastive unsupervised representations for re-
entropy deep reinforcement learning with a stochastic inforcement learning. In International Conference on
actor. In International conference on machine learning, Machine Learning, pages 5639–5650. PMLR, 2020.
pages 1861–1870. PMLR, 2018. [34] Alex X Lee, Anusha Nagabandi, Pieter Abbeel, and
[22] Weiqiao Han, Sergey Levine, and Pieter Abbeel. Learn- Sergey Levine. Stochastic latent actor-critic: Deep re-
ing compound multi-step controllers under unknown dy- inforcement learning with a latent variable model. Ad-
namics. In 2015 IEEE/RSJ International Conference on vances in Neural Information Processing Systems, 33:
Intelligent Robots and Systems, IROS 2015, Hamburg, 741–752, 2020.
Germany, September 28 - October 2, 2015, pages 6435– [35] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter
6442. IEEE, 2015. doi: 10.1109/IROS.2015.7354297. Abbeel. End-to-end training of deep visuomotor policies.
URL https://doi.org/10.1109/IROS.2015.7354297. The Journal of Machine Learning Research, 17(1):1334–
[23] Jonathan Ho and Stefano Ermon. Generative adversarial 1373, 2016.
imitation learning, 2016. URL https://arxiv.org/abs/1606. [36] Kevin Lu, Aditya Grover, Pieter Abbeel, and Igor Mor-
03476. datch. Reset-free lifelong learning with skill-space plan-
[24] Kyle Hsu, Moo Jin Kim, Rafael Rafailov, Jiajun Wu, and ning. arXiv preprint arXiv:2012.03548, 2020.
Chelsea Finn. Vision-based manipulators need to also [37] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and
see from their hands. arXiv preprint arXiv:2203.12677, Yuichi Yoshida. Spectral normalization for generative
2022. adversarial networks. arXiv preprint arXiv:1802.05957,
[25] Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler, 2018.
Frederik Ebert, Corey Lynch, Sergey Levine, and Chelsea [38] Anusha Nagabandi, Kurt Konoglie, Sergey Levine, and
Finn. Bc-z: Zero-shot task generalization with robotic Vikash Kumar. Deep dynamics models for learning
imitation learning. In Conference on Robot Learning, dexterous manipulation, 2019. URL https://arxiv.org/abs/
pages 991–1002. PMLR, 2022. 1909.11652.
[26] Sham Kakade and John Langford. Approximately opti- [39] Ashvin Nair, Bob McGrew, Marcin Andrychowicz, Wo-
mal approximate reinforcement learning. In Proceedings jciech Zaremba, and Pieter Abbeel. Overcoming ex-
of the Nineteenth International Conference on Machine ploration in reinforcement learning with demonstrations.
Learning, ICML ’02, page 267–274, San Francisco, CA, In 2018 IEEE international conference on robotics and
USA, 2002. Morgan Kaufmann Publishers Inc. ISBN automation (ICRA), pages 6292–6299. IEEE, 2018.
[40] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. abs/2011.05286.
f-GAN: Training generative neural samplers using vari- [52] Denis Yarats, Rob Fergus, Alessandro Lazaric, and Lerrel
ational divergence minimization. Advances in neural Pinto. Mastering visual continuous control: Improved
information processing systems, 29, 2016. data-augmented reinforcement learning, 2021. URL
[41] Lerrel Pinto and Abhinav Gupta. Supersizing self- https://arxiv.org/abs/2107.09645.
supervision: Learning to grasp from 50k tries and 700 [53] Andy Zeng, Shuran Song, Johnny Lee, Alberto Ro-
robot hours. In 2016 IEEE international conference driguez, and Thomas Funkhouser. Tossingbot: Learning
on robotics and automation (ICRA), pages 3406–3413. to throw arbitrary objects with residual physics, 2019.
IEEE, 2016. URL https://arxiv.org/abs/1903.11239.
[42] Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, [54] Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and
Giulia Vezzani, John Schulman, Emanuel Todorov, and David Lopez-Paz. mixup: Beyond empirical risk mini-
Sergey Levine. Learning complex dexterous manipula- mization, 2017. URL https://arxiv.org/abs/1710.09412.
tion with deep reinforcement learning and demonstra- [55] Henry Zhu, Abhishek Gupta, Aravind Rajeswaran,
tions. arXiv preprint arXiv:1709.10087, 2017. Sergey Levine, and Vikash Kumar. Dexterous ma-
[43] Archit Sharma, Michael Ahn, Sergey Levine, Vikash nipulation with deep reinforcement learning: Efficient,
Kumar, Karol Hausman, and Shixiang Gu. Emergent general, and low-cost, 2018. URL https://arxiv.org/abs/
real-world robotic skills via unsupervised off-policy re- 1810.06045.
inforcement learning. arXiv preprint arXiv:2004.12974, [56] Henry Zhu, Abhishek Gupta, Aravind Rajeswaran,
2020. Sergey Levine, and Vikash Kumar. Dexterous ma-
[44] Archit Sharma, Abhishek Gupta, Sergey Levine, Karol nipulation with deep reinforcement learning: Efficient,
Hausman, and Chelsea Finn. Autonomous reinforcement general, and low-cost. In 2019 International Conference
learning via subgoal curricula. Advances in Neural on Robotics and Automation (ICRA), pages 3651–3657.
Information Processing Systems, 34:18474–18486, 2021. IEEE, 2019.
[45] Archit Sharma, Kelvin Xu, Nikhil Sardana, Abhishek [57] Henry Zhu, Justin Yu, Abhishek Gupta, Dhruv Shah,
Gupta, Karol Hausman, Sergey Levine, and Chelsea Kristian Hartikainen, Avi Singh, Vikash Kumar, and
Finn. Autonomous reinforcement learning: Formalism Sergey Levine. The ingredients of real-world robotic
and benchmarking. International Conference on Learn- reinforcement learning, 2020. URL https://arxiv.org/abs/
ing Representations (ICLR), 2021. URL https://arxiv.org/ 2004.12570.
abs/2112.09605.
[46] Archit Sharma, Rehaan Ahmad, and Chelsea Finn. A
state-distribution matching approach to non-episodic re-
inforcement learning. In International Conference on
Machine Learning, pages 19645–19657. PMLR, 2022.
URL https://arxiv.org/abs/2205.05212.
[47] Avi Singh, Larry Yang, Kristian Hartikainen, Chelsea
Finn, and Sergey Levine. End-to-end robotic rein-
forcement learning without reward engineering. ArXiv,
abs/1904.07854, 2019.
[48] Laura Smith, J Chase Kew, Xue Bin Peng, Sehoon Ha,
Jie Tan, and Sergey Levine. Legged robots that keep
on learning: Fine-tuning locomotion policies in the real
world. In 2022 International Conference on Robotics and
Automation (ICRA), pages 1593–1599. IEEE, 2022.
[49] Homer Walke, Jonathan Yang, Albert Yu, Aviral Kumar,
Jedrzej Orbik, Avi Singh, and Sergey Levine. Don’t start
from scratch: Leveraging prior data to automate robotic
reinforcement learning, 2022. URL https://arxiv.org/abs/
2207.04703.
[50] Annie Xie, Fahim Tajwar, Archit Sharma, and Chelsea
Finn. When to ask for help: Proactive interventions in
autonomous reinforcement learning. Neural Information
Processing Systems (NeurIPS), 2022. URL https://arxiv.
org/abs/2210.10765.
[51] Kelvin Xu, Siddharth Verma, Chelsea Finn, and Sergey
Levine. Continual learning of control primitives: Skill
discovery via reset-games, 2020. URL https://arxiv.org/
Fig. 8. Environments from the EARL benchmark [45] used for simulated experiments. From left to right, the environments are: Peg insertion, Door closing
and Tabletop organization.

Fig. 9. (left) Randomized position of the cube in the grasping task. The position marked by violet boundary are within the distribution of expert demonstrations,
and the rest are outside the distribution. (right) Architecture overview for MEDAL++.

A PPENDIX
A. Implementation Details and Practical Tips
An overview of the architecture used by the forward and backward networks is shown in Figure 9.

Visual Encoder: For the encoder, we use the same architecture as DrQ-v2 [52]: 4 convolutional layers with 32 filters of size (3,
3), stride 1, followed by ReLU non-linearities. The high-dimensional output from the CNN is embedded into a 50 dimensional
feature using a fully-connected layer, followed by LayerNorm and tanh non-linearity (to output the features normalized to
[−1, 1]). For real-robot experiments, the first person and third person views are concatenated channel wise before being passed
into the encoder. The output of the encoder is fused with proprioceptive information, in this case, the end-effector position,
before being passed to actor and critic networks.
Actor and Critic Networks: Both actor and critic networks are parameterized as 4 layer fully-connected networks with 1024
ReLU hidden units for every layer. The actor parameterizes a Gaussian distribution over the actions, where a tanh non-linearity
on the output restricts the actions to [−1, 1]. We use an ensemble size of 10 critics.
Discriminators: The discriminator for the forward and backward policies use a similar visual encoder but with 2 layers instead
of 4. The visual embedding is passed to a fully connected network with 2 hidden layers with 256 ReLU units. When training
the network, we use mixup and spectral norm regularization [54, 37] for the entire network.
Training Hyperparameters: For all our experiments, K = 20, i.e. the number of frames used as goal frames. The forward policy
interacts with the environment for 200 steps, then the backward policy interacts for 200 steps. In real world experiments, we also
reset the arm every 1000 steps to avoid hitting singular positions. Note, this reset does not require any human intervention as the
controller just resets the arm to a fixed joint position. We use a batch size of 256 to train the policy and critic networks, out of
which 64 transitions are sampled from the demonstrations (oversampling). We use a batch size of 512 to train the discriminators,
256 of the states come from expert data and the other 256 comes from the online data. Further, the discriminators are updated
every 1000 steps collected in the environment. The update-to-data ratio, that is the number of gradient updates per transition
collected in the environment is 3 for simulated environments and 1 for the real-robot experiments. We use a linearly decaying
schedule for behavior cloning regularization from 1 to 0.1 over the first 50000 steps which remains fixed at 0.1 onwards
throughout training.
For real world experiments, we use a wrist camera to improve the overall performance [24], and provide only the wrist-
camera view to both discriminators. We find that this further regularizes the discriminator. Finally, we provide no proprioceptive
information for the VICE discriminator, but we give MEDAL discriminator the proprioceptive information, as it needs a stronger
notion of the robot’s localization to adequately reset to a varied number of initial positions for improved robustness.
Teleoperation: To collect our demonstrations on the real robot, we use an Xbox controller that manipulates the end-effector
position, orientation and the gripper state. Two salient notes: (1) The forward and backward demonstrations are collected
together, one after the other and (2) the initial position for demonstrations is randomized to cover as large a state-space as
feasible. The increased coverage helps with exploration during autonomous training.

You might also like