Self-Improving Robots
Self-Improving Robots
robots: robots that can learn and improve on their own, from
autonomous interaction with minimal human supervision or
oversight. Such robots could collect and train on much larger
datasets, and thus learn more robust and performant policies.
While reinforcement learning offers a framework for such au-
tonomous learning via trial-and-error, practical realizations end
up requiring extensive human supervision for reward function
design and repeated resetting of the environment between episodes
of interactions. In this work, we propose MEDAL++, a novel
design for self-improving robotic systems: given a small set
of expert demonstrations at the start, the robot autonomously
practices the task by learning to both do and undo the task,
simultaneously inferring the reward function from the demon-
strations. The policy and reward function are learned end-to-
end from high-dimensional visual inputs, bypassing the need for Fig. 1. A robot resets the environment from the goal state to the initial state
explicit state estimation or task-specific pre-training for visual (top), in contrast to a human resetting the environment for the robot (bottom).
encoders used in prior work. We first evaluate our proposed While latter is the norm in robotic reinforcement learning, a robot that can
algorithm on a simulated non-episodic benchmark EARL, finding reset the environment and practice the task autonomously can train on more
that MEDAL++ is both more data efficient and gets up to data, and thus, be more competent.
30% better final performance compared to state-of-the-art vision-
based methods. Our real-robot experiments show that MEDAL++ Some of these challenges have been addressed by prior
can be applied to manipulation problems in larger environments work, for example, learning end-to-end visuomotor poli-
than those considered in prior work, and autonomous self- cies [35, 52] and learning reward functions [10, 23, 47].
improvement can improve the success rate by 30-70% over More recent works have started addressing the requirement
behavior cloning on just the expert data. Code, training and
evaluation videos along with a brief overview is available at: of repeated environment resets, demonstrating that complex
https://architsharma97.github.io/self-improving-robots/ behaviors can be learned from autonomous interaction in
simulation [22, 8, 46] and on real-robots [18, 19, 49]. However,
I. I NTRODUCTION
these works learn from low-dimensional state and require
To be useful in natural unstructured environments, robots
engineered reward functions. While R3L [57] shows an au-
have to be competent on a large set of tasks. While imi-
tonomous real-robot system that learns from visual inputs
tation learning methods have shown promising evidence for
without reward engineering, the results have been restricted to
generalization via large-scale teleoperated data collection ef-
smaller and easier to explore environments due to the use of
forts [25, 3], human supervision is expensive and collected
a state-novelty based perturbation controller [45]. A practical
datasets are still incommensurate for learning robust and
real-robot system that learns end-to-end from visual inputs
broadly performant control. In this context, the aspirational
autonomously without extensive task-specific engineering has
notion of self-improving robots becomes relevant: robots that
been elusive.
can learn and improve from their own interactions with the
A key challenge to autonomous policy learning is that of
environment autonomously. Reinforcement learning (RL) is
exploration, especially as environments grow larger. Not only
a natural framework for such self-improvement, where the
is it hard to learn how to solve tasks without engineered
robots can learn from trial-and-error. However, deploying RL
reward functions, but in the absence of frequent resetting,
algorithms has several prerequisites that are time-intensive and
the robot can reach states that are further away and harder to
require domain expertise: state estimation, designing reward
recover from. An effective choice to construct self-improving
functions, and repeated resetting of the environments after
systems can be to use a small set of demonstrations to
every episode of interaction, impeding the dream of self-
alleviate challenges related to exploration [39]. And since the
improving robotic systems.
human supervision required for collecting the demonstrations
§ Authors with substantial contributions to real-robot experimentation is front-loaded, i.e., before the training begins, the robot can
collect data autonomously and self-improve thereon. With this designed the task and environment to bypass the need for
motivation, we build on MEDAL [46], an efficient and simple resetting the environment [41, 9, 27], but this applies to a
autonomous RL algorithm that uses a small set of expert restricted set of tasks.
demonstrations collected prior to training. MEDAL trains a More recent works have identified the need for algorithms
forward policy that learns to do the task and a backward policy that can work autonomously with minimal supervision re-
that matches the distribution of states visited by the expert quired for resetting the environments [22, 57, 45]. Several
when undoing the task. The states visited by an expert can works propose learning a backward policy to undo the task, in
be an efficient initial distribution to learn the forward policy addition to learning a forward policy that does the task. Han
from, shown theoretically [26] and empirically [46]. However, et al. [22], Eysenbach et al. [8] use a backward policy that
MEDAL trains on low-dimensional states and requires explicit learns to reach the initial state distribution, Sharma et al. [44]
reward functions, making it incompatible with real-world proposes a backward policy that generates a curriculum for the
training. forward agent, Xu et al. [51], Lu et al. [36] use unsupervised
We design MEDAL++, an autonomous RL algorithm that skill discovery to create adversarial starting states and non-
is feasible and efficient to train in the real world with mini- stationary task distributions respectively, and Xie et al. [50]
mal task-specific engineering. MEDAL++ has several crucial enables robotic agents to learn autonomously in environments
components that enable such real world training: First, we with irreversible states, asking for interventions in stuck states
learn an encoder for high-dimensional visual inputs end-to-end and learning to avoid them while interacting with the envi-
along the lines of DrQ-v2 [52], bypassing the need for state ronment. In this work, we build upon MEDAL [46], where
estimation or task-specific pre-training of visual encoders used the backward policy learns to match the distribution of states
in prior works. Second, we reuse the expert demonstrations visited by an expert to solve the task. While the results from
to infer a reward function online [23, 47], eliminating the these prior papers are restricted to simulated settings, some
need for engineering reward functions. Finally, we improve recent papers have demonstrated autonomous training on real
upon the learning efficiency of MEDAL by using an ensemble robots [57, 18, 19, 49]. However, the results on real robots
of Q-value functions and increasing the update steps per have either relied on state estimation [18, 19] or pre-specified
sample collected [6], using BC regularization on expert data reward functions [49]. R3L [57] also considers the setting
to regularize policy learning towards [42], and oversampling of learning from image observations without repeated resets
transitions from demonstration data when training the Q-value and specified reward functions, similar to our work. It uses a
function [39]. backward policy that optimizes for state-novelty while learning
Overall, we propose MEDAL++, an efficient and practi- the reward function from a set of goal images collected
cally viable autonomous RL algorithm that can learn from prior to training [47]. However, R3L relies on frozen visual
visual inputs without reward specification, and requires min- encoders trained independently on data collected in the same
imum oversight during training. We evaluate MEDAL++ on environment, and optimizing for state-novelty does not scale
a pixel-based control version of EARL [45], a non-episodic to larger environments, restricting their robot evaluations to
learning benchmark and observe that MEDAL++ is more data smaller, easier to explore environments. Our simulation results
efficient and gets up to 30% better performance compared to indicate that MEDAL++ learns more efficiently than R3L, and
competitive methods [46, 57]. Most importantly, we conduct real robot evaluations indicate the MEDAL++ can be used on
real-robot evaluations using a Franka Panda robot arm on larger environments.
four manipulation tasks, such as hanging a cloth on a hook, Overall, our work proposes a system that can learn end-to-
covering a bowl with a cloth and peg insertion, all from end from visual inputs without repeated environment resets,
RGB image observations. After autonomous training using with real-robot evaluations on four manipulation tasks.
MEDAL++, we observe that the success rate of the policy
can increase by 30-70% when compared to that of a behavior III. P RELIMINARIES
cloning policy learned only on the expert data, indicating that Problem Setting. We consider the autonomous RL problem
MEDAL++ is a step towards self-improving robotic systems. setting [45]. We assume that the agent is in a Markov Decision
Process represented by (S, A, T , r, ρ0 , γ), where S is the state
II. R ELATED W ORK space, potentially corresponding to high-dimensional obser-
Several works have demonstrated the emergence of complex vations such as RGB images, A denotes the robot’s action
skills on a variety of problems using reinforcement learning space, T : S × A × S → R≥0 denotes the transition dynamics
on real robots [32, 30, 35, 7, 27, 55, 38, 53, 28, 48, 2]. of the environment, r : S × A → R is the (unknown) reward
However, these prior works require the environment to be function, ρ0 denotes the initial state distribution, and γ de-
reset to a (narrow) set of initial states for every episode notes the discount P factor. The objective is to learn a policy
∞
of interaction with the environment. Such resetting of the that maximizes E [ t=0 γ t r(st , at )] when deployed from ρ0
environment either requires repeated human interventions and during evaluation. There are two key differences from the
constant monitoring [11, 17, 14, 5, 20] or scripting behav- standard episodic RL setting: First, the training environment is
iors [35, 38, 56, 43, 53, 1] which can be time-intensive while non-episodic, i.e., the environment does not periodically reset
resulting in brittle behaviors. Some prior works have also to the initial state distribution after a fixed number of steps.
Second, the reward function is not available during training. backward demonstrations Db∗ without reward labels. Partic-
Instead, we assume access to a set of demonstrations collected ularly, we focus on design choices that make MEDAL++
by an expert prior to robot training. Specifically, the expert viable in the real world in contrast to MEDAL: First, we
collects a small set of forward trajectories Df∗ = {(si , ai ) . . .} describe how to learn from visual inputs without explicit
demonstrating the task and similarly, a set of backward state estimation. Second, we describe how to train the VICE
demonstrations Db∗ undoing the task back to the initial state classifier to eliminate the need for ground truth rewards when
distribution ρ0 . training the forward policy πf . Third, we describe the algo-
Autonomous Reinforcement Learning via MEDAL. To rithmic modifications for training the Q-value function and the
enable a robot to practice the task autonomously, MEDAL [46] policy π more efficiently. Namely, using an ensemble of Q-
trains a forward policy πf to solve the task, and a backward value functions and leveraging the demonstration data more
policy πb to undo the task. The forward policy πf executes for effectively for efficient learning. Finally, we describe how
a fixed number of steps before the control is switched over to to construct MEDAL++ using all the components described
the backward policy πb for a fixed number of steps. Chaining here, training a forward policy πf and a backward policy πb
the forward and backward policy reduces the number of inter- to learn autonomously.
ventions required to reset theP environment. The forward policy
∞ A. Encoding Visual Inputs
is trained to maximize E [ t=0 γ t r(st , at )], which can be
done via any RL algorithm. The backward policy πb is trained We embed the high-dimensional RGB images into a low-
to minimize the Jensen-Shannon divergence JS(ρb (s) || ρ∗ (s)) dimensional feature space using a convolutional encoder E.
between the stationary state-distribution of the backward pol- The RGB images are augmented using random crops and shifts
icy ρb and the state-distribution of the expert policy ρ∗ . By (up to 4 pixels) to regularize Q-value learning [52]. While
training a classifier Cb : S 7→ (0, 1) to discriminate between some prior works incorporate explicit representation learning
states visited by the expert (i.e. s ∼ ρ∗ ) and states visited by losses for visual encoders [34, 33], Yarats et al. [52] suggest
πb (i.e., s ∼ ρb ), the divergence that regularizing Q-value learning using random crop and shift
P∞ minimization problem can be augmentations is both simpler and efficient, allowing end-
rewritten as maxπb −E[ t=0 γ t log (1 − Cb (st ))] [46]. The
classifier used in the reward function for πb is trained using to-end learning without any explicit representation learning
the cross-entropy loss, where the states s ∈ Df∗ are labeled objectives. Specifically, the training loss for Q-value function
1 and states visited by πb online are labeled 0, leading to a on an environment transition (s, a, s0 , r) can be written as:
minimax optimization between πb and Cb . 2
`(Q, E) = Q (E (aug(s)) , a) − r − γ V̄ E(aug(s0 )) (1)
Learning Reward Functions with VICE. Engineering re-
wards can be tedious, especially when only image observa- where aug(·) denotes the augmented image, and r + γ V̄ (·)
tions are available. Since the transitions from the training is the TD-target. Equation 2 describes the exact computation
environment are not labeled with rewards, the robot needs of V̄ using slow-moving target networks Q̄ and the current
to learn a reward function for the forward policy πf . In policy π.
this work, we consider VICE [13], particularly, the simplified
version presented by Singh et al. [47] that is compatible B. Learning the Reward Function
with off-policy RL. VICE requires a small set of states To train a VICE classifier, we need to specify a set of
representing the desired outcome (i.e., goal images) prior to goal states that can be used as positive samples. Instead of
training. Given a set of goal states G, VICE trains a classifier collecting the goal states separately, we use the last K states of
Cf : S 7→ (0, 1) where Cf is trained using the cross every trajectory in Df∗ to create the goal set G. The trajectories
entropy loss on states ∈ G labeled as 1, and states visited collected by the robot’s policy πf will be used to generate
by πf labeled as 0. The policy πf is trained with a reward negative states for training Cf . The policy is trained to max-
function of log Cf (s) − log (1 − Cf (s)), which can be viewed imize − log (1 − Cf (·)) as the reward function, encouraging
as minimizing the KL-divergence between the stationary state the policy to reach states that have a high probability of being
distribution of πf and the goal distribution [12, 40, 15]. labeled 1 under Cf , and thus, similar to the states in the set
VICE has two benefits over pre-trained frozen classifier-based G. The reward signal from the classifier can be sparse if the
rewards: first, the negative states do not need to be collected classifier has high accuracy on distinguishing between the goal
by a person and second, the VICE classifier is harder to states and states visited by the policy. Since the classification
exploit as the online states are iteratively added to the label 0 problem for Cf is easier than the goal-matching problem for
set, continually improving the goal-reaching reward function πf , especially early in the training when the policy is not as
implicitly. successful, it becomes critical to regularize the discriminator
Cf [16]. We use spectral normalization [37], mixup [54] to
IV. MEDAL++: P RACTICAL AND E FFICIENT regularize Cf , and apply random crop and shift augmentations
AUTONOMOUS R EINFORCEMENT L EARNING to input images during training to create a broader distribution
The goal of this section is to develop a reinforcement learning for learning.
method that can learn from autonomous online interaction in Since we assume access to expert demonstrations Df∗ , why
the real world, given just a (small) set of forward Df∗ and do we use VICE, which matches the policy’s state distribution
to the goal distribution, instead of GAIL [23, 31], which
matches policy’s state-action distribution to that of the expert?
In a practical robotic setup, actions demonstrated by an expert
during teleoperation and optimal actions for a learned neural
network policy will be different. The forward pass through
a policy network introduces a delay, especially as the visual
encoder E becomes larger. Matching both the state and actions
to that of the expert, as is the case with GAIL, can lead to
suboptimal policies and be infeasible in general. In contrast,
VICE allows the robotic policies to choose actions that are Fig. 2. Visualizing the positive target states for forward classifier Cf
and backward classifier Cb from the expert demonstrations. For forward
different from the expert as long as they lead to a set of states demonstrations, last K states are used for Cf (orange) and the rest are used
similar to those in G. The exploratory benefits of matching for Cb (pink). For backward demonstrations, last K states are used for Cb .
the actions can still be recovered, as described in the next
subsection. are frozen with respect to L(π), and are only trained through
`(Qn , E).
C. Improving the Learning Efficiency
To improve the learning efficiency over MEDAL, we incorpo- D. Putting it Together: MEDAL++
rate several changes in how we train the Q-value function
and the policy π. First, we train an ensemble of Q-value Finally, we put together the components from previ-
networks {Qn }N ous sections to construct MEDAL++ for end-to-end au-
n=1 and corresponding set of target networks
{Q̄n }N . When training an ensemble member Qn , the target tonomous reinforcement learning. MEDAL++ trains a for-
n=1
is computed by sampling a subset of target networks, and ward policy that learns to solve the task and a back-
taking the minimum over the subset. The target value V̄ (s0 ) ward policy that learns to undo the task towards the ex-
in Eq 1 can be computed as pert state distribution. The parameters and data buffers
for the forward policy are represented by the tuple
V̄ (s0 ) = Ea0 ∼π min Q̄j (s0 , a0 ), (2) F ≡ πf , E f , {Qfn }N , {Q̄f N
} , C , D ∗
, D , G
j∈M n=1 n n=1 f f f f , where
where M is a random subset of the index set {1 . . . N } of size the symbols retain their meaning from the previous sections.
M . Randomizing the subset of the ensemble when computing Similarly, the parameters
and data buffers are represented by
∗
the target allows more gradient steps to be taken to update Qn the tuple B ≡ πb , E b , {Qbn }N n=1 , {Q̄ b N
}
n n=1 , Cb , D b , D b , Gb .
on `(Qn , E) [6] without overfitting to a specific target value, Noticeably, F and B have a similar structure: Both πf and πb
increasing the overall sample efficiency of learning. The target are trained using with − log(1 − C(·)) as the reward function
networks Q̄n are updated as an exponential moving average (using their respective classifiers Cf and Cb ), with both
of Qn in the weight space over the course of training. At classifiers trained to discriminate between the states visited
(t) (t) (t−1) by the policy and their target states. The primary difference is
iteration t, Q̄n ← τ Qn + (1 − τ )Q̄n , where τ ∈ (0, 1]
determines how closely Q̄n tracks Qn . the set of positive target states Gf and Gb used to train Cf and
Next, we leverage the expert demonstrations to optimize Q Cb respectively, visualized in Figure 2. The VICE classifier Cf
and π more efficiently. Q-value networks are typically updated is trained to predict the last K states of every trajectory from
on minibatches sampled uniformly from a replay buffer D. Df∗ as positive, whereas we train the MEDAL classifier Cb to
However, the transitions in the demonstrations are generated predict all the states of forward demonstrations except the last
by an expert, and thus, can be more informative about the K states as positive. Optionally, we can also include the last
actions for reaching successful states [39]. To bias the data K states of backward demonstrations from Db∗ as positives for
towards the expert distribution, we oversample transitions from training Cb .
the expert data such that for a batch of size B, ρB transitions The pseudocode for training is given in Algorithm 1. First,
are sampled from the expert data uniformly and (1 − ρ)B the parameters and data buffers in F and B are initialized
transitions are sampled from the replay buffer uniformly for and the forward and backward demonstrations are loaded
ρ ∈ [0, 1). Finally, we regularize the policy learning towards into Df∗ and Db∗ respectively. Next, we update the forward
expert actions by introducing a behavior cloning loss in and backward goal sets, as described above. After initializing
addition to maximizing the Q-values [42, 39]: the environment, the forward policy πf interacts with the
environment and collects data, updating the networks and
" N
# buffers in F. The control switches over to the backward
1 X
L(π) =Es∼D,a∼π(·|s) Qn (E(aug(s)), a) + policy πb after a fixed number of steps, and the networks and
N n=1 buffers in B are updated. The backward policy interacts for
h i
λE(s∗ ,a∗ )∼ρ∗ log π a∗ | E(aug(s∗ )) , (3) a fixed number of steps, after which the control is switched
over to the forward policy and this cycle is repeated thereon.
where λ ≥ 0 denotes the hyperparameter controlling the effect When executing in the real world, humans are allowed to
of BC regularization. Note that the parameters of the encoder intervene and reset the environment intermittently, switching
Algorithm 1: MEDAL++ encoder E are updated on a batch of transitions constructed
initialize F, B; // forward, backward parameters by sampling (1 − ρ)B transitions from Df and ρB transitions
F.Df∗ , B.Db∗ ← load_demonstrations() from Df∗ . The Q-value networks and the encoder are updated
PN
F.Gf ← get_states(F.Df∗ , −K:) // last K states by minimizing N1 n=1 `(Qn , E) (Eq 1), and the target Q-
// exclude last K states from Df∗ , use only the last K networks are updated as an exponential moving average of Q-
states from Db∗ value networks. The policy πf is updated by maximizing L(π).
B.Gb ← get_states(F.Df∗ , : − K) ∪ We update the Q-value networks multiple times for every step
get_states(B.Db∗ , −K:) collected in the environment, whereas the policy network is
s ∼ ρ0 ; A ← F; // initialize environment updated once for every step collected in the environment [6].
while not done do
a ∼ A.act(s); s0 ∼ T (· | s, a); V. E XPERIMENTS
A.update_buffer({s, a, s0 }); The goal of our experiments is to determine whether
A.update_classifier(); MEDAL++ can be a practical method for self-improving
A.update_parameters(); robotic systems. First, we modify EARL [45], a benchmark for
// switch policy after a fixed interval non-episodic RL, to return RGB observations instead of low-
if switch then dimensional state. We benchmark MEDAL++ against com-
switch(A, (F, B)); petitive methods [46, 57] to evaluate the learning efficiency
// allow intermittent human interventions from high-dimensional observations, in Section V-A. Our pri-
if interrupt then mary experiments in Section V-B evaluate MEDAL++ on four
s ∼ ρ0 ; real robot manipulation tasks, primarily tasks with soft-body
A ← F; objects such as hanging a cloth on a hook and covering a bowl
else with cloth. The real robot evaluation considers the question
s ← s0 ; of whether self-improvement is feasible via MEDAL++, and
if so, how much self-improvement can MEDAL++ obtain?
Finally, we run ablations to evaluate the contributions of
different components to MEDAL++ in Section V-C.
to s0 ∼ ρ0 every 25,000 steps of interaction with the envi- door closing and tabletop organization, the novelty-seeking
ronment. This is extremely infrequent compared to episodic perturbation controller can cause the robot to drift to states
settings where the environment is reset to the initial state farther away from the goal in larger environments, leading
distribution every 200-1000 steps. EARL comes with 5-15 to slower improvement in evaluation performance on states
forward and backward demonstrations for every environment starting from s0 ∼ ρ0 . While MEDAL and MEDAL++
to help with exploration in these sparse reward environments. have the same objective for the backward policy, optimization
We evaluate the forward policy every 10, 000 P∞training steps, related improvements enable MEDAL++ to learn faster. Note,
where the evaluation approximates Es0 ∼ρ0 [ t=0 γ t r(st , at )] BC performs worse on tabletop organization environment with
by averaging the return of the policy over 10 episodes starting a 45% success rate, compared to the sawyer environments
from s0 ∼ ρ0 . These roll-outs are used only for evaluation, with a 70% and 80% success rate on peg insertion and door
and not for training. closing respectively. So, while BC regularization helps speed
Comparisons. We compare MEDAL++ to four methods: up efficiency and can lead to better policies, it can hurt the
(1) MEDAL [46] uses a backward policy that matches the final performance of MEDAL++ if the BC policy itself has a
expert state distribution by minimizing JS(ρb (s) || ρ∗ (s)), worse success rate (at least, when true rewards are available for
similar to ours. However, the method is designed for low- training, see ablations in Section V-C). While we use the same
dimensional states and policy/Q-value networks and cannot hyperparameters for all environments, reducing the weight
be naı̈vely extended to RGB observations. For a better com- on BC regularization when BC policies have poor success
parison, we improve the method to use a visual encoder with can reduce the bias in policy learning and improve the final
random crop and shift augmentations during training, similar performance.
to MEDAL++. (2) R3L [57] uses a perturbation controller as
backward policy which optimizes for state-novelty computed B. Real Robot Evaluations
using random network distillation [4]. Unlike our method, R3L In line with the main goal of this paper, our experiments aim
also requires a separately collected dataset of environment ob- to evaluate whether self-improvement through MEDAL++
servations to pre-train a VAE [29] based visual encoder, which can enable real-robots to be learn more competent policies
is frozen throughout the training thereafter. (3) We consider an autonomously. On four manipulation tasks, we provide quan-
oracle RL method that trains just a forward policy and gets titative and qualitative comparison of the policy learned by
a privileged training environment that resets every 200 steps behavior cloning on the expert data to the one learned after
(i.e., the same episode length as during evaluation) and finally, self-improvement by MEDAL++. We recommend viewing
(4) we consider a control method naı̈ve RL, that similar to the results on our website for a more complete overview:
oracle trains just a forward policy, but resets every 25,000 steps https://architsharma97.github.io/self-improving-robots/.
similar to the non-episodic methods. We additionally report Robot Setup and Tasks. We use Franka Emika Panda arm
the performance of a behavior cloning policy, trained on the with a Robotiq 2F-85 gripper for all our experiments. We
forward demonstrations used in the EARL environments. The use a RGB camera mounted on the wrist and a third person-
implementation details and hyperparameters can be found in camera, as shown in Figure 6. The final observation space
Appendix A. includes two 100 × 100 RGB images, 3-dimensional end-
Results. Figure 4 plots the evaluation performance of the effector position, orientation along the z-axis, and the width of
forward policy versus the training samples collected in the the gripper. The action space is set up as either a 4 DoF end-
environment. MEDAL++ outperforms all other methods on effector control, or 5 DoF end-effector control with orientation
both the sawyer environments, and is comparable to MEDAL along the z-axis depending on the task (including one degree
on tabletop organization, the best performing method. While of freedom for the gripper). Our evaluation suite consists of
R3L does recover a non-trivial performance eventually on four manipuation tasks: grasping a cube, hanging a cloth on
Fig. 5. An overview of MEDAL++ on the task of inserting the peg into the goal location. (top) Starting with a set of expert trajectories, MEDAL++ learns
a forward policy to insert the peg by matching the goal states and a backward policy to remove and randomize the peg position by matching the rest of the
states visited by an expert. (bottom) Chaining the rollouts of forward and backward policies allows the robot to practice the task autonomously. The rewards
indicate the similarity to their respective target states, output by a discriminator trained to classify online states from expert states.
a hook, covering a bowl with a piece of cloth, and a (soft) tion. Specifically, all the forward demonstrations are collected
peg insertion. Real world data and training is more pertinent starting from a narrow set of initial states (ID), but, the
for soft-body manipulation as they are harder to simulate, and robot is evaluated starting from both ID states and out-of-
thus, we emphasize those tasks in our evaluation suite. The distribution (OOD) states, visualized in Appendix, Figure 9.
tasks are shown in Figure 6, and we provide task-specific BC policy is competent on ID states, but it performs poorly
details along with the analysis. on states that are OOD. However, after autonomous self-
Training and Evaluation. For every task, we first collect a improvement using MEDAL++, we see an improvement of
set of 50 forward demonstrations and 50 backward demon- 15% on ID performance, and a large improvement of 74%
strations using a Xbox controller. We chain the forward and on OOD performance. Autonomous training allows the robot
backward demonstrations to speed up collection and better to practice the task from a diverse set of states, including
approximate autonomous training thereafter. After collecting states that were OOD relative to the demonstration data. This
the demonstrations, the robot is trained for 10-30 hours using suggests that improvement in success rate results partly from
MEDAL++ as described in Section IV-D, collecting about being robust to the initial state distribution, as a small set of
300, 000 environment transitions in the process. For the first 30 demonstrations is unlikely to cover all possible initial states a
minutes of training, we reset the environment to create enough robot can be evaluated from.
(object) diversity in the initial data collected in the replay (2) Cloth on the Hook: In this task, the robot is tasked
buffer. After the initial collection, the environment is reset with grasping the cloth and putting it through a fixed hook.
intermittently approximately every hour of real world training To practice the task repeatedly, the backward policy has to
on an average, though, it is left unattended for several hours. remove the cloth from the hook and drop it on platform.
More details related to hyperparameters, network architecture Here, MEDAL++ improves the success rate over BC by
and training can be found in Appendix A. For evaluation, we 36%. The BC policy has several failure modes: (1) it fails to
roll-out the policy from varying initial states, and measure grasp the cloth, (2) it follows through with hooking because
the success rate over 50 evaluations. To isolate the role of of memorization, or (3) it hits into into the hook because
self-improvement, we compare the performance to a behavior it drifts from the right trajectory and could not recover.
cloning policy trained on the forward demonstrations using the Autonomous self-improvement improves upon all these issues,
same network architecture for the policy as MEDAL++. For but particularly, it learns to re-try grasping the cloth if it fails
both MEDAL++ and BC, we evaluate multiple intermediate the first time, rather than following a memorized trajectory
checkpoints and report the success rate of the best performing observed in the forward demonstrations.
checkpoint. (3) Bowl Covering with Cloth: The goal of this task is to cover
a bowl entirely using the cloth. The cloth can be a wide variety
Results. The success rate of the best performing BC policy
of initial states, ranging from ‘laid out flat’ to ‘scrunched up’
and MEDAL++ policy is reported in Table 6. MEDAL++
in varying locations. The task is challenging as the robot has
substantially increases the success rate of the learned policies,
to grasp the cloth at the correct location to successfully cover
with approximately 30-70% improvements. We expand on the
the entire bowl (partial coverage is counted as a failure). Here,
task and analyze the performance on each of them:
MEDAL++ improves the performance over BC by 34%. The
(1) Cube Grasping: The goal in this task is to grasp the failure modes of BC are similar to previous task, including
cube from varying initial positions and configurations and failure to grasp, memorization and failure to re-try, and incom-
raise it. For this task, we consider a controlled setting to plete coverage due to wrong initial grasp. Autonomous self-
isolate one potential source of improvement from autonomous improvement substantially helps with the grasping (including
reinforcement learning: robustness to the initial state distribu-
C. Ablations
Finally, we consider ablations on the simulated environ-
ments to understand the contributions of each component
of MEDAL++. We benchmark four variants on the table-
top organization and peg insertion tasks in Figure 7: (1)
MEDAL++, (2) MEDAL++ with the true reward function in-
stead of the learned VICE reward, (3) MEDAL++ without the
ensemble of Q-value functions, but, using SAC [21], and (4)
MEDAL++ without both BC-regularization and oversampling
expert transitions for training Q-value functions. While there
is some room for improvement, MEDAL++ can recover the
performance from the learned reward function the performance
with the true rewards. Both ensemble of Q-values and BC-
Task Behavior Cloning MEDAL++ regularization + oversampled expert transitions improve the
ID 0.85 1.00 performance, though the latter makes a larger contribution to
Cube Grasping the improvement in performance. Note, when using the true re-
OOD 0.08 0.82
wards, BC-regularization/oversampling expert transitions can
Cloth Hanging 0.26 0.62
hurt the final performance (as discussed in Section V-A).
Bowl Cloth Cover 0.12 0.46 However, when using learned rewards, they both become
Peg Insertion 0.04 0.52 more important for better performance. We hypothesize that
the signal from the learned reward function becomes noisier,
Fig. 6. (top) The training setup for MEDAL++. The image observations
include a fixed third person view and a first person view from a wrist
making other components important for efficient learning and
camera mounted above the gripper. The evaluation tasks going clockwise: cube better final performance.
grasping, covering a bowl with a cloth, hanging a cloth on the hook and, peg
insertion. (bottom) Evaluation performance of the best checkpoint learned by
behavior cloning and MEDAL++. Table shows the final success rates over
50 trials from randomized initial states, normalized to [0, 1]. MEDAL++
substantially improves the performance over behavior cloning, validating the
feasibility of self-improving robotic systems.
Fig. 9. (left) Randomized position of the cube in the grasping task. The position marked by violet boundary are within the distribution of expert demonstrations,
and the rest are outside the distribution. (right) Architecture overview for MEDAL++.
A PPENDIX
A. Implementation Details and Practical Tips
An overview of the architecture used by the forward and backward networks is shown in Figure 9.
Visual Encoder: For the encoder, we use the same architecture as DrQ-v2 [52]: 4 convolutional layers with 32 filters of size (3,
3), stride 1, followed by ReLU non-linearities. The high-dimensional output from the CNN is embedded into a 50 dimensional
feature using a fully-connected layer, followed by LayerNorm and tanh non-linearity (to output the features normalized to
[−1, 1]). For real-robot experiments, the first person and third person views are concatenated channel wise before being passed
into the encoder. The output of the encoder is fused with proprioceptive information, in this case, the end-effector position,
before being passed to actor and critic networks.
Actor and Critic Networks: Both actor and critic networks are parameterized as 4 layer fully-connected networks with 1024
ReLU hidden units for every layer. The actor parameterizes a Gaussian distribution over the actions, where a tanh non-linearity
on the output restricts the actions to [−1, 1]. We use an ensemble size of 10 critics.
Discriminators: The discriminator for the forward and backward policies use a similar visual encoder but with 2 layers instead
of 4. The visual embedding is passed to a fully connected network with 2 hidden layers with 256 ReLU units. When training
the network, we use mixup and spectral norm regularization [54, 37] for the entire network.
Training Hyperparameters: For all our experiments, K = 20, i.e. the number of frames used as goal frames. The forward policy
interacts with the environment for 200 steps, then the backward policy interacts for 200 steps. In real world experiments, we also
reset the arm every 1000 steps to avoid hitting singular positions. Note, this reset does not require any human intervention as the
controller just resets the arm to a fixed joint position. We use a batch size of 256 to train the policy and critic networks, out of
which 64 transitions are sampled from the demonstrations (oversampling). We use a batch size of 512 to train the discriminators,
256 of the states come from expert data and the other 256 comes from the online data. Further, the discriminators are updated
every 1000 steps collected in the environment. The update-to-data ratio, that is the number of gradient updates per transition
collected in the environment is 3 for simulated environments and 1 for the real-robot experiments. We use a linearly decaying
schedule for behavior cloning regularization from 1 to 0.1 over the first 50000 steps which remains fixed at 0.1 onwards
throughout training.
For real world experiments, we use a wrist camera to improve the overall performance [24], and provide only the wrist-
camera view to both discriminators. We find that this further regularizes the discriminator. Finally, we provide no proprioceptive
information for the VICE discriminator, but we give MEDAL discriminator the proprioceptive information, as it needs a stronger
notion of the robot’s localization to adequately reset to a varied number of initial positions for improved robustness.
Teleoperation: To collect our demonstrations on the real robot, we use an Xbox controller that manipulates the end-effector
position, orientation and the gripper state. Two salient notes: (1) The forward and backward demonstrations are collected
together, one after the other and (2) the initial position for demonstrations is randomized to cover as large a state-space as
feasible. The increased coverage helps with exploration during autonomous training.