SCIENCE ROBOTICS | RESEARCH ARTICLE
ANIMAL ROBOTS Copyright © 2020
The Authors, some
Learning quadrupedal locomotion over rights reserved;
exclusive licensee
challenging terrain American Association
for the Advancement
of Science. No claim
Joonho Lee1*, Jemin Hwangbo1,2, Lorenz Wellhausen1, Vladlen Koltun3, Marco Hutter1 to original U.S.
Government Works
Legged locomotion can extend the operational domain of robots to some of the most challenging environments
on Earth. However, conventional controllers for legged locomotion are based on elaborate state machines that
explicitly trigger the execution of motion primitives and reflexes. These designs have increased in complexity but
fallen short of the generality and robustness of animal locomotion. Here, we present a robust controller for blind
quadrupedal locomotion in challenging natural environments. Our approach incorporates proprioceptive feed-
back in locomotion control and demonstrates zero-shot generalization from simulation to natural environments. The
controller is trained by reinforcement learning in simulation. The controller is driven by a neural network policy
that acts on a stream of proprioceptive signals. The controller retains its robustness under conditions that were
never encountered during training: deformable terrains such as mud and snow, dynamic footholds such as rubble,
Downloaded from http://robotics.sciencemag.org/ by guest on October 21, 2020
and overground impediments such as thick vegetation and gushing water. The presented work indicates that
robust locomotion in natural environments can be achieved by training in simple domains.
INTRODUCTION primitives and reflex controllers (1–5). To trigger transitions between
Much of the dry landmass on Earth remains impassible to wheeled states or the execution of a reflex, many systems explicitly estimate
and tracked machines, the stability of which can be severely com- states such as ground contact and slippage (6–8). Such estimation is
promised on challenging terrains. Quadrupedal animals, on the other commonly based on empirically tuned thresholds and can become
hand, can access some of the most remote parts of Earth. They can erratic in the presence of unmodeled factors such as mud, snow, or
choose safe footholds within their kinematic reach and rapidly vegetation. Other systems use contact sensors at the feet, which can
change their kinematic state in response to the environment. Legged become unreliable under field conditions (9–11). Overall, conven-
robots have the potential to traverse any terrains that their animal tional systems for legged locomotion on rough terrain escalate in
counterparts are able to traverse. complexity as more scenarios are taken into account, have become
Dynamic locomotion in diverse, complex natural environments extremely laborious to develop and maintain, and remain vulnerable
as shown in Fig. 1 has been a grand challenge in legged robotics. to situations that are beyond the design implementation of their
These environments have highly irregular profiles, deformable ter- controller (corner cases).
rains, slippery surfaces, and overground obstructions. Under such Model-free reinforcement learning (RL) has recently emerged as
conditions, existing published controllers manifest frequent foot an alternative approach in the development of locomotion controller
slippage, loss of balance, and ultimately catastrophic failure. The for legged robots (12–14). The idea of RL is to tune a controller to
challenge is exacerbated by the inaccessibility of accurate informa- optimize a given reward function. The optimization is performed
tion about the physical properties of the terrain. Exteroceptive sensors on data acquired by executing the controller itself, which improves
such as cameras and LiDAR cannot reliably measure physical with experience. RL has been used to simplify the design of locomo-
characteristics such as friction and compliance; are impeded by ob- tion controllers, automate parts of the design process, and learn be-
structions such as vegetation, snow, and water; and may not have haviors that could not be engineered with prior approaches (12–15).
the coverage and temporal resolution to capture changes induced by However, application of RL to legged locomotion has largely been
the robot itself, such as the crumbling of loose ground under the confined to laboratory environments and conditions. Our prior work
robot’s feet. Under these conditions, the robot must rely crucially demonstrated end-to-end learning of locomotion and recovery be-
on proprioception—the sensing of its own bodily configuration at haviors, but only on flat ground, in the laboratory (12). Other work
high temporal resolution. In response to unforeseen events such as also developed RL techniques for legged locomotion but likewise
unexpected ground contact, terrain deformation, and foot slippage, focused largely on flat or moderately textured surfaces in laboratory
the controller must rapidly produce whole-body trajectories subject settings (13, 14, 16–19).
to multiple objectives: balancing, avoiding self-collision, counter- Here, we present a robust controller for blind quadrupedal loco-
acting external disturbances, and locomotion. Although animals motion on challenging terrain. The controller uses only propriocep-
solve this complex control problem instinctively, it remains an open tive measurements from joint encoders and an inertial measurement
challenge in robotics. unit, which are the most durable and reliable sensors on legged
Conventional approaches to legged locomotion on uneven terrain machines. The operation of the controller is shown in Fig. 1 and
have yielded increasingly complex control architectures. Many rely Movie 1. The controller was used to drive two generations of ANYmal
on elaborate state machines that coordinate the execution of motion quadrupeds (20) in a variety of environments that are beyond the
reach of prior published work in legged robotics. The quadruped
1 reliably trots through mud, sand, rubble, thick vegetation, snow,
Robotic Systems Lab, ETH-Zürich, Zürich, Switzerland. 2Robotics & Artificial Intelligence
Lab, KAIST, Deajeon, Korea. 3Intelligent Systems Lab, Intel, Santa Clara, CA, USA. running water, and a variety of other off-road terrains. The same con-
*Corresponding author. Email:
[email protected] troller was also used in our entry in the Defense Advanced Research
Lee et al., Sci. Robot. 5, eabc5986 (2020) 21 October 2020 1 of 13
SCIENCE ROBOTICS | RESEARCH ARTICLE
Downloaded from http://robotics.sciencemag.org/ by guest on October 21, 2020
Fig. 1. Deployment of the presented locomotion controller in a variety of challenging environments.
Projects Agency (DARPA) Subterranean Challenge Urban Circuit. efforts have established a number of practices for successful transfer
In all deployments, robots of the same generation were driven by of legged locomotion controllers from simulation to physical machines.
exactly the same controller under all conditions. No tuning was re- One is realistic modeling of the physical system, including the actuators
quired to adapt to different environments. (12). Another is randomization of physical parameters that vary be-
Like a number of prior applications of model-free RL to legged tween simulation and reality, such that the controller becomes robust
locomotion, we trained the controller in simulation (12, 14, 16). Prior to a range of conditions that cover those that arise in physical
Lee et al., Sci. Robot. 5, eabc5986 (2020) 21 October 2020 2 of 13
SCIENCE ROBOTICS | RESEARCH ARTICLE
methods. The controller is consistently effective in zero-shot general-
ization settings. That is, it remains robust when tested under condi-
tions that were never encountered during training. Our training in
simulation only uses rigid terrains and a small set of procedurally
generated terrain profiles, such as hills and steps. However, when
deployed on physical quadrupeds, the controller successfully handled
deformable terrains (mud, moss, and snow); dynamic footholds
(stepping on a rolling board in a cluttered indoor environment or
debris in the field); and overground impediments such as thick veg-
etation, rubble, and gushing water. Our methodology may lead to
future developments for legged robotics. Moreover, our results sug-
gest that the extraordinary complexity of the physical world can be
tamed without brittle and painstaking modeling or dangerous and
expensive trial and error under real-world field conditions.
Movie 1. Robot in the wild. A learning-based locomotion controller enables a
quadrupedal ANYmal robot to traverse challenging natural environments.
RESULTS
Downloaded from http://robotics.sciencemag.org/ by guest on October 21, 2020
deployment, without the necessity to precisely model these condi- Movie 1 summarizes the results of the presented work. We have
tions a priori (21). deployed the trained locomotion controller on two generations of
We used these ideas as well but found that they were not suffi- ANYmal robots: ANYmal-B (Fig. 2, D to G) and ANYmal-C
cient to achieve robust locomotion on rough terrain. We therefore (Figs. 2, A to C, and 3). The robots have different kinematics, inertia,
introduced and validated a number of additional approaches that are and actuators.
crucial to realizing the presented skills. The first is a different policy
architecture. Rather than using a multilayer perceptron (MLP) that Natural environments
operates on a snapshot of the robot’s current state, as was common The presented controller has been deployed in diverse natural envi-
in prior work, we used a sequence model, specifically a temporal ronments, as shown in Fig. 1, Movie 1, and movie S1. These include
convolutional network (TCN) (22) that produces actuation based steep mountain trails, creeks with running water, mud, thick vegeta-
on an extended history of proprioceptive states. We did not use ex- tion, loose rubble, snow-covered hills, and a damp forest. A number
plicit contact and slip estimation modules, which are known to lack of specific scenarios are further highlighted in Fig. 2 (A to F). These
robustness in challenging situations; rather, the TCN learns to im- environments have characteristics that the policy did not experience
plicitly reason about contact and slippage events from proprioceptive during training. The terrains can deform and crumble, with sub-
history as needed. stantial variation of material properties over the surface. The robot’s
The second key concept that enables the demonstrated results is legs are subjected to frequent disturbances due to vegetation, rubble,
privileged learning (23). We found that training a rough-terrain lo- and sticky mud. Existing terrain estimation pipelines that use cam-
comotion policy directly via RL was not successful: The supervisory eras or LiDAR (26) failed in environments with snow (Fig. 2A), water
signal was sparse, and the presented network failed to learn loco- (Fig. 2C), or dense vegetation (Fig. 2F). Our controller does not rely
motion within a reasonable time budget. Instead, we decomposed on exteroception and is immune to failures related to exteroceptive
the training process into two stages. First, we trained a teacher policy sensing. The controller learns omnidirectional locomotion based on
that has access to privileged information, namely ground-truth knowl- a history of proprioceptive observations and is robust in zero-shot
edge of the terrain and the robot’s contact with it. The privileged deployment on terrains with characteristics that were never experi-
information enables the policy to quickly achieve high performance. enced during training.
We then used this privileged teacher to guide the learning of a purely We have compared the presented controller to a state-of-the-art
proprioceptive student controller that only uses sensors that are baseline (1, 27) in the forest environment. The baseline could tra-
available on the real robot. This privileged learning protocol is en- verse flat and unobstructed patches but failed frequently upon en-
abled by simulation, but the resulting proprioceptive policy is not countering loose branches, thick vegetation, and mud, as shown in
confined to simulation and is deployed on physical machines. movie S1. Our controller never failed in these experiments.
The third concept that has been important in achieving the pre- We have quantitatively evaluated the presented controller and the
sented levels of robustness is an automated curriculum that synthe- baseline in three conditions: moss, mud, and vegetation (Fig. 2, D to F).
sizes terrains adaptively, based on the controller’s performance at We have measured locomotion speed and energy efficiency. The
different stages of the training process. In essence, terrains were results are reported in Table 1. The presented controller achieves
synthesized such that the controller is capable of traversing them higher locomotion speed under all conditions. We computed the
while becoming more robust. We evaluated the traversability of dimensionless cost of transport (COT) to compare the efficiency of
parameterized terrains and used particle filtering to maintain a dis- the controllers at different speed ranges. We define mechanical COT
tribution of terrain parameters of medium difficulty (24, 25) that as 12 actuators [̇ ] + / (mgv). denotes joint torque, ̇ is joint speed,
adapt as the neural network learns. The training conditions grew mg is the total weight, and v is the locomotion speed. This quantity
increasingly more challenging, yielding an omnidirectional controller represents positive mechanical power exerted by the actuator per unit
that combines agility with unprecedented resilience. weight and unit locomotion speed (28). As shown in Table 1, the
The result is a legged locomotion controller that can robustly tra- presented controller is more energy efficient, with a lower COT than
verse complex natural terrains that are often unreachable by existing the baseline.
Lee et al., Sci. Robot. 5, eabc5986 (2020) 21 October 2020 3 of 13
SCIENCE ROBOTICS | RESEARCH ARTICLE
Indoor experiments
We further evaluated the robustness of
the presented controller in an indoor
environment populated by loose debris,
as shown in Fig. 3A. Support surfaces are
unstable, and the robot’s feet frequently
slip. Such conditions can be found at
disaster sites and construction zones,
where legged robots are expected to
operate in the future.
Results are shown in Fig. 3A and
movie S2. The robot moved omnidirec-
tionally over the area. The presented
controller could stably locomote over
shifting support surfaces. This level of
robustness is beyond the reach of prior
controllers for ANYmal robots (1, 27)
Downloaded from http://robotics.sciencemag.org/ by guest on October 21, 2020
and is comparable to the state of the
art (2, 29).
The learned controller manifests a
foot-trapping (FT) reflex, as shown in
Fig. 3B and movie S3. The policy identi-
fies the trapping of the foot purely from
proprioceptive observations and lifts the
foot over the obstacle. Such reflexes were
not specified in any way during training:
They developed adaptively. This distin-
guishes the presented approach from
conventional controller design methods,
which explicitly build in such reflexes
Fig. 2. A number of specific deployments. (A to F) Zero-shot generalization to slippery and deforming terrains. (G) and orchestrate their execution by a
Steep descent during the DARPA Subterranean Challenge. The stair rise is 18 cm, and the slope is ∼45∘. higher-level state machine (1, 3). The
step shown in Fig. 3B is 16.8-cm high,
which is higher than the foot clearance
The quantitative evaluation reported in Table 1 understates the of the legs during normal walking on flat terrain. The maximum
difference between the two controllers because it only measured foot clearance on flat terrain is 12.9 and 13.6 cm for the left-front
speed and energetic efficiency of the baseline during successful (LF) and right-front legs, respectively, and increases up to 22.5
locomotion. The baseline’s catastrophic failures were not factored and 18.5 cm in the case of FT. Our controller also learns to adapt
into these measurements: When the baseline failed, it was reset by the hind leg trajectories when stepping up. The maximum foot
a human operator in a more stable configuration. Catastrophic clearance on flat terrains is 13.5 and 9.06 cm for the left-hind and
failures of the baseline controller due to thick vegetation and right-hind legs and increases up to 16.6 and 15.9 cm when the
other factors are shown in movie S1. Our controller exhibited no front legs are above the step. Further analysis is provided in Materials
such failures. and Methods. Note also that the reflexes learned by our controller
are more general and are not tied to particular contact events.
DARPA Subterranean Challenge Figure 3C shows the controller responding to a mid-shin collision
Our controller was used by the Cerberus team for the DARPA during the swing phase. Here, the trapping event was not signaled
Subterranean Challenge Urban Circuit (Fig. 2G). It replaced a model- by foot contact, and scripted controllers that use foot contact events
based controller that had been used by the team in the past (1, 27). as triggers would not appropriately handle this situation. Our con-
The objective of the competition is to develop robotic systems that troller, on the other hand, analyzes the proprioceptive stream as a
rapidly map, navigate, and search complex underground environ- whole and is trained without making assumptions about possible
ments, including tunnels, urban underground, and cave networks. contact locations. Hence, it can learn to react to any obstructions
The human operators are not allowed to assist the robots during the and disturbances that affect the robot’s bodily configuration.
competition physically; only teleoperation is allowed. Accordingly, We now focus on comparing the presented approach with the
the locomotion controller needs to perform without failure over baseline (1, 27) in controlled settings. We first compared the robust-
extended mission durations. ness of the controllers in the diagnostic setting of a single step, as
The presented controller drove two ANYmal-B robots in four shown in Fig. 3D. In each trial, the robot was driven straight to a
missions of 60 min. The controller exhibited a zero failure rate step for 10 s. A trial was a success if the robot traversed the step with
throughout the competition. A steep staircase that was traversed by both front and hind legs. We conducted 10 trials for each step height
one of the robots during the competition is shown in Fig. 2G. and computed the success rate. Because the baseline controller takes
Lee et al., Sci. Robot. 5, eabc5986 (2020) 21 October 2020 4 of 13
SCIENCE ROBOTICS | RESEARCH ARTICLE
Downloaded from http://robotics.sciencemag.org/ by guest on October 21, 2020
Fig. 3. Evaluation in an indoor environment. (A) Locomotion over unstable debris. The robot steps onto loose boards (highlighted in red and blue) that dislodge under
the robot’s feet. (B) The policy exhibits a foot-trapping reflex and overcomes a 16.8-cm step. (C) The policy learns to appropriately handle obstructions irrespective of the
contact location. Here, it is shown reacting to an obstacle that is encountered mid-shin during the swing phase. (D) Controlled experiments with steps and payload. Our
controller and a baseline (1, 27) were commanded to walk over a step with and without the 10-kg payload. (E) Success rates for different step heights. The success rate
was evaluated over 10 trials for each condition. (F) Mean linear speeds for different command directions on flat terrain. 0° refers to the front of the robot. Shaded areas
denote 95% CIs. (G) Mean heading errors for different command directions on flat terrain. Shaded areas denote 95% CIs.
a desired linear velocity of the base as input, we commanded a for- The baseline showed high sensitivity to FT, which often led to a fall,
ward velocity of 0.2 and 0.6 m/s. The maximum speed of the baseline as shown in movie S3.
is 0.6 m/s. The success rates are given in Fig. 3E. The presented con- We also tested the controllers in the presence of substantial model
troller outperformed the baseline in stepping both up and down. mismatch. We attached a 10-kg payload, as shown in Fig. 3D and
Lee et al., Sci. Robot. 5, eabc5986 (2020) 21 October 2020 5 of 13
SCIENCE ROBOTICS | RESEARCH ARTICLE
monly require millions of time steps for training, further exacerbates
Table 1. Comparison of locomotion performance in natural the challenge by precluding reliance on frameworks that may require
environments. The mechanical COT is computed using positive seconds of computation per time step.
mechanical power exerted by the actuators. Our work demonstrates that simulating the abundant variety of
Terrain the physical world may not be necessary. Our training environment
Quantity Controller
Moss Mud Vegetation featured only rigid terrain, with no compliance or overground ob-
Ours 0.452 0.338 0.248 structions such as vegetation. Nevertheless, controllers trained in
Average
speed (m/s) this environment successfully met the diversity of field conditions
Baseline 0.199 0.197 –
encountered at deployment.
Average Ours 0.423 0.692 1.23 We see a number of limitations and opportunities for future work.
mechanical
COT Baseline 0.625 0.931 – First, the presented controller only exhibited the trot gait. This is
narrower than the range of gait patterns found by quadrupeds in
nature (33). The gait pattern is constrained in part by the kinematics
and dynamics of the robot, but the ANYmal machines are physically
movie S4. This payload was 22.7% of the total weight of the robot capable of multiple gaits (27). We hypothesize that training protocols
and was never simulated during training. As shown in Fig. 3E, the and objectives that emphasize diversity can elicit these.
presented controller could still traverse steps up to 13.4 cm despite Second, the presented controller relies solely on proprioception.
Downloaded from http://robotics.sciencemag.org/ by guest on October 21, 2020
the model mismatch. The baseline was incapable of traversing any This is a notable advantage in that the controller makes few assump-
steps under any command speed with the payload. tions on the sensor suite and is not susceptible to failure when
We then evaluated the tracking performance of the controllers exteroception breaks down. Existing work has argued that a blind
on flat ground with the payload. We commanded each controller in (proprioceptive) controller should form the basis of a legged loco-
eight directions and measured the locomotion speed and the tracking motion stack (3). Nevertheless, blind locomotion is inherently lim-
error. Target speed was fixed to 0.4 m/s for the baseline controller, ited. If the machine is commanded to walk off a cliff, it will. Even
which is similar to the operating speed of the presented controller. under less extreme conditions, the robot’s gait is fairly conservative
In Fig. 3F, we show the velocity profiles of the controllers. Our con- because it must, by necessity, feel out the environment with its body
troller locomoted at around 0.4 m/s in all directions and performed as it locomotes. A major opportunity for future work is to use the
similarly with the payload. On the other hand, the locomotion speed presented methodology as a starting point in the development of a
of the baseline varied with direction, which can be seen by the hybrid proprioceptive-exteroceptive controller that, like many ani-
anisotropic velocity profile, and the velocity profile shifted largely mals, will be able to locomote even when vision and other external
off center with the payload. Figure 3G shows the heading error of senses are disrupted but will use exteroceptive data when provided.
the controllers in each commanded direction. The heading error is This will enable legged machines to autonomously traverse environ-
the angle between the command velocity and the base velocity of ments that may have fatal elements, such as cliffs, and to raise speed
the robot. The heading error of the presented controller was consis- and energetic efficiency under safer conditions. More broadly, the
tently smaller than the baseline, both with and without the payload. presented results expedite the deployment of legged machines in
The baseline’s error in the lateral direction reached ~30°, and the environments that are beyond the reach of wheeled and tracked
baseline failed when a speed of 0.6 m/s was commanded, as shown robots and are dangerous or inaccessible to humans, whereas the
in movie S4. In contrast, the average heading error of the presented presented methodology opens new frontiers for training complex
controller stayed within 10° with or without the payload. We con- robotic systems in simulation and deploying them in the full rich-
clude that the presented controller is much more robust to model ness and complexity of the physical world.
mismatch.
Next, we tested robustness to foot slippage. To introduce slippage,
we used a moistened whiteboard (1). The results are shown in movie S5. MATERIALS AND METHODS
The baseline quickly lost balance, aggressively swung the legs, and Overview
fell. In contrast, the presented controller adapted to the slippery The main objective of the presented controller is to locomote over
terrain and successfully locomoted in the commanded direction. rough terrain following a command. The command is given either
by a human operator or by a higher-level navigation controller. In
our formulation, unlike many existing works (12, 14, 16) that focus
DISCUSSION on tracking the target velocity of the base (IBB v T), only the direction
The presented results substantially advance the published state of the (IBB v ̂T)is given to the controller. The reason is that the feasible range
art in legged robotics. Beyond the results themselves, the methodology of target speeds is often unclear on challenging terrain. For example,
presented in this work can have broad applications. Before our work, the robot can walk faster downhill than uphill.
a hypothesis could be held that training in simulation is fundamen- The command vector is defined as 〈 (IBB v T̂ ) xy, ( ̂ T) z 〉. The first part
tally constrained by the limitations of simulation environments in is the target horizontal direction in base frame (IBB v T ̂ ) xy ∶ =
representing the complexity of the physical world. Existing technology 〈cos( T ) , sin( T ) 〉, where T is the yaw angle to command direc-
is severely limited in its ability to simulate compliant contact, slip- tion in the base frame. The stop command is defined as 〈0.0,0.0〉.
page, and deformable and crumbling terrains. As a result, phenomena The second part is the turning direction ( ̂ T) z ∈ {− 1, 0, 1}; 1 refers
such as mud, snow, thick vegetation, gushing water, and many others to counterclockwise rotation along the base z axis.
are beyond the capabilities of robotics simulation frameworks (30–32). An overview of our method is given in Fig. 4. We use a privileged
The sample complexity of model-free RL algorithms, which com- learning strategy inspired by “learning by cheating” (23) (Fig. 4A).
Lee et al., Sci. Robot. 5, eabc5986 (2020) 21 October 2020 6 of 13
SCIENCE ROBOTICS | RESEARCH ARTICLE
RL algorithm
Downloaded from http://robotics.sciencemag.org/ by guest on October 21, 2020
50 Hz 400 Hz
foot position residuals
Inverse Joint position
Neural network
leg frequencies Kinematics PD controller
Foot Trajectory
policy
target
Generator foot positions Robot
+ Dynamics
leg phases observations
Fig. 4. Creating a proprioceptive locomotion controller. (A) Two-stage training process. First, a teacher policy is trained using RL in simulation. It has access to privileged
information that is not available in the real world. Next, a proprioceptive student policy learns by imitating the teacher. The student policy acts on a stream of proprioceptive
sensory input and does not use privileged information. (B) An adaptive terrain curriculum synthesizes terrains at an appropriate level of difficulty during the course of
training. Particle filtering was used to maintain a distribution of terrain parameters that are challenging but traversable by the policy. (C) Architecture of the locomotion
controller. The learned proprioceptive policy modulates motion primitives via kinematic residuals. An empirical model of the joint position PD controller facilitates
deployment on physical machines.
We first train a teacher policy that has access to privileged informa- ed by the teacher policy are used to supervise the student. This is
tion concerning the terrain. This teacher policy is then distilled into illustrated in Fig. 4A.
a proprioceptive student policy that does not rely on privileged in- Training is conducted on procedurally generated terrains in sim-
formation. The privileged teacher policy is confined to simulation, ulation. The terrains are synthesized adaptively to facilitate learning
but the student policy is deployed on physical machines. One differ- according to the skill level of the trained policies at any given time.
ence of our methodology from that of Chen et al. (23) is that we do We define a traversability measure for each terrain and develop a
not rely on expert demonstrations to train the privileged policy; sampling-based method to select terrains with the appropriate diffi-
rather, the teacher policy is trained via RL. culty during the course of training. We use particle filtering to
The privileged teacher model is based on MLPs that receive in- maintain an appropriate distribution of terrain parameters. This is
formation about the current state of the robot, properties of the ter- illustrated in Fig. 4B. The terrain curriculum is applied during both
rain, and the robot’s contact with the terrain. The model computes teacher and student training.
a latent embedding l ¯ tthat represents the current state and an ac- Our control architecture is shown in Fig. 4C. We use the Policies
¯ t. The training objective rewards locomotion in prescribed
tion a Modulating Trajectory Generators (PMTG) architecture (34) to pro-
directions. vide priors on motion generation. The neural network policy mod-
After the teacher policy is trained, it is used to supervise a pro- ulates leg phases and motion primitives by synthesizing residual
prioceptive student policy. The student model is a TCN (22) that position commands.
receives a sequence of N proprioceptive observations as input. The The simulation uses a learned dynamics model of the robot’s
student policy is trained by imitation. The vectors l ¯ t and a
¯ t comput- joint position proportional-derivative (PD) controller (12). This
Lee et al., Sci. Robot. 5, eabc5986 (2020) 21 October 2020 7 of 13
SCIENCE ROBOTICS | RESEARCH ARTICLE
facilitates the transfer of policies from simulation to reality. After time control processes in which the evolution of the state and the
training in simulation, the proprioceptive controller is deployed di- outcomes are partly stochastic. An MDP is defined by a state space S,
rectly on physical legged machines, with no fine-tuning. action space A , a scalar reward function ℛ(st, st + 1), and the transition
probability P(st + 1 ∣ st, at). A learning agent selects an action at from
Motion synthesis its policy (at ∣ st) and receives a reward rt from the environment.
We now elaborate on the control architecture that is illustrated in The objective of the RL framework is to find an optimal policy * that
Fig. 4C. It is divided into motion generation and tracking. The input maximizes the discounted sum of rewards over an infinite time horizon.
to our controller consists of the command vector and a sequence of Assuming that the environment is fully observable to the teacher,
proprioceptive measurements including base velocity, orientation, we formulate locomotion control as an MDP and use an off-the-shelf
and joint states. The controller does not use any exteroceptive input RL method (36) to solve it. In this section, we provide the MDP for
(e.g., no haptic sensors, cameras, or depth sensors). The input also teacher training, which is defined by a tuple of state space, action
does not contain any handcrafted features such as foot contact states space, transition probability, and reward function.
or estimated terrain geometry. The controller outputs joint posi- The state is defined as st ≔ 〈ot, xt〉, where ot is the observation
tion targets. vector obtainable from the robot, and xt is the privileged informa-
Our motion generation strategy is based on the periodic leg tion that is usually not available in the real world. The detailed defi-
phase. Previous works commonly leveraged predefined foot contact nitions are given in table S4. ot contains command, orientation, base
schedules (2, 27, 35). We define a periodic phase variable i ∈ [0.0, 2) twist, joint positions and velocities, i values, fi values, and previous
Downloaded from http://robotics.sciencemag.org/ by guest on October 21, 2020
for each leg, which represents contact phase if ∈ [0.0, ) and swing foot position targets. Joint position errors and velocities measured
phase if ∈ [, 2). At every time step t, i = (i,0 + (f0 + fi)t)( mod 2) at −0.01 and −0.02 s are contained in ot, which is the same as the
where i,0 is the initial phase, f0 is a common base frequency, and fi input to the learned model of the joint position PD controller. This
is the frequency offset for the ith leg. We want the legs to manifest information allows the policy to exploit the actuator dynamics (12).
periodic motions when f0 + fi ≠ 0 and engage ground contact in con- To encode the leg phase, we use 〈 cos (), sin ()〉 instead of , which
tact phase. We set f0 as 1.25 Hz, which is the value used by a previ- is a smooth and unique representation for the angle. Previous foot
ously developed conventional controller for a trot gait (27). position targets are also fed back to the policy and are used to com-
The target foot positions, which are the output of the motion pute the target smoothness reward that is explained in section S4.
generation block, are defined in the horizontal frames (35) of the feet When the student controller is deployed, the quantities in ot are
(Hi, i ∈ {1,2,3,4}). Hi is a reference frame that is attached below the replaced with readings from the proprioceptive sensors, and the base
hip joint of the ith leg. The distance equals the nominal reach of the velocity and orientation are provided by a state estimator (37). xt
leg. The z axis of the frame ( Hiz) is parallel to eg, and Hix is the projec- contains noiseless information that we receive directly from a phys-
tion of the base x axis ( Bx) onto the horizontal plane, i.e., the frame ics engine. xt mainly consists of information related to foot-ground
has the same yaw angle with the robot. The roll and pitch angles of interactions such as terrain profile, foot contact states and forces,
Hi are decoupled from the base. This kinematic trick reduces the effect friction coefficients, and external disturbance forces applied during
of base attitude on the foot motions (35) and consequently stabilizes training. Specifically, we represent the terrain profile with the eleva-
training. Defining the output in Hi results in less premature termina- tion of nine scan points around each foot, which are symmetrically
tion at the beginning of the policy training, when the base motion is placed along a circle with a 10-cm radius (visualized in Fig. 4).
unstable due to random actions. Another benefit is that we can de- ¯ t) is a 16-dimensional (16D) vector consisting of
The action ( a
compose the action distribution of the stochastic policy in the lateral leg frequencies and foot position residuals. The reward function is
and vertical directions during policy training. We applied larger noise defined such that an RL agent receives a higher reward if it advances
in the lateral direction to promote exploration along the ground surface. faster toward the goal. The reward function is specified in detail in
We use the PMTG architecture (34) to integrate a neural network section S4.
to regulate the controller. Our implementation consists of four iden- The policy network is constructed by two MLP blocks as shown
tical foot trajectory generators (FTGs) and a neural network policy. in Fig. 4A. The MLP encoder embeds xt into a latent vector l ¯ t. The
The FTG is a function F() : [0.0,2) → ℝ3 that outputs foot position command and robot states are not included in xt, so l ¯ t contains only
targets for each leg. The FTG drives vertical stepping motion when the terrain- and contact-related features. We hypothesize that l ¯ t drives
fi is nonzero. The definition of F() is given in section S3. The policy adaptive behaviors such as changing foot clearance depending on
outputs fis and target foot position residuals (rfi, T), and the target the terrain profile. Then, l ¯ t and ot are provided to the subsequent
foot position for the ith foot is rfi, T ≔ F(i) + rfi, T. MLP layers to compute action.
The tracking control is done using analytic inverse kinematics The Trust Region Policy Optimization (TRPO) (36) algorithm is
(IK) and joint position control. Each foot position target defined in used for training. The hyperparameters we used are given in table S7.
Hi is first expressed in the robot base frame, and the joint position
Student policy
targets are computed using analytic IK. The joint position targets
The proprioceptive student policy only has access to ot. A key
are then tracked by joint position PD controllers. The main reason
hypothesis here is that the latent features l ¯ tcan be (partially) recov-
for using analytic IK is to maximize computational efficiency and to
reuse existing position control actuator models (6, 15) for the sim-
ered from a time series of proprioceptive observations, ht, which is
to-real transfer. defined as ht ≔ ot\{fo, jointhistory, previousfootpositiontargets}.
The student policy uses a TCN (22) encoder. The input to the
Teacher policy TCN encoder is H = {ht − 1, …, ht − N − 1}, where N is the history
We formulate the control problem as a Markov Decision Process length. The encoder is fully convolutional and consists of three dilated
(MDP). MDP is a mathematical framework for modeling discrete- causal convolutional layers, interleaved with strided convolutional
Lee et al., Sci. Robot. 5, eabc5986 (2020) 21 October 2020 8 of 13
SCIENCE ROBOTICS | RESEARCH ARTICLE
layers that reduce dimensionality. The architecture is specified in Tr(c T, ) = 𝔼 ∼ { (s t, a t, s t+1 ∣ c T ) }∈ [0.0, 1.0] (3)
tables S5 and S6.
We use the TCN architecture because it affords transparent con- where refers to trajectories generated by . This follows a defini-
trol over the input history length, can accommodate long histories, tion of empirical traversability in prior work (44).
and is known to be robust to hyperparameter settings (22). A com- The objective of our terrain generation method is to find cT values
parison with a recurrent neural network architecture is provided in with midrange traversability (Tr(cT, ) ∈ [0.5,0.9]). The rationale is
section S8. to synthesize terrains that are neither too easy nor too difficult. We
The student policy is trained via supervised learning. The loss define terrain desirability as follows
function is defined as
Td(c T, π ) ∶ = Pr(Tr(c T, π ) ∈ [0.5, 0.9 ] ) =
(4)
¯ t(o t, x t ) − a t(o t, H )) 2 + ( l ¯ t(o t, x t ) − lt (H )) 2
ℒ ≔ ( a
(1) 𝔼 ξ∼π { Tr(c T, π ) ∈ [0.5, 0.9 ] }
Quantities marked by a bar ( ¯⋅ ) denote target values generated by where 0.5 and 0.9 are fixed thresholds for minimum/maximum tra-
the teacher. We use the dataset aggregation strategy (DAgger) (38). versability.
Specifically, training data are generated by rolling out trajectories We use a particle filter to keep track of a distribution of high-de-
by the student policy. For each visited state, the teacher policy com- sirability cT values during training. We formulate a particle-filtering
putes its embedding and action vectors ( ¯⋅ ). These outputs of the problem where we approximate the distribution of terrain parameters
Downloaded from http://robotics.sciencemag.org/ by guest on October 21, 2020
teacher policy are used as supervisory signals associated with the that satisfies Tr(cT, ) ∈ [0.5,0.9] with a finite set of sampling points
corresponding states. The hyperparameters we used are given in (cTk ∈ C, k ∈ 1, ⋯, N particle). Our algorithm is modeled on the
table S8. Sequential Importance Resampling particle filter. It is based on the
following assumptions.
Adaptive terrain curriculum 1) Terrain parameters with similar Tr(⋅ , ) are close in Euclidean
Our method is inspired by automatic curriculum learning (ACL) distance in parameter space.
for RL agents (25, 39). The Paired Open-Ended Trailblazer (POET) 2) A policy trained over the terrains generated by cT values in
approach (25) generates diverse parameterized terrains for a 2D bi- some area of C will learn to interpolate to nearby terrain parameters.
pedal agent. The method uses minimal criteria (24, 40) and aims to 3) cT,0, cT,1, … forms a Markov process, where c T,j = {c1T,j , c2T,j
, …
N particle
choose environmental parameters that are neither too challenging cT,j } at iteration j.
nor trivial for the agents: This is realized by selecting task parameters The first assumption comes from the insight that terrain param-
that yield midrange rewards. Florensa et al. (39) similarly choose eters can be interpolated, e.g., the difficulty of a staircase increases
achievable yet difficult goals for RL agents. as we increase the step height. The second assumption justifies the
Our method likewise realizes a training curriculum that gradually use of discrete samples from C to train a policy that generalizes over
modifies a distribution over environmental parameters such that the a certain region of C . The last assumption is necessary for formulating
policy can continuously improve locomotion skills and generalize a particle filter.
to new environments. Our work differs from POET because POET The importance weight wk is defined for each ckT , and the set of
aims for open-ended search in the space of possible problems and tuples 〈 ckT , w k 〉approximates the target distribution (cT values with
evolves a population of specialized agents, whereas we seek to obtain Tr(cT, ) ∈ [0.5,0.9]). We define the measurement variable ykj such
a single generalist agent. that yjk = 1 if Tr(cT,j k , ) ∈ [0.5, 0.9]. Then, the terrain desirability
Figure 4B shows the types of terrains used in our training envi- defined above becomes the measurement probability
ronment. Each terrain is generated by a parameter vector c T ∈ C.
The terrains are described in detail in section S5. Our ACL method Pr(ykj ∣ ckT,j
) = Pr(Tr(ckT,i
, ) ∈ [0.5, 0.9 ] ) = Td(ckT,j , ) (5)
approximates a distribution of desirable cT values using a particle filter.
We first describe how a given cT is evaluated in simulation. In- For practical implementation, the measurement probability is
stead of directly using the reward function to evaluate the learning computed by the empirical expectation from the samples collected
progress (25, 41–43), we evaluate cT values by the traversability of during policy training
generated terrains, which is defined as the success rate of traversing
1(Tr( ckT,j
, ) ∈ [0.5, 0.9 ] )
a terrain. We found traversability to be more intuitive than the re-
ward function, which consists of multiple objectives that are often Pr(ykj ∣ ckT,j
) ≈ ───────────────
(6)
unbounded. We first define a labeling function as N traj
where Ntraj denotes the number of trajectories generated using ckT,j .
{0 if v pr(s t+1 ) < 0.2 ∨ termination
1 if v pr(s t+1 ) > 0.2
(s t, a t, s t+1 ) =
(2) The trajectories are also used for policy training. Our method there-
fore does not require additional evaluation steps to advance the cur-
riculum of the terrain parameters. Resampling is done such that the
for a state transition from st to st + 1. vpr(st + 1) stands for the inner probability of choosing the kth sample equals the normalized im-
product of the base velocity and commanded direction at time step
k / N
portance weight w i
particle i
w ∈ [0, 1].
t + 1. If can locomote in the commanded direction faster than
0.2 m/s, we consider the terrain traversable in this direction. The The transition model is a random walk in C . Each parameter of a
threshold is a hyperparameter; 0.2 m/s is about one-third of the sampling point is shifted to its adjacent value by a fixed probability
maximum speed of our robot. Traversability is defined as ptransition. It satisfies the third assumption (Markov process) because
Lee et al., Sci. Robot. 5, eabc5986 (2020) 21 October 2020 9 of 13
SCIENCE ROBOTICS | RESEARCH ARTICLE
the evolution of each parameter only relies on the current value and line is compared to the same TCN-20 architecture trained via priv-
randomly sampled noise. To improve exploration, we bounded and ileged learning.
k
)
discretized Cto reduce the search space. The initial samples (c T,0 The results are summarized in Fig. 5 (E to G). Figure 5E shows
are either drawn uniformly from Cor concentrated on almost flat that the baseline fails the diagnostic tests: It is incapable of locomoting
terrains. Implementation details and an overview of the training on a slope or traversing a step. Figure 5F shows that the baseline
process are provided in section S2 and algorithm S1. does not reach comparable reward during training as the teacher
MLP architecture with privileged information or the proprioceptive
Validation of the method TCN-20 architecture (same as the baseline, no privileged informa-
We present ablation studies to justify each component of our ap- tion) trained via privileged learning. Figure 5G shows the mean
proach: (i) using a sequence model for the student policy, (ii) privi- episode length during training, which indicates that the baseline fails
leged training, and (iii) adaptive terrain curriculum. to learn to balance and locomote.
Memory in proprioceptive control Adaptive terrain curriculum
We evaluate the importance of incorporating proprioceptive memory We now evaluate the effect of the adaptive terrain curriculum on
in the controller via the TCN architecture (22). Let TCN-N denote teacher training. Terrains used for training (specifically, hills, steps,
a TCN with a receptive field of N time steps. The network architec- and stairs) are shown in Fig. 4B. As a baseline, we trained a teacher
tures we use are specified in detail in
table S5. We test controllers in diagnostic
Downloaded from http://robotics.sciencemag.org/ by guest on October 21, 2020
settings designed to focus on specific
capabilities. Specifically, we test omnidi-
rectional locomotion on sloped ground,
traversal of a discrete step, and robust-
ness to external disturbances (Fig. 5A).
Figure 5 (B to D) summarizes the
importance of the memory length N. In
these experiments, N is varied from 1
(corresponding to 20 ms of memory) to
100 (2 s of proprioceptive memory).
The latter is the default setting used in
our deployed controller.
As shown in Fig. 5B, memory length
does not have a strong effect in the uni-
form slope setting. Memory length does
have a strong effect on the controller’s
ability to traverse a step (Fig. 5, B and C).
Controllers with longer memory are able
to handle higher steps. As shown in
Fig. 5C, the failure rate of limited-memory
controllers is particularly high when
the hind legs encounter the step. Con-
trollers with longer memory also adapt
hind-leg trajectories to ensure higher foot
clearances.
Figure 5D shows that controllers with
longer memory are more robust to ex- Mean reward Mean episode lengt h
0.20 400
ternal disturbances. We applied an ex-
Tim e st eps
Reward
ternal 50-N force laterally to the base 0.15
200
for 5 s during a straight walk and evalu- 0.10
ated the resulting deviation from the 0
intended locomotion direction. The de- 0 2500 5000 0 2500 5000
It erat ion It erat ion
viation of the TCN-100 controller was
35.5% lower than that of TCN-1. Fig. 5. Ablation studies. We trained each model five times using different random seeds. Error bars denote 95% CIs.
Privileged training (A) Test setups. The robot was commanded to advance for 10 s in the specified direction (black arrow). We conducted
We now assess the importance of privi- 100 trials for each test. On the step test, a trial was considered successful if the robot traversed the step with both
front and hind legs. Robots were initialized with random joint configurations. Initial yaw angle was sampled from
leged training. As a baseline, we train a
U(− , ) for the slope test and from U(− /6, /6) for the other tests. The friction coefficients between the feet and the
TCN-20 policy directly, without the two- ground were sampled from U(0.4,1.0). The external force was applied for 5 s in the lateral direction. (B to D) Impor-
stage privileged training protocol. The tance of memory length N in the TCN-N encoder. (E to G) Importance of privileged training. (F) Learning curves for
policy is trained by TRPO (36) with the the teacher (gray) and a TCN-20 student trained directly, without privileged training (red). For comparison, the blue
same reward and hyperparameters that line indicates the mean reward of a TCN-20 student trained with privileged training. The reward was computed by
we use for teacher training. This base- running each policy on uniformly sampled terrains. (H to J) Importance of the adaptive curriculum.
Lee et al., Sci. Robot. 5, eabc5986 (2020) 21 October 2020 10 of 13
SCIENCE ROBOTICS | RESEARCH ARTICLE
Downloaded from http://robotics.sciencemag.org/ by guest on October 21, 2020
Fig. 6. Analysis of the emergent foot-trapping reflex. FT occurs when the LF foot collides with the step. (A) The LF foot hits the step and then manifests higher
foot clearance to overcome the step (ii to iv) in the following swing phase. (B) Reconstructed terrain information from TCN embeddings. Red ellipsoids: Estimated
terrain shape around the foot. The center of the ellipsoid refers to the estimated terrain elevation, and the vertical length represents uncertainty (SD). Black arrows:
Terrain normal at the in-contact foot. Red cone: Uncertainty of normal estimation. Blue spheres: Estimated in-contact feet. (C) Input saliency at different moments.
The peaks show that the TCN policy attends to the FT that happened around 2.1 s. The orange curve (flat terrain) shows the saliency value computed on a flat terrain
at similar gait phases. (D) Saliency map unrolled across input channels at 3.4 s. Red boxes refer to joint measurements from the LF leg at the moment it collides with
the step.
using randomly generated terrains that are uniformly sampled from terrains, the policy fails early and receives less training signal as a
Cas specified in table S2. The success rates on the testing terrains are result. The adaptive curriculum modulates the difficulty of sampled
substantially lower when trained without the adaptive curriculum, as terrains so as to maximize the didactic benefit of each episode. We
shown in Fig. 5H. Figure 5I shows that a teacher trained without provide an additional evaluation of the adaptive curriculum in
adaptive curriculum plateaus at a lower reward level. Throughout section S6.
the training process, the mean episode length is shorter for the
model being trained without adaptive curriculum (Fig. 5J). This is Further analysis of emergent behavior
because uniform sampling is more likely to draw terrains that can- Here, we provide further analysis on how the proprioceptive policy
not be successfully traversed by the policy being trained. On these adapts to different situations. To investigate how the proprioceptive
Lee et al., Sci. Robot. 5, eabc5986 (2020) 21 October 2020 11 of 13
SCIENCE ROBOTICS | RESEARCH ARTICLE
policy perceives the environment, we trained a decoder network SUPPLEMENTARY MATERIALS
that reconstructs the privileged information xt ∈ X from the output robotics.sciencemag.org/cgi/content/full/5/47/eabc5986/DC1
of an intermediate layer of a trained TCN policy. xt consists of in- Section S1. Nomenclature
Section S2. Implementation details
formation that is not directly observable by the student policy such Section S3. Foot trajectory generator
as contact states, terrain shape, and external disturbances. For clas- Section S4. Reward function for teacher policy training
sification of foot contact states, we use a standard cross-entropy loss Section S5. Parameterized terrains
function. For regression of other states, we predict both mean mi Section S6. Qualitative evaluation of the adaptive terrain curriculum
Section S7. Reconstruction of the privileged information in different situations
and SD i for each component and use a negative Gaussian log- Section S8. Recurrent neural network student policy
likelihood loss to quantify the uncertainty encoded in the TCN Section S9. Ablation of the latent representation loss for student training
representation (45) Fig. S1. Illustration of the adaptive curriculum.
Fig. S2. Reconstructed privileged information in different situations.
Fig. S3. Comparison of neural network architectures for the proprioceptive controller.
2
(m i − mgt i )
Table S1. Computation time for training.
ℒ = ─ + log( i)
(7) Table S2. Parameter spaces Cfor simulated terrains.
i∈dim(X\contactstates) 2 2i Table S3. Hyperparameters for automatic terrain curriculum.
Table S4. State representation for proprioceptive controller and the privileged information.
with added weight decay. The superscript gt refers to the ground Table S5. Neural network architectures.
Table S6. Network parameter settings and the training time for student policies.
truth generated in simulation. Note that the parameters of the policy
Downloaded from http://robotics.sciencemag.org/ by guest on October 21, 2020
Table S7. Hyperparameters for teacher policy training.
network are fixed during decoder training. Therefore, the decoder Table S8. Hyperparameters for student policy training.
network is not used for policy training. It only provides insight into Table S9. Hyperparameters for decoder training.
the information encoded by the TCN policy after training. Algorithm S1. Teacher training with automatic terrain curriculum.
Movie S1. Deployment in a forest.
In Fig. 6, we provide snapshots of the FT reflex motion (Fig. 6A)
Movie S2. Locomotion over unstable debris.
and the reconstructed privileged information. In Fig. 6B, we show Movie S3. Step experiment.
the reconstructed terrain geometry and foot contact state. When the Movie S4. Payload experiment.
LF foot collides with the step, the estimated elevation in front of the Movie S5. Foot slippage experiment.
front legs increases, and its uncertainty grows (i and ii). The esti-
mated elevations and normal vectors adapt to the step during the FT
REFERENCES AND NOTES
reflex (iii and iv). After the successful step-up, the terrain uncer- 1. F. Jenelten, J. Hwangbo, F. Tresoldi, C. D. Bellicoso, M. Hutter, Dynamic locomotion
tainty remains elevated (v), indicating an anticipation of generally on slippery ground. IEEE Robot. Autom. Lett. 4, 4170–4176 (2019).
rough terrain. In addition, the decoder network can detect foot 2. G. Bledt, P. M. Wensing, S. Ingersoll, S. Kim, Contact model fusion for event-based
contacts with horizontal and vertical surfaces while successfully locomotion in unstructured terrains, in 2018 IEEE International Conference on Robotics and
Automation (ICRA) (IEEE, 2018).
identifying frontal collision as such, as indicated by the estimated
3. M. Focchi, R. Orsolino, M. Camurri, V. Barasuol, C. Mastalli, D. G. Caldwell, C. Semini,
terrain normal vector (i and iii). The ability to reconstruct explicit Heuristic planning for rough terrain locomotion in presence of external disturbances and
environmental information from the encoding of the propriocep- variable perception quality, in Advances in Robotics Research: From Lab to Market
tive history is a strong indicator that the TCN policy learns to build (Springer, 2020), pp. 165–209.
an internal representation of the environment and uses it for deci- 4. J. Reher, W.-L. Ma, A. D. Ames, Dynamic walking with compliance on a Cassie bipedal
robot, in European Control Conference (IEEE, 2019), pp. 2589–2595.
sion making. We provide more examples of the reconstructed privi-
5. Y. Gong, R. Hartley, X. Da, A. Hereid, O. Harib, J.-K. Huang, J. Grizzle, Feedback control of a
leged information in section S7. Cassie bipedal robot: Walking, standing, and riding a Segway, in American Control
We then analyze how the proprioceptive policy leverages past Conference (IEEE, 2019), pp. 4559–4566 .
observations. We compute the saliency map of the input H ∈ ℝ60 × N 6. J. Hwangbo, C. D. Bellicoso, P. Fankhauser, M. Huttery, Probabilistic foot contact
and visualize the sensitivity of the policy to each element of the input estimation by fusing information from dynamics and differential/forward kinematics, in
2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE,
while overcoming the step (46). Each column of H is a proprioceptive
2016), pp. 3872–3878 .
measurement h ∈ ℝ 60, and we stack N measurements (history 7. M. Camurri, M. Fallon, S. Bazeille, A. Radulescu, V. Barasuol, D. G. Caldwell, C. Semini,
length = 0.02 s × N). We define the saliency value for the ith mea- Probabilistic contact estimation and impact detection for state estimation of quadruped
surement (i ∈ [0, N]) as robots. IEEE Robot. Autom. Lett. 2, 1023–1030 (2017).
8. M. Focchi, V. Barasuol, M. Frigerio, D. G. Caldwell, C. Semini, Slip detection and recovery
for quadruped robots, in Robotics Research (Springer, 2018), pp. 185–199.
M i =
j∈channels
( ∣d((r f,T ) z ) / dH i,j∣ H∣) ∈ ℝ (8) 9. M. Bloesch, C. Gehring, P. Fankhauser, M. Hutter, M. A. Hoepflinger, R. Siegwart, State
estimation for legged robots on unstable and slippery terrain, in 2013 IEEE/RSJ
International Conference on Intelligent Robots and Systems (IEEE, 2013), pp. 6058–6064.
where (rf, T)z refers to the height command for the foot f. We com- 10. C. Gehring, C. D. Bellicoso, S. Coros, M. Bloesch, P. Fankhauser, M. Hutter, R. Siegwart,
puted the value for (rf, T)z because we are interested in the change Dynamic trotting on slopes for quadrupedal robots, in 2015 IEEE/RSJ International
Conference on Intelligent Robots and Systems (IROS) (IEEE, 2015), pp. 5129–5135.
in foot clearance. Mi can be interpreted as the sensitivity of the
11. R. Hartley, J. Mangelson, L. Gan, M. G. Jadidi, J. M. Walls, R. M. Eustice, J. W. Grizzle, Legged
output to the ith measurement. Because we use 1D convolution robot state-estimation through combined forward kinematic and preintegrated contact
over time, the output is in ℝN, i.e., each row of H is regarded as a factors, in 2018 IEEE International Conference on Robotics and Automation (ICRA)
channel. (IEEE, 2018), pp. 1–8.
In Fig. 6C, we can see that the saliency value at the FT is kept 12. J. Hwangbo, J. Lee, A. Dosovitskiy, D. Bellicoso, V. Tsounis, V. Koltun, M. Hutter, Learning
agile and dynamic motor skills for legged robots. Sci. Rob. 4, eaau5872 (2019).
high while stepping up. The policy has direct access to the measure-
13. T. Haarnoja, S. Ha, A. Zhou, J. Tan, G. Tucker, S. Levine, Learning to walk via deep
ments at the moment of FT, and leverages this in the following swing reinforcement learning (Robotics: Science and Systems, 2019).
phase. This is highlighted by the red boxes in Fig. 6D. The policy 14. Z. Xie, P. Clary, J. Dao, P. Morais, J. Hurst, M. van de Panne, Learning locomotion skills for
attends to the LF leg joint states measured at the FT. Cassie: Iterative design and sim-to-real, in Conference on Robot Learning (PMLR, 2019).
Lee et al., Sci. Robot. 5, eabc5986 (2020) 21 October 2020 12 of 13
SCIENCE ROBOTICS | RESEARCH ARTICLE
15. J. Lee, J. Hwangbo, M. Hutter, Robust recovery controller for a quadrupedal robot using 37. M. Bloesch, M. Hutter, M. A. Hoepflinger, S. Leutenegger, C. Gehring, C. D. Remy,
deep reinforcement learning. arXiv:1901.07517 [cs.RO] (22 January 2019). R. Siegwart, State estimation for legged robots-consistent fusion of leg kinematics
16. J. Tan, T. Zhang, E. Coumans, A. Iscen, Y. Bai, D. Hafner, S. Bohez, V. Vanhoucke, and imu. Robotics 17, 17–24 (2013).
Sim-to-real: Learning agile locomotion for quadruped robots (Robotics: Science and 38. S. Ross, G. Gordon, D. Bagnell, A reduction of imitation learning and structured prediction
Systems, 2018). to no-regret online learning, in International Conference on Artificial Intelligence and
17. Y. Yang, K. Caluwaerts, A. Iscen, T. Zhang, J. Tan, V. Sindhwani, Data efficient Statistics (AISTATS, 2011), pp. 627–635.
reinforcement learning for legged robots, in Conference on Robot Learning (PMLR, 2019). 39. C. Florensa, D. Held, X. Geng, P. Abbeel, Automatic goal generation for reinforcement
18. S. Ha, P. Xu, Z. Tan, S. Levine, J. Tan, Learning to walk in the real world with minimal learning agents, in International Conference on Machine Learning (PMLR, 2018),
human effort. arXiv:2002.08550 [cs.RO] (20 February 2020). pp. 1514–1523.
19. X. B. Peng, E. Coumans, T. Zhang, T.-W. Lee, J. Tan, S. Levine, Learning agile robotic 40. J. Lehman, K. O. Stanley, Revising the evolutionary computation abstraction: Minimal
locomotion skills by imitating animals. arXiv:2004.00784 [cs.RO] (2 April 2020). criteria novelty search, in Genetic and Evolutionary Computation Conference (GECCO,
20. M. Hutter, C. Gehring, D. Jud, A. Lauber, C. D. Bellicoso, V. Tsounis, J. Hwangbo, K. Bodie, 2010), pp. 103–110.
P. Fankhauser, M. Bloesch, R. Diethelm, S. Bachmann, A. Melzer, M. A. Höpflinger, 41. T. Matiisen, A. Oliver, T. Cohen, J. Schulman, Teacher-student curriculum learning, in
ANYmal - a highly mobile and dynamic quadrupedal robot, in IEEE/RSJ International IEEE transactions on neural networks and learning systems (IEEE, 2019).
Conference on Intelligent Robots and Systems (IEEE, 2016), pp. 38–44. 42. W. Yu, G. Turk, C. K. Liu, Learning symmetric and low-energy locomotion, ACM
21. X. B. Peng, M. Andrychowicz, W. Zaremba, P. Abbeel, Sim-to-real transfer of robotic Transactions on Graphics (TOG) (2018), p. 144.
control with dynamics randomization, in IEEE International Conference on Robotics and 43. I. Akkaya, M. Andrychowicz, M. Chociej, M. Litwin, B. McGrew, A. Petron, A. Paino,
Automation (ICRA) (IEEE, 2018). M. Plappert, G. Powell, R. Ribas, J. Schneider, N. Tezak, J. Tworek, P. Welinder, L. Weng,
22. S. Bai, J. Z. Kolter, V. Koltun, An empirical evaluation of generic convolutional and Q. Yuan, W. Zaremba, L. Zhang, Solving rubik’s cube with a robot hand. arXiv:1910.07113
recurrent networks for sequence modeling. arXiv:1803.01271 [cs.LG] (4 March 2018). [cs.LG] (16 October 2019).
23. D. Chen, B. Zhou, V. Koltun, P. Krähenbühl, Learning by cheating, in Conference on Robot
Downloaded from http://robotics.sciencemag.org/ by guest on October 21, 2020
44. R. O. Chavez-Garcia, J. Guzzi, L. M. Gambardella, A. Giusti, Learning ground traversability
Learning (2019). from simulations. IEEE Robot. Autom. Lett. 3, 1695–1702 (2018).
24. J. C. Brant, K. O. Stanley, Minimal criterion coevolution: A new approach to 45. A. Kendall, Y. Gal, What uncertainties do we need in bayesian deep learning for computer
open-ended search, in Genetic and Evolutionary Computation Conference (GECCO, vision?, in Advances in Neural Information Processing Systems 30 (Neural Information
2017), pp. 67–74. Processing Systems Foundation, 2017), pp. 5574–5584.
25. R. Wang, J. Lehman, J. Clune, K. O. Stanley, Paired open-ended trailblazer (POET): 46. K. Simonyan, A. Vedaldi, A. Zisserman, Deep inside convolutional networks: Visualising
Endlessly generating increasingly complex and diverse learning environments and their image classification models and saliency maps. arXiv:1312.6034 [cs.CV] (20 December 2013).
solutions. arXiv:1901.01753 [cs.NE] (7 January 2019).
26. P. Fankhauser, M. Bloesch, C. Gehring, M. Hutter, R. Siegwart, Robot-centric elevation Acknowledgments: A substantial part of the work was carried out during J.H.’s stay at the
mapping with uncertainty estimates, in Mobile Service Robotics (World Scientific, 2014), Robotic Systems Lab, ETH Zürich. Funding: The project was funded, in part, by the Intel
pp. 433–440. Network on Intelligent Systems, the Swiss National Science Foundation (SNF) through the
27. C. D. Bellicoso, F. Jenelten, C. Gehring, M. Hutter, Dynamic locomotion through online National Centre of Competence in Research Robotics, the European Research Council (ERC)
nonlinear motion optimization for quadrupedal robots. IEEE Robot. Autom. Lett. 3, under the European Union’s Horizon 2020 research and innovation programme grant
2261–2268 (2018). agreement no 852044 and no 780883. The work has been conducted as part of ANYmal
28. S. Collins, A. Ruina, R. Tedrake, M. Wisse, Efficient bipedal robots based on passive- Research, a community to advance legged robotics. Author contributions: J.L. formulated
dynamic walkers. Science 307, 1082–1085 (2005). the main idea of the training and control methods, implemented the controller, set up the
29. Ghost Robotics, Vision 60: Latest blind-mode stress testing of V60 legged robot (2019); simulation, and trained control policies. J.L. performed the indoor experiments. J.H.
www.youtube.com/watch?v=tQsLauQWp8M. contributed in setting up the simulation. J.L. and L.W. performed outdoor experiments
30. J. Hwangbo, J. Lee, M. Hutter, Per-contact iteration method for solving contact dynamics. together. All authors refined ideas, contributed in the experiment design, and analyzed the
IEEE Robot. Autom. Lett. 3, 895–902 (2018). data. Competing interests: The authors declare that they have no competing interests. Data
31. E. Coumans, Bullet physics library (2013); pybullet.org. and materials availability: All data needed to evaluate the conclusions in the paper are
32. R. Smith, Open dynamics engine (2005); ode.org. present in the paper or the Supplementary Materials. Other materials can be found at https://
33. R. M. Alexander, Principles of Animal Locomotion (Princeton Univ. Press, 2003). github.com/leggedrobotics/learning_locomotion_over_challening_terrain_supplementary.
34. A. Iscen, K. Caluwaerts, J. Tan, T. Zhang, E. Coumans, V. Sindhwani, V. Vanhoucke, Policies
modulating trajectory generators, in Conference on Robot Learning (PMLR, 2018), Submitted 10 May 2020
pp. 916–926. Accepted 22 September 2020
35. V. Barasuol, J. Buchli, C. Semini, M. Frigerio, E. R. De Pieri, D. G. Caldwell, A reactive Published 21 October 2020
controller framework for quadrupedal locomotion on challenging terrain, in 2013 IEEE 10.1126/scirobotics.abc5986
International Conference on Robotics and Automation (IEEE, 2013), pp. 2554–2561.
36. J. Schulman, S. Levine, P. Abbeel, M. Jordan, P. Moritz, Trust region policy optimization, in Citation: J. Lee, J. Hwangbo, L. Wellhausen, V. Koltun, M. Hutter, Learning quadrupedal locomotion
International Conference on Machine Learning (PMLR, 2015), pp. 1889–1897. over challenging terrain. Sci. Robot. 5, eabc5986 (2020).
Lee et al., Sci. Robot. 5, eabc5986 (2020) 21 October 2020 13 of 13
Learning quadrupedal locomotion over challenging terrain
Joonho Lee, Jemin Hwangbo, Lorenz Wellhausen, Vladlen Koltun and Marco Hutter
Sci. Robotics 5, eabc5986.
DOI: 10.1126/scirobotics.abc5986
Downloaded from http://robotics.sciencemag.org/ by guest on October 21, 2020
ARTICLE TOOLS http://robotics.sciencemag.org/content/5/47/eabc5986
SUPPLEMENTARY http://robotics.sciencemag.org/content/suppl/2020/10/19/5.47.eabc5986.DC1
MATERIALS
RELATED http://robotics.sciencemag.org/content/robotics/5/47/eabe5218.full
CONTENT
REFERENCES This article cites 8 articles, 1 of which you can access for free
http://robotics.sciencemag.org/content/5/47/eabc5986#BIBL
PERMISSIONS http://www.sciencemag.org/help/reprints-and-permissions
Use of this article is subject to the Terms of Service
Science Robotics (ISSN 2470-9476) is published by the American Association for the Advancement of Science, 1200
New York Avenue NW, Washington, DC 20005. The title Science Robotics is a registered trademark of AAAS.
Copyright © 2020 The Authors, some rights reserved; exclusive licensee American Association for the Advancement
of Science. No claim to original U.S. Government Works