Reinforcement Learning To Adjust Robot Movements To New Situations
Reinforcement Learning To Adjust Robot Movements To New Situations
2650
representation that allows ensuring the stability of the move-
ment, choosing between a rhythmic and a discrete movement.
One of the biggest advantages of this motor primitive frame-
work is that it is linear in the shape parameters θ. Therefore,
these parameters can be obtained efficiently, and the resulting
framework is well-suited for imitation [Ijspeert et al., 2003]
and reinforcement learning [Kober and Peters, 2010]. The re-
sulting policy is invariant under transformations of the initial
position, the goal, the amplitude and the duration [Ijspeert et
al., 2003]. These four modification parameters can be used as
the meta-parameters γ of the movement. Obviously, we can
make more use of the motor primitive framework by adjust-
ing the meta-parameters γ depending on the current situation
Figure 1: This figure illustrates a table tennis task. The sit- or state s according to a meta-parameter function γ(s). The
uation, described by the state s, corresponds to the positions state s can for example contain the current position, veloc-
and velocities of the ball and the robot at the time the ball is ity and acceleration of the robot and external objects, as well
above the net. The meta-parameters γ are the joint positions as the target to be achieved. This paper focuses on learning
and velocity at which the ball is hit. The policy parameters the meta-parameter function γ(s) by episodic reinforcement
represent the backward motion and the movement on the arc. learning.
The meta-parameter function γ(s), which maps the state to Illustration of the Learning Problem: As an illustration
the meta-parameters, is learned. of the meta-parameter learning problem, we take a table ten-
nis task which is illustrated in Figure 1 (in Section 3.2, we
will expand this example to a robot application). Here, the
As it may be hard to realize a parametrized representa- desired skill is to return a table tennis ball. The motor prim-
tion for meta-parameter determination, we reformulate the itive corresponds to the hitting movement. When modeling a
reward-weighted regression [Peters and Schaal, 2007] in or- single hitting movement with dynamical-systems motor prim-
der to obtain a Cost-regularized Kernel Regression (CrKR) itives [Ijspeert et al., 2003], the combination of retracting and
that is related to Gaussian process regression [Rasmussen and hitting motions would be represented by one movement prim-
Williams, 2006]. We evaluate the algorithm in the acquisition itive and can be learned by determining the movement pa-
of flexible motor primitives for dart games such as Around the rameters θ. These parameters can either be estimated by im-
Clock [Masters Games Ltd., 2011] and for table tennis. itation learning or acquired by reinforcement learning. The
return can be adapted by changing the paddle position and
2 Meta-Parameter Learning for DMPs velocity at the hitting point. These variables can be influ-
The goal of this paper is to show that elementary movements enced by modifying the meta-parameters of the motor primi-
can be generalized by modifying only the meta-parameters tive such as the final joint positions and velocities. The state
of the primitives using learned mappings. In Section 2.1, we consists of the current positions and velocities of the ball and
first review how a single primitive movement can be repre- the robot at the time the ball is directly above the net. The
sented and learned. We discuss how such meta-parameters meta-parameter function γ(s) maps the state (the state of the
may be able to adapt the motor primitive spatially and tem- ball and the robot before the return) to the meta-parameters γ
porally to the new situation. In order to develop algorithms (the final positions and velocities of the motor primitive). Its
that learn to automatically adjust such motor primitives, we variance corresponds to the uncertainty of the mapping.
model meta-parameter self-improvement as an episodic rein- In the next sections, we derive and apply an appropriate
forcement learning problem in Section 2.2. While this prob- reinforcement learning algorithm.
lem could in theory be treated with arbitrary reinforcement
learning methods, the availability of few samples suggests
2.2 Kernalized Meta-Parameter Self-Improvement
that more efficient, task appropriate reinforcement learning The problem of meta-parameter learning is to find a stochastic
approaches are needed. To avoid the limitations of parametric policy π(γ|x) = p(γ|s) that maximizes the expected return
ˆ ˆ
function approximation, we aim for a kernel-based approach.
J(π) = p(s) π(γ|s)R(s, γ)dγ ds, (1)
When a movement is generalized, new parameter settings S G
need to be explored. Hence, a predictive distribution over the where R(s, γ) denotes all the rewards following the selection
meta-parameters is required to serve as an exploratory policy. of the meta-parameter γ according to a situation described by
These requirements lead to the method which we employ for T
meta-parameter learning in Section 2.3. state s. The return of an episode is R(s, γ) = T −1 t=0 rt
with number of steps T and rewards rt . For a parametrized
2.1 DMPs with Meta-Parameters policy π with parameters w it is natural to first try a policy
gradient approach such as finite-difference methods, vanilla
In this section, we review how the dynamical systems motor
policy gradient approaches and natural gradients2 . Reinforce-
primitives [Ijspeert et al., 2003; Schaal et al., 2007] can be
used for meta-parameter learning. The dynamical system mo- 2
While we will denote the shape parameters by θ, we denote the
tor primitives [Ijspeert et al., 2003] are a powerful movement parameters of the meta-parameter function by w.
2651
Algorithm 1: Meta-Parameter Learning where Γi is a vector containing the training examples
Preparation steps: γ i of the meta-parameter component, C = R−1 =
Learn one or more DMPs by imitation and/or diag(R1−1 , . . . , Rn−1 ) is a cost matrix, λ is a ridge factor, and
reinforcement learning (yields shape parameters θ).
k(s) = φ(s)T ΦT as well as K = ΦΦT correspond to a ker-
Determine initial state s0 , meta-parameters γ 0 , and nel where the rows of Φ are the basis functions φ(si ) = Φi of
cost C 0 corresponding to the initial DMP. the training examples. Please refer to [Kober et al., 2010] for
Initialize the corresponding matrices S, Γ, C. a full derivation. Here, costs correspond to the uncertainty
Choose a kernel k, K. about the training examples. Thus, a high cost is incurred
for being further away from the desired optimal solution at a
Set a scaling parameter λ. point. In our formulation, a high cost therefore corresponds
For all iterations j: to a high uncertainty of the prediction at this point. In order to
Determine the state sj specifying the situation. incorporate exploration, we need to have a stochastic policy
Calculate the meta-parameters γ j by: and, hence, we need a predictive distribution. This distribu-
Determine the mean of each meta-parameter i tion can be obtained by performing the policy update with a
−1 Gaussian process regression and we see from the kernel ridge
γi (sj ) = k(sj )T (K + λC) Γi ,
regression
Determine the variance
−1 −1
σ 2 (sj ) = k(sj , sj )−k(sj )T (K + λC) k(sj ), σ 2 (s) = k(s, s) + λ − k(s)T (K + λC) k(s),
Draw the meta-parameters from a Gaussian T
where k(s, s) = φ(s) φ(s) is the distance of a point to itself.
distribution
We call this algorithm Cost-regularized Kernel Regression.
γ j ∼ N (γ|γ(sj ), σ 2 (sj )I).
The algorithm corresponds to a Gaussian process regres-
Execute the DMP using the new meta-parameters. sion where the costs on the diagonal are input-dependent
Calculate the cost cj at the end of the episode. noise priors. If several sets of meta-parameters have simi-
Update S, Γ, C according to the achieved result. larly low costs the algorithm’s convergence depends on the
order of samples. The cost function should be designed to
avoid this behavior and to favor a single set. The exploration
ment learning of the meta-parameter function γ(s) is not has to be restricted to safe meta-parameters.
straightforward as only few examples can be generated on the
real system and trials are often quite expensive. The credit as- 2.3 Reinforcement Learning of Meta-Parameters
signment problem is non-trivial as the whole movement is af- As a result of Section 2.2, we have a framework of motor
fected by every change in the meta-parameter function. Early primitives as introduced in Section 2.1 that we can use for re-
attempts using policy gradient approaches resulted in tens of inforcement learning of meta-parameters as outlined in Sec-
thousands of trials even for simple toy problems, which is not tion 2.2. We have generalized the reward-weighted regression
feasible on a real system. policy update to instead become a Cost-regularized Kernel
Dayan and Hinton [1997] showed that an immediate re- Regression (CrKR) update where the predictive variance is
ward can be maximized by instead minimizing the Kullback- used for exploration. In Algorithm 1, we show the complete
Leibler divergence D(π(γ|s)R(s, γ)||π (γ|s)) between the algorithm resulting from these steps.
reward-weighted policy π(γ|s) and the new policy π (γ|s). The algorithm receives three inputs, i.e., (i) a motor prim-
Williams [Williams, 1992] suggested to use a particular pol- itive that has associated meta-parameters γ, (ii) an initial ex-
icy in this context; i.e., the policy ample containing state s0 , meta-parameter γ 0 and cost C 0 ,
as well as (iii) a scaling parameter λ. The initial motor
π(γ|s) = N (γ|γ(s), σ 2 (s)I),
primitive can be obtained by imitation learning [Ijspeert et
where we have the deterministic mean policy γ(s) = al., 2003] and, subsequently, improved by parametrized rein-
φ(s)T w with basis functions φ(s) and parameters w as forcement learning algorithms such as policy gradients [Pe-
well as the variance σ 2 (s) that determines the exploration ters and Schaal, 2006] or Policy learning by Weighting Explo-
∼ N (0, σ 2 (s)I). The parameters w can then be adapted ration with the Returns (PoWER) [Kober and Peters, 2010].
by reward-weighted regression in an immediate reward [Pe- The demonstration also yields the initial example needed for
ters and Schaal, 2007] or episodic reinforcement learning sce- meta-parameter learning. While the scaling parameter is an
nario [Kober and Peters, 2010]. The reasoning behind this open parameter, it is reasonable to choose it as a fraction of
reward-weighted regression is that the reward can be treated the average cost and the output noise parameter (note that
as an improper probability distribution over indicator vari- output noise and other possible hyper-parameters of the ker-
ables determining whether the action is optimal or not. nel can also be obtained by approximating the unweighted
Designing good basis functions is challenging. A non- meta-parameter function).
parametric representation is better suited in this context. We Illustration of the Algorithm: In order to illustrate this
can transform the reward-weighted regression into a Cost- algorithm, we will use the example of the table tennis task in-
regularized Kernel Regression troduced in Section 2.1. Here, the robot should hit the ball ac-
curately while not destroying its mechanics. Hence, the cost
−1
γ̄ i = γ i (s) = k(s)T (K + λC) Γi , corresponds to the distance between the ball and the paddle,
2652
(a) Intial Policy based on Prior: R=0 (b) Policy after 2 updates: R=0.1 (c) Policy after 9 updates: R=0.8 (d) Policy after 12 updates: R=0.9
3 3 3 3
2 2 2 2
1 1 1 1
mean prediction
0 0 0 0
goal
goal
goal
goal
variance
−1 −1 −1 −1
training points/cost
−2 −2 −2 −2
Gaussian process regression
−3 −3 −3 −3
−4 −4 −4 −4
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
state state state state
Figure 2: This figure illustrates the meaning of policy improvements with Cost-regularized Kernel Regression. Each sample
consists of a state, a meta-parameter and a cost where the cost is indicated the blue error bars. The red line represents the
improved mean policy, the dashed green lines indicate the exploration/variance of the new policy. For comparison, the gray
lines show standard Gaussian process regression. As the cost of a data point is equivalent to having more noise, pairs of states
and meta-parameter with low cost are more likely to be reproduced than others with high costs.
as well as the squared torques. The initial policy is based on mark example. The meta-parameter learning framework can
a prior, illustrated in Figure 2(a), that has a variance for ini- be used in a variety of settings in robotics. We consider two
tial exploration (it often makes sense to start with a uniform scenarios here, i.e., (i) dart throwing with a simulated Bar-
prior). This variance is used to enforce exploration. To return rett WAM and the real JST-ICORP/SARCOS humanoid robot
a ball, we sample the meta-parameters from the policy based CBi, and (ii) table tennis with a simulated robot arm and a real
on the current state. After the trial the cost is determined and, Barrett WAM.
in conjunction with the employed meta-parameters, used to
update the policy. If the cost is large (e.g., the ball was far 3.1 Dart-Throwing
from the racket), the variance of the policy is large as it may
still be improved and therefore needs exploration. Further- Now, we turn
1.4
more, the mean of the policy is shifted only slightly towards towards the Cost−regularized Kernel Regression
1.2 Reward−weighted Regression
the observed example as we are uncertain about the optimal- complete
framework,
average cost
1
ity of this action. If the cost is small, we know that we are
close to an optimal policy (e.g., the racket hit the ball off- i.e., we in- 0.8
center) and only have to search in a small region around the tend to learn 0.6
observed trial. The effects of the cost on the mean and the the meta- 0.4
variance are illustrated in Figure 2(b). Each additional sam- parameters 0.2
ple refines the policy and the overall performance improves for motor 0 200 400 600 800 1000
(see Figure 2(c)). If a state is visited several times and dif- primitives number of rollouts
2653
(a) The dart is placed (b) The arm moves (c) The arm moves (d) The arm contin- (e) The dart is re- (f) The arm stops
in the hand. back. forward on an arc. ues moving. leased and the arm and the dart hits the
follows through. board.
Figure 4: This figure shows a dart throw on the real JST-ICORP/SARCOS humanoid robot CBi.
dinates with respect to the center of the dart board as inputs. a function of the ball positions and velocities when it is over
The parameter for the final position, the duration of the mo- the net. We employed a Gaussian kernel and optimized the
tor primitive and the angle around the vertical axis are the open parameters according to typical values for the input and
meta-parameters. The popular dart game Around the Clock output. As cost function we employ the metric distance be-
requires the player to hit the numbers in ascending order, then tween the center of the paddle and the center of the ball at
the bulls-eye. As energy is lost overcoming the stiction of the the hitting time. The policy is evaluated every 50 episodes
launching sled, the darts fly lower and we placed the dart- with 25 ball launches picked randomly at the beginning of
board lower than official rules require. The cost function is the learning. We initialize the behavior with five success-
the sum of ten times the squared error on impact and the ve- ful strokes observed from another player. After initializing
locity of the motion. After approximately 1000 throws the the meta-parameter function with only these five initial ex-
algorithms have converged but CrKR yields a high perfor- amples, the robot misses ca. 95% of the balls as shown in
mance already much earlier (see Figure 3). We again used a Figure 5. Trials are only used to update the policy if the robot
parametric policy with radial basis functions for RWR. De- has successfully hit the ball. Figure 5 illustrates the costs over
signing a good parametric policy proved very difficult in this all episodes. Current results suggest that the resulting policy
setting as is reflected by the poor performance of RWR. performs well both in simulation and for the real system.
This experiment is carried out in simulation and on a
real, physical robot, i.e., the humanoid robot CBi (JST-
ICORP/SARCOS). CBi was developed within the framework
4 Conclusion & Future Work
of the JST-ICORP Computational Brain Project at ATR Com- In this paper,
putational Neuroscience Labs. The hardware of the robot was we have 0.9
developed by the American robotic development company studied the
average cost/success
SARCOS. CBi can open and close the fingers which helps problem 0.7
for more human-like throwing instead of the launcher. See of meta- Success
0.5
Figure 4 for a throwing movement. The results on the real parameter Cost
robot are significantly more noisy but qualitatively compara- learning for 0.3
ble to the simulation. motor prim-
0.1
itives. It is
3.2 Table Tennis an essential 0 200 400 600 800 1000
number of rollouts
In the second evaluation of the complete framework, we use step towards
it for hitting a table tennis ball in the air. The setup consists applying mo- Figure 5: This figure shows the cost func-
of a ball gun that serves to the forehand of the robot, a Barrett tor primitives tion of the table tennis task in simulation
WAM and a standard sized table. The movement of the robot for learning averaged over 10 runs with the error-bars
has three phases. The robot is in a rest posture and starts to complex indicating standard deviation. The red line
swing back when the ball is launched. During this swing-back motor skills represents the percentage of successful hits
phase, the open parameters for the stroke are predicted. The in robotics and the blue line the average cost.
second phase is the hitting phase which ends with the contact more flexibly.
of the ball and racket. In the final phase the robot gradually We have
ends the stroking motion and returns to the rest posture. See discussed an appropriate reinforcement learning algorithm
Figure 6 for an illustration of a complete episode. The move- for mapping situations to meta-parameters. We show that
ments in the three phases are represented by motor primitives the necessary mapping from situation to meta-parameter
obtained by imitation learning. can be learned using a Cost-regularized Kernel Regression
The meta-parameters are the joint positions and velocities (CrKR) while the parameters of the motor primitive can still
for all seven degrees of freedom at the end of the second be acquired through traditional approaches. The predictive
phase (the instant of hitting the ball) and a timing parame- variance of CrKR is used for exploration in on-policy
ter that controls when the swing back phase is transitioning meta-parameter reinforcement learning. To demonstrate the
to the hitting phase. We learn these 15 meta-parameters as system, we have chosen the Around the Clock dart throwing
2654
(a) The robot is in the rest (b) The arm swings back. (c) The arm strikes the (d) The arm follows (e) The arm returns to the
posture. ball. through and decelerates. rest posture.
Figure 6: This figure shows a table tennis stroke on the real Barrett WAM.
game and table tennis implemented both on simulated and [Masters Games Ltd., 2011] Masters Games Ltd. The
real robots. rules of darts, http://www.mastersgames.com/rules/darts-
Future work will require to sequence different motor prim- rules.htm, 2011.
itives by a supervisory layer. This supervisory layer would [McGovern and Barto, 2001] A. McGovern and A. G. Barto.
for example in a table tennis task decide between a fore- Automatic discovery of subgoals in reinforcement learning
hand motor primitive and a backhand motor primitive, the using diverse density. In Proceedings of the International
spatial meta-parameter and the timing of the motor primitive Conference on Machine Learning (ICML), 2001.
would be adapted according to the incoming ball, and the mo-
tor primitive would generate the trajectory. This supervisory [Mülling et al., 2010] K. Mülling, J. Kober, and J. Peters.
layer could be learned by an hierarchical reinforcement learn- Learning table tennis with a mixture of motor primitives.
ing approach [Barto and Mahadevan, 2003] (as introduced In Proceedings of the International Conference on Hu-
in the early work by [Huber and Grupen, 1998]). In this manoid Robots (HUMANOIDS), 2010.
framework, the motor primitives with meta-parameter func- [Nakanishi et al., 2004] J. Nakanishi, J. Morimoto, G. Endo,
tions could be seen as robotics counterpart of options [Mc- G. Cheng, S. Schaal, and M. Kawato. Learning
Govern and Barto, 2001]. from demonstration and adaptation of biped locomotion.
Robotics and Autonomous Systems, 47(2-3):79–91, 2004.
References [Peters and Schaal, 2006] J. Peters and S. Schaal. Policy gra-
[Barto and Mahadevan, 2003] A. Barto and S. Mahadevan. dient methods for robotics. In Proceedings of the Inter-
Recent advances in hierarchical reinforcement learning. national Conference on Intelligent RObots and Systems
Discrete Event Dynamic Systems, 13(4):341 – 379, 2003. (IROS), 2006.
[Caruana, 1997] R. Caruana. Multitask learning. Machine [Peters and Schaal, 2007] J. Peters and S. Schaal. Reinforce-
Learning, 28:41–75, 1997. ment learning by reward-weighted regression for opera-
tional space control. In Proceedings of the International
[Dayan and Hinton, 1997] P. Dayan and G. E. Hinton. Us-
Conference on Machine Learning (ICML), 2007.
ing expectation-maximization for reinforcement learning.
Neural Computation, 9(2):271–278, 1997. [Pongas et al., 2005] D. Pongas, A. Billard, and S. Schaal.
Rapid synchronization and accurate phase-locking of
[Huber and Grupen, 1998] M. Huber and R. A. Grupen.
rhythmic motor primitives. In Proceedings of the Inter-
Learning robot control using control policies as abstract national Conference on Intelligent RObots and Systems
actions. In Proceedings of the NIPS’98 Workshop: Ab- (IROS), 2005.
straction and Hierarchy in Reinforcement Learning, 1998.
[Rasmussen and Williams, 2006] C.E. Rasmussen and C.K.
[Ijspeert et al., 2003] A. J. Ijspeert, J. Nakanishi, and
Williams. Gaussian Processes for Machine Learning.
S. Schaal. Learning attractor landscapes for learning motor MIT Press, 2006.
primitives. In Proceedings of Advances in Neural Informa-
tion Processing Systems (NIPS), 2003. [Schaal et al., 2007] S. Schaal, P. Mohajerian, and A. J.
Ijspeert. Dynamics systems vs. optimal control – a uni-
[Kober and Peters, 2010] J. Kober and J. Peters. Policy
fying view. Progress in Brain Research, 165(1):425–445,
search for motor primitives in robotics. Machine Learn- 2007.
ing, 2010.
[Urbanek et al., 2004] H. Urbanek, A. Albu-Schäffer, and
[Kober et al., 2010] J. Kober, E. Oztop, and J. Peters. Re-
P. v.d. Smagt. Learning from demonstration repetitive
inforcement learning to adjust robot movements to new movements for autonomous service robotics. In Proceed-
situations. In Proceedings of the Robotics: Science and ings of the International Conference on Intelligent RObots
Systems Conference (R:SS), 2010. and Systems (IROS), 2004.
[Lawrence et al., 2003] G. Lawrence, N. Cowan, and S. Rus- [Williams, 1992] R. J. Williams. Simple statistical gradient-
sell. Efficient gradient estimation for motor control learn- following algorithms for connectionist reinforcement
ing. In Proceedings of the Conference on Uncertainty in learning. Machine Learning, 8:229–256, 1992.
Artificial Intelligence (UAI), 2003.
2655