Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
11 views6 pages

Reinforcement Learning To Adjust Robot Movements To New Situations

Uploaded by

Adi Ferdous
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views6 pages

Reinforcement Learning To Adjust Robot Movements To New Situations

Uploaded by

Adi Ferdous
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence

Reinforcement Learning to Adjust


Robot Movements to New Situations

Jens Kober Erhan Oztop Jan Peters


MPI Tübingen, Germany ATR/CMC, Japan MPI Tübingen, Germany
[email protected] [email protected] [email protected]

Abstract centimeters, it would be better to generalize that learned be-


havior to the modified task. Such generalization of behav-
Many complex robot motor skills can be repre- iors can be achieved by adapting the meta-parameters of the
sented using elementary movements, and there ex- movement representation1 .
ist efficient techniques for learning parametrized In reinforcement learning, there have been many attempts
motor plans using demonstrations and self- to use meta-parameters in order to generalize between tasks
improvement. However with current techniques, in [Caruana, 1997]. Particularly, in grid-world domains, sig-
many cases, the robot currently needs to learn a new nificant speed-up could be achieved by adjusting policies by
elementary movement even if a parametrized motor modifying their meta-parameters (e.g., re-using options with
plan exists that covers a related situation. A method different subgoals) [McGovern and Barto, 2001]. In robotics,
is needed that modulates the elementary movement such meta-parameter learning could be particularly helpful
through the meta-parameters of its representation. due to the complexity of reinforcement learning for complex
In this paper, we describe how to learn such map- motor skills with high dimensional states and actions. The
pings from circumstances to meta-parameters us- cost of experience is high as sample generation is time con-
ing reinforcement learning. In particular we use a suming and often requires human interaction (e.g., in cart-
kernelized version of the reward-weighted regres- pole, for placing the pole back on the robots hand) or super-
sion. We show two robot applications of the pre- vision (e.g., for safety during the execution of the trial). Gen-
sented setup in robotic domains; the generaliza- eralizing a teacher’s demonstration or a previously learned
tion of throwing movements in darts, and of hitting policy to new situations may reduce both the complexity of
movements in table tennis. We demonstrate that the task and the number of required samples. For example,
both tasks can be learned successfully using simu- the overall shape of table tennis forehands are very similar
lated and real robots. when the swing is adapted to varied trajectories of the in-
coming ball and a different targets on the opponent’s court.
1 Introduction Here, the human player has learned by trial and error how he
In robot learning, motor primitives based on dynamical sys- has to adapt the global parameters of a generic strike to vari-
tems [Ijspeert et al., 2003; Schaal et al., 2007] allow acquir- ous situations [Mülling et al., 2010]. Hence, a reinforcement
ing new behaviors quickly and reliably both by imitation and learning method for acquiring and refining meta-parameters
reinforcement learning. Resulting successes have shown that of pre-structured primitive movements becomes an essential
it is possible to rapidly learn motor primitives for complex next step, which we will address in this paper.
behaviors such as tennis-like swings [Ijspeert et al., 2003], T- We present current work on automatic meta-parameter
ball batting [Peters and Schaal, 2006], drumming [Pongas et acquisition for motor primitives by reinforcement learning.
al., 2005], biped locomotion [Nakanishi et al., 2004], ball- We focus on learning the mapping from situations to meta-
in-a-cup [Kober and Peters, 2010], and even in tasks with parameters and how to employ these in dynamical systems
potential industrial applications [Urbanek et al., 2004]. The motor primitives. We extend the motor primitives (DMPs)
dynamical system motor primitives [Ijspeert et al., 2003] can of [Ijspeert et al., 2003] with a learned meta-parameter func-
be adapted both spatially and temporally without changing tion and re-frame the problem as an episodic reinforcement
the overall shape of the motion. While the examples are im- learning scenario. In order to obtain an algorithm for fast re-
pressive, they do not address how a motor primitive can be inforcement learning of meta-parameters, we view reinforce-
generalized to a different behavior by trial and error with- ment learning as a reward-weighted self-imitation [Peters and
out re-learning the task. For example, if the string length Schaal, 2007; Kober and Peters, 2010].
has been changed in a ball-in-a-cup [Kober and Peters, 2010] 1
Note that the tennis-like swings [Ijspeert et al., 2003] could only
movement, the behavior has to be re-learned by modifying hit a static ball at the end of their trajectory, and T-ball batting [Pe-
the movements parameters. Given that the behavior will not ters and Schaal, 2006] was accomplished by changing the policy’s
drastically change due to a string length variation of a few parameters.

2650
representation that allows ensuring the stability of the move-
ment, choosing between a rhythmic and a discrete movement.
One of the biggest advantages of this motor primitive frame-
work is that it is linear in the shape parameters θ. Therefore,
these parameters can be obtained efficiently, and the resulting
framework is well-suited for imitation [Ijspeert et al., 2003]
and reinforcement learning [Kober and Peters, 2010]. The re-
sulting policy is invariant under transformations of the initial
position, the goal, the amplitude and the duration [Ijspeert et
al., 2003]. These four modification parameters can be used as
the meta-parameters γ of the movement. Obviously, we can
make more use of the motor primitive framework by adjust-
ing the meta-parameters γ depending on the current situation
Figure 1: This figure illustrates a table tennis task. The sit- or state s according to a meta-parameter function γ(s). The
uation, described by the state s, corresponds to the positions state s can for example contain the current position, veloc-
and velocities of the ball and the robot at the time the ball is ity and acceleration of the robot and external objects, as well
above the net. The meta-parameters γ are the joint positions as the target to be achieved. This paper focuses on learning
and velocity at which the ball is hit. The policy parameters the meta-parameter function γ(s) by episodic reinforcement
represent the backward motion and the movement on the arc. learning.
The meta-parameter function γ(s), which maps the state to Illustration of the Learning Problem: As an illustration
the meta-parameters, is learned. of the meta-parameter learning problem, we take a table ten-
nis task which is illustrated in Figure 1 (in Section 3.2, we
will expand this example to a robot application). Here, the
As it may be hard to realize a parametrized representa- desired skill is to return a table tennis ball. The motor prim-
tion for meta-parameter determination, we reformulate the itive corresponds to the hitting movement. When modeling a
reward-weighted regression [Peters and Schaal, 2007] in or- single hitting movement with dynamical-systems motor prim-
der to obtain a Cost-regularized Kernel Regression (CrKR) itives [Ijspeert et al., 2003], the combination of retracting and
that is related to Gaussian process regression [Rasmussen and hitting motions would be represented by one movement prim-
Williams, 2006]. We evaluate the algorithm in the acquisition itive and can be learned by determining the movement pa-
of flexible motor primitives for dart games such as Around the rameters θ. These parameters can either be estimated by im-
Clock [Masters Games Ltd., 2011] and for table tennis. itation learning or acquired by reinforcement learning. The
return can be adapted by changing the paddle position and
2 Meta-Parameter Learning for DMPs velocity at the hitting point. These variables can be influ-
The goal of this paper is to show that elementary movements enced by modifying the meta-parameters of the motor primi-
can be generalized by modifying only the meta-parameters tive such as the final joint positions and velocities. The state
of the primitives using learned mappings. In Section 2.1, we consists of the current positions and velocities of the ball and
first review how a single primitive movement can be repre- the robot at the time the ball is directly above the net. The
sented and learned. We discuss how such meta-parameters meta-parameter function γ(s) maps the state (the state of the
may be able to adapt the motor primitive spatially and tem- ball and the robot before the return) to the meta-parameters γ
porally to the new situation. In order to develop algorithms (the final positions and velocities of the motor primitive). Its
that learn to automatically adjust such motor primitives, we variance corresponds to the uncertainty of the mapping.
model meta-parameter self-improvement as an episodic rein- In the next sections, we derive and apply an appropriate
forcement learning problem in Section 2.2. While this prob- reinforcement learning algorithm.
lem could in theory be treated with arbitrary reinforcement
learning methods, the availability of few samples suggests
2.2 Kernalized Meta-Parameter Self-Improvement
that more efficient, task appropriate reinforcement learning The problem of meta-parameter learning is to find a stochastic
approaches are needed. To avoid the limitations of parametric policy π(γ|x) = p(γ|s) that maximizes the expected return
ˆ ˆ
function approximation, we aim for a kernel-based approach.
J(π) = p(s) π(γ|s)R(s, γ)dγ ds, (1)
When a movement is generalized, new parameter settings S G
need to be explored. Hence, a predictive distribution over the where R(s, γ) denotes all the rewards following the selection
meta-parameters is required to serve as an exploratory policy. of the meta-parameter γ according to a situation described by
These requirements lead to the method which we employ for T
meta-parameter learning in Section 2.3. state s. The return of an episode is R(s, γ) = T −1 t=0 rt
with number of steps T and rewards rt . For a parametrized
2.1 DMPs with Meta-Parameters policy π with parameters w it is natural to first try a policy
gradient approach such as finite-difference methods, vanilla
In this section, we review how the dynamical systems motor
policy gradient approaches and natural gradients2 . Reinforce-
primitives [Ijspeert et al., 2003; Schaal et al., 2007] can be
used for meta-parameter learning. The dynamical system mo- 2
While we will denote the shape parameters by θ, we denote the
tor primitives [Ijspeert et al., 2003] are a powerful movement parameters of the meta-parameter function by w.

2651
Algorithm 1: Meta-Parameter Learning where Γi is a vector containing the training examples
Preparation steps: γ i of the meta-parameter component, C = R−1 =
Learn one or more DMPs by imitation and/or diag(R1−1 , . . . , Rn−1 ) is a cost matrix, λ is a ridge factor, and
reinforcement learning (yields shape parameters θ).
k(s) = φ(s)T ΦT as well as K = ΦΦT correspond to a ker-
Determine initial state s0 , meta-parameters γ 0 , and nel where the rows of Φ are the basis functions φ(si ) = Φi of
cost C 0 corresponding to the initial DMP. the training examples. Please refer to [Kober et al., 2010] for
Initialize the corresponding matrices S, Γ, C. a full derivation. Here, costs correspond to the uncertainty
Choose a kernel k, K. about the training examples. Thus, a high cost is incurred
for being further away from the desired optimal solution at a
Set a scaling parameter λ. point. In our formulation, a high cost therefore corresponds
For all iterations j: to a high uncertainty of the prediction at this point. In order to
Determine the state sj specifying the situation. incorporate exploration, we need to have a stochastic policy
Calculate the meta-parameters γ j by: and, hence, we need a predictive distribution. This distribu-
Determine the mean of each meta-parameter i tion can be obtained by performing the policy update with a
−1 Gaussian process regression and we see from the kernel ridge
γi (sj ) = k(sj )T (K + λC) Γi ,
regression
Determine the variance
−1 −1
σ 2 (sj ) = k(sj , sj )−k(sj )T (K + λC) k(sj ), σ 2 (s) = k(s, s) + λ − k(s)T (K + λC) k(s),
Draw the meta-parameters from a Gaussian T
where k(s, s) = φ(s) φ(s) is the distance of a point to itself.
distribution
We call this algorithm Cost-regularized Kernel Regression.
γ j ∼ N (γ|γ(sj ), σ 2 (sj )I).
The algorithm corresponds to a Gaussian process regres-
Execute the DMP using the new meta-parameters. sion where the costs on the diagonal are input-dependent
Calculate the cost cj at the end of the episode. noise priors. If several sets of meta-parameters have simi-
Update S, Γ, C according to the achieved result. larly low costs the algorithm’s convergence depends on the
order of samples. The cost function should be designed to
avoid this behavior and to favor a single set. The exploration
ment learning of the meta-parameter function γ(s) is not has to be restricted to safe meta-parameters.
straightforward as only few examples can be generated on the
real system and trials are often quite expensive. The credit as- 2.3 Reinforcement Learning of Meta-Parameters
signment problem is non-trivial as the whole movement is af- As a result of Section 2.2, we have a framework of motor
fected by every change in the meta-parameter function. Early primitives as introduced in Section 2.1 that we can use for re-
attempts using policy gradient approaches resulted in tens of inforcement learning of meta-parameters as outlined in Sec-
thousands of trials even for simple toy problems, which is not tion 2.2. We have generalized the reward-weighted regression
feasible on a real system. policy update to instead become a Cost-regularized Kernel
Dayan and Hinton [1997] showed that an immediate re- Regression (CrKR) update where the predictive variance is
ward can be maximized by instead minimizing the Kullback- used for exploration. In Algorithm 1, we show the complete
Leibler divergence D(π(γ|s)R(s, γ)||π  (γ|s)) between the algorithm resulting from these steps.
reward-weighted policy π(γ|s) and the new policy π  (γ|s). The algorithm receives three inputs, i.e., (i) a motor prim-
Williams [Williams, 1992] suggested to use a particular pol- itive that has associated meta-parameters γ, (ii) an initial ex-
icy in this context; i.e., the policy ample containing state s0 , meta-parameter γ 0 and cost C 0 ,
as well as (iii) a scaling parameter λ. The initial motor
π(γ|s) = N (γ|γ(s), σ 2 (s)I),
primitive can be obtained by imitation learning [Ijspeert et
where we have the deterministic mean policy γ(s) = al., 2003] and, subsequently, improved by parametrized rein-
φ(s)T w with basis functions φ(s) and parameters w as forcement learning algorithms such as policy gradients [Pe-
well as the variance σ 2 (s) that determines the exploration ters and Schaal, 2006] or Policy learning by Weighting Explo-
 ∼ N (0, σ 2 (s)I). The parameters w can then be adapted ration with the Returns (PoWER) [Kober and Peters, 2010].
by reward-weighted regression in an immediate reward [Pe- The demonstration also yields the initial example needed for
ters and Schaal, 2007] or episodic reinforcement learning sce- meta-parameter learning. While the scaling parameter is an
nario [Kober and Peters, 2010]. The reasoning behind this open parameter, it is reasonable to choose it as a fraction of
reward-weighted regression is that the reward can be treated the average cost and the output noise parameter (note that
as an improper probability distribution over indicator vari- output noise and other possible hyper-parameters of the ker-
ables determining whether the action is optimal or not. nel can also be obtained by approximating the unweighted
Designing good basis functions is challenging. A non- meta-parameter function).
parametric representation is better suited in this context. We Illustration of the Algorithm: In order to illustrate this
can transform the reward-weighted regression into a Cost- algorithm, we will use the example of the table tennis task in-
regularized Kernel Regression troduced in Section 2.1. Here, the robot should hit the ball ac-
curately while not destroying its mechanics. Hence, the cost
−1
γ̄ i = γ i (s) = k(s)T (K + λC) Γi , corresponds to the distance between the ball and the paddle,

2652
(a) Intial Policy based on Prior: R=0 (b) Policy after 2 updates: R=0.1 (c) Policy after 9 updates: R=0.8 (d) Policy after 12 updates: R=0.9
3 3 3 3
2 2 2 2
1 1 1 1
mean prediction
0 0 0 0
goal

goal

goal

goal
variance
−1 −1 −1 −1
training points/cost
−2 −2 −2 −2
Gaussian process regression
−3 −3 −3 −3
−4 −4 −4 −4
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
state state state state

Figure 2: This figure illustrates the meaning of policy improvements with Cost-regularized Kernel Regression. Each sample
consists of a state, a meta-parameter and a cost where the cost is indicated the blue error bars. The red line represents the
improved mean policy, the dashed green lines indicate the exploration/variance of the new policy. For comparison, the gray
lines show standard Gaussian process regression. As the cost of a data point is equivalent to having more noise, pairs of states
and meta-parameter with low cost are more likely to be reproduced than others with high costs.

as well as the squared torques. The initial policy is based on mark example. The meta-parameter learning framework can
a prior, illustrated in Figure 2(a), that has a variance for ini- be used in a variety of settings in robotics. We consider two
tial exploration (it often makes sense to start with a uniform scenarios here, i.e., (i) dart throwing with a simulated Bar-
prior). This variance is used to enforce exploration. To return rett WAM and the real JST-ICORP/SARCOS humanoid robot
a ball, we sample the meta-parameters from the policy based CBi, and (ii) table tennis with a simulated robot arm and a real
on the current state. After the trial the cost is determined and, Barrett WAM.
in conjunction with the employed meta-parameters, used to
update the policy. If the cost is large (e.g., the ball was far 3.1 Dart-Throwing
from the racket), the variance of the policy is large as it may
still be improved and therefore needs exploration. Further- Now, we turn
1.4
more, the mean of the policy is shifted only slightly towards towards the Cost−regularized Kernel Regression
1.2 Reward−weighted Regression
the observed example as we are uncertain about the optimal- complete
framework,

average cost
1
ity of this action. If the cost is small, we know that we are
close to an optimal policy (e.g., the racket hit the ball off- i.e., we in- 0.8

center) and only have to search in a small region around the tend to learn 0.6

observed trial. The effects of the cost on the mean and the the meta- 0.4

variance are illustrated in Figure 2(b). Each additional sam- parameters 0.2
ple refines the policy and the overall performance improves for motor 0 200 400 600 800 1000
(see Figure 2(c)). If a state is visited several times and dif- primitives number of rollouts

ferent meta-parameters are sampled, the policy update must in discrete


movements. Figure 3: This figure shows the cost func-
favor the meta-parameters with lower costs. Algorithm 1 ex- tion of the dart-throwing task in simulation
hibits this behavior as illustrated in Figure 2(d). We compare
the Cost- for a whole game Around the Clock in each
In the dart throwing example (Section 3.1) we have a cor-
regularized rollout. The costs are averaged over 10
respondence between the state and the outcome similar to a runs with the error-bars indicating standard
regression problem. However, the mapping between the state Kernel Re-
gression deviation.
and the meta-parameter is not unique. The same height can be
achieved by different combinations of velocities and angles. (CrKR) al-
Averaging these combinations is likely to generate inconsis- gorithm to the reward-weighted regression (RWR). As a
tent solutions. The regression must hence favor the meta- sufficiently complex scenario, we chose a robot dart throwing
parameters with the lower costs. CrKR could be employed as task inspired by [Lawrence et al., 2003]. However, we take
a regularized regression method in this case. In the dart set- a more complicated scenario and choose dart games such
ting, we could choose the next target and thus employ CrKR as Around the Clock [Masters Games Ltd., 2011] instead of
as an active learning approach by picking states with large simple throwing at a fixed location. Hence, it will have an
variances. additional parameter in the state depending on the location
on the dartboard that should come next in the sequence.
The acquisition of a basic motor primitive is achieved using
3 Evaluation previous work on imitation learning [Ijspeert et al., 2003].
In Section 2, we have introduced both a framework for meta- Only the meta-parameter function is learned using CrKR or
parameter self-improvement as well as an appropriate rein- RWR.
forcement learning algorithm used in this framework. In The dart is placed on a launcher attached to the end-effector
[Kober et al., 2010] we have shown that the presented rein- and held there by stiction. We use the Barrett WAM robot
forcement learning algorithm yields higher performance than arm in order to achieve the high accelerations needed to over-
the preceding reward-weighted regression and an off-the- come the stiction. The motor primitive is trained by imitation
shelf finite difference policy gradient approach on a bench- learning with kinesthetic teach-in. We use the Cartesian coor-

2653
(a) The dart is placed (b) The arm moves (c) The arm moves (d) The arm contin- (e) The dart is re- (f) The arm stops
in the hand. back. forward on an arc. ues moving. leased and the arm and the dart hits the
follows through. board.
Figure 4: This figure shows a dart throw on the real JST-ICORP/SARCOS humanoid robot CBi.

dinates with respect to the center of the dart board as inputs. a function of the ball positions and velocities when it is over
The parameter for the final position, the duration of the mo- the net. We employed a Gaussian kernel and optimized the
tor primitive and the angle around the vertical axis are the open parameters according to typical values for the input and
meta-parameters. The popular dart game Around the Clock output. As cost function we employ the metric distance be-
requires the player to hit the numbers in ascending order, then tween the center of the paddle and the center of the ball at
the bulls-eye. As energy is lost overcoming the stiction of the the hitting time. The policy is evaluated every 50 episodes
launching sled, the darts fly lower and we placed the dart- with 25 ball launches picked randomly at the beginning of
board lower than official rules require. The cost function is the learning. We initialize the behavior with five success-
the sum of ten times the squared error on impact and the ve- ful strokes observed from another player. After initializing
locity of the motion. After approximately 1000 throws the the meta-parameter function with only these five initial ex-
algorithms have converged but CrKR yields a high perfor- amples, the robot misses ca. 95% of the balls as shown in
mance already much earlier (see Figure 3). We again used a Figure 5. Trials are only used to update the policy if the robot
parametric policy with radial basis functions for RWR. De- has successfully hit the ball. Figure 5 illustrates the costs over
signing a good parametric policy proved very difficult in this all episodes. Current results suggest that the resulting policy
setting as is reflected by the poor performance of RWR. performs well both in simulation and for the real system.
This experiment is carried out in simulation and on a
real, physical robot, i.e., the humanoid robot CBi (JST-
ICORP/SARCOS). CBi was developed within the framework
4 Conclusion & Future Work
of the JST-ICORP Computational Brain Project at ATR Com- In this paper,
putational Neuroscience Labs. The hardware of the robot was we have 0.9
developed by the American robotic development company studied the
average cost/success

SARCOS. CBi can open and close the fingers which helps problem 0.7
for more human-like throwing instead of the launcher. See of meta- Success
0.5
Figure 4 for a throwing movement. The results on the real parameter Cost
robot are significantly more noisy but qualitatively compara- learning for 0.3
ble to the simulation. motor prim-
0.1
itives. It is
3.2 Table Tennis an essential 0 200 400 600 800 1000
number of rollouts
In the second evaluation of the complete framework, we use step towards
it for hitting a table tennis ball in the air. The setup consists applying mo- Figure 5: This figure shows the cost func-
of a ball gun that serves to the forehand of the robot, a Barrett tor primitives tion of the table tennis task in simulation
WAM and a standard sized table. The movement of the robot for learning averaged over 10 runs with the error-bars
has three phases. The robot is in a rest posture and starts to complex indicating standard deviation. The red line
swing back when the ball is launched. During this swing-back motor skills represents the percentage of successful hits
phase, the open parameters for the stroke are predicted. The in robotics and the blue line the average cost.
second phase is the hitting phase which ends with the contact more flexibly.
of the ball and racket. In the final phase the robot gradually We have
ends the stroking motion and returns to the rest posture. See discussed an appropriate reinforcement learning algorithm
Figure 6 for an illustration of a complete episode. The move- for mapping situations to meta-parameters. We show that
ments in the three phases are represented by motor primitives the necessary mapping from situation to meta-parameter
obtained by imitation learning. can be learned using a Cost-regularized Kernel Regression
The meta-parameters are the joint positions and velocities (CrKR) while the parameters of the motor primitive can still
for all seven degrees of freedom at the end of the second be acquired through traditional approaches. The predictive
phase (the instant of hitting the ball) and a timing parame- variance of CrKR is used for exploration in on-policy
ter that controls when the swing back phase is transitioning meta-parameter reinforcement learning. To demonstrate the
to the hitting phase. We learn these 15 meta-parameters as system, we have chosen the Around the Clock dart throwing

2654
(a) The robot is in the rest (b) The arm swings back. (c) The arm strikes the (d) The arm follows (e) The arm returns to the
posture. ball. through and decelerates. rest posture.
Figure 6: This figure shows a table tennis stroke on the real Barrett WAM.

game and table tennis implemented both on simulated and [Masters Games Ltd., 2011] Masters Games Ltd. The
real robots. rules of darts, http://www.mastersgames.com/rules/darts-
Future work will require to sequence different motor prim- rules.htm, 2011.
itives by a supervisory layer. This supervisory layer would [McGovern and Barto, 2001] A. McGovern and A. G. Barto.
for example in a table tennis task decide between a fore- Automatic discovery of subgoals in reinforcement learning
hand motor primitive and a backhand motor primitive, the using diverse density. In Proceedings of the International
spatial meta-parameter and the timing of the motor primitive Conference on Machine Learning (ICML), 2001.
would be adapted according to the incoming ball, and the mo-
tor primitive would generate the trajectory. This supervisory [Mülling et al., 2010] K. Mülling, J. Kober, and J. Peters.
layer could be learned by an hierarchical reinforcement learn- Learning table tennis with a mixture of motor primitives.
ing approach [Barto and Mahadevan, 2003] (as introduced In Proceedings of the International Conference on Hu-
in the early work by [Huber and Grupen, 1998]). In this manoid Robots (HUMANOIDS), 2010.
framework, the motor primitives with meta-parameter func- [Nakanishi et al., 2004] J. Nakanishi, J. Morimoto, G. Endo,
tions could be seen as robotics counterpart of options [Mc- G. Cheng, S. Schaal, and M. Kawato. Learning
Govern and Barto, 2001]. from demonstration and adaptation of biped locomotion.
Robotics and Autonomous Systems, 47(2-3):79–91, 2004.
References [Peters and Schaal, 2006] J. Peters and S. Schaal. Policy gra-
[Barto and Mahadevan, 2003] A. Barto and S. Mahadevan. dient methods for robotics. In Proceedings of the Inter-
Recent advances in hierarchical reinforcement learning. national Conference on Intelligent RObots and Systems
Discrete Event Dynamic Systems, 13(4):341 – 379, 2003. (IROS), 2006.
[Caruana, 1997] R. Caruana. Multitask learning. Machine [Peters and Schaal, 2007] J. Peters and S. Schaal. Reinforce-
Learning, 28:41–75, 1997. ment learning by reward-weighted regression for opera-
tional space control. In Proceedings of the International
[Dayan and Hinton, 1997] P. Dayan and G. E. Hinton. Us-
Conference on Machine Learning (ICML), 2007.
ing expectation-maximization for reinforcement learning.
Neural Computation, 9(2):271–278, 1997. [Pongas et al., 2005] D. Pongas, A. Billard, and S. Schaal.
Rapid synchronization and accurate phase-locking of
[Huber and Grupen, 1998] M. Huber and R. A. Grupen.
rhythmic motor primitives. In Proceedings of the Inter-
Learning robot control using control policies as abstract national Conference on Intelligent RObots and Systems
actions. In Proceedings of the NIPS’98 Workshop: Ab- (IROS), 2005.
straction and Hierarchy in Reinforcement Learning, 1998.
[Rasmussen and Williams, 2006] C.E. Rasmussen and C.K.
[Ijspeert et al., 2003] A. J. Ijspeert, J. Nakanishi, and
Williams. Gaussian Processes for Machine Learning.
S. Schaal. Learning attractor landscapes for learning motor MIT Press, 2006.
primitives. In Proceedings of Advances in Neural Informa-
tion Processing Systems (NIPS), 2003. [Schaal et al., 2007] S. Schaal, P. Mohajerian, and A. J.
Ijspeert. Dynamics systems vs. optimal control – a uni-
[Kober and Peters, 2010] J. Kober and J. Peters. Policy
fying view. Progress in Brain Research, 165(1):425–445,
search for motor primitives in robotics. Machine Learn- 2007.
ing, 2010.
[Urbanek et al., 2004] H. Urbanek, A. Albu-Schäffer, and
[Kober et al., 2010] J. Kober, E. Oztop, and J. Peters. Re-
P. v.d. Smagt. Learning from demonstration repetitive
inforcement learning to adjust robot movements to new movements for autonomous service robotics. In Proceed-
situations. In Proceedings of the Robotics: Science and ings of the International Conference on Intelligent RObots
Systems Conference (R:SS), 2010. and Systems (IROS), 2004.
[Lawrence et al., 2003] G. Lawrence, N. Cowan, and S. Rus- [Williams, 1992] R. J. Williams. Simple statistical gradient-
sell. Efficient gradient estimation for motor control learn- following algorithms for connectionist reinforcement
ing. In Proceedings of the Conference on Uncertainty in learning. Machine Learning, 8:229–256, 1992.
Artificial Intelligence (UAI), 2003.

2655

You might also like