Submitted to AAAI 2013 Spring Symposium on Lifelong Machine Learning, Stanford University, March 2013
Information-Theoretic Objective Functions for Lifelong Learning
Byoung-Tak Zhang
School of Computer Science and Engineering &
Graduate Programs in Cognitive Science and Brain Science
Seoul National University
Seoul 151-742, Korea
E-mail: [email protected]
Abstract Lifelong learning also involves sequential revision and
Conventional paradigms of machine learning assume all the transfer of knowledge across samples, tasks, and domains.
training data are available when learning starts. However, in In terms of knowledge acquisition, the model revision
lifelong learning, the examples are observed sequentially as typically requires restructuring of models rather than
learning unfolds, and the learner should continually explore
parameter tuning as in traditional machine learning or
the world and reorganize and refine the internal model or
knowledge of the world. This leads to a fundamental neural network algorithms. Combined with the effects of
challenge: How to balance long-term and short-term goals incremental and online change in both data size and model
and how to trade-off between information gain and model complexity, it is fundamentally important how the lifelong
complexity? These questions boil down to “what objective learner should control the model complexity and data
functions can best guide a lifelong learning agent?” Here we
complexity as learning unfolds over a long period or
develop a sequential Bayesian framework for lifelong
learning, build a taxonomy of lifelong-learning paradigms, lifetime of experience.
and examine information-theoretic objective functions for We ask the following questions: how can a lifelong
each paradigm, with an emphasis on active learning. The learner maximize information gain while minimizing its
objective functions can provide theoretical criteria for model complexity and costs for revision and transfer of
designing algorithms and determining effective strategies
knowledge about the world? What objective function can
for selective sampling, representation discovery, knowledge
transfer, and continual update over a lifetime of experience. best guide the lifelong learning process by making trade-
off between long-term and short-term goals. In this paper
we focus on information-theoretic objective functions for
1. Introduction lifelong learning, with an emphasis on active learning, and
develop a taxonomy of lifelong learning paradigms based
Lifelong learning involves long-term interactions with the
on the learning objectives.
environment. In this setting, a number of learning
In Section 2 we give a Bayesian framework for lifelong
processes should be performed continually. These include,
learning based on the perception-cycle model of cognitive
among others, discovering representations from raw
systems. Section 3 describes the objective functions for
sensory data and transferring knowledge learned on
lifelong learning with passive observations, such as time
previous tasks to improve learning on the current task
series prediction and target tracking. Section 4 describes
(Eaton & desJardins, 2011). Thus, lifelong learning
the objective functions for active lifelong learning, i.e.
typically requires sequential, online, and incremental
continual learning with actions on the environment but
updates.
without rewards. We also consider the measures for active
Here we focus on the aspect of never-ending
constructive learning. In Section 5 we discuss the objective
exploration and continuous discovery of knowledge. In this
functions for lifelong learning with explicit rewards from
regard, lifelong learning can be divided into passive and
the environment. Section 6 concludes by discussing the
active learning (Cohn et al., 1990; Zhang & Veenker,
extension and further use of the framework and objective
1991a; Thrun & Moeller, 1992). In passive learning the
functions
learner just observes the incoming data while in active
learning the learner chooses what data to learn. Active
learning can be further divided into selective and creative 2. A Framework for Lifelong Learning
learning (Valiant, 1984; Zhang & Veenker, 1991b; Freund Here we develop a general framework for lifelong learning
et al., 1993). Selective learning subsamples the incoming that unifies active learning and constructive learning as
examples while creative learning generates new examples well as passive observational learning over lifetime. We
(Cohn et al., 1994, Zhang, 1994).
Form Approved
Report Documentation Page OMB No. 0704-0188
Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and
maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information,
including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington
VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it
does not display a currently valid OMB control number.
1. REPORT DATE 3. DATES COVERED
2. REPORT TYPE
MAR 2013 00-00-2013 to 00-00-2013
4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER
Information-Theoretic Objective Functions for Lifelong Learning 5b. GRANT NUMBER
5c. PROGRAM ELEMENT NUMBER
6. AUTHOR(S) 5d. PROJECT NUMBER
5e. TASK NUMBER
5f. WORK UNIT NUMBER
7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION
REPORT NUMBER
Seoul National University,Graduate Programs in Cognitive Science and
Brain Science,School of Computer Science and Engineering &, Seoul
151-742, Korea, ,
9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S)
11. SPONSOR/MONITOR’S REPORT
NUMBER(S)
12. DISTRIBUTION/AVAILABILITY STATEMENT
Approved for public release; distribution unlimited
13. SUPPLEMENTARY NOTES
Preprint, Submitted to AAAI 2013 Spring Symposium on Lifelong Machine Learning, Stanford University,
March 2013,Government or Federal Purpose Rights License.
14. ABSTRACT
Conventional paradigms of machine learning assume all the training data are available when learning
starts. However, in lifelong learning, the examples are observed sequentially as learning unfolds, and the
learner should continually explore the world and reorganize and refine the internal model or knowledge of
the world. This leads to a fundamental challenge: How to balance long-term and short-term goals and how
to trade-off between information gain and model complexity? These questions boil down to ?what objective
functions can best guide a lifelong learning agent?? Here we develop a sequential Bayesian framework for
lifelong learning, build a taxonomy of lifelong-learning paradigms, and examine information-theoretic
objective functions for each paradigm, with an emphasis on active learning. The objective functions can
provide theoretical criteria for designing algorithms and determining effective strategies for selective
sampling, representation discovery, knowledge transfer, and continual update over a lifetime of experience.
15. SUBJECT TERMS
16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF
ABSTRACT OF PAGES RESPONSIBLE PERSON
a. REPORT b. ABSTRACT c. THIS PAGE Same as 9
unclassified unclassified unclassified Report (SAR)
Standard Form 298 (Rev. 8-98)
Prescribed by ANSI Std Z39-18
2.2 Lifelong Learning as Sequential Bayesian Inference
In lifelong learning, the agent starts with the initial
knowledge base and continually updates it as it collects
more data by observing and interacting with the problem
domain or task. This inductive process of evidence-driven
refinement of prior knowledge into posterior knowledge
can be naturally formulated as a Bayesian inference
(Zhang et al., 2012).
The prior distribution of the memory state at time t, is
given as P(mt− ) , where the minus sign in mt− denotes the
memory state before observing the data. The agent collects
experience by acting on the environment by at and sensing
its world state st . This action and perception provides the
data for computing likelihood P( st , at | mt− ) of the current
Figure 1: A lifelong learning system architecture with the
perception-action cycle model to get the posterior distribution of the memory state
P(mt | st , at ) . Formally, the update process can be written
start by considering the information flow in the perception- as
action cycle of an agent interacting with the environment. P(mt | st , at , mt− )
2.1 Action-Perception-Learning Cycle P(mt , st , at , mt− )
=
P( st , at , mt− )
Consider an agent situated in an environment (Figure 1).
1
The agent has a memory to model the lifelong history. We = P(mt | st , at ) P( st , at | mt− ) P(mt− )
denote the memory state at time t by mt . The agent P( st , at , mt− )
observes the environment and measures the sensory state ∝ P(mt | st , at ) P( st , at | mt− ) P(mt− )
st of the environment and chooses an action at . The goal = P(mt | st ) P( st | at ) P(at | mt− ) P(mt− )
of the learner is to learn about the environment and predict
the next world states st +1 as accurately as possible. The where we have used the conditional independence between
ability to predict improves the performance of learner action and perception given the memory state.
across a large variety of specific behaviors, and is hence From the statistical computing point of view, a
quite fundamental, increasing the success rate of many sequential estimation of the memory states would be more
tasks (Still, 2009). The perception-action cycle of the efficient. To this end, we formulate the lifelong learning
learner is effective for continuous acquisition and problem as a filtering problem, i.e. estimating the
refinement of knowledge in a domain or across domains. distribution P(mt | s1:t ) of memory states mt from the
This paradigm can also be used for time series prediction lifelong observations s1:t = s1 s2 ...st up to time t. That is,
(Barber et al., 2011), target tracking, and robot motions given the filtering distribution P(mt −1 | s1:t −1 ) at time t-1, the
(Yi et al., 2012). We shall see objective functions for these goal is to recursively estimate the filtering distribution
problems in Sections 3 and 4.
P(mt | s1:t ) of time step t:
In a different problem setting, the agent is more task-
oriented. It has a specific goal, such as reaching a target P(mt , s1:t )
location or winning a game, and takes actions to achieve P=
(mt | s1:t ) ≈ P(mt , st , s1:t −1 )
P( s1:t )
the goal. For the actions at taken at state st , the agent
P(mt | s1:t ) ≈ P( st | mt ) P(mt | s1:t −1 )
receives rewards rt from the environment. In this case, the
= P( st | mt )∑ P(mt , mt −1 | s1:t −1 )
objective is typically formulated to maximize the expected mt −1
reward V ( st ). The Markov decision problems (Sutton &
= P( st | mt )∑ P(mt | mt −1 )P(mt −1 | s1:t −1 )
Barto, 1998) are a representative example of this class of mt −1
tasks. We shall see variants of objective functions for
solving these problems in Section 5. If we let α (mt ) = P(mt | s1:t ) we have now a recursive
update equation:
α (mt ) = P( st | mt )∑ P(mt | mt −1 )α (mt −1 )
mt −1
Taking into account the actions explicitly, the recursive
lifelong learning becomes:
α (mt )
= P( st | mt )∑ P(mt | mt −1 )α (mt −1 )
mt −1
= ∑ P( st , at | mt )∑ P(mt | mt −1 )α (mt −1 )
at mt −1
= ∑ P( st | at ) P(at | mt )∑ P(mt | mt −1 )α (mt −1 )
at mt −1
We note that the factors P( st | at ), P(at | mt ), P(mt | mt −1 ) Figure 2: Learning with observations
correspond respectively to the perception, action, and the
prediction steps in Figure 1. These distributions determine Dynamical systems can be represented by Markov
how the agent interacts with the environment to model it models or Markov chains. The joint distribution of the
and attain novel information. sequence of observations can be expressed as
T T
2.3 Lifelong Supervised Learning
=
P( s1:T )
t 1:t −1
=t 1 =t 1
∏
= P( s | s ) ∏ P( s t | st −1 )
The perception-action cycle formulation above emphasizes where in the second equality we used the Markov
the sequential nature of lifelong learning. However, the assumption, i.e. the current state is dependent only on the
nonsequential learning tasks, such as classification and one previous step.
regression, can also be incorporated in this framework as In time series prediction, the learner has no control
special cases. This is especially true for concept learning in over the observations, it just passively receives the
a dynamic environment (Zhang et al., 2012). In lifelong incoming data. The goal of the learner is to find the model
learning of classification, the examples are observed as mt that best predicts the next state st +1 = f ( st ; mt ) given
( xt , yt ), t = 1, 2,3,... , but the examples are independent. the current state st . How do we define the word “best”
The goal is to approximate yˆt = f ( xt ; mt ) with a minimum quantitatively? In the following subsections we examine
loss L( yq , yˆ q ) for an arbitrary query input xq . Note that by three measures: prediction error, predictive information,
and information bottleneck. The last two criteria are based
substituting on information-theoretic measures.
st := xt
at +=
1 : y=
ˆt f ( xt ; mt ) 3.2 Prediction Error
The accuracy of time series prediction can be measured by
Likewise, the lifelong learning of regression problems can
prediction error, i.e. the mean squared error (MSE)
be solved within this framework. The only difference from
between the predicted states sˆt +1 and the observed states
the classification problem is that in regression the output
st +1 :
yt are real values instead of categorical or discrete values.
1 T −1
=
MSE ( s1:T ) ∑ (st +1 − sˆt +1 )2
T − 1 t =1
3. Learning with Observations
3.1 Dynamical Systems and Markov Models where the prediction sˆt +1 is made by using the learner’s
current model, i.e. sˆt +1 = f ( st ; mt )) and n is the length of
Dynamical systems involve sequential prediction (Figure the series. Thus, a natural measure is the root of the MSE
2). For example, time series data consists of s1:T ≡ s1 ,..., sT . or RMSE and the learner aims to minimize it:
Since the time length T can be very large or infinite, this
problem is an example of lifelong learning problems. In min RMSE ( s1:T ) = min MSE ( s1:T )
m m
addition, the learner can observe many or indefinite series
of different times series, in which case each time series is where m ∈ M is the model parameters.
called an episode.
3
3.3 Predictive Information
For the evaluation of a time series with an indefinite length,
predictive information (Bialek et al., 2001) has been
proposed. It is defined as the mutual information (MI)
between the future and the past, relative to some instant of
t:
P ( s future , s past )
I ( S future ; S past ) = log 2
P ( s future ) P ( s past )
where ⋅ symbol denotes an expectation operator. If S is a
Markov chain, the predictive information (PI) is given by
the MI between two successive time steps.
Figure 3: Learning with actions
P( st +1 , st )
I ( St +1 ; St ) = log 2
P ( st ) Creutzig et al. (2009) shows that the past-future
information bottleneck method can make the underlying
Several authors have studied this measure for self- predictive structure of the process explicit, and capture it
organized learning and adaptive behavior. Zahedi et al. by the states of a dynamical system. From the lifelong
(2010), for example, found the principle of maximizing the learning point of view, this means that from repeated
predictive information effective to evolve a coordinated observations of the dynamic environment the measure
behavior of the physically connected robots starting with provide an objective function that the learner can use to
no knowledge of itself or the world. Friston (2009) argues identify the regularity and extract the underlying structures.
that self-organizing biological agents resist a tendency
to disorder and therefore minimize the entropy of their 4. Learning with Actions
sensory states. He proposes that the brain uses the free-
energy principle for action, perception, and learning. 4.1 Interactive Learning
We now consider the learning agents that perform actions
3.4 Information Bottleneck
on the environment to change the states of the environment
The information bottleneck method is a technique to (Figure 3). An example of this paradigm is the interactive
compress an unknown random variable X, when a joint learning (Still, 2009). Assume that the learner interacts
probability distribution between X and an observed with the environment between consecutive observations.
relevant variable Y is given (Tishby et al., 1999). The Let one decision epoch consists in mapping the current
compressed variable is Z and the algorithm minimizes the history h, available to the learner at time t, onto an action
quantity: min P ( z | x ) I ( X ; Z ) − β I ( Z ; Y ), where I ( X ; Z ) are (sequence) a that starts at time t and takes time ∆ to be
executed. The problem of interactive learning is to choose
the mutual information between X and Z.
a model and an action policy, which are optimal in that
Creutzig et al. (2009) proposes to use the information
they maximize the learner’s ability to predict the world,
bottleneck to find the properties of the past that are
while being minimally complex.
relevant and sufficient for predicting the future in
The decision function, or action policy, is given by the
dynamical systems. Adapted in our notation, this past-
conditional probability distribution P(at | ht ). Let the
future information bottleneck is written as:
model summarize historical information via the probability
min P ( mt | st ) {I ( St ; Sˆt ) − β I ( Sˆt ; St +1 )} map P( st | ht ). The learner uses the current state st
together with knowledge of the action at to make
where St , St +1 , Sˆt are respectively the input past, the output probabilistic predictions of future observations, st +1 :
future, and the model future. Given past signal values a
P( st +1 | st , at )
compressed version of the past is to be formed such that
information about the future is preserved. When varying 𝛽, =
1
P( st +1 | h, at ) P(at | st ) P( st | h) P(h)
we obtain the optimal trade-off curve, also known as the P( st , at )
information curve, between compression and prediction,
which is a more complete characterization of the The interactive learning problem is solved by
complexity of the process. maximizing I ({St , At }; St +1 ) over P( st | ht ) and P(at | ht ) ,
under constraints that select for the simplest possible
model and the most efficient policy, respectively, in terms
of smallest complexity measured by the coding rate. Less
complex models and policies result in less predictive
power. This trade-off can be implemented using Lagrange
multipliers, 𝜆 and 𝜇 . Thus, the optimization problem for
interactive learning (Still, 2009) is given by
max
P ( st | ht ), P ( at | ht )
{I ({St , At }; St +1 ) − λ I ( St ; H t ) − µ I ( At ; H t )}
Note that interactive learning is different from
reinforcement learning, which will be discussed in the next
section. In contrast to reinforcement learning, the
predictive model approach such as interactive learning asks
about behavior that is optimal with respect to learning Figure 4: Learning with rewards
about the environment rather than with respect to fulfilling
a specific task. This approach does not require rewards. the transition probabilities) and a reward does not need to
Conceptually, the predictive approach could be thought of be specified. Empowerment provides a natural utility
as “rewarding” information gain and, hence, curiosity. In function which imbues its states with an a priori value,
that sense, it is related to curiosity driven reinforcement without an explicit specification of a reward. This enables
learning (Schmidhuber, 1991, Still & Precup, 2012), the system to keep alive indefinitely.
where internal rewards are given that correlate with some
measure of prediction error. However, the learner’s goal is
not to predict future rewards, but rather to behave such that 5. Learning with Rewards
the time series that it observes as a consequence of its own 5.1 Markov Decision Processes
actions is rich in causal structure. This, in turn, allows the
learner to construct a maximally predictive model of its In some settings of lifelong learning, the agent receives
environment. feedback information from the environment. In this case,
the agent’s decision process can be modeled as a Markov
4.2 Empowerment decision process (MDP). MDPs are a popular approach for
modeling sequences of decisions taken by an agent in the
Empowerment measures how much influence an agent has face of delayed accumulation of rewards. The structure of
on its environment. It is an information-theoretic the rewards defines the tasks the agent is supposed to
generalization of joint controllability (influence on achieve.
environment) and observability (measurement by sensors) A standard approach to solving the MDP is
of the environment by the agent, both controllability and reinforcement learning (Sutton & Barto, 1998), which is an
observability being usually defined in control theory as the approximate dynamic programming method. The learner
dimensionality of the control/observation spaces (Jung et observes the states st of the environment, take actions at
al., 2012).
on the environment, and gets rewards rt from it (Figure 4).
Formally, empowerment is defined as the Shannon
channel capacity between At , the choice of an action This occurs sequentially, i.e. the learner observes the next
sequence, and St +1 , the resulting successor state: states only after it takes actions. An example of this kind of
learner is a mobile robot that sequentially measures current
C ( st ) = max P ( a ) I ( St +1 , At | st ) location, takes motions, and reduces the distance to the
= max P ( a ) { H ( St +1 | st ) − H ( St +1 | At , st )} destination. Another example is a stock-investment agent
that observes the state of the stock market, makes sell/buy
The maximization of the mutual information is with decisions, and gets payoffs. It is not difficult to imagine
respect to all possible distribution over At . The extending this idea to develop a lifelong learning agent that
empowerment measures to what extent an agent can incorporates external guidance and feedback from humans
influence the environment by its actions. It is zero if, or other agents to accumulate knowledge from experience.
regardless what the agent does, the outcome will be the
same. And it is maximal if every action will have a distinct 5.2 Value Functions
outcome. The goal of reinforcement learning is to maximize the
It should be noted that empowerment is fully specified expected value for the cumulated reward. The reward
by the dynamics of the agent-environment coupling (i.e.
5
function is defined as R( st +1 | st , at ) or rt +1 = r ( st , at ) . This 5.4 Interestingness and Curiosity
value is obtained by averaging over the transition
The objective function consisting of the value function and
probabilities T ( st +1 | st , at ) and the policy π (at | st ) or
the information cost can balance the expected return with
at = π ( st ) Given a starting state s and a policy π , the minimum cost. However, this lacks any notion of
value V π ( st ) of the state st following policy 𝜋 can be interestingness (Zhang, 1994) or curiosity (Schmidhuber,
expressed via the recursive Bellman equation (Sutton & 1991). In Section 4 we have seen this aspect being
Barto, 1998), reflected in the predictive power and empowerment (Jung
et al., 2011). The objective function can be extended by
V π ( st ) the predictive power (Still & Precup, 2012). Using
Lagrange multipliers, we can formulate the lifelong
∑ π (a
at ∈ A
t | st ) ∑ T ( st +1 | st , at ) R ( st +1 | st , at ) + V π ( st +1 )
st +1 ∈S
learning as an optimization problem:
Alternatively, the value function can be defined on state-
q
{ ({ S , A } ; S ) + α V
arg max I qπ t t t +1 t
π
}
(q) − λ I ( St ; At )
action pairs:
where q (at | st ) is the action policy to be approximated.
Qπ ( st , at ) ∑ T (s
st +1 ∈S
t +1 | st , at ) R( st +1 | st , at ) + V π ( st +1 )
The ability to predict improves the performance of a
learner across a large variety of specific behaviors.
which is the utility function attained if, in state st , the The above objective function embodying the curiosity
agent carries out action at , and after that begins to follow terms as well as the value and information cost terms can
thus be an ideal guideline for a lifelong learner. The
𝜋.
predictive power term I qπ ({St , At } ; St +1 ) allows for the
agent to actively explore the environment to extract
5.3 Information Costs interesting knowledge. The information cost term I ( St ; At )
If there are multiple optimal policies, then asking for the enables the learner to minimize the interaction with the
information-theoretically cheapest one among these environment or teacher. This all happens with the goal of
optimal policies becomes more interesting. Tishby & maximizing the value or utility Vt π (q ) of the information
Polani (2010) and Polani (2011) propose to introduce the agent is acquiring.
information cost term in policy learning. It is even more
interesting if we do not require the solution be perfectly 6. Summary and Conclusion
optimal. Thus, if we wish the expected reward E[V(S)] to
We have formulated lifelong learning as a sequential,
be sufficiently large, the information cost for such as
online, incremental learning process over an extended
suboptimal (but informationally parsimonious) policy will
period of time in a dynamic, changing environment. The
be generally lower.
hallmark of this lifelong-learning framework is that the
For a given utility level, we can use the Lagrangian
training data are observed sequentially and not kept for
formalism to formulate the unconstrained minimization
iterative reuse. This requires instant, online model building
problem
and incremental transfer of knowledge acquired from
min π { I π ( St ; At ) − β E[Qπ ( St , At )]} previous learning to future learning, which can be
formulated as a Bayesian inference.
where I π ( St ; At ) measures the decision cost incurred by The Bayesian framework is general enough to cover
the perception-action cycle model of cognitive systems in
the agent:
its various instantiations. We applied the framework to
π (at | st ) develop a taxonomy of lifelong learning based on the way
I π ( St ; At ) = ∑ P( st )∑ π (at | st ) log of obtaining learning examples. We distinguished three
st at P (at )
paradigms: learning with observations, learning with
where P(at ) = ∑ s π (at | st +1 ) P( st +1 ). The term actions, and learning with rewards. For each of the
t +1
paradigms we examined the objective functions of the
I π ( St ; At ) denotes the information that the action At lifelong learning styles.
carries about the state St under policy 𝜋. The first paradigm is lifelong learning with passive,
continual observations. Typical examples are time series
prediction and target tracking (filtering). The objective
functions for this setting are prediction errors and
predictive information, the latter being defined as the
mutual information between the past and future states in on model structures can be especially fruitful for
time series. The information bottleneck method can also be automated knowledge acquisition and sequential
modified to measure the predictive information. knowledge transfer between a wide range of similar but
The second paradigm is lifelong learning with actions significantly different tasks and domains.
(but without reward feedbacks). Interactive learning and
empowerment are the examples. Here, the learner actively Acknowledgements: This work was supported in part
explores the environment to achieve maximal predictive by the National Research Foundation (NRF-2010-
power at minimal complexity about the environment. In 0017734) and the AFOSR/AOARD R&D Grant 124087.
this paradigm, the agent takes actions on the environment
by action policy, but does not receive rewards from the References
environment for its actions on the environment. The goal is
mainly to know more about the world. Simultaneous [Ay et al., 2008] Ay, N., Bertschinger, N., Der, R., Guetter, F.,
localization and mapping (SLAM) in robotics is an & Olbrich, E., Predictive information and explorative
behavior in autonomous robots, European Physical Journal
excellent example of the interactive learning problem,
B, 63:329-339, 2008.
though no literature is found on explicit formulation of [Barber et al., 2011] Barber, D., Cemgeil, A. T., & Chiappa,
SLAM as interactive learning. S. (eds.), Bayesian Time Series Models, Cambridge
The third paradigm is active lifelong learning with University Press, 2011.
explicit rewards. This includes the MDP problem for [Bialek et al., 2001] Bialek, W., Nemenman, I., & Tishby, N.,
which approximate dynamic programming and Predictability, complexity, and learning, Neural
Computation, 13:2409-2463, 2001.
reinforcement learning have been extensively studied. The
[Cohn et al., 1990] Cohn, D., Atlas, L., & Ladner, R.,
conventional objective function for MDPs is the value Training connectionist networks with queries and
function or the expected reward of the agent. As we have selective sampling, In: D. Touretzky (ed.), Advances in
reviewed in this paper, there have been several proposals Neural Information Processing 2, Morgan Kaufmann,
recently to extend the objective function by incorporating 1990.
information-theoretic factors. These objective functions [Cohn et al., 1994] Cohn, D., Atlas, L., & Ladner, R.,
can be applied to lifelong learning agents, for example, to Improving generalization with active learning, Machine
Learning, 15(2):201-221, 1994.
attempt to minimize information costs while maximizing [Creutzig et al., 2009] Creutzig, F., Globerson, A., & Tishby,
the predictive information or curiosity for a given level of N., Past-future information bottleneck in dynamical
expected reward from the environment. These approaches systems, Physical Review E, 79, 042519, 2009.
are motivated by information-theoretic analysis of the [Eaton & desJardins, 2011] Eaton, E. & desJardins, M., Selective
perception-action cycle view of cognitive dynamic systems. transfer between learning tasks using task-based boosting,
In this article, we have focused on the sequential, In: Proc. Twenty-Fifth AAAI Conf. Artificial Intelligence
(AAAI-11), pp. 337–342, AAAI Press, 2011.
predictive learning aspects of lifelong learning. This [Freund et al., 1993] Freund, Y., Seung, H. S., Shamir, E., &
framework is general and thus can incorporate the classes Tishby, N., Information, prediction, and query by
of lifelong classification and regression learning. Since committee, In: S. Hanson et al. (eds)., Advances in Neural
these supervised learning problems do not care about the Information Processing 5, Morgan Kaufmann.
sequence of observations, the sequential formulations [Friston, 2009] Friston, K., The free-energy principle: a
presented in this paper can be reused by ignoring the unified brain theory?, Nature Reviews Neuroscience,
11:127-138, 2009.
temporal dependency. We also did not discuss the detailed
[Jung et al., 2011] Jung, T., Polani, D., & Stone, P.,
mechanisms of learning processes for the lifelong learning Empowerment for continuous agent-environment systems,
framework. Future work should relate the information- Adaptive Behavior, 19(1):16-39, 2011.
theoretic objective functions to the representations to [Polani, 2011] Polani, D., An information perspective on how the
address questions like “how to discover and revise the embodiment can relieve cognitive burden, In Proc. IEEE
knowledge structures to represent the internal model of the Symposium Series in Computational Intelligence: Artificial
Life, IEEE Press, pp. 78-85, 2011.
world or environment” (Zhang, 2008).
[Schmidhuber, 1991] Schmidhuber, J., Curious model-building
As a whole, we believe the general framework and the control systems, In Proc. Int. Joint. Conf. Neural Networks,
objective functions for lifelong learning described here pp. 1458-1463, 1991.
provide a baseline for evaluating the representations and [Still, 2009] Still, S., Information-theoretic approach to
strategies of the learning algorithms. Specifically, the interactive learning, European Physical Journal, 85, 2009.
objective functions can be used for innovating algorithms [Still & Precup, 2012] Still, S. & Precup, D., An Information-
for discovery, revision, and transfer of knowledge of the theoretic approach to curiosity-driven reinforcement learning,
Theory in Biosciences, 131(3):139-148, 2012.
lifelong learners over the extended period of experience. [Sutton & Barto, 1998] Sutton, R. S. & Barto, A. G.,
Our emphasis on information theory-based active and Reinforcement Learning: An Introduction, MIT Press, 1998.
predictive learning with minimal mechanistic assumptions
7
[Tishby et al., 1999] Tishby, N., Pereira, F. C., & Bialek, W.,
The information bottleneck method, In: Proc. 37 th Annual
Allerton Conf. Communication, Control and Computing,
1999.
[Tishby & Polani, 2010] Tishby, N. & Polani, D., Information
theory of decisions and actions. In: Perception-Reason-Action
Cycle: Models, Algorithm and Systems. Springer, 2010.
[Thrun & Moeller, 1992] Thrun, S. & Moeller, K. Active
exploration in dynamic environments, In: J. Moody et al.,
(eds.) Advances in Neural Information Processing 4,
Morgan Kaufmann, 1992.
[Valiant, 1984] Valiant, L. G., A theory of the learnable,
Communications of the ACM, 27(11):1134-1342, 1984.
[Yi et al., 2012] Yi, S.-J., Zhang, B.-T., & Lee, D. D., Online
learning of uneven terrain for humanoid bipedal walking, In
Proc. AAAI Conference on Artificial Intelligence (AAAI 2010),
pp. 1639-1644, 2010.
[Zahedi et al., 2010] Zahedi, K., Ay, N., & Der, R., Higher
coordination with less control – A result of information
maximization in the sensorimotor loop, Adaptive Behavior,
18(3-4):338-355, 2010.
[Zhang et al., 2012] Zhang, B.-T., Ha, J.-W., & Kang, M.,
Sparse population code models of word learning in concept
drift, In: Proc. 34th Annual Conference of the Cognitive
Science Society (CogSci 2012), pp. 1221-1226, 2012.
[Zhang, 2008] Zhang, B.-T., Hypernetworks: A molecular
evolutionary architecture for cognitive learning and memory,
IEEE Computational Intelligence Magazine, 3(3):49-63, 2008.
[Zhang, 1994] Zhang, B.-T., Accelerated learning by active
example selection, International Journal of Neural
Systems, 5(1):67-75, 1994.
[Zhang & Veenker, 1991a] Zhang B.-T. & Veenker, G., Focused
incremental learning for improved generalization with
reduced training sets, Proc. Int. Conf. Artificial Neural
Networks (ICANN'91), pp. 227-232, 1991.
[Zhang & Veenker, 1991b] Zhang B.-T. & Veenker, G., Neural
networks that teach themselves through genetic discovery of
novel examples, Proc. 1991 IEEE Int. Joint Conf. Neural
Networks (IJCNN'91), pp. 690-695, 1991.