Vdoc - Pub The Method of Maximum Entropy
Vdoc - Pub The Method of Maximum Entropy
T h e M e t h o d of M a x i m u m E n t r o p y
H. Gzyl
Even though to err is human, some "misprints" are plain dum, if not worse.
Here are some of the worst t h a t I caught a bit too late.
[Hfi/h)dQ
4. P. 82, line 6 from below, I should have written
qi = q2 = q3 = q4 = 1/4.
THE METHOD OF
MAXIMUM
ENTROPY
This page is intentionally left blank
Series on Advances in Mathematics for Applied Sciences - Vol. 29
T H E M E T H O D OF
MAXIMUM
ENTROPY
Henryk Gzyl
Facultad de Ciencias
Universidad Central de Venezuela
World Scientific
Singapore >New Jersey London* Hong Kong
Published by
World Scientific Publishing Co. Pte. Ltd.
P O Box 128, Farrer Road, Singapore 9128
USA office: Suite IB, 1060 Main Street, River Edge, NJ 07661
UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE
For photocopying of material in this volume, please pay a copying fee through
the Copyright Clearance Center, Inc., 27 Congress Street, Salem, M A 01970, US A.
This book is an outgrowth of a set of lecture notes on the maximum entropy method delivered in
1988 at the 1st Venezuelan School of Mathematics. This yearly event aims at acquainting graduate
students, and university teachers with trends, techniques and open problems of current interest. It
takes place during the month of September at the Universidad de los Andes, in the city of Merida.
At the same time I was being invited to give lectures, Didier Dacunha-Castelle passed by
and reported on his work on the subject. This happened not long after some astronomers friends
of mine from the CEDA (also in Merida) had asked me to go with them over some methods for
reconstructing images based on a maximum entropy procedure. So what else was left for me to do
but collect material for that course?
The more I looked around, the more applications of the method I found. My original goal
was to organize the material in such a way that the underlying philosophy of the method became
transparent and to try to understand myself why it works. I hope to convey some of that to you,
even though some of the whys are still a mystery (at least to me).
v
This page is intentionally left blank
Table of Contents
PREFACE v
CHAPTER 0
Introduction 1
CHAPTER 1
Basic Concepts from Probability Theory 7
CHAPTER 2
Equilibrium Distributions in Statistical Mechanics 17
CHAPTER 3
Some Heuristics 23
CHAPTER 4
Entropy Functionals 27
1. Basics 27
2. Entropy inequalities 35
3. Axiomatic characterization of entropies 39
CHAPTER 5
The Method of Maximum Entropy 41
1. Kullback's and Jaynes' reconstruction methods 41
2. Czizar's results 44
3. Borwein and Lewis' extensions 49
4. Dacunha-Castelle and Gamboa's approach to level 2 M.E.M. 49
CHAPTER 6
Applications and Extensions 61
1. Entropy maximization under quadratic constraints, or constraint relaxation 61
2. Failure of maximum entropy methods for reconstruction in infinite systems 63
3. Some finite dimensional, linear reconstruction problems 67
4. Maxentropic approach to linear programming 72
5. Entropy as Lyapunov functional. Further comments 75
6. Solving matrix equations 78
7. Estimation of transition probabilities 80
8. Maxentropic reconstruction of velocity profiles 83
vii
viii Contents
INTRODUCTION
The Method of Maximum Entropy is an offspring of the Maximum Entropy Principle introduced
in 1957 in statistical physics by E. Jaynes. That principle has the esthetic appeal of all variational
principles in physics and its basic role is to characterize equilibrium states. It works as follows: a
functional which is a Lyapunov function for the dynamics of the system is defined on the set of
states. It is postulated that the equilibrium states are those yielding a maximum value of the
functional, compatible with a set of given values of some extensive variables.
In chapter 2 we shall explain these things further, here we direct the reader to [0.1] where
the original papers by Jaynes are reprinted together with many other interesting ones.
Actually, the possibility of characterizing probabiUty densities by variational methods was
already noticed by statisticians and information theorists well before 1957. Take a look at [0.2],
especially chapter 3. By now, the list of probabiUty distributions derived via the maximum entropy
method is pretty long. We can go back even further. The germ of the idea is already presented in
Boltzmann's writings. See [0.24] in the commemorative volume dedicated to his life and work.
The germ of the idea is also present in Gibb's work. See the paper on the Camot's principle by
Jaynes in [0.7].
See the volume [0.3] by Kapur which devotes several chapters to characterization of
standard probability densities by maximum entropy methods. Besides, a large variety of
appUcations in which the notion of entropy enters is presented. And, speaking of appUcations, we
bring the collection [0.4]-[0.11] to the reader's attention, where not only many appUcations of the
MEM. are collected, but a lot of space is dedicated to foundational matters, and to explain the
word "Bayesian" in the title of many of the volumes.
To explain the general philosophy of the M.E.M., and the underlying common approach of
the long list of successful applications, let us begin by saying that many inverse and/or direct
problems, aU of which we shall call reconstruction problems, lead to searching for solutions to
equations like
(0.1) Ax=y
where A: V, -> V2 is a linear transformation between two appropriate vector spaces V, and V2 ,
and we may be looking for solutions in some cone (or convex set) C,cV, while the data Ues in
/
2 The Method ofMaximum Entropy
(0.2) S: C,->5R
which will be the entropy functional and instead of solving (0.1) one sets up the following
maximization problem
thus, we see that if x* is such that S(x*) reaches a maximum value, and the constraint is satisfied,
one automatically has a solution to (0.1). The beauty about (0.3) is that many times solving (0.3)
is equivalent tofindinga minimum of a convex functional
(0.4) H(X)=lnZ(X)-KX,y)
where Z(X) will make its appearance below. In general H(V) is defined on a convex D c V\, and
when we are lucky D= V2 H(X) is some sort of dual to S(x) although not quite. Physicists
have nice interpretations for it. In (0.4) the X are the Lagrange multipliers for (0.3).
We shall write (x,y) for the scalar product of vectors x, and y , and when x e V and
A,eV, (X,x)s X(x) as usual.
The value A.* of the X that makes x(A.) a solution x* to (0.3) is obtained by minimizing the
convex functional H(X), which depends on as many variables as there are equations in (0.1).
We have thus transformed solving linear problem with more unknowns than equations into
solving a smaller minimization problem, hopefully without constraints.
Many of the initial applications consisted in looking for positive probability densities
yielding prescribed mean values for a finite collection of functions, taking S as the
Gibbs-Boltzmann entropy associated to a density was natural.
In 1967 Burg proposed another entropy functional which has proven very useful for
reconstructing densities when information about time series is given by a few correlations. We
shall come back to this below.
At what we call a level 2 reconstruction problem, the M.E.M. enters the following way:
On some appropriate measurable space (£1,3), see chapter 1, we consider a class P of probability
Introduction 3
measures, possibly absolutely continuous with respect to some fixed, preassigned a priori
measure, and a family of random variables X: fi->V,. Thus if PeP, the expected value of X
with respect to P, Ep (X), is an element i s V , . Instead of considering equation (0.1) we shall think
of random variables AX with expected value
E/>AX=AEPX=y
Instead of solving (0.1) we will search for measures satisfying EpAX=y. Now this becomes
a level 1 problem on a different space namely, we want to find
The rest of the comments made when describing the level 1 version of the MEM. apply
here as well.
I have been able to trace this approach to reconstruction problems at least to the work of
Rietsch [0.12], where it was applied to reconstruct the earth's density given its mass and moment
of inertia.
After recalling some basic notations and definitions from probability theory, in chapter 1,
we devote chapter 2 to a watered down presentation of equilibrium statistical mechanics. Chapter
3 consists of some heuristic arguments backing up the M.E.M..
In chapter 4 we introduce the most used entropy functionals and examine some of their
properties.
Finally, it is in chapter 5 where the MEM. is explained. There we borrow from the
important work by Cszizar, Dacunha and Gamboa, and make a few comments about the work by
Bowrein and Lewis.
Surely the appeal of the MEM. has to do with its success in a large variety of
applications, in some of which an entropy like concept is natural. But in many cases it is just
something you pull out of your hat and it solves a problem for you. It may be this fact what
prompts much of the work of explaining what the MEM. is about. Besides that, there is the
appeal of the concept of entropy that comes through the second law of thermodynamics in
understanding irreversibility; see [0.13]-[0.14]. To see how entropies help to understand issues
related to self-organization and/or chaotic behavior as explained in [0.14]-[0.16]. Some uses of
the notion of entropy in biology and economics have generated strong and sarcastic criticism. See
[0.17]-[0.18] and the reviews in [0.19]-[0.20].
An interesting collection in which the thermodynamic notion of entropy plays a role is
compiled in [0.21] and, of course, we should not fail to list at least one reference on the use of the
4 The Method ofMaximum Entropy
concept of entropy in information theory, [0.22] and in the theory of dynamical systems [0.23],
connections between entropy, complexity and several quantum issues are reviewed in [0.25],
It will be up to the reader to decide whether there is or there is not a common thread in
this list of references, many not directly related to the M.E.M., which explains its appeal beyond
the mere: it just works.
To conclude, it must be clear that we are citing references by square brackets, numbered
almost always by order of appearance, listed by chapter. Also, formulae, definitions and results
will be cited sequentially in each chapter within round brackets.
I would like to thank my colleague Aldo Tagliani for writing sections (6.19)-(6.21).
And last, but not least, my thanks go to Ms. Leda Calderon, who typed the manuscript
and went along nicely with my changing of mind now and then about a paragraph here and there.
The editorial staff at WSP did a fabulous job weeding out uncountable mispirints
To finish I want to acknowledge the support of the Facultad de Ciencias, U.C.V., and of
CONICIT forfinancialsupport during the preparation of the book.
Two references which I obtained during a brief visit to the CWI in Amsterdam, and added
at the last minute are [0.25]-[0.26].
REFERENCES
[0.1] E. T. Jaynes: Papers on Probability, "Statistics and Statistical Physics". Ed. Rosenkrantz,
E.D. Kluwer Acad. Publi., Dordrecht, 1983.
[0.2] Kullback, S. "Information Theory and Statistics". Dover Publi., New York, 1968.
[0.3] Kapur, J. N. "Maximum Entropy Models in a Science and Engineering' John Wiley,
New York, 1989.
[0.4] Justice, J. H. (Eds) "Maximum Entropy and Bayesian Methods in Applied Statistics"
Cambridge Univ. Press, 1986.
[0.5] Ray Smith C. and Grandy, W. T., Jr. (Eds) "Maximum Entropy and Bayesian Methods in
Inverse Problems". D. Reidel Publi. Co., Dordrecht, 1987.
[0.6] Ray Smith C. and Erickson, G. J. (Eds) "Maximum Entropy and Bayesian Spectral
Analysis and Estimation Problems" D. Reidel Publi. Co., Dordrecht, 1987.
[0.7] Erickson, G. and Ray Smith C. (Eds) "Maximum Entropy Methods and Bayesian Methods
in Science and Engineering": Vol I-Foundations. Kluwer Acad. Publi., Dordrecht, 1988.
[0.8] Erickson, G. and Ray Smith C. (Eds) "Maximum Entropy Methods and Bayesian Methods
in Science and Engineering". Vol II-Applications. Kluwer Acad. Publi., Dordrecht, 1988.
[0.9] Skilling, J. (Eds) "Maximum Entropy and Bayesian Methods" Kluwer Acad. Publi.
Co.,Dordrecht, 1989.
Introduction 5
[0.10] Fougere, P. F. (Eds) "Maximum Entropy and Bayesiam Methods" Kluwer Acad. Publi.
Co., Dordrecht, 1990.
[0.11] Grandy, W. T., Jr. and Schick, L. H. (Eds) "Maximum Entropy and Bayesian Methods"
Kluwer Acad. Publi., Dordrecht, 1991.
[0.12] Rietsch, E. "A Maximum Entropy approach to inverse problems". Joum. of Geophysics.
42, pp. 489-506, 1977.
[0.13] Atkins, P. W. "77K; Second Law" W. H. Freeman. New York, 1984.
[0.14] Prigogine, I. and Stengers, I. "Order out of Chaos". Bantham Books, New York, 1984.
[0.15] Klimontovich, Yu. L. "Turbulent Motion and The Structure of Chaos" Kluwer Acad.
Publi., Dordrecht, 1991.
[0.16] Mackey, M. C. "The Origin of Thermodynamic Behaviour" Springer Verlag, Berlin,
1992.
[0.17] Rifkin, J. "Entropy: Into the Greenhouse World'. Bantham Books, New York, 1989.
[0.18] Brooks, D. R. and Wiley, E. O. "Evolution as Entropy".Univ. of Chicago Press, Chicago,
1988.
[0.19] Morowitz, H. "Entropy Anyone"., in "Mayonnaise and the Origin of Life" Berkeley
Books, New York, 1985.
[0.20] Rothman, T. "Science a la Mode: Physical Fashions and Fictions". Princeton Univ. Press,
Princeton, 1989.
[0.21] Leff, H. S. and Rex, A. F. "Maxwell's Demon. Entropy, Information, Computing"
Princeton Univ. Press, Princeton, 1990.
[0.22] McEliece, R. J. " The Theory of Information and Coding" Vol 3, Encyclop. Math.,
Addison - Wesley, Reading, 1981.
[0.23] Martin, N. F. G. "Mathematical Theory of Entropy". Vol 12, Encyclop. Math.,
Addison-Wesley, Reading, 1981.
[0.24] Klein, M. J. "The development of Boltzmann's statistical ideas". ActaPhys. Aust. Suppl.
X, pp 53-106, 1973.
[0.25] Gelfand, I. M. and Yaglom, A. M. "Calculation of the amount of information about a
random function contained in another such function". Amer. Math. Soc. Transl. Series 2 ,
A.M.S., Providence, 1959, pp. 199-246.
[0.26] "Maximum Entropy and Bayesian Methods" 3 volumes edited by Grandy, W. T. and
Schick, L. H.; Ray-Smith, C. et al; Mohamed-Djafary, A. and Demoment, G. Published by
Kluwer Acad. Publishers respectively in 1991, 1992 and 1993 in Dordrecht, Holland.
Chapter 1
We will recall some basic concepts from measure theory and from probability theory. The
purpose of this chapter is to provide applied and other scientists with some standard vocabulary.
Most of the concepts and results are intuitive and obvious at times, even though the proper names
are not widely known.
A measurable space (E,$) consist of a set E and a a -algebra ^ of subsets of E. In ?are
the sets to which we will assign a measure (or, later on, a probability). It is a collection of subsets
of E closed with respect to:
i) taking complements: if As W then E-A=AC e %
ii) forming denumerable unions: if {A„s ^, n > 1} then uA„e $!
These sets operations when viewed abstractly, correspond with the logical operations not
and or This is what makes a-algebras a convenient realizations of events to which we want to
assign probabilities.
A measure m on a measurable space (E,?) is a function m: % ->[0,oo) satisfying
(1.1) v
m(uAn)=T.m(A„)
I ' i=i
where {A^: n> 1} is any countable collection in ? such that A, nAj=0. We add here that instead of
[0,oo) the range of values of m can be taken to be any space X on which an additive operation and
a notion of convergence are defined such that the right hand side of(1.1) exists. In such cases one
says that m is an X-valued measure.
Consider two measurable spaces (E,W) and (F,<^. We shall say that a function X: E-»F is
^/immeasurable if
X'(A)={x: X(x)eA}={XeA}e? for any A e s F
The a-algebra 5(9?) = B generated by the open (or the closed, or the compact) subsets of
9l=(-oo,oo) is called the Borel a-algebra, since real valued functions appear all the time, one
usually writes Xe ^instead of Xe W/B.
We shall say that f=g a.e(m) (almost everywhere with respect to m) whenever {f >g or
f<g}={x: f(x)>g(x) or f(x)<g(x)} is such that m({f>g or f<g})=0.
It is left for the reader to verify that the basic arithmetic properties, performed on Borel
measurable functions yield Borel measurable functions, that is linear combinations, products,
quotients (whenever defined), infima, suprema of measurable functions are measurable.
7
8 The Method ofMaximum Entropy
For this to make sense it is required that the right hand side converges. This is easily
achieved when all exceptfinitelymany of the A„ are empty.
The second step is to realize that any positive Xe f can be approximated by an increasing
sequence of simple functions, that is X=lim X„ where X,, is simple,
Now define
is finite and write XeL, (E,?,m) (or XeL, whenever E,% and m are understood from the
context). Also we say that XeL p whenever j(X)p dm is finite.
Let m and n be two measures on (E,?). We shall say that m is absolutely continuous with
respect to n, and write m « n , if whenever n(A)=0 for Ae % then m(A)=0. In this case there exists
pe% such that
and note that since X"1: #"—»<?' preserves set operations, n is a well defined measure on (F,?).
Also, using (1.6) one can prove (going stagewise from simple functions onwards) that for any
positive measurable Y: F—»5R
\Y(y)n(dy)= \Y(X{x))m(dx).
Let us proceed to rewrite some of the former in terms of probabilistic language. We shall
say that (fl.d^P) is a probability space if (il,d&) is a measurable space and P is a measure with
range [0,1], such that P(fi)=l.
The points w in Q. are called elementary events, they may or may not be such that {m} is in
d^"The usual interpretation of oo is that it represents an experiment (sequence of measurements in
continuous or discrete times) performed on some system. The elements of d^~are called events
and, questions about experiments are described by the set operations of union, intersection and
complementation.
A (real valued) random variable X is a (Borel) measurable function defined on (Cl,d^.
From now on we shall consider a given (Q.d^P). Let X be a (real valued) random
variable. The distribution function of X is defined to be the function F x : SR ->[0,1]
Fx(x) = P(X<x)
and when X : Cl -» 9?" is such that each component function X^ is measurable, we define
Fx(X) = P(X1<xl,...,Xn<xn).
From now on we shall follow the standard notational convention and write
where X : (£l,d&) -> (E a ,? a ) are (Ea -valued) random variables, I is some set of indices and Aa
sE^for all I. (Warning: the set described above may not be in <#when I is not countable.)
Assume that X is an integrable random variable. We introduce the symbols
10 The Method ofMaximum Entropy
EY^{Y)= lY((o)dP((o)
When m is a measure on 9?", we shall say that then 9?"-valued X has a density p with
respect to X„ if the measure induced on M" by F x is absolutely continuous respect to m and
dFx/dm=p. Then, for any bounded measurable G: 91" -> 9?
£G(X) = jG(x)p(x)rf/n.
(1.9) P(A1r^A1)=P(Ai)P(A2)
for any A(€<rf^, A^ed^ .
This notion extends trivially to any countable family of a-algebras. When X,e^/^„ X;S
d^/<?2 we say that they are independent if
W,)gC*2) = W i ) W»)
for any bounded measurable functions defined on E, ,E2. We leave for the reader to verify that if
X, and Xj are uncorrelated, i.e.
£TiJf2 = £^,£^2
Basics ConceptsfromProbability Theory 11
they may not be necessarily independent. Also, for independence, it suffices to verify that
The proof of the following lemma, asserting some basic properties of conditional
expectations is left for the reader.
Lemma 1.12. The notations are as above.
a)IfX>0,E[X|G]>0.
b) If X, is bounded, G-measurable, E[X,X|G]=X,E[X|G].
c) Let g be a bounded function on SR2 and X, e G, then
ElgiXX^G] = E[G{X,z)\G]^
where z is a constant.
d) E[E[X|G]]=EX.
e) For a,, a, in « , Efa^+ajXJG] = a^fXJGl+aJEfXjIG].
f) If {XJ is either a monotonic sequence of positive functions or a uniformly bounded,
pointwise convergence sequence, E[lim XJG]=lim E[XJG].
g) If the a-algebras a(X) and G are independent, then
E[X\G] = E[X].
E[£L¥lG2]lGi].
is an orthogonal projection onto L2(Q.,G,P). This result is important when dealing with Gaussian
processes and computing predictors optimal in quadratic distances.
Let us now recall some important factorization results.
Lemma 1.13. Let Y: (E,?)->(E',W ') be a W '-measurable and then X: E->91 is
measurable a(Y)/B if and only if there exists g: E'-> 9J, ^'/5-measurable such that X=g(Y).
Therefore, when G in definition (1.10) is o(Y), then there is hx: E'—> 9? such that
E[X\o(Y)] = hx(Y).
Ef/W\y\= \Ax)N(dx,Y)
Comment. Usually life is good with us and there is a measure m(dx) on E with respect to
N(.,y) which is absolutely continuous with a jointly measurable density n(x,y) then
E[f(X)\Y\ = \Ax)n(x\y)m(dx)
E[f{X)\Y=y] = lfix)n(x\y)m(dx)
Basics ConceptsfromProbability Theory 13
is usually employed. See [1.3] for nuances about constructing kernels and [1.1] or [1.4] for the
necessary measure theoretic results.
We shall need in chapter 4 a slight extension of the conditional expectation operator to the
case when (£2,<^jj.) is such that n is a cr-finite measure with a(£2)=+°o.
Let G c«P"be a sub-o-algebra. Let feL,(u,), and assume that (Q,G,n) is a-finite. Let us
state
Definition 1.16. We shall denote by EJf]G] the unique (up to appropriate sets of measure
zero) element of L,(n) such that
\gfd\x. = lgE»[f\G]d
Proof: (i) jfg du = j(g/h,) fdP, = J(g/h,)E,>, [f|G]dP, = jgE/., [f|G]du. Part (ii) follows
from (i) by taking g=(E/>, [f]G]-E/>2 [f]G]) and using the fact that Jg2d|j.=0 implies g=0 a. e. \x..
'Lei us consider some simple examples. To begin with note that if G={0,n} is the trivial
a-algebra, then
E[X\G] = E[X\
for any integrable or positive X. Assume now that G is the a-algebra generated by a partition
{A,.:k>l} of £2. That is, its elements are countable unions of sets of {\: k>l} and any
G-measurable function is of the type Z akhk for appropriate constants c^. In this case E[X|G]
must be something like
E[X\G] = Z.akIAk
and we have to determine the <\. For that, multiply both sides of the identity by I^for some j and
use (1.11) to obtain
/4 The Method ofMaximum Entropy
ak E[X;Aj]
and correspondingly
that is, the function on the right-hand side is constant on the sets {Y^eJ taking the value
E[X|Y=e,] as specified above.
The other very common case is the following. Let X and Y be respectively 5R" and SRm
valued random variables such that the distribution of (X Y) has density p(x,y) with respect to the
product Lebesgue measure dx dy on 9?"*m. From the factorization lemma above, we know that the
bounded random variables, measurable with respect to cr(Y) are of the form g(Y) for g:9?" -> 9J.
Therefore computing both sides of (1.11) we see that for bounded f: 9?" —»9?"
EWftgWl = lAx)g(y)p(x,y)dxdy
where we denoted by hx(y) the function introduced right after Lemma (1.13). Since both sides are
equal for any g(y) we conclude that
(1.18) p(X\Y)=($(,x,Y)dxylp(x,Y).
For the sake of future reference, we shall now present two variations on the theme of
Bayes formula, which are at the basics of both applications and interpretations of the maximum
entropy method.
In what follows, we shall denote the intersection C, oC 2 of two sets C, and C2 by C,C2.
Let {AJ be a countable partition of tl, i.e., a denumerable exhaustive collection of mutually
exclusive events. Then for any events B, C
P(B\Q= Z,P(At\BQP(B\C)
and also
Basics ConceptsfromProbability Theory 15
since
P(A,\BC) = P{B\AtC)^
which is known as Bayes identity. Substituting P(B|C) by the second summation displayed above
we have
In this identity P(AJC) is interpreted as the "a priori" probability of Aj given C, taken to
describe the knowledge we have about the event Aj given the preliminary information in event C.
The left-hand side tells us how much does our knowledge about A change when we collect the
information contained in event B. Theright-handside is the recipe for computing the change. The
left-hand side is called the " a posteriori" probability of \
REFERENCES
[1.1] Bauer, H. "Probability Theory and Elements of Measure Theory". Holt, Rinehart and
Wilson, Inc., New York, 1972.
[1.2] Gihman, I. I. and Skorohod A. V. "The Theory of Stochastic Processes F
Springer-Verlag, New York, 1974.
[1.3] Getoor, R. K. "On the Construction of Kernels" Led. Notes in Math. N°465, pp.
443-463, Springer-Verlag, Berlin, 1977
[1.4] Rudin, W. "Real and Complex Analysis". McGraw-Hill, New York, 1966.
Chapter 2
Statistical physics owes its birth to the inconvenience and impossibility of describing
systems having very large numbers of particles by specifying the behavior of each individual.
Loosely speaking, the aim of statistical physics is to describe the "collective" or ''macroscopic"
properties of a system of particles in terms of appropriate averages of its "microscopic" motions.
The words in quotation marks are the key ones. Macroscopic refers to the ''properties of
the system as a whole", the properties "visible by the naked eye". Microscopic means to the
description based on the exact evolutions laws (classical or quantum) describing the motions of
the particles.
Even though today's supercomputers can follow the individual motions of large numbers
of independent particles, there is yet nothing able to handle 1023 individuals.
To begin with, we will only consider systems to which equilibrium thermodynamics
applies, i.e., systems whose external, macroscopic changes are very slow compared to the
microscopic, internal motions and can be considered at every instant to be in equilibrium.
One of the cornerstones in physics, important for the outlook at the world that it provides,
is the second law of thermodynamics. According to this law, whenever an isolated system evolves
its entropy can only increase and, when an equilibrium state is reached its entropy attains the
highest possible value compatible with the values of the macroscopic extensive parameters of the
system.
I want to emphasize that this is not the standard formulation, see [2.1] or [2.2], for I have
made the characterization of the equilibrium state part of the statement of the second law.
Actually, when the entropy functional happens to be a Lyapunov functional for the evolution law
of the system, the characterization of equilibrium states as maxima of the entropy functional is
obvious. See section (6.5) for some more on these issues.
The connection between the macroscopic and microscopic descriptions can be traced
down to ideas of Maxwell, Boltzmann and Gibbs. The ingredients involved depend on the kind of
system under study: how do we describe its microscopies states and the changes of state, i.e., its
evolution on one hand and, on the other, on the method we choose to average over the
microscopical states. To be more specific, we may consider either classical or quantum
description of the particles making up a system and in the latter case, we may consider the
17
18 The Method ofMaximum Entropy
particles to be distinguishable or not. But regardless of all these subtleties, the basic philosophy is
always the same.
The presentation that follows is essentially due to Jaynes [2.3] see also [2.4].
Consider, to simplify as much as possible, a system whose states can be described by a
countable set E, which are unaccessible to observation. The microscopic observables describing
the properties of the system are described by real valued functions
F: £->5R.
The basic assumption is that the macroscopic values of these observables are obtained as
average values
<f>= \F(t)Pu
ieB
where the Pi are the probabilities offindingthe system in state i e E, or the fraction of time the
system spends at state i when it is in equilibrium. With all these ingredients we state the
The state of equilibrium is characterized by the assignment of probabilities {P,: isE} such
that
attains its maximum possible value among all distributions {Pt -. ie E} such that
(23) P,*=^exp41^,0),
Observe that S({P;}) defined by (2.1) is a concave function defined on the convex set
P={ {P,: ie E, ZPpl}} of all probability measures on E. Also, ZA(X) is only defined on a subset of
5RM specified by
Equilibrium Distributions in Statistical Mechanics 19
DA = {XeW": ZA(X)<oo}.
When E isfinite,DA= 5RM but when E is notfiniteDA is only a convex subset of SRM One
should always know how big the set DA is.
Anyway, when {P^} given by (2.3) is substituted in (2.1) we obtain
which is convex on DA(X). We shall verify these assertions below, in a more general setting. We
have set
M
(A,a) = Z X,a,
That is, the value X' at which H(X) reaches its minimum is that of the Lagrange multiplier
which insures that {P;*|ieE} given by (2.3) satisfy the constraints (2.2). Note that the number M
may be much smaller than the cardinality of E, and H(X) is convex. These are the two facts that lie
behind the appeal of the M.E.M..
We shall remark as well that the condition for X' to be a minimum, namely that
d^H (X' )/dX,dXj be a positive definite form has strong consequences in thermodynamics. See
[2.2] and [2.4]. In terms of covariances of the observables A,, it looks like this: the matrix
is (obviously) positive definite. Actually, in some cases the function H(X) may be very flat near X'
making the numerical search for X* hard (especially for large M).
A similar procedure can be followed to obtain the probability distribution in the (simplest)
quantum case. For that we assume we are studying a system of noninteracting particles, each of
which may be found in any of the states of a denumerable set. To distinguish between the two
classes of identical particles we introduce two possible sets of states for the system.
We shall assume that the bosons have as (denumerable) set of states the set
20 The Method ofMaximum Entropy
In each case we shall assume there are two observables whose average values are
accessible to us, namely
E:H^>% £(v) = E E(i)y(i)
teS
(2.5)
N.H^m, JV(\|/) =E \|/(0
where E: S-» SFt is to be interpreted as a microscopic energy of the individual quantum states.
Now (2.1) looks like
Z(X)= £ exp(-^(v|/)-^(H/))
Z(X) = E exp-i E ^ O M O + S ^ v K O
= I, r i e x p - i p ^ O + ^ M O
K
ye.HbieH
=E £ « P ~ 1(^(0+Xa)
= n ( l - C e x p -(40)/*?))- 1
where in the last line we set C =exp-(A_/k) and A., = 1/T to conform with standard notation in the
physical literature.
Equilibrium Distributions in Statistical Mechanics 21
To compute the partition function for the fermionic case, we proceed in the same way,
except that now instead of summing over all n as in step 3, we sum only over n=0 and n=l
obtaining
exp--iffl/kT)).
Z(X) = n (1 + qc, cxp-£(0/M)).
Z(X).
ieff
I'e//
The point of these exercises was to show how the specification of the set of states and the
choice of the observables determine the final result.
To complete this mock approach to equilibrium statistical mechanics we shall verify that
the functional (2.1) is a Lyapunov functional. We shall assume that the probability distributions
{Pj(t): ieE} evolve in time according to
(2.7) j.iptP(i)=
W = -LPjPj,-P,P,j
XPJPJ,-P,PiJ
i
starting from some initial distribution P,(0). The P,j are given in advance and we assume them to
satisfy the microscopic reversibility condition P^P,,. If we compute the entropy of {Pt(t)}
according to (2.1) then its rate of change satisfies
f =
= -/tz -kXiKW-kZJf,
=. - * I -kI.P IJ{P
-kI.P 1-P,)\nP
IJ{P 1-P,)\nPl
V
V
=
-n
= --1^P.jiPj-P,)
T.PtffJPd \nP,
InP, +
+ ff ZP,j(P,
ZP(,(P, -- Pj) \aPj
InP;
=
-ffvt -^P-^Pj(Pj-P
j(Pj-P)la(P
i i )la(P/P
ii /P )
ii j
where we used the symmetry of Pj and the fact that SP,(t)=l to go from the first line to the
second, a simple symmetrization to go from the second to the third line. It is clear that the last line
is always positive. (Verify that (l-s)lns is always negative for s > 0.)
The interesting fact here is that in equilibrium, a distribution Pe such that the left-hand side
of (2.7) vanishes, provides a local maximum for S({P,}). In his book [2.5] Gibbs somehow
postulates a distribution like (2.3) as an equilibrium distribution for the evolution provided by the
Hamilton (Newton) equations of motion in phase space.
It was Boltzmann who used (2.1) as a Lyapunov function for his equation describing the
time evolution of particle a density function. See the reprint [2.6], And, as we said in the
22 The Method ofMaximum Entropy
REFERENCES
[2.1] Atkins, P. W. "The Second Law" W. H. Freeman and Co., New York, 1984.
[2.2] Callen, H. "Thermodynamics and Introduction to Thermostatistics" John Wiley, New
York, 1985.
[2.3] Jaynes, E. T. "Information theory and statistical mechanics" Phys. Rev. 106. pp. 620 -
630, 1957. See [0.1] for more along related lines.
[2.4] Tribus, M. "Micro and macro-thermodynamics" Am. Scientist 54, No.2, pp. 201-211,
1966.
[2.5] Gibbs J. W. "Elementary Principles in Statistical Mechanics". Dover Books, New York,
1960. Reprint of the 1902 U. of Yale Edition.
[2.6] Boltzmann, L. "Lectures on Gas Theory". California Univ. Press, Berkeley, 1964.
[2.7] Wehrl, A. "General properties of entropy" Rev. Mod. Phys. 50, No.2, pp. 221-260,
1978.
[2.8] Lindblad, G. "Non-Equilibrium Entropy and Irreversibility" D. Reidel Publishing Co.,
Dordrecht, 1983.
[2.9] Gzyl, H. "A unified presentation of equilibrium distributions in classical and quantum
mechanics". Ann Inst. Henri Poincare. 32, 1980.
[2.10] Jaynes, E. T. "The minimum entropy production principle" Ann. Rev. Phys. Chem. 3_1,
pp. 579-601, 1980 (Reprinted in [0.1]).
[2.11] Garcia-Colin, L. S. "Entropy and irreversibility macroscopics issues". Rev. Mex. Fisica.
Supl. 1, pp. 198-201, 1992.
Chapter 3
SOME HEURISTICS
Here we follow Jaynes [3.1] and Papoulis [3.2] in developing some heuristics that sheds
light on the concept of entropy and on the method of maximum entropy.
Let X be a discrete random variable taking n values x, ,Xj .....x,, and let p, be the
probabilities of observing the events A,= {X = x j . We shall write
N, being the number of times the value x, appears in a run of N independent observations
ofX.
A different way of looking at (3.2) is the following. Each of the possible results of
observing X N consecutive times is an element of EN=Ex...xE, E={x„...,x0}. If teE N then
where X,,...,XN are independent copies of X and N, denotes the number of times the value x( is
repeated in the listt,,...,!^.
Now, if N is large enough so that Pj~N,/N then
P(t) =p"i"../„>" = expAr S pi \npi = exp -NH(X).
If we say that the configuration t is typical whenever N,~NP,, it follows that the number
of typical configurations W(typ) given by
23
24 The Method ofMaximum Entropy
H(X)^^\nW(typ).
Below, and in the next chapter, we shall discuss these relations in some detail. For the time
being let us compare the number of typical configurations for two distributions corresponding to a
random variable having six possible outcomes, that is a die.
The first distribution {p,,...,pn} is the distribution that maximizes
6
H(pu...,p6) = -T. pi \np,
subject to the constraint
Z iP, = 4.5
i S ' = 3.5.
and with the aid of a computer one obtains the X such that Sip; = 4.5. It is X= -0.37105 from
which it follows that
also satisfies Sip =4.5 for p=0.7. But this distribution has entropy
(3.7) #=1.4136.
Some Heuristics 25
The quotient of the two numbers of typical configuration corresponding to (3.5) and (3.7)
is given by
eNHm
= eM
e
A H ) r e ee2 0 0 >>1 0 8l u4
^m ~
IVW~ ~gNH ~
^ = 38220.
That is, the number of microscopical configurations (i.e., chains of 50 throws) for which
thefrequenciesN/N correspond to the maximum entropy distribution (3.5) is 38220 times more
frequent than the microscopic configurations corresponding to the distribution (3.6).
This is a statement about how different assignments of probabilities are reflected in the
outcome of an experiment.
In a statistical physics, in systems with 1020 particles the use of asymptotic methods is
quite justified. This is the reason why, in almost the first line of any book in statistical physics, the
following typical phrase appears: the number of microscopical states accessible to the system,
compatible with its macroscopic constraints is '
W=expSlk
which is equivalent to assert how disordered a system is, or equivalently, the higher the entropy,
the higher the more states it can occupy. By the way, according to the third law of
thermodynamics, which asserts that at 0°K the entropy vanishes i.e. S=0. Therefore at 0°K there is
only 1 configuration available to the system.
To read about Boltzmann's ideas about these subjects check with [0.24] and with Jayne's
essay in [0.7].
Later on we shall see how does the entropy concept appear with the theory of large
deviations. The issue is to find the bridge between the two aspects of probability assignments.
To finish, I cannot but help directing the reader to chase back from [3.3] where an
application of the maximum entropy method to find "dishonest" dice is explained.
REFERENCES
[3.1] Jaynes, E. T. "On the rationale of maximum entropy methods". Proc. IEEE 79, No.2, pp.
939-952, 1982.
[3.2] Papoulis, A. "Maximum entropy and spectral estimation". IEEE Trans. Acoust. Speech
and Signal Processes. ASSP- 29, No.6, pp. 1176-1186, 1981.
[3.3] Fougere, P. F. "Maximum entropy calculations on a discrete probability space". See [0.9].
Chapter 4
ENTROPY FUNCTIONALS
1. Basics.
We shall take a look at some properties of several functionals defined on the set of all
positive, a-finite measures and on the set of all probability measures on a measurable space
The definitions and results are variations on the themes developed in [4.1]-[4.2]. We direct
the reader to [0.2] where besides the results, many references to original work and applications to
statistics are developed. We shall also describe briefly some of the results from the review on
entropy inequalities compiled by Dembo, Cover and Thomas, [4.3].
For any measurable space (Cl,d?j, M(fl) and P(fl) will denote the sets
M(f2)={positive a-finite measures on (fl,F)}
P(fl)={probability measures on (Q.,d&)}.
Definitional. Let y eM(fi) and PeP(fi). We define
5 Y (P)=-!flnfrfy
W = -\p<y) in/>(v)rfn(y)
27
28 The Method ofMaximum Entropy
The following two cases are the most frequent. When M is a countable set, ^ m ^ - l (i.e. u
is the counting measure) and P(X = m,) = p,. Then
SI1(X) = -Z/>0)lnp(/).
The other case being M = 9T and u(dy) = dy being the Lebesgue measure. In this case we
shall just write
S(X)= -\p(y)\np(y)dy.
Definition 4.3. Let y,u,v eM(fl) be such that u « y , v « y and u, v are finite. Define
*0(u,v)=j£ln(^>a + v(n)-u(n)
/o(P,0=jflnfe)rfa
again when ln((dP/da)/(dQ/da)) is in L,(dP) and +00 otherwise.
The proof of the following obvious lemma is left for the reader.
Lemma. When u. « v (and P « Q) K(u,,v) (and I(P,Q)) are independent of a and
denoted by K(|x,v) (and I(P,Q) resp).
The functionals K and/or I have many names: Kulback Leibler information number,
information for discrimination, information distance, information gain or entropy gain of u (or P)
with respect to v (or Q). The functionals SV(P) or SV(X) are called u-entropy of P (or X).
Lemma 4.5. With the notation introduced above we have
i) SV(P) is concave in P.
ii) K(u,v) is convex in u. When u(il) = v(fi), K(u,v)>0, the identity holds true when
dyJdo - dv/do.
Proof:
i) The function -x lnx being concave on [0,oo) yields Sv(aP,+bP2)>bSv(P1)+bSv(P2).
ii) The convexity of K(u,v) can be obtained similarly. When |i(fi)=v(ft) and setting
c={dP/da>0} we have
Entropy Functionals 29
-JCQioO- J£h^*Sta{f(^)ifaSlnj42sO.
When K(u,v)=0
0=lnl£fe>^ln 1 S* <0
and the result follows from the strict concavity of lnx and the following lemma. (Nevertheless, see
the simpler proof in Theorem 3.1 of [0.1].)
Lemma 4.6. Let g be a positive function defined on (Ild^PY Then \n\gdP>\\ngdP
with the identity holds and only if g is constant a.s-P.
Proof: The inequality is the obvious concavity of lnx. Recall that lnx is strictly concave,
i.e., lnfSa^HCa; lnxj whenSa=l, if and only if x,=x2=...=x11.
For any a such that P{g<a}>0 and P(g>a)>0 we have, when the identity lnjg dP=Jlng dP
holds, then
r
ln\P(g>a) | gdP/P{gZa) + P{g<a) J gdP/P{g<a)
The assumption implies that the middle term equals the first term and therefore, the strict
concavity of the logarithm function implies that
J gdPIP{g>a}= J gdP/P(g<a)
5¥(P)-Sp(P) = 4 l n S ] = 4 l n ^ J .
where X =au+bv and 0<a,b, a+b=l. When u,v are finite
We leave these as exercises for the reader. We only add that when v = Q is a probability
measure
SV(P)= -X»(P,v).
The next lemma contains the basic behavior relative to changes of variables.
Lemma 4.9. a) Let O: (Q,,d&}->(Q,\d&~) be a measurable mapping. Let a, n, veM(fl), P,
QeP{a) and c', u', v'eM(n') and P'.Q'ePCn) be related by o' = o(*'' ), etc... Then
SW(P') * S»(P); KjQif, vO < K0(ix, v),/ 0 ,(/", Q<) < Ia(P, Q).
Comment. These results are a variation on the theme of sertinn 4 chapter ? r>f [0 9]
fiwg£ •' It is easy to verify that when P is restricted to <D~'(fi^J (see lemma (1.13))
dP/du=(dP7dn>4>. Thus
- I f . n f e ) ^ u + J f l n ^ = ^ l „ ^ _ , 0
Similarly
which is positive
Next we shall introduce some variations on the theme of conditional entropies.
Definition 4.10. In the same setting as in (4.1) assume further that v restricted to the sub
a-algebra G is a-finite. Then we shall define
similarly.
Definition 4.11. Let u.v.o be in M(il) with u,v finite and absolutely continuous with
respect to a. Let G be a sub-a-algebra of #"such that s is a-finite on G. Set
^lv)=lMg-(fi)-,(n).
When u,=P, v=Q are probability measures we obtain
f j p I (dPlda)/E<,[(dPld<s)\G]
K'°{P\Q)-- ' {dQlda)IEc[{(lPld<5)\a\
%(P\Q)=l$(P\Q) + i%°(P\Q)
Also r^G(P|Q) > 0, the identity sign occurs whenever dP/dQ=EQ [dP/dQ|G].
Comment. Similarly, when G, cG, are such that s restricted to G isfinite,then
Sv[X\Y](y) = -jP(x\y)\nP(x\y)\i(dc)
Lemma 4.16. Also Sv (X|Y) < Sv (X). The identity holds when X and Y are independent.
Proof:
= \P(x,y) I n j ^ j V , m v 2 { d y ) > 0
where the last step follows from lemma (4.5). Here P,(x)=fP(x,y)v2(dy). Again, applying lemma
(4.5) to the last term we conclude that the identity Sv (X/Y)=S(X) holds whenever
P(x,y)=P,(x)P2(y), i.e., when X and Y are independent.
A repeated application of this lemma yields.
Lemma 4.17. Let {X, |l<i<n} be random variables with values in (£2, ,M, ,v,) respectively,
etc. Then
for t s 5R* when the integral exists. Whenever (u.O) remain fixed, we will not mention them
explicitly. We shall introduce the notation
The O - Hellinger arc of m is the family of measures, absolutely continuous with respect
to du,, defined by
where
Lemma 4.25. When D(n,<t) has a nonempty interior D°((i,4>), the mapping
D°(n, <!>)-» W given by/-»J<J>rfu< is defined and \®\d\x, = 3 In Z(t)/dti. When the co variance
matrix exists and is of full rank, the mapping is 1:1.
Proof: Let h be any fixed vector in 91* and c > 0 such that t + sh e Z)°(u,0) for
/ s D° (n, <t>) for 0 < |s| < c. Since Z(t+sh) is finite
■sj(/i,0)4i, = {lne*A-'1Vn,<lnJe*<I,>4i,<oo
because of the concavity of lnx. Since the sign of s is arbitrary we conclude that f (h,Q)d\it <<*>
for any h.
It is not hard to see that Z(t+sh) is twice differentiable at s=0 and
£ja Z(t+sh) I = fa, O)2 dp, - (f (h, 4>) ./u,) 2 = (h, Ch),
where C is the covariance matrix of <1>, which happens to be the Jacobian of the map t -> \<bd\x.t.
Even though D° (jo., $ ) is convex, the range of the map described above is not necessarily
convex. For an example take a look at the third page of [4.4] where some properties of
exponential families are investigated. For some other metric properties associated to the Kullback
I-divergence see [4.5] and [4.6]. Below we shall be quoting extensively from the last one.
2. Entropy inequalities.
Let us now describe, somewhat scantily, a few results from [4.3], We commented at the
end of chapter 3 that setting
(which will be called the entropy power of P relative to u) gives us an intuitive way of
understanding how big is the support of P. For example, when Q is a finite set fi={l,2,...,n} and
36 The Method of Maximum Entropy
H is the counting measure and P{i}=l/n then S=lgn and, when a=b=l then N„(P)=n, the number
of states that can be occupied.
When (fJ,(#J=(<R",B), u(dx) is the Lebesgue measure and P has Gaussian density with
covariance Kij=EpXiXj and zero mean, setting a=2/n, b=l/27te, we obtain N(l(P)=|K|"°
When Q is a product space f2, ®fi2 and P is a probability on Q, absolutely continuous
with respect to a product v = Vi ® v2, then Lemma (4.17) asserts that
5 v (P)<S5v/(P,)
therefore, from (4.26) we have
with the identity holds whenever P is actually a product of its projections P, and P2.
Notice that (4.27) does not depend on the nature of the measures involved. In the papers
by Shanon and by Stam quoted in [4.3] the following is proved: let X,Y be two independent 91"
-valued variables having density with respect to Lebesgue measure, such that S(X) and S(Y) exist.
Then, the Shanon's entropy power inequality asserts that
But, notice that when X,Y takefinitelymany values the opposite inequality seems to hold.
For example, let X,Y be independent such that P(X = ±1) = P(Y = ±1) = 1/2. In this case S(X) =
S(Y) = ln2 and S(X+Y) = 3/2 ln2. Using (4.26) with a = b = 1 we obtain
One way to understand this is to consider finite sets £1, and £2j on which probabilities are
defined and a map 0: fi -> Q' such that F is into and |Q'| = |Q, |+|Q 21-1 = number of diagonals
(or antidiagonals). Thus the conjecture here is that if we set P = (P, ® Pj)°0' then
N(.P)<N(Pi)+N{P2) as suggested by (4.29).
When the density of the distribution of the 91" valued random variable is continuously
differentiable and together with its first partial derivatives decays sufficiently rapidly at infinity,
then between the Fisher information of X defined by
(4-30) J{X)=\(%)dx
Theorem 4.31. ( D e Bniijn's Identity) Let X he. defined as shnve and 7 he a TVT(0 1) (R»
valued random variable. Then
(4.32)
(4.32) ±S(X+J£2)=\J{X)
■ysz) == f4*)
2W«
and furthermore, the isopermetric inequality for entropies states that
(4.33) \J(X)N(X)>\
(4-34) J(^=jfe)fefe
Let vj/(x), <I>(y) be conjugate elements in L2 (9t"), i.e. *(y) is the Fourier transform of
V|/(x). Define X, Y to be random variables with densities
D (x) _ l<**>l2
(where 0 for matrices means positive definite). The Cramer-Rao inequality asserts that
(4.36)
m-■*2 >o
JOO -Kr>00
JGO -Ky>
and the combination of these two yields the analogue of the Heisenberg - Weyl uncertainty
relations in quantum mechanics in four possible equivalent statements
167t 2 ^j--A^?>0
16K2Kx -K?>0
38 The Method ofMaximum Entropy
(4.37)
\6%2KfKyKf -I7t0
IttfKfKxKf -I>0.
eNHm
= eMAH)ree200>1084
IV ~ gNH
There are several proofs of this inequality. See [5.5] for references. Here we present
KuUback's version. It appears as a set of exercises at the end of chapter 3 of [0.2] in which
references to the original papers can be found. Let us set f, = dP/du, and fj = dQ/du.. They are
positive, integrable functions (with respect to du.). An easy application of Cauchy-Schwartz
inequality yields
JOV2)" 2 rfu^l
the identity holds only when f, = £,. Rewriting the integrand above as f, (fj /f, )1/2 and using the
concavity of the logarithm functions we obtain
-21n|(/i/2) 1 / 2 rfu</i:u(P,0
and since for x< 1 logx<x-l we obtain, using the normalization \f\ d\i = \fid\x=l , that
K,(P,Q)>2{x -{0V2)1/2rfu) =f ( M 2 ) V
substitute f, = u, ^ = v, take square roots and use Cauchy-Schwartz inequality to obtain the
inequality.
This was just a sample of an interesting class of inequalities. I hope to have wetted your
appetite enough.
Certainly the entropy functionals we introduced in section 1 do not seem to be the obvious
convex (or concave) functionals, having metric-like properties, to be defined on P(f2) or M(Q).
This has prompted many people, to postulate "natural" assumptions on the functionals to be
studied, which would lead to functional equations to be satisfied by the desired functionals. Then
they would prove that either the entropy functional or the Kullback-like directed divergence were
the unique functionals having these properties.
But then again, one is always left wondering why the chosen postulates are natural.
Anyway, a line of research traceable to Shannon's work of 1948 was summarized in the
book by Aczel and Daroczy [4.7]. The results there concern entropy functionals on probability
spaces having finitely many atoms. eNHm
= eMAH)ree200>1084
To obtain S ^ ) or yP,Q) IV ~from
gNH axioms on functionals defined on {Pe P (9?"): P«fi} or
2
{Pe P(9J"): P « n } see the work of Forte and Sastri [4.8]-[4.9] and that of Johnson and/or
Shore.
Assume that a concave functional
F : { F e P ( 9 ? " ) : F « p . } - > 91
i) Is in subadditive with respect to projections, i.e. when P, and P2 are restrictions to 91"' and to
91 "2 (identifying 9J" = 91 "■ ® 9t"») then
eNHm
= eMAH)ree200>1084
IV ~ gNH
F(F(r'))=F(F)
If finiteness is insisted upon, then c = 0 whenever n(5R") = °o. The constant b can be
different from zero in applications where n can vary. For example, in statistical physics the
entropy is an extensive quantity depending on the number of particles.
In [4.10] the postulates of uniqueness, invariance, system independence and subset
independence are manipulated to obtain not only the functional forms of S ^ ) or IM(P,Q) but
Jaynes principle of maximum entropy and KuUback's principle of minimum cross-entropy, as the
uniquely correct methods for inductive inference when information is given in the form of
expected values.
REFERENCES
[4.1] Kullback, S., Keegel, J. and Kullback, J. "Topics in Statistical Information Theory". Led.
Notes in Stat. No.42. Springer - Verlag, Berlin, 1987.
[4.2] Dacunha, D. and Gamboa, F. "Maximum d"entropie et probleme des moments" Ann.
Inst. Poincare. _26, No.9, pp. 576-596, 1990.
[4.3] Dembo, A., Cover, T. and Thomas, J. "Information theoretic inequalities". IEEE
Transac-Info. Theory. 37, No.6, pp. 1501-1518, 1991.
[4.4] Efrom, B. "The geometry of exponential families" Ann. of Statistics. 6, No.2, pp.
362-376, 1978.
[4.5] Rodriguez, C. C. "The metrics induced by the Kullback number". In [0.9].
[4.6] Cizar, I. "I-Divergence geometry of probability distributions and minimization
problems". Ann. of Probability. 3, No.l, pp. 146-158, 1975.
[4.7] Aczel, J. and Daroczy, Z. "On measures of information and their characterization".
Acad. Press, New York, 1975.
[4.8] Forte, B. and Sastri L. "Is there something missing in the Boltzmann entropy". Jour -
Math. Phys. 16, No.7, pp. 1453-1456, 1975.
[4.9] Forte, B. and Sastri L. "Representation of the entropy functional for a grand canonical
ensemble in classical statistical mechanics" J. Math. Phys. 18, No.7, pp. 1299-1302,
1975.
[4.10] Johnson, R. W. "Axiomatic characterization of the directed divergences and their linear
combinations". IEEE Trans. Info. Theory. IT-25, No.6, pp. 709-716, 1979.
[4.11] Shore, J. E. and Johnson, R. W. "Axiomatic derivation of the principle of maximum
entropy...". IEEE Trans. Info. Theory. IT-26, No.l, pp. 26-37, 1980.
[4.12] Borwein, J. M. and Lewis, A. S. "Convergence of Best entropy estimates" SLAM Jour.
Optimizat. I, No.2, pp. 191-205, 1991.
Chapter 5
In this chapter we carry out the program described in the introduction for the
reconstruction problems at levels 1 and 2 of the M.E.M.. For the sake of a presentation that more
or less follows the chronological development of the results, we will be somewhat repetitive.
To begin at the beginning, one of our basic reconstruction problems was to find the
measure P « u, realizing
where $ : fl-»SRk is a given measurable, finite valued, function and cs 9?k, Q«u, is a fixed
measure. Our first result dates back to the fifties. The following theorems or variations on the
themes of theorems 2.1 and 2.2 of chapter 4 of [0.2].
Theorem 5.1. With the notations, and assumptions introduced above, assume that
P |1 (c,*)={PeP(n)|P«n, Ep(4>)=c} is not empty, that ceint(D(c,4>))=int{te 9tk|ZQ(„(t)<oo}.
Then, i)
^ = cxp -(t,©)f/Z^(f)
c = VtlnZe,<p(r)lr
41
42 The Method of Maximum Entropy
where we dropped the variable co and we shall be dropping as many super and subscripts as
possible. L(s) has, for each given a, a maximum at s0()=exp-{(t,O)+t0+l} at which L(s0)=-s0.
Since L"(s)=l/s, there is S,(<D), lying between s0(co) and s such that
/ 2 2
L(s) = L(s
L(s)--= L(s00)) ++(s(s-So)L
- s0)L'(so) (s--s0)s0/2si
(s0) ++(s- ) /2si
2 2
=-s-s 0+(s-so)Jo)
0+(s- /2s/2^i
l ..
Actually, this identity defines %. Set dP/du=f„ dQ/du=fj and substitute f/f, for s in the
identity above. After integrating with respect to dQ=£,dp., we obtain
where we used the fact that J#fj<lu=c, In the lemmas below we prove that
#(t) ==(t,c)
H(t)-- (t,c) + lnZ(t)
is negative, analytic and, whenever the hypothesis of the theorem are met, (ii) is fulfilled.
Before doing the lemmas, we extend Theorem 5.1 as
Theorem 5.4. With the same notations as above, but now we consider P « u and Q « u
with respect to which <S is integrable and
£/.(<£>,
E P(<&, t )) :^
S(t,0)
(t,0)
Lemma 5.5. For teint D(Q,<t>), the complex valued extension of Z(t) obtained by
replacing t by t+iu=Z is analytic in Z and the following hold
the first identity drops out. To obtain (b) start from (a) and invoke differentiability through
analyticity.
Lemma 5.6. When the inequality in (5.5)(b) holds, the function 9(t)= -V,lnZ(t) defined on
intD(Q,<I>) is one-to-one.
Comment. The range of the mapping thus defined may not be convex. See example at the
end of section 2 of [4.4]. It is nevertheless an open set.
Proof: For the reader. It is based on (5.5b).
We shall denote by t(9) the value oft e intD(Q,S>) for which 0= -VlnZ(t). It is also easy
to see that.
Lemma 5.7. £el(T-9)2=Hess(lnZ(t))=J(9). Also J(0)J(t)=J, where J(0) and J(t) denote the
Jacobian matrices of the mappings t->9(t) and 9->t(0).
Lemma 5.8. Assuming that the inequality in (5.5-b) holds, let t(9) be the inverse function
to that defined in Lemma 5.6. Then H(9) = H(t(9)) is negative and has a strict maximum at 9(0)=
JcfcdQ.
Proof: Note that
Ktl(Q?,Q) = -H(t)>0.
Then so is -H(0). Using H(t)= -(9,t)-lnZ(t) with 9= -V,lnZ(t). Solving for t(0), differentiating,
using (5.6) and the chain rule one obtains V0H(9)=t(9). Using Lemma 5.7 and the assumptions, it
follows that the quadratic form associated to Hess H(9) is strictly positive. Therefore H(9(0))=0
is the maximum value of H(9).
To completely finish the proof of theorem 5.1, notice that when the datum c is in the
image of int(D(Q,0)) by - V,lnZ(t), then
44 The Method ofMaximum Entropy
where the meaning of the symbols is as above. Also, now Z(t) stands for Z^ 0 (t). The solution to
(P2) is contained in
Theorem 5.9. For a given c in the image of intD(u.,«I>) by -V,lnZ(t), let t* in intD(n,<f>) be
the point at which H(t)=(c,t)+lnZ(t) reaches its maximum value. Then (P2) is solved by
d\x\- = exp-(f, <)>)rfu/Z(t).
Proof: Flip a few pages back, and check (4.23). It asserts that if t e intD(|j.,<l>) and P « u ,
Ep (<t>)=c, then
KviP,v.t)=mfy-sv.{p),
2. Czizar's results.
Here we present a summary of [5.1], with some obvious changes. In section II of chapter
4 we obtained the lower bound
(5io) §f|g-fU*(W0)i
The Method of Maximum Entropy 45
K/J(P/Q)=inf{K)jiP',Q) /P'z E)
eNHm
(5.12) eMAH)ree200>1084
IV ~ gNHKv.<P,R)-Kli(P,Q)=EPln^
=
E»{in(M)) =K,(Q,R).
Proof: Let pa=a(dP/du.)+(l-a)(dQ/du.) denote the density of P with respect to u,. From
the convexity of plnp it follows that
/a = s ( p » l n P a - ^ l n ^ J
decreases to
eNHm
= eMAH)ree200>1084
IV ~ gNH
from which it follows that if EI,(ln(dQ/dn)/(dR/dn))<Kll(Q,R) then there is 0<a<l such that
The converse is easier. The hypothesis implies that K(l(P,R)>K(l(Q,R) by (5.12), and
therefore K ^ J R ^ K ^ Q . R ) because of the convexity of KM(P,R) in P.
The second half is left for the reader. It is also left for the reader to verify that the lemma
and (5.12) imply the next proposition.
Proposition 5.14. A probability P is the K^-projection of Q on the convex set of
probabilities £ if and only if every P'e £oSM(Q,oo) satisfies
(515) V.fija^P'^+V.g).
If the K -projection P is an algebraic inner point of £ (i.e. if for every P's £, there exists P"
e £ with P=aP'+(l-a)P", 0<a<l) then fcS^Q.oo) and Ep.(ln(dP/dn)/(dQ/du))=K^(P,Q) and (5.15)
holds with the equal sign.
Before we consider the first of the results we want, observe that if £ is any set of
measures, and if there exists Pe£ with u,-density cexpg(x)(dQ/du) where JgdP, =JgdP2forany
P„ P2 in £, then K/P|Q) = inf{K(l(P',Q)|P'e£}. More exactly, in this case
The particular case we are interested in can be rephrased as: Let £={PeP(£2)| P « u ,
Ep(0)=c}, where <T>: fl-> 5Rk is given measurable mapping and ce SRk, then if a Pe £ exist and is
of the form dP/du=c exp-(t,<D)(dQ/du), then it is the K^-projection of Q and (5.16) holds. We are
now ready for
Theorem 5.17. Let {fjae A} be an arbitrary set of real valued, measurable functions
defined on Cl and {cJaeA} a set of real constants. Let £={PeP(fi)|Ep(f1)=ca, aeA}. Then, if a
probability Q « ji has K^-projection P on £, its u,-density is of the form
where N has P'(N)=0 for all P'sEr.S^Q.oo) and g() belongs to the closed subspace of L,(Q)
spanned by the f,'s. Conversely, if a Pe£ has Q-density of the form (5.18) with g belonging to the
linear space spanned by the f,'s, then P is the K^-projection of Q on £ and (5.16) holds.
Proof: It follows from Proposition 5.14 that P is the K-projection of Q on £, then for N=
{dP/du=0} it is necessary to have P'(N) = 0 for any P' e £nS (Q,°o).
Let £t£, the class of P'e£ with dP'/dP<2. If P'e£', there is P"e£' with dP7du=2-dP'/dn
and P=(P'+P")/2. (Given P's £, define P"=2P-P' and verify it is in £■.) Thus P is an algebraic inner
point of ^.Applying Proposition 5.14 to £■ instead of £ we obtain Ej,[ln(dP/du)/(dQ/du)] =
K,(P,Q)or
eNHm
= eMAH)ree200>1084
IV ~ gNH
for all such h, and therefore for all heL(dP) satisfying (5.20).
Therefore, ln((dP/du)/(dQ/du)) belongs to the (closed) subspace of L,(P) spanned by 1
and the fa's. For, were this not the case, the Hann-Banach theorem ([1.4]) would imply the
existence of a bounded linear functional on L,(P) vanishing on the said subspace but not at
ln((dP/du)/(dQ/dn)). Since the dual of L,(P) is L„(P), this is a contradiction.
To prove the second part, suppose that (dP/du.) is of said form. Since g is a finite linear
combination of f,'s, Jg dP is constant on £ and
But for P'e£ both Kpi(P,Q)<oo (by hypothesis) and KJ¥\P)<CD. Therefore
KI1(P',Q)=KJP'F)+KJ?,Q) as desired.
So far we know that if a solution to (PI) exists it has the desired form. Let us see how
Czizar settles the existence question.
48 The Method of Maximum Entropy
Theorem 5.21. Let P^/cO) be as in statement of Theorem 5.1 and A=fcs SR'IPeP^c,*),
K(l(P,Q)<oo}. Assume that D(Q,<X>) is open. Then, the KM-projection of Q on PM(c,<D) exists for
every c in the interior of A.
Comment. The obvious question is: what is the relationship between the interior of A and
the image of D(Q,<J>) by -V,lnZQ4(t)?
During the proof we shall need the following lemma, the proof of which is carried out in
[5.1].
Lemma 5.22. For anv measurable function O such that e(t-<>) is Q-integrable for small |t|,
K|1(P„,P)^-0 implies J<X>dP„ -> J<D dP.
Proof of Theorem 5.21. Since K^fP) is convex in P, the set A is convex, and
Let us verify that teD(c,0). Let Pn eP^(c,$) be such that KM(Pn,Q)->F(a). Then by
(5.10), P„ converges in variation to some P. Set <J>Jn) = */ if -t, <I>l <¥^ and <E>, = 0 elsewhere.
Here K„t oo. Let Pn be probability distribution with
dP„
(5.24) 4>
=JV«fe=A-.«p-(t,*«)fi
Since, 0sD(Q,O), the components of O are Q-integrable, (see Lemma (5.5a)) and Pn is
integrable as well. Therefore, for n large enough, JO(n) dPn is arbitrary close to J«X>dPn = b° say.
Choosing the K,, property, get the JO'-'dP',, close to J<J>dP'n=a. Compare (5.25), (5.12) to
(5.23) with b" in the role of b and obtain K ^ . Q J - ^ O . Thus, on account of (5.10), the Pn with
densities (5.24) converge to P with density N exp-(t,<J>)(dQ/du) with respect to u. And also,
teD(Q,<I>). Setting c=Ep*, similarly to (5.25) we have
^(P,0=^(ln^)-(t,c-a)
The Method of Maximum Entropy 49
from which, using (5.12) and (5.23) we obtain ^ ( P ^ P ) - ^ . Since we are assuming that D(c,*) is
open, lemma stated above implies that J G> dP=lim JdP'n=c. This completes the proof.
The problems we have been dealing with are actually particular cases of linear
programming problems which can roughly be described as follows:
where *P is an appropriate convex function defined on 9?, K is a convex set of functions p defined
on a measure space (Cl,3,\i), <X>: Cl -> SRk is a measurable mapping and ce 9?k.
Even though I am hardly describing the results of this interesting line of work, the reader
should at least take a look at [5.2] and at some of the references there, in particular to the
pioneering work by Rockafellar in [5.3]-[5.4].
As an appetizer, I will only mention a few of the examples described by them.
Let Cl, be [0,1] and 9 its Borel sets. Let ^ u ) ^ " (p>l) or T(u)=l/u for u>0 or 0<u<l.
Or, let xP(u)=u lnu or 4'(u)=-lnu for u>0. Supply with appropriate linear constraints to obtain a
problem as above.
Anyway, their treatment relies heavily on convex analysis, particularly on duality theory.
The general idea is always to go from the original problem (on an infinitely dimensional space) to
a dual (finitely dimensional) problem, and to verify that there are enough conditions under which
both lead to the same solution. Heavy stuff, but quite general, and useful!
Even though the original idea behind the present approach seems to date back to Rietsch's
paper in 1977, see [0.12], the development of the method in its full french generality is due to
Dacunha-Castelle and Gamboa, see [4.2] and, for further extensions and generalizations by
Gamboa and Gassiat see references [5.5]-[5.7]. Before presenting the main results in [4.2], and to
motivate further, consider the following problem.
Suppose you want to solve
(5.26-6) x, e { 0 , l } .
The level 2 MEM. way of solving this problem is the following. On fi={0,l}N with the
obviousCT-algebra. For to s Si, P(to) is the probability of configuration to . We define X,:
Q. ->{0,1} by the obvious thing Xj(co)=to(i). On {0,1} we define a probability m(0)=mo ,m(l)=m,
and on £2 we define 1^= m®... m in the obvious way. This m is some "a priori" measure on fi.
We shall define <J>, on fit by *,=5:^ljX1 and we shall look for P'e P(fi) such that
(5.26 - c ) EP[<b,]=y,
where P(to)=p(to)|x(to) for every to s Q.. Having established the notation, there is no problem in
verifying that
Z(f)=Z e-w"<<'')u(to)=n (m0+m1e-^'",i').
assumes its minimum value there, then p*(co)=exp-(t*,<J>)/Z(t*) and the X; we want are given by
Xj = EpnXj =Z ArJ(to)p*(co)u(a))
for l<j<M. If all goes well, the numbers t*„ l<i<M will be such that Xj is very near zero or very
near one, (in practice it is somewhat hard to go beyond such statements).
The Method of Maximum Entropy 51
In its most general form, the reconstruction problem via the M.E.M. goes like this: Let B
be a locally compact, topological vector space and B* its dual. Let |x be a reference measure on B
and X a B-valued random variable and P its distribution.
Let <t>s(B*)k and cs 9?k The M.E.M. reconstruction consists of finding P that maximizes
S ^ ) subject to Ep(<4>,X>)=c, PeC.
But we shall not aim at such generality here. For us B will be C([0,1]) the class of all
continuous, real valued functions defined on [0.1]. To be specific we shall consider the cases
C2 = {geC([0,l])\lg2dx<l}
I
Also, let O: [0,1 ]-»SRk be continuous, and set (O, g) ={ <f>{x)g{x)ax.
o
Since dealing with measures on infinitely dimensional spaces is hard, the thing to do is to
discretize and verify that a solution to our problem exists in the limit . This explains the reason
behind our regularity assumptions.
We want to solve the problems: For i=l or 2
To discretize, we shall consider the discretization I„={i/n| i=0,...,n-l} of size n of [0,1] and
for hsC([0,l]) the trace h„ of h on L, is defined to be {h(i/n)| i=0,..,n-l}.
Let C(i)„=(a,b)" and for any measure m on (a,b) we set u,n=m®... m. And, when dealing
with C2, C(2)n will be the unit ball Bn in SRn and we shall take u.n to be the uniform measure on Bn.
In any case, Xu=(X(")1,...,X(n,11) will denote the obvious coordinate vector and we will
search for measures Pn on C(i)n such that P„«u n and
(5.27-6) i£ / .„[Z*(r/n)^" ) ] = c
eNHm
= eMAH)ree200>1084
IV ~ gNH
c) \<P{x)gk(x)dx = c.
0
The following lemma asserts that discretization yields feasible solution at every stage. The
proof is in [4.2].
Lemma 5.29. Let C be an open convex of C([0,1]). Assume that the constraint is
realizable, i.e., there exists g such that <0,g> = c. Furthermore, assume * is of rank k, i.e., for
any a e 5Rk such that (a.O(x)) = 0 for all x in [0,1], we must have a = 0. Then, the constraint is
realizable in C„(i) that is, there exists a„ s Cn(i) such that £ S 9(i/n)&„ = c .
The following lemma asserts the existence of solutions to the M E M . problems of size n.
Lemma 5.30. If the hypotheses of Lemma 5.29 hold, and if the convex envelope of
the support of n„ contains Cn, and if
is a non empty open set, the ME-problem of size n admits a unique solution defined by
eNHm
*5(*.)IV= JEJ5«P[-4
~ J («,*(*))*i]dMc)
MAH) 200
gNH
= 84e
■
ree >10
Proof. According to Lemma 5.30, there exists c n such that (l/n)S4>(i/n)ov' = c. Thus we
only have to quote Theorem 5.21 above to obtain a measure P*n such that
eNHm
= eMAH)ree200>1084
IV ~ gNH
Let us now concentrate on problem (5.27-a) for C,. Under the following assumptions, we
shall prove that the sequence of ME-problems yields a solution to our (5.27-a).
Assumptions on the measure m on (a,b)
Al) (a,b) is contained in the convex envelope of the support of m.
A2) The set D(m) = { te 9? | Jexp-ty m(dy) < QO } is a non empty, open set on which we define
C(t)=Jexp-ty m(dy) and >P(t)=lnC(t).
A3) The set V= {u e 5Rk| (u ,4>(x)) eD(m), Vx s [0,1]} is non empty and coincides with V={u
e Mk| *F((u,<P(x)) e L,([0,l]),dx)}.
Denoting the u in Lemma 5.30 by un and setting An = un /n we can restate Lemma 5.30
as
The Method of Maximum Entropy 47
where N has P'(N)=0 for all P'sEr.S^Q.oo) and g() belongs to the closed subspace of L,(Q)
spanned by the f,'s. Conversely, if a Pe£ has Q-density of the form (5.18) with g belonging to the
linear space spanned by the f,'s, then P is the K^-projection of Q on £ and (5.16) holds.
Proof: It follows from Proposition 5.14 that P is the K-projection of Q on £, then for N=
{dP/du=0} it is necessary to have P'(N) = 0 for any P' e £nS (Q,°o).
Let £t£, the class of P'e£ with dP'/dP<2. If P'e£', there is P"e£' with dP7du=2-dP'/dn
and P=(P'+P")/2. (Given P's £, define P"=2P-P' and verify it is in £■.) Thus P is an algebraic inner
point of ^.Applying Proposition 5.14 to £■ instead of £ we obtain Ej,[ln(dP/du)/(dQ/du)] =
K,(P,Q)or
eNHm
= eMAH)ree200>1084
IV ~ gNH
for all such h, and therefore for all heL(dP) satisfying (5.20).
Therefore, ln((dP/du)/(dQ/du)) belongs to the (closed) subspace of L,(P) spanned by 1
and the fa's. For, were this not the case, the Hann-Banach theorem ([1.4]) would imply the
existence of a bounded linear functional on L,(P) vanishing on the said subspace but not at
ln((dP/du)/(dQ/dn)). Since the dual of L,(P) is L„(P), this is a contradiction.
To prove the second part, suppose that (dP/du.) is of said form. Since g is a finite linear
combination of f,'s, Jg dP is constant on £ and
But for P'e£ both Kpi(P,Q)<oo (by hypothesis) and KJ¥\P)<CD. Therefore
KI1(P',Q)=KJP'F)+KJ?,Q) as desired.
So far we know that if a solution to (PI) exists it has the desired form. Let us see how
Czizar settles the existence question.
54 The Method ofMaximum Entropy
Proof: Let i s W . Since W is open, there is x" and 0 <A. <1 such that x=A.x"+(l-X,)x"
From the concavity of K we obtain
Now use assumption (iii), and the fact that x* is fixed to obtain the desired result.
Proof ofLemma 5.32: Set
Certainly Ffn(A,c)->H(A,c) uniformly on compact sets. Thus if we show that H(A,c) has a
minimum at A„ e V, we will be through. For that, let us begin by verifying that H(A,c) satisfies
the assumptions of Lemma 5.33.
For A'sV set c* =J ®(x)x¥'(AmMx))dx.
o
Then A* is the minimum of H(A,c*) and (i) of the lemma holds. Let geC, be such that
c=<*,g>. Set
eNHm
= eMAH)ree200>1084
IV ~ gNH
eNHm
= eMAH)ree200>1084
IV ~ gNH
and also
X(P;)=JffB(A„,C„).
The Method of Maximum Entropy 55
ft e x p f - x i C I " ) - 1 ^ ) ) - « P o C P 0 - , ( * ( i ) ) )
rm(g) = -\ ym(g(x))dx<H(A,c)
o
for any A e SR k. Since Tm(g) isfiniteon ge C„ due to the continuity of ym on (a,b). Therefore, (ii)
holds as well. Fatou's lemma yields (iii), and therefore Lemma 5.32 can be invoked to conclude
that the minimum of H(A,c) cannot be reached at U*
All these scattered lemmas amount to proving
Theorem 5.34. Let O:[0,l]—» 5Sk be a continuous mapping. Let cs SRk and consider
problem (5.27-a) for C,. The following are equivalent
1) (5.27-a) has a solution.
2) These exists ^(t), such that expY is the Laplace transform of a measure m on (a,b)
satisfying Al, A2 and A3 such that (5.27-a) has <P'(A„,<I)(x)) as solution, where A„ verifies
c=\g(x)V>(K„,g(x))dx.
o
56 The Method of Maximum Entropy
3) For any *F such that exp*F is the Laplace transform of a positive measure satisfying Al,
A2 and A3, (5.27-a) has a solution y(A«.g( x )) with A„ satisfying
)g(x)y>(A„,g(x))ax=c
o
And to finish we have
Theorem 5.35. Define T: C, -> 9? by
r(h) = -]ylh(y)]dy
o
where Y(y)=yOP )" (y)-[ F°OI")- ](y). Then g*(x)=,P,(<A».*>) is the unique element at which
1 1 t 1
J exp(t,x")A"=l ex.p[\\t\\y](l-y2)^v^dy
-l
Z„(u) = £/„(±||(u,<I>)||)
from which we obtain that
«sW] = dku». Ml)"1 ( »», *(;) )G.(U.)
where un has to satisfy
The Method of Maximum Entropy 57
id(u„,<b„)\\)-LG„MX ( u . , * ( i ) ) « ( i ) =c
Set now A„= UnG^uJ/IKu,,,^)!!"1 and the n-th maxentropic approximant to the desired g*k
is given by
g;(r) = z(A„,<i»(i))x([^>i])(x)
where we set x(A)(x) equal 1 or 0 depending on whether x is in A or not (i.e. the indicator
function of A in the parlance of measure theorists but not in that of convex theorists).
Anyway, above, An is the unique minimum of the convex functional
tf„(A,c) = £j:(A,<&()))2+(A,c)
\ a>(x)®+(x)dx
o
g*(x) = -(MilcMx))
APPENDIX.
55 The Method ofMaximum Entropy
Before starting to quote results from [5.8], it is convenient to translate the scheme of
section 1 to 5R*. By means of <& : Cl -»SR^ we can associate with each measure P, U, |i on fi a
corresponding measure P°<b~l = <S>(P), etc - o n 91* We shall assume that the range S of 4> is a
Borel set in 9?* and we shall denote by C the closure of the convex set generate by S. And we
shall write 71(x) = x for the identity mapping on SR*
Instead of considering the translates of p(u) = [ P e p(G)\P « u ] by * we shall
consider only the translates of the Hellinger arc
4i2> = ^exp-(A,0>)4i
and if, for short, we denote by u the measure <£>(u,) on 5R* we have the exponential family
H(\i) = {u, : X s .DJwhere
eNHm
= eMAH)ree200>1084
IV ~ gNH
infC< doms<C.
Theorem 5.37. k(X) is steep if and only if d(D°) = infC, where 3: D° -> SR* is given by
3{X)=VxKX) = VxlnZ(A.).
Theorem 5.38. Let t be a boundary point of C. If there is a hyperplane H supporting C at t
and satisfying u(H)=0, then tg dom s in particular, whenever u, is absolutely continuous with
respect to the Lebesgue measure on SK*, we have that dom s= infC.
To explain a bit some of the words, we mention that a closed convex function is just a
convex lower semicontinuous function. That 5(c) is essentially smooth means that
doms= {c e 5Rl.s(c) < oo} is open and s is differentiable on int(dom s). In this case, s(c) comes
out being steep, that is J^(c' + X(c - c')) tends to infinity as Xio, where c' is in the boundary of
dom s and ce int(dom s).
The Method of Maximum Entropy 59
For more about these facts, the reader is directed to Chapter 5 of [5.8] where appropriate
references to the treatise on convexity by Rockafellar are given.
In our situation K(0) is the natural logarithm of a Laplace transform, and therefore it will
be differentiable on the interior of D whenever it is not empty. According to Theorem 5.37 this
will happen whenever int C is not empty. Then, the first thing to examine is the support of p. in
SRl. If it has a nonempty interior we proceed to find D. If S is a finite or countable set, then we
proceed according to
Theorem 5.39. Let S be a finite or countable set. Then conv Seldom S. In particular dom
S=C if Sis finite.
REFERENCES
This chapter is made up of bits and pieces. It is a collection of sections, not related in any
logical order, the contents of which can be considered as either comments on the material of the
preceding chapters, extensions or variations on the theme of some of the topics, or applications
mostly taken from the literature, and presented in no particular order at all, hopefully to break the
monotony.
Since this chapter is very long, results and formulae will be numbered by section.
Many reconstruction problems have inexact data, and instead of wanting to solve for x in
Ax=y one decides to look for x's such that
where
61
62 The Method ofMaximum Entropy
D(M)={Xe<Hk: Z^(X)<oo}.
u) Otherwise dPy.- = - ,,
Then to find the sup {SJP): PsBM(y,y)} it suffices to find sup{Sli(Pxl(y„) ): T| e VM(0,y) ) . The
final step consist in applying the min-max exchange theorem in [6.3].
Applications and Extensions 63
Comments. Actually, instead of VM (y,y) we could have considered any convex set K. The
issue would then be to find the analogue of HIX,y).
There is one very important sense in which relaxation is of real help. Notice that when y
gR(A) the range of A, then there will be no hope of finding a minimum of lnZx(|i)+(X,y). We
have to consider finding x such that Axe VM(y,y) and VM(y,y) oR(A) is not empty. There will be
a critical value of y below which no solution will exist.
Here we present some examples, borrowed from [6.2], in which the maximum entropy
solution to a linear reconstruction problem does not satisfy the associated dual problem. We also
present, without proof, the way around this difficulty proposed by Bowrein and Lewis.
Consider a measure space (£2,^u) and a vector subspace X of Lp(fi,n) 1 <p < °o , on
which a functional S^ is defined by
where (p : 5R -> [-oo, oo] is a closed concave function. The maximum entropy problem consists of
finding
where A: X-»Y is some continuous linear operator. Some examples of <p are
a) Burg entropy
cp(x) = -lnx
b) Boltzmann entropy
(p(x) = -xlnx
c) Fermi-Dirac entropy
cp(x) = -x\nx-(l -x) ln(l -*)
d) Lp norm
(p(x) = -xp/P
64 The Method ofMaximum Entropy
e) Lp entropy
, . f xpIP x>0
<p(*) = 1
[ -00 X < 00 .
V © = I<P'(^))«
where q>*(£)=/«/" {(p(x)-(4,x)} is the Fenchel conjugate of (p. Here we use (4,x) to mean !;(x) for
5eX',xeX.
The conjugates of the functions listed above are respectively
a) cp*(0=l+lnK)
b) <p*(x) = e^-'
c) (p*© = ln(l+e-5)
eNHm
= eMAH)ree200>1084
IV ~ gNH
e) (p^^max^}'/?.
Ax = y with inf dom <p < infx < sup x < sup dom (p.
(6.2.2/ inf{V04*A.) + ( X , y ) : X s r } .
(62.4) x .=_^(i4.r)
Applications and Extensions 65
Let us verify that the range of AA* can be larger than the range of A. Notice that x*n=l/2°
satisfies our problem, but no X" such that A (A*X")=y can be found. Since (A A'X )v =X J42v
and y„=l/8" we would have Xn=2" or X not in L2(Q,n). In other words the maximum entropy
problem cannot be solved using duality theory, a real handicap.
One may consider solving the finite dimensional problems (Ax)0=yn for (Kn<N and then
attempt taking limits. But observe that in this case xN=(l,l/2,...,l/2N,0,...,0...) and
X'N=(1,2,...,2N,0,...,0). Even though the solution to the full primal (6.2.2) is the limit of the x*N,
the Xn cannot converge. This is related to the fact that
is strictly convex, has a unique minimum at zero, but notice that if er is the n-th basis vector,
H(ne„,y)->0 and ||neJ|-» °o , that is H(X,y) is not coercive.
A different, but similar example, is the following: let fl, and Q2 be compact metric spaces
endowed with Borel measures ix and v respectively. We shall assume that fi2 is separable as well.
Let X=C(£2,), Y=C(£22) denote the continuous functions on Q, and fij respectively considered as
vector subspaces of L,(fl, ,u.) and L,(n 2 ,v).
Define A:X->Y by
(Ax)(m2) = \ a(a>2,coi)u(rfcoi).
eNHm
= eMAH)ree200>1084
IV ~ gNH
would have to satisfy Ax*=y, but again, A((d<p*)/(dt)(R(A*))) may be smaller than R(A) and
duality will fail. To use Theorem 6.2.3 and verify that the situation is not circumvented by going
first to the finite dimensional case, take X=L,(D,n) and Y=L,(fi2 ,v) but the rest as above. Let
{(B°2: n> 1} be a dense subset in Qj and consider the problems of size N, i.e. find
If we set
eNHm
N t
= eMAH)ree200>1084
IV ~ gNH
A,// = Z Xsr 5 k
eNHm eNHm
= e M A H ) r e e=2 0e 0
M> A
1 0H
8 4) r e e 2 0 0 > 1 0 8 4
IV ~ gNH IV ~ gNH
x„ = -^x»)^£(,«*r) = x-.
That is, if the xN's are a bounded sequence in L,(Q,|j.), x' would be as in Theorem 6.2.3.
But when A(d/dt q>'(R(A*) )) is smaller than R(A), that could not happen and X^, cannot be
bounded.
Applications and Extensions 67
Borwein proposed in [6.2] two ways of going around these difficulties. We shall cite one
of them and urge the reader to see [6.2] for details and for the method of penalization.
Theorem 6.2.6. (Relaxation). Assume q>* is everywhere finite and differentiable. Take
X=L,(a>, ,u) on a complete measure space and (Y, || ||) to be some normed space. Then the
supremum in (6.2.2) is attained (when finite). In this case consider, for e > 0, the relaxed problem
xt=-&iA'ks)
where \ is any solution to (DE)6. Moreover as 6 ->0, xek converges in mean to the unique
solution x* of (ME)E and
Sv(xt)^S9(x').
Not too long ago I attempted to show algebraists how to use standard maximum entropy
methods to solve linear equations. Much to my surprise, since many years ago a journal like
Linear Algebra and Applications has been publishing papers on the subject. Besides passing down
some more references, missed in [0.3], given to me via R. Brualdi, it will be the gist of this
section to compare the level 1 and level 2 approaches.
Consider for example the problem offinding{Ps: i=l,...,n} such that
We shall generalize shortly. For some applications see chapter 13 of [0.3] for example.
You could think of (6.3.1) as a problem of resource allocation, the Pj being the fraction of the
total resource allocated to mode i, or think of (6.3.1) as the problem of determining how loaded a
68 The Method ofMaximum Entropy
die may be from the knowledge of the mean earnings of a player that has bet (enough) on each
possibility.
Anyway, the standard approach consists of finding the {P,} maximizing
As we have seen many times so far, a candidate for P{ is exp-XCt / Z(X) with
Z(X)=lLexp-XC.. The Lagrange multiplier X is to be determined by minimizing the Hamiltonian
(the name physicists employ for the dual of the Lagrangian) H(X)=lnZ(X)+Xc.
Again, in chapter 13 of [0.3] the analysis on conditions for c to be in the range of
-5Z(X)/8\ is carried out. Certainly, since ECP;=C is a convex combination, we have to have
minC, < C < maxC,. If c is generated by experimental data, as in the second motivational situation
above, that will certainly be the case.
But when the consistency condition is not satisfied, you either drop your towel, or relax
your constraints and look for {P,} such that
£ P , = 1, |2P,-c|<s
■SV(P) = -Jptolnp0c)4iCc)
D
where we set u=Xj+l. Once the values X*, and X*2 that minimize
When n=2, a simple but lengthy computation shows that for c,<c<2, the direct solution or, any of
the two maxentropic methods yields
Actually the same is true for any square, invertible, reconstruction problem.
It may happen that minimizing H (A., ,XJ becomes too difficult for it may be too flat near
the minimum. In this case it may be convenient to use a genetic algorithm to minimize something
like
y ! i \w+ \y —- Aw'
r r
I J ' ' I I I ' ' I
S v=
eNHm eNHm
2 0e0M> 1A0 8H4) r e e 2 0 0 > 1 0 8 4
eMAH)ree=
Pi-~ gNH
IV IV ~ gNH
which is what you'd get solving (6.3.1) by proposing the solution PpX^+Xj and finding the right
eNHm
= eMAH)ree200>1084
IV ~ gNH
Before returning to the mainstream of this section, let us recap what we have done in the
following.
Comments.
i) Level 1 and level 2 approaches to reconstruction problems may yield the same answer
to a reconstruction problem.
ii) When using level 1 approach the choice of SM(P) is arbitrary whereas, when using a
level 2 approach, one agrees up on the Boltzmann-Gibbs-Shannon functional and plays with the a
priori knowledge one has, or can assume, about the range and distribution of the Xj. But, of
course, the choice of the entropy functional and the reference measures is totally arbitrary.
The following problem: find {x,: 0<Xj<l, i=l,...,n} such that
(6.3.3) A\ = \>
where the nxm matrix A and the vector be 95m are given. For a review about work in this
problem see [6.4], and for conditions for the existence of a minimum of the dual problem, i.e., a
minimum of H(X) associated with the standard maximum entropy problem see [6.5].
The way the second level maximum entropy method applies to (6.3.3) consists of
assuming an a priori measure, say du(x)=dx on f2=[0,l]°. By ep . : fi -»SR we denote the
coordinate map <pi(x)=xj. We look for measures P on O (equipped with the obvious Borel
c-algebra <&) having density p(x), that maximize S(l(P)=-Jp(x)lnp(x)dx subject to the constraints (
SAj,<p1(x))p(x)dx=bj, j=l,...,m.
The usual arguments provide us with
m
where C,=Z^ i=l,2,...,n, of course
7=1
Z(X)=\e-lc-*h*=fi(t£L)
(6.3.5) H(X)=£ln(^-)+(X,b).
Whenever life is nice to us and the minimum in (6.3.5) is reached, we know from chapter 5
that
PME(X) = ^ e x p - ( C * , <p)(x)
is the distribution on the set of all images that maximizes S (P). The maxentropic reconstruction is
To finish this section with a calculation that we shall make use of in the next one, assume
that we know or we have to impose the condition that the solution to (6.3.3) belongs to the set
{0,1}". In this case the measure du,(x) on fi=9T is
4i(x)=n{±(5„(<£c) + 5,(<fe))}
(6.3.7) xt = (e? + l)
which suggests one should look for values of X that make the absolute value of C^S \fo very
large when searching for X's that minimize H(X).
72 The Method ofMaximum Entropy
When we put out [6.6] and [6.7] we neglected to look through the published literature for
related work. To patch this up a bit, here we mention some maxentropic approaches at the linear
programming problem. Consider references [6.8]-[6.11] for example. Our approach is
nevertheless quite different. We consider the problem of finding
where D,={xe 91" ttSx^l, i=l,...,n), A0 is a fixed vector in SR", A is an nxk-matrix and c is a
fixed vector in 9?.
We shall assume that the nx(k+l)-matrix
eNHm
= eMAH)ree200>1084
IV ~ gNH
obtained by adding A0 to A as first row is of rank k+1. We shall also assume that A has at least
one row, say the first one, with all entries positive.
0<Xj<Ci/Av
*"(*)■ Ht)
The way we go about solving (6.4.1) is to find a c„ in 9? for which the maxentropic
solution to Ax=c fails to exist. The first such c0 will be the one we need.
Applications and Extensions 73
(6.4.2) Ax = c, x,e{0,l}
and on {0,1} put the measure m=l/2{80+8,}, which induces an obvious Q on fi={0,l}°. The
elements co of {0,1}° are configurations with probabilities P(co). As always, we denote by X,(co)
the i-th element of co and think of (6.4.2) as
where A1 is the vector (X0,A., ,..,XkH) and A, is the vector corresponding to the i-th column of A.
Again, A is to be found by minimizing
o
2) If c e K-K, then there exists a solution xJ; to Ax=c of the form
x* = 1 ieP
x\ = 0 ieN
Xj = Xi 0<Xi<\,iePuN
74 The Method ofMaximum Entropy
and depending on the vectors A;, we may have (A„ ,Aj) converging to a finite limit despite
||A"|| -> oo.These comprise the third case.
If c0 is such that ce K, then H*(c)= -oo. This is due to the fact that H(A,c) is convex on
5R*+I, and if
where cs 5R*, D, is the unit cube in SR" and BM(8)={ye 5R*|(y,My)<8} for a positive definite,
symmetric kxk matrix M.
Here instead of minimizing H„(A,c)=lnZ(A)+(A,c) to find the vector A of Lagrange
multipliers, standard procedure leads us to minimize
This section provides more substance to the way the second law of thermodynamics was
phrased in chapter 3. To make things simple we shall assume that the states of our system are
discrete and that microscopic dynamics are described by an infinitesimal transition matrix (or rate
matrix) W^. We shall denote the set of microscopic states by S.
Thus, if PXO denotes the probability offindingthe system in state i at time t, when at time
t=0 the distribution was known to be P,(0), then
Z.Wj, = -Wj,
and setting A(=W^ (this is the mean holding time at state i) we have Qjj=Wjj/A/ for jump
distribution (see [6.12] for more on this and other stuff).
Although we do not need to assume symmetry Wj=W» we do need to assume the
existence of a measure |iL with respect to which our dynamics satisfies the detailed balance
condition
(6.5.3) £n,»V = 0.
H,0)=EM^(r)
/
76 The Method ofMaximum Entropy
(6.5.4-a) Zrv0)W)=M
j
(6.5.4-6) EWV(/) = 0
and hence the name harmonic. We assume we have at least a few invariant functions. If we set for
any probability distribution {P,} and a given invariant distribution {\i-}
= -XE { £ u ^ - £ u , ^ } m i V u , -
Thus, for any initial value p,(0), S (p) increases until Pj=m for all i. From the point of view
of physical applications, we need a supply of invariant measures u.
Let f,,...,fN be N invariant functions, let m be any invariant measure on the set of states
and, as above, let
Z(X) =E mt exp-(X, fj)
i
where F is an element in W with components F ; , A and f(i) are in SKN with components A;, fj(i),
j=l,...,N. Also, we set <P,f > for the vector with components 2P,f](i)for j=l,2,...,N. If we think of
P as a row vector, then P(t)= PP(t) is also a row vector.
Applications and Extensions 77
Notice that for PsP(F), P(t)eP(F) for <P(t),f>=<P,P(t)f>=<P,fc> since the fj are
invariant.
Consider Sm(P) restricted to P(F). From what we know from before, there is a unique P'
in P(F) such that Sro(P*)=sup{Sm(P)|PsP(F)}. Also
Assume, which is reasonable for physical applications, that for PeP(f) lim Pft^P,^ exists,
and denote by P*(t)=P*P(t) the time evolved of P" Since Sm(P*(t)) is increasing on P(F) and its
smallest value Sm(P") is already the largest value of Sm(°) on P(F) it follows; from the uniqueness
of P*, that
P'(f) = P'F(t) = P*
Theorem 6.5.6. The measure P* yielding a maximum for the entropy Sm(P) over P(F) is an
equilibrium measure for the microscopic dynamics given by P(t).
It is not hard to conceive all sort of extensions of these results.
Let us say a few things about the use of the entropy as a Lyapunov functional.
Assume that {u,,} is an invariant distribution and {P:} is any distribution. We saw in Lem
ma 4.5 that -S (P) > 0 (here we let counting measure on S to play the role of what we denote by
u, there). Above we saw that dSydfeO, when we let P(t)=PP(t). Notice that when P happens to be
invariant, then S (P) is constant in time. We shall set
and we shall call it the attractor of u.. Notice that we exclude n from it.
Theorem 6.5.8. If Pe A(u) then P(t) tends to u as t tends to infinity whenever S ^ f ^ t O .
Proof: Consider first t = inf{t >0\ dS/dt = 0). Note from the computation of dS/dt given
above that the right hand side vanishes if and only if P.(t„) = n,.Therefore if t„<°o, P^t) = u- for
all t>0.
Consider now the case dS^/dtX) for all t>0. From (4.38) we obtain that
i(z|P,(0-n,|) 2 ^-W0).
Since the right hand side goes to zero, by passing to a subsequence if required, we obtain
P^t)-)^ for all i.
78 The Method ofMaximum Entropy
Comment. Note that assuming that P(t)->Peq is not enough, it may happen that S^CPJ^O,
and (S |Pi(t)-M.J)2 may only oscillate in the interval (Q,SJ[P^)).
Let us state a few problems leading to search for solution of the matrix equation
(6.6.1) AX=C
where A, X, C are respectively nxm, mxk and mk matrices. Here A and C are given and we shall
require the unknown matrix X to have its components in a preassiened convex set. We direct the
reader to [6.4] and [6.16] for more on related issues, namely, different problems leading to matrix
equations like (6.6.1) and their solution via the level 1 maximum entropy method.
Example 1. Let Ay denote the intensity of spectral band i, l<i<M of a substance j , l<j<N.
Assume that the intensity Cj in the i-th band for mixture is known and we want to know the
concentration xi of substance j in the mixture. Certainly the normalization 0<x.<l, for l<j<N is
natural in this case.
Example 2. Consider the problem offindingthe generalized inverse X of a matrix A. The
whole thing here is that A may not be a square matrix. The matrix equation defining X is
(6.6.2) AXA=A
For a very fast review and analysis of best solutions in norms other than 12 see [6.16],
Example 3. Consider the extension of (6.6.2) to either of
Example 4. Relating stimuli to responses by means of linear maps. Suppose you encode
stimuli by vectors in certain Wand have m of them, described by {S^ l<i<n, l^j<m}. Assume
that the system under scrutiny responds linearly to the stimuli to produce k different responses
encoded by vectors in 9?m You want to know the mechanism, or transfer matrix such that
(6.6.4) SX=R.
Applications and Extensions 79
You may need different k and m because, say, the independent or different stimuli may
yield common or related responses. Besides that you may know in advance, or need to assume
that
the x^ are to take values in some preassigned set, {-1,1} say.
For the fun of it, we shall look at the problem of finding a lxn-matrix X , the inverse of
the nxl-matrix A, such that (6.6.2) holds and -||A||<X<||A||. On ft=[-l,l]° equipped with the
Borel a-algebra we shall define the measure m(dx) with density 2° with respect to dx=dx,,...,dxn.
Denote by £f the coordinate maps ^i(o))=coi and by 3>((<B) the map
a , IZafcjim)
a, afabo)
i
where ^ denotes the j-th component of A. We shall look for measures P on ft such that
==aja,
Ep$>j --
Ep®j
dP = (Z(X)r'exp-(>., <b)m{dx)
which satisfies (X; ,i$= 1 so that AXA=A (So, we can go to sleep with a feeling of being
consistent. Some may think that this is a dum way of writing x^A/UAH2)!. This would be true if
||X||=1/||A|| which is not apparent from the result found above. Also, try to find X using singular
values decomposition.
80 The Method ofMaximum Entropy
The following is a variation on the theme of a nice paper by Bard, [6-18], in which state
probabilities are estimated, using the MEM., from the knowledge of probabilities of a collection
of sets.
Suppose S is a finite set, with atoms J„..,J n and the a-algebra S consists of the collection
of subsets of S. Any probability P on S is then determined by the P({SJ).
Also, if (XJ denotes a time homogeneous Markov chain, having S as state space, then the
transition matrix
(6.7.2) Pv=P(XleC,j\Xa=:Jl)
where the C,j are a collection of not necessarily exclusive nor exhaustive events. In terms of the
Pj,, the P,j can be rewritten as
(6.7.4) SP,j = l
is a condition satisfied if the chain is conservative (When it does not hold, we throw in a cemetery
state to enforce it).
So our problem becomes that of determining for each i, a collection P. satisfying (6.7.4)
when all that is known is (6.7.3). Dropping any reference to the index i, we are in the situation
discussed by Bard. So, we will follow him.
Applications and Extensions 81
Given sets Cj, j=l,2,...,K; we denote by DJ; l<j<M the partition of S induced by the Cj(
that is, Dj is a mutually exclusive, exhaustive collection such that any set in the a-algebra
generated by {C^} is a uniori of sets from {D^. In particular
eNHm iA
(6.7.6 -a) = eMAH)ree200>1084
IV ~ gNH
From the solution to this problem, the original problem drops out, for the procedure can
be carried out for each starting point i.
Observe as well that once the Q, are known, we can setup the problem of finding Pj;
j=l,...,N such that
(6.7.7) XP, = l
So Bard's technique., applied twice in succession, provides us with the complete collection
of (transition) probabilities. Let us apply the MEM. to solve the set (6.7.6). Again, denoting by
X the Lagrange multiplier oorresponding to (6.7.6-b) we would obtain, after an application of the
level 1 routine
where we set x(A)=l or 0 depending on A being empty or not. Finding the X* such that (6.7.6-b)
are met provides us with the Q*, that maximize the entropy and satisfy (6.7.6-a).
If you compare (6.7.8) with Bard's results you will notice some differences, stemming
from the fact that he does not have condition (6.7.6-a). As a simple minded application, that can
be worked out by hand consider the problem of figuring out the probabilities of the different
outcomes of a die throw when you only know
P,=P(2,3,4) = i ? 2 = (3,4,5) = i
The sets C,={2,3,4} and C2={3,4,5} determine the partition D,={1,6}, D2={2}, D3={3,4} and
D4={5} of the sample space.
According to (6.7.8) the outcomes of the throw fall in these sets with frequencies
e X
~ '=7T e~^ =P2/{\-P2)
Z=l/(l-P,)(l-P2)
9i = i 92 = \, qi = f, q* = j
Pi = i P* = i P^\, P* = i P5=i PS = 1
A larger scale application of this technique could be the following random search
algorithm. Let 1=1,...,N label the points of a grid and let P, denote the probability offindingthe
particle, individual, oil, or water at i.
Assume you have a way of assigning areas to detection procedures and you determine
P , = P ( C , ) = E Pk
keC,
by some experimental procedure. For example, P, is the fraction of successful detections in region
Ck. The C|, i=l,2,...,k are some not necessarily disjoint nor necessarily covering of the whole
domain. The procedure outlined above would yield the P,.
Even though we shall be following [6.19] we urge the reader to take a look at [6.20] on
which it is based. Especially the section devoted to the choice of the a priori profile. Instead of
directly applying the results in section 4 of chapter 5 I shall repeat myself a bit and, restate the
results of [6.19]
At given instants t,,...,^ during an interval [0,T] the following mean square averages are
some how determined
Also, to avoid ridiculous complications we assume that the tj happen to coincide with
points of the partition. And we will want to think of the xk as E ^ and, as usual of (6.8.2) as
Even though somewhat unrealistic from the physical point of view, we shall assume that
the random variables X,^ take values in the interval [L,<») and on f ^ r L , * ) N we shall define an a
priori reference measure dQ(x) with density
with respect to the density dx=dx, ...dxN on rL,°o)N. The x„(i) are chosen so that x„(i)>L and
therefore
5
e(^) = - f I p(x)lnp(x)V(x)ax
where <p(i) is the n-vector with components pit and X is in SRW The maxentropic P will have
density PN(x) given by
wherea
' = ^b+(^(p('))-
Applications and Extensions 85
where d is the n-vector with components dt, j=l,2,...,n. We are through. Setting xjxjr^fjii), L=v,2
and letting N tend to infinity we would have
where of course (p(t) is the n-vector with components Pj(t). To simplify, as in [6.19], we set
V^O, V02(t)=Vu2 constant, which is to be added to Xa.
It is easy to verify that
d} =1 —f = dH + U - trll[vl+l Xk) ) .
i
This is a part of a project, once started with L. Dohnert, based on [6.21]. The problem
studied there is to understand thefragmentationof a heavy nucleus by a fast light nucleus.
The everyday language description of the process consists in supposing that the large
nucleus gets "hot" when it absorbs the kinetic energy of the smaller nucleus. Upon cooling down,
it condenses in globules that fly away. The problem is to find the distribution of the fragments.
To be precise, we specify the outcome of a reaction by giving {n(ij): 0<i<j, ij integers},
where n(ij) is the number of fragments of mass j (measured by the number of nucleons) and
charge i (measured by the number of protons).
The macroscopic constraints on n(ij) are
eNHm
= eMAH)ree200>1084
IV ~ gNH
pi
(69.1) e({«(',y)})=2'»(»j)
pi
M{{n{i,j)})=lMi,j)
pi
The meaning being: F({n}), Q({n}) and M({n}) stand for the number of fragments, the
charge, the mass of the distribution {n} respectively. We want to find the measure P({n}) defined
on the set of all possible configurations, and such that
EP({«}) =1
TP({n}mW) = N0
(6.9.2)
ZP({n})Q«n}) = Z0
ZPan})M({n})=A0
{«)
where the numbers on the right hand side denote the average number of fragments, charge and
mass respectively. The maxentropic procedure would yield a probability
where, as usual
Z(X) = Z e x p - { A . , ^ { « » + X2Q{{n}) + X s M W ) }
00 00
which is what any decent physicist would do. By differentiating we obtain the integral analogues
of (6.9.5) and the X can be found by minimizing
over the set D={X|Z(X)<°o}, which has to be precisely determined. This set seems to be the
positive orthant in 5K3 Can you enlarge it?
88 The Method ofMaximum Entropy
Suppose you want to recover a continuous function fix) defined on [0,co) such that either
f(t) tends to zero as t goes to infinity or, that its growth rate is such that for some a 0 >0
f(t)exp-a0t tends to 0 as t goes to infinity. Suppose that you know
where 0<a0<a, ,...,0^,, and you want to recover f(t). Note that the change of variables t=-ln s
transforms that problem into recovering x(s)=f(-ln s) from
l
(6.10.1) x(a,)=\x(s)sa'-lds i=\,...,M-
o
and B(s): il -»SR, B(s)(<n)=co(s) denotes the standard brownian motion on [0,1]. Here,
I
\x0(s)dB(s)
o
denotes the standard It0 integral of x„(s) with respect to B(s). All these probabilistic constructions
are described in [6.22].
Again, standard reasoning yields
Applications and Extensions 89
(6.10.2) EPo[B(t)]=lx0(s)ds
and it follows that under P0, and if we denote by <D(s) the SR" -valued function on [0,1] with
components ^(s) = s", note that
where x0 is the vector whose components are the Laplace transform of (the initial guess) x/s).
Ours maxentropic problem now is to find a law P on (Cl,dF) such that Sro(P) achieves the
maximum value over the class of measures Q on (£l,d&) such that Q « P0 and
(6.10.4) EQ\^(s)dB(s) =i
where Z(k) can be explicitly computed (fly me in and I'll tell you how, but it is really simple)
from which it follows that to obtain X' that makes P the measure that satisfies (6.10.4) we have to
minimize
HQ,) = f 1 {<X,«I»(s)>V -2(\Ms))xo(.s)}ds + (\,i)
o
= ±{(k,a.)2-2(k,z0)}+(h*y
(6.10.5) r=C-'(x0-x).
Note that when the Laplace transform of xje'1) coincides with that of x, then X' is 0 and P0
is already the maxentropic measure on (£!,<#"). The maxentropic reconstruction of x(s) is
jx(s)<Hs) = x.\
o
The first approach is a variation on the theme developed in the previous section, the second
evolves according to a discretization procedure as in section (5.4). We direct the reader to [6.23]
for a level-1 like approach.
The (inessential) difference with the setup of (6.10) is that we allow for the initial point
B(0) of the brownian motion on [0,1] to be started with a distribution such that the Wiener
measure, W on (fi,<^) satisfies W,'(B(0)s A)=u(A) and j£u(d!;)=0.
The measure P,/1 is similarly defined, i.e., for any measurable functional H
E»[H) =E^[HM0]
Applications and Extensions 91
1 1
with Wo = exp J x0(t)dB(t) - \ j xQ(t)dt
z
lo o
Again, x0(s) is the a priori knowledge we have of x(s). Instead of (6.10.2) we now have
(6.11.4) £?Q*(s)dB(i)] =b
where b is the vector in (6.11.1). Again, mutatis mutandum, everything is as before, with
eNHm
= eMAH)ree200>1084
IV ~ gNH
tf(X) = }(\,CX.)-(A.,b-b°)
achieves its minimum. Now things are simpler due to the orthogonality properties of the
{ek(s):-M<k<M}. Actually C=C"' is a rather simple matrix, to wit.
0 /
1
/ 0
Again, as in the previous section, the maxentropic reconstruction x*(s) of the function x(s)
92 The Method ofMaximum Entropy
where x^Q/N) and ek(i)=ek0/N)- We shall assume that each x; is the mean value of the j-th
element of a collection (X|„...,XNI) defined on Q=[-1,1]N as Xj(x)=xp and the a priori reference
measure Q(dx)=dx/2N is defined on the Borel sets of fi.
We want to find a probability P(dx) having density p(x) with respect to Q(dx), yielding a
maximum value for
^ = -^Jp(x)lnp(x)A
subject to the constraints
where X is the 2M+1 vector with components X_M ,..., X0,..., \ , and e(j) is, for each j=0,...,N-l,
the2M+l vector with components e.M(j),...,e0(j),...,eM(j).
The value of X that makes the usual maxentropic p(x) satisfy (6.11.7) is obtained by
minimizing
H„(X) = jj\nZ(X) + (X,bk)
eNHm
= eMAH)ree200>1084
IV ~ gNH
Applications and Extensions 93
* ; = [<A,e(,)>-l-tanh(X,e(0>]
Here we present a few basic results on the problem of reconstructing a time series, or to
be more precise of reconstructing a second order stationary process, from the observation of this
values at afiniteset of times. In almost any of [0.3]-[0.11] there is at least one paper on this issue,
but what follows is lifted mainly from [4.1] and [6.25]-[6.27]. For even more references and
applications see [6.28].
To establish notation, let us recall a few basic facts, the proof of which appears in chapter
9 of [6.29] and chapter 9 of [1.2].
Let {X^nsZ} be a sequence of random variables. We shall say that it is a weakly
stationary, centered process if for any n
We shall say that {Xn} is Gaussian (or {XJ is a Gaussian process) whenever for any finite
collection {n,,^,...,^} of integers and anyfinitecollection {a,,...,^} of real numbers, the random
variable a,X(nl)+a2X(n2)+...+amX(nra) has a Gaussian distribution. Some authors phrase it like:
{Xn} is a Gaussian process if and only if the vector (X(n,),...^(n.J) is a Gaussian K"1 valued
random variable.
Anyway, in terms of the Fourier transform of the distribution of (X(n,),...,X(nnl)), the
Gaussian property is stated as
£[exp/£a)fcAT/u)] = exp-j(a*,Ca)
The following two basic results show why the correlation function is important for the
reconstruction of the process {X,,}.
Theorem. (Bochner-Herglotz). There exists a positive bounded measure u. on I=r0,27t)
such that
To state the next result, we need the concept of random measure or random kernel. Let
(fi,<#",P) be a probability space, let B(7) denote the Borel a-algebra of subsets of I=[0,27c). Then
Definition. Z: B(7)xQ-»rO.°o) is the random kernel associate with the measure u on I if
and only if
a) A—>Z(A) is afinitelyadditive function and £Z(AT1) converges to Z(A) in L2(fi,dP).
b) EZ(A)=0, E[Z(A)Z(B)]=EZ(AoB)2=u(AoB).
Bochner-Herglotz theorem is to be complemented with the following.
Theorem. Let ufdV) be the spectral measure associated to RfnV Then
(6.12.3-6) Xn=\emaZ(da).
i
(6.12.4) £akX„-k = z„
fc=0
Applications and Extensions 95
where e„ is as in (I). We leave for the reader to verify that n(da)=g(a)da with
where
q(x)=I, akxk
(612.6) gia.)=±Z.R(.n)e^
Then the maxentropic distribution, compatible with the knowledge provided by (6.12.7) is
given by
where of course
Z(X) = iexp-(X,A(x))A, (X,A)=|x*A t .
and the left-hand side of (6.12.7) can be completed from the data as
~ N-k
(6.12.9-6)
5.12.9-6) Ak = j^Nfx^.
A-k=Ak, A-k=Ak
[ \H \i-j\ < M
A# =
1 0 \i-j\ > At.
g(Z)=ZX i Z*
-M
then for N » M , the eigenvalue § tends to g(Z) with Zj=exp(27tj/N+l)i and therefore
The first identity follows from computing the entropy of P,XX) given by (6.12.10), and the
limit is an exercise in Riemann integration theory. The important fact is that gj -> g(Zj), which can
be found in [6.30].
The nonlinear relationship between the R(k) and the \ is contained in
2*
R
(6.12.11) ^ = -W^2{\) = \-^rdQ ~M<k<M
0 «(.«'
The left-hand side of (6.12.11) can be computed from the data, and once the \ are known
for |k|<M, the right-hand side of (6.12.11) determines R(k) for |k|>M. Therefore, the procedure
outlined above can be applied.
Let us now consider two more, equivalent, inductive ways of getting the R(n). The first
approach consists of computing the entropy of the joint distribution of the first N+2 variables
^.....X^,., of a Gaussian process as
Remember that our problem is to find the R(k) for |k|>M. Take N>M to begin with and
assume that R(0),...,R(M) are known. It is clear that the value of R(M+1) that maximizes det A
(M+l) is the same that maximizes S(M+1). We denote also that
Since det A (M+l) is a quadratic function of R(M+1) with a negative derivative, then it
has a unique maximum. The allowed values of R(M+1) fall between the values of y(N+l) that
make the det A (M+l) zero.
Choosing the R(M+1) that maximizes det(A(M+l)) we maximize the entropy S(M+1). It
is clear then that this procedure yields the R(k) for |k|>M.
The other procedure that produces the same result is to assume that our process is an
AR(M), autoregressive process of order M, satisfying
Now, if we know R(0),...,R(M) we could use (6.12.14) to solve for R(M+1). But a simple
computation shows that if £(0)=R(0), «(l)=R(l),...rR(M)=R(M) are known, then
Applications and Extensions 99
R{\) . . R(M-l)
det l<fdelA(MH)| _ -
2 dR(N+l) 'R(M+l) ~ U
VR(M+1) . . R(l)
which, as we saw above, is the condition determining the R(M+1) that maximizes S(N+1).
Observe that the values b„...,bM can be obtained from the first M equations above. Thus if
covanances are our only information, and AR(M) process is the candidate from process having
the given covariances.
We could arrive at the same conclusion by yet another way. To wit, consider the
(differential) entropy rate denned by
S=lim S « p = i b C t « ) + J L ? ln(27rg(a))</a
g(a) = ± S^ R(n)exp-(ina.)
and we have
Theorem. The random process {XJ which maximizes the differential entropy rate S,
subject to the constraints
SN(Y1,...,YN) = -lp(yu...,yN)lnp(yi,...,yN)dy,,~,dyff
<5(Zi,...,Z w )+ Z 5,(ZtIZt-i,...,Zt.w)
Aft-i
eNHm eNHm
= e M A H =) r eeeM
2 0A0H> 1) 0r e8e42 0 0 > 1 0 8 4
IV ~ gNH IV ~ gNH
100 The Method ofMaximum Entropy
= S(XU...,XM)+ S S(Xk\Xk.u...,Xx)
M+l
= SM(X\,...,XN).
Proof: We know that for any positive f,g on 91", Jf ln(f/g)dx >0. To verify the first
inequality let pN(x) stand for fi» and put g(x)=[(27t)Ndet Cfexp-'/2(x,Cx) where C is the
correlation matrix of Y„...,YN computed with their joint density pN(x). The next one is an
application of Lemma 4.15 and the one right after follows from Lemma 4.16 (actually, a simple
variation on the theme thereof).
The following identity is obtained when we exchange the Gaussian families. The next to
the last identity follows from the Markov property and the last step is justified by the same
reasoning that implied the second step. Therefore
which almost completes the proof. It only remains to show that {XJ satisfying (6.12.13) is a
Gauss-Markov process having the correct correlations and its spectral density is obtained
inverting (6.12.3-a) for the appropriate R(n)'s.
(6.13.1) x(t)=flt)-\K(t,s)x{s)ds
a
where, to make things easy we assume a<s, t<b and the regularity assumptions needed on f and K
will become clear below. This set up can be extended in several obvious ways.
Let {M„(t): n> 1} be a collection of linearly independent functions. Multiply both sides of
(6.13.1) by M^t), integrate over (a,b) with respect to dx (or any appropriate m(dt)) and obtain
(613.2) a„=jx(s)G„(.s)ds
Applications and Extensions 101
where
(6.13.3-a) a„=\Mn(t)fls)dt
a
Here, we see what the minimal assumptions on f and K are. The functions f, K, M„ have to
be such that the integrals above exist, that all exchanges of integrals make sense. What else?.
Under the assumption of positivity on x(t), Mead replaced the problem of solving (6.13.1)
by the problem offindinga maxentropic solution x^t) maximizing
- j pit) Inptfdl
a
where the \ , i=l,...,M are such that (6.13.3-a) holds. A few examples are provided in [6.31] as
well.
Here we mix approaches a bit to further illustrate maxentropic reconstruction techniques.
To begin with consider the discretized version of (6.13.2)
(6.13.5) n = A E «(i)x(i-l)
/=i
where A = (b-a)/N, <bji) = G„(i-1), n = 1,...,M. We shall consider on fi-»SRw the Gaussian
density p0(£) = exp-(5°/2)/[27t]W2 which makes the coordinate maps X, : Cl -> 91, X ^ ) = £,
independent, centered Gaussian random variables with covariance EjfXjXj] = A8a.
Given an initial guess x„(i) , i = 0,...,N-1, of x(i) we introduce a new auxiliary measure
P,(d£) on £1 such that
with respect to P„ we have E/Xj) = x„(j-l)A. We now ask for a measure dP(£), having density
p(i;) with respect to dP„(4), yielding a maximum for SPa(P) over the set of P' such that
»(z ©(0*,) = i
And when N tends to infinity and i/N tends to t via an appropriate sequence, we obtain
x(t)=xo(t)-(\>Ht))
b
C„m =| G„(s)Gm(s)ds.
Applications and Extensions 103
Maxentropic Image Reconstruction methods have made it to movies and may have,
perhaps, contributed a lot to popularize maximum entropy. The references [6.32]-[6.37] are to
serve as starting or guide to literature. Below we present a variation on the theme, in which the
set up is taken from the literature, but we apply to it a level 2 reconstruction technique.
The standard formulation of the problem consists of assuming the (compact) domain
containing the picture to be divided into N cells and imagining the intensity Cn in the n-th cell to
be superposition of the impinging unknown intensities x(j) according to a blurring function b„_j. To
make things worse, there is noise contaminating the background in an additive way. Thus Cn is
actually
where the vn describe the noise measured in the n-th cell. The stochastic nature of the vn is part of
the data, or of the assumed a priori knowledge. Here we shall assume the vn to be centered,
Gaussian random variables with variance ak.
We will assume the x, to be the mean values of random variables Xi with respect to a
distribution dP(l;) on 9?w, and we shall consider an a priori distribution dP0(O on SR^ with respect
to which the X, are independent and gamma distributed as
To deal with the random nature of the constraint, notice that (6.14.1) implies that
Now, this is the set up dealt with in section (6.1) there we proved that the maxentropic
dP*(4) was such that
i P ® = (Z(J))- l «iH>.BQ* , o©
where of course n; = SBj, X = Sb^, X.j. Notice that we let the parameters a and p depend on i to
allow for different "illuminations" of the picture. (By the way, perhaps more physically reasonable
candidates for the a priori distribution could be used. Different situations may merit doing so.)
The value of X that makes dP(£) satisfy the constraints can be found by minimizing
2
(6.14.4) H(X) = lnZ(X) + <X,C) + Xo.95(s Xja/j
eNHm
= eMAH)ree200>1084
IV ~ gNH
Just for the fun of it, had we assumed that each X, is uniformly distributed on [0,M], M for
maximum, then, in this case
eNHm
= eMAH)ree200>1084
IV ~ gNH
and, once the corresponding version of (6.14.4) had been minimized for X, the corresponding
maxentropic image is
For any set c we denote by |c| the cardinality of c. Let X(t) denote the state of the system
at time t, we shall define the transition matrix by
(6.15.1)
N
k=\
*)\ + Vk\A(t,*)l) y=*
Q*y = vk y*x, yeB(k,x)
uk y*x, y e A(k,x)
0 otherwise
where the rate function Vk is assumed positive and, it is interpreted as the mean service or
discharge time and Uk is the mean arrival time.
106 The Method ofMaximum Entropy
for all xs S.
To guess a candidate for q(x) we invoke the following
Lemma 6.15.3. The probability distribution on S that maximizes
S(q) = -£q(x)]nq(x)
xeS
subject to the constraint Syk(x)q(x)=mk (the mean number of individuals of type k) is given by
q(x)=n/rtZ(y).
k=l
Proof: Do as usual but denote exp-Xj by y; where X^ is the usual Lagrange multiplier.
If we substitute the q(x) given by the lemma in (6.15.2) we see that the candidate for yk is
(V/UJ thus we arrive at
Lemma 6.15.4. The equilibrium distribution on S satisfying (6.15.2) is
eNHm
= eMAH)ree200>1084
IV ~ gNH
where
z(u,v) = s nf^-)1
For a bunch of nice applications of these ideas the reader is directed to [6.40].
Apparently it was E. Schrodinger in 1931 who first set up the problem offindingthe p(ij)
= P(X = i,Y = j) as "similar or close" to a given P0(i j) such that the marginals
are known before hand. Take a look at [6.41]-[6.42] for some history on this problem and for its
analysis without using max-ent procedures.
Here the closeness between p(ij) and p0(ij) will be measured in terms of
Actually, instead of I, we should put al„ where a; is the (known) fraction of income of
group i spend on goods.
Certainly (l-ajlj is saved or invested and we assume it does not determine the
consumption pattern.
The following two examples were reviewed in [0.3]. The first one consists of assuming P^
to be the number of trips between origin i and destination j , i<M, j<N. The number of trips
originating from i is known to be 0> and the number of trips coming into destination j is known to
108 The Method ofMaximum Entropy
be Dj. The trip pattern is useful for urban planners when deciding where to build roads, gas
stations or whatever.
The second similar situation we describe consists of the problem of determining an
international trade pattern P^ measuring the amount of commerce between country i and country j ,
when the only assumed informations are the total exports of country i and the total imports of
country j .
We direct the reader to [0.3] for original references and for a description of how to
convert the reconstruction problem into the problem of finding a density given its marginals.
Below we will do it as an application of the level 2 procedure.
If we introduce constraints Fi(n,m)=5m; Gj(n,m)=5jln, then EpF^i), EpGpgQ and the
candidate for maximizing Sp0 (P) or minimizing K(P,P0) given by (6.17.1)
where \ , i=l,...,M are the Lagrange multipliers corresponding to constraint ^ ( F j ) ^ and u; are
similarly defined. Also,
Z(X, u.) =E e-x"e-*"P0(p,m)
If we set 3>n=exp-Xnand v|/m=exp-um"then the <I>'s and xj/'s are to be determined solving
or equivalently, minimizing
H(<b, y) = In Z®nyaPo(n,myLf„ln®a-T,gmlnym.
run n Ttl
When Po(i,j)=P,(i)P20) this problem has the obvious solution O.^/P^i), \)/j=gj/P2Q) which
yields the obvious P O J ^ g , as an answer. The set (6.16.3) can be "simplified" a bit by setting up a
max-ent problem to determine P(i[j)=P(ij)/gG) or PQ|i)=P(ij)/flTj.
To begin with note that P(i[j) satisfies the constraints
P(m\n) = Z;1P0fam)exp-u„1/(OT)//(n)
To treat the problem as a reconstruction assume that the X in (6.17.2) are positive and let
Bjj=miii(fi,gp. Thus 0<PB ^B^. We shall consider a collection of random variables, each taking
values in [O.BJ, uniformly distributed there, and mutually independent relative to a law P0.
SPo(P) = -lp(x)lnp(i)dx
•LErX^f,, XEpX^gj.
i <
= n ( l - e ^ + ^ ) / ( X , + H;)S;.
(l-exp-(vHi,fl(/))
K W
H(X, n) =E b / , \ + S Xig,+I, sijfj.
The problem of reconstructing a measure defined on an interval (a,b) with --oo <a<b< oo
has a long history and a lot of mathematics has been denoted to it. See the nice review in [6.45].
Here we will lift some results from [6.47] and [6.48] on the convergence of maxentropic
estimates, and we direct the readers' attention to [6.49]-[6.50] for related results using more
functional analytic techniques and to [5.5]-[5.7] for a more probabilistic approach.
We shall denote by P the set of probability densities on [0,1]. It is known that
Theorem 6.17.1. Given a sequence (MJ of positive numbers such that u,0=l, there exists
a bounded fe P such that
lAx)xndx = n„ n>0
0
if and only if |in is completely strictly monotonic and there is a constant M such that
When >0 is replaced by >0 we obtain a completely monotonic sequence and we know that a
measure (dx) exists on B[0,1] such that
1
JxVu(x) = u„ n^O
o
For all about this see Widder's [6.51].
For a given fe P having moments { nn) we set
and if we put S(f)=-Jf(x) lnf(x) dx for fs P, then a sequence fn e Pn(f) maximizing S(f) over Pn(f) is
called a sequence of maximum entropy estimators for the moment problem.
Each f„(x) is of the form
f(x) = exp-Z XkXk,
where -A.0=lnZ(X),
Z„{\)=\txV-(t\
v kAdx
J
o i
Hn(X) = [nZn(\)+£\k\xk.
Comment. To be consistent we should have put X°=(X,",...,Xn") but that the heck.
The following first Lemma is proved in [6.47]
Lemma 6.17.2. A necessary and sufficient condition for FL, ( X) to have an absolute
minimum is that {u,„} is completely strictly monotonic.
The idea of the proof is to write X as Xu with ||u||=l and rewrite H(X) as
#(X) = ln jdxexpXF„(x)
o
with
F„(x)=i: £/*("* = **)
Comment. This proof has a drawback: it needs the existence and complete strict
monotonicity of the whole sequence of moments. The reader should go to the references of [6.19]
to see how to proceed when only u.0,...u.„are known. The next nice result in [6.47] is
Theorem 6.17.3. Let f(x) be a non-negative integrable function on [0,1] having moments
H0,...,(j,. (Dividing f(x) by \i0 we obtain a density.) Now let fn (x) be the maxentropic densities
described above. Then for any bounded function F(x) on [0,1]
lim )Mx)F(x)dx=]f(x)F(x)dx.
" 0 0
yVr(.x)4m-M'))dt
0
Since the absolutely continuous (with respect to dx) measures of finite total variation are
the dual of the bounded functions on [0,1] and the unit sphere is weakly-*-compact, given yn(k)
there exists a subsequence, denote it by v|/„(x) again, and a function of finite total variation v|/(x)
such that v|/n(x)-»\|/(x). Since for all k
l I
0 =j xkd\\i„(x) ->J xkdy(x)
0 0
and since the right-hand side is zero for all k, the uniqueness of the moment problem asserts that
v|/(x) = 0 on [0,1] which amounts to what we want.
Note that given the special form of fn(x) and that fn(x) and f(x) have the same moments
Actually, when a, > n, we also have S(fnl) >S(f^) (which can also be guessed from the
fact that VJi) < Pnl(f))- The gist of [6.48] is to prove
Theorem 6.17.4. Let f be a bounded density. Then the maxentropic sequence fn introduced
above satisfies
Applications and Extensions 113
In view of the results of section 6.2 and of Lemma 6.17.2, if we attempted to reconstruct
f(x) by a max-ent procedure we would have to prove that (for example)
r f°° \ ""
H(X) = In | dxexp-[E
v
X„x"J +£ X„\i„
o i ' I
achieves a minimum over some appropriate (candidates?) set of infinite dimensional Vs.
(6.18.1) r=S/,n,7,
cti/i < a.2h < ... < a„I„ and a,/, </,.
In this fashion a maximum tax l-aL is preassigned, which if applied does not make the richer
poorer.
The question is how to rise T within these constraints. So how to find fj such that (6.18.1)
holds and
is satisfied. This is a level 2 reconstruction problem as described in section 6.3, but let us redo it
from scratch here.
Define on Q=[0,l-a,]x...x[0,l-aj the reference measure
The coordinate maps X ^ x ) ^ are independent and each is uniformly distributed in the
corresponding interval. The partition function corresponding to the constraints
EPY.n,X,It = T
is given by
zW=ft(^p),
where we set M ^ I ^ l - a ; ) for brevity. Correspondingly, since -d\nZ(\)/d\=T we conclude that
For such X to exist, your representatives will have to find a;'s such that
r<!n,;,(i-o ( )
Note that for every x, 0 < 1/x-l/eM < 1 thus 0< f:* /1-a^ 1 and the constraintsare satisfied.
Comment. The next three sections were written by Aldo Tagliani, and contain a summary
of a line of research bearing on an important practical issue: are there a priori restrictions to be
satisfied by the moment of a distribution when only a finite number of them are used in the
reconstruction problem?
Applications and Extensions 115
The moment problem on the semi-finite interval [0,+oo) consists of producing a positive
density p(x) such that
(6.19.1) ]x"p(x)dx=\i„ n>0.
Ho Hi H2 Hi H2 H3
(6.19.2) Ho Hi Hi H2
Ho, Hi, Hi H2 H3 , H2 H3 H4
Hi H2 H2 H3
H2 H3 H4 H3 H4 H5
are positive.
If we consider the problem of finding a positive p(x) on [0,-K») such that (6.19.1) holds
only for n=0,l,...,N, then the standard maxentropic reconstruction procedure suggests that we
look for Xo, Xi,..., XN such that
satisfies
X,l\\\ = X,
the u, being normalized moment. Introducing the standard a dimensional statistical parameters
(variation, skewness, kurtosis,...) y,v,K,... we can rewrite the m as
which can be used to obtain the moments M. for j>N as functions of the first N moments and the
Lagrange multipliers X .
We leave for the reader to work out the details for the case N=l. The case N=2 was dealt
with by Dowson and Wragg in [6.53]-[6.59] based on previous work by Barrow-Cohen [6.55].
They introduced the Mill's function B(x) defined by
such that 5'=f2-x) and B"=B'(2B-x)-l. They prove that there is an x such that
M IMUB^
(UB-xf
has to satisfy
In other words, when N=2 the positivity of the Hankel determinants (6.19.1) is only a
necessary condition for the existence of PN(x). The condition (6.19.7) must be imposed to obtain
the existence of PN(x) satisfying (6.19.3)' and (6.19.4)'.
Applications and Extensions 117
The cases N=3 and N=4 were discussed by Tagliani in [6.56] extending previous work in
[6.38]-[6.39]. The computations and arguments are intrincate. The case N=3 is briefly
summarized and the results for N=4 have just been presented.
The basic philosophy consist of transforming (6.19.4)' or (6.19.5) into a system of
differential equations, by varying continuously one of the moments, say u., and keeping the others
constant. The dependence of X^A,,,...,^ in terms of u, is studied and of particular interest is the
range of u^ making X^ positive.
From now, on we shall write Kj, i >1, for the standard statistical coefficients y,v,K, etc.
and we shall denote by D(KJ ,N) the domain of acceptable values of the coefficient K when N
moments are preassigned. It was proved in [6.56] that
From these, we see that if we let K2,Ks,Kfl_] become arbitrarily large, then
which are obviously interpreted as saying that if for a particular value of N, say N*, none of the
coefficients IC,,...,KN- , admits an upper bound, then for any N>N*, the coefficients Kj,...,^ are
bounded as well.
In other words, if for given u,,,...,^ a PN-(x) satisfying (6.19.3)' and (6.19.4)' exists, then
P^x) exists for N>N* and (6.19.8) represents a necessary and sufficient condition for the
existence of a maximum entropy reconstruction of a density with thefirstN moments preassigned.
Let us now look at the details for N=3. In this case y and v are preassigned and we want
to determine D(y,3) and D(v,3).
It can be seen that D(y,3)=(0,oo) but (y,3) depends on the value for y. From Schwartz's
inequality written as
M*., < UjiV-2 j > 2
118 The Method of Maximum Entropy
y-\<\<\m!a.
D(y,3) = (0,-KX>)
in which the final comment that the positivity of the Hankel determinants (6.19.2) represent a
necessary and sufficient condition for the existence of P,(x) if y>l but only a necessary condition
fory<l.
Let us take a brief look at the case N=4 where y,v,K are preassigned. From (6.19.8) we
obtain that
Z)(y,4) = (0,oo)
(6.19.11)
D(y, 4) = (y - y, -K»J when y> 1
but D(v,4) for y< 1 and D ( K , 4 ) are yet to be determined. After a quite cumbersome analysis one
arrives at
Z)(v, 4) = (y - 1/y, +°o) y<1
(6.19.12)
D(K,4) = (1+V2,+OO)
where the quantity 1+v2 is related to the positivity of the Hankel determinant
Mo Hi H2
Hi H2 u 3
H2 u 3 u 4
Applications and Extensions 119
These results solve completely the case N=4. No upper bound exists for the coefficients
y,v,K, or the positivity of the Hankel determinants (6.19.2) represents a necessary and sufficient
condition for the existence of P4(x).
By taking into account (6.19.8) and (6.19.9) the same is valid for N>4. We summarize the
discourse of this section in
Theorem. Given a sequence \i.0,\i.x,...,\iv, N>4, of positive numbers, a necessary and
sufficient condition for the existence of P,Xx) is the positivity of the Hankel determinants (6.19.2).
For N=2 or N=3, the positivity of the determinants is only a necessary condition and auxiliary
constraints have to be introduced for the existence of PN(x).
Here we present a brief description of the results in [6.57], and we will be concerned with
finding a density P^x) on (-oo,oo) whose first moments \x.a,...,\i^ are assigned. That is we want
PN(x) such that
PN(X) = exp-S XjXJ
(6.20.1)
| x"PN(x)dx = \i„ n = 0,1, ...,N.
To begin with let us recall the result in [6.51] asserting the existence of solution to the
moments problem.
Theorem. Given a sequence {n„: n>0) of numbers such that m,=l, there will exist a
positive measurable density p(x) on (-<», <x>) such that
Ho HI H2
(6.20.2) Mo,
Ho Hi Hi H2 H3
Hi H2 Hi H3 H4
As in the Stieltjes case, the even moments can be expressed in terms of adimensional
coefficients y,v,K,...(also labeled as Kj below) such that
And as above we want to determine what restrictions, besides the positivity of (6.20.2)
does thefinitesize of the moment problem impose.
Begin with N=2. This case was analyzed by Powles and Carranza in [6.58]. They obtain
after transforming (6.20.2) into a differential equation solved with the aid of weber functions that
l<i^<3
or equivalently
(6.20.3) D<y<j2
(6.20.5) D(Kj,N)cD(Kj,N+l)
and as before , when either K ^ . . . , ^ . , , ^ are unbounded, then so are K^,, iqj+2 and so on.
The results for N=3 and N=4 for the symmetric case are similar to the corresponding
results for the Stieltjes case and are obtained by Tagliani in [6.39],
Applications and Extensions 121
Let us now look at the general, non-symmetric, case. As usual begin with the case N=2.
Here we have
and hence
Now consider N=4 when y,v,K are preassigned. Again (6.21.1) or its equivalent (6.20.5) is
transformed into a system of differential equations. Examination of this solution yields
£>(v,4) = (-oo,+oo)
(6.20.8)
-D(K,4) = (1+V 2 ,+OO)
In many cases when applying statistical modeling in applied sciences, the analytical
representation of probability distributions is essentially empirical.
Experience suggests that whenever a particular mathematical distribution gives a good fit
to experimental data under limited information, it is also reasonable to base the estimation of
probabilities on the maximum entropy method. See for example the work by Siddall and Diab
[6.59]. From their work one would conclude that almost all well known analytical distributions
can be accurately reconstructed from the convergence of their first four orfivemoments.
In other words, the probabilistic nature of the random variable can be reasonably well captured by
these moments. Or, the question whether the first few moments are a good representation of
information supplied by the sample data, appears to have a positive answer.
122 The Method ofMaximum Entropy
In general, however, the result of the testing program is a set of N measured values
{x,,...,xN} rather than a set of population moments. From these one could compute N independent
sample moments by
And in applications it is frequently assumed that the unknown population moments u^ can
be replaced by the known sample moments fit.
By replacing 14 by (ik it would appear that the entropy, which is presumable a measure of
information, does not depend on the number of tests used to compute the sample moments.
Besides that, it is not clear how many moments should be included as constraints in the maximum
entropy formalism when the available information is a sample of N measured values.
Such questions have been raised by Baker in [6.60] in a vivid paper and his approach is
applied to the case of a random variable takes values in DcW (typically D=[0,1], as in the
Hausdorff moments case).
Making use of Kullback's relative information, we would obtain p*(x) as
With this
a) One solves for the {Xa,...,X^ for different M as usual.
b) The "best" number of moments corresponds to that value of M making (6.21.2) smallest.
Applications and Extensions 123
REFERENCES
[6.19] Ulrich, T., Bassrei, A. and Lane, M. "Minimum relative entropy inversion of ID data with
applications". Geoph. Prosp. 38, PP- 465-487, 1990.
[6.20] Rietsch, E. "The maximum entropy approach to the inversion of Id seismograms". Geoph.
Prosp. 36, pp. 365-382, 1988.
[6.21] Aichelin, J. and Huefner, J. "Fragmentation reactions on nuclei: condensation of vapor,
shattering of glass". Phys. Lett. 136 B. pp. 15-17, 1984.
[6.22] Varadham, S.R.S. "Diffusion problems and partial differential equations" Tata Led.
Notes. N° 64, Springier-Verlag, Berlin, 1980.
[6.23] Gassiat, E. "Probleme sommatoire par maximum d'entropie" C.R.A.S. Paris, t 303. Serie
I, pp. 675-680, 1986.
[6.24] Landau, H. J. "Maximum entropy and the moment problem". Bull. Am. Math. Soc. 1. .16,
pp. 47-77, 1987.
[6.25] Choi, B.S and Cover, T. M. "An information theoretic proof of Burg's maximum entropy
spectrum". Proc. IEEE. 72, pp. 1094-1096, 1984.
[6.26] Grandel, J., Hamrud, H. and Toll, P. "A remark on the correspondence between the
max-entropy method and the autoregressive model" IEEE. Trans. Inf. Th. IT-26. pp.
750-751, 1980.
[6.27] Van den Bos, A. "Alternative interpretation of maximum entropy spectral analysis". IEEE.
Trans. Inf. Th. IT-17. pp. 493-494, 1971.
[6.28] Lin, Dh. and Wong, E.K. "A survey on the maximum entropy method and parameter
spectral estimation". Phys. Reports. North Holland, 193. pp. 41-135, 1990.
[6.29] Karlin, Sand Taylor, H.M. "A first course in stochastic processes1'. 2nd. Ed. Acad. Press,
New York, 1975.
[6.30] Grenander, V and Szego, G. "Toeplitz forms and their applications". Univ. Calif. Press,
Berkeley, 1958.
[6.31] Mead, L. R. "Approximate solution of Fredholm integral equations by the maximum
entropy method? Jour. Math. Phys. 27, pp. 2903-2907, 1986.
[6.32] Bryan, L. K. and Skilling, J. "Deconvolution by maximum entropy, as illustrated by
application to the jet of M87". Mon. Not. R. Art. Soc. 191_, PP 69-79, 1980.
[6.33] Birch, S. F., Gull, S. F. and Skilling, J. "Image restoration by a powerful maximum
entropy method" Comp. Vis. Graph and Im. Proc. 23, pp. 113-128, 1983.
[6.34] Wenecke, S. J. and D'Addario, L. R. "Maximum entropy image reconstruction". IEEE. C.
26, pp. 351-369, 1977.
[6.35] Geman, D. and Geman, S. "Bayesian image analysis". Nato ASI Series, F20, Disord.
Syst.and Biol. Organize, Springer-Verlag, Berlin, 1986.
Applications and Extensions 125
[6.36] Zuang, X., Ostelvold, E. and Haralick, R. M. "A differential equation approach to
maximum entropy image reconstruction". IEEE AS.5.P 35, pp. 208-218, 1987.
[6.37] Elfving, T. "An algorithm for maximum entropy image reconstruction from noisy data".
Math. Comp. Modeling, 12, pp. 729-745, 1989.
[6.38] Rosenblueth, E. Karmesh and Hong, H. P. "Maximum entropy and discretization of
probability distributions1'. Probab. Engin. Mech. 2, pp. 58-63, 1987.
[6.39] Tagliani, A. "On the existence of maximum entropy distributions with four or more
assigned moments". Probab. Engin. Mech.
[6.40] Ferdinand, A. E. "A statistical mechanical approach to systems analysis". I.B.M. Jour. Res.
Dev. Sept., pp. 539-547, 1970.
[6.41] Jamison, B. "A Martin boundary interpretation of the maximum entropy method". Zeit. f.
Warsch. 30, pp. 265-272, 1974.
[6.42] Jamison, B. "Reciprocal processes". Zeit f Warsch. 30, pp. 65-86, 1974.
[6.43] Aebi, R. and Nagasawa, M. "Large deviations and the propagation of chaos for
Schroedinger processes". Zeit. f Warsch. 94, PP- 53-68, 1992.
[6.44] Arnold, G. S. and Kinsey, J. L. "Information theory for marginal distributions applications
to energy disposal in an exothermic reaction". Jour. Chem. Phys. 67, pp. 3530-3532,
1977.
[6.45] Rebick, C, Levine, R. D. and Bernstein, R. B. "Energy requirements and energy disposal"
Jour. Chem. Phys. 60, pp. 4977-4989, 1974.
[6.46] Landau, H. J. "Maximum entropy and the moment problem". Bull. Am. Math. Soc. 16, pp.
47-71, 1987.
[6.47] Mead, L. R. and Papanicolau, N. "Maximum entropy in the problem of moments". Jour.
Math. Phys. 25, pp. 2404-2417, 1984.
[6.48] Forte, B., Hughes, N. and Pales, Z. "Maximum entropy and the problem of moments"
Rendiconti di Matematica, Serie VII, 9, pp. 689-699, 1989.
[6.49] Borwein, J. M. and Lewis, A. S. "Convergence of best entropy estimates" SLAM Jour.
Optim. 1 p p . 191-205, 1991.
[6.50] Lewis, A. S. "The convergence of entropic estimates for moment problems". Workshop
on Functional Analysis/Optimization. Fitzpatrick S. and Giles, J. Eds, Centre for Mathem.
Analysis, Australian Nat. Univ. Canberra, pp. 100-115, 1988.
[6.51] Widder, D. V. "The Laplace transform". Princeton Univ. Press, Princeton, 1946.
[6.52] Theil, H. "Economics and information theory". North Holland, Amsterdam, 1967.
[6.53] Dowson, D. C. and Wragg, A. "Maximum entropy distributions having prescribed first
and second moments". LEEEIL16, p p . 689-693, 1973.
126 The Method ofMaximum Entropy
[6.54] Wragg, A. and Dowson, D.C. "Fitting continuous probability density functions over [0,oo)
using information theory ideas". IEEE IT-16. pp. 220-230, 1970.
[6.55] Barrow, D. F. and Cohen, A. C. "On some functions involving Mill's ratio". Ann. Math.
Statistics, 25, pp. 405-408, 1954.
[6.56] Tagliani, A. "On the application of maximum entropy to the problem of moments". Jour.
Math. Phys. 34, pp. 326-337, 1993.
[6.57] Tagliani, A. "Maximum entropy the Hamburger moments problem". Submitted to Jour.
Math. Phys. 1993.
[6.58] Powles, J. G. and Carranza, B. "An information theory of nuclear magnetic resonance". In
Magnetic Resonance. CooganC. K. Ed., pp. 133-161, 1970.
[6.59] Siddall, J. N. and Diab, Y. "The use in probabilistic design of probability curves generated
by maximizing the Shannon entropy function constrained by moments". Jour, of Engin. for
Industry. A. S. M. E. 97 , pp. 843-852, 1975.
[6.60] Baker, R. "Probability estimation and information principles" Structural Stability. 9, pp.
97-116, 1990.
Chapter 7
The following is taken almost literally from [7.1]. It comprises the very basic results in the
theory of large deviations. In that reference you will find quite a lot about the subject and its
applications to statistical mechanics. Also, check with [7.2] for more.
We shall consider a probability space (£2,r,P) on which we have defined a family of
independent, identically distributed random variables {X„: n > 1} taking values on afiniteset
S={x„...,xN).
It is clear that any measure p on S (equipped with the a-algebra P(S) the class of all
subsets of S) can be written as
p(A)=2pi5IXA)
(7.1) IvM^ht&iyMxi})
where the co e Q. is written to emphasize that the Lal are random variables which count, for each
realization XJa\) of the process, thefrequencywith which the sequence XJa) takes the value x,.
L„(A)=I,Ln,i = ^idXj(A).
127
128 The Method of Maximum Entropy
EXt = E x , p , = m p
and the summands in L , are independent, identically distributed random variables, taking values
in the set of all probability measures on S.
According to the law of large numbers, for any e>0 the following limits hold true
where the vector (pi,..., p#) is the limit of the random vector ( L ^ ^ . X ^ ) .
To begin with we shall consider the fluctuations of L a i about their means when S={0,1},
or if you will, for the head and tail game. All the basic results and techniques already appear in this
case, in which counting is simpler
For the time being set S={0,1}, p=l/2(5 0 +5, ), p 0 = p, = l/2.L ft0 = 1-S„ /n, L ftl = S„ In.
Therefore, |Z,„,o - Po | = \S„/n -mp\ From this
Let Qn(1) denote the distribution of Sn/n as an 5R-valued variable and set
■4 = { r e 9 t : \t-m9\ >e} with0<s<l/2.
Certainly, An[0,l] * 0 and Q„(1,(A) = P{|5„/n-m p | >s} is positive for large enough n.
Since ms £ A,Q(n\A) - » 0 as n -»oo .
Let us define
with 01n0=0 as usual. Note that I(1)(z) is symmetric about 1/2 and has its minimum there. The
following result relates the decay of Qn(1)(A) to I(1)(z).
Theorem 7.4. With the notations introduced above
Comment. Since A is a closed set and m p <* /<J min A'^z) is strictly large:r than ^ ( m ^ N ) .
Therefore Qnc"(<4) tends to zero exponentially fast as n - » oo.
Proof: Sn ranges over the set {0,l,...,n} and
P(S„=k) = ( j ) / 2 " .
max ■
keAn
teA„
*J isfrePwzy&H**.41}
"1
2
keA„
m\
i i
To conclude the proof we need the following.
Lemma 7.5. The following estimate is uniform in k < n.
iln("J=-|^-(l-0ln(l-0 + ^) aj n - > « .
Proof: For k=0 or k=n it is obviously true. From Stilting' s theorem, In n!=n In n-ri+0(ln
n). Therefore
im(*J=m B - t t a * - ^ M * - A ) + ±0(lnB).
| l n ( j ) i = l n | - | l n | - ( l - | ) i n ( l - | ) + ,o(¥)
= -/»(*)+o(V).
Back to the theorem. As both In n/n and ln(n+l)/n are 0(ln n/n) we have
lim rmnf[x)=minf{a}1.
"- > °° An A
Let us now consider the general case: S={x„...xn}, (and let the xt be real numbers such
thatx ,<...<xj. As mentioned above
Entropy and Large Deviations 131
is a compact subset of 9t". Recall that the entropy of v =Z v,8x, relative to p = Y.p,hXi is
Sp(y) = -Y.v,\n(\,lp,).
Assume that on (Cl,<&) we have a probability Pp such that PQiI=xi'^=pi for all n. Let us
denote by Qn(1) and Qn(2) the distributions of S„/n and Lu with respect to P
Let A, and \ be the Borel sets defined by
Ai = \ v e P(5): max |v, -p,\ >sf, 0 < E < min {/?,, 1 -p,\
I i=l,2,... j i
and define
max S : v e p
(7 8) S(p z) = \ { f^ (^> £v,x, =z} z e [*i,x„]
| -oo z<2 [XI,A:„]
(0 lim igi(^i)=max5(p,z)
(ii) Bmi0W2)=max^(v)
n-wo v£i42
Proof: Let us take care of (ii) to begin with. For each n and <D S Cl fixed, let 1 < / < N
and k;= #{Xj appears in the sequence X, ,...,XN}. Then L^pkj/n and LD(.) is in A, if and only if
k=(k„...kN) is in the set
ilnC(»,k) = - S | l n | + o ( ^ )
ilnC(»,*)p" = - s £ l n ^ + o ( ^ ) .
Noticing now, that in the sum defining Q*(A^) there are less than (n+l)N terms, we can
proceed as in the proof of Theorem 7.4 to obtain
>G2(/l2)=max | ^ ( V « , ) + 0 ( T ) 1 .
U m i l n e ^ 2 ) = max5 p (v).
To get the result we want we need to compute the right hand side. Note to begin with that
= max S(p, z)
ze^ir,[i,,j: K )
note now, that we defined Sip,z) = -oo whenever it is not in [x„xN]. Therefore
n
lim \;\nQ$\Ai)= max S(p,z).
-*°° Z£/l|
Comments. Since max{S(p,z): z e At} is negative, this theorem asserts that for N large,
the probability that a microscopic configuration is such that the empirical mean \xLN(dx) differs
a little bit from the actual mean of X, is exponentially small. See chapter of [7.1] for more on this
and [7.3] for a variation on the theme.
REFERENCES
[7.1] Ellis, R. S. "Entropy, large deviations and statistical mechanics". Springer Verlag,
Berlin, 1985.
[7.2] Bucklew, J. A. "Large deviation techniques in decision, simulation, and estimation"
John Wiley & Sons, New York, 1990.
[7.3] Robert, C. "An entropy concentration theorems an application in artificial intelligence
and descriptive statistics". Jour. Appl. Prob.,_27, pp. 303-313, 1990.
Chapter 8
(8.1) EM[H]=E[HM„]
for every bounded H in d^ . The martingale property provides us with consistency for (8.1). To
have (8.1) for any bounded H in <^"just approximate H by an appropriate sequence H^.
Consider now a collection {Xn: n>l} of independent, identically distributed random
variables such that
D(X) = {Xe SRl£[exp-;uri] < oo}
has a nonempty interior. A standard convexity argument implies that D(X) is an interval
containing 0, and the comments in the appendix to chapter 5 tell us that d?(X)= -dlnZ(A,)/dX is a
differentiable bijection between the int (D(X)) and int (conv(range X,)). That is for each a s
int(conv(range X,)) there is X„e int (D(X)) such that a= -d\nZJdX(XJ.
Notice now that MB=exp-A.Sn/Z(X)" is a positive martingale, where Sn=ZXK, and that
EM(X1)=E[X1e-™<]/Z(X)
135
136 The Method ofMaximum Entropy
In the probabilistic literature NL, is called Wald's martingale and the change of measure in (8.1) is
the discrete time analogue of the Cameron-Martin-Girsanov transformation employed in sections
6.10 and 6.11. This setup can be greatly generalized, but let us not do it here. Let us just prove
Lemma 8.2. Let G be an inteprable function. Let NL, be any positive martingale and PM
be defined as in (8.1). Then, for G in 7n
EulG\S„]=E[GMjSn]/M„.
E[G\S„]=EU[G\S„]
Since . P ( J f - a | > e j behaves as exp-nK(s) for appropriate K(s), see Theorem 7.4, the
result we want follows.
This result is extended in
Theorem 8.5. Let {Xu: n>l} denote a sequence of real valued independent identically
distributed random variables. Let U: 5R -> 9? be a bounded measurable function and h: 5R -»5R
be such that D(X)={Xe SR: Z(A.)=E[exp-Xh(X)]<oo} has a nonempty interior. Let C denote the
closure of the convex set generated by the range of X,. Choose keint D(X) such that
EJh(X,)]=a for aeint C, and put Sn=Zh(X). Then
Maximum Entropy and Conditional Probabilities 137
Jim EM[u(XOg{^)]=EM[U(X,)]gia)
Et^UWOgfe) ]= i£&)EM[WXl)<!Kp-(iSJny\
Now, the h(Xn) are still independent relative to PM and the expectation under the integral sign can
be further computed as
Eu[U(Xdexp-i(kS„/n)]
where we are again using K(X)=lnZ(X). The way we choose X insures us with the approximation
= EM[U(Xl)]g(a).
To conclude the proof, all you need is to know that when regular conditional probabilities
exist, then
EM[U(Xi)\SJn = a] =lim E^U(Xi)gk{k) ]'E g (V)
Nice huh? The natural, and obvious, interpretation is that conditioning with respect to
{S„/n=a} concentrates the probability on the distribution yielding value a to h(X).
REFERENCES
This chapter consists of sections not bearing a relation with each other, but all related to
statistics.
Let <${. 5R —> 9t, i=l, K be measurable functions and F0(x) be the guess we make about the
unknown distribution function F(x) of a random variable x. Assume that
Z(\) = \e-^«V»dFo(x)
is finite on 95* The density dF/dF0(x) minimizing K(F„F0)=-SFP(F) over the set of distribution
functions {F: dF«dF 0 , EF(3>)=c} is given by
That much is old news. In [9.1] from which we are quoting, Campbell observed the
following. Assume there is a measure m(dx) on 95 with respect to which F and dF0 have densities
f(x,c) and ^(x) respectively.
If we put $ 0 (x)=l and define A.0=lnZ(X), then if n measurements x,,...,^ of a random
variable X distributed according to (9.2), i.e.
then we have
139
140 The Method of Maximum Entropy
Lemma 9.1.4. With the notations introduced above, the maximum likelihood estimators of
the c^ are
(9.1.5) C i ^ I * ^ )
i=\
n
Proof: Just form In Tlfixi, c) and differentiate (9.1.3) with respect to <\, equate the
derivative to zero, use (9.1.2) to obtain (9.1.5).
Before going to the converse, we shall recall (the generalization of) Gauss' method. Again
let us assume that the unknown distribution with respect to a measure m(dx)) of a real valued
random variable is f(x,c). The experimenter knows a sample x,,...^ and f0(x)=f(x,c0) for some c0.
Gauss' methods is based on assuming that
i) The right f(x,c) corresponds to the value of c maximizing the likelihood
(9.1.6) In II fix,,t)
cK = li^k(x,) k=\,2,...,K
and (9.1.5). If we think of x=(x„...,xn) as the coordinates of a point in 9T, then both (9.1.8) and
(9.1.5) rewritten as
Maximum Entropy and Statistics 141
determine the same value of c. This is the key determine the same k-dimensional surface. This
means that each normal to (9.1.5) is a linear combinations of the normals to (9.1.9). That is,
for each l<i<n
^feMv))^,^,.)
Since the functions involved depend only on one coordinate x^, we can give any generic
name to it, let it be x.
Now, integrating both sides of the last identity with respect to x and noticing that (9.1.8)
and (9.1.9) have to hold, it is clear that the integration constants have to be such that
(9.1.10) £lnA*,c)=ZajH(<^(x)-c<)
Jfc-^)(*,fr)-^-fl«+«» = 0
and bringing in the assumption of the independence of the set l,4>j(x),,..,#K(x), we obtain that
The first implies that for each l<i<K there exists a X.,(c) such that a^- -dXt ld<\. The
second implies that there is a function V(c) such that A^SV/SCj. Now, we can rewrite (9.1.10) as
t ^ . c ) = -4|[M®/-C i )] + F .
Integrating both sides along any curve joining c0 to C and since f„(x)=f(x,c0) we obtain
Jfy, c) =/0(x)exp-[(X(c), ®(x)) + Xo]
142 The Method of Maximum Entropy
where the identification of X0 with all remaining constants in the exponent is clear. Since the left
hand side is to be normalized, it is clear that expX0 = Z(X) as above
This concludes Campbell's proof.
Notice one interesting consequence of the extension of Gauss' method. When c is given
according to (9.1.5) and when f(x,c) is the right distribution, and when n tends to infinity then
},i<S>k(xd^l<S>k(x)dF(x)
r=l
Thus, maximum likelihood makes the entropy functional easy to accept, at least for
statisticians. And for them, the maximum entropy method for looking for distribution functions
must also be natural. The gist or the crux of the problem of characterizing distribution becomes
identical with the issue of choosing a family {<l>k(x), k=l,...,K} to characterize the parameters of
the distribution.
In the next section we shall see how the notion of sufficiency anticipated Campbell
results from a different point of view.
9.2 Sufficiency.
Here we shall refer to the second chapter of [4.1] and from the second and third
chapters of [0.2]. Recall that if the probability P on (Cl.d&j has a density f with respect to a
measure u, on (€l.d&) then the restriction on P to a sub-a-algebra G has a density E [f\G] with
respect to the restriction of u, to G.
We defined for P,QeP(n) and ueM(n)
dPUyi dQIdp
Er[dPld\iiG\ ~ Er[dQ/dtiiG]
holds a. e. (i.
In the setup of Lemma 4.9 G = a(*) for * : (Cl.<&) ->• (fi',^), P' = P °<D_1 and |i'=
| i ° t " ' , we note that E^ [dP/du,|<X>]=(dP'/d|/) ■=$ , etc., therefore the sufficiency condition
becomes
Consider now the following setup. Let X: Q-»(E,?) be an E-valued random variable.
Let Q be a probability on (fl,<^ and m be a measure on 9t such that
Q(XzA)=\P<&m(d{,)
A
is defined for X in some convex set De 9?* Denote by P(X) the measure on fl with density
dP(A.)/dQ=exp-(X,4>(X))/Z(X). Let 9(X) be given by
(9.2.3) 6(W = V ) ( W )
e = ^ZflKx,)
is a sufficient statistics for K(PN,QN)=NK(P(X),Q).
144 The Method of Maximum Entropy
Thus supplementing this result with the uniqueness in the correspondence X -> 9(X)
contained in (9.2.3) neatly rounds up. a bunch of ideas.
We shall present some very elementary results about the Bayesian approach to density
estimation. For more the reader should take a look at [9.2] or [9.3].
When considering a parametric family p(x,9) it proves convenient to think of 9 as the
values of a random variable about which we know an a priori distribution G(9). We want to
estimate G given the values X,,...,^, of a random variable X having distribution function p(x|9).
To produce the Bayesian estimator of 9 the following procedure is applied
i) The posterior distribution of 9 given the observations x„...,x„ is
i?(e,e) =Ji(§,e)p(xi,...,xnle>fri.,.<&„
*(§)=J*(8,e)jg©)v<*).
The Bayes estimator 9 B is the estimator 9 which minimizes R(9). When L(9,9) is convex
in the first variable, then R(9) is a convex functional of 9 and we can do variational analysis. For
full details see [9.2].
It is reasonably easy to verify that 9s is the estimator at which the a posteriori risk
R(Q\XU...,X„) =\L(p,Q)h(B\xi,...,xn)\(d&)
reaches its minimum.
Maximum Entropy and Statistics 145
The problem is then how to choose g(9). In [9.2] there are a few examples in which
g(9) is chosen as the density (with respect to v(d8)) which maximizes
5v(6) = -|g(9)lng(0)v(dB)
subject to
Jg(9)v(dB) = l
This is a level 1 maxentropic reconstruction problem about which you have heard
enough by now. We present instead an example missing in [9.2] which is a (minor) variation on
the theme of section 9 of [9.3].
Assume that we somehow know that P(9) is a member of a family of densities p(9,a) on
TxA, a being some countable set on which we have another a priori density w(9,a) chosen
perhaps according to some invariance principle (or god given to cut short the regression ).
Assume that on the basis of an observation of a we decide to look for the distribution P(9,a)
concentrated on an a maximizing the relative entropy (or Kullback distance)
REFERENCES